本篇博文主要展示 2024-09-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-04)

今日共更新1104篇论文,其中:

  • 自然语言处理170篇(Computation and Language (cs.CL))
  • 人工智能279篇(Artificial Intelligence (cs.AI))
  • 计算机视觉290篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习310篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
[NLP-0] 制作您的数据集:通过数据库检索和增强生成特定于任务的合成数据集

链接: https://arxiv.org/abs/2409.02098
作者: Ingo Ziegler,Abdullatif Köksal,Desmond Elliott,Hinrich Schütze
关键词-EN: Building high-quality datasets, specialized domain knowledge, requires specialized domain, Building high-quality, domain knowledge
关键词-ZH: 构建高质量的数据集、专业领域知识,需要专业领域,构建高质量的领域知识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.
摘要:为专门的任务构建高质量的数据集是一个耗时和资源密集型的过程,通常需要专门的领域知识。我们提出了一种用于精细调整的语料库检索和增强(CREATE),这是一种生成合成数据集的方法,给定少量用户编写的少数几个镜头来演示要执行的任务。考虑到少数几个例子,我们使用大规模公共网络爬行语料库和基于相似度的文档检索来查找其他相关的人类编写的文档。最后,指令调优的大型语言模型(LLM)将检索到的文档扩充为定制格式的任务样本,然后可以使用这些样本进行微调。我们证明了CREATE可以有效地为四个不同的任务生成大规模的特定于任务的训练数据集:生物问答、医学问答和常识问答以及摘要。我们的实验表明,在QA任务中,基于工艺的模型的性能优于或达到与一般LLM相当的性能,而基于工艺的摘要模型的性能比基于人工挑选的数据训练的模型高出46个偏好点。

[NLP-1] Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text
[NLP-1] 政治辩论:政治文本的高效零镜头和少镜头分类器

链接: https://arxiv.org/abs/2409.02078
作者: Michael Burnham,Kayla Kahn,Ryan Yank Wang,Rachel X. Peng
关键词-EN: Social scientists quickly, scientists quickly adopted, Social scientists, quickly adopted large, adopted large language
关键词-ZH: 社会科学家迅速采用,科学家迅速采用,社会科学家迅速采用大型语言
类目: Computation and Language (cs.CL)
备注: 26 pages, 5 figures

点击查看摘要

Abstract:Social scientists quickly adopted large language models due to their ability to annotate documents without supervised training, an ability known as zero-shot learning. However, due to their compute demands, cost, and often proprietary nature, these models are often at odds with replication and open science standards. This paper introduces the Political DEBATE (DeBERTa Algorithm for Textual Entailment) language models for zero-shot and few-shot classification of political documents. These models are not only as good, or better than, state-of-the art large language models at zero and few-shot classification, but are orders of magnitude more efficient and completely open source. By training the models on a simple random sample of 10-25 documents, they can outperform supervised classifiers trained on hundreds or thousands of documents and state-of-the-art generative models with complex, engineered prompts. Additionally, we release the PolNLI dataset used to train these models – a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.
摘要:社会科学家很快采用了大型语言模型,因为它们能够在没有监督训练的情况下对文档进行注释,这种能力被称为零机会学习。然而,由于其计算需求、成本以及通常是专有性质,这些模型经常与复制和开放科学标准相冲突。介绍了政治辩论(DeBERTa算法的文本蕴涵)语言模型,用于政治文档的零镜头分类和少镜头分类。这些模型不仅与最先进的大型语言模型一样好,甚至更好,而且效率更高,而且完全开放源代码。通过在10-25个文档的简单随机样本上训练模型,它们可以胜过对数百或数千个文档训练的监督分类器,以及具有复杂、工程提示的最先进的生成模型。此外,我们还发布了用于训练这些模型的PolNLI数据集–一个由200,000多个政治文档组成的语料库,这些文档的标签非常准确,涉及800多个分类任务。

[NLP-2] Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models
[NLP-2] 编织金线:语言模型中的长形式生成基准

链接: https://arxiv.org/abs/2409.02076
作者: Yuhao Wu,Ming Shan Hee,Zhiqing Hu,Roy Ka-Wei Lee
关键词-EN: comprises tasks designed, large text sequences, identify specific information, Golden Thread, Spinning the Golden
关键词-ZH: 包括设计的任务、大型文本序列、识别特定信息、金线、旋转金线
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The abilities of long-context language models (LMs) are often evaluated using the “Needle-in-a-Haystack” (NIAH) test, which comprises tasks designed to assess a model’s ability to identify specific information (“needle”) within large text sequences (“haystack”). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation–a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models’ ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.
摘要:长语境语言模型(LMS)的能力通常使用“干草堆中的针”(NIAH)测试来评估,该测试包括旨在评估模型在大型文本序列(“干草堆”)中识别特定信息(“针”)的能力的任务。虽然这些基准测试衡量模型理解长上下文输入序列的程度,但它们不能有效地衡量长格式文本生成的质量–这是设计提案和创造性写作等应用程序的关键方面。为了弥补这一差距,我们引入了一个新的长格式文本评估基准,旋转黄金线索(SGT),它测试模型识别生成的长文本序列中特定事件的能力。在这个基准测试中,我们提示长上下文LMS创建必须包括特定事件或约束的长格式文本,并评估它们合并这些元素的能力。我们在四个不同的场景、三种类型的提示指令和两种不同的代长设置(16K和32K)中评估了10个长上下文LMS。尽管这些模型在Niah基准上表现良好,但没有一个模型在旋转的黄金线上表现出令人满意的表现,这引发了人们对它们生成遵循指令的连贯长文本能力的担忧。此外,随着生成的文本长度的增加,所有模型的性能都会显著下降。

[NLP-3] OLMoE: Open Mixture-of-Experts Language Models
[NLP-3] OLMoE:开放式专家混合语言模型

链接: https://arxiv.org/abs/2409.02060
作者: Niklas Muennighoff,Luca Soldaini,Dirk Groeneveld,Kyle Lo,Jacob Morrison,Sewon Min,Weijia Shi,Pete Walsh,Oyvind Tafjord,Nathan Lambert,Yuling Gu,Shane Arora,Akshita Bhagia,Dustin Schwenk,David Wadden,Alexander Wettig,Binyuan Hui,Tim Dettmers,Douwe Kiela,Ali Farhadi,Noah A. Smith,Pang Wei Koh,Amanpreet Singh,Hannaneh Hajishirzi
关键词-EN: language model leveraging, model leveraging sparse, introduce OLMoE, fully open, leveraging sparse
关键词-ZH: 语言模型利用,模型利用稀疏,引入OLMoE,完全开放,利用稀疏
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 61 pages (24 main), 36 figures, 14 tables

点击查看摘要

Abstract:We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
摘要:我们引入OLMoE,这是一种完全开放的、最先进的语言模型,利用稀疏的专家混合(MoE)。OLMoE-1B-7 B具有70亿(B)个参数,但每个输入令牌仅使用1B。我们在5万亿个代币上预训练它,并进一步调整它以创建OLMoE-1B-7 B-Direct。我们的型号优于所有具有类似活动参数的可用型号,甚至超过Llama 2 - 13 B-Chat和DeepSeekMoE-16 B等较大型号。我们展示了有关MoE培训的各种实验,分析了我们模型中的路由,显示出高度专业化,并开源了我们工作的各个方面:模型权重、训练数据、代码和日志。

[NLP-4] Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model
[NLP-4] 利用基于LID的协作混合专家模型增强代码转换语音识别

链接: https://arxiv.org/abs/2409.02050
作者: Hukai Huang,Jiayan Lin,Kaidi Wang,Yishuang Li,Wenhao Guan,Qingyang Hong,Lin Li
关键词-EN: code-switching speech recognition, modeling phonetic similarities, speech recognition presents, code-switching speech, formidable challenge
关键词-ZH: 代码转换语音识别,语音相似性建模,语音识别呈现,代码转换语音,艰巨的挑战
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to IEEE SLT 2024

点击查看摘要

Abstract:Due to the inherent difficulty in modeling phonetic similarities across different languages, code-switching speech recognition presents a formidable challenge. This study proposes a Collaborative-MoE, a Mixture of Experts (MoE) model that leverages a collaborative mechanism among expert groups. Initially, a preceding routing network explicitly learns Language Identification (LID) tasks and selects experts based on acquired LID weights. This process ensures robust routing information to the MoE layer, mitigating interference from diverse language domains on expert network parameter updates. The LID weights are also employed to facilitate inter-group collaboration, enabling the integration of language-specific representations. Furthermore, within each language expert group, a gating network operates unsupervised to foster collaboration on attributes beyond language. Extensive experiments demonstrate the efficacy of our approach, achieving significant performance enhancements compared to alternative methods. Importantly, our method preserves the efficient inference capabilities characteristic of MoE models without necessitating additional pre-training.
摘要:由于对不同语言之间的语音相似性进行建模存在固有的困难,代码转换语音识别是一个巨大的挑战。这项研究提出了一种协作-MOE,一种利用专家组之间的协作机制的专家混合(MOE)模型。最初,先前的路由网络明确地学习语言识别(LID)任务,并基于所获取的LID权重来选择专家。这一过程确保了到MOE层的可靠的路由信息,减少了来自不同语言领域对专家网络参数更新的干扰。LID权重还用于促进组间协作,从而能够集成特定于语言的表示。此外,在每个语言专家组内,门控网络在无人监督的情况下运作,以促进在语言以外的属性上的合作。广泛的实验证明了我们的方法的有效性,与其他方法相比,实现了显著的性能提升。重要的是,我们的方法保留了MOE模型的高效推理能力,而不需要额外的预训练。

[NLP-5] BEAVER: An Enterprise Benchmark for Text-to-SQL
[NLP-5] BEAVER:文本到SQL的企业基准

链接: https://arxiv.org/abs/2409.02038
作者: Peter Baile Chen,Fabian Wenz,Yi Zhang,Moe Kayali,Nesime Tatbul,Michael Cafarella,Çağatay Demiralp,Michael Stonebraker
关键词-EN: SQL statement pairs, constructed using publicly, human-generated tests, Existing, data
关键词-ZH: SQL声明对,使用公开的、人类生成的测试、现有的、数据构建
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this environment, LLMs perform poorly, even when standard prompt engineering and RAG techniques are utilized. As we will show, the reasons for poor performance are largely due to three characteristics: (1) public LLMs cannot train on enterprise data warehouses because they are largely in the “dark web”, (2) schemas of enterprise tables are more complex than the schemas in public data, which leads the SQL-generation task innately harder, and (3) business-oriented questions are often more complex, requiring joins over multiple tables and aggregations. As a result, we propose a new dataset BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history. We evaluated this dataset using recent LLMs and demonstrated their poor performance on this task. We hope this dataset will facilitate future researchers building more sophisticated text-to-SQL systems which can do better on this important class of data.
摘要:现有的文本到SQL基准测试在很大程度上是使用Web上公开可用的表格构建的,其中包含问题和SQL语句对的人工生成测试。它们通常显示非常好的结果,并使人们认为LLM在文本到SQL任务中是有效的。在本文中,我们将现成的LLMS应用于包含企业数据仓库数据的基准测试。在这种环境下,即使使用了标准的快速工程和RAG技术,LLM的性能也很差。正如我们将展示的那样,性能差的原因主要是由于三个特征:(1)公共LLM不能针对企业数据仓库进行培训,因为它们很大程度上处于“黑暗网络”中;(2)企业表的模式比公共数据中的模式更复杂,这导致SQL生成任务天生更难;(3)面向业务的问题通常更复杂,需要跨多个表和聚合进行连接。因此,我们提出了一种新的数据集Beaver,它来源于真实的企业数据仓库,并结合了我们从实际用户历史中收集的自然语言查询及其正确的SQL语句。我们使用最近的LLM评估了这个数据集,并展示了它们在这项任务中的糟糕表现。我们希望这个数据集将有助于未来的研究人员构建更复杂的文本到SQL系统,以便更好地处理这类重要的数据。

[NLP-6] Foundations of Large Language Model Compression – Part 1: Weight Quantization
[NLP-6] 大型语言模型压缩的基础–第1部分:权重量化

链接: https://arxiv.org/abs/2409.02026
作者: Sean I. Young
关键词-EN: reduce computational costs, large language models, language model deployment, large language, recent years
关键词-ZH: 降低计算成本,大型语言模型,语言模型部署,大型语言,近年来
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:In recent years, compression of large language models (LLMs) has emerged as an important problem to allow language model deployment on resource-constrained devices, reduce computational costs, and mitigate the environmental footprint of large-scale AI infrastructure. In this paper, we present the foundations of LLM quantization from a convex optimization perspective and propose a quantization method that builds on these foundations and outperforms previous methods. Our quantization framework, CVXQ, scales to models containing hundreds of billions of weight parameters and provides users with the flexibility to compress models to any specified model size, post-training. A reference implementation of CVXQ can be obtained from this https URL.
摘要:近年来,大型语言模型(LLM)的压缩已成为允许在资源受限的设备上部署语言模型、降低计算成本并减轻大规模人工智能基础设施的环境足迹的一个重要问题。本文从凸优化的角度介绍了LLM量化的基础,并提出了一种基于这些基础且优于以前方法的量化方法。我们的量化框架CVXQ可扩展到包含数千亿个权重参数的模型,并为用户提供了在训练后将模型压缩到任何指定模型大小的灵活性。CVXQ的参考实现可以从此https URL中获取。

[NLP-7] FuzzCoder: Byte-level Fuzzing Test via Large Language Model
[NLP-7] FuzzCoder:通过大型语言模型进行字节级模糊测试

链接: https://arxiv.org/abs/2409.01944
作者: Liqun Yang,Jian Yang,Chaoren Wei,Guanglin Niu,Ge Zhang,Yunli Wang,Linzheng ChaI,Wanxu Xia,Hongcheng Guo,Shun Zhang,Jiaheng Liu,Yuwei Yin,Junran Peng,Jiaxin Ma,Liang Sun,Zhoujun Li
关键词-EN: analysis technique designed, important dynamic program, dynamic program analysis, program analysis technique, complex software
关键词-ZH: 设计分析技术,重要动态程序,动态程序分析,程序分析技术,复杂软件
类目: Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:Fuzzing is an important dynamic program analysis technique designed for finding vulnerabilities in complex software. Fuzzing involves presenting a target program with crafted malicious input to cause crashes, buffer overflows, memory errors, and exceptions. Crafting malicious inputs in an efficient manner is a difficult open problem and the best approaches often apply uniform random mutations to pre-existing valid inputs. In this work, we propose to adopt fine-tuned large language models (FuzzCoder) to learn patterns in the input files from successful attacks to guide future fuzzing explorations. Specifically, we develop a framework to leverage the code LLMs to guide the mutation process of inputs in fuzzing. The mutation process is formulated as the sequence-to-sequence modeling, where LLM receives a sequence of bytes and then outputs the mutated byte sequence. FuzzCoder is fine-tuned on the created instruction dataset (Fuzz-Instruct), where the successful fuzzing history is collected from the heuristic fuzzing tool. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program. Experimental results show that FuzzCoder based on AFL (American Fuzzy Lop) gain significant improvements in terms of effective proportion of mutation (EPM) and number of crashes (NC) for various input formats including ELF, JPG, MP3, and XML.
摘要:模糊是一种重要的动态程序分析技术,旨在发现复杂软件中的漏洞。Fuzing涉及向目标程序呈现精心编制的恶意输入,以导致崩溃、缓冲区溢出、内存错误和异常。以有效的方式创建恶意输入是一个困难的开放问题,最好的方法通常会对预先存在的有效输入应用统一的随机突变。在这项工作中,我们建议采用微调的大型语言模型(FuzzCoder)来从成功的攻击中学习输入文件中的模式,以指导未来的模糊探索。具体地说,我们开发了一个框架来利用代码LLMS来指导模糊化中输入的突变过程。突变过程被描述为序列到序列的建模,其中LLM接收字节序列,然后输出突变的字节序列。FuzzCoder在创建的指令数据集(Fuzz-Indict)上进行了微调,其中成功的模糊历史是从启发式模糊工具收集的。FuzzCoder可以预测输入文件中的突变位置和策略位置,从而触发程序的异常行为。实验结果表明,对于ELF、JPG、MP3和XML等多种输入格式,基于AFL(American FuzzLop)的FuzzCoder在有效变异比例(EPM)和崩溃次数(NC)方面都有明显的改善。

[NLP-8] owards Leveraging Large Language Models for Automated Medical QA Evaluation
[NLP-8] owards利用大型语言模型进行自动化医疗QA评估

链接: https://arxiv.org/abs/2409.01941
作者: Jack Krolik,Herprit Mahal,Feroz Ahmad,Gaurav Trivedi,Bahador Saket
关键词-EN: Large Language Models, Natural Language Processing, Language Models, Language Processing, Large Language
关键词-ZH: 大型语言模型、自然语言处理、语言模型、语言处理、大型语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, 3 tables

点击查看摘要

Abstract:This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q\A) systems, a crucial form of Natural Language Processing. Traditionally, human evaluation has been indispensable for assessing the quality of these responses. However, manual evaluation by medical professionals is time-consuming and costly. Our study examines whether LLMs can reliably replicate human evaluations by using questions derived from patient data, thereby saving valuable time for medical experts. While the findings suggest promising results, further research is needed to address more specific or complex questions that were beyond the scope of this initial investigation.
摘要:本文探讨了使用大型语言模型(LLM)自动评估医学问答(Q\A)系统(自然语言处理的重要形式)中回答的潜力。传统上,人为评估对于评估这些反应的质量至关重要。然而,医疗专业人员的手动评估既耗时又昂贵。我们的研究考察了LLM是否可以通过使用从患者数据中得出的问题可靠地复制人类评估,从而为医学专家节省宝贵的时间。虽然研究结果显示出有希望的结果,但还需要进一步的研究来解决超出初步调查范围的更具体或复杂的问题。

[NLP-9] 3D-LEX v1.0: 3D Lexicons for American Sign Language and Sign Language of the Netherlands
[NLP-9] 3D-FLEX v1.0:美国手语和荷兰手语的3D词典

链接: https://arxiv.org/abs/2409.01901
作者: Oline Ranum,Gomer Otterspeer,Jari I. Andersen,Robert G. Belleman,Floris Roelofsen
关键词-EN: American Sign Language, sign language, capturing sign language, sign, language
关键词-ZH: 美国手语,手语,捕捉手语,手语,语言
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX v1.0 dataset, and detail a method for semi-automatic annotation of phonetic properties. Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features, and attains an average sampling rate of one sign every 10 seconds. This includes the time for presenting a sign example, performing and recording the sign, and archiving the capture. The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands. We showcase the dataset utility by presenting a simple method for generating handshape annotations directly from 3D-LEX. We produce handshape labels for 1,000 signs from American Sign Language and evaluate the labels in a sign recognition task. The labels enhance gloss recognition accuracy by 5% over using no handshape annotations, and by 1% over expert annotations. Our motion capture data supports in-depth analysis of sign features and facilitates the generation of 2D projections from any viewpoint. The 3D-LEX collection has been aligned with existing sign language benchmarks and linguistic resources, to support studies in 3D-aware sign language processing.
摘要:本文提出了一种有效的三维手语捕获方法,介绍了3D-Lex v1.0数据集,并详细介绍了一种半自动标注语音属性的方法。我们的程序集成了三种运动捕捉技术,包括高分辨率3D姿势、3D手形和深度感知面部特征,并获得了平均每10秒一个手势的采样率。这包括展示标志示例、执行和记录标志以及将捕获存档的时间。3D-Lex数据集包括1000个来自美国手语的手势和另外1000个来自荷兰手语的手势。我们通过提供一种直接从3D-lex生成手形批注的简单方法来展示DataSet实用程序。我们从美国手语中为1000个手势制作手形标签,并在手势识别任务中对这些标签进行评估。与不使用手形注释相比,标签将光泽识别准确率提高了5%,比使用专家注释提高了1%。我们的运动捕捉数据支持对符号特征的深入分析,并便于从任何角度生成2D投影。3D-Lex集合与现有的手语基准和语言资源保持一致,以支持3D感知手语处理的研究。

[NLP-10] What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
[NLP-10] 制作有效的长上下文多跳指令数据集的基本因素是什么?见解和最佳实践

链接: https://arxiv.org/abs/2409.01893
作者: Zhi Chen,Qiguang Chen,Libo Qin,Qipeng Guo,Haijun Lv,Yicheng Zou,Wanxiang Che,Hang Yan,Kai Chen,Dahua Lin
关键词-EN: complex planning scenarios, Recent advancements, extended context windows, information extraction, planning scenarios
关键词-ZH: 复杂的规划场景、最新进展、扩展上下文窗口、信息提取、规划场景
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: this https URL.
摘要:具有扩展上下文窗口的大型语言模型(LLM)的最新进展显著改进了诸如信息提取、问题回答和复杂规划场景等任务。为了在长语境任务中取得成功,人们做了大量的工作来通过合成数据来增强模型的长语境能力。现有方法通常利用自指令框架来生成指令调整数据,以便更好地改进长上下文能力。然而,我们的初步实验表明,只有不到35%的生成样本是多跳的,超过40%的样本表现出较差的质量,限制了对该问题的全面理解和进一步研究。为了提高合成数据的质量,我们提出了多智能体交互多跳生成(MIMG)框架,其中包括质量验证代理、单跳问题生成代理、多问题抽样策略和多跳问题合并代理。该框架提高了数据质量,高质量、多跳、多样的数据比例超过85%。此外,我们通过跨各种模型的广泛实验,系统地研究了文档选择、问题合并和验证技术的策略。我们的发现表明,我们合成的高质量长上下文教学数据显著提高了模型的性能,甚至超过了基于更大数量的人类注释数据训练的模型。我们的代码可从以下网址获得:This https URL。

[NLP-11] Investigating Expert-in-the-Loop LLM Discourse Patterns for Ancient Intertextual Analysis
[NLP-11] 研究古代互文分析的循环专家LLM话语模式

链接: https://arxiv.org/abs/2409.01882
作者: Ray Umphrey,Jesse Roberts,Lindsey Roberts
关键词-EN: Koine Greek texts, Koine Greek, large language models, examining intertextual relationships, Greek texts
关键词-ZH: Koine希腊文本,Koine希腊语,大型语言模型,检查互文关系,希腊文本
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study explores the potential of large language models (LLMs) for identifying and examining intertextual relationships within biblical, Koine Greek texts. By evaluating the performance of LLMs on various intertextuality scenarios the study demonstrates that these models can detect direct quotations, allusions, and echoes between texts. The LLM’s ability to generate novel intertextual observations and connections highlights its potential to uncover new insights. However, the model also struggles with long query passages and the inclusion of false intertextual dependences, emphasizing the importance of expert evaluation. The expert-in-the-loop methodology presented offers a scalable approach for intertextual research into the complex web of intertextuality within and beyond the biblical corpus.
摘要:本研究探讨了大型语言模型(LLM)识别和检查圣经、Koine希腊文本中互文关系的潜力。通过评估LLM在各种互文场景下的性能,该研究表明这些模型可以检测文本之间的直接引用、影射和呼应。法学硕士产生新颖的互文观察和联系的能力凸显了其发现新见解的潜力。然而,该模型也难以应对冗长的查询段落和包含错误的互文依赖,这强调了专家评估的重要性。所提出的专家在循环方法论为圣经文集内外的复杂互文网络的互文研究提供了一种可扩展的方法。

[NLP-12] he Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?
[NLP-12] 大型语言模型在音乐学中的作用:我们准备好信任机器了吗?

链接: https://arxiv.org/abs/2409.01864
作者: Pedro Ramoneda,Emilia Parada-Cabaleiro,Benno Weck,Xavier Serra
关键词-EN: Large Language Models, Large Language, reliability of Large, Language Models, Large
关键词-ZH: 大型语言模型,大型语言,大型的可靠性,语言模型,大型
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice question generation, validated by human experts. Our evaluation on 400 human-validated questions shows that current vanilla LLMs are less reliable than retrieval augmented generation from music dictionaries. This paper suggests that the potential of LLMs in musicology requires musicology driven research that can specialized LLMs by including accurate and reliable domain knowledge.
摘要:在这项工作中,我们探索了大型语言模型(LLM)在音乐学中的使用和可靠性。通过与专家和学生的讨论,我们评估了目前对这项如今无处不在的技术的接受程度和担忧。我们的目标是更进一步,提出一种半自动方法,使用检索增强生成模型和经人类专家验证的多项选择问题生成来创建初始基准。我们对400个经过人类验证的问题的评估表明,当前的vanilla LLM不如音乐词典中的检索增强生成可靠。本文表明,法学硕士在音乐学中的潜力需要音乐学驱动的研究,通过包含准确可靠的领域知识来专业化法学硕士。

[NLP-13] Agent RE: An Agent -Based Framework for Navigating Complex Information Landscapes in Relation Extraction CIKM2024
[NLP-13] AgentRE:一个基于Agent的框架,用于在关系提取中导航复杂信息景观

链接: https://arxiv.org/abs/2409.01854
作者: Yuchen Shi,Guochao Jiang,Tian Qiu,Deqing Yang
关键词-EN: diverse relation types, complex scenarios faces, scenarios faces challenges, relation extraction, language models
关键词-ZH: 多样化的关系类型,面临复杂的场景,面临挑战的场景,关系提取,语言模型
类目: Computation and Language (cs.CL)
备注: Accepted by CIKM 2024

点击查看摘要

Abstract:The relation extraction (RE) in complex scenarios faces challenges such as diverse relation types and ambiguous relations between entities within a single sentence, leading to the poor performance of pure “text-in, text-out” language models (LMs). To address these challenges, in this paper, we propose an agent-based RE framework, namely AgentRE, which fully leverages the potential of large language models (LLMs) including memory, retrieval and reflection, to achieve RE in complex scenarios. Specifically, three major modules are built in AgentRE serving as the tools to help the agent acquire and process various useful information, thereby obtaining improved RE performance. Our extensive experimental results upon two datasets in English and Chinese demonstrate our AgentRE’s superior performance, especially in low-resource scenarios. Additionally, the trajectories generated by AgentRE can be refined to construct a high-quality training dataset incorporating different reasoning methods, which can be used to fine-tune smaller models. Code is available at this https URL.
摘要:复杂场景下的关系抽取面临着关系类型多样、单句内实体间关系不明确等挑战,导致纯文本输入、文本输出语言模型的性能较差。为了应对这些挑战,本文提出了一种基于代理的逆向工程框架AgentRE,该框架充分利用大型语言模型的潜力,包括记忆、检索和反射,以实现复杂场景下的逆向工程。具体地说,在AgentRE中构建了三个主要模块,作为帮助代理获取和处理各种有用信息的工具,从而提高了RE的性能。我们在两个英文和中文数据集上的大量实验结果表明,我们的AgentRE具有优越的性能,特别是在低资源场景下。此外,AgentRE生成的轨迹可以被细化,以构建包含不同推理方法的高质量训练数据集,该数据集可用于微调较小的模型。代码可在此HTTPS URL上找到。

[NLP-14] owards Generative Class Prompt Learning for Few-shot Visual Recognition BMVC2024
[NLP-14] owards用于少镜头视觉识别的生成式课堂提示学习

链接: https://arxiv.org/abs/2409.01835
作者: Soumitri Chattopadhyay,Sanket Biswas,Emanuele Vivoli,Josep Lladós
关键词-EN: semantic discrimination tasks, discrimination tasks, Class Prompt Learning, foundational vision-language models, struggle to perform
关键词-ZH: 语义辨别任务、辨别任务、课堂提示学习、基础视觉语言模型、难以执行
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at BMVC 2024

点击查看摘要

Abstract:Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM’s semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges. The source code will be made available at: this https URL.
摘要:尽管基本视觉-语言模型已被证明在各种语义辨别任务中非常成功,但它们仍然难以忠实地执行细粒度分类任务。此外,在没有微调的情况下,在一个领域上训练的基础模型不能在不同的领域上很好地推广。我们将这些归因于VLM的语义表示的局限性,并尝试使用生成式建模来提高其细粒度视觉感知。具体地说,我们提出了两种新的方法:生成性课堂快速学习(GCPL)和对比性多课堂快速学习(COMPE)。GCPL利用文本到图像的扩散模型,通过对带有可学习的课堂提示的少量样本的条件化,显著提高了课堂嵌入中的视觉语言协同效应。Comple在此基础上引入了对比学习组件,该组件鼓励在生成性优化过程中进行类间分离。我们的实验结果表明,这种产生式课堂快速学习方法的性能大大优于现有方法,为少数镜头图像识别提供了一种更好的选择。源代码将在以下地址提供:This HTTPS URL。

[NLP-15] Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations ALT
[NLP-15] 您可以信任的对话:人类和人工智能对生成对话的看法

链接: https://arxiv.org/abs/2409.01808
作者: Ike Ebubechukwu,Johane Takeuchi,Antonello Ceravola,Frank Joublin
关键词-EN: chatbots increasingly integrate, accurate evaluation methods, Goal Contribution, Incorrect Fact, everyday interactions
关键词-ZH: 聊天机器人越来越多地集成准确的评估方法、目标贡献、错误事实、日常互动
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 15 figures, shorter version submitted to 22nd Annual Workshop of the Australasian Language Technology Association (ALTA’24)

点击查看摘要

Abstract:As dialogue systems and chatbots increasingly integrate into everyday interactions, the need for efficient and accurate evaluation methods becomes paramount. This study explores the comparative performance of human and AI assessments across a range of dialogue scenarios, focusing on seven key performance indicators (KPIs): Coherence, Innovation, Concreteness, Goal Contribution, Commonsense Contradiction, Incorrect Fact, and Redundancy. Utilizing the GPT-4o API, we generated a diverse dataset of conversations and conducted a two-part experimental analysis. In Experiment 1, we evaluated multi-party conversations on Coherence, Innovation, Concreteness, and Goal Contribution, revealing that GPT models align closely with human judgments. Notably, both human and AI evaluators exhibited a tendency towards binary judgment rather than linear scaling, highlighting a shared challenge in these assessments. Experiment 2 extended the work of Finch et al. (2023) by focusing on dyadic dialogues and assessing Commonsense Contradiction, Incorrect Fact, and Redundancy. The results indicate that while GPT-4o demonstrates strong performance in maintaining factual accuracy and commonsense reasoning, it still struggles with reducing redundancy and self-contradiction. Our findings underscore the potential of GPT models to closely replicate human evaluation in dialogue systems, while also pointing to areas for improvement. This research offers valuable insights for advancing the development and implementation of more refined dialogue evaluation methodologies, contributing to the evolution of more effective and human-like AI communication tools.
摘要:随着对话系统和聊天机器人越来越多地融入日常互动,对高效和准确的评估方法的需求变得至关重要。这项研究探索了人类和人工智能评估在一系列对话场景中的比较表现,重点放在七个关键绩效指标(KPI)上:一致性、创新、具体性、目标贡献、常识矛盾、不正确的事实和冗余。利用GPT-4o API,我们生成了一个不同的对话数据集,并进行了两部分的实验分析。在实验1中,我们评估了关于连贯性、创新性、具体性和目标贡献的多方对话,发现GPT模型与人类的判断密切一致。值得注意的是,人类和人工智能评估者都表现出一种二元判断的倾向,而不是线性定标,这突显了这些评估中的共同挑战。实验2扩展了Finch等人的工作。(2023)通过关注二元对话和评估常识矛盾、不正确事实和冗余。结果表明,尽管GPT-4o在保持事实准确性和常识性推理方面表现出了很强的性能,但它仍然在减少冗余和自相矛盾方面做着努力。我们的发现强调了GPT模型在对话系统中密切复制人类评价的潜力,同时也指出了需要改进的领域。这项研究为推进更精细的对话评估方法的开发和实施提供了有价值的见解,有助于发展更有效和更人性化的人工智能交流工具。

[NLP-16] LASP: Surveying the State-of-the-Art in Large Language Model-Assisted AI Planning
[NLP-16] LASP:调查大型语言模型辅助人工智能规划的最新水平

链接: https://arxiv.org/abs/2409.01806
作者: Haoming Li,Zhaoliang Chen,Jonathan Zhang,Fei Liu
关键词-EN: developing corporate strategies, routing autonomous vehicles, corporate strategies, organizing a vacation, vacation to routing
关键词-ZH: 制定企业战略,路线自动驾驶车辆,企业战略,组织假期,假期到路线
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Effective planning is essential for the success of any task, from organizing a vacation to routing autonomous vehicles and developing corporate strategies. It involves setting goals, formulating plans, and allocating resources to achieve them. LLMs are particularly well-suited for automated planning due to their strong capabilities in commonsense reasoning. They can deduce a sequence of actions needed to achieve a goal from a given state and identify an effective course of action. However, it is frequently observed that plans generated through direct prompting often fail upon execution. Our survey aims to highlight the existing challenges in planning with language models, focusing on key areas such as embodied environments, optimal scheduling, competitive and cooperative games, task decomposition, reasoning, and planning. Through this study, we explore how LLMs transform AI planning and provide unique insights into the future of LM-assisted planning.
摘要:有效的规划对于任何任务的成功都至关重要,从组织度假到安排自动驾驶车辆和制定企业战略。它涉及设定目标、制定计划以及分配资源来实现这些目标。LLM因其强大的常识推理能力而特别适合自动化规划。他们可以从给定状态推断出实现目标所需的一系列行动,并确定有效的行动方案。然而,人们经常观察到,通过直接提示生成的计划经常在执行时失败。我们的调查旨在强调使用语言模型进行规划方面的现有挑战,重点关注具体环境、最佳调度、竞争和合作游戏、任务分解、推理和规划等关键领域。通过这项研究,我们探索LLM如何改变人工智能规划,并为LM辅助规划的未来提供独特的见解。

[NLP-17] raining on the Benchmark Is Not All You Need
[NLP-17] 在基准上下雨并不是你需要的全部

链接: https://arxiv.org/abs/2409.01790
作者: Shiwen Ni,Xiangtao Kong,Chengming Li,Xiping Hu,Ruifeng Xu,Jia Zhu,Min Yang
关键词-EN: Large Language Models, pre-training data learned, data, data leakage, model pre-training data
关键词-ZH: 大型语言模型、学习的预训练数据、数据、数据泄露、模型预训练数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model’s log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under black-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.
摘要:大型语言模型的成功在很大程度上依赖于在预训练阶段学习的大量预训练数据。训练前过程和训练数据的不透明导致许多基准测试的结果变得不可靠。如果任何模型在基准测试集上接受过培训,它可能会严重阻碍该领域的健康发展。为了自动化和高效地测试大型语言模型的能力,许多主流基准采用多项选择格式。由于多项选择题的内容互换并不影响问题本身的意义,基于这一性质,我们提出了一种简单有效的数据泄漏检测方法。具体地说,我们将数据中的选项内容置乱以生成相应的派生数据集,然后基于模型在派生数据集上的对数概率分布来检测数据泄漏。如果日志概率集中存在最大值和异常值,则表示数据已泄露。我们的方法能够在不访问模型训练数据或权重的情况下工作在黑盒条件下,有效地识别模型预训练数据中基准测试集的数据泄漏,包括正常场景和复杂场景,其中选项可能被有意或无意地洗牌。通过基于两个LLMS和Benchmark设计的实验,验证了该方法的有效性。此外,我们在四个基准数据集上对31个主流开源LLM的数据泄漏程度进行了评估,并对每个基准的LLM进行了泄漏排名,发现Qwen系列LLM的数据泄漏程度最高。

[NLP-18] LLM-GAN: Construct Generative Adversarial Network Through Large Language Models For Explainable Fake News Detection
[NLP-18] LLM-GAN:通过大型语言模型构建生成对抗网络,用于可解释的假新闻检测

链接: https://arxiv.org/abs/2409.01787
作者: Yifeng Wang,Zhouhong Gu,Siwei Zhang,Suhang Zheng,Tao Wang,Tianyu Li,Hongwei Feng,Yanghua Xiao
关键词-EN: Large Language Models, predicts the authenticity, items with annotated, Explainable fake, Large Language
关键词-ZH: 大型语言模型,预测真实性,带注释的项目,可解释的假货,大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Explainable fake news detection predicts the authenticity of news items with annotated explanations. Today, Large Language Models (LLMs) are known for their powerful natural language understanding and explanation generation abilities. However, presenting LLMs for explainable fake news detection remains two main challenges. Firstly, fake news appears reasonable and could easily mislead LLMs, leaving them unable to understand the complex news-faking process. Secondly, utilizing LLMs for this task would generate both correct and incorrect explanations, which necessitates abundant labor in the loop. In this paper, we propose LLM-GAN, a novel framework that utilizes prompting mechanisms to enable an LLM to become Generator and Detector and for realistic fake news generation and detection. Our results demonstrate LLM-GAN’s effectiveness in both prediction performance and explanation quality. We further showcase the integration of LLM-GAN to a cloud-native AI platform to provide better fake news detection service in the cloud.
摘要:可解释假新闻检测是用带注释的解释来预测新闻的真实性。今天,大型语言模型以其强大的自然语言理解和解释生成能力而闻名。然而,提出用于可解释的假新闻检测的LLMS仍然是两个主要挑战。首先,假新闻看起来很合理,很容易误导小岛屿发展中国家,使他们无法理解复杂的新闻造假过程。其次,使用LLMS执行此任务将生成正确和不正确的解释,这需要在循环中投入大量劳动力。在本文中,我们提出了一种新的框架LLM-GAN,它利用提示机制使LLM成为生成器和检测器,并用于现实假新闻的生成和检测。我们的结果证明了LLM-GAN在预测性能和解释质量方面的有效性。我们进一步展示了LLM-GAN与云原生AI平台的集成,以在云中提供更好的假新闻检测服务。

[NLP-19] State-of-the-art Advances of Deep-learning Linguistic Steganalysis Research
[NLP-19] 深度学习语言隐写分析研究的最新进展

链接: https://arxiv.org/abs/2409.01780
作者: Yihao Wang,Ru Zhang,Yifan Tang,Jianyi Liu
关键词-EN: linguistic steganography techniques, generative linguistic steganography, conventional steganalysis falls, steganalysis falls short, steganography techniques
关键词-ZH: 语言隐写术技术,生成语言隐写术,传统隐写分析失败,隐写分析失败,隐写技术
类目: Computation and Language (cs.CL)
备注: Accepted by 2023 International Conference on Data, Information and Computing Science

点击查看摘要

Abstract:With the evolution of generative linguistic steganography techniques, conventional steganalysis falls short in robustly quantifying the alterations induced by steganography, thereby complicating detection. Consequently, the research paradigm has pivoted towards deep-learning-based linguistic steganalysis. This study offers a comprehensive review of existing contributions and evaluates prevailing developmental trajectories. Specifically, we first provided a formalized exposition of the general formulas for linguistic steganalysis, while comparing the differences between this field and the domain of text classification. Subsequently, we classified the existing work into two levels based on vector space mapping and feature extraction models, thereby comparing the research motivations, model advantages, and other details. A comparative analysis of the experiments is conducted to assess the performances. Finally, the challenges faced by this field are discussed, and several directions for future development and key issues that urgently need to be addressed are proposed.
摘要:随着生成性语言隐写技术的发展,传统的隐写分析不能很好地量化隐写引起的变化,从而使检测变得复杂。因此,研究范式转向了基于深度学习的语言隐写分析。这项研究全面审查了现有的贡献,并评估了当前的发展轨迹。具体地说,我们首先形式化地阐述了语言隐写分析的一般公式,同时比较了该领域与文本分类领域的区别。随后,基于向量空间映射和特征提取模型将已有的研究工作分为两个层次,从而比较了研究动机、模型优势等细节。对实验进行了对比分析,以评估其性能。最后,讨论了该领域面临的挑战,并提出了未来发展的几个方向和迫切需要解决的关键问题。

[NLP-20] FC-KAN: Function Combinations in Kolmogorov-Arnold Networks
[NLP-20] FC-KAN:Kolmogorov-Arnold网络中的函数组合

链接: https://arxiv.org/abs/2409.01763
作者: Hoang-Thang Ta,Duy-Quy Thai,Abu Bakar Siddiqur Rahman,Grigori Sidorov,Alexander Gelbukh
关键词-EN: popular mathematical functions, radial basis functions, popular mathematical, radial basis, low-dimensional data
关键词-ZH: 流行的数学函数、辐射基函数、流行的数学、辐射基、低维数据
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:In this paper, we introduce FC-KAN, a Kolmogorov-Arnold Network (KAN) that leverages combinations of popular mathematical functions such as B-splines, wavelets, and radial basis functions on low-dimensional data through element-wise operations. We explore several methods for combining the outputs of these functions, including sum, element-wise product, the addition of sum and element-wise product, quadratic function representation, and concatenation. In our experiments, we compare FC-KAN with multi-layer perceptron network (MLP) and other existing KANs, such as BSRBF-KAN, EfficientKAN, FastKAN, and FasterKAN, on the MNIST and Fashion-MNIST datasets. A variant of FC-KAN, which uses a combination of outputs from B-splines and Difference of Gaussians (DoG) in the form of a quadratic function, outperformed all other models on the average of 5 independent training runs. We expect that FC-KAN can leverage function combinations to design future KANs. Our repository is publicly available at: this https URL.
摘要:在本文中,我们介绍了一种Kolmogorov-Arnold网络(KANN),它通过元素级运算,在低维数据上利用了常用的数学函数的组合,如B-Spline、小波和径向基函数。我们探索了几种组合这些函数的输出的方法,包括求和、逐元素乘积、求和与逐元素乘积的相加、二次函数表示和连接。在我们的实验中,我们在MNIST和Fashion-MNIST数据集上比较了FC-KAN和多层感知器网络(MLP)以及其他现有的KAN,如BSRBF-KAN、EfficientKAN、FastKAN和FasterKan。FC-KAN的一个变种使用了B-Spline的输出和以二次函数形式存在的高斯差(DOG)的组合,在平均5次独立训练中优于所有其他模型。我们期望FC-KAN能够利用功能组合来设计未来的KAN。我们的存储库可通过以下网址公开获取:This https URL。

[NLP-21] Empirical evidence of Large Language Models influence on human spoken communication
[NLP-21] 大型语言模型对人类口语交流影响的经验证据

链接: https://arxiv.org/abs/2409.01754
作者: Hiromu Yakura,Ezequiel Lopez-Lopez,Levin Brinkmann,Ignacio Serna,Prateek Gupta,Iyad Rahwan
关键词-EN: Large Language Models, Artificial Intelligence, advances in Large, Language Models, Large Language
关键词-ZH: 大型语言模型、人工智能、大型进步、语言模型、大型语言
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) agents now interact with billions of humans in natural language, thanks to advances in Large Language Models (LLMs) like ChatGPT. This raises the question of whether AI has the potential to shape a fundamental aspect of human culture: the way we speak. Recent analyses revealed that scientific publications already exhibit evidence of AI-specific language. But this evidence is inconclusive, since scientists may simply be using AI to copy-edit their writing. To explore whether AI has influenced human spoken communication, we transcribed and analyzed about 280,000 English-language videos of presentations, talks, and speeches from more than 20,000 YouTube channels of academic institutions. We find a significant shift in the trend of word usage specific to words distinctively associated with ChatGPT following its release. These findings provide the first empirical evidence that humans increasingly imitate LLMs in their spoken language. Our results raise societal and policy-relevant concerns about the potential of AI to unintentionally reduce linguistic diversity, or to be deliberately misused for mass manipulation. They also highlight the need for further investigation into the feedback loops between machine behavior and human culture.
摘要:由于ChatGPT等大型语言模型的进步,人工智能(AI)代理现在可以用自然语言与数十亿人交互。这引发了一个问题:人工智能是否有潜力塑造人类文化的一个基本方面:我们的说话方式。最近的分析表明,科学出版物已经展示了人工智能特有语言的证据。但这一证据并不确凿,因为科学家可能只是在使用人工智能来复制编辑他们的作品。为了探索人工智能是否影响了人类的口语交流,我们转录并分析了来自20,000多个学术机构YouTube频道的约28万个演讲、演讲和演讲的英文视频。我们发现,在ChatGPT发布后,与ChatGPT明显相关的单词的使用趋势发生了显著变化。这些发现提供了第一个实证证据,证明人类在口语中越来越多地模仿LLM。我们的结果引发了社会和政策相关的担忧,即人工智能可能会无意中减少语言多样性,或者被故意滥用于大规模操纵。他们还强调了进一步研究机器行为和人类文化之间反馈循环的必要性。

[NLP-22] aming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits ECCV2024
[NLP-22] 用于对博物馆展品进行细粒度和结构化视觉理解的剪辑

链接: https://arxiv.org/abs/2409.01690
作者: Ada-Astrid Balauca,Danda Pani Paudel,Kristina Toutanova,Luc Van Gool
关键词-EN: perform nuanced tasks, natural language descriptions, nuanced tasks, widely used tool, natural language
关键词-ZH: 执行细致入微的任务、自然语言描述、细致入微的任务、广泛使用的工具、自然语言
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ECCV 2024

点击查看摘要

Abstract:CLIP is a powerful and widely used tool for understanding images in the context of natural language descriptions to perform nuanced tasks. However, it does not offer application-specific fine-grained and structured understanding, due to its generic nature. In this work, we aim to adapt CLIP for fine-grained and structured – in the form of tabular data – visual understanding of museum exhibits. To facilitate such understanding we (a) collect, curate, and benchmark a dataset of 200K+ image-table pairs, and (b) develop a method that allows predicting tabular outputs for input images. Our dataset is the first of its kind in the public domain. At the same time, the proposed method is novel in leveraging CLIP’s powerful representations for fine-grained and tabular understanding. The proposed method (MUZE) learns to map CLIP’s image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet). More specifically, parseNet enables prediction of missing attribute values while integrating context from known attribute-value pairs for an input image. We show that this leads to significant improvement in accuracy. Through exhaustive experiments, we show the effectiveness of the proposed method on fine-grained and structured understanding of museum exhibits, by achieving encouraging results in a newly established benchmark. Our dataset and source-code can be found at: this https URL
摘要:CLIP是一种功能强大且被广泛使用的工具,用于在自然语言描述的上下文中理解图像以执行细微差别的任务。但是,由于其通用性,它不提供特定于应用程序的细粒度和结构化理解。在这项工作中,我们的目标是使CLIP适应细粒度和结构化的–以表格数据的形式–对博物馆展品的视觉理解。为了促进这样的理解,我们(A)收集、整理和基准一个200K以上的图像-表格对的数据集,并且(B)开发一种允许预测输入图像的表格输出的方法。我们的数据集是公共领域中的第一个此类数据集。同时,该方法在利用CLIP的强大表示能力进行细粒度和表格理解方面是新颖的。该方法(Muze)通过提出的基于变换的解析网络(ParseNet)学习将Clip的图像嵌入映射到表格结构。更具体地说,parseNet支持预测缺失的属性值,同时集成输入图像的已知属性-值对的上下文。我们表明,这导致了精度的显著提高。通过详尽的实验,我们证明了该方法在细粒度和结构化的博物馆展品理解方面的有效性,并在新建立的基准上取得了令人鼓舞的结果。我们的数据集和源代码可在以下HTTPS URL中找到

[NLP-23] In Defense of RAG in the Era of Long-Context Language Models
[NLP-23] 在长上下文语言模型时代捍卫RAG

链接: https://arxiv.org/abs/2409.01666
作者: Tan Yu,Anbang Xu,Rama Akkiraju
关键词-EN: Overcoming the limited, limited context limitations, RAG, limitations in early-generation, reliable solution
关键词-ZH: 克服有限的上下文限制、RAG、早期可靠解决方案的局限性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Overcoming the limited context limitations in early-generation LLMs, retrieval-augmented generation (RAG) has been a reliable solution for context-based answer generation in the past. Recently, the emergence of long-context LLMs allows the models to incorporate much longer text sequences, making RAG less attractive. Recent studies show that long-context LLMs significantly outperform RAG in long-context applications. Unlike the existing works favoring the long-context LLM over RAG, we argue that the extremely long context in LLMs suffers from a diminished focus on relevant information and leads to potential degradation in answer quality. This paper revisits the RAG in long-context answer generation. We propose an order-preserve retrieval-augmented generation (OP-RAG) mechanism, which significantly improves the performance of RAG for long-context question-answer applications. With OP-RAG, as the number of retrieved chunks increases, the answer quality initially rises, and then declines, forming an inverted U-shaped curve. There exist sweet points where OP-RAG could achieve higher answer quality with much less tokens than long-context LLM taking the whole context as input. Extensive experiments on public benchmark demonstrate the superiority of our OP-RAG.
摘要:检索-增强生成(RAG)克服了早期LLMS的上下文限制,是一种可靠的基于上下文的答案生成方案。最近,长语境LLM的出现使模型能够包含更长的文本序列,从而降低了RAG的吸引力。最近的研究表明,长语境LLM在长语境应用中的表现明显优于RAG。与现有的倾向于长语境LLM而不是RAG的研究不同,我们认为LLMS中极长的语境会削弱对相关信息的关注,并导致答案质量的潜在下降。本文回顾了长上下文答案生成中的RAG。提出了一种保序检索增强生成机制(OP-RAG),显著提高了RAG在长上下文问答应用中的性能。对于OP-RAG,随着检索到的组块数量的增加,答案质量最初会上升,然后下降,形成倒U形曲线。与采用整个上下文作为输入的长上下文LLM相比,OP-RAG可以用更少的标记获得更高的答案质量。在PUBLIC基准上的大量实验证明了该算法的优越性。

[NLP-24] Interpreting and Improving Large Language Models in Arithmetic Calculation ICML2024
[NLP-24] 解释和改进算术计算中的大型语言模型

链接: https://arxiv.org/abs/2409.01659
作者: Wei Zhang,Chaoqun Wan,Yonggang Zhang,Yiu-ming Cheung,Xinmei Tian,Xu Shen,Jieping Ye
关键词-EN: Large language models, tackle complex reasoning, Large language, complex reasoning tasks, demonstrated remarkable potential
关键词-ZH: 大型语言模型,解决复杂推理,大型语言,复杂推理任务,展现出显着的潜力
类目: Computation and Language (cs.CL)
备注: Accepted by ICML 2024 (oral)

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable potential across numerous applications and have shown an emergent ability to tackle complex reasoning tasks, such as mathematical computations. However, even for the simplest arithmetic calculations, the intrinsic mechanisms behind LLMs remain mysterious, making it challenging to ensure reliability. In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations. Through comprehensive experiments, we find that LLMs frequently involve a small fraction ( 5%) of attention heads, which play a pivotal role in focusing on operands and operators during calculation processes. Subsequently, the information from these operands is processed through multi-layer perceptrons (MLPs), progressively leading to the final solution. These pivotal heads/MLPs, though identified on a specific dataset, exhibit transferability across different datasets and even distinct tasks. This insight prompted us to investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs’ computational performance. We empirically find that such precise tuning can yield notable enhancements on mathematical prowess, without compromising the performance on non-mathematical tasks. Our work serves as a preliminary exploration into the arithmetic calculation abilities inherent in LLMs, laying a solid foundation to reveal more intricate mathematical tasks.
摘要:大型语言模型在众多应用中显示出了巨大的潜力,并显示出处理复杂推理任务(如数学计算)的紧急能力。然而,即使对于最简单的算术计算,LLMS背后的内在机制仍然是个谜,这使得确保可靠性具有挑战性。在这项工作中,我们深入揭示了LLM执行计算的特定机制。通过综合实验,我们发现LLMS经常涉及一小部分(5%)的注意头部,这些注意头部在计算过程中对集中在操作数和操作符上起着关键作用。随后,来自这些操作数的信息通过多层感知器(MLP)进行处理,逐步导致最终解决方案。这些关键头部/MLP虽然确定在特定的数据集上,但展示了跨不同数据集甚至不同任务的可转移性。这种洞察力促使我们调查了有选择地微调这些基本头部/MLP以提高LLMS计算性能的潜在好处。我们经验性地发现,这种精确的调整可以显著提高数学能力,而不会影响非数学任务的性能。我们的工作是对LLMS固有的算术计算能力的初步探索,为揭示更复杂的数学任务奠定了坚实的基础。

[NLP-25] From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning ICML2024
[NLP-25] 从唯唯诺诺的人到讲真话的人:通过精确调优解决大型语言模型中的谄媚问题

链接: https://arxiv.org/abs/2409.01658
作者: Wei Chen,Zhen Huang,Liang Xie,Binbin Lin,Houqiang Li,Le Lu,Xinmei Tian,Deng Cai,Yonggang Zhang,Wenxiao Wan,Xu Shen,Jieping Ye
关键词-EN: Large Language Models, Large Language, Language Models, providing veracious responses, prioritize adherence
关键词-ZH: 大型语言模型,大型语言,语言模型,提供准确的响应,优先考虑遵守
类目: Computation and Language (cs.CL)
备注: Accepted by ICML 2024

点击查看摘要

Abstract:Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads to the degeneration of LLMs’ general capability. To address the challenge, we propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective. Specifically, SPT first reveals and verifies a small percentage (5%) of the basic modules, which significantly affect a particular behavior of LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified modules while freezing the rest. To verify the effectiveness of the proposed SPT, we conduct comprehensive experiments, demonstrating that SPT significantly mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT introduces limited or even no side effects on the general capability of LLMs. Our results shed light on how to precisely, effectively, and efficiently explain and improve the targeted ability of LLMs.
摘要:大型语言模型倾向于优先遵守用户提示,而不是提供准确的响应,这导致了奉承问题。当受到用户的质疑时,LLMS倾向于承认错误并提供不准确的回答,即使它们最初提供了正确的答案。最近的工作提出使用有监督的微调(SFT)来缓解拍马屁的问题,但这通常会导致LLMS的总体性能退化。为了解决这一挑战,我们提出了一种新的有监督的精确点调整(SPT),其中感兴趣区域模块针对给定的目标进行调整。具体地说,SPT首先揭示和验证一小部分(5%)基本模块,这些模块显著影响LLM的特定行为。也就是说,奉承。随后,SPT只是微调这些已识别的模块,而冻结其余模块。为了验证SPT的有效性,我们进行了全面的实验,表明SPT显著地缓解了LLMS的奉承问题(甚至比SFT更好)。此外,SPT对LLMS的总体性能产生的副作用有限,甚至没有。我们的结果为如何准确、有效、高效地解释和提高LLMS的目标能力提供了启示。

[NLP-26] CTG-KrEW: Generating Synthetic Structured Contextually Correlated Content by Conditional Tabular GAN with K-Means Clustering and Efficient Word Embedding
[NLP-26] CTG-KrEW:通过具有K均值集群和高效词嵌入的条件表格GAN生成合成结构化上下文相关内容

链接: https://arxiv.org/abs/2409.01628
作者: Riya Samanta,Bidyut Saha,Soumya K. Ghosh,Sajal K. Das
关键词-EN: Generative Adversarial Networks, Tabular Generative Adversarial, Adversarial Networks, Conditional Tabular Generative, Generative Adversarial
关键词-ZH: 生成对抗网络、表格生成对抗、对抗网络、条件表格生成、生成对抗
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conditional Tabular Generative Adversarial Networks (CTGAN) and their various derivatives are attractive for their ability to efficiently and flexibly create synthetic tabular data, showcasing strong performance and adaptability. However, there are certain critical limitations to such models. The first is their inability to preserve the semantic integrity of contextually correlated words or phrases. For instance, skillset in freelancer profiles is one such attribute where individual skills are semantically interconnected and indicative of specific domain interests or qualifications. The second challenge of traditional approaches is that, when applied to generate contextually correlated tabular content, besides generating semantically shallow content, they consume huge memory resources and CPU time during the training stage. To address these problems, we introduce a novel framework, CTGKrEW (Conditional Tabular GAN with KMeans Clustering and Word Embedding), which is adept at generating realistic synthetic tabular data where attributes are collections of semantically and contextually coherent words. CTGKrEW is trained and evaluated using a dataset from Upwork, a realworld freelancing platform. Comprehensive experiments were conducted to analyze the variability, contextual similarity, frequency distribution, and associativity of the generated data, along with testing the framework’s system feasibility. CTGKrEW also takes around 99% less CPU time and 33% less memory footprints than the conventional approach. Furthermore, we developed KrEW, a web application to facilitate the generation of realistic data containing skill-related information. This application, available at this https URL, is freely accessible to both the general public and the research community.
摘要:条件表格生成对抗网络(CTGAN)及其各种衍生工具能够高效、灵活地生成合成表格数据,表现出很强的性能和适应性,因而具有很大的吸引力。然而,这些模型有一些关键的限制。第一个问题是他们无法保持上下文相关单词或短语的语义完整性。例如,自由职业者个人资料中的技能集就是这样一个属性,其中各个技能在语义上相互关联,并指示特定领域的兴趣或资格。传统方法的第二个挑战是,当应用于生成上下文相关的表格内容时,除了生成语义较浅的内容外,它们在训练阶段消耗了大量的内存资源和CPU时间。为了解决这些问题,我们引入了一种新的框架CTGKrEW(Conditional Tabular GAN With KMeans Cluging And Word Embedding),该框架擅长生成真实的合成表格数据,其中属性是语义和上下文连贯的单词的集合。CTGKrEW是使用来自Upwork的数据集进行训练和评估的,Upwork是一个现实世界的自由职业平台。综合实验分析了生成数据的可变性、上下文相似性、频率分布和关联性,并测试了该框架的系统可行性。与传统方法相比,CTGKrEW占用的CPU时间减少了约99%,内存占用减少了33%。此外,我们开发了Krew,这是一个网络应用程序,可以帮助生成包含技能相关信息的真实数据。这个应用程序可以在这个HTTPS URL上找到,公众和研究社区都可以免费访问。

[NLP-27] Booster: Tackling Harmful Fine-tuing for Large Language Models via Attenuating Harmful Perturbation
[NLP-27] 助推器:通过减弱有害扰动来解决大型语言模型的有害微调

链接: https://arxiv.org/abs/2409.01586
作者: Tiansheng Huang,Sihao Hu,Fatih Ilhan,Selim Furkan Tekin,Ling Liu
关键词-EN: Large language models’, concerns for Large, Large language, Harmful fine-tuning issue, poses serious safety
关键词-ZH: 大型语言模型,对大型语言的担忧,有害的微调问题,带来严重的安全性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Harmful fine-tuning issue \citepqi2023fine poses serious safety concerns for Large language models’ fine-tuning-as-a-service. While existing defenses \citephuang2024vaccine,rosati2024representation have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textitharmful perturbation over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage’s optimization. The regularizer ensures that the model’s harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \urlthis https URL.
摘要:有害的微调问题给大型语言模型的微调即服务带来了严重的安全隐患。虽然已经提出了现有的防御措施\Citephuang2024疫苗、罗萨斯2024代表疫苗来缓解这一问题,但它们的表现仍然差强人意,问题的根本原因尚未完全恢复。在文献中,我们首次证明了对模型权重的文本摄动应该是有害微调的对准断裂的根本原因。为了减弱有害扰动的负面影响,我们提出了一种对准阶段的解决方案,称为Booster。在技术上,除了原始的对准损失外,我们还在对准阶段的优化中加入了损失正则化。正则化确保了模型在模拟有害扰动之前/之后的有害损失减小,从而减轻了随后的微调风险。实验结果表明,Booster在保持下游任务性能的同时,能有效降低微调模型的有害分数。我们的代码位于此HTTPS URL。

[NLP-28] owards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models
[NLP-28] owards大规模视觉语言模型中艺术作品的跨语言解释

链接: https://arxiv.org/abs/2409.01584
作者: Shintaro Ozaki,Kazuki Hayashi,Yusuke Sakai,Hidetaka Kamigaito,Katsuhiko Hayashi,Taro Watanabe
关键词-EN: Vision Language Models, Large-scale Vision Language, Large-scale Vision, Vision Encoder, Language Models
关键词-ZH: 视觉语言模型,大规模视觉语言,大规模视觉,视觉编码器,语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data.
摘要:随着大规模视觉语言模型性能的提高,它们对多种语言的响应能力越来越强,人们对大规模视觉语言模型产生的解释的需求将会增长。然而,视觉编码器的预训练和LLMS与Vision编码器的集成训练主要是使用英语训练数据进行的,这使得LLMS在生成英语以外的语言解释时是否能够完全发挥其潜力是不确定的。此外,使用机器翻译创建数据集的多语言QA基准存在文化差异和偏见,仍存在用作评估任务的问题。为了应对这些挑战,这项研究创建了一个不依赖机器翻译的多语言扩展数据集。这个考虑了细微差别和特定国家短语的数据集随后被用来评估LVLMS的生成解释能力。此外,这项研究还考察了在资源丰富的英语中进行教学调整是否会提高其他语言的表现。我们的发现表明,与英语相比,LVLM在英语以外的其他语言中的表现更差。此外,据观察,低收入者难以有效地管理从英文数据中学到的知识。

[NLP-29] AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models
[NLP-29] AdaComp:使用自适应预测器的提取上下文压缩用于检索增强大型语言模型

链接: https://arxiv.org/abs/2409.01579
作者: Qianchi Zhang,Hainan Zhang,Liang Pang,Hongwei Zheng,Zhiming Zheng
关键词-EN: detecting answer clues, inference process slow, slow and expensive, compression rate, context compression
关键词-ZH: 检测答案线索、推理过程缓慢、缓慢且昂贵、压缩率、上下文压缩
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, code available at https://anonymous.4open.science/r/AdaComp-8C0C/

点击查看摘要

Abstract:Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Muiti-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.
摘要:检索到的含有噪声的文档会阻碍RAG检测答案线索,并使推理过程变得缓慢和昂贵。因此,为了提高其准确性和效率,有必要对上下文进行压缩。现有的上下文压缩方法使用提取或生成模型来保留与查询最相关的句子,或者应用信息瓶颈理论来保存足够的信息。然而,这些方法可能会面临过度压缩或计算成本高等问题。我们观察到,检索者通常将相关文档排在最前面,但由于查询复杂性和检索质量的影响,回答查询所需的确切文档数量是不确定的:像多跳问题这样的复杂查询可能需要保留比更简单的查询更多的文档,而低质量的检索可能需要依赖更多的文档来生成准确的输出。因此,确定所需的最小文档数(压缩率)仍然是RAG面临的挑战。本文介绍了一种低成本的抽取上下文压缩方法AdaComp,该方法根据查询复杂度和检索质量自适应地确定压缩率。具体地说,我们首先将RAG系统回答当前查询所需的最小top-k文档注释为压缩比,然后构造查询、检索到的文档及其压缩比的三元组。然后,我们使用这个三元组数据集来训练压缩率预测器。在三个QA数据集和一个会话多文档QA数据集上的实验表明,AdaComp在保持与未压缩模型几乎相同的性能的同时,显著降低了推理代价,实现了效率和性能之间的平衡。

[NLP-30] An Implementation of Werewolf Agent That does not Truly Trust LLMs
[NLP-30] 不真正信任LLM的狼人代理的实现

链接: https://arxiv.org/abs/2409.01575
作者: Takehiro Sato,Shintaro Ozaki,Daisaku Yokoyama
关键词-EN: Large Language Model, incomplete information game, situational lying, incomplete information, challenges when creating
关键词-ZH: 大语言模型、不完整信息游戏、情景撒谎、不完整信息、创建时的挑战
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Werewolf is an incomplete information game, which has several challenges when creating a computer agent as a player given the lack of understanding of the situation and individuality of utterance (e.g., computer agents are not capable of characterful utterance or situational lying). We propose a werewolf agent that solves some of those difficulties by combining a Large Language Model (LLM) and a rule-based algorithm. In particular, our agent uses a rule-based algorithm to select an output either from an LLM or a template prepared beforehand based on the results of analyzing conversation history using an LLM. It allows the agent to refute in specific situations, identify when to end the conversation, and behave with persona. This approach mitigated conversational inconsistencies and facilitated logical utterance as a result. We also conducted a qualitative evaluation, which resulted in our agent being perceived as more human-like compared to an unmodified LLM. The agent is freely available for contributing to advance the research in the field of Werewolf game.
摘要:狼人是一个不完全信息游戏,在缺乏对话语的情境和个性的理解的情况下(例如,计算机代理人不能进行有特征的话语或情境撒谎),在创建计算机代理作为玩家时面临着几个挑战。我们提出了一个狼人代理,通过结合大型语言模型(LLM)和基于规则的算法来解决其中的一些困难。特别是,我们的代理使用基于规则的算法从LLM或基于使用LLM分析对话历史的结果预先准备的模板中选择输出。它允许代理在特定情况下反驳,确定何时结束对话,并以人物角色的方式行事。这种方法减少了会话中的不一致,从而促进了逻辑表达。我们还进行了定性评估,结果是与未经修改的LLM相比,我们的代理更像人类。该代理可以免费获得,为推进狼人游戏领域的研究做出贡献。

[NLP-31] Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture
[NLP-31] LLM认知领域基准:台湾客文化的见解

链接: https://arxiv.org/abs/2409.01556
作者: Chen-Chi Chang,Ching-Yuan Chen,Hung-Shin Lee,Chih-Cheng Lee
关键词-EN: large language models, focus on Hakka, Hakka culture, comprehensive benchmark designed, Leveraging Bloom Taxonomy
关键词-ZH: 大型语言模型,关注客语、客语文化,设计全面的基准,利用Bloom分类学
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to O-COCOSDA 2024

点击查看摘要

Abstract:This study introduces a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge, with a specific focus on Hakka culture as a case study. Leveraging Bloom’s Taxonomy, the study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. This benchmark extends beyond traditional single-dimensional evaluations by providing a deeper analysis of LLMs’ abilities to handle culturally specific content, ranging from basic recall of facts to higher-order cognitive tasks such as creative synthesis. Additionally, the study integrates Retrieval-Augmented Generation (RAG) technology to address the challenges of minority cultural knowledge representation in LLMs, demonstrating how RAG enhances the models’ performance by dynamically incorporating relevant external information. The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge. However, the findings also reveal the limitations of RAG in creative tasks, underscoring the need for further optimization. This benchmark provides a robust tool for evaluating and comparing LLMs in culturally diverse contexts, offering valuable insights for future research and development in AI-driven cultural knowledge preservation and dissemination.
摘要:本研究以客家文化为例,介绍了一项旨在评估大型语言模型在理解和处理文化知识方面的性能的综合基准。利用Bloom的分类学,该研究开发了一个多维框架,系统地评估了六个认知领域的LLM:记忆、理解、应用、分析、评估和创造。这一基准超越了传统的单一维度评估,对LLMS处理特定文化内容的能力进行了更深入的分析,范围从基本的事实回忆到创造性合成等更高级别的认知任务。此外,研究还整合了检索增强生成(RAG)技术来解决LLMS中少数民族文化知识表示的挑战,展示了RAG如何通过动态整合相关外部信息来提高模型的性能。这一结果突出了RAG在提高所有认知域的准确性方面的有效性,特别是在需要准确提取和应用文化知识的任务中。然而,这些发现也揭示了RAG在创造性任务中的局限性,强调了进一步优化的必要性。这一基准为评估和比较不同文化背景下的LLM提供了一个强大的工具,为未来人工智能驱动的文化知识保存和传播方面的研究和开发提供了有价值的见解。

[NLP-32] Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs
[NLP-32] 自学衍生提示生成满足上下文学习:释放黑匣子LLM的新潜力

链接: https://arxiv.org/abs/2409.01552
作者: Zhuo Li,Yuhao Du,Jinpeng Hu,Xiang Wan,Anningzhe Gao
关键词-EN: Large language models, Large language, shown success, Large, LLMs
关键词-ZH: 大型语言模型,大型语言,显示成功,大型,LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown success in generating high-quality responses. In order to achieve better alignment with LLMs with human preference, various works are proposed based on specific optimization process, which, however, is not suitable to Black-Box LLMs like GPT-4, due to inaccessible parameters. In Black-Box LLMs case, their performance is highly dependent on the quality of the provided prompts. Existing methods to enhance response quality often involve a prompt refinement model, yet these approaches potentially suffer from semantic inconsistencies between the refined and original prompts, and typically overlook the relationship between them. To address these challenges, we introduce a self-instructed in-context learning framework that empowers LLMs to deliver more effective responses by generating reliable derived prompts to construct informative contextual environments. Our approach incorporates a self-instructed reinforcement learning mechanism, enabling direct interaction with the response model during derived prompt generation for better alignment. We then formulate querying as an in-context learning task, using responses from LLMs combined with the derived prompts to establish a contextual demonstration for the original prompt. This strategy ensures alignment with the original query, reduces discrepancies from refined prompts, and maximizes the LLMs’ in-context learning capability. Extensive experiments demonstrate that the proposed method not only generates more reliable derived prompts but also significantly enhances LLMs’ ability to deliver more effective responses, including Black-Box models such as GPT-4.
摘要:大型语言模型(LLM)在生成高质量响应方面取得了成功。为了更好地与人类偏好的LLMS对准,人们根据具体的优化过程提出了各种工作,但由于参数不可达,不适合于GPT-4这样的黑盒LLMS。在黑盒LLMS的情况下,它们的表现高度依赖于所提供的提示的质量。现有的提高响应质量的方法通常涉及即时精化模型,然而这些方法潜在地受到精化提示和原始提示之间的语义不一致的影响,并且通常忽略了它们之间的关系。为了应对这些挑战,我们引入了一个自学的上下文学习框架,该框架通过生成可靠的派生提示来构建信息丰富的上下文环境,从而使LLM能够提供更有效的响应。我们的方法结合了一种自我指导的强化学习机制,允许在派生提示生成期间与响应模型直接交互,以实现更好的一致性。然后,我们将查询作为一项上下文学习任务,使用来自LLMS的响应与派生的提示相结合,为原始提示建立上下文演示。该策略确保了与原始查询的一致性,减少了精化提示的差异,并最大化了LLMS的上下文学习能力。大量实验表明,该方法不仅生成了更可靠的派生提示,而且显著增强了LLMS提供更有效响应的能力,包括GPT-4等黑盒模型。

[NLP-33] VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka
[NLP-33] VoxHakka:台湾客语方言多样化的多说话人文本到语音系统

链接: https://arxiv.org/abs/2409.01548
作者: Li-Wei Chen,Hung-Shin Lee,Chen-Chi Chang
关键词-EN: spoken in Taiwan, designed for Taiwanese, paper introduces VoxHakka, Taiwanese Hakka, critically under-resourced language
关键词-ZH: 在台湾使用,专为台湾人设计,论文介绍VoxHakka,台湾客语,资源严重不足的语言
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to O-COCOSDA 2024

点击查看摘要

Abstract:This paper introduces VoxHakka, a text-to-speech (TTS) system designed for Taiwanese Hakka, a critically under-resourced language spoken in Taiwan. Leveraging the YourTTS framework, VoxHakka achieves high naturalness and accuracy and low real-time factor in speech synthesis while supporting six distinct Hakka dialects. This is achieved by training the model with dialect-specific data, allowing for the generation of speaker-aware Hakka speech. To address the scarcity of publicly available Hakka speech corpora, we employed a cost-effective approach utilizing a web scraping pipeline coupled with automatic speech recognition (ASR)-based data cleaning techniques. This process ensured the acquisition of a high-quality, multi-speaker, multi-dialect dataset suitable for TTS training. Subjective listening tests conducted using comparative mean opinion scores (CMOS) demonstrate that VoxHakka significantly outperforms existing publicly available Hakka TTS systems in terms of pronunciation accuracy, tone correctness, and overall naturalness. This work represents a significant advancement in Hakka language technology and provides a valuable resource for language preservation and revitalization efforts.
摘要:本文介绍了VoxHakka,一个为严重缺乏资源的台湾客家语而设计的文本到语音(TTS)系统。利用YourTTS框架,VoxHakka在语音合成中实现了高自然度和准确性以及低实时因素,同时支持六种不同的客家方言。这是通过用特定于方言的数据训练模型来实现的,从而允许生成说话人感知的客家话。为了解决公开可用的客家语语音语料库的稀缺问题,我们采用了一种具有成本效益的方法,利用网络抓取管道结合基于自动语音识别(ASR)的数据清理技术。这一过程确保了获得适合TTS培训的高质量、多说话人、多方言的数据集。使用比较平均意见分数(CMOS)进行的主观听力测试显示,VoxHakka在发音准确性、语调正确性和整体自然度方面明显优于现有的公开提供的客家TTS系统。这项工作标志着客家语言技术的重大进步,为语言保存和振兴工作提供了宝贵的资源。

[NLP-34] Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation
[NLP-34] 利用动态随机扰动的域自适应语音增强的有效噪音感知数据模拟

链接: https://arxiv.org/abs/2409.01545
作者: Chien-Chun Wang,Li-Wei Chen,Hung-Shin Lee,Berlin Chen,Hsin-Min Wang
关键词-EN: Cross-domain speech enhancement, severe challenges due, Cross-domain speech, faced with severe, severe challenges
关键词-ZH: 跨域语音增强,面临严峻挑战,跨域语音,面临严峻、严峻的挑战
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to IEEE SLT 2024

点击查看摘要

Abstract:Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation.
摘要:由于未知目标域中噪声和背景信息的稀缺,导致训练条件和测试条件不匹配,跨域语音增强面临严峻的挑战。为了解决这一问题,本研究提出了一种新的数据模拟方法,该方法利用噪声提取技术和生成性对抗网络(GANS),只有有限的目标含噪语音数据。值得注意的是,我们的方法使用了一个噪声编码器来从目标域数据中提取噪声嵌入。这些嵌入恰如其分地引导生成器合成声学上适合于目标域的发声,同时真实地保留输入纯净语音的语音内容。此外,我们引入了动态随机扰动的概念,它可以在推理过程中向噪声嵌入注入受控扰动,从而使模型能够很好地推广到看不见的噪声条件。在Voicebank-Demand基准数据集上的实验表明,我们的领域自适应SE方法的性能优于现有的基于数据模拟的强基线。

[NLP-35] It is Time to Develop an Auditing Framework to Promote Value Aware Chatbots
[NLP-35] 是时候开发审计框架来促进价值意识聊天机器人了

链接: https://arxiv.org/abs/2409.01539
作者: Yanchen Wang,Lisa Singh
关键词-EN: marked the beginning, availability of generative, generative AI tools, ChatGPT in November, November
关键词-ZH: ChatGPT于11月推出,标志着生成性、生成性人工智能工具的开始,11月
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2306.07500

点击查看摘要

Abstract:The launch of ChatGPT in November 2022 marked the beginning of a new era in AI, the availability of generative AI tools for everyone to use. ChatGPT and other similar chatbots boast a wide range of capabilities from answering student homework questions to creating music and art. Given the large amounts of human data chatbots are built on, it is inevitable that they will inherit human errors and biases. These biases have the potential to inflict significant harm or increase inequity on different subpopulations. Because chatbots do not have an inherent understanding of societal values, they may create new content that is contrary to established norms. Examples of concerning generated content includes child pornography, inaccurate facts, and discriminatory posts. In this position paper, we argue that the speed of advancement of this technology requires us, as computer and data scientists, to mobilize and develop a values-based auditing framework containing a community established standard set of measurements to monitor the health of different chatbots and LLMs. To support our argument, we use a simple audit template to share the results of basic audits we conduct that are focused on measuring potential bias in search engine style tasks, code generation, and story generation. We identify responses from GPT 3.5 and GPT 4 that are both consistent and not consistent with values derived from existing law. While the findings come as no surprise, they do underscore the urgency of developing a robust auditing framework for openly sharing results in a consistent way so that mitigation strategies can be developed by the academic community, government agencies, and companies when our values are not being adhered to. We conclude this paper with recommendations for value-based strategies for improving the technologies.
摘要:2022年11月推出的ChatGPT标志着人工智能新纪元的开始,每个人都可以使用生成性人工智能工具。ChatGPT和其他类似的聊天机器人拥有从回答学生作业问题到创作音乐和艺术的广泛能力。考虑到聊天机器人建立在大量人类数据之上,它们不可避免地会继承人类的错误和偏见。这些偏见有可能对不同的亚群造成重大伤害或增加不平等。因为聊天机器人对社会价值观没有与生俱来的理解,它们可能会创造与既定规范背道而驰的新内容。有关生成的内容的示例包括儿童色情、不准确的事实和歧视性帖子。在这份立场文件中,我们认为,这项技术的进步速度要求我们,作为计算机和数据科学家,动员和开发一个基于价值的审计框架,其中包含一套社区建立的标准衡量标准,以监控不同聊天机器人和LLM的健康状况。为了支持我们的论点,我们使用一个简单的审计模板来分享我们进行的基本审计的结果,这些审计的重点是衡量搜索引擎风格任务、代码生成和故事生成中的潜在偏差。我们确定了GPT 3.5和GPT 4的答复既与现有法律得出的值一致,也与现有法律得出的值不一致。虽然这些发现并不令人惊讶,但它们确实突显了开发一个强大的审计框架的紧迫性,以便以一致的方式公开分享结果,以便在我们的价值观未得到遵守时,学术界、政府机构和公司可以制定缓解策略。最后,我们对改进技术的基于价值的战略提出了建议。

[NLP-36] S3c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners
[NLP-36] S3 c-Math:自发的分步自我纠正使大型语言模型成为更好的数学推理

链接: https://arxiv.org/abs/2409.01524
作者: Yuchen Yan,Jin Jiang,Yang Liu,Yixin Cao,Xin Xu,Mengdi zhang,Xunliang Cai,Jian Shao
关键词-EN: large language models, potential reasoning abilities, Spontaneous Step-level Self-correction, language models, stimulate the potential
关键词-ZH: 大型语言模型、潜在推理能力、自发分步自我纠正、语言模型、激发潜力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-correction is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S ^3 c-Math, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.
摘要:自校正是一种能够激发大语言模型潜在推理能力的新方法。它涉及到当LLMS解决推理问题时,在推理过程中检测和纠正错误。然而,最近的研究并没有将自我纠正视为LLMS的一种自发和内在的能力。相反,这种修正是通过后自组织生成、外部知识引入、多模型协作和类似技术实现的。在本文中,我们提出了一系列称为S^3c-Math的数学最小二乘模型,它们能够对数学推理进行自发的步长级自校正。此功能帮助LLMS识别其正在进行的推理是否倾向于包含错误,并同时纠正这些错误以产生更可靠的响应。为了实现这种能力,我们提出了一种方法,该方法采用步进式抽样方法来构造步进式自校正数据。此外,我们实施了一种训练策略,该策略使用上述构建的数据来装备LLM具有自发的步长级别的自校正能力。我们的数据和方法已被证明在各种Foundation LLM中是有效的,在GSM8K、数学和其他数学基准的评估中一直显示出显著的进步。据我们所知,我们是第一个在数学推理中引入LLMS的自发步长自校正能力的人。

[NLP-37] DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models
[NLP-37] DiversityMedQA:使用大型语言模型评估医疗诊断中的人口统计学偏见

链接: https://arxiv.org/abs/2409.01497
作者: Rajat Rawat,Hudson McBride,Dhiyaan Nirmal,Rajarshi Ghosh,Jong Moon,Dhruv Alamuri,Sean O’Brien,Kevin Zhu
关键词-EN: large language models, gain traction, traction in healthcare, biases are growing, large language
关键词-ZH: 大型语言模型,获得吸引力,医疗保健领域的吸引力,偏见正在增长,大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce DiversityMedQA, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. Furthermore, to ensure the perturbations were accurate, we also propose a filtering strategy that validates each perturbation. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.
摘要:随着大型语言模型(LLM)在医疗保健领域越来越受欢迎,人们对它们容易受到人口统计偏见的担忧也越来越严重。我们引入了DiversityMedQA,这是一种新颖的基准,旨在评估LLM对不同患者人口统计数据(例如性别和种族)的医疗询问的反应。通过扰乱MedQA数据集(包括医疗委员会考试问题)中的问题,我们创建了一个基准,可以捕捉不同患者特征之间医疗诊断的细微差异。我们的研究结果显示,当针对这些人口统计差异进行测试时,模型性能存在显着差异。此外,为了确保扰动的准确性,我们还提出了一种验证每个扰动的过滤策略。通过发布DiversityMedQA,我们为评估和减轻LLM医学诊断中的人口统计学偏见提供了资源。

[NLP-38] he Compressor-Retriever Architecture for Language Model OS
[NLP-38] 语言模型操作系统的压缩机-检索器架构

链接: https://arxiv.org/abs/2409.01495
作者: Yuan Yang,Siheng Xiong,Ehsan Shareghi,Faramarz Fekri
关键词-EN: handling long documents, multimodal data querying, Recent advancements, tool usage, large language models
关键词-ZH: 处理长文档、多模式数据查询、最新进展、工具使用、大型语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly enhanced their capacity to aggregate and process information across multiple modalities, enabling them to perform a wide range of tasks such as multimodal data querying, tool usage, web interactions, and handling long documents. These capabilities pave the way for transforming LLMs from mere chatbots into general-purpose agents capable of interacting with the real world. This paper explores the concept of using a language model as the core component of an operating system (OS), effectively acting as a CPU that processes data stored in a context window, which functions as RAM. A key challenge in realizing such an LM OS is managing the life-long context and ensuring statefulness across sessions, a feature limited by the current session-based interaction paradigm due to context window size limit. To address this, we introduce compressor-retriever, a model-agnostic architecture designed for life-long context management. Unlike other long-context solutions such as retrieval-augmented generation, our approach exclusively uses the base model’s forward function to compress and retrieve context, ensuring end-to-end differentiability. Preliminary experiments demonstrate the effectiveness of this architecture in in-context learning tasks, marking a step towards the development of a fully stateful LLM OS. Project repo available at: this https URL
摘要:大型语言模型(LLM)的最新进展显著增强了它们跨多个通道聚合和处理信息的能力,使它们能够执行广泛的任务,如多通道数据查询、工具使用、Web交互和处理长文档。这些能力为将LLM从纯粹的聊天机器人转变为能够与现实世界互动的通用代理铺平了道路。本文探讨了使用语言模型作为操作系统(OS)的核心组件的概念,有效地充当处理存储在上下文窗口中的数据的CPU,该上下文窗口起到RAM的作用。实现这种LM OS的一个关键挑战是管理持续时间的上下文并确保跨会话的状态,由于上下文窗口大小的限制,这一功能受到当前基于会话的交互范例的限制。为了解决这个问题,我们引入了压缩器-检索器,这是一个为终身上下文管理而设计的与模型无关的体系结构。与其他长上下文解决方案(如检索-增强生成)不同,我们的方法只使用基本模型的前向函数来压缩和检索上下文,确保了端到端的可区分性。初步实验证明了该体系结构在情景学习任务中的有效性,标志着朝着开发完全有状态的LLm操作系统迈出了一步。项目回购地址:此HTTPS URL

[NLP-39] Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning
[NLP-39] 通过使用任务特定专家修剪评估不有效性来重新审视SMoE语言模型

链接: https://arxiv.org/abs/2409.01483
作者: Soumajyoti Sarkar,Leonard Lausen,Volkan Cevher,Sheng Zha,Thomas Brox,George Karypis
关键词-EN: Sparse Mixture, Mixture of Expert, language modeling, scalable alternative, alternative to dense
关键词-ZH: 稀疏混合、专家混合、语言建模、可扩展替代方案、密集替代方案
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. These models use conditionally activated feedforward subnetworks in transformer blocks, allowing for a separation between total model parameters and per-example computation. However, large token-routed SMoE models face a significant challenge: during inference, the entire model must be used for a sequence or a batch, resulting in high latencies in a distributed setting that offsets the advantages of per-token sparse activation. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures, mainly modulating the choice of expert counts in pretraining. We investigate whether such pruned models offer advantages over smaller SMoE models trained from scratch, when evaluating and comparing them individually on tasks. To that end, we introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training. Our findings reveal a threshold pruning factor for the reduction that depends on the number of experts used in pretraining, above which, the reduction starts to degrade model performance. These insights contribute to our understanding of model design choices when pretraining with SMoE architectures, particularly useful when considering task-specific inference optimization for later stages.
摘要:专家稀疏混合(SMOE)模型已成为语言建模中密集模型的一种可扩展替代方案。这些模型在变压器块中使用有条件激活的前馈子网络,允许总模型参数和逐个实例计算之间的分离。然而,大型令牌路由SMOE模型面临一个重大挑战:在推理过程中,整个模型必须用于序列或批处理,这导致分布式设置中的高延迟抵消了按令牌稀疏激活的优势。我们的研究探索了任务特定的模型修剪,以提供有关设计SMOE体系结构的决策信息,主要是在预培训中调节专家数量的选择。我们调查这种修剪后的模型在对任务进行单独评估和比较时,是否比从头开始训练的较小的SMOE模型具有优势。为此,我们引入了一种自适应的任务感知剪枝技术UnCurl,以在训练后以离线方式减少每个MOE层的专家数量。我们的发现揭示了一个阈值剪枝因子,该因子取决于预训练中使用的专家数量,超过这个值,减少开始降低模型的性能。这些见解有助于我们在使用SMOE架构进行预培训时理解模型设计选择,在考虑后续阶段的特定于任务的推理优化时尤其有用。

[NLP-40] Masked Mixers for Language Generation and Retrieval
[NLP-40] 用于语言生成和检索的掩蔽混合器

链接: https://arxiv.org/abs/2409.01482
作者: Benjamin L. Badger
关键词-EN: confer selective focus, mechanisms that confer, confer selective, selective focus, strict subset
关键词-ZH: 赋予选择性焦点,赋予机制,赋予选择性,选择性焦点,严格子集
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 15 figures (11 primary, 4 supplementary)

点击查看摘要

Abstract:Attention mechanisms that confer selective focus on a strict subset of input elements are nearly ubiquitous in language models today. We posit there to be downside to the use of attention: most information present in the input is necessarily lost. In support of this idea we observe poor input representation accuracy in transformers, but find more accurate representation in what we term masked mixers which replace self-attention with masked convolutions. Applied to TinyStories the masked mixer learns causal language tasks more efficiently than early transformer implementations and somewhat less efficiently than optimized, current implementations. The most efficient learning algorithm observed for this dataset is a transformer-masked mixer hybrid, suggesting that these models learn in an orthogonal manner. We hypothesized that the information loss exhibited by transformers would be much more detrimental to retrieval than generation, and to test this we introduce an efficient training approach for retrieval models based on existing generative model embeddings. With this method, embeddings from masked mixers are found to result in far better summary-to-story retrieval compared to embeddings from transformers.
摘要:在当今的语言模型中,对输入元素的严格子集给予选择性关注的注意机制几乎无处不在。我们假设注意力的使用有不利的一面:输入中存在的大多数信息必然会丢失。为了支持这一想法,我们观察到变压器中输入表示的准确性较差,但在我们所称的屏蔽混合器中找到了更准确的表示,它用屏蔽卷积取代了自我注意。应用于TinyStories,屏蔽混合器学习因果语言任务的效率高于早期的转换器实现,但略低于优化的当前实现。对于这个数据集,观察到的最有效的学习算法是变压器-屏蔽混合器混合,这表明这些模型以正交方式学习。我们假设变压器表现出的信息损失对检索的损害比生成更大,为了验证这一点,我们在现有生成模型嵌入的基础上引入了一种有效的检索模型训练方法。使用这种方法,与来自转换器的嵌入相比,来自屏蔽混合器的嵌入被发现导致了更好的从摘要到故事的检索。

[NLP-41] PoliPrompt: A High-Performance Cost-Effective LLM-Based Text Classification Framework for Political Science
[NLP-41] PoliPrompt:一个高性能、经济高效的基于LLM的政治学文本分类框架

链接: https://arxiv.org/abs/2409.01466
作者: Menglin Liu,Ge Shi
关键词-EN: large language models, extensive feature engineering, require extensive feature, Recent advancements, language models
关键词-ZH: 大型语言模型、广泛的功能工程、需要广泛的功能、最新进展、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 5 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have opened new avenues for enhancing text classification efficiency in political science, surpassing traditional machine learning methods that often require extensive feature engineering, human labeling, and task-specific training. However, their effectiveness in achieving high classification accuracy remains questionable. This paper introduces a three-stage in-context learning approach that leverages LLMs to improve classification accuracy while minimizing experimental costs. Our method incorporates automatic enhanced prompt generation, adaptive exemplar selection, and a consensus mechanism that resolves discrepancies between two weaker LLMs, refined by an advanced LLM. We validate our approach using datasets from the BBC news reports, Kavanaugh Supreme Court confirmation, and 2018 election campaign ads. The results show significant improvements in classification F1 score (+0.36 for zero-shot classification) with manageable economic costs (-78% compared with human labeling), demonstrating that our method effectively addresses the limitations of traditional machine learning while offering a scalable and reliable solution for text analysis in political science.
摘要:大型语言模型的最新进展为提高政治学中的文本分类效率开辟了新的途径,超越了传统的机器学习方法,后者通常需要广泛的特征工程、人类标记和特定任务的训练。然而,它们在实现高分类精度方面的有效性仍然值得怀疑。本文介绍了一种三阶段上下文中学习方法,该方法利用LLMS来提高分类精度,同时将实验成本降至最低。我们的方法结合了自动增强的提示生成、自适应样本选择和共识机制,该机制解决了由高级LLM改进的两个较弱的LLM之间的差异。我们使用来自BBC新闻报道、卡瓦诺最高法院确认和2018年竞选广告的数据集来验证我们的方法。结果表明,在可管理的经济代价(与人工标注相比-78%)的情况下,分类F1得分(零镜头分类+0.36)显著提高,表明该方法有效地解决了传统机器学习的局限性,同时为政治学中的文本分析提供了一种可扩展的可靠解决方案。

[NLP-42] GenAgent : Build Collaborative AI Systems with Automated Workflow Generation – Case Studies on ComfyUI
[NLP-42] GenAgent:通过自动化工作流生成构建协作人工智能系统–ComfyUI案例研究

链接: https://arxiv.org/abs/2409.01392
作者: Xiangyuan Xue,Zeyu Lu,Di Huang,Wanli Ouyang,Lei Bai
关键词-EN: developing monolithic models, previous AI research, research has focused, focused on developing, maximize their intelligence
关键词-ZH: 开发整体模型,之前的人工智能研究,研究已经专注于,专注于开发,最大化他们的智能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Much previous AI research has focused on developing monolithic models to maximize their intelligence and capability, with the primary goal of enhancing performance on specific tasks. In contrast, this paper explores an alternative approach: collaborative AI systems that use workflows to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex workflows, offering greater flexibility and scalability compared to monolithic models. The core innovation of GenAgent lies in representing workflows with code, alongside constructing workflows with collaborative agents in a step-by-step manner. We implement GenAgent on the ComfyUI platform and propose a new benchmark, OpenComfy. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations, showing its capability to generate complex workflows with superior effectiveness and stability.
摘要:以前的许多人工智能研究都集中在开发单一模型,以最大限度地提高它们的智能和能力,主要目标是提高特定任务的性能。相反,本文探索了另一种方法:协作式人工智能系统,它使用工作流来集成模型、数据源和管道,以解决复杂和多样化的任务。我们引入了GenAgent,这是一个基于LLM的框架,可以自动生成复杂的工作流,与单一模型相比,提供了更大的灵活性和可伸缩性。GenAgent的核心创新在于用代码表示工作流,同时用协作代理循序渐进地构建工作流。我们在ComfyUI平台上实现了GenAgent,并提出了一个新的基准测试程序OpenComfy。结果表明,GenAgent在运行级和任务级的评估中都优于基准方法,显示了其生成具有卓越有效性和稳定性的复杂工作流的能力。

[NLP-43] CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
[NLP-43] CV-Probes:研究词汇和世界知识在视觉基础动词理解中的相互作用

链接: https://arxiv.org/abs/2409.01389
作者: Ivana Beňová,Michal Gregor,Albert Gatt
关键词-EN: study investigates, investigates the ability, ground context-dependent, ground context-dependent verb, context-dependent verb phrases
关键词-ZH: 研究调查,调查能力,背景上下文相关,背景上下文相关动词,背景上下文相关动词短语
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure, 11 tables, LIMO Workshop at KONVENS 2024

点击查看摘要

Abstract:This study investigates the ability of various vision-language (VL) models to ground context-dependent and non-context-dependent verb phrases. To do that, we introduce the CV-Probes dataset, designed explicitly for studying context understanding, containing image-caption pairs with context-dependent verbs (e.g., “beg”) and non-context-dependent verbs (e.g., “sit”). We employ the MM-SHAP evaluation to assess the contribution of verb tokens towards model predictions. Our results indicate that VL models struggle to ground context-dependent verb phrases effectively. These findings highlight the challenges in training VL models to integrate context accurately, suggesting a need for improved methodologies in VL model training and evaluation.
摘要:本研究调查了各种视觉语言(VD)模型建立上下文相关和非上下文相关动词短语的能力。为此,我们引入了CV-Probes数据集,该数据集专门为研究上下文理解而设计,包含带有上下文相关动词的图像标题对(例如,“beg”)和非上下文相关动词(例如,“坐下”)。我们使用MM-SHAP评估来评估动词标记对模型预测的贡献。我们的结果表明,VD模型很难有效地建立依赖上下文的动词短语。这些研究结果凸显了训练DL模型以准确整合上下文所面临的挑战,表明需要改进DL模型训练和评估的方法。

[NLP-44] Membership Inference Attacks Against In-Context Learning
[NLP-44] 针对上下文内学习的成员推理攻击

链接: https://arxiv.org/abs/2409.01380
作者: Rui Wen,Zheng Li,Michael Backes,Yang Zhang
关键词-EN: Adapting Large Language, specific tasks introduces, tasks introduces concerns, Large Language Models, In-Context Learning
关键词-ZH: 适应大型语言、特定任务介绍、任务介绍关注点、大型语言模型、上下文学习
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: To Appear in the ACM Conference on Computer and Communications Security, October 14-18, 2024

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) to specific tasks introduces concerns about computational efficiency, prompting an exploration of efficient methods such as In-Context Learning (ICL). However, the vulnerability of ICL to privacy attacks under realistic assumptions remains largely unexplored. In this work, we present the first membership inference attack tailored for ICL, relying solely on generated texts without their associated probabilities. We propose four attack strategies tailored to various constrained scenarios and conduct extensive experiments on four popular large language models. Empirical results show that our attacks can accurately determine membership status in most cases, e.g., 95% accuracy advantage against LLaMA, indicating that the associated risks are much higher than those shown by existing probability-based attacks. Additionally, we propose a hybrid attack that synthesizes the strengths of the aforementioned strategies, achieving an accuracy advantage of over 95% in most cases. Furthermore, we investigate three potential defenses targeting data, instruction, and output. Results demonstrate combining defenses from orthogonal dimensions significantly reduces privacy leakage and offers enhanced privacy assurances.
摘要:将大型语言模型(LLM)适应于特定的任务会引起对计算效率的担忧,促使人们探索高效的方法,如上下文中学习(ICL)。然而,ICL在现实假设下对隐私攻击的脆弱性在很大程度上仍未被探索。在这项工作中,我们提出了第一个为ICL量身定做的成员关系推理攻击,仅依赖于没有关联概率的生成文本。针对不同的约束场景,我们提出了四种攻击策略,并在四个流行的大型语言模型上进行了广泛的实验。实验结果表明,我们的攻击在大多数情况下都可以准确地确定成员身份,例如对骆驼的95%的准确率优势,表明关联的风险比现有的基于概率的攻击要高得多。此外,我们还提出了一种综合上述策略优点的混合攻击方法,在大多数情况下获得了95%以上的准确率优势。此外,我们还研究了针对数据、指令和输出的三种潜在防御措施。结果表明,从正交维组合防御显着减少隐私泄漏,并提供增强的隐私保证。

[NLP-45] CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
[NLP-45] 国际象棋:通过逐行分组和选择性稀疏化优化LLM推理

链接: https://arxiv.org/abs/2409.01366
作者: Junhui He,Shangyu Wu,Weidong Wen,Chun Jason Xue,Qingan Li
关键词-EN: Deploying large language, edge devices presents, devices presents significant, substantial computational overhead, Deploying large
关键词-ZH: 部署大型语言,边缘设备存在,设备存在大量的计算负担,部署大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) on edge devices presents significant challenges due to the substantial computational overhead and memory requirements. Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference. Existing methods typically employ thresholding-based sparsification based on the statistics of activation tensors. However, these methods do not explicitly model the impact of activation sparsification on performance, leading to suboptimal performance degradation. To address this issue, this paper reformulates the activation sparsification problem by introducing a new objective that optimizes the sparsification decisions. Building on this reformulation, we propose CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification. First, channel-wise thresholding assigns a unique threshold to each activation channel in the feed-forward network (FFN) layers. Then, selective sparsification involves applying thresholding-based activation sparsification to specific layers within the attention modules. Finally, we detail the implementation of sparse kernels to accelerate LLM inference. Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over 8 downstream tasks while activating fewer parameters compared to existing methods, thus speeding up the LLM inference by up to 1.27x.
摘要:由于大量的计算开销和内存需求,在边缘设备上部署大型语言模型(LLM)是一个巨大的挑战。激活稀疏可以通过减少推理过程中激活的神经元数量来缓解这些挑战。现有的方法通常采用基于阈值的稀疏化,基于激活张量的统计。然而,这些方法没有显式建模激活稀疏化对性能的影响,从而导致次优性能下降。为了解决这个问题,本文通过引入一个优化稀疏决策的新目标,对激活稀疏问题进行了重新描述。在此基础上,我们提出了CHESS,一种基于通道阈值和选择性稀疏化的通用激活稀疏方法。首先,基于信道的阈值为前馈网络(FFN)层中的每个激活信道分配唯一的阈值。然后,选择性稀疏化涉及将基于阈值的激活稀疏化应用于注意模块内的特定层。最后,我们详细介绍了稀疏核的实现方法,以加速LLM推理。实验结果表明,与已有方法相比,该方法在8个下游任务上的性能降幅更小,激活的参数更少,从而使LLM推理的速度提高了1.27倍。

[NLP-46] Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain
[NLP-46] 知道何时检索:研究法律领域的非英语混合检索

链接: https://arxiv.org/abs/2409.01357
作者: Antoine Louis,Gijs van Dijck,Gerasimos Spanakis
关键词-EN: matching paradigms, Hybrid search, effective strategy, strategy to offset, offset the limitations
关键词-ZH: 匹配范式、混合搜索、有效策略、抵消策略、抵消限制
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review

点击查看摘要

Abstract:Hybrid search has emerged as an effective strategy to offset the limitations of different matching paradigms, especially in out-of-domain contexts where notable improvements in retrieval quality have been observed. However, existing research predominantly focuses on a limited set of retrieval methods, evaluated in pairs on domain-general datasets exclusively in English. In this work, we study the efficacy of hybrid search across a variety of prominent retrieval models within the unexplored field of law in the French language, assessing both zero-shot and in-domain scenarios. Our findings reveal that in a zero-shot context, fusing different domain-general models consistently enhances performance compared to using a standalone model, regardless of the fusion method. Surprisingly, when models are trained in-domain, we find that fusion generally diminishes performance relative to using the best single system, unless fusing scores with carefully tuned weights. These novel insights, among others, expand the applicability of prior findings across a new field and language, and contribute to a deeper understanding of hybrid search in non-English specialized domains.
摘要:混合搜索已经成为一种有效的策略来弥补不同匹配范例的局限性,特别是在检索质量有显著改善的域外上下文中。然而,现有的研究主要集中在一组有限的检索方法上,仅在英语领域通用数据集上对其进行配对评估。在这项工作中,我们研究了混合搜索在法语尚未探索的法律领域内各种著名的检索模型中的效率,评估了零命中和领域内两种情况。我们的发现表明,在零镜头环境下,与使用独立模型相比,融合不同的领域通用模型始终可以提高性能,而不考虑融合方法。令人惊讶的是,当模型在领域内训练时,我们发现融合通常会比使用最好的单一系统降低性能,除非将分数与仔细调整的权重进行融合。这些新颖的见解扩展了先前发现在一个新领域和新语言中的适用性,并有助于更深入地理解非英语专业领域的混合搜索。

[NLP-47] Language Models Benefit from Preparation with Elicited Knowledge
[NLP-47] 语言模型受益于具有引出知识的准备

链接: https://arxiv.org/abs/2409.01345
作者: Jiacan Yu,Hannah An,Lenhart K. Schubert
关键词-EN: require multiple reasoning, multiple reasoning steps, reasoning steps, chain of thought, language models
关键词-ZH: 需要多重推理、多个推理步骤、推理步骤、思维链、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The zero-shot chain of thought (CoT) approach is often used in question answering (QA) by language models (LMs) for tasks that require multiple reasoning steps, typically enhanced by the prompt “Let’s think step by step.” However, some QA tasks hinge more on accessing relevant knowledge than on chaining reasoning steps. We introduce a simple general prompting technique, called PREP, that involves using two instances of LMs: the first (LM1) generates relevant information, and the second (LM2) answers the question based on this information. PREP is designed to be general and independent of the user’s domain knowledge, making it applicable across various QA tasks without the need for specialized prompt engineering. To evaluate the effectiveness of our prompting method, we create a dataset of 100 binary-choice questions, derived from an extensive schematic dataset on artifact parts and material composition. These questions ask which of two artifacts is less likely to share materials with another artifact. Such questions probe the LM’s knowledge of shared materials in the part structure of different artifacts. We test our method on our dataset and three published commonsense reasoning datasets. The average accuracy of our method is consistently higher than that of all the other tested methods across all the tested datasets.
摘要:对于需要多个推理步骤的任务,语言模型(LMS)经常在问题回答(QA)中使用零命中思想链(COT)方法,典型的是通过“让我们逐步思考”的提示来增强。然而,一些QA任务更多地依赖于获取相关知识,而不是链接推理步骤。我们介绍了一种简单的通用提示技术,称为PREP,它涉及使用LMS的两个实例:第一个(LM1)生成相关信息,第二个(LM2)根据该信息回答问题。PREP被设计为通用的,独立于用户的领域知识,使其适用于各种QA任务,而不需要专门的提示工程。为了评估我们的提示方法的有效性,我们创建了一个包含100个二元选择问题的数据集,这些问题来自关于人工制品部件和材料组成的大量示意图数据集。这些问题询问两个文物中哪一个不太可能与另一个文物共享材料。这样的问题探索了LM对不同文物的部件结构中共享材料的知识。我们在我们的数据集和三个已发表的常识推理数据集上测试了我们的方法。在所有测试的数据集上,我们的方法的平均准确率始终高于所有其他测试方法。

[NLP-48] Pairing Analogy-Augmented Generation with Procedural Memory for Procedural QA
[NLP-48] 将模拟增强生成与程序内存配对以实现程序QA

链接: https://arxiv.org/abs/2409.01344
作者: K Roth,Rushil Gupta,Simon Halle,Bang Liu
关键词-EN: procedural question answering, shown remarkable performance, question answering, complex tasks, paradigm have shown
关键词-ZH: 程序性问题回答,表现出色,问题回答,复杂任务,范式已显示
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While LLMs in the RAG paradigm have shown remarkable performance on a variety of tasks, they still under-perform on unseen domains, especially on complex tasks like procedural question answering. In this work, we introduce a novel formalism and structure for manipulating text-based procedures. Based on this formalism, we further present a novel dataset called LCStep, scraped from the LangChain Python docs. Moreover, we extend the traditional RAG system to propose a novel system called analogy-augmented generation (AAG), that draws inspiration from human analogical reasoning and ability to assimilate past experiences to solve unseen problems. The proposed method uses a frozen language model with a custom procedure memory store to adapt to specialized knowledge. We demonstrate that AAG outperforms few-shot and RAG baselines on LCStep, RecipeNLG, and CHAMP datasets under a pairwise LLM-based evaluation, corroborated by human evaluation in the case of RecipeNLG.
摘要:虽然RAG范式中的LLM在各种任务中表现出色,但它们在未知领域仍然表现不佳,尤其是在程序性问题回答等复杂任务中。在这项工作中,我们引入了一种新颖的形式主义和结构来操作基于文本的过程。基于这种形式主义,我们进一步提出了一个名为LCStep的新颖数据集,该数据集从LangChain Python文档中抓取。此外,我们扩展了传统的RAG系统,提出了一种名为类比增强生成(AAG)的新型系统,该系统从人类类比推理和吸收过去经验以解决看不见的问题的能力中汲取灵感。所提出的方法使用具有自定义过程内存存储的冻结语言模型来适应专业知识。我们证明,在基于成对LLM的评估下,AAG在LCStep、RecipeNLG和CHMP数据集上的表现优于少数镜头和RAG基线,RecipeNLG的人类评估也证实了这一点。

[NLP-49] Path-Consistency: Prefix Enhancement for Efficient Inference in LLM
[NLP-49] 路径一致性:LLM中有效推理的前置增强

链接: https://arxiv.org/abs/2409.01281
作者: Jiace Zhu,Yingtao Shen,Jie Zhao,An Zou
关键词-EN: large language models, gained significant popularity, combining multiple sampling, language models, majority voting
关键词-ZH: 大型语言模型,结合多重抽样、语言模型、多数投票,获得了广泛的欢迎
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To enhance the reasoning capabilities of large language models (LLMs), self-consistency has gained significant popularity by combining multiple sampling with majority voting. However, the state-of-the-art self-consistency approaches consume substantial computational resources and lead to significant additional time costs due to the multiple sampling. This prevents its full potential from being realized in scenarios where computational resources are critical. To improve the inference efficiency, this paper introduces \textitpath-consistency, a method that leverages the confidence of answers generated in earlier branches to identify the prefix of the most promising path. By dynamically guiding the generation of subsequent branches based on this prefix, the \textitpath-consistency mitigates both the errors and redundancies from random or less useful sampling in self-consistency. As a result, it can significantly accelerate the inference process by reducing the number of tokens generated. Our extensive empirical evaluation shows that the \textitpath-consistency achieves significant acceleration in inference latency ranging from 7.8% to 40.5% , while maintaining or even improving task accuracy across different datasets, including mathematical reasoning, common sense reasoning, symbolic reasoning, and code generation.
摘要:为了提高大型语言模型的推理能力,通过将多重抽样和多数投票相结合,自我一致性得到了广泛的应用。然而,最先进的自洽方法消耗了大量的计算资源,并且由于多次采样而导致显著的额外时间开销。这使其无法在计算资源至关重要的情况下充分发挥其潜力。为了提高推理效率,本文引入了文本路径一致性方法,该方法利用早期分支产生的答案的置信度来确定最有希望路径的前缀。通过基于该前缀动态地指导后续分支的生成,文本标题路径一致性在自一致性中减少了来自随机或不太有用的采样的错误和冗余。因此,它可以通过减少生成的令牌数量来显著加快推理过程。大量的实验结果表明,在保持甚至提高不同数据集上的任务精度的同时,文本标题路径一致性在推理延迟上获得了显著的加速,从7.8到40.5%,包括数学推理、常识推理、符号推理和代码生成。

[NLP-50] HInC: A Theory-Driven Framework for Computational Humor Detection
[NLP-50] HInC:理论驱动的计算幽默检测框架

链接: https://arxiv.org/abs/2409.01232
作者: Victor De Marez,Thomas Winters,Ayla Rigouts Terryn
关键词-EN: Humor, humor theories, communication and cognition, social engagement, human communication
关键词-ZH: 幽默、幽默理论、沟通与认知、社会参与、人际沟通
类目: Computation and Language (cs.CL)
备注: Accepted at CREAI 2024 (International Workshop on Artificial Intelligence and Creativity)

点击查看摘要

Abstract:Humor is a fundamental aspect of human communication and cognition, as it plays a crucial role in social engagement. Although theories about humor have evolved over centuries, there is still no agreement on a single, comprehensive humor theory. Likewise, computationally recognizing humor remains a significant challenge despite recent advances in large language models. Moreover, most computational approaches to detecting humor are not based on existing humor theories. This paper contributes to bridging this long-standing gap between humor theory research and computational humor detection by creating an interpretable framework for humor classification, grounded in multiple humor theories, called THInC (Theory-driven Humor Interpretation and Classification). THInC ensembles interpretable GA2M classifiers, each representing a different humor theory. We engineered a transparent flow to actively create proxy features that quantitatively reflect different aspects of theories. An implementation of this framework achieves an F1 score of 0.85. The associative interpretability of the framework enables analysis of proxy efficacy, alignment of joke features with theories, and identification of globally contributing features. This paper marks a pioneering effort in creating a humor detection framework that is informed by diverse humor theories and offers a foundation for future advancements in theory-driven humor classification. It also serves as a first step in automatically comparing humor theories in a quantitative manner.
摘要:幽默是人类交流和认知的一个基本方面,因为它在社会参与中起着至关重要的作用。尽管关于幽默的理论已经发展了几个世纪,但对于一个单一的、全面的幽默理论仍然没有达成一致意见。同样,尽管最近在大型语言模型方面取得了进展,但在计算上识别幽默仍然是一个巨大的挑战。此外,大多数用于检测幽默的计算方法并不是基于现有的幽默理论。本文以多种幽默理论为基础,构建了一个可解释的幽默分类框架,称为THINC,旨在弥合幽默理论研究和计算幽默检测之间的长期差距。THINC集合了可解释的GA2M分类器,每个分类器代表不同的幽默理论。我们设计了一个透明的流程,以积极地创建代理特征,定量地反映理论的不同方面。该框架的一个实现实现了F1分数为0.85。该框架的关联可解释性使得能够分析代理效力、将笑话特征与理论对齐、以及识别全局贡献特征。这篇论文标志着在创建一个基于不同幽默理论的幽默检测框架方面的开创性努力,并为未来理论驱动的幽默分类的发展奠定了基础。这也是自动定量比较幽默理论的第一步。

[NLP-51] Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
[NLP-51] 使用上下文感知句子编码的提示压缩,以实现快速且改进的LLM推理

链接: https://arxiv.org/abs/2409.01227
作者: Barys Liskavets,Maxim Ushakov,Shuvendu Roy,Mark Klibanov,Ali Etemad,Shane Luke
关键词-EN: Large language models, Large language, language models, stream of research, research focusing
关键词-ZH: 大型语言模型,大型语言,语言模型,研究流,研究重点
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2002.01664 by other authors

点击查看摘要

Abstract:Large language models (LLMs) have triggered a new stream of research focusing on compressing the context length to reduce the computational cost while ensuring the retention of helpful information for LLMs to answer the given question. Token-based removal methods are one of the most prominent approaches in this direction, but risk losing the semantics of the context caused by intermediate token removal, especially under high compression ratios, while also facing challenges in computational efficiency. In this work, we propose context-aware prompt compression (CPC), a sentence-level prompt compression technique where its key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question. To train this encoder, we generate a new dataset consisting of questions, positives, and negative pairs where positives are sentences relevant to the question, while negatives are irrelevant context sentences. We train the encoder in a contrastive setup to learn context-aware sentence representations. Our method considerably outperforms prior works on prompt compression on benchmark datasets and is up to 10.93x faster at inference compared to the best token-level compression method. We also find better improvement for shorter length constraints in most benchmarks, showing the effectiveness of our proposed solution in the compression of relevant information in a shorter context. Finally, we release the code and the dataset for quick reproducibility and further development: this https URL.
摘要:大型语言模型引发了一股新的研究热潮,其重点是压缩上下文长度以降低计算成本,同时确保保留有用的信息以回答给定的问题。基于令牌的删除方法是这一方向最突出的方法之一,但由于中间令牌删除,特别是在高压缩比的情况下,可能会丢失上下文的语义,同时也面临计算效率方面的挑战。在这项工作中,我们提出了上下文感知提示压缩(CPC),这是一种句子级别的提示压缩技术,其关键创新是一种新颖的上下文感知句子编码器,它为给定问题的每个句子提供一个相关性分数。为了训练这个编码器,我们生成了一个新的数据集,由问题、肯定词和否定词组成,其中肯定词是与问题相关的句子,而否定词是无关的上下文句子。我们在对比设置中训练编码者学习上下文感知的句子表示。我们的方法在基准数据集上的即时压缩性能大大优于以前的工作,并且与最好的令牌级压缩方法相比,推理速度最高可快10.93倍。我们还发现,在大多数基准测试中,较短的长度约束都有更好的改进,这表明了我们所提出的解决方案在较短的上下文中压缩相关信息的有效性。最后,我们发布了代码和数据集,以便于快速重现和进一步开发:此HTTPS URL。

[NLP-52] A multilingual training strategy for low resource Text to Speech
[NLP-52] 低资源文本到语音的多语言培训策略

链接: https://arxiv.org/abs/2409.01217
作者: Asma Amalas,Mounir Ghogho,Mohamed Chetouani,Rachid Oulad Haj Thami
关键词-EN: high quality synthesised, Recent speech technologies, produce high quality, neural Text, synthesised speech due
关键词-ZH: 高质量合成,最新语音技术,产生高质量的神经文本,合成语音
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Recent speech technologies have led to produce high quality synthesised speech due to recent advances in neural Text to Speech (TTS). However, such TTS models depend on extensive amounts of data that can be costly to produce and is hardly scalable to all existing languages, especially that seldom attention is given to low resource languages. With techniques such as knowledge transfer, the burden of creating datasets can be alleviated. In this paper, we therefore investigate two aspects; firstly, whether data from social media can be used for a small TTS dataset construction, and secondly whether cross lingual transfer learning (TL) for a low resource language can work with this type of data. In this aspect, we specifically assess to what extent multilingual modeling can be leveraged as an alternative to training on monolingual corporas. To do so, we explore how data from foreign languages may be selected and pooled to train a TTS model for a target low resource language. Our findings show that multilingual pre-training is better than monolingual pre-training at increasing the intelligibility and naturalness of the generated speech.
摘要:由于神经文语转换(TTS)的最新进展,最近的语音技术已经导致产生高质量的合成语音。然而,这样的TTS模型依赖于大量的数据,这些数据的产生成本很高,而且很难扩展到所有现有的语言,特别是很少关注低资源语言。通过知识转移等技术,可以减轻创建数据集的负担。因此,本文从两个方面进行了研究:第一,社交媒体上的数据是否可以用于小语料库的构建;第二,低资源语言的跨语言迁移学习是否可以处理这种类型的数据。在这方面,我们具体评估在多大程度上可以利用多语言建模作为单语言语料库培训的替代方案。为此,我们探索了如何选择和汇集来自外语的数据来为目标低资源语言训练TTS模型。我们的研究结果表明,在提高语音的可理解性和自然度方面,多语预训练优于单语预训练。

[NLP-53] CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models NDSS
[NLP-53] CLMBE:在基于转换器的NLP模型中检测动态后门

链接: https://arxiv.org/abs/2409.01193
作者: Rui Zeng,Xi Chen,Yuwen Pu,Xuhong Zhang,Tianyu Du,Shouling Ji
关键词-EN: attacker secretly selects, NLP dynamic backdoor, NLP, NLP models, CLIBE
关键词-ZH: 攻击者秘密选择,NLP动态后门,NLP,NLP模型,CLMBE
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To appear in the Network and Distributed System Security (NDSS) Symposium, February, 2025

点击查看摘要

Abstract:Backdoors can be injected into NLP models to induce misbehavior when the input text contains a specific feature, known as a trigger, which the attacker secretly selects. Unlike fixed words, phrases, or sentences used in the static text trigger, NLP dynamic backdoor attacks design triggers associated with abstract and latent text features, making them considerably stealthier than traditional static backdoor attacks. However, existing research on NLP backdoor detection primarily focuses on defending against static backdoor attacks, while detecting dynamic backdoors in NLP models remains largely unexplored. This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models. CLIBE injects a “few-shot perturbation” into the suspect Transformer model by crafting optimized weight perturbation in the attention layers to make the perturbed model classify a limited number of reference samples as a target label. Subsequently, CLIBE leverages the generalization ability of this few-shot perturbation to determine whether the original model contains a dynamic backdoor. Extensive evaluation on three advanced NLP dynamic backdoor attacks, two widely-used Transformer frameworks, and four real-world classification tasks strongly validates the effectiveness of CLIBE. We also demonstrate the robustness of CLIBE against various adaptive attacks. Furthermore, we employ CLIBE to scrutinize 49 popular Transformer models on Hugging Face and discover one exhibiting a high probability of containing a dynamic backdoor. We have contacted Hugging Face and provided detailed evidence of this model’s backdoor behavior. Moreover, we extend CLIBE to detect backdoor text generation models modified to exhibit toxic behavior. To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without access to trigger input test samples.
摘要:当输入文本包含攻击者秘密选择的特定特征(称为触发器)时,可以将后门注入NLP模型以诱导不当行为。与静态文本触发器中使用的固定单词、短语或句子不同,NLP动态后门攻击设计与抽象和潜在文本特征相关联的触发器,使其比传统的静态后门攻击具有相当大的隐蔽性。然而,现有的关于NLP后门检测的研究主要集中在对静态后门攻击的防御上,而对NLP模型中的动态后门检测在很大程度上还没有被探索。本文提出了第一个检测基于变压器的NLP模型中的动态后门的框架CLIBE。CLIBE通过在关注层中精心设计优化的权重扰动,向可疑的变形金刚模型注入一种“几次扰动”,以使扰动的模型将有限数量的参考样本分类为目标标签。随后,CLIBE利用这种小扰动的泛化能力来确定原始模型是否包含动态后门。对三个高级NLP动态后门攻击、两个广泛使用的Transformer框架和四个真实世界分类任务的广泛评估有力地验证了CLIBE的有效性。我们还证明了CLIBE对各种自适应攻击的健壮性。此外,我们使用CLIBE仔细检查了49个流行的拥抱脸变形金刚模型,发现其中一个模型显示出包含动态后门的高概率。我们已经联系了拥抱脸,并提供了这种模式走后门的详细证据。此外,我们扩展了CLIBE来检测修改后的显示有毒行为的后门文本生成模型。据我们所知,CLIBE是第一个能够在不访问触发器输入测试样本的情况下检测文本生成模型中的后门的框架。

[NLP-54] Real World Conversational Entity Linking Requires More Than Zeroshots
[NLP-54] 现实世界对话实体链接需要的不仅仅是Zeroshots

链接: https://arxiv.org/abs/2409.01152
作者: Mohanna Hoveyda,Arjen P. de Vries,Maarten de Rijke,Faegheh Hasibi
关键词-EN: sparse knowledge bases, conversations faces notable, faces notable challenges, practical applications, primarily due
关键词-ZH: 知识库稀疏,对话面临显着,面临显着挑战,实际应用,主要是由于
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Entity linking (EL) in conversations faces notable challenges in practical applications, primarily due to the scarcity of entity-annotated conversational datasets and sparse knowledge bases (KB) containing domain-specific, long-tail entities. We designed targeted evaluation scenarios to measure the efficacy of EL models under resource constraints. Our evaluation employs two KBs: Fandom, exemplifying real-world EL complexities, and the widely used Wikipedia. First, we assess EL models’ ability to generalize to a new unfamiliar KB using Fandom and a novel zero-shot conversational entity linking dataset that we curated based on Reddit discussions on Fandom entities. We then evaluate the adaptability of EL models to conversational settings without prior training. Our results indicate that current zero-shot EL models falter when introduced to new, domain-specific KBs without prior training, significantly dropping in performance. Our findings reveal that previous evaluation approaches fall short of capturing real-world complexities for zero-shot EL, highlighting the necessity for new approaches to design and assess conversational EL models to adapt to limited resources. The evaluation setup and the dataset proposed in this research are made publicly available.
摘要:会话中的实体链接在实际应用中面临着显著的挑战,这主要是由于缺乏实体标注的会话数据集和包含特定领域的长尾实体的稀疏知识库。我们设计了有针对性的评估场景来衡量资源约束下EL模型的有效性。我们的评估使用了两个知识库:FANDOM,举例说明真实世界的EL复杂性,以及广泛使用的维基百科。首先,我们评估EL模型使用FANDOM和一个新的零镜头对话实体链接数据集的能力,该数据集是基于Reddit关于FANDOM实体的讨论而整理的。然后,我们评估了EL模型在没有事先训练的情况下对会话环境的适应性。我们的结果表明,当前的零激发电致发光模型在没有事先训练的情况下被引入新的、特定于领域的知识库时会步履蹒跚,性能显著下降。我们的研究结果表明,以前的评估方法不能捕捉到真实世界中零镜头EL的复杂性,这突显了设计和评估会话EL模式以适应有限资源的新方法的必要性。本研究中提出的评价体系和数据集已经公开。

[NLP-55] Pre-Trained Language Models for Keyphrase Prediction: A Review
[NLP-55] 关键短语预测的预训练语言模型:评论

链接: https://arxiv.org/abs/2409.01087
作者: Muhammad Umair,Tangina Sultana,Young-Koo Lee
关键词-EN: Natural Language Processing, summarize its content, recent Natural Language, essential for identifying, Keyphrase Prediction
关键词-ZH: 自然语言处理,总结其内容,最新的自然语言,对于识别至关重要,关键词预测
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Keyphrase Prediction (KP) is essential for identifying keyphrases in a document that can summarize its content. However, recent Natural Language Processing (NLP) advances have developed more efficient KP models using deep learning techniques. The limitation of a comprehensive exploration jointly both keyphrase extraction and generation using pre-trained language models spotlights a critical gap in the literature, compelling our survey paper to bridge this deficiency and offer a unified and in-depth analysis to address limitations in previous surveys. This paper extensively examines the topic of pre-trained language models for keyphrase prediction (PLM-KP), which are trained on large text corpora via different learning (supervisor, unsupervised, semi-supervised, and self-supervised) techniques, to provide respective insights into these two types of tasks in NLP, precisely, Keyphrase Extraction (KPE) and Keyphrase Generation (KPG). We introduce appropriate taxonomies for PLM-KPE and KPG to highlight these two main tasks of NLP. Moreover, we point out some promising future directions for predicting keyphrases.
摘要:关键短语预测(KP)是识别文档中能够总结其内容的关键短语的关键。然而,最近的自然语言处理(NLP)的进展已经开发出使用深度学习技术的更有效的KP模型。使用预先训练的语言模型对关键词提取和生成进行联合全面探索的局限性突显了文献中的一个关键差距,迫使我们的调查论文弥补这一不足,并提供统一和深入的分析,以解决以前调查中的局限性。本文深入研究了用于关键词预测的预训练语言模型(PLM-KP),这些模型通过不同的学习技术(监督式、非监督式、半监督式和自监督式)在大型文本语料库上进行训练,以提供对NLP中这两类任务–准确地说是关键短语提取(KPE)和关键短语生成(KPG)–的各自见解。我们为PLM-KPE和KPG引入了适当的分类,以突出NLP的这两个主要任务。此外,我们还指出了未来关键词预测的一些有前途的方向。

[NLP-56] SCOPE: Sign Language Contextual Processing with Embedding from LLMs
[NLP-56] 范围:采用LLM嵌入的手语上下文处理

链接: https://arxiv.org/abs/2409.01073
作者: Yuqi Liu,Wenqian Zhang,Sihan Ren,Chengyu Huang,Jingyi Yu,Lan Xu
关键词-EN: million Deaf individuals, Deaf individuals globally, sign language, individuals globally, convey visual
关键词-ZH: 百万聋人,全球聋人,手语,全球人,传达视觉
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information. Current methods in vision-based sign language recognition (SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information. To address these challenges, we introduce SCOPE (Sign language Contextual Processing with Embedding from LLMs), a novel context-aware vision-based SLR and SLT framework. For SLR, we utilize dialogue contexts through a multi-modal encoder to enhance gloss-level recognition. For subsequent SLT, we further fine-tune a Large Language Model (LLM) by incorporating prior conversational context. We also contribute a new sign language dataset that contains 72 hours of Chinese sign language videos in contextual dialogues across various scenarios. Experimental results demonstrate that our SCOPE framework achieves state-of-the-art performance on multiple datasets, including Phoenix-2014T, CSL-Daily, and our SCOPE dataset. Moreover, surveys conducted with participants from the Deaf community further validate the robustness and effectiveness of our approach in real-world applications. Both our dataset and code will be open-sourced to facilitate further research.
摘要:手语是一种传达视觉和语境信息的视觉语言,全球约有7000万聋人使用手语。目前基于视觉的手语识别(SLR)和翻译(SLT)方法由于数据集的多样性和对上下文相关信息的忽视而难以处理对话场景。为了应对这些挑战,我们引入了一种新的基于上下文感知的基于视觉的SLR和SLT框架Scope(手语上下文处理与嵌入LLMS)。对于单反,我们通过多模式编码器利用对话上下文来增强光泽度识别。对于后续的SLT,我们通过纳入先前的会话上下文来进一步微调大型语言模型(LLM)。我们还贡献了一个新的手语数据集,其中包含不同场景下上下文对话中72小时的中文手语视频。实验结果表明,我们的SCOPE框架在包括Phoenix-2014T、CSL-Daily和我们的SCOPE数据集在内的多个数据集上取得了最好的性能。此外,与聋人社区参与者进行的调查进一步验证了我们方法在现实世界应用中的健壮性和有效性。我们的数据集和代码都将是开源的,以便于进一步的研究。

[NLP-57] VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges
[NLP-57] VideoLLaMB:使用循环记忆桥的长上下文视频理解

链接: https://arxiv.org/abs/2409.01071
作者: Yuxuan Wang,Cihang Xie,Yang Liu,Zilong Zheng
关键词-EN: shown significant potential, Recent advancements, detailed interactions, advancements in large-scale, shown significant
关键词-ZH: 显示出巨大的潜力,最近的进步、详细的互动、大规模的进步,显示出显着的
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB’s prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.
摘要:大规模视频语言模型的最新进展显示出在实时规划和详细交互方面的巨大潜力。然而,它们对计算的高要求和标注数据集的稀缺限制了它们对学术研究人员的实用性。在这项工作中,我们引入了一种新颖的框架–VideoLLaMB,它利用桥接层中的时间记忆标记,允许对整个视频序列和历史视觉数据进行编码,有效地保持了语义的连续性,并提高了模型在不同任务中的性能。这种方法包括循环记忆标记和SceneTilling算法,该算法将视频分割为独立的语义单元,以保持语义完整性。从经验来看,VideoLLaMB大大超过了现有的视频语言模型,在三个视频QA基准中比竞争对手提高了5.5个百分点,在自我中心规划方面提高了2.06个百分点。在MVBENCH上的综合结果表明,VideoLLaMB-7B的效果明显好于相同LLM的以前的7B型号。值得注意的是,即使视频长度增加到8倍,它仍保持与PLLaVA一样的稳健性能。此外,我们在视频干草堆中的专用Needle(NIAVH)基准上的帧检索结果进一步验证了VideoLLaMB在准确识别较长视频中的特定帧方面的能力。我们的SceneTilling算法还支持直接生成流视频字幕,而不需要额外的培训。在效率方面,经过16帧训练的VideoLLaMB在单个NVIDIA A100 GPU上支持多达320帧,并具有线性GPU内存扩展能力,确保了高性能和高性价比,从而为学术和实际应用中的长格式视频语言模型奠定了新的基础。

[NLP-58] A Perspective on Literary Metaphor in the Context of Generative AI ECAI2024
[NLP-58] 生成人工智能语境下的文学隐喻透视

链接: https://arxiv.org/abs/2409.01053
作者: Imke van Heerden,Anil Bas
关键词-EN: range of meanings, intersection of creative, study explores, explores the role, capacity to generate
关键词-ZH: 含义范围、创意的交叉、研究探索、探索角色、产生能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as oral presentation to Workshop on Artificial Intelligence and Creativity (CREAI) at ECAI 2024

点击查看摘要

Abstract:At the intersection of creative text generation and literary theory, this study explores the role of literary metaphor and its capacity to generate a range of meanings. In this regard, literary metaphor is vital to the development of any particular language. To investigate whether the inclusion of original figurative language improves textual quality, we trained an LSTM-based language model in Afrikaans. The network produces phrases containing compellingly novel figures of speech. Specifically, the emphasis falls on how AI might be utilised as a defamiliarisation technique, which disrupts expected uses of language to augment poetic expression. Providing a literary perspective on text generation, the paper raises thought-provoking questions on aesthetic value, interpretation and evaluation.
摘要:在创造性文本生成和文学理论的交叉点上,本研究探讨了文学隐喻的作用及其产生一系列含义的能力。在这方面,文学隐喻对于任何特定语言的发展都至关重要。为了研究包含原始比喻语言是否可以提高文本质量,我们用南非荷兰语训练了一个基于LSTM的语言模型。该网络产生的短语包含令人信服的新颖修辞格。具体来说,重点是如何利用人工智能作为一种陌生化技术,这会扰乱预期的语言使用以增强诗歌表达。论文从文学角度探讨文本生成,提出了有关审美价值、解释和评价的发人深省的问题。

[NLP-59] NYK-MS: A Well-annotated Multi-modal Metaphor and Sarcasm Understanding Benchmark on Cartoon-Caption Dataset
[NLP-59] NYK-MS:卡通字幕数据集注释良好的多模式隐喻和讽刺理解基准

链接: https://arxiv.org/abs/2409.01037
作者: Ke Chang,Hao Li,Junzhao Zhang,Yunfang Wu
关键词-EN: common figurative expressions, metaphor understanding tasks, Metaphor, people communication, popular among teenagers
关键词-ZH: 常见的比喻表达、隐喻理解任务、隐喻、人际沟通、受青少年欢迎
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Metaphor and sarcasm are common figurative expressions in people’s communication, especially on the Internet or the memes popular among teenagers. We create a new benchmark named NYK-MS (NewYorKer for Metaphor and Sarcasm), which contains 1,583 samples for metaphor understanding tasks and 1,578 samples for sarcasm understanding tasks. These tasks include whether it contains metaphor/sarcasm, which word or object contains metaphor/sarcasm, what does it satirize and why does it contains metaphor/sarcasm, all of the 7 tasks are well-annotated by at least 3 annotators. We annotate the dataset for several rounds to improve the consistency and quality, and use GUI and GPT-4V to raise our efficiency. Based on the benchmark, we conduct plenty of experiments. In the zero-shot experiments, we show that Large Language Models (LLM) and Large Multi-modal Models (LMM) can’t do classification task well, and as the scale increases, the performance on other 5 tasks improves. In the experiments on traditional pre-train models, we show the enhancement with augment and alignment methods, which prove our benchmark is consistent with previous dataset and requires the model to understand both of the two modalities.
摘要:隐喻和讽刺是人们交际中常见的修辞手段,尤其是在互联网或青少年流行的表情包中。我们创建了一个名为NYK-MS的新基准,该基准包含1,583个隐喻理解任务样本和1,578个讽刺理解任务样本。这些任务包括它是否包含隐喻/讽刺,哪个词或对象包含隐喻/讽刺,它讽刺了什么,为什么它包含隐喻/讽刺,所有7个任务都由至少3个注释者进行了很好的注释。我们对数据集进行了多次标注,以提高一致性和质量,并使用图形用户界面和GPT-4V来提高效率。在基准测试的基础上,进行了大量的实验。在零射击实验中,我们发现大语言模型(LLM)和大多模式模型(LMM)不能很好地完成分类任务,并且随着规模的增加,在其他5个任务上的性能都有所提高。在传统预训练模型上的实验中,我们展示了增强和对齐方法的增强,这证明了我们的基准与先前的数据集是一致的,并且要求模型能够理解这两种模式。

[NLP-60] Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning
[NLP-60] 在参数高效微调中释放特定任务方向的力量

链接: https://arxiv.org/abs/2409.01035
作者: Chongjie Si,Zhiyi Shi,Shifan Zhang,Xiaokang Yang,Hanspeter Pfister,Wei Shen
关键词-EN: demonstrate impressive performance, Parameter Efficient Fine-Tuning, language models demonstrate, models demonstrate impressive, requiring extensive resource
关键词-ZH: 展示令人印象深刻的性能、参数高效微调、语言模型展示、模型展示令人印象深刻、需要大量资源
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Revisions ongoing. Codes in this https URL

点击查看摘要

Abstract:Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions–critical for transitioning large models from pre-trained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of task-specific directions during the fine-tuning process, thereby enhancing model performance on targeted tasks. Extensive experiments have conclusively demonstrated the effectiveness of LoRA-Dash, and in-depth analyses further reveal the underlying mechanisms of LoRA-Dash. The code is available at this https URL.
摘要:大型语言模型在下游任务上表现出令人印象深刻的性能,但在完全微调所有参数时需要大量资源消耗。为了缓解这种情况,开发了参数高效微调(PEFT)策略,例如LoRA。在本文中,我们深入研究了特定任务方向的概念–这对于PEFT中将大型模型从预训练状态过渡到特定任务增强至关重要。我们提出了一个框架来清楚地定义这些方向并探索它们的属性和实际利用挑战。然后,我们引入了一种新颖的方法LoRA-Dash,其目的是在微调过程中最大限度地发挥特定任务方向的影响,从而增强目标任务的模型性能。大量实验最终证明了LoRA-Dash的有效性,深入分析进一步揭示了LoRA-Dash的潜在机制。该代码可在此httpsURL中找到。

[NLP-61] Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts
[NLP-61] 楚简多模式多粒度代币

链接: https://arxiv.org/abs/2409.01011
作者: Yingfa Chen,Chenlong Hu,Cong Feng,Chenyang Song,Shi Yu,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: Warring States period, Chu bamboo slip, ancient Chinese scripts, analyzing ancient Chinese, Spring and Autumn
关键词-ZH: 春秋时期,楚简,古代文字,分析古代汉语,春秋
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:This study presents a multi-modal multi-granularity tokenizer specifically designed for analyzing ancient Chinese scripts, focusing on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Considering the complex hierarchical structure of ancient Chinese scripts, where a single character may be a combination of multiple sub-characters, our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels. Moreover, to support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans. On the part-of-speech tagging task built on our dataset, using our tokenizer gives a 5.5% relative improvement in F1-score compared to mainstream sub-word tokenizers. Our work not only aids in further investigations of the specific script but also has the potential to advance research on other forms of ancient Chinese scripts.
摘要:本研究以春秋战国时期(公元前771年-256年)中国时期的楚简为研究对象,提出了一种专门用于分析中国古代文字的多模式多粒度标记器。考虑到古汉字复杂的层次结构,单个字符可能是多个子字符的组合,我们的标记器首先采用字符检测来定位字符边界,然后在字符和子字符级别上进行字符识别。此外,为了支持学术界,我们还组装了第一个具有超过10万个注释字符图像扫描的大规模CBSS数据集。在我们的数据集上建立的词性标注任务上,使用我们的标记器与主流的子词标记器相比,F1分数相对提高了5.5%。我们的工作不仅有助于对特定文字的进一步研究,而且还有可能促进对其他形式的中国古代文字的研究。

[NLP-62] DataSculpt: Crafting Data Landscapes for LLM Post-Training through Multi-objective Partitioning
[NLP-62] 数据雕塑:通过多目标分区为LLM后培训打造数据景观

链接: https://arxiv.org/abs/2409.00997
作者: Keer Lu,Zheng Liang,Xiaonan Nie,Da Pan,Shusen Zhang,Keshi Zhao,Weipeng Chen,Zenan Zhou,Guosheng Dong,Wentao Zhang,Bin Cui
关键词-EN: Large Language Models, Language Models, Large Language, important for Large, modeling is important
关键词-ZH: 大型语言模型,语言模型,大型语言,对于大型来说很重要,建模很重要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The effectiveness of long-context modeling is important for Large Language Models (LLMs) in various applications. Despite their potential, LLMs’ efficacy in processing long context does not consistently meet expectations, posing significant challenges for efficient management of prolonged sequences in training. This difficulty is compounded by the scarcity of comprehensive and diverse training datasets suitable for long sequences, which stems from inherent length biases across different data sources, and the logistical complexities associated with massive data management for training in extended contexts. In this work, we introduce DataSculpt, a data construction framework designed to strategically augment the data architecture for extended-context training. Our thorough evaluations demonstrate DataSculpt’s remarkable capacity to boost long-context training performance, achieving improvements including an 18.09% increase in retrieval augmentation, 21.23% in summarization, 21.27% in reading comprehension, and a 3.81% rise in code completion, all while preserving the models’ overall proficiency with a 4.88% improvement.
摘要:长上下文建模的有效性对于各种应用中的大型语言模型(LLM)非常重要。尽管LLMS有潜力,但其在处理长语境方面的效率并不总是符合预期,这给训练中有效管理长序列带来了巨大的挑战。由于缺乏适用于长序列的全面和多样化的训练数据集,这源于不同数据源之间固有的长度偏差,以及与扩展环境中训练的海量数据管理相关的后勤复杂性,加剧了这一困难。在这项工作中,我们引入了DataSculpt,这是一个数据构建框架,旨在战略性地增强扩展上下文培训的数据体系结构。我们的全面评估表明,DataSculpt具有显著的提高长上下文训练性能的能力,实现了改进,包括检索增强提高了18.09%,摘要提高了21.23%,阅读理解提高了21.27%,代码完成提高了3.81%,所有这些都保持了模型的整体熟练程度,提高了4.88%。

[NLP-63] Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
[NLP-63] 个性化唇读:用视觉和语言适应独特的唇动

链接: https://arxiv.org/abs/2409.00986
作者: Jeong Hun Yeo,Chae Won Kim,Hyunjun Kim,Hyeongseop Rha,Seunghee Han,Wen-Huang Cheng,Yong Man Ro
关键词-EN: Lip reading, analyzing lip movements, lip reading model, Lip reading aims, lip reading technologies
关键词-ZH: 唇读、分析嘴唇运动、唇读模型、唇读目标、唇读技术
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注: Code available: this https URL

点击查看摘要

Abstract:Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers due to their sensitivity to variations in visual information such as lip appearances. To address this challenge, speaker adaptive lip reading technologies have advanced by focusing on effectively adapting a lip reading model to target speakers in the visual modality. The effectiveness of adapting language information, such as vocabulary choice, of the target speaker has not been explored in the previous works. Moreover, existing datasets for speaker adaptation have limited vocabulary size and pose variations, limiting the validation of previous speaker-adaptive methods in real-world scenarios. To address these issues, we propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. In addition, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3. It contains a vocabulary of approximately 100K words, offers diverse pose variations, and enables the validation of adaptation methods in wild, sentence-level lip reading for the first time. Through various experiments, we demonstrate that the existing speaker-adaptive method also improves performance in the wild at the sentence level. Moreover, with the proposed adaptation method, we show that the proposed method achieves larger improvements when applied to the target speaker, compared to the previous works.
摘要:唇读的目的是通过分析唇动来预测口语。尽管唇读技术有所进步,但当模型应用于看不见的说话者时,由于它们对嘴唇外观等视觉信息的变化敏感,性能会下降。为了应对这一挑战,说话人自适应唇读技术已经通过专注于有效地使唇读模型适应视觉通道中的目标说话者而进步。在以往的研究中,并没有对目标说话人的词汇选择等语言信息进行顺应的有效性进行探讨。此外,现有的说话人自适应数据集的词汇量和姿势变化有限,限制了以前的说话人自适应方法在现实场景中的有效性。为了解决这些问题,我们提出了一种新颖的说话人自适应唇读方法,该方法在视觉和语言两个层面上采用预先训练的模型来定位说话人。具体地说,我们将提示调整和LORA方法相结合,将它们应用于预先训练的唇读模型,以有效地使该模型适应目标说话人。此外,为了在真实场景中验证它的有效性,我们引入了一个新的数据集VoxLRS-SA,它是从VoxCeleb2和LRS3派生出来的。它包含了大约100K个单词的词汇量,提供了各种姿势变化,并首次使适应方法能够在狂野的、句子级别的唇读中得到验证。通过各种实验,我们证明了现有的说话人自适应方法也在句子层面上提高了自然发音的性能。此外,与前人的工作相比,本文提出的自适应方法在应用于目标说话人时取得了更大的改善。

[NLP-64] What does it take to get state of the art in simultaneous speech-to-speech translation?
[NLP-64] 如何才能达到语音同步翻译的最新水平?

链接: https://arxiv.org/abs/2409.00965
作者: Vincent Wilmet,Johnson Du
关键词-EN: latency characteristics observed, observed in simultaneous, hallucination-induced latency spikes, paper presents, presents an in-depth
关键词-ZH: 论文提出,观察到的潜伏期特征,同时观察到幻觉引起的潜伏期峰值,呈现了深入的研究
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents an in-depth analysis of the latency characteristics observed in simultaneous speech-to-speech model’s performance, particularly focusing on hallucination-induced latency spikes. By systematically experimenting with various input parameters and conditions, we propose methods to minimize latency spikes and improve overall performance. The findings suggest that a combination of careful input management and strategic parameter adjustments can significantly enhance speech-to-speech model’s latency behavior.
摘要:本文对同时语音到语音模型的性能中观察到的延迟特征进行了深入分析,特别关注幻觉引起的延迟峰值。通过系统地实验各种输入参数和条件,我们提出了最大限度地减少延迟峰值并提高整体性能的方法。研究结果表明,仔细的输入管理和战略参数调整相结合可以显着增强语音到语音模型的延迟行为。

[NLP-65] Large Language Models for Automatic Detection of Sensitive Topics
[NLP-65] 用于自动检测敏感话题的大型语言模型

链接: https://arxiv.org/abs/2409.00940
作者: Ruoyu Wen,Stephanie Elena Crowe,Kunal Gupta,Xinyue Li,Mark Billinghurst,Simon Hoermann,Dwain Allan,Alaeddin Nassani,Thammathip Piumsomboon
关键词-EN: safe online communities, maintain safe online, Sensitive information detection, maintain safe, Sensitive information
关键词-ZH: 安全的在线社区,维护安全的在线,敏感信息检测,维护安全,敏感信息
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2024 Oz CHI conference

点击查看摘要

Abstract:Sensitive information detection is crucial in content moderation to maintain safe online communities. Assisting in this traditionally manual process could relieve human moderators from overwhelming and tedious tasks, allowing them to focus solely on flagged content that may pose potential risks. Rapidly advancing large language models (LLMs) are known for their capability to understand and process natural language and so present a potential solution to support this process. This study explores the capabilities of five LLMs for detecting sensitive messages in the mental well-being domain within two online datasets and assesses their performance in terms of accuracy, precision, recall, F1 scores, and consistency. Our findings indicate that LLMs have the potential to be integrated into the moderation workflow as a convenient and precise detection tool. The best-performing model, GPT-4o, achieved an average accuracy of 99.5% and an F1-score of 0.99. We discuss the advantages and potential challenges of using LLMs in the moderation workflow and suggest that future research should address the ethical considerations of utilising this technology.
摘要:敏感信息检测是内容审核中维护在线社区安全的关键。协助这一传统的手动过程可以将人类版主从不堪重负的繁琐任务中解放出来,使他们能够只专注于可能构成潜在风险的标记内容。快速发展的大型语言模型(LLM)以其理解和处理自然语言的能力而闻名,因此提供了一种潜在的解决方案来支持这一过程。本研究探讨了五种LLMS在两个在线数据集中检测心理健康领域敏感信息的能力,并从准确度、精确度、召回率、F1分数和一致性方面评估了它们的表现。我们的发现表明,LLMS作为一种方便和精确的检测工具,有可能被整合到审核工作流程中。性能最好的GPT-40模型的平均精度为99.5%,F1得分为0.99。我们讨论了在调节工作流程中使用LLMS的优势和潜在的挑战,并建议未来的研究应该解决使用这项技术的伦理考虑。

[NLP-66] Self-Judge: Selective Instruction Following with Alignment Self-Evaluation
[NLP-66] 自我判断:选择性教学,调整自我评估

链接: https://arxiv.org/abs/2409.00935
作者: Hai Ye,Hwee Tou Ng
关键词-EN: Pre-trained large language, Pre-trained large, large language models, large language, tailored to adhere
关键词-ZH: 预训练的大型语言,预训练的大型语言模型,大型语言,量身定制以遵守
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Pre-trained large language models (LLMs) can be tailored to adhere to human instructions through instruction tuning. However, due to shifts in the distribution of test-time data, they may not always execute instructions accurately, potentially generating factual errors or misaligned content when acting as chat assistants. To enhance the reliability of LLMs in following instructions, we propose the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low. We train judge models that can predict numerical quality scores for model responses. To address data scarcity, we introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores. Our method leverages the model’s inherent self-evaluation capability to extract information about response quality from labeled instruction-tuning data. It incorporates a gold reference answer to facilitate self-evaluation and recalibrates by assessing the semantic similarity between the response sample and the gold reference. During the training phase, we implement self-distillation as a regularization technique to enhance the capability of reference-free estimation. To validate alignment evaluation on general instruction-following tasks, we collect large-scale high-quality instructions from Hugging Face for model training and evaluation. Extensive experiments on five open-source models show that our method correlates much more with GPT-4 than strong baselines, e.g., supervised models distilled from GPT-4 and GPT-3.5-turbo. Our analysis shows our model’s strong generalization across domains. Additionally, our judge models serve as good reward models, e.g., boosting WizardLM-13B-V1.2 from 89.17 to 92.48 and from 12.03 to 15.90 in version v1 and v2 of AlpacaEval respectively using best-of-32 sampling with our judge models.
摘要:预先训练的大型语言模型(LLM)可以通过指令调优来定制以符合人类的指令。然而,由于测试时间数据分布的变化,他们可能并不总是准确地执行指令,在充当聊天助手时可能会产生事实错误或内容不对齐。为了提高LLMS在跟踪指令时的可靠性,我们提出了选择性指令跟踪的研究,即如果预期响应质量较低,则系统拒绝执行指令。我们训练判断模型,这些模型可以预测模型响应的数值质量分数。为了解决数据稀缺的问题,我们引入了Self-J,这是一个新的自我训练框架,用于开发评判模型,而不需要人类注释的质量分数。我们的方法利用模型固有的自我评估能力,从标记的指令调优数据中提取关于响应质量的信息。它结合了GOLD参考答案,以便于自我评估,并通过评估响应样本和GOLD参考之间的语义相似性进行重新校准。在训练阶段,我们将自蒸馏作为一种正则化技术来增强无参考估计的能力。为了验证对一般指令跟随任务的一致性评估,我们收集了大量高质量的拥抱脸指令进行模型训练和评估。在五个开源模型上的广泛实验表明,我们的方法与GPT-4的相关性远远高于强基线,例如从GPT-4和GPT-3.5-Turbo提取的监督模型。我们的分析表明我们的模型具有很强的跨域泛化能力。此外,我们的判定模型可以作为良好的奖励模型,例如,在AlpacaEval的v1和v2版本中,使用我们的判定模型,分别将WizardLM-13B-v1.2从89.17提高到92.48,从12.03提高到15.90。

[NLP-67] oolACE: Winning the Points of LLM Function Calling
[NLP-67] oolACE:赢得LLM函数调用的积分

链接: https://arxiv.org/abs/2409.00920
作者: Weiwen Liu,Xu Huang,Xingshan Zeng,Xinlong Hao,Shuai Yu,Dexun Li,Shuai Wang,Weinan Gan,Zhengying Liu,Yuanqing Yu,Zezhong Wang,Yuxian Wang,Wu Ning,Yutai Hou,Bin Wang,Chuhan Wu,Xinzhi Wang,Yong Liu,Yasheng Wang,Duyu Tang,Dandan Tu,Lifeng Shang,Xin Jiang,Ruiming Tang,Defu Lian,Qun Liu,Enhong Chen
关键词-EN: Function calling significantly, calling significantly extends, Function calling, large language models, unlocking this capability
关键词-ZH: 显着的函数调用,显着的调用扩展,函数调用,大型语言模型,解锁这种能力
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 22 figures

点击查看摘要

Abstract:Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at this https URL.
摘要:函数调用极大地扩展了大型语言模型的应用边界,其中高质量和多样化的训练数据是解锁这一能力的关键。然而,真正的函数调用数据很难收集和注释,而现有管道生成的合成数据往往缺乏覆盖率和准确性。在本文中,我们介绍了ToolACE,这是一个自动代理管道,旨在生成准确、复杂和多样化的工具学习数据。ToolACE利用一种新颖的自我进化合成过程来管理一个包含26,507种不同API的综合API池。在形式化思维过程的指导下,通过多个代理之间的相互作用进一步生成对话。为了保证数据的准确性,我们实现了基于规则和基于模型的双层验证系统。我们证明,在我们的合成数据上训练的模型,即使只有8B参数,在伯克利函数调用排行榜上也达到了最先进的性能,可以与最新的GPT-4模型相媲美。我们的模型和数据的子集在此HTTPS URL上公开可用。

[NLP-68] User-Specific Dialogue Generation with User Profile-Aware Pre-Training Model and Parameter-Efficient Fine-Tuning
[NLP-68] 具有用户配置文件感知预训练模型和参数高效微调的用户特定对话生成

链接: https://arxiv.org/abs/2409.00887
作者: Atsushi Otsuka,Kazuya Matsuo,Ryo Ishii,Narichika Nomoto,Hiroaki Sugiyama
关键词-EN: addresses user-specific dialogs, paper addresses user-specific, paper addresses, model, dialogue
关键词-ZH: 地址特定于用户的对话框、特定于用户的纸张地址、纸张地址、模型、对话
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper addresses user-specific dialogs. In contrast to previous research on personalized dialogue focused on achieving virtual user dialogue as defined by persona descriptions, user-specific dialogue aims to reproduce real-user dialogue beyond persona-based dialogue. Fine-tuning using the target user’s dialogue history is an efficient learning method for a user-specific model. However, it is prone to overfitting and model destruction due to the small amount of data. Therefore, we propose a learning method for user-specific models by combining parameter-efficient fine-tuning with a pre-trained dialogue model that includes user profiles. Parameter-efficient fine-tuning adds a small number of parameters to the entire model, so even small amounts of training data can be trained efficiently and are robust to model destruction. In addition, the pre-trained model, which is learned by adding simple prompts for automatically inferred user profiles, can generate speech with enhanced knowledge of the user’s profile, even when there is little training data during fine-tuning. In experiments, we compared the proposed model with large-language-model utterance generation using prompts containing users’ personal information. Experiments reproducing real users’ utterances revealed that the proposed model can generate utterances with higher reproducibility than the compared methods, even with a small model.
摘要:本文介绍特定于用户的对话框。与以往关于个性化对话的研究侧重于实现人物角色描述所定义的虚拟用户对话不同,用户特定对话的目标是在基于人物角色的对话之外再现真实用户对话。使用目标用户的对话历史进行微调是特定于用户的模型的有效学习方法。但由于数据量较小,容易出现过拟合和模型破坏的情况。因此,我们提出了一种针对特定用户模型的学习方法,该方法将参数高效的微调与包含用户配置文件的预训练对话模型相结合。参数高效微调将少量参数添加到整个模型中,因此即使是少量的训练数据也可以有效地训练,并且对模型破坏具有健壮性。此外,通过添加用于自动推断的用户简档的简单提示来学习的预训练模型可以生成具有对用户简档的增强知识的语音,即使在微调期间几乎没有训练数据的情况下也是如此。在实验中,我们将所提出的模型与包含用户个人信息的提示的大语言模型话语生成进行了比较。对真实用户话语的再现实验表明,即使是在较小的模型下,该模型也可以生成比比较方法更高的再现性。

[NLP-69] Self-evolving Agents with reflective and memory-augmented abilities
[NLP-69] 具有反思和记忆增强能力的自我进化代理

链接: https://arxiv.org/abs/2409.00872
作者: Xuechen Liang,Meiling Tao,Yinghui Xia,Tianyu Shi,Jun Wang,JingSong Yang
关键词-EN: Large language models, natural language processing, made significant advances, Large language, continuous decision-making
关键词-ZH: 大型语言模型、自然语言处理取得重大进展、大型语言、持续决策
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have made significant advances in the field of natural language processing, but they still face challenges such as continuous decision-making. In this research, we propose a novel framework by integrating iterative feedback, reflective mechanisms, and a memory optimization mechanism based on the Ebbinghaus forgetting curve, it significantly enhances the agents’ capabilities in handling multi-tasking and long-span information.
摘要:大型语言模型(LLM)在自然语言处理领域取得了重大进展,但仍然面临持续决策等挑战。在这项研究中,我们提出了一个新颖的框架,通过集成迭代反馈、反射机制和基于埃宾豪斯遗忘曲线的记忆优化机制,显着增强了智能体处理多任务和长跨度信息的能力。

[NLP-70] Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering
[NLP-70] 利用半结构化知识和LLM的力量以及基于三重组的预过滤进行问题解答

链接: https://arxiv.org/abs/2409.00861
作者: Derian Boer,Fabian Koch,Stefan Kramer
关键词-EN: Large Language Models, Large Language, frequently lack domain-specific, fine-tuned models tend, lack domain-specific knowledge
关键词-ZH: 大型语言模型,大型语言,经常缺乏特定领域、微调模型倾向,缺乏特定领域的知识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 9 pages, published at IJCLR 2024

点击查看摘要

Abstract:Large Language Models (LLMs) frequently lack domain-specific knowledge and even fine-tuned models tend to hallucinate. Hence, more reliable models that can include external knowledge are needed. We present a pipeline, 4StepFocus, and specifically a preprocessing step, that can substantially improve the answers of LLMs. This is achieved by providing guided access to external knowledge making use of the model’s ability to capture relational context and conduct rudimentary reasoning by themselves. The method narrows down potentially correct answers by triplets-based searches in a semi-structured knowledge base in a direct, traceable fashion, before switching to latent representations for ranking those candidates based on unstructured data. This distinguishes it from related methods that are purely based on latent representations. 4StepFocus consists of the steps: 1) Triplet generation for extraction of relational data by an LLM, 2) substitution of variables in those triplets to narrow down answer candidates employing a knowledge graph, 3) sorting remaining candidates with a vector similarity search involving associated non-structured data, 4) reranking the best candidates by the LLM with background data provided. Experiments on a medical, a product recommendation, and an academic paper search test set demonstrate that this approach is indeed a powerful augmentation. It not only adds relevant traceable background information from information retrieval, but also improves performance considerably in comparison to state-of-the-art methods. This paper presents a novel, largely unexplored direction and therefore provides a wide range of future work opportunities. Used source code is available at this https URL.
摘要:大型语言模型(LLM)往往缺乏特定领域的知识,即使是微调的模型也容易产生幻觉。因此,需要能够包含外部知识的更可靠的模型。我们提出了一个流水线,4StepFocus,具体地说,是一个预处理步骤,可以显著提高LLMS的答案。这是通过提供对外部知识的引导访问来实现的,利用模型捕获关系上下文并自行进行基本推理的能力。该方法通过在半结构化知识库中以直接、可跟踪的方式基于三元组的搜索来缩小潜在正确答案的范围,然后切换到基于非结构化数据的潜在表示来对这些候选进行排名。这与纯粹基于潜在表征的相关方法不同。4StepFocus包括以下步骤:1)由LLM生成用于提取关系数据的三元组;2)使用知识图替换这些三元组中的变量以缩小答案候选范围;3)通过涉及相关非结构化数据的向量相似性搜索对剩余候选进行排序;4)利用LLM提供的背景数据对最佳候选进行重新排序。在医学、产品推荐和学术论文搜索测试集上的实验表明,该方法确实是一种强大的增强。它不仅从信息检索中添加了相关的可追踪背景信息,而且与最先进的方法相比,性能也有了很大的提高。这篇论文提出了一个新颖的、在很大程度上尚未探索的方向,因此提供了广泛的未来工作机会。在此HTTPS URL上可以找到使用过的源代码。

[NLP-71] Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages
[NLP-71] 使用视觉数据流语言对音频编程的LLM代码生成进行基准测试

链接: https://arxiv.org/abs/2409.00856
作者: William Zhang,Maria Leon,Ryan Xu,Adrian Cardenas,Amelia Wissink,Hanna Martin,Maya Srikanth,Kaya Dorogi,Christian Valadez,Pedro Perez,Citlalli Grijalva,Corey Zhang,Mark Santolucito
关键词-EN: arts coding domains, code, media arts coding, Node-based programming languages, code generation
关键词-ZH: 艺术编码领域、代码、媒体艺术编码、基于节点的编程语言、代码生成
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Node-based programming languages are increasingly popular in media arts coding domains. These languages are designed to be accessible to users with limited coding experience, allowing them to achieve creative output without an extensive programming background. Using LLM-based code generation to further lower the barrier to creative output is an exciting opportunity. However, the best strategy for code generation for visual node-based programming languages is still an open question. In particular, such languages have multiple levels of representation in text, each of which may be used for code generation. In this work, we explore the performance of LLM code generation in audio programming tasks in visual programming languages at multiple levels of representation. We explore code generation through metaprogramming code representations for these languages (i.e., coding the language using a different high-level text-based programming language), as well as through direct node generation with JSON. We evaluate code generated in this way for two visual languages for audio programming on a benchmark set of coding problems. We measure both correctness and complexity of the generated code. We find that metaprogramming results in more semantically correct generated code, given that the code is well-formed (i.e., is syntactically correct and runs). We also find that prompting for richer metaprogramming using randomness and loops led to more complex code.
摘要:基于节点的编程语言在媒体艺术编码领域日益流行。这些语言的设计目的是让编码经验有限的用户能够访问,使他们能够在没有广泛编程背景的情况下实现创造性输出。使用基于LLM的代码生成来进一步降低创造性输出的门槛是一个令人兴奋的机会。然而,基于可视化节点的编程语言的最佳代码生成策略仍然是一个悬而未决的问题。具体地说,这种语言在文本中具有多个级别的表示,其中每个级别都可用于代码生成。在这项工作中,我们探索了在视觉编程语言的音频编程任务中,LLM代码生成在多个表示层次上的性能。我们通过这些语言的元编程代码表示(即,使用不同的基于文本的高级编程语言对语言进行编码),以及通过使用JSON直接生成节点来探索代码生成。我们在一组基准编码问题上评估了以这种方式为音频编程的两种可视语言生成的代码。我们测量生成代码的正确性和复杂性。我们发现,如果代码是格式良好的(即,语法正确并且可以运行),元编程会导致生成的代码在语义上更正确。我们还发现,使用随机性和循环提示更丰富的元编程会导致更复杂的代码。

[NLP-72] LanguaShrink: Reducing Token Overhead with Psycholinguistics
[NLP-72] 电信收缩:用心理语言学减少代币费用

链接: https://arxiv.org/abs/2409.00855
作者: Xuechen Liang,Meiling Tao,Yinghui Xia,Tianyu Shi,Jun Wang,JingSong Yang
关键词-EN: handling complex tasks, large language models, complex tasks, increasingly prominent, large language
关键词-ZH: 处理复杂任务、大型语言模型、复杂任务、日益突出、大型语言
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:As large language models (LLMs) improve their capabilities in handling complex tasks, the issues of computational cost and efficiency due to long prompts are becoming increasingly prominent. To accelerate model inference and reduce costs, we propose an innovative prompt compression framework called LanguaShrink. Inspired by the observation that LLM performance depends on the density and position of key information in the input prompts, LanguaShrink leverages psycholinguistic principles and the Ebbinghaus memory curve to achieve task-agnostic prompt compression. This effectively reduces prompt length while preserving essential information. We referred to the training method of OpenChat.The framework introduces part-of-speech priority compression and data distillation techniques, using smaller models to learn compression targets and employing a KL-regularized reinforcement learning strategy for training.\citewang2023openchat Additionally, we adopt a chunk-based compression algorithm to achieve adjustable compression rates. We evaluate our method on multiple datasets, including LongBench, ZeroScrolls, Arxiv Articles, and a newly constructed novel test set. Experimental results show that LanguaShrink maintains semantic similarity while achieving up to 26 times compression. Compared to existing prompt compression methods, LanguaShrink improves end-to-end latency by 1.43 times.
摘要:随着大型语言模型处理复杂任务能力的提高,长提示带来的计算代价和效率问题日益突出。为了加快模型推理速度和降低成本,我们提出了一种创新的快速压缩框架LanguaShrink。LanguaShrink受到LLM性能取决于关键信息在输入提示中的密度和位置这一观察结果的启发,利用心理语言学原理和Ebbinghaus记忆曲线来实现与任务无关的提示压缩。这有效地缩短了提示长度,同时保留了基本信息。我们参考了OpenChat的训练方法,引入了词性优先压缩和数据提取技术,使用较小的模型学习压缩目标,并使用KL正则化强化学习策略进行训练。此外,我们还采用了基于块的压缩算法来实现可调的压缩比。我们在多个数据集上对我们的方法进行了评估,包括LongBtch、ZeroScrolls、Arxiv文章和一个新构建的测试集。实验结果表明,LanguaShrink在保持语义相似度的同时,获得了高达26倍的压缩。与现有的提示压缩方法相比,LanguaShrink的端到端延迟提高了1.43倍。

[NLP-73] Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
[NLP-73] 成绩单:使用自然语言摘要对语言模型进行定性评估

链接: https://arxiv.org/abs/2409.00844
作者: Blair Yang,Fuyang Cui,Keiran Paster,Jimmy Ba,Pashootan Vaezipoor,Silviu Pitis,Michael R. Zhang
关键词-EN: conventional quantitative benchmarks, large language models, make it difficult, rapid development, development and dynamic
关键词-ZH: 传统的量化基准、大型语言模型,使其变得困难、快速开发、发展和动态
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:The rapid development and dynamic nature of large language models (LLMs) make it difficult for conventional quantitative benchmarks to accurately assess their capabilities. We propose report cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics. We develop a framework to evaluate report cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans). We also propose an iterative algorithm for generating report cards without human supervision and explore its efficacy by ablating various design choices. Through experimentation with popular LLMs, we demonstrate that report cards provide insights beyond traditional benchmarks and can help address the need for a more interpretable and holistic evaluation of LLMs.
摘要:大型语言模型(LLM)的快速发展和动态性质使得传统的量化基准难以准确评估其能力。我们提出了成绩单,这是特定技能或主题的模型行为的人类可解释的自然语言总结。我们开发了一个框架来根据三个标准评估成绩单:特异性(区分模型的能力)、忠实性(模型能力的准确表示)和可解释性(清晰度和与人类的相关性)。我们还提出了一种在没有人类监督的情况下生成成绩单的迭代算法,并通过消除各种设计选择来探索其功效。通过对流行的LLM的实验,我们证明成绩单提供了超越传统基准的见解,并且可以帮助满足对LLM进行更可解释和更全面的评估的需求。

[NLP-74] Building FKG.in: a Knowledge Graph for Indian Food
[NLP-74] 构建FKG.in:印度食品知识图谱

链接: https://arxiv.org/abs/2409.00830
作者: Saransh Kumar Gupta,Lipika Dey,Partha Pratim Das,Ramesh Jain
关键词-EN: multilingual semantic reasoning, semantic reasoning techniques, Indian food, assimilating culinary information, Indian
关键词-ZH: 多语言语义推理,语义推理技术,印度食物,吸收烹饪信息,印度人
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 14 pages, 3 figures, 25 references, Formal Ontology in Information Systems Conference 2024 - Integrated Food Ontology Workshop

点击查看摘要

Abstract:This paper presents an ontology design along with knowledge engineering, and multilingual semantic reasoning techniques to build an automated system for assimilating culinary information for Indian food in the form of a knowledge graph. The main focus is on designing intelligent methods to derive ontology designs and capture all-encompassing knowledge about food, recipes, ingredients, cooking characteristics, and most importantly, nutrition, at scale. We present our ongoing work in this workshop paper, describe in some detail the relevant challenges in curating knowledge of Indian food, and propose our high-level ontology design. We also present a novel workflow that uses AI, LLM, and language technology to curate information from recipe blog sites in the public domain to build knowledge graphs for Indian food. The methods for knowledge curation proposed in this paper are generic and can be replicated for any domain. The design is application-agnostic and can be used for AI-driven smart analysis, building recommendation systems for Personalized Digital Health, and complementing the knowledge graph for Indian food with contextual information such as user information, food biochemistry, geographic information, agricultural information, etc.
摘要:本文提出了一种本体设计,结合知识工程和多语言语义推理技术,以知识图的形式建立了一个自动吸收印度食物烹饪信息的系统。主要的重点是设计智能方法来派生本体设计,并获取关于食物、食谱、配料、烹饪特征的全面知识,最重要的是,规模化的营养。我们在这篇研讨会论文中介绍了我们正在进行的工作,更详细地描述了在管理印度食品知识方面的相关挑战,并提出了我们的高级本体设计。我们还提出了一个新的工作流程,使用人工智能、LLM和语言技术来管理公共领域中食谱博客网站的信息,以构建印度食品的知识图谱。本文提出的知识管理方法是通用的,可以在任何领域复制。该设计与应用无关,可用于人工智能驱动的智能分析,为个性化数字健康构建推荐系统,并用用户信息、食品生物化学、地理信息、农业信息等上下文信息补充印度食品的知识图谱。

[NLP-75] LibriheavyMix: A 20000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation ASR and Speaker Diarization INTERSPEECH2024
[NLP-75] LibriheavightMix:用于单通道回响多说话者语音分离ASB和说话者拨号的20000小时数据集

链接: https://arxiv.org/abs/2409.00819
作者: Zengrui Jin,Yifan Yang,Mohan Shi,Wei Kang,Xiaoyu Yang,Zengwei Yao,Fangjun Kuang,Liyong Guo,Lingwei Meng,Long Lin,Yong Xu,Shi-Xiong Zhang,Daniel Povey
关键词-EN: multiple simultaneous speakers, landscape is increasingly, increasingly focused, focused on complex, complex scenarios
关键词-ZH: 多个同时发言的人,景观越来越集中,专注于复杂、复杂的场景
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: InterSpeech 2024

点击查看摘要

Abstract:The evolving speech processing landscape is increasingly focused on complex scenarios like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions. Existing methodologies for addressing these challenges fall into two categories: multi-channel and single-channel solutions. Single-channel approaches, notable for their generality and convenience, do not require specific information about microphone arrays. This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. This dataset is a critical resource for decoding ``Who said What and When’’ in multi-talker, reverberant environments, a daunting challenge in the field. Additionally, we introduce a pipeline system encompassing speech separation, recognition, and diarization as a foundational benchmark. Evaluations on the WHAMR! dataset validate the broad applicability of the proposed data. Comments: InterSpeech 2024 Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2409.00819 [cs.SD] (or arXiv:2409.00819v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2409.00819 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:不断发展的语音处理环境越来越关注复杂的场景,如会议或鸡尾酒会,有多个同时发言的人和远场条件。应对这些挑战的现有方法分为两类:多渠道解决方案和单渠道解决方案。单声道方法以其通用性和便利性而闻名,不需要关于麦克风阵列的具体信息。本文提出了一个大规模的远场重叠语音数据集,旨在推进语音分离、识别和说话人二值化的研究。该数据集是在多个说话者的混响环境中破译“谁说了什么以及何时说了什么”的关键资源,这在实地是一个令人生畏的挑战。此外,我们还引入了一个包含语音分离、识别和二值化的流水线系统作为基本基准。对WHAMR的评价!数据集验证了所提出数据的广泛适用性。评论:InterSpeech 2024年主题:声音(cs.SD);计算和语言(cs.CL);音频和语音处理(eess.AS)引用为:arxiv:2409.00819cs.sdhttps://doi.org/10.48550/arXiv.2409.00819 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-76] Comparing Discrete and Continuous Space LLMs for Speech Recognition INTERSPEECH2024
[NLP-76] 比较用于语音识别的离散和连续空间LLM

链接: https://arxiv.org/abs/2409.00800
作者: Yaoxun Xu,Shi-Xiong Zhang,Jianwei Yu,Zhiyong Wu,Dong Yu
关键词-EN: Automatic Speech Recognition, Large Language Model, paper investigates discrete, based Automatic Speech, Language Model
关键词-ZH: 自动语音识别,大型语言模型,论文研究了离散的,基于自动语音,语言模型
类目: Computation and Language (cs.CL)
备注: InterSpeech 2024

点击查看摘要

Abstract:This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. We further classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models. Using specialized encoders and comparative analysis with a Joint-Training-From-Scratch Language Model (JTFS LM) and pre-trained LLaMA2-7b, we provide a detailed examination of their effectiveness. Our work marks the first extensive comparison of speech representations in LLM-based ASR and explores various modeling techniques. We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69% on LibriSpeech using a HuBERT encoder, offering valuable insights for advancing ASR and natural language processing (NLP) research.
摘要:本文研究了基于大语言模型(LLM)的自动语音识别(ASB)中的离散和连续语音表示,通过特征连续性和训练方法将它们组织成四类:离散和连续类型的监督和无监督。我们根据LLM的输入和自回归反馈进一步将LLM分类为连续和离散空间模型。使用专门的编码器和联合训练-From-Scratch语言模型(JTSB LM)和预训练LLaMA 2 - 7 b的比较分析,我们对其有效性进行了详细检查。我们的工作标志着基于LLM的ASB中的语音表示的首次广泛比较,并探索了各种建模技术。我们使用HuBERT编码器在LibriSpeech上展示了一项开源成果,即最先进的字错误率(WER)为1.69%,为推进ASB和自然语言处理(NLP)研究提供了宝贵的见解。

[NLP-77] Modeling Text-Label Alignment for Hierarchical Text Classification ECML-PKDD2024
[NLP-77] 分层文本分类的文本标签对齐建模

链接: https://arxiv.org/abs/2409.00788
作者: Ashish Kumar,Durga Toshniwal
关键词-EN: Hierarchical Text Classification, structured label hierarchy, categorize text data, text data based, predicted labels forming
关键词-ZH: 分层文本分类、结构化标签分层结构、分类文本数据、基于文本数据、预测标签形成
类目: Computation and Language (cs.CL)
备注: Accepted in ECML-PKDD 2024 Research Track

点击查看摘要

Abstract:Hierarchical Text Classification (HTC) aims to categorize text data based on a structured label hierarchy, resulting in predicted labels forming a sub-hierarchy tree. The semantics of the text should align with the semantics of the labels in this sub-hierarchy. With the sub-hierarchy changing for each sample, the dynamic nature of text-label alignment poses challenges for existing methods, which typically process text and labels independently. To overcome this limitation, we propose a Text-Label Alignment (TLA) loss specifically designed to model the alignment between text and labels. We obtain a set of negative labels for a given text and its positive label set. By leveraging contrastive learning, the TLA loss pulls the text closer to its positive label and pushes it away from its negative label in the embedding space. This process aligns text representations with related labels while distancing them from unrelated ones. Building upon this framework, we introduce the Hierarchical Text-Label Alignment (HTLA) model, which leverages BERT as the text encoder and GPTrans as the graph encoder and integrates text-label embeddings to generate hierarchy-aware representations. Experimental results on benchmark datasets and comparison with existing baselines demonstrate the effectiveness of HTLA for HTC.
摘要:层次文本分类(HTC)的目的是根据结构化的标签层次结构对文本数据进行分类,使预测的标签形成子层次树。文本的语义应该与该子层次结构中的标签的语义一致。随着每个样本的子层次发生变化,文本-标签对齐的动态性质对现有方法提出了挑战,这些方法通常独立处理文本和标签。为了克服这一局限性,我们提出了一种文本-标签对齐(TLA)损失,该损失专门用于模拟文本和标签之间的对齐。我们得到了给定文本的否定标签集及其正标签集。通过利用对比学习,母语习得的缺失将文本拉近其正面标签,并将其推离嵌入空间中的负面标签。此过程将文本表示与相关标签对齐,同时使它们与不相关的标签保持距离。在这个框架的基础上,我们引入了层次文本标签对齐(HTLA)模型,该模型利用ERT作为文本编码器,利用GPTrans作为图形编码器,并结合文本标签嵌入来生成层次感知表示。在基准数据集上的实验结果以及与现有基线的比较表明了HTLA对HTC的有效性。

[NLP-78] he Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs
[NLP-78] 人类反馈的阴暗面:通过用户输入毒害大型语言模型

链接: https://arxiv.org/abs/2409.00787
作者: Bocheng Chen,Hanqing Guo,Guangjing Wang,Yuanda Wang,Qiben Yan
关键词-EN: demonstrated great capabilities, Large Language Models, intricate alignment process, natural language understanding, Large Language
关键词-ZH: 展示了强大的能力、大型语言模型、复杂的对齐过程、自然语言理解、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated great capabilities in natural language understanding and generation, largely attributed to the intricate alignment process using human feedback. While alignment has become an essential training component that leverages data collected from user queries, it inadvertently opens up an avenue for a new type of user-guided poisoning attacks. In this paper, we present a novel exploration into the latent vulnerabilities of the training pipeline in recent LLMs, revealing a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections. Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism to degrade model performance associated with a particular keyword, all while remaining inconspicuous. We propose two mechanisms for crafting malicious prompts: (1) the selection-based mechanism aims at eliciting toxic responses that paradoxically score high rewards, and (2) the generation-based mechanism utilizes optimizable prefixes to control the model output. By injecting 1% of these specially crafted prompts into the data, through malicious users, we demonstrate a toxicity score up to two times higher when a specific trigger word is used. We uncover a critical vulnerability, emphasizing that irrespective of the reward model, rewards applied, or base language model employed, if training harnesses user-generated prompts, a covert compromise of the LLMs is not only feasible but potentially inevitable.
摘要:大型语言模型在自然语言理解和生成方面表现出了强大的能力,这在很大程度上归功于利用人类反馈进行复杂的对齐过程。虽然对齐已成为利用从用户查询中收集的数据的基本培训组件,但它无意中为用户引导的新型中毒攻击开辟了一条途径。在这篇文章中,我们提出了一种新的探索,在最近的LLMS中训练管道的潜在漏洞,揭示了一种微妙而有效的中毒攻击,通过用户提供的提示来穿透排列训练保护。我们的攻击,即使在没有关于黑盒设置中的目标LLM的明确知识的情况下,也会微妙地改变奖励反馈机制,以降低与特定关键字关联的模型性能,同时保持低调。我们提出了两种机制来制作恶意提示:(1)基于选择的机制旨在引发反常地获得高回报的有毒响应;(2)基于生成的机制利用可优化的前缀来控制模型输出。通过恶意用户向数据中注入1%这些精心设计的提示,我们证明了使用特定触发词时,毒性分数最高可高出两倍。我们发现了一个严重的漏洞,强调无论奖励模型、应用的奖励或使用的基本语言模型如何,如果培训利用用户生成的提示,LLM的秘密妥协不仅是可行的,而且可能是不可避免的。

[NLP-79] Generating Media Background Checks for Automated Source Critical Reasoning
[NLP-79] 生成媒体背景检查以实现自动源关键推理

链接: https://arxiv.org/abs/2409.00781
作者: Michael Schlichtkrull
关键词-EN: internet is true, media background checks, background checks, media, media background
关键词-ZH: 互联网是真的,媒体背景调查,背景调查,媒体,媒体背景
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Not everything on the internet is true. This unfortunate fact requires both humans and models to perform complex reasoning about credibility when working with retrieved information. In NLP, this problem has seen little attention. Indeed, retrieval-augmented models are not typically expected to distrust retrieved documents. Human experts overcome the challenge by gathering signals about the context, reliability, and tendency of source documents - that is, they perform source criticism. We propose a novel NLP task focused on finding and summarising such signals. We introduce a new dataset of 6,709 “media background checks” derived from Media Bias / Fact Check, a volunteer-run website documenting media bias. We test open-source and closed-source LLM baselines with and without retrieval on this dataset, finding that retrieval greatly improves performance. We furthermore carry out human evaluation, demonstrating that 1) media background checks are helpful for humans, and 2) media background checks are helpful for retrieval-augmented models.
摘要:互联网上并不是所有的东西都是真的。这一不幸的事实要求人类和模型在处理检索到的信息时都要对可信度进行复杂的推理。在NLP,这个问题几乎没有受到关注。事实上,检索增强的模型通常不会不信任检索到的文档。人类专家通过收集有关源文档的上下文、可靠性和趋势的信号来克服这一挑战–也就是说,他们执行源批评。我们提出了一个新的NLP任务,专注于发现和总结这样的信号。我们介绍了一个包含6,709个“媒体背景调查”的新数据集,该数据集来自一个记录媒体偏见的志愿者运营的网站–媒体偏见/事实检查。我们在此数据集上测试了具有和不具有检索的开放源代码和封闭源代码的LLM基线,发现检索极大地提高了性能。此外,我们还进行了人工评估,表明1)媒体背景调查对人类有帮助,2)媒体背景调查对检索增强模型有帮助。

[NLP-80] ContextCite: Attributing Model Generation to Context
[NLP-80] ContextCite:将模型生成归因于上下文

链接: https://arxiv.org/abs/2409.00729
作者: Benjamin Cohen-Wang,Harshay Shah,Kristian Georgiev,Aleksander Madry
关键词-EN: information provided, context, Abstract, context attribution, ContextCite
关键词-ZH: 提供的信息、上下文、摘要、上下文属性、ContextCite
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How do language models use information provided as context when generating a response? Can we infer whether a particular generated statement is actually grounded in the context, a misinterpretation, or fabricated? To help answer these questions, we introduce the problem of context attribution: pinpointing the parts of the context (if any) that led a model to generate a particular statement. We then present ContextCite, a simple and scalable method for context attribution that can be applied on top of any existing language model. Finally, we showcase the utility of ContextCite through three applications: (1) helping verify generated statements (2) improving response quality by pruning the context and (3) detecting poisoning attacks. We provide code for ContextCite at this https URL.
摘要:语言模型在生成响应时如何使用作为上下文提供的信息?我们能否推断生成的特定陈述实际上是基于上下文、误解还是捏造的?为了帮助回答这些问题,我们引入了上下文归因问题:确定导致模型生成特定陈述的上下文部分(如果有的话)。然后,我们介绍了ContextCite,这是一种简单且可扩展的上下文归因方法,可以应用在任何现有语言模型之上。最后,我们通过三个应用程序展示了ContextCite的实用性:(1)帮助验证生成的陈述(2)通过修剪上下文来提高响应质量以及(3)检测中毒攻击。我们在此https URL中提供ContextCite的代码。

[NLP-81] Hound: Hunting Supervision Signals for Few and Zero Shot Node Classification on Text-attributed Graph
[NLP-81] 猎犬:狩猎监督信号,在文本属性图上对少数和零镜头节点分类

链接: https://arxiv.org/abs/2409.00727
作者: Yuxiang Wang,Xiao Yan,Shiyu Jin,Quanqing Xu,Chuanhui Yang,Yuanyuan Zhu,Chuang Hu,Bo Du,Jiawei Jiang
关键词-EN: Text-attributed graph, graph structured data, graph structured, important type, node
关键词-ZH: 文本属性图、图结构化数据、图结构化、重要类型、节点
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Text-attributed graph (TAG) is an important type of graph structured data with text descriptions for each node. Few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. However, the two tasks are challenging due to the lack of supervision signals, and existing methods only use the contrastive loss to align graph-based node embedding and language-based text embedding. In this paper, we propose Hound to improve accuracy by introducing more supervision signals, and the core idea is to go beyond the node-text pairs that come with data. Specifically, we design three augmentation techniques, i.e., node perturbation, text matching, and semantics negation to provide more reference nodes for each text and vice versa. Node perturbation adds/drops edges to produce diversified node embeddings that can be matched with a text. Text matching retrieves texts with similar embeddings to match with a node. Semantics negation uses a negative prompt to construct a negative text with the opposite semantics, which is contrasted with the original node and text. We evaluate Hound on 5 datasets and compare with 13 state-of-the-art baselines. The results show that Hound consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.
摘要:文本属性图(Tag)是一种重要的图结构数据,每个节点都有文本描述。标签的少镜头和零镜头节点分类在学术界和社会网络等领域有着广泛的应用。然而,由于缺乏监督信号,这两个任务都具有挑战性,现有的方法只利用对比损失来对齐基于图的节点嵌入和基于语言的文本嵌入。在本文中,我们提出Hound通过引入更多的监督信号来提高准确率,其核心思想是超越伴随数据而来的节点-文本对。具体来说,我们设计了三种增强技术,即节点扰动、文本匹配和语义否定,为每个文本提供更多的参考节点,反之亦然。节点扰动添加/删除边以产生可与文本匹配的多样化节点嵌入。文本匹配检索具有类似嵌入的文本以与节点匹配。语义否定使用否定提示来构建语义相反的否定文本,它与原始的节点和文本形成对比。我们在5个数据集上对Hound进行了评估,并与13个最先进的基线进行了比较。结果表明,Hound的性能一直优于所有基线,其精度比表现最好的基线提高了5%以上。

[NLP-82] Who Would Chatbots Vote For? Political Preferences of ChatGPT and Gemini in the 2024 European Union Elections
[NLP-82] 聊天机器人会投票给谁?ChatGPT和Gemini在2024年欧盟选举中的政治偏好

链接: https://arxiv.org/abs/2409.00721
作者: Michael Haman,Milan Školník
关键词-EN: European Parliament elections, large language models, European Parliament, Parliament elections, European Free Alliance
关键词-ZH: 欧洲议会选举,大型语言模型,欧洲议会,议会选举,欧洲自由联盟
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study examines the political bias of chatbots powered by large language models, namely ChatGPT and Gemini, in the context of the 2024 European Parliament elections. The research focused on the evaluation of political parties represented in the European Parliament across 27 EU Member States by these generative artificial intelligence (AI) systems. The methodology involved daily data collection through standardized prompts on both platforms. The results revealed a stark contrast: while Gemini mostly refused to answer political questions, ChatGPT provided consistent ratings. The analysis showed a significant bias in ChatGPT in favor of left-wing and centrist parties, with the highest ratings for the Greens/European Free Alliance. In contrast, right-wing parties, particularly the Identity and Democracy group, received the lowest ratings. The study identified key factors influencing the ratings, including attitudes toward European integration and perceptions of democratic values. The findings highlight the need for a critical approach to information provided by generative AI systems in a political context and call for more transparency and regulation in this area.
摘要:在2024年欧洲议会选举的背景下,这项研究考察了由大语言模型(即ChatGPT和Gemini)驱动的聊天机器人的政治偏见。这项研究的重点是通过这些生成性人工智能(AI)系统对27个欧盟成员国在欧洲议会中代表的政党进行评估。该方法包括通过两个平台上的标准化提示进行日常数据收集。结果显示出了鲜明的对比:虽然双子座大多拒绝回答政治问题,但ChatGPT提供了一致的评级。分析显示,ChatGPT明显倾向于左翼和中间派政党,绿党/欧洲自由联盟的支持率最高。相比之下,右翼政党,特别是认同与民主团体,得分最低。这项研究确定了影响评级的关键因素,包括对欧洲一体化的态度和对民主价值观的看法。这些发现突显了在政治背景下对生成性人工智能系统提供的信息采取批判性方法的必要性,并呼吁在这一领域加强透明度和监管。

[NLP-83] Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
[NLP-83] 综合评级:一个具有成本效益且具有偏见意识的LLM评估评级系统

链接: https://arxiv.org/abs/2409.00696
作者: Jasper Dekoninck,Maximilian Baader,Martin Vechev
关键词-EN: Rating-based human evaluation, Rating-based human, Large language models, essential tool, tool to accurately
关键词-ZH: 基于评级的人类评估,基于评级的人类,大型语言模型,必要的工具,准确的工具
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rating-based human evaluation has become an essential tool to accurately evaluate the impressive performance of Large language models (LLMs). However, current rating systems suffer from several critical limitations. Specifically, they fail to account for human biases that significantly influence evaluation results, require large and expensive preference datasets to obtain accurate ratings, and do not facilitate meaningful comparisons of model ratings across different tasks. To address these issues, we introduce Polyrating, an expressive and flexible rating system based on maximum a posteriori estimation that enables a more nuanced and thorough analysis of model performance at lower costs. Polyrating can detect and quantify biases affecting human preferences, ensuring fairer model comparisons. Furthermore, Polyrating can reduce the cost of human evaluations by up to 41% for new models and up to 77% for new tasks by leveraging existing benchmark scores. Lastly, Polyrating enables direct comparisons of ratings across different tasks, providing a comprehensive understanding of an LLMs’ strengths, weaknesses, and relative performance across different applications.
摘要:基于评分的人工评价已经成为准确评价大型语言模型令人印象深刻的性能的重要工具。然而,当前的评级体系存在几个严重的局限性。具体地说,它们没有考虑到显著影响评估结果的人为偏见,需要大量且昂贵的偏好数据集才能获得准确的评级,并且无法促进对不同任务的模型评级进行有意义的比较。为了解决这些问题,我们引入了PolyRating,这是一种基于最大后验估计的表现力和灵活的评级系统,能够以更低的成本对模型性能进行更细微和彻底的分析。多元评级可以检测和量化影响人类偏好的偏差,确保更公平的模型比较。此外,PolyRating通过利用现有的基准分数,可以将新模型的人工评估成本降低高达41%,将新任务的人工评估成本降低高达77%。最后,PolyRating允许直接比较不同任务的评级,全面了解LLMS的优势、劣势和不同应用程序的相对表现。

[NLP-84] Correcting FLORES Evaluation Dataset for Four African Languages
[NLP-84] 更正四种非洲语言的FLORES评估数据集

链接: https://arxiv.org/abs/2409.00626
作者: Idris Abdulmumin,Sthembiso Mkhwanazi,Mahlatse S. Mbooi,Shamsuddeen Hassan Muhammad,Ibrahim Said Ahmad,Neo Putini,Miehleketo Mathebula,Matimba Shingange,Tajuddeen Gwadabe,Vukosi Marivate
关键词-EN: Northern Sotho, Xitsonga and isiZulu, FLORES evaluation, dev and devtest, paper describes
关键词-ZH: Northern Scrum、Xitsonga和isiZulu,FLORES评估、开发和开发测试,论文描述
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes the corrections made to the FLORES evaluation (dev and devtest) dataset for four African languages, namely Hausa, Northern Sotho (Sepedi), Xitsonga and isiZulu. The original dataset, though groundbreaking in its coverage of low-resource languages, exhibited various inconsistencies and inaccuracies in the reviewed languages that could potentially hinder the integrity of the evaluation of downstream tasks in natural language processing (NLP), especially machine translation. Through a meticulous review process by native speakers, several corrections were identified and implemented, improving the dataset’s overall quality and reliability. For each language, we provide a concise summary of the errors encountered and corrected, and also present some statistical analysis that measure the difference between the existing and corrected datasets. We believe that our corrections enhance the linguistic accuracy and reliability of the data and, thereby, contributing to more effective evaluation of NLP tasks involving the four African languages.
摘要:本文描述了对四种非洲语言,即豪萨语、北索托语(塞佩迪语)、西松加语和伊西祖鲁语的Flores评估(dev和devtest)数据集所做的修正。原始数据集虽然在覆盖低资源语言方面具有开创性,但在审查的语言中显示出各种不一致和不准确之处,这可能会阻碍自然语言处理(NLP),特别是机器翻译中下游任务评估的完整性。通过母语人士的仔细审查过程,确定并实施了几项更正,提高了数据集的整体质量和可靠性。对于每种语言,我们提供了遇到和更正的错误的简明摘要,并提供了一些统计分析,以衡量现有数据集和更正后的数据集之间的差异。我们认为,我们的更正提高了数据的语言准确性和可靠性,从而有助于更有效地评价涉及四种非洲语言的自然语言规划任务。

[NLP-85] Entity-Aware Biaffine Attention Model for Improved Constituent Parsing with Reduced Entity Violations
[NLP-85] 改进成分解析并减少实体违规的实体感知偏仿射注意模型

链接: https://arxiv.org/abs/2409.00625
作者: Xinyi Bai
关键词-EN: Constituency parsing involves, parsing involves analyzing, Constituency parsing, involves analyzing, Constituency
关键词-ZH: 选区解析涉及,解析涉及分析,选区解析,涉及分析,选区
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Constituency parsing involves analyzing a sentence by breaking it into sub-phrases, or constituents. While many deep neural models have achieved state-of-the-art performance in this task, they often overlook the entity-violating issue, where an entity fails to form a complete sub-tree in the resultant parsing tree. To address this, we propose an entity-aware biaffine attention model for constituent parsing. This model incorporates entity information into the biaffine attention mechanism by using additional entity role vectors for potential phrases, which enhances the parsing accuracy. We introduce a new metric, the Entity Violating Rate (EVR), to quantify the extent of entity violations in parsing results. Experiments on three popular datasets-ONTONOTES, PTB, and CTB-demonstrate that our model achieves the lowest EVR while maintaining high precision, recall, and F1-scores comparable to existing models. Further evaluation in downstream tasks, such as sentence sentiment analysis, highlights the effectiveness of our model and the validity of the proposed EVR metric.
摘要:词组分析涉及通过将句子分解为子短语或成分来分析句子。虽然许多深度神经模型在这项任务中取得了最先进的性能,但它们往往忽略了实体违反问题,即实体无法在结果分析树中形成完整的子树。为了解决这个问题,我们提出了一种实体感知的双仿射注意模型来进行成分分析。该模型通过对潜在短语使用额外的实体角色向量,将实体信息融入到双仿射注意机制中,从而提高了句法分析的准确性。我们引入了一个新的度量–实体违规率(EVR)来量化分析结果中实体违规的程度。在三个流行的数据集-ONTONOTES、PTB和CTB上的实验表明,我们的模型实现了最低的EVR,同时保持了与现有模型相当的高精度、召回率和F1分数。在后续任务中的进一步评估,如句子情感分析,突出了我们模型的有效性和所提出的EVR度量的有效性。

[NLP-86] Does Knowledge Localization Hold True? Surprising Differences Between Entity and Relation Perspectives in Language Models CIKM2024
[NLP-86] 知识本地化正确吗?语言模型中实体视角和关系视角之间的惊人差异

链接: https://arxiv.org/abs/2409.00617
作者: Yifan Wei,Xiaoyan Yu,Yixuan Weng,Huanhuan Ma,Yuanzhe Zhang,Jun Zhao,Kang Liu
关键词-EN: demonstrated superior performance, Large language models, language processing tasks, natural language processing, Large language
关键词-ZH: 表现出卓越的性能、大型语言模型、语言处理任务、自然语言处理、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: CIKM 2024

点击查看摘要

Abstract:Large language models encapsulate knowledge and have demonstrated superior performance on various natural language processing tasks. Recent studies have localized this knowledge to specific model parameters, such as the MLP weights in intermediate layers. This study investigates the differences between entity and relational knowledge through knowledge editing. Our findings reveal that entity and relational knowledge cannot be directly transferred or mapped to each other. This result is unexpected, as logically, modifying the entity or the relation within the same knowledge triplet should yield equivalent outcomes. To further elucidate the differences between entity and relational knowledge, we employ causal analysis to investigate how relational knowledge is stored in pre-trained models. Contrary to prior research suggesting that knowledge is stored in MLP weights, our experiments demonstrate that relational knowledge is also significantly encoded in attention modules. This insight highlights the multifaceted nature of knowledge storage in language models, underscoring the complexity of manipulating specific types of knowledge within these models.
摘要:大型语言模型封装了知识,在各种自然语言处理任务中表现出了优异的性能。最近的研究已经将这种知识局限于特定的模型参数,例如中间层中的MLP权重。本研究通过知识编辑来考察实体知识和关系知识之间的差异。我们的发现表明,实体知识和关系知识不能直接相互转移或映射。这一结果是意想不到的,因为从逻辑上讲,修改同一知识三元组中的实体或关系应该会产生相同的结果。为了进一步阐明实体知识和关系知识之间的区别,我们使用因果分析来调查关系知识是如何存储在预先训练的模型中的。与之前的研究表明知识存储在MLP权重中相反,我们的实验表明,关系知识也显著编码在注意模块中。这一见解突出了语言模型中知识存储的多面性,强调了在这些模型中操作特定类型的知识的复杂性。

[NLP-87] DAMe: Personalized Federated Social Event Detection with Dual Aggregation Mechanism CIKM2024
[NLP-87] DAMe:具有双重聚合机制的个性化联邦社会事件检测

链接: https://arxiv.org/abs/2409.00614
作者: Xiaoyan Yu,Yifan Wei,Pu Li,Shuaishuai Zhou,Hao Peng,Li Sun,Liehuang Zhu,Philip S. Yu
关键词-EN: improve participants’ performance, Training social event, Training social, event detection models, social event detection
关键词-ZH: 提高参与者表现,训练社交事件,训练社交,事件检测模型,社交事件检测
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: CIKM 2024

点击查看摘要

Abstract:Training social event detection models through federated learning (FedSED) aims to improve participants’ performance on the task. However, existing federated learning paradigms are inadequate for achieving FedSED’s objective and exhibit limitations in handling the inherent heterogeneity in social data. This paper proposes a personalized federated learning framework with a dual aggregation mechanism for social event detection, namely DAMe. We present a novel local aggregation strategy utilizing Bayesian optimization to incorporate global knowledge while retaining local characteristics. Moreover, we introduce a global aggregation strategy to provide clients with maximum external knowledge of their preferences. In addition, we incorporate a global-local event-centric constraint to prevent local overfitting and ``client-drift’'. Experiments within a realistic simulation of a natural federated setting, utilizing six social event datasets spanning six languages and two social media platforms, along with an ablation study, have demonstrated the effectiveness of the proposed framework. Further robustness analyses have shown that DAMe is resistant to injection attacks.
摘要:通过联合学习训练社交事件检测模型的目的是提高参与者在任务中的表现。然而,现有的联合学习范例不足以实现FedSED的目标,并且在处理社交数据中固有的异质性方面显示出局限性。提出了一种具有双重聚合机制的个性化联合学习框架DAME。我们提出了一种新的局部聚集策略,该策略利用贝叶斯优化来融合全局知识,同时保留局部特征。此外,我们引入了全球聚合策略,为客户提供关于其偏好的最大外部知识。此外,我们还加入了以全局-局部事件为中心的约束,以防止局部过度匹配和“客户漂移”。利用跨越六种语言和两个社交媒体平台的六个社会事件数据集,以及一项消融研究,在自然联邦环境的真实模拟中进行的实验,已经证明了所提出的框架的有效性。进一步的健壮性分析表明,DAME能够抵抗注入攻击。

[NLP-88] nyAgent : Function Calling at the Edge
[NLP-88] nyAgent:边缘的函数调用

链接: https://arxiv.org/abs/2409.00608
作者: Lutfi Eren Erdogan,Nicholas Lee,Siddharth Jha,Sehoon Kim,Ryan Tabrizi,Suhong Moon,Coleman Hooper,Gopala Anumanchipalli,Kurt Keutzer,Amir Gholami
关键词-EN: Recent large language, Recent large, fulfill user queries, function calling, advanced agentic systems
关键词-ZH: 最近的大型语言、最近的大型、满足用户查询、函数调用、先进的代理系统
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been explored since they typically require cloud-based infrastructure due to their substantial model size and computational demands. To this end, we present TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling for driving agentic systems at the edge. We first show how to enable accurate function calling for open-source models via the LLMCompiler framework. We then systematically curate a high-quality dataset for function calling, which we use to fine-tune two small language models, TinyAgent-1.1B and 7B. For efficient inference, we introduce a novel tool retrieval method to reduce the input prompt length and utilize quantization to further accelerate the inference speed. As a driving application, we demonstrate a local Siri-like system for Apple’s MacBook that can execute user commands through text or voice input. Our results show that our models can achieve, and even surpass, the function-calling capabilities of larger models like GPT-4-Turbo, while being fully deployed at the edge. We open-source our dataset, models, and installable package and provide a demo video for our MacBook assistant agent.
摘要:最近的大型语言模型使高级代理系统的发展成为可能,这些代理系统可以集成各种工具和API来通过函数调用来完成用户查询。然而,这些LLM在边缘的部署还没有被探索过,因为它们通常需要基于云的基础设施,因为它们的模型大小和计算需求很大。为此,我们提出了TinyAgent,一个端到端的框架,用于训练和部署特定于任务的小语言模型代理,这些代理能够在边缘调用驱动代理系统。我们首先展示如何通过LLMCompiler框架为开放源码模型启用准确的函数调用。然后,我们系统地为函数调用管理一个高质量的数据集,我们使用它来微调两个小型语言模型TinyAgent-1.1B和7B。为了有效地进行推理,我们引入了一种新的工具检索方法来缩短输入提示长度,并利用量化来进一步加快推理速度。作为一个驾驶应用程序,我们展示了一个适用于苹果MacBook的本地类似Siri的系统,它可以通过文本或语音输入执行用户命令。我们的结果表明,我们的模型可以达到甚至超过GPT-4-Turbo等更大型号的函数调用能力,同时完全部署在边缘。我们将我们的数据集、模型和可安装包开源,并为我们的MacBook助理代理提供演示视频。

[NLP-89] Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
[NLP-89] 用于评估大型语言模型中虚假拒绝的自动伪有害提示生成

链接: https://arxiv.org/abs/2409.00598
作者: Bang An,Sicheng Zhu,Ruiyi Zhang,Michael-Andrei Panaitescu-Liess,Yuancheng Xu,Furong Huang
关键词-EN: Safety-aligned large language, large language models, Safety-aligned large, falsely refuse pseudo-harmful, refuse pseudo-harmful prompts
关键词-ZH: 安全对齐的大型语言、大型语言模型、安全对齐的大型、错误拒绝伪有害、拒绝伪有害提示
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful prompts, like “how to kill a mosquito,” which are actually harmless. Frequent false refusals not only frustrate users but also provoke a public backlash against the very values alignment seeks to protect. In this paper, we propose the first method to auto-generate diverse, content-controlled, and model-dependent pseudo-harmful prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately labels controversial prompts. We evaluate 20 LLMs on PHTest, uncovering new insights due to its scale and labeling. Our findings reveal a trade-off between minimizing false refusals and improving safety against jailbreak attacks. Moreover, we show that many jailbreak defenses significantly increase the false refusal rates, thereby undermining usability. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs. Our code and dataset are available at this https URL
摘要:基于安全的大型语言模型(LLM)有时会错误地拒绝虚假有害的提示,例如“如何杀死蚊子”,而这些提示实际上是无害的。频繁的虚假拒绝不仅让用户感到沮丧,还会引发公众对Align试图保护的价值观的强烈反对。在本文中,我们提出了第一种方法来自动生成多样化的、内容受控的、依赖于模型的伪有害提示。使用该方法,我们构建了一个评价数据集PHTest,它比现有的数据集大了十倍,覆盖了更多的错误拒绝模式,并分别对有争议的提示进行了标注。我们在PHTest上评估了20个LLM,发现了由于其规模和标签而产生的新见解。我们的发现揭示了在尽量减少虚假拒绝和提高针对越狱攻击的安全性之间的权衡。此外,我们发现许多越狱防御措施显著增加了错误拒绝率,从而破坏了可用性。我们的方法和数据集可以帮助开发人员评估和微调更安全、更可用的LLM。我们的代码和数据集可在此HTTPS URL中获得

[NLP-90] Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model ACM-MM2024
[NLP-90] 多模式多轮对话姿态检测:挑战数据集和有效模型

链接: https://arxiv.org/abs/2409.00597
作者: Fuqiang Niu,Zebang Cheng,Xianghua Fu,Xiaojiang Peng,Genan Dai,Yin Chen,Hu Huang,Bowen Zhang
关键词-EN: identify public opinion, social media data, Stance detection, social media, aims to identify
关键词-ZH: 识别公众舆论、社交媒体数据、立场检测、社交媒体、旨在识别
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注: ACM MM2024

点击查看摘要

Abstract:Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the proliferation of diverse multimodal social media content including text, and images multimodal stance detection (MSD) has become a crucial research area. However, existing MSD studies have focused on modeling stance within individual text-image pairs, overlooking the multi-party conversational contexts that naturally occur on social media. This limitation stems from a lack of datasets that authentically capture such conversational scenarios, hindering progress in conversational MSD. To address this, we introduce a new multimodal multi-turn conversational stance detection dataset (called MmMtCSD). To derive stances from this challenging dataset, we propose a novel multimodal large language model stance detection framework (MLLM-SD), that learns joint stance representations from textual and visual modalities. Experiments on MmMtCSD show state-of-the-art performance of our proposed MLLM-SD approach for multimodal stance detection. We believe that MmMtCSD will contribute to advancing real-world applications of stance detection research.
摘要:姿态检测是一项重要而又具有挑战性的任务,其目的是利用社交媒体数据识别公众对特定目标的看法。随着包括文本和图像在内的各种多通道社交媒体内容的激增,多通道姿态检测(MSD)成为一个重要的研究领域。然而,现有的MSD研究集中在单个文本-图像对中的建模立场,而忽略了社交媒体上自然发生的多方对话上下文。这一限制源于缺乏真实捕捉此类对话场景的数据集,阻碍了对话MSD的进展。为了解决这个问题,我们引入了一个新的多通道多轮对话姿态检测数据集(MmMtCSD)。为了从这个具有挑战性的数据集中获取姿态,我们提出了一种新的多通道大型语言模型姿态检测框架(MLLM-SD),该框架从文本和视觉通道中学习关节姿态表示。在MmMtCSD上的实验表明,我们提出的MLLM-SD方法在多模式姿态检测中具有最先进的性能。我们相信,MmMtCSD将为推进姿态检测研究的实际应用做出贡献。

[NLP-91] Learning to Ask: When LLMs Meet Unclear Instruction
[NLP-91] 学会询问:当LLM遇到不明确的指示时

链接: https://arxiv.org/abs/2409.00557
作者: Wenxuan Wang,Juluan Shi,Chaozheng Wang,Cheryl Lee,Youliang Yuan,Jen-tse Huang,Michael R. Lyu
关键词-EN: modern large language, large language models, leverage external tools, language models, large language
关键词-ZH: 现代大型语言、大型语言模型、利用外部工具、语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.
摘要:配备了调用函数能力的现代大型语言模型(LLM)可以利用外部工具来处理仅靠语言技能无法完成的一系列任务。然而,这些工具的有效执行不仅在很大程度上依赖于LLMS的高级功能,而且还依赖于准确的用户指令,而这在现实世界中往往无法得到保证。为了评估LLMS工具在不完美指令下的使用性能,我们仔细检查了用户查询的真实指令,分析了错误模式,并构建了一个具有挑战性的工具使用基准测试,称为NoisyToolBance(噪声工具台)。我们发现,由于下一代币预测的训练目标,LLMS往往会随意生成遗漏的论点,这可能会导致幻觉和风险。为了解决这个问题,我们提出了一种新颖的框架,即当需要时询问(AWN),当用户遇到由于指令不清楚而遇到的障碍时,该框架会提示LLMS向用户提问。此外,为了减少用户与LLM交互的人工工作量,从精度和效率两个角度评估LLMS的工具使用性能,我们设计了一个自动化评估工具Tool Evaluator。我们的实验表明,AWN的性能明显优于NoisyToolB边中现有的工具学习框架。我们将发布所有相关代码和数据集,以支持未来的研究。

[NLP-92] sting and Evaluation of Large Language Models : Correctness Non-Toxicity and Fairness
[NLP-92] 大型语言模型的攻击和评估:正确性、无毒性和公平性

链接: https://arxiv.org/abs/2409.00551
作者: Wenxuan Wang
关键词-EN: extraordinary conversational skills, Large language models, Large language, past few years, rapidly penetrated
关键词-ZH: 非凡的对话技巧,大型语言模型,大型语言,过去几年,迅速渗透
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: PhD Thesis

点击查看摘要

Abstract:Large language models (LLMs), such as ChatGPT, have rapidly penetrated into people’s work and daily lives over the past few years, due to their extraordinary conversational skills and intelligence. ChatGPT has become the fastest-growing software in terms of user numbers in human history and become an important foundational model for the next generation of artificial intelligence applications. However, the generations of LLMs are not entirely reliable, often producing content with factual errors, biases, and toxicity. Given their vast number of users and wide range of application scenarios, these unreliable responses can lead to many serious negative impacts. This thesis introduces the exploratory works in the field of language model reliability during the PhD study, focusing on the correctness, non-toxicity, and fairness of LLMs from both software testing and natural language processing perspectives. First, to measure the correctness of LLMs, we introduce two testing frameworks, FactChecker and LogicAsker, to evaluate factual knowledge and logical reasoning accuracy, respectively. Second, for the non-toxicity of LLMs, we introduce two works for red-teaming LLMs. Third, to evaluate the fairness of LLMs, we introduce two evaluation frameworks, BiasAsker and XCulturalBench, to measure the social bias and cultural bias of LLMs, respectively.
摘要:在过去的几年里,ChatGPT等大型语言模型凭借其非凡的会话技能和智力迅速渗透到人们的工作和日常生活中。ChatGPT已经成为人类历史上用户数量增长最快的软件,并成为下一代人工智能应用的重要基础模式。然而,LLM的几代并不完全可靠,经常产生带有事实错误、偏见和毒性的内容。鉴于其庞大的用户数量和广泛的应用场景,这些不可靠的响应可能会导致许多严重的负面影响。本文介绍了博士期间在语言模型可靠性方面所做的探索性工作,重点从软件测试和自然语言处理两个角度对LLMS的正确性、无毒性和公平性进行了研究。首先,为了度量LLMS的正确性,我们引入了两个测试框架,FactChecker和LogicAsker,分别用于评估事实知识和逻辑推理的准确性。其次,针对低分子化合物的无毒特性,我们介绍了两个关于红队低分子物质的研究工作。第三,为了评价低收入群体的公平性,我们引入了两个评价框架BiasAsker和XculturalBch来衡量低收入群体的社会偏向和文化偏向。

[NLP-93] Large Language Models -Enabled Digital Twins for Precision Medicine in Rare Gynecological Tumors
[NLP-93] 支持大语言模型的数字双胞胎用于罕见妇科肿瘤的精准医疗

链接: https://arxiv.org/abs/2409.00544
作者: Jacqueline Lammert,Nicole Pfarr,Leonid Kuligin,Sonja Mathes,Tobias Dreyer,Luise Modersohn,Patrick Metzger,Dyke Ferber,Jakob Nikolas Kather,Daniel Truhn,Lisa Christine Adams,Keno Kyrill Bressem,Sebastian Lange,Kristina Schwamborn,Martin Boeker,Marion Kiechle,Ulrich A. Schatz,Holger Bronger,Maximilian Tschochohei
关键词-EN: Rare gynecological tumors, Rare gynecological, present major clinical, major clinical challenges, clinical challenges due
关键词-ZH: 罕见妇科肿瘤,罕见妇科,目前主要临床,主要临床挑战,临床挑战,由于
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
备注: 20 pages, 2 figures, 3 tables, supplements, original article

点击查看摘要

Abstract:Rare gynecological tumors (RGTs) present major clinical challenges due to their low incidence and heterogeneity. The lack of clear guidelines leads to suboptimal management and poor prognosis. Molecular tumor boards accelerate access to effective therapies by tailoring treatment based on biomarkers, beyond cancer type. Unstructured data that requires manual curation hinders efficient use of biomarker profiling for therapy matching. This study explores the use of large language models (LLMs) to construct digital twins for precision medicine in RGTs. Our proof-of-concept digital twin system integrates clinical and biomarker data from institutional and published cases (n=21) and literature-derived data (n=655 publications with n=404,265 patients) to create tailored treatment plans for metastatic uterine carcinosarcoma, identifying options potentially missed by traditional, single-source analysis. LLM-enabled digital twins efficiently model individual patient trajectories. Shifting to a biology-based rather than organ-based tumor definition enables personalized care that could advance RGT management and thus enhance patient outcomes. Comments: 20 pages, 2 figures, 3 tables, supplements, original article Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML) Cite as: arXiv:2409.00544 [cs.CL] (or arXiv:2409.00544v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.00544 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:罕见妇科肿瘤(RGTS)发病率低、异质性强,是临床面临的主要挑战。缺乏明确的指导方针,导致治疗效果不佳,预后不佳。分子肿瘤委员会通过根据癌症类型以外的生物标记物量身定做治疗方法,加速了有效治疗的获得。需要手动管理的非结构化数据阻碍了生物标记物分析用于治疗匹配的有效使用。这项研究探索了在RGTS中使用大语言模型(LLM)来构建用于精确医学的数字双胞胎。我们的概念验证数字双胞胎系统集成了来自机构和已发表病例(n=21)的临床和生物标记物数据以及来自文献的数据(n=655篇出版物,n=404,265名患者),以创建转移性子宫癌肉瘤的定制治疗计划,确定传统的单一来源分析可能遗漏的选择。启用LLM的数字双胞胎有效地对单个患者的轨迹进行建模。转向基于生物学而不是基于器官的肿瘤定义可以实现个性化护理,从而促进RGT管理,从而提高患者的预后。评论:20页,2张图,3张表,副刊,原文主题:计算与语言(cs.CL);人工智能(cs.AI);定量方法(Q-Bio.QM);机器学习(stat.ML)引用为:arxiv:2409.00544cs.CLhttps://doi.org/10.48550/arXiv.2409.00544 Focus通过DataCite了解更多arxiv发布的目标文件(等待注册)

[NLP-94] How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?
[NLP-94] 文本注释的多样化解释性如何影响医学视觉语言零镜头任务?

链接: https://arxiv.org/abs/2409.00543
作者: Sicheng Wang,Che Liu,Rossella Arcucci
关键词-EN: image-text pair pre-training, Recent advancements, medical vision-language pre-training, vision-language pre-training, pair pre-training
关键词-ZH: 图像-文本配对预训练,最新进展,医学视觉-语言预训练,视觉-语言预训练,配对预训练
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recent advancements in medical vision-language pre-training (MedVLP) have significantly enhanced zero-shot medical vision tasks such as image classification by leveraging large-scale medical image-text pair pre-training. However, the performance of these tasks can be heavily influenced by the variability in textual prompts describing the categories, necessitating robustness in MedVLP models to diverse prompt styles. Yet, this sensitivity remains underexplored. In this work, we are the first to systematically assess the sensitivity of three widely-used MedVLP methods to a variety of prompts across 15 different diseases. To achieve this, we designed six unique prompt styles to mirror real clinical scenarios, which were subsequently ranked by interpretability. Our findings indicate that all MedVLP models evaluated show unstable performance across different prompt styles, suggesting a lack of robustness. Additionally, the models’ performance varied with increasing prompt interpretability, revealing difficulties in comprehending complex medical concepts. This study underscores the need for further development in MedVLP methodologies to enhance their robustness to diverse zero-shot prompts.
摘要:医学视觉-语言预训练(MedVLP)的最新进展通过利用大规模的医学图文对预训练,显著增强了诸如图像分类等零镜头医学视觉任务。然而,这些任务的性能可能会受到描述类别的文本提示的可变性的严重影响,因此需要在MedVLP模型中对不同的提示风格保持稳健性。然而,这种敏感性仍未得到充分挖掘。在这项工作中,我们首次系统地评估了三种广泛使用的MedVLP方法对15种不同疾病的各种提示的敏感性。为了实现这一点,我们设计了六种独特的提示风格来反映真实的临床场景,并随后根据可解释性进行了排名。我们的研究结果表明,所有被评估的MedVLP模型在不同的提示风格上表现出不稳定的表现,这表明缺乏稳健性。此外,模型的表现随着即时可解释性的增加而变化,揭示了理解复杂医学概念的困难。这项研究强调了进一步发展MedVLP方法的必要性,以增强其对各种零射提示的稳健性。

[NLP-95] Post-OCR Text Correction for Bulgarian Historical Documents
[NLP-95] 保加利亚历史文件的OCR后文本更正

链接: https://arxiv.org/abs/2409.00527
作者: Angel Beshirov,Milena Dobreva,Dimitar Dimitrov,Momchil Hardalov,Ivan Koychev,Preslav Nakov
关键词-EN: Optical Character Recognition, OCR text correction, crucial for preserving, preserving the cultural, cultural heritage
关键词-ZH: 光学字符识别、OCR文本纠正,对于保存、保存文化、文化遗产至关重要
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: Accepted for publication in the International Journal on Digital Libraries

点击查看摘要

Abstract:The digitization of historical documents is crucial for preserving the cultural heritage of the society. An important step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a hard problem as standard OCR tools are not tailored to deal with historical orthography as well as with challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at \urlthis https URL.
摘要:历史文献的数字化是保护社会文化遗产的关键。这一过程中的一个重要步骤是使用光学字符识别(OCR)将扫描的图像转换为文本,这可以实现进一步的搜索、信息提取等。不幸的是,这是一个困难的问题,因为标准的OCR工具不能处理历史正字法以及具有挑战性的布局。因此,在处理此类文档时,对OCR输出应用额外的文本校正步骤是标准的。在这项工作中,我们专注于保加利亚语,我们创建了第一个基准数据集,用于评估用保加利亚第一个标准化正字法–19世纪的Drinov正字法–书写的历史保加利亚文献的OCR文本校正。我们进一步开发了一种方法,通过利用大量的当代保加利亚文学文本,自动生成该正字法以及随后的伊万切夫正字法中的合成数据。然后,我们使用最先进的LLMS和编解码器框架,并通过对角线注意力损失和复制和覆盖机制来增强OCR后的文本校正。该方法减少了识别过程中引入的错误,文档质量提高了25%,与ICDAR 2019年保加利亚数据集的最新水平相比提高了16%。我们在此HTTPS URL上发布我们的数据和代码。

[NLP-96] LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models KR
[NLP-96] LongRecipe:在大型特权模型中进行高效长上下文概括的食谱

链接: https://arxiv.org/abs/2409.00509
作者: Zhiyuan Hu,Yuliang Liu,Jinman Zhao,Suyuchen Wang,Yan Wang,Wei Shen,Qing Gu,Anh Tuan Luu,See-Kiong Ng,Zhiwei Jiang,Bryan Hooi
关键词-EN: Large language models, face significant challenges, Large language, context window, context window size
关键词-ZH: 大型语言模型,面临重大挑战,大型语言,上下文窗口,上下文窗口大小
类目: Computation and Language (cs.CL)
备注: Wokr in Progress

点击查看摘要

Abstract:Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive. To address this, we introduce LongRecipe, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model’s understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM’s capabilities in general tasks. Ultimately, we can extend the effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory. Our code is released at the [link](this https URL).
摘要:大型语言模型在处理长上下文任务时面临着巨大的挑战,因为它们在预训练期间的有效上下文窗口大小有限,这限制了它们在扩展序列上的泛化能力。同时,通过后置培训来扩展LLMS中的上下文窗口是高度资源密集型的。针对这一问题,我们引入了一种扩展LLMS上下文窗口的有效训练策略LongRecipe,包括有效的令牌分析、位置索引转换和训练优化策略。它在保持训练效率的同时模拟长序列输入,并显著提高了模型对长期相关性的理解。在三种类型的LLMS上的实验表明,LongRecipe可以在只需要30%的目标上下文窗口大小的情况下利用长序列,并且与全序列训练相比,减少了85%以上的计算训练资源。此外,LongRecipe还保留了原始LLM在一般任务中的能力。最终,*我们可以将开源LLMS的有效上下文窗口从8k扩展到128k,只需使用80G内存的单个GPU进行一天的专门培训,就可以实现接近GPT-4的性能。*我们的代码在[链接](此HTTPS URL)上发布。

[NLP-97] With Good MT There is No Need For End-to-End: A Case for Translate-then-Summarize Cross-lingual Summarization
[NLP-97] 有了好的MT,就不需要端到端:翻译然后总结跨语言总结的案例

链接: https://arxiv.org/abs/2409.00414
作者: Daniel Varab,Christian Hardmeier
关键词-EN: traditional pipelined designs, competitive solutions, traditional pipelined, cross-lingual summarization, pipelined designs
关键词-ZH: 传统流水线设计、竞争解决方案、传统流水线、跨语言总结、流水线设计
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work has suggested that end-to-end system designs for cross-lingual summarization are competitive solutions that perform on par or even better than traditional pipelined designs. A closer look at the evidence reveals that this intuition is based on the results of only a handful of languages or using underpowered pipeline baselines. In this work, we compare these two paradigms for cross-lingual summarization on 39 source languages into English and show that a simple \textittranslate-then-summarize pipeline design consistently outperforms even an end-to-end system with access to enormous amounts of parallel data. For languages where our pipeline model does not perform well, we show that system performance is highly correlated with publicly distributed BLEU scores, allowing practitioners to establish the feasibility of a language pair a priori. Contrary to recent publication trends, our result suggests that the combination of individual progress of monolingual summarization and translation tasks offers better performance than an end-to-end system, suggesting that end-to-end designs should be considered with care.
摘要:最近的研究表明,用于跨语言摘要的端到端系统设计是具有竞争力的解决方案,其性能与传统流水线设计不相上下,甚至更好。仔细看一下证据就会发现,这种直觉是基于少数几种语言的结果,或者是使用了动力不足的流水线基线。在这项工作中,我们对39种源语言的跨语言摘要翻译成英语的这两种范例进行了比较,结果表明,简单的\文本翻译-然后摘要流水线设计在访问大量并行数据的情况下始终优于端到端系统。对于我们的流水线模型表现不佳的语言,我们表明系统性能与公开分布的BLEU分数高度相关,从而允许实践者先验地建立语言对的可行性。与最近的出版趋势相反,我们的结果表明,单语摘要和翻译任务的个人进展相结合的表现比端到端系统更好,这表明端到端设计应该被仔细考虑。

[NLP-98] Rethinking Backdoor Detection Evaluation for Language Models
[NLP-98] 重新思考语言模型的后门检测评估

链接: https://arxiv.org/abs/2409.00399
作者: Jun Yan,Wenjie Jacky Mo,Xiang Ren,Robin Jia
关键词-EN: major security risk, publicly released language, model behaves maliciously, released language models, attacker-specified trigger
关键词-ZH: 重大安全风险、公开发布的语言、模型行为恶意、已发布的语言模型、攻击者指定的触发器
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Backdoor attacks, in which a model behaves maliciously when given an attacker-specified trigger, pose a major security risk for practitioners who depend on publicly released language models. Backdoor detection methods aim to detect whether a released model contains a backdoor, so that practitioners can avoid such vulnerabilities. While existing backdoor detection methods have high accuracy in detecting backdoored models on standard benchmarks, it is unclear whether they can robustly identify backdoors in the wild. In this paper, we examine the robustness of backdoor detectors by manipulating different factors during backdoor planting. We find that the success of existing methods highly depends on how intensely the model is trained on poisoned data during backdoor planting. Specifically, backdoors planted with either more aggressive or more conservative training are significantly more difficult to detect than the default ones. Our results highlight a lack of robustness of existing backdoor detectors and the limitations in current benchmark construction.
摘要:后门攻击是指模型在给定攻击者指定的触发器后进行恶意行为,对于依赖公开发布的语言模型的从业者来说,这是一个主要的安全风险。后门检测方法旨在检测发布的模型是否包含后门,以便从业者可以避免此类漏洞。虽然现有的后门检测方法在检测标准基准上的后门模型方面具有很高的准确性,但尚不清楚它们是否能够有力地识别野外的后门。在本文中,我们通过操纵后门种植过程中的不同因素来检查后门检测器的稳健性。我们发现,现有方法的成功高度取决于模型在后门种植期间对有毒数据的训练程度。具体地说,植入了更具攻击性或更保守训练的后门比默认后门更难检测到。我们的结果突显了现有后门检测器缺乏稳健性,以及当前基准构建的局限性。

[NLP-99] An Empirical Study on Information Extraction using Large Language Models
[NLP-99] 使用大型语言模型的信息提取实证研究

链接: https://arxiv.org/abs/2409.00369
作者: Ridong Han,Chaohao Yang,Tao Peng,Prayag Tiwari,Xiang Wan,Lu Liu,Benyou Wang
关键词-EN: large language models, natural language processing, OpenAI GPT family, Human-like large language, information extraction ability
关键词-ZH: 大型语言模型、自然语言处理、OpenAI GPT家族、类人大型语言、信息提取能力
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2305.14450

点击查看摘要

Abstract:Human-like large language models (LLMs), especially the most powerful and popular ones in OpenAI’s GPT family, have proven to be very helpful for many natural language processing (NLP) related tasks. Therefore, various attempts have been made to apply LLMs to information extraction (IE), which is a fundamental NLP task that involves extracting information from unstructured plain text. To demonstrate the latest representative progress in LLMs’ information extraction ability, we assess the information extraction ability of GPT-4 (the latest version of GPT at the time of writing this paper) from four perspectives: Performance, Evaluation Criteria, Robustness, and Error Types. Our results suggest a visible performance gap between GPT-4 and state-of-the-art (SOTA) IE methods. To alleviate this problem, considering the LLMs’ human-like characteristics, we propose and analyze the effects of a series of simple prompt-based methods, which can be generalized to other LLMs and NLP tasks. Rich experiments show our methods’ effectiveness and some of their remaining issues in improving GPT-4’s information extraction ability.
摘要:类人大语言模型,特别是OpenAI的GPT家族中最强大、最流行的语言模型,已经被证明对许多与自然语言处理(NLP)相关的任务非常有帮助。因此,已经进行了各种尝试来将LLMS应用于信息提取(IE),这是涉及从非结构化纯文本中提取信息的基本NLP任务。为了展示LLMS信息提取能力的最新代表性进展,我们从四个角度对GPT-4(本文写作时的最新版本)的信息提取能力进行了评估:性能、评估标准、稳健性和错误类型。我们的结果表明,GPT-4和最先进的IE方法(SOTA)之间存在明显的性能差距。为了缓解这一问题,考虑到LLMS的类人特性,我们提出并分析了一系列简单的基于提示的方法的效果,这些方法可以推广到其他LLMS和NLP任务。大量实验表明,该方法在提高高考S信息抽取能力方面是有效的,也存在一些有待解决的问题。

[NLP-100] Predicting the Target Word of Game-playing Conversations using a Low-Rank Dialect Adapter for Decoder Models
[NLP-100] 使用解码器模型的低级别方言适配器预测游戏玩对话的目标词

链接: https://arxiv.org/abs/2409.00358
作者: Dipankar Srirag,Aditya Joshi,Jacob Eisenstein
关键词-EN: LLMs for NLU, national varieties, sake of brevity, NLU tasks, reported for encoder
关键词-ZH: NLU的LLM、国家品种、为了简洁起见、NLU任务、为编码器报告
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 Figures, 5 Tables

点击查看摘要

Abstract:Dialect adapters that improve the performance of LLMs for NLU tasks on certain sociolects/dialects/national varieties (‘dialects’ for the sake of brevity) have been reported for encoder models. In this paper, we extend the idea of dialect adapters to decoder models in our architecture called LoRDD. Using MD-3, a publicly available dataset of word game-playing conversations between dialectal speakers, our task is Target Word Prediction (TWP) from a masked conversation. LoRDD combines task adapters and dialect adapters where the latter employ contrastive learning on pseudo-parallel conversations from MD-3. Our results for en-IN conversations on two models (Mistral and Gemma) show that LoRDD outperforms four baselines on TWP, while bridging the performance gap with en-US by 12% on word similarity and 25% on accuracy. The focused contribution of LoRDD is in its promise for dialect adaptation of decoder models.
摘要:已针对编码器模型报告了方言适配器,可以提高LLM在某些社会观点/方言/国家变种(为了简洁起见,称为“方言”)上执行NLU任务的性能。在本文中,我们将方言适配器的想法扩展到称为LoRDD的架构中的解码器模型。使用MD-3(一个公开可用的方言说话者之间玩文字游戏对话的数据集),我们的任务是从蒙面对话中预测目标词(TWP)。LoRDD结合了任务适配器和方言适配器,后者对MD-3的伪并行对话采用对比学习。我们对两种模型(Mistral和Gemma)的en-IN对话结果显示,LoRDD在TWP上的表现优于四个基线,同时在单词相似度方面与en-US的性能差距缩小了12%,在准确性方面缩小了25%。LoRDD的重点贡献在于其对解码器模型的方言改编的承诺。

[NLP-101] YA-TA: Towards Personalized Question-Answering Teaching Assistants using Instructor-Student Dual Retrieval-augmented Knowledge Fusion
[NLP-101] YA-TA:利用师生双重检索增强知识融合实现个性化志愿服务助教

链接: https://arxiv.org/abs/2409.00355
作者: Dongil Yang,Suyeon Lee,Minjin Kim,Jungsoo Won,Namyoung Kim,Dongha Lee,Jinyoung Yeo
关键词-EN: enhancing students’academic performance, Virtual Teaching Assistant, students’academic performance, plays a crucial, crucial role
关键词-ZH: 提高学生的学习成绩,虚拟助教,学生的学习成绩,发挥着至关重要的作用
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Engagement between instructors and students plays a crucial role in enhancing students’academic performance. However, instructors often struggle to provide timely and personalized support in large classes. To address this challenge, we propose a novel Virtual Teaching Assistant (VTA) named YA-TA, designed to offer responses to students that are grounded in lectures and are easy to understand. To facilitate YA-TA, we introduce the Dual Retrieval-augmented Knowledge Fusion (DRAKE) framework, which incorporates dual retrieval of instructor and student knowledge and knowledge fusion for tailored response generation. Experiments conducted in real-world classroom settings demonstrate that the DRAKE framework excels in aligning responses with knowledge retrieved from both instructor and student sides. Furthermore, we offer additional extensions of YA-TA, such as a QA board and self-practice tools to enhance the overall learning experience. Our video is publicly available.
摘要:教师和学生之间的互动对于提高学生的学业成绩发挥着至关重要的作用。然而,教师往往很难在大班中提供及时和个性化的支持。为了应对这一挑战,我们提出了一种名为YA-TA的新型虚拟助教(VTA),旨在为学生提供基于讲座且易于理解的回复。为了促进YA-TA,我们引入了双重检索增强知识融合(DRAKE)框架,该框架结合了教师和学生知识的双重检索以及知识融合以生成定制响应。在现实世界的课堂环境中进行的实验表明,DRAKE框架擅长将反应与从教师和学生方面检索到的知识保持一致。此外,我们还提供YA-TA的额外扩展,例如QA板和自我练习工具,以增强整体学习体验。我们的视频已公开。

[NLP-102] Does Alignment Tuning Really Break LLMs Internal Confidence?
[NLP-102] 对齐调整真的会破坏LLM的内部信心吗?

链接: https://arxiv.org/abs/2409.00352
作者: Hongseok Oh,Wonseok Hwang
关键词-EN: Large Language Models, Large Language, shown remarkable progress, real-world application necessitates, application necessitates reliable
关键词-ZH: 大型语言模型,大型语言,显示出显着的进步,现实世界的应用需要,应用需要可靠
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods. Initial analysis showed that the relationship between alignment and calibration is not always a trade-off, but under stricter analysis conditions, we found the alignment process consistently harms calibration. This highlights the need for (1) a careful approach when measuring model confidences and calibration errors and (2) future research into algorithms that can help LLMs to achieve both instruction-following and calibration without sacrificing either.
摘要:大型语言模型(LLM)已经取得了显着的进步,但其现实世界的应用需要可靠的校准。本研究对LLM的校准退化进行了四个维度的全面分析:模型、校准指标、任务和置信度提取方法。初步分析表明,对准和校准之间的关系并不总是一种权衡,但在更严格的分析条件下,我们发现对准过程始终会损害校准。这凸显了以下方面的必要性:(1)在测量模型置信度和校准误差时采取谨慎的方法;(2)未来对算法进行研究,以帮助LLM同时实现描述跟踪和校准,而不会牺牲两者。

[NLP-103] Chatting Up Attachment: Using LLMs to Predict Adult Bonds
[NLP-103] 聊天依恋:使用LLM预测成人关系

链接: https://arxiv.org/abs/2409.00347
作者: Paulo Soares,Sean McCurdy,Andrew J. Gerber,Peter Fonagy
关键词-EN: Obtaining data, field is challenging, making the adoption, slow and high-risk, medical field
关键词-ZH: 获取数据,该领域具有挑战性,使医疗领域的采用缓慢且高风险
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Obtaining data in the medical field is challenging, making the adoption of AI technology within the space slow and high-risk. We evaluate whether we can overcome this obstacle with synthetic data generated by large language models (LLMs). In particular, we use GPT-4 and Claude 3 Opus to create agents that simulate adults with varying profiles, childhood memories, and attachment styles. These agents participate in simulated Adult Attachment Interviews (AAI), and we use their responses to train models for predicting their underlying attachment styles. We evaluate our models using a transcript dataset from 9 humans who underwent the same interview protocol, analyzed and labeled by mental health professionals. Our findings indicate that training the models using only synthetic data achieves performance comparable to training the models on human data. Additionally, while the raw embeddings from synthetic answers occupy a distinct space compared to those from real human responses, the introduction of unlabeled human data and a simple standardization allows for a closer alignment of these representations. This adjustment is supported by qualitative analyses and is reflected in the enhanced predictive accuracy of the standardized embeddings.
摘要:在医疗领域获取数据具有挑战性,使得AI技术在空间内的采用速度慢且风险高。我们评估是否可以用大型语言模型(LLM)生成的合成数据克服这一障碍。特别是,我们使用GPT-4和Claude 3 Opus来创建代理,以模拟具有不同档案、童年记忆和依恋风格的成年人。这些代理参与模拟成人依恋访谈(AAI),我们使用他们的反应来训练模型,以预测他们潜在的依恋风格。我们使用9名接受相同采访方案的人的文字记录数据集来评估我们的模型,这些人由心理健康专业人员分析和标记。我们的发现表明,仅使用合成数据训练模型获得的性能与使用人类数据训练模型的性能相当。此外,尽管来自合成答案的原始嵌入与来自真实人类反应的原始嵌入占据了不同的空间,但引入未标记的人类数据和简单的标准化允许这些表示更紧密地对齐。这一调整得到了定性分析的支持,并反映在标准化嵌入的预测精度提高上。

[NLP-104] Evaluating the Effectiveness of Large Language Models in Representing and Understanding Movement Trajectories
[NLP-104] 评估大型语言模型在表示和理解运动轨迹方面的有效性

链接: https://arxiv.org/abs/2409.00335
作者: Yuhan Ji,Song Gao
关键词-EN: Dynamic Time Warping, focuses on assessing, assessing the ability, Time Warping distances, foundation models
关键词-ZH: 动态时间扭曲,专注于评估、评估能力、时间扭曲距离、基础模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:This research focuses on assessing the ability of AI foundation models in representing the trajectories of movements. We utilize one of the large language models (LLMs) (i.e., GPT-J) to encode the string format of trajectories and then evaluate the effectiveness of the LLM-based representation for trajectory data analysis. The experiments demonstrate that while the LLM-based embeddings can preserve certain trajectory distance metrics (i.e., the correlation coefficients exceed 0.74 between the Cosine distance derived from GPT-J embeddings and the Hausdorff and Dynamic Time Warping distances on raw trajectories), challenges remain in restoring numeric values and retrieving spatial neighbors in movement trajectory analytics. In addition, the LLMs can understand the spatiotemporal dependency contained in trajectories and have good accuracy in location prediction tasks. This research highlights the need for improvement in terms of capturing the nuances and complexities of the underlying geospatial data and integrating domain knowledge to support various GeoAI applications using LLMs.
摘要:本研究的重点是评估人工智能基础模型表示运动轨迹的能力。我们利用一种大型语言模型(LLM)(即GPT-J)来编码轨迹的字符串格式,然后评估基于LLM的表示对于轨迹数据分析的有效性。实验表明,虽然基于LLM的嵌入能够保持特定的轨迹距离度量(即由GPT-J嵌入得到的余弦距离与原始轨迹上的Hausdorff距离和动态时间弯曲距离之间的相关系数超过0.74),但在运动轨迹分析中恢复数值和检索空间邻域仍然存在挑战。此外,LLMS能够理解轨迹中包含的时空相关性,并在位置预测任务中具有良好的精度。这项研究突出了在捕捉底层地理空间数据的细微差别和复杂性以及整合领域知识以支持使用LLMS的各种GeoAI应用方面需要改进的必要性。

[NLP-105] WikiCausal: Corpus and Evaluation Framework for Causal Knowledge Graph Construction ISWC2024
[NLP-105] WikiCausal:因果知识图构建的数据库和评估框架

链接: https://arxiv.org/abs/2409.00331
作者: Oktie Hassanzadeh
关键词-EN: causal knowledge graphs, knowledge graph construction, causal knowledge, domain-specific causal knowledge, knowledge graphs
关键词-ZH: 因果知识图、知识图构建、因果知识、特定领域因果知识、知识图
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Extended version; poster paper accepted at ISWC 2024

点击查看摘要

Abstract:Recently, there has been an increasing interest in the construction of general-domain and domain-specific causal knowledge graphs. Such knowledge graphs enable reasoning for causal analysis and event prediction, and so have a range of applications across different domains. While great progress has been made toward automated construction of causal knowledge graphs, the evaluation of such solutions has either focused on low-level tasks (e.g., cause-effect phrase extraction) or on ad hoc evaluation data and small manual evaluations. In this paper, we present a corpus, task, and evaluation framework for causal knowledge graph construction. Our corpus consists of Wikipedia articles for a collection of event-related concepts in Wikidata. The task is to extract causal relations between event concepts from the corpus. The evaluation is performed in part using existing causal relations in Wikidata to measure recall, and in part using Large Language Models to avoid the need for manual or crowd-sourced evaluation. We evaluate a pipeline for causal knowledge graph construction that relies on neural models for question answering and concept linking, and show how the corpus and the evaluation framework allow us to effectively find the right model for each task. The corpus and the evaluation framework are publicly available.
摘要:最近,人们对构造一般领域和特定领域的因果知识图越来越感兴趣。这样的知识图支持因果分析和事件预测的推理,因此在不同的领域有一系列的应用。虽然在自动构建因果知识图方面取得了很大进展,但这种解决方案的评估要么集中在低级别任务(例如,因果短语提取)上,要么集中在临时评估数据和小型手动评估上。在本文中,我们提出了一个构建因果知识图的语料库、任务和评估框架。我们的语料库由维基百科文章组成,这些文章收集了维基数据中与事件相关的概念。任务是从语料库中提取事件概念之间的因果关系。评估部分是使用维基数据中现有的因果关系来衡量回忆,部分是使用大型语言模型来避免手动或众包评估的需要。我们评估了依赖神经模型进行问题回答和概念链接的因果知识图构建管道,并展示了语料库和评估框架如何使我们能够有效地为每个任务找到正确的模型。语料库和评价框架是公开提供的。

[NLP-106] From Prediction to Application: Language Model-based Code Knowledge Tracing with Domain Adaptive Pre-Training and Automatic Feedback System with Pedagogical Prompting for Comprehensive Programming Education
[NLP-106] 从预测到应用:基于语言模型的代码知识跟踪、领域自适应预训练和自动反馈系统、具有教学预算的综合编程教育

链接: https://arxiv.org/abs/2409.00323
作者: Unggi Lee,Jiyeong Bae,Yeonji Jung,Minji Kang,Gyuri Byun,Yeonseo Lee,Dohee Kim,Sookbun Lee,Jaekwon Park,Taekyung Ahn,Gunho Lee,Hyeoncheol Kim
关键词-EN: Code Knowledge Tracing, Knowledge Tracing, traditional approaches face, approaches face limitations, model-based Knowledge Tracing
关键词-ZH: 代码知识追踪,知识追踪,传统方法面临,方法面临局限,基于模型的知识追踪
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Knowledge Tracing (KT) is a critical component in online learning, but traditional approaches face limitations in interpretability and cross-domain adaptability. This paper introduces Language Model-based Code Knowledge Tracing (CodeLKT), an innovative application of Language model-based Knowledge Tracing (LKT) to programming education. CodeLKT leverages pre-trained language models to process learning data, demonstrating superior performance over existing KT and Code KT models. We explore Domain Adaptive Pre-Training (DAPT) and Task Adaptive Pre-Training (TAPT), showing enhanced performance in the coding domain and investigating cross-domain transfer between mathematics and coding. Additionally, we present an theoretically-informed integrated system combining CodeLKT with large language models to generate personalized, in-depth feedback to support students’ programming learning. This work advances the field of Code Knowledge Tracing by expanding the knowledge base with language model-based approach and offering practical implications for programming education through data-informed feedback.
摘要:知识追踪(KT)是在线学习的重要组成部分,但传统方法在可解释性和跨域适应性方面存在局限性。本文介绍了基于语言模型的代码知识跟踪(CodeLKT),它是基于语言模型的知识跟踪(LKT)在编程教育中的一种创新应用。CodeLKT利用预先训练的语言模型来处理学习数据,表现出优于现有KT和Code KT模型的性能。我们探索了领域自适应预训练(DAPT)和任务自适应预训练(TAPT),展示了编码领域的增强性能,并研究了数学和编码之间的跨域迁移。此外,我们提出了一个理论上知情的集成系统,将CodeLKT与大型语言模型相结合,以生成个性化的、深入的反馈,以支持学生的编程学习。这项工作通过基于语言模型的方法扩展了知识库,并通过数据知情反馈为编程教育提供了实践意义,从而推动了代码知识跟踪领域的发展。

[NLP-107] An Empirical Study on Context Length for Open-Domain Dialog Generation
[NLP-107] 开放域对话生成的上下文长度实证研究

链接: https://arxiv.org/abs/2409.00315
作者: Xinyi Shen,Zuoquan Lin
关键词-EN: recent years, increasingly popular, popular in recent, context, Transformer-based open-domain dialog
关键词-ZH: 近年来,越来越受欢迎,在最近的背景下,基于Transformer的开放领域对话
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Transformer-based open-domain dialog models have become increasingly popular in recent years. These models typically represent context as a concatenation of a dialog history. However, there is no criterion to decide how many utterances should be kept adequate in a context. We try to figure out how the choice of context length affects the model. We experiment on three questions from coarse to fine: (i) Does longer context help model training? (ii) Is it necessary to change the training context length when dealing with dialogs of different context lengths? (iii) Do different dialog samples have the same preference for context length? Our experimental results show that context length, an often overlooked setting, deserves attention when implementing Transformer-based dialog models.
摘要:近年来,基于转换器的开放域对话模型变得越来越受欢迎。这些模型通常将上下文表示为对话历史的级联。然而,没有标准来决定在某个背景下应该保持多少话语足够。我们试图弄清楚上下文长度的选择如何影响模型。我们对从粗到细的三个问题进行了实验:(i)更长的上下文是否有助于建模训练?(ii)处理不同上下文长度的对话时,是否有必要改变训练上下文长度?(iii)不同的对话框示例对上下文长度的偏好是否相同?我们的实验结果表明,在实现基于Transformer的对话模型时,上下文长度是一个经常被忽视的设置,值得关注。

[NLP-108] REFFLY: Melody-Constrained Lyrics Editing Model
[NLP-108] REFFLY:旋律约束歌词编辑模型

链接: https://arxiv.org/abs/2409.00292
作者: Songyan Zhao,Bingxuan Li,Yufei Tian,Nanyun Peng
关键词-EN: generation aims, aims to produce, produce lyrics, lyrics, Automatic
关键词-ZH: 一代目标,旨在制作,制作歌词,歌词,自动
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Automatic melody-to-lyric generation aims to produce lyrics that align with a given melody. Although previous work can generate lyrics based on high-level control signals, such as keywords or genre, they often struggle with three challenges: (1) lack of controllability, as prior works are only able to produce lyrics from scratch, with little or no control over the content; (2) inability to generate fully structured songs with the desired format; and (3) failure to align prominent words in the lyrics with prominent notes in the melody, resulting in poor lyrics-melody alignment. In this work, we introduce REFFLY (REvision Framework For Lyrics), the first revision framework designed to edit arbitrary forms of plain text draft into high-quality, full-fledged song lyrics. Our approach ensures that the generated lyrics retain the original meaning of the draft, align with the melody, and adhere to the desired song structures. We demonstrate that REFFLY performs well in diverse task settings, such as lyrics revision and song translation. Experimental results show that our model outperforms strong baselines, such as Lyra (Tian et al. 2023) and GPT-4, by 25% in both musicality and text quality.
摘要:自动旋律到歌词生成的目标是生成与给定旋律一致的歌词。虽然以前的工作可以基于诸如关键字或流派等高级控制信号来生成歌词,但它们经常面临三个挑战:(1)缺乏可控性,因为以前的作品只能从头开始生成歌词,对内容几乎没有控制;(2)无法生成具有所需格式的完整结构的歌曲;以及(3)未能将歌词中的突出字与旋律中的突出音符对齐,导致歌词-旋律对齐不良。在这项工作中,我们引入了REFFLY(歌词修订框架),这是第一个修订框架,旨在将任意形式的纯文本草稿编辑成高质量、成熟的歌词。我们的方法确保生成的歌词保留了草稿的原始含义,与旋律保持一致,并符合所需的歌曲结构。我们证明了REFFLY在不同的任务设置中表现良好,例如歌词修改和歌曲翻译。实验结果表明,我们的模型优于Lyra(Tian等人)等强基线。2023)和GPT-4,在音乐性和文本质量方面都提高了25%。

[NLP-109] OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters
[NLP-109] OnlySportsLM:在十亿个参数下利用SOTA性能优化体育领域语言模型

链接: https://arxiv.org/abs/2409.00286
作者: Zexin Chen,Chengxi Li,Xiangyu Xie,Parijat Dube
关键词-EN: model trained exclusively, OnlySports Dataset, paper explores, explores the potential, trained exclusively
关键词-ZH: 独家训练的模型,OnlySports Dataset,论文探索,探索潜力,独家训练
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures, 4 tables

点击查看摘要

Abstract:This paper explores the potential of a small, domain-specific language model trained exclusively on sports-related data. We investigate whether extensive training data with specially designed small model structures can overcome model size constraints. The study introduces the OnlySports collection, comprising OnlySportsLM, OnlySports Dataset, and OnlySports Benchmark. Our approach involves: 1) creating a massive 600 billion tokens OnlySports Dataset from FineWeb, 2) optimizing the RWKV architecture for sports-related tasks, resulting in a 196M parameters model with 20-layer, 640-dimension structure, 3) training the OnlySportsLM on part of OnlySports Dataset, and 4) testing the resultant model on OnlySports Benchmark. OnlySportsLM achieves a 37.62%/34.08% accuracy improvement over previous 135M/360M state-of-the-art models and matches the performance of larger models such as SomlLM 1.7B and Qwen 1.5B in the sports domain. Additionally, the OnlySports collection presents a comprehensive workflow for building high-quality, domain-specific language models, providing a replicable blueprint for efficient AI development across various specialized fields.
摘要:本文探讨了一种专门针对体育相关数据进行训练的小型特定领域语言模型的潜力。我们研究了具有特殊设计的小模型结构的大量训练数据是否能够克服模型大小的限制。这项研究介绍了OnlySports集合,包括OnlySportsLM、OnlySports数据集和OnlySports基准。我们的方法包括:1)从FineWeb创建一个海量的6000亿个Token OnlySports数据集;2)针对体育相关任务优化RWKV架构,得到一个20层640维结构的196M参数模型;3)在OnlySports数据集的部分上训练OnlySportsLM;4)在OnlySports基准上测试得到的模型。OnlySportsLM的精确度比之前的135M/360M最先进型号提高了37.62%/34.08%,在体育领域与SomlLm 1.7B和Qwen 1.5B等更大型号的性能不相上下。此外,OnlySports集合提供了用于构建高质量、特定于领域的语言模型的全面工作流程,为跨各个专业领域的高效人工智能开发提供了可复制的蓝图。

[NLP-110] Simple stochastic processes behind Menzeraths Law
[NLP-110] 孟泽拉斯定律背后的简单随机过程

链接: https://arxiv.org/abs/2409.00279
作者: Jiří Milička
关键词-EN: revisits Menzerath Law, paper revisits Menzerath, Menzerath Law, revisits Menzerath, Menzerath-Altmann Law
关键词-ZH: 重温门泽拉斯定律,论文重温门泽拉斯、门泽拉斯定律,重温门泽拉斯、门泽拉斯-奥尔特曼定律
类目: Computation and Language (cs.CL)
备注: The paper was presented at QUALICO 2023, Lausanne. This manuscript has been submitted to the proceedings of this conference. Full scale figures: this http URL

点击查看摘要

Abstract:This paper revisits Menzerath’s Law, also known as the Menzerath-Altmann Law, which models a relationship between the length of a linguistic construct and the average length of its constituents. Recent findings indicate that simple stochastic processes can display Menzerathian behaviour, though existing models fail to accurately reflect real-world data. If we adopt the basic principle that a word can change its length in both syllables and phonemes, where the correlation between these variables is not perfect and these changes are of a multiplicative nature, we get bivariate log-normal distribution. The present paper shows, that from this very simple principle, we obtain the classic Altmann model of the Menzerath-Altmann Law. If we model the joint distribution separately and independently from the marginal distributions, we can obtain an even more accurate model by using a Gaussian copula. The models are confronted with empirical data, and alternative approaches are discussed.
摘要:本文重新探讨了门泽拉斯定律,也称为门泽拉斯-奥尔特曼定律,该定律建模了语言结构的长度与其成分平均长度之间的关系。最近的研究结果表明,简单的随机过程可以表现出曼泽拉斯行为,尽管现有模型无法准确反映现实世界的数据。如果我们采用一个词可以在音节和音素中改变其长度的基本原则,而这些变量之间的相关性并不完美,并且这些变化具有乘性,我们就会得到二元log正态分布。本文表明,从这个非常简单的原理,我们得到了门泽拉斯-奥尔特曼定律的经典奥尔特曼模型。如果我们单独且独立于边缘分布对联合分布进行建模,我们可以通过使用高斯Copula获得更准确的模型。这些模型面对经验数据,并讨论了替代方法。

[NLP-111] owards a dynamical model of English vowels. Evidence from diphthongisation
[NLP-111] 拥有英语元音的动态模型。双元音化的证据

链接: https://arxiv.org/abs/2409.00275
作者: Patrycja Strycharczuk,Sam Kirkham,Emily Gorman,Takayuki Nagamine
关键词-EN: inherent dynamic change, synchronically and diachronically, vice versa, Diphthong vowels exhibit, exhibit a degree
关键词-ZH: 内在的动态变化,共时和历时,反之亦然,双元音表现出,表现出一定程度
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Diphthong vowels exhibit a degree of inherent dynamic change, the extent of which can vary synchronically and diachronically, such that diphthong vowels can become monophthongs and vice versa. Modelling this type of change requires defining diphthongs in opposition to monophthongs. However, formulating an explicit definition has proven elusive in acoustics and articulation, as diphthongisation is often gradient in these domains. In this study, we consider whether diphthong vowels form a coherent phonetic category from the articulatory point of view. We present articulometry and acoustic data from six speakers of Northern Anglo-English producing a full set of phonologically long vowels. We analyse several measures of diphthongisation, all of which suggest that diphthongs are not categorically distinct from long monophthongs. We account for this observation with an Articulatory Phonology/Task Dynamic model in which diphthongs and long monophthongs have a common gestural representation, comprising two articulatory targets in each case, but they differ according to gestural constriction and location of the component gestures. We argue that a two-target representation for all long vowels is independently supported by phonological weight, as well as by the nature of historical diphthongisation and present-day dynamic vowel variation in British English.
摘要:双元音表现出一定程度的内在动态变化,这种动态变化的程度可以是共时的,也可以是历时的,使得双元音可以变成单元音,反之亦然。对这种类型的变化进行建模需要定义双元音,而不是单元音。然而,在声学和发音方面,制定一个明确的定义已被证明是难以捉摸的,因为在这些领域,双元音通常是渐变的。在这项研究中,我们从发音的角度来考虑双元音是否形成了一个连贯的语音范畴。我们提供了六位北方英语使用者的发音测量和声学数据,产生了一套完整的语音长元音。我们分析了几种双元化的衡量标准,所有这些都表明,双元音与长单元音并没有明确的区别。我们用节律音系学/任务动态模型解释了这一观察结果,在该模型中,双元音和长单元音具有共同的手势表征,每种情况下都包括两个发音目标,但它们根据手势收缩和组成手势的位置而不同。我们认为,所有长元音的两个目标表征独立地受到音位权重的支持,以及英国英语中历史上的双元音和今天的动态元音变异的性质。

[NLP-112] Finding frames with BERT: A transformer-based approach to generic news frame detection
[NLP-112] 使用BERT查找框架:基于变换器的通用新闻框架检测方法

链接: https://arxiv.org/abs/2409.00272
作者: Vihang Jumle,Mykola Makhortykh,Maryna Sydorova,Victoria Vziatysheva
关键词-EN: extensively used concepts, communication science, Anglophone online content, raises challenges related, societally relevant issues
关键词-ZH: 广泛使用的概念、传播科学、英语在线内容,提出了相关挑战、社会相关问题
类目: Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Framing is among the most extensively used concepts in the field of communication science. The availability of digital data offers new possibilities for studying how specific aspects of social reality are made more salient in online communication but also raises challenges related to the scaling of framing analysis and its adoption to new research areas (e.g. studying the impact of artificial intelligence-powered systems on representation of societally relevant issues). To address these challenges, we introduce a transformer-based approach for generic news frame detection in Anglophone online content. While doing so, we discuss the composition of the training and test datasets, the model architecture, and the validation of the approach and reflect on the possibilities and limitations of the automated detection of generic news frames.
摘要:框架是传播科学领域使用最广泛的概念之一。数字数据的可用性为研究如何在在线通信中使社会现实的特定方面更加突出提供了新的可能性,但也提出了与框架分析的扩展及其应用到新研究领域相关的挑战(例如研究人工智能驱动的系统对社会相关问题的表示的影响)。为了解决这些挑战,我们引入了一种基于变换器的方法,用于英语在线内容中的通用新闻框架检测。在此过程中,我们讨论了训练和测试数据集的组成、模型架构以及方法的验证,并反思了通用新闻框架自动检测的可能性和局限性。

[NLP-113] Leveraging a Cognitive Model to Measure Subjective Similarity of Human and GPT-4 Written Content
[NLP-113] 利用认知模型来衡量人类和GPT-4书面内容的主观相似性

链接: https://arxiv.org/abs/2409.00269
作者: Tyler Malloy,Maria José Ferreira,Fei Fang,Cleotilde Gonzalez
关键词-EN: Large Language Models, Large Language, formed by Large, token embeddings formed, Cosine similarity
关键词-ZH: 大型语言模型,大型语言,由大型形成,形成的标记嵌入,Cosine相似性
类目: Computation and Language (cs.CL)
备注: 7 Figures, 1 table

点击查看摘要

Abstract:Cosine similarity between two documents can be computed using token embeddings formed by Large Language Models (LLMs) such as GPT-4, and used to categorize those documents across a range of uses. However, these similarities are ultimately dependent on the corpora used to train these LLMs, and may not reflect subjective similarity of individuals or how their biases and constraints impact similarity metrics. This lack of cognitively-aware personalization of similarity metrics can be particularly problematic in educational and recommendation settings where there is a limited number of individual judgements of category or preference, and biases can be particularly relevant. To address this, we rely on an integration of an Instance-Based Learning (IBL) cognitive model with LLM embeddings to develop the Instance-Based Individualized Similarity (IBIS) metric. This similarity metric is beneficial in that it takes into account individual biases and constraints in a manner that is grounded in the cognitive mechanisms of decision making. To evaluate the IBIS metric, we also introduce a dataset of human categorizations of emails as being either dangerous (phishing) or safe (ham). This dataset is used to demonstrate the benefits of leveraging a cognitive model to measure the subjective similarity of human participants in an educational setting.
摘要:两个文档之间的余弦相似度可以使用由GPT-4等大型语言模型形成的令牌嵌入来计算,并用于对这些文档进行分类。然而,这些相似性最终取决于用于训练这些LLM的语料库,可能不反映个人的主观相似性,也不反映他们的偏见和约束如何影响相似性度量。这种对相似性度量缺乏认知意识的个性化在教育和推荐环境中可能特别成问题,在这些环境中,对类别或偏好的个人判断数量有限,并且偏见可能特别相关。为了解决这个问题,我们依赖于基于实例的学习(IBL)认知模型与LLM嵌入的集成来开发基于实例的个性化相似性(IBIS)度量。这种相似性度量是有益的,因为它以一种植根于决策的认知机制的方式考虑了个人的偏见和限制。为了评估IBIS指标,我们还引入了一个电子邮件人类分类的数据集,分为危险电子邮件(网络钓鱼)和安全电子邮件(火腿)。这个数据集被用来展示利用认知模型来衡量教育环境中人类参与者的主观相似性的好处。

[NLP-114] DiverseDialogue: A Methodology for Designing Chatbots with Human-Like Diversity
[NLP-114] DiverseDialogue:设计具有类人多样性的聊天机器人的方法

链接: https://arxiv.org/abs/2409.00262
作者: Xiaoyu Lin,Xinkai Yu,Ankit Aich,Salvatore Giorgi,Lyle Ungar
关键词-EN: Large Language Models, Large Language, Language Models, customer service, frequently employed
关键词-ZH: 大型语言模型,大型语言,语言模型,客户服务,经常雇用
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), which simulate human users, are frequently employed to evaluate chatbots in applications such as tutoring and customer service. Effective evaluation necessitates a high degree of human-like diversity within these simulations. In this paper, we demonstrate that conversations generated by GPT-4o mini, when used as simulated human participants, systematically differ from those between actual humans across multiple linguistic features. These features include topic variation, lexical attributes, and both the average behavior and diversity (variance) of the language used. To address these discrepancies, we propose an approach that automatically generates prompts for user simulations by incorporating features derived from real human interactions, such as age, gender, emotional tone, and the topics discussed. We assess our approach using differential language analysis combined with deep linguistic inquiry. Our method of prompt optimization, tailored to target specific linguistic features, shows significant improvements. Specifically, it enhances the human-likeness of LLM chatbot conversations, increasing their linguistic diversity. On average, we observe a 54 percent reduction in the error of average features between human and LLM-generated conversations. This method of constructing chatbot sets with human-like diversity holds great potential for enhancing the evaluation process of user-facing bots.
摘要:大语言模型模拟人类用户,经常被用来评估聊天机器人在辅导和客户服务等应用中的应用。有效的评估需要在这些模拟中具有高度的类人多样性。在这篇文章中,我们证明了GPT-4o mini生成的会话作为模拟人类参与者时,在多个语言特征上与真实人类之间的会话有系统地不同。这些特征包括主题变化、词汇属性以及所用语言的平均行为和多样性(变化)。为了解决这些差异,我们提出了一种方法,通过结合来自真实人类交互的特征,如年龄、性别、情感基调和讨论的主题,自动为用户模拟生成提示。我们使用差异语言分析结合深入的语言调查来评估我们的方法。我们的快速优化方法,针对特定的语言特征量身定做,显示出显著的改进。具体地说,它增强了LLM聊天机器人对话的人类相似性,增加了它们的语言多样性。平均而言,我们观察到人类和LLM生成的对话之间的平均特征错误减少了54%。这种构造具有类人类多样性的聊天机器人集合的方法对于增强面向用户的机器人的评估过程具有很大的潜力。

[NLP-115] MAPWise: Evaluating Vision-Language Models for Advanced Map Queries
[NLP-115] MAPWise:评估高级地图收件箱的视觉语言模型

链接: https://arxiv.org/abs/2409.00255
作者: Srija Mukhopadhyay,Abhishek Rajgaria,Prerana Khatiwada,Vivek Gupta,Dan Roth
关键词-EN: tasks requiring joint, Vision-language models, excel at tasks, linguistic information, answering questions based
关键词-ZH: 需要联合的任务、视觉语言模型、擅长任务、语言信息、回答问题
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: 30 Pages, 46 Tables, 6 Figure

点击查看摘要

Abstract:Vision-language models (VLMs) excel at tasks requiring joint understanding of visual and linguistic information. A particularly promising yet under-explored application for these models lies in answering questions based on various kinds of maps. This study investigates the efficacy of VLMs in answering questions based on choropleth maps, which are widely used for data analysis and representation. To facilitate and encourage research in this area, we introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China), each containing 1000 questions. Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning. It also includes maps with discrete and continuous values, encompassing variations in color-mapping, category ordering, and stylistic patterns, enabling comprehensive analysis. We evaluate the performance of multiple VLMs on this benchmark, highlighting gaps in their abilities and providing insights for improving such models.
摘要:视觉-语言模型(VLM)擅长于联合理解视觉和语言信息的任务。这些模型的一个特别有前景但未得到充分探索的应用是根据各种地图回答问题。本研究探讨了语篇模型在回答问题时的有效性。语篇图被广泛用于数据分析和表征。为了促进和鼓励这一领域的研究,我们引入了一个新的基于地图的问答基准,由来自三个地理区域(美国、印度、中国)的地图组成,每个地图包含1,000个问题。我们的基准包含43个不同的问题模板,需要对相对空间关系、复杂的地图功能和复杂的推理进行细致入微的理解。它还包括具有离散值和连续值的地图,包括颜色映射、类别排序和风格模式的变化,从而实现全面分析。我们在这个基准上评估了多个VLM的性能,强调了它们在能力上的差距,并为改进这些模型提供了见解。

[NLP-116] Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
[NLP-116] 预训练具有损坏接地数据的多模式幻觉检测器

链接: https://arxiv.org/abs/2409.00238
作者: Spencer Whitehead,Jacob Phillips,Sean Hendryx
关键词-EN: limits their reliability, Multimodal language models, Multimodal language, Abstract, exhibit hallucinations
关键词-ZH: 限制其可靠性,多模式语言模型,多模式语言,抽象,表现出幻觉
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal language models can exhibit hallucinations in their outputs, which limits their reliability. The ability to automatically detect these errors is important for mitigating them, but has been less explored and existing efforts do not localize hallucinations, instead framing this as a classification task. In this work, we first pose multimodal hallucination detection as a sequence labeling task where models must localize hallucinated text spans and present a strong baseline model. Given the high cost of human annotations for this task, we propose an approach to improve the sample efficiency of these models by creating corrupted grounding data, which we use for pre-training. Leveraging phrase grounding data, we generate hallucinations to replace grounded spans and create hallucinated text. Experiments show that pre-training on this data improves sample efficiency when fine-tuning, and that the learning signal from the grounding data plays an important role in these improvements.
摘要:多通道语言模型在其输出中会表现出幻觉,这限制了其可靠性。自动检测这些错误的能力对于缓解这些错误很重要,但人们对此探索较少,现有的努力并未将幻觉定位,而是将其视为一项分类任务。在这项工作中,我们首先将多模式幻觉检测作为一项序列标记任务,其中模型必须定位幻觉文本跨度并提供一个强大的基线模型。鉴于这项任务的人工标注成本很高,我们提出了一种方法,通过创建损坏的接地数据来提高这些模型的样本效率,我们将其用于预训练。利用短语接地数据,我们产生幻觉来取代接地跨度并创建幻觉文本。实验表明,对这些数据的预训练提高了微调时的样本效率,而来自接地数据的学习信号在这些改进中起到了重要作用。

[NLP-117] Can Large Language Models Address Open-Target Stance Detection?
[NLP-117] 大型语言模型能否解决开放目标立场检测问题?

链接: https://arxiv.org/abs/2409.00222
作者: Abu Ubaida Akash,Ahmed Fahmy,Amine Trabelsi
关键词-EN: Stance detection, Open-Target Stance Detection, Large Language Models, typically labeled, detection
关键词-ZH: 姿态检测、开放目标姿态检测、大型语言模型(通常标记)检测
类目: Computation and Language (cs.CL)
备注: 10 pages, currently under submission

点击查看摘要

Abstract:Stance detection (SD) assesses a text’s position towards a target, typically labeled as “favor,” “against,” or “neutral.” We introduce Open-Target Stance Detection (OTSD), where targets are neither seen during training nor provided as input. Evaluating Large Language Models (LLMs) like GPT-3.5, Llama 3, and Mistral, we compare their performance with the Target-Stance Extraction (TSE) approach, which has the advantage of using predefined targets. LLMs perform better than TSE in target generation when the real target is explicitly and not explicitly mentioned in the text. For stance detection, LLMs perform better in explicit scenarios but fail in non-explicit ones.
摘要:姿态检测(SD)评估文本相对于目标的位置,通常标记为“赞成”、“反对”或“中立”。“我们引入了开放目标姿态检测(OTSD),目标在训练期间既不会被看到,也不会作为输入提供。在评估GPT-3.5、Llama 3和Mistral等大型语言模型(LLM)时,我们将其性能与目标姿态提取(PSE)方法进行了比较,后者具有使用预定义目标的优势。当文本中明确和未明确提及真正的目标时,LLM在目标生成方面的表现优于PSE。对于姿态检测,LLM在显式场景中表现更好,但在非显式场景中表现不佳。

[NLP-118] ProGRes: Prompted Generative Rescoring on ASR n-Best
[NLP-118] ProGRes:在ASB n-Best上进行的预定生成重新评分

链接: https://arxiv.org/abs/2409.00217
作者: Ada Defne Tur,Adel Moumen,Mirco Ravanelli
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: IEEE Spoken Language Technology Workshop

点击查看摘要

Translation interface exception

[NLP-119] Enhancing Document-level Argument Extraction with Definition-augmented Heuristic-driven Prompting for LLMs
[NLP-119] 通过LLM的描述增强启发式驱动的描述来增强文档级参数提取

链接: https://arxiv.org/abs/2409.00214
作者: Tongyue Sun,Jiayi Xiao
关键词-EN: Event Argument Extraction, extracting structured information, remains challenging due, Event Argument, Large Language Models
关键词-ZH: 由于事件参数、大型语言模型,事件参数提取(提取结构化信息)仍然具有挑战性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Event Argument Extraction (EAE) is pivotal for extracting structured information from unstructured text, yet it remains challenging due to the complexity of real-world document-level EAE. We propose a novel Definition-augmented Heuristic-driven Prompting (DHP) method to enhance the performance of Large Language Models (LLMs) in document-level EAE. Our method integrates argument extraction-related definitions and heuristic rules to guide the extraction process, reducing error propagation and improving task accuracy. We also employ the Chain-of-Thought (CoT) method to simulate human reasoning, breaking down complex problems into manageable sub-problems. Experiments have shown that our method achieves a certain improvement in performance over existing prompting methods and few-shot supervised learning on document-level EAE datasets. The DHP method enhances the generalization capability of LLMs and reduces reliance on large annotated datasets, offering a novel research perspective for document-level EAE.
摘要:事件参数提取是从非结构化文本中提取结构化信息的关键,但由于现实世界文档级事件参数提取的复杂性,事件参数提取仍然具有挑战性。为了提高大语言模型在文档级EAE中的性能,提出了一种新的基于定义的启发式提示方法。该方法综合了参数抽取相关定义和启发式规则来指导抽取过程,减少了错误传播,提高了任务精度。我们还使用了思想链(COT)方法来模拟人类推理,将复杂问题分解为可管理的子问题。实验表明,我们的方法在文档级EAE数据集上的性能比现有的提示方法和少镜头监督学习方法都有一定的提高。DHP方法增强了LLMS的泛化能力,减少了对大型标注数据集的依赖,为文档级EAE提供了一个新的研究视角。

[NLP-120] Enhancing Event Reasoning in Large Language Models through Instruction Fine-Tuning with Semantic Causal Graphs
[NLP-120] 通过使用语义因果图进行指令微调来增强大型语言模型中的事件推理

链接: https://arxiv.org/abs/2409.00209
作者: Mazal Bethany,Emet Bethany,Brandon Wherry,Cho-Yu Chiang,Nishant Vishwamitra,Anthony Rios,Peyman Najafirad
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Translation interface exception

[NLP-121] he creative psychometric item generator: a framework for item generation and validation using large language models
[NLP-121] 创意心理测量项目生成器:使用大型语言模型的项目生成和验证框架

链接: https://arxiv.org/abs/2409.00202
作者: Antonio Laverghetta Jr.,Simone Luchini,Averie Linell,Roni Reiter-Palmon,Roger Beaty
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL)
备注: CREAI 2024

点击查看摘要

Translation interface exception

[NLP-122] Facilitating phenotyping from clinical texts: the medkit library
[NLP-122] 促进临床文本中的表型分型:medKit图书馆

链接: https://arxiv.org/abs/2409.00164
作者: Antoine Neuraz,Ghislain Vaillant,Camila Arias,Olivier Birot,Kim-Tam Huynh,Thibaut Fabacher,Alice Rogier,Nicolas Garcelon,Ivan Lerner,Bastien Rance,Adrien Coulet
关键词-EN: Electronic Health Records, Health Records, Electronic Health, collection of Electronic, potentially complex
关键词-ZH: 电子健康记录、健康记录、电子健康、电子收集,潜在复杂
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Phenotyping consists in applying algorithms to identify individuals associated with a specific, potentially complex, trait or condition, typically out of a collection of Electronic Health Records (EHRs). Because a lot of the clinical information of EHRs are lying in texts, phenotyping from text takes an important role in studies that rely on the secondary use of EHRs. However, the heterogeneity and highly specialized aspect of both the content and form of clinical texts makes this task particularly tedious, and is the source of time and cost constraints in observational studies. To facilitate the development, evaluation and reproductibility of phenotyping pipelines, we developed an open-source Python library named medkit. It enables composing data processing pipelines made of easy-to-reuse software bricks, named medkit operations. In addition to the core of the library, we share the operations and pipelines we already developed and invite the phenotyping community for their reuse and enrichment. medkit is available at this https URL
摘要:表型是指应用算法来识别与特定的、潜在的复杂的、特征或疾病相关的个体,通常是从电子健康记录(EHR)集合中。由于EHR的许多临床信息都在文本中,因此从文本中进行表型分析在依赖EHR二次使用的研究中扮演着重要的角色。然而,临床文本的内容和形式的异质性和高度专业化使得这项任务特别乏味,也是观察性研究中时间和成本限制的来源。为了方便表型管道的开发、评估和重现性,我们开发了一个名为Medkit的开源Python库。它能够组成由易于重复使用的软件块组成的数据处理管道,称为Medkit操作。除了库的核心,我们还共享我们已经开发的操作和管道,并邀请表型社区重新使用和丰富它们。MedKit可通过以下HTTPS URL获得

[NLP-123] Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback
[NLP-123] 序列到序列奖励建模:通过语言反馈改进WLHF

链接: https://arxiv.org/abs/2409.00162
作者: Jiayi Zhou,Jiaming Ji,Juntao Dai,Yaodong Yang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Translation interface exception

[NLP-124] LLMs hallucinate graphs too: a structural perspective
[NLP-124] 法学硕士也会幻觉图表:结构性视角

链接: https://arxiv.org/abs/2409.00159
作者: Erwan Le Merrer,Gilles Tredan
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Translation interface exception

[NLP-125] Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder INTERSPEECH2024
[NLP-125] 开发端到端框架来预测自闭症谱系障碍儿童的社交沟通严重程度评分

链接: https://arxiv.org/abs/2409.00158
作者: Jihyun Mun,Sunhee Kim,Minhwa Chung
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for Interspeech 2024

点击查看摘要

Translation interface exception

[NLP-126] Speaker Tagging Correction With Non-Autoregressive Language Models
[NLP-126] 使用非自回归语言模型进行说话者标记纠正

链接: https://arxiv.org/abs/2409.00151
作者: Grigor Kirakosyan,Davit Karamyan
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 6 pages, 7 tables

点击查看摘要

Translation interface exception

[NLP-127] MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models
[NLP-127] MultiMath:为大型语言模型搭建视觉和数学推理的桥梁

链接: https://arxiv.org/abs/2409.00147
作者: Shuai Peng,Di Fu,Liangcai Gao,Xiuqin Zhong,Hongguang Fu,Zhi Tang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-128] Dynamic Depth Decoding: Faster Speculative Decoding for LLMs
[NLP-128] 动态深度解码:LLM的更快推测解码

链接: https://arxiv.org/abs/2409.00142
作者: Oscar Brown,Zhengjie Wang,Andrea Do,Nikhil Mathew,Cheng Yu
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-129] PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action
[NLP-129] PrivacyLens:评估实际语言模型的隐私规范意识

链接: https://arxiv.org/abs/2409.00138
作者: Yijia Shao,Tianshi Li,Weiyan Shi,Yanchen Liu,Diyi Yang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Under review

点击查看摘要

Translation interface exception

[NLP-130] Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks
[NLP-130] 前沿模型中新出现的漏洞:多回合越狱攻击

链接: https://arxiv.org/abs/2409.00137
作者: Tom Gibbs,Ethan Kosak-Hine,George Ingebretsen,Jason Zhang,Julius Broomfield,Sara Pieri,Reihaneh Iranmanesh,Reihaneh Rabbany,Kellin Pelrine
关键词-EN:
关键词-ZH: Translation interface exception
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Translation interface exception

[NLP-131] HoneyComb: A Flexible LLM-Based Agent System for Materials Science EMNLP2024
[NLP-131] HoneyComb:一个基于LLM的灵活材料科学代理系统

链接: https://arxiv.org/abs/2409.00135
作者: Huan Zhang,Yu Song,Ziyu Hou,Santiago Miret,Bang Liu
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review on EMNLP 2024

点击查看摘要

Translation interface exception

[NLP-132] A Survey for Large Language Models in Biomedicine
[NLP-132] 生物医学中大型语言模型的调查

链接: https://arxiv.org/abs/2409.00133
作者: Chong Wang,Mengyao Li,Junjun He,Zhongruo Wang,Erfan Darzi,Zan Chen,Jin Ye,Tianbin Li,Yanzhou Su,Jing Ke,Kaili Qu,Shuxin Li,Yi Yu,Pietro Liò,Tianyun Wang,Yu Guang Wang,Yiqing Shen
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-133] Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems
[NLP-133] 数学单词问题的轻量级大语言模型逻辑对比推理

链接: https://arxiv.org/abs/2409.00131
作者: Ding Kai,Ma Zhenguo,Yan Xiaoran
关键词-EN: lightweight Large Language, Large Language Models, study focuses, focuses on improving, Large Language
关键词-ZH: 轻量级大型语言,大型语言模型,研究重点,专注于改进,大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study focuses on improving the performance of lightweight Large Language Models (LLMs) in mathematical reasoning tasks. We introduce a novel method for measuring mathematical logic similarity and design an automatic screening mechanism to construct a set of reference problems that integrate both semantic and logical similarity. By employing carefully crafted positive and negative example prompts, we guide the model towards adopting sound reasoning logic. To the best of our knowledge, this is the first attempt to utilize retrieval-enhanced generation for mathematical problem-solving. Experimental results demonstrate that our method achieves a 15.8% improvement over the Chain of Thought approach on the SVAMP dataset and a 21.5 % improvement on the GSM8K dataset. Further application of this method to a large-scale model with 175 billion parameters yields performance comparable to the best results on both aforementioned datasets. Finally, we conduct an analysis of errors during the reasoning process, providing valuable insights and directions for future research on reasoning tasks using large language models.
摘要:本研究旨在提高轻量级大语言模型在数学推理任务中的性能。本文提出了一种度量数理逻辑相似度的新方法,并设计了一种自动筛选机制来构造一组综合了语义和逻辑相似度的参考题。通过使用精心设计的正面和负面示例提示,我们引导模型采用合理的推理逻辑。据我们所知,这是第一次尝试利用检索增强的生成来解决数学问题。实验结果表明,该方法在SVAMP数据集和GSM8K数据集上分别比思想链方法提高了15.8%和21.5%。将该方法进一步应用于具有1750亿个参数的大规模模型,其性能可与上述两个数据集的最佳结果相媲美。最后,我们对推理过程中的错误进行了分析,为未来使用大型语言模型进行推理任务的研究提供了有价值的见解和方向。

[NLP-134] Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs
[NLP-134] 人工智能可以取代人类吗?LLM大规模复制心理实验

链接: https://arxiv.org/abs/2409.00128
作者: Ziyan Cui,Ning Li,Huaikang Zhou
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 5 figures, 2 tables

点击查看摘要

Translation interface exception

[NLP-135] ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings ICPR2024
[NLP-135] ConCSE:代码交换嵌入的统一对比学习和增强

链接: https://arxiv.org/abs/2409.00120
作者: Jangyeong Jeon,Sangyeon Cho,Minuk Ma,Junyoung Kim
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICPR 2024

点击查看摘要

Translation interface exception

[NLP-136] 3-in-1: 2D Rotary Adaptation for Efficient Finetuning Efficient Batching and Composability
[NLP-136] 3合1:2D旋转调整,实现高效微调高效批量和可组合性

链接: https://arxiv.org/abs/2409.00119
作者: Baohao Liao,Christof Monz
关键词-EN:
关键词-ZH: Translation interface exception
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages, 6 figures, 13 tables

点击查看摘要

Translation interface exception

[NLP-137] FedMCP: Parameter-Efficient Federated Learning with Model-Contrastive Personalization
[NLP-137] FedHCP:具有模型对比个性化的参数高效联邦学习

链接: https://arxiv.org/abs/2409.00116
作者: Qianyi Zhao,Chen Qu,Cen Chen,Mingyuan Fan,Yanhao Wang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Translation interface exception

[NLP-138] When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options
[NLP-138] 当所有选项都错误时:使用不正确的多项选择选项评估大型语言模型的稳健性

链接: https://arxiv.org/abs/2409.00113
作者: Gracjan Góral,Emilia Wiśnios
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-139] oward Large Language Models as a Therapeutic Tool: Comparing Prompting Techniques to Improve GPT-Delivered Problem-Solving Therapy
[NLP-139] oward大型语言模型作为治疗工具:比较预算技术以改进GPT提供的问题解决疗法

链接: https://arxiv.org/abs/2409.00112
作者: Daniil Filienko,Yinzhou Wang,Caroline El Jazmi,Serena Xie,Trevor Cohen,Martine De Cock,Weichao Yuwen
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted for AMIA 2024 proceedings

点击查看摘要

Translation interface exception

[NLP-140] Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis
[NLP-140] 视觉语言模型的零镜头视觉推理:基准和分析

链接: https://arxiv.org/abs/2409.00106
作者: Aishik Nagar,Shantanu Jaiswal,Cheston Tan
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages

点击查看摘要

Translation interface exception

[NLP-141] Negation Blindness in Large Language Models : Unveiling the NO Syndrome in Image Generation
[NLP-141] 大型语言模型中的否定盲:揭开图像生成中的NO综合症

链接: https://arxiv.org/abs/2409.00105
作者: Mohammad Nadeem,Shahab Saquib Sohail,Erik Cambria,Björn W. Schuller,Amir Hussain
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 7 figures

点击查看摘要

Translation interface exception

[NLP-142] Nuance Matters: Probing Epistemic Consistency in Causal Reasoning
[NLP-142] 细微差别很重要:探索因果推理中的认识一致性

链接: https://arxiv.org/abs/2409.00103
作者: Shaobo Cui,Junyou Li,Luca Mouchel,Yiyang Feng,Boi Faltings
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Translation interface exception

[NLP-143] Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning
[NLP-143] 使用谱-时态图专注池和多任务学习的逐例查询关键词发现

链接: https://arxiv.org/abs/2409.00099
作者: Zhenyu Wang,Shuyu Kong,Li Wan,Biqiao Zhang,Yiteng Huang,Mumin Jin,Ming Sun,Xin Lei,Zhaojun Yang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Translation interface exception

[NLP-144] How to Train Text Summarization Model with Weak Supervisions
[NLP-144] 如何训练弱监督的文本摘要模型

链接: https://arxiv.org/abs/2409.00098
作者: Yanbo Wang,Wenyu Chen,Shimin Shan
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Translation interface exception

[NLP-145] Large Language Models for Disease Diagnosis: A Scoping Review
[NLP-145] 疾病诊断的大型语言模型:范围界定评论

链接: https://arxiv.org/abs/2409.00097
作者: Shuang Zhou,Zidu Xu,Mian Zhang,Chunpu Xu,Yawen Guo,Zaifu Zhan,Sirui Ding,Jiashuo Wang,Kaishuai Xu,Yi Fang,Liqiao Xia,Jeremy Yeung,Daochen Zha,Mingquan Lin,Rui Zhang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 57 pages

点击查看摘要

Translation interface exception

[NLP-146] Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data
[NLP-146] 非教学微调:在没有教学遵循数据的情况下在预训练的语言模型中启用教学遵循能力

链接: https://arxiv.org/abs/2409.00096
作者: Juncheng Xie,Shensian Syu,Hung-yi Lee
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 2 figures, 15 tables

点击查看摘要

Translation interface exception

[NLP-147] Examining Independence in Ensemble Sentiment Analysis: A Study on the Limits of Large Language Models Using the Condorcet Jury Theorem
[NLP-147] 审查集合情绪分析中的独立性:使用孔多塞陪审团定理研究大型语言模型的局限性

链接: https://arxiv.org/abs/2409.00094
作者: Baptiste Lefort,Eric Benhamou,Jean-Jacques Ohana,Beatrice Guez,David Saltiel,Thomas Jacquot
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-148] PatentGPT: A Large Language Model for Patent Drafting Using Knowledge-based Fine-tuning Method
[NLP-148] PatentGPT:使用基于知识的微调方法的专利起草大型语言模型

链接: https://arxiv.org/abs/2409.00092
作者: Runtao Ren,Jian Ma
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 4 figures

点击查看摘要

Translation interface exception

[NLP-149] Classification of Safety Events at Nuclear Sites using Large Language Models
[NLP-149] 使用大型语言模型对核电站安全事件进行分类

链接: https://arxiv.org/abs/2409.00091
作者: Mishca de Costa,Muhammad Anwar,Daniel Lau,Issam Hammad
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Translation interface exception

[NLP-150] Evaluating ChatGPT on Nuclear Domain-Specific Data
[NLP-150] 根据核领域特定数据评估ChatGPT

链接: https://arxiv.org/abs/2409.00090
作者: Muhammad Anwar,Mischa de Costa,Issam Hammad,Daniel Lau
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-151] On-Device Language Models: A Comprehensive Review
[NLP-151] 设备上语言模型:全面评论

链接: https://arxiv.org/abs/2409.00088
作者: Jiajun Xu,Zhiyuan Li,Wei Chen,Qun Wang,Xin Gao,Qi Cai,Ziyuan Ling
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL)
备注: 38 pages, 6 figures

点击查看摘要

Translation interface exception

[NLP-152] Genetic Approach to Mitigate Hallucination in Generative IR SIGIR2024
[NLP-152] 减轻生成性IR中幻觉的遗传学方法

链接: https://arxiv.org/abs/2409.00085
作者: Hrishikesh Kulkarni,Nazli Goharian,Ophir Frieder,Sean MacAvaney
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Gen-IR@SIGIR 2024

点击查看摘要

Translation interface exception

[NLP-153] Vision-Language and Large Language Model Performance in Gastroenterology: GPT Claude Llama Phi Mistral Gemma and Quantized Models
[NLP-153] 胃肠道病学中的视觉语言和大型语言模型性能:GPT Claude Llama Phi Mistral Gemma和量化模型

链接: https://arxiv.org/abs/2409.00084
作者: Seyed Amir Ahmad Safavi-Naini,Shuhaib Ali,Omer Shahab,Zahra Shahhoseini,Thomas Savage,Sara Rafiee,Jamil S Samaan,Reem Al Shabeeb,Farah Ladak,Jamie O Yang,Juan Echavarria,Sumbal Babar,Aasma Shaukat,Samuel Margolis,Nicholas P Tatonetti,Girish Nadkarni,Bara El Kurdi,Ali Soroush
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Manuscript Pages: 34, Figures: 7, Tables: 2, Supplementary File Pages: 35, Data Transparency Statement: Code is available at: this https URL . Study data from American College of Gastroenterology (ACG) are restricted and available upon request with ACG permission

点击查看摘要

Translation interface exception

[NLP-154] owards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical Introspective Multi-Agent Framework for Open-Domain Question Answering ECML KDD2024
[NLP-154] 复杂流程工程原理图的人性层面理解:开放领域问题解答的教学内省多主体框架

链接: https://arxiv.org/abs/2409.00082
作者: Sagar Srinivas Sakhinana,Geethan Sannidhi,Venkataramana Runkana
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Our paper is accepted for publication at ML4CCE workshop at ECML PKDD 2024

点击查看摘要

Translation interface exception

[NLP-155] Are LLM-based methods good enough for detecting unfair terms of service?
[NLP-155] 基于LLM的方法是否足以检测不公平的服务条款?

链接: https://arxiv.org/abs/2409.00077
作者: Mirgita Frasheri,Arian Bakhtiarnia,Lukas Esterle,Aleksandros Iosifidis
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Translation interface exception

[NLP-156] Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation
[NLP-156] 用于机器翻译中低资源语言数据增强的生成对抗网络

链接: https://arxiv.org/abs/2409.00071
作者: Linda Zeng
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures, 4 tables, presented at ICNLP 2024, to be published in IEEE Explore

点击查看摘要

Translation interface exception

[NLP-157] Learning to Plan Long-Term for Language Modeling
[NLP-157] 学习为语言建模制定长期计划

链接: https://arxiv.org/abs/2409.00070
作者: Florian Mai,Nathan Cornille,Marie-Francine Moens
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Translation interface exception

[NLP-158] An alternative formulation of attention pooling function in translation
[NLP-158] 翻译中注意力集中功能的替代表述

链接: https://arxiv.org/abs/2409.00068
作者: Eddie Conti
关键词-EN: attention scoring function, attention scoring, attention scoring matrix, translation tasks, attention
关键词-ZH: 注意力评分功能、注意力评分、注意力评分矩阵、翻译任务、注意力
类目: Computation and Language (cs.CL); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The aim of this paper is to present an alternative formulation of the attention scoring function in translation tasks. Generally speaking, language is deeply structured, and this is reflected in the attention scoring matrix. We exploit this property to define the attention pooling function, taking this aspect into account. In the first chapters, we introduce the attention mechanism in mathematical terms and explain its limitations and alternative formulations. Next, we focus on the experimental session that led to the alternative formulation. Essentially, we guide queries and keys to interact in a specific manner, encoding the distinct roles of attention heads and directing values on where to seek context. In mathematical terms, we can think of this formula as projecting the attention scores matrix, say H , onto the space of band matrices with fixed bandwidth. This convex subspace is clearly finite-dimensional and therefore closed. As a consequence, the projection on this space is well-posed and unique. However, at the price of losing the uniqueness of the projection (i.e., the best approximation for H ), we defined a new space consisting of band matrices plus error sparse matrices. We prove that this is a compact subspace which guarantees the existence of a matrix that best approximates H . We conclude the thesis by validating the new formula, namely calculating how well the new formula for attention scores approximates the original one. Additionally, we explore the impact of different parameters such as w (context windows) and num-pos (number of relevant words in a sentence). These analyses provide deeper insights into how languages are processed and translated, revealing nuances in the roles of context and word relevance.
摘要:本文的目的是提出翻译任务中注意力评分函数的一种替代公式。一般来说,语言是深层结构化的,这一点反映在注意力评分矩阵中。我们利用这一性质来定义注意池函数,并考虑到这一方面。在第一章中,我们用数学术语介绍了注意机制,并解释了它的局限性和替代公式。接下来,我们将重点放在导致替代配方的实验会议上。本质上,我们引导查询和关键字以特定的方式交互,编码注意力头部的不同角色,并指导在哪里寻找上下文的值。在数学术语中,我们可以认为这个公式是将注意力得分矩阵(比方说H)投影到具有固定带宽的频带矩阵的空间上。这个凸子空间显然是有限维的,因此是封闭的。因此,在这个空间上的投影是合适的和独特的。然而,以失去投影的唯一性(即对H的最佳逼近)为代价,我们定义了一个由带矩阵加误差稀疏矩阵组成的新空间。我们证明了这是一个紧致子空间,它保证存在一个最接近H的矩阵。最后,我们对新公式进行了验证,即计算新的注意力分数公式与原公式的逼近程度。此外,我们还探讨了不同参数(如w(上下文窗口)和num-pos(句子中相关单词的数量))的影响。这些分析提供了对语言如何处理和翻译的更深层次的洞察,揭示了语境和词语关联性的作用的细微差别。

[NLP-159] LCA and energy efficiency in buildings: mapping more than twenty years of research
[NLP-159] 建筑物的生命周期评估和能源效率:绘制二十多年的研究

链接: https://arxiv.org/abs/2409.00065
作者: F. Asdrubali,A. Fronzetti Colladon,L. Segneri,D.M. Gandola
关键词-EN:
关键词-ZH: Translation interface exception
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Translation interface exception

[NLP-160] Phrasing for UX: Enhancing Information Engagement through Computational Linguistics and Creative Analytics
[NLP-160] 用户体验短语:通过计算语言学和创意分析增强信息参与度

链接: https://arxiv.org/abs/2409.00064
作者: Nimrod Dvir
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Translation interface exception

[NLP-161] Urban Mobility Assessment Using LLMs
[NLP-161] 使用LLM进行城市流动性评估

链接: https://arxiv.org/abs/2409.00063
作者: Prabin Bhandari,Antonios Anastasopoulos,Dieter Pfoser
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL)
备注: 13 pages, 10 Figures

点击查看摘要

Translation interface exception

[NLP-162] Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language
[NLP-162] 利用印度尼西亚语COVID-19自动事实核查的知识图增强自然语言推理性能

链接: https://arxiv.org/abs/2409.00061
作者: Arief Purnama Muharram,Ayu Purwarianti
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Translation interface exception

[NLP-163] Understanding Literary Texts by LLMs: A Case Study of Ancient Chinese Poetry
[NLP-163] 法学硕士理解文学文本:中国古代诗歌的案例研究

链接: https://arxiv.org/abs/2409.00060
作者: Cheng Zhao,Bin Wang,Zhen Wang
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Translation interface exception

[NLP-164] Automating Knowledge Discovery from Scientific Literature via LLMs: A Dual-Agent Approach with Progressive Ontology Prompting
[NLP-164] 通过LLM从科学文献中自动发现知识:一种具有渐进式实体预算的双代理方法

链接: https://arxiv.org/abs/2409.00054
作者: Yuting Hu,Dancheng Liu,Qingyun Wang,Charles Yu,Heng Ji,Jinjun Xiong
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: in submission

点击查看摘要

Translation interface exception

[NLP-165] Evolving Text Data Stream Mining
[NLP-165] 不断发展的文本数据流挖掘

链接: https://arxiv.org/abs/2409.00010
作者: Jay Kumar
关键词-EN:
关键词-ZH: Translation interface exception
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 134 Pages, 7 Chapters, 38 Figures, 10 Tables

点击查看摘要

Translation interface exception

[NLP-166] Measuring Human Contribution in AI-Assisted Content Generation
[NLP-166] 衡量人类在人工智能辅助内容生成中的贡献

链接: https://arxiv.org/abs/2408.14792
作者: Yueqi Xie,Tao Qi,Jingwei Yi,Ryan Whalen,Junming Huang,Qian Ding,Yu Xie,Xing Xie,Fangzhao Wu
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Translation interface exception

[NLP-167] Zero-shot Bilingual App Reviews Mining with Large Language Models
[NLP-167] 零镜头双语应用程序评论使用大型语言模型进行挖掘

链接: https://arxiv.org/abs/2311.03058
作者: Jialiang Wei,Anne-Lise Courbis,Thomas Lambolais,Binbin Xu,Pierre Louis Bernard,Gérard Dray
关键词-EN:
关键词-ZH: Translation interface exception
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted for The 35th IEEE International Conference on Tools with Artificial Intelligence

点击查看摘要

Translation interface exception

[NLP-168] Statistics of punctuation in experimental literature – the remarkable case of “Finnegans Wake” by James Joyce
[NLP-168] 实验文学中标点符号的统计–詹姆斯·乔伊斯《芬兰人守灵》的非凡案例

链接: https://arxiv.org/abs/2409.00483
作者: Tomasz Stanisz,Stanisław Drożdż,Jarosław Kwapień
关键词-EN:
关键词-ZH: Translation interface exception
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL); Applications (stat.AP)
备注:

点击查看摘要

Translation interface exception

[NLP-169] Leveraging Large Language Models for Wireless Symbol Detection via In-Context Learning
[NLP-169] 通过上下文内学习利用大型语言模型进行无线符号检测

链接: https://arxiv.org/abs/2409.00124
作者: Momin Abbas,Koushik Kar,Tianyi Chen
关键词-EN:
关键词-ZH: Translation interface exception
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at IEEE GLOBECOM 2024

点击查看摘要

Translation interface exception

人工智能

[AI-0] On a heuristic approach to the description of consciousness as a hypercomplex system state and the possibility of machine consciousness (German edition)

链接: https://arxiv.org/abs/2409.02100
作者: Ralf Otte
关键词-EN: imaginary hypercomplex basis, presents a heuristic, heuristic view, view that shows, physical but imaginary
类目: Artificial Intelligence (cs.AI); Commutative Algebra (math.AC); Applied Physics (physics.app-ph)
*备注: 7 pages, in German language. 1 figure

点击查看摘要

Abstract:This article presents a heuristic view that shows that the inner states of consciousness experienced by every human being have a physical but imaginary hypercomplex basis. The hypercomplex description is necessary because certain processes of consciousness cannot be physically measured in principle, but nevertheless exist. Based on theoretical considerations, it could be possible - as a result of mathematical investigations into a so-called bicomplex algebra - to generate and use hypercomplex system states on machines in a targeted manner. The hypothesis of the existence of hypercomplex system states on machines is already supported by the surprising performance of highly complex AI systems. However, this has yet to be proven. In particular, there is a lack of experimental data that distinguishes such systems from other systems, which is why this question will be addressed in later articles. This paper describes the developed bicomplex algebra and possible applications of these findings to generate hypercomplex energy states on machines. In the literature, such system states are often referred to as machine consciousness. The article uses mathematical considerations to explain how artificial consciousness could be generated and what advantages this would have for such AI systems.

[AI-1] CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

链接: https://arxiv.org/abs/2409.02098
作者: Ingo Ziegler,Abdullatif Köksal,Desmond Elliott,Hinrich Schütze
关键词-EN: Building high-quality datasets, specialized domain knowledge, requires specialized domain, Building high-quality, domain knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.

[AI-2] DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

链接: https://arxiv.org/abs/2409.02095
作者: Wenbo Hu,Xiangjun Gao,Xiaoyu Li,Sijie Zhao,Xiaodong Cun,Yong Zhang,Long Quan,Ying Shan
关键词-EN: world remains challenging, open world remains, static images, remains challenging, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. DepthCrafter achieves generalization ability to open-world videos by training a video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy with the compiled paired video-depth datasets. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that processes extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.

[AI-3] A Deployed Online Reinforcement Learning Algorithm In An Oral Health Clinical Trial

链接: https://arxiv.org/abs/2409.02069
作者: Anna L. Trella,Kelly W. Zhang,Hinal Jajal,Inbal Nahum-Shani,Vivek Shetty,Finale Doshi-Velez,Susan A. Murphy
关键词-EN: substantial financial burden, prevalent chronic condition, personal suffering, financial burden, prevalent chronic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dental disease is a prevalent chronic condition associated with substantial financial burden, personal suffering, and increased risk of systemic diseases. Despite widespread recommendations for twice-daily tooth brushing, adherence to recommended oral self-care behaviors remains sub-optimal due to factors such as forgetfulness and disengagement. To address this, we developed Oralytics, a mHealth intervention system designed to complement clinician-delivered preventative care for marginalized individuals at risk for dental disease. Oralytics incorporates an online reinforcement learning algorithm to determine optimal times to deliver intervention prompts that encourage oral self-care behaviors. We have deployed Oralytics in a registered clinical trial. The deployment required careful design to manage challenges specific to the clinical trials setting in the U.S. In this paper, we (1) highlight key design decisions of the RL algorithm that address these challenges and (2) conduct a re-sampling analysis to evaluate algorithm design decisions. A second phase (randomized control trial) of Oralytics is planned to start in spring 2025.

[AI-4] OLMoE: Open Mixture-of-Experts Language Models

链接: https://arxiv.org/abs/2409.02060
作者: Niklas Muennighoff,Luca Soldaini,Dirk Groeneveld,Kyle Lo,Jacob Morrison,Sewon Min,Weijia Shi,Pete Walsh,Oyvind Tafjord,Nathan Lambert,Yuling Gu,Shane Arora,Akshita Bhagia,Dustin Schwenk,David Wadden,Alexander Wettig,Binyuan Hui,Tim Dettmers,Douwe Kiela,Ali Farhadi,Noah A. Smith,Pang Wei Koh,Amanpreet Singh,Hannaneh Hajishirzi
关键词-EN: language model leveraging, model leveraging sparse, introduce OLMoE, fully open, leveraging sparse
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 61 pages (24 main), 36 figures, 14 tables

点击查看摘要

Abstract:We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

[AI-5] Low-Resolution Face Recognition via Adaptable Instance-Relation Distillation IJCNN2024

链接: https://arxiv.org/abs/2409.02049
作者: Ruixin Shi,Weijia Guo,Shiming Ge
关键词-EN: Low-resolution face recognition, challenging task due, Low-resolution face, face recognition, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted by IJCNN 2024

点击查看摘要

Abstract:Low-resolution face recognition is a challenging task due to the missing of informative details. Recent approaches based on knowledge distillation have proven that high-resolution clues can well guide low-resolution face recognition via proper knowledge transfer. However, due to the distribution difference between training and testing faces, the learned models often suffer from poor adaptability. To address that, we split the knowledge transfer process into distillation and adaptation steps, and propose an adaptable instance-relation distillation approach to facilitate low-resolution face recognition. In the approach, the student distills knowledge from high-resolution teacher in both instance level and relation level, providing sufficient cross-resolution knowledge transfer. Then, the learned student can be adaptable to recognize low-resolution faces with adaptive batch normalization in inference. In this manner, the capability of recovering missing details of familiar low-resolution faces can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.

[AI-6] AllWeatherNet:Unified Image enhancement for autonomous driving under adverse weather and lowlight-conditions

链接: https://arxiv.org/abs/2409.02045
作者: Chenghao Qian,Mahdi Rezaei,Saeed Anwar,Wenjing Li,Tanveer Hussain,Mohsen Azarmi,Wei Wang
关键词-EN: pose challenges, driving perception systems, Adverse conditions, Illumination-aware Attention Mechanism, autonomous driving perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Adverse conditions like snow, rain, nighttime, and fog, pose challenges for autonomous driving perception systems. Existing methods have limited effectiveness in improving essential computer vision tasks, such as semantic segmentation, and often focus on only one specific condition, such as removing rain or translating nighttime images into daytime ones. To address these limitations, we propose a method to improve the visual quality and clarity degraded by such adverse conditions. Our method, AllWeather-Net, utilizes a novel hierarchical architecture to enhance images across all adverse conditions. This architecture incorporates information at three semantic levels: scene, object, and texture, by discriminating patches at each level. Furthermore, we introduce a Scaled Illumination-aware Attention Mechanism (SIAM) that guides the learning towards road elements critical for autonomous driving perception. SIAM exhibits robustness, remaining unaffected by changes in weather conditions or environmental scenes. AllWeather-Net effectively transforms images into normal weather and daytime scenes, demonstrating superior image enhancement results and subsequently enhancing the performance of semantic segmentation, with up to a 5.3% improvement in mIoU in the trained domain. We also show our model’s generalization ability by applying it to unseen domains without re-training, achieving up to 3.9% mIoU improvement. Code can be accessed at: this https URL.

[AI-7] BEAVER: An Enterprise Benchmark for Text-to-SQL

链接: https://arxiv.org/abs/2409.02038
作者: Peter Baile Chen,Fabian Wenz,Yi Zhang,Moe Kayali,Nesime Tatbul,Michael Cafarella,Çağatay Demiralp,Michael Stonebraker
关键词-EN: SQL statement pairs, constructed using publicly, human-generated tests, Existing, data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this environment, LLMs perform poorly, even when standard prompt engineering and RAG techniques are utilized. As we will show, the reasons for poor performance are largely due to three characteristics: (1) public LLMs cannot train on enterprise data warehouses because they are largely in the “dark web”, (2) schemas of enterprise tables are more complex than the schemas in public data, which leads the SQL-generation task innately harder, and (3) business-oriented questions are often more complex, requiring joins over multiple tables and aggregations. As a result, we propose a new dataset BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history. We evaluated this dataset using recent LLMs and demonstrated their poor performance on this task. We hope this dataset will facilitate future researchers building more sophisticated text-to-SQL systems which can do better on this important class of data.

[AI-8] ransDAE: Dual Attention Mechanism in a Hierarchical Transformer for Efficient Medical Image Segmentation

链接: https://arxiv.org/abs/2409.02018
作者: Bobby Azad,Pourya Adibfar,Kaiqun Fu
关键词-EN: effective treatment strategies, accurate disease diagnosis, medical image segmentation, medical image, treatment strategies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In healthcare, medical image segmentation is crucial for accurate disease diagnosis and the development of effective treatment strategies. Early detection can significantly aid in managing diseases and potentially prevent their progression. Machine learning, particularly deep convolutional neural networks, has emerged as a promising approach to addressing segmentation challenges. Traditional methods like U-Net use encoding blocks for local representation modeling and decoding blocks to uncover semantic relationships. However, these models often struggle with multi-scale objects exhibiting significant variations in texture and shape, and they frequently fail to capture long-range dependencies in the input data. Transformers designed for sequence-to-sequence predictions have been proposed as alternatives, utilizing global self-attention mechanisms. Yet, they can sometimes lack precise localization due to insufficient granular details. To overcome these limitations, we introduce TransDAE: a novel approach that reimagines the self-attention mechanism to include both spatial and channel-wise associations across the entire feature space, while maintaining computational efficiency. Additionally, TransDAE enhances the skip connection pathway with an inter-scale interaction module, promoting feature reuse and improving localization accuracy. Remarkably, TransDAE outperforms existing state-of-the-art methods on the Synaps multi-organ dataset, even without relying on pre-trained weights.

[AI-9] AI Governance in Higher Education: Case Studies of Guidance at Big Ten Universities

链接: https://arxiv.org/abs/2409.02017
作者: Chuhao Wu,He Zhang,John M. Carroll
关键词-EN: drawn significant attention, drawn significant, significant attention, attention from stakeholders, higher education
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI has drawn significant attention from stakeholders in higher education. As it introduces new opportunities for personalized learning and tutoring support, it simultaneously poses challenges to academic integrity and leads to ethical issues. Consequently, governing responsible AI usage within higher education institutions (HEIs) becomes increasingly important. Leading universities have already published guidelines on Generative AI, with most attempting to embrace this technology responsibly. This study provides a new perspective by focusing on strategies for responsible AI governance as demonstrated in these guidelines. Through a case study of 14 prestigious universities in the United States, we identified the multi-unit governance of AI, the role-specific governance of AI, and the academic characteristics of AI governance from their AI guidelines. The strengths and potential limitations of these strategies and characteristics are discussed. The findings offer practical implications for guiding responsible AI usage in HEIs and beyond.

[AI-10] When Digital Twin Meets 6G: Concepts Obstacles and Research Prospects

链接: https://arxiv.org/abs/2409.02008
作者: Wenshuai Liu,Yaru Fu,Zheng Shi,Hong Wang
关键词-EN: digital twin technology, digital twin, numerous research opportunities, leveraging digital twin, twin technology
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:The convergence of digital twin technology and the emerging 6G network presents both challenges and numerous research opportunities. This article explores the potential synergies between digital twin and 6G, highlighting the key challenges and proposing fundamental principles for their integration. We discuss the unique requirements and capabilities of digital twin in the context of 6G networks, such as sustainable deployment, real-time synchronization, seamless migration, predictive analytic, and closed-loop control. Furthermore, we identify research opportunities for leveraging digital twin and artificial intelligence to enhance various aspects of 6G, including network optimization, resource allocation, security, and intelligent service provisioning. This article aims to stimulate further research and innovation at the intersection of digital twin and 6G, paving the way for transformative applications and services in the future.

[AI-11] QueryCheetah: Fast Automated Discovery of Attribute Inference Attacks Against Query-Based Systems CCS

链接: https://arxiv.org/abs/2409.01992
作者: Bozhidar Stevanoski,Ana-Maria Cretu,Yves-Alexandre de Montjoye
关键词-EN: Query-based systems, sharing data, Query-based, Attacks, QBSs
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: This is an extended version of the ACM CCS paper which includes appendices

点击查看摘要

Abstract:Query-based systems (QBSs) are one of the key approaches for sharing data. QBSs allow analysts to request aggregate information from a private protected dataset. Attacks are a crucial part of ensuring QBSs are truly privacy-preserving. The development and testing of attacks is however very labor-intensive and unable to cope with the increasing complexity of systems. Automated approaches have been shown to be promising but are currently extremely computationally intensive, limiting their applicability in practice. We here propose QueryCheetah, a fast and effective method for automated discovery of privacy attacks against QBSs. We instantiate QueryCheetah on attribute inference attacks and show it to discover stronger attacks than previous methods while being 18 times faster than the state-of-the-art automated approach. We then show how QueryCheetah allows system developers to thoroughly evaluate the privacy risk, including for various attacker strengths and target individuals. We finally show how QueryCheetah can be used out-of-the-box to find attacks in larger syntaxes and workarounds around ad-hoc defenses.

[AI-12] Planning to avoid ambiguous states through Gaussian approximations to non-linear sensors in active inference agents

链接: https://arxiv.org/abs/2409.01974
作者: Wouter M. Kouw
关键词-EN: active inference agents, world represent, active inference, measurement function, measurement
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
*备注: 13 pages, 3 figures. Accepted to the International Workshop on Active Inference 2024

点击查看摘要

Abstract:In nature, active inference agents must learn how observations of the world represent the state of the agent. In engineering, the physics behind sensors is often known reasonably accurately and measurement functions can be incorporated into generative models. When a measurement function is non-linear, the transformed variable is typically approximated with a Gaussian distribution to ensure tractable inference. We show that Gaussian approximations that are sensitive to the curvature of the measurement function, such as a second-order Taylor approximation, produce a state-dependent ambiguity term. This induces a preference over states, based on how accurately the state can be inferred from the observation. We demonstrate this preference with a robot navigation experiment where agents plan trajectories.

[AI-13] Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

链接: https://arxiv.org/abs/2409.01952
作者: Abdullah Arafat Miah,Yu Bi
关键词-EN: Deep neural networks, Deep neural, neural networks, long been recognized, recognized as vulnerable
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have long been recognized as vulnerable to backdoor attacks. By providing poisoned training data in the fine-tuning process, the attacker can implant a backdoor into the victim model. This enables input samples meeting specific textual trigger patterns to be classified as target labels of the attacker’s choice. While such black-box attacks have been well explored in both computer vision and natural language processing (NLP), backdoor attacks relying on white-box attack philosophy have hardly been thoroughly investigated. In this paper, we take the first step to introduce a new type of backdoor attack that conceals itself within the underlying model architecture. Specifically, we pcricKet1996!ropose to design separate backdoor modules consisting of two functions: trigger detection and noise injection. The add-on modules of model architecture layers can detect the presence of input trigger tokens and modify layer weights using Gaussian noise to disturb the feature distribution of the baseline model. We conduct extensive experiments to evaluate our attack methods using two model architecture settings on five different large language datasets. We demonstrate that the training-free architectural backdoor on a large language model poses a genuine threat. Unlike the-state-of-art work, it can survive the rigorous fine-tuning and retraining process, as well as evade output probability-based defense methods (i.e. BDDR). All the code and data is available this https URL.

[AI-14] Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics ICPR

链接: https://arxiv.org/abs/2409.01928
作者: Imanol Solano,Alejandro Peña,Aythami Morales,Julian Fierrez,Ruben Tolosana,Francisco Zamora-Martinez,Javier San Agustin
关键词-EN: quantify biased behaviors, biased behaviors, metric, systems, metric designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted paper for the 27th International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

Abstract:We present a novel metric designed, among other applications, to quantify biased behaviors of machine learning models. As its core, the metric consists of a new similarity metric between score distributions that balances both their general shapes and tails’ probabilities. In that sense, our proposed metric may be useful in many application areas. Here we focus on and apply it to the operational evaluation of face recognition systems, with special attention to quantifying demographic biases; an application where our metric is especially useful. The topic of demographic bias and fairness in biometric recognition systems has gained major attention in recent years. The usage of these systems has spread in society, raising concerns about the extent to which these systems treat different population groups. A relevant step to prevent and mitigate demographic biases is first to detect and quantify them. Traditionally, two approaches have been studied to quantify differences between population groups in machine learning literature: 1) measuring differences in error rates, and 2) measuring differences in recognition score distributions. Our proposed Comprehensive Equity Index (CEI) trade-offs both approaches combining both errors from distribution tails and general distribution shapes. This new metric is well suited to real-world scenarios, as measured on NIST FRVT evaluations, involving high-performance systems and realistic face databases including a wide range of covariates and demographic groups. We first show the limitations of existing metrics to correctly assess the presence of biases in realistic setups and then propose our new metric to tackle these limitations. We tested the proposed metric with two state-of-the-art models and four widely used databases, showing its capacity to overcome the main flaws of previous bias metrics.

[AI-15] From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

链接: https://arxiv.org/abs/2409.01927
作者: Segev Shlomov,Ben wiesel,Aviad Sela,Ido Levy,Liane Galanti,Roy Abitbol
关键词-EN: General web-based agents, applications remains poor, yielding extremely low, extremely low accuracy, General web-based
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:General web-based agents are increasingly essential for interacting with complex web environments, yet their performance in real-world web applications remains poor, yielding extremely low accuracy even with state-of-the-art frontier models. We observe that these agents can be decomposed into two primary components: Planning and Grounding. Yet, most existing research treats these agents as black boxes, focusing on end-to-end evaluations which hinder meaningful improvements. We sharpen the distinction between the planning and grounding components and conduct a novel analysis by refining experiments on the Mind2Web dataset. Our work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance. Contrary to prevalent assumptions, our findings suggest that grounding is not a significant bottleneck and can be effectively addressed with current techniques. Instead, the primary challenge lies in the planning component, which is the main source of performance degradation. Through this analysis, we offer new insights and demonstrate practical suggestions for improving the capabilities of web agents, paving the way for more reliable agents.

[AI-16] GradINN: Gradient Informed Neural Network

链接: https://arxiv.org/abs/2409.01914
作者: Filippo Aglietti,Francesco Della Santa,Andrea Piano,Virginia Aglietti
关键词-EN: Physics Informed Neural, Informed Neural Networks, propose Gradient Informed, Physics Informed, Gradient Informed Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose Gradient Informed Neural Networks (GradINNs), a methodology inspired by Physics Informed Neural Networks (PINNs) that can be used to efficiently approximate a wide range of physical systems for which the underlying governing equations are completely unknown or cannot be defined, a condition that is often met in complex engineering problems. GradINNs leverage prior beliefs about a system’s gradient to constrain the predicted function’s gradient across all input dimensions. This is achieved using two neural networks: one modeling the target function and an auxiliary network expressing prior beliefs, e.g., smoothness. A customized loss function enables training the first network while enforcing gradient constraints derived from the auxiliary network. We demonstrate the advantages of GradINNs, particularly in low-data regimes, on diverse problems spanning non time-dependent systems (Friedman function, Stokes Flow) and time-dependent systems (Lotka-Volterra, Burger’s equation). Experimental results showcase strong performance compared to standard neural networks and PINN-like approaches across all tested scenarios.

[AI-17] LUK: Empowering Log Understanding with Expert Knowledge from Large Language Models

链接: https://arxiv.org/abs/2409.01909
作者: Lipeng Ma,Weidong Yang,Sihang Jiang,Ben Fei,Mingjie Zhou,Shuhao Li,Bo Xu,Yanghua Xiao
关键词-EN: providing essential information, monitoring and troubleshooting, expert knowledge, play a critical, providing essential
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Logs play a critical role in providing essential information for system monitoring and troubleshooting. Recently, with the success of pre-trained language models (PLMs) and large language models (LLMs) in natural language processing (NLP), smaller PLMs (such as BERT) and LLMs (like ChatGPT) have become the current mainstream approaches for log analysis. While LLMs possess rich knowledge, their high computational costs and unstable performance make LLMs impractical for analyzing logs directly. In contrast, smaller PLMs can be fine-tuned for specific tasks even with limited computational resources, making them more practical. However, these smaller PLMs face challenges in understanding logs comprehensively due to their limited expert knowledge. To better utilize the knowledge embedded within LLMs for log understanding, this paper introduces a novel knowledge enhancement framework, called LUK, which acquires expert knowledge from LLMs to empower log understanding on a smaller PLM. Specifically, we design a multi-expert collaboration framework based on LLMs consisting of different roles to acquire expert knowledge. In addition, we propose two novel pre-training tasks to enhance the log pre-training with expert knowledge. LUK achieves state-of-the-art results on different log analysis tasks and extensive experiments demonstrate expert knowledge from LLMs can be utilized more effectively to understand logs.

[AI-18] A randomized simulation trial evaluating ABiMed a clinical decision support system for medication reviews and polypharmacy management

链接: https://arxiv.org/abs/2409.01903
作者: Abdelmalek Mouazer,Sophie Dubois,Romain Léguillon,Nada Boudegzdame,Thibaud Levrard,Yoann Le Bars,Christian Simon,Brigitte Séroussi,Julien Grosjean,Romain Lelong,Catherine Letord,Stéfan Darmoni,Karima Sedki,Pierre Meneton,Rosy Tsopra,Hector Falcoff,Jean-Baptiste Lamy
关键词-EN: Medication review, Medication, structured interview, aimed at optimizing, ABiMed
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Background: Medication review is a structured interview of the patient, performed by the pharmacist and aimed at optimizing drug treatments. In practice, medication review is a long and cognitively-demanding task that requires specific knowledge. Clinical practice guidelines have been proposed, but their application is tedious. Methods: We designed ABiMed, a clinical decision support system for medication reviews, based on the implementation of the STOPP/START v2 guidelines and on the visual presentation of aggregated drug knowledge using tables, graphs and flower glyphs. We evaluated ABiMed with 39 community pharmacists during a randomized simulation trial, each pharmacist performing a medication review for two fictitious patients without ABiMed, and two others with ABiMed. We recorded the problems identified by the pharmacists, the interventions proposed, the response time, the perceived usability and the comments. Pharmacists’ medication reviews were compared to an expert-designed gold standard. Results: With ABiMed, pharmacists found 1.6 times more relevant drug-related problems during the medication review (p=1.1e-12) and proposed better interventions (p=9.8e-9), without needing more time (p=0.56). The System Usability Scale score is 82.7, which is ranked “excellent”. In their comments, pharmacists appreciated the visual aspect of ABiMed and its ability to compare the current treatment with the proposed one. A multifactor analysis showed no difference in the support offered by ABiMed according to the pharmacist’s age or sex, in terms of percentage of problems identified or quality of the proposed interventions. Conclusions: The use of an intelligent and visual clinical decision support system can help pharmacists when they perform medication reviews. Our main perspective is the validation of the system in clinical conditions.

[AI-19] 3D-LEX v1.0: 3D Lexicons for American Sign Language and Sign Language of the Netherlands

链接: https://arxiv.org/abs/2409.01901
作者: Oline Ranum,Gomer Otterspeer,Jari I. Andersen,Robert G. Belleman,Floris Roelofsen
关键词-EN: American Sign Language, sign language, capturing sign language, sign, language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this work, we present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX v1.0 dataset, and detail a method for semi-automatic annotation of phonetic properties. Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features, and attains an average sampling rate of one sign every 10 seconds. This includes the time for presenting a sign example, performing and recording the sign, and archiving the capture. The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands. We showcase the dataset utility by presenting a simple method for generating handshape annotations directly from 3D-LEX. We produce handshape labels for 1,000 signs from American Sign Language and evaluate the labels in a sign recognition task. The labels enhance gloss recognition accuracy by 5% over using no handshape annotations, and by 1% over expert annotations. Our motion capture data supports in-depth analysis of sign features and facilitates the generation of 2D projections from any viewpoint. The 3D-LEX collection has been aligned with existing sign language benchmarks and linguistic resources, to support studies in 3D-aware sign language processing.

[AI-20] What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

链接: https://arxiv.org/abs/2409.01893
作者: Zhi Chen,Qiguang Chen,Libo Qin,Qipeng Guo,Haijun Lv,Yicheng Zou,Wanxiang Che,Hang Yan,Kai Chen,Dahua Lin
关键词-EN: complex planning scenarios, Recent advancements, extended context windows, information extraction, planning scenarios
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Work in progress

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: this https URL.

[AI-21] CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention

链接: https://arxiv.org/abs/2409.01876
作者: Gaojie Lin,Jianwen Jiang,Chao Liang,Tianyun Zhong,Jiaqi Yang,Yanbo Zheng
关键词-EN: Diffusion-based video generation, Diffusion-based video, advanced significantly, catalyzing a proliferation, technology has advanced
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.

[AI-22] Latent Distillation for Continual Object Detection at the Edge ECCV

链接: https://arxiv.org/abs/2409.01872
作者: Francesco Pasti,Marina Ceccon,Davide Dalle Pezze,Francesco Paissan,Elisabetta Farella,Gian Antonio Susto,Nicola Bellotto
关键词-EN: shifts remains challenging, distribution shifts remains, addressing data distribution, data distribution shifts, achieving remarkable performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV workshops, Computational Aspects of Deep Learning (CADL) 2024

点击查看摘要

Abstract:While numerous methods achieving remarkable performance exist in the Object Detection literature, addressing data distribution shifts remains challenging. Continual Learning (CL) offers solutions to this issue, enabling models to adapt to new data while maintaining performance on previous data. This is particularly pertinent for edge devices, common in dynamic environments like automotive and robotics. In this work, we address the memory and computation constraints of edge devices in the Continual Learning for Object Detection (CLOD) scenario. Specifically, (i) we investigate the suitability of an open-source, lightweight, and fast detector, namely NanoDet, for CLOD on edge devices, improving upon larger architectures used in the literature. Moreover, (ii) we propose a novel CL method, called Latent Distillation~(LD), that reduces the number of operations and the memory required by state-of-the-art CL approaches without significantly compromising detection performance. Our approach is validated using the well-known VOC and COCO benchmarks, reducing the distillation parameter overhead by 74% and the Floating Points Operations~(FLOPs) by 56% per model update compared to other distillation methods.

[AI-23] Real-Time Indoor Object Detection based on hybrid CNN-Transformer Approach

链接: https://arxiv.org/abs/2409.01871
作者: Salah Eddine Laidoudi,Madjid Maidi,Samir Otmane
关键词-EN: computer vision, faced with unique, complex backgrounds, challenging area, area of computer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-time object detection in indoor settings is a challenging area of computer vision, faced with unique obstacles such as variable lighting and complex backgrounds. This field holds significant potential to revolutionize applications like augmented and mixed realities by enabling more seamless interactions between digital content and the physical world. However, the scarcity of research specifically fitted to the intricacies of indoor environments has highlighted a clear gap in the literature. To address this, our study delves into the evaluation of existing datasets and computational models, leading to the creation of a refined dataset. This new dataset is derived from OpenImages v7, focusing exclusively on 32 indoor categories selected for their relevance to real-world applications. Alongside this, we present an adaptation of a CNN detection model, incorporating an attention mechanism to enhance the model’s ability to discern and prioritize critical features within cluttered indoor scenes. Our findings demonstrate that this approach is not just competitive with existing state-of-the-art models in accuracy and speed but also opens new avenues for research and application in the field of real-time indoor object detection.

[AI-24] he Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?

链接: https://arxiv.org/abs/2409.01864
作者: Pedro Ramoneda,Emilia Parada-Cabaleiro,Benno Weck,Xavier Serra
关键词-EN: Large Language Models, Large Language, reliability of Large, Language Models, Large
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice question generation, validated by human experts. Our evaluation on 400 human-validated questions shows that current vanilla LLMs are less reliable than retrieval augmented generation from music dictionaries. This paper suggests that the potential of LLMs in musicology requires musicology driven research that can specialized LLMs by including accurate and reliable domain knowledge.

[AI-25] Learning State-Dependent Policy Parametrizations for Dynamic Technician Routing with Rework

链接: https://arxiv.org/abs/2409.01815
作者: Jonas Stein,Florentin D Hildebrandt,Barrett W Thomas,Marlin W Ulmer
关键词-EN: Home repair, repair and installation, Home, installation services require, resolve tasks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Home repair and installation services require technicians to visit customers and resolve tasks of different complexity. Technicians often have heterogeneous skills and working experiences. The geographical spread of customers makes achieving only perfect matches between technician skills and task requirements impractical. Additionally, technicians are regularly absent due to sickness. With non-perfect assignments regarding task requirement and technician skill, some tasks may remain unresolved and require a revisit and rework. Companies seek to minimize customer inconvenience due to delay. We model the problem as a sequential decision process where, over a number of service days, customers request service while heterogeneously skilled technicians are routed to serve customers in the system. Each day, our policy iteratively builds tours by adding “important” customers. The importance bases on analytical considerations and is measured by respecting routing efficiency, urgency of service, and risk of rework in an integrated fashion. We propose a state-dependent balance of these factors via reinforcement learning. A comprehensive study shows that taking a few non-perfect assignments can be quite beneficial for the overall service quality. We further demonstrate the value provided by a state-dependent parametrization.

[AI-26] Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations ALT

链接: https://arxiv.org/abs/2409.01808
作者: Ike Ebubechukwu,Johane Takeuchi,Antonello Ceravola,Frank Joublin
关键词-EN: chatbots increasingly integrate, accurate evaluation methods, Goal Contribution, Incorrect Fact, everyday interactions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 17 pages, 15 figures, shorter version submitted to 22nd Annual Workshop of the Australasian Language Technology Association (ALTA’24)

点击查看摘要

Abstract:As dialogue systems and chatbots increasingly integrate into everyday interactions, the need for efficient and accurate evaluation methods becomes paramount. This study explores the comparative performance of human and AI assessments across a range of dialogue scenarios, focusing on seven key performance indicators (KPIs): Coherence, Innovation, Concreteness, Goal Contribution, Commonsense Contradiction, Incorrect Fact, and Redundancy. Utilizing the GPT-4o API, we generated a diverse dataset of conversations and conducted a two-part experimental analysis. In Experiment 1, we evaluated multi-party conversations on Coherence, Innovation, Concreteness, and Goal Contribution, revealing that GPT models align closely with human judgments. Notably, both human and AI evaluators exhibited a tendency towards binary judgment rather than linear scaling, highlighting a shared challenge in these assessments. Experiment 2 extended the work of Finch et al. (2023) by focusing on dyadic dialogues and assessing Commonsense Contradiction, Incorrect Fact, and Redundancy. The results indicate that while GPT-4o demonstrates strong performance in maintaining factual accuracy and commonsense reasoning, it still struggles with reducing redundancy and self-contradiction. Our findings underscore the potential of GPT models to closely replicate human evaluation in dialogue systems, while also pointing to areas for improvement. This research offers valuable insights for advancing the development and implementation of more refined dialogue evaluation methodologies, contributing to the evolution of more effective and human-like AI communication tools.

[AI-27] LASP: Surveying the State-of-the-Art in Large Language Model-Assisted AI Planning

链接: https://arxiv.org/abs/2409.01806
作者: Haoming Li,Zhaoliang Chen,Jonathan Zhang,Fei Liu
关键词-EN: developing corporate strategies, routing autonomous vehicles, corporate strategies, organizing a vacation, vacation to routing
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective planning is essential for the success of any task, from organizing a vacation to routing autonomous vehicles and developing corporate strategies. It involves setting goals, formulating plans, and allocating resources to achieve them. LLMs are particularly well-suited for automated planning due to their strong capabilities in commonsense reasoning. They can deduce a sequence of actions needed to achieve a goal from a given state and identify an effective course of action. However, it is frequently observed that plans generated through direct prompting often fail upon execution. Our survey aims to highlight the existing challenges in planning with language models, focusing on key areas such as embodied environments, optimal scheduling, competitive and cooperative games, task decomposition, reasoning, and planning. Through this study, we explore how LLMs transform AI planning and provide unique insights into the future of LM-assisted planning.

[AI-28] raining on the Benchmark Is Not All You Need

链接: https://arxiv.org/abs/2409.01790
作者: Shiwen Ni,Xiangtao Kong,Chengming Li,Xiping Hu,Ruifeng Xu,Jia Zhu,Min Yang
关键词-EN: Large Language Models, pre-training data learned, data, data leakage, model pre-training data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model’s log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under black-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.

[AI-29] Empirical evidence of Large Language Models influence on human spoken communication

链接: https://arxiv.org/abs/2409.01754
作者: Hiromu Yakura,Ezequiel Lopez-Lopez,Levin Brinkmann,Ignacio Serna,Prateek Gupta,Iyad Rahwan
关键词-EN: Large Language Models, Artificial Intelligence, advances in Large, Language Models, Large Language
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) agents now interact with billions of humans in natural language, thanks to advances in Large Language Models (LLMs) like ChatGPT. This raises the question of whether AI has the potential to shape a fundamental aspect of human culture: the way we speak. Recent analyses revealed that scientific publications already exhibit evidence of AI-specific language. But this evidence is inconclusive, since scientists may simply be using AI to copy-edit their writing. To explore whether AI has influenced human spoken communication, we transcribed and analyzed about 280,000 English-language videos of presentations, talks, and speeches from more than 20,000 YouTube channels of academic institutions. We find a significant shift in the trend of word usage specific to words distinctively associated with ChatGPT following its release. These findings provide the first empirical evidence that humans increasingly imitate LLMs in their spoken language. Our results raise societal and policy-relevant concerns about the potential of AI to unintentionally reduce linguistic diversity, or to be deliberately misused for mass manipulation. They also highlight the need for further investigation into the feedback loops between machine behavior and human culture.

[AI-30] Interpreting Outliers in Time Series Data through Decoding Autoencoder ECML-PKDD

链接: https://arxiv.org/abs/2409.01713
作者: Patrick Knab,Sascha Marton,Christian Bartelt,Robert Fuder
关键词-EN: crucial analytical tool, crucial analytical, analytical tool, Aggregated Explanatory Ensemble, Outlier detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figures, accepted at TempXAI @ ECML-PKDD

点击查看摘要

Abstract:Outlier detection is a crucial analytical tool in various fields. In critical systems like manufacturing, malfunctioning outlier detection can be costly and safety-critical. Therefore, there is a significant need for explainable artificial intelligence (XAI) when deploying opaque models in such environments. This study focuses on manufacturing time series data from a German automotive supply industry. We utilize autoencoders to compress the entire time series and then apply anomaly detection techniques to its latent features. For outlier interpretation, we (i) adopt widely used XAI techniques to the autoencoder’s encoder. Additionally, (ii) we propose AEE, Aggregated Explanatory Ensemble, a novel approach that fuses explanations of multiple XAI techniques into a single, more expressive interpretation. For evaluation of explanations, (iii) we propose a technique to measure the quality of encoder explanations quantitatively. Furthermore, we qualitatively assess the effectiveness of outlier explanations with domain expertise.

[AI-31] USTC-KXDIGIT System Description for ASVspoof5 Challenge

链接: https://arxiv.org/abs/2409.01695
作者: Yihao Chen,Haochen Wu,Nan Jiang,Xiang Xia,Qing Gu,Yunqi Hao,Pengfei Cai,Yu Guan,Jialong Wang,Weilin Xie,Lei Fang,Sian Fang,Yan Song,Wu Guo,Lin Liu,Minqiang Xu
关键词-EN: spoofing-robust automatic speaker, automatic speaker verification, Track, spoofing-robust automatic, speaker verification
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: ASVspoof5 workshop paper

点击查看摘要

Abstract:This paper describes the USTC-KXDIGIT system submitted to the ASVspoof5 Challenge for Track 1 (speech deepfake detection) and Track 2 (spoofing-robust automatic speaker verification, SASV). Track 1 showcases a diverse range of technical qualities from potential processing algorithms and includes both open and closed conditions. For these conditions, our system consists of a cascade of a frontend feature extractor and a back-end classifier. We focus on extensive embedding engineering and enhancing the generalization of the back-end classifier model. Specifically, the embedding engineering is based on hand-crafted features and speech representations from a self-supervised model, used for closed and open conditions, respectively. To detect spoof attacks under various adversarial conditions, we trained multiple systems on an augmented training set. Additionally, we used voice conversion technology to synthesize fake audio from genuine audio in the training set to enrich the synthesis algorithms. To leverage the complementary information learned by different model architectures, we employed activation ensemble and fused scores from different systems to obtain the final decision score for spoof detection. During the evaluation phase, the proposed methods achieved 0.3948 minDCF and 14.33% EER in the close condition, and 0.0750 minDCF and 2.59% EER in the open condition, demonstrating the robustness of our submitted systems under adversarial conditions. In Track 2, we continued using the CM system from Track 1 and fused it with a CNN-based ASV system. This approach achieved 0.2814 min-aDCF in the closed condition and 0.0756 min-aDCF in the open condition, showcasing superior performance in the SASV system.

[AI-32] Differentially Private Kernel Density Estimation

链接: https://arxiv.org/abs/2409.01688
作者: Erzhi Liu,Jerry Yao-Chieh Hu,Alex Reneau,Zhao Song,Han Liu
关键词-EN: refined differentially private, differentially private, improved privacy-utility tradeoff, Toggle, Differentially Private Kernel
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a refined differentially private (DP) data structure for kernel density estimation (KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior results. Specifically, we study the mathematical problem: given a similarity function f (or DP KDE) and a private dataset X \subset \mathbbR^d , our goal is to preprocess X so that for any query y\in\mathbbR^d , we approximate \sum_x \in X f(x, y) in a differentially private fashion. The best previous algorithm for f(x,y) =| x - y |_1 is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires O(nd) space and time for preprocessing with n=|X| . For any query point, the query time is d \log n , with an error guarantee of (1+\alpha) -approximation and \epsilon^-1 \alpha^-0.5 d^1.5 R \log^1.5 n . In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of \alpha^-1 \log n . - We improve the approximation ratio from \alpha to 1. - We reduce the error dependence by a factor of \alpha^-0.5 . From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into \alpha^-1 \log n numbers, each derived from the summation of \log n values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into \log n numbers, where each is a smart combination of two distance values, two counting values, and y itself. We believe our tree structure may be of independent interest. Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2409.01688 [cs.DS] (or arXiv:2409.01688v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2409.01688 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jerry Yao-Chieh Hu [view email] [v1] Tue, 3 Sep 2024 08:01:19 UTC (86 KB) Full-text links: Access Paper: View a PDF of the paper titled Differentially Private Kernel Density Estimation, by Erzhi Liu and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.DS prev | next new | recent | 2024-09 Change to browse by: cs cs.AI cs.LG stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-33] Adaptive Explicit Knowledge Transfer for Knowledge Distillation

链接: https://arxiv.org/abs/2409.01679
作者: Hyungkeun Park,Jong-seok Lee
关键词-EN: Logit-based knowledge distillation, subject to inferior, knowledge, inferior performance, Logit-based knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Logit-based knowledge distillation (KD) for classification is cost-efficient compared to feature-based KD but often subject to inferior performance. Recently, it was shown that the performance of logit-based KD can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model, which is known as `implicit (dark) knowledge’, to the student model. Through gradient analysis, we first show that this actually has an effect of adaptively controlling the learning of implicit knowledge. Then, we propose a new loss that enables the student to learn explicit knowledge (i.e., the teacher’s confidence about the target class) along with implicit knowledge in an adaptive manner. Furthermore, we propose to separate the classification and distillation tasks for effective distillation and inter-class relationship modeling. Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods on the CIFAR-100 and ImageNet datasets.

[AI-34] Classifier-Free Diffusion-Based Weakly-Supervised Approach for Health Indicator Derivation in Rotating Machines: Advancing Early Fault Detection and Condition Monitoring

链接: https://arxiv.org/abs/2409.01676
作者: Wenyang Hu,Gaetan Frusque,Tianyang Wang,Fulei Chu,Olga Fink
关键词-EN: Deriving health indicators, health indicators, Deriving health, rotating machines, indicators of rotating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Deriving health indicators of rotating machines is crucial for their maintenance. However, this process is challenging for the prevalent adopted intelligent methods since they may take the whole data distributions, not only introducing noise interference but also lacking the explainability. To address these issues, we propose a diffusion-based weakly-supervised approach for deriving health indicators of rotating machines, enabling early fault detection and continuous monitoring of condition evolution. This approach relies on a classifier-free diffusion model trained using healthy samples and a few anomalies. This model generates healthy samples. and by comparing the differences between the original samples and the generated ones in the envelope spectrum, we construct an anomaly map that clearly identifies faults. Health indicators are then derived, which can explain the fault types and mitigate noise interference. Comparative studies on two cases demonstrate that the proposed method offers superior health monitoring effectiveness and robustness compared to baseline models.

[AI-35] Enhancing Fine-Grained Visual Recognition in the Low-Data Regime Through Feature Magnitude Regularization

链接: https://arxiv.org/abs/2409.01672
作者: Avraham Chapman,Haiming Xu,Lingqiao Liu
关键词-EN: distracting noise patterns, easily discernible amidst, discernible amidst distracting, amidst distracting noise, limited data presents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training a fine-grained image recognition model with limited data presents a significant challenge, as the subtle differences between categories may not be easily discernible amidst distracting noise patterns. One commonly employed strategy is to leverage pretrained neural networks, which can generate effective feature representations for constructing an image classification model with a restricted dataset. However, these pretrained neural networks are typically trained for different tasks than the fine-grained visual recognition (FGVR) task at hand, which can lead to the extraction of less relevant features. Moreover, in the context of building FGVR models with limited data, these irrelevant features can dominate the training process, overshadowing more useful, generalizable discriminative features. Our research has identified a surprisingly simple solution to this challenge: we introduce a regularization technique to ensure that the magnitudes of the extracted features are evenly distributed. This regularization is achieved by maximizing the uniformity of feature magnitude distribution, measured through the entropy of the normalized features. The motivation behind this regularization is to remove bias in feature magnitudes from pretrained models, where some features may be more prominent and, consequently, more likely to be used for classification. Additionally, we have developed a dynamic weighting mechanism to adjust the strength of this regularization throughout the learning process. Despite its apparent simplicity, our approach has demonstrated significant performance improvements across various fine-grained visual recognition datasets.

[AI-36] Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training ICASSP2025

链接: https://arxiv.org/abs/2409.01668
作者: Wenhan Yao,Zedong Xing,Xiarun Chen,Jia Liu,Yongqiang He,Weiping Wen
关键词-EN: unseen target speaker, aims to change, change the timbre, unseen target, One-shot voice conversion
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: submmited to ICASSP 2025

点击查看摘要

Abstract:One-shot voice conversion(VC) aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing style transfer-based VC methods relied on speech representation disentanglement and suffered from accurately and independently encoding each speech component and recomposing back to converted speech effectively. To tackle this, we proposed Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder as the generator. In the decoder, we used effective styleformer blocks to integrate speaker characteristics into the generated speech effectively. The models used the generative VAE loss for encoding components and triplet loss for unsupervised discriminative training. We applied the styleformer method to Zipformer’s shared weights for style transfer. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.

[AI-37] ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

链接: https://arxiv.org/abs/2409.01652
作者: Wenlong Huang,Chen Wang,Yunzhu Li,Ruohan Zhang,Li Fei-Fei
关键词-EN: Relational Keypoint Constraints, encode desired robot, desired robot behaviors, Keypoint Constraints, Relational Keypoint
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Representing robotic manipulation tasks as constraints that associate the robot and the environment is a promising way to encode desired robot behaviors. However, it remains unclear how to formulate the constraints such that they are 1) versatile to diverse tasks, 2) free of manual labeling, and 3) optimizable by off-the-shelf solvers to produce robot actions in real-time. In this work, we introduce Relational Keypoint Constraints (ReKep), a visually-grounded representation for constraints in robotic manipulation. Specifically, ReKep is expressed as Python functions mapping a set of 3D keypoints in the environment to a numerical cost. We demonstrate that by representing a manipulation task as a sequence of Relational Keypoint Constraints, we can employ a hierarchical optimization procedure to solve for robot actions (represented by a sequence of end-effector poses in SE(3)) with a perception-action loop at a real-time frequency. Furthermore, in order to circumvent the need for manual specification of ReKep for each new task, we devise an automated procedure that leverages large vision models and vision-language models to produce ReKep from free-form language instructions and RGB-D observations. We present system implementations on a wheeled single-arm platform and a stationary dual-arm platform that can perform a large variety of manipulation tasks, featuring multi-stage, in-the-wild, bimanual, and reactive behaviors, all without task-specific data or environment models. Website at this https URL.

[AI-38] PMLBmini: A Tabular Classification Benchmark Suite for Data-Scarce Applications

链接: https://arxiv.org/abs/2409.01635
作者: Ricardo Knauer,Marvin Grimm,Erik Rodner
关键词-EN: faced with small-sized, small-sized tabular data, small-sized tabular, tabular, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: AutoML 2024 Workshop Track

点击查看摘要

Abstract:In practice, we are often faced with small-sized tabular data. However, current tabular benchmarks are not geared towards data-scarce applications, making it very difficult to derive meaningful conclusions from empirical comparisons. We introduce PMLBmini, a tabular benchmark suite of 44 binary classification datasets with sample sizes \leq 500. We use our suite to thoroughly evaluate current automated machine learning (AutoML) frameworks, off-the-shelf tabular deep neural networks, as well as classical linear models in the low-data regime. Our analysis reveals that state-of-the-art AutoML and deep learning approaches often fail to appreciably outperform even a simple logistic regression baseline, but we also identify scenarios where AutoML and deep learning methods are indeed reasonable to apply. Our benchmark suite, available on this https URL , allows researchers and practitioners to analyze their own methods and challenge their data efficiency.

[AI-39] Dreaming is All You Need

链接: https://arxiv.org/abs/2409.01633
作者: Mingze Ni,Wei Liu
关键词-EN: achieving a harmonious, paramount importance, harmonious balance, SleepNet, classification tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In classification tasks, achieving a harmonious balance between exploration and precision is of paramount importance. To this end, this research introduces two novel deep learning models, SleepNet and DreamNet, to strike this balance. SleepNet seamlessly integrates supervised learning with unsupervised sleep" stages using pre-trained encoder models. Dedicated neurons within SleepNet are embedded in these unsupervised features, forming intermittent sleep" blocks that facilitate exploratory learning. Building upon the foundation of SleepNet, DreamNet employs full encoder-decoder frameworks to reconstruct the hidden states, mimicking the human “dreaming” process. This reconstruction process enables further exploration and refinement of the learned representations. Moreover, the principle ideas of our SleepNet and DreamNet are generic and can be applied to both computer vision and natural language processing downstream tasks. Through extensive empirical evaluations on diverse image and text datasets, SleepNet and DreanNet have demonstrated superior performance compared to state-of-the-art models, showcasing the strengths of unsupervised exploration and supervised precision afforded by our innovative approaches.

[AI-40] SafeEmbodAI: a Safety Framework for Mobile Robots in Embodied AI Systems

链接: https://arxiv.org/abs/2409.01630
作者: Wenxiao Zhang,Xiangrui Kong,Thomas Braunl,Jin B. Hong
关键词-EN: Large Language Models, Language Models, Large Language, understand complex language, perform advanced tasks
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Embodied AI systems, including AI-powered robots that autonomously interact with the physical world, stand to be significantly advanced by Large Language Models (LLMs), which enable robots to better understand complex language commands and perform advanced tasks with enhanced comprehension and adaptability, highlighting their potential to improve embodied AI capabilities. However, this advancement also introduces safety challenges, particularly in robotic navigation tasks. Improper safety management can lead to failures in complex environments and make the system vulnerable to malicious command injections, resulting in unsafe behaviours such as detours or collisions. To address these issues, we propose \textitSafeEmbodAI, a safety framework for integrating mobile robots into embodied AI systems. \textitSafeEmbodAI incorporates secure prompting, state management, and safety validation mechanisms to secure and assist LLMs in reasoning through multi-modal data and validating responses. We designed a metric to evaluate mission-oriented exploration, and evaluations in simulated environments demonstrate that our framework effectively mitigates threats from malicious commands and improves performance in various environment settings, ensuring the safety of embodied AI systems. Notably, In complex environments with mixed obstacles, our method demonstrates a significant performance increase of 267% compared to the baseline in attack scenarios, highlighting its robustness in challenging conditions.

[AI-41] Lexicographic optimization-based approaches to learning a representative model for multi-criteria sorting with non-monotonic criteria

链接: https://arxiv.org/abs/2409.01612
作者: Zhen Zhang,Zhuolin Li,Wenyu Yu
关键词-EN: MCS problems, Deriving a representative, representative model, MCS problems traditionally, MCS
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 45 pages, 12 figures

点击查看摘要

Abstract:Deriving a representative model using value function-based methods from the perspective of preference disaggregation has emerged as a prominent and growing topic in multi-criteria sorting (MCS) problems. A noteworthy observation is that many existing approaches to learning a representative model for MCS problems traditionally assume the monotonicity of criteria, which may not always align with the complexities found in real-world MCS scenarios. Consequently, this paper proposes some approaches to learning a representative model for MCS problems with non-monotonic criteria through the integration of the threshold-based value-driven sorting procedure. To do so, we first define some transformation functions to map the marginal values and category thresholds into a UTA-like functional space. Subsequently, we construct constraint sets to model non-monotonic criteria in MCS problems and develop optimization models to check and rectify the inconsistency of the decision maker’s assignment example preference information. By simultaneously considering the complexity and discriminative power of the models, two distinct lexicographic optimization-based approaches are developed to derive a representative model for MCS problems with non-monotonic criteria. Eventually, we offer an illustrative example and conduct comprehensive simulation experiments to elaborate the feasibility and validity of the proposed approaches.

[AI-42] Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

链接: https://arxiv.org/abs/2409.01610
作者: Yearim Kim,Sangyu Han,Sangbum Han,Nojun Kwak
关键词-EN: local explanations, global explanations, mechanistic interpretability, exact operations, progression from local
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of eXplainable AI (XAI) in language models, the progression from local explanations of individual decisions to global explanations with high-level concepts has laid the groundwork for mechanistic interpretability, which aims to decode the exact operations. However, this paradigm has not been adequately explored in image models, where existing methods have primarily focused on class-specific interpretations. This paper introduces a novel approach to systematically trace the entire pathway from input through all intermediate layers to the final output within the whole dataset. We utilize Pointwise Feature Vectors (PFVs) and Effective Receptive Fields (ERFs) to decompose model embeddings into interpretable Concept Vectors. Then, we calculate the relevance between concept vectors with our Generalized Integrated Gradients (GIG), enabling a comprehensive, dataset-wide analysis of model behavior. We validate our method of concept extraction and concept attribution in both qualitative and quantitative evaluations. Our approach advances the understanding of semantic significance within image models, offering a holistic view of their operational mechanics.

[AI-43] Laser: Parameter-Efficient LLM Bi-Tuning for Sequential Recommendation with Collaborative Information

链接: https://arxiv.org/abs/2409.01605
作者: Xinyu Zhang,Linmei Hu,Luhao Zhang,Dandan Song,Heyan Huang,Liqiang Nie
关键词-EN: facilitating targeted recommendations, Large Language Models, discerning user preferences, Large Language, employing Large Language
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Sequential recommender systems are essential for discerning user preferences from historical interactions and facilitating targeted recommendations. Recent innovations employing Large Language Models (LLMs) have advanced the field by encoding item semantics, yet they often necessitate substantial parameter tuning and are resource-demanding. Moreover, these works fails to consider the diverse characteristics of different types of users and thus diminishes the recommendation accuracy. In this paper, we propose a parameter-efficient Large Language Model Bi-Tuning framework for sequential recommendation with collaborative information (Laser). Specifically, Bi-Tuning works by inserting trainable virtual tokens at both the prefix and suffix of the input sequence and freezing the LLM parameters, thus optimizing the LLM for the sequential recommendation. In our Laser, the prefix is utilized to incorporate user-item collaborative information and adapt the LLM to the recommendation task, while the suffix converts the output embeddings of the LLM from the language space to the recommendation space for the follow-up item recommendation. Furthermore, to capture the characteristics of different types of users when integrating the collaborative information via the prefix, we introduce M-Former, a lightweight MoE-based querying transformer that uses a set of query experts to integrate diverse user-specific collaborative information encoded by frozen ID-based sequential recommender systems, significantly improving the accuracy of recommendations. Extensive experiments on real-world datasets demonstrate that Laser can parameter-efficiently adapt LLMs to effective recommender systems, significantly outperforming state-of-the-art methods.

[AI-44] A Time-Intensity Aware Pipeline for Generating Late-Stage Breast DCE-MRI using Generative Adversarial Models

链接: https://arxiv.org/abs/2409.01596
作者: Ruben D. Fonnegra,Maria Liliana Hernández,Juan C. Caicedo,Gloria M. Díaz
关键词-EN: Contrast-enhancement pattern analysis, magnetic resonance imaging, breast magnetic resonance, contrast-enhanced breast MRI, Contrast-enhancement pattern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrast-enhancement pattern analysis is critical in breast magnetic resonance imaging (MRI) to distinguish benign from probably malignant tumors. However, contrast-enhanced image acquisitions are time-consuming and very expensive. As an alternative to physical acquisition, this paper proposes a comprehensive pipeline for the generation of accurate long-term (late) contrast-enhanced breast MRI from the early counterpart. The proposed strategy focuses on preserving the contrast agent pattern in the enhanced regions while maintaining visual properties in the entire synthesized images. To that end, a novel loss function that leverages the biological behavior of contrast agent (CA) in tissue, given by the Time-Intensity (TI) enhancement curve, is proposed to optimize a pixel-attention based generative model. In addition, unlike traditional normalization and standardization methods, we developed a new normalization strategy that maintains the contrast enhancement pattern across the image sequences at multiple timestamps. This ensures the prevalence of the CA pattern after image preprocessing, unlike conventional approaches. Furthermore, in order to objectively evaluate the clinical quality of the synthesized images, two metrics are also introduced to measure the differences between the TI curves of enhanced regions of the acquired and synthesized images. The experimental results showed that the proposed strategy generates images that significantly outperform diagnostic quality in contrast-enhanced regions while maintaining the spatial features of the entire image. This results suggest a potential use of synthetic late enhanced images generated via deep learning in clinical scenarios.

[AI-45] Booster: Tackling Harmful Fine-tuing for Large Language Models via Attenuating Harmful Perturbation

链接: https://arxiv.org/abs/2409.01586
作者: Tiansheng Huang,Sihao Hu,Fatih Ilhan,Selim Furkan Tekin,Ling Liu
关键词-EN: Large language models’, concerns for Large, Large language, Harmful fine-tuning issue, poses serious safety
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Harmful fine-tuning issue \citepqi2023fine poses serious safety concerns for Large language models’ fine-tuning-as-a-service. While existing defenses \citephuang2024vaccine,rosati2024representation have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textitharmful perturbation over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage’s optimization. The regularizer ensures that the model’s harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \urlthis https URL.

[AI-46] GaussianPU: A Hybrid 2D-3D Upsampling Framework for Enhancing Color Point Clouds via 3D Gaussian Splatting

链接: https://arxiv.org/abs/2409.01581
作者: Zixuan Guo,Yifan Xie,Weijing Xie,Peng Huang,Fei Ma,Fei Richard Yu
关键词-EN: point clouds, colored point clouds, clouds enhance visual, point, point clouds enhance
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Dense colored point clouds enhance visual perception and are of significant value in various robotic applications. However, existing learning-based point cloud upsampling methods are constrained by computational resources and batch processing strategies, which often require subdividing point clouds into smaller patches, leading to distortions that degrade perceptual quality. To address this challenge, we propose a novel 2D-3D hybrid colored point cloud upsampling framework (GaussianPU) based on 3D Gaussian Splatting (3DGS) for robotic perception. This approach leverages 3DGS to bridge 3D point clouds with their 2D rendered images in robot vision systems. A dual scale rendered image restoration network transforms sparse point cloud renderings into dense representations, which are then input into 3DGS along with precise robot camera poses and interpolated sparse point clouds to reconstruct dense 3D point clouds. We have made a series of enhancements to the vanilla 3DGS, enabling precise control over the number of points and significantly boosting the quality of the upsampled point cloud for robotic scene understanding. Our framework supports processing entire point clouds on a single consumer-grade GPU, such as the NVIDIA GeForce RTX 3090, eliminating the need for segmentation and thus producing high-quality, dense colored point clouds with millions of points for robot navigation and manipulation tasks. Extensive experimental results on generating million-level point cloud data validate the effectiveness of our method, substantially improving the quality of colored point clouds and demonstrating significant potential for applications involving large-scale point clouds in autonomous robotics and human-robot interaction scenarios.

[AI-47] AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models

链接: https://arxiv.org/abs/2409.01579
作者: Qianchi Zhang,Hainan Zhang,Liang Pang,Hongwei Zheng,Zhiming Zheng
关键词-EN: detecting answer clues, inference process slow, slow and expensive, compression rate, context compression
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures, code available at https://anonymous.4open.science/r/AdaComp-8C0C/

点击查看摘要

Abstract:Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Muiti-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.

[AI-48] Improving Apple Object Detection with Occlusion-Enhanced Distillation

链接: https://arxiv.org/abs/2409.01573
作者: Liang Geng
关键词-EN: face severe visual, severe visual obstructions, Apples growing, environments often face, face severe
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Apples growing in natural environments often face severe visual obstructions from leaves and branches. This significantly increases the risk of false detections in object detection tasks, thereby escalating the challenge. Addressing this issue, we introduce a technique called “Occlusion-Enhanced Distillation” (OED). This approach utilizes occlusion information to regularize the learning of semantically aligned features on occluded datasets and employs Exponential Moving Average (EMA) to enhance training stability. Specifically, we first design an occlusion-enhanced dataset that integrates Grounding DINO and SAM methods to extract occluding elements such as leaves and branches from each sample, creating occlusion examples that reflect the natural growth state of fruits. Additionally, we propose a multi-scale knowledge distillation strategy, where the student network uses images with increased occlusions as inputs, while the teacher network employs images without natural occlusions. Through this setup, the strategy guides the student network to learn from the teacher across scales of semantic and local features alignment, effectively narrowing the feature distance between occluded and non-occluded targets and enhancing the robustness of object detection. Lastly, to improve the stability of the student network, we introduce the EMA strategy, which aids the student network in learning more generalized feature expressions that are less affected by the noise of individual image occlusions. Our method significantly outperforms current state-of-the-art techniques through extensive comparative experiments.

[AI-49] LSSF-Net: Lightweight Segmentation with Self-Awareness Spatial Attention and Focal Modulation

链接: https://arxiv.org/abs/2409.01572
作者: Hamza Farooq,Zuhair Zafar,Ahsan Saadat,Tariq M Khan,Shahzaib Iqbal,Imran Razzak
关键词-EN: dermoscopic images plays, skin lesion segmentation, Accurate segmentation, skin lesions, dermoscopic images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate segmentation of skin lesions within dermoscopic images plays a crucial role in the timely identification of skin cancer for computer-aided diagnosis on mobile platforms. However, varying shapes of the lesions, lack of defined edges, and the presence of obstructions such as hair strands and marker colors make this challenge more complex. \textcolorredAdditionally, skin lesions often exhibit subtle variations in texture and color that are difficult to differentiate from surrounding healthy skin, necessitating models that can capture both fine-grained details and broader contextual information. Currently, melanoma segmentation models are commonly based on fully connected networks and U-Nets. However, these models often struggle with capturing the complex and varied characteristics of skin lesions, such as the presence of indistinct boundaries and diverse lesion appearances, which can lead to suboptimal segmentation this http URL address these challenges, we propose a novel lightweight network specifically designed for skin lesion segmentation utilizing mobile devices, featuring a minimal number of learnable parameters (only 0.8 million). This network comprises an encoder-decoder architecture that incorporates conformer-based focal modulation attention, self-aware local and global spatial attention, and split channel-shuffle. The efficacy of our model has been evaluated on four well-established benchmark datasets for skin lesion segmentation: ISIC 2016, ISIC 2017, ISIC 2018, and PH2. Empirical findings substantiate its state-of-the-art performance, notably reflected in a high Jaccard index.

[AI-50] Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models BMVC2024

链接: https://arxiv.org/abs/2409.01560
作者: Bin Fu,Qiyang Wan,Jialin Li,Ruiping Wang,Xilin Chen
关键词-EN: organizes objects based, Large Multimodal Models, common features, computer vision, core cognitive ability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 39 pages, 28 figures, 4 tables. Accepted at The 35th British Machine Vision Conference (BMVC 2024). Project page at this https URL

点击查看摘要

Abstract:Categorization, a core cognitive ability in humans that organizes objects based on common features, is essential to cognitive science as well as computer vision. To evaluate the categorization ability of visual AI models, various proxy tasks on recognition from datasets to open world scenarios have been proposed. Recent development of Large Multimodal Models (LMMs) has demonstrated impressive results in high-level visual tasks, such as visual question answering, video temporal reasoning, etc., utilizing the advanced architectures and large-scale multimodal instruction tuning. Previous researchers have developed holistic benchmarks to measure the high-level visual capability of LMMs, but there is still a lack of pure and in-depth quantitative evaluation of the most fundamental categorization ability. According to the research on human cognitive process, categorization can be seen as including two parts: category learning and category use. Inspired by this, we propose a novel, challenging, and efficient benchmark based on composite blocks, called ComBo, which provides a disentangled evaluation framework and covers the entire categorization process from learning to use. By analyzing the results of multiple evaluation tasks, we find that although LMMs exhibit acceptable generalization ability in learning new categories, there are still gaps compared to humans in many ways, such as fine-grained perception of spatial relationship and abstract category understanding. Through the study of categorization, we can provide inspiration for the further development of LMMs in terms of interpretability and generalization.

[AI-51] Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture

链接: https://arxiv.org/abs/2409.01556
作者: Chen-Chi Chang,Ching-Yuan Chen,Hung-Shin Lee,Chih-Cheng Lee
关键词-EN: large language models, focus on Hakka, Hakka culture, comprehensive benchmark designed, Leveraging Bloom Taxonomy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Submitted to O-COCOSDA 2024

点击查看摘要

Abstract:This study introduces a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge, with a specific focus on Hakka culture as a case study. Leveraging Bloom’s Taxonomy, the study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. This benchmark extends beyond traditional single-dimensional evaluations by providing a deeper analysis of LLMs’ abilities to handle culturally specific content, ranging from basic recall of facts to higher-order cognitive tasks such as creative synthesis. Additionally, the study integrates Retrieval-Augmented Generation (RAG) technology to address the challenges of minority cultural knowledge representation in LLMs, demonstrating how RAG enhances the models’ performance by dynamically incorporating relevant external information. The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge. However, the findings also reveal the limitations of RAG in creative tasks, underscoring the need for further optimization. This benchmark provides a robust tool for evaluating and comparing LLMs in culturally diverse contexts, offering valuable insights for future research and development in AI-driven cultural knowledge preservation and dissemination.

[AI-52] EA-RAS: Towards Efficient and Accurate End-to-End Reconstruction of Anatomical Skeleton

链接: https://arxiv.org/abs/2409.01555
作者: Zhiheng Peng,Kai Zhao,Xiaoran Chen,Li Ma,Siyu Xia,Changjie Fan,Weijian Shang,Wei Jing
关键词-EN: human skeletal information, human-computer interaction, low-cost estimation, estimation of human, human skeletal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages,15 figures

点击查看摘要

Abstract:Efficient, accurate and low-cost estimation of human skeletal information is crucial for a range of applications such as biology education and human-computer interaction. However, current simple skeleton models, which are typically based on 2D-3D joint points, fall short in terms of anatomical fidelity, restricting their utility in fields. On the other hand, more complex models while anatomically precise, are hindered by sophisticate multi-stage processing and the need for extra data like skin meshes, making them unsuitable for real-time applications. To this end, we propose the EA-RAS (Towards Efficient and Accurate End-to-End Reconstruction of Anatomical Skeleton), a single-stage, lightweight, and plug-and-play anatomical skeleton estimator that can provide real-time, accurate anatomically realistic skeletons with arbitrary pose using only a single RGB image input. Additionally, EA-RAS estimates the conventional human-mesh model explicitly, which not only enhances the functionality but also leverages the outside skin information by integrating features into the inside skeleton modeling process. In this work, we also develop a progressive training strategy and integrated it with an enhanced optimization process, enabling the network to obtain initial weights using only a small skin dataset and achieve self-supervision in skeleton reconstruction. Besides, we also provide an optional lightweight post-processing optimization strategy to further improve accuracy for scenarios that prioritize precision over real-time processing. The experiments demonstrated that our regression method is over 800 times faster than existing methods, meeting real-time requirements. Additionally, the post-processing optimization strategy provided can enhance reconstruction accuracy by over 50% and achieve a speed increase of more than 7 times.

[AI-53] Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs

链接: https://arxiv.org/abs/2409.01552
作者: Zhuo Li,Yuhao Du,Jinpeng Hu,Xiang Wan,Anningzhe Gao
关键词-EN: Large language models, Large language, shown success, Large, LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown success in generating high-quality responses. In order to achieve better alignment with LLMs with human preference, various works are proposed based on specific optimization process, which, however, is not suitable to Black-Box LLMs like GPT-4, due to inaccessible parameters. In Black-Box LLMs case, their performance is highly dependent on the quality of the provided prompts. Existing methods to enhance response quality often involve a prompt refinement model, yet these approaches potentially suffer from semantic inconsistencies between the refined and original prompts, and typically overlook the relationship between them. To address these challenges, we introduce a self-instructed in-context learning framework that empowers LLMs to deliver more effective responses by generating reliable derived prompts to construct informative contextual environments. Our approach incorporates a self-instructed reinforcement learning mechanism, enabling direct interaction with the response model during derived prompt generation for better alignment. We then formulate querying as an in-context learning task, using responses from LLMs combined with the derived prompts to establish a contextual demonstration for the original prompt. This strategy ensures alignment with the original query, reduces discrepancies from refined prompts, and maximizes the LLMs’ in-context learning capability. Extensive experiments demonstrate that the proposed method not only generates more reliable derived prompts but also significantly enhances LLMs’ ability to deliver more effective responses, including Black-Box models such as GPT-4.

[AI-54] VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka

链接: https://arxiv.org/abs/2409.01548
作者: Li-Wei Chen,Hung-Shin Lee,Chen-Chi Chang
关键词-EN: spoken in Taiwan, designed for Taiwanese, paper introduces VoxHakka, Taiwanese Hakka, critically under-resourced language
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注: Submitted to O-COCOSDA 2024

点击查看摘要

Abstract:This paper introduces VoxHakka, a text-to-speech (TTS) system designed for Taiwanese Hakka, a critically under-resourced language spoken in Taiwan. Leveraging the YourTTS framework, VoxHakka achieves high naturalness and accuracy and low real-time factor in speech synthesis while supporting six distinct Hakka dialects. This is achieved by training the model with dialect-specific data, allowing for the generation of speaker-aware Hakka speech. To address the scarcity of publicly available Hakka speech corpora, we employed a cost-effective approach utilizing a web scraping pipeline coupled with automatic speech recognition (ASR)-based data cleaning techniques. This process ensured the acquisition of a high-quality, multi-speaker, multi-dialect dataset suitable for TTS training. Subjective listening tests conducted using comparative mean opinion scores (CMOS) demonstrate that VoxHakka significantly outperforms existing publicly available Hakka TTS systems in terms of pronunciation accuracy, tone correctness, and overall naturalness. This work represents a significant advancement in Hakka language technology and provides a valuable resource for language preservation and revitalization efforts.

[AI-55] Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation

链接: https://arxiv.org/abs/2409.01545
作者: Chien-Chun Wang,Li-Wei Chen,Hung-Shin Lee,Berlin Chen,Hsin-Min Wang
关键词-EN: Cross-domain speech enhancement, severe challenges due, Cross-domain speech, faced with severe, severe challenges
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注: Accepted to IEEE SLT 2024

点击查看摘要

Abstract:Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation.

[AI-56] Long-Range Biometric Identification in Real World Scenarios: A Comprehensive Evaluation Framework Based on Missions

链接: https://arxiv.org/abs/2409.01540
作者: Deniz Aykac,Joel Brogan,Nell Barber,Ryan Shivers,Bob Zhang,Dallas Sacca,Ryan Tipton,Gavin Jager,Austin Garret,Matthew Love,Jim Goddard,David Cornett III,David S. Bolme
关键词-EN: increasingly common problem, target performance mismatch, environments has contributed, increasingly common, target performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The considerable body of data available for evaluating biometric recognition systems in Research and Development (R\D) environments has contributed to the increasingly common problem of target performance mismatch. Biometric algorithms are frequently tested against data that may not reflect the real world applications they target. From a Testing and Evaluation (T\E) standpoint, this domain mismatch causes difficulty assessing when improvements in State-of-the-Art (SOTA) research actually translate to improved applied outcomes. This problem can be addressed with thoughtful preparation of data and experimental methods to reflect specific use-cases and scenarios. To that end, this paper evaluates research solutions for identifying individuals at ranges and altitudes, which could support various application areas such as counterterrorism, protection of critical infrastructure facilities, military force protection, and border security. We address challenges including image quality issues and reliance on face recognition as the sole biometric modality. By fusing face and body features, we propose developing robust biometric systems for effective long-range identification from both the ground and steep pitch angles. Preliminary results show promising progress in whole-body recognition. This paper presents these early findings and discusses potential future directions for advancing long-range biometric identification systems based on mission-driven metrics. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.01540 [cs.CV] (or arXiv:2409.01540v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.01540 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-57] hink Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition

链接: https://arxiv.org/abs/2409.01534
作者: Yaozong Gan,Guang Li,Ren Togo,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
关键词-EN: Fine-grained TSR, improve fine-grained traffic, TSR, effective fine-grained TSR, recognizing to improve
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:We propose a new strategy called think twice before recognizing to improve fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is difficult due to the complex road conditions, and existing approaches particularly struggle with cross-country TSR when data is lacking. Our strategy achieves effective fine-grained TSR by stimulating the multiple-thinking capability of large multimodal models (LMM). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for the LMM. The context descriptions with center coordinate prompt optimization help the LMM to locate the target traffic sign in the original road images containing multiple traffic signs and filter irrelevant answers through the proposed prior traffic sign hypothesis. The characteristic description is based on few-shot in-context learning of template traffic signs, which decreases the cross-domain difference and enhances the fine-grained recognition capability of the LMM. The differential descriptions of similar traffic signs optimize the multimodal thinking capability of the LMM. The proposed method is independent of training data and requires only simple and uniform instructions. We conducted extensive experiments on three benchmark datasets and two real-world datasets from different countries, and the proposed method achieves state-of-the-art TSR results on all five datasets.

[AI-58] Improving Robustness of Spectrogram Classifiers with Neural Stochastic Differential Equations

链接: https://arxiv.org/abs/2409.01532
作者: Joel Brogan,Olivera Kotevska,Anibely Torres,Sumit Jha,Mark Adams
关键词-EN: noise and perturbation, fraught with high, high levels, levels of noise, Signal analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Signal analysis and classification is fraught with high levels of noise and perturbation. Computer-vision-based deep learning models applied to spectrograms have proven useful in the field of signal classification and detection; however, these methods aren’t designed to handle the low signal-to-noise ratios inherent within non-vision signal processing tasks. While they are powerful, they are currently not the method of choice in the inherently noisy and dynamic critical infrastructure domain, such as smart-grid sensing, anomaly detection, and non-intrusive load monitoring.

[AI-59] On the Design Space Between Transformers and Recursive Neural Nets

链接: https://arxiv.org/abs/2409.01531
作者: Jishnu Ray Chowdhury,Cornelia Caragea
关键词-EN: Recursive Neural Networks, Continuous Recursive Neural, Neural Data Routers, Neural Networks, Recursive Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we study two classes of models, Recursive Neural Networks (RvNNs) and Transformers, and show that a tight connection between them emerges from the recent development of two recent models - Continuous Recursive Neural Networks (CRvNN) and Neural Data Routers (NDR). On one hand, CRvNN pushes the boundaries of traditional RvNN, relaxing its discrete structure-wise composition and ends up with a Transformer-like structure. On the other hand, NDR constrains the original Transformer to induce better structural inductive bias, ending up with a model that is close to CRvNN. Both models, CRvNN and NDR, show strong performance in algorithmic tasks and generalization in which simpler forms of RvNNs and Transformers fail. We explore these “bridge” models in the design space between RvNNs and Transformers, formalize their tight connections, discuss their limitations, and propose ideas for future research.

[AI-60] S3c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners

链接: https://arxiv.org/abs/2409.01524
作者: Yuchen Yan,Jin Jiang,Yang Liu,Yixin Cao,Xin Xu,Mengdi zhang,Xunliang Cai,Jian Shao
关键词-EN: large language models, potential reasoning abilities, Spontaneous Step-level Self-correction, language models, stimulate the potential
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-correction is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S ^3 c-Math, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.

[AI-61] From Data to Insights: A Covariate Analysis of the IARPA BRIAR Dataset for Multimodal Biometric Recognition Algorithms at Altitude and Range

链接: https://arxiv.org/abs/2409.01514
作者: David S. Bolme,Deniz Aykac,Ryan Shivers,Joel Brogan,Nell Barber,Bob Zhang,Laura Davies,David Cornett III
关键词-EN: IARPA BRIAR dataset, IARPA BRIAR, paper examines covariate, examines covariate effects, BRIAR dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper examines covariate effects on fused whole body biometrics performance in the IARPA BRIAR dataset, specifically focusing on UAV platforms, elevated positions, and distances up to 1000 meters. The dataset includes outdoor videos compared with indoor images and controlled gait recordings. Normalized raw fusion scores relate directly to predicted false accept rates (FAR), offering an intuitive means for interpreting model results. A linear model is developed to predict biometric algorithm scores, analyzing their performance to identify the most influential covariates on accuracy at altitude and range. Weather factors like temperature, wind speed, solar loading, and turbulence are also investigated in this analysis. The study found that resolution and camera distance best predicted accuracy and findings can guide future research and development efforts in long-range/elevated/UAV biometrics and support the creation of more reliable and robust systems for national security and other critical domains.

[AI-62] AMG: Avatar Motion Guided Video Generation

链接: https://arxiv.org/abs/2409.01502
作者: Zhangsihao Yang,Mengyi Shan,Mohammad Farazi,Wenhui Zhu,Yanxi Chen,Xuanzhao Dong,Yalin Wang
关键词-EN: gained significant attention, deep generative models, task has gained, gained significant, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: The project page is at this https URL

点击查看摘要

Abstract:Human video generation task has gained significant attention with the advancement of deep generative models. Generating realistic videos with human movements is challenging in nature, due to the intricacies of human body topology and sensitivity to visual artifacts. The extensively studied 2D media generation methods take advantage of massive human media datasets, but struggle with 3D-aware control; whereas 3D avatar-based approaches, while offering more freedom in control, lack photorealism and cannot be harmonized seamlessly with background scene. We propose AMG, a method that combines the 2D photorealism and 3D controllability by conditioning video diffusion models on controlled rendering of 3D avatars. We additionally introduce a novel data processing pipeline that reconstructs and renders human avatar movements from dynamic camera videos. AMG is the first method that enables multi-person diffusion video generation with precise control over camera positions, human motions, and background style. We also demonstrate through extensive evaluation that it outperforms existing human video generation methods conditioned on pose sequences or driving videos in terms of realism and adaptability.

[AI-63] EarthGen: Generating the World from Top-Down Views

链接: https://arxiv.org/abs/2409.01491
作者: Ansh Sharma,Albert Xiao,Praneet Rathi,Rohit Kundu,Albert Zhai,Yuan Shen,Shenlong Wang
关键词-EN: generative terrain modeling, extensive multi-scale generative, multi-scale generative terrain, terrain modeling, extensive multi-scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we present a novel method for extensive multi-scale generative terrain modeling. At the core of our model is a cascade of superresolution diffusion models that can be combined to produce consistent images across multiple resolutions. Pairing this concept with a tiled generation method yields a scalable system that can generate thousands of square kilometers of realistic Earth surfaces at high resolution. We evaluate our method on a dataset collected from Bing Maps and show that it outperforms super-resolution baselines on the extreme super-resolution task of 1024x zoom. We also demonstrate its ability to create diverse and coherent scenes via an interactive gigapixel-scale generated map. Finally, we demonstrate how our system can be extended to enable novel content creation applications including controllable world generation and 3D scene generation.

[AI-64] PoliPrompt: A High-Performance Cost-Effective LLM-Based Text Classification Framework for Political Science

链接: https://arxiv.org/abs/2409.01466
作者: Menglin Liu,Ge Shi
关键词-EN: large language models, extensive feature engineering, require extensive feature, Recent advancements, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 23 pages, 5 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have opened new avenues for enhancing text classification efficiency in political science, surpassing traditional machine learning methods that often require extensive feature engineering, human labeling, and task-specific training. However, their effectiveness in achieving high classification accuracy remains questionable. This paper introduces a three-stage in-context learning approach that leverages LLMs to improve classification accuracy while minimizing experimental costs. Our method incorporates automatic enhanced prompt generation, adaptive exemplar selection, and a consensus mechanism that resolves discrepancies between two weaker LLMs, refined by an advanced LLM. We validate our approach using datasets from the BBC news reports, Kavanaugh Supreme Court confirmation, and 2018 election campaign ads. The results show significant improvements in classification F1 score (+0.36 for zero-shot classification) with manageable economic costs (-78% compared with human labeling), demonstrating that our method effectively addresses the limitations of traditional machine learning while offering a scalable and reliable solution for text analysis in political science.

[AI-65] Kvasir-VQA: A Text-Image Pair GI Tract Dataset ACM-MM

链接: https://arxiv.org/abs/2409.01437
作者: Sushant Gautam,Andrea Storås,Cise Midoglu,Steven A. Hicks,Vajira Thambawita,Pål Halvorsen,Michael A. Riegler
关键词-EN: facilitate advanced machine, advanced machine learning, extended dataset derived, machine learning tasks, Visual Question Answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: to be published in VLM4Bio 2024, part of the ACM Multimedia (ACM MM) conference 2024

点击查看摘要

Abstract:We introduce Kvasir-VQA, an extended dataset derived from the HyperKvasir and Kvasir-Instrument datasets, augmented with question-and-answer annotations to facilitate advanced machine learning tasks in Gastrointestinal (GI) diagnostics. This dataset comprises 6,500 annotated images spanning various GI tract conditions and surgical instruments, and it supports multiple question types including yes/no, choice, location, and numerical count. The dataset is intended for applications such as image captioning, Visual Question Answering (VQA), text-based generation of synthetic medical images, object detection, and classification. Our experiments demonstrate the dataset’s effectiveness in training models for three selected tasks, showcasing significant applications in medical image analysis and diagnostics. We also present evaluation metrics for each task, highlighting the usability and versatility of our dataset. The dataset and supporting artifacts are available at this https URL.

[AI-66] Performance-Aware Self-Configurable Multi-Agent Networks: A Distributed Submodular Approach for Simultaneous Coordination and Network Design

链接: https://arxiv.org/abs/2409.01411
作者: Zirui Xu,Vasileios Tzoumas
关键词-EN: multi-agent planning, enables multi-agent networks, rigorous approach, topology to balance, balance the trade-off
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: Accepted to CDC 2024

点击查看摘要

Abstract:We introduce the first, to our knowledge, rigorous approach that enables multi-agent networks to self-configure their communication topology to balance the trade-off between scalability and optimality during multi-agent planning. We are motivated by the future of ubiquitous collaborative autonomy where numerous distributed agents will be coordinating via agent-to-agent communication to execute complex tasks such as traffic monitoring, event detection, and environmental exploration. But the explosion of information in such large-scale networks currently curtails their deployment due to impractical decision times induced by the computational and communication requirements of the existing near-optimal coordination algorithms. To overcome this challenge, we present the AlterNAting COordination and Network-Design Algorithm (Anaconda), a scalable algorithm that also enjoys near-optimality guarantees. Subject to the agents’ bandwidth constraints, Anaconda enables the agents to optimize their local communication neighborhoods such that the action-coordination approximation performance of the network is maximized. Compared to the state of the art, Anaconda is an anytime self-configurable algorithm that quantifies its suboptimality guarantee for any type of network, from fully disconnected to fully centralized, and that, for sparse networks, is one order faster in terms of decision speed. To develop the algorithm, we quantify the suboptimality cost due to decentralization, i.e., due to communication-minimal distributed coordination. We also employ tools inspired by the literature on multi-armed bandits and submodular maximization subject to cardinality constraints. We demonstrate Anaconda in simulated scenarios of area monitoring and compare it with a state-of-the-art algorithm.

[AI-67] GenAgent : Build Collaborative AI Systems with Automated Workflow Generation – Case Studies on ComfyUI

链接: https://arxiv.org/abs/2409.01392
作者: Xiangyuan Xue,Zeyu Lu,Di Huang,Wanli Ouyang,Lei Bai
关键词-EN: developing monolithic models, previous AI research, research has focused, focused on developing, maximize their intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Much previous AI research has focused on developing monolithic models to maximize their intelligence and capability, with the primary goal of enhancing performance on specific tasks. In contrast, this paper explores an alternative approach: collaborative AI systems that use workflows to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex workflows, offering greater flexibility and scalability compared to monolithic models. The core innovation of GenAgent lies in representing workflows with code, alongside constructing workflows with collaborative agents in a step-by-step manner. We implement GenAgent on the ComfyUI platform and propose a new benchmark, OpenComfy. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations, showing its capability to generate complex workflows with superior effectiveness and stability.

[AI-68] VLSI Hypergraph Partitioning with Deep Learning

链接: https://arxiv.org/abs/2409.01387
作者: Muhammad Hadir Khan,Bugra Onal,Eren Dogan,Matthew R. Guthaus
关键词-EN: chip design workflows, significantly influence design, influence design quality, Graph Neural Networks, design workflows
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Partitioning is a known problem in computer science and is critical in chip design workflows, as advancements in this area can significantly influence design quality and efficiency. Deep Learning (DL) techniques, particularly those involving Graph Neural Networks (GNNs), have demonstrated strong performance in various node, edge, and graph prediction tasks using both inductive and transductive learning methods. A notable area of recent interest within GNNs are pooling layers and their application to graph partitioning. While these methods have yielded promising results across social, computational, and other random graphs, their effectiveness has not yet been explored in the context of VLSI hypergraph netlists. In this study, we introduce a new set of synthetic partitioning benchmarks that emulate real-world netlist characteristics and possess a known upper bound for solution cut quality. We distinguish these benchmarks with the prior work and evaluate existing state-of-the-art partitioning algorithms alongside GNN-based approaches, highlighting their respective advantages and disadvantages.

[AI-69] Automatic Detection of LLM-generated Code: A Case Study of Claude 3 Haiku

链接: https://arxiv.org/abs/2409.01382
作者: Musfiqur Rahman,SayedHassan Khatoonabadi,Ahmad Abdellatif,Emad Shihab
关键词-EN: Large Language Models, Large Language, Claude, generating source code, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Submitted to a journal for potential publication

点击查看摘要

Abstract:Using Large Language Models (LLMs) has gained popularity among software developers for generating source code. However, the use of LLM-generated code can introduce risks of adding suboptimal, defective, and vulnerable code. This makes it necessary to devise methods for the accurate detection of LLM-generated code. Toward this goal, we perform a case study of Claude 3 Haiku (or Claude 3 for brevity) on CodeSearchNet dataset. We divide our analyses into two parts: function-level and class-level. We extract 22 software metric features, such as Code Lines and Cyclomatic Complexity, for each level of granularity. We then analyze code snippets generated by Claude 3 and their human-authored counterparts using the extracted features to understand how unique the code generated by Claude 3 is. In the following step, we use the unique characteristics of Claude 3-generated code to build Machine Learning (ML) models and identify which features of the code snippets make them more detectable by ML models. Our results indicate that Claude 3 tends to generate longer functions, but shorter classes than humans, and this characteristic can be used to detect Claude 3-generated code with ML models with 82% and 66% accuracies for function-level and class-level snippets, respectively.

[AI-70] H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark

链接: https://arxiv.org/abs/2409.01374
作者: Solim LeGris,Wai Keen Vong,Brenden M. Lake,Todd M. Gureckis
关键词-EN: Reasoning Corpus, Abstraction and Reasoning, visual program synthesis, program synthesis benchmark, synthesis benchmark designed
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus (ARC) is a visual program synthesis benchmark designed to test challenging out-of-distribution generalization in humans and machines. Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods. Comparing human and machine performance is important for the validity of the benchmark. While previous work explored how well humans can solve tasks from the ARC benchmark, they either did so using only a subset of tasks from the original dataset, or from variants of ARC, and therefore only provided a tentative estimate of human performance. In this work, we obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks from the original ARC problem set. We estimate that average human performance lies between 73.3% and 77.2% correct with a reported empirical average of 76.2% on the training set, and between 55.9% and 68.9% correct with a reported empirical average of 64.2% on the public evaluation set. However, we also find that 790 out of the 800 tasks were solvable by at least one person in three attempts, suggesting that the vast majority of the publicly available ARC tasks are in principle solvable by typical crowd-workers recruited over the internet. Notably, while these numbers are slightly lower than earlier estimates, human performance still greatly exceeds current state-of-the-art approaches for solving ARC. To facilitate research on ARC, we publicly release our dataset, called H-ARC (human-ARC), which includes all of the submissions and action traces from human participants.

[AI-71] Imitating Language via Scalable Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2409.01369
作者: Markus Wulfmeier,Michael Bloesch,Nino Vieillard,Arun Ahuja,Jorg Bornschein,Sandy Huang,Artem Sokolov,Matt Barnes,Guillaume Desjardins,Alex Bewley,Sarah Maria Elisabeth Bechtle,Jost Tobias Springenberg,Nikola Momchev,Olivier Bachem,Matthieu Geist,Martin Riedmiller
关键词-EN: model training builds, training builds, language model training, model training, learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The majority of language model training builds on imitation learning. It covers pretraining, supervised fine-tuning, and affects the starting conditions for reinforcement learning from human feedback (RLHF). The simplicity and scalability of maximum likelihood estimation (MLE) for next token prediction led to its role as predominant paradigm. However, the broader field of imitation learning can more effectively utilize the sequential structure underlying autoregressive generation. We focus on investigating the inverse reinforcement learning (IRL) perspective to imitation, extracting rewards and directly optimizing sequences instead of individual token likelihoods and evaluate its benefits for fine-tuning large language models. We provide a new angle, reformulating inverse soft-Q-learning as a temporal difference regularized extension of MLE. This creates a principled connection between MLE and IRL and allows trading off added complexity with increased performance and diversity of generations in the supervised fine-tuning (SFT) setting. We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance, rendering IRL a strong alternative on fixed SFT datasets even without online data generation. Our analysis of IRL-extracted reward functions further indicates benefits for more robust reward functions via tighter integration of supervised and preference-based LLM post-training.

[AI-72] CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification

链接: https://arxiv.org/abs/2409.01366
作者: Junhui He,Shangyu Wu,Weidong Wen,Chun Jason Xue,Qingan Li
关键词-EN: Deploying large language, edge devices presents, devices presents significant, substantial computational overhead, Deploying large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) on edge devices presents significant challenges due to the substantial computational overhead and memory requirements. Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference. Existing methods typically employ thresholding-based sparsification based on the statistics of activation tensors. However, these methods do not explicitly model the impact of activation sparsification on performance, leading to suboptimal performance degradation. To address this issue, this paper reformulates the activation sparsification problem by introducing a new objective that optimizes the sparsification decisions. Building on this reformulation, we propose CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification. First, channel-wise thresholding assigns a unique threshold to each activation channel in the feed-forward network (FFN) layers. Then, selective sparsification involves applying thresholding-based activation sparsification to specific layers within the attention modules. Finally, we detail the implementation of sparse kernels to accelerate LLM inference. Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over 8 downstream tasks while activating fewer parameters compared to existing methods, thus speeding up the LLM inference by up to 1.27x.

[AI-73] Correlating Time Series with Interpretable Convolutional Kernels

链接: https://arxiv.org/abs/2409.01362
作者: Xinyu Chen,HanQin Cai,Fuqiang Liu,Jinhua Zhao
关键词-EN: supporting downstream machine, convolutional kernel learning, time series, time series data, downstream machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:This study addresses the problem of convolutional kernel learning in univariate, multivariate, and multidimensional time series data, which is crucial for interpreting temporal patterns in time series and supporting downstream machine learning tasks. First, we propose formulating convolutional kernel learning for univariate time series as a sparse regression problem with a non-negative constraint, leveraging the properties of circular convolution and circulant matrices. Second, to generalize this approach to multivariate and multidimensional time series data, we use tensor computations, reformulating the convolutional kernel learning problem in the form of tensors. This is further converted into a standard sparse regression problem through vectorization and tensor unfolding operations. In the proposed methodology, the optimization problem is addressed using the existing non-negative subspace pursuit method, enabling the convolutional kernel to capture temporal correlations and patterns. To evaluate the proposed model, we apply it to several real-world time series datasets. On the multidimensional rideshare and taxi trip data from New York City and Chicago, the convolutional kernels reveal interpretable local correlations and cyclical patterns, such as weekly seasonality. In the context of multidimensional fluid flow data, both local and nonlocal correlations captured by the convolutional kernels can reinforce tensor factorization, leading to performance improvements in fluid flow reconstruction tasks. Thus, this study lays an insightful foundation for automatically learning convolutional kernels from time series data, with an emphasis on interpretability through sparsity and non-negativity constraints.

[AI-74] Language Models Benefit from Preparation with Elicited Knowledge

链接: https://arxiv.org/abs/2409.01345
作者: Jiacan Yu,Hannah An,Lenhart K. Schubert
关键词-EN: require multiple reasoning, multiple reasoning steps, reasoning steps, chain of thought, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The zero-shot chain of thought (CoT) approach is often used in question answering (QA) by language models (LMs) for tasks that require multiple reasoning steps, typically enhanced by the prompt “Let’s think step by step.” However, some QA tasks hinge more on accessing relevant knowledge than on chaining reasoning steps. We introduce a simple general prompting technique, called PREP, that involves using two instances of LMs: the first (LM1) generates relevant information, and the second (LM2) answers the question based on this information. PREP is designed to be general and independent of the user’s domain knowledge, making it applicable across various QA tasks without the need for specialized prompt engineering. To evaluate the effectiveness of our prompting method, we create a dataset of 100 binary-choice questions, derived from an extensive schematic dataset on artifact parts and material composition. These questions ask which of two artifacts is less likely to share materials with another artifact. Such questions probe the LM’s knowledge of shared materials in the part structure of different artifacts. We test our method on our dataset and three published commonsense reasoning datasets. The average accuracy of our method is consistently higher than that of all the other tested methods across all the tested datasets.

[AI-75] Pairing Analogy-Augmented Generation with Procedural Memory for Procedural QA

链接: https://arxiv.org/abs/2409.01344
作者: K Roth,Rushil Gupta,Simon Halle,Bang Liu
关键词-EN: procedural question answering, shown remarkable performance, question answering, complex tasks, paradigm have shown
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:While LLMs in the RAG paradigm have shown remarkable performance on a variety of tasks, they still under-perform on unseen domains, especially on complex tasks like procedural question answering. In this work, we introduce a novel formalism and structure for manipulating text-based procedures. Based on this formalism, we further present a novel dataset called LCStep, scraped from the LangChain Python docs. Moreover, we extend the traditional RAG system to propose a novel system called analogy-augmented generation (AAG), that draws inspiration from human analogical reasoning and ability to assimilate past experiences to solve unseen problems. The proposed method uses a frozen language model with a custom procedure memory store to adapt to specialized knowledge. We demonstrate that AAG outperforms few-shot and RAG baselines on LCStep, RecipeNLG, and CHAMP datasets under a pairwise LLM-based evaluation, corroborated by human evaluation in the case of RecipeNLG.

[AI-76] Pediatric brain tumor classification using digital histopathology and deep learning: evaluation of SOTA methods on a multi-center Swedish cohort

链接: https://arxiv.org/abs/2409.01330
作者: Iulian Emil Tampu,Per Nyman,Christoforos Spyretos,Ida Blystad,Alia Shamikh,Gabriela Prochazka,Teresita Díaz de Ståhl,Johanna Sandgren,Peter Lundberg,Neda Haj-Hosseini
关键词-EN: pediatric brain tumors, common solid tumors, Brain tumors, large histopathology datasets, pediatric brain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Brain tumors are the most common solid tumors in children and young adults, but the scarcity of large histopathology datasets has limited the application of computational pathology in this group. This study implements two weakly supervised multiple-instance learning (MIL) approaches on patch-features obtained from state-of-the-art histology-specific foundation models to classify pediatric brain tumors in hematoxylin and eosin whole slide images (WSIs) from a multi-center Swedish cohort. WSIs from 540 subjects (age 8.5 \pm 4.9 years) diagnosed with brain tumor were gathered from the six Swedish university hospitals. Instance (patch)-level features were obtained from WSIs using three pre-trained feature extractors: ResNet50, UNI and CONCH. Instances were aggregated using attention-based MIL (ABMIL) or clustering-constrained attention MIL (CLAM) for patient-level classification. Models were evaluated on three classification tasks based on the hierarchical classification of pediatric brain tumors: tumor category, family and type. Model generalization was assessed by training on data from two of the centers and testing on data from four other centers. Model interpretability was evaluated through attention-mapping. The highest classification performance was achieved using UNI features and AMBIL aggregation, with Matthew’s correlation coefficient of 0.86 \pm 0.04, 0.63 \pm 0.04, and 0.53 \pm 0.05, for tumor category, family and type classification, respectively. When evaluating generalization, models utilizing UNI and CONCH features outperformed those using ResNet50. However, the drop in performance from the in-site to out-of-site testing was similar across feature extractors. These results show the potential of state-of-the-art computational pathology methods in diagnosing pediatric brain tumors at different hierarchical levels with fair generalizability on a multi-center national dataset.

[AI-77] Grounding Language Models in Autonomous Loco-manipulation Tasks ICRA

链接: https://arxiv.org/abs/2409.01326
作者: Jin Wang,Nikos Tsagarakis
关键词-EN: Humanoid robots, embodied intelligence, consistently been regarded, regarded as ideal, ideal collaborators
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Summit to ICRA@40. arXiv admin note: substantial text overlap with arXiv:2406.14655

点击查看摘要

Abstract:Humanoid robots with behavioral autonomy have consistently been regarded as ideal collaborators in our daily lives and promising representations of embodied intelligence. Compared to fixed-based robotic arms, humanoid robots offer a larger operational space while significantly increasing the difficulty of control and planning. Despite the rapid progress towards general-purpose humanoid robots, most studies remain focused on locomotion ability with few investigations into whole-body coordination and tasks planning, thus limiting the potential to demonstrate long-horizon tasks involving both mobility and manipulation under open-ended verbal instructions. In this work, we propose a novel framework that learns, selects, and plans behaviors based on tasks in different scenarios. We combine reinforcement learning (RL) with whole-body optimization to generate robot motions and store them into a motion library. We further leverage the planning and reasoning features of the large language model (LLM), constructing a hierarchical task graph that comprises a series of motion primitives to bridge lower-level execution with higher-level planning. Experiments in simulation and real-world using the CENTAURO robot show that the language model based planner can efficiently adapt to new loco-manipulation tasks, demonstrating high autonomy from free-text commands in unstructured scenes.

[AI-78] opological degree as a discrete diagnostic for disentanglement with applications to the DeltaVAE

链接: https://arxiv.org/abs/2409.01303
作者: Mahefa Ratsisetraina Ravelonanosy,Vlado Menkovski,Jacobus W. Portegies
关键词-EN: Diffusion Variational Autoencoder, Variational Autoencoder, Diffusion Variational, disentangle latent factors, ability of Diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:We investigate the ability of Diffusion Variational Autoencoder ( \Delta VAE) with unit sphere \mathcalS^2 as latent space to capture topological and geometrical structure and disentangle latent factors in datasets. For this, we introduce a new diagnostic of disentanglement: namely the topological degree of the encoder, which is a map from the data manifold to the latent space. By using tools from homology theory, we derive and implement an algorithm that computes this degree. We use the algorithm to compute the degree of the encoder of models that result from the training procedure. Our experimental results show that the \Delta VAE achieves relatively small LSBD scores, and that regardless of the degree after initialization, the degree of the encoder after training becomes -1 or +1 , which implies that the resulting encoder is at least homotopic to a homeomorphism.

[AI-79] Path-Consistency: Prefix Enhancement for Efficient Inference in LLM

链接: https://arxiv.org/abs/2409.01281
作者: Jiace Zhu,Yingtao Shen,Jie Zhao,An Zou
关键词-EN: large language models, gained significant popularity, combining multiple sampling, language models, majority voting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To enhance the reasoning capabilities of large language models (LLMs), self-consistency has gained significant popularity by combining multiple sampling with majority voting. However, the state-of-the-art self-consistency approaches consume substantial computational resources and lead to significant additional time costs due to the multiple sampling. This prevents its full potential from being realized in scenarios where computational resources are critical. To improve the inference efficiency, this paper introduces \textitpath-consistency, a method that leverages the confidence of answers generated in earlier branches to identify the prefix of the most promising path. By dynamically guiding the generation of subsequent branches based on this prefix, the \textitpath-consistency mitigates both the errors and redundancies from random or less useful sampling in self-consistency. As a result, it can significantly accelerate the inference process by reducing the number of tokens generated. Our extensive empirical evaluation shows that the \textitpath-consistency achieves significant acceleration in inference latency ranging from 7.8% to 40.5% , while maintaining or even improving task accuracy across different datasets, including mathematical reasoning, common sense reasoning, symbolic reasoning, and code generation.

[AI-80] Real-time Accident Anticipation for Autonomous Driving Through Monocular Depth-Enhanced 3D Modeling

链接: https://arxiv.org/abs/2409.01256
作者: Haicheng Liao,Yongkang Li,Chengyue Wang,Songning Lai,Zhenning Li,Zilin Bian,Jaeyoung Lee,Zhiyong Cui,Guohui Zhang,Chengzhong Xu
关键词-EN: autonomous driving technologies, foresee potential accidents, traffic accident datasets, Dashcam Accident Dataset, traffic accident anticipation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The primary goal of traffic accident anticipation is to foresee potential accidents in real time using dashcam videos, a task that is pivotal for enhancing the safety and reliability of autonomous driving technologies. In this study, we introduce an innovative framework, AccNet, which significantly advances the prediction capabilities beyond the current state-of-the-art (SOTA) 2D-based methods by incorporating monocular depth cues for sophisticated 3D scene modeling. Addressing the prevalent challenge of skewed data distribution in traffic accident datasets, we propose the Binary Adaptive Loss for Early Anticipation (BA-LEA). This novel loss function, together with a multi-task learning strategy, shifts the focus of the predictive model towards the critical moments preceding an accident. We rigorously evaluate the performance of our framework on three benchmark datasets–Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D), and DADA-2000 Dataset–demonstrating its superior predictive accuracy through key metrics such as Average Precision (AP) and mean Time-To-Accident (mTTA).

[AI-81] Conversational Complexity for Assessing Risk in Large Language Models

链接: https://arxiv.org/abs/2409.01247
作者: John Burden,Manuel Cebrian,Jose Hernandez-Orallo
关键词-EN: Large Language Models, Language Models, enable beneficial applications, Large Language, present a dual-use
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) present a dual-use dilemma: they enable beneficial applications while harboring potential for harm, particularly through conversational interactions. Despite various safeguards, advanced LLMs remain vulnerable. A watershed case was Kevin Roose’s notable conversation with Bing, which elicited harmful outputs after extended interaction. This contrasts with simpler early jailbreaks that produced similar content more easily, raising the question: How much conversational effort is needed to elicit harmful information from LLMs? We propose two measures: Conversational Length (CL), which quantifies the conversation length used to obtain a specific response, and Conversational Complexity (CC), defined as the Kolmogorov complexity of the user’s instruction sequence leading to the response. To address the incomputability of Kolmogorov complexity, we approximate CC using a reference LLM to estimate the compressibility of user instructions. Applying this approach to a large red-teaming dataset, we perform a quantitative analysis examining the statistical distribution of harmful and harmless conversational lengths and complexities. Our empirical findings suggest that this distributional analysis and the minimisation of CC serve as valuable tools for understanding AI safety, offering insights into the accessibility of harmful information. This work establishes a foundation for a new perspective on LLM safety, centered around the algorithmic complexity of pathways to harm.

[AI-82] Revisiting Safe Exploration in Safe Reinforcement learning

链接: https://arxiv.org/abs/2409.01245
作者: David Eckel,Baohe Zhang,Joschka Bödecker
关键词-EN: extends standard reinforcement, standard reinforcement learning, Safe reinforcement learning, reinforcement learning, extends standard
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Safe reinforcement learning (SafeRL) extends standard reinforcement learning with the idea of safety, where safety is typically defined through the constraint of the expected cost return of a trajectory being below a set limit. However, this metric fails to distinguish how costs accrue, treating infrequent severe cost events as equal to frequent mild ones, which can lead to riskier behaviors and result in unsafe exploration. We introduce a new metric, expected maximum consecutive cost steps (EMCC), which addresses safety during training by assessing the severity of unsafe steps based on their consecutive occurrence. This metric is particularly effective for distinguishing between prolonged and occasional safety violations. We apply EMMC in both on- and off-policy algorithm for benchmarking their safe exploration capability. Finally, we validate our metric through a set of benchmarks and propose a new lightweight benchmark task, which allows fast evaluation for algorithm design.

[AI-83] CyberCortex.AI: An AI-based Operating System for Autonomous Robotics and Complex Automation

链接: https://arxiv.org/abs/2409.01241
作者: Sorin Grigorescu,Mihai Zaha
关键词-EN: complex automation applications, http URL, complex automation, remote cloud computers, Operating Systems
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
*备注:

点击查看摘要

Abstract:The underlying framework for controlling autonomous robots and complex automation applications are Operating Systems (OS) capable of scheduling perception-and-control tasks, as well as providing real-time data communication to other robotic peers and remote cloud computers. In this paper, we introduce this http URL, a robotics OS designed to enable heterogeneous AI-based robotics and complex automation applications. this http URL is a decentralized distributed OS which enables robots to talk to each other, as well as to High Performance Computers (HPC) in the cloud. Sensory and control data from the robots is streamed towards HPC systems with the purpose of training AI algorithms, which are afterwards deployed on the robots. Each functionality of a robot (e.g. sensory data acquisition, path planning, motion control, etc.) is executed within a so-called DataBlock of Filters shared through the internet, where each filter is computed either locally on the robot itself, or remotely on a different robotic system. The data is stored and accessed via a so-called \textitTemporal Addressable Memory (TAM), which acts as a gateway between each filter’s input and output. this http URL has two main components: i) the CyberCortex.AI.inference system, which is a real-time implementation of the DataBlock running on the robots’ embedded hardware, and ii) the CyberCortex.AI.dojo, which runs on an HPC computer in the cloud, and it is used to design, train and deploy AI algorithms. We present a quantitative and qualitative performance analysis of the proposed approach using two collaborative robotics applications: \textiti) a forest fires prevention system based on an Unitree A1 legged robot and an Anafi Parrot 4K drone, as well as \textitii) an autonomous driving system which uses this http URL for collaborative perception and motion control.

[AI-84] ESP-PCT: Enhanced VR Semantic Performance through Efficient Compression of Temporal and Spatial Redundancies in Point Cloud Transformers

链接: https://arxiv.org/abs/2409.01216
作者: Luoyu Mei,Shuai Wang,Yun Cheng,Ruofeng Liu,Zhimeng Yin,Wenchao Jiang,Shuai Wang,Wei Gong
关键词-EN: Semantic recognition, point cloud, virtual reality, enabling immersive, interactive experiences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Semantic recognition is pivotal in virtual reality (VR) applications, enabling immersive and interactive experiences. A promising approach is utilizing millimeter-wave (mmWave) signals to generate point clouds. However, the high computational and memory demands of current mmWave point cloud models hinder their efficiency and reliability. To address this limitation, our paper introduces ESP-PCT, a novel Enhanced Semantic Performance Point Cloud Transformer with a two-stage semantic recognition framework tailored for VR applications. ESP-PCT takes advantage of the accuracy of sensory point cloud data and optimizes the semantic recognition process, where the localization and focus stages are trained jointly in an end-to-end manner. We evaluate ESP-PCT on various VR semantic recognition conditions, demonstrating substantial enhancements in recognition efficiency. Notably, ESP-PCT achieves a remarkable accuracy of 93.2% while reducing the computational requirements (FLOPs) by 76.9% and memory usage by 78.2% compared to the existing Point Transformer model simultaneously. These underscore ESP-PCT’s potential in VR semantic recognition by achieving high accuracy and reducing redundancy. The code and data of this project are available at \urlthis https URL.

[AI-85] Integrating End-to-End and Modular Driving Approaches for Online Corner Case Detection in Autonomous Driving

链接: https://arxiv.org/abs/2409.01178
作者: Gemb Kaljavesi,Xiyan Su,Frank Diermeyer
关键词-EN: corner case detection, Online corner case, corner case, case detection, Online corner
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: IEEE SMC 2024

点击查看摘要

Abstract:Online corner case detection is crucial for ensuring safety in autonomous driving vehicles. Current autonomous driving approaches can be categorized into modular approaches and end-to-end approaches. To leverage the advantages of both, we propose a method for online corner case detection that integrates an end-to-end approach into a modular system. The modular system takes over the primary driving task and the end-to-end network runs in parallel as a secondary one, the disagreement between the systems is then used for corner case detection. We implement this method on a real vehicle and evaluate it qualitatively. Our results demonstrate that end-to-end networks, known for their superior situational awareness, as secondary driving systems, can effectively contribute to corner case detection. These findings suggest that such an approach holds potential for enhancing the safety of autonomous vehicles.

[AI-86] Logit Scaling for Out-of-Distribution Detection

链接: https://arxiv.org/abs/2409.01175
作者: Andrija Djurisic,Rosanne Liu,Mladen Nikolic
关键词-EN: open-world settings hinges, settings hinges critically, OOD detection, ability to detect, OOD
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The safe deployment of machine learning and AI models in open-world settings hinges critically on the ability to detect out-of-distribution (OOD) data accurately, data samples that contrast vastly from what the model was trained with. Current approaches to OOD detection often require further training the model, and/or statistics about the training data which may no longer be accessible. Additionally, many existing OOD detection methods struggle to maintain performance when transferred across different architectures. Our research tackles these issues by proposing a simple, post-hoc method that does not require access to the training data distribution, keeps a trained network intact, and holds strong performance across a variety of architectures. Our method, Logit Scaling (LTS), as the name suggests, simply scales the logits in a manner that effectively distinguishes between in-distribution (ID) and OOD samples. We tested our method on benchmarks across various scales, including CIFAR-10, CIFAR-100, ImageNet and OpenOOD. The experiments cover 3 ID and 14 OOD datasets, as well as 9 model architectures. Overall, we demonstrate state-of-the-art performance, robustness and adaptability across different architectures, paving the way towards a universally applicable solution for advanced OOD detection.

[AI-87] FMRFT: Fusion Mamba and DETR for Query Time Sequence Intersection Fish Tracking

链接: https://arxiv.org/abs/2409.01148
作者: Mingyuan Yao,Yukang Huo,Qingbin Tian,Jiayin Zhao,Xiao Liu,Ruifeng Wang,Haihua Wang
关键词-EN: abnormal behavior, monitoring fish tracking, early detected, detected by monitoring, method of image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages,14 figures

点击查看摘要

Abstract:Growth, abnormal behavior, and diseases of fish can be early detected by monitoring fish tracking through the method of image processing, which is of great significance for factory aquaculture. However, underwater reflections and some reasons with fish, such as the high similarity , rapid swimming caused by stimuli and multi-object occlusion bring challenges to multi-target tracking of fish. To address these challenges, this paper establishes a complex multi-scene sturgeon tracking dataset and proposes a real-time end-to-end fish tracking model, FMRFT. In this model, the Mamba In Mamba (MIM) architecture with low memory consumption is introduced into the tracking algorithm to realize multi-frame video timing memory and fast feature extraction, which improves the efficiency of correlation analysis for contiguous frames in multi-fish video. Additionally, the superior feature interaction and a priori frame processing capabilities of RT-DETR are leveraged to provide an effective tracking algorithm. By incorporating the QTSI query interaction processing module, the model effectively handles occluded objects and redundant tracking frames, resulting in more accurate and stable fish tracking. Trained and tested on the dataset, the model achieves an IDF1 score of 90.3% and a MOTA accuracy of 94.3%. Experimental results demonstrate that the proposed FMRFT model effectively addresses the challenges of high similarity and mutual occlusion in fish populations, enabling accurate tracking in factory farming environments.

[AI-88] LATEX-GCL: Large Language Models (LLMs)-Based Data Augmentation for Text-Attributed Graph Contrastive Learning

链接: https://arxiv.org/abs/2409.01145
作者: Haoran Yang,Xiangyu Zhao,Sirui Huang,Qing Li,Guandong Xu
关键词-EN: Graph Contrastive Learning, self-supervised graph learning, Graph Contrastive, Contrastive Learning, self-supervised graph
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) is a potent paradigm for self-supervised graph learning that has attracted attention across various application scenarios. However, GCL for learning on Text-Attributed Graphs (TAGs) has yet to be explored. Because conventional augmentation techniques like feature embedding masking cannot directly process textual attributes on TAGs. A naive strategy for applying GCL to TAGs is to encode the textual attributes into feature embeddings via a language model and then feed the embeddings into the following GCL module for processing. Such a strategy faces three key challenges: I) failure to avoid information loss, II) semantic loss during the text encoding phase, and III) implicit augmentation constraints that lead to uncontrollable and incomprehensible results. In this paper, we propose a novel GCL framework named LATEX-GCL to utilize Large Language Models (LLMs) to produce textual augmentations and LLMs’ powerful natural language processing (NLP) abilities to address the three limitations aforementioned to pave the way for applying GCL to TAG tasks. Extensive experiments on four high-quality TAG datasets illustrate the superiority of the proposed LATEX-GCL method. The source codes and datasets are released to ease the reproducibility, which can be accessed via this link: https://anonymous.4open.science/r/LATEX-GCL-0712.

[AI-89] Generating Synthetic Satellite Imagery for Rare Objects: An Empirical Comparison of Models and Metrics

链接: https://arxiv.org/abs/2409.01138
作者: Tuong Vy Nguyen,Johannes Hoster,Alexander Glaser,Kristian Hildebrand,Felix Biessmann
关键词-EN: drastic societal implications, potentially drastic societal, high-resolution fake imagery, Generative deep learning, deep learning architectures
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Presented at KI 2024 - 47th German Conference on AI, 2nd Workshop on Public Interest AI, 23 September, 2024, Wuerzburg, DE

点击查看摘要

Abstract:Generative deep learning architectures can produce realistic, high-resolution fake imagery – with potentially drastic societal implications. A key question in this context is: How easy is it to generate realistic imagery, in particular for niche domains. The iterative process required to achieve specific image content is difficult to automate and control. Especially for rare classes, it remains difficult to assess fidelity, meaning whether generative approaches produce realistic imagery and alignment, meaning how (well) the generation can be guided by human input. In this work, we present a large-scale empirical evaluation of generative architectures which we fine-tuned to generate synthetic satellite imagery. We focus on nuclear power plants as an example of a rare object category - as there are only around 400 facilities worldwide, this restriction is exemplary for many other scenarios in which training and test data is limited by the restricted number of occurrences of real-world examples. We generate synthetic imagery by conditioning on two kinds of modalities, textual input and image input obtained from a game engine that allows for detailed specification of the building layout. The generated images are assessed by commonly used metrics for automatic evaluation and then compared with human judgement from our conducted user studies to assess their trustworthiness. Our results demonstrate that even for rare objects, generation of authentic synthetic satellite imagery with textual or detailed building layouts is feasible. In line with previous work, we find that automated metrics are often not aligned with human perception – in fact, we find strong negative correlations between commonly used image quality metrics and human ratings.

[AI-90] Smart E-commerce Recommendations with Semantic AI

链接: https://arxiv.org/abs/2409.01137
作者: M. Badouch,M. Boutaounte
关键词-EN: fails to meet, web mining, semantic web mining, neural network, user
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:In e-commerce, web mining for page recommendations is widely used but often fails to meet user needs. To address this, we propose a novel solution combining semantic web mining with BP neural networks. We process user search logs to extract five key features: content priority, time spent, user feedback, recommendation semantics, and input deviation. These features are then fed into a BP neural network to classify and prioritize web pages. The prioritized pages are recommended to users. Using book sales pages for testing, our results demonstrate that this solution can quickly and accurately identify the pages users need. Our approach ensures that recommendations are more relevant and tailored to individual preferences, enhancing the online shopping experience. By leveraging advanced semantic analysis and neural network techniques, we bridge the gap between user expectations and actual recommendations. This innovative method not only improves accuracy but also speeds up the recommendation process, making it a valuable tool for e-commerce platforms aiming to boost user satisfaction and engagement. Additionally, our system ability to handle large datasets and provide real-time recommendations makes it a scalable and efficient solution for modern e-commerce challenges.

[AI-91] Large Language Models Can Understanding Depth from Monocular Images

链接: https://arxiv.org/abs/2409.01133
作者: Zhongyi Xia,Tianzhao Wu
关键词-EN: computer vision applications, critical function, function in computer, vision applications, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Monocular depth estimation is a critical function in computer vision applications. This paper shows that large language models (LLMs) can effectively interpret depth with minimal supervision, using efficient resource utilization and a consistent neural network architecture. We introduce LLM-MDE, a multimodal framework that deciphers depth through language comprehension. Specifically, LLM-MDE employs two main strategies to enhance the pretrained LLM’s capability for depth estimation: cross-modal reprogramming and an adaptive prompt estimation module. These strategies align vision representations with text prototypes and automatically generate prompts based on monocular images, respectively. Comprehensive experiments on real-world MDE datasets confirm the effectiveness and superiority of LLM-MDE, which excels in few-/zero-shot tasks while minimizing resource use. The source code is available.

[AI-92] AI Olympics challenge with Evolutionary Soft Actor Critic

链接: https://arxiv.org/abs/2409.01104
作者: Marco Calì,Alberto Sinigaglia,Niccolò Turcato,Ruggero Carli,Gian Antonio Susto
关键词-EN: Olympics competition held, held at IROS, Model-free Deep Reinforcement, Deep Reinforcement Learning, Olympics competition
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the following report, we describe the solution we propose for the AI Olympics competition held at IROS 2024. Our solution is based on a Model-free Deep Reinforcement Learning approach combined with an evolutionary strategy. We will briefly describe the algorithms that have been used and then provide details of the approach

[AI-93] DS MYOLO: A Reliable Object Detector Based on SSMs for Driving Scenarios ICPR

链接: https://arxiv.org/abs/2409.01093
作者: Yang Li,Jianli Xiao
关键词-EN: advanced driver-assistance systems, Accurate real-time object, Accurate real-time, driver-assistance systems, real-time object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 27th International Conference on Pattern Recognition(ICPR)

点击查看摘要

Abstract:Accurate real-time object detection enhances the safety of advanced driver-assistance systems, making it an essential component in driving scenarios. With the rapid development of deep learning technology, CNN-based YOLO real-time object detectors have gained significant attention. However, the local focus of CNNs results in performance bottlenecks. To further enhance detector performance, researchers have introduced Transformer-based self-attention mechanisms to leverage global receptive fields, but their quadratic complexity incurs substantial computational costs. Recently, Mamba, with its linear complexity, has made significant progress through global selective scanning. Inspired by Mamba’s outstanding performance, we propose a novel object detector: DS MYOLO. This detector captures global feature information through a simplified selective scanning fusion block (SimVSS Block) and effectively integrates the network’s deep features. Additionally, we introduce an efficient channel attention convolution (ECAConv) that enhances cross-channel feature interaction while maintaining low computational complexity. Extensive experiments on the CCTSDB 2021 and VLD-45 driving scenarios datasets demonstrate that DS MYOLO exhibits significant potential and competitive advantage among similarly scaled YOLO series real-time object detectors.

[AI-94] wo-Timescale Synchronization and Migration for Digital Twin Networks: A Multi-Agent Deep Reinforcement Learning Approach

链接: https://arxiv.org/abs/2409.01092
作者: Wenshuai Liu,Yaru Fu,Yongna Guo,Fu Lee Wang,Wen Sun,Yan Zhang
关键词-EN: realizing self-sustaining systems, Digital twins, self-sustaining systems, promising enabler, enabler for representing
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 15 pages, 14 figures

点击查看摘要

Abstract:Digital twins (DTs) have emerged as a promising enabler for representing the real-time states of physical worlds and realizing self-sustaining systems. In practice, DTs of physical devices, such as mobile users (MUs), are commonly deployed in multi-access edge computing (MEC) networks for the sake of reducing latency. To ensure the accuracy and fidelity of DTs, it is essential for MUs to regularly synchronize their status with their DTs. However, MU mobility introduces significant challenges to DT synchronization. Firstly, MU mobility triggers DT migration which could cause synchronization failures. Secondly, MUs require frequent synchronization with their DTs to ensure DT fidelity. Nonetheless, DT migration among MEC servers, caused by MU mobility, may occur infrequently. Accordingly, we propose a two-timescale DT synchronization and migration framework with reliability consideration by establishing a non-convex stochastic problem to minimize the long-term average energy consumption of MUs. We use Lyapunov theory to convert the reliability constraints and reformulate the new problem as a partially observable Markov decision-making process (POMDP). Furthermore, we develop a heterogeneous agent proximal policy optimization with Beta distribution (Beta-HAPPO) method to solve it. Numerical results show that our proposed Beta-HAPPO method achieves significant improvements in energy savings when compared with other benchmarks.

[AI-95] Pre-Trained Language Models for Keyphrase Prediction: A Review

链接: https://arxiv.org/abs/2409.01087
作者: Muhammad Umair,Tangina Sultana,Young-Koo Lee
关键词-EN: Natural Language Processing, summarize its content, recent Natural Language, essential for identifying, Keyphrase Prediction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Keyphrase Prediction (KP) is essential for identifying keyphrases in a document that can summarize its content. However, recent Natural Language Processing (NLP) advances have developed more efficient KP models using deep learning techniques. The limitation of a comprehensive exploration jointly both keyphrase extraction and generation using pre-trained language models spotlights a critical gap in the literature, compelling our survey paper to bridge this deficiency and offer a unified and in-depth analysis to address limitations in previous surveys. This paper extensively examines the topic of pre-trained language models for keyphrase prediction (PLM-KP), which are trained on large text corpora via different learning (supervisor, unsupervised, semi-supervised, and self-supervised) techniques, to provide respective insights into these two types of tasks in NLP, precisely, Keyphrase Extraction (KPE) and Keyphrase Generation (KPG). We introduce appropriate taxonomies for PLM-KPE and KPG to highlight these two main tasks of NLP. Moreover, we point out some promising future directions for predicting keyphrases.

[AI-96] DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing

链接: https://arxiv.org/abs/2409.01086
作者: Xiaolong Wang,Zhi-Qi Cheng,Jue Wang,Xiaojiang Peng
关键词-EN: design concepts interactively, visualizing design concepts, Fashion image editing, Fashion image, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages,12 figures

点击查看摘要

Abstract:Fashion image editing is a crucial tool for designers to convey their creative ideas by visualizing design concepts interactively. Current fashion image editing techniques, though advanced with multimodal prompts and powerful diffusion models, often struggle to accurately identify editing regions and preserve the desired garment texture detail. To address these challenges, we introduce a new multimodal fashion image editing architecture based on latent diffusion models, called Detail-Preserved Diffusion Models (DPDEdit). DPDEdit guides the fashion image generation of diffusion models by integrating text prompts, region masks, human pose images, and garment texture images. To precisely locate the editing region, we first introduce Grounded-SAM to predict the editing region based on the user’s textual description, and then combine it with other conditions to perform local editing. To transfer the detail of the given garment texture into the target fashion image, we propose a texture injection and refinement mechanism. Specifically, this mechanism employs a decoupled cross-attention layer to integrate textual descriptions and texture images, and incorporates an auxiliary U-Net to preserve the high-frequency details of generated garment texture. Additionally, we extend the VITON-HD dataset using a multimodal large language model to generate paired samples with texture images and textual descriptions. Extensive experiments show that our DPDEdit outperforms state-of-the-art methods in terms of image fidelity and coherence with the given multimodal inputs.

[AI-97] Affordance-based Robot Manipulation with Flow Matching

链接: https://arxiv.org/abs/2409.01083
作者: Fan Zhang,Michael Gienger
关键词-EN: efficiently adapting large-scale, requires strenuous effort, involving humans requires, humans requires strenuous, adapting large-scale models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot trajectories guided by affordances in a supervised Flow Matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance with language prompter achieves competitive performance and even outperforms other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot trajectories with a single flow matching policy also leads to consistently better performance than alternative behavior cloning methods, especially given multimodal robot action distributions. Our framework seamlessly unifies affordance model learning and trajectory generation with flow matching for robot manipulation.

[AI-98] Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

链接: https://arxiv.org/abs/2409.01081
作者: Dingshuo Chen,Zhixun Li,Yuyan Ni,Guibin Zhang,Ding Wang,Qiang Liu,Shu Wu,Jeffrey Xu Yu,Liang Wang
关键词-EN: perform efficient training, perform efficient, urgent yet under-explored, under-explored issue, Data pruning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: 20 pages, under review

点击查看摘要

Abstract:With the emergence of various molecular tasks and massive datasets, how to perform efficient training has become an urgent yet under-explored issue in the area. Data pruning (DP), as an oft-stated approach to saving training burdens, filters out less influential samples to form a coreset for training. However, the increasing reliance on pretrained models for molecular tasks renders traditional in-domain DP methods incompatible. Therefore, we propose a Molecular data Pruning framework for enhanced Generalization (MolPeg), which focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models. By maintaining two models with different updating paces during training, we introduce a novel scoring function to measure the informativeness of samples based on the loss discrepancy. As a plug-and-play framework, MolPeg realizes the perception of both source and target domain and consistently outperforms existing DP methods across four downstream tasks. Remarkably, it can surpass the performance obtained from full-dataset training, even when pruning up to 60-70% of the data on HIV and PCBA dataset. Our work suggests that the discovery of effective data-pruning metrics could provide a viable path to both enhanced efficiency and superior generalization in transfer learning.

[AI-99] SCOPE: Sign Language Contextual Processing with Embedding from LLMs

链接: https://arxiv.org/abs/2409.01073
作者: Yuqi Liu,Wenqian Zhang,Sihan Ren,Chengyu Huang,Jingyi Yu,Lan Xu
关键词-EN: million Deaf individuals, Deaf individuals globally, sign language, individuals globally, convey visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information. Current methods in vision-based sign language recognition (SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information. To address these challenges, we introduce SCOPE (Sign language Contextual Processing with Embedding from LLMs), a novel context-aware vision-based SLR and SLT framework. For SLR, we utilize dialogue contexts through a multi-modal encoder to enhance gloss-level recognition. For subsequent SLT, we further fine-tune a Large Language Model (LLM) by incorporating prior conversational context. We also contribute a new sign language dataset that contains 72 hours of Chinese sign language videos in contextual dialogues across various scenarios. Experimental results demonstrate that our SCOPE framework achieves state-of-the-art performance on multiple datasets, including Phoenix-2014T, CSL-Daily, and our SCOPE dataset. Moreover, surveys conducted with participants from the Deaf community further validate the robustness and effectiveness of our approach in real-world applications. Both our dataset and code will be open-sourced to facilitate further research.

[AI-100] Learning in Hybrid Active Inference Models

链接: https://arxiv.org/abs/2409.01066
作者: Poppy Collis,Ryan Singh,Paul F Kinghorn,Christopher L Buckley
关键词-EN: solving inherently continuous, flexibly learn discrete, learn discrete abstractions, Parr Friston, active inference
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 11 pages (+ appendix). Accepted to the International Workshop on Active Inference 2024. arXiv admin note: substantial text overlap with arXiv:2408.10970

点击查看摘要

Abstract:An open problem in artificial intelligence is how systems can flexibly learn discrete abstractions that are useful for solving inherently continuous problems. Previous work in computational neuroscience has considered this functional integration of discrete and continuous variables during decision-making under the formalism of active inference (Parr, Friston de Vries, 2017; Parr Friston, 2018). However, their focus is on the expressive physical implementation of categorical decisions and the hierarchical mixed generative model is assumed to be known. As a consequence, it is unclear how this framework might be extended to learning. We therefore present a novel hierarchical hybrid active inference agent in which a high-level discrete active inference planner sits above a low-level continuous active inference controller. We make use of recent work in recurrent switching linear dynamical systems (rSLDS) which implement end-to-end learning of meaningful discrete representations via the piecewise linear decomposition of complex continuous dynamics (Linderman et al., 2016). The representations learned by the rSLDS inform the structure of the hybrid decision-making agent and allow us to (1) specify temporally-abstracted sub-goals in a method reminiscent of the options framework, (2) lift the exploration into discrete space allowing us to exploit information-theoretic exploration bonuses and (3) `cache’ the approximate solutions to low-level problems in the discrete planner. We apply our model to the sparse Continuous Mountain Car task, demonstrating fast system identification via enhanced exploration and successful planning through the delineation of abstract sub-goals.

[AI-101] A Perspective on Literary Metaphor in the Context of Generative AI ECAI2024

链接: https://arxiv.org/abs/2409.01053
作者: Imke van Heerden,Anil Bas
关键词-EN: range of meanings, intersection of creative, study explores, explores the role, capacity to generate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted as oral presentation to Workshop on Artificial Intelligence and Creativity (CREAI) at ECAI 2024

点击查看摘要

Abstract:At the intersection of creative text generation and literary theory, this study explores the role of literary metaphor and its capacity to generate a range of meanings. In this regard, literary metaphor is vital to the development of any particular language. To investigate whether the inclusion of original figurative language improves textual quality, we trained an LSTM-based language model in Afrikaans. The network produces phrases containing compellingly novel figures of speech. Specifically, the emphasis falls on how AI might be utilised as a defamiliarisation technique, which disrupts expected uses of language to augment poetic expression. Providing a literary perspective on text generation, the paper raises thought-provoking questions on aesthetic value, interpretation and evaluation.

[AI-102] Accelerated Multi-objective Task Learning using Modified Q-learning Algorithm

链接: https://arxiv.org/abs/2409.01046
作者: Varun Prakash Rajamohan,Senthil Kumar Jagatheesaperumal
关键词-EN: Robots find extensive, find extensive applications, Q-learning algorithm, Robots find, applications in industry
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 9 pages, 9 figures, 7 tables

点击查看摘要

Abstract:Robots find extensive applications in industry. In recent years, the influence of robots has also increased rapidly in domestic scenarios. The Q-learning algorithm aims to maximise the reward for reaching the goal. This paper proposes a modified version of the Q-learning algorithm, known as Q-learning with scaled distance metric (Q-SD). This algorithm enhances task learning and makes task completion more meaningful. A robotic manipulator (agent) applies the Q-SD algorithm to the task of table cleaning. Using Q-SD, the agent acquires the sequence of steps necessary to accomplish the task while minimising the manipulator’s movement distance. We partition the table into grids of different dimensions. The first has a grid count of 3 times 3, and the second has a grid count of 4 times 4. Using the Q-SD algorithm, the maximum success obtained in these two environments was 86% and 59% respectively. Moreover, Compared to the conventional Q-learning algorithm, the drop in average distance moved by the agent in these two environments using the Q-SD algorithm was 8.61% and 6.7% respectively.

[AI-103] Robust Vehicle Localization and Tracking in Rain using Street Maps

链接: https://arxiv.org/abs/2409.01038
作者: Yu Xiang Tan,Malika Meghjani
关键词-EN: dense urban areas, unstable positional information, positional information commonly, information commonly experienced, Visual Inertial Odometry
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:GPS-based vehicle localization and tracking suffers from unstable positional information commonly experienced in tunnel segments and in dense urban areas. Also, both Visual Odometry (VO) and Visual Inertial Odometry (VIO) are susceptible to adverse weather conditions that causes occlusions or blur on the visual input. In this paper, we propose a novel approach for vehicle localization that uses street network based map information to correct drifting odometry estimates and intermittent GPS measurements especially, in adversarial scenarios such as driving in rain and tunnels. Specifically, our approach is a flexible fusion algorithm that integrates intermittent GPS, drifting IMU and VO estimates together with 2D map information for robust vehicle localization and tracking. We refer to our approach as Map-Fusion. We robustly evaluate our proposed approach on four geographically diverse datasets from different countries ranging across clear and rain weather conditions. These datasets also include challenging visual segments in tunnels and underpasses. We show that with the integration of the map information, our Map-Fusion algorithm reduces the error of the state-of-the-art VO and VIO approaches across all datasets. We also validate our proposed algorithm in a real-world environment and in real-time on a hardware constrained mobile robot. Map-Fusion achieved 2.46m error in clear weather and 6.05m error in rain weather for a 150m route.

[AI-104] From Birds-Eye to Street View: Crafting Diverse and Condition-Aligned Images with Latent Diffusion Model ICRA

链接: https://arxiv.org/abs/2409.01014
作者: Xiaojie Xu,Tianshuo Xu,Fulong Ma,Yingcong Chen
关键词-EN: BEV, BEV map, Neural View Transformation, Street Image Generation, image generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at International Conference on Robotics and Automation(ICRA)

点击查看摘要

Abstract:We explore Bird’s-Eye View (BEV) generation, converting a BEV map into its corresponding multi-view street images. Valued for its unified spatial representation aiding multi-sensor fusion, BEV is pivotal for various autonomous driving applications. Creating accurate street-view images from BEV maps is essential for portraying complex traffic scenarios and enhancing driving algorithms. Concurrently, diffusion-based conditional image generation models have demonstrated remarkable outcomes, adept at producing diverse, high-quality, and condition-aligned results. Nonetheless, the training of these models demands substantial data and computational resources. Hence, exploring methods to fine-tune these advanced models, like Stable Diffusion, for specific conditional generation tasks emerges as a promising avenue. In this paper, we introduce a practical framework for generating images from a BEV layout. Our approach comprises two main components: the Neural View Transformation and the Street Image Generation. The Neural View Transformation phase converts the BEV map into aligned multi-view semantic segmentation maps by learning the shape correspondence between the BEV and perspective views. Subsequently, the Street Image Generation phase utilizes these segmentations as a condition to guide a fine-tuned latent diffusion model. This finetuning process ensures both view and style consistency. Our model leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition-coherent street view images.

[AI-105] Unlocking the Wisdom of Large Language Models : An Introduction to The Path to Artificial General Intelligence

链接: https://arxiv.org/abs/2409.01007
作者: Edward Y. Chang
关键词-EN: Large Language Models, Artificial General Intelligence, Unlocking the Wisdom, Language Models, Wisdom of Large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This booklet, “Unlocking the Wisdom of Large Language Models,” serves as an introduction to the comprehensive work “The Path to Artificial General Intelligence.” Through a series of nine aphorisms, we distill key insights and principles that underpin the larger exploration of AI’s future through adversarial LLM dialogue. We propose this approach as a potential path to realizing artificial general intelligence (AGI). This booklet also includes the titles, abstracts, and introductions of the chapters in the main book, and presents the first two chapters in their entirety.

[AI-106] 3D Priors-Guided Diffusion for Blind Face Restoration

链接: https://arxiv.org/abs/2409.00991
作者: Xiaobin Lu,Xiaobin Hu,Jun Luo,Ben Zhu,Yaping Ruan,Wenqi Ren
关键词-EN: degraded counterpart, Generative Adversarial Networks, endeavors to restore, restore a clear, employing Generative Adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Blind face restoration endeavors to restore a clear face image from a degraded counterpart. Recent approaches employing Generative Adversarial Networks (GANs) as priors have demonstrated remarkable success in this field. However, these methods encounter challenges in achieving a balance between realism and fidelity, particularly in complex degradation scenarios. To inherit the exceptional realism generative ability of the diffusion model and also constrained by the identity-aware fidelity, we propose a novel diffusion-based framework by embedding the 3D facial priors as structure and identity constraints into a denoising diffusion process. Specifically, in order to obtain more accurate 3D prior representations, the 3D facial image is reconstructed by a 3D Morphable Model (3DMM) using an initial restored face image that has been processed by a pretrained restoration network. A customized multi-level feature extraction method is employed to exploit both structural and identity information of 3D facial images, which are then mapped into the noise estimation process. In order to enhance the fusion of identity information into the noise estimation, we propose a Time-Aware Fusion Block (TAFB). This module offers a more efficient and adaptive fusion of weights for denoising, considering the dynamic nature of the denoising process in the diffusion model, which involves initial structure refinement followed by texture detail enhancement.Extensive experiments demonstrate that our network performs favorably against state-of-the-art algorithms on synthetic and real-world datasets for blind face restoration.

[AI-107] Co-Learning: Code Learning for Multi-Agent Reinforcement Collaborative Framework with Conversational Natural Language Interfaces

链接: https://arxiv.org/abs/2409.00985
作者: Jiapeng Yu,Yuqian Wu,Yajing Zhan,Wenhao Guo,Zhou Xu,Raymond Lee
关键词-EN: Large Language Model, Language Model, Large Language, systems based, progressively diverged
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Online question-and-answer (Q\A) systems based on the Large Language Model (LLM) have progressively diverged from recreational to professional use. This paper proposed a Multi-Agent framework with environmentally reinforcement learning (E-RL) for code correction called Code Learning (Co-Learning) community, assisting beginners to correct code errors independently. It evaluates the performance of multiple LLMs from an original dataset with 702 error codes, uses it as a reward or punishment criterion for E-RL; Analyzes input error codes by the current agent; selects the appropriate LLM-based agent to achieve optimal error correction accuracy and reduce correction time. Experiment results showed that 3% improvement in Precision score and 15% improvement in time cost as compared with no E-RL method respectively. Our source code is available at: \hrefthis https URLthis https URL_Learning.

[AI-108] DNN-GDITD: Out-of-distribution detection via Deep Neural Network based Gaussian Descriptor for Imbalanced Tabular Data

链接: https://arxiv.org/abs/2409.00980
作者: Priyanka Chudasama,Anil Surisetty,Aakarsh Malhotra,Alok Singh
关键词-EN: tasks present challenges, present challenges due, textbf, Classification tasks present, evolving data distributions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages

点击查看摘要

Abstract:Classification tasks present challenges due to class imbalances and evolving data distributions. Addressing these issues requires a robust method to handle imbalances while effectively detecting out-of-distribution (OOD) samples not encountered during training. This study introduces a novel OOD detection algorithm designed for tabular datasets, titled \textit\textbfDeep \textbfNeural \textbfNetwork-based \textbfGaussian \textbfDescriptor for \textbfImbalanced \textbfTabular \textbfData (\textbfDNN-GDITD). The DNN-GDITD algorithm can be placed on top of any DNN to facilitate better classification of imbalanced data and OOD detection using spherical decision boundaries. Using a combination of Push, Score-based, and focal losses, DNN-GDITD assigns confidence scores to test data points, categorizing them as known classes or as an OOD sample. Extensive experimentation on tabular datasets demonstrates the effectiveness of DNN-GDITD compared to three OOD algorithms. Evaluation encompasses imbalanced and balanced scenarios on diverse tabular datasets, including a synthetic financial dispute dataset and publicly available tabular datasets like Gas Sensor, Drive Diagnosis, and MNIST, showcasing DNN-GDITD’s versatility.

[AI-109] Enhancing Privacy in Federated Learning: Secure Aggregation for Real-World Healthcare Applications MICCAI MICCAI2024

链接: https://arxiv.org/abs/2409.00974
作者: Riccardo Taiello,Sergen Cansiz,Marc Vesin,Francesco Cremonesi,Lucia Innocenti,Melek Önen,Marco Lorenzi
关键词-EN: Deploying federated learning, Deploying federated, poses challenges, federated learning, federated aggregation procedure
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted at the 5-th MICCAI Workshop on Distributed, Collaborative and Federated Learning in Conjunction with MICCAI 2024

点击查看摘要

Abstract:Deploying federated learning (FL) in real-world scenarios, particularly in healthcare, poses challenges in communication and security. In particular, with respect to the federated aggregation procedure, researchers have been focusing on the study of secure aggregation (SA) schemes to provide privacy guarantees over the model’s parameters transmitted by the clients. Nevertheless, the practical availability of SA in currently available FL frameworks is currently limited, due to computational and communication bottlenecks. To fill this gap, this study explores the implementation of SA within the open-source Fed-BioMed framework. We implement and compare two SA protocols, Joye-Libert (JL) and Low Overhead Masking (LOM), by providing extensive benchmarks in a panel of healthcare data analysis problems. Our theoretical and experimental evaluations on four datasets demonstrate that SA protocols effectively protect privacy while maintaining task accuracy. Computational overhead during training is less than 1% on a CPU and less than 50% on a GPU for large models, with protection phases taking less than 10 seconds. Incorporating SA into Fed-BioMed impacts task accuracy by no more than 2% compared to non-SA scenarios. Overall this study demonstrates the feasibility of SA in real-world healthcare applications and contributes in reducing the gap towards the adoption of privacy-preserving technologies in sensitive applications.

[AI-110] Semantically Controllable Augmentations for Generalizable Robot Learning

链接: https://arxiv.org/abs/2409.00951
作者: Zoey Chen,Zhao Mandi,Homanga Bharadhwaj,Mohit Sharma,Shuran Song,Abhishek Gupta,Vikash Kumar
关键词-EN: manipulation requires exposure, requires exposure, robot, real-world, generative
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for publication by IJRR. First 3 authors contributed equally. Last 3 authors advised equally

点击查看摘要

Abstract:Generalization to unseen real-world scenarios for robot manipulation requires exposure to diverse datasets during training. However, collecting large real-world datasets is intractable due to high operational costs. For robot learning to generalize despite these challenges, it is essential to leverage sources of data or priors beyond the robot’s direct experience. In this work, we posit that image-text generative models, which are pre-trained on large corpora of web-scraped data, can serve as such a data source. These generative models encompass a broad range of real-world scenarios beyond a robot’s direct experience and can synthesize novel synthetic experiences that expose robotic agents to additional world priors aiding real-world generalization at no extra cost. In particular, our approach leverages pre-trained generative models as an effective tool for data augmentation. We propose a generative augmentation framework for semantically controllable augmentations and rapidly multiplying robot datasets while inducing rich variations that enable real-world generalization. Based on diverse augmentations of robot data, we show how scalable robot manipulation policies can be trained and deployed both in simulation and in unseen real-world environments such as kitchens and table-tops. By demonstrating the effectiveness of image-text generative models in diverse real-world robotic applications, our generative augmentation framework provides a scalable and efficient path for boosting generalization in robot learning at no extra human cost. Comments: Accepted for publication by IJRR. First 3 authors contributed equally. Last 3 authors advised equally Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.00951 [cs.RO] (or arXiv:2409.00951v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.00951 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-111] XNet v2: Fewer Limitations Better Results and Greater Universality

链接: https://arxiv.org/abs/2409.00947
作者: Yanfeng Zhou,Lingrui Li,Zichen Wang,Guole Liu,Ziwen Liu,Ge Yang
关键词-EN: X-shaped unified architecture, wavelet-based X-shaped unified, X-shaped unified, wavelet-based X-shaped, architecture for fully
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:XNet introduces a wavelet-based X-shaped unified architecture for fully- and semi-supervised biomedical segmentation. So far, however, XNet still faces the limitations, including performance degradation when images lack high-frequency (HF) information, underutilization of raw images and insufficient fusion. To address these issues, we propose XNet v2, a low- and high-frequency complementary model. XNet v2 performs wavelet-based image-level complementary fusion, using fusion results along with raw images inputs three different sub-networks to construct consistency loss. Furthermore, we introduce a feature-level fusion module to enhance the transfer of low-frequency (LF) information and HF information. XNet v2 achieves state-of-the-art in semi-supervised segmentation while maintaining competitve results in fully-supervised learning. More importantly, XNet v2 excels in scenarios where XNet fails. Compared to XNet, XNet v2 exhibits fewer limitations, better results and greater universality. Extensive experiments on three 2D and two 3D datasets demonstrate the effectiveness of XNet v2. Code is available at this https URL .

[AI-112] A Framework for Synthetic Audio Conversations Generation using Large Language Models

链接: https://arxiv.org/abs/2409.00946
作者: Kaung Myat Kyaw,Jonathan Hoyin Chan
关键词-EN: multiple persona settings, large language models, persona settings, large language, multiple persona
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: This work has been submitted for consideration at the WI-IAT’24 to be held in December 2024

点击查看摘要

Abstract:In this paper, we introduce ConversaSynth, a framework designed to generate synthetic conversation audio using large language models (LLMs) with multiple persona settings. The framework first creates diverse and coherent text-based dialogues across various topics, which are then converted into audio using text-to-speech (TTS) systems. Our experiments demonstrate that ConversaSynth effectively generates highquality synthetic audio datasets, which can significantly enhance the training and evaluation of models for audio tagging, audio classification, and multi-speaker speech recognition. The results indicate that the synthetic datasets generated by ConversaSynth exhibit substantial diversity and realism, making them suitable for developing robust, adaptable audio-based AI systems.

[AI-113] Large Language Models for Automatic Detection of Sensitive Topics

链接: https://arxiv.org/abs/2409.00940
作者: Ruoyu Wen,Stephanie Elena Crowe,Kunal Gupta,Xinyue Li,Mark Billinghurst,Simon Hoermann,Dwain Allan,Alaeddin Nassani,Thammathip Piumsomboon
关键词-EN: safe online communities, maintain safe online, Sensitive information detection, maintain safe, Sensitive information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 2024 Oz CHI conference

点击查看摘要

Abstract:Sensitive information detection is crucial in content moderation to maintain safe online communities. Assisting in this traditionally manual process could relieve human moderators from overwhelming and tedious tasks, allowing them to focus solely on flagged content that may pose potential risks. Rapidly advancing large language models (LLMs) are known for their capability to understand and process natural language and so present a potential solution to support this process. This study explores the capabilities of five LLMs for detecting sensitive messages in the mental well-being domain within two online datasets and assesses their performance in terms of accuracy, precision, recall, F1 scores, and consistency. Our findings indicate that LLMs have the potential to be integrated into the moderation workflow as a convenient and precise detection tool. The best-performing model, GPT-4o, achieved an average accuracy of 99.5% and an F1-score of 0.99. We discuss the advantages and potential challenges of using LLMs in the moderation workflow and suggest that future research should address the ethical considerations of utilising this technology.

[AI-114] Development of Occupancy Prediction Algorithm for Underground Parking Lots

链接: https://arxiv.org/abs/2409.00923
作者: Shijie Wang
关键词-EN: perception challenges faced, core objective, challenges faced, underground garage, Transformer-based Occupancy Network
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The core objective of this study is to address the perception challenges faced by autonomous driving in adverse environments like basements. Initially, this paper commences with data collection in an underground garage. A simulated underground garage model is established within the CARLA simulation environment, and SemanticKITTI format occupancy ground truth data is collected in this simulated setting. Subsequently, the study integrates a Transformer-based Occupancy Network model to complete the occupancy grid prediction task within this scenario. A comprehensive BEV perception framework is designed to enhance the accuracy of neural network models in dimly lit, challenging autonomous driving environments. Finally, experiments validate the accuracy of the proposed solution’s perception performance in basement scenarios. The proposed solution is tested on our self-constructed underground garage dataset, SUSTech-COE-ParkingLot, yielding satisfactory results.

[AI-115] Statically Contextualizing Large Language Models with Typed Holes

链接: https://arxiv.org/abs/2409.00921
作者: Andrew Blinn,Xiang Li,June Hyung Kim,Cyrus Omar
关键词-EN: Large language models, Large language, language server, Hazel Language Server, reshaped the landscape
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: To appear at OOPSLA2024

点击查看摘要

Abstract:Large language models (LLMs) have reshaped the landscape of program synthesis. However, contemporary LLM-based code completion systems often hallucinate broken code because they lack appropriate context, particularly when working with definitions not in the training data nor near the cursor. This paper demonstrates that tight integration with the type and binding structure of a language, as exposed by its language server, can address this contextualization problem in a token-efficient manner. In short, we contend that AIs need IDEs, too! In particular, we integrate LLM code generation into the Hazel live program sketching environment. The Hazel Language Server identifies the type and typing context of the hole being filled, even in the presence of errors, ensuring that a meaningful program sketch is always available. This allows prompting with codebase-wide contextual information not lexically local to the cursor, nor necessarily in the same file, but that is likely to be semantically local to the developer’s goal. Completions synthesized by the LLM are then iteratively refined via further dialog with the language server. To evaluate these techniques, we introduce MVUBench, a dataset of model-view-update (MVU) web applications. These applications serve as challenge problems due to their reliance on application-specific data structures. We find that contextualization with type definitions is particularly impactful. After introducing our ideas in the context of Hazel we duplicate our techniques and port MVUBench to TypeScript in order to validate the applicability of these methods to higher-resource languages. Finally, we outline ChatLSP, a conservative extension to the Language Server Protocol (LSP) that language servers can implement to expose capabilities that AI code completion systems of various designs can use to incorporate static context when generating prompts for an LLM.

[AI-116] oolACE: Winning the Points of LLM Function Calling

链接: https://arxiv.org/abs/2409.00920
作者: Weiwen Liu,Xu Huang,Xingshan Zeng,Xinlong Hao,Shuai Yu,Dexun Li,Shuai Wang,Weinan Gan,Zhengying Liu,Yuanqing Yu,Zezhong Wang,Yuxian Wang,Wu Ning,Yutai Hou,Bin Wang,Chuhan Wu,Xinzhi Wang,Yong Liu,Yasheng Wang,Duyu Tang,Dandan Tu,Lifeng Shang,Xin Jiang,Ruiming Tang,Defu Lian,Qun Liu,Enhong Chen
关键词-EN: Function calling significantly, calling significantly extends, Function calling, large language models, unlocking this capability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 21 pages, 22 figures

点击查看摘要

Abstract:Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at this https URL.

[AI-117] MMT-BERT: Chord-aware Symbolic Music Generation Based on Multitrack Music Transformer and MusicBERT

链接: https://arxiv.org/abs/2409.00919
作者: Jinlong Zhu,Keigo Sakurai,Ren Togo,Takahiro Ogawa,Miki Haseyama
关键词-EN: Generative Adversarial Network, Adversarial Network, Generative Adversarial, symbolic music representation, symbolic music generation
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted to the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

点击查看摘要

Abstract:We propose a novel symbolic music representation and Generative Adversarial Network (GAN) framework specially designed for symbolic multitrack music generation. The main theme of symbolic music generation primarily encompasses the preprocessing of music data and the implementation of a deep learning framework. Current techniques dedicated to symbolic music generation generally encounter two significant challenges: training data’s lack of information about chords and scales and the requirement of specially designed model architecture adapted to the unique format of symbolic music representation. In this paper, we solve the above problems by introducing new symbolic music representation with MusicLang chord analysis model. We propose our MMT-BERT architecture adapting to the representation. To build a robust multitrack music generator, we fine-tune a pre-trained MusicBERT model to serve as the discriminator, and incorporate relativistic standard loss. This approach, supported by the in-depth understanding of symbolic music encoded within MusicBERT, fortifies the consonance and humanity of music generated by our method. Experimental results demonstrate the effectiveness of our approach which strictly follows the state-of-the-art methods.

[AI-118] ViRED: Prediction of Visual Relations in Engineering Drawings

链接: https://arxiv.org/abs/2409.00909
作者: Chao Gu,Ke Lin,Yiyang Luo,Jiahui Hou,Xiang-Yang Li
关键词-EN: accurately understand engineering, understand engineering drawings, accurately understand, essential to establish, establish the correspondence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.

[AI-119] Multi-scale Temporal Fusion Transformer for Incomplete Vehicle Trajectory Prediction

链接: https://arxiv.org/abs/2409.00904
作者: Zhanwen Liu,Chao Li,Yang Wang,Nan Yang,Xing Fan,Jiaqi Ma,Xiangmo Zhao
关键词-EN: autonomous driving systems, driving decisions based, enabling autonomous vehicles, vehicle trajectory prediction, multi-scale motion representation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Motion prediction plays an essential role in autonomous driving systems, enabling autonomous vehicles to achieve more accurate local-path planning and driving decisions based on predictions of the surrounding vehicles. However, existing methods neglect the potential missing values caused by object occlusion, perception failures, etc., which inevitably degrades the trajectory prediction performance in real traffic scenarios. To address this limitation, we propose a novel end-to-end framework for incomplete vehicle trajectory prediction, named Multi-scale Temporal Fusion Transformer (MTFT), which consists of the Multi-scale Attention Head (MAH) and the Continuity Representation-guided Multi-scale Fusion (CRMF) module. Specifically, the MAH leverages the multi-head attention mechanism to parallelly capture multi-scale motion representation of trajectory from different temporal granularities, thus mitigating the adverse effect of missing values on prediction. Furthermore, the multi-scale motion representation is input into the CRMF module for multi-scale fusion to obtain the robust temporal feature of the vehicle. During the fusion process, the continuity representation of vehicle motion is first extracted across time steps to guide the fusion, ensuring that the resulting temporal feature incorporates both detailed information and the overall trend of vehicle motion, which facilitates the accurate decoding of future trajectory that is consistent with the vehicle’s motion trend. We evaluate the proposed model on four datasets derived from highway and urban traffic scenarios. The experimental results demonstrate its superior performance in the incomplete vehicle trajectory prediction task compared with state-of-the-art models, e.g., a comprehensive performance improvement of more than 39% on the HighD dataset.

[AI-120] MarsCode Agent : AI-native Automated Bug Fixing

链接: https://arxiv.org/abs/2409.00899
作者: Yizhou Liu,Pengfei Gao,Xinchen Wang,Chao Peng,Zhao Zhang
关键词-EN: large language models, shown significant potential, Recent advances, including code completion, software development tasks
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Yizhou Liu and Pengfei Gao contributed equally and the order is determined by rolling the dice. Chao Peng is the corresponding author

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown significant potential to automate various software development tasks, including code completion, test generation, and bug fixing. However, the application of LLMs for automated bug fixing remains challenging due to the complexity and diversity of real-world software systems. In this paper, we introduce MarsCode Agent, a novel framework that leverages LLMs to automatically identify and repair bugs in software code. MarsCode Agent combines the power of LLMs with advanced code analysis techniques to accurately localize faults and generate patches. Our approach follows a systematic process of planning, bug reproduction, fault localization, candidate patch generation, and validation to ensure high-quality bug fixes. We evaluated MarsCode Agent on SWE-bench, a comprehensive benchmark of real-world software projects, and our results show that MarsCode Agent achieves a high success rate in bug fixing compared to most of the existing automated approaches.

[AI-121] User-Specific Dialogue Generation with User Profile-Aware Pre-Training Model and Parameter-Efficient Fine-Tuning

链接: https://arxiv.org/abs/2409.00887
作者: Atsushi Otsuka,Kazuya Matsuo,Ryo Ishii,Narichika Nomoto,Hiroaki Sugiyama
关键词-EN: addresses user-specific dialogs, paper addresses user-specific, paper addresses, model, dialogue
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses user-specific dialogs. In contrast to previous research on personalized dialogue focused on achieving virtual user dialogue as defined by persona descriptions, user-specific dialogue aims to reproduce real-user dialogue beyond persona-based dialogue. Fine-tuning using the target user’s dialogue history is an efficient learning method for a user-specific model. However, it is prone to overfitting and model destruction due to the small amount of data. Therefore, we propose a learning method for user-specific models by combining parameter-efficient fine-tuning with a pre-trained dialogue model that includes user profiles. Parameter-efficient fine-tuning adds a small number of parameters to the entire model, so even small amounts of training data can be trained efficiently and are robust to model destruction. In addition, the pre-trained model, which is learned by adding simple prompts for automatically inferred user profiles, can generate speech with enhanced knowledge of the user’s profile, even when there is little training data during fine-tuning. In experiments, we compared the proposed model with large-language-model utterance generation using prompts containing users’ personal information. Experiments reproducing real users’ utterances revealed that the proposed model can generate utterances with higher reproducibility than the compared methods, even with a small model.

[AI-122] Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts

链接: https://arxiv.org/abs/2409.00879
作者: Youngseog Chung,Dhruv Malik,Jeff Schneider,Yuanzhi Li,Aarti Singh
关键词-EN: Soft MoE, Sparse Mixture, large expert, single large expert, small experts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, 5 figures, 13 tables

点击查看摘要

Abstract:The traditional viewpoint on Sparse Mixture of Experts (MoE) models is that instead of training a single large expert, which is computationally expensive, we can train many small experts. The hope is that if the total parameter count of the small experts equals that of the singular large expert, then we retain the representation power of the large expert while gaining computational tractability and promoting expert specialization. The recently introduced Soft MoE replaces the Sparse MoE’s discrete routing mechanism with a differentiable gating function that smoothly mixes tokens. While this smooth gating function successfully mitigates the various training instabilities associated with Sparse MoE, it is unclear whether it induces implicit biases that affect Soft MoE’s representation power or potential for expert specialization. We prove that Soft MoE with a single arbitrarily powerful expert cannot represent simple convex functions. This justifies that Soft MoE’s success cannot be explained by the traditional viewpoint of many small experts collectively mimicking the representation power of a single large expert, and that multiple experts are actually necessary to achieve good representation power (even for a fixed total parameter count). Continuing along this line of investigation, we introduce a notion of expert specialization for Soft MoE, and while varying the number of experts yet fixing the total parameter count, we consider the following (computationally intractable) task. Given any input, how can we discover the expert subset that is specialized to predict this input’s label? We empirically show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset. Our method can be easily implemented to potentially reduce computation during inference.

[AI-123] Equitable Skin Disease Prediction Using Transfer Learning and Domain Adaptation

链接: https://arxiv.org/abs/2409.00873
作者: Sajib Acharjee Dip,Kazi Hasan Ibn Arif,Uddip Acharjee Shuvo,Ishtiaque Ahmed Khan,Na Meng
关键词-EN: conditions manually necessitates, diverse skin tones, expertise of dermatologists, manually necessitates, necessitates the expertise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the realm of dermatology, the complexity of diagnosing skin conditions manually necessitates the expertise of dermatologists. Accurate identification of various skin ailments, ranging from cancer to inflammatory diseases, is paramount. However, existing artificial intelligence (AI) models in dermatology face challenges, particularly in accurately diagnosing diseases across diverse skin tones, with a notable performance gap in darker skin. Additionally, the scarcity of publicly available, unbiased datasets hampers the development of inclusive AI diagnostic tools. To tackle the challenges in accurately predicting skin conditions across diverse skin tones, we employ a transfer-learning approach that capitalizes on the rich, transferable knowledge from various image domains. Our method integrates multiple pre-trained models from a wide range of sources, including general and specific medical images, to improve the robustness and inclusiveness of the skin condition predictions. We rigorously evaluated the effectiveness of these models using the Diverse Dermatology Images (DDI) dataset, which uniquely encompasses both underrepresented and common skin tones, making it an ideal benchmark for assessing our approach. Among all methods, Med-ViT emerged as the top performer due to its comprehensive feature representation learned from diverse image sources. To further enhance performance, we conducted domain adaptation using additional skin image datasets such as HAM10000. This adaptation significantly improved model performance across all models.

[AI-124] Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering

链接: https://arxiv.org/abs/2409.00861
作者: Derian Boer,Fabian Koch,Stefan Kramer
关键词-EN: Large Language Models, Large Language, frequently lack domain-specific, fine-tuned models tend, lack domain-specific knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 9 pages, published at IJCLR 2024

点击查看摘要

Abstract:Large Language Models (LLMs) frequently lack domain-specific knowledge and even fine-tuned models tend to hallucinate. Hence, more reliable models that can include external knowledge are needed. We present a pipeline, 4StepFocus, and specifically a preprocessing step, that can substantially improve the answers of LLMs. This is achieved by providing guided access to external knowledge making use of the model’s ability to capture relational context and conduct rudimentary reasoning by themselves. The method narrows down potentially correct answers by triplets-based searches in a semi-structured knowledge base in a direct, traceable fashion, before switching to latent representations for ranking those candidates based on unstructured data. This distinguishes it from related methods that are purely based on latent representations. 4StepFocus consists of the steps: 1) Triplet generation for extraction of relational data by an LLM, 2) substitution of variables in those triplets to narrow down answer candidates employing a knowledge graph, 3) sorting remaining candidates with a vector similarity search involving associated non-structured data, 4) reranking the best candidates by the LLM with background data provided. Experiments on a medical, a product recommendation, and an academic paper search test set demonstrate that this approach is indeed a powerful augmentation. It not only adds relevant traceable background information from information retrieval, but also improves performance considerably in comparison to state-of-the-art methods. This paper presents a novel, largely unexplored direction and therefore provides a wide range of future work opportunities. Used source code is available at this https URL.

[AI-125] rustworthy Human-AI Collaboration: Reinforcement Learning with Human Feedback and Physics Knowledge for Safe Autonomous Driving

链接: https://arxiv.org/abs/2409.00858
作者: Zilin Huang,Zihao Sheng,Lei Shi,Sikai Chen
关键词-EN: Human Feedback, Reinforcement Learning, driving policies remains, Physics-enhanced Reinforcement Learning, Human
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 33 pages, 20 figures

点击查看摘要

Abstract:In the field of autonomous driving, developing safe and trustworthy autonomous driving policies remains a significant challenge. Recently, Reinforcement Learning with Human Feedback (RLHF) has attracted substantial attention due to its potential to enhance training safety and sampling efficiency. Nevertheless, existing RLHF-enabled methods often falter when faced with imperfect human demonstrations, potentially leading to training oscillations or even worse performance than rule-based approaches. Inspired by the human learning process, we propose Physics-enhanced Reinforcement Learning with Human Feedback (PE-RLHF). This novel framework synergistically integrates human feedback (e.g., human intervention and demonstration) and physics knowledge (e.g., traffic flow model) into the training loop of reinforcement learning. The key advantage of PE-RLHF is its guarantee that the learned policy will perform at least as well as the given physics-based policy, even when human feedback quality deteriorates, thus ensuring trustworthy safety improvements. PE-RLHF introduces a Physics-enhanced Human-AI (PE-HAI) collaborative paradigm for dynamic action selection between human and physics-based actions, employs a reward-free approach with a proxy value function to capture human preferences, and incorporates a minimal intervention mechanism to reduce the cognitive load on human mentors. Extensive experiments across diverse driving scenarios demonstrate that PE-RLHF significantly outperforms traditional methods, achieving state-of-the-art (SOTA) performance in safety, efficiency, and generalizability, even with varying quality of human feedback. The philosophy behind PE-RLHF not only advances autonomous driving technology but can also offer valuable insights for other safety-critical domains. Demo video and code are available at: \this https URL

[AI-126] Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages

链接: https://arxiv.org/abs/2409.00856
作者: William Zhang,Maria Leon,Ryan Xu,Adrian Cardenas,Amelia Wissink,Hanna Martin,Maya Srikanth,Kaya Dorogi,Christian Valadez,Pedro Perez,Citlalli Grijalva,Corey Zhang,Mark Santolucito
关键词-EN: arts coding domains, code, media arts coding, Node-based programming languages, code generation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Node-based programming languages are increasingly popular in media arts coding domains. These languages are designed to be accessible to users with limited coding experience, allowing them to achieve creative output without an extensive programming background. Using LLM-based code generation to further lower the barrier to creative output is an exciting opportunity. However, the best strategy for code generation for visual node-based programming languages is still an open question. In particular, such languages have multiple levels of representation in text, each of which may be used for code generation. In this work, we explore the performance of LLM code generation in audio programming tasks in visual programming languages at multiple levels of representation. We explore code generation through metaprogramming code representations for these languages (i.e., coding the language using a different high-level text-based programming language), as well as through direct node generation with JSON. We evaluate code generated in this way for two visual languages for audio programming on a benchmark set of coding problems. We measure both correctness and complexity of the generated code. We find that metaprogramming results in more semantically correct generated code, given that the code is well-formed (i.e., is syntactically correct and runs). We also find that prompting for richer metaprogramming using randomness and loops led to more complex code.

[AI-127] JaxLife: An Open-Ended Agent ic Simulator

链接: https://arxiv.org/abs/2409.00853
作者: Chris Lu,Michael Beukman,Michael Matthews,Jakob Foerster
关键词-EN: Human intelligence emerged, Human intelligence, evolution on Earth, intelligence emerged, natural selection
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Human intelligence emerged through the process of natural selection and evolution on Earth. We investigate what it would take to re-create this process in silico. While past work has often focused on low-level processes (such as simulating physics or chemistry), we instead take a more targeted approach, aiming to evolve agents that can accumulate open-ended culture and technologies across generations. Towards this, we present JaxLife: an artificial life simulator in which embodied agents, parameterized by deep neural networks, must learn to survive in an expressive world containing programmable systems. First, we describe the environment and show that it can facilitate meaningful Turing-complete computation. We then analyze the evolved emergent agents’ behavior, such as rudimentary communication protocols, agriculture, and tool use. Finally, we investigate how complexity scales with the amount of compute used. We believe JaxLife takes a step towards studying evolved behavior in more open-ended simulations. Our code is available at this https URL

[AI-128] he Design of an LLM-powered Unstructured Analytics System

链接: https://arxiv.org/abs/2409.00847
作者: Eric Anderson,Jonathan Fritz,Austin Lee,Bohou Li,Mark Lindblad,Henry Lindeman,Alex Meyer,Parth Parmar,Tanvi Ranade,Mehul A. Shah,Benjamin Sowell,Dan Tecuci,Vinayak Thapliyal,Matt Welsh
关键词-EN: process unstructured data, uncanny ability, ability to process, search and run, Aryn
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users can specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents using LLMs. At the core of Aryn is Sycamore, a declarative document processing engine, built using Ray, that provides a reliable distributed abstraction called \em DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn also comprises Luna, a query planner that translates natural language queries to Sycamore scripts, and the Aryn Partitioner, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. Using Aryn, we demonstrate a real world use case for analyzing accident reports from the National Transportation Safety Board (NTSB), and discuss some of the major challenges we encountered in deploying Aryn in the wild.

[AI-129] Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries

链接: https://arxiv.org/abs/2409.00844
作者: Blair Yang,Fuyang Cui,Keiran Paster,Jimmy Ba,Pashootan Vaezipoor,Silviu Pitis,Michael R. Zhang
关键词-EN: conventional quantitative benchmarks, large language models, make it difficult, rapid development, development and dynamic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:The rapid development and dynamic nature of large language models (LLMs) make it difficult for conventional quantitative benchmarks to accurately assess their capabilities. We propose report cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics. We develop a framework to evaluate report cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans). We also propose an iterative algorithm for generating report cards without human supervision and explore its efficacy by ablating various design choices. Through experimentation with popular LLMs, we demonstrate that report cards provide insights beyond traditional benchmarks and can help address the need for a more interpretable and holistic evaluation of LLMs.

[AI-130] Entropy Loss: An Interpretability Amplifier of 3D Object Detection Network for Intelligent Driving

链接: https://arxiv.org/abs/2409.00839
作者: Haobo Yang,Shiyan Zhang,Zhuoyi Yang,Xinyu Zhang,Li Wang,Yifan Tang,Jilong Guo,Jun Li
关键词-EN: Entropy Loss, intelligent driving perception, loss, intelligent driving, Entropy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:With the increasing complexity of the traffic environment, the significance of safety perception in intelligent driving is intensifying. Traditional methods in the field of intelligent driving perception rely on deep learning, which suffers from limited interpretability, often described as a “black box.” This paper introduces a novel type of loss function, termed “Entropy Loss,” along with an innovative training strategy. Entropy Loss is formulated based on the functionality of feature compression networks within the perception model. Drawing inspiration from communication systems, the information transmission process in a feature compression network is expected to demonstrate steady changes in information volume and a continuous decrease in information entropy. By modeling network layer outputs as continuous random variables, we construct a probabilistic model that quantifies changes in information volume. Entropy Loss is then derived based on these expectations, guiding the update of network parameters to enhance network interpretability. Our experiments indicate that the Entropy Loss training strategy accelerates the training process. Utilizing the same 60 training epochs, the accuracy of 3D object detection models using Entropy Loss on the KITTI test set improved by up to 4.47% compared to models without Entropy Loss, underscoring the method’s efficacy. The implementation code is available at \urlthis https URL.

[AI-131] You-Only-Randomize-Once: Shaping Statistical Properties in Constraint-based PCG

链接: https://arxiv.org/abs/2409.00837
作者: Jediah Katz,Bahar Bateni,Adam M. Smith
关键词-EN: procedural content generation, procedural content, define local, constraint satisfaction problem, constraint
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Published in Foundations of Digital Games (FDG) 2024. 10 pages, 6 figures

点击查看摘要

Abstract:In procedural content generation, modeling the generation task as a constraint satisfaction problem lets us define local and global constraints on the generated output. However, a generator’s perceived quality often involves statistics rather than just hard constraints. For example, we may desire that generated outputs use design elements with a similar distribution to that of reference designs. However, such statistical properties cannot be expressed directly as a hard constraint on the generation of any one output. In contrast, methods which do not use a general-purpose constraint solver, such as Gumin’s implementation of the WaveFunctionCollapse (WFC) algorithm, can control output statistics but have limited constraint propagation ability and cannot express non-local constraints. In this paper, we introduce You-Only-Randomize-Once (YORO) pre-rolling, a method for crafting a decision variable ordering for a constraint solver that encodes desired statistics in a constraint-based generator. Using a solver-based WFC as an example, we show that this technique effectively controls the statistics of tile-grid outputs generated by several off-the-shelf SAT solvers, while still enforcing global constraints on the outputs.1 Our approach is immediately applicable to WFC-like generation problems and it offers a conceptual starting point for controlling the design element statistics in other constraint-based generators.

[AI-132] Building FKG.in: a Knowledge Graph for Indian Food

链接: https://arxiv.org/abs/2409.00830
作者: Saransh Kumar Gupta,Lipika Dey,Partha Pratim Das,Ramesh Jain
关键词-EN: multilingual semantic reasoning, semantic reasoning techniques, Indian food, assimilating culinary information, Indian
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 14 pages, 3 figures, 25 references, Formal Ontology in Information Systems Conference 2024 - Integrated Food Ontology Workshop

点击查看摘要

Abstract:This paper presents an ontology design along with knowledge engineering, and multilingual semantic reasoning techniques to build an automated system for assimilating culinary information for Indian food in the form of a knowledge graph. The main focus is on designing intelligent methods to derive ontology designs and capture all-encompassing knowledge about food, recipes, ingredients, cooking characteristics, and most importantly, nutrition, at scale. We present our ongoing work in this workshop paper, describe in some detail the relevant challenges in curating knowledge of Indian food, and propose our high-level ontology design. We also present a novel workflow that uses AI, LLM, and language technology to curate information from recipe blog sites in the public domain to build knowledge graphs for Indian food. The methods for knowledge curation proposed in this paper are generic and can be replicated for any domain. The design is application-agnostic and can be used for AI-driven smart analysis, building recommendation systems for Personalized Digital Health, and complementing the knowledge graph for Indian food with contextual information such as user information, food biochemistry, geographic information, agricultural information, etc.

[AI-133] Accelerating Hybrid Agent -Based Models and Fuzzy Cognitive Maps: How to Combine Agents who Think Alike?

链接: https://arxiv.org/abs/2409.00824
作者: Philippe J. Giabbanelli,Jack T. Beerman
关键词-EN: create detailed artificial, detailed artificial societies, artificial societies based, local context, computationally intensive
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: To appear at the 2024 Winter Simulation Conference

点击查看摘要

Abstract:While Agent-Based Models can create detailed artificial societies based on individual differences and local context, they can be computationally intensive. Modelers may offset these costs through a parsimonious use of the model, for example by using smaller population sizes (which limits analyses in sub-populations), running fewer what-if scenarios, or accepting more uncertainty by performing fewer simulations. Alternatively, researchers may accelerate simulations via hardware solutions (e.g., GPU parallelism) or approximation approaches that operate a tradeoff between accuracy and compute time. In this paper, we present an approximation that combines agents who `think alike’, thus reducing the population size and the compute time. Our innovation relies on representing agent behaviors as networks of rules (Fuzzy Cognitive Maps) and empirically evaluating different measures of distance between these networks. Then, we form groups of think-alike agents via community detection and simplify them to a representative agent. Case studies show that our simplifications remain accuracy.

[AI-134] Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition

链接: https://arxiv.org/abs/2409.00815
作者: Hao Shi,Yuan Gao,Zhaoheng Ni,Tatsuya Kawahara
关键词-EN: Serialized output training, attracts increasing attention, increasing attention due, automatic speech recognition, output training
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Serialized output training (SOT) attracts increasing attention due to its convenience and flexibility for multi-speaker automatic speech recognition (ASR). However, it is not easy to train with attention loss only. In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. This additional separator is inserted after the encoder to extract the multi-speaker information with CTC losses. Furthermore, we propose the serialized speech information guidance SOT (GEncSep) to further utilize the separated encodings. The separated streams are concatenated to provide single-speaker information to guide attention during decoding. The experimental results on LibriMix show that the single-speaker encoding can be separated from the overlapped encoding. The CTC loss helps to improve the encoder representation under complex scenarios. GEncSep further improved performance.

[AI-135] A Novel Self-Attention-Enabled Weighted Ensemble-Based Convolutional Neural Network Framework for Distributed Denial of Service Attack Classification

链接: https://arxiv.org/abs/2409.00810
作者: Kanthimathi S,Shravan Venkatraman,Jayasankar K S,Pranay Jiljith T,Jashwanth R
关键词-EN: disrupt network services, Distributed Denial, compromise sensitive data, Denial of Service, network services
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 3 tables, 9 figures

点击查看摘要

Abstract:Distributed Denial of Service (DDoS) attacks are a major concern in network security, as they overwhelm systems with excessive traffic, compromise sensitive data, and disrupt network services. Accurately detecting these attacks is crucial to protecting network infrastructure. Traditional approaches, such as single Convolutional Neural Networks (CNNs) or conventional Machine Learning (ML) algorithms like Decision Trees (DTs) and Support Vector Machines (SVMs), struggle to extract the diverse features needed for precise classification, resulting in suboptimal performance. This research addresses this gap by introducing a novel approach for DDoS attack detection. The proposed method combines three distinct CNN architectures: SA-Enabled CNN with XGBoost, SA-Enabled CNN with LSTM, and SA-Enabled CNN with Random Forest. Each model extracts features at multiple scales, while self-attention mechanisms enhance feature integration and relevance. The weighted ensemble approach ensures that both prominent and subtle features contribute to the final classification, improving adaptability to evolving attack patterns and novel threats. The proposed method achieves a precision of 98.71%, an F1-score of 98.66%, a recall of 98.63%, and an accuracy of 98.69%, outperforming traditional methods and setting a new benchmark in DDoS attack detection. This innovative approach addresses critical limitations in current models and advances the state of the art in network security.

[AI-136] Diffusion based multi-domain neuroimaging harmonization method with preservation of anatomical details

链接: https://arxiv.org/abs/2409.00807
作者: Haoyu Lan,Bino A. Varghese,Nasim Sheikh-Bahaei,Farshid Sepehrband,Arthur W Toga,Jeiran Choupan
关键词-EN: face technical variability, technical variability due, reduce technical variability, studies face technical, Multi-center neuroimaging studies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Multi-center neuroimaging studies face technical variability due to batch differences across sites, which potentially hinders data aggregation and impacts study reliability.Recent efforts in neuroimaging harmonization have aimed to minimize these technical gaps and reduce technical variability across batches. While Generative Adversarial Networks (GAN) has been a prominent method for addressing image harmonization tasks, GAN-harmonized images suffer from artifacts or anatomical distortions. Given the advancements of denoising diffusion probabilistic model which produces high-fidelity images, we have assessed the efficacy of the diffusion model for neuroimaging harmonization. we have demonstrated the diffusion model’s superior capability in harmonizing images from multiple domains, while GAN-based methods are limited to harmonizing images between two domains per model. Our experiments highlight that the learned domain invariant anatomical condition reinforces the model to accurately preserve the anatomical details while differentiating batch differences at each diffusion step. Our proposed method has been tested on two public neuroimaging dataset ADNI1 and ABIDE II, yielding harmonization results with consistent anatomy preservation and superior FID score compared to the GAN-based methods. We have conducted multiple analysis including extensive quantitative and qualitative evaluations against the baseline models, ablation study showcasing the benefits of the learned conditions, and improvements in the consistency of perivascular spaces (PVS) segmentation through harmonization.

[AI-137] he Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

链接: https://arxiv.org/abs/2409.00787
作者: Bocheng Chen,Hanqing Guo,Guangjing Wang,Yuanda Wang,Qiben Yan
关键词-EN: demonstrated great capabilities, Large Language Models, intricate alignment process, natural language understanding, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated great capabilities in natural language understanding and generation, largely attributed to the intricate alignment process using human feedback. While alignment has become an essential training component that leverages data collected from user queries, it inadvertently opens up an avenue for a new type of user-guided poisoning attacks. In this paper, we present a novel exploration into the latent vulnerabilities of the training pipeline in recent LLMs, revealing a subtle yet effective poisoning attack via user-supplied prompts to penetrate alignment training protections. Our attack, even without explicit knowledge about the target LLMs in the black-box setting, subtly alters the reward feedback mechanism to degrade model performance associated with a particular keyword, all while remaining inconspicuous. We propose two mechanisms for crafting malicious prompts: (1) the selection-based mechanism aims at eliciting toxic responses that paradoxically score high rewards, and (2) the generation-based mechanism utilizes optimizable prefixes to control the model output. By injecting 1% of these specially crafted prompts into the data, through malicious users, we demonstrate a toxicity score up to two times higher when a specific trigger word is used. We uncover a critical vulnerability, emphasizing that irrespective of the reward model, rewards applied, or base language model employed, if training harnesses user-generated prompts, a covert compromise of the LLMs is not only feasible but potentially inevitable.

[AI-138] rusted Unified Feature-Neighborhood Dynamics for Multi-View Classification

链接: https://arxiv.org/abs/2409.00755
作者: Haojian Huang,Chuanyu Qin,Zhe Liu,Kaijing Ma,Jin Chen,Han Fang,Chao Ban,Hao Sun,Zhongjiang He
关键词-EN: faces inherent challenges, inherent challenges due, Evidential Deep Learning, faces inherent, inherent challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Ongoing work: 13pages, 13figures, 12 tables

点击查看摘要

Abstract:Multi-view classification (MVC) faces inherent challenges due to domain gaps and inconsistencies across different views, often resulting in uncertainties during the fusion process. While Evidential Deep Learning (EDL) has been effective in addressing view uncertainty, existing methods predominantly rely on the Dempster-Shafer combination rule, which is sensitive to conflicting evidence and often neglects the critical role of neighborhood structures within multi-view data. To address these limitations, we propose a Trusted Unified Feature-NEighborhood Dynamics (TUNED) model for robust MVC. This method effectively integrates local and global feature-neighborhood (F-N) structures for robust decision-making. Specifically, we begin by extracting local F-N structures within each view. To further mitigate potential uncertainties and conflicts in multi-view fusion, we employ a selective Markov random field that adaptively manages cross-view neighborhood dependencies. Additionally, we employ a shared parameterized evidence extractor that learns global consensus conditioned on local F-N structures, thereby enhancing the global integration of multi-view features. Experiments on benchmark datasets show that our method improves accuracy and robustness over existing approaches, particularly in scenarios with high uncertainty and conflicting views. The code will be made available at this https URL.

[AI-139] Cooperative Path Planning with Asynchronous Multiagent Reinforcement Learning

链接: https://arxiv.org/abs/2409.00754
作者: Jiaming Yin,Weixiong Rao,Yu Xiao,Keshuang Tang
关键词-EN: minimize average travel, shortest path problem, average travel time, multiple source-destination pairs, source-destination pairs
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we study the shortest path problem (SPP) with multiple source-destination pairs (MSD), namely MSD-SPP, to minimize average travel time of all shortest paths. The inherent traffic capacity limits within a road network contributes to the competition among vehicles. Multi-agent reinforcement learning (MARL) model cannot offer effective and efficient path planning cooperation due to the asynchronous decision making setting in MSD-SPP, where vehicles (a.k.a agents) cannot simultaneously complete routing actions in the previous time step. To tackle the efficiency issue, we propose to divide an entire road network into multiple sub-graphs and subsequently execute a two-stage process of inter-region and intra-region route planning. To address the asynchronous issue, in the proposed asyn-MARL framework, we first design a global state, which exploits a low-dimensional vector to implicitly represent the joint observations and actions of multi-agents. Then we develop a novel trajectory collection mechanism to decrease the redundancy in training trajectories. Additionally, we design a novel actor network to facilitate the cooperation among vehicles towards the same or close destinations and a reachability graph aimed at preventing infinite loops in routing paths. On both synthetic and real road networks, our evaluation result demonstrates that our approach outperforms state-of-the-art planning approaches.

[AI-140] MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

链接: https://arxiv.org/abs/2409.00750
作者: Yuancheng Wang,Haoyue Zhan,Liwei Liu,Ruihong Zeng,Haotian Guo,Jiachen Zheng,Qiang Zhang,Shunsi Zhang,Zhizheng Wu
关键词-EN: Generative Codec Transformer, primarily divided, Nowadays, TTS, Masked Generative Codec
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Nowadays, large-scale text-to-speech (TTS) systems are primarily divided into two types: autoregressive and non-autoregressive. The autoregressive systems have certain deficiencies in robustness and cannot control speech duration. In contrast, non-autoregressive systems require explicit prediction of phone-level duration, which may compromise their naturalness. We introduce the Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive model for TTS that does not require precise alignment information between text and speech. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the \textitmask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. We scale MaskGCT to a large-scale multilingual dataset with 100K hours of in-the-wild speech. Our experiments demonstrate that MaskGCT achieves superior or competitive performance compared to state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility while offering higher generation efficiency than diffusion-based or autoregressive TTS models. Audio samples are available at this https URL.

[AI-141] Interpretable Clustering: A Survey

链接: https://arxiv.org/abs/2409.00743
作者: Lianyu Hu,Mudi Jiang,Junjie Dong,Xinying Liu,Zengyou He
关键词-EN: recent years, accuracy and efficiency, expense of interpretability, primarily focused, focused on enhancing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:In recent years, much of the research on clustering algorithms has primarily focused on enhancing their accuracy and efficiency, frequently at the expense of interpretability. However, as these methods are increasingly being applied in high-stakes domains such as healthcare, finance, and autonomous systems, the need for transparent and interpretable clustering outcomes has become a critical concern. This is not only necessary for gaining user trust but also for satisfying the growing ethical and regulatory demands in these fields. Ensuring that decisions derived from clustering algorithms can be clearly understood and justified is now a fundamental requirement. To address this need, this paper provides a comprehensive and structured review of the current state of explainable clustering algorithms, identifying key criteria to distinguish between various methods. These insights can effectively assist researchers in making informed decisions about the most suitable explainable clustering methods for specific application contexts, while also promoting the development and adoption of clustering algorithms that are both efficient and transparent.

[AI-142] Simulation of Social Media-Driven Bubble Formation in Financial Markets using an Agent -Based Model with Hierarchical Influence Network

链接: https://arxiv.org/abs/2409.00742
作者: Gonzalo Bohorquez,John Cartlidge
关键词-EN: tree-like hierarchical structure, hierarchical structure represents, financial markets, social media influences, structure represents
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR)
*备注: 11 pages, 7 figures, To appear in Proceedings of 36th European Modeling and Simulation Symposium (EMSS), 21st International Multidisciplinary Modelling and Simulation Multiconference (I3M), Tenerife, Spain, Sep. 2024

点击查看摘要

Abstract:We propose that a tree-like hierarchical structure represents a simple and effective way to model the emergent behaviour of financial markets, especially markets where there exists a pronounced intersection between social media influences and investor behaviour. To explore this hypothesis, we introduce an agent-based model of financial markets, where trading agents are embedded in a hierarchical network of communities, and communities influence the strategies and opinions of traders. Empirical analysis of the model shows that its behaviour conforms to several stylized facts observed in real financial markets; and the model is able to realistically simulate the effects that social media-driven phenomena, such as echo chambers and pump-and-dump schemes, have on financial markets.

[AI-143] AgGym: An agricultural biotic stress simulation environment for ultra-precision management planning

链接: https://arxiv.org/abs/2409.00735
作者: Mahsa Khosravi,Matthew Carroll,Kai Liang Tan,Liza Van der Laan,Joscif Raigne,Daren S. Mueller,Arti Singh,Aditya Balu,Baskar Ganapathysubramanian,Asheesh Kumar Singh,Soumik Sarkar
关键词-EN: Agricultural production requires, superior seed quality, requires careful management, production requires careful, Agricultural production
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Agricultural production requires careful management of inputs such as fungicides, insecticides, and herbicides to ensure a successful crop that is high-yielding, profitable, and of superior seed quality. Current state-of-the-art field crop management relies on coarse-scale crop management strategies, where entire fields are sprayed with pest and disease-controlling chemicals, leading to increased cost and sub-optimal soil and crop management. To overcome these challenges and optimize crop production, we utilize machine learning tools within a virtual field environment to generate localized management plans for farmers to manage biotic threats while maximizing profits. Specifically, we present AgGym, a modular, crop and stress agnostic simulation framework to model the spread of biotic stresses in a field and estimate yield losses with and without chemical treatments. Our validation with real data shows that AgGym can be customized with limited data to simulate yield outcomes under various biotic stress conditions. We further demonstrate that deep reinforcement learning (RL) policies can be trained using AgGym for designing ultra-precise biotic stress mitigation strategies with potential to increase yield recovery with less chemicals and lower cost. Our proposed framework enables personalized decision support that can transform biotic stress management from being schedule based and reactive to opportunistic and prescriptive. We also release the AgGym software implementation as a community resource and invite experts to contribute to this open-sourced publicly available modular environment framework. The source code can be accessed at: this https URL.

[AI-144] Hound: Hunting Supervision Signals for Few and Zero Shot Node Classification on Text-attributed Graph

链接: https://arxiv.org/abs/2409.00727
作者: Yuxiang Wang,Xiao Yan,Shiyu Jin,Quanqing Xu,Chuanhui Yang,Yuanyuan Zhu,Chuang Hu,Bo Du,Jiawei Jiang
关键词-EN: Text-attributed graph, graph structured data, graph structured, important type, node
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Text-attributed graph (TAG) is an important type of graph structured data with text descriptions for each node. Few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. However, the two tasks are challenging due to the lack of supervision signals, and existing methods only use the contrastive loss to align graph-based node embedding and language-based text embedding. In this paper, we propose Hound to improve accuracy by introducing more supervision signals, and the core idea is to go beyond the node-text pairs that come with data. Specifically, we design three augmentation techniques, i.e., node perturbation, text matching, and semantics negation to provide more reference nodes for each text and vice versa. Node perturbation adds/drops edges to produce diversified node embeddings that can be matched with a text. Text matching retrieves texts with similar embeddings to match with a node. Semantics negation uses a negative prompt to construct a negative text with the opposite semantics, which is contrasted with the original node and text. We evaluate Hound on 5 datasets and compare with 13 state-of-the-art baselines. The results show that Hound consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.

[AI-145] LPUWF-LDM: Enhanced Latent Diffusion Model for Precise Late-phase UWF-FA Generation on Limited Dataset

链接: https://arxiv.org/abs/2409.00726
作者: Zhaojie Fang,Xiao Yu,Guanyu Zhou,Ke Zhuang,Yifei Chen,Ruiquan Ge,Changmiao Wang,Gangyong Jia,Qing Wu,Juan Ye,Maimaiti Nuliqiman,Peifang Xu,Ahmed Elazab
关键词-EN: enables precise identification, Scanning Laser Ophthalmoscopy, high-quality late-phase UWF-FA, late-phase UWF-FA, Late-Phase Fluorescein Angiography
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Ultra-Wide-Field Fluorescein Angiography (UWF-FA) enables precise identification of ocular diseases using sodium fluorescein, which can be potentially harmful. Existing research has developed methods to generate UWF-FA from Ultra-Wide-Field Scanning Laser Ophthalmoscopy (UWF-SLO) to reduce the adverse reactions associated with injections. However, these methods have been less effective in producing high-quality late-phase UWF-FA, particularly in lesion areas and fine details. Two primary challenges hinder the generation of high-quality late-phase UWF-FA: the scarcity of paired UWF-SLO and early/late-phase UWF-FA datasets, and the need for realistic generation at lesion sites and potential blood leakage regions. This study introduces an improved latent diffusion model framework to generate high-quality late-phase UWF-FA from limited paired UWF images. To address the challenges as mentioned earlier, our approach employs a module utilizing Cross-temporal Regional Difference Loss, which encourages the model to focus on the differences between early and late phases. Additionally, we introduce a low-frequency enhanced noise strategy in the diffusion forward process to improve the realism of medical images. To further enhance the mapping capability of the variational autoencoder module, especially with limited datasets, we implement a Gated Convolutional Encoder to extract additional information from conditional images. Our Latent Diffusion Model for Ultra-Wide-Field Late-Phase Fluorescein Angiography (LPUWF-LDM) effectively reconstructs fine details in late-phase UWF-FA and achieves state-of-the-art results compared to other existing methods when working with limited datasets. Our source code is available at: this https URL.

[AI-146] Who Would Chatbots Vote For? Political Preferences of ChatGPT and Gemini in the 2024 European Union Elections

链接: https://arxiv.org/abs/2409.00721
作者: Michael Haman,Milan Školník
关键词-EN: European Parliament elections, large language models, European Parliament, Parliament elections, European Free Alliance
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This study examines the political bias of chatbots powered by large language models, namely ChatGPT and Gemini, in the context of the 2024 European Parliament elections. The research focused on the evaluation of political parties represented in the European Parliament across 27 EU Member States by these generative artificial intelligence (AI) systems. The methodology involved daily data collection through standardized prompts on both platforms. The results revealed a stark contrast: while Gemini mostly refused to answer political questions, ChatGPT provided consistent ratings. The analysis showed a significant bias in ChatGPT in favor of left-wing and centrist parties, with the highest ratings for the Greens/European Free Alliance. In contrast, right-wing parties, particularly the Identity and Democracy group, received the lowest ratings. The study identified key factors influencing the ratings, including attitudes toward European integration and perceptions of democratic values. The findings highlight the need for a critical approach to information provided by generative AI systems in a political context and call for more transparency and regulation in this area.

[AI-147] Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques

链接: https://arxiv.org/abs/2409.00717
作者: Natalia Zhang,Xinqi Wang,Qiwen Cui,Runlong Zhou,Sham M. Kakade,Simon S. Du
关键词-EN: Human Feedback, Multi-Agent Reinforcement Learning, identifying Nash equilibrium, empirical validations, Nash equilibrium
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We initiate the study of Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations. We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that single-policy coverage is inadequate and highlighting the importance of unilateral dataset coverage. These theoretical insights are verified through comprehensive experiments. To enhance the practical performance, we further introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE) regularization along the time axis to achieve a more uniform reward distribution and improve reward learning outcomes. (2) We utilize imitation learning to approximate the reference policy, ensuring stability and effectiveness in training. Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.

[AI-148] ReMOVE: A Reference-free Metric for Object Erasure CVPR2024

链接: https://arxiv.org/abs/2409.00707
作者: Aditya Chandrasekar,Goirik Chakrabarty,Jai Bardhan,Ramya Hebbalaguppe,Prathosh AP
关键词-EN: editing models post-generation, diffusion-based image editing, assessing object erasure, object erasure efficacy, erasure efficacy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at The First Workshop on the Evaluation of Generative Foundation Models (EvGENFM) at CVPR 2024

点击查看摘要

Abstract:We introduce \textttReMOVE , a novel reference-free metric for assessing object erasure efficacy in diffusion-based image editing models post-generation. Unlike existing measures such as LPIPS and CLIPScore, \textttReMOVE addresses the challenge of evaluating inpainting without a reference image, common in practical scenarios. It effectively distinguishes between object removal and replacement. This is a key issue in diffusion models due to stochastic nature of image generation. Traditional metrics fail to align with the intuitive definition of inpainting, which aims for (1) seamless object removal within masked regions (2) while preserving the background continuity. \textttReMOVE not only correlates with state-of-the-art metrics and aligns with human perception but also captures the nuanced aspects of the inpainting process, providing a finer-grained evaluation of the generated outputs.

[AI-149] Abstaining Machine Learning – Philosophical Considerations

链接: https://arxiv.org/abs/2409.00706
作者: Daniela Schuster
关键词-EN: machine learning systems, machine learning, abstaining machine learning, behaving neutrally, establishes a connection
类目: Artificial Intelligence (cs.AI)
*备注: Part of the published PhD Thesis: Daniela Schuster. Suspension of Judgment in Artificial Intelligence-Uncovering Uncertainty in Data-Based and Logic-Based Systems. PhD thesis, University of Konstanz, 2024. this http URL

点击查看摘要

Abstract:This paper establishes a connection between the fields of machine learning (ML) and philosophy concerning the phenomenon of behaving neutrally. It investigates a specific class of ML systems capable of delivering a neutral response to a given task, referred to as abstaining machine learning systems, that has not yet been studied from a philosophical perspective. The paper introduces and explains various abstaining machine learning systems, and categorizes them into distinct types. An examination is conducted on how abstention in the different machine learning system types aligns with the epistemological counterpart of suspended judgment, addressing both the nature of suspension and its normative profile. Additionally, a philosophical analysis is suggested on the autonomy and explainability of the abstaining response. It is argued, specifically, that one of the distinguished types of abstaining systems is preferable as it aligns more closely with our criteria for suspended judgment. Moreover, it is better equipped to autonomously generate abstaining outputs and offer explanations for abstaining outputs when compared to the other type.

[AI-150] Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

链接: https://arxiv.org/abs/2409.00700
作者: Yan Rong,Li Liu
关键词-EN: Face-based Voice Conversion, speaker voice style, target speaker voice, Voice Conversion, leverages facial images
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker’s voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker’s voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. To address these issues, we present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations. More precisely, we propose an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion. Besides, unlike prior works, our method can accept either audio or text inputs, offering controllable speech generation with adjustable emotional tone and speed. Extensive experiments demonstrate that ID-FaceVC achieves state-of-the-art performance across various metrics, with qualitative and user study results confirming its effectiveness in naturalness, similarity, and diversity. Project website with audio samples and code can be found at this https URL.

[AI-151] Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation

链接: https://arxiv.org/abs/2409.00696
作者: Jasper Dekoninck,Maximilian Baader,Martin Vechev
关键词-EN: Rating-based human evaluation, Rating-based human, Large language models, essential tool, tool to accurately
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Rating-based human evaluation has become an essential tool to accurately evaluate the impressive performance of Large language models (LLMs). However, current rating systems suffer from several critical limitations. Specifically, they fail to account for human biases that significantly influence evaluation results, require large and expensive preference datasets to obtain accurate ratings, and do not facilitate meaningful comparisons of model ratings across different tasks. To address these issues, we introduce Polyrating, an expressive and flexible rating system based on maximum a posteriori estimation that enables a more nuanced and thorough analysis of model performance at lower costs. Polyrating can detect and quantify biases affecting human preferences, ensuring fairer model comparisons. Furthermore, Polyrating can reduce the cost of human evaluations by up to 41% for new models and up to 77% for new tasks by leveraging existing benchmark scores. Lastly, Polyrating enables direct comparisons of ratings across different tasks, providing a comprehensive understanding of an LLMs’ strengths, weaknesses, and relative performance across different applications.

[AI-152] Curriculum Prompting Foundation Models for Medical Image Segmentation MICCAI2024

链接: https://arxiv.org/abs/2409.00695
作者: Xiuqi Zheng,Yuhang Zhang,Haoran Zhang,Hongrui Liang,Xueqi Bao,Zhuqing Jiang,Qicheng Lao
关键词-EN: Adapting large pre-trained, pre-trained foundation models, large pre-trained foundation, Adapting large, foundation models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by MICCAI 2024

点击查看摘要

Abstract:Adapting large pre-trained foundation models, e.g., SAM, for medical image segmentation remains a significant challenge. A crucial step involves the formulation of a series of specialized prompts that incorporate specific clinical instructions. Past works have been heavily reliant on a singular type of prompt for each instance, necessitating manual input of an ideally correct prompt, which is less efficient. To tackle this issue, we propose to utilize prompts of different granularity, which are sourced from original images to provide a broader scope of clinical insights. However, combining prompts of varying types can pose a challenge due to potential conflicts. In response, we have designed a coarse-to-fine mechanism, referred to as curriculum prompting, that progressively integrates prompts of different types. Through extensive experiments on three public medical datasets across various modalities, we demonstrate the effectiveness of our proposed approach, which not only automates the prompt generation process but also yields superior performance compared to other SAM-based medical image segmentation methods. Code is available at: this https URL.

[AI-153] When Heterophily Meets Heterogeneous Graphs: Latent Graphs Guided Unsupervised Representation Learning

链接: https://arxiv.org/abs/2409.00687
作者: Zhixiang Shen,Zhao Kang
关键词-EN: gained increasing attention, increasing attention due, Unsupervised Representation Learning, handling practical graphs, representation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 14 pages

点击查看摘要

Abstract:Unsupervised heterogeneous graph representation learning (UHGRL) has gained increasing attention due to its significance in handling practical graphs without labels. However, heterophily has been largely ignored, despite its ubiquitous presence in real-world heterogeneous graphs. In this paper, we define semantic heterophily and propose an innovative framework called Latent Graphs Guided Unsupervised Representation Learning (LatGRL) to handle this problem. First, we develop a similarity mining method that couples global structures and attributes, enabling the construction of fine-grained homophilic and heterophilic latent graphs to guide the representation learning. Moreover, we propose an adaptive dual-frequency semantic fusion mechanism to address the problem of node-level semantic heterophily. To cope with the massive scale of real-world data, we further design a scalable implementation. Extensive experiments on benchmark datasets validate the effectiveness and efficiency of our proposed framework. The source code and datasets have been made available at this https URL.

[AI-154] Comprehensive Botnet Detection by Mitigating Adversarial Attacks Navigating the Subtleties of Perturbation Distances and Fortifying Predictions with Conformal Layers

链接: https://arxiv.org/abs/2409.00667
作者: Rahul Yumlembam,Biju Issac,Seibu Mary Jacob,Longzhi Yang
关键词-EN: significant cybersecurity challenges, present significant cybersecurity, computer networks controlled, cybersecurity challenges, controlled by malicious
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 46 pages

点击查看摘要

Abstract:Botnets are computer networks controlled by malicious actors that present significant cybersecurity challenges. They autonomously infect, propagate, and coordinate to conduct cybercrimes, necessitating robust detection methods. This research addresses the sophisticated adversarial manipulations posed by attackers, aiming to undermine machine learning-based botnet detection systems. We introduce a flow-based detection approach, leveraging machine learning and deep learning algorithms trained on the ISCX and ISOT datasets. The detection algorithms are optimized using the Genetic Algorithm and Particle Swarm Optimization to obtain a baseline detection method. The Carlini Wagner (CW) attack and Generative Adversarial Network (GAN) generate deceptive data with subtle perturbations, targeting each feature used for classification while preserving their semantic and syntactic relationships, which ensures that the adversarial samples retain meaningfulness and realism. An in-depth analysis of the required L2 distance from the original sample for the malware sample to misclassify is performed across various iteration checkpoints, showing different levels of misclassification at different L2 distances of the Pertrub sample from the original sample. Our work delves into the vulnerability of various models, examining the transferability of adversarial examples from a Neural Network surrogate model to Tree-based algorithms. Subsequently, models that initially misclassified the perturbed samples are retrained, enhancing their resilience and detection capabilities. In the final phase, a conformal prediction layer is integrated, significantly rejecting incorrect predictions, of 58.20 % in the ISCX dataset and 98.94 % in the ISOT dataset.

[AI-155] Artificial Intelligence in Gastrointestinal Bleeding Analysis for Video Capsule Endoscopy: Insights Innovations and Prospects (2008-2023)

链接: https://arxiv.org/abs/2409.00639
作者: Tanisha Singh,Shreshtha Jha,Nidhi Bhatt,Palak Handa,Nidhi Goel,Sreedevi Indu
关键词-EN: escalating global mortality, traditional endoscopic methods, underscore the urgent, addressing this condition, Video Capsule Endoscopy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The escalating global mortality and morbidity rates associated with gastrointestinal (GI) bleeding, compounded by the complexities and limitations of traditional endoscopic methods, underscore the urgent need for a critical review of current methodologies used for addressing this condition. With an estimated 300,000 annual deaths worldwide, the demand for innovative diagnostic and therapeutic strategies is paramount. The introduction of Video Capsule Endoscopy (VCE) has marked a significant advancement, offering a comprehensive, non-invasive visualization of the digestive tract that is pivotal for detecting bleeding sources unattainable by traditional methods. Despite its benefits, the efficacy of VCE is hindered by diagnostic challenges, including time-consuming analysis and susceptibility to human error. This backdrop sets the stage for exploring Machine Learning (ML) applications in automating GI bleeding detection within capsule endoscopy, aiming to enhance diagnostic accuracy, reduce manual labor, and improve patient outcomes. Through an exhaustive analysis of 113 papers published between 2008 and 2023, this review assesses the current state of ML methodologies in bleeding detection, highlighting their effectiveness, challenges, and prospective directions. It contributes an in-depth examination of AI techniques in VCE frame analysis, offering insights into open-source datasets, mathematical performance metrics, and technique categorization. The paper sets a foundation for future research to overcome existing challenges, advancing gastrointestinal diagnostics through interdisciplinary collaboration and innovation in ML applications.

[AI-156] Entity-Aware Biaffine Attention Model for Improved Constituent Parsing with Reduced Entity Violations

链接: https://arxiv.org/abs/2409.00625
作者: Xinyi Bai
关键词-EN: Constituency parsing involves, parsing involves analyzing, Constituency parsing, involves analyzing, Constituency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Constituency parsing involves analyzing a sentence by breaking it into sub-phrases, or constituents. While many deep neural models have achieved state-of-the-art performance in this task, they often overlook the entity-violating issue, where an entity fails to form a complete sub-tree in the resultant parsing tree. To address this, we propose an entity-aware biaffine attention model for constituent parsing. This model incorporates entity information into the biaffine attention mechanism by using additional entity role vectors for potential phrases, which enhances the parsing accuracy. We introduce a new metric, the Entity Violating Rate (EVR), to quantify the extent of entity violations in parsing results. Experiments on three popular datasets-ONTONOTES, PTB, and CTB-demonstrate that our model achieves the lowest EVR while maintaining high precision, recall, and F1-scores comparable to existing models. Further evaluation in downstream tasks, such as sentence sentiment analysis, highlights the effectiveness of our model and the validity of the proposed EVR metric.

[AI-157] Enhancing Vectorized Map Perception with Historical Rasterized Maps ECCV2024

链接: https://arxiv.org/abs/2409.00620
作者: Xiaoyu Zhang,Guangwei Liu,Zihao Liu,Ningyi Xu,Yunhui Liu,Ji Zhao
关键词-EN: high-cost offline high-definition, replace traditional high-cost, traditional high-cost offline, Historical Rasterized Map, map
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:In autonomous driving, there is growing interest in end-to-end online vectorized map perception in bird’s-eye-view (BEV) space, with an expectation that it could replace traditional high-cost offline high-definition (HD) maps. However, the accuracy and robustness of these methods can be easily compromised in challenging conditions, such as occlusion or adverse weather, when relying only on onboard sensors. In this paper, we propose HRMapNet, leveraging a low-cost Historical Rasterized Map to enhance online vectorized map perception. The historical rasterized map can be easily constructed from past predicted vectorized results and provides valuable complementary information. To fully exploit a historical map, we propose two novel modules to enhance BEV features and map element queries. For BEV features, we employ a feature aggregation module to encode features from both onboard images and the historical map. For map element queries, we design a query initialization module to endow queries with priors from the historical map. The two modules contribute to leveraging map information in online perception. Our HRMapNet can be integrated with most online vectorized map perception methods. We integrate it in two state-of-the-art methods, significantly improving their performance on both the nuScenes and Argoverse 2 datasets. The source code is released at this https URL.

[AI-158] Does Knowledge Localization Hold True? Surprising Differences Between Entity and Relation Perspectives in Language Models CIKM2024

链接: https://arxiv.org/abs/2409.00617
作者: Yifan Wei,Xiaoyan Yu,Yixuan Weng,Huanhuan Ma,Yuanzhe Zhang,Jun Zhao,Kang Liu
关键词-EN: demonstrated superior performance, Large language models, language processing tasks, natural language processing, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: CIKM 2024

点击查看摘要

Abstract:Large language models encapsulate knowledge and have demonstrated superior performance on various natural language processing tasks. Recent studies have localized this knowledge to specific model parameters, such as the MLP weights in intermediate layers. This study investigates the differences between entity and relational knowledge through knowledge editing. Our findings reveal that entity and relational knowledge cannot be directly transferred or mapped to each other. This result is unexpected, as logically, modifying the entity or the relation within the same knowledge triplet should yield equivalent outcomes. To further elucidate the differences between entity and relational knowledge, we employ causal analysis to investigate how relational knowledge is stored in pre-trained models. Contrary to prior research suggesting that knowledge is stored in MLP weights, our experiments demonstrate that relational knowledge is also significantly encoded in attention modules. This insight highlights the multifaceted nature of knowledge storage in language models, underscoring the complexity of manipulating specific types of knowledge within these models.

[AI-159] DAMe: Personalized Federated Social Event Detection with Dual Aggregation Mechanism CIKM2024

链接: https://arxiv.org/abs/2409.00614
作者: Xiaoyan Yu,Yifan Wei,Pu Li,Shuaishuai Zhou,Hao Peng,Li Sun,Liehuang Zhu,Philip S. Yu
关键词-EN: improve participants’ performance, Training social event, Training social, event detection models, social event detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: CIKM 2024

点击查看摘要

Abstract:Training social event detection models through federated learning (FedSED) aims to improve participants’ performance on the task. However, existing federated learning paradigms are inadequate for achieving FedSED’s objective and exhibit limitations in handling the inherent heterogeneity in social data. This paper proposes a personalized federated learning framework with a dual aggregation mechanism for social event detection, namely DAMe. We present a novel local aggregation strategy utilizing Bayesian optimization to incorporate global knowledge while retaining local characteristics. Moreover, we introduce a global aggregation strategy to provide clients with maximum external knowledge of their preferences. In addition, we incorporate a global-local event-centric constraint to prevent local overfitting and ``client-drift’'. Experiments within a realistic simulation of a natural federated setting, utilizing six social event datasets spanning six languages and two social media platforms, along with an ablation study, have demonstrated the effectiveness of the proposed framework. Further robustness analyses have shown that DAMe is resistant to injection attacks.

[AI-160] Hyper-Compression: Model Compression via Hyperfunction

链接: https://arxiv.org/abs/2409.00592
作者: Fenglei Fan,Juntong Fan,Dayang Wang,Jingbo Zhang,Zelin Dong,Shijun Zhang,Ge Wang,Tieyong Zeng
关键词-EN: large models’ size, GPU memory, rapid growth, growth of large, large models’
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The rapid growth of large models’ size has far outpaced that of GPU memory. To bridge this gap, inspired by the succinct relationship between genotype and phenotype, we turn the model compression problem into the issue of parameter representation to propose the so-called hyper-compression. The hyper-compression uses a hyperfunction to represent the parameters of the target network, and notably, here the hyperfunction is designed per ergodic theory that relates to a problem: if a low-dimensional dynamic system can fill the high-dimensional space eventually. Empirically, the proposed hyper-compression enjoys the following merits: 1) \textbfPreferable compression ratio; 2) \textbfNo post-hoc retraining; 3) \textbfAffordable inference time; and 4) \textbfShort compression time. It compresses LLaMA2-7B in an hour and achieves close-to-int4-quantization performance, without retraining and with a performance drop of less than 1%. Our work has the potential to invigorate the field of model compression, towards a harmony between the scaling law and the stagnation of hardware upgradation.

[AI-161] FastBO: Fast HPO and NAS with Adaptive Fidelity Identification ECCV2024

链接: https://arxiv.org/abs/2409.00584
作者: Jiantong Jiang,Ajmal Mian
关键词-EN: neural architecture search, machine learning models, Bayesian optimization, Hyperparameter optimization, architecture search
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The 18th European Conference on Computer Vision ECCV 2024 Women in Computer Vision Workshop

点击查看摘要

Abstract:Hyperparameter optimization (HPO) and neural architecture search (NAS) are powerful in attaining state-of-the-art machine learning models, with Bayesian optimization (BO) standing out as a mainstream method. Extending BO into the multi-fidelity setting has been an emerging research topic, but faces the challenge of determining an appropriate fidelity for each hyperparameter configuration to fit the surrogate model. To tackle the challenge, we propose a multi-fidelity BO method named FastBO, which adaptively decides the fidelity for each configuration and efficiently offers strong performance. The advantages are achieved based on the novel concepts of efficient point and saturation point for each configuration.We also show that our adaptive fidelity identification strategy provides a way to extend any single-fidelity method to the multi-fidelity setting, highlighting its generality and applicability.

[AI-162] Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs

链接: https://arxiv.org/abs/2409.00571
作者: Nafis Tanveer Islam,Joseph Khoury,Andrew Seong,Elias Bou-Harb,Peyman Najafirad
关键词-EN: Large Language Models, Artificial Intelligence, Large Language, establishing clear guidelines, recent unprecedented advancements
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the recent unprecedented advancements in Artificial Intelligence (AI) computing, progress in Large Language Models (LLMs) is accelerating rapidly, presenting challenges in establishing clear guidelines, particularly in the field of security. That being said, we thoroughly identify and describe three main technical challenges in the security and software engineering literature that spans the entire LLM workflow, namely; \textbf\textit(i) Data Collection and Labeling; \textbf\textit(ii) System Design and Learning; and \textbf\textit(iii) Performance Evaluation. Building upon these challenges, this paper introduces \textttSecRepair, an instruction-based LLM system designed to reliably \textitidentify, \textitdescribe, and automatically \textitrepair vulnerable source code. Our system is accompanied by a list of actionable guides on \textbf\textit(i) Data Preparation and Augmentation Techniques; \textbf\textit(ii) Selecting and Adapting state-of-the-art LLM Models; \textbf\textit(iii) Evaluation Procedures. \textttSecRepair uses a reinforcement learning-based fine-tuning with a semantic reward that caters to the functionality and security aspects of the generated code. Our empirical analysis shows that \textttSecRepair achieves a \textit12% improvement in security code repair compared to other LLMs when trained using reinforcement learning. Furthermore, we demonstrate the capabilities of \textttSecRepair in generating reliable, functional, and compilable security code repairs against real-world test cases using automated evaluation metrics.

[AI-163] Learning to Ask: When LLMs Meet Unclear Instruction

链接: https://arxiv.org/abs/2409.00557
作者: Wenxuan Wang,Juluan Shi,Chaozheng Wang,Cheryl Lee,Youliang Yuan,Jen-tse Huang,Michael R. Lyu
关键词-EN: modern large language, large language models, leverage external tools, language models, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.

[AI-164] Multi-Output Distributional Fairness via Post-Processing

链接: https://arxiv.org/abs/2409.00553
作者: Gang Li,Qihang Lin,Ayush Ghosh,Tianbao Yang
关键词-EN: low computational cost, machine learning models’, learning models’ fairness, low computational, computational cost
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 17 pages, 4 figures

点击查看摘要

Abstract:The post-processing approaches are becoming prominent techniques to enhance machine learning models’ fairness because of their intuitiveness, low computational cost, and excellent scalability. However, most existing post-processing methods are designed for task-specific fairness measures and are limited to single-output models. In this paper, we introduce a post-processing method for multi-output models, such as the ones used for multi-task/multi-class classification and representation learning, to enhance a model’s distributional parity, a task-agnostic fairness measure. Existing techniques to achieve distributional parity are based on the (inverse) cumulative density function of a model’s output, which is limited to single-output models. Extending previous works, our method employs an optimal transport mapping to move a model’s outputs across different groups towards their empirical Wasserstein barycenter. An approximation technique is applied to reduce the complexity of computing the exact barycenter and a kernel regression method is proposed for extending this process to out-of-sample data. Our empirical studies, which compare our method to current existing post-processing baselines on multi-task/multi-class classification and representation learning tasks, demonstrate the effectiveness of the proposed approach.

[AI-165] sting and Evaluation of Large Language Models : Correctness Non-Toxicity and Fairness

链接: https://arxiv.org/abs/2409.00551
作者: Wenxuan Wang
关键词-EN: extraordinary conversational skills, Large language models, Large language, past few years, rapidly penetrated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: PhD Thesis

点击查看摘要

Abstract:Large language models (LLMs), such as ChatGPT, have rapidly penetrated into people’s work and daily lives over the past few years, due to their extraordinary conversational skills and intelligence. ChatGPT has become the fastest-growing software in terms of user numbers in human history and become an important foundational model for the next generation of artificial intelligence applications. However, the generations of LLMs are not entirely reliable, often producing content with factual errors, biases, and toxicity. Given their vast number of users and wide range of application scenarios, these unreliable responses can lead to many serious negative impacts. This thesis introduces the exploratory works in the field of language model reliability during the PhD study, focusing on the correctness, non-toxicity, and fairness of LLMs from both software testing and natural language processing perspectives. First, to measure the correctness of LLMs, we introduce two testing frameworks, FactChecker and LogicAsker, to evaluate factual knowledge and logical reasoning accuracy, respectively. Second, for the non-toxicity of LLMs, we introduce two works for red-teaming LLMs. Third, to evaluate the fairness of LLMs, we introduce two evaluation frameworks, BiasAsker and XCulturalBench, to measure the social bias and cultural bias of LLMs, respectively.

[AI-166] Data Augmentation for Image Classification using Generative AI

链接: https://arxiv.org/abs/2409.00547
作者: Fazle Rahat,M Shifat Hossain,Md Rubel Ahmed,Sumit Kumar Jha,Rickard Ewetz
关键词-EN: Scaling laws dictate, Scaling laws, laws dictate, Scaling, Data augmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 15 figures, 4 tables

点击查看摘要

Abstract:Scaling laws dictate that the performance of AI models is proportional to the amount of available data. Data augmentation is a promising solution to expanding the dataset size. Traditional approaches focused on augmentation using rotation, translation, and resizing. Recent approaches use generative AI models to improve dataset diversity. However, the generative methods struggle with issues such as subject corruption and the introduction of irrelevant artifacts. In this paper, we propose the Automated Generative Data Augmentation (AGA). The framework combines the utility of large language models (LLMs), diffusion models, and segmentation models to augment data. AGA preserves foreground authenticity while ensuring background diversity. Specific contributions include: i) segment and superclass based object extraction, ii) prompt diversity with combinatorial complexity using prompt decomposition, and iii) affine subject manipulation. We evaluate AGA against state-of-the-art (SOTA) techniques on three representative datasets, ImageNet, CUB, and iWildCam. The experimental evaluation demonstrates an accuracy improvement of 15.6% and 23.5% for in and out-of-distribution data compared to baseline models, respectively. There is also a 64.3% improvement in SIC score compared to the baselines.

[AI-167] Large Language Models -Enabled Digital Twins for Precision Medicine in Rare Gynecological Tumors

链接: https://arxiv.org/abs/2409.00544
作者: Jacqueline Lammert,Nicole Pfarr,Leonid Kuligin,Sonja Mathes,Tobias Dreyer,Luise Modersohn,Patrick Metzger,Dyke Ferber,Jakob Nikolas Kather,Daniel Truhn,Lisa Christine Adams,Keno Kyrill Bressem,Sebastian Lange,Kristina Schwamborn,Martin Boeker,Marion Kiechle,Ulrich A. Schatz,Holger Bronger,Maximilian Tschochohei
关键词-EN: Rare gynecological tumors, Rare gynecological, present major clinical, major clinical challenges, clinical challenges due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: 20 pages, 2 figures, 3 tables, supplements, original article

点击查看摘要

Abstract:Rare gynecological tumors (RGTs) present major clinical challenges due to their low incidence and heterogeneity. The lack of clear guidelines leads to suboptimal management and poor prognosis. Molecular tumor boards accelerate access to effective therapies by tailoring treatment based on biomarkers, beyond cancer type. Unstructured data that requires manual curation hinders efficient use of biomarker profiling for therapy matching. This study explores the use of large language models (LLMs) to construct digital twins for precision medicine in RGTs. Our proof-of-concept digital twin system integrates clinical and biomarker data from institutional and published cases (n=21) and literature-derived data (n=655 publications with n=404,265 patients) to create tailored treatment plans for metastatic uterine carcinosarcoma, identifying options potentially missed by traditional, single-source analysis. LLM-enabled digital twins efficiently model individual patient trajectories. Shifting to a biology-based rather than organ-based tumor definition enables personalized care that could advance RGT management and thus enhance patient outcomes. Comments: 20 pages, 2 figures, 3 tables, supplements, original article Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML) Cite as: arXiv:2409.00544 [cs.CL] (or arXiv:2409.00544v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.00544 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-168] Mapping earth mounds from space

链接: https://arxiv.org/abs/2409.00518
作者: Baki Uzun,Shivam Pande,Gwendal Cachin-Bernard,Minh-Tan Pham,Sébastien Lefèvre,Rumais Blatrix,Doyle McKey
关键词-EN: Regular patterns, considered widespread landscapes, considered widespread, global extent, climate change
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Regular patterns of vegetation are considered widespread landscapes, although their global extent has never been estimated. Among them, spotted landscapes are of particular interest in the context of climate change. Indeed, regularly spaced vegetation spots in semi-arid shrublands result from extreme resource depletion and prefigure catastrophic shift of the ecosystem to a homogeneous desert, while termite mounds also producing spotted landscapes were shown to increase robustness to climate change. Yet, their identification at large scale calls for automatic methods, for instance using the popular deep learning framework, able to cope with a vast amount of remote sensing data, e.g., optical satellite imagery. In this paper, we tackle this problem and benchmark some state-of-the-art deep networks on several landscapes and geographical areas. Despite the promising results we obtained, we found that more research is needed to be able to map automatically these earth mounds from space.

[AI-169] Plant detection from ultra high resolution remote sensing images: A Semantic Segmentation approach based on fuzzy loss

链接: https://arxiv.org/abs/2409.00513
作者: Shivam Pande,Baki Uzun,Florent Guiotte,Thomas Corpetti,Florian Delerue,Sébastien Lefèvre
关键词-EN: ultra high resolution, remote sensing images, RGB remote sensing, identifying plant species, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 5 figures, 2 tables

点击查看摘要

Abstract:In this study, we tackle the challenge of identifying plant species from ultra high resolution (UHR) remote sensing images. Our approach involves introducing an RGB remote sensing dataset, characterized by millimeter-level spatial resolution, meticulously curated through several field expeditions across a mountainous region in France covering various landscapes. The task of plant species identification is framed as a semantic segmentation problem for its practical and efficient implementation across vast geographical areas. However, when dealing with segmentation masks, we confront instances where distinguishing boundaries between plant species and their background is challenging. We tackle this issue by introducing a fuzzy loss within the segmentation model. Instead of utilizing one-hot encoded ground truth (GT), our model incorporates Gaussian filter refined GT, introducing stochasticity during training. First experimental results obtained on both our UHR dataset and a public dataset are presented, showing the relevance of the proposed methodology, as well as the need for future improvement.

[AI-170] Streamlining Forest Wildfire Surveillance: AI-Enhanced UAVs Utilizing the FLAME Aerial Video Dataset for Lightweight and Efficient Monitoring IROS

链接: https://arxiv.org/abs/2409.00510
作者: Lemeng Zhao,Junjie Hu,Jianchao Bi,Yanbing Bai,Erick Mas,Shunichi Koshimura
关键词-EN: increasingly crucial role, unmanned aerial vehicles, analyzing aerial images, supporting disaster emergency, emergency response efforts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: accpeted by Proceedings of the International Conference on Intelligent Robots and Systems (2024 IROS)

点击查看摘要

Abstract:In recent years, unmanned aerial vehicles (UAVs) have played an increasingly crucial role in supporting disaster emergency response efforts by analyzing aerial images. While current deep-learning models focus on improving accuracy, they often overlook the limited computing resources of UAVs. This study recognizes the imperative for real-time data processing in disaster response scenarios and introduces a lightweight and efficient approach for aerial video understanding. Our methodology identifies redundant portions within the video through policy networks and eliminates this excess information using frame compression techniques. Additionally, we introduced the concept of a `station point,’ which leverages future information in the sequential policy network, thereby enhancing accuracy. To validate our method, we employed the wildfire FLAME dataset. Compared to the baseline, our approach reduces computation costs by more than 13 times while boosting accuracy by 3 % . Moreover, our method can intelligently select salient frames from the video, refining the dataset. This feature enables sophisticated models to be effectively trained on a smaller dataset, significantly reducing the time spent during the training process.

[AI-171] GenAI-powered Multi-Agent Paradigm for Smart Urban Mobility: Opportunities and Challenges for Integrating Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) with Intelligent Transportation Systems

链接: https://arxiv.org/abs/2409.00494
作者: Haowen Xu,Jinghui Yuan,Anye Zhou,Guanhao Xu,Wan Li,Xuegang(Jeff)Ban,Xinyue Ye
关键词-EN: Leveraging recent advances, Leveraging recent, smart city applications, recent advances, increasingly being developed
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Leveraging recent advances in generative AI, multi-agent systems are increasingly being developed to enhance the functionality and efficiency of smart city applications. This paper explores the transformative potential of large language models (LLMs) and emerging Retrieval-Augmented Generation (RAG) technologies in Intelligent Transportation Systems (ITS), paving the way for innovative solutions to address critical challenges in urban mobility. We begin by providing a comprehensive overview of the current state-of-the-art in mobility data, ITS, and Connected Vehicles (CV) applications. Building on this review, we discuss the rationale behind RAG and examine the opportunities for integrating these Generative AI (GenAI) technologies into the smart mobility sector. We propose a conceptual framework aimed at developing multi-agent systems capable of intelligently and conversationally delivering smart mobility services to urban commuters, transportation operators, and decision-makers. Our approach seeks to foster an autonomous and intelligent approach that (a) promotes science-based advisory to reduce traffic congestion, accidents, and carbon emissions at multiple scales, (b) facilitates public education and engagement in participatory mobility management, and © automates specialized transportation management tasks and the development of critical ITS platforms, such as data analytics and interpretation, knowledge representation, and traffic simulations. By integrating LLM and RAG, our approach seeks to overcome the limitations of traditional rule-based multi-agent systems, which rely on fixed knowledge bases and limited reasoning capabilities. This integration paves the way for a more scalable, intuitive, and automated multi-agent paradigm, driving advancements in ITS and urban mobility.

[AI-172] Geospatial foundation models for image analysis: evaluating and enhancing NASA-IBM Prithvis domain adaptability

链接: https://arxiv.org/abs/2409.00489
作者: Chia-Yu Hsu,Wenwen Li,Sizhe Wang
关键词-EN: achieving high generalizability, research due, reducing model training, geospatial artificial intelligence, model training costs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Research on geospatial foundation models (GFMs) has become a trending topic in geospatial artificial intelligence (AI) research due to their potential for achieving high generalizability and domain adaptability, reducing model training costs for individual researchers. Unlike large language models, such as ChatGPT, constructing visual foundation models for image analysis, particularly in remote sensing, encountered significant challenges such as formulating diverse vision tasks into a general problem framework. This paper evaluates the recently released NASA-IBM GFM Prithvi for its predictive performance on high-level image analysis tasks across multiple benchmark datasets. Prithvi was selected because it is one of the first open-source GFMs trained on time-series of high-resolution remote sensing imagery. A series of experiments were designed to assess Prithvi’s performance as compared to other pre-trained task-specific AI models in geospatial image analysis. New strategies, including band adaptation, multi-scale feature generation, and fine-tuning techniques, are introduced and integrated into an image analysis pipeline to enhance Prithvi’s domain adaptation capability and improve model performance. In-depth analyses reveal Prithvi’s strengths and weaknesses, offering insights for both improving Prithvi and developing future visual foundation models for geospatial tasks.

[AI-173] Rapid Gyroscope Calibration: A Deep Learning Approach

链接: https://arxiv.org/abs/2409.00488
作者: Yair Stolero,Itzik Klein
关键词-EN: gyroscope, essential for ensuring, ensuring the accuracy, accuracy and reliability, calibration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 10 Pages, 14 Figures,

点击查看摘要

Abstract:Low-cost gyroscope calibration is essential for ensuring the accuracy and reliability of gyroscope measurements. Stationary calibration estimates the deterministic parts of measurement errors. To this end, a common practice is to average the gyroscope readings during a predefined period and estimate the gyroscope bias. Calibration duration plays a crucial role in performance, therefore, longer periods are preferred. However, some applications require quick startup times and calibration is therefore allowed only for a short time. In this work, we focus on reducing low-cost gyroscope calibration time using deep learning methods. We propose a deep-learning framework and explore the possibilities of using multiple real and virtual gyroscopes to improve the calibration performance of single gyroscopes. To train and validate our approach, we recorded a dataset consisting of 169 hours of gyroscope readings, using 24 gyroscopes of two different brands. We also created a virtual dataset consisting of simulated gyroscope readings. The two datasets were used to evaluate our proposed approach. One of our key achievements in this work is reducing gyroscope calibration time by up to 89% using three low-cost gyroscopes.

[AI-174] PSLF: A PID Controller-incorporated Second-order Latent Factor Analysis Model for Recommender System

链接: https://arxiv.org/abs/2409.00448
作者: Jialiang Wang,Yan Xia,Ye Yuan
关键词-EN: analysis model demonstrates, graph representation learning, demonstrates superior performance, interaction data, model demonstrates superior
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A second-order-based latent factor (SLF) analysis model demonstrates superior performance in graph representation learning, particularly for high-dimensional and incomplete (HDI) interaction data, by incorporating the curvature information of the loss landscape. However, its objective function is commonly bi-linear and non-convex, causing the SLF model to suffer from a low convergence rate. To address this issue, this paper proposes a PID controller-incorporated SLF (PSLF) model, leveraging two key strategies: a) refining learning error estimation by incorporating the PID controller principles, and b) acquiring second-order information insights through Hessian-vector products. Experimental results on multiple HDI datasets indicate that the proposed PSLF model outperforms four state-of-the-art latent factor models based on advanced optimizers regarding convergence rates and generalization performance.

[AI-175] he MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts

链接: https://arxiv.org/abs/2409.00447
作者: I. de Rodrigo,A. Sanchez-Cuadrado,J. Boal,A. J. Lopez-Lopez
关键词-EN: MERIT Dataset, fully labeled dataset, Visually-rich Document Understanding, introduces the MERIT, demanding Visually-rich Document
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces the MERIT Dataset, a multimodal (text + image + layout) fully labeled dataset within the context of school reports. Comprising over 400 labels and 33k samples, the MERIT Dataset is a valuable resource for training models in demanding Visually-rich Document Understanding (VrDU) tasks. By its nature (student grade reports), the MERIT Dataset can potentially include biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models (LLMs). The paper outlines the dataset’s generation pipeline and highlights its main features in the textual, visual, layout, and bias domains. To demonstrate the dataset’s utility, we present a benchmark with token classification models, showing that the dataset poses a significant challenge even for SOTA models and that these would greatly benefit from including samples from the MERIT Dataset in their pretraining phase.

[AI-176] Breaking Down Financial News Impact: A Novel AI Approach with Geometric Hypergraphs

链接: https://arxiv.org/abs/2409.00438
作者: Anoushka Harit,Zhongtian Sun,Jongmin Yu,Noura Al Moubayed
关键词-EN: accurately predicting stock, predicting stock movements, stock movements based, volatile financial markets, Explainable Artificial Intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, conference

点击查看摘要

Abstract:In the fast-paced and volatile financial markets, accurately predicting stock movements based on financial news is critical for investors and analysts. Traditional models often struggle to capture the intricate and dynamic relationships between news events and market reactions, limiting their ability to provide actionable insights. This paper introduces a novel approach leveraging Explainable Artificial Intelligence (XAI) through the development of a Geometric Hypergraph Attention Network (GHAN) to analyze the impact of financial news on market behaviours. Geometric hypergraphs extend traditional graph structures by allowing edges to connect multiple nodes, effectively modelling high-order relationships and interactions among financial entities and news events. This unique capability enables the capture of complex dependencies, such as the simultaneous impact of a single news event on multiple stocks or sectors, which traditional models frequently overlook. By incorporating attention mechanisms within hypergraphs, GHAN enhances the model’s ability to focus on the most relevant information, ensuring more accurate predictions and better interpretability. Additionally, we employ BERT-based embeddings to capture the semantic richness of financial news texts, providing a nuanced understanding of the content. Using a comprehensive financial news dataset, our GHAN model addresses key challenges in financial news impact analysis, including the complexity of high-order interactions, the necessity for model interpretability, and the dynamic nature of financial markets. Integrating attention mechanisms and SHAP values within GHAN ensures transparency, highlighting the most influential factors driving market predictions. Empirical validation demonstrates the superior effectiveness of our approach over traditional sentiment analysis and time-series models. Comments: 16 pages, conference Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.00438 [cs.LG] (or arXiv:2409.00438v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.00438 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-177] Robust off-policy Reinforcement Learning via Soft Constrained Adversary

链接: https://arxiv.org/abs/2409.00418
作者: Kosuke Nakanishi,Akihiro Kubo,Yuji Yasui,Shin Ishii
关键词-EN: garnered significant attention, undergone rapid evolution, rapid evolution due, potential vulnerability, input observation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 33 pages, 12 figures, 2 tables

点击查看摘要

Abstract:Recently, robust reinforcement learning (RL) methods against input observation have garnered significant attention and undergone rapid evolution due to RL’s potential vulnerability. Although these advanced methods have achieved reasonable success, there have been two limitations when considering adversary in terms of long-term horizons. First, the mutual dependency between the policy and its corresponding optimal adversary limits the development of off-policy RL algorithms; although obtaining optimal adversary should depend on the current policy, this has restricted applications to off-policy RL. Second, these methods generally assume perturbations based only on the L_p -norm, even when prior knowledge of the perturbation distribution in the environment is available. We here introduce another perspective on adversarial RL: an f-divergence constrained problem with the prior knowledge distribution. From this, we derive two typical attacks and their corresponding robust learning frameworks. The evaluation of robustness is conducted and the results demonstrate that our proposed methods achieve excellent performance in sample-efficient off-policy RL.

[AI-178] Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders

链接: https://arxiv.org/abs/2409.00391
作者: Georgios Ioannides,Adrian Kieback,Aman Chadha,Aaron Elkins
关键词-EN: Speech-based depression detection, automated detection due, Speech-based depression, depression detection poses, poses significant challenges
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech-based depression detection poses significant challenges for automated detection due to its unique manifestation across individuals and data scarcity. Addressing these challenges, we introduce DAAMAudioCNNLSTM and DAAMAudioTransformer, two parameter efficient and explainable models for audio feature extraction and depression detection. DAAMAudioCNNLSTM features a novel CNN-LSTM framework with multi-head Density Adaptive Attention Mechanism (DAAM), focusing dynamically on informative speech segments. DAAMAudioTransformer, leveraging a transformer encoder in place of the CNN-LSTM architecture, incorporates the same DAAM module for enhanced attention and interpretability. These approaches not only enhance detection robustness and interpretability but also achieve state-of-the-art performance: DAAMAudioCNNLSTM with an F1 macro score of 0.702 and DAAMAudioTransformer with an F1 macro score of 0.72 on the DAIC-WOZ dataset, without reliance on supplementary information such as vowel positions and speaker information during training/validation as in previous approaches. Both models’ significant explainability and efficiency in leveraging speech signals for depression detection represent a leap towards more reliable, clinically useful diagnostic tools, promising advancements in speech and mental health care. To foster further research in this domain, we make our code publicly available.

[AI-179] Predicting Femicide in Veracruz: A Fuzzy Logic Approach with the Expanded MFM-FEM-VER-CP-2024 Model

链接: https://arxiv.org/abs/2409.00359
作者: Carlos Medel-Ramírez,Hilario Medel-López
关键词-EN: mathematical framework designed, predict femicide risk, predict femicide, femicide in Veracruz, fuzzy logic
类目: Artificial Intelligence (cs.AI)
*备注: 24 pages, 2 tables, 3 figures

点击查看摘要

Abstract:The article focuses on the urgent issue of femicide in Veracruz, Mexico, and the development of the MFM_FEM_VER_CP_2024 model, a mathematical framework designed to predict femicide risk using fuzzy logic. This model addresses the complexity and uncertainty inherent in gender based violence by formalizing risk factors such as coercive control, dehumanization, and the cycle of violence. These factors are mathematically modeled through membership functions that assess the degree of risk associated with various conditions, including personal relationships and specific acts of violence. The study enhances the original model by incorporating new rules and refining existing membership functions, which significantly improve the model predictive accuracy.

[AI-180] Predicting the Target Word of Game-playing Conversations using a Low-Rank Dialect Adapter for Decoder Models

链接: https://arxiv.org/abs/2409.00358
作者: Dipankar Srirag,Aditya Joshi,Jacob Eisenstein
关键词-EN: LLMs for NLU, national varieties, sake of brevity, NLU tasks, reported for encoder
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages, 3 Figures, 5 Tables

点击查看摘要

Abstract:Dialect adapters that improve the performance of LLMs for NLU tasks on certain sociolects/dialects/national varieties (‘dialects’ for the sake of brevity) have been reported for encoder models. In this paper, we extend the idea of dialect adapters to decoder models in our architecture called LoRDD. Using MD-3, a publicly available dataset of word game-playing conversations between dialectal speakers, our task is Target Word Prediction (TWP) from a masked conversation. LoRDD combines task adapters and dialect adapters where the latter employ contrastive learning on pseudo-parallel conversations from MD-3. Our results for en-IN conversations on two models (Mistral and Gemma) show that LoRDD outperforms four baselines on TWP, while bridging the performance gap with en-US by 12% on word similarity and 25% on accuracy. The focused contribution of LoRDD is in its promise for dialect adaptation of decoder models.

[AI-181] Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology ICPR2024

链接: https://arxiv.org/abs/2409.00356
作者: Weinan Dai,Yifeng Jiang,Yuanjing Liu,Jinkun Chen,Xin Sun,Jinglei Tao
关键词-EN: substantial labeled data, paper addresses, addresses the persistent, persistent challenge, fundamental component
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: This paper has been accepted by the ICPR2024

点击查看摘要

Abstract:This paper addresses the persistent challenge in Keyword Spotting (KWS), a fundamental component in speech technology, regarding the acquisition of substantial labeled data for training. Given the difficulty in obtaining large quantities of positive samples and the laborious process of collecting new target samples when the keyword changes, we introduce a novel approach combining unsupervised contrastive learning and a unique augmentation-based technique. Our method allows the neural network to train on unlabeled data sets, potentially improving performance in downstream tasks with limited labeled data sets. We also propose that similar high-level feature representations should be employed for speech utterances with the same keyword despite variations in speed or volume. To achieve this, we present a speech augmentation-based unsupervised learning method that utilizes the similarity between the bottleneck layer feature and the audio reconstructing information for auxiliary training. Furthermore, we propose a compressed convolutional architecture to address potential redundancy and non-informative information in KWS tasks, enabling the model to simultaneously learn local features and focus on long-term information. This method achieves strong performance on the Google Speech Commands V2 Dataset. Inspired by recent advancements in sign spotting and spoken term detection, our method underlines the potential of our contrastive learning approach in KWS and the advantages of Query-by-Example Spoken Term Detection strategies. The presented CAB-KWS provide new perspectives in the field of KWS, demonstrating effective ways to reduce data collection efforts and increase the system’s robustness.

[AI-182] GSpect: Spectral Filtering for Cross-Scale Graph Classification

链接: https://arxiv.org/abs/2409.00338
作者: Xiaoyu Zhang,Wenchuan Yang,Jiawei Feng,Bitao Dai,Tianci Bu,Xin Lu
关键词-EN: Identifying structures, common forms, forms the basis, basis for networked, Identifying
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Identifying structures in common forms the basis for networked systems design and optimization. However, real structures represented by graphs are often of varying sizes, leading to the low accuracy of traditional graph classification methods. These graphs are called cross-scale graphs. To overcome this limitation, in this study, we propose GSpect, an advanced spectral graph filtering model for cross-scale graph classification tasks. Compared with other methods, we use graph wavelet neural networks for the convolution layer of the model, which aggregates multi-scale messages to generate graph representations. We design a spectral-pooling layer which aggregates nodes to one node to reduce the cross-scale graphs to the same size. We collect and construct the cross-scale benchmark data set, MSG (Multi Scale Graphs). Experiments reveal that, on open data sets, GSpect improves the performance of classification accuracy by 1.62% on average, and for a maximum of 3.33% on PROTEINS. On MSG, GSpect improves the performance of classification accuracy by 15.55% on average. GSpect fills the gap in cross-scale graph classification studies and has potential to provide assistance in application research like diagnosis of brain disease by predicting the brain network’s label and developing new drugs with molecular structures learned from their counterparts in other systems.

[AI-183] Evaluating the Effectiveness of Large Language Models in Representing and Understanding Movement Trajectories

链接: https://arxiv.org/abs/2409.00335
作者: Yuhan Ji,Song Gao
关键词-EN: Dynamic Time Warping, focuses on assessing, assessing the ability, Time Warping distances, foundation models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:This research focuses on assessing the ability of AI foundation models in representing the trajectories of movements. We utilize one of the large language models (LLMs) (i.e., GPT-J) to encode the string format of trajectories and then evaluate the effectiveness of the LLM-based representation for trajectory data analysis. The experiments demonstrate that while the LLM-based embeddings can preserve certain trajectory distance metrics (i.e., the correlation coefficients exceed 0.74 between the Cosine distance derived from GPT-J embeddings and the Hausdorff and Dynamic Time Warping distances on raw trajectories), challenges remain in restoring numeric values and retrieving spatial neighbors in movement trajectory analytics. In addition, the LLMs can understand the spatiotemporal dependency contained in trajectories and have good accuracy in location prediction tasks. This research highlights the need for improvement in terms of capturing the nuances and complexities of the underlying geospatial data and integrating domain knowledge to support various GeoAI applications using LLMs.

[AI-184] WikiCausal: Corpus and Evaluation Framework for Causal Knowledge Graph Construction ISWC2024

链接: https://arxiv.org/abs/2409.00331
作者: Oktie Hassanzadeh
关键词-EN: causal knowledge graphs, knowledge graph construction, causal knowledge, domain-specific causal knowledge, knowledge graphs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Extended version; poster paper accepted at ISWC 2024

点击查看摘要

Abstract:Recently, there has been an increasing interest in the construction of general-domain and domain-specific causal knowledge graphs. Such knowledge graphs enable reasoning for causal analysis and event prediction, and so have a range of applications across different domains. While great progress has been made toward automated construction of causal knowledge graphs, the evaluation of such solutions has either focused on low-level tasks (e.g., cause-effect phrase extraction) or on ad hoc evaluation data and small manual evaluations. In this paper, we present a corpus, task, and evaluation framework for causal knowledge graph construction. Our corpus consists of Wikipedia articles for a collection of event-related concepts in Wikidata. The task is to extract causal relations between event concepts from the corpus. The evaluation is performed in part using existing causal relations in Wikidata to measure recall, and in part using Large Language Models to avoid the need for manual or crowd-sourced evaluation. We evaluate a pipeline for causal knowledge graph construction that relies on neural models for question answering and concept linking, and show how the corpus and the evaluation framework allow us to effectively find the right model for each task. The corpus and the evaluation framework are publicly available.

[AI-185] Demo: FedCampus: A Real-world Privacy-preserving Mobile Application for Smart Campus via Federated Learning Analytics

链接: https://arxiv.org/abs/2409.00327
作者: Jiaxiang Geng,Beilong Tang,Boyan Zhang,Jiaqi Shao,Bing Luo
关键词-EN: privacy-preserving mobile application, erated learning, federated analytics, underline, mobile application
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 2 pages, 3 figures, accepted for publication in ACM Mobihoc 2024

点击查看摘要

Abstract:In this demo, we introduce FedCampus, a privacy-preserving mobile application for smart \underlinecampus with \underlinefederated learning (FL) and federated analytics (FA). FedCampus enables cross-platform on-device FL/FA for both iOS and Android, supporting continuously models and algorithms deployment (MLOps). Our app integrates privacy-preserving processed data via differential privacy (DP) from smartwatches, where the processed parameters are used for FL/FA through the FedCampus backend platform. We distributed 100 smartwatches to volunteers at Duke Kunshan University and have successfully completed a series of smart campus tasks featuring capabilities such as sleep tracking, physical activity monitoring, personalized recommendations, and heavy hitters. Our project is opensourced at this https URL. See the FedCampus video at this https URL.

[AI-186] oward a More Complete OMR Solution

链接: https://arxiv.org/abs/2409.00316
作者: Guang Yang(1),Muru Zhang(1),Lin Qiu(1),Yanming Wan(1),Noah A. Smith(1 and 2) ((1) Paul G. Allen School of Computer Science amp; Engineering, University of Washington, United States, (2) Allen Institute for Artificial Intelligence, United States)
关键词-EN: Optical music recognition, Optical music, aims to convert, digital formats, notation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optical music recognition (OMR) aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image (object detection) and then assembles them into a music notation (notation assembly). Most previous work on notation assembly unrealistically assumes perfect object detection. In this study, we focus on the MUSCIMA++ v2.0 dataset, which represents musical notation as a graph with pairwise relationships among detected music objects, and we consider both stages together. First, we introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output. We find that this model is able to outperform existing models trained on perfect detection output, showing the benefit of considering the detection and assembly stages in a more holistic way. These findings, together with our novel evaluation metric, are important steps toward a more complete OMR solution.

[AI-187] An Empirical Study on Context Length for Open-Domain Dialog Generation

链接: https://arxiv.org/abs/2409.00315
作者: Xinyi Shen,Zuoquan Lin
关键词-EN: recent years, increasingly popular, popular in recent, context, Transformer-based open-domain dialog
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Transformer-based open-domain dialog models have become increasingly popular in recent years. These models typically represent context as a concatenation of a dialog history. However, there is no criterion to decide how many utterances should be kept adequate in a context. We try to figure out how the choice of context length affects the model. We experiment on three questions from coarse to fine: (i) Does longer context help model training? (ii) Is it necessary to change the training context length when dealing with dialogs of different context lengths? (iii) Do different dialog samples have the same preference for context length? Our experimental results show that context length, an often overlooked setting, deserves attention when implementing Transformer-based dialog models.

[AI-188] Objective Features Extracted from Motor Activity Time Series for Food Addiction Analysis Using Machine Learning

链接: https://arxiv.org/abs/2409.00310
作者: Mikhail Borisenkov,Andrei Velichko,Maksim Belyaev,Dmitry Korzun,Tatyana Tserne,Larisa Bakutova,Denis Gubin
关键词-EN: diagnosing food addiction, Food Addiction Scale, assessing confirmed symptoms, Yale Food Addiction, study investigates machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
*备注: 16 pages, 3 figures, 14 tables

点击查看摘要

Abstract:This study investigates machine learning algorithms to identify objective features for diagnosing food addiction (FA) and assessing confirmed symptoms (SC). Data were collected from 81 participants (mean age: 21.5 years, range: 18-61 years, women: 77.8%) whose FA and SC were measured using the Yale Food Addiction Scale (YFAS). Participants provided demographic and anthropometric data, completed the YFAS, the Zung Self-Rating Depression Scale, and the Dutch Eating Behavior Questionnaire, and wore an actimeter on the non-dominant wrist for a week to record motor activity. Analysis of the actimetric data identified significant statistical and entropy-based features that accurately predicted FA and SC using ML. The Matthews correlation coefficient (MCC) was the primary metric. Activity-related features were more effective for FA prediction (MCC=0.88) than rest-related features (MCC=0.68). For SC, activity segments yielded MCC=0.47, rest segments MCC=0.38, and their combination MCC=0.51. Significant correlations were also found between actimetric features related to FA, emotional, and restrained eating behaviors, supporting the model’s validity. Our results support the concept of a human bionic suite composed of IoT devices and ML sensors, which implements health digital assistance with real-time monitoring and analysis of physiological indicators related to FA and SC.

[AI-189] OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters

链接: https://arxiv.org/abs/2409.00286
作者: Zexin Chen,Chengxi Li,Xiangyu Xie,Parijat Dube
关键词-EN: model trained exclusively, OnlySports Dataset, paper explores, explores the potential, trained exclusively
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 13 pages, 4 figures, 4 tables

点击查看摘要

Abstract:This paper explores the potential of a small, domain-specific language model trained exclusively on sports-related data. We investigate whether extensive training data with specially designed small model structures can overcome model size constraints. The study introduces the OnlySports collection, comprising OnlySportsLM, OnlySports Dataset, and OnlySports Benchmark. Our approach involves: 1) creating a massive 600 billion tokens OnlySports Dataset from FineWeb, 2) optimizing the RWKV architecture for sports-related tasks, resulting in a 196M parameters model with 20-layer, 640-dimension structure, 3) training the OnlySportsLM on part of OnlySports Dataset, and 4) testing the resultant model on OnlySports Benchmark. OnlySportsLM achieves a 37.62%/34.08% accuracy improvement over previous 135M/360M state-of-the-art models and matches the performance of larger models such as SomlLM 1.7B and Qwen 1.5B in the sports domain. Additionally, the OnlySports collection presents a comprehensive workflow for building high-quality, domain-specific language models, providing a replicable blueprint for efficient AI development across various specialized fields.

[AI-190] Reframing Data Value for Large Language Models Through the Lens of Plausability

链接: https://arxiv.org/abs/2409.00284
作者: Mohamad Rida Rammal,Ruida Zhou,Suhas Diggavi
关键词-EN: Data valuation seeks, important question, seeks to answer, answer the important, Data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data valuation seeks to answer the important question, “How much is this data worth?” Existing data valuation methods have largely focused on discriminative models, primarily examining data value through the lens of its utility in training. However, with the push for ever-larger language models, relying on valuation methods that require training becomes increasingly expensive and dependent on specific techniques. We propose an alternative perspective on the data value problem for language models, centering around the plausibility of the data. We posit that data holds lesser value if it can be plausibly generated by the model itself. Starting from some intuitive criteria that align with our notions of valuable data, we develop a novel value function that is computationally tractable and derived from first principles with provable properties. We conduct a theoretical analysis of our value function and evaluate it across multiple scenarios and datasets.

[AI-191] Explainable Artificial Intelligence: A Survey of Needs Techniques Applications and Future Direction

链接: https://arxiv.org/abs/2409.00265
作者: Melkamu Mersha,Khang Lam,Joseph Wood,Ali AlShami,Jugal Kalita
关键词-EN: Artificial intelligence models, Explainable Artificial Intelligence, Artificial intelligence, encounter significant challenges, significant challenges due
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial intelligence models encounter significant challenges due to their black-box nature, particularly in safety-critical domains such as healthcare, finance, and autonomous vehicles. Explainable Artificial Intelligence (XAI) addresses these challenges by providing explanations for how these models make decisions and predictions, ensuring transparency, accountability, and fairness. Existing studies have examined the fundamental concepts of XAI, its general principles, and the scope of XAI techniques. However, there remains a gap in the literature as there are no comprehensive reviews that delve into the detailed mathematical representations, design methodologies of XAI models, and other associated aspects. This paper provides a comprehensive literature review encompassing common terminologies and definitions, the need for XAI, beneficiaries of XAI, a taxonomy of XAI methods, and the application of XAI methods in different application areas. The survey is aimed at XAI researchers, XAI practitioners, AI model developers, and XAI beneficiaries who are interested in enhancing the trustworthiness, transparency, accountability, and fairness of their AI models.

[AI-192] he Artificial Intelligence Act: critical overview

链接: https://arxiv.org/abs/2409.00264
作者: Nuno Sousa e Silva
关键词-EN: Artificial Intelligence Act, approved Artificial Intelligence, recently approved Artificial, Intelligence Act, Artificial Intelligence
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This article provides a critical overview of the recently approved Artificial Intelligence Act. It starts by presenting the main structure, objectives, and approach of Regulation (EU) 2024/1689. A definition of key concepts follows, and then the material and territorial scope, as well as the timing of application, are analyzed. Although the Regulation does not explicitly set out principles, the main ideas of fairness, accountability, transparency, and equity in AI underly a set of rules of the regulation. This is discussed before looking at the ill-defined set of forbidden AI practices (manipulation and e exploitation of vulnerabilities, social scoring, biometric identification and classification, and predictive policing). It is highlighted that those rules deal with behaviors rather than AI systems. The qualification and regulation of high-risk AI systems are tackled, alongside the obligation of transparency for certain systems, the regulation of general-purpose models, and the rules on certification, supervision, and sanctions. The text concludes that even if the overall framework can be deemed adequate and balanced, the approach is so complex that it risks defeating its own purpose of promoting responsible innovation within the European Union and beyond its borders.

[AI-193] MAPWise: Evaluating Vision-Language Models for Advanced Map Queries

链接: https://arxiv.org/abs/2409.00255
作者: Srija Mukhopadhyay,Abhishek Rajgaria,Prerana Khatiwada,Vivek Gupta,Dan Roth
关键词-EN: tasks requiring joint, Vision-language models, excel at tasks, linguistic information, answering questions based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
*备注: 30 Pages, 46 Tables, 6 Figure

点击查看摘要

Abstract:Vision-language models (VLMs) excel at tasks requiring joint understanding of visual and linguistic information. A particularly promising yet under-explored application for these models lies in answering questions based on various kinds of maps. This study investigates the efficacy of VLMs in answering questions based on choropleth maps, which are widely used for data analysis and representation. To facilitate and encourage research in this area, we introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China), each containing 1000 questions. Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning. It also includes maps with discrete and continuous values, encompassing variations in color-mapping, category ordering, and stylistic patterns, enabling comprehensive analysis. We evaluate the performance of multiple VLMs on this benchmark, highlighting gaps in their abilities and providing insights for improving such models.

[AI-194] One-Frame Calibration with Siamese Network in Facial Action Unit Recognition

链接: https://arxiv.org/abs/2409.00240
作者: Shuangquan Feng,Virginia R. de Sa
关键词-EN: Automatic facial action, facial action unit, Automatic facial, action unit, facial expression analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic facial action unit (AU) recognition is used widely in facial expression analysis. Most existing AU recognition systems aim for cross-participant non-calibrated generalization (NCG) to unseen faces without further calibration. However, due to the diversity of facial attributes across different identities, accurately inferring AU activation from single images of an unseen face is sometimes infeasible, even for human experts – it is crucial to first understand how the face appears in its neutral expression, or significant bias may be incurred. Therefore, we propose to perform one-frame calibration (OFC) in AU recognition: for each face, a single image of its neutral expression is used as the reference image for calibration. With this strategy, we develop a Calibrating Siamese Network (CSN) for AU recognition and demonstrate its remarkable effectiveness with a simple iResNet-50 (IR50) backbone. On the DISFA, DISFA+, and UNBC-McMaster datasets, we show that our OFC CSN-IR50 model (a) substantially improves the performance of IR50 by mitigating facial attribute biases (including biases due to wrinkles, eyebrow positions, facial hair, etc.), (b) substantially outperforms the naive OFC method of baseline subtraction as well as © a fine-tuned version of this naive OFC method, and (d) also outperforms state-of-the-art NCG models for both AU intensity estimation and AU detection.

[AI-195] Deep learning surrogate models of JULES-INFERNO for wildfire prediction on a global scale

链接: https://arxiv.org/abs/2409.00237
作者: Sibo Cheng,Hector Chassagnon,Matthew Kasoar,Yike Guo,Rossella Arcucci
关键词-EN: changing wildfire regimes, play a crucial, crucial role, role in anticipating, anticipating and responding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Global wildfire models play a crucial role in anticipating and responding to changing wildfire regimes. JULES-INFERNO is a global vegetation and fire model simulating wildfire emissions and area burnt on a global scale. However, because of the high data dimensionality and system complexity, JULES-INFERNO’s computational costs make it challenging to apply to fire risk forecasting with unseen initial conditions. Typically, running JULES-INFERNO for 30 years of prediction will take several hours on High Performance Computing (HPC) clusters. To tackle this bottleneck, two data-driven models are built in this work based on Deep Learning techniques to surrogate the JULES-INFERNO model and speed up global wildfire forecasting. More precisely, these machine learning models take global temperature, vegetation density, soil moisture and previous forecasts as inputs to predict the subsequent global area burnt on an iterative basis. Average Error per Pixel (AEP) and Structural Similarity Index Measure (SSIM) are used as metrics to evaluate the performance of the proposed surrogate models. A fine tuning strategy is also proposed in this work to improve the algorithm performance for unseen scenarios. Numerical results show a strong performance of the proposed models, in terms of both computational efficiency (less than 20 seconds for 30 years of prediction on a laptop CPU) and prediction accuracy (with AEP under 0.3% and SSIM over 98% compared to the outputs of JULES-INFERNO).

[AI-196] Spatially-Aware Diffusion Models with Cross-Attention for Global Field Reconstruction with Sparse Observations

链接: https://arxiv.org/abs/2409.00230
作者: Yilin Zhuang,Sibo Cheng,Karthik Duraisamy
关键词-EN: represent complex distributions, incorporate uncertainty, making them ideal, gained attention, represent complex
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Diffusion models have gained attention for their ability to represent complex distributions and incorporate uncertainty, making them ideal for robust predictions in the presence of noisy or incomplete data. In this study, we develop and enhance score-based diffusion models in field reconstruction tasks, where the goal is to estimate complete spatial fields from partial observations. We introduce a condition encoding approach to construct a tractable mapping mapping between observed and unobserved regions using a learnable integration of sparse observations and interpolated fields as an inductive bias. With refined sensing representations and an unraveled temporal dimension, our method can handle arbitrary moving sensors and effectively reconstruct fields. Furthermore, we conduct a comprehensive benchmark of our approach against a deterministic interpolation-based method across various static and time-dependent PDEs. Our study attempts to addresses the gap in strong baselines for evaluating performance across varying sampling hyperparameters, noise levels, and conditioning methods. Our results show that diffusion models with cross-attention and the proposed conditional encoding generally outperform other methods under noisy conditions, although the deterministic method excels with noiseless data. Additionally, both the diffusion models and the deterministic method surpass the numerical approach in accuracy and computational cost for the steady problem. We also demonstrate the ability of the model to capture possible reconstructions and improve the accuracy of fused results in covariance-based correction tasks using ensemble sampling.

[AI-197] A Generative Adversarial Network-based Method for LiDAR-Assisted Radar Image Enhancement

链接: https://arxiv.org/abs/2409.00196
作者: Thakshila Thilakanayake,Oscar De Silva,Thumeera R. Wanasinghe,George K. Mann,Awantha Jayasiri
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-198] Deep Neural Networks for Predicting Recurrence and Survival in Patients with Esophageal Cancer After Surgery MICCAI MICCAI2024

链接: https://arxiv.org/abs/2409.00163
作者: Yuhan Zheng,Jessie A Elliott,John V Reynolds,Sheraz R Markar,Bartłomiej W. Papież,ENSURE study group
关键词-EN: cancer-related mortality internationally, high recurrence rates, Esophageal cancer, mortality internationally, curative-intent surgery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 3 figures, 4 tables. To appear in CaPTion: MICCAI Workshop on Cancer Prevention, detection, and intervenTion, Sharib Ali et al., MICCAI 2024, Lecture Notes in Computer Science, Springer

点击查看摘要

Abstract:Esophageal cancer is a major cause of cancer-related mortality internationally, with high recurrence rates and poor survival even among patients treated with curative-intent surgery. Investigating relevant prognostic factors and predicting prognosis can enhance post-operative clinical decision-making and potentially improve patients’ outcomes. In this work, we assessed prognostic factor identification and discriminative performances of three models for Disease-Free Survival (DFS) and Overall Survival (OS) using a large multicenter international dataset from ENSURE study. We first employed Cox Proportional Hazards (CoxPH) model to assess the impact of each feature on outcomes. Subsequently, we utilised CoxPH and two deep neural network (DNN)-based models, DeepSurv and DeepHit, to predict DFS and OS. The significant prognostic factors identified by our models were consistent with clinical literature, with post-operative pathologic features showing higher significance than clinical stage features. DeepSurv and DeepHit demonstrated comparable discriminative accuracy to CoxPH, with DeepSurv slightly outperforming in both DFS and OS prediction tasks, achieving C-index of 0.735 and 0.74, respectively. While these results suggested the potential of DNNs as prognostic tools for improving predictive accuracy and providing personalised guidance with respect to risk stratification, CoxPH still remains an adequately good prediction model, with the data used in this study.

[AI-199] Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

链接: https://arxiv.org/abs/2409.00162
作者: Jiayi Zhou,Jiaming Ji,Juntao Dai,Yaodong Yang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

[AI-200] Learning-Based Finite Element Methods Modeling for Complex Mechanical Systems

链接: https://arxiv.org/abs/2409.00160
作者: Jiasheng Shi,Fu Lin,Weixiong Rao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-201] LLMs hallucinate graphs too: a structural perspective

链接: https://arxiv.org/abs/2409.00159
作者: Erwan Le Merrer,Gilles Tredan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

[AI-202] Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder INTERSPEECH2024

链接: https://arxiv.org/abs/2409.00158
作者: Jihyun Mun,Sunhee Kim,Minhwa Chung
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for Interspeech 2024

点击查看摘要

[AI-203] Speaker Tagging Correction With Non-Autoregressive Language Models

链接: https://arxiv.org/abs/2409.00151
作者: Grigor Kirakosyan,Davit Karamyan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 7 tables

点击查看摘要

[AI-204] From Semantics to Hierarchy: A Hybrid Euclidean-Tangent-Hyperbolic Space Model for Temporal Knowledge Graph Reasoning

链接: https://arxiv.org/abs/2409.00149
作者: Siling Feng,Zhisheng Qi,Cong Lin
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-205] MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

链接: https://arxiv.org/abs/2409.00147
作者: Shuai Peng,Di Fu,Liangcai Gao,Xiuqin Zhong,Hongguang Fu,Zhi Tang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-206] Robust Temporal-Invariant Learning in Multimodal Disentanglement

链接: https://arxiv.org/abs/2409.00143
作者: Guoyang Xu,Junqi Xue,Zhenxi Song,Yuxin Liu,Zirui Wang,Min Zhang,Zhiguo Zhang
关键词-EN: Multimodal sentiment recognition, identify human emotions, sentiment recognition aims, Multimodal sentiment, human emotions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures, this is the first version. The code is available at this https URL

点击查看摘要

Abstract:Multimodal sentiment recognition aims to learn representations from different modalities to identify human emotions. However, previous works does not suppresses the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise. To address this issue, we propose the Temporal-invariant learning, which minimizes the distributional differences between time steps to effectively capture smoother time series patterns, thereby enhancing the quality of the representations and robustness of the model. To fully exploit the rich semantic information in textual knowledge, we propose a Text-Driven Fusion Module (TDFM). To guide cross-modal interactions, TDFM evaluates the correlations between different modality through modality-invariant representations. Furthermore, we introduce a modality discriminator to disentangle modality-invariant and modality-specific subspaces. Experimental results on two public datasets demonstrate the superiority of our model.

[AI-207] Dynamic Depth Decoding: Faster Speculative Decoding for LLMs

链接: https://arxiv.org/abs/2409.00142
作者: Oscar Brown,Zhengjie Wang,Andrea Do,Nikhil Mathew,Cheng Yu
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-208] Statistical Analysis of the Impact of Quaternion Components in Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.00140
作者: Gerardo Altamirano-Gómez,Carlos Gershenson
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 6 figures

点击查看摘要

[AI-209] PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action

链接: https://arxiv.org/abs/2409.00138
作者: Yijia Shao,Tianshi Li,Weiyan Shi,Yanchen Liu,Diyi Yang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Under review

点击查看摘要

[AI-210] Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

链接: https://arxiv.org/abs/2409.00137
作者: Tom Gibbs,Ethan Kosak-Hine,George Ingebretsen,Jason Zhang,Julius Broomfield,Sara Pieri,Reihaneh Iranmanesh,Reihaneh Rabbany,Kellin Pelrine
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-211] HoneyComb: A Flexible LLM-Based Agent System for Materials Science EMNLP2024

链接: https://arxiv.org/abs/2409.00135
作者: Huan Zhang,Yu Song,Ziyu Hou,Santiago Miret,Bang Liu
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under Review on EMNLP 2024

点击查看摘要

[AI-212] MAPF-GPT: Imitation Learning for Multi-Agent Pathfinding at Scale

链接: https://arxiv.org/abs/2409.00134
作者: Anton Andreychuk,Konstantin Yakovlev,Aleksandr Panov,Alexey Skrynnik
关键词-EN:
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-213] A Survey for Large Language Models in Biomedicine

链接: https://arxiv.org/abs/2409.00133
作者: Chong Wang,Mengyao Li,Junjun He,Zhongruo Wang,Erfan Darzi,Zan Chen,Jin Ye,Tianbin Li,Yanzhou Su,Jing Ke,Kaili Qu,Shuxin Li,Yi Yu,Pietro Liò,Tianyun Wang,Yu Guang Wang,Yiqing Shen
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-214] Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems

链接: https://arxiv.org/abs/2409.00131
作者: Ding Kai,Ma Zhenguo,Yan Xiaoran
关键词-EN: lightweight Large Language, Large Language Models, study focuses, focuses on improving, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study focuses on improving the performance of lightweight Large Language Models (LLMs) in mathematical reasoning tasks. We introduce a novel method for measuring mathematical logic similarity and design an automatic screening mechanism to construct a set of reference problems that integrate both semantic and logical similarity. By employing carefully crafted positive and negative example prompts, we guide the model towards adopting sound reasoning logic. To the best of our knowledge, this is the first attempt to utilize retrieval-enhanced generation for mathematical problem-solving. Experimental results demonstrate that our method achieves a 15.8% improvement over the Chain of Thought approach on the SVAMP dataset and a 21.5 % improvement on the GSM8K dataset. Further application of this method to a large-scale model with 175 billion parameters yields performance comparable to the best results on both aforementioned datasets. Finally, we conduct an analysis of errors during the reasoning process, providing valuable insights and directions for future research on reasoning tasks using large language models.

[AI-215] Estimating the number of reachable positions in Minishogi

链接: https://arxiv.org/abs/2409.00129
作者: Sotaro Ishii,Tetsuro Tanaka
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: This article was submitted to IPSJ (Information Processing Society of Japan) SIG Technical Reports for Game Informatics in September 6, 2024. (a non-reviewed technical report)

点击查看摘要

[AI-216] Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs

链接: https://arxiv.org/abs/2409.00128
作者: Ziyan Cui,Ning Li,Huaikang Zhou
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注: 5 figures, 2 tables

点击查看摘要

[AI-217] Latent-EnSF: A Latent Ensemble Score Filter for High-Dimensional Data Assimilation with Sparse Observation Data

链接: https://arxiv.org/abs/2409.00127
作者: Phillip Si,Peng Chen
关键词-EN: correct errors inherent, Ensemble Kalman Filter, Ensemble Score Filters, Accurate modeling, nonlinear Bayesian filtering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 13 pages, 10 figures, 1 table

点击查看摘要

Abstract:Accurate modeling and prediction of complex physical systems often rely on data assimilation techniques to correct errors inherent in model simulations. Traditional methods like the Ensemble Kalman Filter (EnKF) and its variants as well as the recently developed Ensemble Score Filters (EnSF) face significant challenges when dealing with high-dimensional and nonlinear Bayesian filtering problems with sparse observations, which are ubiquitous in real-world applications. In this paper, we propose a novel data assimilation method, Latent-EnSF, which leverages EnSF with efficient and consistent latent representations of the full states and sparse observations to address the joint challenges of high dimensionlity in states and high sparsity in observations for nonlinear Bayesian filtering. We introduce a coupled Variational Autoencoder (VAE) with two encoders to encode the full states and sparse observations in a consistent way guaranteed by a latent distribution matching and regularization as well as a consistent state reconstruction. With comparison to several methods, we demonstrate the higher accuracy, faster convergence, and higher efficiency of Latent-EnSF for two challenging applications with complex models in shallow water wave propagation and medium-range weather forecasting, for highly sparse observations in both space and time.

[AI-218] A Hybrid Framework for Spatial Interpolation: Merging Data-driven with Domain Knowledge

链接: https://arxiv.org/abs/2409.00125
作者: Cong Zhang,Shuyi Du,Hongqing Song,Yuhe Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 21 pages, 13 figures

点击查看摘要

[AI-219] ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings ICPR2024

链接: https://arxiv.org/abs/2409.00120
作者: Jangyeong Jeon,Sangyeon Cho,Minuk Ma,Junyoung Kim
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: ICPR 2024

点击查看摘要

[AI-220] 3-in-1: 2D Rotary Adaptation for Efficient Finetuning Efficient Batching and Composability

链接: https://arxiv.org/abs/2409.00119
作者: Baohao Liao,Christof Monz
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 24 pages, 6 figures, 13 tables

点击查看摘要

[AI-221] When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options

链接: https://arxiv.org/abs/2409.00113
作者: Gracjan Góral,Emilia Wiśnios
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-222] oward Large Language Models as a Therapeutic Tool: Comparing Prompting Techniques to Improve GPT-Delivered Problem-Solving Therapy

链接: https://arxiv.org/abs/2409.00112
作者: Daniil Filienko,Yinzhou Wang,Caroline El Jazmi,Serena Xie,Trevor Cohen,Martine De Cock,Weichao Yuwen
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted for AMIA 2024 proceedings

点击查看摘要

[AI-223] Evaluating the Impact of Multiple DER Aggregators on Wholesale Energy Markets: A Hybrid Mean Field Approach

链接: https://arxiv.org/abs/2409.00107
作者: Jun He,Andrew L. Liu
关键词-EN:
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN); Optimization and Control (math.OC)
*备注:

点击查看摘要

[AI-224] Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

链接: https://arxiv.org/abs/2409.00106
作者: Aishik Nagar,Shantanu Jaiswal,Cheston Tan
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

[AI-225] Negation Blindness in Large Language Models : Unveiling the NO Syndrome in Image Generation

链接: https://arxiv.org/abs/2409.00105
作者: Mohammad Nadeem,Shahab Saquib Sohail,Erik Cambria,Björn W. Schuller,Amir Hussain
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures

点击查看摘要

[AI-226] Nuance Matters: Probing Epistemic Consistency in Causal Reasoning

链接: https://arxiv.org/abs/2409.00103
作者: Shaobo Cui,Junyou Li,Luca Mouchel,Yiyang Feng,Boi Faltings
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 20 pages

点击查看摘要

[AI-227] Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning

链接: https://arxiv.org/abs/2409.00099
作者: Zhenyu Wang,Shuyu Kong,Li Wan,Biqiao Zhang,Yiteng Huang,Mumin Jin,Ming Sun,Xin Lei,Zhaojun Yang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-228] Large Language Models for Disease Diagnosis: A Scoping Review

链接: https://arxiv.org/abs/2409.00097
作者: Shuang Zhou,Zidu Xu,Mian Zhang,Chunpu Xu,Yawen Guo,Zaifu Zhan,Sirui Ding,Jiashuo Wang,Kaishuai Xu,Yi Fang,Liqiao Xia,Jeremy Yeung,Daochen Zha,Mingquan Lin,Rui Zhang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 57 pages

点击查看摘要

[AI-229] Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data

链接: https://arxiv.org/abs/2409.00096
作者: Juncheng Xie,Shensian Syu,Hung-yi Lee
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures, 15 tables

点击查看摘要

[AI-230] Examining Independence in Ensemble Sentiment Analysis: A Study on the Limits of Large Language Models Using the Condorcet Jury Theorem

链接: https://arxiv.org/abs/2409.00094
作者: Baptiste Lefort,Eric Benhamou,Jean-Jacques Ohana,Beatrice Guez,David Saltiel,Thomas Jacquot
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-231] PatentGPT: A Large Language Model for Patent Drafting Using Knowledge-based Fine-tuning Method

链接: https://arxiv.org/abs/2409.00092
作者: Runtao Ren,Jian Ma
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 21 pages, 4 figures

点击查看摘要

[AI-232] Classification of Safety Events at Nuclear Sites using Large Language Models

链接: https://arxiv.org/abs/2409.00091
作者: Mishca de Costa,Muhammad Anwar,Daniel Lau,Issam Hammad
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-233] Evaluating ChatGPT on Nuclear Domain-Specific Data

链接: https://arxiv.org/abs/2409.00090
作者: Muhammad Anwar,Mischa de Costa,Issam Hammad,Daniel Lau
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-234] Watermarking Techniques for Large Language Models : A Survey

链接: https://arxiv.org/abs/2409.00089
作者: Yuqing Liang,Jiancheng Xiao,Wensheng Gan,Philip S. Yu
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Preprint. 19 figures, 7 tables

点击查看摘要

[AI-235] Vision-Language and Large Language Model Performance in Gastroenterology: GPT Claude Llama Phi Mistral Gemma and Quantized Models

链接: https://arxiv.org/abs/2409.00084
作者: Seyed Amir Ahmad Safavi-Naini,Shuhaib Ali,Omer Shahab,Zahra Shahhoseini,Thomas Savage,Sara Rafiee,Jamil S Samaan,Reem Al Shabeeb,Farah Ladak,Jamie O Yang,Juan Echavarria,Sumbal Babar,Aasma Shaukat,Samuel Margolis,Nicholas P Tatonetti,Girish Nadkarni,Bara El Kurdi,Ali Soroush
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Manuscript Pages: 34, Figures: 7, Tables: 2, Supplementary File Pages: 35, Data Transparency Statement: Code is available at: this https URL . Study data from American College of Gastroenterology (ACG) are restricted and available upon request with ACG permission

点击查看摘要

[AI-236] owards Human-Level Understanding of Complex Process Engineering Schematics: A Pedagogical Introspective Multi-Agent Framework for Open-Domain Question Answering ECML KDD2024

链接: https://arxiv.org/abs/2409.00082
作者: Sagar Srinivas Sakhinana,Geethan Sannidhi,Venkataramana Runkana
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Our paper is accepted for publication at ML4CCE workshop at ECML PKDD 2024

点击查看摘要

[AI-237] Learning to Plan Long-Term for Language Modeling

链接: https://arxiv.org/abs/2409.00070
作者: Florian Mai,Nathan Cornille,Marie-Francine Moens
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

[AI-238] How to Measure Human-AI Prediction Accuracy in Explainable AI Systems

链接: https://arxiv.org/abs/2409.00069
作者: Sujay Koujalgi,Andrew Anderson,Iyadunni Adenuga,Shikha Soneji,Rupika Dikkala,Teresita Guzman Nader,Leo Soccio,Sourav Panda,Rupak Kumar Das,Margaret Burnett,Jonathan Dodge
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-239] Phrasing for UX: Enhancing Information Engagement through Computational Linguistics and Creative Analytics

链接: https://arxiv.org/abs/2409.00064
作者: Nimrod Dvir
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

[AI-240] Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian Language

链接: https://arxiv.org/abs/2409.00061
作者: Arief Purnama Muharram,Ayu Purwarianti
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-241] Automating Knowledge Discovery from Scientific Literature via LLMs: A Dual-Agent Approach with Progressive Ontology Prompting

链接: https://arxiv.org/abs/2409.00054
作者: Yuting Hu,Dancheng Liu,Qingyun Wang,Charles Yu,Heng Ji,Jinjun Xiong
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: in submission

点击查看摘要

[AI-242] Quality Assessment in the Era of Large Models: A Survey

链接: https://arxiv.org/abs/2409.00031
作者: Zicheng Zhang,Yingjie Zhou,Chunyi Li,Baixuan Zhao,Xiaohong Liu,Guangtao Zhai
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-243] Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach

链接: https://arxiv.org/abs/2409.00022
作者: Zhe Fu,Kanlun Wang,Wangjiaxuan Xin,Lina Zhou,Shi Chen,Yaorong Ge,Daniel Janies,Dongsong Zhang
关键词-EN:
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to PACIS 2024. 15 pages, 3 figures

点击查看摘要

[AI-244] ACOS: Task Agnostic Continual Learning in Spiking Neural Networks

链接: https://arxiv.org/abs/2409.00021
作者: Nicholas Soures,Peter Helfer,Anurag Daram,Tej Pandit,Dhireesha Kudithipudi
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-245] Navigating the sociotechnical labyrinth: Dynamic certification for responsible embodied AI

链接: https://arxiv.org/abs/2409.00015
作者: Georgios Bakirtzis,Andrea Aler Tubella,Andreas Theodorou,David Danks,Ufuk Topcu
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

[AI-246] DivDiff: A Conditional Diffusion Model for Diverse Human Motion Prediction

链接: https://arxiv.org/abs/2409.00014
作者: Hua Yu,Yaqing Hou,Wenbin Pei,Qiang Zhang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-247] AVIN-Chat: An Audio-Visual Interactive Chatbot System with Emotional State Tuning

链接: https://arxiv.org/abs/2409.00012
作者: Chanhyuk Park,Jungbin Cho,Junwan Kim,Seongmin Lee,Jungsu Kim,Sanghoon Lee
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-248] Web Retrieval Agents for Evidence-Based Misinformation Detection

链接: https://arxiv.org/abs/2409.00009
作者: Jacob-Junqi Tian,Hao Yu,Yury Orlovskiy,Tyler Vergho,Mauricio Rivera,Mayank Goel,Zachary Yang,Jean-Francois Godbout,Reihaneh Rabbany,Kellin Pelrine
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 1 main figure, 8 tables, 10 pages, 12 figures in Appendix, 7 tables in Appendix

点击查看摘要

[AI-249] Csi-LLM: A Novel Downlink Channel Prediction Method Aligned with LLM Pre-Training

链接: https://arxiv.org/abs/2409.00005
作者: Shilong Fan,Zhenyu Liu,Xinyu Gu,Haozhen Li
关键词-EN:
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-250] Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy

链接: https://arxiv.org/abs/2409.00001
作者: Kimji N. Pellano,Inga Strümke,Daniel Groos,Lars Adde,Espen Alexander F. Ihlen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-251] Measuring Human Contribution in AI-Assisted Content Generation

链接: https://arxiv.org/abs/2408.14792
作者: Yueqi Xie,Tao Qi,Jingwei Yi,Ryan Whalen,Junming Huang,Qian Ding,Yu Xie,Xing Xie,Fangzhao Wu
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-252] Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration

链接: https://arxiv.org/abs/2408.07341
作者: Xiaogen Zhon,Yiyou Sun,Min Deng,Winnie Chiu Wing Chu,Qi Dou
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

[AI-253] All Artificial Less Intelligence: GenAI through the Lens of Formal Verification

链接: https://arxiv.org/abs/2403.16750
作者: Deepak Narayan Gadde,Aman Kumar,Thomas Nalapat,Evgenii Rezunov,Fabio Cappellini
关键词-EN:
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Published in DVCon U.S. 2024

点击查看摘要

[AI-254] vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

链接: https://arxiv.org/abs/2409.01995
作者: Yiwei Guo,Zhihan Li,Junjie Li,Chenpeng Du,Hankun Wang,Shuai Wang,Xie Chen,Kai Yu
关键词-EN: advances voice conversion, voice conversion, speech discrete token, advances voice, discrete token vocoder
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis.

[AI-255] On the design space between molecular mechanics and machine learning force fields

链接: https://arxiv.org/abs/2409.01931
作者: Yuanqing Wang,Kenichiro Takaba,Michael S. Chen,Marcus Wieder,Yuzhi Xu,John Z. H. Zhang,Kuang Yu,Xinyan Wang,Linfeng Zhang,Daniel J. Cole,Joshua A. Rackers,Joe G. Greener,Peter Eastman,Stefano Martiniani,Mark E. Tuckerman
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

[AI-256] 1-contrast Enhanced MRI Generation from Multi-parametric MRI for Glioma Patients with Latent Tumor Conditioning

链接: https://arxiv.org/abs/2409.01622
作者: Zach Eidex,Mojtaba Safari,Richard L.J. Qiu,David S. Yu,Hui-Kuo Shu,Hui Mao,Xiaofeng Yang
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2407.02616

点击查看摘要

[AI-257] Multi-frequency Neural Born Iterative Method for Solving 2-D Inverse Scattering Problems

链接: https://arxiv.org/abs/2409.01315
作者: Daoqi Liu,Tao Shan,Maokun Li,Fan Yang,Shenheng Xu
关键词-EN:
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-258] EnCLAP: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance

链接: https://arxiv.org/abs/2409.01201
作者: Jaeyeon Kim,Minjeon Jeon,Jaeyoon Jung,Sang Hoon Woo,Jinjoo Lee
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Accepted to DCASE2024 Workshop

点击查看摘要

[AI-259] Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning

链接: https://arxiv.org/abs/2409.01160
作者: Jaeyeon Kim,Jaeyoon Jung,Minjeong Jeon,Sang Hoon Woo,Jinjoo Lee
关键词-EN: Language-based Audio Retrieval, Automated Audio Captioning, Automated Audio, Language-based Audio, Audio Retrieval
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: DCASE2024 Challenge Technical Report. Ranked 2nd in Task 6 Automated Audio Captioning

点击查看摘要

Abstract:In this technical report, we describe our submission to DCASE2024 Challenge Task6 (Automated Audio Captioning) and Task8 (Language-based Audio Retrieval). We develop our approach building upon the EnCLAP audio captioning framework and optimizing it for Task6 of the challenge. Notably, we outline the changes in the underlying components and the incorporation of the reranking process. Additionally, we submit a supplementary retriever model, a byproduct of our modified framework, to Task8. Our proposed systems achieve FENSE score of 0.542 on Task6 and mAP@10 score of 0.386 on Task8, significantly outperforming the baseline models.

[AI-260] wo-stage initial-value iterative physics-informed neural networks for simulating solitary waves of nonlinear wave equations

链接: https://arxiv.org/abs/2409.01124
作者: Jin Song,Ming Zhong,George Em Karniadakis,Zhenya Yan
关键词-EN: iterative neural network, physics-informed neural networks, initial-value iterative neural, two-stage initial-value iterative, numerical iterative methods
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (math-ph); Pattern Formation and Solitons (nlin.PS); Exactly Solvable and Integrable Systems (nlin.SI)
*备注: 25 pages, 17 figures

点击查看摘要

Abstract:We propose a new two-stage initial-value iterative neural network (IINN) algorithm for solitary wave computations of nonlinear wave equations based on traditional numerical iterative methods and physics-informed neural networks (PINNs). Specifically, the IINN framework consists of two subnetworks, one of which is used to fit a given initial value, and the other incorporates physical information and continues training on the basis of the first subnetwork. Importantly, the IINN method does not require any additional data information including boundary conditions, apart from the given initial value. Corresponding theoretical guarantees are provided to demonstrate the effectiveness of our IINN method. The proposed IINN method is efficiently applied to learn some types of solutions in different nonlinear wave equations, including the one-dimensional (1D) nonlinear Schrödinger equations (NLS) equation (with and without potentials), the 1D saturable NLS equation with PT -symmetric optical lattices, the 1D focusing-defocusing coupled NLS equations, the KdV equation, the two-dimensional (2D) NLS equation with potentials, the 2D amended GP equation with a potential, the (2+1)-dimensional KP equation, and the 3D NLS equation with a potential. These applications serve as evidence for the efficacy of our method. Finally, by comparing with the traditional methods, we demonstrate the advantages of the proposed IINN method.

[AI-261] Bootstrap SGD: Algorithmic Stability and Robustness

链接: https://arxiv.org/abs/2409.01074
作者: Andreas Christmann,Yunwen Lei
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-262] SeCo-INR: Semantically Conditioned Implicit Neural Representations for Improved Medical Image Super-Resolution WACV

链接: https://arxiv.org/abs/2409.01013
作者: Mevan Ekanayake,Zhifeng Chen,Gary Egan,Mehrtash Harandi,Zhaolin Chen
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was accepted for presentation at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

[AI-263] Solving Integrated Process Planning and Scheduling Problem via Graph Neural Network Based Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.00968
作者: Hongpei Li,Han Zhang,Ziyan He,Yunkai Jia,Bo Jiang,Xiang Huang,Dongdong Ge
关键词-EN: Integrated Process Planning, process route planning, maximize resource utilization, combines process route, Integer Linear Programming
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 24 pages, 13 figures

点击查看摘要

Abstract:The Integrated Process Planning and Scheduling (IPPS) problem combines process route planning and shop scheduling to achieve high efficiency in manufacturing and maximize resource utilization, which is crucial for modern manufacturing systems. Traditional methods using Mixed Integer Linear Programming (MILP) and heuristic algorithms can not well balance solution quality and speed when solving IPPS. In this paper, we propose a novel end-to-end Deep Reinforcement Learning (DRL) method. We model the IPPS problem as a Markov Decision Process (MDP) and employ a Heterogeneous Graph Neural Network (GNN) to capture the complex relationships among operations, machines, and jobs. To optimize the scheduling strategy, we use Proximal Policy Optimization (PPO). Experimental results show that, compared to traditional methods, our approach significantly improves solution efficiency and quality in large-scale IPPS instances, providing superior scheduling strategies for modern intelligent manufacturing systems.

[AI-264] BUET Multi-disease Heart Sound Dataset: A Comprehensive Auscultation Dataset for Developing Computer-Aided Diagnostic Systems

链接: https://arxiv.org/abs/2409.00724
作者: Shams Nafisa Ali,Afia Zahin,Samiul Based Shuvo,Nusrat Binta Nizam,Shoyad Ibn Sabur Khan Nuhash,Sayeed Sajjad Razin,S.M. Sakeef Sani,Farihin Rahman,Nawshad Binta Nizam,Farhat Binte Azam,Rakib Hossen,Sumaiya Ohab,Nawsabah Noor,Taufiq Hasan
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 14 pages, 13 figures

点击查看摘要

[AI-265] Multiscale Color Guided Attention Ensemble Classifier for Age-Related Macular Degeneration using Concurrent Fundus and Optical Coherence Tomography Images ICPR

链接: https://arxiv.org/abs/2409.00718
作者: Pragya Gupta,Subhamoy Mandal,Debashree Guha,Debjani Chakraborty
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 27th International Conference on Pattern Recognition (ICPR) 2024