本篇博文主要展示 2024-09-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-09-24)

今日共更新904篇论文,其中:

  • 自然语言处理188篇(Computation and Language (cs.CL))
  • 人工智能281篇(Artificial Intelligence (cs.AI))
  • 计算机视觉183篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习220篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

【速读】: 该论文试图解决大型语言模型(LLMs)在医学领域应用中的性能评估问题,特别是模型在理解、推理和多语言能力方面的表现。解决方案的关键在于通过使用37个医学数据集,包括基于《新英格兰医学杂志》和《柳叶刀》专业医学测验构建的两个新挑战性问答任务,对模型进行全面评估。研究结果表明,增强的推理能力显著提升了模型在理解复杂医学指令和处理临床场景中的表现,但同时也揭示了模型在幻觉、多语言能力不一致和评估指标不一致等方面的弱点。

链接: https://arxiv.org/abs/2409.15277
作者: Yunfei Xie,Juncheng Wu,Haoqin Tu,Siwei Yang,Bingchen Zhao,Yongshuo Zong,Qiao Jin,Cihang Xie,Yuyin Zhou
关键词-EN: Large language models, exhibited remarkable capabilities, Large language, pushing the boundaries, exhibited remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The first four authors contributed equally, project page available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable capabilities across various domains and tasks, pushing the boundaries of our knowledge in learning and cognition. The latest model, OpenAI’s o1, stands out as the first LLM with an internalized chain-of-thought technique using reinforcement learning strategies. While it has demonstrated surprisingly strong capabilities on various general language tasks, its performance in specialized fields such as medicine remains unknown. To this end, this report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine (NEJM) and The Lancet. These datasets offer greater clinical relevance compared to standard medical QA benchmarks such as MedQA, translating more effectively into real-world clinical utility. Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios. But meanwhile, we identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. We release our raw data and model outputs at this https URL for future research.
摘要:大语言模型 (LLMs) 在各个领域和任务中展现了卓越的能力,推动了我们在学习和认知方面的知识边界。最新的模型,OpenAI 的 o1,作为首个采用强化学习策略的内化思维链技术的大语言模型,脱颖而出。尽管它在多种通用语言任务上展示了令人惊讶的强大能力,但在如医学等专业领域的表现仍未可知。为此,本报告对 o1 在不同医疗场景中的表现进行了全面探索,考察了三个关键方面:理解、推理和多语言能力。具体而言,我们的评估涵盖了 6 项任务,使用了来自 37 个医疗数据集的数据,其中包括基于《新英格兰医学杂志》(NEJM) 和《柳叶刀》专业医学测验构建的两个新的更具挑战性的问答 (QA) 任务。这些数据集相比标准的医疗 QA 基准(如 MedQA)具有更高的临床相关性,更有效地转化为实际临床应用。我们对 o1 的分析表明,大语言模型增强的推理能力可能(显著)提升其理解各种医疗指令和推理复杂临床场景的能力。值得注意的是,o1 在 19 个数据集和两个新创建的复杂 QA 场景中的准确率平均比之前的 GPT-4 高出 6.2% 和 6.6%。但与此同时,我们也发现了模型能力和现有评估协议中的几个弱点,包括幻觉、多语言能力不一致以及评估指标的差异。我们在此 https URL 发布了原始数据和模型输出,供未来研究使用。

[NLP-1] OmniBench: Towards The Future of Universal Omni-Language Models

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在同时处理和推理视觉、声学和文本输入时的能力不足问题。解决方案的关键在于引入了一个名为OmniBench的新基准,该基准通过高质量的人工标注,严格评估模型在三模态环境下的识别、解释和推理能力。OmniBench强调了在所有三种模态中进行综合理解和推理的重要性,揭示了现有开源模型在指令跟随和多模态推理方面的显著局限性,并呼吁未来研究应聚焦于开发更强大的三模态集成技术和训练策略,以提升模型的跨模态性能。

链接: https://arxiv.org/abs/2409.15272
作者: Yizhi Li,Ge Zhang,Yinghao Ma,Ruibin Yuan,Kang Zhu,Hangyu Guo,Yiming Liang,Jiaheng Liu,Jian Yang,Siwei Wu,Xingwei Qu,Jinjie Shi,Xinyue Zhang,Zhenzhu Yang,Xiangzhou Wang,Zhaoxiang Zhang,Zachary Liu,Emmanouil Benetos,Wenhao Huang,Chenghua Lin
关键词-EN: multimodal large language, Recent advancements, large language models, advancements in multimodal, multimodal large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models’ ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) the baseline models perform poorly (below 50% accuracy) even when provided with alternative textual representations of images and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at this https URL.
摘要:近期多模态大语言模型 (Multimodal Large Language Models, MLLMs) 的发展旨在整合并解读跨多种模态的数据。然而,这些模型同时处理和推理多种模态的能力仍未得到充分探索,部分原因是缺乏全面的模态基准测试。我们引入了 OmniBench,这是一个新颖的基准测试,旨在严格评估模型在视觉、声学和文本输入上同时进行识别、解读和推理的能力。我们将具备这种三模态处理能力的模型定义为全语言模型 (Omni-Language Models, OLMs)。OmniBench 以其高质量的人工标注为特点,确保准确响应需要对所有三种模态进行综合理解和推理。我们的主要发现表明:i) 开源 OLMs 在三模态情境下的指令跟随和推理能力存在显著局限;ii) 即使提供了图像和音频的替代文本表示,基线模型的表现仍然不佳(准确率低于 50%)。这些结果表明,从文本、图像和音频构建一致上下文的能力在现有的 MLLM 训练范式中经常被忽视。我们呼吁未来研究应聚焦于开发更强大的三模态整合技术和训练策略,以提升 OLMs 在多种模态上的表现。代码和实时排行榜可在以下链接找到:https URL。

[NLP-2] Behavioral Bias of Vision-Language Models: A Behavioral Finance View ICML2024

【速读】: 该论文旨在研究大型视觉语言模型(LVLMs)在行为金融学领域中的潜在行为偏差,特别是近期偏差和权威偏差。解决方案的关键在于提出一个端到端的框架,从数据收集到新的评估指标,全面评估LVLMs的推理能力和在金融行为偏差中的表现。研究结果表明,开源模型如LLaVA-NeXT、MobileVLM-V2等在这两种偏差上表现显著,而专有模型GPT-4o则影响较小,这为开源模型的改进提供了方向。

链接: https://arxiv.org/abs/2409.15256
作者: Yuhang Xiao,Yudi Lin,Ming-Chang Chiu
关键词-EN: Large Language Models, Large Language, Large Vision-Language Models, Large Vision-Language, evolve rapidly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICML 2024 Workshop on Large Language Models and Cognition

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) evolve rapidly as Large Language Models (LLMs) was equipped with vision modules to create more human-like models. However, we should carefully evaluate their applications in different domains, as they may possess undesired biases. Our work studies the potential behavioral biases of LVLMs from a behavioral finance perspective, an interdisciplinary subject that jointly considers finance and psychology. We propose an end-to-end framework, from data collection to new evaluation metrics, to assess LVLMs’ reasoning capabilities and the dynamic behaviors manifested in two established human financial behavioral biases: recency bias and authority bias. Our evaluations find that recent open-source LVLMs such as LLaVA-NeXT, MobileVLM-V2, Mini-Gemini, MiniCPM-Llama3-V 2.5 and Phi-3-vision-128k suffer significantly from these two biases, while the proprietary model GPT-4o is negligibly impacted. Our observations highlight directions in which open-source models can improve. The code is available at this https URL.
摘要:大型视觉-语言模型 (Large Vision-Language Models, LVLMs) 随着大语言模型 (Large Language Models, LLMs) 配备了视觉模块而迅速发展,旨在创建更接近人类的模型。然而,我们应谨慎评估其在不同领域的应用,因为它们可能存在不希望的偏见。我们的研究从行为金融学的角度探讨了 LVLMs 的潜在行为偏见,这是一个结合了金融学和心理学的跨学科课题。我们提出了一套端到端的框架,从数据收集到新的评估指标,以评估 LVLMs 的推理能力和在两种已确立的人类金融行为偏见中的动态行为:近期偏见 (recency bias) 和权威偏见 (authority bias)。我们的评估发现,最近的开源 LVLMs 如 LLaVA-NeXT、MobileVLM-V2、Mini-Gemini、MiniCPM-Llama3-V 2.5 和 Phi-3-vision-128k 在这两种偏见上表现出显著问题,而专有模型 GPT-4o 则几乎不受影响。我们的观察指出了开源模型可以改进的方向。代码可在以下链接获取:https URL。

[NLP-3] Archon: An Architecture Search Framework for Inference-Time Techniques

【速读】: 该论文试图解决在大语言模型(LLM)中有效结合推理时技术的问题,具体挑战包括:(1)合理分配推理计算预算,(2)理解不同推理时技术组合对下游性能的影响,(3)高效搜索模型选择、推理时技术和其组合的大空间。解决方案的关键是引入Archon框架,它通过定义一个可扩展的设计空间,涵盖生成集成、多采样、排序、融合、批评、验证和单元测试等方法,将选择和组合LLM及推理时技术的问题转化为超参数优化目标。通过自动化的推理时架构搜索(ITAS)算法,Archon能够在给定目标基准、推理计算预算和可用LLM的情况下,输出优化的架构,从而在多个指令跟随和推理基准上显著提升性能。

链接: https://arxiv.org/abs/2409.15254
作者: Jon Saad-Falcon,Adrian Gamarra Lafuente,Shlok Natarajan,Nahum Maru,Hristo Todorov,E. Kelly Buchanan,Mayee Chen,Neel Guha,Christopher Ré,Azalia Mirhoseini
关键词-EN: highly effective tools, Inference-time techniques, large language model, Inference-time, increase large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time techniques are emerging as highly effective tools to increase large language model (LLM) capabilities. However, there is still limited understanding of the best practices for developing systems that combine inference-time techniques with one or more LLMs, with challenges including: (1) effectively allocating inference compute budget, (2) understanding the interactions between different combinations of inference-time techniques and their impact on downstream performance, and 3) efficiently searching over the large space of model choices, inference-time techniques, and their compositions. To address these challenges, we introduce Archon, an automated framework for designing inference-time architectures. Archon defines an extensible design space, encompassing methods such as generation ensembling, multi-sampling, ranking, fusion, critiquing, verification, and unit testing. It then transforms the problem of selecting and combining LLMs and inference-time techniques into a hyperparameter optimization objective. To optimize this objective, we introduce automated Inference-Time Architecture Search (ITAS) algorithms. Given target benchmark(s), an inference compute budget, and available LLMs, ITAS outputs optimized architectures. We evaluate Archon architectures across a wide range of instruction-following and reasoning benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. We show that automatically designed inference-time architectures by Archon outperform strong models such as GPT-4o and Claude 3.5 Sonnet on these benchmarks, achieving an average increase of 14.1 and 10.3 percentage points with all-source models and open-source models, respectively. We make our code and datasets available publicly on Github: this https URL.
摘要:推理时技术正逐渐成为提升大语言模型 (LLM) 能力的高效工具。然而,对于如何开发结合推理时技术与一个或多个 LLM 的系统,仍存在许多未知之处,面临的挑战包括:(1) 有效分配推理计算预算,(2) 理解不同推理时技术组合间的相互作用及其对下游性能的影响,以及 (3) 在大规模模型选择、推理时技术及其组合的空间中进行高效搜索。为应对这些挑战,我们推出了 Archon,一个用于设计推理时架构的自动化框架。Archon 定义了一个可扩展的设计空间,涵盖了生成集成、多采样、排序、融合、批判、验证和单元测试等方法。随后,它将选择和组合 LLM 及推理时技术的问题转化为超参数优化目标。为优化这一目标,我们引入了自动化推理时架构搜索 (ITAS) 算法。在给定目标基准测试、推理计算预算和可用 LLM 的情况下,ITAS 输出优化的架构。我们在一系列指令跟随和推理基准测试中评估了 Archon 架构,包括 MT-Bench、Arena-Hard-Auto、AlpacaEval 2.0、MixEval、MixEval Hard、MATH 和 CodeContests。结果显示,Archon 自动设计的推理时架构在这些基准测试中优于 GPT-4o 和 Claude 3.5 Sonnet 等强模型,全源模型和开源模型的平均提升分别为 14.1 和 10.3 个百分点。我们已在 Github 上公开了代码和数据集:this https URL。

[NLP-4] MemBench: Towards Real-world Evaluation of Memory-Augmented Dialogue Systems

【速读】: 该论文试图解决现有对话系统在长期记忆评估方法上的不足,即仅依赖于事实信息的准确性或生成响应的困惑度,而忽略了人类记忆召回的多样性,如情感和环境因素。解决方案的关键在于构建了一个名为Memory Benchmark (MemBench)的新基准,该基准基于认知科学和心理学理论,涵盖了多种记忆召回模式,包括被动和主动的记忆召回,并首次考虑了元信息。此外,该基准还提出了新的评分维度,以全面评估生成响应的质量。实验结果表明,现有对话系统在该基准上仍有很大的改进空间。

链接: https://arxiv.org/abs/2409.15240
作者: Junqing He,Liang Zhu,Qi Wei,Rui Wang,Jiaxing Zhang
关键词-EN: developed numerous memory-augmented, Long-term memory, important for chatbots, researchers have developed, developed numerous
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In progress

点击查看摘要

Abstract:Long-term memory is so important for chatbots and dialogue systems (DS) that researchers have developed numerous memory-augmented DS. However, their evaluation methods are different from the real situation in human conversation. They only measured the accuracy of factual information or the perplexity of generated responses given a query, which hardly reflected their performance. Moreover, they only consider passive memory retrieval based on similarity, neglecting diverse memory-recalling paradigms in humans, e.g. emotions and surroundings. To bridge the gap, we construct a novel benchmark covering various memory recalling paradigms based on cognitive science and psychology theory. The Memory Benchmark (MemBench) contains two tasks according to the two-phrase theory in cognitive science: memory retrieval, memory recognition and injection. The benchmark considers both passive and proactive memory recalling based on meta information for the first time. In addition, novel scoring aspects are proposed to comprehensively measure the generated responses. Results from the strongest embedding models and LLMs on MemBench show that there is plenty of room for improvement in existing dialogue systems. Extensive experiments also reveal the correlation between memory injection and emotion supporting (ES) skillfulness, and intimacy. Our code and dataset will be released.
摘要:长期记忆对于聊天机器人和对话系统 (Dialogue Systems, DS) 的重要性不言而喻,因此研究人员开发了众多记忆增强型 DS。然而,这些系统的评估方法与人类实际对话中的情况存在差异。它们仅测量了在给定查询时事实信息的准确性或生成响应的困惑度,这几乎无法反映其真实表现。此外,它们仅考虑基于相似性的被动记忆检索,忽略了人类多样化的记忆召回范式,例如情感和环境因素。为了弥合这一差距,我们基于认知科学和心理学理论构建了一个涵盖多种记忆召回范式的新型基准。记忆基准 (Memory Benchmark, MemBench) 根据认知科学中的两阶段理论包含两个任务:记忆检索、记忆识别和注入。该基准首次考虑了基于元信息的被动和主动记忆召回。此外,提出了新的评分维度,以全面衡量生成的响应。在 MemBench 上对最强嵌入模型和大语言模型 (Large Language Model, LLM) 的结果表明,现有对话系统仍有很大的改进空间。广泛的实验还揭示了记忆注入与情感支持 (Emotion Supporting, ES) 技能和亲密度的相关性。我们的代码和数据集将公开发布。

[NLP-5] ASTE Transformer Modelling Dependencies in Aspect-Sentiment Triplet Extraction

【速读】: 该论文试图解决Aspect-Sentiment Triplet Extraction (ASTE)任务中,现有方法因独立分类器决策而无法有效利用短语间依赖关系的问题。解决方案的关键在于提出了一种包含三个transformer启发层的新方法,通过这些层来建模短语间及最终分类器决策间的依赖关系,从而提升模型在F1度量上的性能。此外,论文还展示了简单的预训练技术能进一步提高模型性能。

链接: https://arxiv.org/abs/2409.15202
作者: Iwo Naglik,Mateusz Lango
关键词-EN: Aspect-Sentiment Triplet Extraction, Triplet Extraction, aspect-based sentiment analysis, Aspect-Sentiment Triplet, recently proposed task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The 2024 Conference on Empirical Methods in Natural Language Processing, November 12-16, Miami, Florida 9 pages, appendix, diagrams

点击查看摘要

Abstract:Aspect-Sentiment Triplet Extraction (ASTE) is a recently proposed task of aspect-based sentiment analysis that consists in extracting (aspect phrase, opinion phrase, sentiment polarity) triples from a given sentence. Recent state-of-the-art methods approach this task by first extracting all possible text spans from a given text, then filtering the potential aspect and opinion phrases with a classifier, and finally considering all their pairs with another classifier that additionally assigns sentiment polarity to them. Although several variations of the above scheme have been proposed, the common feature is that the final result is constructed by a sequence of independent classifier decisions. This hinders the exploitation of dependencies between extracted phrases and prevents the use of knowledge about the interrelationships between classifier predictions to improve performance. In this paper, we propose a new ASTE approach consisting of three transformer-inspired layers, which enables the modelling of dependencies both between phrases and between the final classifier decisions. Experimental results show that the method achieves higher performance in terms of F1 measure than other methods studied on popular benchmarks. In addition, we show that a simple pre-training technique further improves the performance of the model.
摘要:方面-情感三元组提取 (Aspect-Sentiment Triplet Extraction, ASTE) 是基于方面情感分析的一项新任务,旨在从给定句子中提取 (方面短语, 观点短语, 情感极性) 三元组。最新的最先进方法通过首先从给定文本中提取所有可能的文本片段,然后使用分类器过滤潜在的方面和观点短语,最后通过另一个分类器考虑它们的所有配对,并额外赋予情感极性来处理此任务。尽管上述方案有多种变体,但共同特点是最终结果由一系列独立的分类器决策构建。这阻碍了提取短语之间依赖关系的利用,并阻止了使用关于分类器预测之间相互关系的知识来提高性能。本文提出了一种新的 ASTE 方法,由三个受 Transformer 启发的层组成,使得模型能够在短语之间以及最终分类器决策之间建模依赖关系。实验结果表明,该方法在流行的基准测试中相比其他研究方法在 F1 度量上实现了更高的性能。此外,我们还展示了简单的预训练技术进一步提升了模型的性能。

[NLP-6] Learning from Contrastive Prompts: Automated Optimization and Adaptation

【速读】: 该论文试图解决现有提示优化方法仅依赖于错误样本学习导致的性能次优问题,以及提示对先前模型有效但在新版本或不同语言环境下表现不佳的问题。解决方案的关键在于提出的Learning from Contrastive Prompts (LCP)框架,该框架通过对比学习分析优秀和劣质提示样本的模式,生成有效的提示,从而提升提示优化和适应性。LCP在Big-Bench Hard数据集上的评估显示,其提示优化胜率超过76%,并展现出在不同模型版本、系列和语言间的强大适应性,为提示工程提供了一种系统化的方法,减少了在多样化情境下部署大语言模型时的手动工作量。

链接: https://arxiv.org/abs/2409.15199
作者: Mingqi Li,Karan Aggarwal,Yong Xie,Aitzaz Ahmad,Stephen Lau
关键词-EN: manually crafting prompts, spent on manually, manually crafting, prompt optimization, LCP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLMs evolve, significant effort is spent on manually crafting prompts. While existing prompt optimization methods automate this process, they rely solely on learning from incorrect samples, leading to a sub-optimal performance. Additionally, an unexplored challenge in the literature is prompts effective for prior models may not perform well on newer versions or different languages. We propose the Learning from Contrastive Prompts (LCP) framework to address these gaps, enhancing both prompt optimization and adaptation. LCP employs contrastive learning to generate effective prompts by analyzing patterns in good and bad prompt examples. Our evaluation on the Big-Bench Hard dataset shows that LCP has a win rate of over 76% over existing methods in prompt optimization and demonstrates strong adaptability across different model versions, families, and languages. LCP offers a systematic approach to prompt engineering, reducing manual effort in deploying LLMs across varied contexts.
摘要:随着大语言模型 (LLM) 的发展,大量的精力被投入到手动设计提示 (prompt) 中。尽管现有的提示优化方法自动化了这一过程,但它们仅依赖于从错误样本中学习,导致性能次优。此外,文献中尚未探索的一个挑战是,先前模型有效的提示在新版本或不同语言中可能表现不佳。我们提出了对比提示学习 (Learning from Contrastive Prompts, LCP) 框架,以解决这些差距,增强提示优化和适应性。LCP 采用对比学习方法,通过分析好坏提示样本中的模式来生成有效提示。我们在 Big-Bench Hard 数据集上的评估显示,LCP 在提示优化方面的胜率超过 76%,并且在不同模型版本、系列和语言之间表现出强大的适应性。LCP 提供了一种系统的提示工程方法,减少了在不同上下文中部署大语言模型的手动工作量。

[NLP-7] PALLM: Evaluating and Enhancing PALLiative Care Conversations with Large Language Models ALT

【速读】: 该论文试图解决临床护理中患者与提供者之间有效沟通的评估问题,传统方法成本高且难以扩展。解决方案的关键在于利用大型语言模型(LLMs),如GPT-4和LLaMA2,通过模拟临床对话并进行微调,以评估沟通质量中的关键指标(如“理解”和“同理心”)。LLMs的引入不仅提升了评估的准确性和效率,还展示了开发自定义LLMs的可行性,为未来临床健康系统的智能化提供了基础。

链接: https://arxiv.org/abs/2409.15188
作者: Zhiyuan Wang,Fangxu Yuan,Virginia LeBaron,Tabor Flickinger,Laura E. Barnes
关键词-EN: directly impacting patient, impacting patient outcomes, Effective patient-provider communication, directly impacting, Effective patient-provider
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted by ACM Transactions on Computing for Healthcare, pending minor revisions

点击查看摘要

Abstract:Effective patient-provider communication is crucial in clinical care, directly impacting patient outcomes and quality of life. Traditional evaluation methods, such as human ratings, patient feedback, and provider self-assessments, are often limited by high costs and scalability issues. Although existing natural language processing (NLP) techniques show promise, they struggle with the nuances of clinical communication and require sensitive clinical data for training, reducing their effectiveness in real-world applications. Emerging large language models (LLMs) offer a new approach to assessing complex communication metrics, with the potential to advance the field through integration into passive sensing and just-in-time intervention systems. This study explores LLMs as evaluators of palliative care communication quality, leveraging their linguistic, in-context learning, and reasoning capabilities. Specifically, using simulated scripts crafted and labeled by healthcare professionals, we test proprietary models (e.g., GPT-4) and fine-tune open-source LLMs (e.g., LLaMA2) with a synthetic dataset generated by GPT-4 to evaluate clinical conversations, to identify key metrics such as understanding' and empathy’. Our findings demonstrated LLMs’ superior performance in evaluating clinical communication, providing actionable feedback with reasoning, and demonstrating the feasibility and practical viability of developing in-house LLMs. This research highlights LLMs’ potential to enhance patient-provider interactions and lays the groundwork for downstream steps in developing LLM-empowered clinical health systems.
摘要:有效的医患沟通在临床护理中至关重要,直接影响患者的治疗效果和生活质量。传统的评估方法,如人工评分、患者反馈和医护人员自我评估,往往受限于高成本和可扩展性问题。尽管现有的自然语言处理 (NLP) 技术显示出潜力,但它们在处理临床沟通的细微差别时表现不佳,并且需要敏感的临床数据进行训练,这限制了其在实际应用中的有效性。新兴的大语言模型 (LLM) 提供了一种新的评估复杂沟通指标的方法,通过集成到被动感知和实时干预系统中,有望推动该领域的发展。本研究探讨了 LLM 作为姑息治疗沟通质量评估工具的潜力,利用其在语言学、上下文学习和推理方面的能力。具体而言,我们使用由医疗专业人员精心设计和标注的模拟脚本,测试了专有模型 (如 GPT-4) 和微调的开源 LLM (如 LLaMA2),这些模型通过 GPT-4 生成的合成数据集来评估临床对话,以识别关键指标如“理解”和“同理心”。我们的研究结果表明,LLM 在评估临床沟通方面表现优异,能够提供具有推理依据的可操作反馈,并展示了开发内部 LLM 的可行性和实际应用价值。这项研究突显了 LLM 在增强医患互动方面的潜力,并为开发基于 LLM 的临床健康系统奠定了基础。

[NLP-8] Lessons Learned on Information Retrieval in Electronic Health Records: A Comparison of Embedding Models and Pooling Strategies

【速读】: 该论文旨在解决在临床领域应用大型语言模型(LLMs)时,由于医疗记录的上下文密集性导致的挑战。论文提出通过检索增强生成(RAG)方法来解决这一问题,并探讨了不同的嵌入模型和池化方法对信息检索性能的影响。研究的关键在于发现嵌入模型的选择对检索性能有显著影响,其中BGE模型(一个相对较小的通用领域模型)在多个任务中表现优于其他模型,包括专门针对医疗领域的模型。此外,论文还确定了每种模型的最佳池化策略,强调了嵌入模型、池化策略和查询表达方式对检索性能的重要性,并指出这些因素在不同数据集和查询表达方式下的表现存在显著差异。

链接: https://arxiv.org/abs/2409.15163
作者: Skatje Myers,Timothy A. Miller,Yanjun Gao,Matthew M. Churpek,Anoop Mayampurath,Dmitriy Dligach,Majid Afshar
关键词-EN: Applying large language, Applying large, retrieval, challenging due, context-heavy nature
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Objective: Applying large language models (LLMs) to the clinical domain is challenging due to the context-heavy nature of processing medical records. Retrieval-augmented generation (RAG) offers a solution by facilitating reasoning over large text sources. However, there are many parameters to optimize in just the retrieval system alone. This paper presents an ablation study exploring how different embedding models and pooling methods affect information retrieval for the clinical domain. Methods: Evaluating on three retrieval tasks on two electronic health record (EHR) data sources, we compared seven models, including medical- and general-domain models, specialized encoder embedding models, and off-the-shelf decoder LLMs. We also examine the choice of embedding pooling strategy for each model, independently on the query and the text to retrieve. Results: We found that the choice of embedding model significantly impacts retrieval performance, with BGE, a comparatively small general-domain model, consistently outperforming all others, including medical-specific models. However, our findings also revealed substantial variability across datasets and query text phrasings. We also determined the best pooling methods for each of these models to guide future design of retrieval systems. Discussion: The choice of embedding model, pooling strategy, and query formulation can significantly impact retrieval performance and the performance of these models on other public benchmarks does not necessarily transfer to new domains. Further studies such as this one are vital for guiding empirically-grounded development of retrieval frameworks, such as in the context of RAG, for the clinical domain. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2409.15163 [cs.CL] (or arXiv:2409.15163v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.15163 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Skatje Myers [view email] [v1] Mon, 23 Sep 2024 16:16:08 UTC (7,556 KB)
摘要:
目标:将大语言模型 (LLMs) 应用于临床领域面临挑战,因为处理医疗记录具有高度上下文依赖性。检索增强生成 (RAG) 通过促进对大型文本源的推理提供了一种解决方案。然而,仅在检索系统中就有许多参数需要优化。本文通过一项消融研究,探讨了不同的嵌入模型和池化方法如何影响临床领域的信息检索。

方法:我们在两个电子健康记录 (EHR) 数据源上的三个检索任务中进行评估,比较了七种模型,包括医疗领域和通用领域模型、专用编码器嵌入模型以及现成的解码器 LLMs。我们还独立地考察了每个模型在查询和待检索文本上的嵌入池化策略选择。

结果:我们发现,嵌入模型的选择显著影响检索性能,其中 BGE(一个相对较小的通用领域模型)持续优于所有其他模型,包括医疗专用模型。然而,我们的研究结果也揭示了在不同数据集和查询文本措辞上的显著差异。我们还确定了这些模型的最佳池化方法,以指导未来检索系统的设计。

讨论:嵌入模型的选择、池化策略和查询表达方式可以显著影响检索性能,并且这些模型在其他公共基准上的表现不一定能转移到新领域。进一步的研究,如本文所述,对于指导基于实证的检索框架开发至关重要,例如在临床领域的 RAG 背景下。

主题:计算与语言 (cs.CL);信息检索 (cs.IR)
引用为:arXiv:2409.15163 [cs.CL](或 arXiv:2409.15163v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.15163
arXiv-issued DOI via DataCite(待注册)
提交历史:从 Skatje Myers [查看电子邮件]
[v1] Mon, 23 Sep 2024 16:16:08 UTC (7,556 KB)

[NLP-9] Scientific Cross-Document Coreference and Hierarchy with Definition-Augmented Relational Reasoning

【速读】: 该论文试图解决科学文本中跨文档共指和层次结构的推断问题,这一任务在知识图谱构建、搜索、推荐和发现等领域具有重要应用。解决方案的关键在于提出了一种新颖的方法,通过检索全文文献生成上下文相关的概念提及定义,并利用这些定义增强跨文档提及关系的检测。此外,论文还生成了关系定义,描述两个概念提及之间的关联或差异,并设计了一种高效的重新排序方法来应对跨论文推断链接时涉及的组合爆炸问题。该方法在微调和上下文学习设置中均显著提升了性能,并通过生成的定义分析揭示了大型语言模型在细粒度科学文本上的关系推理能力。

链接: https://arxiv.org/abs/2409.15113
作者: Lior Forer,Tom Hope
关键词-EN: knowledge graph construction, recommendation and discovery, graph construction, fundamental task, coreference and hierarchy
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We address the fundamental task of inferring cross-document coreference and hierarchy in scientific texts, which has important applications in knowledge graph construction, search, recommendation and discovery. LLMs still struggle when faced with many long-tail technical concepts with nuanced variations. We present a novel method which generates context-dependent definitions of concept mentions by retrieving full-text literature, and uses the definitions to enhance detection of cross-document mention relations. We further generate relational definitions, which describe how two concept mentions are related or different, and design an efficient re-ranking approach to address the combinatorial explosion involved in inferring links across papers. In both fine-tuning and in-context learning settings we achieve large gains in performance. We provide analysis of generated definitions, shedding light on the relational reasoning ability of LLMs over fine-grained scientific texts.
摘要:我们解决了科学文本中跨文档共指和层次结构推断的基本任务,这一任务在知识图谱构建、搜索、推荐和发现中具有重要应用。大语言模型 (LLM) 在面对许多具有细微差异的长尾技术概念时仍然表现不佳。我们提出了一种新方法,通过检索全文文献生成上下文相关的概念提及定义,并利用这些定义来增强跨文档提及关系的检测。我们进一步生成关系定义,描述两个概念提及之间的关系或差异,并设计了一种高效的重新排序方法来解决在推断跨论文链接时涉及的组合爆炸问题。在微调和上下文学习设置中,我们都实现了性能的大幅提升。我们提供了生成的定义分析,揭示了大语言模型在细粒度科学文本上的关系推理能力。

[NLP-10] Efficiently Dispatching Flash Attention For Partially Filled Attention Masks

【速读】: 该论文试图解决现有Transformer模型在处理稀疏或部分填充的注意力矩阵时,仍采用二次复杂度进行计算的问题。解决方案的关键在于引入Binary Block Masking,这是一种高效的修改方法,使Flash Attention算法能够感知并利用注意力矩阵的稀疏性。论文进一步提出了两种优化策略:一种针对具有连续非零模式的掩码,另一种针对极稀疏的掩码。实验结果表明,该方法在实际应用场景中可将运行时间提升至多9倍。

链接: https://arxiv.org/abs/2409.15097
作者: Agniv Sharma,Jonas Geiping
关键词-EN: Transformers are widely, partially filled attention, filled attention matrices, partially filled, Binary Block Masking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices. Examples include attention masks designed to reduce the quadratic complexity of attention, sequence packing techniques, and recent innovations like tree masking for fast validation in MEDUSA. Despite the inherent sparsity in these matrices, the state-of-the-art algorithm Flash Attention still processes them with quadratic complexity as though they were dense. In this paper, we introduce \textbfBinary Block Masking, a highly efficient modification that enhances Flash Attention by making it mask-aware. We further propose two optimizations: one tailored for masks with contiguous non-zero patterns and another for extremely sparse masks. Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement. The implementation will be publicly released to foster further research and application.
摘要:Transformer 在各种应用中被广泛使用,其中许多应用产生的注意力矩阵是稀疏的或部分填充的。例如,设计用于减少注意力二次复杂度的注意力掩码、序列打包技术,以及最近在 MEDUSA 中用于快速验证的树掩码等创新。尽管这些矩阵具有固有的稀疏性,但最先进的算法 Flash Attention 仍然以二次复杂度处理它们,仿佛它们是密集的。在本文中,我们引入了 二进制块掩码 (Binary Block Masking),这是一种高效的改进方法,使 Flash Attention 具有掩码感知能力。我们进一步提出了两种优化:一种针对具有连续非零模式的掩码,另一种针对极度稀疏的掩码。我们在从现实场景中提取的注意力掩码上的实验表明,运行时间最多可提高 9 倍。该实现将公开发布,以促进进一步的研究和应用。

[NLP-11] Using Similarity to Evaluate Factual Consistency in Summaries

【速读】: 该论文试图解决现有摘要生成模型在生成流畅摘要时无法保证事实准确性的问题。解决方案的关键在于提出了一种新的零样本事实性评估指标——Sentence-BERT Score (SBERTScore),通过比较摘要与源文档中的句子级别相似性来评估事实一致性。该方法避免了传统基于n-gram重叠和嵌入相似性的评估指标的局限性,无需微调即可与现有的基于自然语言推理(NLI)和问答(QA)模型的评估方法相媲美,尤其在识别正确摘要方面表现出色。

链接: https://arxiv.org/abs/2409.15090
作者: Yuxuan Ye,Edwin Simpson,Raul Santos Rodriguez
关键词-EN: Cutting-edge abstractive summarisers, abstractive summarisers generate, summarisers generate fluent, Cutting-edge abstractive, generate fluent summaries
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cutting-edge abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed. Early summary factuality evaluation metrics are usually based on n-gram overlap and embedding similarity, but are reported fail to align with human annotations. Therefore, many techniques for detecting factual inconsistencies build pipelines around natural language inference (NLI) or question-answering (QA) models with additional supervised learning steps. In this paper, we revisit similarity-based metrics, showing that this failure stems from the comparison text selection and its granularity. We propose a new zero-shot factuality evaluation metric, Sentence-BERT Score (SBERTScore), which compares sentences between the summary and the source document. It outperforms widely-used word-word metrics including BERTScore and can compete with existing NLI and QA-based factuality metrics on the benchmark without needing any fine-tuning. Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries. We demonstrate how a combination of techniques is more effective in detecting various types of error.
摘要:尖端的生成式摘要器能够生成流畅的摘要,但生成的文本的事实准确性并不能得到保证。早期的摘要事实性评估指标通常基于 n-gram 重叠和嵌入相似性,但据报道这些方法无法与人工标注相匹配。因此,许多检测事实不一致性的技术围绕自然语言推理 (NLI) 或问答 (QA) 模型构建,并辅以额外的监督学习步骤。在本文中,我们重新审视了基于相似性的指标,表明这种失败源于比较文本的选择及其粒度。我们提出了一种新的零样本事实性评估指标,即 Sentence-BERT Score (SBERTScore),该指标比较摘要与源文档之间的句子。它在包括 BERTScore 在内的广泛使用的词对词指标中表现出色,并且无需任何微调即可与现有的基于 NLI 和 QA 的事实性指标在基准测试中竞争。我们的实验表明,每种技术都有不同的优势,其中 SBERTScore 在识别正确摘要方面尤为有效。我们展示了如何通过结合多种技术更有效地检测各种类型的错误。

[NLP-12] Depression Diagnosis Dialogue Simulation: Self-improving Psychiatrist with Tertiary Memory

【速读】: 该论文试图解决抑郁症等心理健康问题的自动化诊断难题,提出了一种名为Agent Mental Clinic (AMC)的自改进对话代理系统。解决方案的关键在于设计了一个包含三级记忆结构、对话控制与反思插件以及记忆采样模块的精神科医生代理,通过模拟患者与精神科医生的对话,充分利用精神科医生代理的技能,实现了对抑郁风险和自杀风险的准确诊断。实验结果表明,该系统在不修改大型语言模型(LLMs)权重的情况下,通过模拟精神科医生的培训过程,能够有效优化LLMs在特定领域的实际应用,即使只有少量代表性标注案例。

链接: https://arxiv.org/abs/2409.15084
作者: Kunyao Lan,Bingui Jin,Zichen Zhu,Siyuan Chen,Shu Zhang,Kenny Q. Zhu,Mengyue Wu
关键词-EN: Mental health issues, present significant challenges, effective automated diagnostic, Agent Mental Clinic, automated diagnostic methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mental health issues, particularly depressive disorders, present significant challenges in contemporary society, necessitating the development of effective automated diagnostic methods. This paper introduces the Agent Mental Clinic (AMC), a self-improving conversational agent system designed to enhance depression diagnosis through simulated dialogues between patient and psychiatrist agents. To enhance the dialogue quality and diagnosis accuracy, we design a psychiatrist agent consisting of a tertiary memory structure, a dialogue control and reflect plugin that acts as ``supervisor’’ and a memory sampling module, fully leveraging the skills reflected by the psychiatrist agent, achieving great accuracy on depression risk and suicide risk diagnosis via conversation. Experiment results on datasets collected in real-life scenarios demonstrate that the system, simulating the procedure of training psychiatrists, can be a promising optimization method for aligning LLMs with real-life distribution in specific domains without modifying the weights of LLMs, even when only a few representative labeled cases are available.
摘要:心理健康问题,特别是抑郁症,在当代社会中呈现出显著的挑战,迫切需要开发有效的自动化诊断方法。本文介绍了 Agent Mental Clinic (AMC),这是一个自我改进的对话式智能体系统,旨在通过模拟患者和精神科医生智能体之间的对话来增强抑郁症诊断。为了提高对话质量和诊断准确性,我们设计了一个精神科医生智能体,该智能体由三级记忆结构、对话控制与反思插件(作为“监督者”)和记忆采样模块组成,充分利用了精神科医生智能体所反映的技能,通过对话实现了对抑郁症风险和自杀风险的极高诊断准确性。在真实生活场景中收集的数据集上的实验结果表明,该系统模拟了精神科医生的培训过程,即使只有少数代表性的标记案例可用,也能成为一种有前景的优化方法,用于将大语言模型 (LLM) 与特定领域的真实生活分布对齐,而无需修改 LLM 的权重。

[NLP-13] Enhancing Scientific Reproducibility Through Automated BioCompute Object Creation Using Retrieval-Augmented Generation from Publications

【速读】: 该论文试图解决生物信息学研究中标准化文档创建的复杂性和耗时问题,特别是针对遗留研究项目的文档合规性。解决方案的关键在于利用检索增强生成(RAG)和大型语言模型(LLMs)来自动化生成符合IEEE BioCompute Object(BCO)标准的文档。通过开发BCO助手工具,该工具能够从科学论文和相关代码库中提取关键信息,有效应对LLM幻觉和长上下文理解等挑战。其核心技术包括优化的两阶段检索与重排序过程,以及针对BCO各领域的精心设计的提示工程,从而显著减少文档创建的时间和精力,同时确保符合标准,提升科学研究的透明度和可重复性。

链接: https://arxiv.org/abs/2409.15076
作者: Sean Kim,Raja Mazumder
关键词-EN: necessitating standardized documentation, IEEE BioCompute Object, Large Language Models, BCO assistant tool, BCO assistant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:The exponential growth in computational power and accessibility has transformed the complexity and scale of bioinformatics research, necessitating standardized documentation for transparency, reproducibility, and regulatory compliance. The IEEE BioCompute Object (BCO) standard addresses this need but faces adoption challenges due to the overhead of creating compliant documentation, especially for legacy research. This paper presents a novel approach to automate the creation of BCOs from scientific papers using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). We describe the development of the BCO assistant tool that leverages RAG to extract relevant information from source papers and associated code repositories, addressing key challenges such as LLM hallucination and long-context understanding. The implementation incorporates optimized retrieval processes, including a two-pass retrieval with re-ranking, and employs carefully engineered prompts for each BCO domain. We discuss the tool’s architecture, extensibility, and evaluation methods, including automated and manual assessment approaches. The BCO assistant demonstrates the potential to significantly reduce the time and effort required for retroactive documentation of bioinformatics research while maintaining compliance with the standard. This approach opens avenues for AI-assisted scientific documentation and knowledge extraction from publications thereby enhancing scientific reproducibility. The BCO assistant tool and documentation is available at this https URL.
摘要:计算能力和可访问性的指数级增长已经改变了生物信息学研究的复杂性和规模,迫切需要标准化的文档以确保透明性、可重复性和符合监管要求。IEEE BioCompute Object (BCO) 标准满足了这一需求,但由于创建合规文档的额外开销,尤其是在处理遗留研究时,面临着采用的挑战。本文提出了一种利用检索增强生成 (Retrieval-Augmented Generation, RAG) 和大语言模型 (Large Language Models, LLMs) 来自动生成 BCO 的新方法。我们描述了 BCO 助手工具的开发,该工具利用 RAG 从源论文和相关代码库中提取相关信息,解决了大语言模型幻觉和长上下文理解等关键挑战。实现中包含了优化的检索过程,包括两阶段检索与重排序,并针对每个 BCO 领域精心设计了提示。我们讨论了该工具的架构、可扩展性及评估方法,包括自动化和手动评估方法。BCO 助手展示了显著减少生物信息学研究事后文档编制时间和精力的潜力,同时保持与标准的合规性。这种方法为 AI 辅助的科学文档编制和从出版物中提取知识开辟了道路,从而增强了科学的可重复性。BCO 助手工具及其文档可通过此 https URL 获取。

[NLP-14] Evaluating the Usability of LLMs in Threat Intelligence Enrichment

【速读】: 该论文试图解决大型语言模型(LLMs)在威胁情报领域的应用中存在的可用性问题,特别是其在用户界面设计、错误处理、学习曲线、性能以及与现有工具集成方面的不足。解决方案的关键在于通过全面的可用性评估,识别出LLMs(如ChatGPT、Gemini、Cohere、Copilot和Meta AI)在这些方面的具体问题,并提供可行的改进建议,以确保这些工具在威胁情报增强过程中既用户友好又可靠,从而提高威胁情报的效率和准确性。

链接: https://arxiv.org/abs/2409.15072
作者: Sanchana Srikanth,Mohammad Hasanuzzaman,Farah Tasnur Meem
关键词-EN: Large Language Models, Large Language, Language Models, significantly enhance threat, automating the collection
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the potential to significantly enhance threat intelligence by automating the collection, preprocessing, and analysis of threat data. However, the usability of these tools is critical to ensure their effective adoption by security professionals. Despite the advanced capabilities of LLMs, concerns about their reliability, accuracy, and potential for generating inaccurate information persist. This study conducts a comprehensive usability evaluation of five LLMs ChatGPT, Gemini, Cohere, Copilot, and Meta AI focusing on their user interface design, error handling, learning curve, performance, and integration with existing tools in threat intelligence enrichment. Utilizing a heuristic walkthrough and a user study methodology, we identify key usability issues and offer actionable recommendations for improvement. Our findings aim to bridge the gap between LLM functionality and user experience, thereby promoting more efficient and accurate threat intelligence practices by ensuring these tools are user-friendly and reliable.
摘要:大语言模型 (LLMs) 具有通过自动化威胁数据的收集、预处理和分析来显著增强威胁情报的潜力。然而,这些工具的可用性对于确保安全专业人员有效采用至关重要。尽管 LLMs 具有先进的能力,但其可靠性、准确性以及生成不准确信息的可能性仍然存在担忧。本研究对五种 LLMs(ChatGPT、Gemini、Cohere、Copilot 和 Meta AI)进行了全面的可用性评估,重点关注其用户界面设计、错误处理、学习曲线、性能以及与现有工具在威胁情报增强中的集成。通过启发式走查和用户研究方法,我们识别了关键的可用性问题,并提供了可行的改进建议。我们的研究旨在弥合 LLM 功能与用户体验之间的差距,从而通过确保这些工具的用户友好性和可靠性,促进更高效和准确的威胁情报实践。

[NLP-15] Brotherhood at WMT 2024: Leveraging LLM-Generated Contextual Conversations for Cross-Lingual Image Captioning EMNLP2024

【速读】: 该论文试图解决英语到低资源语言的多模态翻译问题,特别是针对英语-印地语、英语-豪萨语、英语-孟加拉语和英语-马拉雅拉姆语的语言对。解决方案的关键在于利用多模态大型语言模型(如GPT-4o和Claude 3.5 Sonnet),通过指令调优的提示生成丰富的、基于上下文的对话,并结合图像的英文描述作为额外上下文,然后将这些合成对话翻译为目标语言。最终,通过加权提示策略,平衡原始英文描述与翻译后的对话,生成目标语言的描述。这种方法在多个语言对的挑战集上取得了竞争性的结果,尤其是在英语-豪萨语的挑战和评估排行榜上分别排名第一和第二。

链接: https://arxiv.org/abs/2409.15052
作者: Siddharth Betala,Ishan Chokshi
关键词-EN: Multi-Modal Translation Task, team name Brotherhood, Multi-Modal Translation, Translation Task, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the Ninth Conference on Machine Translation (WMT24), co-located with EMNLP 2024

点击查看摘要

Abstract:In this paper, we describe our system under the team name Brotherhood for the English-to-Lowres Multi-Modal Translation Task. We participate in the multi-modal translation tasks for English-Hindi, English-Hausa, English-Bengali, and English-Malayalam language pairs. We present a method leveraging multi-modal Large Language Models (LLMs), specifically GPT-4o and Claude 3.5 Sonnet, to enhance cross-lingual image captioning without traditional training or fine-tuning. Our approach utilizes instruction-tuned prompting to generate rich, contextual conversations about cropped images, using their English captions as additional context. These synthetic conversations are then translated into the target languages. Finally, we employ a weighted prompting strategy, balancing the original English caption with the translated conversation to generate captions in the target language. This method achieved competitive results, scoring 37.90 BLEU on the English-Hindi Challenge Set and ranking first and second for English-Hausa on the Challenge and Evaluation Leaderboards, respectively. We conduct additional experiments on a subset of 250 images, exploring the trade-offs between BLEU scores and semantic similarity across various weighting schemes.
摘要:本文描述了我们团队 Brotherhood 在英语到低分辨率多模态翻译任务中的系统。我们参与了英语-印地语、英语-豪萨语、英语-孟加拉语和英语-马拉雅拉姆语的多模态翻译任务。我们提出了一种利用多模态大语言模型 (LLMs),特别是 GPT-4o 和 Claude 3.5 Sonnet,来增强跨语言图像描述的方法,无需传统的训练或微调。我们的方法利用指令调优的提示生成关于裁剪图像的丰富、上下文相关的对话,使用其英语描述作为额外上下文。这些合成对话随后被翻译成目标语言。最后,我们采用加权提示策略,平衡原始英语描述与翻译后的对话,以生成目标语言的描述。该方法取得了有竞争力的结果,在英语-印地语挑战集上获得了 37.90 BLEU 分数,并在英语-豪萨语的挑战和评估排行榜上分别排名第一和第二。我们还在 250 张图像的子集上进行了额外实验,探索了不同加权方案下 BLEU 分数与语义相似性之间的权衡。

[NLP-16] Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

【速读】: 该论文试图解决的问题是探索解码器专用模型在多语言和多领域翻译任务中的扩展规律。解决方案的关键在于通过训练一系列从70M到7B参数的解码器专用模型,并进行实验验证,发现这些模型的损失可以通过类似于大型语言模型的扩展规律来估计。然而,该扩展规律在应用于过大模型或不同数据分布时存在局限性。此外,论文还研究了不同的扩展方法,发现扩展模型的深度和宽度都能带来类似的测试损失改进,但对模型效率的影响不同。

链接: https://arxiv.org/abs/2409.15051
作者: Gaëtan Caillaut,Raheel Qader,Mariam Nakhlé,Jingshu Liu,Jean-Gabriel Barthélemy
关键词-EN: showcased remarkable capabilities, Recent studies, NLP tasks, decoder-only models, models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies have showcased remarkable capabilities of decoder-only models in many NLP tasks, including translation. Yet, the machine translation field has been largely dominated by encoder-decoder models based on the Transformer architecture. As a consequence, scaling laws of encoder-decoder models for neural machine translation have already been well studied, but decoder-only models have received less attention. This work explores the scaling laws of decoder-only models on the multilingual and multidomain translation task. We trained a collection of six decoder-only models, ranging from 70M to 7B parameters, on a sentence-level, multilingual and multidomain dataset. We conducted a series of experiments showing that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models, but we also show that this scaling law has difficulties to generalize to too large models or to a different data distribution. We also study different scaling methods and show that scaling the depth and the width of a model lead to similar test loss improvements, but with different impact on the model’s efficiency.
摘要:近期研究表明,仅解码器模型在包括翻译在内的许多自然语言处理 (NLP) 任务中展现出显著能力。然而,机器翻译领域主要由基于 Transformer 架构的编码器-解码器模型主导。因此,编码器-解码器模型的神经机器翻译扩展规律已得到充分研究,但仅解码器模型的关注度较低。本研究探讨了仅解码器模型在多语言和多领域翻译任务中的扩展规律。我们在一个句子级别的多语言和多领域数据集上训练了六个仅解码器模型,参数规模从 70M 到 7B 不等。我们进行了一系列实验,结果表明,仅解码器模型的损失可以通过类似于大语言模型发现的扩展规律进行估计,但我们也发现,该扩展规律在应用于过大模型或不同数据分布时存在泛化困难。此外,我们还研究了不同的扩展方法,发现扩展模型的深度和宽度均能带来类似的测试损失改善,但对模型效率的影响不同。

[NLP-17] Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP EMNLP2024

【速读】: 该论文试图解决CLIP模型在图像生成任务中对用户意图的数量理解偏差问题。解决方案的关键在于通过设计不同的实验设置和数据集,全面评估CLIP对数量概念的理解,从文本、图像和跨模态角度揭示CLIP嵌入中的数量偏差,从而提高下游任务的可靠性。

链接: https://arxiv.org/abs/2409.15035
作者: Zeliang Zhang,Zhuo Liu,Mingqian Feng,Chenliang Xu
关键词-EN: visual question answering, demonstrated great versatility, visual question, question answering, demonstrated great
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Short paper. Accepted by the Findings of EMNLP 2024

点击查看摘要

Abstract:CLIP has demonstrated great versatility in adapting to various downstream tasks, such as image editing and generation, visual question answering, and video understanding. However, CLIP-based applications often suffer from misunderstandings regarding user intent, leading to discrepancies between the required number of objects and the actual outputs in image generation tasks. In this work, we empirically investigate the quantity bias in CLIP. By carefully designing different experimental settings and datasets, we comprehensively evaluate CLIP’s understanding of quantity from text, image, and cross-modal perspectives. Our experimental results reveal a quantity bias in CLIP embeddings, impacting the reliability of downstream tasks.
摘要:CLIP 在适应各种下游任务方面展示了极大的多功能性,例如图像编辑和生成、视觉问答以及视频理解。然而,基于 CLIP 的应用程序常常在理解用户意图方面存在误解,导致图像生成任务中所需对象数量与实际输出之间存在差异。在本研究中,我们通过实验研究了 CLIP 中的数量偏差。通过精心设计不同的实验设置和数据集,我们从文本、图像和跨模态的角度全面评估了 CLIP 对数量的理解。我们的实验结果揭示了 CLIP 嵌入中的数量偏差,影响了下游任务的可靠性。

[NLP-18] Generative LLM Powered Conversational AI Application for Personalized Risk Assessment: A Case Study in COVID-19

【速读】: 该论文试图解决在医疗领域中利用大型语言模型(LLMs)进行疾病风险评估的问题,特别是通过流式人机对话实现无需编程的交互式风险评估。解决方案的关键在于通过微调预训练的生成式LLMs(如Llama2-7b和Flan-t5-xl),并将其集成到移动应用中,以生成式AI为核心,实现实时的医患互动和无代码风险评估。这种方法不仅允许使用流式问答作为输入,还通过LLM的注意力层提供个性化的特征重要性分析,增强了风险评估的可解释性。通过在低数据环境下实现高AUC评分,论文展示了生成式LLMs在低数据环境下的优越性能,强调了其在实际应用中的适应性和有效性。

链接: https://arxiv.org/abs/2409.15027
作者: Mohammad Amin Roshani,Xiangyu Zhou,Yao Qiang,Srinivasan Suresh,Steve Hicks,Usha Sethuraman,Dongxiao Zhu
关键词-EN: Large language models, shown remarkable capabilities, Large language, natural language tasks, healthcare domains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in various natural language tasks and are increasingly being applied in healthcare domains. This work demonstrates a new LLM-powered disease risk assessment approach via streaming human-AI conversation, eliminating the need for programming required by traditional machine learning approaches. In a COVID-19 severity risk assessment case study, we fine-tune pre-trained generative LLMs (e.g., Llama2-7b and Flan-t5-xl) using a few shots of natural language examples, comparing their performance with traditional classifiers (i.e., Logistic Regression, XGBoost, Random Forest) that are trained de novo using tabular data across various experimental settings. We develop a mobile application that uses these fine-tuned LLMs as its generative AI (GenAI) core to facilitate real-time interaction between clinicians and patients, providing no-code risk assessment through conversational interfaces. This integration not only allows for the use of streaming Questions and Answers (QA) as inputs but also offers personalized feature importance analysis derived from the LLM’s attention layers, enhancing the interpretability of risk assessments. By achieving high Area Under the Curve (AUC) scores with a limited number of fine-tuning samples, our results demonstrate the potential of generative LLMs to outperform discriminative classification methods in low-data regimes, highlighting their real-world adaptability and effectiveness. This work aims to fill the existing gap in leveraging generative LLMs for interactive no-code risk assessment and to encourage further research in this emerging field.
摘要:大语言模型 (LLMs) 在各种自然语言任务中展示了卓越的能力,并越来越多地应用于医疗保健领域。本研究展示了一种通过流式人机对话实现的新型 LLM 驱动的疾病风险评估方法,消除了传统机器学习方法所需的编程需求。在一个 COVID-19 严重程度风险评估的案例研究中,我们使用少量自然语言示例对预训练的生成式 LLMs(例如 Llama2-7b 和 Flan-t5-xl)进行微调,并将其性能与使用表格数据从头训练的传统分类器(即逻辑回归、XGBoost、随机森林)在各种实验设置下进行比较。我们开发了一款移动应用程序,该应用程序使用这些微调后的 LLMs 作为其生成式 AI (GenAI) 核心,以促进临床医生和患者之间的实时互动,通过对话界面提供无代码风险评估。这种集成不仅允许使用流式问答 (QA) 作为输入,还提供了从 LLM 的注意力层中派生的个性化特征重要性分析,增强了风险评估的可解释性。通过在有限数量的微调样本下实现高曲线下面积 (AUC) 分数,我们的结果展示了生成式 LLMs 在低数据环境下优于判别分类方法的潜力,突显了其在现实世界中的适应性和有效性。本研究旨在填补现有利用生成式 LLMs 进行交互式无代码风险评估的空白,并鼓励在这一新兴领域进行进一步研究。

[NLP-19] Inference-Friendly Models With MixAttention

【速读】: 该论文试图解决现代语言模型在推理过程中由于KV缓存大小随注意力头数和处理token数增加而导致的内存消耗增加和推理速度下降的问题。解决方案的关键在于引入MixAttention架构,该架构结合了滑动窗口注意力机制(仅存储最近的token子集)和跨层KV缓存共享,从而显著减少内存使用并提高推理速度,同时保持模型在短上下文和长上下文任务中的性能。

链接: https://arxiv.org/abs/2409.15012
作者: Shashank Rajput,Ying Sheng,Sean Owen,Vitaliy Chiley
关键词-EN: maximum context length, concurrent requests supported, modern language models, plays a critical, critical role
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The size of the key-value (KV) cache plays a critical role in determining both the maximum context length and the number of concurrent requests supported during inference in modern language models. The KV cache size grows proportionally with the number of attention heads and the tokens processed, leading to increased memory consumption and slower inference for long inputs. In this work, we explore the use of MixAttention, a model architecture modification closely related to a blog published by this http URL. MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers. Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks. We also explore various configurations of this architecture, identifying those that maintain quality across evaluation metrics while optimizing resource efficiency.
摘要:在现代语言模型中,键值 (KV) 缓存的大小在决定最大上下文长度和支持推理期间并发请求数量方面起着关键作用。KV 缓存的大小与注意力头数量和处理的 Token 数量成正比增长,导致长输入时的内存消耗增加和推理速度减慢。在本研究中,我们探讨了 MixAttention 的使用,这是一种与 此链接 博客中提到的模型架构修改密切相关的技术。MixAttention 结合了滑动窗口注意力机制,其中仅存储 KV 缓存中的一小部分最近 Token,以及跨层共享 KV 缓存。我们的实验表明,MixAttention 显著减少了内存使用并提高了推理速度,同时在短上下文和长上下文任务中不牺牲模型性能。我们还探索了该架构的各种配置,识别出在保持评估指标质量的同时优化资源效率的配置。

[NLP-20] ViBERTgrid BiLSTM-CRF: Multimodal Key Information Extraction from Unstructured Financial Documents ECML KDD2023

【速读】: 该论文试图解决在非结构化金融文档中进行多模态关键信息提取(KIE)的问题。解决方案的关键在于将多模态Transformer模型(ViBERTgrid)与BiLSTM-CRF层结合,以适应非结构化文档的特性,从而显著提升命名实体识别的性能,同时保持其在半结构化文档中的KIE表现。

链接: https://arxiv.org/abs/2409.15004
作者: Furkan Pala,Mehmet Yasin Akpınar,Onur Deniz,Gülşen Eryiğit
关键词-EN: key information extraction, Multimodal key information, information extraction, key information, studied extensively
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted in MIDAS (The 8th Workshop on MIning DAta for financial applicationS) workshop of ECML PKDD 2023 conference

点击查看摘要

Abstract:Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.
摘要:多模态关键信息提取 (KIE) 模型在半结构化文档上已得到广泛研究。然而,其在非结构化文档上的研究是一个新兴的研究课题。本文提出了一种方法,通过引入 BiLSTM-CRF 层,将多模态 Transformer (即之前在半结构化文档上探索的 ViBERTgrid) 应用于非结构化金融文档。所提出的 ViBERTgrid BiLSTM-CRF 模型在金融领域非结构化文档的命名实体识别任务中表现出显著的性能提升 (高达 2 个百分点),同时在半结构化文档的 KIE 性能上保持不变。作为额外的贡献,我们公开发布了 SROIE 数据集的 Token 级别标注,以促进其在多模态序列标注模型中的应用。

[NLP-21] Enhancing Aspect-based Sentiment Analysis in Tourism Using Large Language Models and Positional Information

【速读】: 该论文试图解决传统方面级情感分析(ABSA)中存在的错误传播和情感元素提取不完整的问题。解决方案的关键在于提出了一个名为ACOS_LLM的模型,该模型通过两个关键阶段来实现方面-类别-观点-情感四元组提取(ACOSQE):首先,使用Adalora对大型语言模型进行微调以生成高质量的辅助知识,并通过Sparsegpt将微调后的模型压缩至50%的稀疏度以提高效率;其次,结合位置信息和序列建模,利用辅助知识和原始文本作为输入,完成ACOSQE任务。实验结果表明,该模型在自建旅游数据集和公开数据集Rest15、Rest16上均表现出色,显著提升了F1分数。

链接: https://arxiv.org/abs/2409.14997
作者: Chun Xu,Mengmeng Wang,Yan Ren,Shaolin Zhu
关键词-EN: understanding tourists’ evaluations, Aspect-Based Sentiment Analysis, Sentiment Analysis, sentiment analysis model, aspects of attractions
类目: Computation and Language (cs.CL)
备注: 19 pages, 17 figures

点击查看摘要

Abstract:Aspect-Based Sentiment Analysis (ABSA) in tourism plays a significant role in understanding tourists’ evaluations of specific aspects of attractions, which is crucial for driving innovation and development in the tourism industry. However, traditional pipeline models are afflicted by issues such as error propagation and incomplete extraction of sentiment elements. To alleviate this issue, this paper proposes an aspect-based sentiment analysis model, ACOS_LLM, for Aspect-Category-Opinion-Sentiment Quadruple Extraction (ACOSQE). The model comprises two key stages: auxiliary knowledge generation and ACOSQE. Firstly, Adalora is used to fine-tune large language models for generating high-quality auxiliary knowledge. To enhance model efficiency, Sparsegpt is utilized to compress the fine-tuned model to 50% sparsity. Subsequently, Positional information and sequence modeling are employed to achieve the ACOSQE task, with auxiliary knowledge and the original text as inputs. Experiments are conducted on both self-created tourism datasets and publicly available datasets, Rest15 and Rest16. Results demonstrate the model’s superior performance, with an F1 improvement of 7.49% compared to other models on the tourism dataset. Additionally, there is an F1 improvement of 0.05% and 1.06% on the Rest15 and Rest16 datasets, respectively.
摘要:基于方面的情感分析 (Aspect-Based Sentiment Analysis, ABSA) 在旅游业中扮演着重要角色,有助于理解游客对景点特定方面的评价,这对推动旅游业创新和发展至关重要。然而,传统的流水线模型存在误差传播和情感元素提取不完整等问题。为缓解这一问题,本文提出了一种基于方面的情感分析模型 ACOS_LLM,用于方面-类别-观点-情感四元组提取 (Aspect-Category-Opinion-Sentiment Quadruple Extraction, ACOSQE)。该模型包括两个关键阶段:辅助知识生成和 ACOSQE。首先,使用 Adalora 对大语言模型进行微调,以生成高质量的辅助知识。为提高模型效率,采用 Sparsegpt 将微调后的模型压缩至 50% 稀疏度。随后,利用位置信息和序列建模来完成 ACOSQE 任务,输入为辅助知识和原始文本。实验在自建的旅游数据集以及公开的 Rest15 和 Rest16 数据集上进行。结果显示,该模型在旅游数据集上的 F1 值比其他模型提高了 7.49%。此外,在 Rest15 和 Rest16 数据集上分别提高了 0.05% 和 1.06%。

[NLP-22] Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs

【速读】: 该论文旨在探讨如何通过多种技术手段优化大型语言模型(LLMs)在临床应用中的性能。解决方案的关键在于综合运用连续预训练(continuous pretraining)、指令微调(instruct fine-tuning)、NEFTune和提示工程(prompt engineering)四种技术。连续预训练为模型提供了强大的基础,指令微调在此基础上进一步提升了模型的适应性,而NEFTune则在生成质量上带来了额外收益。复杂的提示工程方法进一步增强了模型在临床任务中的表现。这些技术的结合使用,显著优化了LLMs在临床领域的应用效果。

链接: https://arxiv.org/abs/2409.14988
作者: Clément Christophe,Tathagata Raha,Svetlana Maslenkova,Muhammad Umar Salman,Praveen K Kanithi,Marco AF Pimentel,Shadab Khan
关键词-EN: Large Language Models, Large Language, demonstrated significant potential, Language Models, transforming clinical applications
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant potential in transforming clinical applications. In this study, we investigate the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt engineering. We employ these methods on Mistral 7B and Mixtral 8x7B models, leveraging a large-scale clinical pretraining dataset of 50 billion tokens and an instruct fine-tuning dataset of 500 million tokens. Our evaluation across various clinical tasks reveals the impact of each technique. While continuous pretraining beyond 250 billion tokens yields marginal improvements on its own, it establishes a strong foundation for instruct fine-tuning. Notably, NEFTune, designed primarily to enhance generation quality, surprisingly demonstrates additional gains on our benchmark. Complex prompt engineering methods further enhance performance. These findings show the importance of tailoring fine-tuning strategies and exploring innovative techniques to optimize LLM performance in the clinical domain.
摘要:大语言模型 (LLM) 在临床应用中展现了显著的潜力。在本研究中,我们探讨了四种技术在适应 LLM 用于临床用例中的效果:持续预训练、指令微调、NEFTune 和提示工程。我们采用了这些方法在 Mistral 7B 和 Mixtral 8x7B 模型上,利用了一个包含 500 亿 Token 的大规模临床预训练数据集和一个包含 5000 万 Token 的指令微调数据集。我们在各种临床任务中的评估揭示了每种技术的影响。尽管超过 2500 亿 Token 的持续预训练单独带来的改进有限,但它为指令微调奠定了坚实的基础。值得注意的是,NEFTune 主要设计用于提高生成质量,却在我们的基准测试中意外地展示了额外的收益。复杂的提示工程方法进一步提升了性能。这些发现表明,定制微调策略和探索创新技术对于优化临床领域中的 LLM 性能至关重要。

[NLP-23] Evaluating Theory of (an uncertain) Mind: Predicting the Uncertain Beliefs of Others in Conversation Forecasting

【速读】: 该论文试图解决在评估心智理论(Theory of Mind)时,如何量化他人信念的不确定性问题。解决方案的关键在于设计了一套新的任务,要求语言模型(LMs)在对话中模拟他人的不确定性,并通过对话预测任务来量化这种不确定性。具体方法包括将对话参与者视为预测者,要求LMs预测对话参与者的不确定性概率,并采用重缩放方法、方差减少策略和人口统计学背景来优化这一回归任务。实验结果表明,LMs能够解释他人不确定性中高达7%的方差,但同时也指出了任务的难度和未来在实际应用中的改进空间。

链接: https://arxiv.org/abs/2409.14986
作者: Anthony Sicilia,Malihe Alikhani
关键词-EN: Theory of Mind, evaluating Theory, Typically, Mind, Theory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Typically, when evaluating Theory of Mind, we consider the beliefs of others to be binary: held or not held. But what if someone is unsure about their own beliefs? How can we quantify this uncertainty? We propose a new suite of tasks, challenging language models (LMs) to model the uncertainty of others in dialogue. We design these tasks around conversation forecasting, wherein an agent forecasts an unobserved outcome to a conversation. Uniquely, we view interlocutors themselves as forecasters, asking an LM to predict the uncertainty of the interlocutors (a probability). We experiment with re-scaling methods, variance reduction strategies, and demographic context, for this regression task, conducting experiments on three dialogue corpora (social, negotiation, task-oriented) with eight LMs. While LMs can explain up to 7% variance in the uncertainty of others, we highlight the difficulty of the tasks and room for future work, especially in practical applications, like anticipating ``false
摘要:通常,在评估心智理论时,我们考虑他人的信念是二元的:持有或未持有。但如果某人对自己的信念不确定呢?我们如何量化这种不确定性?我们提出了一套新的任务,挑战语言模型 (LMs) 在对话中模拟他人的不确定性。我们围绕对话预测设计了这些任务,其中智能体预测对话中未观察到的结果。独特的是,我们将对话者本身视为预测者,要求 LM 预测对话者的不确定性(概率)。我们针对这一回归任务,实验了重新缩放方法、方差减少策略和人口统计背景,并在三个对话语料库(社交、谈判、任务导向)上对八个 LM 进行了实验。尽管 LM 可以解释他人不确定性中高达 7% 的方差,但我们强调了任务的难度和未来工作的空间,特别是在实际应用中,如预测“虚假”。

[NLP-24] Bilingual Rhetorical Structure Parsing with Large Parallel Annotations

【速读】: 该论文试图解决跨语言话语解析(cross-lingual discourse parsing)中的挑战,主要由于平行数据有限以及修辞结构理论(Rhetorical Structure Theory, RST)在不同语言和语料库中的应用不一致。解决方案的关键在于引入了一个大规模且多样化的英语GUM RST语料库的俄语平行标注,并利用最新的技术发展,开发了一个端到端的RST解析器。该解析器在英语和俄语语料库上均达到了最先进的性能,展示了在单语和双语设置中的有效性,即使在第二语言标注有限的情况下也能成功迁移。这是首次在手动标注的平行语料库上评估跨语言端到端RST解析的潜力。

链接: https://arxiv.org/abs/2409.14969
作者: Elena Chistova
关键词-EN: Rhetorical Structure Theory, natural language processing, Discourse parsing, English GUM RST, crucial task
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Discourse parsing is a crucial task in natural language processing that aims to reveal the higher-level relations in a text. Despite growing interest in cross-lingual discourse parsing, challenges persist due to limited parallel data and inconsistencies in the Rhetorical Structure Theory (RST) application across languages and corpora. To address this, we introduce a parallel Russian annotation for the large and diverse English GUM RST corpus. Leveraging recent advances, our end-to-end RST parser achieves state-of-the-art results on both English and Russian corpora. It demonstrates effectiveness in both monolingual and bilingual settings, successfully transferring even with limited second-language annotation. To the best of our knowledge, this work is the first to evaluate the potential of cross-lingual end-to-end RST parsing on a manually annotated parallel corpus.
摘要:话语解析是自然语言处理中的一个关键任务,旨在揭示文本中的更高层次关系。尽管跨语言话语解析引起了越来越多的兴趣,但由于平行数据有限以及修辞结构理论 (RST) 在不同语言和语料库中的应用不一致,这一领域仍面临挑战。为此,我们为大型且多样化的英语 GUM RST 语料库引入了俄语平行标注。利用最新的进展,我们的端到端 RST 解析器在英语和俄语语料库上均达到了最先进的水平。它在单语和双语设置中均表现出色,即使在第二语言标注有限的情况下也能成功迁移。据我们所知,这项工作首次评估了在人工标注的平行语料库上进行跨语言端到端 RST 解析的潜力。

[NLP-25] Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

【速读】: 该论文试图解决在不同专业领域中有效部署数据增强的大型语言模型(LLMs)所面临的挑战,特别是如何准确地检索相关数据、解释用户意图以及充分利用LLMs的推理能力来处理复杂任务。解决方案的关键在于提出了一种基于检索增强生成(RAG)的任务分类方法,将用户查询分为四个层次:显式事实查询、隐式事实查询、可解释理由查询和隐藏理由查询。通过这种分类,论文定义了不同层次的查询,提供了相关数据集,并总结了应对这些挑战的关键技术和最有效的方法。此外,论文还讨论了三种主要的外部数据集成形式:上下文、小模型和微调,强调了它们各自的优缺点以及适用的问题类型,旨在帮助读者全面理解并分解构建LLM应用的数据需求和关键瓶颈,提供系统化的开发指南。

链接: https://arxiv.org/abs/2409.14924
作者: Siyun Zhao,Yuqing Yang,Zilong Wang,Zhiyuan He,Luna K. Qiu,Lili Qiu
关键词-EN: Large language models, Large language, completing real-world tasks, demonstrated remarkable capabilities, external data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications.
摘要:结合外部数据的大语言模型 (LLMs) 在完成现实世界任务方面展示了显著的能力。将外部数据整合到 LLMs 中的技术,如检索增强生成 (RAG) 和微调,正受到越来越多的关注和广泛应用。然而,在各种专业领域中有效部署数据增强的 LLMs 面临着重大挑战。这些挑战涵盖了从检索相关数据和准确解释用户意图,到充分利用 LLMs 的推理能力来处理复杂任务的广泛问题。我们认为,数据增强的 LLM 应用没有一刀切的解决方案。在实践中,性能不佳往往源于未能正确识别任务的核心焦点,或因为任务本身需要结合多种能力,而这些能力必须被解耦以更好地解决。在本综述中,我们提出了一种 RAG 任务分类方法,根据所需外部数据的类型和任务的主要焦点,将用户查询分为四个层次:显式事实查询、隐式事实查询、可解释理由查询和隐藏理由查询。我们定义了这些查询层次,提供了相关数据集,并总结了应对这些挑战的关键挑战和最有效技术。最后,我们讨论了将外部数据整合到 LLMs 中的三种主要形式:上下文、小型模型和微调,强调了它们各自的优缺点以及适合解决的问题类型。这项工作的目的是帮助读者全面理解和分解构建 LLM 应用的数据需求和关键瓶颈,提供应对不同挑战的解决方案,并为系统开发此类应用提供指导。

[NLP-26] With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models EMNLP2024

【速读】: 该论文试图解决的问题是探究仅具备视觉和文本模态的大型语言模型(LLMs)和视觉语言模型(VLMs)是否能够通过抽象推理从字形和图像中隐式理解基于声音的现象。解决方案的关键在于通过实验分析这些模型在声音象征性(即识别声音与概念之间的非任意联系)和通过语言与视觉模块的互动来“听”的能力。研究通过复现经典的Kiki-Bouba和Mil-Mal形状与大小象征性任务,并比较人类与LLMs对语言象似性的判断,发现VLMs在不同程度上与人类标签达成一致,且VLMs在进行模拟实验时可能需要比人类更多的任务信息。此外,研究还表明,大小象征性比形状象征性更容易被VLMs识别,且对语言象似性的理解高度依赖于模型的大小。

链接: https://arxiv.org/abs/2409.14917
作者: Tyler Loakman,Yucheng Li,Chenghua Lin
关键词-EN: Large Language Models, testing psycholinguistic phenomena, Vision Language Models, Large Language, experiments testing psycholinguistic
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated aptitude as potential substitutes for human participants in experiments testing psycholinguistic phenomena. However, an understudied question is to what extent models that only have access to vision and text modalities are able to implicitly understand sound-based phenomena via abstract reasoning from orthography and imagery alone. To investigate this, we analyse the ability of VLMs and LLMs to demonstrate sound symbolism (i.e., to recognise a non-arbitrary link between sounds and concepts) as well as their ability to ``hear’’ via the interplay of the language and vision modules of open and closed-source multimodal models. We perform multiple experiments, including replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism tasks, and comparing human judgements of linguistic iconicity with that of LLMs. Our results show that VLMs demonstrate varying levels of agreement with human labels, and more task information may be required for VLMs versus their human counterparts for in silico experimentation. We additionally see through higher maximum agreement levels that Magnitude Symbolism is an easier pattern for VLMs to identify than Shape Symbolism, and that an understanding of linguistic iconicity is highly dependent on model size.
摘要:近年来,大语言模型 (LLM) 和视觉语言模型 (VLM) 在实验中展示了作为测试心理语言现象的人类参与者的潜在替代能力。然而,一个未被充分研究的问题是,仅能访问视觉和文本模式的模型能否通过从正字法和图像中进行抽象推理来隐式理解基于声音的现象。为了探讨这一点,我们分析了 VLM 和 LLM 展示声音象征性(即识别声音与概念之间的非任意联系)的能力,以及它们通过开源和闭源多模态模型的语言和视觉模块的相互作用来“听”的能力。我们进行了多项实验,包括复制经典的 Kiki-Bouba 和 Mil-Mal 形状与量级象征性任务,并比较了人类对语言象似性的判断与 LLM 的判断。我们的结果显示,VLM 与人类标签的一致性水平各异,并且与人类相比,VLM 在进行体内实验时可能需要更多的任务信息。此外,通过更高的最大一致性水平,我们发现量级象征性比形状象征性更容易被 VLM 识别,并且对语言象似性的理解高度依赖于模型的大小。

[NLP-27] owards a Realistic Long-Term Benchmark for Open-Web Research Agents

【速读】: 该论文旨在解决现有基准测试中缺乏针对具有经济价值任务的评估问题,特别是那些在金融和咨询行业中常见的复杂任务。解决方案的关键在于构建一个能够评估大型语言模型(LLM)代理在这些任务中表现的新基准,该基准不仅评估任务的完成度,还对部分解决任务的能力进行评分。通过使用GPT-4o、Claude-3.5 Sonnet、Llama 3.1 (405b)和GPT-4o-mini等多种架构进行测试,论文发现基于Claude-3.5 Sonnet的代理表现最佳,尤其是在采用ReAct架构并能将子任务委托给子代理的情况下。此外,论文还通过定性分析代理的行为轨迹来进一步评估其性能。

链接: https://arxiv.org/abs/2409.14913
作者: Peter Mühlbacher,Nikos I. Bosse,Lawrence Phillips
关键词-EN: present initial results, evaluating LLM agents, LLM agents, agents, evaluating LLM
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present initial results of a forthcoming benchmark for evaluating LLM agents on white-collar tasks of economic value. We evaluate eight realistic and messy'' tasks that are routine in finance and consulting, drawn from real-world cases from our customers. We lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact. This fills a gap in existing benchmarks with tasks like order a pizza to the following address’’ that do not constitute real-human work of economic value. Our evaluations assign credit to agents for partially solving tasks. By doing that, this initial evaluation, and the forthcoming benchmark, allow us to more accurately extrapolate performance of LLM-based agents on economically valuable tasks. We built and tested several architectures with GPT-4o, Claude-3.5 Sonnet, Llama 3.1 (405b), and GPT-4o-mini, ensuring that failure to solve a task was due to failures of reasoning and planning, rather than due to common failures like e.g. the inability to parse a website. On average, LLM agents powered by Claude-3.5 Sonnet substantially outperformed agents using GPT-4o, with agents based on Llama 3.1 (405b) and GPT-4o-mini lagging noticeably behind. Across LLMs, a ReAct architecture with the ability to delegate subtasks to subagents performed best. In addition to quantitative evaluations, we qualitatively assessed the performance of the LLM agents by inspecting their traces and reflecting on their observations. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.14913 [cs.CL] (or arXiv:2409.14913v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.14913 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:我们展示了即将推出的用于评估大语言模型 (LLM) 智能体在具有经济价值的白领任务上的基准测试的初步结果。我们评估了八个现实且“混乱”的任务,这些任务在金融和咨询行业中是常规的,并从我们客户的实际案例中提取。我们为大语言模型智能体评估套件奠定了基础,其中良好的表现直接对应于巨大的经济和社会影响。这填补了现有基准测试中的一个空白,例如“向以下地址订购披萨”这样的任务,这些任务并不构成具有经济价值的真实人类工作。我们的评估为部分解决任务的智能体分配了信用。通过这样做,这一初步评估以及即将推出的基准测试,使我们能够更准确地推断基于大语言模型的智能体在具有经济价值的任务上的表现。我们构建并测试了几种架构,包括 GPT-4o、Claude-3.5 Sonnet、Llama 3.1 (405b) 和 GPT-4o-mini,确保未能解决任务是由于推理和规划的失败,而不是由于常见的失败,例如无法解析网站。平均而言,基于 Claude-3.5 Sonnet 的大语言模型智能体显著优于使用 GPT-4o 的智能体,而基于 Llama 3.1 (405b) 和 GPT-4o-mini 的智能体则明显落后。在所有大语言模型中,具有将子任务委托给子智能体能力的 ReAct 架构表现最佳。除了定量评估外,我们还通过检查智能体的轨迹并反思其观察结果,对大语言模型智能体的性能进行了定性评估。

主题:计算与语言 (cs.CL); 机器学习 (cs.LG)
引用为:arXiv:2409.14913 [cs.CL] (或 arXiv:2409.14913v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.14913
通过 DataCite 发布的 arXiv-issued DOI (待注册)

[NLP-28] Knowledge Planning in Large Language Models for Domain-Aligned Counseling Summarization EMNLP2024

【速读】: 该论文旨在解决在心理健康咨询中将对话内容精炼为简洁且相关总结(即咨询笔记)的问题。解决方案的关键在于引入了一个新颖的规划引擎,通过结构化知识对齐来增强大型语言模型(LLMs)的能力。具体来说,该规划引擎将知识封装分为两个主要阶段:(i)保持对话结构和(ii)整合领域特定知识。论文提出的PIECE框架利用知识过滤与支架技术来封装领域知识,并通过束卷积学习增强对对话结构细微差别的理解。实验结果表明,PIECE在ROUGE和Bleurt评分上显著优于14种基线方法,并且在专家评估中显示出有效性,有时甚至超过金标准。

链接: https://arxiv.org/abs/2409.14907
作者: Aseem Srivastava,Smriti Joshi,Tanmoy Chakraborty,Md Shad Akhtar
关键词-EN: aka counseling notes, holds pivotal significance, mental health counseling, Large Language Models, aka counseling
类目: Computation and Language (cs.CL)
备注: Full paper accepted at EMNLP 2024 (main)

点击查看摘要

Abstract:In mental health counseling, condensing dialogues into concise and relevant summaries (aka counseling notes) holds pivotal significance. Large Language Models (LLMs) exhibit remarkable capabilities in various generative tasks; however, their adaptation to domain-specific intricacies remains challenging, especially within mental health contexts. Unlike standard LLMs, mental health experts first plan to apply domain knowledge in writing summaries. Our work enhances LLMs’ ability by introducing a novel planning engine to orchestrate structuring knowledge alignment. To achieve high-order planning, we divide knowledge encapsulation into two major phases: (i) holding dialogue structure and (ii) incorporating domain-specific knowledge. We employ a planning engine on Llama-2, resulting in a novel framework, PIECE. Our proposed system employs knowledge filtering-cum-scaffolding to encapsulate domain knowledge. Additionally, PIECE leverages sheaf convolution learning to enhance its understanding of the dialogue’s structural nuances. We compare PIECE with 14 baseline methods and observe a significant improvement across ROUGE and Bleurt scores. Further, expert evaluation and analyses validate the generation quality to be effective, sometimes even surpassing the gold standard. We further benchmark PIECE with other LLMs and report improvement, including Llama-2 (+2.72%), Mistral (+2.04%), and Zephyr (+1.59%), to justify the generalizability of the planning engine.
摘要:在心理健康咨询中,将对话浓缩成简洁且相关的总结(即咨询笔记)具有至关重要的意义。大语言模型 (LLM) 在各种生成任务中展现出卓越的能力;然而,它们在适应特定领域的复杂性方面仍面临挑战,尤其是在心理健康领域。与标准 LLM 不同,心理健康专家在撰写总结时首先会运用领域知识进行规划。我们的工作通过引入一种新颖的规划引擎来协调知识结构对齐,从而提升 LLM 的能力。为了实现高阶规划,我们将知识封装分为两个主要阶段:(i) 保持对话结构和 (ii) 融入领域特定知识。我们在 Llama-2 上应用了这一规划引擎,形成了一个新的框架,称为 PIECE。我们提出的系统采用知识过滤与支架技术来封装领域知识。此外,PIECE 利用束卷积学习来增强其对对话结构细微差别的理解。我们将 PIECE 与 14 种基线方法进行了比较,观察到在 ROUGE 和 Bleurt 评分上显著提升。进一步的专家评估和分析验证了生成质量的有效性,有时甚至超越了黄金标准。我们还将 PIECE 与其他 LLM 进行了基准测试,并报告了改进结果,包括 Llama-2 (+2.72%)、Mistral (+2.04%) 和 Zephyr (+1.59%),以证明规划引擎的通用性。

[NLP-29] DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models

【速读】: 该论文试图解决在非英语区域(如韩国)的电子医疗记录(EMR)数据中进行紧急/非紧急分类任务时,现有的领域特定预训练语言模型表现不佳的问题。解决方案的关键在于提出了一种领域知识转移方法,通过知识蒸馏技术将通用语言模型的知识与领域特定预训练模型的知识相结合,具体做法是将通用语言模型作为学生模型,领域特定预训练模型作为教师模型,从而提升在非英语区域EMR数据上的分类性能。

链接: https://arxiv.org/abs/2409.14904
作者: Sangyeon Cho,Jangyeong Jeon,Dongjoon Lee,Changhee Lee,Junyeong Kim
关键词-EN: natural language processing, language models, language, common approach, approach in natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: IEEE ACCESS 2024

点击查看摘要

Abstract:The use of pre-trained language models fine-tuned to address specific downstream tasks is a common approach in natural language processing (NLP). However, acquiring domain-specific knowledge via fine-tuning is challenging. Traditional methods involve pretraining language models using vast amounts of domain-specific data before fine-tuning for particular tasks. This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea. Our findings reveal that existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics of non-English-speaking regions. To address these limitations, we propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning. This study demonstrates the effective transfer of specialized knowledge between models by defining a general language model as the student model and a domain-specific pre-trained model as the teacher model. In particular, we address the complexities of EMR data obtained from PEDs in non-English-speaking regions, such as Korea, and demonstrate that the proposed method enhances classification performance in such contexts. The proposed methodology not only outperforms baseline models on Korean PED EMR data, but also promises broader applicability in various professional and technical domains. In future works, we intend to extend this methodology to include diverse non-English-speaking regions and address additional downstream tasks, with the aim of developing advanced model architectures using state-of-the-art KD techniques. The code is available in this https URL.
摘要:预训练语言模型经过微调以解决特定下游任务的方法在自然语言处理 (NLP) 中十分常见。然而,通过微调获取领域特定知识具有挑战性。传统方法涉及在特定任务微调之前,使用大量领域特定数据对语言模型进行预训练。本研究基于从韩国儿科急诊科 (PED) 获取的电子病历 (EMR) 数据,探讨了急诊/非急诊分类任务。我们的研究结果表明,现有的领域特定预训练语言模型在处理非英语地区多语言自由文本数据特征方面,表现不如通用语言模型。为解决这些局限性,我们提出了一种利用知识蒸馏通过微调将领域特定知识注入通用语言模型的领域知识转移方法。本研究通过将通用语言模型定义为学生模型,将领域特定预训练模型定义为教师模型,展示了模型间专业知识的有效转移。特别是,我们解决了从非英语地区(如韩国)的 PED 获取的 EMR 数据的复杂性,并证明所提出的方法在这些情境下提高了分类性能。所提出的方法不仅在韩国 PED EMR 数据上优于基线模型,还具有在各种专业和技术领域广泛应用的潜力。在未来的工作中,我们计划将这种方法扩展到包括多个非英语地区,并解决更多的下游任务,旨在利用最先进的知识蒸馏 (KD) 技术开发先进的模型架构。代码可在以下链接获取:https URL。

[NLP-30] End-to-End Graph Flattening Method for Large Language Models

【速读】: 该论文试图解决图数据在长距离场景理解中的性能问题,特别是在将图转换为自然语言以供大型语言模型(LLMs)处理时,文本格式的不良组织导致的长距离推理能力不足的问题。解决方案的关键在于提出了一种名为“端到端有向无环图路径提示(End-to-End DAG-Path prompting, EEDP)”的新方法,通过模拟人类认知推理习惯,优化图的扁平化过程,从而在长距离和短距离场景中均显著提升LLMs的推理性能,并展现出良好的鲁棒性。

链接: https://arxiv.org/abs/2409.14880
作者: Bin Hong,Jinze Wu,Jiayu Liu,Liang Ding,Jing Sha,Kai Zhang,Shijin Wang,Zhenya Huang
关键词-EN: Large Language Models, Language Models, breakthrough of Large, Large Language, achieving universal methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2024 1st International Conference on Computational Linguistics and Natural Language Processing (CLNLP 2024)

点击查看摘要

Abstract:In recent years, the breakthrough of Large Language Models (LLMs) offers new ideas for achieving universal methods on graph data. The common practice of converting graphs into natural language for LLMs, which refers to graph flattening, exhibits good generalizability and interpretability. However, the poor organization of the textual format results in poor performance in long-distance scenario understanding. Inspired by human cognitive reasoning habits, we propose a novel method for graph flattening to fit LLMs, termed as End-to-End DAG-Path prompting (EEDP). Experiments on real-world datasets show that EEDP enhances the reasoning performance of LLMs in long-distance scenarios while maintaining excellent performance in short-distance scenarios, demonstrating good robustness in the face of distance variations.
摘要:近年来,大语言模型 (LLM) 的突破为实现图数据的通用方法提供了新思路。将图转换为自然语言以供 LLM 处理的常见做法,即图展平 (graph flattening),表现出良好的泛化性和可解释性。然而,文本格式的组织不佳导致其在长距离场景理解中表现不佳。受人类认知推理习惯的启发,我们提出了一种新的图展平方法以适应 LLM,称为端到端有向无环图路径提示 (End-to-End DAG-Path prompting, EEDP)。在真实世界数据集上的实验表明,EEDP 在长距离场景中提升了 LLM 的推理性能,同时在短距离场景中保持了优异的表现,展示了面对距离变化的良好鲁棒性。

[NLP-31] Privacy Policy Analysis through Prompt Engineering for LLMs

【速读】: 该论文试图解决隐私政策分析中存在的复杂性和透明度不足的问题。解决方案的关键在于提出并应用PAPEL框架,该框架通过提示工程(prompt engineering)利用大型语言模型(LLMs)来自动化隐私政策的分析过程。PAPEL框架通过零样本、单样本和少样本学习方法以及思维链提示(chain-of-thought prompting)来创建预定义的提示和提示模板,指导LLMs高效地解析、解释和综合隐私政策的关键方面,从而生成用户友好的摘要,无需额外的模型训练。该方法显著减少了训练需求,并提高了对新分析需求的适应性。

链接: https://arxiv.org/abs/2409.14879
作者: Arda Goknil,Femke B. Gelderblom,Simeon Tverdal,Shukun Tokas,Hui Song
关键词-EN: Privacy policies, informed consent, impedes transparency, transparency and informed, Privacy
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Privacy policies are often obfuscated by their complexity, which impedes transparency and informed consent. Conventional machine learning approaches for automatically analyzing these policies demand significant resources and substantial domain-specific training, causing adaptability issues. Moreover, they depend on extensive datasets that may require regular maintenance due to changing privacy concerns. In this paper, we propose, apply, and assess PAPEL (Privacy Policy Analysis through Prompt Engineering for LLMs), a framework harnessing the power of Large Language Models (LLMs) through prompt engineering to automate the analysis of privacy policies. PAPEL aims to streamline the extraction, annotation, and summarization of information from these policies, enhancing their accessibility and comprehensibility without requiring additional model training. By integrating zero-shot, one-shot, and few-shot learning approaches and the chain-of-thought prompting in creating predefined prompts and prompt templates, PAPEL guides LLMs to efficiently dissect, interpret, and synthesize the critical aspects of privacy policies into user-friendly summaries. We demonstrate the effectiveness of PAPEL with two applications: (i) annotation and (ii) contradiction analysis. We assess the ability of several LLaMa and GPT models to identify and articulate data handling practices, offering insights comparable to existing automated analysis approaches while reducing training efforts and increasing the adaptability to new analytical needs. The experiments demonstrate that the LLMs PAPEL utilizes (LLaMA and Chat GPT models) achieve robust performance in privacy policy annotation, with F1 scores reaching 0.8 and above (using the OPP-115 gold standard), underscoring the effectiveness of simpler prompts across various advanced language models. Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Software Engineering (cs.SE) Cite as: arXiv:2409.14879 [cs.CL] (or arXiv:2409.14879v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.14879 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:隐私政策因其复杂性常常被混淆,这阻碍了透明度和知情同意。传统的机器学习方法用于自动分析这些政策需要大量资源和特定领域的广泛训练,导致适应性问题。此外,它们依赖于可能因隐私关注变化而需要定期维护的大量数据集。本文中,我们提出、应用并评估了 PAPEL(通过大语言模型提示工程进行隐私政策分析),这是一个利用大语言模型(LLMs)通过提示工程自动化隐私政策分析的框架。PAPEL旨在简化从这些政策中提取、注释和总结信息的过程,增强其可访问性和可理解性,而无需额外的模型训练。通过整合零样本、单样本和少样本学习方法以及思维链提示在创建预定义提示和提示模板中的应用,PAPEL引导LLMs高效地剖析、解释并将隐私政策的关键方面综合为用户友好的摘要。我们通过两个应用展示了PAPEL的有效性:(i)注释和(ii)矛盾分析。我们评估了几种LLaMa和GPT模型识别和阐述数据处理实践的能力,提供了与现有自动化分析方法相当的见解,同时减少了训练努力并增加了对新分析需求的适应性。实验表明,PAPEL所使用的LLMs(LLaMA和Chat GPT模型)在隐私政策注释中表现出色,F1分数达到0.8及以上(使用OPP-115黄金标准),突显了简单提示在各种高级语言模型中的有效性。

主题:计算与语言(cs.CL);计算机与社会(cs.CY);软件工程(cs.SE
引用为:arXiv:2409.14879 [cs.CL](或 arXiv:2409.14879v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.14879
聚焦以了解更多
arXiv通过DataCite发布的DOI(待注册)

[NLP-32] Orthogonal Finetuning for Direct Preference Optimization

【速读】: 该论文试图解决DPO算法在偏好优化过程中容易对非偏好样本过拟合的问题,表现为生成内容过长且缺乏多样性。解决方案的关键在于从权重更新的角度引入正则化,通过引入权重旋转的偏好优化方法(RoPO),仅对权重参数进行旋转和幅度拉伸更新,以保持超球面能量不变,从而保留神经元之间的角度编码的知识。实验结果表明,RoPO在保持与人类偏好对齐的同时,有效防止了过拟合,显著提升了生成内容的多样性。

链接: https://arxiv.org/abs/2409.14836
作者: Chenxu Yang,Ruipeng Jia,Naibin Gu,Zheng Lin,Siyuan Chen,Chao Pang,Weichong Yin,Yu Sun,Hua Wu,Weiping Wang
关键词-EN: preference optimization algorithm, optimization algorithm, effective preference optimization, preference optimization, weight-Rotated Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:DPO is an effective preference optimization algorithm. However, the DPO-tuned models tend to overfit on the dispreferred samples, manifested as overly long generations lacking diversity. While recent regularization approaches have endeavored to alleviate this issue by modifying the objective function, they achieved that at the cost of alignment performance degradation. In this paper, we innovatively incorporate regularization from the perspective of weight updating to curb alignment overfitting. Through the pilot experiment, we discovered that there exists a positive correlation between overfitting and the hyperspherical energy fluctuation. Hence, we introduce orthogonal finetuning for DPO via a weight-Rotated Preference Optimization (RoPO) method, which merely conducts rotational and magnitude-stretching updates on the weight parameters to maintain the hyperspherical energy invariant, thereby preserving the knowledge encoded in the angle between neurons. Extensive experiments demonstrate that our model aligns perfectly with human preferences while retaining the original expressive capacity using only 0.0086% of the trainable parameters, suggesting an effective regularization against overfitting. Specifically, RoPO outperforms DPO by up to 10 points on MT-Bench and by up to 2.8 points on AlpacaEval 2, while enhancing the generation diversity by an average of 6 points.
摘要:DPO 是一种有效的偏好优化算法。然而,经过 DPO 调优的模型往往对非偏好样本过度拟合,表现为生成内容过长且缺乏多样性。尽管近期的一些正则化方法通过修改目标函数来缓解这一问题,但它们在提升对齐性能的同时牺牲了模型的对齐性能。本文创新性地从权重更新的角度引入正则化,以抑制对齐过程中的过度拟合。通过初步实验,我们发现过度拟合与超球面能量波动之间存在正相关关系。因此,我们通过一种名为权重旋转偏好优化 (RoPO) 的方法,对 DPO 进行正交微调,该方法仅对权重参数进行旋转和幅度拉伸更新,以保持超球面能量不变,从而保留神经元之间角度所编码的知识。大量实验表明,我们的模型在仅使用 0.0086% 的可训练参数的情况下,完美地与人类偏好对齐,同时保留了原有的表达能力,显示出有效的过度拟合抑制效果。具体而言,RoPO 在 MT-Bench 上比 DPO 高出最多 10 分,在 AlpacaEval 2 上高出最多 2.8 分,同时生成多样性平均提高了 6 分。

[NLP-33] oolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback

【速读】: 该论文试图解决工具增强型大型语言模型(LLMs)在实际应用中面临的两个主要问题:一是训练数据中的指令过于详细,包含API名称或参数,而实际用户不会明确提及这些细节,导致模型与真实场景脱节;二是现有工作忽视了交互过程是否遵循指令。解决方案的关键在于构建了一个名为MGToolBench的训练数据集,该数据集包含陈述性和类别级别的指令,以更好地反映真实世界场景。此外,论文提出了ToolPlanner,这是一个两阶段的强化学习框架,通过路径规划和两种反馈机制来增强LLM的任务完成能力和指令遵循能力。实验结果表明,ToolPlanner显著提高了匹配率、通过率和胜率,分别提升了26.8%、20.2%和5.6%。

链接: https://arxiv.org/abs/2409.14826
作者: Qinzhuo Wu,Wei Liu,Jian Luan,Bin Wang
关键词-EN: gained increasing attention, increasing attention, tool-augmented LLMs, gained increasing, Recently
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, tool-augmented LLMs have gained increasing attention. Given an instruction, tool-augmented LLMs can interact with various external tools in multiple rounds and provide a final answer. However, previous LLMs were trained on overly detailed instructions, which included API names or parameters, while real users would not explicitly mention these API details. This leads to a gap between trained LLMs and real-world scenarios. In addition, most works ignore whether the interaction process follows the instruction. To address these issues, we constructed a training dataset called MGToolBench, which contains statement and category-level instructions to better reflect real-world scenarios. In addition, we propose ToolPlanner, a two-stage reinforcement learning framework that utilizes path planning and two feedback mechanisms to enhance the LLM’s task completion and instruction-following capabilities. Experimental results show that ToolPlanner significantly improves the Match Rate, Pass Rate and Win Rate by 26.8%, 20.2%, and 5.6% compared to the SOTA model. Human evaluation verifies that the multi-granularity instructions can better align with users’ usage habits. Our data and code will be released upon acceptance.
摘要:近期,工具增强型大语言模型 (LLM) 引起了越来越多的关注。给定一个指令,工具增强型 LLM 能够与多种外部工具进行多轮交互,并提供最终答案。然而,先前的大语言模型是在过于详细的指令上进行训练的,这些指令包括 API 名称或参数,而实际用户并不会明确提及这些 API 细节。这导致训练中的 LLM 与现实场景之间存在差距。此外,大多数研究忽视了交互过程是否遵循指令。为解决这些问题,我们构建了一个名为 MGToolBench 的训练数据集,该数据集包含陈述性和类别级别的指令,以更好地反映现实场景。此外,我们提出了 ToolPlanner,这是一个两阶段的强化学习框架,利用路径规划和两种反馈机制来增强 LLM 的任务完成能力和指令遵循能力。实验结果表明,与最先进的模型相比,ToolPlanner 显著提高了匹配率 (Match Rate)、通过率 (Pass Rate) 和胜率 (Win Rate),分别提升了 26.8%、20.2% 和 5.6%。人类评估验证了多粒度指令能够更好地符合用户的习惯。我们的数据和代码将在接受后发布。

[NLP-34] Past Meets Present: Creating Historical Analogy with Large Language Models

【速读】: 该论文试图解决历史类比获取的难题,即如何为给定事件找到合适的历史类比。解决方案的关键在于利用大型语言模型(LLMs)进行检索和生成历史类比,并通过自省方法减少生成类比时的幻觉和刻板印象。研究结果表明,LLMs在历史类比获取方面具有良好潜力,且通过自省方法可以进一步提升模型性能。

链接: https://arxiv.org/abs/2409.14820
作者: Nianqi Li,Siyu Yuan,Jiangjie Chen,Jiaqing Liang,Feng Wei,Zujie Liang,Deqing Yang,Yanghua Xiao
关键词-EN: people make decisions, understand the world, Historical analogies, compare known past, contemporary but unfamiliar
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Historical analogies, which compare known past events with contemporary but unfamiliar events, are important abilities that help people make decisions and understand the world. However, research in applied history suggests that people have difficulty finding appropriate analogies. And previous studies in the AI community have also overlooked historical analogies. To fill this gap, in this paper, we focus on the historical analogy acquisition task, which aims to acquire analogous historical events for a given event. We explore retrieval and generation methods for acquiring historical analogies based on different large language models (LLMs). Furthermore, we propose a self-reflection method to mitigate hallucinations and stereotypes when LLMs generate historical analogies. Through human evaluations and our specially designed automatic multi-dimensional assessment, we find that LLMs generally have a good potential for historical analogies. And the performance of the models can be further improved by using our self-reflection method.
摘要:历史类比,即将已知的过去事件与当代但不熟悉的事件进行比较,是帮助人们做出决策和理解世界的重要能力。然而,应用历史学的研究表明,人们难以找到合适的类比。此外,AI 社区之前的研究也忽视了历史类比。为了填补这一空白,本文聚焦于历史类比获取任务,旨在为给定事件获取类似的历史事件。我们探索了基于不同大语言模型 (LLM) 的检索和生成方法来获取历史类比。此外,我们提出了一种自反思方法,以减轻 LLM 生成历史类比时的幻觉和刻板印象。通过人工评估和我们专门设计的自动多维度评估,我们发现 LLM 在历史类比方面普遍具有良好的潜力。并且,使用我们的自反思方法可以进一步提高模型的性能。

[NLP-35] MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

【速读】: 该论文试图解决现有视觉语言模型(VLM)在移动领域中缺乏对特定UI元素的识别能力、细粒度信息理解以及跨页面关系的理解问题。解决方案的关键在于提出了名为MobileVLM的新模型,并通过两个额外的预训练阶段来增强模型对UI内部和跨UI的理解。具体来说,论文定义了四个基于UI的预训练任务,帮助模型更好地感知细粒度元素并捕捉页面转换动作。此外,为了弥补移动预训练数据的不足,研究团队从头构建了一个包含300万UI页面和真实转换动作的大型中文移动数据集Mobile3M,形成了一个有向图结构。实验结果表明,MobileVLM在测试集和公开的移动基准测试中均优于现有的VLM。

链接: https://arxiv.org/abs/2409.14818
作者: Qinzhuo Wu,Weikai Xu,Wei Liu,Tao Tan,Jianfeng Liu,Ang Li,Jian Luan,Bin Wang,Shuo Shang
关键词-EN: gaining increasing attention, increasing attention, agents based, gaining increasing, Recently
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, mobile AI agents based on VLMs have been gaining increasing attention. These works typically utilize VLM as a foundation, fine-tuning it with instruction-based mobile datasets. However, these VLMs are typically pre-trained on general-domain data, which often results in a lack of fundamental capabilities specific to the mobile domain. Therefore, they may struggle to recognize specific UI elements and understand intra-UI fine-grained information. In addition, the current fine-tuning task focuses on interacting with the most relevant element for the given instruction. These fine-tuned VLMs may still ignore the relationships between UI pages, neglect the roles of elements in page transitions and lack inter-UI understanding. To address issues, we propose a VLM called MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. We defined four UI-based pre-training tasks, enabling the model to better perceive fine-grained elements and capture page transition actions. To address the lack of mobile pre-training data, we built a large Chinese mobile dataset Mobile3M from scratch, which contains 3 million UI pages, and real-world transition actions, forming a directed graph structure. Experimental results show MobileVLM excels on both our test set and public mobile benchmarks, outperforming existing VLMs.
摘要:近年来,基于视觉语言模型 (VLM) 的移动 AI 智能体引起了越来越多的关注。这些研究通常利用 VLM 作为基础,通过基于指令的移动数据集对其进行微调。然而,这些 VLM 通常是在通用领域数据上预训练的,这往往导致它们缺乏移动领域特有的基本能力。因此,它们可能在识别特定 UI 元素和理解 UI 内部细粒度信息方面表现不佳。此外,当前的微调任务主要集中在与给定指令最相关的元素上。这些微调后的 VLM 可能仍然忽略 UI 页面之间的关系,忽视元素在页面转换中的作用,并缺乏跨 UI 的理解能力。为了解决这些问题,我们提出了一种名为 MobileVLM 的 VLM,它包括两个额外的预训练阶段,以增强 UI 内部和跨 UI 的理解能力。我们定义了四个基于 UI 的预训练任务,使模型能够更好地感知细粒度元素并捕捉页面转换动作。为了解决移动预训练数据缺乏的问题,我们从零开始构建了一个大型中文移动数据集 Mobile3M,其中包含 300 万张 UI 页面和真实世界的转换动作,形成了一个有向图结构。实验结果表明,MobileVLM 在我们的测试集和公开的移动基准测试中均表现优异,优于现有的 VLM。

[NLP-36] MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations ACL2024

【速读】: 该论文试图解决在对话中检测关键转折点(turning points, TPs)的问题,这些转折点包括情感爆发、决策变化等,对于理解人类行为及其后果至关重要。解决方案的关键在于引入了一个精心策划、高共识的人类注释多模态数据集,并提供精确的时间戳、描述和视觉-文本证据来突出这些转折点。论文还提出了一个名为TPMaven的框架,利用先进的视觉-语言模型构建视频叙事,并结合大型语言模型进行分类和检测转折点。评估结果显示,TPMaven在分类任务中达到0.88的F1分数,在检测任务中达到0.61的F1分数,且其解释与人类预期相符。

链接: https://arxiv.org/abs/2409.14801
作者: Gia-Bao Dinh Ho,Chang Wei Tan,Zahra Zamanzadeh Darban,Mahsa Salehi,Gholamreza Haffari,Wray Buntine
关键词-EN: Detecting critical moments, Detecting critical, emotional outbursts, crucial for understanding, understanding shifts
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2024 main conference

点击查看摘要

Abstract:Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting focusing on these moments as turning points (TPs), accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence high-lighting changes in emotions, behaviors, perspectives, and decisions at these turning points. We also propose a framework, TPMaven, utilizing state-of-the-art vision-language models to construct a narrative from the videos and large language models to classify and detect turning points in our multi-modal dataset. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations aligning with human expectations.
摘要:检测对话中的关键时刻,如情绪爆发或决策变化,对于理解人类行为及其后果的转变至关重要。我们的工作引入了一个新的问题设置,专注于这些时刻作为转折点 (TP),并伴随一个精心策划、高度一致的人工标注的多模态数据集。我们提供了精确的时间戳、描述以及视觉-文本证据,突出了在这些转折点上情绪、行为、观点和决策的变化。我们还提出了一个框架,TPMaven,利用最先进的视觉-语言模型从视频构建叙事,并使用大语言模型对我们的多模态数据集进行分类和检测转折点。评估结果显示,TPMaven 在分类中达到了 0.88 的 F1 分数,在检测中达到了 0.61 的 F1 分数,并提供了与人类预期一致的额外解释。

[NLP-37] owards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models

【速读】: 该论文试图解决现有视觉问答与自然语言解释(VQA-NLE)数据集创建过程中依赖人工标注导致的时间和成本高昂的问题。解决方案的关键在于利用大型视觉语言模型(LVLMs)生成高质量的合成VQA-NLE数据集,通过先进的提示技术实现高效且接近人工标注质量的数据生成,特别是通过视觉提示显著提升文本生成的相关性。

链接: https://arxiv.org/abs/2409.14785
作者: Patrick Amadeus Irawan,Genta Indra Winata,Samuel Cahyawijaya,Ayu Purwarianti
关键词-EN: Natural Language Explanation, Natural Language, Language Explanation, aims to elucidate, providing detailed
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models (LVLMs) through the use of language models. While existing methods for creating a Vision Question-Answering with Natural Language Explanation (VQA-NLE) datasets can provide explanations, they heavily rely on human annotations that are time-consuming and costly. In this study, we propose a novel approach that leverages LVLMs to efficiently generate high-quality synthetic VQA-NLE datasets. By evaluating our synthetic data, we showcase how advanced prompting techniques can lead to the production of high-quality VQA-NLE data. Our findings indicate that this proposed method achieves up to 20x faster than human annotation, with only a minimal decrease in qualitative metrics, achieving robust quality that is nearly equivalent to human-annotated data. Furthermore, we show that incorporating visual prompts significantly enhances the relevance of text generation. Our study paves the way for a more efficient and robust automated generation of multi-modal NLE data, offering a promising solution to the problem.
摘要:自然语言解释 (Natural Language Explanation, NLE) 旨在通过提供详细、易于人类理解的自然语言解释来阐明决策过程。它通过使用语言模型,帮助揭开大型视觉语言模型 (Large Vision-Language Models, LVLMs) 的决策过程。尽管现有创建视觉问答与自然语言解释 (Vision Question-Answering with Natural Language Explanation, VQA-NLE) 数据集的方法可以提供解释,但它们严重依赖耗时且成本高昂的人工标注。在本研究中,我们提出了一种利用 LVLMs 高效生成高质量合成 VQA-NLE 数据集的新方法。通过评估我们的合成数据,我们展示了先进的提示技术如何能够生成高质量的 VQA-NLE 数据。我们的研究结果表明,该方法的生成速度比人工标注快 20 倍,且在质量指标上仅有轻微下降,达到了与人工标注数据几乎相当的质量。此外,我们发现结合视觉提示显著增强了文本生成的相关性。我们的研究为更高效和稳健的多模态 NLE 数据自动化生成铺平了道路,为该问题提供了一个有前景的解决方案。

[NLP-38] Pretraining Data Detection for Large Language Models : A Divergence-based Calibration Method EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)训练数据透明度不足的问题,特别是如何通过黑箱访问推断给定文本是否属于LLM的训练数据。解决方案的关键在于引入了一种基于散度校准的方法,通过计算词元概率分布与词元频率分布之间的交叉熵(即散度)来校准词元概率,从而提高训练数据检测的准确性。该方法在英文和中文基准测试中均显著优于现有方法。

链接: https://arxiv.org/abs/2409.14781
作者: Weichao Zhang,Ruqing Zhang,Jiafeng Guo,Maarten de Rijke,Yixing Fan,Xueqi Cheng
关键词-EN: large language models, language models, model developers, corpora for large, large language
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted by EMNLP 2024 main

点击查看摘要

Abstract:As the scale of training corpora for large language models (LLMs) grows, model developers become increasingly reluctant to disclose details on their data. This lack of transparency poses challenges to scientific evaluation and ethical deployment. Recently, pretraining data detection approaches, which infer whether a given text was part of an LLM’s training data through black-box access, have been explored. The Min-K% Prob method, which has achieved state-of-the-art results, assumes that a non-training example tends to contain a few outlier words with low token probabilities. However, the effectiveness may be limited as it tends to misclassify non-training texts that contain many common words with high probabilities predicted by LLMs. To address this issue, we introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection. We compute the cross-entropy (i.e., the divergence) between the token probability distribution and the token frequency distribution to derive a detection score.We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text. Experimental results on English-language benchmarks and PatentMIA demonstrate that our proposed method significantly outperforms existing methods. Our code and PatentMIA benchmark are available at this https URL
摘要:随着大语言模型 (LLM) 训练语料库规模的扩大,模型开发者越来越不愿意披露其数据的详细信息。这种缺乏透明度的现象给科学评估和伦理部署带来了挑战。最近,研究人员探索了通过黑箱访问推断给定文本是否属于 LLM 训练数据的前训练数据检测方法。其中,Min-K% Prob 方法取得了最先进的结果,该方法假设非训练样本往往包含一些 Token 概率较低的异常词。然而,这种方法的有效性可能受到限制,因为它容易将包含许多高概率常见词的非训练文本错误分类。为了解决这一问题,我们引入了一种基于发散性校准的方法,灵感来自于随机性发散的概念,用于校准前训练数据检测的 Token 概率。我们计算 Token 概率分布与 Token 频率分布之间的交叉熵(即发散度),以推导出检测分数。我们开发了一个中文基准测试 PatentMIA,用于评估针对中文文本的 LLM 检测方法的性能。在英文基准测试和 PatentMIA 上的实验结果表明,我们提出的方法显著优于现有方法。我们的代码和 PatentMIA 基准测试可在以下链接获取:https URL

[NLP-39] OMPar: Automatic Parallelization with AI-Driven Source-to-Source Compilation

【速读】: 该论文试图解决手动并行化代码的复杂性和多核架构广泛应用带来的挑战。解决方案的关键在于引入OMPar工具,该工具利用AI技术自动并行化C/C++代码。OMPar通过集成大型语言模型(LLMs)的两个核心组件——OMPify评估循环并行化潜力和MonoCoder-OMP生成精确的OpenMP pragmas,实现了高效的代码并行化。OMPar不仅在识别可并行化循环和生成高效pragmas方面显著优于传统方法,还具备处理不完整代码库和持续学习新代码模式的能力,从而不断提升其并行化能力。

链接: https://arxiv.org/abs/2409.14771
作者: Tal Kadosh,Niranjan Hasabnis,Prema Soundararajan,Vy A. Vo,Mihai Capota,Nesreen Ahmed,Yuval Pinter,Gal Oren
关键词-EN: significant challenge due, modern software systems, Manual parallelization, Large Language Models, multi-core architectures
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Manual parallelization of code remains a significant challenge due to the complexities of modern software systems and the widespread adoption of multi-core architectures. This paper introduces OMPar, an AI-driven tool designed to automate the parallelization of C/C++ code using OpenMP pragmas. OMPar integrates Large Language Models (LLMs) through two key components: OMPify, which assesses loop parallelization potential, and MonoCoder-OMP, a new fine-tuned model which generates precise OpenMP pragmas. The evaluation of OMPar follows the same rigorous process applied to traditional tools like source-to-source AutoPar and ICPC compilers: (1) ensuring the generated code compiles and runs correctly in serial form, (2) assessing performance with the gradual addition of threads and corresponding physical cores, and (3) verifying and validating the correctness of the code’s output. Benchmarks from HeCBench and ParEval are used to evaluate accuracy and performance. Experimental results demonstrate that OMPar significantly outperforms traditional methods, achieving higher accuracy in identifying parallelizable loops and generating efficient pragmas. Beyond accuracy, OMPar offers advantages such as the ability to work on partial or incomplete codebases and the capacity to continuously learn from new code patterns, enhancing its parallelization capabilities over time. These results underscore the potential of LLMs in revolutionizing automatic parallelization techniques, paving the way for more efficient and scalable parallel computing systems.
摘要:由于现代软件系统的复杂性和多核架构的广泛采用,手动并行化代码仍然是一个重大挑战。本文介绍了 OMPar,这是一种利用 OpenMP 编译指示自动并行化 C/C++ 代码的 AI 驱动工具。OMPar 通过两个关键组件集成了大语言模型 (LLM):OMPify,用于评估循环并行化潜力;以及 MonoCoder-OMP,这是一个经过微调的新模型,用于生成精确的 OpenMP 编译指示。OMPar 的评估遵循与传统工具(如源代码到源代码的 AutoPar 和 ICPC 编译器)相同的严格流程:(1) 确保生成的代码以串行形式正确编译和运行,(2) 通过逐步增加线程和相应的物理核心来评估性能,以及 (3) 验证和确认代码输出的正确性。使用 HeCBench 和 ParEval 的基准测试来评估准确性和性能。实验结果表明,OMPar 显著优于传统方法,在识别可并行化循环和生成高效编译指示方面具有更高的准确性。除了准确性之外,OMPar 还具有处理部分或不完整代码库的能力,并能够从新的代码模式中持续学习,从而随着时间的推移增强其并行化能力。这些结果突显了 LLM 在革新自动并行化技术方面的潜力,为更高效和可扩展的并行计算系统铺平了道路。

[NLP-40] Language-Agnostic Analysis of Speech Depression Detection

【速读】: 该论文试图解决的主要问题是利用语音数据自动检测抑郁症(MDD),特别是在不同语言背景下识别抑郁症患者的语音特征。解决方案的关键在于利用卷积神经网络(CNN)模型,通过分析英语和马拉雅拉姆语两种语言的语音数据,识别与抑郁症相关的声学特征。研究采用了IViE语料库中的多种句子类型,以自然地引发不同的音调模式,从而训练模型在跨语言背景下有效检测抑郁症。该方法旨在开发一种语言无关的语音抑郁症检测系统,以提高对不同语言群体的适用性和可访问性。

链接: https://arxiv.org/abs/2409.14769
作者: Sona Binu,Jismi Jose,Fathima Shimna K V,Alino Luke Hans,Reni K. Cherian,Starlet Ben Alex,Priyanka Srivastava,Chiranjeevi Yarra
关键词-EN: Major Depressive Disorder, Depressive Disorder, Major Depressive, people with Major, English and Malayalam
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The people with Major Depressive Disorder (MDD) exhibit the symptoms of tonal variations in their speech compared to the healthy counterparts. However, these tonal variations not only confine to the state of MDD but also on the language, which has unique tonal patterns. This work analyzes automatic speech-based depression detection across two languages, English and Malayalam, which exhibits distinctive prosodic and phonemic characteristics. We propose an approach that utilizes speech data collected along with self-reported labels from participants reading sentences from IViE corpus, in both English and Malayalam. The IViE corpus consists of five sets of sentences: simple sentences, WH-questions, questions without morphosyntactic markers, inversion questions and coordinations, that can naturally prompt speakers to speak in different tonal patterns. Convolutional Neural Networks (CNNs) are employed for detecting depression from speech. The CNN model is trained to identify acoustic features associated with depression in speech, focusing on both languages. The model’s performance is evaluated on the collected dataset containing recordings from both depressed and non-depressed speakers, analyzing its effectiveness in detecting depression across the two languages. Our findings and collected data could contribute to the development of language-agnostic speech-based depression detection systems, thereby enhancing accessibility for diverse populations.
摘要:患有重度抑郁症 (Major Depressive Disorder, MDD) 的人群在语音中表现出与健康人群相比的音调变化症状。然而,这些音调变化不仅限于 MDD 状态,还与语言本身特有的音调模式有关。本研究分析了基于语音的抑郁症自动检测在两种语言——英语和马拉雅拉姆语中的应用,这两种语言具有独特的韵律和音位特征。我们提出了一种方法,利用从参与者阅读 IViE 语料库中的句子时收集的语音数据,并结合自报告标签,在英语和马拉雅拉姆语中进行分析。IViE 语料库包含五组句子:简单句、WH 问句、无形态句法标记的问句、倒装问句和并列句,这些句子自然地促使说话者以不同的音调模式说话。卷积神经网络 (Convolutional Neural Networks, CNNs) 被用于从语音中检测抑郁症。CNN 模型被训练以识别与抑郁症相关的声学特征,重点关注两种语言。该模型的性能在包含抑郁症和非抑郁症说话者录音的收集数据集上进行评估,分析其在两种语言中检测抑郁症的有效性。我们的研究结果和收集的数据有助于开发与语言无关的基于语音的抑郁症检测系统,从而提高对多样化人群的可及性。

[NLP-41] Do Large Language Models have Problem-Solving Capability under Incomplete Information Scenarios? ACL2024

【速读】: 该论文试图解决在大语言模型(LLMs)在信息不完整场景下的问题解决能力评估问题。解决方案的关键在于引入了一种名为BrainKing的新型游戏,该游戏结合了“谁是卧底”和“二十问”的元素,要求LLMs通过有限的“是或否”问题识别目标实体,并识别潜在的误导性答案。通过设置简单、中等和困难三种难度模式,全面评估LLMs在不同方面的表现,从而揭示其在信息不完整场景下的能力与局限性。

链接: https://arxiv.org/abs/2409.14762
作者: Yuyan Chen,Tianhao Yu,Yueze Li,Songzhou Yan,Sijia Liu,Jiaqing Liang,Yanghua Xiao
关键词-EN: Large Language Models, Language Models, Large Language, knowledge search, error detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:The evaluation of the problem-solving capability under incomplete information scenarios of Large Language Models (LLMs) is increasingly important, encompassing capabilities such as questioning, knowledge search, error detection, and path planning. Current research mainly focus on LLMs’ problem-solving capability such as Twenty Questions''. However, these kinds of games do not require recognizing misleading cues which are necessary in the incomplete information scenario. Moreover, the existing game such as Who is undercover’’ are highly subjective, making it challenging for evaluation. Therefore, in this paper, we introduce a novel game named BrainKing based on the Who is undercover'' and Twenty Questions’’ for evaluating LLM capabilities under incomplete information scenarios. It requires LLMs to identify target entities with limited yes-or-no questions and potential misleading answers. By setting up easy, medium, and hard difficulty modes, we comprehensively assess the performance of LLMs across various aspects. Our results reveal the capabilities and limitations of LLMs in BrainKing, providing significant insights of LLM problem-solving levels.
摘要:在大语言模型 (LLM) 的不完全信息场景下评估其问题解决能力变得越来越重要,这包括提问、知识搜索、错误检测和路径规划等能力。当前的研究主要集中在 LLM 的问题解决能力上,例如“二十问”游戏。然而,这类游戏并不要求识别误导性线索,而这是不完全信息场景中必需的。此外,现有的游戏如“谁是卧底”具有高度主观性,使得评估变得困难。因此,本文引入了一种基于“谁是卧底”和“二十问”的新游戏——BrainKing,用于评估 LLM 在不完全信息场景下的能力。它要求 LLM 通过有限的“是或否”问题和潜在的误导性答案来识别目标实体。通过设置简单、中等和困难三种难度模式,我们全面评估了 LLM 在各个方面的表现。我们的结果揭示了 LLM 在 BrainKing 中的能力和局限性,为 LLM 的问题解决水平提供了重要见解。

[NLP-42] FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension EMNLP2024

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在指代表达理解(REC)任务中的性能提升问题。解决方案的关键在于引入了一个新的REC数据集,该数据集具有可控难度级别和包含负样本的特点。具体来说,数据集设计了多层次的细粒度推理需求,涵盖对象类别、属性和多跳关系,并通过细粒度编辑和生成技术创建了负文本和图像,以测试模型在目标对象不可见情况下的拒绝能力。这一数据集的引入旨在全面评估现有模型和MLLMs的性能,并推动视觉推理和跨模态交互策略的发展。

链接: https://arxiv.org/abs/2409.14750
作者: Junzhuo Liu,Xuzheng Yang,Weiwei Li,Peng Wang
关键词-EN: Referring Expression Comprehension, Referring Expression, Expression Comprehension, Multi-modal Large Language, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 19 pages, EMNLP 2024

点击查看摘要

Abstract:Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model’s ability to correctly reject scenarios where the target object is not visible in the image–an essential aspect often overlooked in existing datasets and approaches. Utilizing this high-quality dataset, we conducted comprehensive evaluations of both state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal interaction strategies, ultimately unlocking the full potential of MLLMs. Our code and the datasets are available at this https URL.
摘要:指称表达理解 (Referring Expression Comprehension, REC) 是一项关键的跨模态任务,客观评估语言理解、图像理解和语言到图像的关联能力。因此,它成为多模态大语言模型 (Multi-modal Large Language Models, MLLMs) 的理想测试平台。为了实现这一目标,我们构建了一个新的 REC 数据集,该数据集具有两个关键特征:首先,它设计了可控的难度级别,需要在对象类别、属性和多跳关系之间进行多层次的细粒度推理。其次,它包含了通过基于现有数据的细粒度编辑和生成创建的负文本和图像,从而测试模型在目标对象在图像中不可见的情况下正确拒绝场景的能力——这是现有数据集和方法中经常被忽视的重要方面。利用这一高质量数据集,我们对最先进的专业模型和 MLLMs 进行了全面评估。我们的研究结果表明,在实现令人满意的关联性能方面仍存在显著差距。我们预计,我们的数据集将激发新的方法来增强视觉推理能力,并开发更先进的多模态交互策略,最终释放 MLLMs 的全部潜力。我们的代码和数据集可通过此 https URL 获取。

[NLP-43] LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs EMNLP

【速读】: 该论文试图解决非事实性问答(NFQA)评估中的难题,即由于答案多样性和缺乏客观标准,传统的自动评估指标(如ROUGE或BERTScore)无法准确衡量语义相似性或不同视角的答案。论文提出的解决方案关键在于引入大语言模型(LLMs)进行列表式评估,通过LLMs对候选答案进行排序,并生成不同质量的参考答案列表,以提高评估的准确性和与人类注释的相关性。实验结果表明,该方法在多个NFQA数据集上显著优于传统的自动评分和点对点、成对比较方法。

链接: https://arxiv.org/abs/2409.14744
作者: Sihui Yang,Keping Bi,Wanqing Cui,Jiafeng Guo,Xueqi Cheng
关键词-EN: diverse potential answers, Large Language Models, Question Answering, objective criterion, challenging to evaluate
类目: Computation and Language (cs.CL)
备注: Published as a conference paper at EMNLP Findings 2024

点击查看摘要

Abstract:Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. The commonly used automatic evaluation metrics like ROUGE or BERTScore cannot accurately measure semantic similarities or answers from different perspectives. Recently, Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. Common approaches include pointwise scoring of each candidate answer and pairwise comparisons between answers. Inspired by the evolution from pointwise to pairwise to listwise in learning-to-rank methods, we propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality. Moreover, for NF questions that do not have multi-grade or any golden answers, we leverage LLMs to generate the reference answer list of various quality to facilitate the listwise evaluation. Extensive experimental results on three NFQA datasets, i.e., ANTIQUE, the TREC-DL-NF, and WebGLM show that our method has significantly higher correlations with human annotations compared to automatic scores and common pointwise and pairwise approaches.
摘要:非事实性 (Non-Factoid, NF) 问答 (Question Answering, QA) 的评估具有挑战性,因为其答案多样且缺乏客观标准。常用的自动评估指标,如 ROUGE 或 BERTScore,无法准确衡量语义相似性或从不同角度给出的答案。近年来,大语言模型 (Large Language Models, LLMs) 因其出色的表现被用于 NFQA 评估。常见的方法包括对每个候选答案进行点对点评分以及答案之间的成对比较。受学习排序方法中从点对点到成对再到列表排序的演变启发,我们提出了一种新颖的列表排序 NFQA 评估方法,该方法利用 LLMs 对候选答案进行排序,排序依据是按质量降序排列的参考答案列表。此外,对于没有多级评分或任何黄金答案的 NF 问题,我们利用 LLMs 生成不同质量的参考答案列表,以促进列表排序评估。在三个 NFQA 数据集(即 ANTIQUE、TREC-DL-NF 和 WebGLM)上的广泛实验结果表明,与自动评分和常见的点对点及成对方法相比,我们的方法与人类注释的相关性显著更高。

[NLP-44] oxiCraft: A Novel Framework for Synthetic Generation of Harmful Information

【速读】: 该论文试图解决在自然语言处理任务中检测有害内容时面临的数据稀缺和定义不一致问题。解决方案的关键是提出了一个名为Toxicraft的新框架,该框架能够利用少量种子数据生成大量合成但高度真实的有害信息样本。这一方法显著增强了检测模型的鲁棒性和适应性,使其在不同数据集上的表现接近或超越了金标准标签。

链接: https://arxiv.org/abs/2409.14740
作者: Zheng Hui,Zhaoxiao Guo,Hang Zhao,Juanyong Duan,Congrui Huang
关键词-EN: NLP tasks, detecting harmful content, online environments, social media, crucial for online
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In different NLP tasks, detecting harmful content is crucial for online environments, especially with the growing influence of social media. However, previous research has two main issues: 1) a lack of data in low-resource settings, and 2) inconsistent definitions and criteria for judging harmful content, requiring classification models to be robust to spurious features and diverse. We propose Toxicraft, a novel framework for synthesizing datasets of harmful information to address these weaknesses. With only a small amount of seed data, our framework can generate a wide variety of synthetic, yet remarkably realistic, examples of toxic information. Experimentation across various datasets showcases a notable enhancement in detection model robustness and adaptability, surpassing or close to the gold labels. We release the generated data at Github upon acceptance.
摘要:在不同的自然语言处理 (NLP) 任务中,检测有害内容对于在线环境至关重要,尤其是在社交媒体影响力日益增强的背景下。然而,以往的研究存在两个主要问题:1) 在低资源环境下缺乏数据;2) 对有害内容判断的定义和标准不一致,要求分类模型对虚假特征和多样性具有鲁棒性。我们提出了 Toxicraft,这是一种用于合成有害信息数据集的新框架,旨在解决这些弱点。仅使用少量的种子数据,我们的框架就能生成多种多样且极为逼真的有害信息示例。在各种数据集上的实验表明,检测模型的鲁棒性和适应性显著增强,性能超越或接近黄金标签。我们将在接受后在 Github 上发布生成的数据。

[NLP-45] ERABAL: Enhancing Role-Playing Agents through Boundary-Aware Learning

【速读】: 该论文试图解决角色扮演代理(RPLAs)在对话中难以保持角色一致性的问题,特别是在面对与角色属性相关的边界查询时。解决方案的关键在于提出了ERABAL框架,该框架通过边界感知学习来增强RPLAs的角色扮演能力。ERABAL包括一个角色特定对话的生成管道和一个相应的对齐训练方法,能够在使用较少对话数据的情况下,显著提升在WikiRoleEval、CharacterEval和MT-Bench角色扮演子集上的表现。

链接: https://arxiv.org/abs/2409.14710
作者: Yihong Tang,Jiao Ou,Che Liu,Fuzheng Zhang,Di Zhang,Kun Gai
关键词-EN: Human-Computer Interaction, large language model, primarily implemented, HCI, LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2402.10618

点击查看摘要

Abstract:Role-playing is an emerging application in the field of Human-Computer Interaction (HCI), primarily implemented through the alignment training of a large language model (LLM) with assigned characters. Despite significant progress, role-playing agents (RPLAs) still struggle with maintaining role-consistency across conversations, particularly when confronted with boundary queries subtly related to character attributes. In this paper, we present ERABAL, a framework aimed at enhancing RPLAs’ role-playing capabilities through boundary-aware learning. ERABAL encompasses a generation pipeline for role-specific dialogues and a concomitant methodology for alignment training. Through comprehensive evaluations, we demonstrate that ERABAL is both efficient and effective. By training with significantly fewer dialogues than those used in leading approaches, ERABAL achieves notable improvements across WikiRoleEval, CharacterEval, and the role-playing subset of MT-Bench compared to the generalist baseline models. Our code and datasets will be made publicly available to support further research.
摘要:角色扮演是人类与计算机交互 (Human-Computer Interaction, HCI) 领域中新兴的应用,主要通过大语言模型 (Large Language Model, LLM) 与指定角色的对齐训练来实现。尽管取得了显著进展,角色扮演智能体 (Role-Playing Agents, RPLAs) 在跨对话中保持角色一致性方面仍面临挑战,尤其是在面对与角色属性微妙相关的边界查询时。本文提出了 ERABAL,一个旨在通过边界感知学习增强 RPLAs 角色扮演能力的框架。ERABAL 包含一个角色特定对话的生成管道和一个相应的对齐训练方法。通过全面的评估,我们证明了 ERABAL 既高效又有效。通过使用远少于领先方法所需的对话进行训练,ERABAL 在 WikiRoleEval、CharacterEval 以及 MT-Bench 的角色扮演子集上相较于通用基线模型取得了显著的改进。我们的代码和数据集将公开发布,以支持进一步的研究。

[NLP-46] arget-Aware Language Modeling via Granular Data Sampling EMNLP2024

【速读】: 该论文试图解决在特定领域内训练语言模型时,如何在不影响其他领域性能的前提下,高效地选择预训练数据的问题。解决方案的关键在于采用基于n-gram特征的重要性采样方法,通过多粒度token的组合,实现对大规模预训练数据的精确选择,从而在保留对其他任务有效性的同时,显著提升目标下游任务的性能。实验结果表明,使用约1%的数据进行预训练,模型在多个基准测试中表现与使用完整数据集相当,甚至优于随机选择的样本。

链接: https://arxiv.org/abs/2409.14705
作者: Ernie Chang,Pin-Jie Lin,Yang Li,Changsheng Zhao,Daeil Kim,Rastislav Rabatin,Zechun Liu,Yangyang Shi,Vikas Chandra
关键词-EN: diverse sources, broad range, pretraining generally targets, model pretraining generally, data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2024 Main Conference, 9 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective and straightforward approach is sampling with low-dimensional data features, which allows to select large-scale pretraining data for domain-specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the target downstream task performance while preserving its effectiveness on other tasks. This leads to the proposed data sampling paradigm where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with \sim 1% of the data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
摘要:语言模型的预训练通常面向广泛的应用场景,并整合来自多种来源的数据。然而,在某些情况下,我们希望模型在特定领域表现出色,同时不显著影响在其他领域的表现。一种经济高效且直接的方法是利用低维数据特征进行采样,从而为特定领域的应用场景选择大规模的预训练数据。在本研究中,我们重新审视了基于 n-gram 特征的重要性采样方法,这些特征由多粒度 Token 组成,在句子压缩和表示能力之间取得了良好的平衡。我们观察到,采样数据与目标下游任务性能高度相关,同时在其他任务上保持了有效性。这引出了我们提出的数据采样范式,即语言模型可以在选定的文档上更高效地进行预训练。在八个基准测试中,我们展示了使用约 1% 的数据,预训练模型在性能上与完整 RefinedWeb 数据相当,并且在 125M 到 1.5B 的模型规模范围内优于随机选择的样本。

[NLP-47] VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models EMNLP2024

【速读】: 该论文试图解决现有文本到图像(T2I)模型评估指标无法充分评估模型处理多样化文本提示的能力,从而影响模型泛化性的问题。解决方案的关键在于引入了一种名为Visual Language Evaluation Understudy (VLEU)的新评估指标。VLEU利用大型语言模型从视觉文本域中采样生成多样化的提示,并通过CLIP模型评估生成图像与输入文本的对齐程度。该指标通过计算视觉文本的边缘分布与模型生成图像的条件分布之间的Kullback-Leibler散度,量化模型的泛化能力,为不同T2I模型提供了一种定量的比较方法,并有助于在模型微调过程中跟踪改进。

链接: https://arxiv.org/abs/2409.14704
作者: Jingtao Cao,Zheng Zhang,Hongru Wang,Kam-Fai Wong
关键词-EN: Language Evaluation Understudy, Visual Language Evaluation, significantly improved, improved the generation, textual descriptions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: accepted by EMNLP2024(long paper,main conference)

点击查看摘要

Abstract:Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models’ ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP model.VLEU quantifies a model’s generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric provides a quantitative way to compare different T2I models and track improvements during model finetuning. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models, positioning it as an essential metric for future research in text-to-image synthesis.
摘要:文本到图像 (Text-to-Image, T2I) 模型的进展显著提升了从文本描述生成图像的能力。然而,现有的评估指标未能充分评估模型处理多样化文本提示的能力,这对模型的泛化性至关重要。为解决这一问题,我们引入了一种新的指标,称为视觉语言评估替补 (Visual Language Evaluation Understudy, VLEU)。VLEU 利用大语言模型从视觉文本域中采样,即 T2I 模型的所有可能输入文本集合,以生成广泛的提示。根据这些提示生成的图像通过 CLIP 模型进行评估,评估其与输入文本的对齐程度。VLEU 通过计算视觉文本的边缘分布与模型生成图像的条件分布之间的 Kullback-Leibler 散度,量化模型的泛化性。该指标提供了一种定量的方法来比较不同的 T2I 模型,并在模型微调过程中跟踪改进。我们的实验证明了 VLEU 在评估各种 T2I 模型泛化能力方面的有效性,使其成为未来文本到图像合成研究中的一个重要指标。

[NLP-48] MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification EMNLP2024

【速读】: 该论文试图解决文本嵌入图像的复杂性问题,特别是在多模态理解中对表达的多个方面(如仇恨、目标、立场和幽默)进行检测的挑战。解决方案的关键在于引入了一个新的数据集PrideMM,该数据集包含了与LGBTQ+ Pride运动相关的文本嵌入图像,填补了现有资源的空白。论文还提出了一种新的框架MemeCLIP,该框架在保留预训练CLIP模型知识的同时,实现了高效的下游学习,并在两个真实世界数据集上展示了优于先前框架的性能。

链接: https://arxiv.org/abs/2409.14703
作者: Siddhant Bikram Shah,Shuvam Shiwakoti,Maheep Chaudhary,Haohan Wang
关键词-EN: text-embedded images presents, encompass multiple aspects, multiple aspects, presents a formidable, formidable challenge
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted to EMNLP 2024 (Main)

点击查看摘要

Abstract:The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of the multiple aspects of expression conveyed in them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, our study expands the focus to encompass multiple aspects of linguistics: hate, target, stance, and humor detection. We introduce a novel dataset PrideMM comprising text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: this https URL.
摘要:文本嵌入图像的复杂性在机器学习中提出了一个严峻的挑战,因为需要对图像中传达的多方面表达进行多模态理解。尽管先前的多模态分析研究主要集中在单一方面的分析,如仇恨言论及其子类,但我们的研究扩展了关注点,涵盖了语言学的多个方面:仇恨、目标、立场和幽默检测。我们引入了一个新的数据集 PrideMM,该数据集包含了与 LGBTQ+ 骄傲运动相关的文本嵌入图像,从而填补了现有资源中的一个重要空白。我们通过使用单模态和多模态基线方法对 PrideMM 进行了广泛的实验,为每个任务建立了基准。此外,我们提出了一种新的框架 MemeCLIP,该框架在保留预训练 CLIP 模型知识的同时,实现了高效的下游学习。我们的实验结果表明,MemeCLIP 在两个真实世界数据集上的表现优于先前提出的框架。我们进一步比较了 MemeCLIP 和零样本 GPT-4 在仇恨分类任务上的性能。最后,我们通过定性分析误分类样本,讨论了我们模型的不足之处。我们的代码和数据集可在以下链接公开获取:this https URL。

[NLP-49] Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

【速读】: 该论文试图解决多向量检索方法(如ColBERT)在存储和内存需求方面的高成本问题。解决方案的关键在于引入了一种基于聚类的token池化方法,通过将相似的token向量聚类并存储代表性向量,从而大幅减少需要存储的向量数量。这种方法在几乎不损失检索性能的情况下,可以将ColBERT索引的存储空间减少50%,并且在进一步减少向量数量时,性能下降仍保持在5%以内。该方法无需改变模型架构或查询时处理,可以在索引构建阶段简单地集成到任何类似ColBERT的模型中。

链接: https://arxiv.org/abs/2409.14683
作者: Benjamin Clavié,Antoine Chaffin,Griffin Adams
关键词-EN: increasingly popular approach, multi-vector retrieval methods, increasingly popular, multi-vector retrieval, approach to Neural
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.
摘要:近年来,以 ColBERT 为首的多向量检索方法在神经信息检索 (Neural IR) 领域中变得越来越流行。通过在 Token 级别而非文档级别存储表示,这些方法在跨领域设置中展示了非常强大的检索性能。然而,存储大量相关向量所需的存储和内存需求仍然是一个重要的缺点,阻碍了实际应用。在本文中,我们介绍了一种基于聚类的 Token 池化方法,以积极减少需要存储的向量数量。这种方法可以将 ColBERT 索引的空间内存占用减少 50%,而几乎不会降低检索性能。该方法还允许进一步减少,将向量数量减少 66% 至 75%,在绝大多数数据集上性能下降仍保持在 5% 以下。重要的是,这种方法不需要架构上的改变,也不需要在查询时进行处理,并且可以在索引过程中作为简单的插件使用于任何类似 ColBERT 的模型。

[NLP-50] RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

【速读】: 该论文试图解决机器人操作中缺乏自恢复机制和简单语言指令不足以指导机器人动作的问题。解决方案的关键在于提出了一个可扩展的数据生成管道,自动增强专家演示与故障恢复轨迹和细粒度语言注释,并引入Rich languAge-guided failure reCovERy (RACER)框架。RACER通过结合故障恢复数据和丰富的语言描述,利用视觉-语言模型(VLM)作为在线监督者提供详细的语言指导,以及语言条件化的视觉运动策略作为执行者预测下一步动作,从而显著提升了机器人在复杂任务中的表现。

链接: https://arxiv.org/abs/2409.14674
作者: Yinpei Dai,Jayjun Lee,Nima Fazeli,Joyce Chai
关键词-EN: Developing robust, simple language instructions, correctable visuomotor policies, guiding robot actions, robust and correctable
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL

点击查看摘要

Abstract:Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: this https URL.
摘要:由于缺乏从失败中自我恢复的机制以及简单语言指令在指导机器人动作方面的局限性,开发稳健且可纠正的视觉运动策略用于机器人操作具有挑战性。为解决这些问题,我们提出了一种可扩展的数据生成管道,该管道自动增强专家演示,通过失败恢复轨迹和细粒度的语言注释进行训练。随后,我们引入了丰富的语言引导失败恢复系统 (Rich languAge-guided failure reCovERy, RACER),这是一个监督者-执行者框架,结合了失败恢复数据和丰富的语言描述以增强机器人控制。RACER 包含一个视觉语言模型 (Vision-Language Model, VLM),作为在线监督者,提供详细的语言指导用于错误纠正和任务执行,以及一个语言条件化的视觉运动策略作为执行者,用于预测下一步动作。我们的实验结果表明,RACER 在各种评估设置下,包括标准的长时任务、动态目标变化任务和零样本未见任务,均优于最先进的机器人视图 Transformer (Robotic View Transformer, RVT),在模拟和真实环境中均实现了卓越的性能。视频和代码可在以下链接获取:this https URL。

[NLP-51] Instruction Tuning Vs. In-Context Learning: Revisiting Large Language Models in Few-Shot Computational Social Science

【速读】: 该论文试图解决在计算社会科学(CSS)任务中,如何有效利用大语言模型(LLMs)进行少样本分类的问题。解决方案的关键在于评估指令微调(IT)和上下文学习(ICL)在少样本CSS任务中的分类性能,并发现ICL在大多数CSS任务中表现优于IT。此外,论文强调了样本质量和提示策略的重要性,指出单纯增加样本数量而不考虑其质量并不能持续提升LLM的性能,有时甚至会导致性能下降。研究结果表明,ICL在少样本设置下处理CSS任务具有显著优势,并建议优化样本质量和提示策略以提高LLM的分类性能。

链接: https://arxiv.org/abs/2409.14673
作者: Taihang Wang,Xiaoman Xu,Yimin Wang,Ye Jiang
关键词-EN: large language models, computational social science, Real-world applications, tasks primarily depend, CSS tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world applications of large language models (LLMs) in computational social science (CSS) tasks primarily depend on the effectiveness of instruction tuning (IT) or in-context learning (ICL). While IT has shown highly effective at fine-tuning LLMs for various tasks, ICL offers a rapid alternative for task adaptation by learning from examples without explicit gradient updates. In this paper, we evaluate the classification performance of LLMs using IT versus ICL in few-shot CSS tasks. The experimental results indicate that ICL consistently outperforms IT in most CSS tasks. Additionally, we investigate the relationship between the increasing number of training samples and LLM performance. Our findings show that simply increasing the number of samples without considering their quality does not consistently enhance the performance of LLMs with either ICL or IT and can sometimes even result in a performance decline. Finally, we compare three prompting strategies, demonstrating that ICL is more effective than zero-shot and Chain-of-Thought (CoT). Our research highlights the significant advantages of ICL in handling CSS tasks in few-shot settings and emphasizes the importance of optimizing sample quality and prompting strategies to improve LLM classification performance. The code will be made available.
摘要: 大语言模型 (LLM) 在计算社会科学 (CSS) 任务中的实际应用主要依赖于指令调优 (IT) 或上下文学习 (ICL) 的有效性。尽管 IT 在微调 LLM 以适应各种任务方面表现出高度有效,但 ICL 提供了一种快速的任务适应方法,通过从示例中学习而不进行显式的梯度更新。本文中,我们评估了在少样本 CSS 任务中使用 IT 与 ICL 的 LLM 分类性能。实验结果表明,在大多数 CSS 任务中,ICL 持续优于 IT。此外,我们研究了训练样本数量增加与 LLM 性能之间的关系。我们的研究发现,在不考虑样本质量的情况下简单增加样本数量,无论是使用 ICL 还是 IT,都不一定能持续提升 LLM 的性能,有时甚至会导致性能下降。最后,我们比较了三种提示策略,证明 ICL 比零样本和思维链 (CoT) 更为有效。我们的研究突显了 ICL 在少样本设置下处理 CSS 任务的显著优势,并强调了优化样本质量和提示策略以提高 LLM 分类性能的重要性。代码将公开发布。

[NLP-52] Direct Judgement Preference Optimization

【速读】: 该论文试图解决如何通过自动评估提升大型语言模型(LLMs)的评估能力,特别是在不同应用场景下的表现。解决方案的关键在于采用偏好优化方法,通过从正负数据中学习,收集不同应用场景下的偏好对,从而从多个角度提升生成式评判模型的能力。具体方法包括三种不同的偏好对收集策略,旨在增强评判模型的评估能力。研究结果表明,这种方法在多个基准测试中表现优异,尤其在13个基准测试中取得了10个最佳成绩,显著优于GPT-4o等强基线模型,并能有效应对位置和长度偏差,灵活适应各种评估协议,并为下游生成模型提供有用的语言反馈。

链接: https://arxiv.org/abs/2409.14664
作者: Peifeng Wang,Austin Xu,Yilun Zhou,Caiming Xiong,Shafiq Joty
关键词-EN: assessing response quality, Auto-evaluation is crucial, crucial for assessing, assessing response, response quality
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to evaluate and critique other models’ outputs. In this work, we investigate the idea of learning from both positive and negative data with preference optimization to enhance the evaluation capabilities of LLM judges across an array of different use cases. We achieve this by employing three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our comprehensive study over a wide range of benchmarks demonstrates the effectiveness of our method. In particular, our generative judge achieves the best performance on 10 out of 13 benchmarks, outperforming strong baselines like GPT-4o and specialized judge models. Further analysis show that our judge model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
摘要: 自动评估对于评估响应质量和为模型开发提供反馈至关重要。最近的研究探索了将大语言模型 (LLM) 训练为生成式评判者,以评估和批评其他模型的输出。在这项工作中,我们探讨了通过偏好优化从正负数据中学习,以增强 LLM 评判者在各种不同用例中的评估能力。我们通过采用三种方法来收集不同用例的偏好对,每种方法都旨在从不同角度提升我们的生成式评判者。我们在广泛的基准测试上的综合研究表明了我们的方法的有效性。特别是,我们的生成式评判者在 13 个基准测试中取得了 10 个最佳表现,超越了 GPT-4o 和专门的评判模型等强基线。进一步分析表明,我们的评判模型能够稳健地应对位置和长度偏差等固有偏差,灵活适应从业者指定的任何评估协议,并为改进下游生成模型提供有用的语言反馈。

[NLP-53] Building Tamil Treebanks

【速读】: 该论文试图解决创建高质量泰米尔语树库(Tamil treebanks)的问题,解决方案的关键在于采用三种不同的方法:手动标注、计算语法和机器学习技术。手动标注确保了高质量和丰富的句法与语义信息,但耗时且需要专业知识;计算语法如词汇功能语法(LFG)提供深入的语言分析,但需要对形式化有深入了解;机器学习方法利用现成的框架和工具(如Stanza、UDpipe和UUParser)实现大规模数据集的自动化标注,但依赖于高质量的标注数据、跨语言训练资源和计算能力。论文强调了构建泰米尔语树库的挑战,包括互联网数据的质量问题、全面语言分析的需求以及找到熟练标注者的困难,但指出这些树库的开发对于推进语言研究和改进泰米尔语的自然语言处理工具至关重要。

链接: https://arxiv.org/abs/2409.14657
作者: Kengatharaiyer Sarveswaran
关键词-EN: Natural Language Processing, Tamil treebanks, important linguistic resources, Language Processing, Natural Language
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Treebanks are important linguistic resources, which are structured and annotated corpora with rich linguistic annotations. These resources are used in Natural Language Processing (NLP) applications, supporting linguistic analyses, and are essential for training and evaluating various computational models. This paper discusses the creation of Tamil treebanks using three distinct approaches: manual annotation, computational grammars, and machine learning techniques. Manual annotation, though time-consuming and requiring linguistic expertise, ensures high-quality and rich syntactic and semantic information. Computational deep grammars, such as Lexical Functional Grammar (LFG), offer deep linguistic analyses but necessitate significant knowledge of the formalism. Machine learning approaches, utilising off-the-shelf frameworks and tools like Stanza, UDpipe, and UUParser, facilitate the automated annotation of large datasets but depend on the availability of quality annotated data, cross-linguistic training resources, and computational power. The paper discusses the challenges encountered in building Tamil treebanks, including issues with Internet data, the need for comprehensive linguistic analysis, and the difficulty of finding skilled annotators. Despite these challenges, the development of Tamil treebanks is essential for advancing linguistic research and improving NLP tools for Tamil.
摘要:树库(Treebanks)是重要的语言资源,它们是结构化并带有丰富语言注释的语料库。这些资源用于自然语言处理(NLP)应用,支持语言分析,并且对于训练和评估各种计算模型至关重要。本文讨论了使用三种不同方法创建泰米尔语树库的过程:手动注释、计算语法和机器学习技术。手动注释虽然耗时且需要语言学专业知识,但能确保高质量和丰富的句法和语义信息。计算深层语法,如词汇功能语法(LFG),提供深入的语言分析,但需要对形式化有深入了解。机器学习方法,利用现成的框架和工具如 Stanza、UDpipe 和 UUParser,促进了大规模数据集的自动注释,但依赖于高质量注释数据、跨语言训练资源和计算能力。本文讨论了在构建泰米尔语树库过程中遇到的各种挑战,包括互联网数据的问题、全面语言分析的需求以及寻找熟练注释者的困难。尽管面临这些挑战,泰米尔语树库的开发对于推进语言学研究和改进泰米尔语的 NLP 工具至关重要。

[NLP-54] Harmonising the Clinical Melody: Tuning Large Language Models for Hospital Course Summarisation in Clinical Coding

【速读】: 该论文试图解决电子病历系统中医学文档数量和复杂性增加给临床编码员带来的挑战,特别是如何从大量临床文本中提取关键信息以完成编码任务。解决方案的关键在于利用预训练的大型语言模型(如Llama 3、BioMistral、Mistral Instruct v0.1),通过量化低秩适应微调(Quantized Low Rank Adaptation fine tuning)来适应医院病程总结任务。研究通过从MIMIC III数据集中构建自由文本临床数据集,并使用BERTScore和ROUGE指标评估模型效果,验证了在临床领域微调预训练LLMs可以显著提升医院病程总结的性能,从而为临床编码提供辅助工具。

链接: https://arxiv.org/abs/2409.14638
作者: Bokang Bi,Leibo Liu,Oscar Perez-Concha,Sanja Lujic,Louisa Jorm
关键词-EN: Electronic Medical Records, Medical Records systems, Records systems pose, Electronic Medical, Medical Records
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:The increasing volume and complexity of clinical documentation in Electronic Medical Records systems pose significant challenges for clinical coders, who must mentally process and summarise vast amounts of clinical text to extract essential information needed for coding tasks. While large language models have been successfully applied to shorter summarisation tasks in recent years, the challenge of summarising a hospital course remains an open area for further research and development. In this study, we adapted three pre trained LLMs, Llama 3, BioMistral, Mistral Instruct v0.1 for the hospital course summarisation task, using Quantized Low Rank Adaptation fine tuning. We created a free text clinical dataset from MIMIC III data by concatenating various clinical notes as the input clinical text, paired with ground truth Brief Hospital Course sections extracted from the discharge summaries for model training. The fine tuned models were evaluated using BERTScore and ROUGE metrics to assess the effectiveness of clinical domain fine tuning. Additionally, we validated their practical utility using a novel hospital course summary assessment metric specifically tailored for clinical coding. Our findings indicate that fine tuning pre trained LLMs for the clinical domain can significantly enhance their performance in hospital course summarisation and suggest their potential as assistive tools for clinical coding. Future work should focus on refining data curation methods to create higher quality clinical datasets tailored for hospital course summary tasks and adapting more advanced open source LLMs comparable to proprietary models to further advance this research.
摘要:随着电子病历系统中临床文档的数量和复杂性不断增加,临床编码员面临着巨大的挑战,他们需要在大脑中处理和总结大量的临床文本,以提取编码任务所需的关键信息。尽管近年来大语言模型在较短的摘要任务中取得了成功应用,但总结医院病程的挑战仍然是一个有待进一步研究和开发的新领域。在本研究中,我们采用了三种预训练的大语言模型(LLMs),即 Llama 3、BioMistral 和 Mistral Instruct v0.1,通过量化低秩适应(Quantized Low Rank Adaptation)微调方法,将其应用于医院病程摘要任务。我们通过将 MIMIC III 数据中的各种临床笔记连接起来,创建了一个自由文本临床数据集,作为输入临床文本,并配以从出院总结中提取的实际医院病程摘要部分作为模型训练的基准。通过 BERTScore 和 ROUGE 指标评估了微调模型的有效性,以评估临床领域微调的效果。此外,我们还使用了一种专门为临床编码设计的医院病程摘要评估新指标,验证了其实际应用价值。研究结果表明,针对临床领域对预训练大语言模型进行微调,可以显著提升其在医院病程摘要任务中的表现,并显示出其作为临床编码辅助工具的潜力。未来的工作应聚焦于改进数据收集方法,以创建更高质量的、针对医院病程摘要任务定制的临床数据集,并适应更多与专有模型相当的开源大语言模型,以进一步推动这一研究领域的发展。

[NLP-55] Can a Neural Model Guide Fieldwork? A Case Study on Morphological Inflection

【速读】: 该论文旨在解决语言田野工作中数据收集和形态学结构泛化效率低下的问题。解决方案的关键在于引入一种新的框架,通过评估不同采样策略的效率和利用神经网络模型的置信度来指导数据标注过程。具体策略包括:1) 通过在范式表格单元中均匀采样来增加标注数据的多样性;2) 利用模型置信度作为指导,提供可靠的预测以增强标注过程中的正面互动。

链接: https://arxiv.org/abs/2409.14628
作者: Aso Mahmudi,Borja Herce,Demian Inostroza Amestica,Andreas Scherbakov,Eduard Hovy,Ekaterina Vylomova
关键词-EN: Linguistic fieldwork, documentation and preservation, important component, component in language, language documentation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Linguistic fieldwork is an important component in language documentation and preservation. However, it is a long, exhaustive, and time-consuming process. This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions. We introduce a novel framework that evaluates the efficiency of various sampling strategies for obtaining morphological data and assesses the effectiveness of state-of-the-art neural models in generalising morphological structures. Our experiments highlight two key strategies for improving the efficiency: (1) increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables, and (2) using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.
摘要:语言学实地调查是语言记录和保存的重要组成部分。然而,这是一个漫长、耗时且繁琐的过程。本文提出了一种新型模型,该模型在实地调查过程中指导语言学家,并考虑了语言学家与说话者之间的互动动态。我们引入了一个新的框架,用于评估各种采样策略在获取形态数据方面的效率,并评估最先进的神经模型在泛化形态结构方面的有效性。我们的实验突出了两种提高效率的关键策略:(1) 通过在范式表格的单元格中均匀采样来增加标注数据的多样性,以及 (2) 使用模型置信度作为指导,通过在标注过程中提供可靠的预测来增强正面互动。

[NLP-56] Can pre-trained language models generate titles for research papers?

【速读】: 该论文试图解决自动生成研究论文标题的问题,解决方案的关键在于微调预训练的大型语言模型(如ChatGPT),并利用其零样本学习能力从论文摘要中生成标题。通过使用ROUGE、METEOR、MoverScore、BERTScore和SciBERTScore等多项评价指标来评估模型的性能。

链接: https://arxiv.org/abs/2409.14602
作者: Tohida Rehman,Debarshi Kumar Sanyal,Samiran Chattopadhyay
关键词-EN: research paper communicates, succinct style, style the main, main theme, research paper
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The title of a research paper communicates in a succinct style the main theme and, sometimes, the findings of the paper. Coming up with the right title is often an arduous task, and therefore, it would be beneficial to authors if title generation can be automated. In this paper, we fine-tune pre-trained and large language models to generate titles of papers from their abstracts. We also use ChatGPT in a zero-shot setting to generate paper titles. The performance of the models is measured with ROUGE, METEOR, MoverScore, BERTScore and SciBERTScore metrics.
摘要:研究论文的标题以简洁的风格传达了论文的主要主题,有时还包括论文的发现。构思出一个合适的标题往往是一项艰巨的任务,因此,如果能够自动化生成标题,将对作者大有裨益。本文中,我们对预训练的大语言模型进行微调,以从论文摘要中生成标题。我们还使用 ChatGPT 在零样本设置下生成论文标题。模型的性能通过 ROUGE、METEOR、MoverScore、BERTScore 和 SciBERTScore 指标进行衡量。

[NLP-57] EchoAtt: Attend Copy then Adjust for More Efficient Large Language Models

【速读】: 该论文试图解决大规模语言模型(LLMs)在推理和微调过程中计算需求高的问题。解决方案的关键在于引入EchoAtt框架,通过分析和利用模型各层之间注意力模式的相似性,实现注意力矩阵的共享。具体来说,EchoAtt在知识蒸馏的设置下,允许学生模型在相似度高的层之间共享注意力矩阵,从而显著减少计算需求,同时保持甚至提升模型性能。这种方法不仅提高了推理和训练速度,还减少了模型参数数量,增强了LLMs在实时和资源受限应用中的实用性。

链接: https://arxiv.org/abs/2409.14595
作者: Hossein Rajabzadeh,Aref Jafari,Aman Sharma,Benyamin Jami,Hyock Ju Kwon,Ali Ghodsi,Boxing Chen,Mehdi Rezagholizadeh
关键词-EN: Large Language Models, language processing tasks, natural language processing, Large Language, natural language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), with their increasing depth and number of parameters, have demonstrated outstanding performance across a variety of natural language processing tasks. However, this growth in scale leads to increased computational demands, particularly during inference and fine-tuning. To address these challenges, we introduce EchoAtt, a novel framework aimed at optimizing transformer-based models by analyzing and leveraging the similarity of attention patterns across layers. Our analysis reveals that many inner layers in LLMs, especially larger ones, exhibit highly similar attention matrices. By exploiting this similarity, EchoAtt enables the sharing of attention matrices in less critical layers, significantly reducing computational requirements without compromising performance. We incorporate this approach within a knowledge distillation setup, where a pre-trained teacher model guides the training of a smaller student model. The student model selectively shares attention matrices in layers with high similarity while inheriting key parameters from the teacher. Our best results with TinyLLaMA-1.1B demonstrate that EchoAtt improves inference speed by 15%, training speed by 25%, and reduces the number of parameters by approximately 4%, all while improving zero-shot performance. These findings highlight the potential of attention matrix sharing to enhance the efficiency of LLMs, making them more practical for real-time and resource-limited applications.
摘要:随着深度和参数数量的增加,大语言模型 (Large Language Models, LLMs) 在各种自然语言处理任务中表现出色。然而,这种规模的扩大导致了计算需求的增加,特别是在推理和微调过程中。为了应对这些挑战,我们提出了 EchoAtt,这是一种新颖的框架,旨在通过分析和利用各层之间注意力模式的相似性来优化基于 Transformer 的模型。我们的分析表明,LLMs 中的许多内部层,尤其是较大的模型,表现出高度相似的注意力矩阵。通过利用这种相似性,EchoAtt 使得在不太关键的层中共享注意力矩阵成为可能,从而显著减少了计算需求,同时不影响性能。我们将这种方法整合到知识蒸馏的设置中,其中预训练的教师模型指导较小学生模型的训练。学生模型在高度相似的层中选择性地共享注意力矩阵,同时继承教师模型的关键参数。我们在 TinyLLaMA-1.1B 上的最佳结果表明,EchoAtt 将推理速度提高了 15%,训练速度提高了 25%,并将参数数量减少了约 4%,同时提高了零样本性能。这些发现突显了注意力矩阵共享在提高 LLMs 效率方面的潜力,使其更适用于实时和资源受限的应用。

[NLP-58] Backtracking Improves Generation Safety

【速读】: 该论文试图解决语言模型在生成不安全内容时无法撤销已生成token的问题。解决方案的关键在于引入一种特殊的[RESET] token,使模型能够在检测到不安全生成时进行“回溯”操作,从而撤销并重新生成内容。这种方法通过在SFT或DPO训练中优化帮助性和无害性,显著提高了模型的安全性,同时避免了帮助性的下降。实验结果表明,采用回溯技术的Llama-3-8B模型在安全性上比基线模型提高了四倍,且能有效抵御多种对抗攻击。

链接: https://arxiv.org/abs/2409.14586
作者: Yiming Zhang,Jianfeng Chi,Hailey Nguyen,Kartikeya Upasani,Daniel M. Bikel,Jason Weston,Eric Michael Smith
关键词-EN: taking back tokens, fundamental limitation, taking back, unsafe additional text, Text generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to “undo” and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% \to 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.
摘要:文本生成在定义上存在一个根本性的局限:一旦生成的 Token 出现问题,即使这些 Token 明显存在问题,也无法撤销。在语言模型安全性的背景下,当生成部分不安全的内容时,语言模型本质上倾向于继续生成类似的不安全文本。这实际上是在野外环境中,尽管在提高模型安全性方面做出了巨大努力,但前沿模型的安全对齐仍然被绕过的原因。我们提出了一种偏离传统安全对齐预防方法(降低有害响应的概率)的新技术——回溯 (backtracking),该技术通过引入特殊 [RESET] Token,使语言模型能够“撤销”并从自身的不安全生成中恢复。我们的方法可以融入到 SFT 或 DPO 训练中,以优化有用性和无害性。实验表明,经过回溯训练的模型在安全性方面始终优于基线模型:回溯训练的 Llama-3-8B 模型在我们的评估中比基线模型安全四倍(6.1% → 1.5%),且在有用性方面没有退化。此外,我们的方法在没有专门训练的情况下,还能抵御包括自适应攻击在内的四种对抗攻击。

[NLP-59] he X Types – Mapping the Semantics of the Twitter Sphere

【速读】: 该论文试图解决社交媒体上缺乏结构化语义信息的问题,特别是为大约20万Twitter热门账号赋予细粒度的语义类型标签。解决方案的关键在于通过与DBpedia和Wikidata知识库的对齐,获取部分账号的语义标签,并利用这些标签微调基于Transformer的文本编码器,生成实体的语义嵌入。结合网络嵌入,该方法能够高效预测实体的语义类型,并在实验中展示了高准确性。最终,该研究不仅提供了Twitter领域的全局语义洞察,还展示了语义类型信息和嵌入在实体相似性评估等下游任务中的应用潜力。

链接: https://arxiv.org/abs/2409.14584
作者: Ogen Schlachet Drukerman,Einat Minkov
关键词-EN: Social networks form, influential entities correspond, semantic, networks form, form a valuable
类目: Computation and Language (cs.CL)
备注: 23 pages

点击查看摘要

Abstract:Social networks form a valuable source of world knowledge, where influential entities correspond to popular accounts. Unlike factual knowledge bases (KBs), which maintain a semantic ontology, structured semantic information is not available on social media. In this work, we consider a social KB of roughly 200K popular Twitter accounts, which denotes entities of interest. We elicit semantic information about those entities. In particular, we associate them with a fine-grained set of 136 semantic types, e.g., determine whether a given entity account belongs to a politician, or a musical artist. In the lack of explicit type information in Twitter, we obtain semantic labels for a subset of the accounts via alignment with the KBs of DBpedia and Wikidata. Given the labeled dataset, we finetune a transformer-based text encoder to generate semantic embeddings of the entities based on the contents of their accounts. We then exploit this evidence alongside network-based embeddings to predict the entities semantic types. In our experiments, we show high type prediction performance on the labeled dataset. Consequently, we apply our type classification model to all of the entity accounts in the social KB. Our analysis of the results offers insights about the global semantics of the Twitter sphere. We discuss downstream applications that should benefit from semantic type information and the semantic embeddings of social entities generated in this work. In particular, we demonstrate enhanced performance on the key task of entity similarity assessment using this information.
摘要:社交网络构成了世界知识的重要来源,其中具有影响力的实体对应于受欢迎的账户。与维护语义本体的知识库 (KB) 不同,社交媒体上不存在结构化的语义信息。在本研究中,我们考虑了一个包含约 20 万个流行 Twitter 账户的社交 KB,这些账户代表了我们感兴趣的实体。我们提取了这些实体的语义信息。特别是,我们将它们与一组细粒度的 136 种语义类型相关联,例如,确定给定实体账户是否属于政治家或音乐艺术家。由于 Twitter 中缺乏显式的类型信息,我们通过与 DBpedia 和 Wikidata 的 KB 对齐,为部分账户获取了语义标签。在获得标注数据集后,我们微调了一个基于 Transformer 的文本编码器,以根据账户内容生成实体的语义嵌入。然后,我们利用这些证据以及基于网络的嵌入来预测实体的语义类型。在我们的实验中,我们展示了在标注数据集上的高类型预测性能。因此,我们将类型分类模型应用于社交 KB 中的所有实体账户。我们对结果的分析提供了关于 Twitter 领域全局语义的见解。我们讨论了应受益于语义类型信息和本研究中生成的社交实体语义嵌入的下游应用。特别是,我们展示了在使用此信息进行实体相似性评估的关键任务中性能的提升。

[NLP-60] Medical Concept Normalization in a Low-Resource Setting

【速读】: 该论文试图解决在低资源环境下,德语非专业文本中的医学概念规范化问题。解决方案的关键在于利用多语言Transformer模型,通过上下文信息来提高概念映射的准确性,尽管实验表明上下文信息的使用效果不佳,但该方法仍能超越传统的字符串相似度方法。论文还通过系统性错误分析提出了潜在的改进措施,以减少常见错误。

链接: https://arxiv.org/abs/2409.14579
作者: Tim Patzelt
关键词-EN: large knowledge base, natural language processing, biomedical natural language, medical concept normalization, accurately mapping mentions
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Master Thesis

点击查看摘要

Abstract:In the field of biomedical natural language processing, medical concept normalization is a crucial task for accurately mapping mentions of concepts to a large knowledge base. However, this task becomes even more challenging in low-resource settings, where limited data and resources are available. In this thesis, I explore the challenges of medical concept normalization in a low-resource setting. Specifically, I investigate the shortcomings of current medical concept normalization methods applied to German lay texts. Since there is no suitable dataset available, a dataset consisting of posts from a German medical online forum is annotated with concepts from the Unified Medical Language System. The experiments demonstrate that multilingual Transformer-based models are able to outperform string similarity methods. The use of contextual information to improve the normalization of lay mentions is also examined, but led to inferior results. Based on the results of the best performing model, I present a systematic error analysis and lay out potential improvements to mitigate frequent errors.
摘要:在生物医学自然语言处理领域,医学概念规范化是一项关键任务,旨在将概念提及准确映射到大型知识库中。然而,在低资源环境下,这一任务变得更加具有挑战性,因为可用的数据和资源有限。在本论文中,我探讨了在低资源环境下医学概念规范化的挑战。具体而言,我研究了当前应用于德语非专业文本的医学概念规范化方法的不足之处。由于没有合适的可用数据集,我构建了一个由德国医学在线论坛帖子组成的数据集,并使用统一医学语言系统中的概念对其进行了标注。实验结果表明,基于多语言 Transformer 的模型能够超越基于字符串相似度的方法。此外,我还探讨了利用上下文信息来改进非专业提及的规范化,但结果并不理想。基于表现最佳模型的结果,我进行了系统的错误分析,并提出了潜在的改进措施以减少常见错误。

[NLP-61] Evaluating the Performance and Robustness of LLMs in Materials Science QA and Property Predictions

【速读】: 该论文试图解决大语言模型(LLMs)在材料科学领域应用中的鲁棒性和可靠性问题。解决方案的关键在于通过三种特定数据集(本科材料科学课程的多选题、钢材成分与屈服强度数据集、材料晶体结构与带隙值数据集)对LLMs进行全面评估和鲁棒性分析,采用多种提示策略(如零样本链式思维、专家提示和少样本上下文学习),并测试模型在不同形式噪声(从现实干扰到故意对抗性操作)下的表现,以评估其在实际应用中的韧性和可靠性。研究还揭示了LLMs在预测任务中的一些独特现象,如提示示例接近度改变时的模式崩溃行为和训练/测试不匹配带来的性能提升。这些发现旨在为LLMs在材料科学中的广泛应用提供审慎的怀疑态度,并激发提升其鲁棒性和可靠性的技术进步。

链接: https://arxiv.org/abs/2409.14572
作者: Hongchen Wang,Kangming Li,Scott Ramsay,Yao Fehlis,Edward Kim,Jason Hattrick-Simpers
关键词-EN: Large Language Models, Large Language, revolutionize scientific research, remain insufficiently explored, applications remain insufficiently
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. This study conducts a comprehensive evaluation and robustness analysis of LLMs within the field of materials science, focusing on domain-specific question answering and materials property prediction. Three distinct datasets are used in this study: 1) a set of multiple-choice questions from undergraduate-level materials science courses, 2) a dataset including various steel compositions and yield strengths, and 3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of ‘noise’, ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study uncovers unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance enhancement from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.
摘要:大语言模型 (LLMs) 具有彻底改变科学研究的潜力,然而其在特定领域应用中的稳健性和可靠性仍未得到充分探索。本研究对材料科学领域内的 LLMs 进行了全面的评估和稳健性分析,重点关注领域特定的问答和材料属性预测。本研究使用了三个不同的数据集:1) 一组来自本科材料科学课程的多项选择题,2) 一个包含各种钢成分和屈服强度的数据集,3) 一个带隙数据集,包含材料晶体结构和带隙值的文本描述。通过多种提示策略评估 LLMs 的性能,包括零样本链式思维、专家提示和少样本上下文学习。这些模型的稳健性通过各种形式的“噪声”进行测试,从现实干扰到故意的对抗性操作,以评估其在实际条件下的弹性和可靠性。此外,研究揭示了 LLMs 在预测任务中的一些独特现象,例如当提示示例的接近度改变时出现的模式崩溃行为,以及从训练/测试不匹配中获得的性能提升。研究结果旨在为 LLMs 在材料科学中的广泛应用提供有根据的怀疑,并激发增强其稳健性和可靠性的进步,以实现实际应用。

[NLP-62] Unleashing the Power of Emojis in Texts via Self-supervised Graph Pre-Training EMNLP2024

【速读】: 该论文试图解决现有数据挖掘方法在处理社交媒体中的表情符号(emojis)时,未能充分捕捉其丰富语义信息及其与文本互动的问题。解决方案的关键在于构建一个包含帖子、词语和表情符号三种节点类型的异构图,并通过定义节点间的边来模拟这些元素之间的互动。论文提出了一种图预训练框架,包含节点级图对比学习和边级链接重构学习两个预训练任务,以促进帖子、词语和表情符号节点间的信息共享,从而显著提升在下游任务中的表现。

链接: https://arxiv.org/abs/2409.14552
作者: Zhou Zhang,Dongzeng Tan,Jiaan Wang,Yilong Chen,Jiarong Xu
关键词-EN: gained immense popularity, gained immense, immense popularity, supplement or replace, ordinary Unicode characters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024 Main Conference

点击查看摘要

Abstract:Emojis have gained immense popularity on social platforms, serving as a common means to supplement or replace text. However, existing data mining approaches generally either completely ignore or simply treat emojis as ordinary Unicode characters, which may limit the model’s ability to grasp the rich semantic information in emojis and the interaction between emojis and texts. Thus, it is necessary to release the emoji’s power in social media data mining. To this end, we first construct a heterogeneous graph consisting of three types of nodes, i.e. post, word and emoji nodes to improve the representation of different elements in posts. The edges are also well-defined to model how these three elements interact with each other. To facilitate the sharing of information among post, word and emoji nodes, we propose a graph pre-train framework for text and emoji co-modeling, which contains two graph pre-training tasks: node-level graph contrastive learning and edge-level link reconstruction learning. Extensive experiments on the Xiaohongshu and Twitter datasets with two types of downstream tasks demonstrate that our approach proves significant improvement over previous strong baseline methods.
摘要:表情符号在社交平台上获得了极大的流行,成为补充或替代文本的常见手段。然而,现有的数据挖掘方法通常要么完全忽略表情符号,要么简单地将其视为普通的 Unicode 字符,这可能会限制模型捕捉表情符号中丰富的语义信息及其与文本之间的交互。因此,有必要在社交媒体数据挖掘中释放表情符号的力量。为此,我们首先构建了一个包含三种类型节点(即帖子、词语和表情符号节点)的异构图,以改进帖子中不同元素的表示。边也经过精心定义,以模拟这三种元素之间的相互作用。为了促进帖子、词语和表情符号节点之间的信息共享,我们提出了一种用于文本和表情符号协同建模的图预训练框架,该框架包含两个图预训练任务:节点级图对比学习和边级链接重构学习。在两个数据集(小红书和 Twitter)上进行的广泛实验,以及两种类型的下游任务,证明了我们的方法相较于之前的强基线方法有显著的改进。

[NLP-63] What Are They Doing? Joint Audio-Speech Co-Reasoning ICASSP2025

【速读】: 该论文试图解决音频和语音处理中单一模态处理的局限性问题,提出了一种名为Joint Audio-Speech Co-Reasoning (JASCO)的新任务,旨在统一音频和语音处理,并严格要求模型在两种模态之间进行协同推理。解决方案的关键在于开发和评估能够同时处理音频和语音的大型语言模型(ALLMs),并通过引入名为“What Are They Doing”的场景推理数据集,建立了一个联合音频-语音基准,以评估这些模型在联合推理任务中的表现。此外,论文还通过分析模型对各模态的依赖性,提供了对模型行为的深入见解。

链接: https://arxiv.org/abs/2409.14526
作者: Yingzhi Wang,Pooneh Mousavi,Artem Ploujnikov,Mirco Ravanelli
关键词-EN: Auditory Large Language, Large Language Models, Recent Auditory Large, speech processing, joint audio-speech
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are present in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We release a scene-reasoning dataset called “What Are They Doing” and establish a joint audio-speech benchmark to evaluate the joint reasoning capability of popular ALLMs. Additionally, we provide deeper insights into the models’ behaviors by analyzing their dependence on each modality. Comments: Submitted to ICASSP 2025 Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2409.14526 [cs.SD] (or arXiv:2409.14526v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2409.14526 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在音频和语音处理领域,任务通常侧重于音频或语音模态,即使同一音频片段中同时存在声音和人类语音。最近出现的听觉大语言模型 (Auditory Large Language Models, ALLMs) 使得在单一模型中同时处理音频和语音成为可能,从而进一步考虑了联合音频-语音任务。本文探讨了 ALLMs 在联合音频-语音处理中的表现。具体而言,我们引入了联合音频-语音协同推理 (Joint Audio-Speech Co-Reasoning, JASCO) 这一新任务,该任务统一了音频和语音处理,严格要求在两种模态之间进行协同推理。我们发布了一个名为“他们在做什么”的场景推理数据集,并建立了一个联合音频-语音基准,以评估流行 ALLMs 的联合推理能力。此外,我们通过分析模型对各模态的依赖性,提供了对模型行为的深入见解。

评论:已提交至 ICASSP 2025
主题:声音 (cs.SD); 计算与语言 (cs.CL); 音频与语音处理 (eess.AS)
引用为:arXiv:2409.14526 [cs.SD]
(或 arXiv:2409.14526v1 [cs.SD] 用于此版本)
https://doi.org/10.48550/arXiv.2409.14526
了解更多信息
arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-64] Beyond Words: Evaluating Large Language Models in Transportation Planning

【速读】: 该论文试图解决如何利用生成式人工智能(GenAI)中的大型语言模型(LLMs),如GPT-4和Phi-3-mini,来提升城市交通规划的效率和准确性。解决方案的关键在于评估这些模型在地理信息系统(GIS)技能、交通领域知识以及实际交通问题解决能力方面的表现,并通过一个包含地理空间技能、交通领域技能和现实交通问题解决的评估框架来验证其性能。研究结果表明,GPT-4在多种GIS和交通特定任务中表现出更高的准确性和可靠性,显示出其在交通规划中的强大潜力,而Phi-3-mini则在资源受限环境中表现出一定的分析能力。未来的研究可以探索更新的LLMs和检索增强生成(RAG)技术在更广泛的真实交通规划和运营挑战中的应用,以深化先进AI模型在交通管理实践中的整合。

链接: https://arxiv.org/abs/2409.14516
作者: Shaowei Ying,Zhenlong Li,Manzhu Yu
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, numerous industry sectors, advancement of Generative
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The resurgence and rapid advancement of Generative Artificial Intelligence (GenAI) in 2023 has catalyzed transformative shifts across numerous industry sectors, including urban transportation and logistics. This study investigates the evaluation of Large Language Models (LLMs), specifically GPT-4 and Phi-3-mini, to enhance transportation planning. The study assesses the performance and spatial comprehension of these models through a transportation-informed evaluation framework that includes general geospatial skills, general transportation domain skills, and real-world transportation problem-solving. Utilizing a mixed-methods approach, the research encompasses an evaluation of the LLMs’ general Geographic Information System (GIS) skills, general transportation domain knowledge as well as abilities to support human decision-making in the real-world transportation planning scenarios of congestion pricing. Results indicate that GPT-4 demonstrates superior accuracy and reliability across various GIS and transportation-specific tasks compared to Phi-3-mini, highlighting its potential as a robust tool for transportation planners. Nonetheless, Phi-3-mini exhibits competence in specific analytical scenarios, suggesting its utility in resource-constrained environments. The findings underscore the transformative potential of GenAI technologies in urban transportation planning. Future work could explore the application of newer LLMs and the impact of Retrieval-Augmented Generation (RAG) techniques, on a broader set of real-world transportation planning and operations challenges, to deepen the integration of advanced AI models in transportation management practices.
摘要:2023年生成式人工智能 (Generative Artificial Intelligence, GenAI) 的复兴与快速发展,推动了多个行业领域的变革,包括城市交通与物流。本研究探讨了如何通过评估大语言模型 (Large Language Models, LLMs),特别是 GPT-4 和 Phi-3-mini,来提升交通规划。研究通过一个包含通用地理空间技能、通用交通领域技能以及解决现实交通问题的交通导向评估框架,评估了这些模型的性能和空间理解能力。采用混合方法,研究涵盖了 LLMs 的通用地理信息系统 (Geographic Information System, GIS) 技能、通用交通领域知识以及在拥堵收费等现实交通规划场景中支持人类决策的能力。结果显示,GPT-4 在各种 GIS 和交通特定任务中表现出更高的准确性和可靠性,相比 Phi-3-mini,突显了其作为交通规划者强有力工具的潜力。尽管如此,Phi-3-mini 在特定分析场景中表现出能力,表明其在资源受限环境中的实用性。研究结果强调了 GenAI 技术在城市交通规划中的变革潜力。未来的工作可以探索更新的 LLMs 的应用以及检索增强生成 (Retrieval-Augmented Generation, RAG) 技术对更广泛现实交通规划和运营挑战的影响,以深化先进 AI 模型在交通管理实践中的整合。

[NLP-65] Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits

【速读】: 该论文试图解决LLM生成文本与人类写作文本之间的差异问题,并探索如何通过编辑方法改进LLM生成文本的质量。解决方案的关键在于:首先,通过专业作家对LLM生成文本的编辑,形成了一个包含七类常见问题的分类法(如陈词滥调、不必要的阐述等);其次,构建了LAMP语料库,包含1,057段经专业作家编辑的LLM生成文本,揭示了不同LLM模型在写作质量上的共同局限性;最后,研究了自动编辑方法,发现这些方法在提高LLM生成文本与人类写作文本的一致性方面显示出潜力。

链接: https://arxiv.org/abs/2409.14509
作者: Tuhin Chakrabarty,Philippe Laban,Chien-Sheng Wu
关键词-EN: helping people write, LLM-based applications, people write, social media, applications are helping
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: NLP+HCI, Behavioral Science

点击查看摘要

Abstract:LLM-based applications are helping people write, and LLM-generated text is making its way into social media, journalism, and our classrooms. However, the differences between LLM-generated and human-written text remain unclear. To explore this, we hired professional writers to edit paragraphs in several creative domains. We first found these writers agree on undesirable idiosyncrasies in LLM-generated text, formalizing it into a seven-category taxonomy (e.g. cliches, unnecessary exposition). Second, we curated the LAMP corpus: 1,057 LLM-generated paragraphs edited by professional writers according to our taxonomy. Analysis of LAMP reveals that none of the LLMs used in our study (GPT4o, Claude-3.5-Sonnet, Llama-3.1-70b) outperform each other in terms of writing quality, revealing common limitations across model families. Third, we explored automatic editing methods to improve LLM-generated text. A large-scale preference annotation confirms that although experts largely prefer text edited by other experts, automatic editing methods show promise in improving alignment between LLM-generated and human-written text.
摘要:基于大语言模型 (LLM) 的应用正在帮助人们写作,而大语言模型生成的文本正逐渐进入社交媒体、新闻报道以及我们的课堂。然而,大语言模型生成的文本与人类撰写的文本之间的差异仍然不明确。为了探讨这一点,我们聘请了专业作家来编辑多个创意领域的段落。首先,我们发现这些作家一致认为大语言模型生成的文本存在一些不受欢迎的特质,并将其形式化为一个包含七个类别的分类法(例如:陈词滥调、不必要的阐述)。其次,我们构建了 LAMP 语料库:1,057 段由专业作家根据我们的分类法编辑的大语言模型生成的段落。对 LAMP 的分析显示,在我们研究中使用的所有大语言模型(GPT4o, Claude-3.5-Sonnet, Llama-3.1-70b)在写作质量上并无显著优劣之分,揭示了不同模型家族之间的共同局限性。第三,我们探索了自动编辑方法以改进大语言模型生成的文本。大规模的偏好标注证实,尽管专家们大多偏好由其他专家编辑的文本,但自动编辑方法在提高大语言模型生成文本与人类撰写文本之间的一致性方面显示出潜力。

[NLP-66] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

【速读】: 该论文试图解决稀疏自编码器(SAEs)在分解大型语言模型(LLMs)激活时提取的潜在变量是否具有单一语义和可解释性的问题,以及稀疏度和SAE大小对这些特性的影响。解决方案的关键在于识别并分析了一种称为“特征吸收”的问题,即看似单一语义的潜在变量在某些情况下未能激活,尽管它们应该激活。研究表明,仅通过调整SAE的大小或稀疏度无法解决这一问题,暗示存在更深层次的概念问题需要解决。

链接: https://arxiv.org/abs/2409.14507
作者: David Chanin,James Wilken-Smith,Tomáš Dulka,Hardik Bhatnagar,Joseph Bloom
关键词-EN: Large Language Models, Sparse Autoencoders, Language Models, Large Language, activations of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of Large Language Models (LLMs) into human-interpretable latents. In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents? Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability? By investigating these questions in the context of a simple first-letter identification task where we have complete access to ground truth labels for all tokens in the vocabulary, we are able to provide more detail than prior investigations. Critically, we identify a problematic form of feature-splitting we call feature absorption where seemingly monosemantic latents fail to fire in cases where they clearly should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that there are deeper conceptual issues in need of resolution.
摘要:稀疏自编码器 (Sparse Autoencoders, SAEs) 作为一种有前景的方法,用于将大语言模型 (Large Language Models, LLMs) 的激活分解为人类可解释的潜在变量。本文提出两个问题。首先,SAEs 在多大程度上提取了单语义且可解释的潜在变量?其次,改变 SAE 的稀疏度或大小在多大程度上影响了单语义性/可解释性?通过在一个简单的首字母识别任务中进行研究,我们能够完全访问词汇表中所有 Token 的真实标签,从而提供比先前研究更详细的分析。关键地,我们识别出一种称为特征吸收 (feature absorption) 的问题形式,其中看似单语义的潜在变量在明显应该激活的情况下未能触发。我们的研究表明,改变 SAE 的大小或稀疏度不足以解决这一问题,并且存在更深层次的概念问题需要解决。

[NLP-67] hought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension

【速读】: 该论文试图解决逻辑阅读理解任务中,现有方法在构建链式思维(Chain-of-Thought, CoT)推理路径时仅关注正确选项,而忽略错误选项的问题,以及数据增强方法生成的上下文缺乏多样性和连贯性的问题。解决方案的关键在于提出了一个基于前提的数据增强框架(Premise-Oriented Data Augmentation, PODA),该框架能够生成包含正确和错误选项分析的CoT推理路径,并从错误候选选项中构建多样且高质量的反事实上下文。此外,通过引入前提总结和识别,结合多步提示构建反事实上下文,并采用一种新的思维路径对比学习方法,比较原始样本与反事实样本的推理路径,从而提升模型在区分不同选项推理过程的能力。

链接: https://arxiv.org/abs/2409.14495
作者: Chenxu Wang,Ping Jian,Yang Zhen
关键词-EN: Logical reading comprehension, reading comprehension, task that entails, entails grasping, grasping the underlying
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Logical reading comprehension is a challenging task that entails grasping the underlying semantics of text and applying reasoning to deduce the correct answer. Prior researches have primarily focused on enhancing logical reasoning capabilities through Chain-of-Thought (CoT) or data augmentation. However, previous work constructing chain-of-thought rationales concentrates solely on analyzing correct options, neglecting the incorrect alternatives. Addtionally, earlier efforts on data augmentation by altering contexts rely on rule-based methods, which result in generated contexts that lack diversity and coherence. To address these issues, we propose a Premise-Oriented Data Augmentation (PODA) framework. This framework can generate CoT rationales including analyses for both correct and incorrect options, while constructing diverse and high-quality counterfactual contexts from incorrect candidate options. We integrate summarizing premises and identifying premises for each option into rationales. Subsequently, we employ multi-step prompts with identified premises to construct counterfactual context. To facilitate the model’s capabilities to better differentiate the reasoning process associated with each option, we introduce a novel thought-path contrastive learning method that compares reasoning paths between the original and counterfactual samples. Experimental results on three representative LLMs demonstrate that our method can improve the baselines substantially across two challenging logical reasoning benchmarks (ReClor and LogiQA 2.0). The data and code are released at this https URL.
摘要:逻辑阅读理解是一项具有挑战性的任务,涉及掌握文本的潜在语义并运用推理来推导出正确答案。先前的研究主要通过思维链 (Chain-of-Thought, CoT) 或数据增强来提升逻辑推理能力。然而,以往构建思维链推理的工作仅专注于分析正确选项,忽略了错误选项。此外,早期通过改变上下文进行数据增强的方法依赖于基于规则的方法,导致生成的上下文缺乏多样性和连贯性。为了解决这些问题,我们提出了一种前提导向的数据增强 (Premise-Oriented Data Augmentation, PODA) 框架。该框架能够生成包含对正确和错误选项分析的 CoT 推理,同时从错误候选选项中构建多样且高质量的反事实上下文。我们将总结前提和识别每个选项的前提整合到推理中。随后,我们使用识别出的前提进行多步提示,以构建反事实上下文。为了增强模型对每个选项推理过程的区分能力,我们引入了一种新颖的思维路径对比学习方法,该方法比较原始样本和反事实样本之间的推理路径。在三个代表性的大语言模型 (LLM) 上的实验结果表明,我们的方法在两个具有挑战性的逻辑推理基准 (ReClor 和 LogiQA 2.0) 上显著提升了基线水平。数据和代码已发布在 https URL。

[NLP-68] CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments

【速读】: 该论文试图解决在教室环境中自动语音识别(ASR)系统的鲁棒性和适应性问题。解决方案的关键在于使用持续预训练(CPT)方法来适应Wav2vec2.0模型到教室领域,从而显著降低单词错误率(WER)并提高模型对不同噪音、麦克风和教室条件的鲁棒性。

链接: https://arxiv.org/abs/2409.14494
作者: Ahmed Adel Attia,Dorottya Demszky,Tolulope Ogunremi,Jing Liu,Carol Espy-Wilson
关键词-EN: Creating Automatic Speech, Automatic Speech Recognition, Creating Automatic, Speech Recognition, Automatic Speech
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: arXiv admin note: substantial text overlap with arXiv:2405.13018

点击查看摘要

Abstract:Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%. More specifically, CPT improves the model’s robustness to different noises, microphones and classroom conditions.
摘要:创建能够在教室环境中保持稳健和弹性的自动语音识别 (ASR) 系统对于开发辅助教师和学生的 AI 工具至关重要。在本研究中,我们探讨了持续预训练 (CPT) 在将 Wav2vec2.0 适应于教室领域中的有效性。我们展示了 CPT 在这方面的强大作用,能够将基于 Wav2vec2.0 的模型的词错误率 (WER) 降低多达 10%。更具体地说,CPT 提升了模型对不同噪音、麦克风和教室条件的鲁棒性。

[NLP-69] A Large Language Model and Denoising Diffusion Framework for Targeted Design of Microstructures with Commands in Natural Language

【速读】: 该论文试图解决微观结构设计中复杂算法和领域特定知识带来的高学习曲线问题,解决方案的关键在于集成自然语言处理(NLP)、大语言模型(LLMs)和去噪扩散概率模型(DDPMs),通过直观的自然语言命令实现微观结构设计。具体来说,利用预训练的LLM进行上下文数据增强,生成多样化的微观结构描述数据集;通过重新训练的命名实体识别(NER)模型从用户输入的自然语言中提取相关描述,并由DDPM生成具有目标力学性能和拓扑特征的微观结构。此外,NLP和DDPM模块的独立训练和验证确保了框架的灵活性和适应性,而代理模型系统则用于根据目标属性对生成的样本进行排序和筛选。

链接: https://arxiv.org/abs/2409.14473
作者: Nikita Kartashov,Nikolaos N. Vlassis
关键词-EN: MEMS devices, applications spanning alloy, spanning alloy design, Natural Language, tissue engineering
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: 29 pages, 15 figures

点击查看摘要

Abstract:Microstructure plays a critical role in determining the macroscopic properties of materials, with applications spanning alloy design, MEMS devices, and tissue engineering, among many others. Computational frameworks have been developed to capture the complex relationship between microstructure and material behavior. However, despite these advancements, the steep learning curve associated with domain-specific knowledge and complex algorithms restricts the broader application of these tools. To lower this barrier, we propose a framework that integrates Natural Language Processing (NLP), Large Language Models (LLMs), and Denoising Diffusion Probabilistic Models (DDPMs) to enable microstructure design using intuitive natural language commands. Our framework employs contextual data augmentation, driven by a pretrained LLM, to generate and expand a diverse dataset of microstructure descriptors. A retrained NER model extracts relevant microstructure descriptors from user-provided natural language inputs, which are then used by the DDPM to generate microstructures with targeted mechanical properties and topological features. The NLP and DDPM components of the framework are modular, allowing for separate training and validation, which ensures flexibility in adapting the framework to different datasets and use cases. A surrogate model system is employed to rank and filter generated samples based on their alignment with target properties. Demonstrated on a database of nonlinear hyperelastic microstructures, this framework serves as a prototype for accessible inverse design of microstructures, starting from intuitive natural language commands.
摘要:微观结构在决定材料的宏观性质方面起着关键作用,其应用领域涵盖合金设计、微机电系统 (MEMS) 设备以及组织工程等多个领域。为了捕捉微观结构与材料行为之间的复杂关系,已经开发了多种计算框架。然而,尽管取得了这些进展,领域特定知识与复杂算法所带来的陡峭学习曲线限制了这些工具的广泛应用。为了降低这一门槛,我们提出了一种框架,该框架集成了自然语言处理 (NLP)、大语言模型 (LLM) 和去噪扩散概率模型 (DDPM),以实现通过直观的自然语言指令进行微观结构设计。我们的框架利用预训练的 LLM 驱动的上下文数据增强技术,生成并扩展了多样化的微观结构描述符数据集。一个重新训练的命名实体识别 (NER) 模型从用户提供的自然语言输入中提取相关的微观结构描述符,这些描述符随后被 DDPM 用于生成具有目标力学性能和拓扑特征的微观结构。框架中的 NLP 和 DDPM 组件是模块化的,允许分别进行训练和验证,从而确保了框架在适应不同数据集和使用场景时的灵活性。采用了一种代理模型系统,根据生成样本与目标属性的匹配程度对其进行排序和筛选。在一个非线性超弹性微观结构数据库上进行了演示,该框架作为从直观自然语言指令开始的微观结构逆向设计的原型。

[NLP-70] Rethinking Semantic Parsing for Large Language Models : Enhancing LLM Performance with Semantic Hints

【速读】: 该论文试图解决的问题是:直接将语义解析结果引入大型语言模型(LLMs)会降低其性能。解决方案的关键在于提出了一种名为SENSE的新型提示方法,该方法通过在提示中嵌入语义提示来间接地整合语义信息,从而在不损害LLMs性能的前提下提升其在各种任务中的表现。实验结果表明,SENSE能够持续改善LLMs的性能,突显了语义信息整合在提升LLM能力方面的潜力。

链接: https://arxiv.org/abs/2409.14469
作者: Kaikai An,Shuzheng Si,Helan Hu,Haozhe Zhao,Yuchi Wang,Qingyan Guo,Baobao Chang
关键词-EN: structured form, Semantic Parsing aims, Semantic Parsing, aims to capture, capture the meaning
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Semantic Parsing aims to capture the meaning of a sentence and convert it into a logical, structured form. Previous studies show that semantic parsing enhances the performance of smaller models (e.g., BERT) on downstream tasks. However, it remains unclear whether the improvements extend similarly to LLMs. In this paper, our empirical findings reveal that, unlike smaller models, directly adding semantic parsing results into LLMs reduces their performance. To overcome this, we propose SENSE, a novel prompting approach that embeds semantic hints within the prompt. Experiments show that SENSE consistently improves LLMs’ performance across various tasks, highlighting the potential of integrating semantic information to improve LLM capabilities.
摘要:语义解析旨在捕捉句子的含义并将其转换为逻辑结构化的形式。先前的研究表明,语义解析能够提升较小模型(例如 BERT)在下游任务中的表现。然而,目前尚不清楚这种改进是否同样适用于大语言模型 (LLM)。在本研究中,我们的实证结果揭示,与较小模型不同,直接将语义解析结果加入 LLM 会降低其性能。为解决这一问题,我们提出了 SENSE,一种新颖的提示方法,该方法在提示中嵌入语义线索。实验表明,SENSE 在各种任务中持续提升 LLM 的性能,突显了整合语义信息以增强 LLM 能力的潜力。

[NLP-71] AggregHate: An Efficient Aggregative Approach for the Detection of Hatemongers on Social Platforms

【速读】: 该论文试图解决在线仇恨言论的自动检测问题,特别是从用户层面识别仇恨言论发布者。解决方案的关键在于采用多模态聚合方法,综合考虑用户的文本内容、活动行为及其社交网络关系,以提高仇恨言论发布者的检测准确性。相较于传统的文本和图结构方法,该方法在处理用户文本时结合其社交背景,显著提升了检测效果,并能有效应用于分类编码信息、识别隐晦仇恨言论及制定干预措施,同时保持对大规模数据和网络的高效处理能力。

链接: https://arxiv.org/abs/2409.14464
作者: Tom Marzea,Abraham Israeli,Oren Tsur
关键词-EN: online hate speech, hate speech serves, Automatic detection, online discourse, speech serves
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Automatic detection of online hate speech serves as a crucial step in the detoxification of the online discourse. Moreover, accurate classification can promote a better understanding of the proliferation of hate as a social phenomenon. While most prior work focus on the detection of hateful utterances, we argue that focusing on the user level is as important, albeit challenging. In this paper we consider a multimodal aggregative approach for the detection of hate-mongers, taking into account the potentially hateful texts, user activity, and the user network. We evaluate our methods on three unique datasets X (Twitter), Gab, and Parler showing that a processing a user’s texts in her social context significantly improves the detection of hate mongers, compared to previously used text and graph-based methods. Our method can be then used to improve the classification of coded messages, dog-whistling, and racial gas-lighting, as well as inform intervention measures. Moreover, our approach is highly efficient even for very large datasets and networks.
摘要:自动检测在线仇恨言论是净化网络话语的关键步骤。此外,准确的分类有助于更好地理解仇恨作为社会现象的传播。尽管大多数先前的工作集中在仇恨言论的检测上,但我们认为关注用户层面的检测同样重要,尽管更具挑战性。本文中,我们考虑了一种多模态聚合方法来检测仇恨传播者,综合考虑了潜在的仇恨文本、用户活动以及用户网络。我们在三个独特的数据集(X (Twitter)、Gab 和 Parler)上评估了我们的方法,结果表明,在用户的社交背景下处理其文本显著提高了仇恨传播者的检测效果,相较于先前使用的基于文本和图的方法。我们的方法随后可用于改进编码消息、狗哨策略和种族心理操控的分类,并提供干预措施的依据。此外,我们的方法对于非常大的数据集和网络也具有高效率。

[NLP-72] Exploring Multilingual Probing in Large Language Models : A Cross-Language Analysis

【速读】: 该论文试图解决大型语言模型(LLMs)在多语言环境下的表现差异问题,特别是高资源语言和低资源语言之间的性能差距。解决方案的关键在于扩展现有的探针技术至多语言环境,通过实验分析不同语言在LLMs中的探针准确性、层级趋势以及探针向量之间的相似性。研究发现,高资源语言在探针准确性和层级表现上显著优于低资源语言,且高资源语言之间的表示相似性更高。这些发现强调了改进低资源语言建模的必要性。

链接: https://arxiv.org/abs/2409.14459
作者: Daoyang Li,Mingyu Jin,Qingcheng Zeng,Haiyan Zhao,Mengnan Du
关键词-EN: large language models, overlooking the vast, languages, techniques for large, primarily focused
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of the world’s languages. In this paper, we extend these probing methods to a multilingual context, investigating the behaviors of LLMs across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results highlight significant disparities in LLMs’ multilingual capabilities and emphasize the need for improved modeling of low-resource languages.
摘要:对大语言模型 (LLM) 的探查技术主要集中在英语上,忽视了世界上绝大多数语言。本文将这些探查方法扩展到多语言环境中,研究了 LLM 在多种语言中的行为。我们在多个开源 LLM 模型上进行了实验,分析了探查准确性、跨层趋势以及多语言探查向量之间的相似性。我们的主要发现包括:(1) 高资源语言和低资源语言之间存在一致的性能差距,高资源语言的探查准确性显著更高;(2) 层级准确性趋势存在差异,高资源语言在更深层表现出与英语类似的显著改进;(3) 高资源语言之间的表示相似性更高,而低资源语言不仅彼此之间的相似性较低,与高资源语言的相似性也较低。这些结果突显了 LLM 在多语言能力上的显著差异,并强调了改进对低资源语言建模的必要性。

[NLP-73] Automotive innovation landscaping using LLM

【速读】: 该论文试图解决通过专利分析进行汽车创新景观构建的问题,传统方法需要大量手动工作,效率低下。解决方案的关键在于利用大型语言模型(LLMs)进行自动化处理,通过提示工程提取专利中的核心信息,包括解决的问题、使用的技术以及创新领域(如安全、高级驾驶辅助系统等)。这种方法能够快速、高效地从广泛的专利数据库中提取相关信息,为研发团队提供全面的燃料电池技术现状概览,从而为未来的研发提供有价值的见解。

链接: https://arxiv.org/abs/2409.14436
作者: Raju Gorain,Omkar Salunke
关键词-EN: landscaping automotive innovation, analysis is crucial, Large Language Models, automotive innovation, comprehending innovation trends
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 9pages, 4Figures, 1 Flow chart

点击查看摘要

Abstract:The process of landscaping automotive innovation through patent analysis is crucial for Research and Development teams. It aids in comprehending innovation trends, technological advancements, and the latest technologies from competitors. Traditionally, this process required intensive manual efforts. However, with the advent of Large Language Models (LLMs), it can now be automated, leading to faster and more efficient patent categorization state-of-the-art of inventive concept extraction. This automation can assist various R\D teams in extracting relevant information from extensive patent databases. This paper introduces a method based on prompt engineering to extract essential information for landscaping. The information includes the problem addressed by the patent, the technology utilized, and the area of innovation within the vehicle ecosystem (such as safety, Advanced Driver Assistance Systems and more).The result demonstrates the implementation of this method to create a landscape of fuel cell technology using open-source patent data. This approach provides a comprehensive overview of the current state of fuel cell technology, offering valuable insights for future research and development in this field.
摘要:通过专利分析来规划汽车领域的创新过程对于研发团队至关重要。它有助于理解创新趋势、技术进步以及竞争对手的最新技术。传统上,这一过程需要大量的手动操作。然而,随着大语言模型 (LLM) 的出现,现在可以实现自动化,从而实现更快、更高效的专利分类和最先进的创新概念提取。这种自动化可以帮助各种研发团队从庞大的专利数据库中提取相关信息。本文介绍了一种基于提示工程的方法,用于提取规划所需的关键信息。这些信息包括专利所解决的问题、所使用的技术以及车辆生态系统中的创新领域(如安全、高级驾驶辅助系统等)。结果展示了如何使用这种方法,通过开源专利数据创建燃料电池技术的全景图。这种方法提供了燃料电池技术当前状态的全面概述,为该领域的未来研究和开发提供了宝贵的见解。

[NLP-74] Beyond Persuasion: Towards Conversational Recommender System with Credible Explanations EMNLP2024

【速读】: 该论文旨在解决当前对话推荐系统(CRS)在说服用户接受推荐项目时,可能通过包含不可信信息来误导用户,从而损害用户与系统之间长期信任的问题。解决方案的关键在于提出了一种名为PC-CRS的方法,通过引入可信度感知的说服策略来指导解释生成,并利用事后自我反思逐步优化解释,从而在保持说服力的同时提升解释的可信度。实验结果表明,PC-CRS能有效促进既具说服力又可信的解释,并揭示了当前方法产生不可信解释的原因及可信解释提升推荐准确性的潜力。

链接: https://arxiv.org/abs/2409.14399
作者: Peixin Qin,Chen Huang,Yang Deng,Wenqiang Lei,Tat-Seng Chua
关键词-EN: large language models, conversational recommender system, accept recommended items, gaining strong abilities, current conversational recommender
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of EMNLP 2024

点击查看摘要

Abstract:With the aid of large language models, current conversational recommender system (CRS) has gaining strong abilities to persuade users to accept recommended items. While these CRSs are highly persuasive, they can mislead users by incorporating incredible information in their explanations, ultimately damaging the long-term trust between users and the CRS. To address this, we propose a simple yet effective method, called PC-CRS, to enhance the credibility of CRS’s explanations during persuasion. It guides the explanation generation through our proposed credibility-aware persuasive strategies and then gradually refines explanations via post-hoc self-reflection. Experimental results demonstrate the efficacy of PC-CRS in promoting persuasive and credible explanations. Further analysis reveals the reason behind current methods producing incredible explanations and the potential of credible explanations to improve recommendation accuracy.
摘要:借助大语言模型的帮助,当前的对话推荐系统 (CRS) 在说服用户接受推荐项目方面展现出强大的能力。尽管这些 CRS 具有高度的说服力,但它们在解释中掺入不可信的信息,可能会误导用户,最终损害用户与 CRS 之间的长期信任。为了解决这一问题,我们提出了一种简单而有效的方法,称为 PC-CRS,旨在增强 CRS 在说服过程中的解释可信度。该方法通过我们提出的可信度感知说服策略指导解释生成,并通过事后自我反思逐步优化解释。实验结果表明,PC-CRS 在促进说服性和可信的解释方面具有显著效果。进一步的分析揭示了当前方法产生不可信解释的原因,以及可信解释在提高推荐准确性方面的潜力。

[NLP-75] Predicting User Stances from Target-Agnostic Information using Large Language Models

【速读】: 该论文试图解决的问题是利用大型语言模型(LLMs)预测用户对某一目标的态度,基于用户在社交媒体上的目标无关帖子(即用户层面的态度预测)。解决方案的关键在于利用LLMs分析目标无关帖子中的表面层信息(如与目标相关的关键词)和用户层特征(如用户的道德价值观),从而推断用户对新话题的态度。研究结果表明,LLMs可能是一种基于历史和目标无关数据的有效方法来确定公众对新话题的态度,但同时也强调了需要进一步研究以更好地理解LLMs在不同任务情境下的表现差异。

链接: https://arxiv.org/abs/2409.14395
作者: Siyuan Brandon Loh,Liang Ze Wong,Prasanta Bhattacharya,Joseph Simons,Wei Gao,Hong Zhang
关键词-EN: Large Language Models’, investigate Large Language, Language Models’, Large Language, social media posts
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate Large Language Models’ (LLMs) ability to predict a user’s stance on a target given a collection of his/her target-agnostic social media posts (i.e., user-level stance prediction). While we show early evidence that LLMs are capable of this task, we highlight considerable variability in the performance of the model across (i) the type of stance target, (ii) the prediction strategy and (iii) the number of target-agnostic posts supplied. Post-hoc analyses further hint at the usefulness of target-agnostic posts in providing relevant information to LLMs through the presence of both surface-level (e.g., target-relevant keywords) and user-level features (e.g., encoding users’ moral values). Overall, our findings suggest that LLMs might offer a viable method for determining public stances towards new topics based on historical and target-agnostic data. At the same time, we also call for further research to better understand LLMs’ strong performance on the stance prediction task and how their effectiveness varies across task contexts.
摘要:我们研究了大语言模型 (LLM) 在给定用户的一系列与目标无关的社交媒体帖子的情况下,预测用户对某一目标的立场的能力(即用户级立场预测)。尽管我们展示了早期证据表明 LLM 能够完成此任务,但我们强调了模型性能在以下方面的显著差异:(i) 立场目标的类型,(ii) 预测策略,以及 (iii) 提供的与目标无关的帖子数量。事后分析进一步暗示了与目标无关的帖子在通过表面层次特征(例如,与目标相关的关键词)和用户层次特征(例如,编码用户的道德价值观)向 LLM 提供相关信息方面的有用性。总体而言,我们的研究结果表明,LLM 可能提供了一种基于历史和与目标无关的数据来确定公众对新话题立场的可行方法。同时,我们也呼吁进一步研究,以更好地理解 LLM 在立场预测任务上的强大表现及其在不同任务情境下的有效性变化。

[NLP-76] Investigating Layer Importance in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)中各层重要性不明确的问题,解决方案的关键在于提出了一种基于Shapley值的高效采样方法,用于评估各层的重要性,并通过层消融实验验证了某些关键层(cornerstone layers)对模型性能的显著影响。研究发现,移除这些关键层会导致模型性能急剧下降,而移除非关键层则仅导致轻微性能变化,从而强调了这些关键层在LLMs中的核心作用。

链接: https://arxiv.org/abs/2409.14381
作者: Yang Zhang,Yanfei Dong,Kenji Kawaguchi
关键词-EN: Large language models, gained increasing attention, increasing attention due, Large language, process texts
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have gained increasing attention due to their prominent ability to understand and process texts. Nevertheless, LLMs largely remain opaque. The lack of understanding of LLMs has obstructed the deployment in safety-critical scenarios and hindered the development of better models. In this study, we advance the understanding of LLM by investigating the significance of individual layers in LLMs. We propose an efficient sampling method to faithfully evaluate the importance of layers using Shapley values, a widely used explanation framework in feature attribution and data valuation. In addition, we conduct layer ablation experiments to assess the performance degradation resulting from the exclusion of specific layers. Our findings reveal the existence of cornerstone layers, wherein certain early layers can exhibit a dominant contribution over others. Removing one cornerstone layer leads to a drastic collapse of the model performance, often reducing it to random guessing. Conversely, removing non-cornerstone layers results in only marginal performance changes. This study identifies cornerstone layers in LLMs and underscores their critical role for future research.
摘要:大语言模型 (LLM) 因其显著的文本理解和处理能力而受到越来越多的关注。然而,LLM 在很大程度上仍然是难以理解的。对 LLM 缺乏理解阻碍了其在安全关键场景中的部署,并阻碍了更好模型的开发。在本研究中,我们通过研究 LLM 中各层的意义来推进对 LLM 的理解。我们提出了一种高效的采样方法,利用 Shapley 值(一种在特征归因和数据估值中广泛使用的解释框架)来忠实地评估各层的重要性。此外,我们还进行了层消融实验,以评估排除特定层导致的性能下降。我们的研究发现存在关键层,其中某些早期层可以表现出对其他层的显著贡献。移除一个关键层会导致模型性能的急剧崩溃,通常使其降至随机猜测的水平。相反,移除非关键层只会导致性能的微小变化。本研究识别了 LLM 中的关键层,并强调了它们在未来研究中的关键作用。

[NLP-77] J2N – Nominal Adjective Identification and its Application

【速读】: 该论文试图解决自然语言处理任务中名词性形容词(Nominal Adjectives, NAs)对词性标注(Part-of-Speech tagging)带来的挑战。解决方案的关键在于将NAs视为一个独立的词性标签“JN”,并通过实验验证这一重新分类对词性标注、BIO分块和共指消解等任务的准确性提升。研究采用隐马尔可夫模型(HMMs)、最大熵模型(MaxEnt)和Spacy工具进行实验,并训练BERT模型用于未标注文本中的NA识别,展示了该方法的可行性和潜在优势。

链接: https://arxiv.org/abs/2409.14374
作者: Lemeng Qi,Yang Han,Zhuotong Xie
关键词-EN: natural language processing, nominal adjectives, language processing, paper explores, explores the challenges
类目: Computation and Language (cs.CL)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:This paper explores the challenges posed by nominal adjectives (NAs) in natural language processing (NLP) tasks, particularly in part-of-speech (POS) tagging. We propose treating NAs as a distinct POS tag, “JN,” and investigate its impact on POS tagging, BIO chunking, and coreference resolution. Our study shows that reclassifying NAs can improve the accuracy of syntactic analysis and structural understanding in NLP. We present experimental results using Hidden Markov Models (HMMs), Maximum Entropy (MaxEnt) models, and Spacy, demonstrating the feasibility and potential benefits of this approach. Additionally we trained a bert model to identify the NA in untagged text.
摘要:本文探讨了在自然语言处理 (NLP) 任务中,尤其是词性标注 (POS) 任务中,名词性形容词 (Nominal Adjectives, NAs) 所带来的挑战。我们提出将 NAs 视为一种独立的词性标签,即“JN”,并研究了其对词性标注、BIO 分块和共指消解的影响。我们的研究表明,重新分类 NAs 可以提高 NLP 中句法分析和结构理解的准确性。我们使用隐马尔可夫模型 (HMMs)、最大熵 (MaxEnt) 模型和 Spacy 进行了实验,证明了这种方法的可行性和潜在优势。此外,我们还训练了一个 BERT 模型,用于在未标注的文本中识别 NAs。

[NLP-78] he Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests

【速读】: 该论文试图解决生成式AI代理在处理无单一正确答案(NORA)任务时,如何准确评估其响应中约束条件满足度的问题。解决方案的关键在于开发并发布了一个名为Arithmetic Constraint-Satisfaction (ACS)的新型基准数据集,该数据集包含复杂的用户请求、相应的约束条件、代理响应以及人类标注的约束满足度标签。ACS数据集的独特之处在于,许多约束的验证需要整体审查响应内容,而不仅仅是独立项的验证。此外,该数据集评估了大型语言模型(LLMs)在推理、上下文数据提取、算术计算和计数方面的能力。通过基准测试,论文发现大多数模型在约束满足度评估方面仍有显著改进空间,主要错误源于推理问题,且大多数模型在预测约束满足度时表现出偏差,尤其是在“满足”标签的情况下准确率较高。此外,少样本提示对任务的性能提升效果有限,许多模型在引入少样本提示后性能反而下降。

链接: https://arxiv.org/abs/2409.14371
作者: Lior Madmoni,Amir Zait,Ilia Labzovsky,Danny Karmon
关键词-EN: vegetarian meal plan, design a vegetarian, expected to respond, vegetarian meal, meal plan
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative AI agents are often expected to respond to complex user requests that have No One Right Answer (NORA), e.g., “design a vegetarian meal plan below 1800 calories”. Such requests may entail a set of constraints that the agent should adhere to. To successfully develop agents for NORA scenarios, an accurate automatic evaluation framework is essential, and specifically - one capable of validating the satisfaction of constraints in the agent’s response. Recently, large language models (LLMs) have been adopted as versatile evaluators for many NORA tasks, but their ability to evaluate constraint-satisfaction in generated text remains unclear. To study this, we develop and release a novel Arithmetic Constraint-Satisfaction (ACS) benchmarking dataset. The dataset consists of complex user requests with corresponding constraints, agent responses and human labels indicating each constraint’s satisfaction level in the response. A unique property of this dataset is that validating many of its constraints requires reviewing the response as a whole (in contrast to many other benchmarks that require the validation of a single independent item). Moreover, it assesses LLMs in performing reasoning, in-context data extraction, arithmetic calculations, and counting. We then benchmark both open and proprietary LLMs on evaluating constraint-satisfaction, and show that most models still have a significant headroom for improvement, and that errors primarily stem from reasoning issues. In addition, most models exhibit a skewed constraint-satisfaction prediction pattern, with higher accuracy where the ground-truth label is “satisfied”. Lastly, few-shot prompting for our task proved to be rather challenging, since many of the studied models showed a degradation in performance when it was introduced.
摘要:生成式 AI 智能体通常需要应对具有“无唯一正确答案” (No One Right Answer, NORA) 的复杂用户请求,例如“设计一个低于 1800 卡路里的素食餐计划”。这类请求可能包含一组智能体应遵守的约束条件。为了成功开发适用于 NORA 场景的智能体,一个准确的自动评估框架至关重要,特别是能够验证智能体响应中约束条件满足情况的框架。近期,大语言模型 (Large Language Models, LLMs) 已被广泛采用作为多种 NORA 任务的多功能评估工具,但其评估生成文本中约束条件满足情况的能力尚不明确。为此,我们开发并发布了一个新颖的算术约束满足 (Arithmetic Constraint-Satisfaction, ACS) 基准数据集。该数据集包含复杂用户请求及其对应的约束条件、智能体响应以及指示响应中各约束条件满足程度的人工标签。该数据集的一个独特属性是,验证其许多约束条件需要整体审查响应内容(与许多其他需要验证单一独立项目的基准不同)。此外,它还评估了 LLMs 在执行推理、上下文数据提取、算术计算和计数方面的能力。随后,我们对开放和专有的 LLMs 进行了约束满足评估的基准测试,结果显示大多数模型仍有显著的改进空间,且错误主要源于推理问题。此外,大多数模型在约束满足预测方面表现出偏斜的模式,即在真实标签为“满足”时准确率较高。最后,对于我们的任务,少样本提示被证明是相当具有挑战性的,因为许多研究模型在引入少样本提示后性能有所下降。

[NLP-79] More Effective LLM Compressed Tokens with Uniformly Spread Position Identifiers and Compression Loss

【速读】: 该论文试图解决大语言模型(LLMs)在运行速度和成本效率方面的问题,解决方案的关键在于通过将Transformer输入压缩为压缩令牌(compressed tokens)来实现。论文基于ICA压缩方法,深入研究了压缩令牌的位置标识符选择,并提出了一种新的压缩损失函数。实验结果表明,所提出的方法能够显著提高压缩比(15倍于ICA的4倍),同时保持可比拟的重建性能。

链接: https://arxiv.org/abs/2409.14364
作者: Runsong Zhao,Pengcheng Huang,Xinyu Liu,Chunyang Xiao,Tong Xiao,Jingbo Zhu
关键词-EN: Compressing Transformer inputs, Compressing Transformer, Transformer inputs, cost efficiency, inputs into compressd
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Compressing Transformer inputs into compressd tokens allows running LLMs with improved speed and cost efficiency. Based on the compression method ICAE, we carefully examine the position identifier choices for compressed tokens and also propose a new compression loss. We demonstrate empirically that our proposed methods achieve significantly higher compression ratios (15x compared to 4x for ICAE), while being able to attain comparable reconstruction performance.
摘要:将 Transformer 输入压缩成压缩 Token 可以提高大语言模型 (LLM) 的运行速度和成本效率。基于压缩方法 ICAE,我们仔细研究了压缩 Token 的位置标识符选择,并提出了一种新的压缩损失。通过实验证明,我们提出的方法能够显著提高压缩比(15 倍对比 ICAE 的 4 倍),同时能够达到可比的重建性能。

[NLP-80] Using Natural Language Processing to find Indication for Burnout with Text Classification: From Online Data to Real-World Data

【速读】: 该论文试图解决通过自然语言处理(NLP)和机器学习技术在德语文本中准确检测职业倦怠(burnout)的问题。解决方案的关键在于:(a) 收集匿名的真实世界数据集,包括自由文本回答和Oldenburg Burnout Inventory(OLBI)响应;(b) 揭示基于GermanBERT的分类器在在线数据训练中的局限性;© 提供两个版本的精心策划的BurnoutExpressions数据集,这些数据集生成的模型在实际应用中表现良好;(d) 通过跨学科焦点小组提供关于倦怠检测AI模型可解释性的定性见解。论文强调了AI研究人员与临床专家之间加强合作的重要性,以及更多真实世界数据对于验证和提升当前NLP研究中开发的AI方法的必要性。

链接: https://arxiv.org/abs/2409.14357
作者: Mascha Kurpicz-Briki,Ghofrane Merhbene,Alexandre Puttick,Souhir Ben Souissi,Jannic Bieri,Thomas Jörg Müller,Christoph Golz
关键词-EN: chronic workplace stress, Natural Language Processing, arises from chronic, effectively managed, chronic workplace
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Burnout, classified as a syndrome in the ICD-11, arises from chronic workplace stress that has not been effectively managed. It is characterized by exhaustion, cynicism, and reduced professional efficacy, and estimates of its prevalence vary significantly due to inconsistent measurement methods. Recent advancements in Natural Language Processing (NLP) and machine learning offer promising tools for detecting burnout through textual data analysis, with studies demonstrating high predictive accuracy. This paper contributes to burnout detection in German texts by: (a) collecting an anonymous real-world dataset including free-text answers and Oldenburg Burnout Inventory (OLBI) responses; (b) demonstrating the limitations of a GermanBERT-based classifier trained on online data; © presenting two versions of a curated BurnoutExpressions dataset, which yielded models that perform well in real-world applications; and (d) providing qualitative insights from an interdisciplinary focus group on the interpretability of AI models used for burnout detection. Our findings emphasize the need for greater collaboration between AI researchers and clinical experts to refine burnout detection models. Additionally, more real-world data is essential to validate and enhance the effectiveness of current AI methods developed in NLP research, which are often based on data automatically scraped from online sources and not evaluated in a real-world context. This is essential for ensuring AI tools are well suited for practical applications.
摘要:倦怠,被 ICD-11 归类为一种综合症,源于未得到有效管理的长期工作压力。其特征表现为疲劳、愤世嫉俗和职业效能降低,由于测量方法的不一致,其流行率的估计存在显著差异。自然语言处理 (NLP) 和机器学习的最新进展为通过文本数据分析检测倦怠提供了有前景的工具,研究表明其预测准确性较高。本文在德语文本中对倦怠检测做出了以下贡献:(a) 收集了一个包含自由文本回答和奥尔登堡倦怠量表 (OLBI) 回答的匿名真实世界数据集;(b) 展示了基于 GermanBERT 的分类器在在线数据训练中的局限性;© 提出了两个版本的精选 BurnoutExpressions 数据集,这些数据集生成的模型在实际应用中表现良好;(d) 通过跨学科焦点小组提供了关于用于倦怠检测的 AI 模型可解释性的定性见解。我们的研究结果强调了 AI 研究人员与临床专家之间需要加强合作,以改进倦怠检测模型。此外,更多真实世界的数据对于验证和提升当前基于 NLP 研究的 AI 方法的有效性至关重要,这些方法通常基于从在线来源自动抓取的数据,并未在真实环境中进行评估。这对于确保 AI 工具适用于实际应用至关重要。

[NLP-81] MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators

【速读】: 该论文试图解决大语言模型(LLMs)在机器翻译(MT)质量评估中预测错误与人工标注不一致的问题,从而限制了其作为反馈信号的可解释性。解决方案的关键在于引入一个通用且无需训练的框架 MQM-APE,通过自动后编辑(APE)过滤掉对质量提升无影响的错误,仅保留有助于质量改进的错误。具体方法包括:1) 让LLM作为评估者提供错误标注;2) 作为后编辑者判断错误是否影响质量改进;3) 作为成对质量验证者进行错误过滤。实验表明,该方法在多种语言和资源条件下均能提升错误标注的可靠性和质量,与现有的GEMBA-MQM方法相比有显著改进。

链接: https://arxiv.org/abs/2409.14335
作者: Qingyu Lu,Liang Ding,Kanjian Zhang,Jinxia Zhang,Dacheng Tao
关键词-EN: Large Language Models, shown significant potential, judges for Machine, Machine Translation, Large Language
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment, providing both scores and fine-grained feedback. Although approaches such as GEMBA-MQM has shown SOTA performance on reference-free evaluation, the predicted errors do not align well with those annotated by human, limiting their interpretability as feedback signals. To enhance the quality of error annotations predicted by LLM evaluators, we introduce a universal and training-free framework, \textbfMQM-APE , based on the idea of filtering out non-impactful errors by Automatically Post-Editing (APE) the original translation based on each error, leaving only those errors that contribute to quality improvement. Specifically, we prompt the LLM to act as 1) \textitevaluator to provide error annotations, 2) \textitpost-editor to determine whether errors impact quality improvement and 3) \textitpairwise quality verifier as the error filter. Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM, across eight LLMs in both high- and low-resource languages. Orthogonal to trained approaches, MQM-APE complements translation-specific evaluators such as Tower, highlighting its broad applicability. Further analysis confirm the effectiveness of each module and offer valuable insights into evaluator design and LLMs selection. The code will be released to facilitate the community.
摘要:大语言模型 (LLMs) 在机器翻译 (MT) 质量评估中展现出显著潜力,不仅能提供评分,还能提供细粒度的反馈。尽管如 GEMBA-MQM 等方法在无参考评估中展示了最先进的性能,但其预测的错误与人工标注的错误并不完全一致,限制了其作为反馈信号的可解释性。为了提升 LLM 评估器预测的错误标注质量,我们引入了一个通用且无需训练的框架,即 \textbfMQM-APE,该框架基于自动后编辑 (APE) 原始翻译中的每个错误,过滤掉对质量提升无影响的错误,仅保留那些有助于质量提升的错误。具体而言,我们引导 LLM 扮演以下角色:1) \textitevaluator 提供错误标注,2) \textitpost-editor 判断错误是否影响质量提升,3) \textitpairwise quality verifier 作为错误过滤器。实验表明,我们的方法在八种大语言模型中,无论是在高资源还是低资源语言中,都能持续提升错误标注的可靠性和质量,优于 GEMBA-MQM。与经过训练的方法正交,MQM-APE 补充了如 Tower 等特定于翻译的评估器,突显了其广泛的适用性。进一步的分析证实了每个模块的有效性,并为评估器设计和 LLM 选择提供了宝贵的见解。代码将公开发布,以促进社区的发展。

[NLP-82] Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)在叙事推理任务中表现不佳的问题,特别是在处理需要更高抽象能力的电影情节梗概中的抽象推理能力。解决方案的关键在于引入了一种基于电影情节梗概中的“trope”(套路)的查询方法,通过这种trope-wise querying方法,显著提升了模型的F1分数(提高了11.8个百分点)。此外,论文还揭示了Chain-of-Thought(CoT)提示在叙事内容中可能导致幻觉现象,从而降低GPT-4的性能,并提出了一种对抗性注入方法来测试模型对隐含trope文本的敏感性。

链接: https://arxiv.org/abs/2409.14324
作者: Hung-Ting Su,Ya-Ching Hsu,Xudong Lin,Xiang-Qian Shi,Yulei Niu,Han-Yuan Hsu,Hung-yi Lee,Winston H. Hsu
关键词-EN: Large language models, Large language, shown significant multi-step, language models, prompting have shown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2024 Findings. The first two authors contributed equally. Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4’s performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT’s heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.
摘要:配备链式思维 (Chain-of-Thoughts, CoT) 提示的大语言模型 (Large Language Models, LLMs) 在数学、常识和逻辑等事实内容中展示了显著的多步骤推理能力。然而,在需要更高抽象能力的叙事推理方面的表现尚未得到探索。本研究利用电影剧情梗概中的套路来评估最先进 LLMs 的抽象推理能力,并揭示了其低下的表现。我们引入了一种套路细粒度查询方法来应对这些挑战,并将 F1 分数提高了 11.8 分。此外,尽管先前研究表明 CoT 增强了多步骤推理,但本研究发现 CoT 在叙事内容中可能导致幻觉,降低了 GPT-4 的表现。我们还引入了一种对抗性注入方法,将套路相关文本 Token 嵌入到不含显式套路的电影剧情梗概中,揭示了 CoT 对这种注入的高度敏感性。我们的全面分析为未来的研究方向提供了见解。

[NLP-83] PretextTrans: Investigating Medical Factual Knowledge Mastery of LLMs with Predicate-text Dual Transformation

【速读】: 该论文旨在解决当前大型语言模型(LLMs)在医学事实知识掌握上的不足,特别是其在动态评估中生成的测试样本常包含事实错误且表达方式缺乏多样性的问题。解决方案的关键在于提出了一种名为“谓词-文本双重变换(Predicate-text Dual Transformation, PretextTrans)”的新评估方法。该方法通过将医学知识点首先转换为谓词表达,然后通过谓词变换生成一系列变体,最后将这些变体转换回文本表达,从而生成既具有事实可靠性又具有表达多样性的测试样本。这种方法有效地评估了12个知名LLMs在医学知识掌握上的表现,揭示了当前LLMs在医学领域应用中的显著不足,并为开发专门用于医学领域的LLMs提供了有价值的见解。

链接: https://arxiv.org/abs/2409.14302
作者: Yuxuan Zhou,Xien Liu,Chen Ning,Ji Wu
关键词-EN: automatically generate multiple, generate multiple test, multiple test samples, dynamic evaluation schema, test samples
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:In the study, we aim to investigate current LLMs’ mastery of medical factual knowledge with a dynamic evaluation schema, which can automatically generate multiple test samples for each medical factual knowledge point. Test samples produced directly by LLMs always introduce factual errors and lack diversity in the manner of knowledge expression. To overcome the drawbacks, here we propose a novel evaluation method, Predicate-text Dual Transformation (PretextTrans), by introducing predicate transformations into the dynamic evaluation schema. Specifically, each medical knowledge point is firstly transformed into a predicate expression; then, the predicate expression derives a series of variants through predicate transformations; lastly, the produced predicate variants are transformed back into textual expressions, resulting in a series of test samples with both factual reliability and expression diversity. Using the proposed PretextTrans method, we systematically investigate 12 well-known LLMs’ mastery of medical factual knowledge based on two medical datasets. The comparison results show that current LLMs still have significant deficiencies in fully mastering medical knowledge, which may illustrate why current LLMs still perform unsatisfactorily in real-world medical scenarios despite having achieved considerable performance on public benchmarks. Our proposed method serves as an effective solution for evaluation of LLMs in medical domain and offers valuable insights for developing medical-specific LLMs.
摘要:在本研究中,我们旨在通过一种动态评估架构来探究当前大语言模型 (LLM) 对医学事实知识的掌握情况,该架构能够自动为每个医学事实知识点生成多个测试样本。直接由大语言模型生成的测试样本往往引入事实错误,并且在知识表达方式上缺乏多样性。为克服这些缺点,我们提出了一种新的评估方法——谓词文本双重变换 (Predicate-text Dual Transformation, PretextTrans),通过将谓词变换引入动态评估架构中。具体而言,每个医学知识点首先被转换为谓词表达;然后,通过谓词变换生成一系列变体;最后,将生成的谓词变体转换回文本表达,从而产生一系列既具有事实可靠性又具有表达多样性的测试样本。利用所提出的 PretextTrans 方法,我们系统地调查了 12 个知名大语言模型对基于两个医学数据集的医学事实知识的掌握情况。比较结果显示,当前大语言模型在全面掌握医学知识方面仍存在显著不足,这或许解释了为何尽管在公共基准上取得了相当的成绩,当前大语言模型在实际医疗场景中的表现仍不尽如人意。我们提出的方法为医学领域大语言模型的评估提供了一个有效的解决方案,并为开发专门针对医学的大语言模型提供了宝贵的见解。

[NLP-84] Opinion Mining on Offshore Wind Energy for Environmental Engineering

【速读】: 该论文旨在通过情感分析社交媒体数据来研究公众对海上风能的看法,并利用机器学习模型(TextBlob、VADER和SentiWordNet)来适应不同模型的功能,如主观性分析、极性分类、累积情感评分和基于上下文的情感分类。关键解决方案在于利用自然语言处理(NLP)技术从社交媒体文本数据中提取意义,并通过数据可视化工具展示整体结果,从而为决策支持提供基于公众意见的智能治理和公民科学实践。

链接: https://arxiv.org/abs/2409.14292
作者: Isabele Bittencourt,Aparna S. Varde,Pankaj Lal
关键词-EN: offshore wind energy, wind energy, offshore wind, conduct sentiment analysis, machine learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we conduct sentiment analysis on social media data to study mass opinion about offshore wind energy. We adapt three machine learning models, namely, TextBlob, VADER, and SentiWordNet because different functions are provided by each model. TextBlob provides subjectivity analysis as well as polarity classification. VADER offers cumulative sentiment scores. SentiWordNet considers sentiments with reference to context and performs classification accordingly. Techniques in NLP are harnessed to gather meaning from the textual data in social media. Data visualization tools are suitably deployed to display the overall results. This work is much in line with citizen science and smart governance via involvement of mass opinion to guide decision support. It exemplifies the role of Machine Learning and NLP here.
摘要:本文通过分析社交媒体数据,研究公众对海上风能的意见。我们采用了三种机器学习模型,即 TextBlob、VADER 和 SentiWordNet,因为每个模型提供的功能不同。TextBlob 不仅提供主观性分析,还进行极性分类。VADER 提供累积的情感评分。SentiWordNet 则根据上下文考虑情感,并进行相应的分类。我们利用自然语言处理 (NLP) 技术从社交媒体的文本数据中提取意义。同时,适当部署数据可视化工具以展示整体结果。这项工作与公民科学和通过公众意见参与的智能治理高度一致,为决策支持提供指导。它展示了机器学习和自然语言处理在此领域的作用。

[NLP-85] ESPERANTO: Evaluating Synthesized Phrases to Enhance Robustness in AI Detection for Text Origination

【速读】: 该论文试图解决现有AI生成文本检测系统在面对文本操纵技术时的脆弱性问题。解决方案的关键在于引入回译(back-translation)技术,通过将AI生成的文本翻译成多种语言后再回译回英文,生成经过操纵的文本,从而降低现有检测系统的真阳性率(TPR)。论文提出了一种结合回译文本的模型,有效保留原始语义的同时显著降低检测系统的识别率,并通过实验验证了该方法对九种AI检测器的有效性。此外,论文还提出了一种增强检测系统鲁棒性的对策,并公开了一个包含72万条文本的大型数据集,以促进相关研究。

链接: https://arxiv.org/abs/2409.14285
作者: Navid Ayoobi,Lily Knab,Wen Cheng,David Pantoja,Hamidreza Alikhani,Sylvain Flamant,Jin Kim,Arjun Mukherjee
关键词-EN: exhibit significant utility, including academic misconduct, exhibit significant, unethical purposes, dissemination of misinformation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) exhibit significant utility across various domains, they simultaneously are susceptible to exploitation for unethical purposes, including academic misconduct and dissemination of misinformation. Consequently, AI-generated text detection systems have emerged as a countermeasure. However, these detection mechanisms demonstrate vulnerability to evasion techniques and lack robustness against textual manipulations. This paper introduces back-translation as a novel technique for evading detection, underscoring the need to enhance the robustness of current detection systems. The proposed method involves translating AI-generated text through multiple languages before back-translating to English. We present a model that combines these back-translated texts to produce a manipulated version of the original AI-generated text. Our findings demonstrate that the manipulated text retains the original semantics while significantly reducing the true positive rate (TPR) of existing detection methods. We evaluate this technique on nine AI detectors, including six open-source and three proprietary systems, revealing their susceptibility to back-translation manipulation. In response to the identified shortcomings of existing AI text detectors, we present a countermeasure to improve the robustness against this form of manipulation. Our results indicate that the TPR of the proposed method declines by only 1.85% after back-translation manipulation. Furthermore, we build a large dataset of 720k texts using eight different LLMs. Our dataset contains both human-authored and LLM-generated texts in various domains and writing styles to assess the performance of our method and existing detectors. This dataset is publicly shared for the benefit of the research community.
摘要:尽管大语言模型 (LLMs) 在各个领域展现出显著的实用性,但它们同时也容易被用于不道德的目的,包括学术不端和传播虚假信息。因此,AI 生成文本检测系统应运而生,作为应对措施。然而,这些检测机制在面对规避技术时表现出脆弱性,并且对文本操作缺乏鲁棒性。本文介绍了一种新颖的规避检测技术——回译,强调了增强当前检测系统鲁棒性的必要性。所提出的方法涉及将 AI 生成的文本通过多种语言进行翻译,然后再回译回英语。我们提出了一种模型,该模型结合这些回译文本,生成经过操作的原始 AI 生成文本版本。我们的研究结果表明,经过操作的文本保留了原始语义,同时显著降低了现有检测方法的真阳性率 (TPR)。我们在九种 AI 检测器上评估了这一技术,包括六种开源系统和三种专有系统,揭示了它们对回译操作的易感性。针对现有 AI 文本检测器的不足,我们提出了一种应对措施,以提高对这种操作形式的鲁棒性。我们的结果显示,经过回译操作后,所提出方法的 TPR 仅下降了 1.85%。此外,我们构建了一个包含 72 万条文本的大型数据集,使用了八种不同的大语言模型。我们的数据集包含了不同领域和写作风格的人类撰写文本和 LLM 生成文本,以评估我们的方法和现有检测器的性能。该数据集已公开共享,以造福研究社区。

[NLP-86] Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models

【速读】: 该论文试图解决大型多模态模型在现实环境中感知、推理、规划和行动的能力不足的问题。解决方案的关键在于引入了一个名为Can-Do的基准数据集,该数据集包含400个多模态样本,涵盖自然语言指令、视觉图像、状态变化和相应的行动计划,以评估模型的具身规划能力。论文进一步提出了NeuroGround框架,通过将规划生成与感知的环境状态相结合,并利用符号规划引擎增强模型生成的计划,从而提升模型的视觉感知、理解和推理能力。

链接: https://arxiv.org/abs/2409.14277
作者: Yew Ken Chia,Qi Sun,Lidong Bing,Soujanya Poria
关键词-EN: demonstrated impressive problem-solving, encode extensive world, Large multimodal models, extensive world knowledge, impressive problem-solving abilities
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large multimodal models have demonstrated impressive problem-solving abilities in vision and language tasks, and have the potential to encode extensive world knowledge. However, it remains an open challenge for these models to perceive, reason, plan, and act in realistic environments. In this work, we introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities through more diverse and complex scenarios than previous datasets. Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans. The data encompasses diverse aspects of commonsense knowledge, physical understanding, and safety awareness. Our fine-grained analysis reveals that state-of-the-art models, including GPT-4V, face bottlenecks in visual perception, comprehension, and reasoning abilities. To address these challenges, we propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans. Experimental results demonstrate the effectiveness of our framework compared to strong baselines. Our code and dataset are available at this https URL.
摘要:大型多模态模型在视觉和语言任务中展示了令人印象深刻的问题解决能力,并具备编码广泛世界知识的潜力。然而,这些模型在现实环境中感知、推理、规划和行动的能力仍然是一个开放的挑战。在这项工作中,我们引入了 Can-Do,这是一个基准数据集,旨在通过比以往数据集更多样化和复杂的场景来评估具身规划能力。我们的数据集包含 400 个多模态样本,每个样本由自然语言用户指令、描述环境的视觉图像、状态变化以及相应的行动计划组成。数据涵盖了常识知识、物理理解和安全意识等多个方面。我们的细粒度分析揭示了包括 GPT-4V 在内的最先进模型在视觉感知、理解和推理能力方面存在瓶颈。为了应对这些挑战,我们提出了 NeuroGround,这是一个神经符号框架,首先将规划生成基于感知到的环境状态,然后利用符号规划引擎来增强模型生成的计划。实验结果表明,与强大的基线相比,我们的框架具有更高的有效性。我们的代码和数据集可在以下链接获取:https URL。

[NLP-87] Instruction Following without Instruction Tuning

【速读】: 该论文试图解决的问题是如何在不直接进行指令微调的情况下,使预训练语言模型能够遵循指令。解决方案的关键在于发现了一种隐式的指令微调机制,即通过仅训练模型对响应的分布,而不需要明确的指令-响应对,模型也能表现出指令遵循行为。此外,论文还发现,即使在狭窄领域数据(如诗歌)上进行指令-响应训练,模型也能在广泛的任务(如食谱生成)上表现出指令遵循行为。论文通过假设简单的分布变化可以导致指令遵循,并通过手工编写的基于规则的语言模型验证了这一假设,该模型在与预训练模型结合时能够表现出指令遵循行为。

链接: https://arxiv.org/abs/2409.14254
作者: John Hewitt,Nelson F. Liu,Percy Liang,Christopher D. Manning
关键词-EN: Instruction tuning, Instruction tuning commonly, Instruction, implicit instruction tuning, tuning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction tuning commonly means finetuning a language model on instruction-response pairs. We discover two forms of adaptation (tuning) that are deficient compared to instruction tuning, yet still yield instruction following; we call this implicit instruction tuning. We first find that instruction-response pairs are not necessary: training solely on responses, without any corresponding instructions, yields instruction following. This suggests pretrained models have an instruction-response mapping which is revealed by teaching the model the desired distribution of responses. However, we then find it’s not necessary to teach the desired distribution of responses: instruction-response training on narrow-domain data like poetry still leads to broad instruction-following behavior like recipe generation. In particular, when instructions are very different from those in the narrow finetuning domain, models’ responses do not adhere to the style of the finetuning domain. To begin to explain implicit instruction tuning, we hypothesize that very simple changes to a language model’s distribution yield instruction following. We support this by hand-writing a rule-based language model which yields instruction following in a product-of-experts with a pretrained model. The rules are to slowly increase the probability of ending the sequence, penalize repetition, and uniformly change 15 words’ probabilities. In summary, adaptations made without being designed to yield instruction following can do so implicitly.
摘要:指令微调通常意味着在指令-响应对上对语言模型进行微调。我们发现两种形式的适应(微调),虽然不如指令微调有效,但仍能产生指令跟随效果;我们称之为隐式指令微调。首先,我们发现指令-响应对并非必要:仅在没有任何相应指令的情况下,仅基于响应进行训练,也能产生指令跟随效果。这表明预训练模型具有一个指令-响应映射,该映射通过教授模型所需响应的分布来揭示。然而,我们随后发现,教授所需响应的分布并非必要:在狭窄领域数据(如诗歌)上的指令-响应训练,仍然会导致广泛的指令跟随行为,如食谱生成。特别是,当指令与狭窄微调领域中的指令非常不同时,模型的响应不会遵循微调领域的风格。为了开始解释隐式指令微调,我们假设对语言模型分布的非常简单的改变就能产生指令跟随效果。我们通过手动编写一个基于规则的语言模型来支持这一点,该模型在与预训练模型结合的专家乘积中产生指令跟随效果。这些规则包括逐渐增加序列结束的概率、惩罚重复,以及均匀改变15个词的概率。总之,未经设计以产生指令跟随效果的适应措施,可以在隐式中实现这一点。

[NLP-88] Repairs in a Block World: A New Benchmark for Handling User Corrections with Multi-Modal Language Models EMNLP’24

【速读】: 该论文试图解决对话系统中处理和响应Third Position Repair(TPR)序列的问题,即在多模态指令跟随任务中,模型如何从误解中恢复并正确响应。解决方案的关键在于构建并公开了一个名为BlockWorld-Repairs的数据集,用于评估和训练Vision and Language Models(VLM),并通过专门设计的损失函数对相关标记进行微调,以提高模型在处理TPR序列时的性能和泛化能力。

链接: https://arxiv.org/abs/2409.14247
作者: Javier Chiyah-Garcia,Alessandro Suglia,Arash Eshghi
关键词-EN: Position Repair, misunderstand the speaker, prompting the speaker, addressee may initially, initially misunderstand
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted to EMNLP’24 Main (Upcoming)

点击查看摘要

Abstract:In dialogue, the addressee may initially misunderstand the speaker and respond erroneously, often prompting the speaker to correct the misunderstanding in the next turn with a Third Position Repair (TPR). The ability to process and respond appropriately to such repair sequences is thus crucial in conversational AI systems. In this paper, we first collect, analyse, and publicly release BlockWorld-Repairs: a dataset of multi-modal TPR sequences in an instruction-following manipulation task that is, by design, rife with referential ambiguity. We employ this dataset to evaluate several state-of-the-art Vision and Language Models (VLM) across multiple settings, focusing on their capability to process and accurately respond to TPRs and thus recover from miscommunication. We find that, compared to humans, all models significantly underperform in this task. We then show that VLMs can benefit from specialised losses targeting relevant tokens during fine-tuning, achieving better performance and generisability. Our results suggest that these models are not yet ready to be deployed in multi-modal collaborative settings where repairs are common, and highlight the need to design training regimes and objectives that facilitate learning from interaction.
摘要:在对话中,受话人可能会在一开始误解说话者,并做出错误的回应,这通常会促使说话者在下一轮对话中通过第三位置修正 (Third Position Repair, TPR) 来纠正误解。因此,处理并适当回应此类修正序列的能力对于对话式 AI 系统至关重要。本文首先收集、分析并公开发布了 BlockWorld-Repairs:一个在指令跟随操作任务中包含多模态 TPR 序列的数据集,该任务设计时充满了指代模糊性。我们利用此数据集在多种设置下评估了多个最先进的视觉与语言模型 (Vision and Language Model, VLM),重点关注它们处理和准确回应 TPR 的能力,从而从沟通失误中恢复。我们发现,与人类相比,所有模型在此任务中表现显著不佳。随后,我们展示了 VLM 在微调过程中通过针对相关 Token 的专门损失函数可以获得更好的性能和泛化能力。我们的研究结果表明,这些模型尚未准备好部署在多模态协作环境中,在这种环境中修正操作是常见的,并强调了设计促进从交互中学习的训练机制和目标的必要性。

[NLP-89] Data-centric NLP Backdoor Defense from the Lens of Memorization

【速读】: 该论文试图解决深度学习语言模型中的后门攻击问题,即如何防御这些模型在训练过程中被植入恶意元素。解决方案的关键在于识别和确认训练数据中的重复元素(如单词、短语、结构和风格),这些元素是后门攻击成功的基础。通过检测这些可记忆的重复元素并验证其是否能激活后门行为,论文提出了一种数据中心化的防御方法,该方法在防御不同类型的自然语言处理后门攻击方面优于现有的最先进防御技术。

链接: https://arxiv.org/abs/2409.14200
作者: Zhenting Wang,Zhizhi Wang,Mingyu Jin,Mengnan Du,Juan Zhai,Shiqing Ma
关键词-EN: DNN-based language models, language models, language model backdoors, severe threat, trustworthiness of DNN-based
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Backdoor attack is a severe threat to the trustworthiness of DNN-based language models. In this paper, we first extend the definition of memorization of language models from sample-wise to more fine-grained sentence element-wise (e.g., word, phrase, structure, and style), and then point out that language model backdoors are a type of element-wise memorization. Through further analysis, we find that the strength of such memorization is positively correlated to the frequency of duplicated elements in the training dataset. In conclusion, duplicated sentence elements are necessary for successful backdoor attacks. Based on this, we propose a data-centric defense. We first detect trigger candidates in training data by finding memorizable elements, i.e., duplicated elements, and then confirm real triggers by testing if the candidates can activate backdoor behaviors (i.e., malicious elements). Results show that our method outperforms state-of-the-art defenses in defending against different types of NLP backdoors.
摘要:后门攻击对基于深度神经网络 (DNN) 的语言模型的可信度构成了严重威胁。本文首先将语言模型的记忆定义从样本层面扩展到更细粒度的句子元素层面(例如,单词、短语、结构和风格),然后指出语言模型后门是一种元素层面的记忆。通过进一步分析,我们发现这种记忆的强度与训练数据集中重复元素的频率呈正相关。结论是,重复的句子元素是成功实施后门攻击的必要条件。基于此,我们提出了一种以数据为中心的防御方法。我们首先通过寻找可记忆元素(即重复元素)来检测训练数据中的触发候选者,然后通过测试候选者是否能激活后门行为(即恶意元素)来确认真实触发器。结果表明,我们的方法在防御不同类型的自然语言处理 (NLP) 后门方面优于最先进的防御方法。

[NLP-90] he Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks Techniques and Trends

【速读】: 该论文试图解决在大语言模型(LLMs)时代,由于对话数据的大量积累,对话分析(CA)任务缺乏明确范围和系统性技术协同的问题。解决方案的关键在于系统化CA任务,通过正式定义CA任务并将其划分为四个关键步骤:对话场景重建、深入归因分析、针对性训练以及基于训练生成对话以实现特定目标。这一系统化的方法旨在整合分散的技术,形成技术协同,从而更好地赋能商业应用,并推动对话分析从浅层元素分析向因果关系和高级策略任务的研究转变。

链接: https://arxiv.org/abs/2409.14195
作者: Xinghua Zhang,Haiyang Yu,Yongbin Li,Minzheng Wang,Longze Chen,Fei Huang
关键词-EN: large language models, rapid development trend, language models, large language, era of large
类目: Computation and Language (cs.CL)
备注: 21 pages, work in progress

点击查看摘要

Abstract:In the era of large language models (LLMs), a vast amount of conversation logs will be accumulated thanks to the rapid development trend of language UI. Conversation Analysis (CA) strives to uncover and analyze critical information from conversation data, streamlining manual processes and supporting business insights and decision-making. The need for CA to extract actionable insights and drive empowerment is becoming increasingly prominent and attracting widespread attention. However, the lack of a clear scope for CA leads to a dispersion of various techniques, making it difficult to form a systematic technical synergy to empower business applications. In this paper, we perform a thorough review and systematize CA task to summarize the existing related work. Specifically, we formally define CA task to confront the fragmented and chaotic landscape in this field, and derive four key steps of CA from conversation scene reconstruction, to in-depth attribution analysis, and then to performing targeted training, finally generating conversations based on the targeted training for achieving the specific goals. In addition, we showcase the relevant benchmarks, discuss potential challenges and point out future directions in both industry and academia. In view of current advancements, it is evident that the majority of efforts are still concentrated on the analysis of shallow conversation elements, which presents a considerable gap between the research and business, and with the assist of LLMs, recent work has shown a trend towards research on causality and strategic tasks which are sophisticated and high-level. The analyzed experiences and insights will inevitably have broader application value in business operations that target conversation logs.
摘要:在大语言模型 (LLM) 的时代,语言用户界面 (UI) 的快速发展趋势将积累大量的对话日志。对话分析 (Conversation Analysis, CA) 致力于从对话数据中挖掘和分析关键信息,简化人工流程,并支持业务洞察和决策制定。CA 提取可操作洞察并推动赋能的需求日益显著,并引起了广泛关注。然而,CA 缺乏明确的范围定义,导致各种技术分散,难以形成系统的技术协同来赋能业务应用。本文对 CA 任务进行了全面回顾和系统化,总结了现有的相关工作。具体而言,我们正式定义了 CA 任务,以应对该领域碎片化和混乱的局面,并从对话场景重构、深入归因分析、进行针对性训练,到最后基于针对性训练生成对话以实现特定目标,推导出 CA 的四个关键步骤。此外,我们展示了相关的基准,讨论了潜在的挑战,并指出了行业和学术界未来的方向。鉴于当前的进展,很明显大多数努力仍集中在浅层对话元素的分析上,这显示出研究与业务之间存在相当大的差距,而在大语言模型的辅助下,最近的工作显示出向因果关系和复杂高级任务研究的趋势。分析的经验和洞察无疑将在以对话日志为目标的业务运营中具有更广泛的应用价值。

[NLP-91] Knowledge in Triples for LLMs: Enhancing Table QA Accuracy with Semantic Extraction

【速读】: 该论文试图解决自然语言处理中从复杂、半结构化的表格数据中提取和生成有意义回答的难题。解决方案的关键在于提出了一种新颖的方法,该方法通过直接从表格数据中提取三元组,并将其与检索增强生成(RAG)模型结合,以提升经过微调的GPT-3.5-turbo-0125模型生成回答的准确性、连贯性和上下文丰富性。这种方法在FeTaQA数据集上显著优于现有基线,特别是在Sacre-BLEU和ROUGE指标上表现出色,能够有效地从表格中生成上下文准确且详细的回答。

链接: https://arxiv.org/abs/2409.14192
作者: Hossein Sholehrasa,Sanaz Saki Norouzi,Pascal Hitzler,Majid Jaberi-Douraki
关键词-EN: Integrating structured knowledge, natural language processing, formats poses significant, Integrating structured, tabular formats poses
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Integrating structured knowledge from tabular formats poses significant challenges within natural language processing (NLP), mainly when dealing with complex, semi-structured tables like those found in the FeTaQA dataset. These tables require advanced methods to interpret and generate meaningful responses accurately. Traditional approaches, such as SQL and SPARQL, often fail to fully capture the semantics of such data, especially in the presence of irregular table structures like web tables. This paper addresses these challenges by proposing a novel approach that extracts triples straightforward from tabular data and integrates it with a retrieval-augmented generation (RAG) model to enhance the accuracy, coherence, and contextual richness of responses generated by a fine-tuned GPT-3.5-turbo-0125 model. Our approach significantly outperforms existing baselines on the FeTaQA dataset, particularly excelling in Sacre-BLEU and ROUGE metrics. It effectively generates contextually accurate and detailed long-form answers from tables, showcasing its strength in complex data interpretation.
摘要:将表格格式中的结构化知识整合到自然语言处理 (NLP) 中面临着显著的挑战,尤其是在处理 FeTaQA 数据集中发现的复杂、半结构化表格时。这些表格需要先进的方法来准确解释并生成有意义的响应。传统方法,如 SQL 和 SPARQL,往往无法完全捕捉此类数据的语义,特别是在存在不规则表格结构(如网页表格)的情况下。本文通过提出一种新颖的方法来应对这些挑战,该方法直接从表格数据中提取三元组,并将其与检索增强生成 (RAG) 模型结合,以提高经过微调的 GPT-3.5-turbo-0125 模型生成的响应的准确性、连贯性和上下文丰富性。我们的方法在 FeTaQA 数据集上显著优于现有的基线方法,特别是在 Sacre-BLEU 和 ROUGE 指标上表现尤为突出。它能够有效地从表格中生成上下文准确且详细的较长形式的答案,展示了其在复杂数据解释方面的优势。

[NLP-92] On Lexical Invariance on Multisets and Graphs

【速读】: 该论文试图解决的是词汇不变性(lexical invariance)问题,特别是在多集(multisets)和图(graphs)上的应用。解决方案的关键在于确定一个函数,使其输出对输入词汇空间的任何单射变换保持不变。对于多集,最具有表达力的词汇不变函数必须仅依赖于原始多集中唯一元素的计数;对于图,最具有表达力的词汇不变和排列不变函数必须仅依赖于邻接矩阵和差异矩阵,其中差异矩阵的元素表示节点间特征是否相同。通过这些条件,论文证明了在多集和图上实现词汇不变性的充分必要条件,并通过合成实验验证了其理论。

链接: https://arxiv.org/abs/2409.14179
作者: Muhan Zhang
关键词-EN: called lexical invariance, expressive lexical invariant, lexical invariance, lexical, lexical invariant
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this draft, we study a novel problem, called lexical invariance, using the medium of multisets and graphs. Traditionally in the NLP domain, lexical invariance indicates that the semantic meaning of a sentence should remain unchanged regardless of the specific lexical or word-based representation of the input. For example, The movie was extremely entertaining'' would have the same meaning as The film was very enjoyable’'. In this paper, we study a more challenging setting, where the output of a function is invariant to any injective transformation applied to the input lexical space. For example, multiset 1,2,3,2 is equivalent to multiset a,b,c,b if we specify an injective transformation that maps 1 to a, 2 to b and 3 to c. We study the sufficient and necessary conditions for a most expressive lexical invariant (and permutation invariant) function on multisets and graphs, and proves that for multisets, the function must have a form that only takes the multiset of counts of the unique elements in the original multiset as input. For example, a most expressive lexical invariant function on a,b,c,b must have a form that only operates on 1,1,2 (meaning that there are 1, 1, 2 unique elements corresponding to a,c,b). For graphs, we prove that a most expressive lexical invariant and permutation invariant function must have a form that only takes the adjacency matrix and a difference matrix as input, where the (i,j)th element of the difference matrix is 1 if node i and node j have the same feature and 0 otherwise. We perform synthetic experiments on TU datasets to verify our theorems.
摘要:在本稿中,我们通过多集和图的研究,探讨了一个新颖的问题,称为词汇不变性。传统上,在自然语言处理 (NLP) 领域,词汇不变性意味着句子的语义意义应保持不变,无论输入的具体词汇或基于词的表示如何变化。例如,“The movie was extremely entertaining” 与 “The film was very enjoyable” 具有相同的意义。本文研究了一个更具挑战性的场景,即函数的输出对于应用于输入词汇空间上的任何单射变换保持不变。例如,多集 1,2,3,2 等同于多集 a,b,c,b,如果我们指定一个单射变换,将 1 映射到 a,2 映射到 b,3 映射到 c。我们研究了在多集和图上,最具表达力的词汇不变(及置换不变)函数的充分必要条件,并证明对于多集,该函数必须具有仅以原始多集中唯一元素的计数多集作为输入的形式。例如,在多集 a,b,c,b 上,最具表达力的词汇不变函数必须具有仅操作于 1,1,2 的形式(意味着存在对应于 a,c,b 的 1,1,2 个唯一元素)。对于图,我们证明最具表达力的词汇不变和置换不变函数必须具有仅以邻接矩阵和差异矩阵作为输入的形式,其中差异矩阵的 (i,j) 元素为 1,如果节点 i 和节点 j 具有相同特征,否则为 0。我们在 TU 数据集上进行了合成实验,以验证我们的定理。

[NLP-93] QMOS: Enhancing LLMs for Telecommunication with Question Masked loss and Option Shuffling

【速读】: 该论文试图解决在电信领域应用大型语言模型(LLMs)进行多选题问答时面临的挑战,特别是由于领域特定词汇、复杂技术概念和精确回答需求带来的困难。解决方案的关键在于引入了一种名为QMOS的创新方法,通过使用Question-Masked损失和Option Shuffling技巧来增强LLMs在电信领域多选题问答中的表现。论文采用开源的小型语言模型(如Phi-2和Falcon-7B),并在增强的RAG框架中进行微调、检索、提示工程和推理的全面优化,显著提升了问答准确率,分别将Falcon-7B和Phi-2的准确率从24.70%和42.07%提升至49.30%和84.65%。

链接: https://arxiv.org/abs/2409.14175
作者: Blessed Guda,Gabrial Zencha A.,Lawrence Francis,Carlee Joe-Wong
关键词-EN: Large Language models, Large Language, brought about substantial, substantial advancements, Retrieval Augmented Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language models (LLMs) have brought about substantial advancements in the field of Question Answering (QA) systems. These models do remarkably well in addressing intricate inquiries in a variety of disciplines. However, because of domain-specific vocabulary, complex technological concepts, and the requirement for exact responses applying LLMs to specialized sectors like telecommunications presents additional obstacles. GPT-3.5 has been used in recent work, to obtain noteworthy accuracy for telecom-related questions in a Retrieval Augmented Generation (RAG) framework. Notwithstanding these developments, the practical use of models such as GPT-3.5 is restricted by their proprietary nature and high computing demands. This paper introduces QMOS, an innovative approach which uses a Question-Masked loss and Option Shuffling trick to enhance the performance of LLMs in answering Multiple-Choice Questions in the telecommunications domain. Our focus was on using opensource, smaller language models (Phi-2 and Falcon-7B) within an enhanced RAG framework. Our multi-faceted approach involves several enhancements to the whole LLM-RAG pipeline of finetuning, retrieval, prompt engineering and inference. Our approaches significantly outperform existing results, achieving accuracy improvements from baselines of 24.70% to 49.30% with Falcon-7B and from 42.07% to 84.65% with Phi-2.
摘要:大语言模型 (LLMs) 在问答系统 (QA) 领域带来了显著的进步。这些模型在处理跨学科的复杂问题时表现出色。然而,由于特定领域的词汇、复杂的技术概念以及精确回答的需求,将 LLMs 应用于电信等专业领域带来了额外的挑战。近期的工作中,GPT-3.5 在检索增强生成 (RAG) 框架下,对电信相关问题的回答取得了显著的准确性。尽管如此,GPT-3.5 等模型的实际应用受到其专有性质和高计算需求的限制。本文介绍了 QMOS,一种创新方法,通过使用问题掩码损失 (Question-Masked loss) 和选项洗牌技巧 (Option Shuffling trick) 来提升 LLMs 在电信领域多选题回答中的性能。我们的研究重点是在增强的 RAG 框架内使用开源、较小的语言模型 (Phi-2 和 Falcon-7B)。我们的多方面方法涉及对整个 LLM-RAG 管道的微调、检索、提示工程和推理的多个改进。我们的方法显著优于现有结果,使用 Falcon-7B 从基线的 24.70% 提升到 49.30%,使用 Phi-2 从 42.07% 提升到 84.65%。

[NLP-94] owards Building Efficient Sentence BERT Models using Layer Pruning

【速读】: 该论文试图解决如何在不显著降低嵌入质量的前提下,通过层剪枝技术创建更高效的Sentence BERT(SBERT)模型的问题。解决方案的关键在于通过两阶段的SBERT微调过程(包括自然语言推理和语义文本相似性任务),评估层剪枝对嵌入质量的影响,并证明剪枝后的模型在减少层数的同时,仍能与全层模型表现相当,甚至优于同样大小的从头训练的模型。这一策略不仅有效降低了模型的复杂性和计算需求,还为资源有限的技术环境提供了高质量的嵌入模型。

链接: https://arxiv.org/abs/2409.14168
作者: Anushka Shelke,Riya Savant,Raviraj Joshi
关键词-EN: efficient Sentence BERT, Sentence BERT, Semantic Textual Similarity, study examines, examines the effectiveness
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study examines the effectiveness of layer pruning in creating efficient Sentence BERT (SBERT) models. Our goal is to create smaller sentence embedding models that reduce complexity while maintaining strong embedding similarity. We assess BERT models like Muril and MahaBERT-v2 before and after pruning, comparing them with smaller, scratch-trained models like MahaBERT-Small and MahaBERT-Smaller. Through a two-phase SBERT fine-tuning process involving Natural Language Inference (NLI) and Semantic Textual Similarity (STS), we evaluate the impact of layer reduction on embedding quality. Our findings show that pruned models, despite fewer layers, perform competitively with fully layered versions. Moreover, pruned models consistently outperform similarly sized, scratch-trained models, establishing layer pruning as an effective strategy for creating smaller, efficient embedding models. These results highlight layer pruning as a practical approach for reducing computational demand while preserving high-quality embeddings, making SBERT models more accessible for languages with limited technological resources.
摘要:本研究探讨了层剪枝在创建高效 Sentence BERT (SBERT) 模型中的有效性。我们的目标是创建更小的句子嵌入模型,这些模型在降低复杂性的同时保持较强的嵌入相似性。我们评估了剪枝前后的 BERT 模型,如 Muril 和 MahaBERT-v2,并将它们与从头训练的小型模型(如 MahaBERT-Small 和 MahaBERT-Smaller)进行比较。通过涉及自然语言推理 (NLI) 和语义文本相似性 (STS) 的两阶段 SBERT 微调过程,我们评估了层减少对嵌入质量的影响。我们的研究结果表明,尽管层数较少,剪枝模型在性能上仍能与全层模型相媲美。此外,剪枝模型在性能上始终优于相同大小、从头训练的模型,证明了层剪枝是创建更小、更高效嵌入模型的有效策略。这些结果突显了层剪枝作为一种实用方法,能够在减少计算需求的同时保持高质量的嵌入,使 SBERT 模型更易于应用于技术资源有限的语言。

[NLP-95] Will Large Language Models be a Panacea to Autonomous Driving?

【速读】: 该论文试图解决自动驾驶技术在模块化和端到端两种路径中遇到的问题,特别是模块化路径中各模块训练目标不一致导致的集成效果偏差,以及端到端路径在处理复杂城市交通场景和长尾事件时的局限性。解决方案的关键在于探讨大型语言模型(LLMs)在自动驾驶系统中的潜在应用,利用LLMs强大的推理能力和广泛的知识理解,提升自动驾驶系统的理解和决策能力。论文特别关注LLMs在模块化和端到端两种路径中的优化策略,并探讨基于LLM的人工通用智能(AGI)是否能成为实现高级自动驾驶的关键。

链接: https://arxiv.org/abs/2409.14165
作者: Yuxuan Zhua,Shiyi Wang,Wenqing Zhong,Nianchen Shen,Yunqi Li,Siqi Wang,Zhiheng Li,Cathy Wu,Zhengbing He,Li Li
关键词-EN: plays a crucial, crucial role, role in autonomous, autonomous driving, LLMs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) plays a crucial role in autonomous driving (AD) research, propelling its development towards intelligence and efficiency. Currently, the development of AD technology follows two main technical paths: modularization and end-to-end. Modularization decompose the driving task into modules such as perception, prediction, planning, and control, and train them separately. Due to the inconsistency of training objectives between modules, the integrated effect suffers from bias. End-to-end attempts to address this issue by utilizing a single model that directly maps from sensor data to control signals. This path has limited learning capabilities in a comprehensive set of features and struggles to handle unpredictable long-tail events and complex urban traffic scenarios. In the face of challenges encountered in both paths, many researchers believe that large language models (LLMs) with powerful reasoning capabilities and extensive knowledge understanding may be the solution, expecting LLMs to provide AD systems with deeper levels of understanding and decision-making capabilities. In light of the challenges faced by both paths, many researchers believe that LLMs, with their powerful reasoning abilities and extensive knowledge, could offer a solution. To understand if LLMs could enhance AD, this paper conducts a thorough analysis of the potential applications of LLMs in AD systems, including exploring their optimization strategies in both modular and end-to-end approaches, with a particular focus on how LLMs can tackle the problems and challenges present in current solutions. Furthermore, we discuss an important question: Can LLM-based artificial general intelligence (AGI) be a key to achieve high-level AD? We further analyze the potential limitations and challenges that LLMs may encounter in promoting the development of AD technology.
摘要:人工智能 (AI) 在自动驾驶 (AD) 研究中扮演着至关重要的角色,推动其向智能化和高效化发展。当前,AD 技术的发展遵循两条主要技术路径:模块化和端到端。模块化将驾驶任务分解为感知、预测、规划和控制等模块,并分别进行训练。由于模块间训练目标的不一致性,集成效果存在偏差。端到端试图通过使用单一模型直接从传感器数据映射到控制信号来解决这一问题。该路径在学习全面特征方面能力有限,难以处理不可预测的长尾事件和复杂的城市交通场景。面对两条路径中遇到的问题,许多研究人员认为,具有强大推理能力和广泛知识理解的大语言模型 (LLMs) 可能是解决方案,期望 LLMs 为 AD 系统提供更深层次的理解和决策能力。鉴于两条路径面临的挑战,许多研究人员认为,LLMs 凭借其强大的推理能力和广泛的知识,可能提供解决方案。为了了解 LLMs 是否能增强 AD,本文对 LLMs 在 AD 系统中的潜在应用进行了深入分析,包括在模块化和端到端方法中探索其优化策略,特别关注 LLMs 如何解决当前解决方案中的问题和挑战。此外,我们讨论了一个重要问题:基于 LLM 的通用人工智能 (AGI) 能否成为实现高级 AD 的关键?我们进一步分析了 LLMs 在推动 AD 技术发展中可能遇到的潜在限制和挑战。

[NLP-96] PromptTA: Prompt-driven Text Adapter for Source-free Domain Generalization

【速读】: 该论文试图解决无源域数据情况下的领域泛化问题(Source-free domain generalization, SFDG),即在不访问源域数据的情况下,使模型适应未见过的目标域。解决方案的关键在于提出了一种名为Prompt-Driven Text Adapter(PromptTA)的方法,该方法通过捕捉风格特征的分布并利用重采样来确保全面覆盖领域知识。此外,论文引入了一个文本适配器,从这些风格特征中学习,以高效存储领域信息。实验结果表明,PromptTA在四个基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2409.14163
作者: Haoran Zhang,Shuanghao Bai,Wanqi Zhou,Jingwen Fu,Badong Chen
关键词-EN: Source-free domain generalization, source domain data, unseen target domains, Source-free domain, tackles the challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Source-free domain generalization (SFDG) tackles the challenge of adapting models to unseen target domains without access to source domain data. To deal with this challenging task, recent advances in SFDG have primarily focused on leveraging the text modality of vision-language models such as CLIP. These methods involve developing a transferable linear classifier based on diverse style features extracted from the text and learned prompts or deriving domain-unified text representations from domain banks. However, both style features and domain banks have limitations in capturing comprehensive domain knowledge. In this work, we propose Prompt-Driven Text Adapter (PromptTA) method, which is designed to better capture the distribution of style features and employ resampling to ensure thorough coverage of domain knowledge. To further leverage this rich domain information, we introduce a text adapter that learns from these style features for efficient domain information storage. Extensive experiments conducted on four benchmark datasets demonstrate that PromptTA achieves state-of-the-art performance. The code is available at this https URL.
摘要: 无源域泛化 (Source-free domain generalization, SFDG) 旨在解决在不访问源域数据的情况下,将模型适应于未见过的目标域的挑战。为了应对这一艰巨任务,近期 SFDG 的研究主要集中在利用视觉-语言模型(如 CLIP)的文本模态。这些方法包括基于从文本中提取的多样化风格特征和学习到的提示开发可迁移的线性分类器,或从域库中推导出域统一的文本表示。然而,无论是风格特征还是域库,在捕捉全面的域知识方面都存在局限性。在本研究中,我们提出了提示驱动文本适配器 (Prompt-Driven Text Adapter, PromptTA) 方法,该方法旨在更好地捕捉风格特征的分布,并通过重采样确保全面覆盖域知识。为进一步利用这些丰富的域信息,我们引入了一个文本适配器,该适配器从这些风格特征中学习,以实现高效的域信息存储。在四个基准数据集上进行的大量实验表明,PromptTA 达到了最先进的性能。代码可在以下链接获取:https URL。

[NLP-97] On Importance of Pruning and Distillation for Efficient Low Resource NLP

【速读】: 该论文旨在解决低资源语言(特别是Marathi语)在自然语言处理中的计算效率问题。解决方案的关键在于通过应用Block Movement Pruning、知识蒸馏和混合精度等优化技术,对Marathi语的transformer模型进行优化,以减少计算时间和内存使用,同时保持高准确性。研究结果表明,25%的剪枝结合知识蒸馏是最优配置,能够在保持基准准确率的同时,实现2.56倍的计算速度提升。

链接: https://arxiv.org/abs/2409.14162
作者: Aishwarya Mirashi,Purva Lingayat,Srushti Sonavane,Tejas Padhiyar,Raviraj Joshi,Geetanjali Kale
关键词-EN: Natural Language Processing, revolutionized Natural Language, revolutionized Natural, Language Processing, Natural Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rise of large transformer models has revolutionized Natural Language Processing, leading to significant advances in tasks like text classification. However, this progress demands substantial computational resources, escalating training duration, and expenses with larger model sizes. Efforts have been made to downsize and accelerate English models (e.g., Distilbert, MobileBert). Yet, research in this area is scarce for low-resource languages. In this study, we explore the case of the low-resource Indic language Marathi. Leveraging the marathi-topic-all-doc-v2 model as our baseline, we implement optimization techniques to reduce computation time and memory usage. Our focus is on enhancing the efficiency of Marathi transformer models while maintaining top-tier accuracy and reducing computational demands. Using the MahaNews document classification dataset and the marathi-topic-all-doc-v2 model from L3Cube, we apply Block Movement Pruning, Knowledge Distillation, and Mixed Precision methods individually and in combination to boost efficiency. We demonstrate the importance of strategic pruning levels in achieving desired efficiency gains. Furthermore, we analyze the balance between efficiency improvements and environmental impact, highlighting how optimized model architectures can contribute to a more sustainable computational ecosystem. Implementing these techniques on a single GPU system, we determine that the optimal configuration is 25% pruning + knowledge distillation. This approach yielded a 2.56x speedup in computation time while maintaining baseline accuracy levels. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.14162 [cs.CL] (or arXiv:2409.14162v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.14162 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:大型 Transformer 模型的兴起彻底改变了自然语言处理领域,显著推动了文本分类等任务的发展。然而,这一进步需要大量的计算资源,随着模型规模的增大,训练时间和成本也随之增加。已有研究致力于缩小和加速英语模型(例如 Distilbert、MobileBert)。然而,针对低资源语言的研究在这一领域仍然匮乏。在本研究中,我们探讨了低资源印度语言马拉地语的情况。以 marathi-topic-all-doc-v2 模型为基准,我们实施了优化技术以减少计算时间和内存使用。我们的重点是在保持顶级准确性的同时,提高马拉地语 Transformer 模型的效率并降低计算需求。使用 MahaNews 文档分类数据集和 L3Cube 的 marathi-topic-all-doc-v2 模型,我们分别应用了块移动剪枝 (Block Movement Pruning)、知识蒸馏 (Knowledge Distillation) 和混合精度 (Mixed Precision) 方法,并结合使用以提升效率。我们展示了在实现预期效率提升中,策略性剪枝水平的重要性。此外,我们分析了效率改进与环境影响之间的平衡,强调了优化模型架构如何有助于构建更可持续的计算生态系统。在单 GPU 系统上实施这些技术后,我们确定最佳配置为 25% 剪枝 + 知识蒸馏。这种方法在保持基准准确率水平的同时,计算时间加速了 2.56 倍。

主题:计算与语言 (cs.CL); 机器学习 (cs.LG)
引用方式:arXiv:2409.14162 [cs.CL]
(或 arXiv:2409.14162v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.14162
了解更多信息
arXiv 发布的 DOI 通过 DataCite(待注册)

[NLP-98] Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis EMNLP2024

【速读】: 该论文试图解决的问题是理解算术能力在大语言模型中的实现机制,并探索如何通过模型分析和编辑来优化这一能力。解决方案的关键在于提出了比较神经元分析(Comparative Neuron Analysis, CNA)方法,该方法揭示了从输入到预测的四个阶段逻辑链:浅层FFN神经元进行特征增强、浅层注意力层进行特征转移、算术头进行特征预测以及深层FFN神经元进行预测增强。通过识别这些阶段中的关键神经元,研究进一步揭示了LoRA(低秩适应)如何通过放大与预测相关的FFN神经元的系数来增强预测概率。此外,研究还将该方法应用于算术任务的模型剪枝和减少性别偏见的模型编辑中。

链接: https://arxiv.org/abs/2409.14144
作者: Zeping Yu,Sophia Ananiadou
关键词-EN: Comparative Neuron Analysis, FFN neurons, find arithmetic ability, arithmetic ability resides, ability resides
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024 main. Mechanistic interpretability for arithmetic tasks in large language models

点击查看摘要

Abstract:We find arithmetic ability resides within a limited number of attention heads, with each head specializing in distinct operations. To delve into the reason, we introduce the Comparative Neuron Analysis (CNA) method, which identifies an internal logic chain consisting of four distinct stages from input to prediction: feature enhancing with shallow FFN neurons, feature transferring by shallow attention layers, feature predicting by arithmetic heads, and prediction enhancing among deep FFN neurons. Moreover, we identify the human-interpretable FFN neurons within both feature-enhancing and feature-predicting stages. These findings lead us to investigate the mechanism of LoRA, revealing that it enhances prediction probabilities by amplifying the coefficient scores of FFN neurons related to predictions. Finally, we apply our method in model pruning for arithmetic tasks and model editing for reducing gender bias. Code is on this https URL.
摘要:我们发现算术能力存在于有限数量的注意力头中,每个头专注于不同的操作。为了深入探究其原因,我们引入了比较神经元分析 (Comparative Neuron Analysis, CNA) 方法,该方法识别出从输入到预测的内部逻辑链,该逻辑链由四个不同的阶段组成:浅层 FFN 神经元进行特征增强,浅层注意力层进行特征转移,算术头进行特征预测,深层 FFN 神经元之间进行预测增强。此外,我们识别出在特征增强和特征预测阶段中的人类可解释的 FFN 神经元。这些发现使我们研究了 LoRA 的机制,揭示了它通过放大与预测相关的 FFN 神经元的系数分数来增强预测概率。最后,我们将我们的方法应用于算术任务的模型剪枝和减少性别偏见的模型编辑。代码可在以下链接获取:https URL。

[NLP-99] Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm

【速读】: 该论文试图解决参数高效微调(PEFT)中存在的任务无关后门攻击问题。解决方案的关键在于引入了一种名为Obliviate的PEFT集成后门防御方法,通过放大PEFT层中的良性神经元和惩罚触发词的影响,显著降低了现有任务无关后门攻击的成功率(从83.6%下降),并展示了对抗任务特定后门和自适应攻击的强大防御能力。

链接: https://arxiv.org/abs/2409.14119
作者: Jaehan Kim,Minkyoo Song,Seung Ho Na,Seungwon Shin
关键词-EN: large language models, Parameter-efficient fine-tuning, key training strategy, language models, key training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6% \downarrow ). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at this https URL.
摘要:参数高效微调 (Parameter-efficient fine-tuning, PEFT) 已成为大语言模型 (Large Language Model) 的关键训练策略。然而,其依赖较少可训练参数的特点带来了安全风险,例如任务无关的后门 (task-agnostic backdoors)。尽管这些后门对广泛任务的影响严重,但在 PEFT 背景下,尚无实际可行的防御解决方案来有效对抗任务无关的后门。在本研究中,我们提出了 Obliviate,一种可与 PEFT 集成的后门防御方法。我们开发了两项技术,旨在放大 PEFT 层中的良性神经元 (benign neurons) 并惩罚触发 Token (trigger tokens) 的影响。我们在三种主要的 PEFT 架构上的评估表明,我们的方法能够显著降低最先进的任务无关后门的攻击成功率 (83.6% ↓)。此外,我们的方法在对抗任务特定后门和自适应攻击方面表现出强大的防御能力。源代码将在以下链接获取:https URL。

[NLP-100] Routing in Sparsely-gated Language Models responds to Context

【速读】: 该论文试图解决语言模型中专家混合层(mixture-of-experts layers)在路由决策时对上下文敏感性的问题。解决方案的关键在于通过追踪相似性注释的文本对的路由决策,评估学习到的token-专家分配的上下文敏感性。研究发现,编码器层的路由主要依赖于(语义)关联,但上下文线索提供了额外的细化层;而解码器层的路由则更为多变,对上下文的敏感性显著降低。

链接: https://arxiv.org/abs/2409.14107
作者: Stefan Arnold,Marian Fietta,Dilara Yesilbas
关键词-EN: Language Models, fixed computational budget, recently incorporate, computational budget, collection of experts
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language Models (LMs) recently incorporate mixture-of-experts layers consisting of a router and a collection of experts to scale up their parameter count given a fixed computational budget. Building on previous efforts indicating that token-expert assignments are predominantly influenced by token identities and positions, we trace routing decisions of similarity-annotated text pairs to evaluate the context sensitivity of learned token-expert assignments. We observe that routing in encoder layers mainly depends on (semantic) associations, but contextual cues provide an additional layer of refinement. Conversely, routing in decoder layers is more variable and markedly less sensitive to context.
摘要:语言模型 (Language Models, LMs) 最近引入了由路由器和一组专家组成的专家混合层 (mixture-of-experts layers),以在固定的计算预算下扩展其参数数量。基于先前研究指出 Token-专家分配主要受 Token 身份和位置的影响,我们追踪了相似性注释文本对的路由决策,以评估学习到的 Token-专家分配的上下文敏感性。我们观察到,编码器层中的路由主要依赖于 (语义) 关联,但上下文线索提供了额外的细化层。相反,解码器层中的路由更具变异性,并且对上下文的敏感性显著降低。

[NLP-101] Probing Context Localization of Polysemous Words in Pre-trained Language Model Sub-Layers

【速读】: 该论文试图解决的问题是如何量化和理解预训练语言模型(PLM)中不同子层(如自注意力层、前馈激活层和输出层)对上下文信息的编码强度。解决方案的关键在于通过线性探针实验,提取多义词在不同句子对中的子层表示,并比较这些表示在模型前向传播过程中的变化,以及在词义识别分类任务中评估这些子层表示的上下文信息强度。研究结果表明,BERT在特定位置和较短上下文窗口下表现出较高的上下文化程度,但这种表现并不系统地适用于不同的词位置和上下文长度。

链接: https://arxiv.org/abs/2409.14097
作者: Soniya Vijayakumar,Josef van Genabith,Simon Ostermann
关键词-EN: Large Language Models, performing Large Language, Pre-trained Language Model, high performing Large, Large Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the era of high performing Large Language Models, researchers have widely acknowledged that contextual word representations are one of the key drivers in achieving top performances in downstream tasks. In this work, we investigate the degree of contextualization encoded in the fine-grained sub-layer representations of a Pre-trained Language Model (PLM) by empirical experiments using linear probes. Unlike previous work, we are particularly interested in identifying the strength of contextualization across PLM sub-layer representations (i.e. Self-Attention, Feed-Forward Activation and Output sub-layers). To identify the main contributions of sub-layers to contextualisation, we first extract the sub-layer representations of polysemous words in minimally different sentence pairs, and compare how these representations change through the forward pass of the PLM network. Second, by probing on a sense identification classification task, we try to empirically localize the strength of contextualization information encoded in these sub-layer representations. With these probing experiments, we also try to gain a better understanding of the influence of context length and context richness on the degree of contextualization. Our main conclusion is cautionary: BERT demonstrates a high degree of contextualization in the top sub-layers if the word in question is in a specific position in the sentence with a shorter context window, but this does not systematically generalize across different word positions and context sizes.
摘要:在高效能大语言模型 (Large Language Model) 的时代,研究人员普遍认为上下文词表示是实现下游任务卓越表现的关键驱动因素之一。在本研究中,我们通过使用线性探针 (linear probes) 的实证实验,探讨了预训练语言模型 (Pre-trained Language Model, PLM) 细粒度子层表示中编码的上下文化程度。与以往的工作不同,我们特别关注于识别 PLM 子层表示 (即自注意力层 (Self-Attention)、前馈激活层 (Feed-Forward Activation) 和输出层 (Output sub-layers)) 中上下文化强度的变化。为了确定子层对上下文化的主要贡献,我们首先提取了在语义上多义词在最小差异句子对中的子层表示,并比较这些表示在 PLM 网络前向传递过程中的变化。其次,通过在词义识别分类任务上进行探针实验,我们尝试实证定位这些子层表示中编码的上下文化信息强度。通过这些探针实验,我们还试图更好地理解上下文长度和上下文丰富度对上下文化程度的影响。我们的主要结论是谨慎的:BERT 在特定位置的词且上下文窗口较短的情况下,在顶部子层表现出高度的上下文化,但这并不能系统性地推广到不同的词位置和上下文大小。

[NLP-102] PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)在Text-to-SQL任务中的推理能力提升问题。解决方案的关键在于提出了一种名为PTD-SQL(Query Group Partitioning)的方法,通过将查询任务分组,使LLMs能够专注于学习特定问题类型的思维过程,从而在不同难度级别和问题类别上增强其推理能力。实验结果表明,采用PTD-SQL的多个高级LLMs在Spider和BIRD数据集上能够超越或匹配先前的最先进(SOTA)方法,尤其是在模型能力边界处的针对性训练后,性能显著提升。

链接: https://arxiv.org/abs/2409.14082
作者: Ruilin Luo,Liyuan Wang,Binghuai Lin,Zicheng Lin,Yujiu Yang
关键词-EN: Large Language Models, Large Language, exhibiting remarkable reasoning, exhibiting remarkable, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 Main Conference. Revised by ARR April and ARR June. 32 pages, 7 figures and 30 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful tools for Text-to-SQL tasks, exhibiting remarkable reasoning capabilities. Different from tasks such as math word problems and commonsense reasoning, SQL solutions have a relatively fixed pattern. This facilitates the investigation of whether LLMs can benefit from categorical thinking, mirroring how humans acquire knowledge through inductive reasoning based on comparable examples. In this study, we propose that employing query group partitioning allows LLMs to focus on learning the thought processes specific to a single problem type, consequently enhancing their reasoning abilities across diverse difficulty levels and problem categories. Our experiments reveal that multiple advanced LLMs, when equipped with PTD-SQL, can either surpass or match previous state-of-the-art (SOTA) methods on the Spider and BIRD datasets. Intriguingly, models with varying initial performances have exhibited significant improvements, mainly at the boundary of their capabilities after targeted drilling, suggesting a parallel with human progress. Code is available at this https URL.
摘要:大语言模型 (LLMs) 在文本到 SQL 任务中展现出强大的工具性,表现出卓越的推理能力。与数学应用题和常识推理等任务不同,SQL 解决方案具有相对固定的模式。这有助于研究 LLMs 是否能从分类思维中受益,类似于人类通过基于相似例子的归纳推理来获取知识的方式。在本研究中,我们提出,通过查询组分区,使 LLMs 专注于学习单一问题类型的思维过程,从而增强其在不同难度级别和问题类别中的推理能力。我们的实验表明,配备 PTD-SQL 的多个高级 LLMs 在 Spider 和 BIRD 数据集上能够超越或匹配先前的最先进 (SOTA) 方法。有趣的是,不同初始性能的模型在经过针对性训练后,其能力边界均显示出显著提升,这与人类的进步有相似之处。代码可在以下链接获取:https URL。

[NLP-103] MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

【速读】: 该论文试图解决多语言医疗领域自动语音识别(ASR)的问题,旨在通过构建一个包含五种语言(越南语、英语、德语、法语和普通话)的大型多语言医疗ASR数据集MultiMed,来提升跨语言医疗沟通的效率和准确性。解决方案的关键在于创建了迄今为止最大的多语言医疗ASR数据集,涵盖了广泛的疾病类型、录音条件、说话者角色、独特的医学术语、口音和ICD-10代码,并通过实验验证了多语言医疗ASR模型的有效性,提供了可复现的研究基础和语言学分析。

链接: https://arxiv.org/abs/2409.14074
作者: Khai Le-Duc,Phuc Phan,Tan-Hanh Pham,Bach Phan Tat,Minh-Huong Ngo,Truong-Son Hy
关键词-EN: automatic speech recognition, spoken language understanding, Multilingual automatic speech, multilingual medical ASR, medical ASR
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint

点击查看摘要

Abstract:Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, a collection of small-to-large end-to-end ASR models for the medical domain, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese, together with the corresponding real-world ASR dataset. To our best knowledge, MultiMed stands as the largest and the first multilingual medical ASR dataset, in terms of total duration, number of speakers, diversity of diseases, recording conditions, speaker roles, unique medical terms, accents, and ICD-10 codes. Secondly, we establish the empirical baselines, present the first reproducible study of multilinguality in medical ASR, conduct a layer-wise ablation study for end-to-end ASR training, and provide the first linguistic analysis for multilingual medical ASR. All code, data, and models are available online this https URL
摘要:医疗领域的多语言自动语音识别 (ASR) 是支持多种下游应用的基础任务,如语音翻译、口语理解以及语音激活助手。该技术通过跨越语言障碍实现高效沟通,缓解专业人员短缺问题,并促进诊断和治疗的改进,特别是在疫情期间。在本研究中,我们引入了 MultiMed,这是一系列针对医疗领域的小至大型端到端 ASR 模型,涵盖五种语言:越南语、英语、德语、法语和普通话,以及相应的真实世界 ASR 数据集。据我们所知,MultiMed 在总时长、说话者数量、疾病多样性、录音条件、说话者角色、独特医学术语、口音和 ICD-10 代码等方面,是迄今为止最大且首个多语言医疗 ASR 数据集。其次,我们建立了经验基线,首次进行了医疗 ASR 中多语言性的可重复研究,进行了端到端 ASR 训练的逐层消融研究,并提供了首个多语言医疗 ASR 的语言学分析。所有代码、数据和模型均可在线获取。

[NLP-104] mporally Consistent Factuality Probing for Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在时间维度上的事实一致性问题,即确保模型在处理不同时间点的查询时仍能保持事实的正确性和一致性。解决方案的关键在于提出了一个名为TeCFaP的新任务,并构建了高质量的TEMP-COFAC数据集,同时扩展了现有评估指标以涵盖时间维度。此外,论文提出了一种名为CoTSeLF的解决方案,结合多任务指令调优(MT-IT)和时间敏感的一致性强化学习(CTSRL),以提升LLMs在时间维度上的事实一致性。实验结果表明,CoTSeLF在多个基准上表现优于现有方法。

链接: https://arxiv.org/abs/2409.14065
作者: Ashutosh Bajpai,Aaryan Goyal,Atif Anwer,Tanmoy Chakraborty
关键词-EN: Large Language Models, Language Models, Large Language, alternate knowledge base, knowledge base requires
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The prolific use of Large Language Models (LLMs) as an alternate knowledge base requires them to be factually consistent, necessitating both correctness and consistency traits for paraphrased queries. Recently, significant attempts have been made to benchmark datasets and metrics to evaluate LLMs for these traits. However, structural simplicity (subject-relation-object) and contemporary association in their query formulation limit the broader definition of factuality and consistency. In this study, we introduce TeCFaP, a novel Temporally Consistent Factuality Probe task to expand the consistent factuality probe in the temporal dimension. To this end, we propose TEMP-COFAC, a high-quality dataset of prefix-style English query paraphrases. Subsequently, we extend the definitions of existing metrics to represent consistent factuality across temporal dimension. We experiment with a diverse set of LLMs and find most of them performing poorly on TeCFaP. Next, we propose a novel solution CoTSeLF (Consistent-Time-Sensitive Learning Framework) combining multi-task instruction tuning (MT-IT) with consistent-time-sensitive reinforcement learning (CTSRL) to improve temporally consistent factuality in LLMs. Our experiments demonstrate the efficacy of CoTSeLF over several baselines.
摘要:大语言模型 (LLM) 的广泛应用作为替代知识库,要求其在事实一致性方面表现出色,这需要重述查询的正确性和一致性特征。近期,已有大量尝试通过基准数据集和评估指标来评估这些特征。然而,其查询结构(主语-关系-宾语)的简单性和当代关联性限制了事实性和一致性的更广泛定义。在本研究中,我们引入了 TeCFaP,一种新颖的时序一致性事实性探测任务,以在时间维度上扩展一致性事实性探测。为此,我们提出了 TEMP-COFAC,一个高质量的前缀风格英语查询重述数据集。随后,我们扩展了现有指标的定义,以表示跨时间维度的一致性事实性。我们使用多种 LLM 进行实验,发现大多数模型在 TeCFaP 上表现不佳。接着,我们提出了一种新型解决方案 CoTSeLF(一致性时间敏感学习框架),结合多任务指令调优 (MT-IT) 和一致性时间敏感强化学习 (CTSRL),以提升 LLM 的时序一致性事实性。我们的实验证明了 CoTSeLF 在多个基线上的有效性。

[NLP-105] Co-occurrence is not Factual Association in Language Models

【速读】: 该论文试图解决预训练语言模型在微调过程中难以有效学习新的事实知识的问题。解决方案的关键在于区分并优化语言模型中两种不同的知识表示形式:一种是基于词共现统计的知识,主要存储在模型的中间层,难以泛化到复杂的推理任务;另一种是真正的事实关联知识,存储在模型的较低层,能够广泛应用于各种推理任务。论文提出两种策略来改进事实关联知识的学习:一是通过训练模型学习隐含而非显式的事实关联,以避免过度依赖词共现统计;二是采用一种简单的训练方法,主动遗忘已学习的词共现统计,从而释放并增强模型对事实关联知识的学习能力。这些策略显著提升了微调后知识在复杂推理场景中的泛化能力。

链接: https://arxiv.org/abs/2409.14057
作者: Xiao Zhang,Miao Li,Ji Wu
关键词-EN: Pretrained language models, limited textual demonstrations, factual associations, Pretrained language, language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretrained language models can encode a large amount of knowledge and utilize it for various reasoning tasks, yet they can still struggle to learn novel factual knowledge effectively from finetuning on limited textual demonstrations. In this work, we show that the reason for this deficiency is that language models are biased to learn word co-occurrence statistics instead of true factual associations. We identify the differences between two forms of knowledge representation in language models: knowledge in the form of co-occurrence statistics is encoded in the middle layers of the transformer model and does not generalize well to reasoning scenarios beyond simple question answering, while true factual associations are encoded in the lower layers and can be freely utilized in various reasoning tasks. Based on these observations, we propose two strategies to improve the learning of factual associations in language models. We show that training on text with implicit rather than explicit factual associations can force the model to learn factual associations instead of co-occurrence statistics, significantly improving the generalization of newly learned knowledge. We also propose a simple training method to actively forget the learned co-occurrence statistics, which unblocks and enhances the learning of factual associations when training on plain narrative text. On both synthetic and real-world corpora, the two proposed strategies improve the generalization of the knowledge learned during finetuning to reasoning scenarios such as indirect and multi-hop question answering.
摘要:预训练语言模型能够编码大量知识,并将其用于各种推理任务,然而它们在通过有限的文本演示进行微调时,仍然难以有效地学习新的实际知识。本文指出,这一缺陷的原因在于语言模型倾向于学习词共现统计信息,而非真正的实际关联。我们识别了语言模型中两种知识表示形式之间的差异:以共现统计形式存在的知识编码在 Transformer 模型的中间层,且不易泛化到简单问答之外的推理场景;而真正的实际关联则编码在较低层,并能在各种推理任务中自由应用。基于这些观察,我们提出了两种策略来改进语言模型中实际关联的学习。我们发现,通过训练包含隐含而非显式实际关联的文本,可以迫使模型学习实际关联而非共现统计,从而显著提升新学知识的泛化能力。我们还提出了一种简单的训练方法,主动遗忘已学习的共现统计,从而在训练普通叙述文本时,解锁并增强实际关联的学习。在合成和真实世界语料库上,这两种提出的策略均提升了微调过程中学习到的知识在间接和多跳问答等推理场景中的泛化能力。

[NLP-106] GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion

【速读】: 该论文试图解决多代理辩论技术在逻辑推理任务中由于代理数量和辩论轮次的增加导致的令牌成本急剧上升的问题。解决方案的关键在于将所有代理分成多个辩论小组,各小组内部进行辩论并在小组间共享辩论结果,从而显著降低辩论过程中的总令牌消耗,同时可能提升准确性。实验结果表明,这种方法在辩论中最多可减少51.7%的令牌消耗,并有望提高25%的准确性。

链接: https://arxiv.org/abs/2409.14051
作者: Tongxuan Liu,Xingyu Wang,Weizhe Huang,Wenjiang Xu,Yuting Zeng,Lei Jiang,Hailong Yang,Jing Li
关键词-EN: Large Language Models, Large Language, Language Models, diverse NLP tasks, diverse NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse NLP tasks. Extensive research has explored how to enhance the logical reasoning abilities such as Chain-of-Thought, Chain-of-Thought with Self-Consistency, Tree-Of-Thoughts, and multi-agent debates. In the context of multi-agent debates, significant performance improvements can be achieved with an increasing number of agents and debate rounds. However, the escalation in the number of agents and debate rounds can drastically raise the tokens cost of debates, thereby limiting the scalability of the multi-agent debate technique. To better harness the advantages of multi-agent debates in logical reasoning tasks, this paper proposes a method to significantly reduce token cost in multi-agent debates. This approach involves dividing all agents into multiple debate groups, with agents engaging in debates within their respective groups and sharing interim debate results between groups. Comparative experiments across multiple datasets have demonstrated that this method can reduce the total tokens by up to 51.7% during debates and while potentially enhancing accuracy by as much as 25%. Our method significantly enhances the performance and efficiency of interactions in the multi-agent debate.
摘要:近年来,大语言模型 (LLMs) 在多样化的自然语言处理 (NLP) 任务中展示了卓越的能力。广泛的研究探讨了如何增强逻辑推理能力,如思维链 (Chain-of-Thought)、自一致性思维链 (Chain-of-Thought with Self-Consistency)、思维树 (Tree-Of-Thoughts) 以及多智能体辩论。在多智能体辩论的背景下,随着智能体数量和辩论轮次的增加,可以显著提升性能。然而,智能体数量和辩论轮次的增加会急剧提高辩论的 Token 成本,从而限制了多智能体辩论技术的可扩展性。为了更好地利用多智能体辩论在逻辑推理任务中的优势,本文提出了一种显著降低多智能体辩论中 Token 成本的方法。该方法将所有智能体分为多个辩论组,智能体在其所属组内进行辩论,并在组间共享辩论的中间结果。通过在多个数据集上的对比实验,我们证明了这种方法在辩论过程中可以减少高达 51.7% 的总 Token 数量,同时可能将准确性提高多达 25%。我们的方法显著提升了多智能体辩论中的性能和交互效率。

[NLP-107] OAEI-LLM: A Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching

【速读】: 该论文试图解决大型语言模型(LLMs)在领域特定的本体匹配(OM)任务中普遍存在的幻觉问题,并提出了一种新的基准数据集OAEI-LLM。解决方案的关键在于扩展了Ontology Alignment Evaluation Initiative(OAEI)数据集,以评估LLM在本体匹配任务中的特定幻觉现象,并通过详细的方法论和示例展示了数据集的构建和模式扩展过程,从而为理解和缓解LLM在本体匹配中的幻觉问题提供了基准。

链接: https://arxiv.org/abs/2409.14038
作者: Zhangcheng Qiang,Kerry Taylor,Weiqing Wang,Jing Jiang
关键词-EN: large language models, domain-specific downstream tasks, language models, commonly occur, Alignment Evaluation Initiative
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 4 pages, 1 figure

点击查看摘要

Abstract:Hallucinations of large language models (LLMs) commonly occur in domain-specific downstream tasks, with no exception in ontology matching (OM). The prevalence of using LLMs for OM raises the need for benchmarks to better understand LLM hallucinations. The OAEI-LLM dataset is an extended version of the Ontology Alignment Evaluation Initiative (OAEI) datasets that evaluate LLM-specific hallucinations in OM tasks. We outline the methodology used in dataset construction and schema extension, and provide examples of potential use cases.
摘要:大语言模型 (LLM) 在特定领域的下游任务中经常出现幻觉现象,本体匹配 (Ontology Matching, OM) 也不例外。随着 LLM 在 OM 中的广泛应用,建立基准测试以更好地理解 LLM 幻觉现象的需求日益增加。OAEI-LLM 数据集是本体对齐评估倡议 (Ontology Alignment Evaluation Initiative, OAEI) 数据集的扩展版本,专门用于评估 OM 任务中的 LLM 特定幻觉。本文概述了数据集构建和模式扩展所采用的方法,并提供了潜在应用案例的示例。

[NLP-108] Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

【速读】: 该论文试图解决当前大型语言模型(LLMs)在科学传播中的可靠性问题,特别是这些模型在处理需要复杂理解和判断的科学问答任务时的表现。解决方案的关键在于引入了一个新的数据集SCiPS-QA,该数据集包含742个嵌入复杂科学概念的Yes/No问题,并通过一套基准测试工具评估LLMs在正确性和一致性方面的表现。研究结果表明,尽管大多数开源模型表现不佳,但Llama-3-70B在某些方面超越了GPT-4 Turbo,同时揭示了GPT模型在验证自身响应可靠性方面的不足,以及GPT-4 Turbo在某些情况下可能误导人类评估者的趋势。

链接: https://arxiv.org/abs/2409.14037
作者: Prasoon Bajpai,Niladri Chatterjee,Subhabrata Dutta,Tanmoy Chakraborty
关键词-EN: Large Language Models, Large Language, experiencing exponential growth, amateur users, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and AI assistants driven by these models are experiencing exponential growth in usage among both expert and amateur users. In this work, we focus on evaluating the reliability of current LLMs as science communicators. Unlike existing benchmarks, our approach emphasizes assessing these models on scientific questionanswering tasks that require a nuanced understanding and awareness of answerability. We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria. We benchmark three proprietary LLMs from the OpenAI GPT family and 13 open-access LLMs from the Meta Llama-2, Llama-3, and Mistral families. While most open-access models significantly underperform compared to GPT-4 Turbo, our experiments identify Llama-3-70B as a strong competitor, often surpassing GPT-4 Turbo in various evaluation aspects. We also find that even the GPT models exhibit a general incompetence in reliably verifying LLM responses. Moreover, we observe an alarming trend where human evaluators are deceived by incorrect responses from GPT-4 Turbo.
摘要:大语言模型 (LLMs) 及其驱动的 AI 助手在专业用户和业余用户中的使用量呈指数级增长。本文重点评估当前 LLMs 作为科学传播者的可靠性。与现有基准不同,我们的方法强调在需要细致理解和答案可信度的科学问答任务上评估这些模型。我们引入了一个新的数据集,SCiPS-QA,包含 742 个嵌入复杂科学概念的 Yes/No 查询,以及一个评估 LLMs 在各种标准下正确性和一致性的基准套件。我们基准测试了来自 OpenAI GPT 家族的三种专有 LLMs 和来自 Meta Llama-2、Llama-3 和 Mistral 家族的 13 种开放访问 LLMs。尽管大多数开放访问模型与 GPT-4 Turbo 相比表现显著不佳,但我们的实验识别出 Llama-3-70B 作为强劲的竞争者,在多个评估方面经常超越 GPT-4 Turbo。我们还发现,即使是 GPT 模型在可靠验证 LLM 响应方面也表现出普遍的无能。此外,我们观察到一个令人担忧的趋势,即人类评估者被 GPT-4 Turbo 的错误响应所欺骗。

[NLP-109] Uncovering Latent Chain of Thought Vectors in Language Models ICLR

【速读】: 该论文试图解决如何在不使用自然语言提示的情况下,引导语言模型进行链式思维(Chain of Thought, CoT)推理的问题。解决方案的关键在于引入“引导向量”(steering vector)技术,通过在语言模型的前向传播过程中引入特定任务的偏置,从而实现对模型输出的引导。这种方法不仅在多个推理基准测试(如GSM8k、MMLU、AGI Eval、ARC AI2)中取得了与CoT提示方法相媲美的结果,而且相比传统的微调方法,计算成本更低,且能持续引导模型生成符合CoT推理的响应。

链接: https://arxiv.org/abs/2409.14026
作者: Jason Zhang,Scott Viteri
关键词-EN: language models grow, increasingly paramount, grow more influential, influential and trusted, ability to reliably
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2 Pages, Intended for Tiny Papers 2025 Submission to ICLR

点击查看摘要

Abstract:As language models grow more influential and trusted in our society, our ability to reliably steer them toward favorable behaviors becomes increasingly paramount. For this, we investigate the technique of steering vectors: biasing the forward pass of language models using a “steering vector” derived from a specific task. We apply them to steer language models toward performing Chain of Thought (CoT) Reasoning without the need to prompt through natural language. We demonstrate this approach on Llama3 8b and Mistral 7b v0.2, and obtain competitive results compared to CoT-prompted performances on a series of reasoning benchmarks (GSM8k, MMLU, AGI Eval, ARC AI2) and qualitative examples. We find this approach yields consistent steering towards CoT responses and takes less compute than traditional methods of fine-tuning models towards CoT.
摘要:随着语言模型在社会中的影响力和信任度不断增加,我们可靠地引导它们向有利行为的能力变得愈发重要。为此,我们研究了引导向量的技术:通过从特定任务中提取的“引导向量”来偏置语言模型的前向传递。我们将这种方法应用于引导语言模型进行思维链 (Chain of Thought, CoT) 推理,而无需通过自然语言提示。我们在 Llama3 8b 和 Mistral 7b v0.2 上展示了这种方法,并在一系列推理基准测试 (GSM8k, MMLU, AGI Eval, ARC AI2) 和定性示例中获得了与 CoT 提示性能相媲美的结果。我们发现这种方法能够一致地引导模型向 CoT 响应,并且比传统的微调模型向 CoT 的方法计算量更少。

[NLP-110] Graph Neural Network Framework for Sentiment Analysis Using Syntactic Feature

【速读】: 该论文旨在解决社交媒体和电子商务生态系统中意见挖掘领域的问题,特别是从文本中提取与特定元素相关的细微评价。解决方案的关键在于提出了一种综合框架,该框架结合了主题描述符的位置线索,通过将句法结构转换为矩阵格式,并利用卷积和注意力机制在图中提取显著特征。通过整合描述符相对于词汇项的位置相关性,增强了输入的顺序完整性,从而显著提升了评价分类的效率。

链接: https://arxiv.org/abs/2409.14000
作者: Linxiao Wu,Yuanshuai Luo,Binrong Zhu,Guiran Liu,Rui Wang,Qian Yu
关键词-EN: natural language processing, social media platforms, Amidst the swift, e-commerce ecosystems, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Amidst the swift evolution of social media platforms and e-commerce ecosystems, the domain of opinion mining has surged as a pivotal area of exploration within natural language processing. A specialized segment within this field focuses on extracting nuanced evaluations tied to particular elements within textual contexts. This research advances a composite framework that amalgamates the positional cues of topical descriptors. The proposed system converts syntactic structures into a matrix format, leveraging convolutions and attention mechanisms within a graph to distill salient characteristics. Incorporating the positional relevance of descriptors relative to lexical items enhances the sequential integrity of the input. Trials have substantiated that this integrated graph-centric scheme markedly elevates the efficacy of evaluative categorization, showcasing preeminence.
摘要:在社交媒体平台和电子商务生态系统的迅速演变中,意见挖掘领域在自然语言处理中已成为一个关键的研究领域。该领域的一个专门分支专注于从文本上下文中提取与特定元素相关的细微评价。本研究提出了一种综合框架,该框架结合了主题描述符的位置线索。所提出的系统将句法结构转换为矩阵格式,利用图中的卷积和注意力机制来提取显著特征。通过结合描述符相对于词汇项的位置相关性,增强了输入的顺序完整性。试验证明,这种以图为中心的综合方案显著提升了评价分类的效率,展示了其卓越性。

[NLP-111] Contrastive Learning for Knowledge-Based Question Generation in Large Language Models

【速读】: 该论文试图解决大规模语言模型在知识密集型任务中存在的幻觉和知识缺口问题,提出了一种基于对比学习的增强型问题生成方法。解决方案的关键在于利用多模型联合挖掘领域知识,并通过对比学习引导模型减少生成中的噪声和幻觉。实验结果表明,通过设计包含对比示例的提示,模型在问题生成方面的性能显著提升,特别是当同时使用对比指令和示例时,生成问题的质量和准确性达到最高。

链接: https://arxiv.org/abs/2409.13994
作者: Zhenhong Zhang,Jiajing Chen,Weiyan Shi,Lingjie Yi,Chihang Wang,Qian Yu
关键词-EN: increasingly widespread application, artificial intelligence technology, high-quality question generation, question generation, question generation technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:With the rapid development of artificial intelligence technology, especially the increasingly widespread application of question-and-answer systems, high-quality question generation has become a key component in supporting the development of these systems. This article focuses on knowledge-based question generation technology, which aims to enable computers to simulate the human questioning process based on understanding specific texts or knowledge bases. In light of the issues of hallucination and knowledge gaps present in large-scale language models when applied to knowledge-intensive tasks, this paper proposes an enhanced question generation method that incorporates contrastive learning. This method utilizes multiple models to jointly mine domain knowledge and uses contrastive learning to guide the model in reducing noise and hallucinations in generation. Experimental results show that by designing prompts containing contrasting examples, the model’s performance in question generation improves considerably, particularly when contrasting instructions and examples are used simultaneously, leading to the highest quality of generated questions and improved accuracy. These results demonstrate that the method proposed in this study, which combines contrasting context and chain-of-thought prompts, can effectively improve both the quality and the practicality of question generation.
摘要:随着人工智能技术的快速发展,尤其是问答系统的日益广泛应用,高质量的问题生成已成为支持这些系统发展的关键组成部分。本文聚焦于基于知识的问生成技术,旨在使计算机能够基于对特定文本或知识库的理解,模拟人类的提问过程。针对大规模语言模型在应用于知识密集型任务时存在的幻觉和知识差距问题,本文提出了一种结合对比学习的增强型问生成方法。该方法利用多个模型联合挖掘领域知识,并通过对比学习引导模型减少生成中的噪声和幻觉。实验结果表明,通过设计包含对比示例的提示,模型在问生成方面的性能显著提升,特别是在同时使用对比指令和示例时,生成的问质量最高,准确性也得到提升。这些结果表明,本研究提出的结合对比上下文和思维链提示的方法,能够有效提升问生成的质量和实用性。

[NLP-112] SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

【速读】: 该论文试图解决传统检索增强生成(RAG)方法在无监督检索设置下,由于仅基于查询-上下文相关性选择高排名文档,导致引入冗余和冲突信息的问题。解决方案的关键在于提出了Selection using Matrices for Augmented Retrieval (SMART)框架,该框架利用行列式点过程(DPPs)同时建模相关性、多样性和冲突,从而在无监督和无需训练的情况下优化上下文选择,显著提升问答任务的性能。

链接: https://arxiv.org/abs/2409.13992
作者: Jiatao Li,Xinyu Hu,Xiaojun Wan
关键词-EN: contextually grounded responses, greatly improved large, improved large language, Retrieval-Augmented Generation, large language models
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has greatly improved large language models (LLMs) by enabling them to generate accurate, contextually grounded responses through the integration of external information. However, conventional RAG approaches, which prioritize top-ranked documents based solely on query-context relevance, often introduce redundancy and conflicting information. This issue is particularly evident in unsupervised retrieval settings, where there are no mechanisms to effectively mitigate these problems, leading to suboptimal context selection. To address this, we propose Selection using Matrices for Augmented Retrieval (SMART) in question answering tasks, a fully unsupervised and training-free framework designed to optimize context selection in RAG. SMART leverages Determinantal Point Processes (DPPs) to simultaneously model relevance, diversity and conflict, ensuring the selection of potentially high-quality contexts. Experimental results across multiple datasets demonstrate that SMART significantly enhances QA performance and surpasses previous unsupervised context selection methods, showing a promising strategy for RAG.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 通过整合外部信息,极大地提升了大语言模型 (Large Language Models, LLMs) 生成准确、上下文相关响应的能力。然而,传统的 RAG 方法仅基于查询与上下文的相关性来优先选择排名靠前的文档,往往引入了冗余和冲突信息。这一问题在无监督检索环境中尤为明显,因为没有机制有效缓解这些问题,导致上下文选择次优。为此,我们提出了用于问答任务的增强检索矩阵选择 (Selection using Matrices for Augmented Retrieval, SMART),这是一个完全无监督且无需训练的框架,旨在优化 RAG 中的上下文选择。SMART 利用行列式点过程 (Determinantal Point Processes, DPPs) 同时建模相关性、多样性和冲突,确保选择潜在高质量的上下文。在多个数据集上的实验结果表明,SMART 显著提升了问答性能,并超越了以往的无监督上下文选择方法,展示了 RAG 的潜在有效策略。

[NLP-113] ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models

【速读】: 该论文试图解决现有化学领域大语言模型(LLMs)基准测试未能充分满足化学研究专业人员特定需求的问题。解决方案的关键在于提出了ChemEval,这是一个全面的评估框架,涵盖了化学领域的4个关键层次和12个维度,通过42个不同的化学任务来评估LLMs的性能。ChemEval利用开源数据和化学专家精心设计的数据,确保任务具有实际价值并能有效评估LLMs的能力。实验结果表明,通用LLMs在文献理解和指令跟随方面表现优异,但在需要高级化学知识的任务上表现不足,而专用LLMs则在化学领域表现出更强的能力,尽管在文学理解上有所减弱。这表明LLMs在处理化学领域复杂任务时具有显著的提升潜力。

链接: https://arxiv.org/abs/2409.13989
作者: Yuqing Huang,Rongyang Zhang,Xuesong He,Xuyang Zhi,Hao Wang,Xin Li,Feiyang Xu,Deguang Liu,Huadong Liang,Yi Li,Jian Cui,Zimu Liu,Shijin Wang,Guoping Hu,Guiquan Liu,Qi Liu,Defu Lian,Enhong Chen
关键词-EN: LLMs benchmarks tailored, LLMs, type and complexity, chemical tasks varying, growing interest
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:There is a growing interest in the role that LLMs play in chemistry which lead to an increased focus on the development of LLMs benchmarks tailored to chemical domains to assess the performance of LLMs across a spectrum of chemical tasks varying in type and complexity. However, existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals. To this end, we propose \textbf\textitChemEval, which provides a comprehensive assessment of the capabilities of LLMs across a wide range of chemical domain tasks. Specifically, ChemEval identified 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks which are informed by open-source data and the data meticulously crafted by chemical experts, ensuring that the tasks have practical value and can effectively evaluate the capabilities of LLMs. In the experiment, we evaluate 12 mainstream LLMs on ChemEval under zero-shot and few-shot learning contexts, which included carefully selected demonstration examples and carefully designed prompts. The results show that while general LLMs like GPT-4 and Claude-3.5 excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge. Conversely, specialized LLMs exhibit enhanced chemical competencies, albeit with reduced literary comprehension. This suggests that LLMs have significant potential for enhancement when tackling sophisticated tasks in the field of chemistry. We believe our work will facilitate the exploration of their potential to drive progress in chemistry. Our benchmark and analysis will be available at \colorblue \urlthis https URL.
摘要:随着大语言模型 (LLM) 在化学领域中的作用日益受到关注,针对化学领域的 LLM 基准开发也得到了更多的重视,以评估 LLM 在不同类型和复杂度的化学任务中的表现。然而,现有的基准在这一领域未能充分满足化学研究专业人士的特定需求。为此,我们提出了 ChemEval,它提供了一个全面的评估框架,用于评估 LLM 在广泛化学领域任务中的能力。具体而言,ChemEval 识别了化学中的 4 个关键递进层次,评估了 LLM 在 12 个维度上的表现,涵盖了 42 个不同的化学任务,这些任务基于开源数据和化学专家精心设计的数据,确保了任务的实际价值和能够有效评估 LLM 的能力。在实验中,我们在零样本 (zero-shot) 和少样本 (few-shot) 学习环境下,对 12 个主流 LLM 在 ChemEval 上的表现进行了评估,其中包括精心挑选的示范示例和精心设计的提示。结果显示,尽管像 GPT-4 和 Claude-3.5 这样的通用 LLM 在文献理解和指令跟随方面表现出色,但在需要高级化学知识的任务中表现不佳。相反,专门的 LLM 在化学能力方面有所增强,尽管在文学理解方面有所减弱。这表明,当处理化学领域的复杂任务时,LLM 具有显著的提升潜力。我们相信,我们的工作将有助于探索其在推动化学进步方面的潜力。我们的基准和分析将在 \colorblue \urlthis https URL 上提供。

[NLP-114] Bias and Toxicity in Role-Play Reasoning

【速读】: 该论文试图解决大型语言模型(LLM)在角色扮演(role-play)技术应用中可能产生的刻板印象和有害输出的问题。解决方案的关键在于系统性地评估角色扮演对模型在包含刻板印象和有害问题的基准测试中的影响,发现角色扮演往往增加了生成刻板和有害输出的可能性,从而提醒研究者在应用角色扮演技术时需谨慎考虑其潜在风险。

链接: https://arxiv.org/abs/2409.13979
作者: Jinman Zhao,Zifan Qian,Linbo Cao,Yining Wang,Yitian Ding
关键词-EN: Large Language Model, generate contextually relevant, adopt specific perspectives, Large Language, specific perspectives
类目: Computation and Language (cs.CL)
备注: 14 pages, 9 figures, 9 tables

点击查看摘要

Abstract:Role-play in the Large Language Model (LLM) is a crucial technique that enables models to adopt specific perspectives, enhancing their ability to generate contextually relevant and accurate responses. By simulating different roles, theis approach improves reasoning capabilities across various NLP benchmarks, making the model’s output more aligned with diverse scenarios. However, in this work, we demonstrate that role-play also carries potential risks. We systematically evaluate the impact of role-play by asking the language model to adopt different roles and testing it on multiple benchmarks that contain stereotypical and harmful questions. Despite the significant fluctuations in the benchmark results in different experiments, we find that applying role-play often increases the overall likelihood of generating stereotypical and harmful outputs.
摘要: 大语言模型 (LLM) 中的角色扮演是一项关键技术,它使模型能够采用特定的视角,从而增强其生成上下文相关且准确响应的能力。通过模拟不同的角色,这种方法提升了模型在各种自然语言处理 (NLP) 基准测试中的推理能力,使其输出更符合多样化的场景。然而,在本研究中,我们展示了角色扮演也存在潜在风险。我们通过让语言模型采用不同的角色,并在包含刻板印象和有害问题的多个基准测试中进行测试,系统地评估了角色扮演的影响。尽管在不同实验中基准测试结果存在显著波动,我们发现应用角色扮演通常会增加生成刻板印象和有害输出的总体可能性。

[NLP-115] Can Language Model Understand Word Semantics as A Chatbot? An Empirical Study of Language Model Internal External Mismatch

【速读】: 该论文试图解决语言模型在内部知识表示与外部交互方式之间的语义理解差异问题。解决方案的关键在于研究不同类型的预训练语言模型(如仅编码器、仅解码器和编码器-解码器模型)在单词语义理解上的内部与外部不匹配现象,以揭示这些模型在处理提示信息时与其实际内部知识之间的不一致性。

链接: https://arxiv.org/abs/2409.13972
作者: Jinman Zhao,Xueyan Zhang,Xingyu Yue,Weizhe Chen,Zifan Qian,Ruiyu Wang
关键词-EN: Current common interactions, Current common, full inference, common interactions, Current
类目: Computation and Language (cs.CL)
备注: 10 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Current common interactions with language models is through full inference. This approach may not necessarily align with the model’s internal knowledge. Studies show discrepancies between prompts and internal representations. Most focus on sentence understanding. We study the discrepancy of word semantics understanding in internal and external mismatch across Encoder-only, Decoder-only, and Encoder-Decoder pre-trained language models.
摘要:当前与语言模型的常见交互是通过全推理进行的。这种方法未必与模型的内部知识相一致。研究表明,提示与内部表示之间存在差异。大多数研究集中在句子理解上。我们研究了仅编码器、仅解码器和编码器-解码器预训练语言模型中,内部与外部不匹配的词义理解差异。

[NLP-116] Exploring Automated Keyword Mnemonics Generation with Large Language Models via Overgenerate-and-Rank EMNLP2024

【速读】: 该论文试图解决词汇学习中关键词记忆法(keyword mnemonics)的自动化生成问题。解决方案的关键在于提出了一种“生成并排序”的方法,通过提示大型语言模型(LLMs)生成记忆提示词,并根据心理语言学测量和初步用户研究的反馈对这些提示词进行排序,从而实现记忆提示词的自动化生成和优化。

链接: https://arxiv.org/abs/2409.13952
作者: Jaewook Lee,Hunter McNichols,Andrew Lan
关键词-EN: vocabulary learning, memorizing vocabulary, under-explored area, technique for memorizing, memorable associations
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: EMNLP 2024 findings

点击查看摘要

Abstract:In this paper, we study an under-explored area of language and vocabulary learning: keyword mnemonics, a technique for memorizing vocabulary through memorable associations with a target word via a verbal cue. Typically, creating verbal cues requires extensive human effort and is quite time-consuming, necessitating an automated method that is more scalable. We propose a novel overgenerate-and-rank method via prompting large language models (LLMs) to generate verbal cues and then ranking them according to psycholinguistic measures and takeaways from a pilot user study. To assess cue quality, we conduct both an automated evaluation of imageability and coherence, as well as a human evaluation involving English teachers and learners. Results show that LLM-generated mnemonics are comparable to human-generated ones in terms of imageability, coherence, and perceived usefulness, but there remains plenty of room for improvement due to the diversity in background and preference among language learners.
摘要:本文探讨了一个尚未充分研究的领域——语言和词汇学习中的关键词记忆术,这是一种通过与目标词相关的记忆联想来记忆词汇的技术。通常,创建记忆联想需要大量的人力且非常耗时,因此需要一种更具扩展性的自动化方法。我们提出了一种新颖的“生成-排序”方法,通过提示大语言模型 (LLM) 生成记忆联想,然后根据心理语言学测量和初步用户研究的成果对其进行排序。为了评估联想的质量,我们进行了自动化的图像性和连贯性评估,以及包括英语教师和学习者在内的人工评估。结果表明,LLM 生成的记忆联想在图像性、连贯性和感知有用性方面与人工生成的联想相当,但由于语言学习者的背景和偏好多样性,仍有很大的改进空间。

[NLP-117] Mufu: Multilingual Fused Learning for Low-Resource Translation with LLM

【速读】: 该论文试图解决低资源语言在多语言大语言模型(LLMs)中的翻译难题。解决方案的关键在于引入Mufu方法,通过自动生成多语言候选翻译并结合纠错指令,将翻译任务转化为后编辑任务。Mufu提示利用LLM的推理能力,评估输入质量、跨语言对齐语义、从相关输入中复制并覆盖错误实例,从而在低资源语言对中显著提升翻译性能,超越了NLLB 1.3B模型的表现。

链接: https://arxiv.org/abs/2409.13949
作者: Zheng Wei Lim,Nitish Gupta,Honglin Yu,Trevor Cohn
关键词-EN: Multilingual large language, great translators, largely limited, limited to high-resource, large language models
类目: Computation and Language (cs.CL)
备注: 29 pages

点击查看摘要

Abstract:Multilingual large language models (LLMs) are great translators, but this is largely limited to high-resource languages. For many LLMs, translating in and out of low-resource languages remains a challenging task. To maximize data efficiency in this low-resource setting, we introduce Mufu, which includes a selection of automatically generated multilingual candidates and an instruction to correct inaccurate translations in the prompt. Mufu prompts turn a translation task into a postediting one, and seek to harness the LLM’s reasoning capability with auxiliary translation candidates, from which the model is required to assess the input quality, align the semantics cross-lingually, copy from relevant inputs and override instances that are incorrect. Our experiments on En-XX translations over the Flores-200 dataset show LLMs finetuned against Mufu-style prompts are robust to poor quality auxiliary translation candidates, achieving performance superior to NLLB 1.3B distilled model in 64% of low- and very-low-resource language pairs. We then distill these models to reduce inference cost, while maintaining on average 3.1 chrF improvement over finetune-only baseline in low-resource translations.
摘要:多语言大语言模型 (LLM) 在翻译高资源语言方面表现出色,但在处理低资源语言的翻译任务时仍面临挑战。为了在这种低资源环境下最大化数据效率,我们引入了 Mufu,它包括一组自动生成的多语言候选翻译和一个指令,用于在提示中纠正不准确的翻译。Mufu 提示将翻译任务转化为校对任务,并试图利用 LLM 的推理能力,通过辅助翻译候选来评估输入质量、跨语言对齐语义、从相关输入中复制内容并覆盖不正确的实例。我们在 Flores-200 数据集上的 En-XX 翻译实验表明,经过 Mufu 风格提示微调的 LLM 对质量较差的辅助翻译候选具有较强的鲁棒性,在 64% 的低资源和极低资源语言对中表现优于 NLLB 1.3B 蒸馏模型。随后,我们将这些模型进行蒸馏以降低推理成本,同时在低资源翻译中平均保持了 3.1 chrF 的改进,超过了仅微调的基线模型。

[NLP-118] Aligning Language Models Using Follow-up Likelihood as Reward Signal

【速读】: 该论文试图解决在人机交互中如何自动评估机器响应的质量问题,特别是如何在不依赖人工或商业LLM标注的情况下,区分出用户更偏好的响应。解决方案的关键在于提出了“后续话语可能性作为奖励”(Follow-up Likelihood as Reward, FLR)机制,通过分析用户对机器响应的后续反应来评估响应的优劣。FLR机制不仅在多个基准测试中表现出色,还通过自动挖掘基础策略模型的在线生成数据来增强模型帮助性,最终通过自然语言反馈微调语言模型,显著提升了FLR在奖励建模基准测试中的性能和基础策略模型帮助性的对齐效果。

链接: https://arxiv.org/abs/2409.13948
作者: Chen Zhang,Dading Chong,Feng Jiang,Chengguang Tang,Anningzhe Gao,Guohua Tang,Haizhou Li
关键词-EN: receive feedback signals, participants often receive, follow-up, feedback signals, Follow-up Likelihood
类目: Computation and Language (cs.CL)
备注: 16 pages, reward model, LLM Alignment

点击查看摘要

Abstract:In natural human-to-human conversations, participants often receive feedback signals from one another based on their follow-up reactions. These reactions can include verbal responses, facial expressions, changes in emotional state, and other non-verbal cues. Similarly, in human-machine interactions, the machine can leverage the user’s follow-up utterances as feedback signals to assess whether it has appropriately addressed the user’s request. Therefore, we propose using the likelihood of follow-up utterances as rewards to differentiate preferred responses from less favored ones, without relying on human or commercial LLM-based preference annotations. Our proposed reward mechanism, ``Follow-up Likelihood as Reward" (FLR), matches the performance of strong reward models trained on large-scale human or GPT-4 annotated data on 8 pairwise-preference and 4 rating-based benchmarks. Building upon the FLR mechanism, we propose to automatically mine preference data from the online generations of a base policy model. The preference data are subsequently used to boost the helpfulness of the base model through direct alignment from preference (DAP) methods, such as direct preference optimization (DPO). Lastly, we demonstrate that fine-tuning the language model that provides follow-up likelihood with natural language feedback significantly enhances FLR’s performance on reward modeling benchmarks and effectiveness in aligning the base policy model’s helpfulness.
摘要:在自然的人与人对话中,参与者通常会根据对方的后续反应接收到反馈信号。这些反应可能包括口头回应、面部表情、情绪状态的变化以及其他非语言线索。类似地,在人机交互中,机器可以利用用户的后续话语作为反馈信号,来评估其是否恰当地处理了用户的需求。因此,我们提出使用后续话语的可能性作为奖励,以区分更受欢迎的回应和不太受欢迎的回应,而不依赖于人工或基于商业大语言模型 (LLM) 的偏好标注。我们提出的奖励机制,即“后续可能性作为奖励” (Follow-up Likelihood as Reward, FLR),在 8 个成对偏好和 4 个基于评分的基准测试中,与基于大规模人工或 GPT-4 标注数据训练的强奖励模型的性能相匹配。基于 FLR 机制,我们提出从基础策略模型的在线生成内容中自动挖掘偏好数据。这些偏好数据随后通过直接偏好对齐 (DAP) 方法,如直接偏好优化 (DPO),用于提升基础模型帮助性。最后,我们展示了通过自然语言反馈对提供后续可能性评分的语言模型进行微调,显著增强了 FLR 在奖励建模基准测试中的性能,并有效提升了基础策略模型帮助性的对齐效果。

[NLP-119] MirrorStories: Reflecting Diversity through Personalized Narrative Generation with Large Language Models EMNLP2024

【速读】: 该论文试图解决文学作品中缺乏多样性的问题,提出了一种利用大型语言模型(LLMs)生成个性化“镜像故事”的解决方案。关键在于通过整合读者的姓名、性别、年龄、种族、兴趣和故事道德等元素,生成能够反映和共鸣读者身份的个性化短篇故事。研究结果表明,LLMs能够有效融入多样化的身份元素,个性化故事在吸引力和文本多样性方面均优于通用的人类创作和LLM生成的故事,同时保持了预期的道德内涵。

链接: https://arxiv.org/abs/2409.13935
作者: Sarfaroz Yunusov,Hamza Sidat,Ali Emami
关键词-EN: Large Language Models, Language Models, Large Language, individual readers’ identities, effectiveness of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages (excluding references), accepted to EMNLP 2024 Main Conference

点击查看摘要

Abstract:This study explores the effectiveness of Large Language Models (LLMs) in creating personalized “mirror stories” that reflect and resonate with individual readers’ identities, addressing the significant lack of diversity in literature. We present MirrorStories, a corpus of 1,500 personalized short stories generated by integrating elements such as name, gender, age, ethnicity, reader interest, and story moral. We demonstrate that LLMs can effectively incorporate diverse identity elements into narratives, with human evaluators identifying personalized elements in the stories with high accuracy. Through a comprehensive evaluation involving 26 diverse human judges, we compare the effectiveness of MirrorStories against generic narratives. We find that personalized LLM-generated stories not only outscore generic human-written and LLM-generated ones across all metrics of engagement (with average ratings of 4.22 versus 3.37 on a 5-point scale), but also achieve higher textual diversity while preserving the intended moral. We also provide analyses that include bias assessments and a study on the potential for integrating images into personalized stories.
摘要:本研究探讨了大语言模型 (LLM) 在创作个性化“镜像故事”方面的有效性,这些故事能够反映并引起个体读者身份的共鸣,从而解决文学作品中显著的多样性缺失问题。我们提出了 MirrorStories,这是一个包含 1,500 篇个性化短篇故事的语料库,通过整合姓名、性别、年龄、种族、读者兴趣和故事道德等元素生成。我们证明,LLM 能够有效地将多样化的身份元素融入叙事中,人类评估者能够以高准确度识别故事中的个性化元素。通过涉及 26 位多样化人类评委的综合评估,我们将 MirrorStories 与通用叙事进行了比较。我们发现,个性化 LLM 生成的故事不仅在所有参与度指标上均优于通用的人类撰写和 LLM 生成的故事(在 5 分制中平均评分为 4.22 对 3.37),而且在保持预期道德的同时实现了更高的文本多样性。我们还提供了包括偏见评估和研究个性化故事中整合图像潜力的分析。

[NLP-120] On-device Collaborative Language Modeling via a Mixture of Generalists and Specialists

【速读】: 该论文旨在解决在设备上进行大规模语言模型(LLMs)的协同微调问题,特别是在数据异质性高的情况下。解决方案的关键在于提出了一种新的混合专家(Mixture of Experts, MoE)架构,称为CoMiGS(混合通才与专家),通过将专家分为全局通才和本地专家,实现了在令牌级别上的可学习路由网络,从而在微调过程中平衡了协作与个性化需求。该方法不仅在性能上表现优异,还能适应不同用户在计算资源上的差异,使得资源较少的用户也能从资源丰富的用户中获益。

链接: https://arxiv.org/abs/2409.13931
作者: Dongyang Fan,Bettina Messmer,Martin Jaggi
关键词-EN: Large Language Models, Language Models, Large Language, target on-device collaborative, on-device collaborative fine-tuning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We target on-device collaborative fine-tuning of Large Language Models (LLMs) by adapting a Mixture of Experts (MoE) architecture, where experts are Low-Rank Adaptation (LoRA) modules. In conventional MoE approaches, experts develop into specialists throughout training. In contrast, we propose a novel \textbfCo llaborative learning approach via a \textbfMi xture of \textbfG eneralists and \textbfS pecialists (CoMiGS). Diversifying into the two roles is achieved by aggregating certain experts globally while keeping others localized to specialize in user-specific datasets. Central to our work is a learnable routing network that routes at a token level, balancing collaboration and personalization at the finest granularity. Our method consistently demonstrates superior performance in scenarios with high data heterogeneity across various datasets. By design, our approach accommodates varying computational resource constraints among users as shown in different numbers of LoRA experts. We further showcase that low-resourced users can benefit from high-resourced users with high data quantity.
摘要:我们针对大语言模型 (LLM) 的设备端协同微调,通过采用专家混合 (Mixture of Experts, MoE) 架构,其中专家为低秩适应 (Low-Rank Adaptation, LoRA) 模块。在传统的 MoE 方法中,专家在训练过程中逐渐成为专家。相比之下,我们提出了一种新颖的协同学习方法,即专家与专家的混合 (Mixture of Generalists and Specialists, CoMiGS)。通过全局聚合某些专家,同时保持其他专家本地化以专门处理用户特定的数据集,实现了两种角色的多样化。我们工作的核心是一个可学习的路由网络,该网络在 Token 级别进行路由,以最细粒度平衡协作与个性化。我们的方法在各种数据集的高数据异质性场景中持续展现出优越的性能。通过设计,我们的方法能够适应不同用户之间的计算资源约束,如不同数量的 LoRA 专家所示。我们进一步展示了资源较少的用户可以从数据量较大的用户中受益。

[NLP-121] Eliciting Instruction-tuned Code Language Models Capabilities to Utilize Auxiliary Function for Code Generation EMNLP2024

【速读】: 该论文试图解决在代码生成任务中,如何有效利用辅助函数来增强指令调优模型的性能问题。解决方案的关键在于设计了多种方式将辅助函数引入模型,包括将其添加到查询中或提供响应前缀,从而结合了模型对辅助函数的利用能力和指令跟随能力。实验结果表明,这种结合显著提升了模型性能,甚至超越了最新的强大专有语言模型如GPT-4。

链接: https://arxiv.org/abs/2409.13928
作者: Seonghyeon Lee,Suyeon Kim,Joonwon Jang,Heejae Chon,Dongha Lee,Hwanjo Yu
关键词-EN: code generation behavior, code pre-trained language, instruction-tuned models built, pre-trained language models, provide auxiliary functions
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2024 Findings Short

点击查看摘要

Abstract:We study the code generation behavior of instruction-tuned models built on top of code pre-trained language models when they could access an auxiliary function to implement a function. We design several ways to provide auxiliary functions to the models by adding them to the query or providing a response prefix to incorporate the ability to utilize auxiliary functions with the instruction-following capability. Our experimental results show the effectiveness of combining the base models’ auxiliary function utilization ability with the instruction following ability. In particular, the performance of adopting our approaches with the open-sourced language models surpasses that of the recent powerful proprietary language models, i.e., gpt-4o.
摘要:我们研究了在指令微调模型能够访问辅助函数来实现某个功能时,基于代码预训练语言模型的代码生成行为。我们设计了几种方式来向模型提供辅助函数,包括将其添加到查询中或提供响应前缀,以结合利用辅助函数的能力与遵循指令的能力。我们的实验结果显示,将基础模型的辅助函数利用能力与指令遵循能力相结合是有效的。特别是,采用我们方法的开源语言模型的性能超越了近期强大的专有语言模型,即 gpt-4o。

[NLP-122] One Model is All You Need: ByT5-Sanskrit a Unified Model for Sanskrit NLP Tasks

【速读】: 该论文试图解决形态丰富语言(如梵文)在下游自然语言处理(NLP)应用中的处理难题。解决方案的关键在于提出了一个新的预训练语言模型ByT5-Sanskrit,该模型在梵文词分割任务中显著优于以往的数据驱动方法,并达到了当前最佳词典驱动模型的性能水平。ByT5-Sanskrit易于部署且对未覆盖的外部语言资源数据更具鲁棒性,同时在吠陀梵文依存句法分析和OCR后校正任务中取得了新的最先进结果。此外,论文还引入了基于数字梵文语料库的多任务数据集,用于联合训练梵文词分割、词形还原和形态句法标注任务,进一步提升了模型的多功能性。

链接: https://arxiv.org/abs/2409.13920
作者: Sebastian Nehrdich,Oliver Hellwig,Kurt Keutzer
关键词-EN: NLP applications, Sanskrit, Morphologically rich languages, Morphologically rich, NLP applications involving
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich language Sanskrit. We evaluate ByT5-Sanskrit on established Sanskrit word segmentation tasks, where it outperforms previous data-driven approaches by a considerable margin and matches the performance of the current best lexicon-based model. It is easier to deploy and more robust to data not covered by external linguistic resources. It also achieves new state-of-the-art results in Vedic Sanskrit dependency parsing and OCR post-correction tasks. Additionally, based on the Digital Corpus of Sanskrit, we introduce a novel multitask dataset for the joint training of Sanskrit word segmentation, lemmatization, and morphosyntactic tagging tasks. We fine-tune ByT5-Sanskrit on this dataset, creating a versatile multitask model for various downstream Sanskrit applications. We have used this model in Sanskrit linguistic annotation projects, in information retrieval setups, and as a preprocessing step in a Sanskrit machine translation pipeline. We also show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages. We thus demonstrate that byte-level pretrained language models can achieve excellent performance for morphologically rich languages, outperforming tokenizer-based models and presenting an important vector of exploration when constructing NLP pipelines for such languages.
摘要:形态丰富的语言在下游自然语言处理 (NLP) 应用中处理起来极具挑战性。本文介绍了一种新的预训练语言模型,ByT5-Sanskrit,专门设计用于处理形态丰富的梵语 (Sanskrit) 的 NLP 应用。我们在已建立的梵语分词任务上评估了 ByT5-Sanskrit,结果显示其显著优于以往的数据驱动方法,并达到了当前最佳基于词典模型的性能水平。ByT5-Sanskrit 更易于部署,且对未被外部语言资源覆盖的数据更具鲁棒性。此外,它在吠陀梵语 (Vedic Sanskrit) 依存句法分析和 OCR 后校正任务中取得了新的最先进结果。基于梵文数字语料库 (Digital Corpus of Sanskrit),我们引入了一种新颖的多任务数据集,用于联合训练梵语分词、词形还原和形态句法标注任务。我们对 ByT5-Sanskrit 进行了微调,创建了一个适用于多种下游梵语应用的多功能多任务模型。该模型已被应用于梵语语言学注释项目、信息检索设置以及梵语机器翻译流水线的预处理步骤中。我们还展示了,我们的方法在其他形态丰富的语言的词形还原和依存句法分析任务中取得了新的最佳分数。因此,我们证明了字节级预训练语言模型在处理形态丰富的语言时能够取得优异的性能,超越基于分词器的模型,并在构建此类语言的 NLP 流水线时提供了一个重要的探索方向。

[NLP-123] arget word activity detector: An approach to obtain ASR word boundaries without lexicon ICASSP2025

【速读】: 该论文试图解决端到端(E2E)自动语音识别(ASR)模型中获取单词时间戳信息的难题,尤其是在多语言模型中,由于训练过程中缺乏显式的时间对齐,这一问题更加复杂。现有方法依赖于词典或引入额外标记,导致可扩展性问题和计算成本增加。论文提出的解决方案关键在于利用子词单元(sub-word token units)的词嵌入和预训练的ASR模型,仅在训练时需要单词对齐信息,从而在不增加额外成本的情况下,实现对任意数量语言的扩展。该方法在五种语言的多语言ASR模型上进行了验证,并展示了其相对于强基线的有效性。

链接: https://arxiv.org/abs/2409.13913
作者: Sunit Sivasankaran,Eric Sun,Jinyu Li,Yan Huang,Jing Pan
关键词-EN: Obtaining word timestamp, remains challenging due, models remains challenging, Obtaining word, explicit time alignment
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate word boundaries without relying on lexicons. Our method leverages word embeddings from sub-word token units and a pretrained ASR model, requiring only word alignment information during training. Our proposed method can scale-up to any number of languages without incurring any additional cost. We validate our approach using a multilingual ASR model trained on five languages and demonstrate its effectiveness against a strong baseline.
摘要:从端到端 (E2E) 自动语音识别 (ASR) 模型中获取词时间戳信息仍然是一个挑战,这是由于训练过程中缺乏显式的时间对齐。在多语言模型中,这一问题变得更加复杂。现有的方法要么依赖于词典,要么引入额外的 Token,导致可扩展性问题和计算成本的增加。在这项工作中,我们提出了一种新的方法来估计词边界,而不依赖于词典。我们的方法利用子词 Token 单元的词嵌入和预训练的 ASR 模型,仅在训练过程中需要词对齐信息。我们提出的方法可以扩展到任意数量的语言,而不会产生额外的成本。我们通过在一个包含五种语言的多语言 ASR 模型上验证了我们的方法,并展示了其相对于强基线的有效性。

[NLP-124] Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

【速读】: 该论文试图解决大语言模型(LLMs)在医学领域应用时可能产生的缺乏支持证据或基于幻觉证据的问题。解决方案的关键在于引入检索增强生成(RAG)技术,通过构建一个包含70,000份眼科专业文档的RAG管道,在推理时检索相关文档来增强LLMs的输出。研究结果表明,RAG显著提高了答案的准确性(54.5%的正确率),降低了错误率(18.8%的轻微幻觉和26.7%的错误),并改善了证据的归属(从1.85提升到2.49,P<0.001),尽管在准确性和完整性上略有下降。

链接: https://arxiv.org/abs/2409.13902
作者: Aidan Gilson,Xuguang Ai,Thilaka Arunachalam,Ziyou Chen,Ki Xiong Cheong,Amisha Dave,Cameron Duic,Mercy Kibe,Annette Kaminaka,Minali Prasad,Fares Siddig,Maxwell Singer,Wendy Wong,Qiao Jin,Tiarnan D.L. Keenan,Xia Hu,Emily Y. Chew,Zhiyong Lu,Hua Xu,Ron A. Adelman,Yih-Chung Tham,Qingyu Chen
关键词-EN: Large Language Models, Language Models, Large Language, Retrieval Augment Generation, potential of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. In a case study on long-form consumer health questions, we systematically evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals. The evaluation focuses on factuality of evidence, selection and ranking of evidence, attribution of evidence, and answer accuracy and completeness. LLMs without RAG provided 252 references in total. Of which, 45.3% hallucinated, 34.1% consisted of minor errors, and 20.6% were correct. In contrast, LLMs with RAG significantly improved accuracy (54.5% being correct) and reduced error rates (18.8% with minor hallucinations and 26.7% with errors). 62.5% of the top 10 documents retrieved by RAG were selected as the top references in the LLM response, with an average ranking of 4.9. The use of RAG also improved evidence attribution (increasing from 1.85 to 2.49 on a 5-point scale, P0.001), albeit with slight decreases in accuracy (from 3.52 to 3.23, P=0.03) and completeness (from 3.47 to 3.27, P=0.17). The results demonstrate that LLMs frequently exhibited hallucinated and erroneous evidence in the responses, raising concerns for downstream applications in the medical domain. RAG substantially reduced the proportion of such evidence but encountered challenges.
摘要:尽管大语言模型 (Large Language Models, LLMs) 在医学领域具有潜力,但它们可能生成缺乏支持证据或基于幻觉证据的回答。虽然检索增强生成 (Retrieval Augment Generation, RAG) 是解决这一问题的流行方法,但很少有研究在下游领域特定应用中实施和评估 RAG。我们开发了一个包含 70,000 份眼科特定文档的 RAG 管道,该管道在推理时检索相关文档以增强 LLMs。在针对长篇消费者健康问题的案例研究中,我们系统地评估了包括超过 500 条参考文献的 LLMs 在有无 RAG 情况下的 100 个问题的回答,由 10 位医疗专业人员进行评估。评估重点在于证据的事实性、证据的选择与排序、证据的归属以及答案的准确性和完整性。没有 RAG 的 LLMs 总共提供了 252 条参考文献,其中 45.3% 存在幻觉,34.1% 包含轻微错误,20.6% 是正确的。相比之下,使用 RAG 的 LLMs 显著提高了准确性(54.5% 正确)并降低了错误率(18.8% 轻微幻觉和 26.7% 错误)。RAG 检索的前 10 份文档中有 62.5% 被选为 LLM 回答中的顶级参考文献,平均排名为 4.9。使用 RAG 还改善了证据归属(在 5 分制中从 1.85 增加到 2.49,P<0.001),尽管在准确性(从 3.52 到 3.23,P=0.03)和完整性(从 3.47 到 3.27,P=0.17)方面略有下降。结果表明,LLMs 在回答中经常表现出幻觉和错误的证据,这引发了对其在医疗领域下游应用的担忧。RAG 显著减少了此类证据的比例,但仍面临挑战。

[NLP-125] LLM for Everyone: Representing the Underrepresented in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在多语言环境中,特别是对少数语言支持不足的问题。解决方案的关键在于提出数据和计算效率高的方法,以缩小LLMs在少数语言上的能力差距,同时保持其在多任务上的泛化能力。具体方法包括跨语言持续指令调优、基于检索的跨语言上下文学习以及上下文查询对齐,并引入了一种新的方法来衡量不同语言环境下LLMs的文化价值观对齐,以确保文化敏感性和包容性。

链接: https://arxiv.org/abs/2409.13897
作者: Samuel Cahyawijaya
关键词-EN: Natural language processing, large language models, Natural language, underrepresented languages, witnessed a profound
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:Natural language processing (NLP) has witnessed a profound impact of large language models (LLMs) that excel in a multitude of tasks. However, the limitation of LLMs in multilingual settings, particularly in underrepresented languages, remains a significant hurdle. This thesis aims to bridge the gap in NLP research and development by focusing on underrepresented languages. A comprehensive evaluation of LLMs is conducted to assess their capabilities in these languages, revealing the challenges of multilingual and multicultural generalization. Addressing the multilingual generalization gap, this thesis proposes data-and-compute-efficient methods to mitigate the disparity in LLM ability in underrepresented languages, allowing better generalization on underrepresented languages without the loss of task generalization ability. The proposed solutions cover cross-lingual continual instruction tuning, retrieval-based cross-lingual in-context learning, and in-context query alignment. Furthermore, a novel method to measure cultural values alignment between LLMs operating in different languages is proposed, ensuring cultural sensitivity and inclusivity. These contributions aim to enhance the multilingual and multicultural alignment of LLMs in underrepresented languages, ultimately advancing the NLP field toward greater equality and inclusiveness.
摘要:自然语言处理 (NLP) 领域见证了大语言模型 (LLM) 的深远影响,这些模型在多种任务中表现卓越。然而,LLM 在多语言环境中的局限性,特别是在代表性不足的语言中,仍然是一个重大障碍。本论文旨在通过聚焦于代表性不足的语言,弥合 NLP 研究和开发中的这一差距。我们对 LLM 进行了全面评估,以评估其在这些语言中的能力,揭示了多语言和多文化泛化的挑战。为解决多语言泛化差距,本论文提出了数据和计算效率高的方法,以缩小 LLM 在代表性不足语言中的能力差异,从而在不损失任务泛化能力的情况下,更好地泛化代表性不足的语言。所提出的解决方案包括跨语言持续指令调优、基于检索的跨语言上下文学习以及上下文查询对齐。此外,本论文提出了一种新颖的方法来衡量在不同语言中运行的 LLM 之间的文化价值观对齐,确保文化敏感性和包容性。这些贡献旨在增强 LLM 在代表性不足语言中的多语言和多文化对齐,最终推动 NLP 领域向更大的平等和包容性发展。

[NLP-126] ransfer Learning with Clinical Concept Embeddings from Large Language Models

【速读】: 该论文试图解决跨临床站点数据异质性问题,以促进知识共享和迁移学习在医疗领域的应用。解决方案的关键在于利用领域特定的预训练语言模型(如Med-BERT)生成语义嵌入,这些嵌入能够有效捕捉临床概念的语义信息,从而减少异质性。研究结果表明,领域特定的嵌入在本地和直接迁移场景中表现优于通用模型,但需注意模型微调的适度性,以避免过度调优导致的性能下降。

链接: https://arxiv.org/abs/2409.13893
作者: Yuhe Gao,Runxue Bao,Yuelyu Ji,Yiming Sun,Chenxi Song,Jeffrey P. Ferraro,Ye Ye
关键词-EN: address data scarcity, enable timely interventions, data scarcity, timely interventions, multiple clinical sites
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge sharing is crucial in healthcare, especially when leveraging data from multiple clinical sites to address data scarcity, reduce costs, and enable timely interventions. Transfer learning can facilitate cross-site knowledge transfer, but a major challenge is heterogeneity in clinical concepts across different sites. Large Language Models (LLMs) show significant potential of capturing the semantic meaning of clinical concepts and reducing heterogeneity. This study analyzed electronic health records from two large healthcare systems to assess the impact of semantic embeddings from LLMs on local, shared, and transfer learning models. Results indicate that domain-specific LLMs, such as Med-BERT, consistently outperform in local and direct transfer scenarios, while generic models like OpenAI embeddings require fine-tuning for optimal performance. However, excessive tuning of models with biomedical embeddings may reduce effectiveness, emphasizing the need for balance. This study highlights the importance of domain-specific embeddings and careful model tuning for effective knowledge transfer in healthcare.
摘要:在医疗领域,知识共享至关重要,尤其是在利用多个临床站点的数据来解决数据稀缺、降低成本并实现及时干预时。迁移学习可以促进跨站点的知识转移,但一个主要挑战是不同站点之间临床概念的异质性。大语言模型 (LLMs) 显示出捕捉临床概念语义意义并减少异质性的显著潜力。本研究分析了来自两个大型医疗系统的电子健康记录,以评估 LLMs 的语义嵌入对本地、共享和迁移学习模型的影响。结果表明,特定领域的 LLMs,如 Med-BERT,在本地和直接迁移场景中始终表现优异,而通用模型如 OpenAI 嵌入则需要微调以达到最佳性能。然而,过度微调使用生物医学嵌入的模型可能会降低其有效性,这强调了平衡的重要性。本研究强调了特定领域嵌入和谨慎模型微调在医疗领域有效知识转移中的重要性。

[NLP-127] A Multi-LLM Debiasing Framework

【速读】: 该论文试图解决大型语言模型(LLMs)中存在的偏见问题,这些偏见可能加剧社会不平等。解决方案的关键在于提出了一种新颖的多LLM去偏框架,该框架通过两种不同的方法来减少偏见:一种是集中式方法,由单一的中央LLM引导对话;另一种是分散式方法,所有模型直接进行交流。研究结果表明,这种多LLM框架在减少LLMs中的偏见方面显著优于传统方法,特别是在涉及多个社会群体时。

链接: https://arxiv.org/abs/2409.13884
作者: Deonna M. Owens,Ryan A. Rossi,Sungchul Kim,Tong Yu,Franck Dernoncourt,Xiang Chen,Ruiyi Zhang,Jiuxiang Gu,Hanieh Deilamsalehy,Nedim Lipka
关键词-EN: Large Language Models, Large Language, benefit society immensely, perpetuate societal inequalities, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities. Despite significant advancements in bias mitigation techniques using data augmentation, zero-shot prompting, and model fine-tuning, biases continuously persist, including subtle biases that may elude human detection. Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning and factuality in LLMs. Building on this approach, we propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs. Our work is the first to introduce and evaluate two distinct approaches within this framework for debiasing LLMs: a centralized method, where the conversation is facilitated by a single central LLM, and a decentralized method, where all models communicate directly. Our findings reveal that our multi-LLM framework significantly reduces bias in LLMs, outperforming the baseline method across several social groups.
摘要:大语言模型 (LLMs) 是具有巨大潜力造福社会的强大工具,然而,它们也表现出助长社会不平等的偏见。尽管在偏见缓解技术方面取得了显著进展,包括数据增强、零样本提示和模型微调,但偏见仍然持续存在,包括可能逃过人类检测的微妙偏见。最近的研究表明,人们对多 LLM 方法的兴趣日益增长,这些方法已被证明在提高 LLMs 的推理质量和事实准确性方面是有效的。基于这一方法,我们提出了一种新颖的多 LLM 去偏框架,旨在减少 LLMs 中的偏见。我们的工作首次引入了并评估了该框架内的两种不同去偏方法:集中式方法,其中对话由一个中央 LLM 协调,以及去中心化方法,其中所有模型直接通信。我们的研究结果表明,我们的多 LLM 框架显著减少了 LLMs 中的偏见,在多个社会群体中均优于基线方法。

[NLP-128] “I Never Said That”: A dataset taxonomy and baselines on response clarity classification EMNLP2024

【速读】: 该论文试图解决政治访谈中问题回答的清晰度问题,特别是如何检测和分类回答中的模糊性和回避策略。解决方案的关键在于引入了一个新颖的两级分类法,用于评估回答的清晰度(高层次)和识别具体的回避技巧(低层次)。通过结合ChatGPT和人工标注者,论文构建了一个包含政治访谈中问题-回答对的清晰度分类数据集,并利用不同模型架构和适应方法进行实验,以建立新的基准。

链接: https://arxiv.org/abs/2409.13879
作者: Konstantinos Thomas,Giorgos Filandrianos,Maria Lymperaiou,Chrysoula Zerva,Giorgos Stamou
关键词-EN: well-studied discourse phenomena, political interviews, Large Language Models, discourse phenomena, ambiguity in public
类目: Computation and Language (cs.CL)
备注: Accepted at Findings of EMNLP 2024

点击查看摘要

Abstract:Equivocation and ambiguity in public speech are well-studied discourse phenomena, especially in political science and analysis of political interviews. Inspired by the well-grounded theory on equivocation, we aim to resolve the closely related problem of response clarity in questions extracted from political interviews, leveraging the capabilities of Large Language Models (LLMs) and human expertise. To this end, we introduce a novel taxonomy that frames the task of detecting and classifying response clarity and a corresponding clarity classification dataset which consists of question-answer (QA) pairs drawn from political interviews and annotated accordingly. Our proposed two-level taxonomy addresses the clarity of a response in terms of the information provided for a given question (high-level) and also provides a fine-grained taxonomy of evasion techniques that relate to unclear, ambiguous responses (lower-level). We combine ChatGPT and human annotators to collect, validate and annotate discrete QA pairs from political interviews, to be used for our newly introduced response clarity task. We provide a detailed analysis and conduct several experiments with different model architectures, sizes and adaptation methods to gain insights and establish new baselines over the proposed dataset and task.
摘要:在公共演讲中的含糊其辞和模棱两可是政治学和政治访谈分析中广泛研究的语篇现象。受含糊其辞理论的启发,我们旨在解决从政治访谈中提取的问题的回答清晰度这一密切相关的问题,利用大语言模型 (LLM) 和人类专家的能力。为此,我们引入了一种新的分类法,该分类法构建了检测和分类回答清晰度的任务,并相应地构建了一个清晰度分类数据集,该数据集由从政治访谈中提取的问题-答案 (QA) 对组成,并进行了相应的标注。我们提出的两级分类法从提供给定问题的信息量(高层级)和逃避技巧的细粒度分类(低层级)两个方面解决了回答的清晰度问题,这些逃避技巧与不清晰、模棱两可的回答相关。我们结合 ChatGPT 和人类标注者来收集、验证和标注来自政治访谈的离散 QA 对,以用于我们新引入的回答清晰度任务。我们提供了详细的分析,并进行了多种模型架构、大小和适应方法的实验,以获得见解并在我们提出的数据集和任务上建立新的基准。

[NLP-129] Instruct-Tuning Pretrained Causal Language Models for Ancient Greek Papyrology and Epigraphy

【速读】: 该论文旨在通过微调预训练的因果语言模型(Meta的Llama 3.1 8B Instruct)来辅助解决古希腊铭文和文献纸莎草的三个基本研究任务:年代归属、地理归属和文本修复。解决方案的关键在于采用基于提示的指导方法,使微调后的模型在关键指标上超越现有技术水平,特别是在字符错误率(CER)、准确率和时间偏差等方面取得了显著改进。

链接: https://arxiv.org/abs/2409.13870
作者: Eric Cullhed
关键词-EN: Meta Llama, pretrained causal language, ancient Greek inscriptions, causal language model, Greek inscriptions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 1 table. Under review

点击查看摘要

Abstract:This article presents an experiment in fine-tuning a pretrained causal language model (Meta’s Llama 3.1 8B Instruct) for aiding in three fundamental tasks of philological research: chronological and geographic attribution as well as text restoration in ancient Greek inscriptions and documentary papyri. Using a prompt-based instruct approach, the fine-tuned models surpass the state of the art in key metrics. For inscriptions, the models achieve a lower average character error rate (CER) of 22.5% (vs. 26.3%), while closely matching top-1 accuracy (60.9% vs. 61.8%) and top-20 accuracy (77.5% vs. 78.3%) for sequences up to 10 characters. They also provide a practical advantage by ignoring spaces during reconstruction, aligning better with the scriptio continua typically used in ancient written artifacts. In geographic attribution, the model outperforms previous benchmarks with a top-1 accuracy of 75.0% (vs. 70.8%) and a top-3 accuracy of 83.7% (vs. 82.1%). For dating, it achieves an average deviation of 26.2 years (vs. 29.3) and a median deviation of 1 year (vs. 3) from the actual date range. The models also set new baselines for documentary papyri, with a CER of 16.3%, a top-1 accuracy of 71.3%, and top-20 of 85.0% in text reconstruction; a top-1 accuracy of 66.4% and top-3 of 79.9% in geographic attribution; and, in chronological attribution, a deviation of 21.7 years from the actual termini post/ante quem, with a median deviation of 0 years.
摘要:本文介绍了一项针对预训练因果语言模型(Meta 的 Llama 3.1 8B Instruct)进行微调的实验,旨在辅助古希腊铭文和文献纸莎草研究中的三项基本任务:年代和地理归属以及文本修复。通过基于提示的指令方法,微调后的模型在关键指标上超越了现有技术水平。对于铭文,模型实现了更低的平均字符错误率(CER),达到 22.5%(相比 26.3%),同时在最多 10 个字符的序列中,与最高准确率(60.9% 对 61.8%)和前 20 准确率(77.5% 对 78.3%)相当。它们还通过在重建过程中忽略空格,更好地与古代书写文物中通常使用的连续书写方式相匹配,从而提供了实际优势。在地理归属方面,模型以 75.0% 的最高准确率(相比 70.8%)和 83.7% 的前三准确率(相比 82.1%)超越了先前的基准。在年代归属方面,模型实现了 26.2 年的平均偏差(相比 29.3 年)和 1 年的中位偏差(相比 3 年)。模型还为文献纸莎草设立了新的基准,文本重建的 CER 为 16.3%,最高准确率为 71.3%,前 20 准确率为 85.0%;地理归属的最高准确率为 66.4%,前三准确率为 79.9%;在年代归属方面,与实际的 termini post/ante quem 的偏差为 21.7 年,中位偏差为 0 年。

[NLP-130] Generative AI Carries Non-Democratic Biases and Stereotypes: Representation of Women Black Individuals Age Groups and People with Disability in AI-Generated Images across Occupations

【速读】: 该论文试图解决生成式AI在输出内容中对性别、种族、年龄和可见残疾等权益应得群体的公平性问题。解决方案的关键在于识别和纠正生成式AI在这些方面的偏见,确保其输出内容更具包容性和公平性。

链接: https://arxiv.org/abs/2409.13869
作者: Ayoob Sadeghiani
关键词-EN: prompting active discussions, critical concerns, prompting active, tech companies, governance and ethics
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:AI governance and ethics in AI development have become critical concerns, prompting active discussions among tech companies, governments, and researchers about the potential risks AI poses to our democracies. This short essay aims to highlight one such risk: how generative AI includes or excludes equity-deserving groups in its outputs. The findings reveal that generative AI is not equitably inclusive regarding gender, race, age, and visible disability.
摘要:AI 治理和 AI 发展中的伦理问题已成为关键关注点,促使科技公司、政府和研究人员积极讨论 AI 对民主制度可能带来的风险。本文旨在强调其中一种风险:生成式 AI 在其输出中如何包含或排除应获得公平待遇的群体。研究结果表明,生成式 AI 在性别、种族、年龄和可见残疾方面并不具备公平包容性。

[NLP-131] Unlocking Memorization in Large Language Models with Dynamic Soft Prompting

【速读】: 该论文试图解决预训练大型语言模型(LLMs)在记忆训练数据时带来的隐私泄露和版权侵犯风险问题。解决方案的关键在于提出了一种新颖的方法,使用动态、前缀依赖的软提示(soft prompts)来更准确地估计LLM的记忆情况。具体来说,该方法通过训练一个基于Transformer的生成器来生成适应输入变化的软提示,从而能够更有效地提取被记忆的数据,相较于传统方法,显著提升了在文本生成和代码生成任务中的记忆检测性能。

链接: https://arxiv.org/abs/2409.13853
作者: Zhepeng Wang,Runxue Bao,Yawen Wu,Jackson Taylor,Cao Xiao,Feng Zheng,Weiwen Jiang,Shangqian Gao,Yanfu Zhang
关键词-EN: Pretrained large language, large language models, natural language processing, revolutionized natural language, Pretrained large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pretrained large language models (LLMs) have revolutionized natural language processing (NLP) tasks such as summarization, question answering, and translation. However, LLMs pose significant security risks due to their tendency to memorize training data, leading to potential privacy breaches and copyright infringement. Accurate measurement of this memorization is essential to evaluate and mitigate these potential risks. However, previous attempts to characterize memorization are constrained by either using prefixes only or by prepending a constant soft prompt to the prefixes, which cannot react to changes in input. To address this challenge, we propose a novel method for estimating LLM memorization using dynamic, prefix-dependent soft prompts. Our approach involves training a transformer-based generator to produce soft prompts that adapt to changes in input, thereby enabling more accurate extraction of memorized data. Our method not only addresses the limitations of previous methods but also demonstrates superior performance in diverse experimental settings compared to state-of-the-art techniques. In particular, our method can achieve the maximum relative improvement of 112.75% and 32.26% over the vanilla baseline in terms of discoverable memorization rate for the text generation task and code generation task respectively.
摘要:预训练的大语言模型 (LLMs) 已经彻底改变了自然语言处理 (NLP) 任务,如摘要、问答和翻译。然而,LLMs 由于其记忆训练数据的倾向,带来了显著的安全风险,可能导致隐私泄露和版权侵犯。准确测量这种记忆行为对于评估和减轻这些潜在风险至关重要。然而,先前尝试描述记忆行为的方法要么仅使用前缀,要么在输入前添加一个固定的软提示,这些方法无法对输入的变化做出反应。为了应对这一挑战,我们提出了一种使用动态、前缀依赖的软提示来估计 LLM 记忆的新方法。我们的方法涉及训练一个基于 Transformer 的生成器,以生成适应输入变化的软提示,从而实现更精确的记忆数据提取。我们的方法不仅解决了先前方法的局限性,而且在各种实验设置中相比最先进的技术展示了优越的性能。特别是,我们的方法在文本生成任务和代码生成任务中分别实现了相对于基线的最大相对改进 112.75% 和 32.26% 的可发现记忆率。

[NLP-132] Do language models practice what they preach? Examining language ideologies about gendered language reform encoded in LLMs

【速读】: 该论文试图解决的问题是探究大型语言模型(LLMs)在生成文本时如何体现语言意识形态,特别是与性别语言改革相关的政治偏见和内部不一致性。解决方案的关键在于通过案例研究分析LLMs在不同元语言上下文中的表现,发现当要求使用“正确”或“自然”的语言时,LLMs更倾向于保守价值观的语言表达,而在提供更明确的元语言上下文时,LLMs更频繁地使用性别中性变体。这一发现揭示了LLMs在文本生成中隐含传达特定政治群体语言意识形态的能力,以及其表达的语言意识形态可能随上下文变化的不一致性,从而对价值对齐问题提出了新的思考。

链接: https://arxiv.org/abs/2409.13852
作者: Julia Watson,Sophia Lee,Barend Beekhuizen,Suzanne Stevenson
关键词-EN: English gendered language, English gendered, gendered language reform, study on English, related to role
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study language ideologies in text produced by LLMs through a case study on English gendered language reform (related to role nouns like congressperson/-woman/-man, and singular they). First, we find political bias: when asked to use language that is “correct” or “natural”, LLMs use language most similarly to when asked to align with conservative (vs. progressive) values. This shows how LLMs’ metalinguistic preferences can implicitly communicate the language ideologies of a particular political group, even in seemingly non-political contexts. Second, we find LLMs exhibit internal inconsistency: LLMs use gender-neutral variants more often when more explicit metalinguistic context is provided. This shows how the language ideologies expressed in text produced by LLMs can vary, which may be unexpected to users. We discuss the broader implications of these findings for value alignment.
摘要:我们通过一项关于英语性别语言改革的案例研究,探讨了大语言模型 (LLM) 生成的文本中的语言意识形态(涉及诸如 congressperson/-woman/-man 和 singular they 等角色名词)。首先,我们发现了政治偏见:当被要求使用“正确”或“自然”的语言时,LLM 使用的语言与被要求与保守(而非进步)价值观保持一致时最为相似。这表明,即使在看似非政治性的情境中,LLM 的元语言偏好也能隐含地传达特定政治群体的语言意识形态。其次,我们发现 LLM 表现出内部不一致性:当提供更明确的元语言上下文时,LLM 更频繁地使用性别中性变体。这表明,LLM 生成的文本中表达的语言意识形态可能会有所变化,这可能是用户未曾预料到的。我们讨论了这些发现对价值对齐的更广泛影响。

[NLP-133] STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)中显性和隐性偏见的评估问题,特别是当前方法在评估偏见时缺乏对整体情境和潜在偏见范围的考虑。解决方案的关键在于引入Sensitivity Testing on Offensive Progressions (STOP)数据集,该数据集包含450个逐步升级的冒犯性进展,涵盖9个主要群体和46个子群体,确保了评估的全面性和包容性。通过STOP数据集,研究者能够系统地评估不同模型在检测偏见方面的表现,并展示了通过与人类判断对齐,可以显著提高模型在敏感任务上的回答率,从而推动更有效的偏见缓解策略和更公平的语言模型的开发。

链接: https://arxiv.org/abs/2409.13843
作者: Robert Morabito,Sangmitra Madhusudan,Tyler McDonald,Ali Emami
关键词-EN: Mitigating explicit, natural language processing, Large Language Models, Large Language, explicit and implicit
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages (excluding references), accepted to EMNLP 2024 Main Conference

点击查看摘要

Abstract:Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. However, many current methodologies evaluate scenarios in isolation, without considering the broader context or the spectrum of potential biases within each situation. To address this, we introduce the Sensitivity Testing on Offensive Progressions (STOP) dataset, which includes 450 offensive progressions containing 2,700 unique sentences of varying severity that progressively escalate from less to more explicitly offensive. Covering a broad spectrum of 9 demographics and 46 sub-demographics, STOP ensures inclusivity and comprehensive coverage. We evaluate several leading closed- and open-source models, including GPT-4, Mixtral, and Llama 3. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%. We also demonstrate how aligning models with human judgments on STOP can improve model answer rates on sensitive tasks such as BBQ, StereoSet, and CrowS-Pairs by up to 191%, while maintaining or even improving performance. STOP presents a novel framework for assessing the complex nature of biases in LLMs, which will enable more effective bias mitigation strategies and facilitates the creation of fairer language models.
摘要:减轻大语言模型 (LLMs) 中的显性和隐性偏见已成为自然语言处理领域的一个关键焦点。然而,许多当前的方法在评估场景时孤立地进行,没有考虑更广泛的背景或每个情境中潜在偏见的范围。为了解决这一问题,我们引入了攻击性进展敏感测试 (STOP) 数据集,该数据集包含 450 个攻击性进展,包含 2,700 个不同严重程度的独特句子,这些句子从较不明显到较明显地逐步升级。STOP 涵盖了 9 个主要人群和 46 个子人群,确保了包容性和全面覆盖。我们评估了多个领先的闭源和开源模型,包括 GPT-4、Mixtral 和 Llama 3。我们的研究结果表明,即使是表现最好的模型在检测偏见时也存在不一致性,成功率从 19.3% 到 69.8% 不等。我们还展示了如何通过使模型与 STOP 上的人类判断对齐,来提高模型在敏感任务(如 BBQ、StereoSet 和 CrowS-Pairs)上的回答率,最高可达 191%,同时保持或甚至提高性能。STOP 提供了一个新颖的框架,用于评估大语言模型中偏见的复杂性,这将有助于制定更有效的偏见缓解策略,并促进更公平的语言模型的创建。

[NLP-134] Measuring Copyright Risks of Large Language Model via Partial Information Probing

【速读】: 该论文试图解决大型语言模型(LLMs)在训练过程中可能涉及版权侵权的问题,特别是探讨LLMs是否能够直接输出受版权保护的内容。解决方案的关键在于通过提供受版权保护材料的片段作为输入,并使用迭代提示的方法,促使LLMs生成与原始版权材料高度重叠的内容,从而评估LLMs生成侵权内容的能力。

链接: https://arxiv.org/abs/2409.13831
作者: Weijie Zhao,Huajie Shao,Zhaozhuo Xu,Suzhen Duan,Denghui Zhang
关键词-EN: Large Language Models, train Large Language, Large Language, Language Models, investigating potential copyright
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:Exploring the data sources used to train Large Language Models (LLMs) is a crucial direction in investigating potential copyright infringement by these models. While this approach can identify the possible use of copyrighted materials in training data, it does not directly measure infringing risks. Recent research has shifted towards testing whether LLMs can directly output copyrighted content. Addressing this direction, we investigate and assess LLMs’ capacity to generate infringing content by providing them with partial information from copyrighted materials, and try to use iterative prompting to get LLMs to generate more infringing content. Specifically, we input a portion of a copyrighted text into LLMs, prompt them to complete it, and then analyze the overlap between the generated content and the original copyrighted material. Our findings demonstrate that LLMs can indeed generate content highly overlapping with copyrighted materials based on these partial inputs.
摘要:探索用于训练大语言模型 (Large Language Models, LLMs) 的数据源是研究这些模型潜在版权侵权行为的关键方向。尽管这种方法可以识别训练数据中可能使用的受版权保护的材料,但它并不直接衡量侵权风险。最近的研究已转向测试 LLMs 是否能够直接输出受版权保护的内容。针对这一方向,我们研究并评估了 LLMs 基于受版权保护材料的局部信息生成侵权内容的能力,并尝试使用迭代提示来促使 LLMs 生成更多侵权内容。具体而言,我们将受版权保护文本的一部分输入 LLMs,提示它们完成该文本,然后分析生成内容与原始受版权保护材料之间的重叠度。我们的研究结果表明,基于这些局部输入,LLMs 确实能够生成与受版权保护材料高度重叠的内容。

[NLP-135] Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

【速读】: 该论文试图解决大型语言模型(LLMs)在回答问题时对上下文依赖性的问题,并提出了一种新的评估模型忠实度的方法。解决方案的关键在于引入了一种基于局部扰动和自我解释的新型可解释性技术,该技术受到常用的leave-one-out方法的启发,通过识别生成正确答案所需的充分和必要部分来提供解释,并提出了一种评估忠实度的指标,通过比较这些关键部分与模型的自我解释来实现。实验结果表明,该方法在解释模型决策和评估忠实度方面具有显著效果。

链接: https://arxiv.org/abs/2409.13764
作者: Christos Fragkathoulas,Odysseas S. Chlapanis
关键词-EN: large language models, paper introduces, task to assess, large language, local perturbations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.
摘要:本文介绍了一项新颖的任务,通过局部扰动和自我解释来评估大语言模型 (LLM) 的忠实度。许多 LLM 在回答某些问题时通常需要额外的上下文。为此,我们提出了一种新的高效替代可解释性技术,灵感来源于常用的留一法 (leave-one-out approach)。通过这种方法,我们识别出 LLM 生成正确答案所需的充分且必要部分,作为解释。我们提出了一种评估忠实度的指标,该指标将这些关键部分与模型的自我解释进行比较。使用 Natural Questions 数据集,我们验证了我们的方法,展示了其在解释模型决策和评估忠实度方面的有效性。

[NLP-136] Do Large Language Models Need a Content Delivery Network?

【速读】: 该论文试图解决在大语言模型(LLMs)应用中如何灵活且高效地注入新知识的问题。解决方案的关键在于利用KV缓存作为知识注入的媒介,通过构建一个名为“知识交付网络”(Knowledge Delivery Network, KDN)的系统组件,动态优化KV缓存的存储、传输和组合,从而实现知识注入的模块化管理,并提升LLM服务的效率,降低成本,加快响应速度。这一方法不仅避免了传统的微调(fine-tuning)和上下文学习(in-context learning)的局限性,还借鉴了内容交付网络(CDNs)的成功经验,旨在通过高效的“知识交付”推动LLM应用的成功。

链接: https://arxiv.org/abs/2409.13761
作者: Yihua Cheng,Kuntai Du,Jiayi Yao,Junchen Jiang
关键词-EN: large language models, LLM, expands rapidly, knowledge, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries. Thus, enabling flexible and efficient injection of new knowledge in LLM inference is critical. Three high-level options exist: (i) embedding the knowledge in LLM’s weights (i.e., fine-tuning), (ii) including the knowledge as a part of LLM’s text input (i.e., in-context learning), or (iii) injecting the KV caches of the new knowledge to LLM during prefill. This paper argues that, although fine-tuning and in-context learning are popular, using KV caches as the medium of knowledge could simultaneously enable more modular management of knowledge injection and more efficient LLM serving with low cost and fast response. To realize these benefits, we envision a Knowledge Delivery Network (KDN), a new system component in LLM services that dynamically optimizes the storage, transfer, and composition of KV cache across LLM engines and other compute and storage resources. We believe that, just like content delivery networks (CDNs), such as Akamai, enabled the success of the Internet ecosystem through their efficient data delivery, KDNs will be critical to the success of LLM applications through their efficient knowledge delivery. We have open-sourced a KDN prototype at this https URL.
摘要:随着大语言模型 (LLM) 的应用迅速扩展,补充各种 LLM 查询所需的知识范围也在不断扩大。因此,在 LLM 推理过程中灵活且高效地注入新知识变得至关重要。目前存在三种高层次的选项:(i) 将知识嵌入到 LLM 的权重中(即微调),(ii) 将知识作为 LLM 文本输入的一部分(即上下文学习),或 (iii) 在预填充阶段将新知识的 KV 缓存注入到 LLM 中。本文认为,尽管微调和上下文学习较为流行,但使用 KV 缓存作为知识传递媒介,可以同时实现知识注入的模块化管理,并以低成本和快速响应的方式提高 LLM 服务的效率。为了实现这些优势,我们设想了一个知识传递网络 (KDN),这是 LLM 服务中的一个新系统组件,它动态优化了 KV 缓存在 LLM 引擎及其他计算和存储资源之间的存储、传输和组合。我们相信,正如内容传递网络 (CDN),如 Akamai,通过其高效的数据传递推动了互联网生态系统的成功,KDN 也将通过其高效的知识传递成为 LLM 应用成功的关键。我们在 https URL 上开源了一个 KDN 原型。

[NLP-137] Optimizing the Songwriting Process: Genre-Based Lyric Generation Using Deep Learning Models

【速读】: 该论文试图解决传统歌词创作过程复杂且耗时的问题,解决方案的关键在于利用深度学习技术简化歌词生成过程。通过使用18,000首Spotify歌曲的数据集,论文开发了一种独特的预处理格式,将歌词解析为单独的诗句,并训练了一个基于LSTM的神经网络模型,根据歌曲流派生成歌词。研究结果表明,该方法能够有效提高歌词生成的召回率(ROUGE),并在保持相似精确度(BLEU)的同时,显著加速歌词创作过程,使艺术家能够更快速地创作出符合特定流派的歌词。

链接: https://arxiv.org/abs/2409.13758
作者: Tracy Cai,Wilson Liang,Donte Townes
关键词-EN: form comprehensive verses, traditional songwriting process, songwriting process, form comprehensive, traditional songwriting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The traditional songwriting process is rather complex and this is evident in the time it takes to produce lyrics that fit the genre and form comprehensive verses. Our project aims to simplify this process with deep learning techniques, thus optimizing the songwriting process and enabling an artist to hit their target audience by staying in genre. Using a dataset of 18,000 songs off Spotify, we developed a unique preprocessing format using tokens to parse lyrics into individual verses. These results were used to train a baseline pretrained seq2seq model, and a LSTM-based neural network models according to song genres. We found that generation yielded higher recall (ROUGE) in the baseline model, but similar precision (BLEU) for both models. Qualitatively, we found that many of the lyrical phrases generated by the original model were still comprehensible and discernible between which genres they fit into, despite not necessarily being the exact the same as the true lyrics. Overall, our results yielded that lyric generation can reasonably be sped up to produce genre-based lyrics and aid in hastening the songwriting process.
摘要:传统的歌曲创作过程相当复杂,这一点从创作符合特定风格且内容完整的歌词所需的时间中可见一斑。我们的项目旨在通过深度学习技术简化这一过程,从而优化歌曲创作流程,使艺术家能够通过保持风格来吸引目标受众。我们使用从 Spotify 获取的 18,000 首歌曲的数据集,开发了一种独特的预处理格式,利用 Token 将歌词解析为单独的段落。这些结果被用于训练一个基于预训练 seq2seq 模型的基线模型,以及一个基于 LSTM 的神经网络模型,根据歌曲风格进行训练。我们发现,在基线模型中,生成结果的召回率 (ROUGE) 更高,但两个模型的精确度 (BLEU) 相似。从定性分析来看,我们发现原始模型生成的许多歌词短语仍然具有可理解性,并且能够区分适合哪种风格,尽管它们不一定与真实歌词完全相同。总体而言,我们的研究结果表明,歌词生成可以合理地加速,以生成基于风格的歌词,并有助于加快歌曲创作过程。

[NLP-138] Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance

【速读】: 该论文试图解决大规模语言模型(LLMs)高计算成本与小规模语言模型(SLMs)性能不足之间的矛盾。解决方案的关键在于提出了一种新颖的混合推理方法,通过引入基于奖励的机制,动态决定在生成每个token时是否需要云端LLM的辅助。具体来说,SLM生成的每个token都会根据奖励分数进行评估,仅当分数低于预设阈值时,才调用云端LLM进行下一token的预测。这种方法不仅减少了云端LLM的使用频率,降低了成本,还通过调整奖励分数阈值灵活控制响应质量,从而在保持高性能的同时实现了成本效益。

链接: https://arxiv.org/abs/2409.13757
作者: Adarsh MS,Jithin VG,Ditto PS
关键词-EN: Large language models, language processing tasks, natural language processing, cloud LLM, Large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks, but their deployment comes at a high computational and financial cost. On the other hand, smaller language models (SLMs), which can be deployed on lower-cost edge devices, struggle to match the performance of their larger counterparts. This paper presents a novel hybrid inference approach that leverages the strengths of both model types while minimizing reliance on costly cloud-based LLMs. Unlike existing methods that route entire queries to either an SLM or a cloud LLM, our approach introduces a reward-based mechanism to dynamically determine the involvement of the cloud LLM during token generation. Specifically, each token predicted by the SLM is evaluated against a reward score, and only when this score falls below a certain threshold is the cloud LLM consulted for assistance in the next token prediction. This method not only reduces the traffic to the cloud LLM, thereby lowering costs, but also allows for flexible control over response quality depending on the reward score threshold. Experimental results demonstrate that our approach significantly reduces cloud LLM usage with minimal impact on overall response quality, offering a cost-effective solution for deploying high-performance language models
摘要:大语言模型 (LLMs) 以其在一系列自然语言处理任务中的卓越表现而闻名,但其部署需要高昂的计算和财务成本。另一方面,小型语言模型 (SLMs) 虽然可以在低成本的边缘设备上部署,但其性能却难以与大型模型相媲美。本文提出了一种新颖的混合推理方法,该方法结合了两种模型类型的优势,同时最大限度地减少了对昂贵的基于云的 LLMs 的依赖。与现有的将整个查询路由到 SLM 或云 LLM 的方法不同,我们的方法引入了一种基于奖励的机制,用于在 Token 生成过程中动态确定云 LLM 的参与度。具体而言,SLM 预测的每个 Token 都会根据奖励分数进行评估,只有当该分数低于某个阈值时,才会咨询云 LLM 以协助下一个 Token 的预测。这种方法不仅减少了云 LLM 的流量,从而降低了成本,还允许根据奖励分数阈值灵活控制响应质量。实验结果表明,我们的方法显著减少了云 LLM 的使用,对整体响应质量的影响最小,为部署高性能语言模型提供了一种经济高效的解决方案。

[NLP-139] Language Models Learn Metadata: Political Stance Detection Case Study

【速读】: 该论文试图解决在政治立场检测任务中如何最优地整合元数据的问题。解决方案的关键在于,通过将元数据(如党派和政策信息)前置于政治演讲文本中,简单而有效地提升了检测性能,超越了现有的复杂元数据整合系统,表明直接且简洁的元数据处理方式更能优化任务学习效果。

链接: https://arxiv.org/abs/2409.13756
作者: Stanley Cao,Felix Drinkall
关键词-EN: crucial NLP task, analyzing online discussions, crucial NLP, assessing political campaigns, political stance detection
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stance detection is a crucial NLP task with numerous applications in social science, from analyzing online discussions to assessing political campaigns. This paper investigates the optimal way to incorporate metadata into a political stance detection task. We demonstrate that previous methods combining metadata with language-based data for political stance detection have not fully utilized the metadata information; our simple baseline, using only party membership information, surpasses the current state-of-the-art. We then show that prepending metadata (e.g., party and policy) to political speeches performs best, outperforming all baselines, indicating that complex metadata inclusion systems may not learn the task optimally.
摘要:立场检测是一项重要的自然语言处理 (NLP) 任务,在社会科学中有广泛的应用,从分析在线讨论到评估政治竞选。本文研究了在政治立场检测任务中最佳地整合元数据的方法。我们证明,先前结合元数据与基于语言的数据进行政治立场检测的方法并未充分利用元数据信息;我们仅使用党派成员信息的简单基线方法超越了当前的最先进水平。随后,我们展示了将元数据(例如,党派和政策)前置于政治演讲中的方法表现最佳,超越了所有基线方法,表明复杂的元数据整合系统可能无法最优地学习该任务。

[NLP-140] Entity-Aware Self-Attention and Contextualized GCN for Enhanced Relation Extraction in Long Sentences

【速读】: 该论文试图解决现有依赖树图卷积网络在关系抽取任务中忽略依赖树外单词信息的问题。解决方案的关键在于提出了一种新的模型——实体感知自注意力上下文图卷积网络(ESC-GCN),该模型通过相对位置自注意力机制获取整体语义相关性,利用上下文图卷积网络捕捉句子内部丰富的依赖关系,并通过实体感知注意力层动态选择对关系预测更关键的词元,从而有效整合句法结构和语义上下文,减少依赖树的噪声影响,提升长句子中实体间关系抽取的性能。

链接: https://arxiv.org/abs/2409.13755
作者: Xin Wang,Xinyi Bai
关键词-EN: natural Language processing, important natural Language, Language processing, natural Language, important natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Relation extraction as an important natural Language processing (NLP) task is to identify relations between named entities in text. Recently, graph convolutional networks over dependency trees have been widely used to capture syntactic features and achieved attractive performance. However, most existing dependency-based approaches ignore the positive influence of the words outside the dependency trees, sometimes conveying rich and useful information on relation extraction. In this paper, we propose a novel model, Entity-aware Self-attention Contextualized GCN (ESC-GCN), which efficiently incorporates syntactic structure of input sentences and semantic context of sequences. To be specific, relative position self-attention obtains the overall semantic pairwise correlation related to word position, and contextualized graph convolutional networks capture rich intra-sentence dependencies between words by adequately pruning operations. Furthermore, entity-aware attention layer dynamically selects which token is more decisive to make final relation prediction. In this way, our proposed model not only reduces the noisy impact from dependency trees, but also obtains easily-ignored entity-related semantic representation. Extensive experiments on various tasks demonstrate that our model achieves encouraging performance as compared to existing dependency-based and sequence-based models. Specially, our model excels in extracting relations between entities of long sentences.
摘要:关系抽取作为一项重要的自然语言处理 (NLP) 任务,旨在识别文本中命名实体之间的关系。近年来,基于依存树的图卷积网络被广泛用于捕捉句法特征,并取得了令人瞩目的性能。然而,大多数现有的基于依存树的方法忽略了依存树外部的词语对关系抽取的积极影响,这些词语有时传递了丰富且有用的信息。本文提出了一种新型模型,即实体感知自注意力上下文图卷积网络 (Entity-aware Self-attention Contextualized GCN, ESC-GCN),该模型有效地结合了输入句子的句法结构和序列的语义上下文。具体而言,相对位置自注意力机制获取了与词语位置相关的整体语义成对相关性,而上下文图卷积网络通过充分的剪枝操作捕捉了词语间的丰富句子内依赖关系。此外,实体感知注意力层动态选择哪些 Token 对最终的关系预测更为关键。通过这种方式,我们提出的模型不仅减少了依存树带来的噪声影响,还获得了容易被忽略的与实体相关的语义表示。在多种任务上的广泛实验表明,与现有的基于依存树和基于序列的模型相比,我们的模型取得了令人鼓舞的性能。特别地,我们的模型在长句子实体间关系抽取方面表现尤为出色。

[NLP-141] Synergistic Simulations: Multi-Agent Problem Solving with Large Language Models

【速读】: 该论文试图解决的问题是如何利用大型语言模型(LLMs)促进多智能体系统在模拟环境中协同工作,以解决复杂问题,模拟人类群体协作的优势。解决方案的关键在于开发一个多智能体框架,使多个智能体能够在模拟环境中相互协作,并通过两个具体的模拟场景(一个物理公寓和一个编程任务)来验证这种协作的有效性。论文通过展示LLMs在模拟环境中是否能表现出类似于人类协作的协同效应,来推动LLMs在实际应用中的进一步发展。

链接: https://arxiv.org/abs/2409.13753
作者: Asher Sprigler,Alexander Drobek,Keagan Weinstock,Wendpanga Tapsoba,Gavin Childress,Andy Dao,Lucas Gral
关键词-EN: Large Language Models, Large Language, Language Models, increasingly demonstrated, demonstrated the ability
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: 15 pages, 5 figures, published in the MICS 2024 conference

点击查看摘要

Abstract:Large Language Models (LLMs) have increasingly demonstrated the ability to facilitate the development of multi-agent systems that allow the interpretation of thoughts and actions generated by each individual. Promising advancements have also been made in LLM-based interaction with existing worlds, particularly in interacting with simulated environments. This paper aims to integrate both aforementioned topics (agents world interaction) into a single simulation where multiple agents can work together to solve a problem, modeling how groups of humans can often solve problems better than individuals. By showing whether LLMs demonstrate the synergy of human collaboration, it could lead to advancements in the applications of LLMs. We implemented two simulations: a physical studio apartment with two roommates, and another where agents collaborate to complete a programming task. We provide a multi-agent framework, discuss the performance of the agents in each simulation, and discuss potential future additions.
摘要:大语言模型 (LLMs) 越来越显示出其在促进多智能体系统开发方面的能力,这些系统允许对每个个体产生的思想和行动进行解释。在基于 LLM 与现有世界的交互方面,特别是在与模拟环境的交互方面,也取得了令人鼓舞的进展。本文旨在将上述两个主题(智能体与世界交互)整合到一个单一的模拟环境中,其中多个智能体可以协同工作以解决问题,模拟人类群体通常如何比个体更好地解决问题。通过展示 LLMs 是否表现出人类协作的协同效应,这可能会推动 LLMs 应用的进步。我们实施了两个模拟:一个是具有两个室友的物理工作室公寓,另一个是智能体协作完成编程任务的模拟。我们提供了一个多智能体框架,讨论了智能体在每个模拟中的表现,并讨论了潜在的未来扩展。

[NLP-142] hinking Before Speaking: A Role-playing Model with Mindset

【速读】: 该论文试图解决大型语言模型(LLMs)在角色扮演中难以真实模拟特定角色的问题,特别是在面对超出角色知识范围或需要角色特定经验或逻辑回答的问题时。解决方案的关键在于提出了“先思考后说话”(Thinking Before Speaking, TBS)模型。该模型通过扩展基于角色真实生活场景和历史对话的数据,补充每个对话对的角色心态,并引入少量超出角色知识范围的数据点进行微调,从而帮助LLMs更好地采用角色的思维过程和逻辑,避免生成超出角色知识范围的回答。实验结果表明,TBS模型在语调、知识和心态方面能更好地模拟角色。

链接: https://arxiv.org/abs/2409.13752
作者: Baohua Zhang,Yongyi Huang,Wenyao Cui,Huaping Zhang
关键词-EN: Large Language Models, Large Language, simulating human behaviors, task for Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Role-playing is an easy task for Large Language Models (LLMs), as they are skilled at simulating human behaviors. Many current studies have enabled LLMs to generate responses in the tone of a specific role by fine-tuning the models or using specialized prompts. However, it is typically easy to recognize when a role is being played by LLMs. These models tend to perform poorly when confronted with knowledge that the assumed role does not possess, or a question that requires the specific experience or logic of the role to answer. To address this problem and make LLMs act more like real roles, we propose a Thinking Before Speaking (TBS) model in this paper. Unlike other studies, we first extend the data based on the character’s real-life scenarios and the historical dialogue, supplementing each pair of dialogue with the character’s mindset. Then we add few data points that include elements beyond the role’s knowledge, and fine-tune the LLMs. This approach can help LLMs adopt the role’s thought process and logic, avoiding responses that fall outside the role’s knowledge base. We have also prepared a dataset and evaluation metrics to test these capabilities. Experimental results show that our TBS model can better emulate a role in terms of tone, knowledge, and mindset.
摘要:角色扮演对于大语言模型 (LLM) 来说是一项简单的任务,因为它们擅长模拟人类行为。许多现有研究通过微调模型或使用专门的提示,使 LLM 能够以特定角色的语气生成回应。然而,通常很容易识别出这是 LLM 在扮演角色。当面对角色不具备的知识,或需要角色特定经验或逻辑来回答的问题时,这些模型往往表现不佳。为了解决这一问题,使 LLM 更像真实角色,本文提出了一种“先思考后说话” (TBS) 模型。与其它研究不同,我们首先基于角色的真实生活场景和历史对话扩展数据,为每对对话补充角色的心态。然后,我们添加包含角色知识之外元素的少量数据点,并对 LLM 进行微调。这种方法有助于 LLM 采用角色的思维过程和逻辑,避免超出角色知识库的回应。我们还准备了一个数据集和评估指标来测试这些能力。实验结果表明,我们的 TBS 模型在语气、知识和心态方面能更好地模拟角色。

[NLP-143] KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models

【速读】: 该论文试图解决当前最先进的语言模型(如GPT-4)在高度专业化的金融领域中表现不足的问题。解决方案的关键在于引入KodeXv0.1系列大型语言模型,通过使用Llama 3.1 8B和70B的基础变体,并结合自定义的训练机制,针对金融领域进行专门化训练。具体方法包括收集和处理大量公开的金融文档,生成高质量的合成数据集,并进行RAG-aware 4bit LoRA指令微调,从而在金融问答任务中显著超越GPT-4等现有模型。

链接: https://arxiv.org/abs/2409.13749
作者: Neel Rajani,Lilli Kiessling,Aleksandr Ogaltsov,Claus Lang
关键词-EN: highly specialised sectors, current cutting-edge LLMs, specialised sectors, highly specialised, financial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Although powerful, current cutting-edge LLMs may not fulfil the needs of highly specialised sectors. We introduce KodeXv0.1, a family of large language models that outclass GPT-4 in financial question answering. We utilise the base variants of Llama 3.1 8B and 70B and adapt them to the financial domain through a custom training regime. To this end, we collect and process a large number of publicly available financial documents such as earnings calls and business reports. These are used to generate a high-quality, synthetic dataset consisting of Context-Question-Answer triplets which closely mirror real-world financial tasks. Using the train split of this dataset, we perform RAG-aware 4bit LoRA instruction tuning runs of Llama 3.1 base variants to produce KodeX-8Bv0.1 and KodeX-70Bv0.1. We then complete extensive model evaluations using FinanceBench, FinQABench and the withheld test split of our dataset. Our results show that KodeX-8Bv0.1 is more reliable in financial contexts than cutting-edge instruct models in the same parameter regime, surpassing them by up to 9.24%. In addition, it is even capable of outperforming state-of-the-art proprietary models such as GPT-4 by up to 7.07%. KodeX-70Bv0.1 represents a further improvement upon this, exceeding GPT-4’s performance on every tested benchmark.
摘要:尽管当前最先进的大语言模型 (LLM) 功能强大,但可能无法满足高度专业化领域的需求。我们推出了 KodeXv0.1,这是一系列在金融问答方面超越 GPT-4 的大型语言模型。我们利用 Llama 3.1 8B 和 70B 的基础变体,并通过自定义训练机制将其适应于金融领域。为此,我们收集并处理了大量公开的金融文档,如财报电话会议和商业报告。这些文档用于生成高质量的合成数据集,该数据集由上下文-问题-答案三元组组成,这些三元组紧密反映了现实世界的金融任务。使用该数据集的训练部分,我们对 Llama 3.1 基础变体进行 RAG-aware 4bit LoRA 指令微调,以生成 KodeX-8Bv0.1 和 KodeX-70Bv0.1。随后,我们使用 FinanceBench、FinQABench 以及我们数据集的保留测试部分进行了广泛的模型评估。结果显示,KodeX-8Bv0.1 在金融情境中的可靠性优于同一参数范围内的最先进指令模型,超越它们高达 9.24%。此外,它甚至能够超越如 GPT-4 这样的最先进专有模型,高达 7.07%。KodeX-70Bv0.1 在此基础上进一步改进,在所有测试基准上均超越了 GPT-4 的表现。

[NLP-144] heraGen: Therapy for Every Generation

【速读】: 该论文试图解决心理健康支持的普及性和即时性问题,通过开发名为TheraGen的高级AI驱动的精神健康聊天机器人来实现。解决方案的关键在于利用LLaMA 2 7B模型,结合大规模数据集(包含100万条对话记录)和先进的训练技术(如迁移学习和微调),以提供全天候、个性化的同情心理健康护理。TheraGen不仅提供用户友好的界面,还通过高效的响应时间和基于证据的应对策略,显著提高了用户的心理健康水平,同时确保了响应的准确性和用户满意度。

链接: https://arxiv.org/abs/2409.13748
作者: Kartikey Doshi,Jimit Shah,Narendra Shekokar
关键词-EN: health chatbot utilizing, utilizing the LLaMA, chatbot utilizing, mental health, advanced AI-powered mental
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 11 figures

点击查看摘要

Abstract:We present TheraGen, an advanced AI-powered mental health chatbot utilizing the LLaMA 2 7B model. This approach builds upon recent advancements in language models and transformer architectures. TheraGen provides all-day personalized, compassionate mental health care by leveraging a large dataset of 1 million conversational entries, combining anonymized therapy transcripts, online mental health discussions, and psychological literature, including APA resources. Our implementation employs transfer learning, fine-tuning, and advanced training techniques to optimize performance. TheraGen offers a user-friendly interface for seamless interaction, providing empathetic responses and evidence-based coping strategies. Evaluation results demonstrate high user satisfaction rates, with 94% of users reporting improved mental well-being. The system achieved a BLEU score of 0.67 and a ROUGE score of 0.62, indicating strong response accuracy. With an average response time of 1395 milliseconds, TheraGen ensures real-time, efficient support. While not a replacement for professional therapy, TheraGen serves as a valuable complementary tool, significantly improving user well-being and addressing the accessibility gap in mental health treatments. This paper details TheraGen’s architecture, training methodology, ethical considerations, and future directions, contributing to the growing field of AI-assisted mental healthcare and offering a scalable solution to the pressing need for mental health support.
摘要:我们介绍了 TheraGen,一个利用 LLaMA 2 7B 模型的高级 AI 驱动的精神健康聊天机器人。这种方法建立在语言模型和 Transformer 架构的最新进展之上。TheraGen 通过利用包含 100 万条对话条目的大型数据集,结合匿名的治疗记录、在线精神健康讨论和心理学文献(包括 APA 资源),提供全天候个性化、富有同情心的精神健康护理。我们的实现采用了迁移学习、微调和高阶训练技术来优化性能。TheraGen 提供了一个用户友好的界面,实现无缝交互,提供富有同情心的回应和基于证据的应对策略。评估结果显示用户满意度高,94% 的用户报告其心理健康状况有所改善。系统在 BLEU 评分中达到 0.67,在 ROUGE 评分中达到 0.62,表明回应的准确性很强。平均响应时间为 1395 毫秒,确保了实时、高效的支持。尽管 TheraGen 不能替代专业治疗,但它作为一个有价值的补充工具,显著改善了用户的心理健康,并解决了精神健康治疗中的可及性差距。本文详细介绍了 TheraGen 的架构、训练方法、伦理考量和未来方向,为日益增长的 AI 辅助精神健康护理领域做出了贡献,并提供了一个可扩展的解决方案,以应对迫切的精神健康支持需求。

[NLP-145] Machine Translation with Large Language Models : Decoder Only vs. Encoder-Decoder

【速读】: 该论文旨在解决多语言机器翻译问题,特别是针对印度区域语言如泰卢固语、泰米尔语和马拉雅拉姆语的翻译。解决方案的关键在于比较Decoder-only和Encoder-Decoder两种架构,通过优化翻译质量和效率,提升跨语言沟通工具的性能。论文通过利用大型语言模型,进行严格的实验和分析,以期在多语言翻译领域取得进展,并为不同模型架构的有效性提供有价值的见解。

链接: https://arxiv.org/abs/2409.13747
作者: Abhinav P.M.,SujayKumar Reddy M,Oswald Christopher
关键词-EN: Large Language Models, Indian regional languages, Large Language, Machine Translation, Language Models
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This project, titled “Machine Translation with Large Language Models: Decoder-only vs. Encoder-Decoder,” aims to develop a multilingual machine translation (MT) model. Focused on Indian regional languages, especially Telugu, Tamil, and Malayalam, the model seeks to enable accurate and contextually appropriate translations across diverse language pairs. By comparing Decoder-only and Encoder-Decoder architectures, the project aims to optimize translation quality and efficiency, advancing cross-linguistic communication tools.The primary objective is to develop a model capable of delivering high-quality translations that are accurate and contextually appropriate. By leveraging large language models, specifically comparing the effectiveness of Decoder-only and Encoder-Decoder architectures, the project seeks to optimize translation performance and efficiency across multilingual contexts. Through rigorous experimentation and analysis, this project aims to advance the field of machine translation, contributing valuable insights into the effectiveness of different model architectures and paving the way for enhanced cross-linguistic communication tools.
摘要:本项目名为“基于大语言模型的机器翻译:仅解码器与编码器-解码器架构的比较”,旨在开发一种多语言机器翻译 (MT) 模型。该项目专注于印度地区语言,特别是泰卢固语、泰米尔语和马拉雅拉姆语,旨在实现跨多种语言对的准确且上下文适当的翻译。通过比较仅解码器和编码器-解码器架构,该项目旨在优化翻译质量和效率,推进跨语言沟通工具的发展。主要目标是开发一种能够提供高质量翻译的模型,这些翻译既准确又上下文适当。通过利用大语言模型,特别是比较仅解码器和编码器-解码器架构的有效性,该项目旨在优化多语言环境下的翻译性能和效率。通过严格的实验和分析,本项目旨在推进机器翻译领域的发展,为不同模型架构的有效性提供宝贵见解,并为增强跨语言沟通工具铺平道路。

[NLP-146] When Less Is Not More: Large Language Models Normalize Less-Frequent Terms with Lower Accuracy

【速读】: 该论文试图解决大语言模型(如GPT-4o)在术语标准化过程中对低频术语标准化准确率低的问题,特别是在精准医学中对罕见疾病和罕见表型的标准化。解决方案的关键在于平衡训练和评估数据集中低频和高频术语的比例,以提高模型对低频术语的识别和标准化能力,从而提升整体模型性能,确保精准医学应用中的准确性。

链接: https://arxiv.org/abs/2409.13746
作者: Daniel B. Hier,Thanh Son Do,Tayo Obafemi-Ajayi
关键词-EN: Human Phenotype Ontology, process of mapping, free text, standardized concept, machine-readable code
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Term normalization is the process of mapping a term from free text to a standardized concept and its machine-readable code in an ontology. Accurate normalization of terms that capture phenotypic differences between patients and diseases is critical to the success of precision medicine initiatives. A large language model (LLM), such as GPT-4o, can normalize terms to the Human Phenotype Ontology (HPO), but it may retrieve incorrect HPO IDs. Reported accuracy rates for LLMs on these tasks may be inflated due to imbalanced test datasets skewed towards high-frequency terms. In our study, using a comprehensive dataset of 268,776 phenotype annotations for 12,655 diseases from the HPO, GPT-4o achieved an accuracy of 13.1% in normalizing 11,225 unique terms. However, the accuracy was unevenly distributed, with higher-frequency and shorter terms normalized more accurately than lower-frequency and longer terms. Feature importance analysis, using SHAP and permutation methods, identified low-term frequency as the most significant predictor of normalization errors. These findings suggest that training and evaluation datasets for LLM-based term normalization should balance low- and high-frequency terms to improve model performance, particularly for infrequent terms critical to precision medicine.
摘要:术语规范化是将自由文本中的术语映射到本体中的标准化概念及其机器可读代码的过程。准确地规范化捕捉患者和疾病间表型差异的术语,对于精准医疗计划的成败至关重要。如 GPT-4o 这样的大语言模型 (LLM) 可以将术语规范化到人类表型本体 (HPO),但可能会检索到错误的 HPO ID。由于测试数据集偏向于高频术语,LLM 在这些任务上的报告准确率可能被夸大。在我们的研究中,使用来自 HPO 的 268,776 个表型注释,涵盖 12,655 种疾病的综合数据集,GPT-4o 在规范化 11,225 个独特术语时达到了 13.1% 的准确率。然而,准确率的分布并不均匀,高频和较短的术语比低频和较长的术语更准确地被规范化。通过 SHAP 和排列方法进行特征重要性分析,发现术语频率低是规范化错误的最显著预测因子。这些发现表明,基于 LLM 的术语规范化训练和评估数据集应平衡低频和高频术语,以提高模型性能,特别是对于精准医疗中至关重要的不常见术语。

[NLP-147] Context-Aware Membership Inference Attacks against Pre-trained Large Language Models

【速读】: 该论文试图解决预训练大型语言模型(LLMs)上的成员推断攻击(MIAs)问题,传统基于分类模型的攻击方法因忽视了LLMs在token序列生成过程中的特性而失效。解决方案的关键在于将MIAs的统计测试方法适应于数据点内子序列的困惑度动态变化,从而揭示LLMs中依赖于上下文的记忆模式,这种方法显著优于先前的基于损失的攻击方法。

链接: https://arxiv.org/abs/2409.13745
作者: Hongyan Chang,Ali Shahin Shamsabadi,Kleomenis Katevas,Hamed Haddadi,Reza Shokri
关键词-EN: Large Language Models, Membership Inference Attacks, Prior Membership Inference, pre-trained Large Language, Membership Inference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Prior Membership Inference Attacks (MIAs) on pre-trained Large Language Models (LLMs), adapted from classification model attacks, fail due to ignoring the generative process of LLMs across token sequences. In this paper, we present a novel attack that adapts MIA statistical tests to the perplexity dynamics of subsequences within a data point. Our method significantly outperforms prior loss-based approaches, revealing context-dependent memorization patterns in pre-trained LLMs.
摘要:以往针对预训练大语言模型 (LLM) 的成员推断攻击 (Membership Inference Attack, MIA),源自分类模型攻击,由于忽略了 LLM 在 Token 序列中的生成过程而失败。本文提出了一种新型攻击方法,将 MIA 统计测试适应于数据点内子序列的困惑度动态。我们的方法显著优于以往基于损失的方法,揭示了预训练 LLM 中依赖上下文的记忆模式。

[NLP-148] A Simplified Retriever to Improve Accuracy of Phenotype Normalizations by Large Language Models ALT

【速读】: 该论文试图解决生物医学术语规范化任务中,大型语言模型(LLMs)的准确性问题。解决方案的关键在于引入一个简化的检索器,该检索器利用BioBERT的上下文词嵌入在Human Phenotype Ontology(HPO)中搜索候选匹配项,而无需依赖显式的术语定义。通过这种方式,论文展示了在没有增强的情况下,LLM的规范化准确率为62.3%,而在使用检索器增强后,准确率提升至90.3%。这一方法不仅提高了LLM在术语规范化任务中的表现,还具有推广到其他生物医学术语规范化任务的潜力,并提供了一种比复杂检索方法更高效的替代方案。

链接: https://arxiv.org/abs/2409.13744
作者: Daniel B. Hier,Thanh Son Do,Tayo Obafemi-Ajayi
关键词-EN: Large language models, Human Phenotype Ontology, Large language, shown improved accuracy, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to Frontiers in Digital Health

点击查看摘要

Abstract:Large language models (LLMs) have shown improved accuracy in phenotype term normalization tasks when augmented with retrievers that suggest candidate normalizations based on term definitions. In this work, we introduce a simplified retriever that enhances LLM accuracy by searching the Human Phenotype Ontology (HPO) for candidate matches using contextual word embeddings from BioBERT without the need for explicit term definitions. Testing this method on terms derived from the clinical synopses of Online Mendelian Inheritance in Man (OMIM), we demonstrate that the normalization accuracy of a state-of-the-art LLM increases from a baseline of 62.3% without augmentation to 90.3% with retriever augmentation. This approach is potentially generalizable to other biomedical term normalization tasks and offers an efficient alternative to more complex retrieval methods.
摘要:大语言模型 (LLMs) 在通过增强检索器来提高表型术语规范化任务的准确性方面表现出了改进。这些检索器基于术语定义提供候选规范化建议。在本研究中,我们引入了一种简化的检索器,该检索器通过使用 BioBERT 的上下文词嵌入在 Human Phenotype Ontology (HPO) 中搜索候选匹配项,从而提高 LLM 的准确性,而无需明确的术语定义。在基于 Online Mendelian Inheritance in Man (OMIM) 临床摘要中提取的术语上测试这种方法时,我们证明了一个最先进的 LLM 的规范化准确率从无增强的基线 62.3% 提高到有检索器增强的 90.3%。这种方法可能可推广到其他生物医学术语规范化任务,并提供了一种比更复杂的检索方法更高效的替代方案。

[NLP-149] Knowing When to Ask – Bridging Large Language Models and Data

【速读】: 该论文试图解决大语言模型(LLMs)在处理涉及数值和统计数据或时效性事实的查询时容易生成事实错误信息的问题。解决方案的关键在于将LLMs与数据共享平台Data Commons集成,通过两种主要方法提升LLMs的准确性:一是检索交织生成(Retrieval Interleaved Generation, RIG),训练LLM生成自然语言查询以从Data Commons检索数据;二是检索增强生成(Retrieval Augmented Generation, RAG),从Data Commons获取相关数据表以增强LLM的提示。这两种方法通过结合可验证的统计数据,显著提高了LLM输出的事实准确性,为构建更可信和可靠的LLMs奠定了基础。

链接: https://arxiv.org/abs/2409.13741
作者: Prashanth Radhakrishnan,Jennifer Chen,Bo Xu,Prem Ramaswami,Hannah Pho,Adriana Olmos,James Manyika,R. V. Guha
关键词-EN: Large Language Models, generating factually incorrect, factually incorrect information, Large Language, Language Models
类目: Computation and Language (cs.CL)
备注: 39 pages - 25 page paper, 14 page Appendix, 7 figures, 9 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are prone to generating factually incorrect information when responding to queries that involve numerical and statistical data or other timely facts. In this paper, we present an approach for enhancing the accuracy of LLMs by integrating them with Data Commons, a vast, open-source repository of public statistics from trusted organizations like the United Nations (UN), Center for Disease Control and Prevention (CDC) and global census bureaus. We explore two primary methods: Retrieval Interleaved Generation (RIG), where the LLM is trained to produce natural language queries to retrieve data from Data Commons, and Retrieval Augmented Generation (RAG), where relevant data tables are fetched from Data Commons and used to augment the LLM’s prompt. We evaluate these methods on a diverse set of queries, demonstrating their effectiveness in improving the factual accuracy of LLM outputs. Our work represents an early step towards building more trustworthy and reliable LLMs that are grounded in verifiable statistical data and capable of complex factual reasoning.
摘要:大语言模型 (LLMs) 在处理涉及数值和统计数据或其他时效性事实的查询时,容易生成事实错误的信息。本文提出了一种通过将 LLMs 与数据共享平台 (Data Commons) 集成来提高其准确性的方法。Data Commons 是一个庞大的开源公共统计数据仓库,由联合国 (UN)、疾病控制与预防中心 (CDC) 和全球人口普查局等可信组织提供数据。我们探讨了两种主要方法:检索交织生成 (Retrieval Interleaved Generation, RIG),其中 LLM 被训练以生成自然语言查询以从 Data Commons 检索数据;以及检索增强生成 (Retrieval Augmented Generation, RAG),其中从 Data Commons 获取相关数据表并用于增强 LLM 的提示。我们在一系列多样化的查询上评估了这些方法,证明了它们在提高 LLM 输出的事实准确性方面的有效性。我们的工作代表了构建基于可验证统计数据并具备复杂事实推理能力的更可信和可靠的 LLMs 的早期步骤。

[NLP-150] Language agents achieve superhuman synthesis of scientific knowledge

【速读】: 该论文试图解决语言模型在科学研究中的准确性和可靠性问题,并提出了一种详细的人工智能与人类专家对比评估方法。解决方案的关键在于开发了PaperQA2这一专注于提高事实准确性的高级语言模型,并通过LitQA2基准测试来优化其性能。PaperQA2在信息检索、摘要生成和矛盾检测等实际文献搜索任务中,表现优于或至少与领域专家相当,尤其是在生成引用支持的科学主题维基百科式摘要和识别科学文献中的矛盾方面,显著超越了现有的人类编写的维基百科条目和人类专家的能力。

链接: https://arxiv.org/abs/2409.13740
作者: Michael D. Skarlinski,Sam Cox,Jon M. Laurent,James D. Braza,Michaela Hinks,Michael J. Hammerling,Manvitha Ponnapati,Samuel G. Rodriques,Andrew D. White
关键词-EN: produce incorrect information, Language models, produce incorrect, Language, incorrect information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Language models are known to produce incorrect information, and their accuracy and reliability for scientific research are still in question. We developed a detailed human-AI comparison method to evaluate language models on real-world literature search tasks, including information retrieval, summarization, and contradiction detection. Our findings show that PaperQA2, an advanced language model focused on improving factual accuracy, matches or outperforms subject matter experts on three realistic literature search tasks, with no restrictions on human participants (full internet access, search tools, and time). PaperQA2 generates cited, Wikipedia-style summaries of scientific topics that are significantly more accurate than current human-written Wikipedia entries. We also present LitQA2, a new benchmark for scientific literature research, which shaped the development of PaperQA2 and contributed to its superior performance. Additionally, PaperQA2 identifies contradictions in scientific literature, a challenging task for humans. It finds an average of 2.34 +/- 1.99 contradictions per paper in a random sample of biology papers, with 70% of these contradictions validated by human experts. These results show that language models can now surpass domain experts in important scientific literature tasks.
摘要:语言模型被认为会产生不准确的信息,其在科学研究中的准确性和可靠性仍存在疑问。我们开发了一种详细的人工智能与人类对比方法,用于评估语言模型在实际文献搜索任务中的表现,包括信息检索、摘要生成和矛盾检测。我们的研究结果表明,专注于提高事实准确性的先进语言模型 PaperQA2,在三项现实文献搜索任务中与领域专家的表现相当或更优,且对人类参与者没有任何限制(完全互联网访问、搜索工具和时间)。PaperQA2 生成的科学主题维基百科式摘要,其准确性显著高于当前人类编写的维基百科条目。我们还提出了 LitQA2,这是一个新的科学文献研究基准,它指导了 PaperQA2 的开发并促成了其卓越表现。此外,PaperQA2 能够识别科学文献中的矛盾,这对人类来说是一项挑战性任务。在随机抽样的生物学论文中,它平均每篇论文发现 2.34 +/- 1.99 处矛盾,其中 70% 的矛盾得到了人类专家的验证。这些结果表明,语言模型现在在重要的科学文献任务中可以超越领域专家。

[NLP-151] able-to-Text Generation with Pretrained Diffusion Models

【速读】: 该论文试图解决表格到文本生成的问题,通过将扩散模型(diffusion models)应用于这一任务,并进行深入分析。解决方案的关键在于探索和优化扩散模型的训练和采样策略,包括引入DPM-Solver++加速器、测试不同的预测聚合方法(如ROVER和MBR),以及研究预训练阶段和生成长度约束的影响。研究发现,扩散模型在生成质量和多样性之间取得了平衡,而自回归文本生成模型则难以同时兼顾两者。为达到最高生成质量,建议使用常规采样器并严格控制生成长度,然后通过MBR聚合预测结果;若追求速度和简化,可采用DPM-Solver++快速采样器。这些发现突显了扩散模型在表格到文本生成领域的潜力,为其作为研究方向的可行性提供了支持。

链接: https://arxiv.org/abs/2409.13739
作者: Aleksei S. Krylov,Oleg D. Somov
关键词-EN: demonstrated significant potential, Diffusion models, Diffusion, potential in achieving, text generation tasks
类目: Computation and Language (cs.CL)
备注: IEEE Access

点击查看摘要

Abstract:Diffusion models have demonstrated significant potential in achieving state-of-the-art performance across various text generation tasks. In this systematic study, we investigate their application to the table-to-text problem by adapting the diffusion model to the task and conducting an in-depth analysis. Our experiments cover multiple aspects of diffusion models training. We explore sampling strategy influence by inducing recent diffusion model accelerator DPM-Solver++ into our core model. We have tested different prediction aggregation methods, like ROVER and Minimum Bayes-Risk (MBR). Our studies cover the impact of the pre-training phase in diffusion models and the generation length constraints influence. We also have compared diffusion model generation with auto-regressive text-to-text models with different temperature settings for diversity evaluation. Our key observation is that diffusion models demonstrate the balance between quality and diversity while auto-regressive text-to-text models are not successful at handling both at the same time. Furthermore, we found out that to achieve the highest quality possible, it is preferable to use a regular sampler with the strictest length constraint to create multiple samples, and then use MBR to aggregate the predictions. However, if you are prepared to give up high level of diversity and to accelerate the process, you can also utilize a fast sampler DPM-Solver++. Our findings reveal that diffusion models achieve comparable results in the table-to-text domain, highlighting their viability in the table-to-text challenge as a promising research direction.
摘要:扩散模型在各种文本生成任务中展示了实现最先进性能的显著潜力。在本系统研究中,我们通过将扩散模型适应于表格到文本任务并进行深入分析,探讨了其在该领域的应用。我们的实验涵盖了扩散模型训练的多个方面。我们通过将最近的扩散模型加速器 DPM-Solver++ 引入核心模型,探索了采样策略的影响。我们测试了不同的预测聚合方法,如 ROVER 和最小贝叶斯风险 (MBR)。我们的研究涵盖了预训练阶段对扩散模型的影响以及生成长度约束的影响。我们还比较了扩散模型生成与不同温度设置下的自回归文本到文本模型,以评估多样性。我们的关键观察是,扩散模型在质量和多样性之间展示了平衡,而自回归文本到文本模型在同时处理两者时并不成功。此外,我们发现,要实现最高质量,最好使用带有严格长度约束的常规采样器生成多个样本,然后使用 MBR 聚合预测。然而,如果你愿意放弃高水平的多样性并加速过程,也可以使用快速采样器 DPM-Solver++。我们的研究结果表明,扩散模型在表格到文本领域取得了可比的结果,突显了其在表格到文本挑战中的可行性,作为有前景的研究方向。

[NLP-152] NLP4PBM: A Systematic Review on Process Extraction using Natural Language Processing with Rule-based Machine and Deep Learning Methods

【速读】: 该论文试图解决自动化流程提取问题,即将文本描述转化为结构化流程,主要依赖自然语言处理(NLP)技术。解决方案的关键在于采用机器学习(ML)和深度学习(DL)方法,这些方法在处理NLP任务时显示出优于传统基于规则方法的性能。然而,当前缺乏大规模、标准化的标注数据集,这限制了ML/DL方法的训练和评估,因此构建高质量的数据集是推动该领域发展的关键。此外,论文还探讨了大型语言模型(LLMs)在自动化流程提取中的初步应用及其潜在发展。

链接: https://arxiv.org/abs/2409.13738
作者: William Van Woensel,Soroor Motie
关键词-EN: Natural Language Processing, Language Processing, Natural Language, transforming textual descriptions, literature review studies
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This literature review studies the field of automated process extraction, i.e., transforming textual descriptions into structured processes using Natural Language Processing (NLP). We found that Machine Learning (ML) / Deep Learning (DL) methods are being increasingly used for the NLP component. In some cases, they were chosen for their suitability towards process extraction, and results show that they can outperform classic rule-based methods. We also found a paucity of gold-standard, scalable annotated datasets, which currently hinders objective evaluations as well as the training or fine-tuning of ML / DL methods. Finally, we discuss preliminary work on the application of LLMs for automated process extraction, as well as promising developments in this field.
摘要:本文献综述研究了自动化流程提取领域,即利用自然语言处理 (NLP) 将文本描述转化为结构化流程。我们发现,机器学习 (ML) / 深度学习 (DL) 方法正越来越多地被用于 NLP 组件中。在某些情况下,这些方法因其适合流程提取而被选择,结果表明它们能够超越经典的基于规则的方法。我们还发现,目前缺乏高质量、可扩展的标注数据集,这阻碍了客观评估以及 ML / DL 方法的训练或微调。最后,我们讨论了将大语言模型 (LLM) 应用于自动化流程提取的初步工作,以及该领域的有前景的发展。

[NLP-153] Analysis of Socially Unacceptable Discourse with Zero-shot Learning

【速读】: 该论文试图解决社交网络中不可接受言论(Socially Unacceptable Discourse, SUD)的检测与表征问题。解决方案的关键在于利用基于蕴含关系的零样本文本分类方法,通过预训练的Transformer模型和提示技术,实现对未见数据的良好泛化能力,从而生成用于分析和表征极端言论的标注数据集。这种方法展示了其在促进在线负责任沟通方面的潜力。

链接: https://arxiv.org/abs/2409.13735
作者: Rayane Ghilene,Dimitra Niaouri,Michele Linardi,Julien Longhi
关键词-EN: Socially Unacceptable Discourse, Socially Unacceptable, Unacceptable Discourse, online positive environments, maintaining online positive
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Socially Unacceptable Discourse (SUD) analysis is crucial for maintaining online positive environments. We investigate the effectiveness of Entailment-based zero-shot text classification (unsupervised method) for SUD detection and characterization by leveraging pre-trained transformer models and prompting techniques. The results demonstrate good generalization capabilities of these models to unseen data and highlight the promising nature of this approach for generating labeled datasets for the analysis and characterization of extremist narratives. The findings of this research contribute to the development of robust tools for studying SUD and promoting responsible communication online.
摘要:社会不可接受言论 (Socially Unacceptable Discourse, SUD) 分析对于维护在线积极环境至关重要。我们研究了基于蕴含的零样本文本分类 (无监督方法) 在 SUD 检测和特征化中的有效性,通过利用预训练的 Transformer 模型和提示技术。结果表明,这些模型对未见数据的良好泛化能力,并突显了这种方法在生成用于分析和特征化极端主义叙事的标注数据集方面的潜力。本研究的发现有助于开发用于研究 SUD 和促进在线负责任沟通的强大工具。

[NLP-154] Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach

【速读】: 该论文试图解决低资源语言(如中央库尔德语)在文本到语音(TTS)合成中的挑战,主要由于缺乏语言学信息和专用资源。解决方案的关键在于通过在21小时的中央库尔德语语音语料库上训练Kurdish WaveGlow声码器,替代预训练的英语WaveGlow声码器,以更准确和流畅地适应库尔德语的音素和韵律变化。这一改进显著提升了TTS系统的性能,使得自适应WaveGlow模型在中央库尔德语的语音合成中达到了4.91的MOS评分,为该语言的语音合成设定了新的基准,并为其他库尔德语方言及相关语言的进一步发展打开了大门。

链接: https://arxiv.org/abs/2409.13734
作者: Abdulhady Abas Abdullah,Sabat Salih Muhamad,Hadi Veisi
关键词-EN: greatly facilitated access, Central Kurdish, synthesize spoken language, Kurdish TTS system, Kurdish
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The ability to synthesize spoken language from text has greatly facilitated access to digital content with the advances in text-to-speech technology. However, effective TTS development for low-resource languages, such as Central Kurdish (CKB), still faces many challenges due mainly to the lack of linguistic information and dedicated resources. In this paper, we improve the Kurdish TTS system based on Tacotron by training the Kurdish WaveGlow vocoder on a 21-hour central Kurdish speech corpus instead of using a pre-trained English vocoder WaveGlow. Vocoder training on the target language corpus is required to accurately and fluently adapt phonetic and prosodic changes in Kurdish language. The effectiveness of these enhancements is that our model is significantly better than the baseline system with English pretrained models. In particular, our adaptive WaveGlow model achieves an impressive MOS of 4.91, which sets a new benchmark for Kurdish speech synthesis. On one hand, this study empowers the advanced features of the TTS system for Central Kurdish, and on the other hand, it opens the doors for other dialects in Kurdish and other related languages to further develop.
摘要:随着文本到语音 (Text-to-Speech, TTS) 技术的进步,合成语音从文本的能力极大地促进了数字内容的获取。然而,对于资源匮乏的语言,如中库尔德语 (Central Kurdish, CKB),有效的 TTS 开发仍然面临许多挑战,这主要归因于缺乏语言信息和专用资源。本文中,我们基于 Tacotron 改进了库尔德语 TTS 系统,通过在一个 21 小时的中库尔德语语音语料库上训练库尔德语 WaveGlow 声码器,而不是使用预训练的英语声码器 WaveGlow。在目标语言语料库上训练声码器是必要的,以便准确且流畅地适应库尔德语中的音素和韵律变化。这些改进的有效性在于,我们的模型显著优于使用英语预训练模型的基线系统。特别是,我们的自适应 WaveGlow 模型达到了令人印象深刻的 4.91 的 MOS (Mean Opinion Score),为库尔德语语音合成设定了新的基准。一方面,本研究增强了中库尔德语 TTS 系统的高级功能,另一方面,它为库尔德语的其他方言及相关语言的进一步发展打开了大门。

[NLP-155] RNR: Teaching Large Language Models to Follow Roles and Rules

【速读】: 该论文试图解决大型语言模型(LLMs)在遵循开发者定义的复杂角色和规则(即系统提示)方面的不足问题。解决方案的关键在于提出了一个自动化数据生成管道,名为\model,该管道能够从现有的指令微调(IFT)数据中生成多样化的角色和规则,并生成相应的响应数据。这些生成的数据随后用于训练模型,使其能够更好地遵循复杂的系统提示。通过这种方法,模型在遵循角色和规则的能力上显著提升,实验结果显示在Alpaca和Ultrachat数据集上,规则遵循的通过率提高了25%以上,同时并未影响其在标准指令遵循基准和通用NLP任务上的表现。

链接: https://arxiv.org/abs/2409.13733
作者: Kuan Wang,Alexander Bukharin,Haoming Jiang,Qingyu Yin,Zhengyang Wang,Tuo Zhao,Jingbo Shang,Chao Zhang,Bing Yin,Xian Li,Jianshu Chen,Shiyang Li
关键词-EN: large language models, supervised learning, existing IFT instructions, capabilities and steers, steers the behavior
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction fine-tuning (IFT) elicits instruction following capabilities and steers the behavior of large language models (LLMs) via supervised learning. However, existing models trained on open-source IFT datasets only have the ability to follow instructions from users, and often fail to follow complex role and rules specified by developers, a.k.a. system prompts. The ability to follow these roles and rules is essential for deployment, as it ensures that the model safely interacts with users within developer defined guidelines. To improve such role and rule following ability, we propose \model, an automated data generation pipeline that generates diverse roles and rules from existing IFT instructions, along with corresponding responses. This data can then be used to train models that follow complex system prompts. The models are evaluated on our newly created benchmarks for role and rule following ability, as well as standard instruction-following benchmarks and general NLP tasks. Our framework significantly improves role and rule following capability in LLMs, as evidenced by over 25% increase in pass-rate on rule adherence, i.e. following all requirements, in our experiments with the Alpaca and Ultrachat datasets. Moreover, our models achieves this increase without any regression on popular instruction following benchmarks.
摘要:指令微调 (Instruction Fine-Tuning, IFT) 通过监督学习激发大语言模型 (Large Language Models, LLMs) 的指令跟随能力并调整其行为。然而,现有基于开源 IFT 数据集训练的模型仅具备跟随用户指令的能力,往往无法遵循开发者指定的复杂角色和规则,即系统提示 (system prompts)。这种角色和规则的遵循能力对于部署至关重要,因为它确保模型在开发者定义的指南内安全地与用户互动。为提升这种角色和规则的遵循能力,我们提出了 \model,一个自动化的数据生成管道,从现有 IFT 指令中生成多样化的角色和规则及其相应的响应。这些数据随后可用于训练能够遵循复杂系统提示的模型。这些模型在我们的新创建的角色和规则遵循能力基准测试以及标准指令跟随基准测试和通用自然语言处理 (NLP) 任务中进行了评估。我们的框架显著提升了 LLMs 的角色和规则遵循能力,实验结果显示,在使用 Alpaca 和 Ultrachat 数据集的情况下,规则遵循的通过率(即遵循所有要求)提高了超过 25%。此外,我们的模型在实现这一提升的同时,并未在流行的指令跟随基准测试中出现性能下降。

[NLP-156] opoChat: Enhancing Topological Materials Retrieval With Large Language Model and Multi-Source Knowledge

【速读】: 该论文试图解决大型语言模型(LLMs)在特定领域(如拓扑材料)中的性能受限于领域专用语料库稀缺和训练资源需求高的问题。解决方案的关键在于构建一个材料知识图谱(MaterialsKG),并将其与文献集成,结合大型语言模型和提示学习技术,开发出专门针对拓扑材料的对话系统TopoChat。通过这种方式,TopoChat在结构和属性查询、材料推荐以及复杂关系推理方面表现出优于普通LLMs的性能,从而提高了信息检索的效率和准确性,促进了凝聚态材料领域的发展。

链接: https://arxiv.org/abs/2409.13732
作者: HuangChao Xu,Baohua Zhang,Zhong Jin,Tiannian Zhu,Quansheng Wu,Hongming Weng
关键词-EN: text generation task, demonstrated impressive performance, generation task, showing the ability, Large language models
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs), such as ChatGPT, have demonstrated impressive performance in the text generation task, showing the ability to understand and respond to complex instructions. However, the performance of naive LLMs in speciffc domains is limited due to the scarcity of domain-speciffc corpora and specialized training. Moreover, training a specialized large-scale model necessitates signiffcant hardware resources, which restricts researchers from leveraging such models to drive advances. Hence, it is crucial to further improve and optimize LLMs to meet speciffc domain demands and enhance their scalability. Based on the condensed matter data center, we establish a material knowledge graph (MaterialsKG) and integrate it with literature. Using large language models and prompt learning, we develop a specialized dialogue system for topological materials called TopoChat. Compared to naive LLMs, TopoChat exhibits superior performance in structural and property querying, material recommendation, and complex relational reasoning. This system enables efffcient and precise retrieval of information and facilitates knowledge interaction, thereby encouraging the advancement on the ffeld of condensed matter materials.
摘要:大语言模型 (LLMs),如 ChatGPT,在文本生成任务中展示了令人印象深刻的表现,显示出理解和响应复杂指令的能力。然而,由于特定领域语料库的稀缺和专业训练的不足,朴素 LLMs 在特定领域的表现受到限制。此外,训练一个专门的大规模模型需要大量的硬件资源,这限制了研究人员利用这些模型推动进步的能力。因此,进一步改进和优化 LLMs 以满足特定领域需求并增强其可扩展性至关重要。基于凝聚态数据中心,我们构建了一个材料知识图谱 (MaterialsKG) 并将其与文献集成。利用大语言模型和提示学习,我们开发了一个名为 TopoChat 的拓扑材料专用对话系统。与朴素 LLMs 相比,TopoChat 在结构和属性查询、材料推荐以及复杂关系推理方面表现出更优越的性能。该系统能够实现高效且精确的信息检索,并促进知识交互,从而推动凝聚态材料领域的发展。

[NLP-157] KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation

【速读】: 该论文试图解决现有检索增强生成(RAG)技术在模糊检索、语言模型理解与推理能力的“幻觉”问题以及复杂系统中的级联损失等方面的局限性,特别是在科学计算、医学和法律等对知识准确性、信息完整性和逻辑严谨性要求极高的领域。解决方案的关键在于引入专业领域知识服务框架——知识增强生成(KAG),通过双向增强大型语言模型(LLM)和知识图谱(KG)来提升生成和推理性能。具体包括五个关键增强点:1)LLM友好的知识语义表示;2)知识图谱与原始片段的相互索引;3)逻辑形式引导的混合推理与求解;4)基于语义推理的知识对齐;5)KAG模型。实验结果表明,KAG在多跳问答任务中显著优于现有RAG方法,相对提升幅度为19.6%到33.4%。

链接: https://arxiv.org/abs/2409.13731
作者: Lei Liang,Mengshu Sun,Zhengke Gui,Zhongshu Zhu,Zhouyu Jiang,Ling Zhong,Yuan Qu,Peilong Zhao,Zhongpu Bo,Jin Yang,Huaidong Xiong,Lin Yuan,Jun Xu,Zaoyang Wang,Wen Zhang,Huajun Chen,Zhiqiang Zhang,Jun Zhou
关键词-EN: recently developed retrieval-augmented, developed retrieval-augmented generation, technology enables, domain-specific applications, recently developed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages

点击查看摘要

Abstract:The recently developed retrieval-augmented generation (RAG) technology enables the efficient construction of domain-specific applications. However, it faces limitations due to fuzzy retrieval processes, the “hallucination” problem of understanding and reasoning capabilities of general language models, and cascading losses in complex systems. These challenges hinder the effectiveness of specialized knowledge services. However, in scenarios such as scientific computing, medicine, and law, the accuracy of knowledge, the completeness of information, and the logical rigor of rules, time, and values are particularly critical. We Introduce professional domain knowledge service framework: Knowledge Augmented Generation(KAG) to improve generation and reasoning performance by bidirectionally enhancing large language model(LLM)s and knowledge graph(KG)s, including five key enhancements: 1) LLM-friendly knowledge semantic representation, 2) mutual indexing between knowledge graph and original chunks, 3) logicalform-guided hybrid reasoning and solving, 4) Knowledge alignment based on semantic reasoning, 5) Model for KAG. We compared KAG with existing RAG methods in multi-hop question answering. The results show that KAG performs significantly better than the state-of-the-art methods, with a relative improvement from 19.6% to 33.4% in F1. We apply KAG to two professional knowledge QA tasks of Ant Group, including E-Goverment QA and E-Health QA, and has achieved significant improvement in professionalism compared with NaiveRAG. We will soon natively support KAG on the open source KG engine OpenSPG, allowing developers to more easily build rigorous knowledge decision-making or convenient information retrieval services.
摘要:近期开发的检索增强生成 (Retrieval-Augmented Generation, RAG) 技术使得高效构建领域特定应用成为可能。然而,由于模糊检索过程、通用语言模型在理解和推理能力上的“幻觉”问题,以及复杂系统中的级联损失,RAG 面临诸多限制。这些挑战阻碍了专业知识服务的有效性。然而,在科学计算、医学和法律等领域,知识的准确性、信息的完整性以及规则、时间和价值的逻辑严谨性尤为关键。我们引入了专业领域知识服务框架:知识增强生成 (Knowledge Augmented Generation, KAG),通过双向增强大语言模型 (Large Language Model, LLM) 和知识图谱 (Knowledge Graph, KG) 来提升生成和推理性能,包括五个关键增强点:1) 对 LLM 友好的知识语义表示,2) 知识图谱与原始数据块之间的相互索引,3) 逻辑形式引导的混合推理与求解,4) 基于语义推理的知识对齐,5) KAG 模型。我们在多跳问答任务中对比了 KAG 与现有 RAG 方法。结果显示,KAG 的表现显著优于最先进的方法,F1 值相对提升从 19.6% 到 33.4%。我们将 KAG 应用于蚂蚁集团的两个专业知识问答任务,包括政务问答 (E-Goverment QA) 和健康问答 (E-Health QA),与 NaiveRAG 相比,专业性得到了显著提升。我们即将在开源 KG 引擎 OpenSPG 上原生支持 KAG,使开发者能够更轻松地构建严谨的知识决策或便捷的信息检索服务。

[NLP-158] VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

【速读】: 该论文试图解决当前多模态大语言模型(MLLMs)在科学推理任务中,特别是数学、物理和化学领域评估的不足问题。解决方案的关键在于构建了一个名为VisScience的综合基准,该基准包含3000个问题,涵盖K12教育中的数学、物理和化学三个学科,每个学科1000个问题,分为21个不同主题和五个难度级别。通过VisScience,论文详细评估了25个代表性MLLMs在科学推理任务中的表现,结果显示闭源模型总体上优于开源模型,并指出了未来改进的方向和开发能够有效处理多模态科学推理需求的模型的重要性。

链接: https://arxiv.org/abs/2409.13730
作者: Zhihuan Jiang,Zhen Yang,Jinhao Chen,Zhengxiao Du,Weihan Wang,Bin Xu,Yuxiao Dong,Jie Tang
关键词-EN: demonstrated promising capabilities, achieve visual understanding, Multi-modal large language, visual understanding tasks, large language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 89 pages, 70 figures

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks by integrating textual and visual information to achieve visual understanding in complex scenarios. Despite the availability of several benchmarks aims to evaluating MLLMs in tasks from visual question answering to complex problem-solving, most focus predominantly on mathematics or general visual understanding tasks. This reveals a critical gap in current benchmarks, which often overlook the inclusion of other key scientific disciplines such as physics and chemistry. To address this gap, we meticulously construct a comprehensive benchmark, named VisScience, which is utilized to assess the multi-modal scientific reasoning across the three disciplines of mathematics, physics, and chemistry. This benchmark comprises 3,000 questions drawn from K12 education - spanning elementary school through high school - equally distributed across three disciplines, with 1,000 questions per discipline. The questions within VisScience span 21 distinct subjects and are categorized into five difficulty levels, offering a broad spectrum of topics within each discipline. With VisScience, we present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning. Experimental results demonstrate that closed-source MLLMs generally outperform open-source models. The best performance observed include a 53.4% accuracy in mathematics by Claude3.5-Sonnet, 38.2% in physics by GPT-4o, and 47.0% in chemistry by Gemini-1.5-Pro. These results underscore the strengths and limitations of MLLMs, suggesting areas for future improvement and highlighting the importance of developing models that can effectively handle the diverse demands of multi-modal scientific reasoning.
摘要:多模态大语言模型 (MLLMs) 通过整合文本和视觉信息,在复杂场景中实现视觉理解,展示了在各种任务中的潜力。尽管已有多个基准用于评估 MLLMs 从视觉问答到复杂问题解决的任务,但大多数基准主要集中在数学或一般视觉理解任务上。这揭示了当前基准的一个关键差距,即往往忽视了包括物理和化学在内的其他关键科学学科的纳入。为了填补这一空白,我们精心构建了一个名为 VisScience 的综合基准,用于评估数学、物理和化学三个学科的多模态科学推理能力。该基准包含 3,000 个问题,来自 K12 教育(涵盖小学到高中),平均分布在三个学科中,每个学科 1,000 个问题。VisScience 中的问题涵盖 21 个不同的科目,并分为五个难度级别,提供了每个学科广泛的题目范围。通过 VisScience,我们对 25 个代表性 MLLMs 的科学推理性能进行了详细评估。实验结果表明,闭源 MLLMs 通常优于开源模型。最佳表现包括 Claude3.5-Sonnet 在数学中的 53.4% 准确率,GPT-4o 在物理中的 38.2% 准确率,以及 Gemini-1.5-Pro 在化学中的 47.0% 准确率。这些结果突显了 MLLMs 的优势和局限性,指出了未来改进的方向,并强调了开发能够有效应对多模态科学推理多样需求的模型的重要性。

[NLP-159] MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

【速读】: 该论文试图解决当前多模态大语言模型(MLLMs)在数学领域,特别是几何问题上的局限性,即这些模型主要集中在几何问题的解决,而忽略了数学其他领域中丰富的视觉信息。解决方案的关键在于构建了一个名为MathVL的细调数据集,并通过在MathVL上进行监督细调(SFT),开发了一系列专门用于数学的MLLMs,称为MathGLM-Vision。这些模型在多个公共基准和自制的MathVL-test(包含2000个问题)上进行了广泛评估,实验结果表明,MathGLM-Vision相较于现有模型,包括基础模型和开源的数学MLLMs,在数学推理能力上取得了显著提升,这突显了多样化数据集在增强MLLMs数学推理能力中的重要性。

链接: https://arxiv.org/abs/2409.13729
作者: Zhen Yang,Jinhao Chen,Zhengxiao Du,Wenmeng Yu,Weihan Wang,Wenyi Hong,Zhihuan Jiang,Bin Xu,Yuxiao Dong,Jie Tang
关键词-EN: Large language models, Large language, multi-modal large language, demonstrated significant capabilities, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages,19 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.
摘要:大语言模型 (LLMs) 在数学推理方面,特别是在基于文本的数学问题上,展示了显著的能力。然而,当前的多模态大语言模型 (MLLMs),尤其是那些专门针对数学的模型,主要集中在解决几何问题上,而忽视了数学其他领域中丰富的视觉信息。此外,这些专门数学 MLLMs 的几何信息来源于几个公开数据集,这些数据集通常在多样性和复杂性方面存在局限。为了解决这些局限性,我们旨在构建一个名为 MathVL 的微调数据集,并通过在 MathVL 上进行监督微调 (SFT) 开发一系列专门数学 MLLMs,称为 MathGLM-Vision,使用各种参数规模的骨干模型。为了广泛评估 MathGLM-Vision 的有效性,我们在几个公开基准和我们精心策划的包含 2,000 个问题的 MathVL-test 上进行了实验。实验结果表明,与一些现有模型(包括骨干模型和开源数学 MLLMs)相比,MathGLM-Vision 取得了显著的改进。这些发现表明,多样化的数据集在提升 MLLMs 的数学推理能力方面具有重要意义。

[NLP-160] Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts

【速读】: 该论文试图解决大语言模型(LLMs)在面对超出训练分布(out-of-distribution, OOD)的提示时,如何进行上下文学习的问题。解决方案的关键在于定义了一种新的OOD组合泛化场景,称为“规则外推”(rule extrapolation),即当提示违反至少一条规则时的OOD情况。通过在形式语言中评估不同复杂度的线性、循环、Transformer和状态空间模型,研究者旨在理解这些架构对规则外推的影响,并初步构建了一个基于算法信息理论中Solomonoff先验的规范理论,以更好地解释和预测LLMs在OOD情况下的行为。

链接: https://arxiv.org/abs/2409.13728
作者: Anna Mészáros,Szilvia Ujváry,Wieland Brendel,Patrik Reizinger,Ferenc Huszár
关键词-EN: remarkable emergent abilities, show remarkable emergent, LLMs show remarkable, emergent abilities, in-context learning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:LLMs show remarkable emergent abilities, such as inferring concepts from presumably out-of-distribution prompts, known as in-context learning. Though this success is often attributed to the Transformer architecture, our systematic understanding is limited. In complex real-world data sets, even defining what is out-of-distribution is not obvious. To better understand the OOD behaviour of autoregressive LLMs, we focus on formal languages, which are defined by the intersection of rules. We define a new scenario of OOD compositional generalization, termed rule extrapolation. Rule extrapolation describes OOD scenarios, where the prompt violates at least one rule. We evaluate rule extrapolation in formal languages with varying complexity in linear and recurrent architectures, the Transformer, and state space models to understand the architectures’ influence on rule extrapolation. We also lay the first stones of a normative theory of rule extrapolation, inspired by the Solomonoff prior in algorithmic information theory.
摘要:大语言模型 (LLM) 展现出显著的新兴能力,例如从被认为超出分布范围的提示中推断概念,这种能力被称为上下文学习 (in-context learning)。尽管这一成功常归功于 Transformer 架构,但我们对其系统性理解仍有限。在复杂的现实世界数据集中,甚至定义什么是超出分布范围 (out-of-distribution, OOD) 也并不明显。为了更好地理解自回归大语言模型的 OOD 行为,我们聚焦于由规则交集定义的形式语言。我们定义了一种新的 OOD 组合泛化场景,称为规则外推 (rule extrapolation)。规则外推描述了提示至少违反一条规则的 OOD 场景。我们通过在具有不同复杂度的线性、循环架构、Transformer 和状态空间模型中评估规则外推,以理解架构对规则外推的影响。我们还借鉴算法信息理论中的 Solomonoff 先验,奠定了规则外推规范理论的第一块基石。

[NLP-161] Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records

【速读】: 该论文旨在解决大型语言模型(LLMs)在从兽医电子健康记录(EHRs)中提取信息时的性能差异、温度设置的影响以及文本模糊性对模型错误的影响。解决方案的关键在于通过比较GPT-4 Omni(GPT-4o)和GPT-3.5 Turbo在不同条件下的表现,评估其与人类观察者一致性的关系,并发现GPT-4o在温度0设置下表现出显著优于GPT-3.5 Turbo的性能,特别是在敏感度方面。此外,GPT-4o在处理EHRs时的错误主要集中在人类观察者存在分歧的模糊文本上,而非模型本身的缺陷,这表明GPT-4o在自动化提取兽医EHRs信息方面具有可行性。

链接: https://arxiv.org/abs/2409.13727
作者: Judit M Wulcan,Kevin L Jacques,Mary Ann Lee,Samantha L Kovacs,Nicole Dausend,Lauren E Prince,Jonatan Wulcan,Sina Marsilio,Stefan M Keller
关键词-EN: Large language models, IQR, electronic health records, Large language, veterinary electronic health
类目: Computation and Language (cs.CL)
备注: 24 pages, 3 figures, 8 supplementary figures

点击查看摘要

Abstract:Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of temperature settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by comparing the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions and investigating the relationship between human interobserver agreement and LLM errors. The LLMs and five humans were tasked with identifying six clinical signs associated with Feline chronic enteropathy in 250 EHRs from a veterinary referral hospital. At temperature 0, the performance of GPT-4o compared to the majority opinion of human respondents, achieved 96.9% sensitivity (interquartile range [IQR] 92.9-99.3%), 97.6% specificity (IQR 96.5-98.5%), 80.7% positive predictive value (IQR 70.8-84.6%), 99.5% negative predictive value (IQR 99.0-99.9%), 84.4% F1 score (IQR 77.3-90.4%), and 96.3% balanced accuracy (IQR 95.0-97.9%). The performance of GPT-4o was significantly better than that of its predecessor, GPT-3.5 Turbo, particularly with respect to sensitivity where GPT-3.5 Turbo only achieved 81.7% (IQR 78.9-84.8%). Adjusting the temperature for GPT-4o did not significantly impact classification performance. GPT-4o demonstrated greater reproducibility than human pairs regardless of temperature, with an average Cohen’s kappa of 0.98 (IQR 0.98-0.99) at temperature 0 compared to 0.8 (IQR 0.78-0.81) for humans. Most GPT-4o errors occurred in instances where humans disagreed (35/43 errors, 81.4%), suggesting that these errors were more likely caused by ambiguity of the EHR than explicit model faults. Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction.
摘要:大语言模型 (LLMs) 可以从兽医电子健康记录 (EHRs) 中提取信息,但不同模型之间的性能差异、温度设置的影响以及文本模糊性的影响尚未得到评估。本研究通过在不同条件下比较 GPT-4 omni (GPT-4o) 和 GPT-3.5 Turbo 的性能,并探讨人类观察者间一致性与 LLM 错误之间的关系,填补了这些空白。LLMs 和五名人类被要求从一家兽医转诊医院的 250 份 EHRs 中识别与猫慢性肠病相关的六种临床症状。在温度为 0 时,GPT-4o 的性能与人类多数意见相比,灵敏度达到 96.9% (四分位距 [IQR] 92.9-99.3%),特异性 97.6% (IQR 96.5-98.5%),阳性预测值 80.7% (IQR 70.8-84.6%),阴性预测值 99.5% (IQR 99.0-99.9%),F1 分数 84.4% (IQR 77.3-90.4%),平衡准确率 96.3% (IQR 95.0-97.9%)。GPT-4o 的性能显著优于其前身 GPT-3.5 Turbo,特别是在灵敏度方面,GPT-3.5 Turbo 仅达到 81.7% (IQR 78.9-84.8%)。调整 GPT-4o 的温度对其分类性能没有显著影响。无论温度如何,GPT-4o 的再现性均优于人类配对,平均 Cohen’s kappa 值为 0.98 (IQR 0.98-0.99),而人类为 0.8 (IQR 0.78-0.81)。大多数 GPT-4o 错误发生在人类意见不一致的情况下 (35/43 错误,81.4%),表明这些错误更可能是由 EHR 的模糊性而非模型缺陷引起的。使用 GPT-4o 自动化从兽医 EHRs 中提取信息是手动提取的可行替代方案。

[NLP-162] Multilingual Dyadic Interaction Corpus NoXiJ: Toward Understanding Asian-European Non-verbal Cultural Characteristics and their Influences on Engagement

【速读】: 该论文试图解决跨文化背景下非言语行为对对话参与度识别的影响问题。解决方案的关键在于通过扩展NoXi数据集,纳入日语和中文的对话数据,形成NoXi+J增强数据集,并利用多模态非言语特征(如语音声学、面部表情、反馈和手势)进行计算分析。研究通过统计分析和文化特征识别,揭示了不同语言和文化间的非言语行为差异及其对参与度的影响,并通过LSTM模型和SHAP分析验证了这些特征在跨文化参与度预测中的重要性。

链接: https://arxiv.org/abs/2409.13726
作者: Marius Funk,Shogo Okada,Elisabeth André
关键词-EN: non-verbal behaviors, central challenge, affective states, non-verbal behaviors vary, Non-verbal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages. 6 figures. International Conference on Multimodal Interaction, November 4-8, 2024, San Jose, Costa Rica

点击查看摘要

Abstract:Non-verbal behavior is a central challenge in understanding the dynamics of a conversation and the affective states between interlocutors arising from the interaction. Although psychological research has demonstrated that non-verbal behaviors vary across cultures, limited computational analysis has been conducted to clarify these differences and assess their impact on engagement recognition. To gain a greater understanding of engagement and non-verbal behaviors among a wide range of cultures and language spheres, in this study we conduct a multilingual computational analysis of non-verbal features and investigate their role in engagement and engagement prediction. To achieve this goal, we first expanded the NoXi dataset, which contains interaction data from participants living in France, Germany, and the United Kingdom, by collecting session data of dyadic conversations in Japanese and Chinese, resulting in the enhanced dataset NoXi+J. Next, we extracted multimodal non-verbal features, including speech acoustics, facial expressions, backchanneling and gestures, via various pattern recognition techniques and algorithms. Then, we conducted a statistical analysis of listening behaviors and backchannel patterns to identify culturally dependent and independent features in each language and common features among multiple languages. These features were also correlated with the engagement shown by the interlocutors. Finally, we analyzed the influence of cultural differences in the input features of LSTM models trained to predict engagement for five language datasets. A SHAP analysis combined with transfer learning confirmed a considerable correlation between the importance of input features for a language set and the significant cultural characteristics analyzed.
摘要:非语言行为是理解对话动态和对话者之间情感状态变化的核心挑战。尽管心理学研究表明非语言行为在不同文化间存在差异,但针对这些差异及其对参与度识别影响的计算分析仍较为有限。为了更深入地理解跨文化及语言领域的参与度和非语言行为,本研究对非语言特征进行了多语言计算分析,并探讨了其在参与度和参与度预测中的作用。为实现这一目标,我们首先扩展了NoXi数据集,该数据集包含来自法国、德国和英国参与者的互动数据,通过收集日语和中文的双人对话会话数据,形成了增强数据集NoXi+J。接着,我们通过多种模式识别技术和算法提取了多模态非语言特征,包括语音声学、面部表情、反馈信号和手势。随后,我们对倾听行为和反馈信号模式进行了统计分析,以识别每种语言中文化依赖性和独立性特征,以及多语言间的共同特征。这些特征还与对话者的参与度相关联。最后,我们分析了用于预测五种语言数据集参与度的LSTM模型输入特征中文化差异的影响。结合SHAP分析和迁移学习的结果证实,输入特征的重要性与所分析的文化显著特征之间存在显著相关性。

[NLP-163] Identity-related Speech Suppression in Generative AI Content Moderation

【速读】: 该论文试图解决自动化内容审核系统在处理与不同身份群体相关的言论时,可能存在的错误过滤或言论压制问题。解决方案的关键在于定义并引入“言论压制”的度量方法,通过使用传统的用户生成数据集和新的生成式AI数据集,创建一个针对九个身份群体的言论压制基准。研究结果表明,身份相关的言论比其他言论更容易被错误地压制,且不同API在处理生成式AI内容时的准确性存在差异。

链接: https://arxiv.org/abs/2409.13725
作者: Oghenefejiro Isaacs Anigboro,Charlie M. Crawford,Danaë Metaxa,Sorelle A. Friedler
关键词-EN: Automated content moderation, user-generated content online, content moderation, content, filter undesired user-generated
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Automated content moderation has long been used to help identify and filter undesired user-generated content online. Generative AI systems now use such filters to keep undesired generated content from being created by or shown to users. From classrooms to Hollywood, as generative AI is increasingly used for creative or expressive text generation, whose stories will these technologies allow to be told, and whose will they suppress? In this paper, we define and introduce measures of speech suppression, focusing on speech related to different identity groups incorrectly filtered by a range of content moderation APIs. Using both short-form, user-generated datasets traditional in content moderation and longer generative AI-focused data, including two datasets we introduce in this work, we create a benchmark for measurement of speech suppression for nine identity groups. Across one traditional and four generative AI-focused automated content moderation services tested, we find that identity-related speech is more likely to be incorrectly suppressed than other speech except in the cases of a few non-marginalized groups. Additionally, we find differences between APIs in their abilities to correctly moderate generative AI content.
摘要:自动内容审核长期以来被用于帮助识别和过滤在线用户生成的不良内容。生成式 AI (Generative AI) 系统现在利用这些过滤器来防止不良生成内容被创建或展示给用户。从课堂到好莱坞,随着生成式 AI 越来越多地用于创意或表达性文本生成,这些技术将允许讲述哪些故事,又将压制哪些故事?本文中,我们定义并引入了言论压制的衡量标准,重点关注由一系列内容审核 API 错误过滤的与不同身份群体相关的言论。我们使用传统的短形式用户生成数据集和更长的以生成式 AI 为重点的数据集,包括本文中引入的两个数据集,为九个身份群体的言论压制创建了一个基准。在测试的一个传统和四个以生成式 AI 为重点的自动内容审核服务中,我们发现身份相关的言论比其他言论更容易被错误压制,除非在少数非边缘化群体的情况下。此外,我们发现不同 API 在正确审核生成式 AI 内容的能力上存在差异。

[NLP-164] Logically Consistent Language Models via Neuro-Symbolic Integration

【速读】: 该论文试图解决大型语言模型(LLMs)在生成非事实信息和在推理实体间关系时自相矛盾的问题。解决方案的关键在于引入基于神经符号推理的损失函数,该损失函数教导LLM在与外部事实和规则保持逻辑一致性的同时,提升其自我一致性,即使在有限的微调数据集上也能实现。此外,该方法允许系统地结合多个逻辑约束,并以原则性的方式处理,从而使LLM在所有约束下更加一致,并在特定约束下优于多个基线。

链接: https://arxiv.org/abs/2409.13724
作者: Diego Calanzone,Stefano Teso,Antonio Vergari
关键词-EN: natural language understanding, Large language models, language models, understanding and generation, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are a promising venue for natural language understanding and generation. However, current LLMs are far from reliable: they are prone to generating non-factual information and, more crucially, to contradicting themselves when prompted to reason about relations between entities of the world. These problems are currently addressed with large scale fine-tuning or by delegating reasoning to external tools. In this work, we strive for a middle ground and introduce a loss based on neuro-symbolic reasoning that teaches an LLM to be logically consistent with an external set of facts and rules and improves self-consistency even when the LLM is fine-tuned on a limited set of facts. Our approach also allows to easily combine multiple logical constraints at once in a principled way, delivering LLMs that are more consistent w.r.t. all constraints and improve over several baselines w.r.t. a given constraint. Moreover, our method allows LLMs to extrapolate to unseen but semantically similar factual knowledge, represented in unseen datasets, more systematically.
摘要:大语言模型 (LLMs) 在自然语言理解和生成方面展现出巨大的潜力。然而,当前的 LLMs 远非可靠:它们容易生成不实信息,更为关键的是,在要求推理实体间关系时,往往会自相矛盾。目前,这些问题通过大规模微调或委托外部工具进行推理来解决。在本研究中,我们寻求一种折中方案,引入了一种基于神经符号推理的损失函数,该函数教导 LLM 与外部事实和规则集保持逻辑一致性,即使在有限的微调数据集上也能提高自我一致性。我们的方法还允许以系统化的方式同时结合多个逻辑约束,从而生成在所有约束下更为一致的 LLMs,并在特定约束下超越多个基线。此外,我们的方法使 LLMs 能够更系统地外推到未见但语义相似的事实知识,这些知识体现在未见的数据集中。

[NLP-165] LegiLM: A Fine-Tuned Legal Language Model for Data Compliance

【速读】: 该论文试图解决数据保护和隐私安全领域的合规性问题,特别是确保企业在处理数据时遵守国际数据保护标准。解决方案的关键在于引入了LegiLM,一种专门针对数据或信息合规性咨询的法律语言模型。LegiLM通过利用预训练的GDPR罚款数据集,并结合全球数据保护法律、精心注释的政策文档和相关隐私政策,进行微调,以自动评估特定行为或事件是否违反数据安全和隐私法规。该模型集成了先进的法律推理方法和信息检索增强技术,以提高在实际法律咨询场景中的准确性和可靠性。

链接: https://arxiv.org/abs/2409.13721
作者: Linkai Zhu,Lu Yang,Chaofan Li,Shanwen Hu,Lu Liu,Bin Yin
关键词-EN: substantial legal expertise, requiring substantial legal, Ensuring compliance, complex task, crucial but complex
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Ensuring compliance with international data protection standards for privacy and data security is a crucial but complex task, often requiring substantial legal expertise. This paper introduces LegiLM, a novel legal language model specifically tailored for consulting on data or information compliance. LegiLM leverages a pre-trained GDPR Fines dataset and has been fine-tuned to automatically assess whether particular actions or events breach data security and privacy regulations. By incorporating a specialized dataset that includes global data protection laws, meticulously annotated policy documents, and relevant privacy policies, LegiLM is optimized for addressing data compliance challenges. The model integrates advanced legal reasoning methods and information retrieval enhancements to enhance accuracy and reliability in practical legal consulting scenarios. Our evaluation using a custom benchmark dataset demonstrates that LegiLM excels in detecting data regulation breaches, offering sound legal justifications, and recommending necessary compliance modifications, setting a new benchmark for AI-driven legal compliance solutions. Our resources are publicly available at this https URL
摘要:确保符合国际数据保护标准以保障隐私和数据安全是一项关键但复杂的任务,通常需要大量的法律专业知识。本文介绍了 LegiLM,一种专门为数据或信息合规咨询量身定制的新型法律语言模型。LegiLM 利用预训练的 GDPR 罚款数据集,并经过微调,能够自动评估特定行为或事件是否违反数据安全和隐私法规。通过整合包含全球数据保护法律、精心注释的政策文件和相关隐私政策的专用数据集,LegiLM 针对解决数据合规挑战进行了优化。该模型结合了先进的法律推理方法和信息检索增强技术,以提高在实际法律咨询场景中的准确性和可靠性。我们使用自定义基准数据集进行的评估表明,LegiLM 在检测数据法规违规、提供合理的法律依据以及推荐必要的合规修改方面表现优异,为 AI 驱动的法律合规解决方案树立了新的标杆。我们的资源可在以下链接公开获取:https URL

[NLP-166] DiVA-DocRE: A Discriminative and Voice-Aware Paradigm for Document-Level Relation Extraction

【速读】: 该论文试图解决文档级关系三元组提取(DocRTE)中的问题,特别是现有方法在处理跨句关系和关系元素识别时的低效性和次优性能。解决方案的关键在于引入了一种判别式和语音感知范式(DiVA),该方法通过两个步骤简化了提取过程:首先进行文档级关系提取(DocRE),然后基于关系识别主语和宾语实体。DiVA的创新之处在于将DocRE转化为一个判别任务,并关注关系中的主动语态和被动语态问题,从而提高了三元组提取的准确性和效率。

链接: https://arxiv.org/abs/2409.13717
作者: Yiheng Wu,Roman Yangarber,Xian Mao
关键词-EN: Large Language Models, Large Language, revolutionized Information Extraction, Relation Triplet Extraction, capabilities of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The remarkable capabilities of Large Language Models (LLMs) in text comprehension and generation have revolutionized Information Extraction (IE). One such advancement is in Document-level Relation Triplet Extraction (DocRTE), a critical task in information systems that aims to extract entities and their semantic relationships from documents. However, existing methods are primarily designed for Sentence level Relation Triplet Extraction (SentRTE), which typically handles a limited set of relations and triplet facts within a single sentence. Additionally, some approaches treat relations as candidate choices integrated into prompt templates, resulting in inefficient processing and suboptimal performance when determining the relation elements in triplets. To address these limitations, we introduce a Discriminative and Voice Aware Paradigm DiVA. DiVA involves only two steps: performing document-level relation extraction (DocRE) and then identifying the subject object entities based on the relation. No additional processing is required simply input the document to directly obtain the triplets. This streamlined process more accurately reflects real-world scenarios for triplet extraction. Our innovation lies in transforming DocRE into a discriminative task, where the model pays attention to each relation and to the often overlooked issue of active vs. passive voice within the triplet. Our experiments on the Re-DocRED and DocRED datasets demonstrate state-of-the-art results for the DocRTE task.
摘要:大语言模型 (LLMs) 在文本理解和生成方面的显著能力已经彻底改变了信息提取 (IE) 领域。其中一个重要的进展是文档级关系三元组提取 (DocRTE),这是信息系统中的一个关键任务,旨在从文档中提取实体及其语义关系。然而,现有的方法主要设计用于句子级关系三元组提取 (SentRTE),通常处理的是单个句子内的有限关系和三元组事实。此外,一些方法将关系视为候选选择集成到提示模板中,导致在确定三元组中的关系元素时处理效率低下且性能不佳。为了解决这些限制,我们引入了一种判别和语音感知范式 DiVA。DiVA 仅涉及两个步骤:执行文档级关系提取 (DocRE),然后根据关系识别主语和宾语实体。无需额外处理,只需输入文档即可直接获得三元组。这种简化的流程更准确地反映了现实世界中三元组提取的场景。我们的创新之处在于将 DocRE 转变为一个判别任务,模型关注每个关系以及三元组中常被忽略的主动与被动语态问题。我们在 Re-DocRED 和 DocRED 数据集上的实验表明,DiVA 在 DocRTE 任务中达到了最先进的结果。

[NLP-167] Constrained Multi-Layer Contrastive Learning for Implicit Discourse Relationship Recognition

【速读】: 该论文试图解决隐式话语关系识别(IDRR)任务中,传统分类方法依赖复杂神经网络和多层中间层来捕捉话语单元间交互的问题。解决方案的关键在于采用监督对比学习(CL)方法,特别是标签和实例为中心的对比学习,以增强表示学习。此外,论文提出了一种新的约束多层对比学习方法,确保高层的对比损失小于低层的对比损失,从而优化模型性能。实验结果表明,该方法在PDTB 2.0和PDTB 3.0数据集上显著提升了多类分类和二分类的性能。

链接: https://arxiv.org/abs/2409.13716
作者: Yiheng Wu,Junhui Li,Muhua Zhu
关键词-EN: discourse relation recognition, Previous approaches, implicit discourse relation, relation recognition, generally view
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous approaches to the task of implicit discourse relation recognition (IDRR) generally view it as a classification task. Even with pre-trained language models, like BERT and RoBERTa, IDRR still relies on complicated neural networks with multiple intermediate layers to proper capture the interaction between two discourse units. As a result, the outputs of these intermediate layers may have different capability in discriminating instances of different classes. To this end, we propose to adapt a supervised contrastive learning (CL) method, label- and instance-centered CL, to enhance representation learning. Moreover, we propose a novel constrained multi-layer CL approach to properly impose a constraint that the contrastive loss of higher layers should be smaller than that of lower layers. Experimental results on PDTB 2.0 and PDTB 3.0 show that our approach can significantly improve the performance on both multi-class classification and binary classification.
摘要:以往对于隐式话语关系识别 (Implicit Discourse Relation Recognition, IDRR) 任务的处理通常将其视为分类任务。即便使用了预训练语言模型,如 BERT 和 RoBERTa,IDRR 仍然依赖于具有多层中间层的复杂神经网络来适当捕捉两个话语单元之间的交互。因此,这些中间层的输出在区分不同类别的实例时可能具有不同的能力。为此,我们提出采用一种监督对比学习 (Contrastive Learning, CL) 方法,即标签和实例为中心的 CL,以增强表示学习。此外,我们还提出了一种新颖的约束多层 CL 方法,以适当施加一个约束,即较高层的对比损失应小于较低层的对比损失。在 PDTB 2.0 和 PDTB 3.0 上的实验结果表明,我们的方法在多类分类和二分类任务上均能显著提升性能。

[NLP-168] Introducing MeMo: A Multimodal Dataset for Memory Modelling in Multiparty Conversations

【速读】: 该论文试图解决的问题是如何在计算模型中模拟人类对话记忆,以促进智能系统在理解和改善群体互动质量方面的应用。解决方案的关键在于引入了MeMo语料库,这是首个包含参与者记忆保留报告的对话数据集。MeMo语料库通过31小时的小组讨论录音,涵盖了Covid-19主题,并在两周内重复进行,结合了行为和感知测量,提供了音频、视频和多模态注释。这一资源为研究对话记忆和群体动态提供了宝贵的数据基础,有助于开发能够模拟和理解人类对话记忆的智能系统。

链接: https://arxiv.org/abs/2409.13715
作者: Maria Tsfasman,Bernd Dudzik,Kristian Fenech,Andras Lorincz,Catholijn M. Jonker,Catharine Oertel
关键词-EN: human memory processes, Conversational memory, human social relationships, memory, relationships is intricately
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The quality of human social relationships is intricately linked to human memory processes, with memory serving as the foundation for the creation of social bonds. Since human memory is selective, differing recollections of the same events within a group can lead to misunderstandings and misalignments in what is perceived to be common ground in the group. Yet, conversational facilitation systems, aimed at advancing the quality of group interactions, usually focus on tracking users’ states within an individual session, ignoring what remains in each participant’s memory after the interaction. Conversational memory is the process by which humans encode, retain and retrieve verbal, non-verbal and contextual information from a conversation. Understanding conversational memory can be used as a source of information on the long-term development of social connections within a group. This paper introduces the MeMo corpus, the first conversational dataset annotated with participants’ memory retention reports, aimed at facilitating computational modelling of human conversational memory. The MeMo corpus includes 31 hours of small-group discussions on the topic of Covid-19, repeated over the term of 2 weeks. It integrates validated behavioural and perceptual measures, and includes audio, video, and multimodal annotations, offering a valuable resource for studying and modelling conversational memory and group dynamics. By introducing the MeMo corpus, presenting an analysis of its validity, and demonstrating its usefulness for future research, this paper aims to pave the way for future research in conversational memory modelling for intelligent system development.
摘要:人类社会关系的质量与人类记忆过程紧密相连,记忆作为社会关系建立的基础。由于人类记忆具有选择性,同一群体内对同一事件的不同记忆可能导致误解和认知偏差。然而,旨在提升群体互动质量的对话辅助系统通常只关注单次会话中用户的状态,而忽视了互动后每个参与者记忆中的内容。对话记忆是指人类在对话中编码、保留和检索语言、非语言及上下文信息的过程。理解对话记忆可以作为了解群体内社会关系长期发展的一个信息来源。本文介绍了 MeMo 语料库,这是首个包含参与者记忆保留报告的对话数据集,旨在促进人类对话记忆的计算建模。MeMo 语料库包含 31 小时关于新冠疫情的小组讨论录音,持续时间为两周。它整合了经过验证的行为和感知测量方法,并包含音频、视频和多模态注释,为研究对话记忆和群体动力学提供了宝贵的资源。通过介绍 MeMo 语料库,分析其有效性,并展示其对未来研究的实用性,本文旨在为智能系统开发中的对话记忆建模研究铺平道路。

[NLP-169] racrBench: Generating Interpretability Testbeds with Large Language Models ICML

【速读】: 该论文试图解决基于Transformer的语言模型机制理解问题,特别是由于模型参数众多导致的解释性方法缺乏有效评估的问题。解决方案的关键在于提出了TracrBench,这是一个包含121个手动编写和LLM生成的、经过人工验证的RASP程序及其对应Transformer权重的数据集。通过利用大型语言模型(LLMs)生成解释性测试床,TracrBench为评估和比较解释性方法提供了一个宝贵的测试平台。

链接: https://arxiv.org/abs/2409.13714
作者: Hannes Thurnherr,Jérémy Scheurer
关键词-EN: Achieving a mechanistic, ground truth mappings, mechanistic understanding, understanding of transformer-based, transformer-based language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages + appendix, 4 figures, ICML Mechanistic Interpretability Workshop

点击查看摘要

Abstract:Achieving a mechanistic understanding of transformer-based language models is an open challenge, especially due to their large number of parameters. Moreover, the lack of ground truth mappings between model weights and their functional roles hinders the effective evaluation of interpretability methods, impeding overall progress. Tracr, a method for generating compiled transformers with inherent ground truth mappings in RASP, has been proposed to address this issue. However, manually creating a large number of models needed for verifying interpretability methods is labour-intensive and time-consuming. In this work, we present a novel approach for generating interpretability test beds using large language models (LLMs) and introduce TracrBench, a novel dataset consisting of 121 manually written and LLM-generated, human-validated RASP programs and their corresponding transformer weights. During this process, we evaluate the ability of frontier LLMs to autonomously generate RASP programs and find that this task poses significant challenges. GPT-4-turbo, with a 20-shot prompt and best-of-5 sampling, correctly implements only 57 out of 101 test programs, necessitating the manual implementation of the remaining programs. With its 121 samples, TracrBench aims to serve as a valuable testbed for evaluating and comparing interpretability methods.
摘要:实现对基于 Transformer 的语言模型的机制性理解是一个开放的挑战,尤其是因为其参数数量庞大。此外,模型权重与其功能角色之间缺乏真实映射,阻碍了可解释性方法的有效评估,从而影响了整体进展。Tracr 是一种生成编译型 Transformer 的方法,它在 RASP 中具有固有的真实映射,旨在解决这一问题。然而,手动创建大量用于验证可解释性方法的模型既费时又费力。在这项工作中,我们提出了一种利用大语言模型 (LLM) 生成可解释性测试平台的新方法,并引入了 TracrBench,这是一个包含 121 个手动编写和 LLM 生成的、经过人工验证的 RASP 程序及其相应 Transformer 权重的新型数据集。在此过程中,我们评估了前沿 LLM 自主生成 RASP 程序的能力,发现这一任务具有显著挑战性。GPT-4-turbo 在 20-shot 提示和最佳 5 次采样的情况下,仅正确实现了 101 个测试程序中的 57 个,其余程序需要手动实现。TracrBench 凭借其 121 个样本,旨在成为评估和比较可解释性方法的有价值的测试平台。

[NLP-170] Sentiment Informed Sentence BERT-Ensemble Algorithm for Depression Detection

【速读】: 该论文试图解决早期抑郁症检测的问题,特别是在使用机器学习技术时面临的模型泛化能力和数据复杂性挑战。解决方案的关键在于采用堆叠集成模型,并结合情感指标作为附加特征,以提高模型性能。具体来说,论文通过将句子双向编码器表示(SBERT)数值向量嵌入堆叠集成模型中,在两个基准社交媒体数据集(D1和D2)上分别实现了69%和76%的F1分数,表明情感指标的引入显著提升了抑郁症检测模型的性能。

链接: https://arxiv.org/abs/2409.13713
作者: Bayode Ogunleye,Hemlata Sharma,Olamilekan Shobayo
关键词-EN: World Health Organisation, Health Organisation, World Health, world suffer, revealed approximately
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
备注:

点击查看摘要

Abstract:The World Health Organisation (WHO) revealed approximately 280 million people in the world suffer from depression. Yet, existing studies on early-stage depression detection using machine learning (ML) techniques are limited. Prior studies have applied a single stand-alone algorithm, which is unable to deal with data complexities, prone to overfitting, and limited in generalization. To this end, our paper examined the performance of several ML algorithms for early-stage depression detection using two benchmark social media datasets (D1 and D2). More specifically, we incorporated sentiment indicators to improve our model performance. Our experimental results showed that sentence bidirectional encoder representations from transformers (SBERT) numerical vectors fitted into the stacking ensemble model achieved comparable F1 scores of 69% in the dataset (D1) and 76% in the dataset (D2). Our findings suggest that utilizing sentiment indicators as an additional feature for depression detection yields an improved model performance, and thus, we recommend the development of a depressive term corpus for future work.
摘要:世界卫生组织 (WHO) 披露,全球约有 2.8 亿人患有抑郁症。然而,目前利用机器学习 (ML) 技术进行早期抑郁症检测的研究尚不充分。以往的研究多采用单一的独立算法,无法应对数据复杂性,容易过拟合,且泛化能力有限。为此,本文研究了多种 ML 算法在两个基准社交媒体数据集 (D1 和 D2) 上进行早期抑郁症检测的性能。具体而言,我们引入了情感指标以提升模型性能。实验结果表明,将句子双向编码器表示 (SBERT) 数值向量融入堆叠集成模型后,在数据集 (D1) 和 (D2) 上分别达到了 69% 和 76% 的 F1 分数。我们的研究结果表明,利用情感指标作为抑郁症检测的附加特征可以提升模型性能,因此我们建议未来工作应开发抑郁症术语语料库。

[NLP-171] Good Idea or Not Representation of LLM Could Tell

【速读】: 该论文试图解决学术研究中如何从众多想法中有效区分出有价值和无价值想法的问题。解决方案的关键在于利用大型语言模型(LLM)的特定层表示来量化科学想法的价值,而非依赖其生成输出。通过构建和发布一个包含近四千篇完整文本的基准数据集,论文提出了一种框架,用于训练和评估不同方法在此任务中的表现。实验结果表明,该方法预测的评分与人类评估结果相对一致,显示出LLM在量化想法价值方面的潜力,为自动化想法评估提供了新的途径。

链接: https://arxiv.org/abs/2409.13712
作者: Yi Xu,Bo Xue,Shuqian Sheng,Cheng Deng,Jiaxin Ding,Zanwei Shen,Luoyi Fu,Xinbing Wang,Chenghu Zhou
关键词-EN: discerning valuable ideas, large language models, challenge for researchers, discerning valuable, ever-expanding landscape
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the ever-expanding landscape of academic research, the proliferation of ideas presents a significant challenge for researchers: discerning valuable ideas from the less impactful ones. The ability to efficiently evaluate the potential of these ideas is crucial for the advancement of science and paper review. In this work, we focus on idea assessment, which aims to leverage the knowledge of large language models to assess the merit of scientific ideas. First, we investigate existing text evaluation research and define the problem of quantitative evaluation of ideas. Second, we curate and release a benchmark dataset from nearly four thousand manuscript papers with full texts, meticulously designed to train and evaluate the performance of different approaches to this task. Third, we establish a framework for quantifying the value of ideas by employing representations in a specific layer of large language models. Experimental results show that the scores predicted by our method are relatively consistent with those of humans. Our findings suggest that the representations of large language models hold more potential in quantifying the value of ideas than their generative outputs, demonstrating a promising avenue for automating the idea assessment process.
摘要:在学术研究不断扩展的领域中,思想的激增给研究人员带来了重大挑战:如何从影响力较小的思想中辨别出有价值的思想。高效评估这些思想的潜力对于科学进步和论文评审至关重要。在本研究中,我们专注于思想评估,旨在利用大语言模型的知识来评估科学思想的优劣。首先,我们调查了现有的文本评估研究,并定义了思想定量评估的问题。其次,我们从近四千篇全文稿件中精心策划并发布了一个基准数据集,旨在训练和评估不同方法在此任务中的表现。第三,我们建立了一个框架,通过使用大语言模型特定层的表示来量化思想的值。实验结果表明,我们方法预测的分数与人类评估的结果相对一致。我们的研究结果表明,大语言模型的表示在量化思想价值方面比其生成输出更具潜力,展示了自动化思想评估过程的广阔前景。

[NLP-172] You can remove GPT2s LayerNorm by fine-tuning

【速读】: 该论文试图解决LayerNorm(LN)层在GPT-style transformer模型中对机制可解释性的阻碍问题。解决方案的关键在于通过在预训练的GPT2-small模型上进行微调,移除LN层,并使用部分训练数据(500M tokens)进行再训练,以证明在推理阶段LN层并非关键组件。研究结果表明,移除LN层的模型在OpenWebText和ThePile数据集上的交叉熵损失仅增加0.05,在Hellaswag基准测试中的准确率仅下降0.5%,从而为机制可解释性研究提供了一个简化的模型。

链接: https://arxiv.org/abs/2409.13710
作者: Stefan Heimersheim
关键词-EN: GPT-style transformer models, large language models, mechanistic interpretability, GPT-style transformer, transformer models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability. LN is a crucial component required to stabilize the training of large language models, and LN or the similar RMSNorm have been used in practically all large language models based on the transformer architecture. The non-linear nature of the LN layers is a hindrance for mechanistic interpretability as it hinders interpretation of the residual stream, and makes it difficult to decompose the model into circuits. Some research have gone so far as to name “reasons interpretability researchers hate layer norm”. In this paper we show that it is possible to remove the LN layers from a pre-trained GPT2-small model by fine-tuning on a fraction (500M tokens) of the training data. We demonstrate that this LN-free model achieves similar performance to the original model on the OpenWebText and ThePile datasets (-0.05 cross-entropy loss), and the Hellaswag benchmark (-0.5% accuracy). We provide the fine-tuning procedure and a Hugging Face repository with the fine-tuned GPT2-small models. Our work not only provides a simplified model for mechanistic interpretability research, but also provides evidence that the LN layers, at inference time, do not play a crucial role in transformer models. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.13710 [cs.CL] (or arXiv:2409.13710v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13710 Focus to learn more arXiv-issued DOI via DataCite
摘要:GPT 风格 Transformer 模型中的 LayerNorm (LN) 层长期以来一直是机制可解释性的障碍。LN 是稳定大语言模型训练的关键组件,LN 或类似的 RMSNorm 已被几乎所有基于 Transformer 架构的大语言模型所采用。LN 层的非线性特性阻碍了对残差流的解释,并使得将模型分解为电路变得困难。一些研究甚至将“解释性研究人员讨厌层归一化的原因”作为标题。在本文中,我们展示了通过在训练数据的一小部分(5 亿 Token)上进行微调,可以从预训练的 GPT2-small 模型中移除 LN 层。我们证明,这种无 LN 模型在 OpenWebText 和 ThePile 数据集上(-0.05 交叉熵损失)以及 Hellaswag 基准测试(-0.5% 准确率)上达到了与原始模型相似的性能。我们提供了微调过程以及包含微调 GPT2-small 模型的 Hugging Face 仓库。我们的工作不仅为机制可解释性研究提供了一个简化的模型,而且还提供了证据表明,在推理时,LN 层在 Transformer 模型中并不起关键作用。

主题:计算与语言 (cs.CL); 机器学习 (cs.LG)
引用为:arXiv:2409.13710 [cs.CL] (或 arXiv:2409.13710v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.13710
通过 DataCite 发布的 arXiv DOI

[NLP-173] Column Vocabulary Association (CVA): semantic interpretation of dataless tables

【速读】: 该论文试图解决在仅依赖元数据信息的情况下进行语义表解释(Semantic Table Interpretation, STI)的问题。解决方案的关键在于引入“列词汇关联(Column Vocabulary Association, CVA)”任务,即仅基于元数据信息对列标题进行语义标注。论文评估了多种方法的性能,包括大型语言模型(LLMs)和检索增强生成(RAG)方法,以及传统的基于相似度的SemanticBERT方法。研究发现在温度设置低于1.0时,LLMs表现良好,但在输入数据与术语表相关的情况下,传统方法可能优于LLMs。

链接: https://arxiv.org/abs/2409.13709
作者: Margherita Martorana,Xueli Pan,Benno Kruit,Tobias Kuhn,Jacco van Ossenbruggen
关键词-EN: Semantic Table Interpretation, Table Interpretation, Traditional Semantic Table, underlying table data, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional Semantic Table Interpretation (STI) methods rely primarily on the underlying table data to create semantic annotations. This year’s SemTab challenge introduced the ``Metadata to KG’’ track, which focuses on performing STI by using only metadata information, without access to the underlying data. In response to this new challenge, we introduce a new term: Column Vocabulary Association (CVA). This term refers to the task of semantic annotation of column headers solely based on metadata information. In this study, we evaluate the performance of various methods in executing the CVA task, including a Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) approach, as well as a more traditional similarity approach with SemanticBERT. Our methodology uses a zero-shot setting, with no pretraining or examples passed to the Large Language Models (LLMs), as we aim to avoid a domain-specific setting. We investigate a total of 7 different LLMs, of which three commercial GPT models (i.e. gpt-3.5-turbo-0.125, gpt-4o and gpt-4-turbo) and four open source models (i.e. llama3-80b, llama3-7b, gemma-7b and mixtral-8x7b). We integrate this models with RAG systems, and we explore how variations in temperature settings affect performances. Moreover, we continue our investigation by performing the CVA task utilizing SemanticBERT, analyzing how various metadata information influence its performance. Initial findings indicate that LLMs generally perform well at temperatures below 1.0, achieving an accuracy of 100% in certain cases. Nevertheless, our investigation also reveal that the nature of the data significantly influences CVA task outcomes. In fact, in cases where the input data and glossary are related (for example by being created by the same organizations) traditional methods appear to surpass the performance of LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.13709 [cs.CL] (or arXiv:2409.13709v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13709 Focus to learn more arXiv-issued DOI via DataCite
摘要:传统的语义表格解释 (Semantic Table Interpretation, STI) 方法主要依赖于底层表格数据来创建语义注释。今年的 SemTab 挑战赛引入了“元数据到知识图谱 (Metadata to KG)”赛道,该赛道专注于仅使用元数据信息进行 STI,而无需访问底层数据。针对这一新挑战,我们引入了一个新术语:列词汇关联 (Column Vocabulary Association, CVA)。该术语指的是仅基于元数据信息对列标题进行语义注释的任务。在本研究中,我们评估了多种方法在执行 CVA 任务中的表现,包括大语言模型 (Large Language Models, LLMs) 和检索增强生成 (Retrieval Augmented Generation, RAG) 方法,以及传统的基于 SemanticBERT 的相似性方法。我们的方法采用零样本 (zero-shot) 设置,没有对大语言模型 (LLMs) 进行预训练或传递示例,因为我们旨在避免特定领域的设置。我们研究了总共 7 种不同的大语言模型 (LLMs),其中三种商业 GPT 模型 (即 gpt-3.5-turbo-0.125, gpt-4o 和 gpt-4-turbo) 和四种开源模型 (即 llama3-80b, llama3-7b, gemma-7b 和 mixtral-8x7b)。我们将这些模型与 RAG 系统集成,并探讨了温度设置的变化如何影响性能。此外,我们继续通过使用 SemanticBERT 执行 CVA 任务来进行研究,分析各种元数据信息如何影响其性能。初步发现表明,大语言模型 (LLMs) 在温度低于 1.0 时通常表现良好,在某些情况下达到 100% 的准确率。然而,我们的研究也揭示了数据的性质显著影响 CVA 任务的结果。事实上,在输入数据和术语表相关的情况下(例如由同一组织创建),传统方法似乎超越了大语言模型 (LLMs) 的性能。

主题:计算与语言 (cs.CL); 人工智能 (cs.AI)
引用为:arXiv:2409.13709 [cs.CL] (或 arXiv:2409.13709v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.13709
通过 DataCite 发布的 arXiv DOI

[NLP-174] owards Safe Multilingual Frontier AI

【速读】: 该论文试图解决多语言环境下大型语言模型(LLMs)的安全性和包容性问题,特别是多语言“越狱”(jailbreaks)对AI系统安全性的威胁。解决方案的关键在于通过政策措施增强AI的多语言能力,同时降低多语言越狱的风险。具体措施包括强制评估多语言能力和漏洞、进行公众意见研究以及国家对多语言AI发展的支持,这些措施旨在通过欧盟政策倡议提升AI的安全性和功能性,指导《欧盟AI法案》的实施,并为欧洲AI办公室的监管工作提供信息。

链接: https://arxiv.org/abs/2409.13708
作者: Artūrs Kanepajs,Vladimir Ivanov,Richard Moulange
关键词-EN: Linguistically inclusive LLMs, maintain good performance, Linguistically inclusive, Multilingual jailbreaks, maintain good
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 23 pages; 1 figure and 10 supplementary figures

点击查看摘要

Abstract:Linguistically inclusive LLMs – which maintain good performance regardless of the language with which they are prompted – are necessary for the diffusion of AI benefits around the world. Multilingual jailbreaks that rely on language translation to evade safety measures undermine the safe and inclusive deployment of AI systems. We provide policy recommendations to enhance the multilingual capabilities of AI while mitigating the risks of multilingual jailbreaks. We quantitatively assess the relationship between language resourcedness and model vulnerabilities to multilingual jailbreaks for five frontier large language models across 24 official EU languages. Building on prior research, we propose policy actions that align with the EU legal landscape and institutional framework to address multilingual jailbreaks, while promoting linguistic inclusivity. These include mandatory assessments of multilingual capabilities and vulnerabilities, public opinion research, and state support for multilingual AI development. The measures aim to improve AI safety and functionality through EU policy initiatives, guiding the implementation of the EU AI Act and informing regulatory efforts of the European AI Office.
摘要:无论用户使用何种语言进行提示,都能保持良好性能的大语言模型 (LLM) 对于全球范围内 AI 福利的普及至关重要。依赖语言翻译来规避安全措施的多语言越狱行为,破坏了 AI 系统的安全性和包容性部署。我们提出了政策建议,以增强 AI 的多语言能力,同时降低多语言越狱的风险。我们定量评估了语言资源丰富度与模型对多语言越狱的脆弱性之间的关系,涉及五个前沿大语言模型在 24 种欧盟官方语言中的表现。基于先前研究,我们提出了与欧盟法律环境和机构框架相一致的政策行动,以应对多语言越狱问题,同时促进语言包容性。这些行动包括强制评估多语言能力和脆弱性、公众意见研究以及国家对多语言 AI 开发的支持。这些措施旨在通过欧盟政策倡议提升 AI 的安全性和功能性,指导欧盟 AI 法案的实施,并为欧洲 AI 办公室的监管工作提供信息。

[NLP-175] Retrieval Augmented Generation-Based Incident Resolution Recommendation System for IT Support

【速读】: 该论文试图解决在IT支持和AIOps领域中实施生成式AI时面临的两个关键问题:领域覆盖不足和模型大小限制。解决方案的关键在于采用检索增强生成(RAG)技术,通过检索系统获取必要的领域知识,并利用较小的生成模型作为上下文进行生成,从而在不依赖大型专有模型(如GPT-4)的情况下,提高领域覆盖和生成质量。论文详细介绍了为IT支持领域开发的系统架构、数据收集与标注、开发过程及初步验证,以及最终部署和评估计划,强调了RAG在解决上述问题中的核心作用。

链接: https://arxiv.org/abs/2409.13707
作者: Paulina Toro Isaza,Michael Nidd,Noah Zheutlin,Jae-wook Ahn,Chidansh Amitkumar Bhatt,Yu Deng,Ruchi Mahindru,Martin Franz,Hans Florian,Salim Roukos
关键词-EN: size constraints due, model choice limitations, model size constraints, choice limitations, wishing to implement
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Clients wishing to implement generative AI in the domain of IT Support and AIOps face two critical issues: domain coverage and model size constraints due to model choice limitations. Clients might choose to not use larger proprietary models such as GPT-4 due to cost and privacy concerns and so are limited to smaller models with potentially less domain coverage that do not generalize to the client’s domain. Retrieval augmented generation is a common solution that addresses both of these issues: a retrieval system first retrieves the necessary domain knowledge which a smaller generative model leverages as context for generation. We present a system developed for a client in the IT Support domain for support case solution recommendation that combines retrieval augmented generation (RAG) for answer generation with an encoder-only model for classification and a generative large language model for query generation. We cover architecture details, data collection and annotation, development journey and preliminary validations, expected final deployment process and evaluation plans, and finally lessons learned.
摘要:希望在 IT 支持和 AIOps 领域实施生成式 AI (Generative AI) 的客户面临两个关键问题:领域覆盖范围和模型大小限制。由于成本和隐私问题,客户可能选择不使用如 GPT-4 这样的大型专有模型,因此只能使用可能领域覆盖范围较小且无法泛化到客户领域的较小模型。检索增强生成 (Retrieval Augmented Generation, RAG) 是一种常见的解决方案,它通过首先检索必要的领域知识,然后由较小的生成模型利用这些知识作为上下文进行生成,从而解决这两个问题。我们为 IT 支持领域的客户开发了一个系统,用于支持案例解决方案推荐,该系统结合了检索增强生成 (RAG) 进行答案生成,以及仅编码器模型 (encoder-only model) 进行分类和生成大语言模型 (generative large language model) 进行查询生成。本文涵盖了架构细节、数据收集和标注、开发历程和初步验证、预期的最终部署流程和评估计划,以及最终的经验教训。

[NLP-176] Decolonising Data Systems: Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage

【速读】: 该论文试图解决因中文名字罗马化不标准化导致的健康数据链接率低和选择偏差问题。解决方案的关键在于采用标准化的罗马化系统,如粤语的Jyutping和普通话的Pinyin,这些系统能够更好地保留音调信息,从而提高数据链接的准确性和效率。通过对比Jyutping、Pinyin和香港政府罗马化系统(HKG-romanisation)在处理中文名字时的错误率,论文证明了前两者在减少错误方面的优越性,并建议在数据收集和处理过程中保留原始书写系统,以促进更全面和准确的研究数据。

链接: https://arxiv.org/abs/2409.13706
作者: Joseph Lam(1),Mario Cortina-Borja(1),Robert Aldridge(2),Ruth Blackburn(1),Katie Harron(1) ((1) Great Ormond Street Institute of Child Health, University College London, UK (2) Institute for Health Metrics and Evaluation, University of Washington, USA)
关键词-EN: understanding health inequalities, health inequalities, understanding health, Chinese, policy making
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data linkage is increasingly used in health research and policy making and is relied on for understanding health inequalities. However, linked data is only as useful as the underlying data quality, and differential linkage rates may induce selection bias in the linked data. A mechanism that selectively compromises data quality is name romanisation. Converting text of a different writing system into Latin based writing, or romanisation, has long been the standard process of representing names in character-based writing systems such as Chinese, Vietnamese, and other languages such as Swahili. Unstandardised romanisation of Chinese characters, due in part to problems of preserving the correct name orders the lack of proper phonetic representation of a tonal language, has resulted in poor linkage rates for Chinese immigrants. This opinion piece aims to suggests that the use of standardised romanisation systems for Cantonese (Jyutping) or Mandarin (Pinyin) Chinese, which incorporate tonal information, could improve linkage rates and accuracy for individuals with Chinese names. We used 771 Chinese and English names scraped from openly available sources, and compared the utility of Jyutping, Pinyin and the Hong Kong Government Romanisation system (HKG-romanisation) for representing Chinese names. We demonstrate that both Jyutping and Pinyin result in fewer errors compared with the HKG-romanisation system. We suggest that collecting and preserving people’s names in their original writing systems is ethically and socially pertinent. This may inform development of language-specific pre-processing and linkage paradigms that result in more inclusive research data which better represents the targeted populations.
摘要:数据链接在健康研究和政策制定中越来越常用,并且对于理解健康不平等问题至关重要。然而,链接数据的有用性取决于底层数据的质量,而不同的链接率可能会导致链接数据中的选择偏差。一种可能影响数据质量的机制是名称罗马化。将不同书写系统的文本转换为基于拉丁字母的书写系统,即罗马化,长期以来一直是表示汉字、越南语等表意文字系统中名称的标准过程。由于保留正确名称顺序的问题以及缺乏对声调语言的适当语音表示,中文汉字的非标准化罗马化导致了华人移民的链接率较低。本文旨在建议使用标准化的罗马化系统,如粤语(Jyutping)或普通话(Pinyin),这些系统包含了声调信息,可以提高具有中文名称的个体的链接率和准确性。我们使用了从公开可用资源中抓取的771个中英文名称,并比较了Jyutping、Pinyin和香港政府罗马化系统(HKG-romanisation)在表示中文名称方面的效用。我们证明,与HKG-romanisation系统相比,Jyutping和Pinyin导致的错误更少。我们建议,收集和保存人们的姓名在其原始书写系统中在伦理和社会上都是相关的。这可能为开发特定语言的预处理和链接范式提供信息,从而产生更具包容性的研究数据,更好地代表目标人群。

[NLP-177] Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble

【速读】: 该论文试图解决大型语言模型(LLMs)在处理不平衡数据时可能学习到社会偏见的问题。解决方案的关键在于提出了一种轻量级的后处理方法,通过构建一个集成模型来缓解闭源文本安全分类器中的反事实公平性问题。该方法不仅提升了输入分类器的性能并使其与政策对齐,还作为去偏正则化器。论文还引入了两个与阈值无关的指标来评估模型的反事实公平性,并展示了如何结合这些指标与公平数据重加权(FDW)来减轻偏见。通过创建扩展的Open AI数据集和基于用户提示的模板化LLM生成数据集,论文验证了该方法在提高反事实公平性方面的有效性,同时对模型性能影响最小。

链接: https://arxiv.org/abs/2409.13705
作者: Olivia Sturman,Aparna Joshi,Bhaktipriya Radharapu,Piyush Kumar,Renee Shelby
关键词-EN: demand performant guardrails, large language models, demand performant, large language, performant guardrails
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Increasing use of large language models (LLMs) demand performant guardrails to ensure the safety of inputs and outputs of LLMs. When these safeguards are trained on imbalanced data, they can learn the societal biases. We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfactually balanced across identity groups and cover four key areas of safety; we will work towards publicly releasing these datasets. Our results show that our approach improves counterfactual fairness with minimal impact on model performance.
摘要:随着大语言模型 (LLM) 的广泛应用,对其输入和输出的安全性提出了更高的要求。当这些安全措施基于不平衡的数据进行训练时,它们可能会学习到社会偏见。我们提出了一种轻量级的后处理方法,用于缓解闭源文本安全分类器中的反事实公平性问题。我们的方法涉及构建一个集成模型,该模型不仅优于输入分类器并使其与策略对齐,而且还充当去偏正则化器。我们引入了两个与阈值无关的指标来评估模型的反事实公平性,并展示了如何将这些指标与公平数据重加权 (FDW) 结合使用以缓解偏见。我们创建了一个扩展的 Open AI 数据集,以及一个新的基于用户提示的模板化 LLM 生成的数据集,这两个数据集在身份群体之间进行了反事实平衡,并涵盖了四个关键的安全领域;我们将致力于公开发布这些数据集。我们的结果表明,我们的方法在最小化对模型性能影响的同时,显著提高了反事实公平性。

[NLP-178] Entity Extraction from High-Level Corruption Schemes via Large Language Models

【速读】: 该论文试图解决金融犯罪相关新闻文章中个人和组织识别的问题,特别是在缺乏专门数据集的情况下。解决方案的关键在于提出了一种新的微基准数据集,并开发了一种基于大型语言模型(LLM)的方法来识别和区分这些实体。通过实验验证,该方法在准确性、精确度、召回率和F1分数等标准指标上表现优异,并提出了一种有效的基于LLM的消歧方法,确保评估结果与实际情况一致。此外,该方法在与现有开源基线方法的比较中显示出优越性。

链接: https://arxiv.org/abs/2409.13704
作者: Panagiotis Koletsis,Panagiotis-Konstantinos Gemos,Christos Chronis,Iraklis Varlamis,Vasilis Efthymiou,Georgios Th. Papadopoulos
关键词-EN: rise of financial, financial crime, observed in recent, recent years, years has created
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of financial crime that has been observed in recent years has created an increasing concern around the topic and many people, organizations and governments are more and more frequently trying to combat it. Despite the increase of interest in this area, there is a lack of specialized datasets that can be used to train and evaluate works that try to tackle those problems. This article proposes a new micro-benchmark dataset for algorithms and models that identify individuals and organizations, and their multiple writings, in news articles, and presents an approach that assists in its creation. Experimental efforts are also reported, using this dataset, to identify individuals and organizations in financial-crime-related articles using various low-billion parameter Large Language Models (LLMs). For these experiments, standard metrics (Accuracy, Precision, Recall, F1 Score) are reported and various prompt variants comprising the best practices of prompt engineering are tested. In addition, to address the problem of ambiguous entity mentions, a simple, yet effective LLM-based disambiguation method is proposed, ensuring that the evaluation aligns with reality. Finally, the proposed approach is compared against a widely used state-of-the-art open-source baseline, showing the superiority of the proposed method.
摘要:近年来观察到的金融犯罪现象引发了人们对这一问题的日益关注,许多人、组织和政府正越来越多地尝试应对这一问题。尽管对该领域的兴趣不断增加,但缺乏专门的数据集用于训练和评估旨在解决这些问题的算法和模型。本文提出了一种新的微基准数据集,用于识别新闻文章中的个人和组织及其多种表述,并介绍了一种辅助其创建的方法。实验部分还报告了使用该数据集,通过各种低十亿参数的大语言模型 (LLM) 在金融犯罪相关文章中识别个人和组织的努力。在这些实验中,报告了标准指标(准确率、精确率、召回率、F1 分数),并测试了包含提示工程最佳实践的各种提示变体。此外,为了解决实体提及的歧义问题,提出了一种简单但有效的基于 LLM 的消歧方法,确保评估与现实情况一致。最后,将所提出的方法与广泛使用的最先进开源基线进行了比较,显示出所提出方法的优越性。

[NLP-179] Shaping the Future of Endangered and Low-Resource Languages – Our Role in the Age of LLMs: A Keynote at ECIR 2024

【速读】: 该论文试图解决如何利用大型语言模型(LLM)技术在保护和复兴濒危语言(如Occitan语)的同时,避免语言同质化、文化简化和进一步边缘化的问题。解决方案的关键在于探索技术和传统之间的潜在路径和合作关系,通过结合人类专家知识和人工智能技术,实现对语言多样性的保护,同时应对使用这些强大技术带来的伦理和实践挑战。

链接: https://arxiv.org/abs/2409.13702
作者: Josiane Mothe(IRIT-SIG)
关键词-EN: Isidore of Seville, Seville is credited, underlining the profound, social identity, Large Language Model
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Isidore of Seville is credited with the adage that it is language that gives birth to a people, and not the other way around , underlining the profound role played by language in the formation of cultural and social identity. Today, of the more than 7100 languages listed, a significant number are endangered. Since the 1970s, linguists, information seekers and enthusiasts have helped develop digital resources and automatic tools to support a wide range of languages, including endangered ones. The advent of Large Language Model (LLM) technologies holds both promise and peril. They offer unprecedented possibilities for the translation and generation of content and resources, key elements in the preservation and revitalisation of languages. They also present threat of homogenisation, cultural oversimplification and the further marginalisation of already vulnerable languages. The talk this paper is based on has proposed an initiatory journey, exploring the potential paths and partnerships between technology and tradition, with a particular focus on the Occitan language. Occitan is a language from Southern France, parts of Spain and Italy that played a major cultural and economic role, particularly in the Middle Ages. It is now endangered according to UNESCO. The talk critically has examined how human expertise and artificial intelligence can work together to offer hope for preserving the linguistic diversity that forms the foundation of our global and especially our European heritage while addressing some of the ethical and practical challenges that accompany the use of these powerful technologies. This paper is based on the keynote I gave at the 46th European Conference on Information Retrieval (ECIR 2024). As an alternative to reading this paper, a video talk is available online. 1 Date: 26 March 2024.
摘要:塞维利亚的伊西多尔曾有名言,语言赋予了民族生命,而非相反,这突显了语言在形成文化和社交身份中的深远作用。如今,在列出的7100多种语言中,许多语言正面临濒危。自20世纪70年代以来,语言学家、信息寻求者及爱好者们共同开发了数字资源和自动化工具,以支持包括濒危语言在内的多种语言。大语言模型 (LLM) 技术的出现既带来了希望也带来了挑战。它们为内容的翻译和生成提供了前所未有的可能性,这些是语言保存和复兴的关键要素。同时,它们也带来了同质化、文化简化和进一步边缘化弱势语言的威胁。本文基于的演讲提出了一次初步探索,探讨了技术与传统之间的潜在路径和合作关系,特别关注奥克西坦语。奥克西坦语源自法国南部、西班牙和意大利部分地区,在中世纪曾具有重要的文化和经济地位,现已被联合国教科文组织列为濒危语言。该演讲深入探讨了人类专业知识与人工智能如何协同合作,为保护构成我们全球尤其是欧洲文化遗产基础的语言多样性提供希望,同时应对使用这些强大技术带来的伦理和实际挑战。本文基于我在第46届欧洲信息检索会议 (ECIR 2024) 上发表的主题演讲。若不阅读本文,可在线观看视频演讲。1 日期:2024年3月26日。

[NLP-180] CA-BERT: Leveraging Context Awareness for Enhanced Multi-Turn Chat Interaction

【速读】: 该论文试图解决自动化聊天系统中上下文理解不足的问题,特别是传统模型在判断何时需要额外上下文以生成适当回复方面的困难。解决方案的关键是引入了一种名为Context-Aware BERT(CA-BERT)的模型,该模型基于BERT架构,通过创新的深度学习技术专门微调以识别多轮对话中上下文的必要性。CA-BERT通过专用的聊天对话数据集进行训练,显著提升了上下文分类的准确性和效率,同时减少了训练时间和资源消耗,使其适用于实时应用,从而显著改善了聊天机器人的功能和用户体验。

链接: https://arxiv.org/abs/2409.13701
作者: Minghao Liu,Mingxiu Sui,Cangqing Wang,Zhejie Zhou
关键词-EN: Effective communication, understand and respond, Effective, context, chat systems hinges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ICBASE 2024

点击查看摘要

Abstract:Effective communication in automated chat systems hinges on the ability to understand and respond to context. Traditional models often struggle with determining when additional context is necessary for generating appropriate responses. This paper introduces Context-Aware BERT (CA-BERT), a transformer-based model specifically fine-tuned to address this challenge. CA-BERT innovatively applies deep learning techniques to discern context necessity in multi-turn chat interactions, enhancing both the relevance and accuracy of responses. We describe the development of CA-BERT, which adapts the robust architecture of BERT with a novel training regimen focused on a specialized dataset of chat dialogues. The model is evaluated on its ability to classify context necessity, demonstrating superior performance over baseline BERT models in terms of accuracy and efficiency. Furthermore, CA-BERT’s implementation showcases significant reductions in training time and resource usage, making it feasible for real-time applications. The results indicate that CA-BERT can effectively enhance the functionality of chatbots by providing a nuanced understanding of context, thereby improving user experience and interaction quality in automated systems. This study not only advances the field of NLP in chat applications but also provides a framework for future research into context-sensitive AI developments. Comments: This paper has been accepted by ICBASE 2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.13701 [cs.CL] (or arXiv:2409.13701v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13701 Focus to learn more arXiv-issued DOI via DataCite
摘要:自动化聊天系统中的有效沟通依赖于对上下文的理解和响应能力。传统模型在判断何时需要额外上下文以生成适当响应方面往往表现不佳。本文引入了上下文感知 BERT (Context-Aware BERT, CA-BERT),这是一种基于 Transformer 的模型,专门微调以应对这一挑战。CA-BERT 创新性地应用深度学习技术来识别多轮聊天互动中的上下文需求,从而增强了响应的相关性和准确性。我们描述了 CA-BERT 的开发过程,该模型通过专注于特定聊天对话数据集的新颖训练方案,适应了 BERT 的强大架构。模型在分类上下文需求方面的能力进行了评估,结果显示其在准确性和效率方面均优于基准 BERT 模型。此外,CA-BERT 的实现展示了显著减少的训练时间和资源使用,使其适用于实时应用。结果表明,CA-BERT 通过提供对上下文的细致理解,能够有效增强聊天机器人的功能,从而提升自动化系统中的用户体验和交互质量。本研究不仅推动了聊天应用领域的自然语言处理 (NLP) 发展,还为未来上下文敏感型 AI 开发提供了研究框架。

评论:本文已被 ICBASE 2024 接受。
主题:计算与语言 (cs.CL);人工智能 (cs.AI)
引用方式:arXiv:2409.13701 [cs.CL]
(或 arXiv:2409.13701v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.13701
通过 DataCite 发布的 arXiv 发行 DOI

聚焦以了解更多

[NLP-181] Lightweight Transducer Based on Frame-Level Criterion INTERSPEECH2024

【速读】: 该论文试图解决基于序列级准则训练的转换器模型因生成大型概率矩阵而导致的内存和计算需求过高的问题。解决方案的关键在于提出了一种基于帧级准则的轻量级转换器模型,该模型利用CTC强制对齐算法的结果来确定每个帧的标签,从而避免了将编码器输出的每个元素与解码器输出的每个元素相加的传统转换器做法,显著降低了内存和计算需求。此外,论文还通过解耦空白和非空白概率,并截断空白分类器对主网络的梯度,解决了标签中空白过多导致的分类不平衡问题,使得轻量级转换器模型能够达到与传统转换器相似的效果,并通过使用更丰富的信息预测空白概率,进一步提升了性能。

链接: https://arxiv.org/abs/2409.13698
作者: Genshun Wan,Mengzhi Wang,Tingzhi Mao,Hang Chen,Zhongfu Ye
关键词-EN: sequence-level criterion requires, model trained based, transducer model trained, transducer model based, large probability matrix
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2024, code repository: this https URL

点击查看摘要

Abstract:The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. This enables the lightweight transducer achieving similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.
摘要:基于序列级准则训练的转换器模型由于生成大型概率矩阵而需要大量内存。我们提出了一种基于帧级准则的轻量级转换器模型,该模型利用CTC强制对齐算法的结果来确定每个帧的标签。然后,编码器输出可以与相应时间的解码器输出相结合,而不是像转换器那样将编码器输出的每个元素添加到解码器输出的每个元素。这显著减少了内存和计算需求。为了解决标签中空白过多导致的分类不平衡问题,我们将空白和非空白概率解耦,并将空白分类器的梯度截断到主网络。这使得轻量级转换器能够实现与转换器相似的结果。此外,我们使用更丰富的信息来预测空白的概率,从而获得比转换器更优越的结果。

[NLP-182] Prompt Baking

【速读】: 该论文试图解决如何将自然语言提示(prompting)的效果永久性地嵌入到大型语言模型(LLM)的权重中,从而实现更持久的模型行为改变。解决方案的关键在于提出了一种名为“Prompt Baking”的技术,通过最小化原始提示模型与新权重模型之间的KL散度,将提示信息“烘焙”到模型的权重中,使得新模型在无需额外提示的情况下表现出与原始提示模型相似的行为。这一方法不仅提高了模型在多个基准测试中的零样本性能,还缓解了长序列中的“提示遗忘”问题,并展示了通过迭代重提示和重烘焙实现模型自我改进的潜力。

链接: https://arxiv.org/abs/2409.13697
作者: Aman Bhargava,Cameron Witkowski,Alexander Detkov,Matt Thomson
关键词-EN: change LLM behavior, LLM, weight updates, baking, permanent behavior changes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 8 figures

点击查看摘要

Abstract:Two primary ways to change LLM behavior are prompting and weight updates (e.g., fine-tuning). Prompting LLMs is simple and effective, specifying the desired changes explicitly in natural language, whereas weight updates provide more expressive and permanent behavior changes, specified implicitly via training on large datasets. We present a technique for “baking” prompts into the weights of an LLM. Prompt Baking converts a prompt u and initial weights \theta to a new set of weights \theta_u such that new “baked” LLM behaves like the original prompted LLM. Mathematically, we minimize the KL divergence between P_\theta(\cdot | u) and P_\theta_u(\cdot) , where P is the LLM’s probability distribution over token sequences. Across all our experiments, we find prompts can be readily baked into weight updates. Baking chain-of-thought prompts improves zero-shot performance on GSM8K, ASDiv, MBPP, ARC-Easy, ARC-Challenge, and CommonsenseQA benchmarks. Baking news headlines directly updates an LLM’s knowledge. And baking instructions personas alleviates “prompt forgetting” over long sequences. Furthermore, stopping baking early creates “half-baked” models, continuously scaling prompt strength. Baked models retain their sensitivity to further prompting and baking, including re-prompting with the baked-in prompt. Surprisingly, the re-prompted models yield further performance gains in instruction following, as well as math reasoning and coding benchmarks. Taking re-prompting and re-baking to the limit yields a form of iterative self-improvement we call Prompt Pursuit, and preliminary results on instruction following exhibit dramatic performance gains. Finally, we discuss implications for AI safety, continuous model updating, enhancing real-time learning capabilities in LLM-based agents, and generating more stable AI personas.
摘要:改变大语言模型 (LLM) 行为主要有两种方式:提示 (prompting) 和权重更新 (例如,微调)。提示 LLM 简单且有效,通过自然语言明确指定所需的变化,而权重更新则提供更灵活和持久的行为改变,通过在大数据集上训练隐式指定。我们提出了一种将提示“烘焙”到 LLM 权重中的技术。提示烘焙 (Prompt Baking) 将提示 u 和初始权重 \theta 转换为一组新的权重 \theta_u,使得新的“烘焙”LLM 表现得像原始提示的 LLM。在数学上,我们最小化 P_\theta(\cdot | u) 和 P_\theta_u(\cdot) 之间的 KL 散度,其中 P 是 LLM 在 Token 序列上的概率分布。在我们所有的实验中,我们发现提示可以很容易地烘焙到权重更新中。烘焙思维链 (chain-of-thought) 提示提高了 GSM8K、ASDiv、MBPP、ARC-Easy、ARC-Challenge 和 CommonsenseQA 基准测试中的零样本性能。直接烘焙新闻标题更新了 LLM 的知识。烘焙指令角色缓解了长序列中的“提示遗忘”。此外,提前停止烘焙会创建“半烘焙”模型,持续扩展提示强度。烘焙模型保留了对进一步提示和烘焙的敏感性,包括使用烘焙提示重新提示。令人惊讶的是,重新提示的模型在指令跟随以及数学推理和编码基准测试中进一步提高了性能。将重新提示和重新烘焙推向极限,产生了一种我们称之为提示追求 (Prompt Pursuit) 的迭代自我改进形式,并且在指令跟随的初步结果中展示了显著的性能提升。最后,我们讨论了这对 AI 安全、持续模型更新、增强基于 LLM 的智能体的实时学习能力以及生成更稳定的 AI 角色 (AI personas) 的影响。

[NLP-183] You Only Use Reactive Attention Slice For Long Context Retrieval

【速读】: 该论文试图解决在大语言模型(LLM)中支持更长上下文窗口的问题,尤其是在训练成本高昂的情况下。解决方案的关键是提出了一种基于注意力机制的检索技术,称为You Only Use Reactive Attention slice (YOURA)。YOURA通过引入一种新的检索启发式方法——反应分数(reaction score),来评估输入上下文中每个句子与查询句子的相关性。具体来说,它测量每个token的注意力分数对查询的“反应”,并贪婪地检索最具反应性的句子。此外,YOURA还提出了一种无嵌入的句子映射算法Embedding-Agnostic Sentence Yield (EASY),用于将每个句子映射到token索引向量。实验结果表明,该技术在处理长上下文查询时,能够显著提高推理吞吐量,同时保持与简单截断方法相近的质量评分。

链接: https://arxiv.org/abs/2409.13695
作者: Yun Joon Soh,Hanxian Huang,Yuandong Tian,Jishen Zhao
关键词-EN: Supporting longer context, Large Language Models, Supporting longer, Retrieval Augmented Generation, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supporting longer context for Large Language Models (LLM) is a promising direction to advance LLMs. As training a model for a longer context window is computationally expensive, many alternative solutions, such as Retrieval Augmented Generation (RAG), have been used. However, most existing RAG methods adopt embedding-based retrieval that falls short on long contexts. To address such challenges, we propose an attention-based retrieval technique, You Only Use Reactive Attention slice (YOURA). YOURA leverages a novel retrieval heuristic called reaction score to rank the relevance of each sentence in the input context with the query sentence. Intuitively, we measure how the per-token attention score “reacts” to the query and greedily retrieves the most reactive sentences. Internally, YOURA generates a token-indexed vector (called reaction vector) for the whole input context. To map each sentence to the token-indexed vector, we propose an Embedding-Agnostic Sentence Yield (EASY), a best-effort token wiggling algorithm. We evaluate our retrieval technique on three open-source pre-trained LLM models across six LongBench QA datasets. Our technique achieves up to 30% vLLM inference throughput improvement for serving long-context queries with a nearly identical quality score to the simple yet effective truncate-middle approach. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.13695 [cs.CL] (or arXiv:2409.13695v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13695 Focus to learn more arXiv-issued DOI via DataCite
摘要:支持大语言模型 (LLM) 处理更长上下文是一个有前景的方向,以推进 LLM 的发展。由于训练一个具有更长上下文窗口的模型在计算上非常昂贵,许多替代解决方案,如检索增强生成 (RAG),已被采用。然而,大多数现有的 RAG 方法采用基于嵌入的检索,这在处理长上下文时表现不佳。为了应对这些挑战,我们提出了一种基于注意力机制的检索技术,称为“你只使用反应性注意力切片” (YOURA)。YOURA 利用一种新颖的检索启发式方法,称为反应分数,来评估输入上下文中每个句子与查询句子之间的相关性。直观地说,我们测量每个 Token 的注意力分数对查询的“反应”,并贪婪地检索反应最强的句子。在内部,YOURA 为整个输入上下文生成一个 Token 索引向量(称为反应向量)。为了将每个句子映射到 Token 索引向量,我们提出了一种嵌入无关的句子生成 (EASY),这是一种尽力而为的 Token 微调算法。我们在三个开源预训练的 LLM 模型上,对六个 LongBench QA 数据集进行了评估。我们的技术在处理长上下文查询时,实现了高达 30% 的 vLLM 推理吞吐量提升,且质量评分与简单而有效的截断中间方法几乎相同。

主题:计算与语言 (cs.CL);人工智能 (cs.AI)
引用为:arXiv:2409.13695 [cs.CL] (或 arXiv:2409.13695v1 [cs.CL] 针对此版本)
https://doi.org/10.48550/arXiv.2409.13695
通过 DataCite 发布的 arXiv DOI

聚焦以了解更多

[NLP-184] A Knowledge-Centric Benchmarking Framework and Empirical Study for Retrieval-Augmented Generation

【速读】: 该论文试图解决Retrieval-Augmented Generation (RAG)模型在处理现实世界查询时面临的挑战,特别是如何有效利用外部知识源和减少生成内容中的幻觉现象。解决方案的关键在于提出了一种新的RAG基准测试,通过综合实验评估了知识源选择、信息检索、组织和推理的全过程。特别关注了自动化知识源选择代理的影响以及噪声片段对RAG推理的影响,并通过详细实验分析了不同超参数对RAG性能的影响。此外,论文公开了实验结果、代码和解析后的数据集,为RAG方法的进一步研究和未来工作奠定了基础。

链接: https://arxiv.org/abs/2409.13694
作者: Shuo Yu(1 and 2),Mingyue Cheng(1 and 2),Jiqian Yang(1 and 2),Jie Ouyang(1 and 2) ((1) Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China (2) State Key Laboratory of Cognitive Intelligence)
关键词-EN: enhances generative models, Retrieval-Augmented Generation, utilize external knowledge, integrating retrieval mechanisms, external knowledge sources
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 11 figures; Mingyue Cheng is the corresponding author

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances generative models by integrating retrieval mechanisms, which allow these models to access and utilize external knowledge sources. Despite its advantages, RAG encounters significant challenges, particularly in effectively handling real-world queries and mitigating hallucinations. The KDD Cup 2024 CRAG competition brings these issues to the forefront by incorporating both web pages and a mock API as knowledge sources, adding the complexity of parsing HTML before large language models (LLMs) can process the information. In this paper, we propose a novel RAG benchmark designed to address these challenges. Our work provides a comprehensive set of experimental results, offering valuable insights for the study of RAG. We thoroughly examine the entire RAG process, including knowledge source selection, retrieval, organization, and reasoning. Key findings from our study include the impact of automated knowledge source selection using agents and the influence of noise chunks on RAG reasoning. Additionally, we conduct detailed experiments to analyze the effects of various hyperparameters on RAG performance. To support further research, we have made our results, the associated code, and a parsed version of the CRAG dataset publicly available\footnotethis https URL, contributing to the advancement of RAG methodologies and establishing a solid foundation for future work in this domain.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 通过集成检索机制,增强了生成式模型,使其能够访问和利用外部知识源。尽管 RAG 具有诸多优势,但在有效处理现实世界查询和减少幻觉方面仍面临重大挑战。KDD Cup 2024 CRAG 竞赛通过引入网页和模拟 API 作为知识源,增加了在大型语言模型 (Large Language Models, LLMs) 处理信息之前解析 HTML 的复杂性,从而将这些问题置于前沿。本文提出了一种新的 RAG 基准,旨在应对这些挑战。我们的工作提供了一套全面的实验结果,为 RAG 研究提供了宝贵的见解。我们全面考察了 RAG 的整个过程,包括知识源选择、检索、组织和推理。研究的关键发现包括使用智能体进行自动化知识源选择的影响以及噪声块对 RAG 推理的影响。此外,我们还进行了详细的实验,分析了各种超参数对 RAG 性能的影响。为了支持进一步的研究,我们已将结果、相关代码以及 CRAG 数据集的解析版本公开发布 (https URL),这有助于推动 RAG 方法的发展,并为该领域的未来工作奠定了坚实的基础。

[NLP-185] Declarative Integration and Management of Large Language Models through Finite Automata: Application to Automation Communication and Ethics

【速读】: 该论文试图解决如何高效且灵活地将多个大型语言模型(LLMs)与共享历史记录和触发机制结合,以识别最适合特定任务的LLM的问题。解决方案的关键在于设计了一种通用的、声明式的架构,通过构建有限自动机与事件管理系统相结合,实现对LLMs的复杂集成,从而减少编程工作量,并展示了其在自动化、通信和伦理等领域的应用灵活性。

链接: https://arxiv.org/abs/2409.13693
作者: Thierry Petit,Arnault Pachot,Claire Conan-Vrinat,Alexandre Dubarry
关键词-EN: Large Language Models, combine Large Language, declaratively combine Large, innovative architecture designed, Language Models
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: Submitted to IAAI-2025, Philadelphia, PA

点击查看摘要

Abstract:This article introduces an innovative architecture designed to declaratively combine Large Language Models (LLMs) with shared histories, and triggers to identify the most appropriate LLM for a given task. Our approach is general and declarative, relying on the construction of finite automata coupled with an event management system. The developed tool is crafted to facilitate the efficient and complex integration of LLMs with minimal programming effort, especially, but not only, for integrating methods of positive psychology to AI. The flexibility of our technique is demonstrated through applied examples in automation, communication, and ethics.
摘要:本文介绍了一种创新的架构,旨在通过声明式方法将大语言模型 (LLM) 与共享历史记录和触发器结合,以识别最适合特定任务的 LLM。我们的方法具有普遍性和声明性,依赖于有限自动机的构建与事件管理系统的结合。开发的工具旨在促进 LLM 与现有系统的高效且复杂的集成,且编程工作量最小化,尤其适用于(但不限于)将积极心理学方法集成到 AI 中。通过自动化、通信和伦理等应用示例,展示了我们技术的灵活性。

[NLP-186] am QUST at SemEval-2023 Task 3: A Comprehensive Study of Monolingual and Multilingual Approaches for Detecting Online News Genre Framing and Persuasion Techniques

【速读】: 该论文旨在解决SemEval2023任务3中的多语言情感分析问题。解决方案的关键在于采用多语言预训练模型,并通过结合类别权重和样本权重的微调策略,以及任务无关和任务相关的微调方法,来提升模型性能。实验结果表明,多语言方法在10折交叉验证中优于单语言方法,并在意大利语和西班牙语(零样本)的子任务1中取得了第二名的成绩。

链接: https://arxiv.org/abs/2304.04190
作者: Ye Jiang
关键词-EN: team QUST, paper describes, describes the participation, participation of team, pre-trained multilingual model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper describes the participation of team QUST in the SemEval2023 task 3. The monolingual models are first evaluated with the under-sampling of the majority classes in the early stage of the task. Then, the pre-trained multilingual model is fine-tuned with a combination of the class weights and the sample weights. Two different fine-tuning strategies, the task-agnostic and the task-dependent, are further investigated. All experiments are conducted under the 10-fold cross-validation, the multilingual approaches are superior to the monolingual ones. The submitted system achieves the second best in Italian and Spanish (zero-shot) in subtask-1.
摘要: 本文描述了团队 QUST 在 SemEval2023 任务 3 中的参与情况。首先,在任务初期,单语模型通过对多数类进行欠采样进行评估。随后,使用类别权重和样本权重的组合对预训练的多语种模型进行微调。进一步研究了两种不同的微调策略:任务无关和任务相关。所有实验均在 10 折交叉验证下进行,多语种方法优于单语方法。提交的系统在子任务 1 的意大利语和西班牙语(零样本)中取得了第二名的成绩。

[NLP-187] GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks

【速读】: 该论文试图解决高质量、多任务歌唱数据集稀缺的问题,现有数据集存在质量低、语言和歌手多样性有限、缺乏多技术信息和真实乐谱、任务适用性差等缺陷。解决方案的关键在于提出了GTSinger,这是一个全球性、多技术、高质量、免费使用的歌唱语料库,包含80.59小时的高质量歌唱录音,涵盖20位专业歌手和九种广泛使用的语言,提供六种常见歌唱技术的控制比较和音素级注释,以及真实的音乐乐谱。此外,GTSinger还包括手动音素到音频的对齐、全局风格标签和16.16小时的配对语音,以支持多种歌唱任务。论文还进行了四项基准实验,包括技术可控的歌唱语音合成、技术识别、风格转换和语音到歌唱的转换,以促进GTSinger的使用。

链接: https://arxiv.org/abs/2409.13832
作者: Yu Zhang,Changhao Pan,Wenxiang Guo,Ruiqi Li,Zhiyuan Zhu,Jialei Wang,Wenhao Xu,Jingyu Lu,Zhiqing Hong,Chuxin Wang,LiChao Zhang,Jinzheng He,Ziyue Jiang,Yuxin Chen,Chen Yang,Jiecheng Zhou,Xinyu Cheng,Zhou Zhao
关键词-EN: realistic music scores, poor task suitability, datasets significantly hinders, music scores, personalized singing tasks
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: under processing

点击查看摘要

Abstract:The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present \textbfGTSinger, a large \textbfGlobal, multi-\textbfTechnique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset; (2) 20 professional singers across nine widely spoken languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six commonly used singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompanied by manual phoneme-to-audio alignments, global style labels, and 16.16 hours of paired speech for various singing tasks. Moreover, to facilitate the use of GTSinger, we conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion. The corpus and demos can be found at this http URL. We provide the dataset and the code for processing data and conducting benchmarks at this https URL and this https URL.
摘要:高质量且多任务的歌唱数据集的稀缺性极大地阻碍了多样化可控和个性化歌唱任务的发展,因为现有的歌唱数据集存在质量低下、语言和歌手多样性有限、缺乏多技术信息和真实乐谱、以及任务适用性差等问题。为了解决这些问题,我们提出了 GTSinger,这是一个大型的 全球性、多技术、免费使用、高质量的歌唱语料库,包含真实乐谱,专为所有歌唱任务设计,并附带其基准测试。特别地,(1) 我们收集了 80.59 小时的优质歌唱声音,形成了最大的录制歌唱数据集;(2) 来自九种广泛使用语言的 20 位专业歌手提供了多样化的音色和风格;(3) 我们提供了六种常用歌唱技术的受控比较和音素级注释,有助于技术建模和控制;(4) GTSinger 提供了真实的乐谱,有助于实际音乐创作;(5) 歌唱声音伴随着手动音素到音频的对齐、全局风格标签以及 16.16 小时的配对语音,适用于各种歌唱任务。此外,为了促进 GTSinger 的使用,我们进行了四项基准实验:技术可控的歌唱声音合成、技术识别、风格迁移和语音到歌唱的转换。语料库和演示可以在以下网址找到:[http URL]。我们提供了数据集以及处理数据和进行基准测试的代码,网址分别为:[https URL] 和 [https URL]。

人工智能

[AI-0] A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

链接: https://arxiv.org/abs/2409.15277
作者: Yunfei Xie,Juncheng Wu,Haoqin Tu,Siwei Yang,Bingchen Zhao,Yongshuo Zong,Qiao Jin,Cihang Xie,Yuyin Zhou
关键词-EN: Large language models, exhibited remarkable capabilities, Large language, pushing the boundaries, exhibited remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: The first four authors contributed equally, project page available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable capabilities across various domains and tasks, pushing the boundaries of our knowledge in learning and cognition. The latest model, OpenAI’s o1, stands out as the first LLM with an internalized chain-of-thought technique using reinforcement learning strategies. While it has demonstrated surprisingly strong capabilities on various general language tasks, its performance in specialized fields such as medicine remains unknown. To this end, this report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine (NEJM) and The Lancet. These datasets offer greater clinical relevance compared to standard medical QA benchmarks such as MedQA, translating more effectively into real-world clinical utility. Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios. But meanwhile, we identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. We release our raw data and model outputs at this https URL for future research.

[AI-1] OmniBench: Towards The Future of Universal Omni-Language Models

链接: https://arxiv.org/abs/2409.15272
作者: Yizhi Li,Ge Zhang,Yinghao Ma,Ruibin Yuan,Kang Zhu,Hangyu Guo,Yiming Liang,Jiaheng Liu,Jian Yang,Siwei Wu,Xingwei Qu,Jinjie Shi,Xinyue Zhang,Zhenzhu Yang,Xiangzhou Wang,Zhaoxiang Zhang,Zachary Liu,Emmanouil Benetos,Wenhao Huang,Chenghua Lin
关键词-EN: multimodal large language, Recent advancements, large language models, advancements in multimodal, multimodal large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models’ ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) the baseline models perform poorly (below 50% accuracy) even when provided with alternative textual representations of images and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at this https URL.

[AI-2] Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

链接: https://arxiv.org/abs/2409.15268
作者: Benjamin Feuer,Micah Goldblum,Teresa Datta,Sanjana Nambiar,Raz Besaleli,Samuel Dooley,Max Cembalest,John P. Dickerson
关键词-EN: ChatGPT in November, sparked an explosion, release of ChatGPT, explosion of interest, preference optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM judges. In this work, we attempt to answer the following question – do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench, the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at this https URL.

[AI-3] Generative AI Is Not Ready for Clinical Use in Patient Education for Lower Back Pain Patients Even With Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2409.15260
作者: Yi-Fei Zhao,Allyn Bove,David Thompson,James Hill,Yi Xu,Yufan Ren,Andrea Hassman,Leming Zhou,Yanshan Wang
关键词-EN: Low back pain, Low back, LBP, patient education, back pain
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Low back pain (LBP) is a leading cause of disability globally. Following the onset of LBP and subsequent treatment, adequate patient education is crucial for improving functionality and long-term outcomes. Despite advancements in patient education strategies, significant gaps persist in delivering personalized, evidence-based information to patients with LBP. Recent advancements in large language models (LLMs) and generative artificial intelligence (GenAI) have demonstrated the potential to enhance patient education. However, their application and efficacy in delivering educational content to patients with LBP remain underexplored and warrant further investigation. In this study, we introduce a novel approach utilizing LLMs with Retrieval-Augmented Generation (RAG) and few-shot learning to generate tailored educational materials for patients with LBP. Physical therapists manually evaluated our model responses for redundancy, accuracy, and completeness using a Likert scale. In addition, the readability of the generated education materials is assessed using the Flesch Reading Ease score. The findings demonstrate that RAG-based LLMs outperform traditional LLMs, providing more accurate, complete, and readable patient education materials with less redundancy. Having said that, our analysis reveals that the generated materials are not yet ready for use in clinical practice. This study underscores the potential of AI-driven models utilizing RAG to improve patient education for LBP; however, significant challenges remain in ensuring the clinical relevance and granularity of content generated by these models.

[AI-4] S2AG-Vid: Enhancing Multi-Motion Alignment in Video Diffusion Models via Spatial and Syntactic Attention-Based Guidance

链接: https://arxiv.org/abs/2409.15259
作者: Yuanhang Li,Qi Mao,Lan Chen,Zhen Fang,Lei Tian,Xinyan Xiao,Libiao Jin,Hua Wu
关键词-EN: garnered significant attention, Recent advancements, generation using diffusion, significant attention, garnered significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in text-to-video (T2V) generation using diffusion models have garnered significant attention. However, existing T2V models primarily focus on simple scenes featuring a single object performing a single motion. Challenges arise in scenarios involving multiple objects with distinct motions, often leading to incorrect video-text alignment between subjects and their corresponding motions. To address this challenge, we propose \textbfS ^2 AG-Vid, a training-free inference-stage optimization method that improves the alignment of multiple objects with their corresponding motions in T2V models. S ^2 AG-Vid initially applies a spatial position-based, cross-attention (CA) constraint in the early stages of the denoising process, facilitating multiple nouns distinctly attending to the correct subject regions. To enhance the motion-subject binding, we implement a syntax-guided contrastive constraint in the subsequent denoising phase, aimed at improving the correlations between the CA maps of verbs and their corresponding nouns.Both qualitative and quantitative evaluations demonstrate that the proposed framework significantly outperforms baseline approaches, producing higher-quality videos with improved subject-motion consistency.

[AI-5] Behavioral Bias of Vision-Language Models: A Behavioral Finance View ICML2024

链接: https://arxiv.org/abs/2409.15256
作者: Yuhang Xiao,Yudi Lin,Ming-Chang Chiu
关键词-EN: Large Language Models, Large Language, Large Vision-Language Models, Large Vision-Language, evolve rapidly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: ICML 2024 Workshop on Large Language Models and Cognition

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) evolve rapidly as Large Language Models (LLMs) was equipped with vision modules to create more human-like models. However, we should carefully evaluate their applications in different domains, as they may possess undesired biases. Our work studies the potential behavioral biases of LVLMs from a behavioral finance perspective, an interdisciplinary subject that jointly considers finance and psychology. We propose an end-to-end framework, from data collection to new evaluation metrics, to assess LVLMs’ reasoning capabilities and the dynamic behaviors manifested in two established human financial behavioral biases: recency bias and authority bias. Our evaluations find that recent open-source LVLMs such as LLaVA-NeXT, MobileVLM-V2, Mini-Gemini, MiniCPM-Llama3-V 2.5 and Phi-3-vision-128k suffer significantly from these two biases, while the proprietary model GPT-4o is negligibly impacted. Our observations highlight directions in which open-source models can improve. The code is available at this https URL.

[AI-6] Archon: An Architecture Search Framework for Inference-Time Techniques

链接: https://arxiv.org/abs/2409.15254
作者: Jon Saad-Falcon,Adrian Gamarra Lafuente,Shlok Natarajan,Nahum Maru,Hristo Todorov,E. Kelly Buchanan,Mayee Chen,Neel Guha,Christopher Ré,Azalia Mirhoseini
关键词-EN: highly effective tools, Inference-time techniques, large language model, Inference-time, increase large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Inference-time techniques are emerging as highly effective tools to increase large language model (LLM) capabilities. However, there is still limited understanding of the best practices for developing systems that combine inference-time techniques with one or more LLMs, with challenges including: (1) effectively allocating inference compute budget, (2) understanding the interactions between different combinations of inference-time techniques and their impact on downstream performance, and 3) efficiently searching over the large space of model choices, inference-time techniques, and their compositions. To address these challenges, we introduce Archon, an automated framework for designing inference-time architectures. Archon defines an extensible design space, encompassing methods such as generation ensembling, multi-sampling, ranking, fusion, critiquing, verification, and unit testing. It then transforms the problem of selecting and combining LLMs and inference-time techniques into a hyperparameter optimization objective. To optimize this objective, we introduce automated Inference-Time Architecture Search (ITAS) algorithms. Given target benchmark(s), an inference compute budget, and available LLMs, ITAS outputs optimized architectures. We evaluate Archon architectures across a wide range of instruction-following and reasoning benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. We show that automatically designed inference-time architectures by Archon outperform strong models such as GPT-4o and Claude 3.5 Sonnet on these benchmarks, achieving an average increase of 14.1 and 10.3 percentage points with all-source models and open-source models, respectively. We make our code and datasets available publicly on Github: this https URL.

[AI-7] MACeIP: A Multimodal Ambient Context-enriched Intelligence Platform in Smart Cities

链接: https://arxiv.org/abs/2409.15243
作者: Truong Thanh Hung Nguyen,Phuc Truong Loc Nguyen,Monica Wachowicz,Hung Cao
关键词-EN: Multimodal Ambient Context-enriched, Ambient Context-enriched Intelligence, Context-enriched Intelligence Platform, Ambient Context-enriched, comprehensive system designed
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: 4 pages, 6 figures, IEEE/IEIE ICCE-Asia 2024

点击查看摘要

Abstract:This paper presents a Multimodal Ambient Context-enriched Intelligence Platform (MACeIP) for Smart Cities, a comprehensive system designed to enhance urban management and citizen engagement. Our platform integrates advanced technologies, including Internet of Things (IoT) sensors, edge and cloud computing, and Multimodal AI, to create a responsive and intelligent urban ecosystem. Key components include Interactive Hubs for citizen interaction, an extensive IoT sensor network, intelligent public asset management, a pedestrian monitoring system, a City Planning Portal, and a Cloud Computing System. We demonstrate the prototype of MACeIP in several cities, focusing on Fredericton, New Brunswick. This work contributes to innovative city development by offering a scalable, efficient, and user-centric approach to urban intelligence and management.

[AI-8] Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

链接: https://arxiv.org/abs/2409.15241
作者: Guanhua Wang,Chengming Zhang,Zheyu Shen,Ang Li,Olatunji Ruwase
关键词-EN: Large Language Models, Large Language, Language Models, popularity of generative, consume hundreds
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Given the popularity of generative AI, Large Language Models (LLMs) often consume hundreds or thousands of GPUs for parallelizing and accelerating the training process. Communication overhead becomes more pronounced when training LLMs at scale. To eliminate communication overhead in distributed LLM training, we propose Domino, which provides a generic scheme to hide communication behind computation. By breaking data dependency of a single batch training into smaller independent pieces, Domino pipelines these independent pieces training and provides generic strategy of fine-grained communication and computation overlapping. Extensive results show that, comparing with Megatron-LM, Domino achieves up to 1.3x speedup for LLM training on Nvidia DGX-H100 GPUs.

[AI-9] MemBench: Towards Real-world Evaluation of Memory-Augmented Dialogue Systems

链接: https://arxiv.org/abs/2409.15240
作者: Junqing He,Liang Zhu,Qi Wei,Rui Wang,Jiaxing Zhang
关键词-EN: developed numerous memory-augmented, Long-term memory, important for chatbots, researchers have developed, developed numerous
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: In progress

点击查看摘要

Abstract:Long-term memory is so important for chatbots and dialogue systems (DS) that researchers have developed numerous memory-augmented DS. However, their evaluation methods are different from the real situation in human conversation. They only measured the accuracy of factual information or the perplexity of generated responses given a query, which hardly reflected their performance. Moreover, they only consider passive memory retrieval based on similarity, neglecting diverse memory-recalling paradigms in humans, e.g. emotions and surroundings. To bridge the gap, we construct a novel benchmark covering various memory recalling paradigms based on cognitive science and psychology theory. The Memory Benchmark (MemBench) contains two tasks according to the two-phrase theory in cognitive science: memory retrieval, memory recognition and injection. The benchmark considers both passive and proactive memory recalling based on meta information for the first time. In addition, novel scoring aspects are proposed to comprehensively measure the generated responses. Results from the strongest embedding models and LLMs on MemBench show that there is plenty of room for improvement in existing dialogue systems. Extensive experiments also reveal the correlation between memory injection and emotion supporting (ES) skillfulness, and intimacy. Our code and dataset will be released.

[AI-10] AutoAPIEval: A Framework for Automated Evaluation of LLMs in API-Oriented Code Generation

链接: https://arxiv.org/abs/2409.15228
作者: Yixi Wu,Pengfei He,Zehao Wang,Shaowei Wang,Yuan Tian,Tse-Hsun(Peter)Chen
关键词-EN: Large language models, significantly enhancing productivity, accelerating software development, API-oriented code generation, Large language
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) like GitHub Copilot and ChatGPT have emerged as powerful tools for code generation, significantly enhancing productivity and accelerating software development. However, existing benchmarks primarily focus on general code generation without considering API-oriented code generation, i.e., generating code that invokes APIs from specific libraries. Given the growing demand for API-oriented code generation, there is a pressing need for a systematic and automated approach to evaluate LLM on API-oriented code generation. To address this gap, we propose AutoAPIEval, a lightweight and automated framework designed to evaluate the capabilities of LLMs in API-oriented code generation. Our framework works with any library that provides API documentation and focuses on two unit tasks: API recommendation and code example generation, along with four metrics to evaluate the generated APIs and code examples, such as the proportion of incorrect API recommendations for Task 1, and the proportion of code examples where no specific API is invoked and uncompilable/unexecutable code examples for Task 2. In addition, we conducted a case study on three LLMs (ChatGPT, MagiCoder, and DeepSeek Coder) and Java Runtime Environment 8 to demonstrate the framework’s effectiveness. Our findings reveal substantial variability in LLM performance across tasks, with ChatGPT adhering better to instructions, while sharing similar effectiveness in code example generation with its counterparts (i.e., MagiCoder and DeekSeek Coder). We also identify key factors associated with code quality, such as API popularity and model confidence, and build classifiers that achieve high accuracy in detecting incorrect API recommendations and erroneous code examples. Retrieval-augmented generation enhances the quality of code generated by LLMs, though its effectiveness varies across different LLMs.

[AI-11] Enhancing Pedestrian Trajectory Prediction with Crowd Trip Information

链接: https://arxiv.org/abs/2409.15224
作者: Rei Tamaru,Pei Li,Bin Ran
关键词-EN: active traffic management, Pedestrian trajectory prediction, traffic management, trajectory prediction, Pedestrian trajectory
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pedestrian trajectory prediction is essential for various applications in active traffic management, urban planning, traffic control, crowd management, and autonomous driving, aiming to enhance traffic safety and efficiency. Accurately predicting pedestrian trajectories requires a deep understanding of individual behaviors, social interactions, and road environments. Existing studies have developed various models to capture the influence of social interactions and road conditions on pedestrian trajectories. However, these approaches are limited by the lack of a comprehensive view of social interactions and road environments. To address these limitations and enhance the accuracy of pedestrian trajectory prediction, we propose a novel approach incorporating trip information as a new modality into pedestrian trajectory models. We propose RNTransformer, a generic model that utilizes crowd trip information to capture global information on social interactions. We incorporated RNTransformer with various socially aware local pedestrian trajectory prediction models to demonstrate its performance. Specifically, by leveraging a pre-trained RNTransformer when training different pedestrian trajectory prediction models, we observed improvements in performance metrics: a 1.3/2.2% enhancement in ADE/FDE on Social-LSTM, a 6.5/28.4% improvement on Social-STGCNN, and an 8.6/4.3% improvement on S-Implicit. Evaluation results demonstrate that RNTransformer significantly enhances the accuracy of various pedestrian trajectory prediction models across multiple datasets. Further investigation reveals that the RNTransformer effectively guides local models to more accurate directions due to the consideration of global information. By exploring crowd behavior within the road network, our approach shows great promise in improving pedestrian safety through accurate trajectory predictions.

[AI-12] ASTE Transformer Modelling Dependencies in Aspect-Sentiment Triplet Extraction

链接: https://arxiv.org/abs/2409.15202
作者: Iwo Naglik,Mateusz Lango
关键词-EN: Aspect-Sentiment Triplet Extraction, Triplet Extraction, aspect-based sentiment analysis, Aspect-Sentiment Triplet, recently proposed task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The 2024 Conference on Empirical Methods in Natural Language Processing, November 12-16, Miami, Florida 9 pages, appendix, diagrams

点击查看摘要

Abstract:Aspect-Sentiment Triplet Extraction (ASTE) is a recently proposed task of aspect-based sentiment analysis that consists in extracting (aspect phrase, opinion phrase, sentiment polarity) triples from a given sentence. Recent state-of-the-art methods approach this task by first extracting all possible text spans from a given text, then filtering the potential aspect and opinion phrases with a classifier, and finally considering all their pairs with another classifier that additionally assigns sentiment polarity to them. Although several variations of the above scheme have been proposed, the common feature is that the final result is constructed by a sequence of independent classifier decisions. This hinders the exploitation of dependencies between extracted phrases and prevents the use of knowledge about the interrelationships between classifier predictions to improve performance. In this paper, we propose a new ASTE approach consisting of three transformer-inspired layers, which enables the modelling of dependencies both between phrases and between the final classifier decisions. Experimental results show that the method achieves higher performance in terms of F1 measure than other methods studied on popular benchmarks. In addition, we show that a simple pre-training technique further improves the performance of the model.

[AI-13] Learning from Contrastive Prompts: Automated Optimization and Adaptation

链接: https://arxiv.org/abs/2409.15199
作者: Mingqi Li,Karan Aggarwal,Yong Xie,Aitzaz Ahmad,Stephen Lau
关键词-EN: manually crafting prompts, spent on manually, manually crafting, prompt optimization, LCP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As LLMs evolve, significant effort is spent on manually crafting prompts. While existing prompt optimization methods automate this process, they rely solely on learning from incorrect samples, leading to a sub-optimal performance. Additionally, an unexplored challenge in the literature is prompts effective for prior models may not perform well on newer versions or different languages. We propose the Learning from Contrastive Prompts (LCP) framework to address these gaps, enhancing both prompt optimization and adaptation. LCP employs contrastive learning to generate effective prompts by analyzing patterns in good and bad prompt examples. Our evaluation on the Big-Bench Hard dataset shows that LCP has a win rate of over 76% over existing methods in prompt optimization and demonstrates strong adaptability across different model versions, families, and languages. LCP offers a systematic approach to prompt engineering, reducing manual effort in deploying LLMs across varied contexts.

[AI-14] HOTVCOM: Generating Buzzworthy Comments for Videos ACL2024

链接: https://arxiv.org/abs/2409.15196
作者: Yuyan Chen,Yiwen Qian,Songzhou Yan,Jiyuan Jia,Zhixu Li,Yanghua Xiao,Xiaobo Li,Ming Yang,Qingpei Guo
关键词-EN: attracting user impressions, media video platforms, social media video, making them vital, branding purpose
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:In the era of social media video platforms, popular hot-comments'' play a crucial role in attracting user impressions of short-form videos, making them vital for marketing and branding purpose. However, existing research predominantly focuses on generating descriptive comments or danmaku’’ in English, offering immediate reactions to specific video moments. Addressing this gap, our study introduces \textscHotVCom, the largest Chinese video hot-comment dataset, comprising 94k diverse videos and 137 million comments. We also present the \textttComHeat framework, which synergistically integrates visual, auditory, and textual data to generate influential hot-comments on the Chinese video dataset. Empirical evaluations highlight the effectiveness of our framework, demonstrating its excellence on both the newly constructed and existing datasets.

[AI-15] Location is Key: Leveraging Large Language Model for Functional Bug Localization in Verilog

链接: https://arxiv.org/abs/2409.15186
作者: Bingkun Yao,Ning Wang,Jie Zhou,Xi Wang,Hong Gao,Zhe Jiang,Nan Guan
关键词-EN: Large Language Models, Verilog, Large Language, Language Models, Verilog code
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bug localization in Verilog code is a crucial and time-consuming task during the verification of hardware design. Since introduction, Large Language Models (LLMs) have showed their strong programming capabilities. However, no work has yet considered using LLMs for bug localization in Verilog code. This paper presents Location-is-Key, an opensource LLM solution to locate functional errors in Verilog snippets. LiK achieves high localization accuracy, with a pass@1 localization accuracy of 93.3% on our test dataset based on RTLLM, surpassing GPT-4’s 77.9% and comparable to Claude-3.5’s 90.8%. Additionally, the bug location obtained by LiK significantly improves GPT-3.5’s bug repair efficiency (Functional pass@1 increased from 40.39% to 58.92%), highlighting the importance of bug localization in LLM-based Verilog debugging. Compared to existing methods, LiK only requires the design specification and the erroneous code snippet, without the need for testbenches, assertions, or any other EDA tools. This research demonstrates the feasibility of using LLMs for Verilog error localization, thus providing a new direction for automatic Verilog code debugging.

[AI-16] Chattronics: using GPTs to assist in the design of data acquisition systems

链接: https://arxiv.org/abs/2409.15183
作者: Jonathan Paul Driemeyer Brown,Tiago Oliveira Weber
关键词-EN: Large Language Models, Large Language, usefulness of Large, Language Models, continuously tested
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Signal Processing (eess.SP)
*备注: 8 pages

点击查看摘要

Abstract:The usefulness of Large Language Models (LLM) is being continuously tested in various fields. However, their intrinsic linguistic characteristic is still one of the limiting factors when applying these models to exact sciences. In this article, a novel approach to use General Pre-Trained Transformers to assist in the design phase of data acquisition systems will be presented. The solution is packaged in the form of an application that retains the conversational aspects of LLMs, in such a manner that the user must provide details on the desired project in order for the model to draft both a system-level architectural diagram and the block-level specifications, following a Top-Down methodology based on restrictions. To test this tool, two distinct user emulations were used, one of which uses an additional GPT model. In total, 4 different data acquisition projects were used in the testing phase, each with its own measurement requirements: angular position, temperature, acceleration and a fourth project with both pressure and superficial temperature measurements. After 160 test iterations, the study concludes that there is potential for these models to serve adequately as synthesis/assistant tools for data acquisition systems, but there are still technological limitations. The results show coherent architectures and topologies, but that GPTs have difficulties in simultaneously considering all requirements and many times commits theoretical mistakes.

[AI-17] Goal-based Neural Physics Vehicle Trajectory Prediction Model

链接: https://arxiv.org/abs/2409.15182
作者: Rui Gan,Haotian Shi,Pei Li,Keshu Wu,Bocheng An,Linheng Li,Junyi Ma,Chengyuan Ma,Bin Ran
关键词-EN: intelligent transportation systems, influencing traffic safety, Vehicle trajectory prediction, affects vehicle behavior, vehicle behavior planning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vehicle trajectory prediction plays a vital role in intelligent transportation systems and autonomous driving, as it significantly affects vehicle behavior planning and control, thereby influencing traffic safety and efficiency. Numerous studies have been conducted to predict short-term vehicle trajectories in the immediate future. However, long-term trajectory prediction remains a major challenge due to accumulated errors and uncertainties. Additionally, balancing accuracy with interpretability in the prediction is another challenging issue in predicting vehicle trajectory. To address these challenges, this paper proposes a Goal-based Neural Physics Vehicle Trajectory Prediction Model (GNP). The GNP model simplifies vehicle trajectory prediction into a two-stage process: determining the vehicle’s goal and then choosing the appropriate trajectory to reach this goal. The GNP model contains two sub-modules to achieve this process. The first sub-module employs a multi-head attention mechanism to accurately predict goals. The second sub-module integrates a deep learning model with a physics-based social force model to progressively predict the complete trajectory using the generated goals. The GNP demonstrates state-of-the-art long-term prediction accuracy compared to four baseline models. We provide interpretable visualization results to highlight the multi-modality and inherent nature of our neural physics framework. Additionally, ablation studies are performed to validate the effectiveness of our key designs.

[AI-18] Skills Made to Order: Efficient Acquisition of Robot Cooking Skills Guided by Multiple Forms of Internet Data

链接: https://arxiv.org/abs/2409.15172
作者: Mrinal Verghese,Christopher Atkeson
关键词-EN: internet data sources, internet data, data, data sources, internet
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:This study explores the utility of various internet data sources to select among a set of template robot behaviors to perform skills. Learning contact-rich skills involving tool use from internet data sources has typically been challenging due to the lack of physical information such as contact existence, location, areas, and force in this data. Prior works have generally used internet data and foundation models trained on this data to generate low-level robot behavior. We hypothesize that these data and models may be better suited to selecting among a set of basic robot behaviors to perform these contact-rich skills. We explore three methods of template selection: querying large language models, comparing video of robot execution to retrieved human video using features from a pretrained video encoder common in prior work, and performing the same comparison using features from an optic flow encoder trained on internet data. Our results show that LLMs are surprisingly capable template selectors despite their lack of visual information, optical flow encoding significantly outperforms video encoders trained with an order of magnitude more data, and important synergies exist between various forms of internet data for template selection. By exploiting these synergies, we create a template selector using multiple forms of internet data that achieves a 79% success rate on a set of 16 different cooking skills involving tool-use.

[AI-19] DeepCloth-ROB2_textQSPP: Towards a Robust Robot Deployment for Quasi-Static Pick-and-Place Cloth-Shaping Neural Controllers ICRA

链接: https://arxiv.org/abs/2409.15159
作者: Halid Abdulrahim Kadi,Jose Alex Chandy,Luis Figueredo,Kasim Terzić,Praminda Caleb-Solly
关键词-EN: simulation-trained vision-based data-driven, operation impedes reliable, impedes reliable deployment, Franka Emika Panda, real-world operation impedes
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages main texts, 3 figures, and 3 tables. It is submitted to the 2025 IEEE International Conference on Robotics Automation (ICRA)

点击查看摘要

Abstract:The fidelity gap between simulation-trained vision-based data-driven cloth neural controllers and real-world operation impedes reliable deployment of methods from simulation into physical trials. Real-world grasping errors, such as misgrasping and multilayer grasping, degrade their performance; additionally, some fabrics made of synthetic material also tend to stick to the commonly employed Franka Emika Panda’s original gripper. Different approaches adopted various strategies to resolve these problems, further complicating real-world comparison between state-of-the-art methods. We propose DeepCloth-ROB ^2_\textQS PP with a simulation-to-reality transfer strategy Towel-Sim2Real and a cloth grasping protocol to consider and mitigate these grasping errors for robustly deploying quasi-static pick-and-place neural controllers in cloth shaping and demonstrate its generalisability across different deep-learning methods, fabric contexts and robot platforms. Our approach allows us to compare multiple neural controllers in a real environment for the first time, offering valuable insights to the cloth manipulation community.

[AI-20] Automatic Feature Learning for Essence: a Case Study on Car Sequencing

链接: https://arxiv.org/abs/2409.15158
作者: Alessio Pellegrino,Özgür Akgün,Nguyen Dang,Zeynep Kiziltan,Ian Miguel
关键词-EN: describe combinatorial problems, low-level constraint model, detailed modelling decisions, constraint model, Essence offer
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Constraint modelling languages such as Essence offer a means to describe combinatorial problems at a high-level, i.e., without committing to detailed modelling decisions for a particular solver or solving paradigm. Given a problem description written in Essence, there are multiple ways to translate it to a low-level constraint model. Choosing the right combination of a low-level constraint model and a target constraint solver can have significant impact on the effectiveness of the solving process. Furthermore, the choice of the best combination of constraint model and solver can be instance-dependent, i.e., there may not exist a single combination that works best for all instances of the same problem. In this paper, we consider the task of building machine learning models to automatically select the best combination for a problem instance. A critical part of the learning process is to define instance features, which serve as input to the selection model. Our contribution is automatic learning of instance features directly from the high-level representation of a problem instance using a language model. We evaluate the performance of our approach using the Essence modelling language with a case study involving the car sequencing problem.

[AI-21] RMCBench: Benchmarking Large Language Models Resistance to Malicious Code

链接: https://arxiv.org/abs/2409.15154
作者: Jiachi Chen,Qingyuan Zhong,Yanlin Wang,Kaiwen Ning,Yongkun Liu,Zenan Xu,Zhe Zhao,Ting Chen,Zibin Zheng
关键词-EN: Large Language Models, Large Language, resist malicious code, malicious code generation, software development activities
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 12 pages, 6 figures, 5 tables, 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24)

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has significantly influenced various aspects of software development activities. Despite their benefits, LLMs also pose notable risks, including the potential to generate harmful content and being abused by malicious developers to create malicious code. Several previous studies have focused on the ability of LLMs to resist the generation of harmful content that violates human ethical standards, such as biased or offensive content. However, there is no research evaluating the ability of LLMs to resist malicious code generation. To fill this gap, we propose RMCBench, the first benchmark comprising 473 prompts designed to assess the ability of LLMs to resist malicious code generation. This benchmark employs two scenarios: a text-to-code scenario, where LLMs are prompted with descriptions to generate code, and a code-to-code scenario, where LLMs translate or complete existing malicious code. Based on RMCBench, we conduct an empirical study on 11 representative LLMs to assess their ability to resist malicious code generation. Our findings indicate that current LLMs have a limited ability to resist malicious code generation with an average refusal rate of 40.36% in text-to-code scenario and 11.52% in code-to-code scenario. The average refusal rate of all LLMs in RMCBench is only 28.71%; ChatGPT-4 has a refusal rate of only 35.73%. We also analyze the factors that affect LLMs’ ability to resist malicious code generation and provide implications for developers to enhance model robustness.

[AI-22] COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models ICRA

链接: https://arxiv.org/abs/2409.15146
作者: Kehui Liu,Zixin Tang,Dong Wang,Zhigang Wang,Bin Zhao,Xuelong Li
关键词-EN: powerful reasoning capabilities, Leveraging the powerful, large language models, methods yield promising, yield promising results
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures. Submitted to IEEE International Conference on Robotics and Automation (ICRA), 2025

点击查看摘要

Abstract:Leveraging the powerful reasoning capabilities of large language models (LLMs), recent LLM-based robot task planning methods yield promising results. However, they mainly focus on single or multiple homogeneous robots on simple tasks. Practically, complex long-horizon tasks always require collaborations among multiple heterogeneous robots especially with more complex action spaces, which makes these tasks more challenging. To this end, we propose COHERENT, a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems including quadrotors, robotic dogs, and robotic arms. Specifically, a Proposal-Execution-Feedback-Adjustment (PEFA) mechanism is designed to decompose and assign actions for individual robots, where a centralized task assigner makes a task planning proposal to decompose the complex task into subtasks, and then assigns subtasks to robot executors. Each robot executor selects a feasible action to implement the assigned subtask and reports self-reflection feedback to the task assigner for plan adjustment. The PEFA loops until the task is completed. Moreover, we create a challenging heterogeneous multi-robot task planning benchmark encompassing 100 complex long-horizon tasks. The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency. The experimental videos, code, and benchmark are released at this https URL.

[AI-23] CAMAL: Optimizing LSM-trees via Active Learning SIGMOD2025

链接: https://arxiv.org/abs/2409.15130
作者: Weiping Yu,Siqiang Luo,Zihao Yu,Gao Cong
关键词-EN: optimize LSM-tree structure, active learning, Decoupled Active Learning, write operations, apply active learning
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: SIGMOD 2025

点击查看摘要

Abstract:We use machine learning to optimize LSM-tree structure, aiming to reduce the cost of processing various read/write operations. We introduce a new approach Camal, which boasts the following features: (1) ML-Aided: Camal is the first attempt to apply active learning to tune LSM-tree based key-value stores. The learning process is coupled with traditional cost models to improve the training process; (2) Decoupled Active Learning: backed by rigorous analysis, Camal adopts active learning paradigm based on a decoupled tuning of each parameter, which further accelerates the learning process; (3) Easy Extrapolation: Camal adopts an effective mechanism to incrementally update the model with the growth of the data size; (4) Dynamic Mode: Camal is able to tune LSM-tree online under dynamically changing workloads; (5) Significant System Improvement: By integrating Camal into a full system RocksDB, the system performance improves by 28% on average and up to 8x compared to a state-of-the-art RocksDB design.

[AI-24] Boosting Healthcare LLMs Through Retrieved Context

链接: https://arxiv.org/abs/2409.15127
作者: Jordi Bayarri-Planas,Ashwin Kumar Gururajan,Dario Garcia-Gasulla
关键词-EN: Large Language Models, natural language processing, demonstrated remarkable capabilities, Large Language, Language Models
类目: Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures, 12 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing, and yet, their factual inaccuracies and hallucinations limits their application, particularly in critical domains like healthcare. Context retrieval methods, by introducing relevant information as input, have emerged as a crucial approach for enhancing LLM factuality and reliability. This study explores the boundaries of context retrieval methods within the healthcare domain, optimizing their components and benchmarking their performance against open and closed alternatives. Our findings reveal how open LLMs, when augmented with an optimized retrieval system, can achieve performance comparable to the biggest private solutions on established healthcare benchmarks (multiple-choice question answering). Recognizing the lack of realism of including the possible answers within the question (a setup only found in medical exams), and after assessing a strong LLM performance degradation in the absence of those options, we extend the context retrieval system in that direction. In particular, we propose OpenMedPrompt a pipeline that improves the generation of more reliable open-ended answers, moving this technology closer to practical application.

[AI-25] Log-normal Mutations and their Use in Detecting Surreptitious Fake Images

链接: https://arxiv.org/abs/2409.15119
作者: Ismail Labiad,Thomas Bäck,Pierre Fernandez,Laurent Najman,Tom Sanders,Furong Ye,Mariia Zameshina,Olivier Teytaud
关键词-EN: algorithms specifically dedicated, automatic image classifiers, attacking automatic image, specifically dedicated, dedicated to attacking
类目: Artificial Intelligence (cs.AI)
*备注: log-normal mutations and their use in detecting surreptitious fake images

点击查看摘要

Abstract:In many cases, adversarial attacks are based on specialized algorithms specifically dedicated to attacking automatic image classifiers. These algorithms perform well, thanks to an excellent ad hoc distribution of initial attacks. However, these attacks are easily detected due to their specific initial distribution. We therefore consider other black-box attacks, inspired from generic black-box optimization tools, and in particular the log-normal algorithm. We apply the log-normal method to the attack of fake detectors, and get successful attacks: importantly, these attacks are not detected by detectors specialized on classical adversarial attacks. Then, combining these attacks and deep detection, we create improved fake detectors. Comments: log-normal mutations and their use in detecting surreptitious fake images Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2409.15119 [cs.AI] (or arXiv:2409.15119v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.15119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-26] Evaluating ML Robustness in GNSS Interference Classification Characterization Localization

链接: https://arxiv.org/abs/2409.15114
作者: Lucas Heublein,Tobias Feigl,Thorsten Nowak,Alexander Rügamer,Christopher Mutschler,Felix Ott
关键词-EN: navigation satellite system, global navigation satellite, Jamming devices present, satellite system, accurate positioning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Jamming devices present a significant threat by disrupting signals from the global navigation satellite system (GNSS), compromising the robustness of accurate positioning. The detection of anomalies within frequency snapshots is crucial to counteract these interferences effectively. A critical preliminary measure involves the reliable classification of interferences and characterization and localization of jamming devices. This paper introduces an extensive dataset compromising snapshots obtained from a low-frequency antenna, capturing diverse generated interferences within a large-scale environment including controlled multipath effects. Our objective is to assess the resilience of ML models against environmental changes, such as multipath effects, variations in interference attributes, such as the interference class, bandwidth, and signal-to-noise ratio, the accuracy jamming device localization, and the constraints imposed by snapshot input lengths. By analyzing the aleatoric and epistemic uncertainties, we demonstrate the adaptness of our model in generalizing across diverse facets, thus establishing its suitability for real-world applications. this https URL

[AI-27] ChatGPT as a Solver and Grader of Programming Exams written in Spanish

链接: https://arxiv.org/abs/2409.15112
作者: Pablo Fernández-Saborido,Marcos Fernández-Pichel,David E. Losada
关键词-EN: Large Language Models, receiving increasing attention, Large Language, capabilities of Large, Language Models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluating the capabilities of Large Language Models (LLMs) to assist teachers and students in educational tasks is receiving increasing attention. In this paper, we assess ChatGPT’s capacities to solve and grade real programming exams, from an accredited BSc degree in Computer Science, written in Spanish. Our findings suggest that this AI model is only effective for solving simple coding tasks. Its proficiency in tackling complex problems or evaluating solutions authored by others are far from effective. As part of this research, we also release a new corpus of programming tasks and the corresponding prompts for solving the problems or grading the solutions. This resource can be further exploited by other research teams.

[AI-28] he BRAVO Semantic Segmentation Challenge Results in UNCV2024 ECCV2024

链接: https://arxiv.org/abs/2409.15107
作者: Tuan-Hung Vu,Eduardo Valle,Andrei Bursuc,Tommie Kerssies,Daan de Geus,Gijs Dubbelman,Long Qian,Bingke Zhu,Yingying Chen,Ming Tang,Jinqiao Wang,Tomáš Vojíř,Jan Šochman,Jiří Matas,Michael Smith,Frank Ferrie,Shamik Basu,Christos Sakaridis,Luc Van Gool
关键词-EN: unified BRAVO challenge, unified BRAVO, semantic segmentation models, BRAVO challenge, propose the unified
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ECCV 2024 proceeding paper of the BRAVO challenge 2024, see this https URL

点击查看摘要

Abstract:We propose the unified BRAVO challenge to benchmark the reliability of semantic segmentation models under realistic perturbations and unknown out-of-distribution (OOD) scenarios. We define two categories of reliability: (1) semantic reliability, which reflects the model’s accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model’s ability to detect object classes that are unknown during training. The challenge attracted nearly 100 submissions from international teams representing notable research institutions. The results reveal interesting insights into the importance of large-scale pre-training and minimal architectural design in developing robust and reliable semantic segmentation models.

[AI-29] SPformer: A Transformer Based DRL Decision Making Method for Connected Automated Vehicles

链接: https://arxiv.org/abs/2409.15105
作者: Ye Han,Lijun Zhang,Dejian Meng,Xingyu Hu,Yixia Lu
关键词-EN: mixed autonomy traffic, autonomy traffic environment, transportation system, mixed autonomy, autonomous-driving car
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In mixed autonomy traffic environment, every decision made by an autonomous-driving car may have a great impact on the transportation system. Because of the complex interaction between vehicles, it is challenging to make decisions that can ensure both high traffic efficiency and safety now and futher. Connected automated vehicles (CAVs) have great potential to improve the quality of decision-making in this continuous, highly dynamic and interactive environment because of their stronger sensing and communicating ability. For multi-vehicle collaborative decision-making algorithms based on deep reinforcement learning (DRL), we need to represent the interactions between vehicles to obtain interactive features. The representation in this aspect directly affects the learning efficiency and the quality of the learned policy. To this end, we propose a CAV decision-making architecture based on transformer and reinforcement learning algorithms. A learnable policy token is used as the learning medium of the multi-vehicle joint policy, the states of all vehicles in the area of interest can be adaptively noticed in order to extract interactive features among agents. We also design an intuitive physical positional encodings, the redundant location information of which optimizes the performance of the network. Simulations show that our model can make good use of all the state information of vehicles in traffic scenario, so as to obtain high-quality driving decisions that meet efficiency and safety objectives. The comparison shows that our method significantly improves existing DRL-based multi-vehicle cooperative decision-making algorithms.

[AI-30] Robust Federated Learning Over the Air: Combating Heavy-Tailed Noise with Median Anchored Clipping

链接: https://arxiv.org/abs/2409.15100
作者: Jiaxing Li,Zihan Chen,Kai Fong Ernest Chong,Bikramjit Das,Tony Q. S. Quek,Howard H. Yang
关键词-EN: federated edge learning, effective approach, communication bottleneck, Median Anchored Clipping, Leveraging
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Leveraging over-the-air computations for model aggregation is an effective approach to cope with the communication bottleneck in federated edge learning. By exploiting the superposition properties of multi-access channels, this approach facilitates an integrated design of communication and computation, thereby enhancing system privacy while reducing implementation costs. However, the inherent electromagnetic interference in radio channels often exhibits heavy-tailed distributions, giving rise to exceptionally strong noise in globally aggregated gradients that can significantly deteriorate the training performance. To address this issue, we propose a novel gradient clipping method, termed Median Anchored Clipping (MAC), to combat the detrimental effects of heavy-tailed noise. We also derive analytical expressions for the convergence rate of model training with analog over-the-air federated learning under MAC, which quantitatively demonstrates the effect of MAC on training performance. Extensive experimental results show that the proposed MAC algorithm effectively mitigates the impact of heavy-tailed noise, hence substantially enhancing system robustness.

[AI-31] Efficiently Dispatching Flash Attention For Partially Filled Attention Masks

链接: https://arxiv.org/abs/2409.15097
作者: Agniv Sharma,Jonas Geiping
关键词-EN: Transformers are widely, partially filled attention, filled attention matrices, partially filled, Binary Block Masking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices. Examples include attention masks designed to reduce the quadratic complexity of attention, sequence packing techniques, and recent innovations like tree masking for fast validation in MEDUSA. Despite the inherent sparsity in these matrices, the state-of-the-art algorithm Flash Attention still processes them with quadratic complexity as though they were dense. In this paper, we introduce \textbfBinary Block Masking, a highly efficient modification that enhances Flash Attention by making it mask-aware. We further propose two optimizations: one tailored for masks with contiguous non-zero patterns and another for extremely sparse masks. Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement. The implementation will be publicly released to foster further research and application.

[AI-32] Zero-Cost Whole-Body Teleoperation for Mobile Manipulation

链接: https://arxiv.org/abs/2409.15095
作者: Daniel Honerkamp,Harsh Mahesheka,Jan Ole von Hartz,Tim Welschehold,Abhinav Valada
关键词-EN: robotic foundation models, training robotic foundation, learning complex behaviors, foundation models, plays a key
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project Website: this http URL

点击查看摘要

Abstract:Demonstration data plays a key role in learning complex behaviors and training robotic foundation models. While effective control interfaces exist for static manipulators, data collection remains cumbersome and time intensive for mobile manipulators due to their large number of degrees of freedom. While specialized hardware, avatars, or motion tracking can enable whole-body control, these approaches are either expensive, robot-specific, or suffer from the embodiment mismatch between robot and human demonstrator. In this work, we present MoMa-Teleop, a novel teleoperation method that delegates the base motions to a reinforcement learning agent, leaving the operator to focus fully on the task-relevant end-effector motions. This enables whole-body teleoperation of mobile manipulators with zero additional hardware or setup costs via standard interfaces such as joysticks or hand guidance. Moreover, the operator is not bound to a tracked workspace and can move freely with the robot over spatially extended tasks. We demonstrate that our approach results in a significant reduction in task completion time across a variety of robots and tasks. As the generated data covers diverse whole-body motions without embodiment mismatch, it enables efficient imitation learning. By focusing on task-specific end-effector motions, our approach learns skills that transfer to unseen settings, such as new obstacles or changed object positions, from as little as five demonstrations. We make code and videos available at this http URL.

[AI-33] M2OST: Many-to-one Regression for Predicting Spatial Transcriptomics from Digital Pathology Images

链接: https://arxiv.org/abs/2409.15092
作者: Hongyi Wang,Xiuju Du,Jing Liu,Shuyi Ouyang,Yen-Wei Chen,Lanfen Lin
关键词-EN: Spatial Transcriptomics, advancement of Spatial, digital pathology images, gene expressions based, facilitated the spatially-aware
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of Spatial Transcriptomics (ST) has facilitated the spatially-aware profiling of gene expressions based on histopathology images. Although ST data offers valuable insights into the micro-environment of tumors, its acquisition cost remains expensive. Therefore, directly predicting the ST expressions from digital pathology images is desired. Current methods usually adopt existing regression backbones along with patch-sampling for this task, which ignores the inherent multi-scale information embedded in the pyramidal data structure of digital pathology images, and wastes the inter-spot visual information crucial for accurate gene expression prediction. To address these limitations, we propose M2OST, a many-to-one regression Transformer that can accommodate the hierarchical structure of the pathology images via a decoupled multi-scale feature extractor. Unlike traditional models that are trained with one-to-one image-label pairs, M2OST uses multiple images from different levels of the digital pathology image to jointly predict the gene expressions in their common corresponding spot. Built upon our many-to-one scheme, M2OST can be easily scaled to fit different numbers of inputs, and its network structure inherently incorporates nearby inter-spot features, enhancing regression performance. We have tested M2OST on three public ST datasets and the experimental results show that M2OST can achieve state-of-the-art performance with fewer parameters and floating-point operations (FLOPs). The code will be released upon acceptance.

[AI-34] Depression Diagnosis Dialogue Simulation: Self-improving Psychiatrist with Tertiary Memory

链接: https://arxiv.org/abs/2409.15084
作者: Kunyao Lan,Bingui Jin,Zichen Zhu,Siyuan Chen,Shu Zhang,Kenny Q. Zhu,Mengyue Wu
关键词-EN: Mental health issues, present significant challenges, effective automated diagnostic, Agent Mental Clinic, automated diagnostic methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mental health issues, particularly depressive disorders, present significant challenges in contemporary society, necessitating the development of effective automated diagnostic methods. This paper introduces the Agent Mental Clinic (AMC), a self-improving conversational agent system designed to enhance depression diagnosis through simulated dialogues between patient and psychiatrist agents. To enhance the dialogue quality and diagnosis accuracy, we design a psychiatrist agent consisting of a tertiary memory structure, a dialogue control and reflect plugin that acts as ``supervisor’’ and a memory sampling module, fully leveraging the skills reflected by the psychiatrist agent, achieving great accuracy on depression risk and suicide risk diagnosis via conversation. Experiment results on datasets collected in real-life scenarios demonstrate that the system, simulating the procedure of training psychiatrists, can be a promising optimization method for aligning LLMs with real-life distribution in specific domains without modifying the weights of LLMs, even when only a few representative labeled cases are available.

[AI-35] Enhancing Scientific Reproducibility Through Automated BioCompute Object Creation Using Retrieval-Augmented Generation from Publications

链接: https://arxiv.org/abs/2409.15076
作者: Sean Kim,Raja Mazumder
关键词-EN: necessitating standardized documentation, IEEE BioCompute Object, Large Language Models, BCO assistant tool, BCO assistant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
*备注: 21 pages, 8 figures

点击查看摘要

Abstract:The exponential growth in computational power and accessibility has transformed the complexity and scale of bioinformatics research, necessitating standardized documentation for transparency, reproducibility, and regulatory compliance. The IEEE BioCompute Object (BCO) standard addresses this need but faces adoption challenges due to the overhead of creating compliant documentation, especially for legacy research. This paper presents a novel approach to automate the creation of BCOs from scientific papers using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). We describe the development of the BCO assistant tool that leverages RAG to extract relevant information from source papers and associated code repositories, addressing key challenges such as LLM hallucination and long-context understanding. The implementation incorporates optimized retrieval processes, including a two-pass retrieval with re-ranking, and employs carefully engineered prompts for each BCO domain. We discuss the tool’s architecture, extensibility, and evaluation methods, including automated and manual assessment approaches. The BCO assistant demonstrates the potential to significantly reduce the time and effort required for retroactive documentation of bioinformatics research while maintaining compliance with the standard. This approach opens avenues for AI-assisted scientific documentation and knowledge extraction from publications thereby enhancing scientific reproducibility. The BCO assistant tool and documentation is available at this https URL.

[AI-36] Brotherhood at WMT 2024: Leveraging LLM-Generated Contextual Conversations for Cross-Lingual Image Captioning EMNLP2024

链接: https://arxiv.org/abs/2409.15052
作者: Siddharth Betala,Ishan Chokshi
关键词-EN: Multi-Modal Translation Task, team name Brotherhood, Multi-Modal Translation, Translation Task, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at the Ninth Conference on Machine Translation (WMT24), co-located with EMNLP 2024

点击查看摘要

Abstract:In this paper, we describe our system under the team name Brotherhood for the English-to-Lowres Multi-Modal Translation Task. We participate in the multi-modal translation tasks for English-Hindi, English-Hausa, English-Bengali, and English-Malayalam language pairs. We present a method leveraging multi-modal Large Language Models (LLMs), specifically GPT-4o and Claude 3.5 Sonnet, to enhance cross-lingual image captioning without traditional training or fine-tuning. Our approach utilizes instruction-tuned prompting to generate rich, contextual conversations about cropped images, using their English captions as additional context. These synthetic conversations are then translated into the target languages. Finally, we employ a weighted prompting strategy, balancing the original English caption with the translated conversation to generate captions in the target language. This method achieved competitive results, scoring 37.90 BLEU on the English-Hindi Challenge Set and ranking first and second for English-Hausa on the Challenge and Evaluation Leaderboards, respectively. We conduct additional experiments on a subset of 250 images, exploring the trade-offs between BLEU scores and semantic similarity across various weighting schemes.

[AI-37] Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

链接: https://arxiv.org/abs/2409.15051
作者: Gaëtan Caillaut,Raheel Qader,Mariam Nakhlé,Jingshu Liu,Jean-Gabriel Barthélemy
关键词-EN: showcased remarkable capabilities, Recent studies, NLP tasks, decoder-only models, models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent studies have showcased remarkable capabilities of decoder-only models in many NLP tasks, including translation. Yet, the machine translation field has been largely dominated by encoder-decoder models based on the Transformer architecture. As a consequence, scaling laws of encoder-decoder models for neural machine translation have already been well studied, but decoder-only models have received less attention. This work explores the scaling laws of decoder-only models on the multilingual and multidomain translation task. We trained a collection of six decoder-only models, ranging from 70M to 7B parameters, on a sentence-level, multilingual and multidomain dataset. We conducted a series of experiments showing that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models, but we also show that this scaling law has difficulties to generalize to too large models or to a different data distribution. We also study different scaling methods and show that scaling the depth and the width of a model lead to similar test loss improvements, but with different impact on the model’s efficiency.

[AI-38] AlphaZip: Neural Network-Enhanced Lossless Text Compression

链接: https://arxiv.org/abs/2409.15046
作者: Swathi Shree Narashiman,Nitin Chandrachoodan
关键词-EN: Data compression continues, traditional information theory, information theory methods, Large Language Model, continues to evolve
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data compression continues to evolve, with traditional information theory methods being widely used for compressing text, images, and videos. Recently, there has been growing interest in leveraging Generative AI for predictive compression techniques. This paper introduces a lossless text compression approach using a Large Language Model (LLM). The method involves two key steps: first, prediction using a dense neural network architecture, such as a transformer block; second, compressing the predicted ranks with standard compression algorithms like Adaptive Huffman, LZ77, or Gzip. Extensive analysis and benchmarking against conventional information-theoretic baselines demonstrate that neural compression offers improved performance.

[AI-39] Region Mixup ICLR2024

链接: https://arxiv.org/abs/2409.15028
作者: Saptarshi Saha,Utpal Garain
关键词-EN: visual recognition tasks, data augmentation, recognition tasks, paper introduces, introduces a simple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published as a Tiny Paper at ICLR 2024

点击查看摘要

Abstract:This paper introduces a simple extension of mixup (Zhang et al., 2018) data augmentation to enhance generalization in visual recognition tasks. Unlike the vanilla mixup method, which blends entire images, our approach focuses on combining regions from multiple images.

[AI-40] Generative LLM Powered Conversational AI Application for Personalized Risk Assessment: A Case Study in COVID-19

链接: https://arxiv.org/abs/2409.15027
作者: Mohammad Amin Roshani,Xiangyu Zhou,Yao Qiang,Srinivasan Suresh,Steve Hicks,Usha Sethuraman,Dongxiao Zhu
关键词-EN: Large language models, shown remarkable capabilities, Large language, natural language tasks, healthcare domains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in various natural language tasks and are increasingly being applied in healthcare domains. This work demonstrates a new LLM-powered disease risk assessment approach via streaming human-AI conversation, eliminating the need for programming required by traditional machine learning approaches. In a COVID-19 severity risk assessment case study, we fine-tune pre-trained generative LLMs (e.g., Llama2-7b and Flan-t5-xl) using a few shots of natural language examples, comparing their performance with traditional classifiers (i.e., Logistic Regression, XGBoost, Random Forest) that are trained de novo using tabular data across various experimental settings. We develop a mobile application that uses these fine-tuned LLMs as its generative AI (GenAI) core to facilitate real-time interaction between clinicians and patients, providing no-code risk assessment through conversational interfaces. This integration not only allows for the use of streaming Questions and Answers (QA) as inputs but also offers personalized feature importance analysis derived from the LLM’s attention layers, enhancing the interpretability of risk assessments. By achieving high Area Under the Curve (AUC) scores with a limited number of fine-tuning samples, our results demonstrate the potential of generative LLMs to outperform discriminative classification methods in low-data regimes, highlighting their real-world adaptability and effectiveness. This work aims to fill the existing gap in leveraging generative LLMs for interactive no-code risk assessment and to encourage further research in this emerging field.

[AI-41] A Diagonal Structured State Space Model on Loihi 2 for Efficient Streaming Sequence Processing

链接: https://arxiv.org/abs/2409.15022
作者: Svea Marie Meyer,Philipp Weidel,Philipp Plank,Leobardo Campos-Macias,Sumit Bam Shrestha,Philipp Stratmann,Mathis Richter
关键词-EN: Deep State-Space Models, sequence modeling tasks, long-range sequence modeling, Deep State-Space, State-Space Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:Deep State-Space Models (SSM) demonstrate state-of-the art performance on long-range sequence modeling tasks. While the recurrent structure of SSMs can be efficiently implemented as a convolution or as a parallel scan during training, recurrent token-by-token processing cannot currently be implemented efficiently on GPUs. Here, we demonstrate efficient token-by-token inference of the SSM S4D on Intel’s Loihi 2 state-of-the-art neuromorphic processor. We compare this first ever neuromorphic-hardware implementation of an SSM on sMNIST, psMNIST, and sCIFAR to a recurrent and a convolutional implementation of S4D on Jetson Orin Nano (Jetson). While we find Jetson to perform better in an offline sample-by-sample based batched processing mode, Loihi 2 outperforms during token-by-token based processing, where it consumes 1000 times less energy with a 75 times lower latency and a 75 times higher throughput compared to the recurrent implementation of S4D on Jetson. This opens up new avenues towards efficient real-time streaming applications of SSMs.

[AI-42] Acting for the Right Reasons: Creating Reason-Sensitive Artificial Moral Agents

链接: https://arxiv.org/abs/2409.15014
作者: Kevin Baum,Lisa Dargasz,Felix Jahn,Timo P. Gros,Verena Wolf
关键词-EN: reinforcement learning agents, learning agents based, reinforcement learning architecture, reinforcement learning, enables moral decision-making
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures, Workshop paper accepted to FEAR24 (IFM Workshop)

点击查看摘要

Abstract:We propose an extension of the reinforcement learning architecture that enables moral decision-making of reinforcement learning agents based on normative reasons. Central to this approach is a reason-based shield generator yielding a moral shield that binds the agent to actions that conform with recognized normative reasons so that our overall architecture restricts the agent to actions that are (internally) morally justified. In addition, we describe an algorithm that allows to iteratively improve the reason-based shield generator through case-based feedback from a moral judge.

[AI-43] Analogous Alignments: Digital “Formally” meets Analog

链接: https://arxiv.org/abs/2409.15013
作者: Hansa Mohanty,Deepak Narayan Gadde
关键词-EN: verification, complexity of modern-day, continually increasing, increasingly challenging, challenging to deliver
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Accepted for publication at the Design and Verification Conference and Exhibition (DVCon) Europe, Munich, Germany, 2024

点击查看摘要

Abstract:The complexity of modern-day System-on-Chips (SoCs) is continually increasing, and it becomes increasingly challenging to deliver dependable and credible chips in a short time-to-market. Especially, in the case of test chips, where the aim is to study the feasibility of the design, time is a crucial factor. Pre-silicon functional verification is one of the main contributors that makes up a large portion of the product development cycle. Verification engineers often loosely verify test chips that turn out to be non-functional on the silicon, ultimately resulting in expensive re-spins. To left-shift the verification efforts, formal verification is a powerful methodology that aims to exhaustively verify designs, giving better confidence in the overall quality. This paper focuses on the pragmatic formal verification of a mixed signal Intellectual Property (IP) that has a combination of digital and analog blocks. This paper discusses a novel approach of including the analog behavioral model into the formal verification setup. Digital and Analog Mixed-Signal (AMS) designs, which are fundamentally different in nature, are integrated seamlessly in a formal verification setup, a concept that can be referred to as “Analogous Alignments”. Our formal setup leverages powerful formal techniques such as FPV, CSR verification, and connectivity checks. The properties used for FPV are auto-generated using a metamodeling framework. The paper also discusses the challenges faced especially related to state-space explosion, non-compatibility of formal with AMS models, and techniques to mitigate them such as k-induction. With this verification approach, we were able to exhaustively verify the design within a reasonable time and with sufficient coverage. We also reported several bugs at an early stage, making the complete design verification process iterative and effective.

[AI-44] Inference-Friendly Models With MixAttention

链接: https://arxiv.org/abs/2409.15012
作者: Shashank Rajput,Ying Sheng,Sean Owen,Vitaliy Chiley
关键词-EN: maximum context length, concurrent requests supported, modern language models, plays a critical, critical role
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The size of the key-value (KV) cache plays a critical role in determining both the maximum context length and the number of concurrent requests supported during inference in modern language models. The KV cache size grows proportionally with the number of attention heads and the tokens processed, leading to increased memory consumption and slower inference for long inputs. In this work, we explore the use of MixAttention, a model architecture modification closely related to a blog published by this http URL. MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers. Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks. We also explore various configurations of this architecture, identifying those that maintain quality across evaluation metrics while optimizing resource efficiency.

[AI-45] Generalizing monocular colonoscopy image depth estimation by uncertainty-based global and local fusion network

链接: https://arxiv.org/abs/2409.15006
作者: Sijia Du,Chengfeng Zhou,Suncheng Xiang,Jianwei Xu,Dahong Qian
关键词-EN: obtaining ground-truth depth, real clinical scenarios, obtaining ground-truth, ground-truth depth maps, real clinical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Objective: Depth estimation is crucial for endoscopic navigation and manipulation, but obtaining ground-truth depth maps in real clinical scenarios, such as the colon, is challenging. This study aims to develop a robust framework that generalizes well to real colonoscopy images, overcoming challenges like non-Lambertian surface reflection and diverse data distributions. Methods: We propose a framework combining a convolutional neural network (CNN) for capturing local features and a Transformer for capturing global information. An uncertainty-based fusion block was designed to enhance generalization by identifying complementary contributions from the CNN and Transformer branches. The network can be trained with simulated datasets and generalize directly to unseen clinical data without any fine-tuning. Results: Our method is validated on multiple datasets and demonstrates an excellent generalization ability across various datasets and anatomical structures. Furthermore, qualitative analysis in real clinical scenarios confirmed the robustness of the proposed method. Conclusion: The integration of local and global features through the CNN-Transformer architecture, along with the uncertainty-based fusion block, improves depth estimation performance and generalization in both simulated and real-world endoscopic environments. Significance: This study offers a novel approach to estimate depth maps for endoscopy images despite the complex conditions in clinic, serving as a foundation for endoscopic automatic navigation and other clinical tasks, such as polyp detection and segmentation.

[AI-46] Method of Equal Shares with Bounded Overspending

链接: https://arxiv.org/abs/2409.15005
作者: Georgios Papasotiropoulos,Seyedeh Zeinab Pishbin,Oskar Skibski,Piotr Skowron,Tomasz Wąs
关键词-EN: BOS Equal Shares, Equal Shares, BOS Equal, equal, participatory budgeting
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In participatory budgeting (PB), voters decide through voting which subset of projects to fund within a given budget. Proportionality in the context of PB is crucial to ensure equal treatment of all groups of voters. However, pure proportional rules can sometimes lead to suboptimal outcomes. We introduce the Method of Equal Shares with Bounded Overspending (BOS Equal Shares), a robust variant of Equal Shares that balances proportionality and efficiency. BOS Equal Shares addresses inefficiencies inherent in strict proportionality guarantees yet still provides good proportionality similar to the original Method of Equal Shares. In the course of the analysis, we also discuss a fractional variant of the method which allows for partial funding of projects.

[AI-47] ViBERTgrid BiLSTM-CRF: Multimodal Key Information Extraction from Unstructured Financial Documents ECML KDD2023

链接: https://arxiv.org/abs/2409.15004
作者: Furkan Pala,Mehmet Yasin Akpınar,Onur Deniz,Gülşen Eryiğit
关键词-EN: key information extraction, Multimodal key information, information extraction, key information, studied extensively
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Accepted in MIDAS (The 8th Workshop on MIning DAta for financial applicationS) workshop of ECML PKDD 2023 conference

点击查看摘要

Abstract:Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.

[AI-48] Multi-Modal Generative AI: Multi-modal LLM Diffusion and Beyond

链接: https://arxiv.org/abs/2409.14993
作者: Hong Chen,Xin Wang,Yuwei Zhou,Bin Huang,Yipeng Zhang,Wei Feng,Houlun Chen,Zeyang Zhang,Siao Tang,Wenwu Zhu
关键词-EN: received increasing attention, academia and industry, unified model, received increasing, increasing attention
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modal generative AI has received increasing attention in both academia and industry. Particularly, two dominant families of techniques are: i) The multi-modal large language model (MLLM) such as GPT-4V, which shows impressive ability for multi-modal understanding; ii) The diffusion model such as Sora, which exhibits remarkable multi-modal powers, especially with respect to visual generation. As such, one natural question arises: Is it possible to have a unified model for both understanding and generation? To answer this question, in this paper, we first provide a detailed review of both MLLM and diffusion models, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video large language models as well as text-to-image/video generation. Then, we discuss the two important questions on the unified model: i) whether the unified model should adopt the auto-regressive or diffusion probabilistic modeling, and ii) whether the model should utilize a dense architecture or the Mixture of Experts(MoE) architectures to better support generation and understanding, two objectives. We further provide several possible strategies for building a unified model and analyze their potential advantages and disadvantages. We also summarize existing large-scale multi-modal datasets for better model pretraining in the future. To conclude the paper, we present several challenging future directions, which we believe can contribute to the ongoing advancement of multi-modal generative AI.

[AI-49] Evaluating Theory of (an uncertain) Mind: Predicting the Uncertain Beliefs of Others in Conversation Forecasting

链接: https://arxiv.org/abs/2409.14986
作者: Anthony Sicilia,Malihe Alikhani
关键词-EN: Theory of Mind, evaluating Theory, Typically, Mind, Theory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Typically, when evaluating Theory of Mind, we consider the beliefs of others to be binary: held or not held. But what if someone is unsure about their own beliefs? How can we quantify this uncertainty? We propose a new suite of tasks, challenging language models (LMs) to model the uncertainty of others in dialogue. We design these tasks around conversation forecasting, wherein an agent forecasts an unobserved outcome to a conversation. Uniquely, we view interlocutors themselves as forecasters, asking an LM to predict the uncertainty of the interlocutors (a probability). We experiment with re-scaling methods, variance reduction strategies, and demographic context, for this regression task, conducting experiments on three dialogue corpora (social, negotiation, task-oriented) with eight LMs. While LMs can explain up to 7% variance in the uncertainty of others, we highlight the difficulty of the tasks and room for future work, especially in practical applications, like anticipating ``false

[AI-50] Sparse-to-Dense LiDAR Point Generation by LiDAR-Camera Fusion for 3D Object Detection

链接: https://arxiv.org/abs/2409.14985
作者: Minseung Lee,Seokha Moon,Seung Joon Lee,Jinkyu Kim
关键词-EN: Accurately detecting objects, Accurately detecting, point cloud, remains a critical, critical challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:Accurately detecting objects at long distances remains a critical challenge in 3D object detection when relying solely on LiDAR sensors due to the inherent limitations of data sparsity. To address this issue, we propose the LiDAR-Camera Augmentation Network (LCANet), a novel framework that reconstructs LiDAR point cloud data by fusing 2D image features, which contain rich semantic information, generating additional points to improve detection accuracy. LCANet fuses data from LiDAR sensors and cameras by projecting image features into the 3D space, integrating semantic information into the point cloud data. This fused data is then encoded to produce 3D features that contain both semantic and spatial information, which are further refined to reconstruct final points before bounding box prediction. This fusion effectively compensates for LiDAR’s weakness in detecting objects at long distances, which are often represented by sparse points. Additionally, due to the sparsity of many objects in the original dataset, which makes effective supervision for point generation challenging, we employ a point cloud completion network to create a complete point cloud dataset that supervises the generation of dense point clouds in our network. Extensive experiments on the KITTI and Waymo datasets demonstrate that LCANet significantly outperforms existing models, particularly in detecting sparse and distant objects.

[AI-51] Dynamic Integration of Task-Specific Adapters for Class Incremental Learning

链接: https://arxiv.org/abs/2409.14983
作者: Jiashuo Li,Shaokun Wang,Bo Qian,Yuhang He,Xing Wei,Yihong Gong
关键词-EN: Non-exemplar class Incremental, class Incremental Learning, Incremental Learning, Patch-Level Model Alignment, addressing privacy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-exemplar class Incremental Learning (NECIL) enables models to continuously acquire new classes without retraining from scratch and storing old task exemplars, addressing privacy and storage issues. However, the absence of data from earlier tasks exacerbates the challenge of catastrophic forgetting in NECIL. In this paper, we propose a novel framework called Dynamic Integration of task-specific Adapters (DIA), which comprises two key components: Task-Specific Adapter Integration (TSAI) and Patch-Level Model Alignment. TSAI boosts compositionality through a patch-level adapter integration strategy, which provides a more flexible compositional solution while maintaining low computation costs. Patch-Level Model Alignment maintains feature consistency and accurate decision boundaries via two specialized mechanisms: Patch-Level Distillation Loss (PDL) and Patch-Level Feature Reconstruction method (PFR). Specifically, the PDL preserves feature-level consistency between successive models by implementing a distillation loss based on the contributions of patch tokens to new class learning. The PFR facilitates accurate classifier alignment by reconstructing old class features from previous tasks that adapt to new task knowledge. Extensive experiments validate the effectiveness of our DIA, revealing significant improvements on benchmark datasets in the NECIL setting, maintaining an optimal balance between computational complexity and accuracy. The full code implementation will be made publicly available upon the publication of this paper.

[AI-52] On The Specialization of Neural Modules

链接: https://arxiv.org/abs/2409.14981
作者: Devon Jarvis,Richard Klein,Benjamin Rosman,Andrew M. Saxe
关键词-EN: machine learning models, previous experiences, achieving systematic generalization, number of machine, goal of achieving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The Eleventh International Conference on Learning Representations 2023

点击查看摘要

Abstract:A number of machine learning models have been proposed with the goal of achieving systematic generalization: the ability to reason about new situations by combining aspects of previous experiences. These models leverage compositional architectures which aim to learn specialized modules dedicated to structures in a task that can be composed to solve novel problems with similar structures. While the compositionality of these architectures is guaranteed by design, the modules specializing is not. Here we theoretically study the ability of network modules to specialize to useful structures in a dataset and achieve systematic generalization. To this end we introduce a minimal space of datasets motivated by practical systematic generalization benchmarks. From this space of datasets we present a mathematical definition of systematicity and study the learning dynamics of linear neural modules when solving components of the task. Our results shed light on the difficulty of module specialization, what is required for modules to successfully specialize, and the necessity of modular architectures to achieve systematicity. Finally, we confirm that the theoretical results in our tractable setting generalize to more complex datasets and non-linear architectures.

[AI-53] S-TCD: Triplet-Level Cross-Modal Distillation for Time-Series Forecasting Using Large Language Models ICASSP2025

链接: https://arxiv.org/abs/2409.14978
作者: Pengfei Wang,Huanran Zheng,Silong Dai,Wenjing Yue,Wei Zhu,Xiaoling Wang
关键词-EN: large language models, improving predictive performance, shown great potential, capturing complex dependencies, recent years
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:In recent years, large language models (LLMs) have shown great potential in time-series analysis by capturing complex dependencies and improving predictive performance. However, existing approaches often struggle with modality alignment, leading to suboptimal results. To address these challenges, we present a novel framework, TS-TCD, which introduces a comprehensive three-tiered cross-modal knowledge distillation mechanism. Unlike prior work that focuses on isolated alignment techniques, our framework systematically integrates: 1) Dynamic Adaptive Gating for Input Encoding and Alignment, ensuring coherent alignment between time-series tokens and QR-decomposed textual embeddings; 2) Layer-Wise Contrastive Learning, aligning intermediate representations across modalities to reduce feature-level discrepancies; and 3) Optimal Transport-Driven Output Alignment, which ensures consistent output predictions through fine-grained cross-modal alignment. Extensive experiments on benchmark time-series datasets demonstrate that TS-TCD achieves state-of-the-art results, outperforming traditional methods in both accuracy and robustness.

[AI-54] Deep Reinforcement Learning-based Obstacle Avoidance for Robot Movement in Warehouse Environments

链接: https://arxiv.org/abs/2409.14972
作者: Keqin Li,Jiajing Chen,Denzhi Yu,Tao Dajun,Xinyu Qiu,Lian Jieting,Sun Baiwei,Zhang Shengyuan,Zhenyu Wan,Ran Ji,Bo Hong,Fanghao Ni
关键词-EN: robot obstacle avoidance, obstacle avoidance Algorithm, obstacle avoidance strategy, mobile robot obstacle, deep reinforcement learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:At present, in most warehouse environments, the accumulation of goods is complex, and the management personnel in the control of goods at the same time with the warehouse mobile robot trajectory interaction, the traditional mobile robot can not be very good on the goods and pedestrians to feed back the correct obstacle avoidance strategy, in order to control the mobile robot in the warehouse environment efficiently and friendly to complete the obstacle avoidance task, this paper proposes a deep reinforcement learning based on the warehouse environment, the mobile robot obstacle avoidance Algorithm. Firstly, for the insufficient learning ability of the value function network in the deep reinforcement learning algorithm, the value function network is improved based on the pedestrian interaction, the interaction information between pedestrians is extracted through the pedestrian angle grid, and the temporal features of individual pedestrians are extracted through the attention mechanism, so that we can learn to obtain the relative importance of the current state and the historical trajectory state as well as the joint impact on the robot’s obstacle avoidance strategy, which provides an opportunity for the learning of multi-layer perceptual machines afterwards. Secondly, the reward function of reinforcement learning is designed based on the spatial behaviour of pedestrians, and the robot is punished for the state where the angle changes too much, so as to achieve the requirement of comfortable obstacle avoidance; Finally, the feasibility and effectiveness of the deep reinforcement learning-based mobile robot obstacle avoidance algorithm in the warehouse environment in the complex environment of the warehouse are verified through simulation experiments.

[AI-55] Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

链接: https://arxiv.org/abs/2409.14924
作者: Siyun Zhao,Yuqing Yang,Zilong Wang,Zhiyuan He,Luna K. Qiu,Lili Qiu
关键词-EN: Large language models, Large language, completing real-world tasks, demonstrated remarkable capabilities, external data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications.

[AI-56] KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems

链接: https://arxiv.org/abs/2409.14908
作者: Zixuan Wang,Bo Yu,Junzhe Zhao,Wenhao Sun,Sai Hou,Shuai Liang,Xing Hu,Yinhe Han,Yiming Gan
关键词-EN: short-term memory, executing interconnected, leading to inefficiencies, memory, responsible for executing
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Embodied AI agents responsible for executing interconnected, long-sequence household tasks often face difficulties with in-context memory, leading to inefficiencies and errors in task execution. To address this issue, we introduce KARMA, an innovative memory system that integrates long-term and short-term memory modules, enhancing large language models (LLMs) for planning in embodied agents through memory-augmented prompting. KARMA distinguishes between long-term and short-term memory, with long-term memory capturing comprehensive 3D scene graphs as representations of the environment, while short-term memory dynamically records changes in objects’ positions and states. This dual-memory structure allows agents to retrieve relevant past scene experiences, thereby improving the accuracy and efficiency of task planning. Short-term memory employs strategies for effective and adaptive memory replacement, ensuring the retention of critical information while discarding less pertinent data. Compared to state-of-the-art embodied agents enhanced with memory, our memory-augmented embodied AI agent improves success rates by 1.3x and 2.3x in Composite Tasks and Complex Tasks within the AI2-THOR simulator, respectively, and enhances task execution efficiency by 3.4x and 62.7x. Furthermore, we demonstrate that KARMA’s plug-and-play capability allows for seamless deployment on real-world robotic systems, such as mobile manipulation platforms.Through this plug-and-play memory system, KARMA significantly enhances the ability of embodied agents to generate coherent and contextually appropriate plans, making the execution of complex household tasks more efficient. The experimental videos from the work can be found at this https URL.

[AI-57] DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models

链接: https://arxiv.org/abs/2409.14904
作者: Sangyeon Cho,Jangyeong Jeon,Dongjoon Lee,Changhee Lee,Junyeong Kim
关键词-EN: natural language processing, language models, language, common approach, approach in natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: IEEE ACCESS 2024

点击查看摘要

Abstract:The use of pre-trained language models fine-tuned to address specific downstream tasks is a common approach in natural language processing (NLP). However, acquiring domain-specific knowledge via fine-tuning is challenging. Traditional methods involve pretraining language models using vast amounts of domain-specific data before fine-tuning for particular tasks. This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea. Our findings reveal that existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics of non-English-speaking regions. To address these limitations, we propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning. This study demonstrates the effective transfer of specialized knowledge between models by defining a general language model as the student model and a domain-specific pre-trained model as the teacher model. In particular, we address the complexities of EMR data obtained from PEDs in non-English-speaking regions, such as Korea, and demonstrate that the proposed method enhances classification performance in such contexts. The proposed methodology not only outperforms baseline models on Korean PED EMR data, but also promises broader applicability in various professional and technical domains. In future works, we intend to extend this methodology to include diverse non-English-speaking regions and address additional downstream tasks, with the aim of developing advanced model architectures using state-of-the-art KD techniques. The code is available in this https URL.

[AI-58] Deploying Open-Source Large Language Models : A performance Analysis

链接: https://arxiv.org/abs/2409.14887
作者: Yannis Bendi-Ouis,Dan Dutarte,Xavier Hinaut
关键词-EN: ChatGPT in November, considerable success, open-source community, release of ChatGPT, large language models
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Since the release of ChatGPT in November 2023, large language models (LLMs) have seen considerable success, including in the open-source community, with many open-weight models available. However, the requirements to deploy such a service are often unknown and difficult to evaluate in advance. To facilitate this process, we conducted numerous tests at the Centre Inria de l’Université de Bordeaux. In this article, we propose a comparison of the performance of several models of different sizes (mainly Mistral and LLaMa) depending on the available GPUs, using vLLM, a Python library designed to optimize the inference of these models. Our results provide valuable information for private and public groups wishing to deploy LLMs, allowing them to evaluate the performance of different models based on their available hardware. This study thus contributes to facilitating the adoption and use of these large language models in various application domains.

[AI-59] End-to-End Graph Flattening Method for Large Language Models

链接: https://arxiv.org/abs/2409.14880
作者: Bin Hong,Jinze Wu,Jiayu Liu,Liang Ding,Jing Sha,Kai Zhang,Shijin Wang,Zhenya Huang
关键词-EN: Large Language Models, Language Models, breakthrough of Large, Large Language, achieving universal methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 2024 1st International Conference on Computational Linguistics and Natural Language Processing (CLNLP 2024)

点击查看摘要

Abstract:In recent years, the breakthrough of Large Language Models (LLMs) offers new ideas for achieving universal methods on graph data. The common practice of converting graphs into natural language for LLMs, which refers to graph flattening, exhibits good generalizability and interpretability. However, the poor organization of the textual format results in poor performance in long-distance scenario understanding. Inspired by human cognitive reasoning habits, we propose a novel method for graph flattening to fit LLMs, termed as End-to-End DAG-Path prompting (EEDP). Experiments on real-world datasets show that EEDP enhances the reasoning performance of LLMs in long-distance scenarios while maintaining excellent performance in short-distance scenarios, demonstrating good robustness in the face of distance variations.

[AI-60] Mammo-Clustering:A Weakly Supervised Multi-view Global-Local Context Clustering Network for Detection and Classification in Mammography

链接: https://arxiv.org/abs/2409.14876
作者: Shilong Yang,Chulong Zhang,Qi Zang,Juan Yu,Liang Zeng,Xiao Luo,Yexuan Xing,Xin Pan,Qi Li,Xiaokun Liang,Yaoqin Xie
关键词-EN: making early screening, early screening crucial, early screening, mitigating its impact, long posed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Breast cancer has long posed a significant threat to women’s health, making early screening crucial for mitigating its impact. However, mammography, the preferred method for early screening, faces limitations such as the burden of double reading by radiologists, challenges in widespread adoption in remote and underdeveloped areas, and obstacles in intelligent early screening development due to data constraints. To address these challenges, we propose a weakly supervised multi-view mammography early screening model for breast cancer based on context clustering. Context clustering, a feature extraction structure that is neither CNN nor transformer, combined with multi-view learning for information complementation, presents a promising approach. The weak supervision design specifically addresses data limitations. Our model achieves state-of-the-art performance with fewer parameters on two public datasets, with an AUC of 0.828 on the Vindr-Mammo dataset and 0.805 on the CBIS-DDSM dataset. Our model shows potential in reducing the burden on doctors and increasing the feasibility of breast cancer screening for women in underdeveloped regions.

[AI-61] FedSlate:A Federated Deep Reinforcement Learning Recommender System

链接: https://arxiv.org/abs/2409.14872
作者: Yongxin Deng,Xiaoyu Tan,Xihe Qiu,Yaochu Jin
关键词-EN: recommendation systems, optimize long-term user, long-term user engagement, recommendation, Reinforcement
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning methods have been used to optimize long-term user engagement in recommendation systems. However, existing reinforcement learning-based recommendation systems do not fully exploit the relevance of individual user behavior across different platforms. One potential solution is to aggregate data from various platforms in a centralized location and use the aggregated data for training. However, this approach raises economic and legal concerns, including increased communication costs and potential threats to user privacy. To address these challenges, we propose \textbfFedSlate, a federated reinforcement learning recommendation algorithm that effectively utilizes information that is prohibited from being shared at a legal level. We employ the SlateQ algorithm to assist FedSlate in learning users’ long-term behavior and evaluating the value of recommended content. We extend the existing application scope of recommendation systems from single-user single-platform to single-user multi-platform and address cross-platform learning challenges by introducing federated learning. We use RecSim to construct a simulation environment for evaluating FedSlate and compare its performance with state-of-the-art benchmark recommendation models. Experimental results demonstrate the superior effects of FedSlate over baseline methods in various environmental settings, and FedSlate facilitates the learning of recommendation strategies in scenarios where baseline methods are completely inapplicable. Code is available at \textitthis https URL.

[AI-62] A novel agent with formal goal-reaching guarantees: an experimental study with a mobile robot

链接: https://arxiv.org/abs/2409.14867
作者: Grigory Yaremenko,Dmitrii Dobriborsci,Roman Zashchitin,Ruben Contreras Maestre,Ngoc Quoc Huy Hoang,Pavel Osinenko
关键词-EN: Reinforcement Learning, number of tasks, sufficiently large number, CALF, online model-free learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has been shown to be effective and convenient for a number of tasks in robotics. However, it requires the exploration of a sufficiently large number of state-action pairs, many of which may be unsafe or unimportant. For instance, online model-free learning can be hazardous and inefficient in the absence of guarantees that a certain set of desired states will be reached during an episode. An increasingly common approach to address safety involves the addition of a shielding system that constrains the RL actions to a safe set of actions. In turn, a difficulty for such frameworks is how to effectively couple RL with the shielding system to make sure the exploration is not excessively restricted. This work presents a novel safe model-free RL agent called Critic As Lyapunov Function (CALF) and showcases how CALF can be used to improve upon control baselines in robotics in an efficient and convenient fashion while ensuring guarantees of stable goal reaching. The latter is a crucial part of safety, as seen generally. With CALF all state-action pairs remain explorable and yet reaching of desired goal states is formally guaranteed. Formal analysis is provided that shows the goal stabilization-ensuring properties of CALF and a set of real-world and numerical experiments with a non-holonomic wheeled mobile robot (WMR) TurtleBot3 Burger confirmed the superiority of CALF over such a well-established RL agent as proximal policy optimization (PPO), and a modified version of SARSA in a few-episode setting in terms of attained total cost.

[AI-63] Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

链接: https://arxiv.org/abs/2409.14866
作者: Xueluan Gong,Mingzhe Li,Yilin Zhang,Fengyuan Ran,Chen Chen,Yanjiao Chen,Qian Wang,Kwok-Yan Lam
关键词-EN: Large Language Models, Large Language, Language Models, attackers create jailbreak, manually crafted templates
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks, where attackers create jailbreak prompts to mislead the model to produce harmful or offensive content. Current jailbreak methods either rely heavily on manually crafted templates, which pose challenges in scalability and adaptability, or struggle to generate semantically coherent prompts, making them easy to detect. Additionally, most existing approaches involve lengthy prompts, leading to higher query this http URL this paper, to remedy these challenges, we introduce a novel jailbreaking attack framework, which is an automated, black-box jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs. Instead of relying on manually crafted templates, our method starts with an empty seed pool, removing the need to search for any related jailbreaking templates. We also develop three novel question-dependent mutation strategies using an LLM helper to generate prompts that maintain semantic coherence while significantly reducing their length. Additionally, we implement a two-level judge module to accurately detect genuine successful jailbreaks. We evaluated our method on 7 representative LLMs and compared it with 5 state-of-the-art jailbreaking attack strategies. For proprietary LLM APIs, such as GPT-3.5 turbo, GPT-4, and Gemini-Pro, our method achieves attack success rates of over 90%, 80%, and 74%, respectively, exceeding existing baselines by more than 60%. Additionally, our method can maintain high semantic coherence while significantly reducing the length of jailbreak prompts. When targeting GPT-4, our method can achieve over 78% attack success rate even with 100 tokens. Moreover, our method demonstrates transferability and is robust to state-of-the-art defenses. We will open-source our codes upon publication. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.14866 [cs.CR] (or arXiv:2409.14866v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.14866 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-64] FUSED-Net: Enhancing Few-Shot Traffic Sign Detection with Unfrozen Parameters Pseudo-Support Sets Embedding Normalization and Domain Adaptation

链接: https://arxiv.org/abs/2409.14852
作者: Md. Atiqur Rahman,Nahian Ibn Asad,Md. Mushfiqul Haque Omi,Md. Bakhtiar Hasan,Sabbir Ahmed,Md. Hasanul Kabir
关键词-EN: Automatic Traffic Sign, Traffic Sign Recognition, modern transportation systems, Recognition is paramount, Automatic Traffic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 6 figures, 3 tables, submitted to IEEE Access for review

点击查看摘要

Abstract:Automatic Traffic Sign Recognition is paramount in modern transportation systems, motivating several research endeavors to focus on performance improvement by utilizing large-scale datasets. As the appearance of traffic signs varies across countries, curating large-scale datasets is often impractical; and requires efficient models that can produce satisfactory performance using limited data. In this connection, we present ‘FUSED-Net’, built-upon Faster RCNN for traffic sign detection, enhanced by Unfrozen Parameters, Pseudo-Support Sets, Embedding Normalization, and Domain Adaptation while reducing data requirement. Unlike traditional approaches, we keep all parameters unfrozen during training, enabling FUSED-Net to learn from limited samples. The generation of a Pseudo-Support Set through data augmentation further enhances performance by compensating for the scarcity of target domain data. Additionally, Embedding Normalization is incorporated to reduce intra-class variance, standardizing feature representation. Domain Adaptation, achieved by pre-training on a diverse traffic sign dataset distinct from the target domain, improves model generalization. Evaluating FUSED-Net on the BDTSD dataset, we achieved 2.4x, 2.2x, 1.5x, and 1.3x improvements of mAP in 1-shot, 3-shot, 5-shot, and 10-shot scenarios, respectively compared to the state-of-the-art Few-Shot Object Detection (FSOD) models. Additionally, we outperform state-of-the-art works on the cross-domain FSOD benchmark under several scenarios.

[AI-65] GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth

链接: https://arxiv.org/abs/2409.14850
作者: Aurélien Cecille,Stefan Duffner,Franck Davoine,Thibault Neveu,Rémi Agier
关键词-EN: Monocular depth estimation, predicting metric depth, models predicting metric, Monocular depth, estimation has greatly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Monocular depth estimation has greatly improved in the recent years but models predicting metric depth still struggle to generalize across diverse camera poses and datasets. While recent supervised methods mitigate this issue by leveraging ground prior information at inference, their adaptability to self-supervised settings is limited due to the additional challenge of scale recovery. Addressing this gap, we propose in this paper a novel constraint on ground areas designed specifically for the self-supervised paradigm. This mechanism not only allows to accurately recover the scale but also ensures coherence between the depth prediction and the ground prior. Experimental results show that our method surpasses existing scale recovery techniques on the KITTI benchmark and significantly enhances model generalization capabilities. This improvement can be observed by its more robust performance across diverse camera rotations and its adaptability in zero-shot conditions with previously unseen driving datasets such as DDAD.

[AI-66] A-VL: Adaptive Attention for Large Vision-Language Models

链接: https://arxiv.org/abs/2409.14846
作者: Junyang Zhang,Mu Yuan,Ruiguang Zhong,Puhan Luo,Huiyou Zhan,Ningkang Zhang,Chengchen Hu,Xiangyang Li
关键词-EN: integrates computer vision, offering substantial application, substantial application potential, Large Vision-Language Model, natural language processing
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local information. Based on our observation and analysis of vision-language attention patterns, we develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference. Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.

[AI-67] HW-TSCs Submission to the CCMT 2024 Machine Translation Tasks

链接: https://arxiv.org/abs/2409.14842
作者: Zhanglin Wu,Yuanchang Luo,Daimeng Wei,Jiawei Zheng,Bin Wei,Zongyao Li,Hengchao Shang,Jiaxin Guo,Shaojun Li,Weidong Zhang,Ning Xie,Hao Yang
关键词-EN: Translation Services Center, Huawei Translation Services, Services Center, China Conference, machine translation task
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 2 figures, 6 Tables, CCMT2024

点击查看摘要

Abstract:This paper presents the submission of Huawei Translation Services Center (HW-TSC) to machine translation tasks of the 20th China Conference on Machine Translation (CCMT 2024). We participate in the bilingual machine translation task and multi-domain machine translation task. For these two translation tasks, we use training strategies such as regularized dropout, bidirectional training, data diversification, forward translation, back translation, alternated training, curriculum learning, and transductive ensemble learning to train neural machine translation (NMT) models based on the deep Transformer-big architecture. Furthermore, to explore whether large language model (LLM) can help improve the translation quality of NMT systems, we use supervised fine-tuning to train llama2-13b as an Automatic post-editing (APE) model to improve the translation results of the NMT model on the multi-domain machine translation task. By using these plyometric strategies, our submission achieves a competitive result in the final evaluation.

[AI-68] Explainable and Human-Grounded AI for Decision Support Systems: The Theory of Epistemic Quasi-Partnerships

链接: https://arxiv.org/abs/2409.14839
作者: John Dorsch,Maximilian Moll
关键词-EN: decision support systems, provide human decision-makers, developing AI-DSS, support systems, current empirical XAI
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: 20 pages

点击查看摘要

Abstract:In the context of AI decision support systems (AI-DSS), we argue that meeting the demands of ethical and explainable AI (XAI) is about developing AI-DSS to provide human decision-makers with three types of human-grounded explanations: reasons, counterfactuals, and confidence, an approach we refer to as the RCC approach. We begin by reviewing current empirical XAI literature that investigates the relationship between various methods for generating model explanations (e.g., LIME, SHAP, Anchors), the perceived trustworthiness of the model, and end-user accuracy. We demonstrate how current theories about what constitutes good human-grounded reasons either do not adequately explain this evidence or do not offer sound ethical advice for development. Thus, we offer a novel theory of human-machine interaction: the theory of epistemic quasi-partnerships (EQP). Finally, we motivate adopting EQP and demonstrate how it explains the empirical evidence, offers sound ethical advice, and entails adopting the RCC approach.

[AI-69] MICSim: A Modular Simulator for Mixed-signal Compute-in-Memory based AI Accelerator

链接: https://arxiv.org/abs/2409.14838
作者: Cong Wang,Zeming Chen,Shanshi Huang
关键词-EN: pre-circuit simulator designed, overhead of mixed-signal, designed for early-stage, MICSim, Transformers CIM accelerators
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: The 30th Asia and South Pacific Design Automation Conference (ASP-DAC 2025)

点击查看摘要

Abstract:This work introduces MICSim, an open-source, pre-circuit simulator designed for early-stage evaluation of chip-level software performance and hardware overhead of mixed-signal compute-in-memory (CIM) accelerators. MICSim features a modular design, allowing easy multi-level co-design and design space exploration. Modularized from the state-of-the-art CIM simulator NeuroSim, MICSim provides a highly configurable simulation framework supporting multiple quantization algorithms, diverse circuit/architecture designs, and different memory devices. This modular approach also allows MICSim to be effectively extended to accommodate new designs. MICSim natively supports evaluating accelerators’ software and hardware performance for CNNs and Transformers in Python, leveraging the popular PyTorch and HuggingFace Transformers frameworks. These capabilities make MICSim highly adaptive when simulating different networks and user-friendly. This work demonstrates that MICSim can easily be combined with optimization strategies to perform design space exploration and used for chip-level Transformers CIM accelerators evaluation. Also, MICSim can achieve a 9x - 32x speedup of NeuroSim through a statistic-based average mode proposed by this work. Comments: The 30th Asia and South Pacific Design Automation Conference (ASP-DAC 2025) Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2409.14838 [cs.AI] (or arXiv:2409.14838v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.14838 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-70] Orthogonal Finetuning for Direct Preference Optimization

链接: https://arxiv.org/abs/2409.14836
作者: Chenxu Yang,Ruipeng Jia,Naibin Gu,Zheng Lin,Siyuan Chen,Chao Pang,Weichong Yin,Yu Sun,Hua Wu,Weiping Wang
关键词-EN: preference optimization algorithm, optimization algorithm, effective preference optimization, preference optimization, weight-Rotated Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:DPO is an effective preference optimization algorithm. However, the DPO-tuned models tend to overfit on the dispreferred samples, manifested as overly long generations lacking diversity. While recent regularization approaches have endeavored to alleviate this issue by modifying the objective function, they achieved that at the cost of alignment performance degradation. In this paper, we innovatively incorporate regularization from the perspective of weight updating to curb alignment overfitting. Through the pilot experiment, we discovered that there exists a positive correlation between overfitting and the hyperspherical energy fluctuation. Hence, we introduce orthogonal finetuning for DPO via a weight-Rotated Preference Optimization (RoPO) method, which merely conducts rotational and magnitude-stretching updates on the weight parameters to maintain the hyperspherical energy invariant, thereby preserving the knowledge encoded in the angle between neurons. Extensive experiments demonstrate that our model aligns perfectly with human preferences while retaining the original expressive capacity using only 0.0086% of the trainable parameters, suggesting an effective regularization against overfitting. Specifically, RoPO outperforms DPO by up to 10 points on MT-Bench and by up to 2.8 points on AlpacaEval 2, while enhancing the generation diversity by an average of 6 points.

[AI-71] Identify As A Human Does: A Pathfinder of Next-Generation Anti-Cheat Framework for First-Person Shooter Games

链接: https://arxiv.org/abs/2409.14830
作者: Jiayi Zhang,Chenxin Sun,Yue Gu,Qingyu Zhang,Jiayi Lin,Xiaojiang Du,Chenxiong Qian
关键词-EN: experienced substantial growth, online games poses, gaming experience, gaming industry, substantial growth
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The gaming industry has experienced substantial growth, but cheating in online games poses a significant threat to the integrity of the gaming experience. Cheating, particularly in first-person shooter (FPS) games, can lead to substantial losses for the game industry. Existing anti-cheat solutions have limitations, such as client-side hardware constraints, security risks, server-side unreliable methods, and both-sides suffer from a lack of comprehensive real-world datasets. To address these limitations, the paper proposes HAWK, a server-side FPS anti-cheat framework for the popular game CS:GO. HAWK utilizes machine learning techniques to mimic human experts’ identification process, leverages novel multi-view features, and it is equipped with a well-defined workflow. The authors evaluate HAWK with the first large and real-world datasets containing multiple cheat types and cheating sophistication, and it exhibits promising efficiency and acceptable overheads, shorter ban times compared to the in-use anti-cheat, a significant reduction in manual labor, and the ability to capture cheaters who evaded official inspections.

[AI-72] oolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback

链接: https://arxiv.org/abs/2409.14826
作者: Qinzhuo Wu,Wei Liu,Jian Luan,Bin Wang
关键词-EN: gained increasing attention, increasing attention, tool-augmented LLMs, gained increasing, Recently
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, tool-augmented LLMs have gained increasing attention. Given an instruction, tool-augmented LLMs can interact with various external tools in multiple rounds and provide a final answer. However, previous LLMs were trained on overly detailed instructions, which included API names or parameters, while real users would not explicitly mention these API details. This leads to a gap between trained LLMs and real-world scenarios. In addition, most works ignore whether the interaction process follows the instruction. To address these issues, we constructed a training dataset called MGToolBench, which contains statement and category-level instructions to better reflect real-world scenarios. In addition, we propose ToolPlanner, a two-stage reinforcement learning framework that utilizes path planning and two feedback mechanisms to enhance the LLM’s task completion and instruction-following capabilities. Experimental results show that ToolPlanner significantly improves the Match Rate, Pass Rate and Win Rate by 26.8%, 20.2%, and 5.6% compared to the SOTA model. Human evaluation verifies that the multi-granularity instructions can better align with users’ usage habits. Our data and code will be released upon acceptance.

[AI-73] owards Real-world Deployment of NILM Systems: Challenges and Practices

链接: https://arxiv.org/abs/2409.14821
作者: Junyu Xue,Yu Zhang,Xudong Wang,Yi Wang,Guoming Tang
关键词-EN: Non-intrusive load monitoring, load monitoring technology, key load monitoring, traditional power sensors, load monitoring
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Non-intrusive load monitoring (NILM), as a key load monitoring technology, can much reduce the deployment cost of traditional power sensors. Previous research has largely focused on developing cloud-exclusive NILM algorithms, which often result in high computation costs and significant service delays. To address these issues, we propose a three-tier framework to enhance the real-world applicability of NILM systems through edge-cloud collaboration. Considering the computational resources available at both the edge and cloud, we implement a lightweight NILM model at the edge and a deep learning based model at the cloud, respectively. In addition to the differential model implementations, we also design a NILM-specific deployment scheme that integrates Gunicorn and NGINX to bridge the gap between theoretical algorithms and practical applications. To verify the effectiveness of the proposed framework, we apply real-world NILM scenario settings and implement the entire process of data acquisition, model training, and system deployment. The results demonstrate that our framework can achieve high decomposition accuracy while significantly reducing the cloud workload and communication overhead under practical considerations.

[AI-74] Past Meets Present: Creating Historical Analogy with Large Language Models

链接: https://arxiv.org/abs/2409.14820
作者: Nianqi Li,Siyu Yuan,Jiangjie Chen,Jiaqing Liang,Feng Wei,Zujie Liang,Deqing Yang,Yanghua Xiao
关键词-EN: people make decisions, understand the world, Historical analogies, compare known past, contemporary but unfamiliar
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Historical analogies, which compare known past events with contemporary but unfamiliar events, are important abilities that help people make decisions and understand the world. However, research in applied history suggests that people have difficulty finding appropriate analogies. And previous studies in the AI community have also overlooked historical analogies. To fill this gap, in this paper, we focus on the historical analogy acquisition task, which aims to acquire analogous historical events for a given event. We explore retrieval and generation methods for acquiring historical analogies based on different large language models (LLMs). Furthermore, we propose a self-reflection method to mitigate hallucinations and stereotypes when LLMs generate historical analogies. Through human evaluations and our specially designed automatic multi-dimensional assessment, we find that LLMs generally have a good potential for historical analogies. And the performance of the models can be further improved by using our self-reflection method.

[AI-75] MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

链接: https://arxiv.org/abs/2409.14818
作者: Qinzhuo Wu,Weikai Xu,Wei Liu,Tao Tan,Jianfeng Liu,Ang Li,Jian Luan,Bin Wang,Shuo Shang
关键词-EN: gaining increasing attention, increasing attention, agents based, gaining increasing, Recently
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, mobile AI agents based on VLMs have been gaining increasing attention. These works typically utilize VLM as a foundation, fine-tuning it with instruction-based mobile datasets. However, these VLMs are typically pre-trained on general-domain data, which often results in a lack of fundamental capabilities specific to the mobile domain. Therefore, they may struggle to recognize specific UI elements and understand intra-UI fine-grained information. In addition, the current fine-tuning task focuses on interacting with the most relevant element for the given instruction. These fine-tuned VLMs may still ignore the relationships between UI pages, neglect the roles of elements in page transitions and lack inter-UI understanding. To address issues, we propose a VLM called MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. We defined four UI-based pre-training tasks, enabling the model to better perceive fine-grained elements and capture page transition actions. To address the lack of mobile pre-training data, we built a large Chinese mobile dataset Mobile3M from scratch, which contains 3 million UI pages, and real-world transition actions, forming a directed graph structure. Experimental results show MobileVLM excels on both our test set and public mobile benchmarks, outperforming existing VLMs.

[AI-76] VARADE: a Variational-based AutoRegressive model for Anomaly Detection on the Edge

链接: https://arxiv.org/abs/2409.14816
作者: Alessio Mascolini,Sebastiano Gaiardelli,Francesco Ponzio,Nicola Dall’Ora,Enrico Macii,Sara Vinco,Santa Di Cataldo,Franco Fummi
关键词-EN: Detecting complex anomalies, Detecting complex, task in Industry, deep learning, complex anomalies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Detecting complex anomalies on massive amounts of data is a crucial task in Industry 4.0, best addressed by deep learning. However, available solutions are computationally demanding, requiring cloud architectures prone to latency and bandwidth issues. This work presents VARADE, a novel solution implementing a light autoregressive framework based on variational inference, which is best suited for real-time execution on the edge. The proposed approach was validated on a robotic arm, part of a pilot production line, and compared with several state-of-the-art algorithms, obtaining the best trade-off between anomaly detection accuracy, power consumption and inference frequency on two different edge platforms.

[AI-77] Benchmarking Edge AI Platforms for High-Performance ML Inference

链接: https://arxiv.org/abs/2409.14803
作者: Rakshith Jayanth,Neelesh Gupta,Viktor Prasanna
关键词-EN: reduce communication latency, computing growing prominence, Edge computing growing, enable real-time processing, growing prominence
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Edge computing’s growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions. While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads on these platforms can vary significantly, especially when it comes to parallel processing, which is a critical consideration for edge deployments. To address this, we conduct a comprehensive study comparing the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions. We find that the Neural Processing Unit (NPU) excels in matrix-vector multiplication (58.6% faster) and some neural network tasks (3.2 \times faster for video classification and large language models). GPU outperforms in matrix multiplication (22.6% faster) and LSTM networks (2.7 \times faster) while CPU excels at less parallel operations like dot product. NPU-based inference offers a balance of latency and throughput at lower power consumption. GPU-based inference, though more energy-intensive, performs best with large dimensions and batch sizes. We highlight the potential of heterogeneous computing solutions for edge AI, where diverse compute units can be strategically leveraged to boost accurate and real-time inference.

[AI-78] Choose the Final Translation from NMT and LLM hypotheses Using MBR Decoding: HW-TSCs Submission to the WMT24 General MT Shared Task EMNLP2024

链接: https://arxiv.org/abs/2409.14800
作者: Zhanglin Wu,Daimeng Wei,Zongyao Li,Hengchao Shang,Jiaxin Guo,Shaojun Li,Zhiqiang Rao,Yuanchang Luo,Ning Xie,Hao Yang
关键词-EN: Translate Services Center, Huawei Translate Services, Services Center, English to Chinese, Huawei Translate
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 4 figures, 2 Tables, EMNLP2024

点击查看摘要

Abstract:This paper presents the submission of Huawei Translate Services Center (HW-TSC) to the WMT24 general machine translation (MT) shared task, where we participate in the English to Chinese (en2zh) language pair. Similar to previous years’ work, we use training strategies such as regularized dropout, bidirectional training, data diversification, forward translation, back translation, alternated training, curriculum learning, and transductive ensemble learning to train the neural machine translation (NMT) model based on the deep Transformer-big architecture. The difference is that we also use continue pre-training, supervised fine-tuning, and contrastive preference optimization to train the large language model (LLM) based MT model. By using Minimum Bayesian risk (MBR) decoding to select the final translation from multiple hypotheses for NMT and LLM-based MT models, our submission receives competitive results in the final evaluation.

[AI-79] Research on Dynamic Data Flow Anomaly Detection based on Machine Learning

链接: https://arxiv.org/abs/2409.14796
作者: Liyang Wang,Yu Cheng,Hao Gong,Jiacheng Hu,Xirui Tang,Iris Li
关键词-EN: defensive strategy inadequate, standalone defensive strategy, strategy inadequate, data, sophistication and diversity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The sophistication and diversity of contemporary cyberattacks have rendered the use of proxies, gateways, firewalls, and encrypted tunnels as a standalone defensive strategy inadequate. Consequently, the proactive identification of data anomalies has emerged as a prominent area of research within the field of data security. The majority of extant studies concentrate on sample equilibrium data, with the consequence that the detection effect is not optimal in the context of unbalanced data. In this study, the unsupervised learning method is employed to identify anomalies in dynamic data flows. Initially, multi-dimensional features are extracted from real-time data, and a clustering algorithm is utilised to analyse the patterns of the data. This enables the potential outliers to be automatically identified. By clustering similar data, the model is able to detect data behaviour that deviates significantly from normal traffic without the need for labelled data. The results of the experiments demonstrate that the proposed method exhibits high accuracy in the detection of anomalies across a range of scenarios. Notably, it demonstrates robust and adaptable performance, particularly in the context of unbalanced data.

[AI-80] SAMEdge: An Edge-cloud Video Analytics Architecture for the Segment Anything Model

链接: https://arxiv.org/abs/2409.14784
作者: Rui Lu,Siping Shi,Yanting Liu,Dan Wang
关键词-EN: video analytics tasks, video analytics, large model, analytics tasks, continues to evolve
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As artificial intelligence continues to evolve, it is increasingly capable of handling a wide range of video analytics tasks with merely one large model. One of the key foundation technologies is the Segment Anything Model (SAM), which allows the video analytics tasks to be determined on the fly according to the input prompts from the user. However, achieving real-time response in video analytics applications is crucial for user experiences due to the limited communication and computation resources on the edge, especially with SAM, where users may continuously interact by adding or adjusting prompts. In this paper, we propose SAMEdge, a novel edge-cloud computing architecture designed to support SAM computations for edge users. SAMEdge integrates new modules on the edge and the cloud to maximize analytics accuracy under visual prompts and image prompts input with latency constraints. It addresses resource challenges associated with prompt encoding and image encoding by offering a visual prompt transformation algorithm for visual prompts and efficient workload partitioning for image encoding. SAMEdge is implemented by extending the open-source SAM project from Meta AI. We demonstrate the practical application of SAMEdge through a case study on a Visual Tour Guide application. Our evaluation indicates that SAMEdge significantly enhances the accuracy of the video analytics application under distinct network bandwidths across various prompts. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2409.14784 [cs.AI] (or arXiv:2409.14784v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.14784 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-81] Do Large Language Models have Problem-Solving Capability under Incomplete Information Scenarios? ACL2024

链接: https://arxiv.org/abs/2409.14762
作者: Yuyan Chen,Tianhao Yu,Yueze Li,Songzhou Yan,Sijia Liu,Jiaqing Liang,Yanghua Xiao
关键词-EN: Large Language Models, Language Models, Large Language, knowledge search, error detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:The evaluation of the problem-solving capability under incomplete information scenarios of Large Language Models (LLMs) is increasingly important, encompassing capabilities such as questioning, knowledge search, error detection, and path planning. Current research mainly focus on LLMs’ problem-solving capability such as Twenty Questions''. However, these kinds of games do not require recognizing misleading cues which are necessary in the incomplete information scenario. Moreover, the existing game such as Who is undercover’’ are highly subjective, making it challenging for evaluation. Therefore, in this paper, we introduce a novel game named BrainKing based on the Who is undercover'' and Twenty Questions’’ for evaluating LLM capabilities under incomplete information scenarios. It requires LLMs to identify target entities with limited yes-or-no questions and potential misleading answers. By setting up easy, medium, and hard difficulty modes, we comprehensively assess the performance of LLMs across various aspects. Our results reveal the capabilities and limitations of LLMs in BrainKing, providing significant insights of LLM problem-solving levels.

[AI-82] VLMs Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models

链接: https://arxiv.org/abs/2409.14759
作者: Nam Hyeon-Woo,Moon Ye-Bin,Wonseok Choi,Lee Hyun,Tae-Hyun Oh
关键词-EN: Vision language models, perception remains limited, shown promising reasoning, promising reasoning capabilities, Vision language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks; however, our understanding of their visual perception remains limited. In this work, we propose an eye examination process to investigate how a VLM perceives images, specifically focusing on key elements of visual recognition, from primitive color and shape to semantic levels. To this end, we introduce a dataset named LENS to guide a VLM to follow the examination and check its readiness. Once the model is ready, we conduct the examination. Through this examination, we quantify and visualize VLMs’ sensitivities to color and shape, and semantic matching. Our findings reveal that VLMs have varying sensitivity to different colors while consistently showing insensitivity to green across different VLMs. Also, we found different shape sensitivity and semantic recognition depending on LLM’s capacity despite using the same fixed visual encoder. Our analyses and findings have potential to inspire the design of VLMs and the pre-processing of visual input to VLMs for improving application performance.

[AI-83] UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection

链接: https://arxiv.org/abs/2409.14751
作者: Haocheng Zhao,Runwei Guan,Taoyu Wu,Ka Lok Man,Limin Yu,Yutao Yue
关键词-EN: dense point cloud, MMW radar, point cloud data, MMW, dense point
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figues, conference

点击查看摘要

Abstract:4D millimeter-wave (MMW) radar, which provides both height information and dense point cloud data over 3D MMW radar, has become increasingly popular in 3D object detection. In recent years, radar-vision fusion models have demonstrated performance close to that of LiDAR-based models, offering advantages in terms of lower hardware costs and better resilience in extreme conditions. However, many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information. Additionally, these multi-modal networks are often sensitive to the failure of a single modality, particularly vision. To address these challenges, we propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process, enhancing the quality of visual Bird-Eye View (BEV) features. We further introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities using shared module. To assess the robustness of multi-modal models, we develop a novel Failure Test (FT) ablation experiment, which simulates vision modality failure by injecting Gaussian noise. We conduct extensive experiments on the View-of-Delft (VoD) and TJ4D datasets. The results demonstrate that our proposed Unified BEVFusion (UniBEVFusion) network significantly outperforms state-of-the-art models on the TJ4D dataset, with improvements of 1.44 in 3D and 1.72 in BEV object detection accuracy.

[AI-84] Distribution-Level Feature Distancing for Machine Unlearning: Towards a Better Trade-off Between Model Utility and Forgetting AAAI

链接: https://arxiv.org/abs/2409.14747
作者: Dasol Choi,Dongbin Na
关键词-EN: deep learning applications, learning applications, explosive growth, increasingly in demand, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures, submitted to the AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:With the explosive growth of deep learning applications, the right to be forgotten has become increasingly in demand in various AI industries. For example, given a facial recognition system, some individuals may wish to remove images that might have been used in the training phase from the trained model. Unfortunately, modern deep neural networks sometimes unexpectedly leak personal identities. Recent studies have presented various machine unlearning algorithms to make a trained model unlearn the data to be forgotten. While these methods generally perform well in terms of forgetting scores, we have found that an unexpected modelutility drop can occur. This phenomenon, which we term correlation collapse, happens when the machine unlearning algorithms reduce the useful correlation between image features and the true label. To address this challenge, we propose Distribution-Level Feature Distancing (DLFD), a novel method that efficiently forgets instances while preventing correlation collapse. Our method synthesizes data samples so that the generated data distribution is far from the distribution of samples being forgotten in the feature space, achieving effective results within a single training epoch. Through extensive experiments on facial recognition datasets, we demonstrate that our approach significantly outperforms state-of-the-art machine unlearning methods.

[AI-85] Less yet robust: crucial region selection for scene recognition

链接: https://arxiv.org/abs/2409.14741
作者: Jianqi Zhang,Mengxuan Wang,Jingyao Wang,Lingyu Si,Changwen Zheng,Fanjiang Xu
关键词-EN: scene recognition tasks, Scene recognition, types of degradation, blurring or overexposure, Underwater Geological Scene
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scene recognition, particularly for aerial and underwater images, often suffers from various types of degradation, such as blurring or overexposure. Previous works that focus on convolutional neural networks have been shown to be able to extract panoramic semantic features and perform well on scene recognition tasks. However, low-quality images still impede model performance due to the inappropriate use of high-level semantic features. To address these To address these challenges, we propose an adaptive selection mechanism to identify the most important and robust regions with high-level features. Thus, the model can perform learning via these regions to avoid interference. implement a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix. We also introduce a regularization term to further enhance the significance of key high-level feature regions. Different from previous methods, our learnable matrix pays extra attention to regions that are important to multiple categories but may cause misclassification and sets constraints to reduce the influence of such regions.This is a plug-and-play architecture that can be easily extended to other methods. Additionally, we construct an Underwater Geological Scene Classification dataset to assess the effectiveness of our model. Extensive experimental results demonstrate the superiority and robustness of our proposed method over state-of-the-art techniques on two datasets.

[AI-86] oxiCraft: A Novel Framework for Synthetic Generation of Harmful Information

链接: https://arxiv.org/abs/2409.14740
作者: Zheng Hui,Zhaoxiao Guo,Hang Zhao,Juanyong Duan,Congrui Huang
关键词-EN: NLP tasks, detecting harmful content, online environments, social media, crucial for online
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In different NLP tasks, detecting harmful content is crucial for online environments, especially with the growing influence of social media. However, previous research has two main issues: 1) a lack of data in low-resource settings, and 2) inconsistent definitions and criteria for judging harmful content, requiring classification models to be robust to spurious features and diverse. We propose Toxicraft, a novel framework for synthesizing datasets of harmful information to address these weaknesses. With only a small amount of seed data, our framework can generate a wide variety of synthetic, yet remarkably realistic, examples of toxic information. Experimentation across various datasets showcases a notable enhancement in detection model robustness and adaptability, surpassing or close to the gold labels. We release the generated data at Github upon acceptance.

[AI-87] PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs

链接: https://arxiv.org/abs/2409.14729
作者: Jiahao Yu,Yangguang Shao,Hanwen Miao,Junzheng Shi,Xinyu Xing
关键词-EN: Large Language Models, Large Language, Language Models, prompt injection attacks, prompt injection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained widespread use in various applications due to their powerful capability to generate human-like text. However, prompt injection attacks, which involve overwriting a model’s original instructions with malicious prompts to manipulate the generated text, have raised significant concerns about the security and reliability of LLMs. Ensuring that LLMs are robust against such attacks is crucial for their deployment in real-world applications, particularly in critical tasks. In this paper, we propose PROMPTFUZZ, a novel testing framework that leverages fuzzing techniques to systematically assess the robustness of LLMs against prompt injection attacks. Inspired by software fuzzing, PROMPTFUZZ selects promising seed prompts and generates a diverse set of prompt injections to evaluate the target LLM’s resilience. PROMPTFUZZ operates in two stages: the prepare phase, which involves selecting promising initial seeds and collecting few-shot examples, and the focus phase, which uses the collected examples to generate diverse, high-quality prompt injections. Using PROMPTFUZZ, we can uncover more vulnerabilities in LLMs, even those with strong defense prompts. By deploying the generated attack prompts from PROMPTFUZZ in a real-world competition, we achieved the 7th ranking out of over 4000 participants (top 0.14%) within 2 hours. Additionally, we construct a dataset to fine-tune LLMs for enhanced robustness against prompt injection attacks. While the fine-tuned model shows improved robustness, PROMPTFUZZ continues to identify vulnerabilities, highlighting the importance of robust testing for LLMs. Our work emphasizes the critical need for effective testing tools and provides a practical framework for evaluating and improving the robustness of LLMs against prompt injection attacks. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.14729 [cs.CR] (or arXiv:2409.14729v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.14729 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-88] EDSNet: Efficient-DSNet for Video Summarization

链接: https://arxiv.org/abs/2409.14724
作者: Ashish Prasad,Pranav Jeevan,Amit Sethi
关键词-EN: methods largely rely, require substantial computational, substantial computational resources, Current video summarization, Current video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Current video summarization methods largely rely on transformer-based architectures, which, due to their quadratic complexity, require substantial computational resources. In this work, we address these inefficiencies by enhancing the Direct-to-Summarize Network (DSNet) with more resource-efficient token mixing mechanisms. We show that replacing traditional attention with alternatives like Fourier, Wavelet transforms, and Nyströmformer improves efficiency and performance. Furthermore, we explore various pooling strategies within the Regional Proposal Network, including ROI pooling, Fast Fourier Transform pooling, and flat pooling. Our experimental results on TVSum and SumMe datasets demonstrate that these modifications significantly reduce computational costs while maintaining competitive summarization performance. Thus, our work offers a more scalable solution for video summarization tasks.

[AI-89] ERABAL: Enhancing Role-Playing Agents through Boundary-Aware Learning

链接: https://arxiv.org/abs/2409.14710
作者: Yihong Tang,Jiao Ou,Che Liu,Fuzheng Zhang,Di Zhang,Kun Gai
关键词-EN: Human-Computer Interaction, large language model, primarily implemented, HCI, LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2402.10618

点击查看摘要

Abstract:Role-playing is an emerging application in the field of Human-Computer Interaction (HCI), primarily implemented through the alignment training of a large language model (LLM) with assigned characters. Despite significant progress, role-playing agents (RPLAs) still struggle with maintaining role-consistency across conversations, particularly when confronted with boundary queries subtly related to character attributes. In this paper, we present ERABAL, a framework aimed at enhancing RPLAs’ role-playing capabilities through boundary-aware learning. ERABAL encompasses a generation pipeline for role-specific dialogues and a concomitant methodology for alignment training. Through comprehensive evaluations, we demonstrate that ERABAL is both efficient and effective. By training with significantly fewer dialogues than those used in leading approaches, ERABAL achieves notable improvements across WikiRoleEval, CharacterEval, and the role-playing subset of MT-Bench compared to the generalist baseline models. Our code and datasets will be made publicly available to support further research.

[AI-90] arget-Aware Language Modeling via Granular Data Sampling EMNLP2024

链接: https://arxiv.org/abs/2409.14705
作者: Ernie Chang,Pin-Jie Lin,Yang Li,Changsheng Zhao,Daeil Kim,Rastislav Rabatin,Zechun Liu,Yangyang Shi,Vikas Chandra
关键词-EN: diverse sources, broad range, pretraining generally targets, model pretraining generally, data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024 Main Conference, 9 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective and straightforward approach is sampling with low-dimensional data features, which allows to select large-scale pretraining data for domain-specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the target downstream task performance while preserving its effectiveness on other tasks. This leads to the proposed data sampling paradigm where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with \sim 1% of the data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.

[AI-91] VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models EMNLP2024

链接: https://arxiv.org/abs/2409.14704
作者: Jingtao Cao,Zheng Zhang,Hongru Wang,Kam-Fai Wong
关键词-EN: Language Evaluation Understudy, Visual Language Evaluation, significantly improved, improved the generation, textual descriptions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: accepted by EMNLP2024(long paper,main conference)

点击查看摘要

Abstract:Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models’ ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP model.VLEU quantifies a model’s generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric provides a quantitative way to compare different T2I models and track improvements during model finetuning. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models, positioning it as an essential metric for future research in text-to-image synthesis.

[AI-92] Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

链接: https://arxiv.org/abs/2409.14683
作者: Benjamin Clavié,Antoine Chaffin,Griffin Adams
关键词-EN: increasingly popular approach, multi-vector retrieval methods, increasingly popular, multi-vector retrieval, approach to Neural
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.

[AI-93] Quantifying Context Bias in Domain Adaptation for Object Detection

链接: https://arxiv.org/abs/2409.14679
作者: Hojun Son,Arpan Kusari
关键词-EN: context bias, aims to transfer, DAOD, bias, context
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Under review

点击查看摘要

Abstract:Domain adaptation for object detection (DAOD) aims to transfer a trained model from a source to a target domain. Various DAOD methods exist, some of which minimize context bias between foreground-background associations in various domains. However, no prior work has studied context bias in DAOD by analyzing changes in background features during adaptation and how context bias is represented in different domains. Our research experiment highlights the potential usability of context bias in DAOD. We address the problem by varying activation values over different layers of trained models and by masking the background, both of which impact the number and quality of detections. We then use one synthetic dataset from CARLA and two different versions of real open-source data, Cityscapes and Cityscapes foggy, as separate domains to represent and quantify context bias. We utilize different metrics such as Maximum Mean Discrepancy (MMD) and Maximum Variance Discrepancy (MVD) to find the layer-specific conditional probability estimates of foreground given manipulated background regions for separate domains. We demonstrate through detailed analysis that understanding of the context bias can affect DAOD approach and foc

[AI-94] Instruction Tuning Vs. In-Context Learning: Revisiting Large Language Models in Few-Shot Computational Social Science

链接: https://arxiv.org/abs/2409.14673
作者: Taihang Wang,Xiaoman Xu,Yimin Wang,Ye Jiang
关键词-EN: large language models, computational social science, Real-world applications, tasks primarily depend, CSS tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-world applications of large language models (LLMs) in computational social science (CSS) tasks primarily depend on the effectiveness of instruction tuning (IT) or in-context learning (ICL). While IT has shown highly effective at fine-tuning LLMs for various tasks, ICL offers a rapid alternative for task adaptation by learning from examples without explicit gradient updates. In this paper, we evaluate the classification performance of LLMs using IT versus ICL in few-shot CSS tasks. The experimental results indicate that ICL consistently outperforms IT in most CSS tasks. Additionally, we investigate the relationship between the increasing number of training samples and LLM performance. Our findings show that simply increasing the number of samples without considering their quality does not consistently enhance the performance of LLMs with either ICL or IT and can sometimes even result in a performance decline. Finally, we compare three prompting strategies, demonstrating that ICL is more effective than zero-shot and Chain-of-Thought (CoT). Our research highlights the significant advantages of ICL in handling CSS tasks in few-shot settings and emphasizes the importance of optimizing sample quality and prompting strategies to improve LLM classification performance. The code will be made available.

[AI-95] Speechworthy Instruction-tuned Language Models EMNLP2024

链接: https://arxiv.org/abs/2409.14672
作者: Hyundong Cho,Nicolaas Jedema,Leonardo F.R. Ribeiro,Karishma Sharma,Pedro Szekely,Alessandro Moschitti,Ruben Janssen,Jonathan May
关键词-EN: Current instruction-tuned language, textual preference data, Current instruction-tuned, exclusively trained, trained with textual
类目: Artificial Intelligence (cs.AI)
*备注: EMNLP2024

点击查看摘要

Abstract:Current instruction-tuned language models are exclusively trained with textual preference data and thus are often not aligned with the unique requirements of other modalities, such as speech. To better align language models with the speech domain, we explore (i) prompting strategies grounded in radio-industry best practices and (ii) preference learning using a novel speech-based preference data of 20K samples, generated with a wide spectrum of prompts that induce varying dimensions of speech-suitability and labeled by annotators who listen to response pairs. Both human and automatic evaluation show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs. Interestingly, we find that prompting and preference learning can be additive; combining them achieves the best win rates in head-to-head comparison, resulting in responses that are preferred or tied to the base model in 76.2% of comparisons on average. Lastly, we share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.

[AI-96] FedGCA: Global Consistent Augmentation Based Single-Source Federated Domain Generalization

链接: https://arxiv.org/abs/2409.14671
作者: Yuan Liu,Shu Wang,Zhe Qu,Xingyu Li,Shichao Kan,Jianxin Wang
关键词-EN: Federated Domain Generalization, multi-domain training samples, generalization ability, Domain Generalization, aims to train
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 7 figures, conference

点击查看摘要

Abstract:Federated Domain Generalization (FedDG) aims to train the global model for generalization ability to unseen domains with multi-domain training samples. However, clients in federated learning networks are often confined to a single, non-IID domain due to inherent sampling and temporal limitations. The lack of cross-domain interaction and the in-domain divergence impede the learning of domain-common features and limit the effectiveness of existing FedDG, referred to as the single-source FedDG (sFedDG) problem. To address this, we introduce the Federated Global Consistent Augmentation (FedGCA) method, which incorporates a style-complement module to augment data samples with diverse domain styles. To ensure the effective integration of augmented samples, FedGCA employs both global guided semantic consistency and class consistency, mitigating inconsistencies from local semantics within individual clients and classes across multiple clients. The conducted extensive experiments demonstrate the superiority of FedGCA.

[AI-97] Semi-supervised Learning For Robust Speech Evaluation

链接: https://arxiv.org/abs/2409.14666
作者: Huayun Zhang,Jeremy H.M. Wong,Geyu Lin,Nancy F. Chen
关键词-EN: learners oral proficiency, Speech evaluation measures, oral proficiency, proficiency levels, Speech evaluation
类目: Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:Speech evaluation measures a learners oral proficiency using automatic models. Corpora for training such models often pose sparsity challenges given that there often is limited scored data from teachers, in addition to the score distribution across proficiency levels being often imbalanced among student cohorts. Automatic scoring is thus not robust when faced with under-represented samples or out-of-distribution samples, which inevitably exist in real-world deployment scenarios. This paper proposes to address such challenges by exploiting semi-supervised pre-training and objective regularization to approximate subjective evaluation criteria. In particular, normalized mutual information is used to quantify the speech characteristics from the learner and the reference. An anchor model is trained using pseudo labels to predict the correctness of pronunciation. An interpolated loss function is proposed to minimize not only the prediction error with respect to ground-truth scores but also the divergence between two probability distributions estimated by the speech evaluation model and the anchor model. Compared to other state-of-the-art methods on a public data-set, this approach not only achieves high performance while evaluating the entire test-set as a whole, but also brings the most evenly distributed prediction error across distinct proficiency levels. Furthermore, empirical results show the model accuracy on out-of-distribution data also compares favorably with competitive baselines.

[AI-98] zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning

链接: https://arxiv.org/abs/2409.14644
作者: Zixiang Xian,Chenhui Cui,Rubing Huang,Chunrong Fang,Zhenyu Chen
关键词-EN: Large language models, Large language, unlike pre-trained models, unlike pre-trained, code embeddings
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Regarding software engineering (SE) tasks, Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning, unlike pre-trained models (PTMs). However, LLMs are primarily designed for natural language output, and cannot directly produce intermediate embeddings from source code. They also face some challenges, for example, the restricted context length may prevent them from handling larger inputs, limiting their applicability to many SE tasks; while hallucinations may occur when LLMs are applied to complex downstream tasks. Motivated by the above facts, we propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs. Our approach utilizes LLMs to convert source code into concise summaries through zero-shot learning, which is then transformed into functional code embeddings using specialized embedding models. This unsupervised approach eliminates the need for training and addresses the issue of hallucinations encountered with LLMs. To the best of our knowledge, this is the first approach that combines LLMs and embedding models to generate code embeddings. We conducted experiments to evaluate the performance of our approach. The results demonstrate the effectiveness and superiority of our approach over state-of-the-art unsupervised methods. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.14644 [cs.SE] (or arXiv:2409.14644v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.14644 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-99] Not Only the Last-Layer Features for Spurious Correlations: All Layer Deep Feature Reweighting

链接: https://arxiv.org/abs/2409.14637
作者: Humza Wajid Hameed,Geraldin Nanfack,Eugene Belilovsky
关键词-EN: machine learning models, Spurious correlations, combat spurious correlations, learning models, group-level fairness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spurious correlations are a major source of errors for machine learning models, in particular when aiming for group-level fairness. It has been recently shown that a powerful approach to combat spurious correlations is to re-train the last layer on a balanced validation dataset, isolating robust features for the predictor. However, key attributes can sometimes be discarded by neural networks towards the last layer. In this work, we thus consider retraining a classifier on a set of features derived from all layers. We utilize a recently proposed feature selection strategy to select unbiased features from all the layers. We observe this approach gives significant improvements in worst-group accuracy on several standard benchmarks.

[AI-100] Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination

链接: https://arxiv.org/abs/2409.14634
作者: Marissa Radensky,Simra Shahid,Raymond Fok,Pao Siangliulue,Tom Hope,Daniel S. Weld
关键词-EN: involves blending salient, blending salient aspects, scientific ideation process, involves blending, blending salient
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The scientific ideation process often involves blending salient aspects of existing papers to create new ideas. To see if large language models (LLMs) can assist this process, we contribute Scideator, a novel mixed-initiative tool for scientific ideation. Starting from a user-provided set of papers, Scideator extracts key facets (purposes, mechanisms, and evaluations) from these and relevant papers, allowing users to explore the idea space by interactively recombining facets to synthesize inventive ideas. Scideator also helps users to gauge idea novelty by searching the literature for potential overlaps and showing automated novelty assessments and explanations. To support these tasks, Scideator introduces four LLM-powered retrieval-augmented generation (RAG) modules: Analogous Paper Facet Finder, Faceted Idea Generator, Idea Novelty Checker, and Idea Novelty Iterator. In a within-subjects user study, 19 computer-science researchers identified significantly more interesting ideas using Scideator compared to a strong baseline combining a scientific search engine with LLM interaction.

[AI-101] Hierarchical end-to-end autonomous navigation through few-shot waypoint detection ICRA

链接: https://arxiv.org/abs/2409.14633
作者: Amin Ghafourian,Zhongying CuiZhu,Debo Shi,Ian Chuang,Francois Charette,Rithik Sachdeva,Iman Soltani
关键词-EN: recognize salient features, ability to recognize, recognize salient, salient features, navigation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Appeared at the 40th Anniversary of the IEEE International Conference on Robotics and Automation (ICRA@40), 23-26 September, 2024, Rotterdam, The Netherlands. 9 pages, 5 figures

点击查看摘要

Abstract:Human navigation is facilitated through the association of actions with landmarks, tapping into our ability to recognize salient features in our environment. Consequently, navigational instructions for humans can be extremely concise, such as short verbal descriptions, indicating a small memory requirement and no reliance on complex and overly accurate navigation tools. Conversely, current autonomous navigation schemes rely on accurate positioning devices and algorithms as well as extensive streams of sensory data collected from the environment. Inspired by this human capability and motivated by the associated technological gap, in this work we propose a hierarchical end-to-end meta-learning scheme that enables a mobile robot to navigate in a previously unknown environment upon presentation of only a few sample images of a set of landmarks along with their corresponding high-level navigation actions. This dramatically simplifies the wayfinding process and enables easy adoption to new environments. For few-shot waypoint detection, we implement a metric-based few-shot learning technique through distribution embedding. Waypoint detection triggers the multi-task low-level maneuver controller module to execute the corresponding high-level navigation action. We demonstrate the effectiveness of the scheme using a small-scale autonomous vehicle on novel indoor navigation tasks in several previously unseen environments.

[AI-102] EQ-CBM: A Probabilistic Concept Bottleneck with Energy-based Models and Quantized Vectors ACCV2024

链接: https://arxiv.org/abs/2409.14630
作者: Sangwon Kim,Dasom Ahn,Byoung Chul Ko,In-su Jang,Kwang-Ju Kim
关键词-EN: deep neural networks, interpretable deep neural, neural networks, demand for reliable, reliable AI systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ACCV 2024

点击查看摘要

Abstract:The demand for reliable AI systems has intensified the need for interpretable deep neural networks. Concept bottleneck models (CBMs) have gained attention as an effective approach by leveraging human-understandable concepts to enhance interpretability. However, existing CBMs face challenges due to deterministic concept encoding and reliance on inconsistent concepts, leading to inaccuracies. We propose EQ-CBM, a novel framework that enhances CBMs through probabilistic concept encoding using energy-based models (EBMs) with quantized concept activation vectors (qCAVs). EQ-CBM effectively captures uncertainties, thereby improving prediction reliability and accuracy. By employing qCAVs, our method selects homogeneous vectors during concept encoding, enabling more decisive task performance and facilitating higher levels of human intervention. Empirical results using benchmark datasets demonstrate that our approach outperforms the state-of-the-art in both concept and task accuracy.

[AI-103] Brain Surgery: Ensuring GDPR Compliance in Large Language Models via Concept Erasure

链接: https://arxiv.org/abs/2409.14603
作者: Michele Laurelli
关键词-EN: General Data Protection, Data Protection Regulation, General Data, Data Protection, data privacy laws
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As large-scale AI systems proliferate, ensuring compliance with data privacy laws such as the General Data Protection Regulation (GDPR) has become critical. This paper introduces Brain Surgery, a transformative methodology for making every local AI model GDPR-ready by enabling real-time privacy management and targeted unlearning. Building on advanced techniques such as Embedding-Corrupted Prompts (ECO Prompts), blockchain-based privacy management, and privacy-aware continual learning, Brain Surgery provides a modular solution that can be deployed across various AI architectures. This tool not only ensures compliance with privacy regulations but also empowers users to define their own privacy limits, creating a new paradigm in AI ethics and governance.

[AI-104] Can pre-trained language models generate titles for research papers?

链接: https://arxiv.org/abs/2409.14602
作者: Tohida Rehman,Debarshi Kumar Sanyal,Samiran Chattopadhyay
关键词-EN: research paper communicates, succinct style, style the main, main theme, research paper
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The title of a research paper communicates in a succinct style the main theme and, sometimes, the findings of the paper. Coming up with the right title is often an arduous task, and therefore, it would be beneficial to authors if title generation can be automated. In this paper, we fine-tune pre-trained and large language models to generate titles of papers from their abstracts. We also use ChatGPT in a zero-shot setting to generate paper titles. The performance of the models is measured with ROUGE, METEOR, MoverScore, BERTScore and SciBERTScore metrics.

[AI-105] sting Causal Models with Hidden Variables in Polynomial Delay via Conditional Independencies

链接: https://arxiv.org/abs/2409.14593
作者: Hyunchai Jeong,Adiba Ejaz,Jin Tian,Elias Bareinboim
关键词-EN: causal inference tasks, CIs, inference tasks, key prerequisite, Testing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 34 total pages, 14 figures

点击查看摘要

Abstract:Testing a hypothesized causal model against observational data is a key prerequisite for many causal inference tasks. A natural approach is to test whether the conditional independence relations (CIs) assumed in the model hold in the data. While a model can assume exponentially many CIs (with respect to the number of variables), testing all of them is both impractical and unnecessary. Causal graphs, which encode these CIs in polynomial space, give rise to local Markov properties that enable model testing with a significantly smaller subset of CIs. Model testing based on local properties requires an algorithm to list the relevant CIs. However, existing algorithms for realistic settings with hidden variables and non-parametric distributions can take exponential time to produce even a single CI constraint. In this paper, we introduce the c-component local Markov property (C-LMP) for causal graphs with hidden variables. Since C-LMP can still invoke an exponential number of CIs, we develop a polynomial delay algorithm to list these CIs in poly-time intervals. To our knowledge, this is the first algorithm that enables poly-delay testing of CIs in causal graphs with hidden variables against arbitrary data distributions. Experiments on real-world and synthetic data demonstrate the practicality of our algorithm.

[AI-106] Explainable AI needs formal notions of explanation correctness

链接: https://arxiv.org/abs/2409.14590
作者: Stefan Haufe,Rick Wilming,Benedict Clark,Rustam Zhumagambetov,Danny Panknin,Ahcène Boubekki
关键词-EN: medicine poses risks, machine learning, requires regulation, critical domains, medicine poses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The use of machine learning (ML) in critical domains such as medicine poses risks and requires regulation. One requirement is that decisions of ML systems in high-risk applications should be human-understandable. The field of “explainable artificial intelligence” (XAI) seemingly addresses this need. However, in its current form, XAI is unfit to provide quality control for ML; it itself needs scrutiny. Popular XAI methods cannot reliably answer important questions about ML models, their training data, or a given test input. We recapitulate results demonstrating that popular XAI methods systematically attribute importance to input features that are independent of the prediction target. This limits their utility for purposes such as model and data (in)validation, model improvement, and scientific discovery. We argue that the fundamental reason for this limitation is that current XAI methods do not address well-defined problems and are not evaluated against objective criteria of explanation correctness. Researchers should formally define the problems they intend to solve first and then design methods accordingly. This will lead to notions of explanation correctness that can be theoretically verified and objective metrics of explanation performance that can be assessed using ground-truth data.

[AI-107] Backtracking Improves Generation Safety

链接: https://arxiv.org/abs/2409.14586
作者: Yiming Zhang,Jianfeng Chi,Hailey Nguyen,Kartikeya Upasani,Daniel M. Bikel,Jason Weston,Eric Michael Smith
关键词-EN: taking back tokens, fundamental limitation, taking back, unsafe additional text, Text generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to “undo” and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% \to 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.

[AI-108] Evaluating Gender Racial and Age Biases in Large Language Models : A Comparative Analysis of Occupational and Crime Scenarios

链接: https://arxiv.org/abs/2409.14583
作者: Vishal Mirza,Rahul Kulkarni,Aakanksha Jadhav
关键词-EN: Large Language Models, Language Models, Large Language, widespread enterprise adoption, enterprise adoption remains
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 17 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models(LLMs) have been notable, yet widespread enterprise adoption remains limited due to various constraints. This paper examines bias in LLMs-a crucial issue affecting their usability, reliability, and fairness. Researchers are developing strategies to mitigate bias, including debiasing layers, specialized reference datasets like Winogender and Winobias, and reinforcement learning with human feedback (RLHF). These techniques have been integrated into the latest LLMs. Our study evaluates gender bias in occupational scenarios and gender, age, and racial bias in crime scenarios across four leading LLMs released in 2024: Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, and GPT-4o. Findings reveal that LLMs often depict female characters more frequently than male ones in various occupations, showing a 37% deviation from US BLS data. In crime scenarios, deviations from US FBI data are 54% for gender, 28% for race, and 17% for age. We observe that efforts to reduce gender and racial bias often lead to outcomes that may over-index one sub-class, potentially exacerbating the issue. These results highlight the limitations of current bias mitigation techniques and underscore the need for more effective approaches.

[AI-109] Evaluating the Performance and Robustness of LLMs in Materials Science QA and Property Predictions

链接: https://arxiv.org/abs/2409.14572
作者: Hongchen Wang,Kangming Li,Scott Ramsay,Yao Fehlis,Edward Kim,Jason Hattrick-Simpers
关键词-EN: Large Language Models, Large Language, revolutionize scientific research, remain insufficiently explored, applications remain insufficiently
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. This study conducts a comprehensive evaluation and robustness analysis of LLMs within the field of materials science, focusing on domain-specific question answering and materials property prediction. Three distinct datasets are used in this study: 1) a set of multiple-choice questions from undergraduate-level materials science courses, 2) a dataset including various steel compositions and yield strengths, and 3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of ‘noise’, ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study uncovers unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance enhancement from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.

[AI-110] Encoder with the Empirical Mode Decomposition (EMD) to remove muscle artefacts from EEG signal

链接: https://arxiv.org/abs/2409.14571
作者: Ildar Rakhmatulin
关键词-EN: Empirical Mode Decomposition, Mode Decomposition, Empirical Mode, effectively removing artifacts, combining the Empirical
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a novel method for effectively removing artifacts from EEG signals by combining the Empirical Mode Decomposition (EMD) method with a machine learning architecture. The proposed method addresses the limitations of existing artifact removal techniques by enhancing the EMD method through interpolation of the upper and lower. For conventional artifact removal methods, the EMD technique is commonly employed. However, the challenge lies in accurately interpolating the missing components of the signal while preserving its inherent frequency components. To overcome this limitation, we incorporated machine learning technique, which enables us to carefully handle the interpolation process without directly manipulating the data. The key advantage of our approach lies in the preservation of the natural characteristics of the EEG signal during artifact removal. By utilizing machine learning for interpolation, we ensure that the average component obtained through the EMD method retains the crucial frequency components of the original signal. This preservation is essential for maintaining the integrity and fidelity of the EEG data, allowing for accurate analysis and interpretation. The results obtained from our evaluation serve to validate the effectiveness of our approach and pave the way for further advancements in EEG signal processing and analysis.

[AI-111] Combating Spatial Disorientation in a Dynamic Self-Stabilization Task Using AI Assistants

链接: https://arxiv.org/abs/2409.14565
作者: Sheikh Mannan,Paige Hansen,Vivekanand Pandey Vimal,Hannah N. Davies,Paul DiZio,Nikhil Krishnaswamy
关键词-EN: fatal aircraft accidents, aircraft accidents, Spatial disorientation, fatal aircraft, ameliorate spatial disorientation
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注: 10 pages, To be published in the International Conference on Human-Agent Interaction (HAI '24) proceedings

点击查看摘要

Abstract:Spatial disorientation is a leading cause of fatal aircraft accidents. This paper explores the potential of AI agents to aid pilots in maintaining balance and preventing unrecoverable losses of control by offering cues and corrective measures that ameliorate spatial disorientation. A multi-axis rotation system (MARS) was used to gather data from human subjects self-balancing in a spaceflight analog condition. We trained models over this data to create “digital twins” that exemplified performance characteristics of humans with different proficiency levels. We then trained various reinforcement learning and deep learning models to offer corrective cues if loss of control is predicted. Digital twins and assistant models then co-performed a virtual inverted pendulum (VIP) programmed with identical physics. From these simulations, we picked the 5 best-performing assistants based on task metrics such as crash frequency and mean distance from the direction of balance. These were used in a co-performance study with 20 new human subjects performing a version of the VIP task with degraded spatial information. We show that certain AI assistants were able to improve human performance and that reinforcement-learning based assistants were objectively more effective but rated as less trusted and preferable by humans.

[AI-112] RACOON: An LLM-based Framework for Retrieval-Augmented Column Type Annotation with a Knowledge Graph

链接: https://arxiv.org/abs/2409.14556
作者: Linxi Wei,Guorui Xiao,Magdalena Balazinska
关键词-EN: Column Type Annotation, Type Annotation, Column Type, Large Language Models, label columns
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As an important component of data exploration and integration, Column Type Annotation (CTA) aims to label columns of a table with one or more semantic types. With the recent development of Large Language Models (LLMs), researchers have started to explore the possibility of using LLMs for CTA, leveraging their strong zero-shot capabilities. In this paper, we build on this promising work and improve on LLM-based methods for CTA by showing how to use a Knowledge Graph (KG) to augment the context information provided to the LLM. Our approach, called RACOON, combines both pre-trained parametric and non-parametric knowledge during generation to improve LLMs’ performance on CTA. Our experiments show that RACOON achieves up to a 0.21 micro F-1 improvement compared against vanilla LLM inference.

[AI-113] Unleashing the Power of Emojis in Texts via Self-supervised Graph Pre-Training EMNLP2024

链接: https://arxiv.org/abs/2409.14552
作者: Zhou Zhang,Dongzeng Tan,Jiaan Wang,Yilong Chen,Jiarong Xu
关键词-EN: gained immense popularity, gained immense, immense popularity, supplement or replace, ordinary Unicode characters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024 Main Conference

点击查看摘要

Abstract:Emojis have gained immense popularity on social platforms, serving as a common means to supplement or replace text. However, existing data mining approaches generally either completely ignore or simply treat emojis as ordinary Unicode characters, which may limit the model’s ability to grasp the rich semantic information in emojis and the interaction between emojis and texts. Thus, it is necessary to release the emoji’s power in social media data mining. To this end, we first construct a heterogeneous graph consisting of three types of nodes, i.e. post, word and emoji nodes to improve the representation of different elements in posts. The edges are also well-defined to model how these three elements interact with each other. To facilitate the sharing of information among post, word and emoji nodes, we propose a graph pre-train framework for text and emoji co-modeling, which contains two graph pre-training tasks: node-level graph contrastive learning and edge-level link reconstruction learning. Extensive experiments on the Xiaohongshu and Twitter datasets with two types of downstream tasks demonstrate that our approach proves significant improvement over previous strong baseline methods.

[AI-114] Why Is Anything Conscious?

链接: https://arxiv.org/abs/2409.14545
作者: Michael Timothy Bennett,Sean Welsh,Anna Ciaunica
关键词-EN: taking the naturally-selected, embodied organism, starting point, tackle the hard, hard problem
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We tackle the hard problem of consciousness taking the naturally-selected, self-organising, embodied organism as our starting point. We provide a mathematical formalism describing how biological systems self-organise to hierarchically interpret unlabelled sensory information according to valence and specific needs. Such interpretations imply behavioural policies which can only be differentiated from each other by the qualitative aspect of information processing. Selection pressures favour systems that can intervene in the world to achieve homeostatic and reproductive goals. Quality is a property arising in such systems to link cause to affect to motivate real world interventions. This produces a range of qualitative classifiers (interoceptive and exteroceptive) that motivate specific actions and determine priorities and preferences. Building upon the seminal distinction between access and phenomenal consciousness, our radical claim here is that phenomenal consciousness without access consciousness is likely very common, but the reverse is implausible. To put it provocatively: Nature does not like zombies. We formally describe the multilayered architecture of self-organisation from rocks to Einstein, illustrating how our argument applies in the real world. We claim that access consciousness at the human level is impossible without the ability to hierarchically model i) the self, ii) the world/others and iii) the self as modelled by others. Phenomenal consciousness is therefore required for human-level functionality. Our proposal lays the foundations of a formal science of consciousness, deeply connected with natural selection rather than abstract thinking, closer to human fact than zombie fiction.

[AI-115] rackNetV4: Enhancing Fast Sports Object Tracking with Motion Attention Maps

链接: https://arxiv.org/abs/2409.14543
作者: Arjun Raj,Lei Wang,Tom Gedeon
关键词-EN: Accurately detecting, small objects, sports videos, challenging due, due to factors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research report