本篇博文主要展示 2024-09-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-09-24)

今日共更新904篇论文,其中:

  • 自然语言处理188篇(Computation and Language (cs.CL))
  • 人工智能281篇(Artificial Intelligence (cs.AI))
  • 计算机视觉183篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习220篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

【速读】: 该论文试图解决大型语言模型(LLMs)在医学领域应用中的性能评估问题,特别是模型在理解、推理和多语言能力方面的表现。解决方案的关键在于通过使用37个医学数据集,包括基于《新英格兰医学杂志》和《柳叶刀》专业医学测验构建的两个新挑战性问答任务,对模型进行全面评估。研究结果表明,增强的推理能力显著提升了模型在理解复杂医学指令和处理临床场景中的表现,但同时也揭示了模型在幻觉、多语言能力不一致和评估指标不一致等方面的弱点。

链接: https://arxiv.org/abs/2409.15277
作者: Yunfei Xie,Juncheng Wu,Haoqin Tu,Siwei Yang,Bingchen Zhao,Yongshuo Zong,Qiao Jin,Cihang Xie,Yuyin Zhou
关键词-EN: Large language models, exhibited remarkable capabilities, Large language, pushing the boundaries, exhibited remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The first four authors contributed equally, project page available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable capabilities across various domains and tasks, pushing the boundaries of our knowledge in learning and cognition. The latest model, OpenAI’s o1, stands out as the first LLM with an internalized chain-of-thought technique using reinforcement learning strategies. While it has demonstrated surprisingly strong capabilities on various general language tasks, its performance in specialized fields such as medicine remains unknown. To this end, this report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine (NEJM) and The Lancet. These datasets offer greater clinical relevance compared to standard medical QA benchmarks such as MedQA, translating more effectively into real-world clinical utility. Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios. But meanwhile, we identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. We release our raw data and model outputs at this https URL for future research.
摘要:大语言模型 (LLMs) 在各个领域和任务中展现了卓越的能力,推动了我们在学习和认知方面的知识边界。最新的模型,OpenAI 的 o1,作为首个采用强化学习策略的内化思维链技术的大语言模型,脱颖而出。尽管它在多种通用语言任务上展示了令人惊讶的强大能力,但在如医学等专业领域的表现仍未可知。为此,本报告对 o1 在不同医疗场景中的表现进行了全面探索,考察了三个关键方面:理解、推理和多语言能力。具体而言,我们的评估涵盖了 6 项任务,使用了来自 37 个医疗数据集的数据,其中包括基于《新英格兰医学杂志》(NEJM) 和《柳叶刀》专业医学测验构建的两个新的更具挑战性的问答 (QA) 任务。这些数据集相比标准的医疗 QA 基准(如 MedQA)具有更高的临床相关性,更有效地转化为实际临床应用。我们对 o1 的分析表明,大语言模型增强的推理能力可能(显著)提升其理解各种医疗指令和推理复杂临床场景的能力。值得注意的是,o1 在 19 个数据集和两个新创建的复杂 QA 场景中的准确率平均比之前的 GPT-4 高出 6.2% 和 6.6%。但与此同时,我们也发现了模型能力和现有评估协议中的几个弱点,包括幻觉、多语言能力不一致以及评估指标的差异。我们在此 https URL 发布了原始数据和模型输出,供未来研究使用。

[NLP-1] OmniBench: Towards The Future of Universal Omni-Language Models

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在同时处理和推理视觉、声学和文本输入时的能力不足问题。解决方案的关键在于引入了一个名为OmniBench的新基准,该基准通过高质量的人工标注,严格评估模型在三模态环境下的识别、解释和推理能力。OmniBench强调了在所有三种模态中进行综合理解和推理的重要性,揭示了现有开源模型在指令跟随和多模态推理方面的显著局限性,并呼吁未来研究应聚焦于开发更强大的三模态集成技术和训练策略,以提升模型的跨模态性能。

链接: https://arxiv.org/abs/2409.15272
作者: Yizhi Li,Ge Zhang,Yinghao Ma,Ruibin Yuan,Kang Zhu,Hangyu Guo,Yiming Liang,Jiaheng Liu,Jian Yang,Siwei Wu,Xingwei Qu,Jinjie Shi,Xinyue Zhang,Zhenzhu Yang,Xiangzhou Wang,Zhaoxiang Zhang,Zachary Liu,Emmanouil Benetos,Wenhao Huang,Chenghua Lin
关键词-EN: multimodal large language, Recent advancements, large language models, advancements in multimodal, multimodal large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models’ ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) the baseline models perform poorly (below 50% accuracy) even when provided with alternative textual representations of images and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at this https URL.
摘要:近期多模态大语言模型 (Multimodal Large Language Models, MLLMs) 的发展旨在整合并解读跨多种模态的数据。然而,这些模型同时处理和推理多种模态的能力仍未得到充分探索,部分原因是缺乏全面的模态基准测试。我们引入了 OmniBench,这是一个新颖的基准测试,旨在严格评估模型在视觉、声学和文本输入上同时进行识别、解读和推理的能力。我们将具备这种三模态处理能力的模型定义为全语言模型 (Omni-Language Models, OLMs)。OmniBench 以其高质量的人工标注为特点,确保准确响应需要对所有三种模态进行综合理解和推理。我们的主要发现表明:i) 开源 OLMs 在三模态情境下的指令跟随和推理能力存在显著局限;ii) 即使提供了图像和音频的替代文本表示,基线模型的表现仍然不佳(准确率低于 50%)。这些结果表明,从文本、图像和音频构建一致上下文的能力在现有的 MLLM 训练范式中经常被忽视。我们呼吁未来研究应聚焦于开发更强大的三模态整合技术和训练策略,以提升 OLMs 在多种模态上的表现。代码和实时排行榜可在以下链接找到:https URL。

[NLP-2] Behavioral Bias of Vision-Language Models: A Behavioral Finance View ICML2024

【速读】: 该论文旨在研究大型视觉语言模型(LVLMs)在行为金融学领域中的潜在行为偏差,特别是近期偏差和权威偏差。解决方案的关键在于提出一个端到端的框架,从数据收集到新的评估指标,全面评估LVLMs的推理能力和在金融行为偏差中的表现。研究结果表明,开源模型如LLaVA-NeXT、MobileVLM-V2等在这两种偏差上表现显著,而专有模型GPT-4o则影响较小,这为开源模型的改进提供了方向。

链接: https://arxiv.org/abs/2409.15256
作者: Yuhang Xiao,Yudi Lin,Ming-Chang Chiu
关键词-EN: Large Language Models, Large Language, Large Vision-Language Models, Large Vision-Language, evolve rapidly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICML 2024 Workshop on Large Language Models and Cognition

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) evolve rapidly as Large Language Models (LLMs) was equipped with vision modules to create more human-like models. However, we should carefully evaluate their applications in different domains, as they may possess undesired biases. Our work studies the potential behavioral biases of LVLMs from a behavioral finance perspective, an interdisciplinary subject that jointly considers finance and psychology. We propose an end-to-end framework, from data collection to new evaluation metrics, to assess LVLMs’ reasoning capabilities and the dynamic behaviors manifested in two established human financial behavioral biases: recency bias and authority bias. Our evaluations find that recent open-source LVLMs such as LLaVA-NeXT, MobileVLM-V2, Mini-Gemini, MiniCPM-Llama3-V 2.5 and Phi-3-vision-128k suffer significantly from these two biases, while the proprietary model GPT-4o is negligibly impacted. Our observations highlight directions in which open-source models can improve. The code is available at this https URL.
摘要:大型视觉-语言模型 (Large Vision-Language Models, LVLMs) 随着大语言模型 (Large Language Models, LLMs) 配备了视觉模块而迅速发展,旨在创建更接近人类的模型。然而,我们应谨慎评估其在不同领域的应用,因为它们可能存在不希望的偏见。我们的研究从行为金融学的角度探讨了 LVLMs 的潜在行为偏见,这是一个结合了金融学和心理学的跨学科课题。我们提出了一套端到端的框架,从数据收集到新的评估指标,以评估 LVLMs 的推理能力和在两种已确立的人类金融行为偏见中的动态行为:近期偏见 (recency bias) 和权威偏见 (authority bias)。我们的评估发现,最近的开源 LVLMs 如 LLaVA-NeXT、MobileVLM-V2、Mini-Gemini、MiniCPM-Llama3-V 2.5 和 Phi-3-vision-128k 在这两种偏见上表现出显著问题,而专有模型 GPT-4o 则几乎不受影响。我们的观察指出了开源模型可以改进的方向。代码可在以下链接获取:https URL。

[NLP-3] Archon: An Architecture Search Framework for Inference-Time Techniques

【速读】: 该论文试图解决在大语言模型(LLM)中有效结合推理时技术的问题,具体挑战包括:(1)合理分配推理计算预算,(2)理解不同推理时技术组合对下游性能的影响,(3)高效搜索模型选择、推理时技术和其组合的大空间。解决方案的关键是引入Archon框架,它通过定义一个可扩展的设计空间,涵盖生成集成、多采样、排序、融合、批评、验证和单元测试等方法,将选择和组合LLM及推理时技术的问题转化为超参数优化目标。通过自动化的推理时架构搜索(ITAS)算法,Archon能够在给定目标基准、推理计算预算和可用LLM的情况下,输出优化的架构,从而在多个指令跟随和推理基准上显著提升性能。

链接: https://arxiv.org/abs/2409.15254
作者: Jon Saad-Falcon,Adrian Gamarra Lafuente,Shlok Natarajan,Nahum Maru,Hristo Todorov,E. Kelly Buchanan,Mayee Chen,Neel Guha,Christopher Ré,Azalia Mirhoseini
关键词-EN: highly effective tools, Inference-time techniques, large language model, Inference-time, increase large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time techniques are emerging as highly effective tools to increase large language model (LLM) capabilities. However, there is still limited understanding of the best practices for developing systems that combine inference-time techniques with one or more LLMs, with challenges including: (1) effectively allocating inference compute budget, (2) understanding the interactions between different combinations of inference-time techniques and their impact on downstream performance, and 3) efficiently searching over the large space of model choices, inference-time techniques, and their compositions. To address these challenges, we introduce Archon, an automated framework for designing inference-time architectures. Archon defines an extensible design space, encompassing methods such as generation ensembling, multi-sampling, ranking, fusion, critiquing, verification, and unit testing. It then transforms the problem of selecting and combining LLMs and inference-time techniques into a hyperparameter optimization objective. To optimize this objective, we introduce automated Inference-Time Architecture Search (ITAS) algorithms. Given target benchmark(s), an inference compute budget, and available LLMs, ITAS outputs optimized architectures. We evaluate Archon architectures across a wide range of instruction-following and reasoning benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. We show that automatically designed inference-time architectures by Archon outperform strong models such as GPT-4o and Claude 3.5 Sonnet on these benchmarks, achieving an average increase of 14.1 and 10.3 percentage points with all-source models and open-source models, respectively. We make our code and datasets available publicly on Github: this https URL.
摘要:推理时技术正逐渐成为提升大语言模型 (LLM) 能力的高效工具。然而,对于如何开发结合推理时技术与一个或多个 LLM 的系统,仍存在许多未知之处,面临的挑战包括:(1) 有效分配推理计算预算,(2) 理解不同推理时技术组合间的相互作用及其对下游性能的影响,以及 (3) 在大规模模型选择、推理时技术及其组合的空间中进行高效搜索。为应对这些挑战,我们推出了 Archon,一个用于设计推理时架构的自动化框架。Archon 定义了一个可扩展的设计空间,涵盖了生成集成、多采样、排序、融合、批判、验证和单元测试等方法。随后,它将选择和组合 LLM 及推理时技术的问题转化为超参数优化目标。为优化这一目标,我们引入了自动化推理时架构搜索 (ITAS) 算法。在给定目标基准测试、推理计算预算和可用 LLM 的情况下,ITAS 输出优化的架构。我们在一系列指令跟随和推理基准测试中评估了 Archon 架构,包括 MT-Bench、Arena-Hard-Auto、AlpacaEval 2.0、MixEval、MixEval Hard、MATH 和 CodeContests。结果显示,Archon 自动设计的推理时架构在这些基准测试中优于 GPT-4o 和 Claude 3.5 Sonnet 等强模型,全源模型和开源模型的平均提升分别为 14.1 和 10.3 个百分点。我们已在 Github 上公开了代码和数据集:this https URL。

[NLP-4] MemBench: Towards Real-world Evaluation of Memory-Augmented Dialogue Systems

【速读】: 该论文试图解决现有对话系统在长期记忆评估方法上的不足,即仅依赖于事实信息的准确性或生成响应的困惑度,而忽略了人类记忆召回的多样性,如情感和环境因素。解决方案的关键在于构建了一个名为Memory Benchmark (MemBench)的新基准,该基准基于认知科学和心理学理论,涵盖了多种记忆召回模式,包括被动和主动的记忆召回,并首次考虑了元信息。此外,该基准还提出了新的评分维度,以全面评估生成响应的质量。实验结果表明,现有对话系统在该基准上仍有很大的改进空间。

链接: https://arxiv.org/abs/2409.15240
作者: Junqing He,Liang Zhu,Qi Wei,Rui Wang,Jiaxing Zhang
关键词-EN: developed numerous memory-augmented, Long-term memory, important for chatbots, researchers have developed, developed numerous
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In progress

点击查看摘要

Abstract:Long-term memory is so important for chatbots and dialogue systems (DS) that researchers have developed numerous memory-augmented DS. However, their evaluation methods are different from the real situation in human conversation. They only measured the accuracy of factual information or the perplexity of generated responses given a query, which hardly reflected their performance. Moreover, they only consider passive memory retrieval based on similarity, neglecting diverse memory-recalling paradigms in humans, e.g. emotions and surroundings. To bridge the gap, we construct a novel benchmark covering various memory recalling paradigms based on cognitive science and psychology theory. The Memory Benchmark (MemBench) contains two tasks according to the two-phrase theory in cognitive science: memory retrieval, memory recognition and injection. The benchmark considers both passive and proactive memory recalling based on meta information for the first time. In addition, novel scoring aspects are proposed to comprehensively measure the generated responses. Results from the strongest embedding models and LLMs on MemBench show that there is plenty of room for improvement in existing dialogue systems. Extensive experiments also reveal the correlation between memory injection and emotion supporting (ES) skillfulness, and intimacy. Our code and dataset will be released.
摘要:长期记忆对于聊天机器人和对话系统 (Dialogue Systems, DS) 的重要性不言而喻,因此研究人员开发了众多记忆增强型 DS。然而,这些系统的评估方法与人类实际对话中的情况存在差异。它们仅测量了在给定查询时事实信息的准确性或生成响应的困惑度,这几乎无法反映其真实表现。此外,它们仅考虑基于相似性的被动记忆检索,忽略了人类多样化的记忆召回范式,例如情感和环境因素。为了弥合这一差距,我们基于认知科学和心理学理论构建了一个涵盖多种记忆召回范式的新型基准。记忆基准 (Memory Benchmark, MemBench) 根据认知科学中的两阶段理论包含两个任务:记忆检索、记忆识别和注入。该基准首次考虑了基于元信息的被动和主动记忆召回。此外,提出了新的评分维度,以全面衡量生成的响应。在 MemBench 上对最强嵌入模型和大语言模型 (Large Language Model, LLM) 的结果表明,现有对话系统仍有很大的改进空间。广泛的实验还揭示了记忆注入与情感支持 (Emotion Supporting, ES) 技能和亲密度的相关性。我们的代码和数据集将公开发布。

[NLP-5] ASTE Transformer Modelling Dependencies in Aspect-Sentiment Triplet Extraction

【速读】: 该论文试图解决Aspect-Sentiment Triplet Extraction (ASTE)任务中,现有方法因独立分类器决策而无法有效利用短语间依赖关系的问题。解决方案的关键在于提出了一种包含三个transformer启发层的新方法,通过这些层来建模短语间及最终分类器决策间的依赖关系,从而提升模型在F1度量上的性能。此外,论文还展示了简单的预训练技术能进一步提高模型性能。

链接: https://arxiv.org/abs/2409.15202
作者: Iwo Naglik,Mateusz Lango
关键词-EN: Aspect-Sentiment Triplet Extraction, Triplet Extraction, aspect-based sentiment analysis, Aspect-Sentiment Triplet, recently proposed task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The 2024 Conference on Empirical Methods in Natural Language Processing, November 12-16, Miami, Florida 9 pages, appendix, diagrams

点击查看摘要

Abstract:Aspect-Sentiment Triplet Extraction (ASTE) is a recently proposed task of aspect-based sentiment analysis that consists in extracting (aspect phrase, opinion phrase, sentiment polarity) triples from a given sentence. Recent state-of-the-art methods approach this task by first extracting all possible text spans from a given text, then filtering the potential aspect and opinion phrases with a classifier, and finally considering all their pairs with another classifier that additionally assigns sentiment polarity to them. Although several variations of the above scheme have been proposed, the common feature is that the final result is constructed by a sequence of independent classifier decisions. This hinders the exploitation of dependencies between extracted phrases and prevents the use of knowledge about the interrelationships between classifier predictions to improve performance. In this paper, we propose a new ASTE approach consisting of three transformer-inspired layers, which enables the modelling of dependencies both between phrases and between the final classifier decisions. Experimental results show that the method achieves higher performance in terms of F1 measure than other methods studied on popular benchmarks. In addition, we show that a simple pre-training technique further improves the performance of the model.
摘要:方面-情感三元组提取 (Aspect-Sentiment Triplet Extraction, ASTE) 是基于方面情感分析的一项新任务,旨在从给定句子中提取 (方面短语, 观点短语, 情感极性) 三元组。最新的最先进方法通过首先从给定文本中提取所有可能的文本片段,然后使用分类器过滤潜在的方面和观点短语,最后通过另一个分类器考虑它们的所有配对,并额外赋予情感极性来处理此任务。尽管上述方案有多种变体,但共同特点是最终结果由一系列独立的分类器决策构建。这阻碍了提取短语之间依赖关系的利用,并阻止了使用关于分类器预测之间相互关系的知识来提高性能。本文提出了一种新的 ASTE 方法,由三个受 Transformer 启发的层组成,使得模型能够在短语之间以及最终分类器决策之间建模依赖关系。实验结果表明,该方法在流行的基准测试中相比其他研究方法在 F1 度量上实现了更高的性能。此外,我们还展示了简单的预训练技术进一步提升了模型的性能。

[NLP-6] Learning from Contrastive Prompts: Automated Optimization and Adaptation

【速读】: 该论文试图解决现有提示优化方法仅依赖于错误样本学习导致的性能次优问题,以及提示对先前模型有效但在新版本或不同语言环境下表现不佳的问题。解决方案的关键在于提出的Learning from Contrastive Prompts (LCP)框架,该框架通过对比学习分析优秀和劣质提示样本的模式,生成有效的提示,从而提升提示优化和适应性。LCP在Big-Bench Hard数据集上的评估显示,其提示优化胜率超过76%,并展现出在不同模型版本、系列和语言间的强大适应性,为提示工程提供了一种系统化的方法,减少了在多样化情境下部署大语言模型时的手动工作量。

链接: https://arxiv.org/abs/2409.15199
作者: Mingqi Li,Karan Aggarwal,Yong Xie,Aitzaz Ahmad,Stephen Lau
关键词-EN: manually crafting prompts, spent on manually, manually crafting, prompt optimization, LCP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLMs evolve, significant effort is spent on manually crafting prompts. While existing prompt optimization methods automate this process, they rely solely on learning from incorrect samples, leading to a sub-optimal performance. Additionally, an unexplored challenge in the literature is prompts effective for prior models may not perform well on newer versions or different languages. We propose the Learning from Contrastive Prompts (LCP) framework to address these gaps, enhancing both prompt optimization and adaptation. LCP employs contrastive learning to generate effective prompts by analyzing patterns in good and bad prompt examples. Our evaluation on the Big-Bench Hard dataset shows that LCP has a win rate of over 76% over existing methods in prompt optimization and demonstrates strong adaptability across different model versions, families, and languages. LCP offers a systematic approach to prompt engineering, reducing manual effort in deploying LLMs across varied contexts.
摘要:随着大语言模型 (LLM) 的发展,大量的精力被投入到手动设计提示 (prompt) 中。尽管现有的提示优化方法自动化了这一过程,但它们仅依赖于从错误样本中学习,导致性能次优。此外,文献中尚未探索的一个挑战是,先前模型有效的提示在新版本或不同语言中可能表现不佳。我们提出了对比提示学习 (Learning from Contrastive Prompts, LCP) 框架,以解决这些差距,增强提示优化和适应性。LCP 采用对比学习方法,通过分析好坏提示样本中的模式来生成有效提示。我们在 Big-Bench Hard 数据集上的评估显示,LCP 在提示优化方面的胜率超过 76%,并且在不同模型版本、系列和语言之间表现出强大的适应性。LCP 提供了一种系统的提示工程方法,减少了在不同上下文中部署大语言模型的手动工作量。

[NLP-7] PALLM: Evaluating and Enhancing PALLiative Care Conversations with Large Language Models ALT

【速读】: 该论文试图解决临床护理中患者与提供者之间有效沟通的评估问题,传统方法成本高且难以扩展。解决方案的关键在于利用大型语言模型(LLMs),如GPT-4和LLaMA2,通过模拟临床对话并进行微调,以评估沟通质量中的关键指标(如“理解”和“同理心”)。LLMs的引入不仅提升了评估的准确性和效率,还展示了开发自定义LLMs的可行性,为未来临床健康系统的智能化提供了基础。

链接: https://arxiv.org/abs/2409.15188
作者: Zhiyuan Wang,Fangxu Yuan,Virginia LeBaron,Tabor Flickinger,Laura E. Barnes
关键词-EN: directly impacting patient, impacting patient outcomes, Effective patient-provider communication, directly impacting, Effective patient-provider
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted by ACM Transactions on Computing for Healthcare, pending minor revisions

点击查看摘要

Abstract:Effective patient-provider communication is crucial in clinical care, directly impacting patient outcomes and quality of life. Traditional evaluation methods, such as human ratings, patient feedback, and provider self-assessments, are often limited by high costs and scalability issues. Although existing natural language processing (NLP) techniques show promise, they struggle with the nuances of clinical communication and require sensitive clinical data for training, reducing their effectiveness in real-world applications. Emerging large language models (LLMs) offer a new approach to assessing complex communication metrics, with the potential to advance the field through integration into passive sensing and just-in-time intervention systems. This study explores LLMs as evaluators of palliative care communication quality, leveraging their linguistic, in-context learning, and reasoning capabilities. Specifically, using simulated scripts crafted and labeled by healthcare professionals, we test proprietary models (e.g., GPT-4) and fine-tune open-source LLMs (e.g., LLaMA2) with a synthetic dataset generated by GPT-4 to evaluate clinical conversations, to identify key metrics such as understanding' and empathy’. Our findings demonstrated LLMs’ superior performance in evaluating clinical communication, providing actionable feedback with reasoning, and demonstrating the feasibility and practical viability of developing in-house LLMs. This research highlights LLMs’ potential to enhance patient-provider interactions and lays the groundwork for downstream steps in developing LLM-empowered clinical health systems.
摘要:有效的医患沟通在临床护理中至关重要,直接影响患者的治疗效果和生活质量。传统的评估方法,如人工评分、患者反馈和医护人员自我评估,往往受限于高成本和可扩展性问题。尽管现有的自然语言处理 (NLP) 技术显示出潜力,但它们在处理临床沟通的细微差别时表现不佳,并且需要敏感的临床数据进行训练,这限制了其在实际应用中的有效性。新兴的大语言模型 (LLM) 提供了一种新的评估复杂沟通指标的方法,通过集成到被动感知和实时干预系统中,有望推动该领域的发展。本研究探讨了 LLM 作为姑息治疗沟通质量评估工具的潜力,利用其在语言学、上下文学习和推理方面的能力。具体而言,我们使用由医疗专业人员精心设计和标注的模拟脚本,测试了专有模型 (如 GPT-4) 和微调的开源 LLM (如 LLaMA2),这些模型通过 GPT-4 生成的合成数据集来评估临床对话,以识别关键指标如“理解”和“同理心”。我们的研究结果表明,LLM 在评估临床沟通方面表现优异,能够提供具有推理依据的可操作反馈,并展示了开发内部 LLM 的可行性和实际应用价值。这项研究突显了 LLM 在增强医患互动方面的潜力,并为开发基于 LLM 的临床健康系统奠定了基础。

[NLP-8] Lessons Learned on Information Retrieval in Electronic Health Records: A Comparison of Embedding Models and Pooling Strategies

【速读】: 该论文旨在解决在临床领域应用大型语言模型(LLMs)时,由于医疗记录的上下文密集性导致的挑战。论文提出通过检索增强生成(RAG)方法来解决这一问题,并探讨了不同的嵌入模型和池化方法对信息检索性能的影响。研究的关键在于发现嵌入模型的选择对检索性能有显著影响,其中BGE模型(一个相对较小的通用领域模型)在多个任务中表现优于其他模型,包括专门针对医疗领域的模型。此外,论文还确定了每种模型的最佳池化策略,强调了嵌入模型、池化策略和查询表达方式对检索性能的重要性,并指出这些因素在不同数据集和查询表达方式下的表现存在显著差异。

链接: https://arxiv.org/abs/2409.15163
作者: Skatje Myers,Timothy A. Miller,Yanjun Gao,Matthew M. Churpek,Anoop Mayampurath,Dmitriy Dligach,Majid Afshar
关键词-EN: Applying large language, Applying large, retrieval, challenging due, context-heavy nature
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Objective: Applying large language models (LLMs) to the clinical domain is challenging due to the context-heavy nature of processing medical records. Retrieval-augmented generation (RAG) offers a solution by facilitating reasoning over large text sources. However, there are many parameters to optimize in just the retrieval system alone. This paper presents an ablation study exploring how different embedding models and pooling methods affect information retrieval for the clinical domain. Methods: Evaluating on three retrieval tasks on two electronic health record (EHR) data sources, we compared seven models, including medical- and general-domain models, specialized encoder embedding models, and off-the-shelf decoder LLMs. We also examine the choice of embedding pooling strategy for each model, independently on the query and the text to retrieve. Results: We found that the choice of embedding model significantly impacts retrieval performance, with BGE, a comparatively small general-domain model, consistently outperforming all others, including medical-specific models. However, our findings also revealed substantial variability across datasets and query text phrasings. We also determined the best pooling methods for each of these models to guide future design of retrieval systems. Discussion: The choice of embedding model, pooling strategy, and query formulation can significantly impact retrieval performance and the performance of these models on other public benchmarks does not necessarily transfer to new domains. Further studies such as this one are vital for guiding empirically-grounded development of retrieval frameworks, such as in the context of RAG, for the clinical domain. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2409.15163 [cs.CL] (or arXiv:2409.15163v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.15163 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Skatje Myers [view email] [v1] Mon, 23 Sep 2024 16:16:08 UTC (7,556 KB)
摘要:
目标:将大语言模型 (LLMs) 应用于临床领域面临挑战,因为处理医疗记录具有高度上下文依赖性。检索增强生成 (RAG) 通过促进对大型文本源的推理提供了一种解决方案。然而,仅在检索系统中就有许多参数需要优化。本文通过一项消融研究,探讨了不同的嵌入模型和池化方法如何影响临床领域的信息检索。

方法:我们在两个电子健康记录 (EHR) 数据源上的三个检索任务中进行评估,比较了七种模型,包括医疗领域和通用领域模型、专用编码器嵌入模型以及现成的解码器 LLMs。我们还独立地考察了每个模型在查询和待检索文本上的嵌入池化策略选择。

结果:我们发现,嵌入模型的选择显著影响检索性能,其中 BGE(一个相对较小的通用领域模型)持续优于所有其他模型,包括医疗专用模型。然而,我们的研究结果也揭示了在不同数据集和查询文本措辞上的显著差异。我们还确定了这些模型的最佳池化方法,以指导未来检索系统的设计。

讨论:嵌入模型的选择、池化策略和查询表达方式可以显著影响检索性能,并且这些模型在其他公共基准上的表现不一定能转移到新领域。进一步的研究,如本文所述,对于指导基于实证的检索框架开发至关重要,例如在临床领域的 RAG 背景下。

主题:计算与语言 (cs.CL);信息检索 (cs.IR)
引用为:arXiv:2409.15163 [cs.CL](或 arXiv:2409.15163v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.15163
arXiv-issued DOI via DataCite(待注册)
提交历史:从 Skatje Myers [查看电子邮件]
[v1] Mon, 23 Sep 2024 16:16:08 UTC (7,556 KB)

[NLP-9] Scientific Cross-Document Coreference and Hierarchy with Definition-Augmented Relational Reasoning

【速读】: 该论文试图解决科学文本中跨文档共指和层次结构的推断问题,这一任务在知识图谱构建、搜索、推荐和发现等领域具有重要应用。解决方案的关键在于提出了一种新颖的方法,通过检索全文文献生成上下文相关的概念提及定义,并利用这些定义增强跨文档提及关系的检测。此外,论文还生成了关系定义,描述两个概念提及之间的关联或差异,并设计了一种高效的重新排序方法来应对跨论文推断链接时涉及的组合爆炸问题。该方法在微调和上下文学习设置中均显著提升了性能,并通过生成的定义分析揭示了大型语言模型在细粒度科学文本上的关系推理能力。

链接: https://arxiv.org/abs/2409.15113
作者: Lior Forer,Tom Hope
关键词-EN: knowledge graph construction, recommendation and discovery, graph construction, fundamental task, coreference and hierarchy
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We address the fundamental task of inferring cross-document coreference and hierarchy in scientific texts, which has important applications in knowledge graph construction, search, recommendation and discovery. LLMs still struggle when faced with many long-tail technical concepts with nuanced variations. We present a novel method which generates context-dependent definitions of concept mentions by retrieving full-text literature, and uses the definitions to enhance detection of cross-document mention relations. We further generate relational definitions, which describe how two concept mentions are related or different, and design an efficient re-ranking approach to address the combinatorial explosion involved in inferring links across papers. In both fine-tuning and in-context learning settings we achieve large gains in performance. We provide analysis of generated definitions, shedding light on the relational reasoning ability of LLMs over fine-grained scientific texts.
摘要:我们解决了科学文本中跨文档共指和层次结构推断的基本任务,这一任务在知识图谱构建、搜索、推荐和发现中具有重要应用。大语言模型 (LLM) 在面对许多具有细微差异的长尾技术概念时仍然表现不佳。我们提出了一种新方法,通过检索全文文献生成上下文相关的概念提及定义,并利用这些定义来增强跨文档提及关系的检测。我们进一步生成关系定义,描述两个概念提及之间的关系或差异,并设计了一种高效的重新排序方法来解决在推断跨论文链接时涉及的组合爆炸问题。在微调和上下文学习设置中,我们都实现了性能的大幅提升。我们提供了生成的定义分析,揭示了大语言模型在细粒度科学文本上的关系推理能力。

[NLP-10] Efficiently Dispatching Flash Attention For Partially Filled Attention Masks

【速读】: 该论文试图解决现有Transformer模型在处理稀疏或部分填充的注意力矩阵时,仍采用二次复杂度进行计算的问题。解决方案的关键在于引入Binary Block Masking,这是一种高效的修改方法,使Flash Attention算法能够感知并利用注意力矩阵的稀疏性。论文进一步提出了两种优化策略:一种针对具有连续非零模式的掩码,另一种针对极稀疏的掩码。实验结果表明,该方法在实际应用场景中可将运行时间提升至多9倍。

链接: https://arxiv.org/abs/2409.15097
作者: Agniv Sharma,Jonas Geiping
关键词-EN: Transformers are widely, partially filled attention, filled attention matrices, partially filled, Binary Block Masking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices. Examples include attention masks designed to reduce the quadratic complexity of attention, sequence packing techniques, and recent innovations like tree masking for fast validation in MEDUSA. Despite the inherent sparsity in these matrices, the state-of-the-art algorithm Flash Attention still processes them with quadratic complexity as though they were dense. In this paper, we introduce \textbfBinary Block Masking, a highly efficient modification that enhances Flash Attention by making it mask-aware. We further propose two optimizations: one tailored for masks with contiguous non-zero patterns and another for extremely sparse masks. Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement. The implementation will be publicly released to foster further research and application.
摘要:Transformer 在各种应用中被广泛使用,其中许多应用产生的注意力矩阵是稀疏的或部分填充的。例如,设计用于减少注意力二次复杂度的注意力掩码、序列打包技术,以及最近在 MEDUSA 中用于快速验证的树掩码等创新。尽管这些矩阵具有固有的稀疏性,但最先进的算法 Flash Attention 仍然以二次复杂度处理它们,仿佛它们是密集的。在本文中,我们引入了 二进制块掩码 (Binary Block Masking),这是一种高效的改进方法,使 Flash Attention 具有掩码感知能力。我们进一步提出了两种优化:一种针对具有连续非零模式的掩码,另一种针对极度稀疏的掩码。我们在从现实场景中提取的注意力掩码上的实验表明,运行时间最多可提高 9 倍。该实现将公开发布,以促进进一步的研究和应用。

[NLP-11] Using Similarity to Evaluate Factual Consistency in Summaries

【速读】: 该论文试图解决现有摘要生成模型在生成流畅摘要时无法保证事实准确性的问题。解决方案的关键在于提出了一种新的零样本事实性评估指标——Sentence-BERT Score (SBERTScore),通过比较摘要与源文档中的句子级别相似性来评估事实一致性。该方法避免了传统基于n-gram重叠和嵌入相似性的评估指标的局限性,无需微调即可与现有的基于自然语言推理(NLI)和问答(QA)模型的评估方法相媲美,尤其在识别正确摘要方面表现出色。

链接: https://arxiv.org/abs/2409.15090
作者: Yuxuan Ye,Edwin Simpson,Raul Santos Rodriguez
关键词-EN: Cutting-edge abstractive summarisers, abstractive summarisers generate, summarisers generate fluent, Cutting-edge abstractive, generate fluent summaries
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cutting-edge abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed. Early summary factuality evaluation metrics are usually based on n-gram overlap and embedding similarity, but are reported fail to align with human annotations. Therefore, many techniques for detecting factual inconsistencies build pipelines around natural language inference (NLI) or question-answering (QA) models with additional supervised learning steps. In this paper, we revisit similarity-based metrics, showing that this failure stems from the comparison text selection and its granularity. We propose a new zero-shot factuality evaluation metric, Sentence-BERT Score (SBERTScore), which compares sentences between the summary and the source document. It outperforms widely-used word-word metrics including BERTScore and can compete with existing NLI and QA-based factuality metrics on the benchmark without needing any fine-tuning. Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries. We demonstrate how a combination of techniques is more effective in detecting various types of error.
摘要:尖端的生成式摘要器能够生成流畅的摘要,但生成的文本的事实准确性并不能得到保证。早期的摘要事实性评估指标通常基于 n-gram 重叠和嵌入相似性,但据报道这些方法无法与人工标注相匹配。因此,许多检测事实不一致性的技术围绕自然语言推理 (NLI) 或问答 (QA) 模型构建,并辅以额外的监督学习步骤。在本文中,我们重新审视了基于相似性的指标,表明这种失败源于比较文本的选择及其粒度。我们提出了一种新的零样本事实性评估指标,即 Sentence-BERT Score (SBERTScore),该指标比较摘要与源文档之间的句子。它在包括 BERTScore 在内的广泛使用的词对词指标中表现出色,并且无需任何微调即可与现有的基于 NLI 和 QA 的事实性指标在基准测试中竞争。我们的实验表明,每种技术都有不同的优势,其中 SBERTScore 在识别正确摘要方面尤为有效。我们展示了如何通过结合多种技术更有效地检测各种类型的错误。

[NLP-12] Depression Diagnosis Dialogue Simulation: Self-improving Psychiatrist with Tertiary Memory

【速读】: 该论文试图解决抑郁症等心理健康问题的自动化诊断难题,提出了一种名为Agent Mental Clinic (AMC)的自改进对话代理系统。解决方案的关键在于设计了一个包含三级记忆结构、对话控制与反思插件以及记忆采样模块的精神科医生代理,通过模拟患者与精神科医生的对话,充分利用精神科医生代理的技能,实现了对抑郁风险和自杀风险的准确诊断。实验结果表明,该系统在不修改大型语言模型(LLMs)权重的情况下,通过模拟精神科医生的培训过程,能够有效优化LLMs在特定领域的实际应用,即使只有少量代表性标注案例。

链接: https://arxiv.org/abs/2409.15084
作者: Kunyao Lan,Bingui Jin,Zichen Zhu,Siyuan Chen,Shu Zhang,Kenny Q. Zhu,Mengyue Wu
关键词-EN: Mental health issues, present significant challenges, effective automated diagnostic, Agent Mental Clinic, automated diagnostic methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mental health issues, particularly depressive disorders, present significant challenges in contemporary society, necessitating the development of effective automated diagnostic methods. This paper introduces the Agent Mental Clinic (AMC), a self-improving conversational agent system designed to enhance depression diagnosis through simulated dialogues between patient and psychiatrist agents. To enhance the dialogue quality and diagnosis accuracy, we design a psychiatrist agent consisting of a tertiary memory structure, a dialogue control and reflect plugin that acts as ``supervisor’’ and a memory sampling module, fully leveraging the skills reflected by the psychiatrist agent, achieving great accuracy on depression risk and suicide risk diagnosis via conversation. Experiment results on datasets collected in real-life scenarios demonstrate that the system, simulating the procedure of training psychiatrists, can be a promising optimization method for aligning LLMs with real-life distribution in specific domains without modifying the weights of LLMs, even when only a few representative labeled cases are available.
摘要:心理健康问题,特别是抑郁症,在当代社会中呈现出显著的挑战,迫切需要开发有效的自动化诊断方法。本文介绍了 Agent Mental Clinic (AMC),这是一个自我改进的对话式智能体系统,旨在通过模拟患者和精神科医生智能体之间的对话来增强抑郁症诊断。为了提高对话质量和诊断准确性,我们设计了一个精神科医生智能体,该智能体由三级记忆结构、对话控制与反思插件(作为“监督者”)和记忆采样模块组成,充分利用了精神科医生智能体所反映的技能,通过对话实现了对抑郁症风险和自杀风险的极高诊断准确性。在真实生活场景中收集的数据集上的实验结果表明,该系统模拟了精神科医生的培训过程,即使只有少数代表性的标记案例可用,也能成为一种有前景的优化方法,用于将大语言模型 (LLM) 与特定领域的真实生活分布对齐,而无需修改 LLM 的权重。

[NLP-13] Enhancing Scientific Reproducibility Through Automated BioCompute Object Creation Using Retrieval-Augmented Generation from Publications

【速读】: 该论文试图解决生物信息学研究中标准化文档创建的复杂性和耗时问题,特别是针对遗留研究项目的文档合规性。解决方案的关键在于利用检索增强生成(RAG)和大型语言模型(LLMs)来自动化生成符合IEEE BioCompute Object(BCO)标准的文档。通过开发BCO助手工具,该工具能够从科学论文和相关代码库中提取关键信息,有效应对LLM幻觉和长上下文理解等挑战。其核心技术包括优化的两阶段检索与重排序过程,以及针对BCO各领域的精心设计的提示工程,从而显著减少文档创建的时间和精力,同时确保符合标准,提升科学研究的透明度和可重复性。

链接: https://arxiv.org/abs/2409.15076
作者: Sean Kim,Raja Mazumder
关键词-EN: necessitating standardized documentation, IEEE BioCompute Object, Large Language Models, BCO assistant tool, BCO assistant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:The exponential growth in computational power and accessibility has transformed the complexity and scale of bioinformatics research, necessitating standardized documentation for transparency, reproducibility, and regulatory compliance. The IEEE BioCompute Object (BCO) standard addresses this need but faces adoption challenges due to the overhead of creating compliant documentation, especially for legacy research. This paper presents a novel approach to automate the creation of BCOs from scientific papers using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). We describe the development of the BCO assistant tool that leverages RAG to extract relevant information from source papers and associated code repositories, addressing key challenges such as LLM hallucination and long-context understanding. The implementation incorporates optimized retrieval processes, including a two-pass retrieval with re-ranking, and employs carefully engineered prompts for each BCO domain. We discuss the tool’s architecture, extensibility, and evaluation methods, including automated and manual assessment approaches. The BCO assistant demonstrates the potential to significantly reduce the time and effort required for retroactive documentation of bioinformatics research while maintaining compliance with the standard. This approach opens avenues for AI-assisted scientific documentation and knowledge extraction from publications thereby enhancing scientific reproducibility. The BCO assistant tool and documentation is available at this https URL.
摘要:计算能力和可访问性的指数级增长已经改变了生物信息学研究的复杂性和规模,迫切需要标准化的文档以确保透明性、可重复性和符合监管要求。IEEE BioCompute Object (BCO) 标准满足了这一需求,但由于创建合规文档的额外开销,尤其是在处理遗留研究时,面临着采用的挑战。本文提出了一种利用检索增强生成 (Retrieval-Augmented Generation, RAG) 和大语言模型 (Large Language Models, LLMs) 来自动生成 BCO 的新方法。我们描述了 BCO 助手工具的开发,该工具利用 RAG 从源论文和相关代码库中提取相关信息,解决了大语言模型幻觉和长上下文理解等关键挑战。实现中包含了优化的检索过程,包括两阶段检索与重排序,并针对每个 BCO 领域精心设计了提示。我们讨论了该工具的架构、可扩展性及评估方法,包括自动化和手动评估方法。BCO 助手展示了显著减少生物信息学研究事后文档编制时间和精力的潜力,同时保持与标准的合规性。这种方法为 AI 辅助的科学文档编制和从出版物中提取知识开辟了道路,从而增强了科学的可重复性。BCO 助手工具及其文档可通过此 https URL 获取。

[NLP-14] Evaluating the Usability of LLMs in Threat Intelligence Enrichment

【速读】: 该论文试图解决大型语言模型(LLMs)在威胁情报领域的应用中存在的可用性问题,特别是其在用户界面设计、错误处理、学习曲线、性能以及与现有工具集成方面的不足。解决方案的关键在于通过全面的可用性评估,识别出LLMs(如ChatGPT、Gemini、Cohere、Copilot和Meta AI)在这些方面的具体问题,并提供可行的改进建议,以确保这些工具在威胁情报增强过程中既用户友好又可靠,从而提高威胁情报的效率和准确性。

链接: https://arxiv.org/abs/2409.15072
作者: Sanchana Srikanth,Mohammad Hasanuzzaman,Farah Tasnur Meem
关键词-EN: Large Language Models, Large Language, Language Models, significantly enhance threat, automating the collection
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the potential to significantly enhance threat intelligence by automating the collection, preprocessing, and analysis of threat data. However, the usability of these tools is critical to ensure their effective adoption by security professionals. Despite the advanced capabilities of LLMs, concerns about their reliability, accuracy, and potential for generating inaccurate information persist. This study conducts a comprehensive usability evaluation of five LLMs ChatGPT, Gemini, Cohere, Copilot, and Meta AI focusing on their user interface design, error handling, learning curve, performance, and integration with existing tools in threat intelligence enrichment. Utilizing a heuristic walkthrough and a user study methodology, we identify key usability issues and offer actionable recommendations for improvement. Our findings aim to bridge the gap between LLM functionality and user experience, thereby promoting more efficient and accurate threat intelligence practices by ensuring these tools are user-friendly and reliable.
摘要:大语言模型 (LLMs) 具有通过自动化威胁数据的收集、预处理和分析来显著增强威胁情报的潜力。然而,这些工具的可用性对于确保安全专业人员有效采用至关重要。尽管 LLMs 具有先进的能力,但其可靠性、准确性以及生成不准确信息的可能性仍然存在担忧。本研究对五种 LLMs(ChatGPT、Gemini、Cohere、Copilot 和 Meta AI)进行了全面的可用性评估,重点关注其用户界面设计、错误处理、学习曲线、性能以及与现有工具在威胁情报增强中的集成。通过启发式走查和用户研究方法,我们识别了关键的可用性问题,并提供了可行的改进建议。我们的研究旨在弥合 LLM 功能与用户体验之间的差距,从而通过确保这些工具的用户友好性和可靠性,促进更高效和准确的威胁情报实践。

[NLP-15] Brotherhood at WMT 2024: Leveraging LLM-Generated Contextual Conversations for Cross-Lingual Image Captioning EMNLP2024

【速读】: 该论文试图解决英语到低资源语言的多模态翻译问题,特别是针对英语-印地语、英语-豪萨语、英语-孟加拉语和英语-马拉雅拉姆语的语言对。解决方案的关键在于利用多模态大型语言模型(如GPT-4o和Claude 3.5 Sonnet),通过指令调优的提示生成丰富的、基于上下文的对话,并结合图像的英文描述作为额外上下文,然后将这些合成对话翻译为目标语言。最终,通过加权提示策略,平衡原始英文描述与翻译后的对话,生成目标语言的描述。这种方法在多个语言对的挑战集上取得了竞争性的结果,尤其是在英语-豪萨语的挑战和评估排行榜上分别排名第一和第二。

链接: https://arxiv.org/abs/2409.15052
作者: Siddharth Betala,Ishan Chokshi
关键词-EN: Multi-Modal Translation Task, team name Brotherhood, Multi-Modal Translation, Translation Task, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the Ninth Conference on Machine Translation (WMT24), co-located with EMNLP 2024

点击查看摘要

Abstract:In this paper, we describe our system under the team name Brotherhood for the English-to-Lowres Multi-Modal Translation Task. We participate in the multi-modal translation tasks for English-Hindi, English-Hausa, English-Bengali, and English-Malayalam language pairs. We present a method leveraging multi-modal Large Language Models (LLMs), specifically GPT-4o and Claude 3.5 Sonnet, to enhance cross-lingual image captioning without traditional training or fine-tuning. Our approach utilizes instruction-tuned prompting to generate rich, contextual conversations about cropped images, using their English captions as additional context. These synthetic conversations are then translated into the target languages. Finally, we employ a weighted prompting strategy, balancing the original English caption with the translated conversation to generate captions in the target language. This method achieved competitive results, scoring 37.90 BLEU on the English-Hindi Challenge Set and ranking first and second for English-Hausa on the Challenge and Evaluation Leaderboards, respectively. We conduct additional experiments on a subset of 250 images, exploring the trade-offs between BLEU scores and semantic similarity across various weighting schemes.
摘要:本文描述了我们团队 Brotherhood 在英语到低分辨率多模态翻译任务中的系统。我们参与了英语-印地语、英语-豪萨语、英语-孟加拉语和英语-马拉雅拉姆语的多模态翻译任务。我们提出了一种利用多模态大语言模型 (LLMs),特别是 GPT-4o 和 Claude 3.5 Sonnet,来增强跨语言图像描述的方法,无需传统的训练或微调。我们的方法利用指令调优的提示生成关于裁剪图像的丰富、上下文相关的对话,使用其英语描述作为额外上下文。这些合成对话随后被翻译成目标语言。最后,我们采用加权提示策略,平衡原始英语描述与翻译后的对话,以生成目标语言的描述。该方法取得了有竞争力的结果,在英语-印地语挑战集上获得了 37.90 BLEU 分数,并在英语-豪萨语的挑战和评估排行榜上分别排名第一和第二。我们还在 250 张图像的子集上进行了额外实验,探索了不同加权方案下 BLEU 分数与语义相似性之间的权衡。

[NLP-16] Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

【速读】: 该论文试图解决的问题是探索解码器专用模型在多语言和多领域翻译任务中的扩展规律。解决方案的关键在于通过训练一系列从70M到7B参数的解码器专用模型,并进行实验验证,发现这些模型的损失可以通过类似于大型语言模型的扩展规律来估计。然而,该扩展规律在应用于过大模型或不同数据分布时存在局限性。此外,论文还研究了不同的扩展方法,发现扩展模型的深度和宽度都能带来类似的测试损失改进,但对模型效率的影响不同。

链接: https://arxiv.org/abs/2409.15051
作者: Gaëtan Caillaut,Raheel Qader,Mariam Nakhlé,Jingshu Liu,Jean-Gabriel Barthélemy
关键词-EN: showcased remarkable capabilities, Recent studies, NLP tasks, decoder-only models, models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies have showcased remarkable capabilities of decoder-only models in many NLP tasks, including translation. Yet, the machine translation field has been largely dominated by encoder-decoder models based on the Transformer architecture. As a consequence, scaling laws of encoder-decoder models for neural machine translation have already been well studied, but decoder-only models have received less attention. This work explores the scaling laws of decoder-only models on the multilingual and multidomain translation task. We trained a collection of six decoder-only models, ranging from 70M to 7B parameters, on a sentence-level, multilingual and multidomain dataset. We conducted a series of experiments showing that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models, but we also show that this scaling law has difficulties to generalize to too large models or to a different data distribution. We also study different scaling methods and show that scaling the depth and the width of a model lead to similar test loss improvements, but with different impact on the model’s efficiency.
摘要:近期研究表明,仅解码器模型在包括翻译在内的许多自然语言处理 (NLP) 任务中展现出显著能力。然而,机器翻译领域主要由基于 Transformer 架构的编码器-解码器模型主导。因此,编码器-解码器模型的神经机器翻译扩展规律已得到充分研究,但仅解码器模型的关注度较低。本研究探讨了仅解码器模型在多语言和多领域翻译任务中的扩展规律。我们在一个句子级别的多语言和多领域数据集上训练了六个仅解码器模型,参数规模从 70M 到 7B 不等。我们进行了一系列实验,结果表明,仅解码器模型的损失可以通过类似于大语言模型发现的扩展规律进行估计,但我们也发现,该扩展规律在应用于过大模型或不同数据分布时存在泛化困难。此外,我们还研究了不同的扩展方法,发现扩展模型的深度和宽度均能带来类似的测试损失改善,但对模型效率的影响不同。

[NLP-17] Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP EMNLP2024

【速读】: 该论文试图解决CLIP模型在图像生成任务中对用户意图的数量理解偏差问题。解决方案的关键在于通过设计不同的实验设置和数据集,全面评估CLIP对数量概念的理解,从文本、图像和跨模态角度揭示CLIP嵌入中的数量偏差,从而提高下游任务的可靠性。

链接: https://arxiv.org/abs/2409.15035
作者: Zeliang Zhang,Zhuo Liu,Mingqian Feng,Chenliang Xu
关键词-EN: visual question answering, demonstrated great versatility, visual question, question answering, demonstrated great
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Short paper. Accepted by the Findings of EMNLP 2024

点击查看摘要

Abstract:CLIP has demonstrated great versatility in adapting to various downstream tasks, such as image editing and generation, visual question answering, and video understanding. However, CLIP-based applications often suffer from misunderstandings regarding user intent, leading to discrepancies between the required number of objects and the actual outputs in image generation tasks. In this work, we empirically investigate the quantity bias in CLIP. By carefully designing different experimental settings and datasets, we comprehensively evaluate CLIP’s understanding of quantity from text, image, and cross-modal perspectives. Our experimental results reveal a quantity bias in CLIP embeddings, impacting the reliability of downstream tasks.
摘要:CLIP 在适应各种下游任务方面展示了极大的多功能性,例如图像编辑和生成、视觉问答以及视频理解。然而,基于 CLIP 的应用程序常常在理解用户意图方面存在误解,导致图像生成任务中所需对象数量与实际输出之间存在差异。在本研究中,我们通过实验研究了 CLIP 中的数量偏差。通过精心设计不同的实验设置和数据集,我们从文本、图像和跨模态的角度全面评估了 CLIP 对数量的理解。我们的实验结果揭示了 CLIP 嵌入中的数量偏差,影响了下游任务的可靠性。

[NLP-18] Generative LLM Powered Conversational AI Application for Personalized Risk Assessment: A Case Study in COVID-19

【速读】: 该论文试图解决在医疗领域中利用大型语言模型(LLMs)进行疾病风险评估的问题,特别是通过流式人机对话实现无需编程的交互式风险评估。解决方案的关键在于通过微调预训练的生成式LLMs(如Llama2-7b和Flan-t5-xl),并将其集成到移动应用中,以生成式AI为核心,实现实时的医患互动和无代码风险评估。这种方法不仅允许使用流式问答作为输入,还通过LLM的注意力层提供个性化的特征重要性分析,增强了风险评估的可解释性。通过在低数据环境下实现高AUC评分,论文展示了生成式LLMs在低数据环境下的优越性能,强调了其在实际应用中的适应性和有效性。

链接: https://arxiv.org/abs/2409.15027
作者: Mohammad Amin Roshani,Xiangyu Zhou,Yao Qiang,Srinivasan Suresh,Steve Hicks,Usha Sethuraman,Dongxiao Zhu
关键词-EN: Large language models, shown remarkable capabilities, Large language, natural language tasks, healthcare domains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in various natural language tasks and are increasingly being applied in healthcare domains. This work demonstrates a new LLM-powered disease risk assessment approach via streaming human-AI conversation, eliminating the need for programming required by traditional machine learning approaches. In a COVID-19 severity risk assessment case study, we fine-tune pre-trained generative LLMs (e.g., Llama2-7b and Flan-t5-xl) using a few shots of natural language examples, comparing their performance with traditional classifiers (i.e., Logistic Regression, XGBoost, Random Forest) that are trained de novo using tabular data across various experimental settings. We develop a mobile application that uses these fine-tuned LLMs as its generative AI (GenAI) core to facilitate real-time interaction between clinicians and patients, providing no-code risk assessment through conversational interfaces. This integration not only allows for the use of streaming Questions and Answers (QA) as inputs but also offers personalized feature importance analysis derived from the LLM’s attention layers, enhancing the interpretability of risk assessments. By achieving high Area Under the Curve (AUC) scores with a limited number of fine-tuning samples, our results demonstrate the potential of generative LLMs to outperform discriminative classification methods in low-data regimes, highlighting their real-world adaptability and effectiveness. This work aims to fill the existing gap in leveraging generative LLMs for interactive no-code risk assessment and to encourage further research in this emerging field.
摘要:大语言模型 (LLMs) 在各种自然语言任务中展示了卓越的能力,并越来越多地应用于医疗保健领域。本研究展示了一种通过流式人机对话实现的新型 LLM 驱动的疾病风险评估方法,消除了传统机器学习方法所需的编程需求。在一个 COVID-19 严重程度风险评估的案例研究中,我们使用少量自然语言示例对预训练的生成式 LLMs(例如 Llama2-7b 和 Flan-t5-xl)进行微调,并将其性能与使用表格数据从头训练的传统分类器(即逻辑回归、XGBoost、随机森林)在各种实验设置下进行比较。我们开发了一款移动应用程序,该应用程序使用这些微调后的 LLMs 作为其生成式 AI (GenAI) 核心,以促进临床医生和患者之间的实时互动,通过对话界面提供无代码风险评估。这种集成不仅允许使用流式问答 (QA) 作为输入,还提供了从 LLM 的注意力层中派生的个性化特征重要性分析,增强了风险评估的可解释性。通过在有限数量的微调样本下实现高曲线下面积 (AUC) 分数,我们的结果展示了生成式 LLMs 在低数据环境下优于判别分类方法的潜力,突显了其在现实世界中的适应性和有效性。本研究旨在填补现有利用生成式 LLMs 进行交互式无代码风险评估的空白,并鼓励在这一新兴领域进行进一步研究。

[NLP-19] Inference-Friendly Models With MixAttention

【速读】: 该论文试图解决现代语言模型在推理过程中由于KV缓存大小随注意力头数和处理token数增加而导致的内存消耗增加和推理速度下降的问题。解决方案的关键在于引入MixAttention架构,该架构结合了滑动窗口注意力机制(仅存储最近的token子集)和跨层KV缓存共享,从而显著减少内存使用并提高推理速度,同时保持模型在短上下文和长上下文任务中的性能。

链接: https://arxiv.org/abs/2409.15012
作者: Shashank Rajput,Ying Sheng,Sean Owen,Vitaliy Chiley
关键词-EN: maximum context length, concurrent requests supported, modern language models, plays a critical, critical role
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The size of the key-value (KV) cache plays a critical role in determining both the maximum context length and the number of concurrent requests supported during inference in modern language models. The KV cache size grows proportionally with the number of attention heads and the tokens processed, leading to increased memory consumption and slower inference for long inputs. In this work, we explore the use of MixAttention, a model architecture modification closely related to a blog published by this http URL. MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers. Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks. We also explore various configurations of this architecture, identifying those that maintain quality across evaluation metrics while optimizing resource efficiency.
摘要:在现代语言模型中,键值 (KV) 缓存的大小在决定最大上下文长度和支持推理期间并发请求数量方面起着关键作用。KV 缓存的大小与注意力头数量和处理的 Token 数量成正比增长,导致长输入时的内存消耗增加和推理速度减慢。在本研究中,我们探讨了 MixAttention 的使用,这是一种与 此链接 博客中提到的模型架构修改密切相关的技术。MixAttention 结合了滑动窗口注意力机制,其中仅存储 KV 缓存中的一小部分最近 Token,以及跨层共享 KV 缓存。我们的实验表明,MixAttention 显著减少了内存使用并提高了推理速度,同时在短上下文和长上下文任务中不牺牲模型性能。我们还探索了该架构的各种配置,识别出在保持评估指标质量的同时优化资源效率的配置。

[NLP-20] ViBERTgrid BiLSTM-CRF: Multimodal Key Information Extraction from Unstructured Financial Documents ECML KDD2023

【速读】: 该论文试图解决在非结构化金融文档中进行多模态关键信息提取(KIE)的问题。解决方案的关键在于将多模态Transformer模型(ViBERTgrid)与BiLSTM-CRF层结合,以适应非结构化文档的特性,从而显著提升命名实体识别的性能,同时保持其在半结构化文档中的KIE表现。

链接: https://arxiv.org/abs/2409.15004
作者: Furkan Pala,Mehmet Yasin Akpınar,Onur Deniz,Gülşen Eryiğit
关键词-EN: key information extraction, Multimodal key information, information extraction, key information, studied extensively
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted in MIDAS (The 8th Workshop on MIning DAta for financial applicationS) workshop of ECML PKDD 2023 conference

点击查看摘要

Abstract:Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.
摘要:多模态关键信息提取 (KIE) 模型在半结构化文档上已得到广泛研究。然而,其在非结构化文档上的研究是一个新兴的研究课题。本文提出了一种方法,通过引入 BiLSTM-CRF 层,将多模态 Transformer (即之前在半结构化文档上探索的 ViBERTgrid) 应用于非结构化金融文档。所提出的 ViBERTgrid BiLSTM-CRF 模型在金融领域非结构化文档的命名实体识别任务中表现出显著的性能提升 (高达 2 个百分点),同时在半结构化文档的 KIE 性能上保持不变。作为额外的贡献,我们公开发布了 SROIE 数据集的 Token 级别标注,以促进其在多模态序列标注模型中的应用。

[NLP-21] Enhancing Aspect-based Sentiment Analysis in Tourism Using Large Language Models and Positional Information

【速读】: 该论文试图解决传统方面级情感分析(ABSA)中存在的错误传播和情感元素提取不完整的问题。解决方案的关键在于提出了一个名为ACOS_LLM的模型,该模型通过两个关键阶段来实现方面-类别-观点-情感四元组提取(ACOSQE):首先,使用Adalora对大型语言模型进行微调以生成高质量的辅助知识,并通过Sparsegpt将微调后的模型压缩至50%的稀疏度以提高效率;其次,结合位置信息和序列建模,利用辅助知识和原始文本作为输入,完成ACOSQE任务。实验结果表明,该模型在自建旅游数据集和公开数据集Rest15、Rest16上均表现出色,显著提升了F1分数。

链接: https://arxiv.org/abs/2409.14997
作者: Chun Xu,Mengmeng Wang,Yan Ren,Shaolin Zhu
关键词-EN: understanding tourists’ evaluations, Aspect-Based Sentiment Analysis, Sentiment Analysis, sentiment analysis model, aspects of attractions
类目: Computation and Language (cs.CL)
备注: 19 pages, 17 figures

点击查看摘要

Abstract:Aspect-Based Sentiment Analysis (ABSA) in tourism plays a significant role in understanding tourists’ evaluations of specific aspects of attractions, which is crucial for driving innovation and development in the tourism industry. However, traditional pipeline models are afflicted by issues such as error propagation and incomplete extraction of sentiment elements. To alleviate this issue, this paper proposes an aspect-based sentiment analysis model, ACOS_LLM, for Aspect-Category-Opinion-Sentiment Quadruple Extraction (ACOSQE). The model comprises two key stages: auxiliary knowledge generation and ACOSQE. Firstly, Adalora is used to fine-tune large language models for generating high-quality auxiliary knowledge. To enhance model efficiency, Sparsegpt is utilized to compress the fine-tuned model to 50% sparsity. Subsequently, Positional information and sequence modeling are employed to achieve the ACOSQE task, with auxiliary knowledge and the original text as inputs. Experiments are conducted on both self-created tourism datasets and publicly available datasets, Rest15 and Rest16. Results demonstrate the model’s superior performance, with an F1 improvement of 7.49% compared to other models on the tourism dataset. Additionally, there is an F1 improvement of 0.05% and 1.06% on the Rest15 and Rest16 datasets, respectively.
摘要:基于方面的情感分析 (Aspect-Based Sentiment Analysis, ABSA) 在旅游业中扮演着重要角色,有助于理解游客对景点特定方面的评价,这对推动旅游业创新和发展至关重要。然而,传统的流水线模型存在误差传播和情感元素提取不完整等问题。为缓解这一问题,本文提出了一种基于方面的情感分析模型 ACOS_LLM,用于方面-类别-观点-情感四元组提取 (Aspect-Category-Opinion-Sentiment Quadruple Extraction, ACOSQE)。该模型包括两个关键阶段:辅助知识生成和 ACOSQE。首先,使用 Adalora 对大语言模型进行微调,以生成高质量的辅助知识。为提高模型效率,采用 Sparsegpt 将微调后的模型压缩至 50% 稀疏度。随后,利用位置信息和序列建模来完成 ACOSQE 任务,输入为辅助知识和原始文本。实验在自建的旅游数据集以及公开的 Rest15 和 Rest16 数据集上进行。结果显示,该模型在旅游数据集上的 F1 值比其他模型提高了 7.49%。此外,在 Rest15 和 Rest16 数据集上分别提高了 0.05% 和 1.06%。

[NLP-22] Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs

【速读】: 该论文旨在探讨如何通过多种技术手段优化大型语言模型(LLMs)在临床应用中的性能。解决方案的关键在于综合运用连续预训练(continuous pretraining)、指令微调(instruct fine-tuning)、NEFTune和提示工程(prompt engineering)四种技术。连续预训练为模型提供了强大的基础,指令微调在此基础上进一步提升了模型的适应性,而NEFTune则在生成质量上带来了额外收益。复杂的提示工程方法进一步增强了模型在临床任务中的表现。这些技术的结合使用,显著优化了LLMs在临床领域的应用效果。

链接: https://arxiv.org/abs/2409.14988
作者: Clément Christophe,Tathagata Raha,Svetlana Maslenkova,Muhammad Umar Salman,Praveen K Kanithi,Marco AF Pimentel,Shadab Khan
关键词-EN: Large Language Models, Large Language, demonstrated significant potential, Language Models, transforming clinical applications
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant potential in transforming clinical applications. In this study, we investigate the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt engineering. We employ these methods on Mistral 7B and Mixtral 8x7B models, leveraging a large-scale clinical pretraining dataset of 50 billion tokens and an instruct fine-tuning dataset of 500 million tokens. Our evaluation across various clinical tasks reveals the impact of each technique. While continuous pretraining beyond 250 billion tokens yields marginal improvements on its own, it establishes a strong foundation for instruct fine-tuning. Notably, NEFTune, designed primarily to enhance generation quality, surprisingly demonstrates additional gains on our benchmark. Complex prompt engineering methods further enhance performance. These findings show the importance of tailoring fine-tuning strategies and exploring innovative techniques to optimize LLM performance in the clinical domain.
摘要:大语言模型 (LLM) 在临床应用中展现了显著的潜力。在本研究中,我们探讨了四种技术在适应 LLM 用于临床用例中的效果:持续预训练、指令微调、NEFTune 和提示工程。我们采用了这些方法在 Mistral 7B 和 Mixtral 8x7B 模型上,利用了一个包含 500 亿 Token 的大规模临床预训练数据集和一个包含 5000 万 Token 的指令微调数据集。我们在各种临床任务中的评估揭示了每种技术的影响。尽管超过 2500 亿 Token 的持续预训练单独带来的改进有限,但它为指令微调奠定了坚实的基础。值得注意的是,NEFTune 主要设计用于提高生成质量,却在我们的基准测试中意外地展示了额外的收益。复杂的提示工程方法进一步提升了性能。这些发现表明,定制微调策略和探索创新技术对于优化临床领域中的 LLM 性能至关重要。

[NLP-23] Evaluating Theory of (an uncertain) Mind: Predicting the Uncertain Beliefs of Others in Conversation Forecasting

【速读】: 该论文试图解决在评估心智理论(Theory of Mind)时,如何量化他人信念的不确定性问题。解决方案的关键在于设计了一套新的任务,要求语言模型(LMs)在对话中模拟他人的不确定性,并通过对话预测任务来量化这种不确定性。具体方法包括将对话参与者视为预测者,要求LMs预测对话参与者的不确定性概率,并采用重缩放方法、方差减少策略和人口统计学背景来优化这一回归任务。实验结果表明,LMs能够解释他人不确定性中高达7%的方差,但同时也指出了任务的难度和未来在实际应用中的改进空间。

链接: https://arxiv.org/abs/2409.14986
作者: Anthony Sicilia,Malihe Alikhani
关键词-EN: Theory of Mind, evaluating Theory, Typically, Mind, Theory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Typically, when evaluating Theory of Mind, we consider the beliefs of others to be binary: held or not held. But what if someone is unsure about their own beliefs? How can we quantify this uncertainty? We propose a new suite of tasks, challenging language models (LMs) to model the uncertainty of others in dialogue. We design these tasks around conversation forecasting, wherein an agent forecasts an unobserved outcome to a conversation. Uniquely, we view interlocutors themselves as forecasters, asking an LM to predict the uncertainty of the interlocutors (a probability). We experiment with re-scaling methods, variance reduction strategies, and demographic context, for this regression task, conducting experiments on three dialogue corpora (social, negotiation, task-oriented) with eight LMs. While LMs can explain up to 7% variance in the uncertainty of others, we highlight the difficulty of the tasks and room for future work, especially in practical applications, like anticipating ``false
摘要:通常,在评估心智理论时,我们考虑他人的信念是二元的:持有或未持有。但如果某人对自己的信念不确定呢?我们如何量化这种不确定性?我们提出了一套新的任务,挑战语言模型 (LMs) 在对话中模拟他人的不确定性。我们围绕对话预测设计了这些任务,其中智能体预测对话中未观察到的结果。独特的是,我们将对话者本身视为预测者,要求 LM 预测对话者的不确定性(概率)。我们针对这一回归任务,实验了重新缩放方法、方差减少策略和人口统计背景,并在三个对话语料库(社交、谈判、任务导向)上对八个 LM 进行了实验。尽管 LM 可以解释他人不确定性中高达 7% 的方差,但我们强调了任务的难度和未来工作的空间,特别是在实际应用中,如预测“虚假”。

[NLP-24] Bilingual Rhetorical Structure Parsing with Large Parallel Annotations

【速读】: 该论文试图解决跨语言话语解析(cross-lingual discourse parsing)中的挑战,主要由于平行数据有限以及修辞结构理论(Rhetorical Structure Theory, RST)在不同语言和语料库中的应用不一致。解决方案的关键在于引入了一个大规模且多样化的英语GUM RST语料库的俄语平行标注,并利用最新的技术发展,开发了一个端到端的RST解析器。该解析器在英语和俄语语料库上均达到了最先进的性能,展示了在单语和双语设置中的有效性,即使在第二语言标注有限的情况下也能成功迁移。这是首次在手动标注的平行语料库上评估跨语言端到端RST解析的潜力。

链接: https://arxiv.org/abs/2409.14969
作者: Elena Chistova
关键词-EN: Rhetorical Structure Theory, natural language processing, Discourse parsing, English GUM RST, crucial task
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Discourse parsing is a crucial task in natural language processing that aims to reveal the higher-level relations in a text. Despite growing interest in cross-lingual discourse parsing, challenges persist due to limited parallel data and inconsistencies in the Rhetorical Structure Theory (RST) application across languages and corpora. To address this, we introduce a parallel Russian annotation for the large and diverse English GUM RST corpus. Leveraging recent advances, our end-to-end RST parser achieves state-of-the-art results on both English and Russian corpora. It demonstrates effectiveness in both monolingual and bilingual settings, successfully transferring even with limited second-language annotation. To the best of our knowledge, this work is the first to evaluate the potential of cross-lingual end-to-end RST parsing on a manually annotated parallel corpus.
摘要:话语解析是自然语言处理中的一个关键任务,旨在揭示文本中的更高层次关系。尽管跨语言话语解析引起了越来越多的兴趣,但由于平行数据有限以及修辞结构理论 (RST) 在不同语言和语料库中的应用不一致,这一领域仍面临挑战。为此,我们为大型且多样化的英语 GUM RST 语料库引入了俄语平行标注。利用最新的进展,我们的端到端 RST 解析器在英语和俄语语料库上均达到了最先进的水平。它在单语和双语设置中均表现出色,即使在第二语言标注有限的情况下也能成功迁移。据我们所知,这项工作首次评估了在人工标注的平行语料库上进行跨语言端到端 RST 解析的潜力。

[NLP-25] Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

【速读】: 该论文试图解决在不同专业领域中有效部署数据增强的大型语言模型(LLMs)所面临的挑战,特别是如何准确地检索相关数据、解释用户意图以及充分利用LLMs的推理能力来处理复杂任务。解决方案的关键在于提出了一种基于检索增强生成(RAG)的任务分类方法,将用户查询分为四个层次:显式事实查询、隐式事实查询、可解释理由查询和隐藏理由查询。通过这种分类,论文定义了不同层次的查询,提供了相关数据集,并总结了应对这些挑战的关键技术和最有效的方法。此外,论文还讨论了三种主要的外部数据集成形式:上下文、小模型和微调,强调了它们各自的优缺点以及适用的问题类型,旨在帮助读者全面理解并分解构建LLM应用的数据需求和关键瓶颈,提供系统化的开发指南。

链接: https://arxiv.org/abs/2409.14924
作者: Siyun Zhao,Yuqing Yang,Zilong Wang,Zhiyuan He,Luna K. Qiu,Lili Qiu
关键词-EN: Large language models, Large language, completing real-world tasks, demonstrated remarkable capabilities, external data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications.
摘要:结合外部数据的大语言模型 (LLMs) 在完成现实世界任务方面展示了显著的能力。将外部数据整合到 LLMs 中的技术,如检索增强生成 (RAG) 和微调,正受到越来越多的关注和广泛应用。然而,在各种专业领域中有效部署数据增强的 LLMs 面临着重大挑战。这些挑战涵盖了从检索相关数据和准确解释用户意图,到充分利用 LLMs 的推理能力来处理复杂任务的广泛问题。我们认为,数据增强的 LLM 应用没有一刀切的解决方案。在实践中,性能不佳往往源于未能正确识别任务的核心焦点,或因为任务本身需要结合多种能力,而这些能力必须被解耦以更好地解决。在本综述中,我们提出了一种 RAG 任务分类方法,根据所需外部数据的类型和任务的主要焦点,将用户查询分为四个层次:显式事实查询、隐式事实查询、可解释理由查询和隐藏理由查询。我们定义了这些查询层次,提供了相关数据集,并总结了应对这些挑战的关键挑战和最有效技术。最后,我们讨论了将外部数据整合到 LLMs 中的三种主要形式:上下文、小型模型和微调,强调了它们各自的优缺点以及适合解决的问题类型。这项工作的目的是帮助读者全面理解和分解构建 LLM 应用的数据需求和关键瓶颈,提供应对不同挑战的解决方案,并为系统开发此类应用提供指导。

[NLP-26] With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models EMNLP2024

【速读】: 该论文试图解决的问题是探究仅具备视觉和文本模态的大型语言模型(LLMs)和视觉语言模型(VLMs)是否能够通过抽象推理从字形和图像中隐式理解基于声音的现象。解决方案的关键在于通过实验分析这些模型在声音象征性(即识别声音与概念之间的非任意联系)和通过语言与视觉模块的互动来“听”的能力。研究通过复现经典的Kiki-Bouba和Mil-Mal形状与大小象征性任务,并比较人类与LLMs对语言象似性的判断,发现VLMs在不同程度上与人类标签达成一致,且VLMs在进行模拟实验时可能需要比人类更多的任务信息。此外,研究还表明,大小象征性比形状象征性更容易被VLMs识别,且对语言象似性的理解高度依赖于模型的大小。

链接: https://arxiv.org/abs/2409.14917
作者: Tyler Loakman,Yucheng Li,Chenghua Lin
关键词-EN: Large Language Models, testing psycholinguistic phenomena, Vision Language Models, Large Language, experiments testing psycholinguistic
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated aptitude as potential substitutes for human participants in experiments testing psycholinguistic phenomena. However, an understudied question is to what extent models that only have access to vision and text modalities are able to implicitly understand sound-based phenomena via abstract reasoning from orthography and imagery alone. To investigate this, we analyse the ability of VLMs and LLMs to demonstrate sound symbolism (i.e., to recognise a non-arbitrary link between sounds and concepts) as well as their ability to ``hear’’ via the interplay of the language and vision modules of open and closed-source multimodal models. We perform multiple experiments, including replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism tasks, and comparing human judgements of linguistic iconicity with that of LLMs. Our results show that VLMs demonstrate varying levels of agreement with human labels, and more task information may be required for VLMs versus their human counterparts for in silico experimentation. We additionally see through higher maximum agreement levels that Magnitude Symbolism is an easier pattern for VLMs to identify than Shape Symbolism, and that an understanding of linguistic iconicity is highly dependent on model size.
摘要:近年来,大语言模型 (LLM) 和视觉语言模型 (VLM) 在实验中展示了作为测试心理语言现象的人类参与者的潜在替代能力。然而,一个未被充分研究的问题是,仅能访问视觉和文本模式的模型能否通过从正字法和图像中进行抽象推理来隐式理解基于声音的现象。为了探讨这一点,我们分析了 VLM 和 LLM 展示声音象征性(即识别声音与概念之间的非任意联系)的能力,以及它们通过开源和闭源多模态模型的语言和视觉模块的相互作用来“听”的能力。我们进行了多项实验,包括复制经典的 Kiki-Bouba 和 Mil-Mal 形状与量级象征性任务,并比较了人类对语言象似性的判断与 LLM 的判断。我们的结果显示,VLM 与人类标签的一致性水平各异,并且与人类相比,VLM 在进行体内实验时可能需要更多的任务信息。此外,通过更高的最大一致性水平,我们发现量级象征性比形状象征性更容易被 VLM 识别,并且对语言象似性的理解高度依赖于模型的大小。

[NLP-27] owards a Realistic Long-Term Benchmark for Open-Web Research Agents

【速读】: 该论文旨在解决现有基准测试中缺乏针对具有经济价值任务的评估问题,特别是那些在金融和咨询行业中常见的复杂任务。解决方案的关键在于构建一个能够评估大型语言模型(LLM)代理在这些任务中表现的新基准,该基准不仅评估任务的完成度,还对部分解决任务的能力进行评分。通过使用GPT-4o、Claude-3.5 Sonnet、Llama 3.1 (405b)和GPT-4o-mini等多种架构进行测试,论文发现基于Claude-3.5 Sonnet的代理表现最佳,尤其是在采用ReAct架构并能将子任务委托给子代理的情况下。此外,论文还通过定性分析代理的行为轨迹来进一步评估其性能。

链接: https://arxiv.org/abs/2409.14913
作者: Peter Mühlbacher,Nikos I. Bosse,Lawrence Phillips
关键词-EN: present initial results, evaluating LLM agents, LLM agents, agents, evaluating LLM
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present initial results of a forthcoming benchmark for evaluating LLM agents on white-collar tasks of economic value. We evaluate eight realistic and messy'' tasks that are routine in finance and consulting, drawn from real-world cases from our customers. We lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact. This fills a gap in existing benchmarks with tasks like order a pizza to the following address’’ that do not constitute real-human work of economic value. Our evaluations assign credit to agents for partially solving tasks. By doing that, this initial evaluation, and the forthcoming benchmark, allow us to more accurately extrapolate performance of LLM-based agents on economically valuable tasks. We built and tested several architectures with GPT-4o, Claude-3.5 Sonnet, Llama 3.1 (405b), and GPT-4o-mini, ensuring that failure to solve a task was due to failures of reasoning and planning, rather than due to common failures like e.g. the inability to parse a website. On average, LLM agents powered by Claude-3.5 Sonnet substantially outperformed agents using GPT-4o, with agents based on Llama 3.1 (405b) and GPT-4o-mini lagging noticeably behind. Across LLMs, a ReAct architecture with the ability to delegate subtasks to subagents performed best. In addition to quantitative evaluations, we qualitatively assessed the performance of the LLM agents by inspecting their traces and reflecting on their observations. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.14913 [cs.CL] (or arXiv:2409.14913v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.14913 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:我们展示了即将推出的用于评估大语言模型 (LLM) 智能体在具有经济价值的白领任务上的基准测试的初步结果。我们评估了八个现实且“混乱”的任务,这些任务在金融和咨询行业中是常规的,并从我们客户的实际案例中提取。我们为大语言模型智能体评估套件奠定了基础,其中良好的表现直接对应于巨大的经济和社会影响。这填补了现有基准测试中的一个空白,例如“向以下地址订购披萨”这样的任务,这些任务并不构成具有经济价值的真实人类工作。我们的评估为部分解决任务的智能体分配了信用。通过这样做,这一初步评估以及即将推出的基准测试,使我们能够更准确地推断基于大语言模型的智能体在具有经济价值的任务上的表现。我们构建并测试了几种架构,包括 GPT-4o、Claude-3.5 Sonnet、Llama 3.1 (405b) 和 GPT-4o-mini,确保未能解决任务是由于推理和规划的失败,而不是由于常见的失败,例如无法解析网站。平均而言,基于 Claude-3.5 Sonnet 的大语言模型智能体显著优于使用 GPT-4o 的智能体,而基于 Llama 3.1 (405b) 和 GPT-4o-mini 的智能体则明显落后。在所有大语言模型中,具有将子任务委托给子智能体能力的 ReAct 架构表现最佳。除了定量评估外,我们还通过检查智能体的轨迹并反思其观察结果,对大语言模型智能体的性能进行了定性评估。

主题:计算与语言 (cs.CL); 机器学习 (cs.LG)
引用为:arXiv:2409.14913 [cs.CL] (或 arXiv:2409.14913v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.14913
通过 DataCite 发布的 arXiv-issued DOI (待注册)

[NLP-28] Knowledge Planning in Large Language Models for Domain-Aligned Counseling Summarization EMNLP2024

【速读】: 该论文旨在解决在心理健康咨询中将对话内容精炼为简洁且相关总结(即咨询笔记)的问题。解决方案的关键在于引入了一个新颖的规划引擎,通过结构化知识对齐来增强大型语言模型(LLMs)的能力。具体来说,该规划引擎将知识封装分为两个主要阶段:(i)保持对话结构和(ii)整合领域特定知识。论文提出的PIECE框架利用知识过滤与支架技术来封装领域知识,并通过束卷积学习增强对对话结构细微差别的理解。实验结果表明,PIECE在ROUGE和Bleurt评分上显著优于14种基线方法,并且在专家评估中显示出有效性,有时甚至超过金标准。

链接: https://arxiv.org/abs/2409.14907
作者: Aseem Srivastava,Smriti Joshi,Tanmoy Chakraborty,Md Shad Akhtar
关键词-EN: aka counseling notes, holds pivotal significance, mental health counseling, Large Language Models, aka counseling
类目: Computation and Language (cs.CL)
备注: Full paper accepted at EMNLP 2024 (main)

点击查看摘要

Abstract:In mental health counseling, condensing dialogues into concise and relevant summaries (aka counseling notes) holds pivotal significance. Large Language Models (LLMs) exhibit remarkable capabilities in various generative tasks; however, their adaptation to domain-specific intricacies remains challenging, especially within mental health contexts. Unlike standard LLMs, mental health experts first plan to apply domain knowledge in writing summaries. Our work enhances LLMs’ ability by introducing a novel planning engine to orchestrate structuring knowledge alignment. To achieve high-order planning, we divide knowledge encapsulation into two major phases: (i) holding dialogue structure and (ii) incorporating domain-specific knowledge. We employ a planning engine on Llama-2, resulting in a novel framework, PIECE. Our proposed system employs knowledge filtering-cum-scaffolding to encapsulate domain knowledge. Additionally, PIECE leverages sheaf convolution learning to enhance its understanding of the dialogue’s structural nuances. We compare PIECE with 14 baseline methods and observe a significant improvement across ROUGE and Bleurt scores. Further, expert evaluation and analyses validate the generation quality to be effective, sometimes even surpassing the gold standard. We further benchmark PIECE with other LLMs and report improvement, including Llama-2 (+2.72%), Mistral (+2.04%), and Zephyr (+1.59%), to justify the generalizability of the planning engine.
摘要:在心理健康咨询中,将对话浓缩成简洁且相关的总结(即咨询笔记)具有至关重要的意义。大语言模型 (LLM) 在各种生成任务中展现出卓越的能力;然而,它们在适应特定领域的复杂性方面仍面临挑战,尤其是在心理健康领域。与标准 LLM 不同,心理健康专家在撰写总结时首先会运用领域知识进行规划。我们的工作通过引入一种新颖的规划引擎来协调知识结构对齐,从而提升 LLM 的能力。为了实现高阶规划,我们将知识封装分为两个主要阶段:(i) 保持对话结构和 (ii) 融入领域特定知识。我们在 Llama-2 上应用了这一规划引擎,形成了一个新的框架,称为 PIECE。我们提出的系统采用知识过滤与支架技术来封装领域知识。此外,PIECE 利用束卷积学习来增强其对对话结构细微差别的理解。我们将 PIECE 与 14 种基线方法进行了比较,观察到在 ROUGE 和 Bleurt 评分上显著提升。进一步的专家评估和分析验证了生成质量的有效性,有时甚至超越了黄金标准。我们还将 PIECE 与其他 LLM 进行了基准测试,并报告了改进结果,包括 Llama-2 (+2.72%)、Mistral (+2.04%) 和 Zephyr (+1.59%),以证明规划引擎的通用性。

[NLP-29] DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models

【速读】: 该论文试图解决在非英语区域(如韩国)的电子医疗记录(EMR)数据中进行紧急/非紧急分类任务时,现有的领域特定预训练语言模型表现不佳的问题。解决方案的关键在于提出了一种领域知识转移方法,通过知识蒸馏技术将通用语言模型的知识与领域特定预训练模型的知识相结合,具体做法是将通用语言模型作为学生模型,领域特定预训练模型作为教师模型,从而提升在非英语区域EMR数据上的分类性能。

链接: https://arxiv.org/abs/2409.14904
作者: Sangyeon Cho,Jangyeong Jeon,Dongjoon Lee,Changhee Lee,Junyeong Kim
关键词-EN: natural language processing, language models, language, common approach, approach in natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: IEEE ACCESS 2024

点击查看摘要

Abstract:The use of pre-trained language models fine-tuned to address specific downstream tasks is a common approach in natural language processing (NLP). However, acquiring domain-specific knowledge via fine-tuning is challenging. Traditional methods involve pretraining language models using vast amounts of domain-specific data before fine-tuning for particular tasks. This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea. Our findings reveal that existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics of non-English-speaking regions. To address these limitations, we propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning. This study demonstrates the effective transfer of specialized knowledge between models by defining a general language model as the student model and a domain-specific pre-trained model as the teacher model. In particular, we address the complexities of EMR data obtained from PEDs in non-English-speaking regions, such as Korea, and demonstrate that the proposed method enhances classification performance in such contexts. The proposed methodology not only outperforms baseline models on Korean PED EMR data, but also promises broader applicability in various professional and technical domains. In future works, we intend to extend this methodology to include diverse non-English-speaking regions and address additional downstream tasks, with the aim of developing advanced model architectures using state-of-the-art KD techniques. The code is available in this https URL.
摘要:预训练语言模型经过微调以解决特定下游任务的方法在自然语言处理 (NLP) 中十分常见。然而,通过微调获取领域特定知识具有挑战性。传统方法涉及在特定任务微调之前,使用大量领域特定数据对语言模型进行预训练。本研究基于从韩国儿科急诊科 (PED) 获取的电子病历 (EMR) 数据,探讨了急诊/非急诊分类任务。我们的研究结果表明,现有的领域特定预训练语言模型在处理非英语地区多语言自由文本数据特征方面,表现不如通用语言模型。为解决这些局限性,我们提出了一种利用知识蒸馏通过微调将领域特定知识注入通用语言模型的领域知识转移方法。本研究通过将通用语言模型定义为学生模型,将领域特定预训练模型定义为教师模型,展示了模型间专业知识的有效转移。特别是,我们解决了从非英语地区(如韩国)的 PED 获取的 EMR 数据的复杂性,并证明所提出的方法在这些情境下提高了分类性能。所提出的方法不仅在韩国 PED EMR 数据上优于基线模型,还具有在各种专业和技术领域广泛应用的潜力。在未来的工作中,我们计划将这种方法扩展到包括多个非英语地区,并解决更多的下游任务,旨在利用最先进的知识蒸馏 (KD) 技术开发先进的模型架构。代码可在以下链接获取:https URL。

[NLP-30] End-to-End Graph Flattening Method for Large Language Models

【速读】: 该论文试图解决图数据在长距离场景理解中的性能问题,特别是在将图转换为自然语言以供大型语言模型(LLMs)处理时,文本格式的不良组织导致的长距离推理能力不足的问题。解决方案的关键在于提出了一种名为“端到端有向无环图路径提示(End-to-End DAG-Path prompting, EEDP)”的新方法,通过模拟人类认知推理习惯,优化图的扁平化过程,从而在长距离和短距离场景中均显著提升LLMs的推理性能,并展现出良好的鲁棒性。

链接: https://arxiv.org/abs/2409.14880
作者: Bin Hong,Jinze Wu,Jiayu Liu,Liang Ding,Jing Sha,Kai Zhang,Shijin Wang,Zhenya Huang
关键词-EN: Large Language Models, Language Models, breakthrough of Large, Large Language, achieving universal methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2024 1st International Conference on Computational Linguistics and Natural Language Processing (CLNLP 2024)

点击查看摘要

Abstract:In recent years, the breakthrough of Large Language Models (LLMs) offers new ideas for achieving universal methods on graph data. The common practice of converting graphs into natural language for LLMs, which refers to graph flattening, exhibits good generalizability and interpretability. However, the poor organization of the textual format results in poor performance in long-distance scenario understanding. Inspired by human cognitive reasoning habits, we propose a novel method for graph flattening to fit LLMs, termed as End-to-End DAG-Path prompting (EEDP). Experiments on real-world datasets show that EEDP enhances the reasoning performance of LLMs in long-distance scenarios while maintaining excellent performance in short-distance scenarios, demonstrating good robustness in the face of distance variations.
摘要:近年来,大语言模型 (LLM) 的突破为实现图数据的通用方法提供了新思路。将图转换为自然语言以供 LLM 处理的常见做法,即图展平 (graph flattening),表现出良好的泛化性和可解释性。然而,文本格式的组织不佳导致其在长距离场景理解中表现不佳。受人类认知推理习惯的启发,我们提出了一种新的图展平方法以适应 LLM,称为端到端有向无环图路径提示 (End-to-End DAG-Path prompting, EEDP)。在真实世界数据集上的实验表明,EEDP 在长距离场景中提升了 LLM 的推理性能,同时在短距离场景中保持了优异的表现,展示了面对距离变化的良好鲁棒性。

[NLP-31] Privacy Policy Analysis through Prompt Engineering for LLMs

【速读】: 该论文试图解决隐私政策分析中存在的复杂性和透明度不足的问题。解决方案的关键在于提出并应用PAPEL框架,该框架通过提示工程(prompt engineering)利用大型语言模型(LLMs)来自动化隐私政策的分析过程。PAPEL框架通过零样本、单样本和少样本学习方法以及思维链提示(chain-of-thought prompting)来创建预定义的提示和提示模板,指导LLMs高效地解析、解释和综合隐私政策的关键方面,从而生成用户友好的摘要,无需额外的模型训练。该方法显著减少了训练需求,并提高了对新分析需求的适应性。

链接: https://arxiv.org/abs/2409.14879
作者: Arda Goknil,Femke B. Gelderblom,Simeon Tverdal,Shukun Tokas,Hui Song
关键词-EN: Privacy policies, informed consent, impedes transparency, transparency and informed, Privacy
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Privacy policies are often obfuscated by their complexity, which impedes transparency and informed consent. Conventional machine learning approaches for automatically analyzing these policies demand significant resources and substantial domain-specific training, causing adaptability issues. Moreover, they depend on extensive datasets that may require regular maintenance due to changing privacy concerns. In this paper, we propose, apply, and assess PAPEL (Privacy Policy Analysis through Prompt Engineering for LLMs), a framework harnessing the power of Large Language Models (LLMs) through prompt engineering to automate the analysis of privacy policies. PAPEL aims to streamline the extraction, annotation, and summarization of information from these policies, enhancing their accessibility and comprehensibility without requiring additional model training. By integrating zero-shot, one-shot, and few-shot learning approaches and the chain-of-thought prompting in creating predefined prompts and prompt templates, PAPEL guides LLMs to efficiently dissect, interpret, and synthesize the critical aspects of privacy policies into user-friendly summaries. We demonstrate the effectiveness of PAPEL with two applications: (i) annotation and (ii) contradiction analysis. We assess the ability of several LLaMa and GPT models to identify and articulate data handling practices, offering insights comparable to existing automated analysis approaches while reducing training efforts and increasing the adaptability to new analytical needs. The experiments demonstrate that the LLMs PAPEL utilizes (LLaMA and Chat GPT models) achieve robust performance in privacy policy annotation, with F1 scores reaching 0.8 and above (using the OPP-115 gold standard), underscoring the effectiveness of simpler prompts across various advanced language models. Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Software Engineering (cs.SE) Cite as: arXiv:2409.14879 [cs.CL] (or arXiv:2409.14879v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.14879 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:隐私政策因其复杂性常常被混淆,这阻碍了透明度和知情同意。传统的机器学习方法用于自动分析这些政策需要大量资源和特定领域的广泛训练,导致适应性问题。此外,它们依赖于可能因隐私关注变化而需要定期维护的大量数据集。本文中,我们提出、应用并评估了 PAPEL(通过大语言模型提示工程进行隐私政策分析),这是一个利用大语言模型(LLMs)通过提示工程自动化隐私政策分析的框架。PAPEL旨在简化从这些政策中提取、注释和总结信息的过程,增强其可访问性和可理解性,而无需额外的模型训练。通过整合零样本、单样本和少样本学习方法以及思维链提示在创建预定义提示和提示模板中的应用,PAPEL引导LLMs高效地剖析、解释并将隐私政策的关键方面综合为用户友好的摘要。我们通过两个应用展示了PAPEL的有效性:(i)注释和(ii)矛盾分析。我们评估了几种LLaMa和GPT模型识别和阐述数据处理实践的能力,提供了与现有自动化分析方法相当的见解,同时减少了训练努力并增加了对新分析需求的适应性。实验表明,PAPEL所使用的LLMs(LLaMA和Chat GPT模型)在隐私政策注释中表现出色,F1分数达到0.8及以上(使用OPP-115黄金标准),突显了简单提示在各种高级语言模型中的有效性。

主题:计算与语言(cs.CL);计算机与社会(cs.CY);软件工程(cs.SE
引用为:arXiv:2409.14879 [cs.CL](或 arXiv:2409.14879v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.14879
聚焦以了解更多
arXiv通过DataCite发布的DOI(待注册)

[NLP-32] Orthogonal Finetuning for Direct Preference Optimization

【速读】: 该论文试图解决DPO算法在偏好优化过程中容易对非偏好样本过拟合的问题,表现为生成内容过长且缺乏多样性。解决方案的关键在于从权重更新的角度引入正则化,通过引入权重旋转的偏好优化方法(RoPO),仅对权重参数进行旋转和幅度拉伸更新,以保持超球面能量不变,从而保留神经元之间的角度编码的知识。实验结果表明,RoPO在保持与人类偏好对齐的同时,有效防止了过拟合,显著提升了生成内容的多样性。

链接: https://arxiv.org/abs/2409.14836
作者: Chenxu Yang,Ruipeng Jia,Naibin Gu,Zheng Lin,Siyuan Chen,Chao Pang,Weichong Yin,Yu Sun,Hua Wu,Weiping Wang
关键词-EN: preference optimization algorithm, optimization algorithm, effective preference optimization, preference optimization, weight-Rotated Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:DPO is an effective preference optimization algorithm. However, the DPO-tuned models tend to overfit on the dispreferred samples, manifested as overly long generations lacking diversity. While recent regularization approaches have endeavored to alleviate this issue by modifying the objective function, they achieved that at the cost of alignment performance degradation. In this paper, we innovatively incorporate regularization from the perspective of weight updating to curb alignment overfitting. Through the pilot experiment, we discovered that there exists a positive correlation between overfitting and the hyperspherical energy fluctuation. Hence, we introduce orthogonal finetuning for DPO via a weight-Rotated Preference Optimization (RoPO) method, which merely conducts rotational and magnitude-stretching updates on the weight parameters to maintain the hyperspherical energy invariant, thereby preserving the knowledge encoded in the angle between neurons. Extensive experiments demonstrate that our model aligns perfectly with human preferences while retaining the original expressive capacity using only 0.0086% of the trainable parameters, suggesting an effective regularization against overfitting. Specifically, RoPO outperforms DPO by up to 10 points on MT-Bench and by up to 2.8 points on AlpacaEval 2, while enhancing the generation diversity by an average of 6 points.
摘要:DPO 是一种有效的偏好优化算法。然而,经过 DPO 调优的模型往往对非偏好样本过度拟合,表现为生成内容过长且缺乏多样性。尽管近期的一些正则化方法通过修改目标函数来缓解这一问题,但它们在提升对齐性能的同时牺牲了模型的对齐性能。本文创新性地从权重更新的角度引入正则化,以抑制对齐过程中的过度拟合。通过初步实验,我们发现过度拟合与超球面能量波动之间存在正相关关系。因此,我们通过一种名为权重旋转偏好优化 (RoPO) 的方法,对 DPO 进行正交微调,该方法仅对权重参数进行旋转和幅度拉伸更新,以保持超球面能量不变,从而保留神经元之间角度所编码的知识。大量实验表明,我们的模型在仅使用 0.0086% 的可训练参数的情况下,完美地与人类偏好对齐,同时保留了原有的表达能力,显示出有效的过度拟合抑制效果。具体而言,RoPO 在 MT-Bench 上比 DPO 高出最多 10 分,在 AlpacaEval 2 上高出最多 2.8 分,同时生成多样性平均提高了 6 分。

[NLP-33] oolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback

【速读】: 该论文试图解决工具增强型大型语言模型(LLMs)在实际应用中面临的两个主要问题:一是训练数据中的指令过于详细,包含API名称或参数,而实际用户不会明确提及这些细节,导致模型与真实场景脱节;二是现有工作忽视了交互过程是否遵循指令。解决方案的关键在于构建了一个名为MGToolBench的训练数据集,该数据集包含陈述性和类别级别的指令,以更好地反映真实世界场景。此外,论文提出了ToolPlanner,这是一个两阶段的强化学习框架,通过路径规划和两种反馈机制来增强LLM的任务完成能力和指令遵循能力。实验结果表明,ToolPlanner显著提高了匹配率、通过率和胜率,分别提升了26.8%、20.2%和5.6%。

链接: https://arxiv.org/abs/2409.14826
作者: Qinzhuo Wu,Wei Liu,Jian Luan,Bin Wang
关键词-EN: gained increasing attention, increasing attention, tool-augmented LLMs, gained increasing, Recently
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, tool-augmented LLMs have gained increasing attention. Given an instruction, tool-augmented LLMs can interact with various external tools in multiple rounds and provide a final answer. However, previous LLMs were trained on overly detailed instructions, which included API names or parameters, while real users would not explicitly mention these API details. This leads to a gap between trained LLMs and real-world scenarios. In addition, most works ignore whether the interaction process follows the instruction. To address these issues, we constructed a training dataset called MGToolBench, which contains statement and category-level instructions to better reflect real-world scenarios. In addition, we propose ToolPlanner, a two-stage reinforcement learning framework that utilizes path planning and two feedback mechanisms to enhance the LLM’s task completion and instruction-following capabilities. Experimental results show that ToolPlanner significantly improves the Match Rate, Pass Rate and Win Rate by 26.8%, 20.2%, and 5.6% compared to the SOTA model. Human evaluation verifies that the multi-granularity instructions can better align with users’ usage habits. Our data and code will be released upon acceptance.
摘要:近期,工具增强型大语言模型 (LLM) 引起了越来越多的关注。给定一个指令,工具增强型 LLM 能够与多种外部工具进行多轮交互,并提供最终答案。然而,先前的大语言模型是在过于详细的指令上进行训练的,这些指令包括 API 名称或参数,而实际用户并不会明确提及这些 API 细节。这导致训练中的 LLM 与现实场景之间存在差距。此外,大多数研究忽视了交互过程是否遵循指令。为解决这些问题,我们构建了一个名为 MGToolBench 的训练数据集,该数据集包含陈述性和类别级别的指令,以更好地反映现实场景。此外,我们提出了 ToolPlanner,这是一个两阶段的强化学习框架,利用路径规划和两种反馈机制来增强 LLM 的任务完成能力和指令遵循能力。实验结果表明,与最先进的模型相比,ToolPlanner 显著提高了匹配率 (Match Rate)、通过率 (Pass Rate) 和胜率 (Win Rate),分别提升了 26.8%、20.2% 和 5.6%。人类评估验证了多粒度指令能够更好地符合用户的习惯。我们的数据和代码将在接受后发布。

[NLP-34] Past Meets Present: Creating Historical Analogy with Large Language Models

【速读】: 该论文试图解决历史类比获取的难题,即如何为给定事件找到合适的历史类比。解决方案的关键在于利用大型语言模型(LLMs)进行检索和生成历史类比,并通过自省方法减少生成类比时的幻觉和刻板印象。研究结果表明,LLMs在历史类比获取方面具有良好潜力,且通过自省方法可以进一步提升模型性能。

链接: https://arxiv.org/abs/2409.14820
作者: Nianqi Li,Siyu Yuan,Jiangjie Chen,Jiaqing Liang,Feng Wei,Zujie Liang,Deqing Yang,Yanghua Xiao
关键词-EN: people make decisions, understand the world, Historical analogies, compare known past, contemporary but unfamiliar
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Historical analogies, which compare known past events with contemporary but unfamiliar events, are important abilities that help people make decisions and understand the world. However, research in applied history suggests that people have difficulty finding appropriate analogies. And previous studies in the AI community have also overlooked historical analogies. To fill this gap, in this paper, we focus on the historical analogy acquisition task, which aims to acquire analogous historical events for a given event. We explore retrieval and generation methods for acquiring historical analogies based on different large language models (LLMs). Furthermore, we propose a self-reflection method to mitigate hallucinations and stereotypes when LLMs generate historical analogies. Through human evaluations and our specially designed automatic multi-dimensional assessment, we find that LLMs generally have a good potential for historical analogies. And the performance of the models can be further improved by using our self-reflection method.
摘要:历史类比,即将已知的过去事件与当代但不熟悉的事件进行比较,是帮助人们做出决策和理解世界的重要能力。然而,应用历史学的研究表明,人们难以找到合适的类比。此外,AI 社区之前的研究也忽视了历史类比。为了填补这一空白,本文聚焦于历史类比获取任务,旨在为给定事件获取类似的历史事件。我们探索了基于不同大语言模型 (LLM) 的检索和生成方法来获取历史类比。此外,我们提出了一种自反思方法,以减轻 LLM 生成历史类比时的幻觉和刻板印象。通过人工评估和我们专门设计的自动多维度评估,我们发现 LLM 在历史类比方面普遍具有良好的潜力。并且,使用我们的自反思方法可以进一步提高模型的性能。

[NLP-35] MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

【速读】: 该论文试图解决现有视觉语言模型(VLM)在移动领域中缺乏对特定UI元素的识别能力、细粒度信息理解以及跨页面关系的理解问题。解决方案的关键在于提出了名为MobileVLM的新模型,并通过两个额外的预训练阶段来增强模型对UI内部和跨UI的理解。具体来说,论文定义了四个基于UI的预训练任务,帮助模型更好地感知细粒度元素并捕捉页面转换动作。此外,为了弥补移动预训练数据的不足,研究团队从头构建了一个包含300万UI页面和真实转换动作的大型中文移动数据集Mobile3M,形成了一个有向图结构。实验结果表明,MobileVLM在测试集和公开的移动基准测试中均优于现有的VLM。

链接: https://arxiv.org/abs/2409.14818
作者: Qinzhuo Wu,Weikai Xu,Wei Liu,Tao Tan,Jianfeng Liu,Ang Li,Jian Luan,Bin Wang,Shuo Shang
关键词-EN: gaining increasing attention, increasing attention, agents based, gaining increasing, Recently
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, mobile AI agents based on VLMs have been gaining increasing attention. These works typically utilize VLM as a foundation, fine-tuning it with instruction-based mobile datasets. However, these VLMs are typically pre-trained on general-domain data, which often results in a lack of fundamental capabilities specific to the mobile domain. Therefore, they may struggle to recognize specific UI elements and understand intra-UI fine-grained information. In addition, the current fine-tuning task focuses on interacting with the most relevant element for the given instruction. These fine-tuned VLMs may still ignore the relationships between UI pages, neglect the roles of elements in page transitions and lack inter-UI understanding. To address issues, we propose a VLM called MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. We defined four UI-based pre-training tasks, enabling the model to better perceive fine-grained elements and capture page transition actions. To address the lack of mobile pre-training data, we built a large Chinese mobile dataset Mobile3M from scratch, which contains 3 million UI pages, and real-world transition actions, forming a directed graph structure. Experimental results show MobileVLM excels on both our test set and public mobile benchmarks, outperforming existing VLMs.
摘要:近年来,基于视觉语言模型 (VLM) 的移动 AI 智能体引起了越来越多的关注。这些研究通常利用 VLM 作为基础,通过基于指令的移动数据集对其进行微调。然而,这些 VLM 通常是在通用领域数据上预训练的,这往往导致它们缺乏移动领域特有的基本能力。因此,它们可能在识别特定 UI 元素和理解 UI 内部细粒度信息方面表现不佳。此外,当前的微调任务主要集中在与给定指令最相关的元素上。这些微调后的 VLM 可能仍然忽略 UI 页面之间的关系,忽视元素在页面转换中的作用,并缺乏跨 UI 的理解能力。为了解决这些问题,我们提出了一种名为 MobileVLM 的 VLM,它包括两个额外的预训练阶段,以增强 UI 内部和跨 UI 的理解能力。我们定义了四个基于 UI 的预训练任务,使模型能够更好地感知细粒度元素并捕捉页面转换动作。为了解决移动预训练数据缺乏的问题,我们从零开始构建了一个大型中文移动数据集 Mobile3M,其中包含 300 万张 UI 页面和真实世界的转换动作,形成了一个有向图结构。实验结果表明,MobileVLM 在我们的测试集和公开的移动基准测试中均表现优异,优于现有的 VLM。

[NLP-36] MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations ACL2024

【速读】: 该论文试图解决在对话中检测关键转折点(turning points, TPs)的问题,这些转折点包括情感爆发、决策变化等,对于理解人类行为及其后果至关重要。解决方案的关键在于引入了一个精心策划、高共识的人类注释多模态数据集,并提供精确的时间戳、描述和视觉-文本证据来突出这些转折点。论文还提出了一个名为TPMaven的框架,利用先进的视觉-语言模型构建视频叙事,并结合大型语言模型进行分类和检测转折点。评估结果显示,TPMaven在分类任务中达到0.88的F1分数,在检测任务中达到0.61的F1分数,且其解释与人类预期相符。

链接: https://arxiv.org/abs/2409.14801
作者: Gia-Bao Dinh Ho,Chang Wei Tan,Zahra Zamanzadeh Darban,Mahsa Salehi,Gholamreza Haffari,Wray Buntine
关键词-EN: Detecting critical moments, Detecting critical, emotional outbursts, crucial for understanding, understanding shifts
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2024 main conference

点击查看摘要

Abstract:Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting focusing on these moments as turning points (TPs), accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence high-lighting changes in emotions, behaviors, perspectives, and decisions at these turning points. We also propose a framework, TPMaven, utilizing state-of-the-art vision-language models to construct a narrative from the videos and large language models to classify and detect turning points in our multi-modal dataset. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations aligning with human expectations.
摘要:检测对话中的关键时刻,如情绪爆发或决策变化,对于理解人类行为及其后果的转变至关重要。我们的工作引入了一个新的问题设置,专注于这些时刻作为转折点 (TP),并伴随一个精心策划、高度一致的人工标注的多模态数据集。我们提供了精确的时间戳、描述以及视觉-文本证据,突出了在这些转折点上情绪、行为、观点和决策的变化。我们还提出了一个框架,TPMaven,利用最先进的视觉-语言模型从视频构建叙事,并使用大语言模型对我们的多模态数据集进行分类和检测转折点。评估结果显示,TPMaven 在分类中达到了 0.88 的 F1 分数,在检测中达到了 0.61 的 F1 分数,并提供了与人类预期一致的额外解释。

[NLP-37] owards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models

【速读】: 该论文试图解决现有视觉问答与自然语言解释(VQA-NLE)数据集创建过程中依赖人工标注导致的时间和成本高昂的问题。解决方案的关键在于利用大型视觉语言模型(LVLMs)生成高质量的合成VQA-NLE数据集,通过先进的提示技术实现高效且接近人工标注质量的数据生成,特别是通过视觉提示显著提升文本生成的相关性。

链接: https://arxiv.org/abs/2409.14785
作者: Patrick Amadeus Irawan,Genta Indra Winata,Samuel Cahyawijaya,Ayu Purwarianti
关键词-EN: Natural Language Explanation, Natural Language, Language Explanation, aims to elucidate, providing detailed
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models (LVLMs) through the use of language models. While existing methods for creating a Vision Question-Answering with Natural Language Explanation (VQA-NLE) datasets can provide explanations, they heavily rely on human annotations that are time-consuming and costly. In this study, we propose a novel approach that leverages LVLMs to efficiently generate high-quality synthetic VQA-NLE datasets. By evaluating our synthetic data, we showcase how advanced prompting techniques can lead to the production of high-quality VQA-NLE data. Our findings indicate that this proposed method achieves up to 20x faster than human annotation, with only a minimal decrease in qualitative metrics, achieving robust quality that is nearly equivalent to human-annotated data. Furthermore, we show that incorporating visual prompts significantly enhances the relevance of text generation. Our study paves the way for a more efficient and robust automated generation of multi-modal NLE data, offering a promising solution to the problem.
摘要:自然语言解释 (Natural Language Explanation, NLE) 旨在通过提供详细、易于人类理解的自然语言解释来阐明决策过程。它通过使用语言模型,帮助揭开大型视觉语言模型 (Large Vision-Language Models, LVLMs) 的决策过程。尽管现有创建视觉问答与自然语言解释 (Vision Question-Answering with Natural Language Explanation, VQA-NLE) 数据集的方法可以提供解释,但它们严重依赖耗时且成本高昂的人工标注。在本研究中,我们提出了一种利用 LVLMs 高效生成高质量合成 VQA-NLE 数据集的新方法。通过评估我们的合成数据,我们展示了先进的提示技术如何能够生成高质量的 VQA-NLE 数据。我们的研究结果表明,该方法的生成速度比人工标注快 20 倍,且在质量指标上仅有轻微下降,达到了与人工标注数据几乎相当的质量。此外,我们发现结合视觉提示显著增强了文本生成的相关性。我们的研究为更高效和稳健的多模态 NLE 数据自动化生成铺平了道路,为该问题提供了一个有前景的解决方案。

[NLP-38] Pretraining Data Detection for Large Language Models : A Divergence-based Calibration Method EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)训练数据透明度不足的问题,特别是如何通过黑箱访问推断给定文本是否属于LLM的训练数据。解决方案的关键在于引入了一种基于散度校准的方法,通过计算词元概率分布与词元频率分布之间的交叉熵(即散度)来校准词元概率,从而提高训练数据检测的准确性。该方法在英文和中文基准测试中均显著优于现有方法。

链接: https://arxiv.org/abs/2409.14781
作者: Weichao Zhang,Ruqing Zhang,Jiafeng Guo,Maarten de Rijke,Yixing Fan,Xueqi Cheng
关键词-EN: large language models, language models, model developers, corpora for large, large language
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted by EMNLP 2024 main

点击查看摘要

Abstract:As the scale of training corpora for large language models (LLMs) grows, model developers become increasingly reluctant to disclose details on their data. This lack of transparency poses challenges to scientific evaluation and ethical deployment. Recently, pretraining data detection approaches, which infer whether a given text was part of an LLM’s training data through black-box access, have been explored. The Min-K% Prob method, which has achieved state-of-the-art results, assumes that a non-training example tends to contain a few outlier words with low token probabilities. However, the effectiveness may be limited as it tends to misclassify non-training texts that contain many common words with high probabilities predicted by LLMs. To address this issue, we introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection. We compute the cross-entropy (i.e., the divergence) between the token probability distribution and the token frequency distribution to derive a detection score.We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text. Experimental results on English-language benchmarks and PatentMIA demonstrate that our proposed method significantly outperforms existing methods. Our code and PatentMIA benchmark are available at this https URL
摘要:随着大语言模型 (LLM) 训练语料库规模的扩大,模型开发者越来越不愿意披露其数据的详细信息。这种缺乏透明度的现象给科学评估和伦理部署带来了挑战。最近,研究人员探索了通过黑箱访问推断给定文本是否属于 LLM 训练数据的前训练数据检测方法。其中,Min-K% Prob 方法取得了最先进的结果,该方法假设非训练样本往往包含一些 Token 概率较低的异常词。然而,这种方法的有效性可能受到限制,因为它容易将包含许多高概率常见词的非训练文本错误分类。为了解决这一问题,我们引入了一种基于发散性校准的方法,灵感来自于随机性发散的概念,用于校准前训练数据检测的 Token 概率。我们计算 Token 概率分布与 Token 频率分布之间的交叉熵(即发散度),以推导出检测分数。我们开发了一个中文基准测试 PatentMIA,用于评估针对中文文本的 LLM 检测方法的性能。在英文基准测试和 PatentMIA 上的实验结果表明,我们提出的方法显著优于现有方法。我们的代码和 PatentMIA 基准测试可在以下链接获取:https URL

[NLP-39] OMPar: Automatic Parallelization with AI-Driven Source-to-Source Compilation

【速读】: 该论文试图解决手动并行化代码的复杂性和多核架构广泛应用带来的挑战。解决方案的关键在于引入OMPar工具,该工具利用AI技术自动并行化C/C++代码。OMPar通过集成大型语言模型(LLMs)的两个核心组件——OMPify评估循环并行化潜力和MonoCoder-OMP生成精确的OpenMP pragmas,实现了高效的代码并行化。OMPar不仅在识别可并行化循环和生成高效pragmas方面显著优于传统方法,还具备处理不完整代码库和持续学习新代码模式的能力,从而不断提升其并行化能力。

链接: https://arxiv.org/abs/2409.14771
作者: Tal Kadosh,Niranjan Hasabnis,Prema Soundararajan,Vy A. Vo,Mihai Capota,Nesreen Ahmed,Yuval Pinter,Gal Oren
关键词-EN: significant challenge due, modern software systems, Manual parallelization, Large Language Models, multi-core architectures
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Manual parallelization of code remains a significant challenge due to the complexities of modern software systems and the widespread adoption of multi-core architectures. This paper introduces OMPar, an AI-driven tool designed to automate the parallelization of C/C++ code using OpenMP pragmas. OMPar integrates Large Language Models (LLMs) through two key components: OMPify, which assesses loop parallelization potential, and MonoCoder-OMP, a new fine-tuned model which generates precise OpenMP pragmas. The evaluation of OMPar follows the same rigorous process applied to traditional tools like source-to-source AutoPar and ICPC compilers: (1) ensuring the generated code compiles and runs correctly in serial form, (2) assessing performance with the gradual addition of threads and corresponding physical cores, and (3) verifying and validating the correctness of the code’s output. Benchmarks from HeCBench and ParEval are used to evaluate accuracy and performance. Experimental results demonstrate that OMPar significantly outperforms traditional methods, achieving higher accuracy in identifying parallelizable loops and generating efficient pragmas. Beyond accuracy, OMPar offers advantages such as the ability to work on partial or incomplete codebases and the capacity to continuously learn from new code patterns, enhancing its parallelization capabilities over time. These results underscore the potential of LLMs in revolutionizing automatic parallelization techniques, paving the way for more efficient and scalable parallel computing systems.
摘要:由于现代软件系统的复杂性和多核架构的广泛采用,手动并行化代码仍然是一个重大挑战。本文介绍了 OMPar,这是一种利用 OpenMP 编译指示自动并行化 C/C++ 代码的 AI 驱动工具。OMPar 通过两个关键组件集成了大语言模型 (LLM):OMPify,用于评估循环并行化潜力;以及 MonoCoder-OMP,这是一个经过微调的新模型,用于生成精确的 OpenMP 编译指示。OMPar 的评估遵循与传统工具(如源代码到源代码的 AutoPar 和 ICPC 编译器)相同的严格流程:(1) 确保生成的代码以串行形式正确编译和运行,(2) 通过逐步增加线程和相应的物理核心来评估性能,以及 (3) 验证和确认代码输出的正确性。使用 HeCBench 和 ParEval 的基准测试来评估准确性和性能。实验结果表明,OMPar 显著优于传统方法,在识别可并行化循环和生成高效编译指示方面具有更高的准确性。除了准确性之外,OMPar 还具有处理部分或不完整代码库的能力,并能够从新的代码模式中持续学习,从而随着时间的推移增强其并行化能力。这些结果突显了 LLM 在革新自动并行化技术方面的潜力,为更高效和可扩展的并行计算系统铺平了道路。

[NLP-40] Language-Agnostic Analysis of Speech Depression Detection

【速读】: 该论文试图解决的主要问题是利用语音数据自动检测抑郁症(MDD),特别是在不同语言背景下识别抑郁症患者的语音特征。解决方案的关键在于利用卷积神经网络(CNN)模型,通过分析英语和马拉雅拉姆语两种语言的语音数据,识别与抑郁症相关的声学特征。研究采用了IViE语料库中的多种句子类型,以自然地引发不同的音调模式,从而训练模型在跨语言背景下有效检测抑郁症。该方法旨在开发一种语言无关的语音抑郁症检测系统,以提高对不同语言群体的适用性和可访问性。

链接: https://arxiv.org/abs/2409.14769
作者: Sona Binu,Jismi Jose,Fathima Shimna K V,Alino Luke Hans,Reni K. Cherian,Starlet Ben Alex,Priyanka Srivastava,Chiranjeevi Yarra
关键词-EN: Major Depressive Disorder, Depressive Disorder, Major Depressive, people with Major, English and Malayalam
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The people with Major Depressive Disorder (MDD) exhibit the symptoms of tonal variations in their speech compared to the healthy counterparts. However, these tonal variations not only confine to the state of MDD but also on the language, which has unique tonal patterns. This work analyzes automatic speech-based depression detection across two languages, English and Malayalam, which exhibits distinctive prosodic and phonemic characteristics. We propose an approach that utilizes speech data collected along with self-reported labels from participants reading sentences from IViE corpus, in both English and Malayalam. The IViE corpus consists of five sets of sentences: simple sentences, WH-questions, questions without morphosyntactic markers, inversion questions and coordinations, that can naturally prompt speakers to speak in different tonal patterns. Convolutional Neural Networks (CNNs) are employed for detecting depression from speech. The CNN model is trained to identify acoustic features associated with depression in speech, focusing on both languages. The model’s performance is evaluated on the collected dataset containing recordings from both depressed and non-depressed speakers, analyzing its effectiveness in detecting depression across the two languages. Our findings and collected data could contribute to the development of language-agnostic speech-based depression detection systems, thereby enhancing accessibility for diverse populations.
摘要:患有重度抑郁症 (Major Depressive Disorder, MDD) 的人群在语音中表现出与健康人群相比的音调变化症状。然而,这些音调变化不仅限于 MDD 状态,还与语言本身特有的音调模式有关。本研究分析了基于语音的抑郁症自动检测在两种语言——英语和马拉雅拉姆语中的应用,这两种语言具有独特的韵律和音位特征。我们提出了一种方法,利用从参与者阅读 IViE 语料库中的句子时收集的语音数据,并结合自报告标签,在英语和马拉雅拉姆语中进行分析。IViE 语料库包含五组句子:简单句、WH 问句、无形态句法标记的问句、倒装问句和并列句,这些句子自然地促使说话者以不同的音调模式说话。卷积神经网络 (Convolutional Neural Networks, CNNs) 被用于从语音中检测抑郁症。CNN 模型被训练以识别与抑郁症相关的声学特征,重点关注两种语言。该模型的性能在包含抑郁症和非抑郁症说话者录音的收集数据集上进行评估,分析其在两种语言中检测抑郁症的有效性。我们的研究结果和收集的数据有助于开发与语言无关的基于语音的抑郁症检测系统,从而提高对多样化人群的可及性。

[NLP-41] Do Large Language Models have Problem-Solving Capability under Incomplete Information Scenarios? ACL2024

【速读】: 该论文试图解决在大语言模型(LLMs)在信息不完整场景下的问题解决能力评估问题。解决方案的关键在于引入了一种名为BrainKing的新型游戏,该游戏结合了“谁是卧底”和“二十问”的元素,要求LLMs通过有限的“是或否”问题识别目标实体,并识别潜在的误导性答案。通过设置简单、中等和困难三种难度模式,全面评估LLMs在不同方面的表现,从而揭示其在信息不完整场景下的能力与局限性。

链接: https://arxiv.org/abs/2409.14762
作者: Yuyan Chen,Tianhao Yu,Yueze Li,Songzhou Yan,Sijia Liu,Jiaqing Liang,Yanghua Xiao
关键词-EN: Large Language Models, Language Models, Large Language, knowledge search, error detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:The evaluation of the problem-solving capability under incomplete information scenarios of Large Language Models (LLMs) is increasingly important, encompassing capabilities such as questioning, knowledge search, error detection, and path planning. Current research mainly focus on LLMs’ problem-solving capability such as Twenty Questions''. However, these kinds of games do not require recognizing misleading cues which are necessary in the incomplete information scenario. Moreover, the existing game such as Who is undercover’’ are highly subjective, making it challenging for evaluation. Therefore, in this paper, we introduce a novel game named BrainKing based on the Who is undercover'' and Twenty Questions’’ for evaluating LLM capabilities under incomplete information scenarios. It requires LLMs to identify target entities with limited yes-or-no questions and potential misleading answers. By setting up easy, medium, and hard difficulty modes, we comprehensively assess the performance of LLMs across various aspects. Our results reveal the capabilities and limitations of LLMs in BrainKing, providing significant insights of LLM problem-solving levels.
摘要:在大语言模型 (LLM) 的不完全信息场景下评估其问题解决能力变得越来越重要,这包括提问、知识搜索、错误检测和路径规划等能力。当前的研究主要集中在 LLM 的问题解决能力上,例如“二十问”游戏。然而,这类游戏并不要求识别误导性线索,而这是不完全信息场景中必需的。此外,现有的游戏如“谁是卧底”具有高度主观性,使得评估变得困难。因此,本文引入了一种基于“谁是卧底”和“二十问”的新游戏——BrainKing,用于评估 LLM 在不完全信息场景下的能力。它要求 LLM 通过有限的“是或否”问题和潜在的误导性答案来识别目标实体。通过设置简单、中等和困难三种难度模式,我们全面评估了 LLM 在各个方面的表现。我们的结果揭示了 LLM 在 BrainKing 中的能力和局限性,为 LLM 的问题解决水平提供了重要见解。

[NLP-42] FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension EMNLP2024

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在指代表达理解(REC)任务中的性能提升问题。解决方案的关键在于引入了一个新的REC数据集,该数据集具有可控难度级别和包含负样本的特点。具体来说,数据集设计了多层次的细粒度推理需求,涵盖对象类别、属性和多跳关系,并通过细粒度编辑和生成技术创建了负文本和图像,以测试模型在目标对象不可见情况下的拒绝能力。这一数据集的引入旨在全面评估现有模型和MLLMs的性能,并推动视觉推理和跨模态交互策略的发展。

链接: https://arxiv.org/abs/2409.14750
作者: Junzhuo Liu,Xuzheng Yang,Weiwei Li,Peng Wang
关键词-EN: Referring Expression Comprehension, Referring Expression, Expression Comprehension, Multi-modal Large Language, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 19 pages, EMNLP 2024

点击查看摘要

Abstract:Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model’s ability to correctly reject scenarios where the target object is not visible in the image–an essential aspect often overlooked in existing datasets and approaches. Utilizing this high-quality dataset, we conducted comprehensive evaluations of both state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal interaction strategies, ultimately unlocking the full potential of MLLMs. Our code and the datasets are available at this https URL.
摘要:指称表达理解 (Referring Expression Comprehension, REC) 是一项关键的跨模态任务,客观评估语言理解、图像理解和语言到图像的关联能力。因此,它成为多模态大语言模型 (Multi-modal Large Language Models, MLLMs) 的理想测试平台。为了实现这一目标,我们构建了一个新的 REC 数据集,该数据集具有两个关键特征:首先,它设计了可控的难度级别,需要在对象类别、属性和多跳关系之间进行多层次的细粒度推理。其次,它包含了通过基于现有数据的细粒度编辑和生成创建的负文本和图像,从而测试模型在目标对象在图像中不可见的情况下正确拒绝场景的能力——这是现有数据集和方法中经常被忽视的重要方面。利用这一高质量数据集,我们对最先进的专业模型和 MLLMs 进行了全面评估。我们的研究结果表明,在实现令人满意的关联性能方面仍存在显著差距。我们预计,我们的数据集将激发新的方法来增强视觉推理能力,并开发更先进的多模态交互策略,最终释放 MLLMs 的全部潜力。我们的代码和数据集可通过此 https URL 获取。

[NLP-43] LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs EMNLP

【速读】: 该论文试图解决非事实性问答(NFQA)评估中的难题,即由于答案多样性和缺乏客观标准,传统的自动评估指标(如ROUGE或BERTScore)无法准确衡量语义相似性或不同视角的答案。论文提出的解决方案关键在于引入大语言模型(LLMs)进行列表式评估,通过LLMs对候选答案进行排序,并生成不同质量的参考答案列表,以提高评估的准确性和与人类注释的相关性。实验结果表明,该方法在多个NFQA数据集上显著优于传统的自动评分和点对点、成对比较方法。

链接: https://arxiv.org/abs/2409.14744
作者: Sihui Yang,Keping Bi,Wanqing Cui,Jiafeng Guo,Xueqi Cheng
关键词-EN: diverse potential answers, Large Language Models, Question Answering, objective criterion, challenging to evaluate
类目: Computation and Language (cs.CL)
备注: Published as a conference paper at EMNLP Findings 2024

点击查看摘要

Abstract:Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. The commonly used automatic evaluation metrics like ROUGE or BERTScore cannot accurately measure semantic similarities or answers from different perspectives. Recently, Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. Common approaches include pointwise scoring of each candidate answer and pairwise comparisons between answers. Inspired by the evolution from pointwise to pairwise to listwise in learning-to-rank methods, we propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality. Moreover, for NF questions that do not have multi-grade or any golden answers, we leverage LLMs to generate the reference answer list of various quality to facilitate the listwise evaluation. Extensive experimental results on three NFQA datasets, i.e., ANTIQUE, the TREC-DL-NF, and WebGLM show that our method has significantly higher correlations with human annotations compared to automatic scores and common pointwise and pairwise approaches.
摘要:非事实性 (Non-Factoid, NF) 问答 (Question Answering, QA) 的评估具有挑战性,因为其答案多样且缺乏客观标准。常用的自动评估指标,如 ROUGE 或 BERTScore,无法准确衡量语义相似性或从不同角度给出的答案。近年来,大语言模型 (Large Language Models, LLMs) 因其出色的表现被用于 NFQA 评估。常见的方法包括对每个候选答案进行点对点评分以及答案之间的成对比较。受学习排序方法中从点对点到成对再到列表排序的演变启发,我们提出了一种新颖的列表排序 NFQA 评估方法,该方法利用 LLMs 对候选答案进行排序,排序依据是按质量降序排列的参考答案列表。此外,对于没有多级评分或任何黄金答案的 NF 问题,我们利用 LLMs 生成不同质量的参考答案列表,以促进列表排序评估。在三个 NFQA 数据集(即 ANTIQUE、TREC-DL-NF 和 WebGLM)上的广泛实验结果表明,与自动评分和常见的点对点及成对方法相比,我们的方法与人类注释的相关性显著更高。

[NLP-44] oxiCraft: A Novel Framework for Synthetic Generation of Harmful Information

【速读】: 该论文试图解决在自然语言处理任务中检测有害内容时面临的数据稀缺和定义不一致问题。解决方案的关键是提出了一个名为Toxicraft的新框架,该框架能够利用少量种子数据生成大量合成但高度真实的有害信息样本。这一方法显著增强了检测模型的鲁棒性和适应性,使其在不同数据集上的表现接近或超越了金标准标签。

链接: https://arxiv.org/abs/2409.14740
作者: Zheng Hui,Zhaoxiao Guo,Hang Zhao,Juanyong Duan,Congrui Huang
关键词-EN: NLP tasks, detecting harmful content, online environments, social media, crucial for online
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In different NLP tasks, detecting harmful content is crucial for online environments, especially with the growing influence of social media. However, previous research has two main issues: 1) a lack of data in low-resource settings, and 2) inconsistent definitions and criteria for judging harmful content, requiring classification models to be robust to spurious features and diverse. We propose Toxicraft, a novel framework for synthesizing datasets of harmful information to address these weaknesses. With only a small amount of seed data, our framework can generate a wide variety of synthetic, yet remarkably realistic, examples of toxic information. Experimentation across various datasets showcases a notable enhancement in detection model robustness and adaptability, surpassing or close to the gold labels. We release the generated data at Github upon acceptance.
摘要:在不同的自然语言处理 (NLP) 任务中,检测有害内容对于在线环境至关重要,尤其是在社交媒体影响力日益增强的背景下。然而,以往的研究存在两个主要问题:1) 在低资源环境下缺乏数据;2) 对有害内容判断的定义和标准不一致,要求分类模型对虚假特征和多样性具有鲁棒性。我们提出了 Toxicraft,这是一种用于合成有害信息数据集的新框架,旨在解决这些弱点。仅使用少量的种子数据,我们的框架就能生成多种多样且极为逼真的有害信息示例。在各种数据集上的实验表明,检测模型的鲁棒性和适应性显著增强,性能超越或接近黄金标签。我们将在接受后在 Github 上发布生成的数据。

[NLP-45] ERABAL: Enhancing Role-Playing Agents through Boundary-Aware Learning

【速读】: 该论文试图解决角色扮演代理(RPLAs)在对话中难以保持角色一致性的问题,特别是在面对与角色属性相关的边界查询时。解决方案的关键在于提出了ERABAL框架,该框架通过边界感知学习来增强RPLAs的角色扮演能力。ERABAL包括一个角色特定对话的生成管道和一个相应的对齐训练方法,能够在使用较少对话数据的情况下,显著提升在WikiRoleEval、CharacterEval和MT-Bench角色扮演子集上的表现。

链接: https://arxiv.org/abs/2409.14710
作者: Yihong Tang,Jiao Ou,Che Liu,Fuzheng Zhang,Di Zhang,Kun Gai
关键词-EN: Human-Computer Interaction, large language model, primarily implemented, HCI, LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2402.10618

点击查看摘要

Abstract:Role-playing is an emerging application in the field of Human-Computer Interaction (HCI), primarily implemented through the alignment training of a large language model (LLM) with assigned characters. Despite significant progress, role-playing agents (RPLAs) still struggle with maintaining role-consistency across conversations, particularly when confronted with boundary queries subtly related to character attributes. In this paper, we present ERABAL, a framework aimed at enhancing RPLAs’ role-playing capabilities through boundary-aware learning. ERABAL encompasses a generation pipeline for role-specific dialogues and a concomitant methodology for alignment training. Through comprehensive evaluations, we demonstrate that ERABAL is both efficient and effective. By training with significantly fewer dialogues than those used in leading approaches, ERABAL achieves notable improvements across WikiRoleEval, CharacterEval, and the role-playing subset of MT-Bench compared to the generalist baseline models. Our code and datasets will be made publicly available to support further research.
摘要:角色扮演是人类与计算机交互 (Human-Computer Interaction, HCI) 领域中新兴的应用,主要通过大语言模型 (Large Language Model, LLM) 与指定角色的对齐训练来实现。尽管取得了显著进展,角色扮演智能体 (Role-Playing Agents, RPLAs) 在跨对话中保持角色一致性方面仍面临挑战,尤其是在面对与角色属性微妙相关的边界查询时。本文提出了 ERABAL,一个旨在通过边界感知学习增强 RPLAs 角色扮演能力的框架。ERABAL 包含一个角色特定对话的生成管道和一个相应的对齐训练方法。通过全面的评估,我们证明了 ERABAL 既高效又有效。通过使用远少于领先方法所需的对话进行训练,ERABAL 在 WikiRoleEval、CharacterEval 以及 MT-Bench 的角色扮演子集上相较于通用基线模型取得了显著的改进。我们的代码和数据集将公开发布,以支持进一步的研究。

[NLP-46] arget-Aware Language Modeling via Granular Data Sampling EMNLP2024

【速读】: 该论文试图解决在特定领域内训练语言模型时,如何在不影响其他领域性能的前提下,高效地选择预训练数据的问题。解决方案的关键在于采用基于n-gram特征的重要性采样方法,通过多粒度token的组合,实现对大规模预训练数据的精确选择,从而在保留对其他任务有效性的同时,显著提升目标下游任务的性能。实验结果表明,使用约1%的数据进行预训练,模型在多个基准测试中表现与使用完整数据集相当,甚至优于随机选择的样本。

链接: https://arxiv.org/abs/2409.14705
作者: Ernie Chang,Pin-Jie Lin,Yang Li,Changsheng Zhao,Daeil Kim,Rastislav Rabatin,Zechun Liu,Yangyang Shi,Vikas Chandra
关键词-EN: diverse sources, broad range, pretraining generally targets, model pretraining generally, data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2024 Main Conference, 9 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective and straightforward approach is sampling with low-dimensional data features, which allows to select large-scale pretraining data for domain-specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the target downstream task performance while preserving its effectiveness on other tasks. This leads to the proposed data sampling paradigm where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with \sim 1% of the data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
摘要:语言模型的预训练通常面向广泛的应用场景,并整合来自多种来源的数据。然而,在某些情况下,我们希望模型在特定领域表现出色,同时不显著影响在其他领域的表现。一种经济高效且直接的方法是利用低维数据特征进行采样,从而为特定领域的应用场景选择大规模的预训练数据。在本研究中,我们重新审视了基于 n-gram 特征的重要性采样方法,这些特征由多粒度 Token 组成,在句子压缩和表示能力之间取得了良好的平衡。我们观察到,采样数据与目标下游任务性能高度相关,同时在其他任务上保持了有效性。这引出了我们提出的数据采样范式,即语言模型可以在选定的文档上更高效地进行预训练。在八个基准测试中,我们展示了使用约 1% 的数据,预训练模型在性能上与完整 RefinedWeb 数据相当,并且在 125M 到 1.5B 的模型规模范围内优于随机选择的样本。

[NLP-47] VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models EMNLP2024

【速读】: 该论文试图解决现有文本到图像(T2I)模型评估指标无法充分评估模型处理多样化文本提示的能力,从而影响模型泛化性的问题。解决方案的关键在于引入了一种名为Visual Language Evaluation Understudy (VLEU)的新评估指标。VLEU利用大型语言模型从视觉文本域中采样生成多样化的提示,并通过CLIP模型评估生成图像与输入文本的对齐程度。该指标通过计算视觉文本的边缘分布与模型生成图像的条件分布之间的Kullback-Leibler散度,量化模型的泛化能力,为不同T2I模型提供了一种定量的比较方法,并有助于在模型微调过程中跟踪改进。

链接: https://arxiv.org/abs/2409.14704
作者: Jingtao Cao,Zheng Zhang,Hongru Wang,Kam-Fai Wong
关键词-EN: Language Evaluation Understudy, Visual Language Evaluation, significantly improved, improved the generation, textual descriptions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: accepted by EMNLP2024(long paper,main conference)

点击查看摘要

Abstract:Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models’ ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP model.VLEU quantifies a model’s generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric provides a quantitative way to compare different T2I models and track improvements during model finetuning. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models, positioning it as an essential metric for future research in text-to-image synthesis.
摘要:文本到图像 (Text-to-Image, T2I) 模型的进展显著提升了从文本描述生成图像的能力。然而,现有的评估指标未能充分评估模型处理多样化文本提示的能力,这对模型的泛化性至关重要。为解决这一问题,我们引入了一种新的指标,称为视觉语言评估替补 (Visual Language Evaluation Understudy, VLEU)。VLEU 利用大语言模型从视觉文本域中采样,即 T2I 模型的所有可能输入文本集合,以生成广泛的提示。根据这些提示生成的图像通过 CLIP 模型进行评估,评估其与输入文本的对齐程度。VLEU 通过计算视觉文本的边缘分布与模型生成图像的条件分布之间的 Kullback-Leibler 散度,量化模型的泛化性。该指标提供了一种定量的方法来比较不同的 T2I 模型,并在模型微调过程中跟踪改进。我们的实验证明了 VLEU 在评估各种 T2I 模型泛化能力方面的有效性,使其成为未来文本到图像合成研究中的一个重要指标。

[NLP-48] MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification EMNLP2024

【速读】: 该论文试图解决文本嵌入图像的复杂性问题,特别是在多模态理解中对表达的多个方面(如仇恨、目标、立场和幽默)进行检测的挑战。解决方案的关键在于引入了一个新的数据集PrideMM,该数据集包含了与LGBTQ+ Pride运动相关的文本嵌入图像,填补了现有资源的空白。论文还提出了一种新的框架MemeCLIP,该框架在保留预训练CLIP模型知识的同时,实现了高效的下游学习,并在两个真实世界数据集上展示了优于先前框架的性能。

链接: https://arxiv.org/abs/2409.14703
作者: Siddhant Bikram Shah,Shuvam Shiwakoti,Maheep Chaudhary,Haohan Wang
关键词-EN: text-embedded images presents, encompass multiple aspects, multiple aspects, presents a formidable, formidable challenge
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted to EMNLP 2024 (Main)

点击查看摘要

Abstract:The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of the multiple aspects of expression conveyed in them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, our study expands the focus to encompass multiple aspects of linguistics: hate, target, stance, and humor detection. We introduce a novel dataset PrideMM comprising text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: this https URL.
摘要:文本嵌入图像的复杂性在机器学习中提出了一个严峻的挑战,因为需要对图像中传达的多方面表达进行多模态理解。尽管先前的多模态分析研究主要集中在单一方面的分析,如仇恨言论及其子类,但我们的研究扩展了关注点,涵盖了语言学的多个方面:仇恨、目标、立场和幽默检测。我们引入了一个新的数据集 PrideMM,该数据集包含了与 LGBTQ+ 骄傲运动相关的文本嵌入图像,从而填补了现有资源中的一个重要空白。我们通过使用单模态和多模态基线方法对 PrideMM 进行了广泛的实验,为每个任务建立了基准。此外,我们提出了一种新的框架 MemeCLIP,该框架在保留预训练 CLIP 模型知识的同时,实现了高效的下游学习。我们的实验结果表明,MemeCLIP 在两个真实世界数据集上的表现优于先前提出的框架。我们进一步比较了 MemeCLIP 和零样本 GPT-4 在仇恨分类任务上的性能。最后,我们通过定性分析误分类样本,讨论了我们模型的不足之处。我们的代码和数据集可在以下链接公开获取:this https URL。

[NLP-49] Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

【速读】: 该论文试图解决多向量检索方法(如ColBERT)在存储和内存需求方面的高成本问题。解决方案的关键在于引入了一种基于聚类的token池化方法,通过将相似的token向量聚类并存储代表性向量,从而大幅减少需要存储的向量数量。这种方法在几乎不损失检索性能的情况下,可以将ColBERT索引的存储空间减少50%,并且在进一步减少向量数量时,性能下降仍保持在5%以内。该方法无需改变模型架构或查询时处理,可以在索引构建阶段简单地集成到任何类似ColBERT的模型中。

链接: https://arxiv.org/abs/2409.14683
作者: Benjamin Clavié,Antoine Chaffin,Griffin Adams
关键词-EN: increasingly popular approach, multi-vector retrieval methods, increasingly popular, multi-vector retrieval, approach to Neural
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.
摘要:近年来,以 ColBERT 为首的多向量检索方法在神经信息检索 (Neural IR) 领域中变得越来越流行。通过在 Token 级别而非文档级别存储表示,这些方法在跨领域设置中展示了非常强大的检索性能。然而,存储大量相关向量所需的存储和内存需求仍然是一个重要的缺点,阻碍了实际应用。在本文中,我们介绍了一种基于聚类的 Token 池化方法,以积极减少需要存储的向量数量。这种方法可以将 ColBERT 索引的空间内存占用减少 50%,而几乎不会降低检索性能。该方法还允许进一步减少,将向量数量减少 66% 至 75%,在绝大多数数据集上性能下降仍保持在 5% 以下。重要的是,这种方法不需要架构上的改变,也不需要在查询时进行处理,并且可以在索引过程中作为简单的插件使用于任何类似 ColBERT 的模型。

[NLP-50] RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

【速读】: 该论文试图解决机器人操作中缺乏自恢复机制和简单语言指令不足以指导机器人动作的问题。解决方案的关键在于提出了一个可扩展的数据生成管道,自动增强专家演示与故障恢复轨迹和细粒度语言注释,并引入Rich languAge-guided failure reCovERy (RACER)框架。RACER通过结合故障恢复数据和丰富的语言描述,利用视觉-语言模型(VLM)作为在线监督者提供详细的语言指导,以及语言条件化的视觉运动策略作为执行者预测下一步动作,从而显著提升了机器人在复杂任务中的表现。

链接: https://arxiv.org/abs/2409.14674
作者: Yinpei Dai,Jayjun Lee,Nima Fazeli,Joyce Chai
关键词-EN: Developing robust, simple language instructions, correctable visuomotor policies, guiding robot actions, robust and correctable
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL

点击查看摘要

Abstract:Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: this https URL.
摘要:由于缺乏从失败中自我恢复的机制以及简单语言指令在指导机器人动作方面的局限性,开发稳健且可纠正的视觉运动策略用于机器人操作具有挑战性。为解决这些问题,我们提出了一种可扩展的数据生成管道,该管道自动增强专家演示,通过失败恢复轨迹和细粒度的语言注释进行训练。随后,我们引入了丰富的语言引导失败恢复系统 (Rich languAge-guided failure reCovERy, RACER),这是一个监督者-执行者框架,结合了失败恢复数据和丰富的语言描述以增强机器人控制。RACER 包含一个视觉语言模型 (Vision-Language Model, VLM),作为在线监督者,提供详细的语言指导用于错误纠正和任务执行,以及一个语言条件化的视觉运动策略作为执行者,用于预测下一步动作。我们的实验结果表明,RACER 在各种评估设置下,包括标准的长时任务、动态目标变化任务和零样本未见任务,均优于最先进的机器人视图 Transformer (Robotic View Transformer, RVT),在模拟和真实环境中均实现了卓越的性能。视频和代码可在以下链接获取:this https URL。

[NLP-51] Instruction Tuning Vs. In-Context Learning: Revisiting Large Language Models in Few-Shot Computational Social Science

【速读】: 该论文试图解决在计算社会科学(CSS)任务中,如何有效利用大语言模型(LLMs)进行少样本分类的问题。解决方案的关键在于评估指令微调(IT)和上下文学习(ICL)在少样本CSS任务中的分类性能,并发现ICL在大多数CSS任务中表现优于IT。此外,论文强调了样本质量和提示策略的重要性,指出单纯增加样本数量而不考虑其质量并不能持续提升LLM的性能,有时甚至会导致性能下降。研究结果表明,ICL在少样本设置下处理CSS任务具有显著优势,并建议优化样本质量和提示策略以提高LLM的分类性能。

链接: https://arxiv.org/abs/2409.14673
作者: Taihang Wang,Xiaoman Xu,Yimin Wang,Ye Jiang
关键词-EN: large language models, computational social science, Real-world applications, tasks primarily depend, CSS tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world applications of large language models (LLMs) in computational social science (CSS) tasks primarily depend on the effectiveness of instruction tuning (IT) or in-context learning (ICL). While IT has shown highly effective at fine-tuning LLMs for various tasks, ICL offers a rapid alternative for task adaptation by learning from examples without explicit gradient updates. In this paper, we evaluate the classification performance of LLMs using IT versus ICL in few-shot CSS tasks. The experimental results indicate that ICL consistently outperforms IT in most CSS tasks. Additionally, we investigate the relationship between the increasing number of training samples and LLM performance. Our findings show that simply increasing the number of samples without considering their quality does not consistently enhance the performance of LLMs with either ICL or IT and can sometimes even result in a performance decline. Finally, we compare three prompting strategies, demonstrating that ICL is more effective than zero-shot and Chain-of-Thought (CoT). Our research highlights the significant advantages of ICL in handling CSS tasks in few-shot settings and emphasizes the importance of optimizing sample quality and prompting strategies to improve LLM classification performance. The code will be made available.
摘要: 大语言模型 (LLM) 在计算社会科学 (CSS) 任务中的实际应用主要依赖于指令调优 (IT) 或上下文学习 (ICL) 的有效性。尽管 IT 在微调 LLM 以适应各种任务方面表现出高度有效,但 ICL 提供了一种快速的任务适应方法,通过从示例中学习而不进行显式的梯度更新。本文中,我们评估了在少样本 CSS 任务中使用 IT 与 ICL 的 LLM 分类性能。实验结果表明,在大多数 CSS 任务中,ICL 持续优于 IT。此外,我们研究了训练样本数量增加与 LLM 性能之间的关系。我们的研究发现,在不考虑样本质量的情况下简单增加样本数量,无论是使用 ICL 还是 IT,都不一定能持续提升 LLM 的性能,有时甚至会导致性能下降。最后,我们比较了三种提示策略,证明 ICL 比零样本和思维链 (CoT) 更为有效。我们的研究突显了 ICL 在少样本设置下处理 CSS 任务的显著优势,并强调了优化样本质量和提示策略以提高 LLM 分类性能的重要性。代码将公开发布。

[NLP-52] Direct Judgement Preference Optimization

【速读】: 该论文试图解决如何通过自动评估提升大型语言模型(LLMs)的评估能力,特别是在不同应用场景下的表现。解决方案的关键在于采用偏好优化方法,通过从正负数据中学习,收集不同应用场景下的偏好对,从而从多个角度提升生成式评判模型的能力。具体方法包括三种不同的偏好对收集策略,旨在增强评判模型的评估能力。研究结果表明,这种方法在多个基准测试中表现优异,尤其在13个基准测试中取得了10个最佳成绩,显著优于GPT-4o等强基线模型,并能有效应对位置和长度偏差,灵活适应各种评估协议,并为下游生成模型提供有用的语言反馈。

链接: https://arxiv.org/abs/2409.14664
作者: Peifeng Wang,Austin Xu,Yilun Zhou,Caiming Xiong,Shafiq Joty
关键词-EN: assessing response quality, Auto-evaluation is crucial, crucial for assessing, assessing response, response quality
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to evaluate and critique other models’ outputs. In this work, we investigate the idea of learning from both positive and negative data with preference optimization to enhance the evaluation capabilities of LLM judges across an array of different use cases. We achieve this by employing three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our comprehensive study over a wide range of benchmarks demonstrates the effectiveness of our method. In particular, our generative judge achieves the best performance on 10 out of 13 benchmarks, outperforming strong baselines like GPT-4o and specialized judge models. Further analysis show that our judge model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
摘要: 自动评估对于评估响应质量和为模型开发提供反馈至关重要。最近的研究探索了将大语言模型 (LLM) 训练为生成式评判者,以评估和批评其他模型的输出。在这项工作中,我们探讨了通过偏好优化从正负数据中学习,以增强 LLM 评判者在各种不同用例中的评估能力。我们通过采用三种方法来收集不同用例的偏好对,每种方法都旨在从不同角度提升我们的生成式评判者。我们在广泛的基准测试上的综合研究表明了我们的方法的有效性。特别是,我们的生成式评判者在 13 个基准测试中取得了 10 个最佳表现,超越了 GPT-4o 和专门的评判模型等强基线。进一步分析表明,我们的评判模型能够稳健地应对位置和长度偏差等固有偏差,灵活适应从业者指定的任何评估协议,并为改进下游生成模型提供有用的语言反馈。

[NLP-53] Building Tamil Treebanks

【速读】: 该论文试图解决创建高质量泰米尔语树库(Tamil treebanks)的问题,解决方案的关键在于采用三种不同的方法:手动标注、计算语法和机器学习技术。手动标注确保了高质量和丰富的句法与语义信息,但耗时且需要专业知识;计算语法如词汇功能语法(LFG)提供深入的语言分析,但需要对形式化有深入了解;机器学习方法利用现成的框架和工具(如Stanza、UDpipe和UUParser)实现大规模数据集的自动化标注,但依赖于高质量的标注数据、跨语言训练资源和计算能力。论文强调了构建泰米尔语树库的挑战,包括互联网数据的质量问题、全面语言分析的需求以及找到熟练标注者的困难,但指出这些树库的开发对于推进语言研究和改进泰米尔语的自然语言处理工具至关重要。

链接: https://arxiv.org/abs/2409.14657
作者: Kengatharaiyer Sarveswaran
关键词-EN: Natural Language Processing, Tamil treebanks, important linguistic resources, Language Processing, Natural Language
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Treebanks are important linguistic resources, which are structured and annotated corpora with rich linguistic annotations. These resources are used in Natural Language Processing (NLP) applications, supporting linguistic analyses, and are essential for training and evaluating various computational models. This paper discusses the creation of Tamil treebanks using three distinct approaches: manual annotation, computational grammars, and machine learning techniques. Manual annotation, though time-consuming and requiring linguistic expertise, ensures high-quality and rich syntactic and semantic information. Computational deep grammars, such as Lexical Functional Grammar (LFG), offer deep linguistic analyses but necessitate significant knowledge of the formalism. Machine learning approaches, utilising off-the-shelf frameworks and tools like Stanza, UDpipe, and UUParser, facilitate the automated annotation of large datasets but depend on the availability of quality annotated data, cross-linguistic training resources, and computational power. The paper discusses the challenges encountered in building Tamil treebanks, including issues with Internet data, the need for comprehensive linguistic analysis, and the difficulty of finding skilled annotators. Despite these challenges, the development of Tamil treebanks is essential for advancing linguistic research and improving NLP tools for Tamil.
摘要:树库(Treebanks)是重要的语言资源,它们是结构化并带有丰富语言注释的语料库。这些资源用于自然语言处理(NLP)应用,支持语言分析,并且对于训练和评估各种计算模型至关重要。本文讨论了使用三种不同方法创建泰米尔语树库的过程:手动注释、计算语法和机器学习技术。手动注释虽然耗时且需要语言学专业知识,但能确保高质量和丰富的句法和语义信息。计算深层语法,如词汇功能语法(LFG),提供深入的语言分析,但需要对形式化有深入了解。机器学习方法,利用现成的框架和工具如 Stanza、UDpipe 和 UUParser,促进了大规模数据集的自动注释,但依赖于高质量注释数据、跨语言训练资源和计算能力。本文讨论了在构建泰米尔语树库过程中遇到的各种挑战,包括互联网数据的问题、全面语言分析的需求以及寻找熟练注释者的困难。尽管面临这些挑战,泰米尔语树库的开发对于推进语言学研究和改进泰米尔语的 NLP 工具至关重要。

[NLP-54] Harmonising the Clinical Melody: Tuning Large Language Models for Hospital Course Summarisation in Clinical Coding

【速读】: 该论文试图解决电子病历系统中医学文档数量和复杂性增加给临床编码员带来的挑战,特别是如何从大量临床文本中提取关键信息以完成编码任务。解决方案的关键在于利用预训练的大型语言模型(如Llama 3、BioMistral、Mistral Instruct v0.1),通过量化低秩适应微调(Quantized Low Rank Adaptation fine tuning)来适应医院病程总结任务。研究通过从MIMIC III数据集中构建自由文本临床数据集,并使用BERTScore和ROUGE指标评估模型效果,验证了在临床领域微调预训练LLMs可以显著提升医院病程总结的性能,从而为临床编码提供辅助工具。

链接: https://arxiv.org/abs/2409.14638
作者: Bokang Bi,Leibo Liu,Oscar Perez-Concha,Sanja Lujic,Louisa Jorm
关键词-EN: Electronic Medical Records, Medical Records systems, Records systems pose, Electronic Medical, Medical Records
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:The increasing volume and complexity of clinical documentation in Electronic Medical Records systems pose significant challenges for clinical coders, who must mentally process and summarise vast amounts of clinical text to extract essential information needed for coding tasks. While large language models have been successfully applied to shorter summarisation tasks in recent years, the challenge of summarising a hospital course remains an open area for further research and development. In this study, we adapted three pre trained LLMs, Llama 3, BioMistral, Mistral Instruct v0.1 for the hospital course summarisation task, using Quantized Low Rank Adaptation fine tuning. We created a free text clinical dataset from MIMIC III data by concatenating various clinical notes as the input clinical text, paired with ground truth Brief Hospital Course sections extracted from the discharge summaries for model training. The fine tuned models were evaluated using BERTScore and ROUGE metrics to assess the effectiveness of clinical domain fine tuning. Additionally, we validated their practical utility using a novel hospital course summary assessment metric specifically tailored for clinical coding. Our findings indicate that fine tuning pre trained LLMs for the clinical domain can significantly enhance their performance in hospital course summarisation and suggest their potential as assistive tools for clinical coding. Future work should focus on refining data curation methods to create higher quality clinical datasets tailored for hospital course summary tasks and adapting more advanced open source LLMs comparable to proprietary models to further advance this research.
摘要:随着电子病历系统中临床文档的数量和复杂性不断增加,临床编码员面临着巨大的挑战,他们需要在大脑中处理和总结大量的临床文本,以提取编码任务所需的关键信息。尽管近年来大语言模型在较短的摘要任务中取得了成功应用,但总结医院病程的挑战仍然是一个有待进一步研究和开发的新领域。在本研究中,我们采用了三种预训练的大语言模型(LLMs),即 Llama 3、BioMistral 和 Mistral Instruct v0.1,通过量化低秩适应(Quantized Low Rank Adaptation)微调方法,将其应用于医院病程摘要任务。我们通过将 MIMIC III 数据中的各种临床笔记连接起来,创建了一个自由文本临床数据集,作为输入临床文本,并配以从出院总结中提取的实际医院病程摘要部分作为模型训练的基准。通过 BERTScore 和 ROUGE 指标评估了微调模型的有效性,以评估临床领域微调的效果。此外,我们还使用了一种专门为临床编码设计的医院病程摘要评估新指标,验证了其实际应用价值。研究结果表明,针对临床领域对预训练大语言模型进行微调,可以显著提升其在医院病程摘要任务中的表现,并显示出其作为临床编码辅助工具的潜力。未来的工作应聚焦于改进数据收集方法,以创建更高质量的、针对医院病程摘要任务定制的临床数据集,并适应更多与专有模型相当的开源大语言模型,以进一步推动这一研究领域的发展。

[NLP-55] Can a Neural Model Guide Fieldwork? A Case Study on Morphological Inflection

【速读】: 该论文旨在解决语言田野工作中数据收集和形态学结构泛化效率低下的问题。解决方案的关键在于引入一种新的框架,通过评估不同采样策略的效率和利用神经网络模型的置信度来指导数据标注过程。具体策略包括:1) 通过在范式表格单元中均匀采样来增加标注数据的多样性;2) 利用模型置信度作为指导,提供可靠的预测以增强标注过程中的正面互动。

链接: https://arxiv.org/abs/2409.14628
作者: Aso Mahmudi,Borja Herce,Demian Inostroza Amestica,Andreas Scherbakov,Eduard Hovy,Ekaterina Vylomova
关键词-EN: Linguistic fieldwork, documentation and preservation, important component, component in language, language documentation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Linguistic fieldwork is an important component in language documentation and preservation. However, it is a long, exhaustive, and time-consuming process. This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions. We introduce a novel framework that evaluates the efficiency of various sampling strategies for obtaining morphological data and assesses the effectiveness of state-of-the-art neural models in generalising morphological structures. Our experiments highlight two key strategies for improving the efficiency: (1) increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables, and (2) using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.
摘要:语言学实地调查是语言记录和保存的重要组成部分。然而,这是一个漫长、耗时且繁琐的过程。本文提出了一种新型模型,该模型在实地调查过程中指导语言学家,并考虑了语言学家与说话者之间的互动动态。我们引入了一个新的框架,用于评估各种采样策略在获取形态数据方面的效率,并评估最先进的神经模型在泛化形态结构方面的有效性。我们的实验突出了两种提高效率的关键策略:(1) 通过在范式表格的单元格中均匀采样来增加标注数据的多样性,以及 (2) 使用模型置信度作为指导,通过在标注过程中提供可靠的预测来增强正面互动。

[NLP-56] Can pre-trained language models generate titles for research papers?

【速读】: 该论文试图解决自动生成研究论文标题的问题,解决方案的关键在于微调预训练的大型语言模型(如ChatGPT),并利用其零样本学习能力从论文摘要中生成标题。通过使用ROUGE、METEOR、MoverScore、BERTScore和SciBERTScore等多项评价指标来评估模型的性能。

链接: https://arxiv.org/abs/2409.14602
作者: Tohida Rehman,Debarshi Kumar Sanyal,Samiran Chattopadhyay
关键词-EN: research paper communicates, succinct style, style the main, main theme, research paper
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The title of a research paper communicates in a succinct style the main theme and, sometimes, the findings of the paper. Coming up with the right title is often an arduous task, and therefore, it would be beneficial to authors if title generation can be automated. In this paper, we fine-tune pre-trained and large language models to generate titles of papers from their abstracts. We also use ChatGPT in a zero-shot setting to generate paper titles. The performance of the models is measured with ROUGE, METEOR, MoverScore, BERTScore and SciBERTScore metrics.
摘要:研究论文的标题以简洁的风格传达了论文的主要主题,有时还包括论文的发现。构思出一个合适的标题往往是一项艰巨的任务,因此,如果能够自动化生成标题,将对作者大有裨益。本文中,我们对预训练的大语言模型进行微调,以从论文摘要中生成标题。我们还使用 ChatGPT 在零样本设置下生成论文标题。模型的性能通过 ROUGE、METEOR、MoverScore、BERTScore 和 SciBERTScore 指标进行衡量。

[NLP-57] EchoAtt: Attend Copy then Adjust for More Efficient Large Language Models

【速读】: 该论文试图解决大规模语言模型(LLMs)在推理和微调过程中计算需求高的问题。解决方案的关键在于引入EchoAtt框架,通过分析和利用模型各层之间注意力模式的相似性,实现注意力矩阵的共享。具体来说,EchoAtt在知识蒸馏的设置下,允许学生模型在相似度高的层之间共享注意力矩阵,从而显著减少计算需求,同时保持甚至提升模型性能。这种方法不仅提高了推理和训练速度,还减少了模型参数数量,增强了LLMs在实时和资源受限应用中的实用性。

链接: https://arxiv.org/abs/2409.14595
作者: Hossein Rajabzadeh,Aref Jafari,Aman Sharma,Benyamin Jami,Hyock Ju Kwon,Ali Ghodsi,Boxing Chen,Mehdi Rezagholizadeh
关键词-EN: Large Language Models, language processing tasks, natural language processing, Large Language, natural language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), with their increasing depth and number of parameters, have demonstrated outstanding performance across a variety of natural language processing tasks. However, this growth in scale leads to increased computational demands, particularly during inference and fine-tuning. To address these challenges, we introduce EchoAtt, a novel framework aimed at optimizing transformer-based models by analyzing and leveraging the similarity of attention patterns across layers. Our analysis reveals that many inner layers in LLMs, especially larger ones, exhibit highly similar attention matrices. By exploiting this similarity, EchoAtt enables the sharing of attention matrices in less critical layers, significantly reducing computational requirements without compromising performance. We incorporate this approach within a knowledge distillation setup, where a pre-trained teacher model guides the training of a smaller student model. The student model selectively shares attention matrices in layers with high similarity while inheriting key parameters from the teacher. Our best results with TinyLLaMA-1.1B demonstrate that EchoAtt improves inference speed by 15%, training speed by 25%, and reduces the number of parameters by approximately 4%, all while improving zero-shot performance. These findings highlight the potential of attention matrix sharing to enhance the efficiency of LLMs, making them more practical for real-time and resource-limited applications.
摘要:随着深度和参数数量的增加,大语言模型 (Large Language Models, LLMs) 在各种自然语言处理任务中表现出色。然而,这种规模的扩大导致了计算需求的增加,特别是在推理和微调过程中。为了应对这些挑战,我们提出了 EchoAtt,这是一种新颖的框架,旨在通过分析和利用各层之间注意力模式的相似性来优化基于 Transformer 的模型。我们的分析表明,LLMs 中的许多内部层,尤其是较大的模型,表现出高度相似的注意力矩阵。通过利用这种相似性,EchoAtt 使得在不太关键的层中共享注意力矩阵成为可能,从而显著减少了计算需求,同时不影响性能。我们将这种方法整合到知识蒸馏的设置中,其中预训练的教师模型指导较小学生模型的训练。学生模型在高度相似的层中选择性地共享注意力矩阵,同时继承教师模型的关键参数。我们在 TinyLLaMA-1.1B 上的最佳结果表明,EchoAtt 将推理速度提高了 15%,训练速度提高了 25%,并将参数数量减少了约 4%,同时提高了零样本性能。这些发现突显了注意力矩阵共享在提高 LLMs 效率方面的潜力,使其更适用于实时和资源受限的应用。

[NLP-58] Backtracking Improves Generation Safety

【速读】: 该论文试图解决语言模型在生成不安全内容时无法撤销已生成token的问题。解决方案的关键在于引入一种特殊的[RESET] token,使模型能够在检测到不安全生成时进行“回溯”操作,从而撤销并重新生成内容。这种方法通过在SFT或DPO训练中优化帮助性和无害性,显著提高了模型的安全性,同时避免了帮助性的下降。实验结果表明,采用回溯技术的Llama-3-8B模型在安全性上比基线模型提高了四倍,且能有效抵御多种对抗攻击。

链接: https://arxiv.org/abs/2409.14586
作者: Yiming Zhang,Jianfeng Chi,Hailey Nguyen,Kartikeya Upasani,Daniel M. Bikel,Jason Weston,Eric Michael Smith
关键词-EN: taking back tokens, fundamental limitation, taking back, unsafe additional text, Text generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to “undo” and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% \to 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.
摘要:文本生成在定义上存在一个根本性的局限:一旦生成的 Token 出现问题,即使这些 Token 明显存在问题,也无法撤销。在语言模型安全性的背景下,当生成部分不安全的内容时,语言模型本质上倾向于继续生成类似的不安全文本。这实际上是在野外环境中,尽管在提高模型安全性方面做出了巨大努力,但前沿模型的安全对齐仍然被绕过的原因。我们提出了一种偏离传统安全对齐预防方法(降低有害响应的概率)的新技术——回溯 (backtracking),该技术通过引入特殊 [RESET] Token,使语言模型能够“撤销”并从自身的不安全生成中恢复。我们的方法可以融入到 SFT 或 DPO 训练中,以优化有用性和无害性。实验表明,经过回溯训练的模型在安全性方面始终优于基线模型:回溯训练的 Llama-3-8B 模型在我们的评估中比基线模型安全四倍(6.1% → 1.5%),且在有用性方面没有退化。此外,我们的方法在没有专门训练的情况下,还能抵御包括自适应攻击在内的四种对抗攻击。

[NLP-59] he X Types – Mapping the Semantics of the Twitter Sphere

【速读】: 该论文试图解决社交媒体上缺乏结构化语义信息的问题,特别是为大约20万Twitter热门账号赋予细粒度的语义类型标签。解决方案的关键在于通过与DBpedia和Wikidata知识库的对齐,获取部分账号的语义标签,并利用这些标签微调基于Transformer的文本编码器,生成实体的语义嵌入。结合网络嵌入,该方法能够高效预测实体的语义类型,并在实验中展示了高准确性。最终,该研究不仅提供了Twitter领域的全局语义洞察,还展示了语义类型信息和嵌入在实体相似性评估等下游任务中的应用潜力。

链接: https://arxiv.org/abs/2409.14584
作者: Ogen Schlachet Drukerman,Einat Minkov
关键词-EN: Social networks form, influential entities correspond, semantic, networks form, form a valuable
类目: Computation and Language (cs.CL)
备注: 23 pages

点击查看摘要

Abstract:Social networks form a valuable source of world knowledge, where influential entities correspond to popular accounts. Unlike factual knowledge bases (KBs), which maintain a semantic ontology, structured semantic information is not available on social media. In this work, we consider a social KB of roughly 200K popular Twitter accounts, which denotes entities of interest. We elicit semantic information about those entities. In particular, we associate them with a fine-grained set of 136 semantic types, e.g., determine whether a given entity account belongs to a politician, or a musical artist. In the lack of explicit type information in Twitter, we obtain semantic labels for a subset of the accounts via alignment with the KBs of DBpedia and Wikidata. Given the labeled dataset, we finetune a transformer-based text encoder to generate semantic embeddings of the entities based on the contents of their accounts. We then exploit this evidence alongside network-based embeddings to predict the entities semantic types. In our experiments, we show high type prediction performance on the labeled dataset. Consequently, we apply our type classification model to all of the entity accounts in the social KB. Our analysis of the results offers insights about the global semantics of the Twitter sphere. We discuss downstream applications that should benefit from semantic type information and the semantic embeddings of social entities generated in this work. In particular, we demonstrate enhanced performance on the key task of entity similarity assessment using this information.
摘要:社交网络构成了世界知识的重要来源,其中具有影响力的实体对应于受欢迎的账户。与维护语义本体的知识库 (KB) 不同,社交媒体上不存在结构化的语义信息。在本研究中,我们考虑了一个包含约 20 万个流行 Twitter 账户的社交 KB,这些账户代表了我们感兴趣的实体。我们提取了这些实体的语义信息。特别是,我们将它们与一组细粒度的 136 种语义类型相关联,例如,确定给定实体账户是否属于政治家或音乐艺术家。由于 Twitter 中缺乏显式的类型信息,我们通过与 DBpedia 和 Wikidata 的 KB 对齐,为部分账户获取了语义标签。在获得标注数据集后,我们微调了一个基于 Transformer 的文本编码器,以根据账户内容生成实体的语义嵌入。然后,我们利用这些证据以及基于网络的嵌入来预测实体的语义类型。在我们的实验中,我们展示了在标注数据集上的高类型预测性能。因此,我们将类型分类模型应用于社交 KB 中的所有实体账户。我们对结果的分析提供了关于 Twitter 领域全局语义的见解。我们讨论了应受益于语义类型信息和本研究中生成的社交实体语义嵌入的下游应用。特别是,我们展示了在使用此信息进行实体相似性评估的关键任务中性能的提升。

[NLP-60] Medical Concept Normalization in a Low-Resource Setting

【速读】: 该论文试图解决在低资源环境下,德语非专业文本中的医学概念规范化问题。解决方案的关键在于利用多语言Transformer模型,通过上下文信息来提高概念映射的准确性,尽管实验表明上下文信息的使用效果不佳,但该方法仍能超越传统的字符串相似度方法。论文还通过系统性错误分析提出了潜在的改进措施,以减少常见错误。

链接: https://arxiv.org/abs/2409.14579
作者: Tim Patzelt
关键词-EN: large knowledge base, natural language processing, biomedical natural language, medical concept normalization, accurately mapping mentions
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Master Thesis

点击查看摘要

Abstract:In the field of biomedical natural language processing, medical concept normalization is a crucial task for accurately mapping mentions of concepts to a large knowledge base. However, this task becomes even more challenging in low-resource settings, where limited data and resources are available. In this thesis, I explore the challenges of medical concept normalization in a low-resource setting. Specifically, I investigate the shortcomings of current medical concept normalization methods applied to German lay texts. Since there is no suitable dataset available, a dataset consisting of posts from a German medical online forum is annotated with concepts from the Unified Medical Language System. The experiments demonstrate that multilingual Transformer-based models are able to outperform string similarity methods. The use of contextual information to improve the normalization of lay mentions is also examined, but led to inferior results. Based on the results of the best performing model, I present a systematic error analysis and lay out potential improvements to mitigate frequent errors.
摘要:在生物医学自然语言处理领域,医学概念规范化是一项关键任务,旨在将概念提及准确映射到大型知识库中。然而,在低资源环境下,这一任务变得更加具有挑战性,因为可用的数据和资源有限。在本论文中,我探讨了在低资源环境下医学概念规范化的挑战。具体而言,我研究了当前应用于德语非专业文本的医学概念规范化方法的不足之处。由于没有合适的可用数据集,我构建了一个由德国医学在线论坛帖子组成的数据集,并使用统一医学语言系统中的概念对其进行了标注。实验结果表明,基于多语言 Transformer 的模型能够超越基于字符串相似度的方法。此外,我还探讨了利用上下文信息来改进非专业提及的规范化,但结果并不理想。基于表现最佳模型的结果,我进行了系统的错误分析,并提出了潜在的改进措施以减少常见错误。

[NLP-61] Evaluating the Performance and Robustness of LLMs in Materials Science QA and Property Predictions

【速读】: 该论文试图解决大语言模型(LLMs)在材料科学领域应用中的鲁棒性和可靠性问题。解决方案的关键在于通过三种特定数据集(本科材料科学课程的多选题、钢材成分与屈服强度数据集、材料晶体结构与带隙值数据集)对LLMs进行全面评估和鲁棒性分析,采用多种提示策略(如零样本链式思维、专家提示和少样本上下文学习),并测试模型在不同形式噪声(从现实干扰到故意对抗性操作)下的表现,以评估其在实际应用中的韧性和可靠性。研究还揭示了LLMs在预测任务中的一些独特现象,如提示示例接近度改变时的模式崩溃行为和训练/测试不匹配带来的性能提升。这些发现旨在为LLMs在材料科学中的广泛应用提供审慎的怀疑态度,并激发提升其鲁棒性和可靠性的技术进步。

链接: https://arxiv.org/abs/2409.14572
作者: Hongchen Wang,Kangming Li,Scott Ramsay,Yao Fehlis,Edward Kim,Jason Hattrick-Simpers
关键词-EN: Large Language Models, Large Language, revolutionize scientific research, remain insufficiently explored, applications remain insufficiently
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. This study conducts a comprehensive evaluation and robustness analysis of LLMs within the field of materials science, focusing on domain-specific question answering and materials property prediction. Three distinct datasets are used in this study: 1) a set of multiple-choice questions from undergraduate-level materials science courses, 2) a dataset including various steel compositions and yield strengths, and 3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of ‘noise’, ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study uncovers unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance enhancement from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.
摘要:大语言模型 (LLMs) 具有彻底改变科学研究的潜力,然而其在特定领域应用中的稳健性和可靠性仍未得到充分探索。本研究对材料科学领域内的 LLMs 进行了全面的评估和稳健性分析,重点关注领域特定的问答和材料属性预测。本研究使用了三个不同的数据集:1) 一组来自本科材料科学课程的多项选择题,2) 一个包含各种钢成分和屈服强度的数据集,3) 一个带隙数据集,包含材料晶体结构和带隙值的文本描述。通过多种提示策略评估 LLMs 的性能,包括零样本链式思维、专家提示和少样本上下文学习。这些模型的稳健性通过各种形式的“噪声”进行测试,从现实干扰到故意的对抗性操作,以评估其在实际条件下的弹性和可靠性。此外,研究揭示了 LLMs 在预测任务中的一些独特现象,例如当提示示例的接近度改变时出现的模式崩溃行为,以及从训练/测试不匹配中获得的性能提升。研究结果旨在为 LLMs 在材料科学中的广泛应用提供有根据的怀疑,并激发增强其稳健性和可靠性的进步,以实现实际应用。

[NLP-62] Unleashing the Power of Emojis in Texts via Self-supervised Graph Pre-Training EMNLP2024

【速读】: 该论文试图解决现有数据挖掘方法在处理社交媒体中的表情符号(emojis)时,未能充分捕捉其丰富语义信息及其与文本互动的问题。解决方案的关键在于构建一个包含帖子、词语和表情符号三种节点类型的异构图,并通过定义节点间的边来模拟这些元素之间的互动。论文提出了一种图预训练框架,包含节点级图对比学习和边级链接重构学习两个预训练任务,以促进帖子、词语和表情符号节点间的信息共享,从而显著提升在下游任务中的表现。

链接: https://arxiv.org/abs/2409.14552
作者: Zhou Zhang,Dongzeng Tan,Jiaan Wang,Yilong Chen,Jiarong Xu
关键词-EN: gained immense popularity, gained immense, immense popularity, supplement or replace, ordinary Unicode characters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024 Main Conference

点击查看摘要

Abstract:Emojis have gained immense popularity on social platforms, serving as a common means to supplement or replace text. However, existing data mining approaches generally either completely ignore or simply treat emojis as ordinary Unicode characters, which may limit the model’s ability to grasp the rich semantic information in emojis and the interaction between emojis and texts. Thus, it is necessary to release the emoji’s power in social media data mining. To this end, we first construct a heterogeneous graph consisting of three types of nodes, i.e. post, word and emoji nodes to improve the representation of different elements in posts. The edges are also well-defined to model how these three elements interact with each other. To facilitate the sharing of information among post, word and emoji nodes, we propose a graph pre-train framework for text and emoji co-modeling, which contains two graph pre-training tasks: node-level graph contrastive learning and edge-level link reconstruction learning. Extensive experiments on the Xiaohongshu and Twitter datasets with two types of downstream tasks demonstrate that our approach proves significant improvement over previous strong baseline methods.
摘要:表情符号在社交平台上获得了极大的流行,成为补充或替代文本的常见手段。然而,现有的数据挖掘方法通常要么完全忽略表情符号,要么简单地将其视为普通的 Unicode 字符,这可能会限制模型捕捉表情符号中丰富的语义信息及其与文本之间的交互。因此,有必要在社交媒体数据挖掘中释放表情符号的力量。为此,我们首先构建了一个包含三种类型节点(即帖子、词语和表情符号节点)的异构图,以改进帖子中不同元素的表示。边也经过精心定义,以模拟这三种元素之间的相互作用。为了促进帖子、词语和表情符号节点之间的信息共享,我们提出了一种用于文本和表情符号协同建模的图预训练框架,该框架包含两个图预训练任务:节点级图对比学习和边级链接重构学习。在两个数据集(小红书和 Twitter)上进行的广泛实验,以及两种类型的下游任务,证明了我们的方法相较于之前的强基线方法有显著的改进。

[NLP-63] What Are They Doing? Joint Audio-Speech Co-Reasoning ICASSP2025

【速读】: 该论文试图解决音频和语音处理中单一模态处理的局限性问题,提出了一种名为Joint Audio-Speech Co-Reasoning (JASCO)的新任务,旨在统一音频和语音处理,并严格要求模型在两种模态之间进行协同推理。解决方案的关键在于开发和评估能够同时处理音频和语音的大型语言模型(ALLMs),并通过引入名为“What Are They Doing”的场景推理数据集,建立了一个联合音频-语音基准,以评估这些模型在联合推理任务中的表现。此外,论文还通过分析模型对各模态的依赖性,提供了对模型行为的深入见解。

链接: https://arxiv.org/abs/2409.14526
作者: Yingzhi Wang,Pooneh Mousavi,Artem Ploujnikov,Mirco Ravanelli
关键词-EN: Auditory Large Language, Large Language Models, Recent Auditory Large, speech processing, joint audio-speech
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are present in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We release a scene-reasoning dataset called “What Are They Doing” and establish a joint audio-speech benchmark to evaluate the joint reasoning capability of popular ALLMs. Additionally, we provide deeper insights into the models’ behaviors by analyzing their dependence on each modality. Comments: Submitted to ICASSP 2025 Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2409.14526 [cs.SD] (or arXiv:2409.14526v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2409.14526 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在音频和语音处理领域,任务通常侧重于音频或语音模态,即使同一音频片段中同时存在声音和人类语音。最近出现的听觉大语言模型 (Auditory Large Language Models, ALLMs) 使得在单一模型中同时处理音频和语音成为可能,从而进一步考虑了联合音频-语音任务。本文探讨了 ALLMs 在联合音频-语音处理中的表现。具体而言,我们引入了联合音频-语音协同推理 (Joint Audio-Speech Co-Reasoning, JASCO) 这一新任务,该任务统一了音频和语音处理,严格要求在两种模态之间进行协同推理。我们发布了一个名为“他们在做什么”的场景推理数据集,并建立了一个联合音频-语音基准,以评估流行 ALLMs 的联合推理能力。此外,我们通过分析模型对各模态的依赖性,提供了对模型行为的深入见解。

评论:已提交至 ICASSP 2025
主题:声音 (cs.SD); 计算与语言 (cs.CL); 音频与语音处理 (eess.AS)
引用为:arXiv:2409.14526 [cs.SD]
(或 arXiv:2409.14526v1 [cs.SD] 用于此版本)
https://doi.org/10.48550/arXiv.2409.14526
了解更多信息
arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-64] Beyond Words: Evaluating Large Language Models in Transportation Planning

【速读】: 该论文试图解决如何利用生成式人工智能(GenAI)中的大型语言模型(LLMs),如GPT-4和Phi-3-mini,来提升城市交通规划的效率和准确性。解决方案的关键在于评估这些模型在地理信息系统(GIS)技能、交通领域知识以及实际交通问题解决能力方面的表现,并通过一个包含地理空间技能、交通领域技能和现实交通问题解决的评估框架来验证其性能。研究结果表明,GPT-4在多种GIS和交通特定任务中表现出更高的准确性和可靠性,显示出其在交通规划中的强大潜力,而Phi-3-mini则在资源受限环境中表现出一定的分析能力。未来的研究可以探索更新的LLMs和检索增强生成(RAG)技术在更广泛的真实交通规划和运营挑战中的应用,以深化先进AI模型在交通管理实践中的整合。

链接: https://arxiv.org/abs/2409.14516
作者: Shaowei Ying,Zhenlong Li,Manzhu Yu
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, numerous industry sectors, advancement of Generative
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The resurgence and rapid advancement of Generative Artificial Intelligence (GenAI) in 2023 has catalyzed transformative shifts across numerous industry sectors, including urban transportation and logistics. This study investigates the evaluation of Large Language Models (LLMs), specifically GPT-4 and Phi-3-mini, to enhance transportation planning. The study assesses the performance and spatial comprehension of these models through a transportation-informed evaluation framework that includes general geospatial skills, general transportation domain skills, and real-world transportation problem-solving. Utilizing a mixed-methods approach, the research encompasses an evaluation of the LLMs’ general Geographic Information System (GIS) skills, general transportation domain knowledge as well as abilities to support human decision-making in the real-world transportation planning scenarios of congestion pricing. Results indicate that GPT-4 demonstrates superior accuracy and reliability across various GIS and transportation-specific tasks compared to Phi-3-mini, highlighting its potential as a robust tool for transportation planners. Nonetheless, Phi-3-mini exhibits competence in specific analytical scenarios, suggesting its utility in resource-constrained environments. The findings underscore the transformative potential of GenAI technologies in urban transportation planning. Future work could explore the application of newer LLMs and the impact of Retrieval-Augmented Generation (RAG) techniques, on a broader set of real-world transportation planning and operations challenges, to deepen the integration of advanced AI models in transportation management practices.
摘要:2023年生成式人工智能 (Generative Artificial Intelligence, GenAI) 的复兴与快速发展,推动了多个行业领域的变革,包括城市交通与物流。本研究探讨了如何通过评估大语言模型 (Large Language Models, LLMs),特别是 GPT-4 和 Phi-3-mini,来提升交通规划。研究通过一个包含通用地理空间技能、通用交通领域技能以及解决现实交通问题的交通导向评估框架,评估了这些模型的性能和空间理解能力。采用混合方法,研究涵盖了 LLMs 的通用地理信息系统 (Geographic Information System, GIS) 技能、通用交通领域知识以及在拥堵收费等现实交通规划场景中支持人类决策的能力。结果显示,GPT-4 在各种 GIS 和交通特定任务中表现出更高的准确性和可靠性,相比 Phi-3-mini,突显了其作为交通规划者强有力工具的潜力。尽管如此,Phi-3-mini 在特定分析场景中表现出能力,表明其在资源受限环境中的实用性。研究结果强调了 GenAI 技术在城市交通规划中的变革潜力。未来的工作可以探索更新的 LLMs 的应用以及检索增强生成 (Retrieval-Augmented Generation, RAG) 技术对更广泛现实交通规划和运营挑战的影响,以深化先进 AI 模型在交通管理实践中的整合。

[NLP-65] Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits

【速读】: 该论文试图解决LLM生成文本与人类写作文本之间的差异问题,并探索如何通过编辑方法改进LLM生成文本的质量。解决方案的关键在于:首先,通过专业作家对LLM生成文本的编辑,形成了一个包含七类常见问题的分类法(如陈词滥调、不必要的阐述等);其次,构建了LAMP语料库,包含1,057段经专业作家编辑的LLM生成文本,揭示了不同LLM模型在写作质量上的共同局限性;最后,研究了自动编辑方法,发现这些方法在提高LLM生成文本与人类写作文本的一致性方面显示出潜力。

链接: https://arxiv.org/abs/2409.14509
作者: Tuhin Chakrabarty,Philippe Laban,Chien-Sheng Wu
关键词-EN: helping people write, LLM-based applications, people write, social media, applications are helping
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: NLP+HCI, Behavioral Science

点击查看摘要

Abstract:LLM-based applications are helping people write, and LLM-generated text is making its way into social media, journalism, and our classrooms. However, the differences between LLM-generated and human-written text remain unclear. To explore this, we hired professional writers to edit paragraphs in several creative domains. We first found these writers agree on undesirable idiosyncrasies in LLM-generated text, formalizing it into a seven-category taxonomy (e.g. cliches, unnecessary exposition). Second, we curated the LAMP corpus: 1,057 LLM-generated paragraphs edited by professional writers according to our taxonomy. Analysis of LAMP reveals that none of the LLMs used in our study (GPT4o, Claude-3.5-Sonnet, Llama-3.1-70b) outperform each other in terms of writing quality, revealing common limitations across model families. Third, we explored automatic editing methods to improve LLM-generated text. A large-scale preference annotation confirms that although experts largely prefer text edited by other experts, automatic editing methods show promise in improving alignment between LLM-generated and human-written text.
摘要:基于大语言模型 (LLM) 的应用正在帮助人们写作,而大语言模型生成的文本正逐渐进入社交媒体、新闻报道以及我们的课堂。然而,大语言模型生成的文本与人类撰写的文本之间的差异仍然不明确。为了探讨这一点,我们聘请了专业作家来编辑多个创意领域的段落。首先,我们发现这些作家一致认为大语言模型生成的文本存在一些不受欢迎的特质,并将其形式化为一个包含七个类别的分类法(例如:陈词滥调、不必要的阐述)。其次,我们构建了 LAMP 语料库:1,057 段由专业作家根据我们的分类法编辑的大语言模型生成的段落。对 LAMP 的分析显示,在我们研究中使用的所有大语言模型(GPT4o, Claude-3.5-Sonnet, Llama-3.1-70b)在写作质量上并无显著优劣之分,揭示了不同模型家族之间的共同局限性。第三,我们探索了自动编辑方法以改进大语言模型生成的文本。大规模的偏好标注证实,尽管专家们大多偏好由其他专家编辑的文本,但自动编辑方法在提高大语言模型生成文本与人类撰写文本之间的一致性方面显示出潜力。

[NLP-66] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

【速读】: 该论文试图解决稀疏自编码器(SAEs)在分解大型语言模型(LLMs)激活时提取的潜在变量是否具有单一语义和可解释性的问题,以及稀疏度和SAE大小对这些特性的影响。解决方案的关键在于识别并分析了一种称为“特征吸收”的问题,即看似单一语义的潜在变量在某些情况下未能激活,尽管它们应该激活。研究表明,仅通过调整SAE的大小或稀疏度无法解决这一问题,暗示存在更深层次的概念问题需要解决。

链接: https://arxiv.org/abs/2409.14507
作者: David Chanin,James Wilken-Smith,Tomáš Dulka,Hardik Bhatnagar,Joseph Bloom
关键词-EN: Large Language Models, Sparse Autoencoders, Language Models, Large Language, activations of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of Large Language Models (LLMs) into human-interpretable latents. In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents? Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability? By investigating these questions in the context of a simple first-letter identification task where we have complete access to ground truth labels for all tokens in the vocabulary, we are able to provide more detail than prior investigations. Critically, we identify a problematic form of feature-splitting we call feature absorption where seemingly monosemantic latents fail to fire in cases where they clearly should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that there are deeper conceptual issues in need of resolution.
摘要:稀疏自编码器 (Sparse Autoencoders, SAEs) 作为一种有前景的方法,用于将大语言模型 (Large Language Models, LLMs) 的激活分解为人类可解释的潜在变量。本文提出两个问题。首先,SAEs 在多大程度上提取了单语义且可解释的潜在变量?其次,改变 SAE 的稀疏度或大小在多大程度上影响了单语义性/可解释性?通过在一个简单的首字母识别任务中进行研究,我们能够完全访问词汇表中所有 Token 的真实标签,从而提供比先前研究更详细的分析。关键地,我们识别出一种称为特征吸收 (feature absorption) 的问题形式,其中看似单语义的潜在变量在明显应该激活的情况下未能触发。我们的研究表明,改变 SAE 的大小或稀疏度不足以解决这一问题,并且存在更深层次的概念问题需要解决。

[NLP-67] hought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension

【速读】: 该论文试图解决逻辑阅读理解任务中,现有方法在构建链式思维(Chain-of-Thought, CoT)推理路径时仅关注正确选项,而忽略错误选项的问题,以及数据增强方法生成的上下文缺乏多样性和连贯性的问题。解决方案的关键在于提出了一个基于前提的数据增强框架(Premise-Oriented Data Augmentation, PODA),该框架能够生成包含正确和错误选项分析的CoT推理路径,并从错误候选选项中构建多样且高质量的反事实上下文。此外,通过引入前提总结和识别,结合多步提示构建反事实上下文,并采用一种新的思维路径对比学习方法,比较原始样本与反事实样本的推理路径,从而提升模型在区分不同选项推理过程的能力。

链接: https://arxiv.org/abs/2409.14495
作者: Chenxu Wang,Ping Jian,Yang Zhen
关键词-EN: Logical reading comprehension, reading comprehension, task that entails, entails grasping, grasping the underlying
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Logical reading comprehension is a challenging task that entails grasping the underlying semantics of text and applying reasoning to deduce the correct answer. Prior researches have primarily focused on enhancing logical reasoning capabilities through Chain-of-Thought (CoT) or data augmentation. However, previous work constructing chain-of-thought rationales concentrates solely on analyzing correct options, neglecting the incorrect alternatives. Addtionally, earlier efforts on data augmentation by altering contexts rely on rule-based methods, which result in generated contexts that lack diversity and coherence. To address these issues, we propose a Premise-Oriented Data Augmentation (PODA) framework. This framework can generate CoT rationales including analyses for both correct and incorrect options, while constructing diverse and high-quality counterfactual contexts from incorrect candidate options. We integrate summarizing premises and identifying premises for each option into rationales. Subsequently, we employ multi-step prompts with identified premises to construct counterfactual context. To facilitate the model’s capabilities to better differentiate the reasoning process associated with each option, we introduce a novel thought-path contrastive learning method that compares reasoning paths between the original and counterfactual samples. Experimental results on three representative LLMs demonstrate that our method can improve the baselines substantially across two challenging logical reasoning benchmarks (ReClor and LogiQA 2.0). The data and code are released at this https URL.
摘要:逻辑阅读理解是一项具有挑战性的任务,涉及掌握文本的潜在语义并运用推理来推导出正确答案。先前的研究主要通过思维链 (Chain-of-Thought, CoT) 或数据增强来提升逻辑推理能力。然而,以往构建思维链推理的工作仅专注于分析正确选项,忽略了错误选项。此外,早期通过改变上下文进行数据增强的方法依赖于基于规则的方法,导致生成的上下文缺乏多样性和连贯性。为了解决这些问题,我们提出了一种前提导向的数据增强 (Premise-Oriented Data Augmentation, PODA) 框架。该框架能够生成包含对正确和错误选项分析的 CoT 推理,同时从错误候选选项中构建多样且高质量的反事实上下文。我们将总结前提和识别每个选项的前提整合到推理中。随后,我们使用识别出的前提进行多步提示,以构建反事实上下文。为了增强模型对每个选项推理过程的区分能力,我们引入了一种新颖的思维路径对比学习方法,该方法比较原始样本和反事实样本之间的推理路径。在三个代表性的大语言模型 (LLM) 上的实验结果表明,我们的方法在两个具有挑战性的逻辑推理基准 (ReClor 和 LogiQA 2.0) 上显著提升了基线水平。数据和代码已发布在 https URL。

[NLP-68] CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments

【速读】: 该论文试图解决在教室环境中自动语音识别(ASR)系统的鲁棒性和适应性问题。解决方案的关键在于使用持续预训练(CPT)方法来适应Wav2vec2.0模型到教室领域,从而显著降低单词错误率(WER)并提高模型对不同噪音、麦克风和教室条件的鲁棒性。

链接: https://arxiv.org/abs/2409.14494
作者: Ahmed Adel Attia,Dorottya Demszky,Tolulope Ogunremi,Jing Liu,Carol Espy-Wilson
关键词-EN: Creating Automatic Speech, Automatic Speech Recognition, Creating Automatic, Speech Recognition, Automatic Speech
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: arXiv admin note: substantial text overlap with arXiv:2405.13018

点击查看摘要

Abstract:Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%. More specifically, CPT improves the model’s robustness to different noises, microphones and classroom conditions.
摘要:创建能够在教室环境中保持稳健和弹性的自动语音识别 (ASR) 系统对于开发辅助教师和学生的 AI 工具至关重要。在本研究中,我们探讨了持续预训练 (CPT) 在将 Wav2vec2.0 适应于教室领域中的有效性。我们展示了 CPT 在这方面的强大作用,能够将基于 Wav2vec2.0 的模型的词错误率 (WER) 降低多达 10%。更具体地说,CPT 提升了模型对不同噪音、麦克风和教室条件的鲁棒性。

[NLP-69] A Large Language Model and Denoising Diffusion Framework for Targeted Design of Microstructures with Commands in Natural Language

【速读】: 该论文试图解决微观结构设计中复杂算法和领域特定知识带来的高学习曲线问题,解决方案的关键在于集成自然语言处理(NLP)、大语言模型(LLMs)和去噪扩散概率模型(DDPMs),通过直观的自然语言命令实现微观结构设计。具体来说,利用预训练的LLM进行上下文数据增强,生成多样化的微观结构描述数据集;通过重新训练的命名实体识别(NER)模型从用户输入的自然语言中提取相关描述,并由DDPM生成具有目标力学性能和拓扑特征的微观结构。此外,NLP和DDPM模块的独立训练和验证确保了框架的灵活性和适应性,而代理模型系统则用于根据目标属性对生成的样本进行排序和筛选。

链接: https://arxiv.org/abs/2409.14473
作者: Nikita Kartashov,Nikolaos N. Vlassis
关键词-EN: MEMS devices, applications spanning alloy, spanning alloy design, Natural Language, tissue engineering
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: 29 pages, 15 figures

点击查看摘要

Abstract:Microstructure plays a critical role in determining the macroscopic properties of materials, with applications spanning alloy design, MEMS devices, and tissue engineering, among many others. Computational frameworks have been developed to capture the complex relationship between microstructure and material behavior. However, despite these advancements, the steep learning curve associated with domain-specific knowledge and complex algorithms restricts the broader application of these tools. To lower this barrier, we propose a framework that integrates Natural Language Processing (NLP), Large Language Models (LLMs), and Denoising Diffusion Probabilistic Models (DDPMs) to enable microstructure design using intuitive natural language commands. Our framework employs contextual data augmentation, driven by a pretrained LLM, to generate and expand a diverse dataset of microstructure descriptors. A retrained NER model extracts relevant microstructure descriptors from user-provided natural language inputs, which are then used by the DDPM to generate microstructures with targeted mechanical properties and topological features. The NLP and DDPM components of the framework are modular, allowing for separate training and validation, which ensures flexibility in adapting the framework to different datasets and use cases. A surrogate model system is employed to rank and filter generated samples based on their alignment with target properties. Demonstrated on a database of nonlinear hyperelastic microstructures, this framework serves as a prototype for accessible inverse design of microstructures, starting from intuitive natural language commands.
摘要:微观结构在决定材料的宏观性质方面起着关键作用,其应用领域涵盖合金设计、微机电系统 (MEMS) 设备以及组织工程等多个领域。为了捕捉微观结构与材料行为之间的复杂关系,已经开发了多种计算框架。然而,尽管取得了这些进展,领域特定知识与复杂算法所带来的陡峭学习曲线限制了这些工具的广泛应用。为了降低这一门槛,我们提出了一种框架,该框架集成了自然语言处理 (NLP)、大语言模型 (LLM) 和去噪扩散概率模型 (DDPM),以实现通过直观的自然语言指令进行微观结构设计。我们的框架利用预训练的 LLM 驱动的上下文数据增强技术,生成并扩展了多样化的微观结构描述符数据集。一个重新训练的命名实体识别 (NER) 模型从用户提供的自然语言输入中提取相关的微观结构描述符,这些描述符随后被 DDPM 用于生成具有目标力学性能和拓扑特征的微观结构。框架中的 NLP 和 DDPM 组件是模块化的,允许分别进行训练和验证,从而确保了框架在适应不同数据集和使用场景时的灵活性。采用了一种代理模型系统,根据生成样本与目标属性的匹配程度对其进行排序和筛选。在一个非线性超弹性微观结构数据库上进行了演示,该框架作为从直观自然语言指令开始的微观结构逆向设计的原型。

[NLP-70] Rethinking Semantic Parsing for Large Language Models : Enhancing LLM Performance with Semantic Hints

【速读】: 该论文试图解决的问题是:直接将语义解析结果引入大型语言模型(LLMs)会降低其性能。解决方案的关键在于提出了一种名为SENSE的新型提示方法,该方法通过在提示中嵌入语义提示来间接地整合语义信息,从而在不损害LLMs性能的前提下提升其在各种任务中的表现。实验结果表明,SENSE能够持续改善LLMs的性能,突显了语义信息整合在提升LLM能力方面的潜力。

链接: https://arxiv.org/abs/2409.14469
作者: Kaikai An,Shuzheng Si,Helan Hu,Haozhe Zhao,Yuchi Wang,Qingyan Guo,Baobao Chang
关键词-EN: structured form, Semantic Parsing aims, Semantic Parsing, aims to capture, capture the meaning
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Semantic Parsing aims to capture the meaning of a sentence and convert it into a logical, structured form. Previous studies show that semantic parsing enhances the performance of smaller models (e.g., BERT) on downstream tasks. However, it remains unclear whether the improvements extend similarly to LLMs. In this paper, our empirical findings reveal that, unlike smaller models, directly adding semantic parsing results into LLMs reduces their performance. To overcome this, we propose SENSE, a novel prompting approach that embeds semantic hints within the prompt. Experiments show that SENSE consistently improves LLMs’ performance across various tasks, highlighting the potential of integrating semantic information to improve LLM capabilities.
摘要:语义解析旨在捕捉句子的含义并将其转换为逻辑结构化的形式。先前的研究表明,语义解析能够提升较小模型(例如 BERT)在下游任务中的表现。然而,目前尚不清楚这种改进是否同样适用于大语言模型 (LLM)。在本研究中,我们的实证结果揭示,与较小模型不同,直接将语义解析结果加入 LLM 会降低其性能。为解决这一问题,我们提出了 SENSE,一种新颖的提示方法,该方法在提示中嵌入语义线索。实验表明,SENSE 在各种任务中持续提升 LLM 的性能,突显了整合语义信息以增强 LLM 能力的潜力。

[NLP-71] AggregHate: An Efficient Aggregative Approach for the Detection of Hatemongers on Social Platforms

【速读】: 该论文试图解决在线仇恨言论的自动检测问题,特别是从用户层面识别仇恨言论发布者。解决方案的关键在于采用多模态聚合方法,综合考虑用户的文本内容、活动行为及其社交网络关系,以提高仇恨言论发布者的检测准确性。相较于传统的文本和图结构方法,该方法在处理用户文本时结合其社交背景,显著提升了检测效果,并能有效应用于分类编码信息、识别隐晦仇恨言论及制定干预措施,同时保持对大规模数据和网络的高效处理能力。

链接: https://arxiv.org/abs/2409.14464
作者: Tom Marzea,Abraham Israeli,Oren Tsur
关键词-EN: online hate speech, hate speech serves, Automatic detection, online discourse, speech serves
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Automatic detection of online hate speech serves as a crucial step in the detoxification of the online discourse. Moreover, accurate classification can promote a better understanding of the proliferation of hate as a social phenomenon. While most prior work focus on the detection of hateful utterances, we argue that focusing on the user level is as important, albeit challenging. In this paper we consider a multimodal aggregative approach for the detection of hate-mongers, taking into account the potentially hateful texts, user activity, and the user network. We evaluate our methods on three unique datasets X (Twitter), Gab, and Parler showing that a processing a user’s texts in her social context significantly improves the detection of hate mongers, compared to previously used text and graph-based methods. Our method can be then used to improve the classification of coded messages, dog-whistling, and racial gas-lighting, as well as inform intervention measures. Moreover, our approach is highly efficient even for very large datasets and networks.
摘要:自动检测在线仇恨言论是净化网络话语的关键步骤。此外,准确的分类有助于更好地理解仇恨作为社会现象的传播。尽管大多数先前的工作集中在仇恨言论的检测上,但我们认为关注用户层面的检测同样重要,尽管更具挑战性。本文中,我们考虑了一种多模态聚合方法来检测仇恨传播者,综合考虑了潜在的仇恨文本、用户活动以及用户网络。我们在三个独特的数据集(X (Twitter)、Gab 和 Parler)上评估了我们的方法,结果表明,在用户的社交背景下处理其文本显著提高了仇恨传播者的检测效果,相较于先前使用的基于文本和图的方法。我们的方法随后可用于改进编码消息、狗哨策略和种族心理操控的分类,并提供干预措施的依据。此外,我们的方法对于非常大的数据集和网络也具有高效率。

[NLP-72] Exploring Multilingual Probing in Large Language Models : A Cross-Language Analysis

【速读】: 该论文试图解决大型语言模型(LLMs)在多语言环境下的表现差异问题,特别是高资源语言和低资源语言之间的性能差距。解决方案的关键在于扩展现有的探针技术至多语言环境,通过实验分析不同语言在LLMs中的探针准确性、层级趋势以及探针向量之间的相似性。研究发现,高资源语言在探针准确性和层级表现上显著优于低资源语言,且高资源语言之间的表示相似性更高。这些发现强调了改进低资源语言建模的必要性。

链接: https://arxiv.org/abs/2409.14459
作者: Daoyang Li,Mingyu Jin,Qingcheng Zeng,Haiyan Zhao,Mengnan Du
关键词-EN: large language models, overlooking the vast, languages, techniques for large, primarily focused
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of the world’s languages. In this paper, we extend these probing methods to a multilingual context, investigating the behaviors of LLMs across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results highlight significant disparities in LLMs’ multilingual capabilities and emphasize the need for improved modeling of low-resource languages.
摘要:对大语言模型 (LLM) 的探查技术主要集中在英语上,忽视了世界上绝大多数语言。本文将这些探查方法扩展到多语言环境中,研究了 LLM 在多种语言中的行为。我们在多个开源 LLM 模型上进行了实验,分析了探查准确性、跨层趋势以及多语言探查向量之间的相似性。我们的主要发现包括:(1) 高资源语言和低资源语言之间存在一致的性能差距,高资源语言的探查准确性显著更高;(2) 层级准确性趋势存在差异,高资源语言在更深层表现出与英语类似的显著改进;(3) 高资源语言之间的表示相似性更高,而低资源语言不仅彼此之间的相似性较低,与高资源语言的相似性也较低。这些结果突显了 LLM 在多语言能力上的显著差异,并强调了改进对低资源语言建模的必要性。

[NLP-73] Automotive innovation landscaping using LLM

【速读】: 该论文试图解决通过专利分析进行汽车创新景观构建的问题,传统方法需要大量手动工作,效率低下。解决方案的关键在于利用大型语言模型(LLMs)进行自动化处理,通过提示工程提取专利中的核心信息,包括解决的问题、使用的技术以及创新领域(如安全、高级驾驶辅助系统等)。这种方法能够快速、高效地从广泛的专利数据库中提取相关信息,为研发团队提供全面的燃料电池技术现状概览,从而为未来的研发提供有价值的见解。

链接: https://arxiv.org/abs/2409.14436
作者: Raju Gorain,Omkar Salunke
关键词-EN: landscaping automotive innovation, analysis is crucial, Large Language Models, automotive innovation, comprehending innovation trends
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 9pages, 4Figures, 1 Flow chart

点击查看摘要

Abstract:The process of landscaping automotive innovation through patent analysis is crucial for Research and Development teams. It aids in comprehending innovation trends, technological advancements, and the latest technologies from competitors. Traditionally, this process required intensive manual efforts. However, with the advent of Large Language Models (LLMs), it can now be automated, leading to faster and more efficient patent categorization state-of-the-art of inventive concept extraction. This automation can assist various R\D teams in extracting relevant information from extensive patent databases. This paper introduces a method based on prompt engineering to extract essential information for landscaping. The information includes the problem addressed by the patent, the technology utilized, and the area of innovation within the vehicle ecosystem (such as safety, Advanced Driver Assistance Systems and more).The result demonstrates the implementation of this method to create a landscape of fuel cell technology using open-source patent data. This approach provides a comprehensive overview of the current state of fuel cell technology, offering valuable insights for future research and development in this field.
摘要:通过专利分析来规划汽车领域的创新过程对于研发团队至关重要。它有助于理解创新趋势、技术进步以及竞争对手的最新技术。传统上,这一过程需要大量的手动操作。然而,随着大语言模型 (LLM) 的出现,现在可以实现自动化,从而实现更快、更高效的专利分类和最先进的创新概念提取。这种自动化可以帮助各种研发团队从庞大的专利数据库中提取相关信息。本文介绍了一种基于提示工程的方法,用于提取规划所需的关键信息。这些信息包括专利所解决的问题、所使用的技术以及车辆生态系统中的创新领域(如安全、高级驾驶辅助系统等)。结果展示了如何使用这种方法,通过开源专利数据创建燃料电池技术的全景图。这种方法提供了燃料电池技术当前状态的全面概述,为该领域的未来研究和开发提供了宝贵的见解。

[NLP-74] Beyond Persuasion: Towards Conversational Recommender System with Credible Explanations EMNLP2024

【速读】: 该论文旨在解决当前对话推荐系统(CRS)在说服用户接受推荐项目时,可能通过包含不可信信息来误导用户,从而损害用户与系统之间长期信任的问题。解决方案的关键在于提出了一种名为PC-CRS的方法,通过引入可信度感知的说服策略来指导解释生成,并利用事后自我反思逐步优化解释,从而在保持说服力的同时提升解释的可信度。实验结果表明,PC-CRS能有效促进既具说服力又可信的解释,并揭示了当前方法产生不可信解释的原因及可信解释提升推荐准确性的潜力。

链接: https://arxiv.org/abs/2409.14399
作者: Peixin Qin,Chen Huang,Yang Deng,Wenqiang Lei,Tat-Seng Chua
关键词-EN: large language models, conversational recommender system, accept recommended items, gaining strong abilities, current conversational recommender
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of EMNLP 2024

点击查看摘要

Abstract:With the aid of large language models, current conversational recommender system (CRS) has gaining strong abilities to persuade users to accept recommended items. While these CRSs are highly persuasive, they can mislead users by incorporating incredible information in their explanations, ultimately damaging the long-term trust between users and the CRS. To address this, we propose a simple yet effective method, called PC-CRS, to enhance the credibility of CRS’s explanations during persuasion. It guides the explanation generation through our proposed credibility-aware persuasive strategies and then gradually refines explanations via post-hoc self-reflection. Experimental results demonstrate the efficacy of PC-CRS in promoting persuasive and credible explanations. Further analysis reveals the reason behind current methods producing incredible explanations and the potential of credible explanations to improve recommendation accuracy.
摘要:借助大语言模型的帮助,当前的对话推荐系统 (CRS) 在说服用户接受推荐项目方面展现出强大的能力。尽管这些 CRS 具有高度的说服力,但它们在解释中掺入不可信的信息,可能会误导用户,最终损害用户与 CRS 之间的长期信任。为了解决这一问题,我们提出了一种简单而有效的方法,称为 PC-CRS,旨在增强 CRS 在说服过程中的解释可信度。该方法通过我们提出的可信度感知说服策略指导解释生成,并通过事后自我反思逐步优化解释。实验结果表明,PC-CRS 在促进说服性和可信的解释方面具有显著效果。进一步的分析揭示了当前方法产生不可信解释的原因,以及可信解释在提高推荐准确性方面的潜力。

[NLP-75] Predicting User Stances from Target-Agnostic Information using Large Language Models

【速读】: 该论文试图解决的问题是利用大型语言模型(LLMs)预测用户对某一目标的态度,基于用户在社交媒体上的目标无关帖子(即用户层面的态度预测)。解决方案的关键在于利用LLMs分析目标无关帖子中的表面层信息(如与目标相关的关键词)和用户层特征(如用户的道德价值观),从而推断用户对新话题的态度。研究结果表明,LLMs可能是一种基于历史和目标无关数据的有效方法来确定公众对新话题的态度,但同时也强调了需要进一步研究以更好地理解LLMs在不同任务情境下的表现差异。

链接: https://arxiv.org/abs/2409.14395
作者: Siyuan Brandon Loh,Liang Ze Wong,Prasanta Bhattacharya,Joseph Simons,Wei Gao,Hong Zhang
关键词-EN: Large Language Models’, investigate Large Language, Language Models’, Large Language, social media posts
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate Large Language Models’ (LLMs) ability to predict a user’s stance on a target given a collection of his/her target-agnostic social media posts (i.e., user-level stance prediction). While we show early evidence that LLMs are capable of this task, we highlight considerable variability in the performance of the model across (i) the type of stance target, (ii) the prediction strategy and (iii) the number of target-agnostic posts supplied. Post-hoc analyses further hint at the usefulness of target-agnostic posts in providing relevant information to LLMs through the presence of both surface-level (e.g., target-relevant keywords) and user-level features (e.g., encoding users’ moral values). Overall, our findings suggest that LLMs might offer a viable method for determining public stances towards new topics based on historical and target-agnostic data. At the same time, we also call for further research to better understand LLMs’ strong performance on the stance prediction task and how their effectiveness varies across task contexts.
摘要:我们研究了大语言模型 (LLM) 在给定用户的一系列与目标无关的社交媒体帖子的情况下,预测用户对某一目标的立场的能力(即用户级立场预测)。尽管我们展示了早期证据表明 LLM 能够完成此任务,但我们强调了模型性能在以下方面的显著差异:(i) 立场目标的类型,(ii) 预测策略,以及 (iii) 提供的与目标无关的帖子数量。事后分析进一步暗示了与目标无关的帖子在通过表面层次特征(例如,与目标相关的关键词)和用户层次特征(例如,编码用户的道德价值观)向 LLM 提供相关信息方面的有用性。总体而言,我们的研究结果表明,LLM 可能提供了一种基于历史和与目标无关的数据来确定公众对新话题立场的可行方法。同时,我们也呼吁进一步研究,以更好地理解 LLM 在立场预测任务上的强大表现及其在不同任务情境下的有效性变化。

[NLP-76] Investigating Layer Importance in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)中各层重要性不明确的问题,解决方案的关键在于提出了一种基于Shapley值的高效采样方法,用于评估各层的重要性,并通过层消融实验验证了某些关键层(cornerstone layers)对模型性能的显著影响。研究发现,移除这些关键层会导致模型性能急剧下降,而移除非关键层则仅导致轻微性能变化,从而强调了这些关键层在LLMs中的核心作用。

链接: https://arxiv.org/abs/2409.14381
作者: Yang Zhang,Yanfei Dong,Kenji Kawaguchi
关键词-EN: Large language models, gained increasing attention, increasing attention due, Large language, process texts
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have gained increasing attention due to their prominent ability to understand and process texts. Nevertheless, LLMs largely remain opaque. The lack of understanding of LLMs has obstructed the deployment in safety-critical scenarios and hindered the development of better models. In this study, we advance the understanding of LLM by investigating the significance of individual layers in LLMs. We propose an efficient sampling method to faithfully evaluate the importance of layers using Shapley values, a widely used explanation framework in feature attribution and data valuation. In addition, we conduct layer ablation experiments to assess the performance degradation resulting from the exclusion of specific layers. Our findings reveal the existence of cornerstone layers, wherein certain early layers can exhibit a dominant contribution over others. Removing one cornerstone layer leads to a drastic collapse of the model performance, often reducing it to random guessing. Conversely, removing non-cornerstone layers results in only marginal performance changes. This study identifies cornerstone layers in LLMs and underscores their critical role for future research.
摘要:大语言模型 (LLM) 因其显著的文本理解和处理能力而受到越来越多的关注。然而,LLM 在很大程度上仍然是难以理解的。对 LLM 缺乏理解阻碍了其在安全关键场景中的部署,并阻碍了更好模型的开发。在本研究中,我们通过研究 LLM 中各层的意义来推进对 LLM 的理解。我们提出了一种高效的采样方法,利用 Shapley 值(一种在特征归因和数据估值中广泛使用的解释框架)来忠实地评估各层的重要性。此外,我们还进行了层消融实验,以评估排除特定层导致的性能下降。我们的研究发现存在关键层,其中某些早期层可以表现出对其他层的显著贡献。移除一个关键层会导致模型性能的急剧崩溃,通常使其降至随机猜测的水平。相反,移除非关键层只会导致性能的微小变化。本研究识别了 LLM 中的关键层,并强调了它们在未来研究中的关键作用。

[NLP-77] J2N – Nominal Adjective Identification and its Application

【速读】: 该论文试图解决自然语言处理任务中名词性形容词(Nominal Adjectives, NAs)对词性标注(Part-of-Speech tagging)带来的挑战。解决方案的关键在于将NAs视为一个独立的词性标签“JN”,并通过实验验证这一重新分类对词性标注、BIO分块和共指消解等任务的准确性提升。研究采用隐马尔可夫模型(HMMs)、最大熵模型(MaxEnt)和Spacy工具进行实验,并训练BERT模型用于未标注文本中的NA识别,展示了该方法的可行性和潜在优势。

链接: https://arxiv.org/abs/2409.14374
作者: Lemeng Qi,Yang Han,Zhuotong Xie
关键词-EN: natural language processing, nominal adjectives, language processing, paper explores, explores the challenges
类目: Computation and Language (cs.CL)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:This paper explores the challenges posed by nominal adjectives (NAs) in natural language processing (NLP) tasks, particularly in part-of-speech (POS) tagging. We propose treating NAs as a distinct POS tag, “JN,” and investigate its impact on POS tagging, BIO chunking, and coreference resolution. Our study shows that reclassifying NAs can improve the accuracy of syntactic analysis and structural understanding in NLP. We present experimental results using Hidden Markov Models (HMMs), Maximum Entropy (MaxEnt) models, and Spacy, demonstrating the feasibility and potential benefits of this approach. Additionally we trained a bert model to identify the NA in untagged text.
摘要:本文探讨了在自然语言处理 (NLP) 任务中,尤其是词性标注 (POS) 任务中,名词性形容词 (Nominal Adjectives, NAs) 所带来的挑战。我们提出将 NAs 视为一种独立的词性标签,即“JN”,并研究了其对词性标注、BIO 分块和共指消解的影响。我们的研究表明,重新分类 NAs 可以提高 NLP 中句法分析和结构理解的准确性。我们使用隐马尔可夫模型 (HMMs)、最大熵 (MaxEnt) 模型和 Spacy 进行了实验,证明了这种方法的可行性和潜在优势。此外,我们还训练了一个 BERT 模型,用于在未标注的文本中识别 NAs。

[NLP-78] he Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests

【速读】: 该论文试图解决生成式AI代理在处理无单一正确答案(NORA)任务时,如何准确评估其响应中约束条件满足度的问题。解决方案的关键在于开发并发布了一个名为Arithmetic Constraint-Satisfaction (ACS)的新型基准数据集,该数据集包含复杂的用户请求、相应的约束条件、代理响应以及人类标注的约束满足度标签。ACS数据集的独特之处在于,许多约束的验证需要整体审查响应内容,而不仅仅是独立项的验证。此外,该数据集评估了大型语言模型(LLMs)在推理、上下文数据提取、算术计算和计数方面的能力。通过基准测试,论文发现大多数模型在约束满足度评估方面仍有显著改进空间,主要错误源于推理问题,且大多数模型在预测约束满足度时表现出偏差,尤其是在“满足”标签的情况下准确率较高。此外,少样本提示对任务的性能提升效果有限,许多模型在引入少样本提示后性能反而下降。

链接: https://arxiv.org/abs/2409.14371
作者: Lior Madmoni,Amir Zait,Ilia Labzovsky,Danny Karmon
关键词-EN: vegetarian meal plan, design a vegetarian, expected to respond, vegetarian meal, meal plan
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative AI agents are often expected to respond to complex user requests that have No One Right Answer (NORA), e.g., “design a vegetarian meal plan below 1800 calories”. Such requests may entail a set of constraints that the agent should adhere to. To successfully develop agents for NORA scenarios, an accurate automatic evaluation framework is essential, and specifically - one capable of validating the satisfaction of constraints in the agent’s response. Recently, large language models (LLMs) have been adopted as versatile evaluators for many NORA tasks, but their ability to evaluate constraint-satisfaction in generated text remains unclear. To study this, we develop and release a novel Arithmetic Constraint-Satisfaction (ACS) benchmarking dataset. The dataset consists of complex user requests with corresponding constraints, agent responses and human labels indicating each constraint’s satisfaction level in the response. A unique property of this dataset is that validating many of its constraints requires reviewing the response as a whole (in contrast to many other benchmarks that require the validation of a single independent item). Moreover, it assesses LLMs in performing reasoning, in-context data extraction, arithmetic calculations, and counting. We then benchmark both open and proprietary LLMs on evaluating constraint-satisfaction, and show that most models still have a significant headroom for improvement, and that errors primarily stem from reasoning issues. In addition, most models exhibit a skewed constraint-satisfaction prediction pattern, with higher accuracy where the ground-truth label is “satisfied”. Lastly, few-shot prompting for our task proved to be rather challenging, since many of the studied models showed a degradation in performance when it was introduced.
摘要:生成式 AI 智能体通常需要应对具有“无唯一正确答案” (No One Right Answer, NORA) 的复杂用户请求,例如“设计一个低于 1800 卡路里的素食餐计划”。这类请求可能包含一组智能体应遵守的约束条件。为了成功开发适用于 NORA 场景的智能体,一个准确的自动评估框架至关重要,特别是能够验证智能体响应中约束条件满足情况的框架。近期,大语言模型 (Large Language Models, LLMs) 已被广泛采用作为多种 NORA 任务的多功能评估工具,但其评估生成文本中约束条件满足情况的能力尚不明确。为此,我们开发并发布了一个新颖的算术约束满足 (Arithmetic Constraint-Satisfaction, ACS) 基准数据集。该数据集包含复杂用户请求及其对应的约束条件、智能体响应以及指示响应中各约束条件满足程度的人工标签。该数据集的一个独特属性是,验证其许多约束条件需要整体审查响应内容(与许多其他需要验证单一独立项目的基准不同)。此外,它还评估了 LLMs 在执行推理、上下文数据提取、算术计算和计数方面的能力。随后,我们对开放和专有的 LLMs 进行了约束满足评估的基准测试,结果显示大多数模型仍有显著的改进空间,且错误主要源于推理问题。此外,大多数模型在约束满足预测方面表现出偏斜的模式,即在真实标签为“满足”时准确率较高。最后,对于我们的任务,少样本提示被证明是相当具有挑战性的,因为许多研究模型在引入少样本提示后性能有所下降。

[NLP-79] More Effective LLM Compressed Tokens with Uniformly Spread Position Identifiers and Compression Loss

【速读】: 该论文试图解决大语言模型(LLMs)在运行速度和成本效率方面的问题,解决方案的关键在于通过将Transformer输入压缩为压缩令牌(compressed tokens)来实现。论文基于ICA压缩方法,深入研究了压缩令牌的位置标识符选择,并提出了一种新的压缩损失函数。实验结果表明,所提出的方法能够显著提高压缩比(15倍于ICA的4倍),同时保持可比拟的重建性能。

链接: https://arxiv.org/abs/2409.14364
作者: Runsong Zhao,Pengcheng Huang,Xinyu Liu,Chunyang Xiao,Tong Xiao,Jingbo Zhu
关键词-EN: Compressing Transformer inputs, Compressing Transformer, Transformer inputs, cost efficiency, inputs into compressd
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Compressing Transformer inputs into compressd tokens allows running LLMs with improved speed and cost efficiency. Based on the compression method ICAE, we carefully examine the position identifier choices for compressed tokens and also propose a new compression loss. We demonstrate empirically that our proposed methods achieve significantly higher compression ratios (15x compared to 4x for ICAE), while being able to attain comparable reconstruction performance.
摘要:将 Transformer 输入压缩成压缩 Token 可以提高大语言模型 (LLM) 的运行速度和成本效率。基于压缩方法 ICAE,我们仔细研究了压缩 Token 的位置标识符选择,并提出了一种新的压缩损失。通过实验证明,我们提出的方法能够显著提高压缩比(15 倍对比 ICAE 的 4 倍),同时能够达到可比的重建性能。

[NLP-80] Using Natural Language Processing to find Indication for Burnout with Text Classification: From Online Data to Real-World Data

【速读】: 该论文试图解决通过自然语言处理(NLP)和机器学习技术在德语文本中准确检测职业倦怠(burnout)的问题。解决方案的关键在于:(a) 收集匿名的真实世界数据集,包括自由文本回答和Oldenburg Burnout Inventory(OLBI)响应;(b) 揭示基于GermanBERT的分类器在在线数据训练中的局限性;© 提供两个版本的精心策划的BurnoutExpressions数据集,这些数据集生成的模型在实际应用中表现良好;(d) 通过跨学科焦点小组提供关于倦怠检测AI模型可解释性的定性见解。论文强调了AI研究人员与临床专家之间加强合作的重要性,以及更多真实世界数据对于验证和提升当前NLP研究中开发的AI方法的必要性。

链接: https://arxiv.org/abs/2409.14357
作者: Mascha Kurpicz-Briki,Ghofrane Merhbene,Alexandre Puttick,Souhir Ben Souissi,Jannic Bieri,Thomas Jörg Müller,Christoph Golz
关键词-EN: chronic workplace stress, Natural Language Processing, arises from chronic, effectively managed, chronic workplace
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Burnout, classified as a syndrome in the ICD-11, arises from chronic workplace stress that has not been effectively managed. It is characterized by exhaustion, cynicism, and reduced professional efficacy, and estimates of its prevalence vary significantly due to inconsistent measurement methods. Recent advancements in Natural Language Processing (NLP) and machine learning offer promising tools for detecting burnout through textual data analysis, with studies demonstrating high predictive accuracy. This paper contributes to burnout detection in German texts by: (a) collecting an anonymous real-world dataset including free-text answers and Oldenburg Burnout Inventory (OLBI) responses; (b) demonstrating the limitations of a GermanBERT-based classifier trained on online data; © presenting two versions of a curated BurnoutExpressions dataset, which yielded models that perform well in real-world applications; and (d) providing qualitative insights from an interdisciplinary focus group on the interpretability of AI models used for burnout detection. Our findings emphasize the need for greater collaboration between AI researchers and clinical experts to refine burnout detection models. Additionally, more real-world data is essential to validate and enhance the effectiveness of current AI methods developed in NLP research, which are often based on data automatically scraped from online sources and not evaluated in a real-world context. This is essential for ensuring AI tools are well suited for practical applications.
摘要:倦怠,被 ICD-11 归类为一种综合症,源于未得到有效管理的长期工作压力。其特征表现为疲劳、愤世嫉俗和职业效能降低,由于测量方法的不一致,其流行率的估计存在显著差异。自然语言处理 (NLP) 和机器学习的最新进展为通过文本数据分析检测倦怠提供了有前景的工具,研究表明其预测准确性较高。本文在德语文本中对倦怠检测做出了以下贡献:(a) 收集了一个包含自由文本回答和奥尔登堡倦怠量表 (OLBI) 回答的匿名真实世界数据集;(b) 展示了基于 GermanBERT 的分类器在在线数据训练中的局限性;© 提出了两个版本的精选 BurnoutExpressions 数据集,这些数据集生成的模型在实际应用中表现良好;(d) 通过跨学科焦点小组提供了关于用于倦怠检测的 AI 模型可解释性的定性见解。我们的研究结果强调了 AI 研究人员与临床专家之间需要加强合作,以改进倦怠检测模型。此外,更多真实世界的数据对于验证和提升当前基于 NLP 研究的 AI 方法的有效性至关重要,这些方法通常基于从在线来源自动抓取的数据,并未在真实环境中进行评估。这对于确保 AI 工具适用于实际应用至关重要。

[NLP-81] MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators

【速读】: 该论文试图解决大语言模型(LLMs)在机器翻译(MT)质量评估中预测错误与人工标注不一致的问题,从而限制了其作为反馈信号的可解释性。解决方案的关键在于引入一个通用且无需训练的框架 MQM-APE,通过自动后编辑(APE)过滤掉对质量提升无影响的错误,仅保留有助于质量改进的错误。具体方法包括:1) 让LLM作为评估者提供错误标注;2) 作为后编辑者判断错误是否影响质量改进;3) 作为成对质量验证者进行错误过滤。实验表明,该方法在多种语言和资源条件下均能提升错误标注的可靠性和质量,与现有的GEMBA-MQM方法相比有显著改进。

链接: https://arxiv.org/abs/2409.14335
作者: Qingyu Lu,Liang Ding,Kanjian Zhang,Jinxia Zhang,Dacheng Tao
关键词-EN: Large Language Models, shown significant potential, judges for Machine, Machine Translation, Large Language
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment, providing both scores and fine-grained feedback. Although approaches such as GEMBA-MQM has shown SOTA performance on reference-free evaluation, the predicted errors do not align well with those annotated by human, limiting their interpretability as feedback signals. To enhance the quality of error annotations predicted by LLM evaluators, we introduce a universal and training-free framework, \textbfMQM-APE , based on the idea of filtering out non-impactful errors by Automatically Post-Editing (APE) the original translation based on each error, leaving only those errors that contribute to quality improvement. Specifically, we prompt the LLM to act as 1) \textitevaluator to provide error annotations, 2) \textitpost-editor to determine whether errors impact quality improvement and 3) \textitpairwise quality verifier as the error filter. Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM, across eight LLMs in both high- and low-resource languages. Orthogonal to trained approaches, MQM-APE complements translation-specific evaluators such as Tower, highlighting its broad applicability. Further analysis confirm the effectiveness of each module and offer valuable insights into evaluator design and LLMs selection. The code will be released to facilitate the community.
摘要:大语言模型 (LLMs) 在机器翻译 (MT) 质量评估中展现出显著潜力,不仅能提供评分,还能提供细粒度的反馈。尽管如 GEMBA-MQM 等方法在无参考评估中展示了最先进的性能,但其预测的错误与人工标注的错误并不完全一致,限制了其作为反馈信号的可解释性。为了提升 LLM 评估器预测的错误标注质量,我们引入了一个通用且无需训练的框架,即 \textbfMQM-APE,该框架基于自动后编辑 (APE) 原始翻译中的每个错误,过滤掉对质量提升无影响的错误,仅保留那些有助于质量提升的错误。具体而言,我们引导 LLM 扮演以下角色:1) \textitevaluator 提供错误标注,2) \textitpost-editor 判断错误是否影响质量提升,3) \textitpairwise quality verifier 作为错误过滤器。实验表明,我们的方法在八种大语言模型中,无论是在高资源还是低资源语言中,都能持续提升错误标注的可靠性和质量,优于 GEMBA-MQM。与经过训练的方法正交,MQM-APE 补充了如 Tower 等特定于翻译的评估器,突显了其广泛的适用性。进一步的分析证实了每个模块的有效性,并为评估器设计和 LLM 选择提供了宝贵的见解。代码将公开发布,以促进社区的发展。

[NLP-82] Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)在叙事推理任务中表现不佳的问题,特别是在处理需要更高抽象能力的电影情节梗概中的抽象推理能力。解决方案的关键在于引入了一种基于电影情节梗概中的“trope”(套路)的查询方法,通过这种trope-wise querying方法,显著提升了模型的F1分数(提高了11.8个百分点)。此外,论文还揭示了Chain-of-Thought(CoT)提示在叙事内容中可能导致幻觉现象,从而降低GPT-4的性能,并提出了一种对抗性注入方法来测试模型对隐含trope文本的敏感性。

链接: https://arxiv.org/abs/2409.14324
作者: Hung-Ting Su,Ya-Ching Hsu,Xudong Lin,Xiang-Qian Shi,Yulei Niu,Han-Yuan Hsu,Hung-yi Lee,Winston H. Hsu
关键词-EN: Large language models, Large language, shown significant multi-step, language models, prompting have shown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2024 Findings. The first two authors contributed equally. Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4’s performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT’s heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.
摘要:配备链式思维 (Chain-of-Thoughts, CoT) 提示的大语言模型 (Large Language Models, LLMs) 在数学、常识和逻辑等事实内容中展示了显著的多步骤推理能力。然而,在需要更高抽象能力的叙事推理方面的表现尚未得到探索。本研究利用电影剧情梗概中的套路来评估最先进 LLMs 的抽象推理能力,并揭示了其低下的表现。我们引入了一种套路细粒度查询方法来应对这些挑战,并将 F1 分数提高了 11.8 分。此外,尽管先前研究表明 CoT 增强了多步骤推理,但本研究发现 CoT 在叙事内容中可能导致幻觉,降低了 GPT-4 的表现。我们还引入了一种对抗性注入方法,将套路相关文本 Token 嵌入到不含显式套路的电影剧情梗概中,揭示了 CoT 对这种注入的高度敏感性。我们的全面分析为未来的研究方向提供了见解。

[NLP-83] PretextTrans: Investigating Medical Factual Knowledge Mastery of LLMs with Predicate-text Dual Transformation

【速读】: 该论文旨在解决当前大型语言模型(LLMs)在医学事实知识掌握上的不足,特别是其在动态评估中生成的测试样本常包含事实错误且表达方式缺乏多样性的问题。解决方案的关键在于提出了一种名为“谓词-文本双重变换(Predicate-text Dual Transformation, PretextTrans)”的新评估方法。该方法通过将医学知识点首先转换为谓词表达,然后通过谓词变换生成一系列变体,最后将这些变体转换回文本表达,从而生成既具有事实可靠性又具有表达多样性的测试样本。这种方法有效地评估了12个知名LLMs在医学知识掌握上的表现,揭示了当前LLMs在医学领域应用中的显著不足,并为开发专门用于医学领域的LLMs提供了有价值的见解。

链接: https://arxiv.org/abs/2409.14302
作者: Yuxuan Zhou,Xien Liu,Chen Ning,Ji Wu
关键词-EN: automatically generate multiple, generate multiple test, multiple test samples, dynamic evaluation schema, test samples
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:In the study, we aim to investigate current LLMs’ mastery of medical factual knowledge with a dynamic evaluation schema, which can automatically generate multiple test samples for each medical factual knowledge point. Test samples produced directly by LLMs always introduce factual errors and lack diversity in the manner of knowledge expression. To overcome the drawbacks, here we propose a novel evaluation method, Predicate-text Dual Transformation (PretextTrans), by introducing predicate transformations into the dynamic evaluation schema. Specifically, each medical knowledge point is firstly transformed into a predicate expression; then, the predicate expression derives a series of variants through predicate transformations; lastly, the produced predicate variants are transformed back into textual expressions, resulting in a series of test samples with both factual reliability and expression diversity. Using the proposed PretextTrans method, we systematically investigate 12 well-known LLMs’ mastery of medical factual knowledge based on two medical datasets. The comparison results show that current LLMs still have significant deficiencies in fully mastering medical knowledge, which may illustrate why current LLMs still perform unsatisfactorily in real-world medical scenarios despite having achieved considerable performance on public benchmarks. Our proposed method serves as an effective solution for evaluation of LLMs in medical domain and offers valuable insights for developing medical-specific LLMs.
摘要:在本研究中,我们旨在通过一种动态评估架构来探究当前大语言模型 (LLM) 对医学事实知识的掌握情况,该架构能够自动为每个医学事实知识点生成多个测试样本。直接由大语言模型生成的测试样本往往引入事实错误,并且在知识表达方式上缺乏多样性。为克服这些缺点,我们提出了一种新的评估方法——谓词文本双重变换 (Predicate-text Dual Transformation, PretextTrans),通过将谓词变换引入动态评估架构中。具体而言,每个医学知识点首先被转换为谓词表达;然后,通过谓词变换生成一系列变体;最后,将生成的谓词变体转换回文本表达,从而产生一系列既具有事实可靠性又具有表达多样性的测试样本。利用所提出的 PretextTrans 方法,我们系统地调查了 12 个知名大语言模型对基于两个医学数据集的医学事实知识的掌握情况。比较结果显示,当前大语言模型在全面掌握医学知识方面仍存在显著不足,这或许解释了为何尽管在公共基准上取得了相当的成绩,当前大语言模型在实际医疗场景中的表现仍不尽如人意。我们提出的方法为医学领域大语言模型的评估提供了一个有效的解决方案,并为开发专门针对医学的大语言模型提供了宝贵的见解。

[NLP-84] Opinion Mining on Offshore Wind Energy for Environmental Engineering

【速读】: 该论文旨在通过情感分析社交媒体数据来研究公众对海上风能的看法,并利用机器学习模型(TextBlob、VADER和SentiWordNet)来适应不同模型的功能,如主观性分析、极性分类、累积情感评分和基于上下文的情感分类。关键解决方案在于利用自然语言处理(NLP)技术从社交媒体文本数据中提取意义,并通过数据可视化工具展示整体结果,从而为决策支持提供基于公众意见的智能治理和公民科学实践。

链接: https://arxiv.org/abs/2409.14292
作者: Isabele Bittencourt,Aparna S. Varde,Pankaj Lal
关键词-EN: offshore wind energy, wind energy, offshore wind, conduct sentiment analysis, machine learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we conduct sentiment analysis on social media data to study mass opinion about offshore wind energy. We adapt three machine learning models, namely, TextBlob, VADER, and SentiWordNet because different functions are provided by each model. TextBlob provides subjectivity analysis as well as polarity classification. VADER offers cumulative sentiment scores. SentiWordNet considers sentiments with reference to context and performs classification accordingly. Techniques in NLP are harnessed to gather meaning from the textual data in social media. Data visualization tools are suitably deployed to display the overall results. This work is much in line with citizen science and smart governance via involvement of mass opinion to guide decision support. It exemplifies the role of Machine Learning and NLP here.
摘要:本文通过分析社交媒体数据,研究公众对海上风能的意见。我们采用了三种机器学习模型,即 TextBlob、VADER 和 SentiWordNet,因为每个模型提供的功能不同。TextBlob 不仅提供主观性分析,还进行极性分类。VADER 提供累积的情感评分。SentiWordNet 则根据上下文考虑情感,并进行相应的分类。我们利用自然语言处理 (NLP) 技术从社交媒体的文本数据中提取意义。同时,适当部署数据可视化工具以展示整体结果。这项工作与公民科学和通过公众意见参与的智能治理高度一致,为决策支持提供指导。它展示了机器学习和自然语言处理在此领域的作用。

[NLP-85] ESPERANTO: Evaluating Synthesized Phrases to Enhance Robustness in AI Detection for Text Origination

【速读】: 该论文试图解决现有AI生成文本检测系统在面对文本操纵技术时的脆弱性问题。解决方案的关键在于引入回译(back-translation)技术,通过将AI生成的文本翻译成多种语言后再回译回英文,生成经过操纵的文本,从而降低现有检测系统的真阳性率(TPR)。论文提出了一种结合回译文本的模型,有效保留原始语义的同时显著降低检测系统的识别率,并通过实验验证了该方法对九种AI检测器的有效性。此外,论文还提出了一种增强检测系统鲁棒性的对策,并公开了一个包含72万条文本的大型数据集,以促进相关研究。

链接: https://arxiv.org/abs/2409.14285
作者: Navid Ayoobi,Lily Knab,Wen Cheng,David Pantoja,Hamidreza Alikhani,Sylvain Flamant,Jin Kim,Arjun Mukherjee
关键词-EN: exhibit significant utility, including academic misconduct, exhibit significant, unethical purposes, dissemination of misinformation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) exhibit significant utility across various domains, they simultaneously are susceptible to exploitation for unethical purposes, including academic misconduct and dissemination of misinformation. Consequently, AI-generated text detection systems have emerged as a countermeasure. However, these detection mechanisms demonstrate vulnerability to evasion techniques and lack robustness against textual manipulations. This paper introduces back-translation as a novel technique for evading detection, underscoring the need to enhance the robustness of current detection systems. The proposed method involves translating AI-generated text through multiple languages before back-translating to English. We present a model that combines these back-translated texts to produce a manipulated version of the original AI-generated text. Our findings demonstrate that the manipulated text retains the original semantics while significantly reducing the true positive rate (TPR) of existing detection methods. We evaluate this technique on nine AI detectors, including six open-source and three proprietary systems, revealing their susceptibility to back-translation manipulation. In response to the identified shortcomings of existing AI text detectors, we present a countermeasure to improve the robustness against this form of manipulation. Our results indicate that the TPR of the proposed method declines by only 1.85% after back-translation manipulation. Furthermore, we build a large dataset of 720k texts using eight different LLMs. Our dataset contains both human-authored and LLM-generated texts in various domains and writing styles to assess the performance of our method and existing detectors. This dataset is publicly shared for the benefit of the research community.
摘要:尽管大语言模型 (LLMs) 在各个领域展现出显著的实用性,但它们同时也容易被用于不道德的目的,包括学术不端和传播虚假信息。因此,AI 生成文本检测系统应运而生,作为应对措施。然而,这些检测机制在面对规避技术时表现出脆弱性,并且对文本操作缺乏鲁棒性。本文介绍了一种新颖的规避检测技术——回译,强调了增强当前检测系统鲁棒性的必要性。所提出的方法涉及将 AI 生成的文本通过多种语言进行翻译,然后再回译回英语。我们提出了一种模型,该模型结合这些回译文本,生成经过操作的原始 AI 生成文本版本。我们的研究结果表明,经过操作的文本保留了原始语义,同时显著降低了现有检测方法的真阳性率 (TPR)。我们在九种 AI 检测器上评估了这一技术,包括六种开源系统和三种专有系统,揭示了它们对回译操作的易感性。针对现有 AI 文本检测器的不足,我们提出了一种应对措施,以提高对这种操作形式的鲁棒性。我们的结果显示,经过回译操作后,所提出方法的 TPR 仅下降了 1.85%。此外,我们构建了一个包含 72 万条文本的大型数据集,使用了八种不同的大语言模型。我们的数据集包含了不同领域和写作风格的人类撰写文本和 LLM 生成文本,以评估我们的方法和现有检测器的性能。该数据集已公开共享,以造福研究社区。

[NLP-86] Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models

【速读】: 该论文试图解决大型多模态模型在现实环境中感知、推理、规划和行动的能力不足的问题。解决方案的关键在于引入了一个名为Can-Do的基准数据集,该数据集包含400个多模态样本,涵盖自然语言指令、视觉图像、状态变化和相应的行动计划,以评估模型的具身规划能力。论文进一步提出了NeuroGround框架,通过将规划生成与感知的环境状态相结合,并利用符号规划引擎增强模型生成的计划,从而提升模型的视觉感知、理解和推理能力。

链接: https://arxiv.org/abs/2409.14277
作者: Yew Ken Chia,Qi Sun,Lidong Bing,Soujanya Poria
关键词-EN: demonstrated impressive problem-solving, encode extensive world, Large multimodal models, extensive world knowledge, impressive problem-solving abilities
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large multimodal models have demonstrated impressive problem-solving abilities in vision and language tasks, and have the potential to encode extensive world knowledge. However, it remains an open challenge for these models to perceive, reason, plan, and act in realistic environments. In this work, we introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities through more diverse and complex scenarios than previous datasets. Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans. The data encompasses diverse aspects of commonsense knowledge, physical understanding, and safety awareness. Our fine-grained analysis reveals that state-of-the-art models, including GPT-4V, face bottlenecks in visual perception, comprehension, and reasoning abilities. To address these challenges, we propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans. Experimental results demonstrate the effectiveness of our framework compared to strong baselines. Our code and dataset are available at this https URL.
摘要:大型多模态模型在视觉和语言任务中展示了令人印象深刻的问题解决能力,并具备编码广泛世界知识的潜力。然而,这些模型在现实环境中感知、推理、规划和行动的能力仍然是一个开放的挑战。在这项工作中,我们引入了 Can-Do,这是一个基准数据集,旨在通过比以往数据集更多样化和复杂的场景来评估具身规划能力。我们的数据集包含 400 个多模态样本,每个样本由自然语言用户指令、描述环境的视觉图像、状态变化以及相应的行动计划组成。数据涵盖了常识知识、物理理解和安全意识等多个方面。我们的细粒度分析揭示了包括 GPT-4V 在内的最先进模型在视觉感知、理解和推理能力方面存在瓶颈。为了应对这些挑战,我们提出了 NeuroGround,这是一个神经符号框架,首先将规划生成基于感知到的环境状态,然后利用符号规划引擎来增强模型生成的计划。实验结果表明,与强大的基线相比,我们的框架具有更高的有效性。我们的代码和数据集可在以下链接获取:https URL。

[NLP-87] Instruction Following without Instruction Tuning

【速读】: 该论文试图解决的问题是如何在不直接进行指令微调的情况下,使预训练语言模型能够遵循指令。解决方案的关键在于发现了一种隐式的指令微调机制,即通过仅训练模型对响应的分布,而不需要明确的指令-响应对,模型也能表现出指令遵循行为。此外,论文还发现,即使在狭窄领域数据(如诗歌)上进行指令-响应训练,模型也能在广泛的任务(如食谱生成)上表现出指令遵循行为。论文通过假设简单的分布变化可以导致指令遵循,并通过手工编写的基于规则的语言模型验证了这一假设,该模型在与预训练模型结合时能够表现出指令遵循行为。

链接: https://arxiv.org/abs/2409.14254
作者: John Hewitt,Nelson F. Liu,Percy Liang,Christopher D. Manning
关键词-EN: Instruction tuning, Instruction tuning commonly, Instruction, implicit instruction tuning, tuning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction tuning commonly means finetuning a language model on instruction-response pairs. We discover two forms of adaptation (tuning) that are deficient compared to instruction tuning, yet still yield instruction following; we call this implicit instruction tuning. We first find that instruction-response pairs are not necessary: training solely on responses, without any corresponding instructions, yields instruction following. This suggests pretrained models have an instruction-response mapping which is revealed by teaching the model the desired distribution of responses. However, we then find it’s not necessary to teach the desired distribution of responses: instruction-response training on narrow-domain data like poetry still leads to broad instruction-following behavior like recipe generation. In particular, when instructions are very different from those in the narrow finetuning domain, models’ responses do not adhere to the style of the finetuning domain. To begin to explain implicit instruction tuning, we hypothesize that very simple changes to a language model’s distribution yield instruction following. We support this by hand-writing a rule-based language model which yields instruction following in a product-of-experts with a pretrained model. The rules are to slowly increase the probability of ending the sequence, penalize repetition, and uniformly change 15 words’ probabilities. In summary, adaptations made without being designed to yield instruction following can do so implicitly.
摘要:指令微调通常意味着在指令-响应对上对语言模型进行微调。我们发现两种形式的适应(微调),虽然不如指令微调有效,但仍能产生指令跟随效果;我们称之为隐式指令微调。首先,我们发现指令-响应对并非必要:仅在没有任何相应指令的情况下,仅基于响应进行训练,也能产生指令跟随效果。这表明预训练模型具有一个指令-响应映射,该映射通过教授模型所需响应的分布来揭示。然而,我们随后发现,教授所需响应的分布并非必要:在狭窄领域数据(如诗歌)上的指令-响应训练,仍然会导致广泛的指令跟随行为,如食谱生成。特别是,当指令与狭窄微调领域中的指令非常不同时,模型的响应不会遵循微调领域的风格。为了开始解释隐式指令微调,我们假设对语言模型分布的非常简单的改变就能产生指令跟随效果。我们通过手动编写一个基于规则的语言模型来支持这一点,该模型在与预训练模型结合的专家乘积中产生指令跟随效果。这些规则包括逐渐增加序列结束的概率、惩罚重复,以及均匀改变15个词的概率。总之,未经设计以产生指令跟随效果的适应措施,可以在隐式中实现这一点。

[NLP-88] Repairs in a Block World: A New Benchmark for Handling User Corrections with Multi-Modal Language Models EMNLP’24

【速读】: 该论文试图解决对话系统中处理和响应Third Position Repair(TPR)序列的问题,即在多模态指令跟随任务中,模型如何从误解中恢复并正确响应。解决方案的关键在于构建并公开了一个名为BlockWorld-Repairs的数据集,用于评估和训练Vision and Language Models(VLM),并通过专门设计的损失函数对相关标记进行微调,以提高模型在处理TPR序列时的性能和泛化能力。

链接: https://arxiv.org/abs/2409.14247
作者: Javier Chiyah-Garcia,Alessandro Suglia,Arash Eshghi
关键词-EN: Position Repair, misunderstand the speaker, prompting the speaker, addressee may initially, initially misunderstand
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted to EMNLP’24 Main (Upcoming)

点击查看摘要

Abstract:In dialogue, the addressee may initially misunderstand the speaker and respond erroneously, often prompting the speaker to correct the misunderstanding in the next turn with a Third Position Repair (TPR). The ability to process and respond appropriately to such repair sequences is thus crucial in conversational AI systems. In this paper, we first collect, analyse, and publicly release BlockWorld-Repairs: a dataset of multi-modal TPR sequences in an instruction-following manipulation task that is, by design, rife with referential ambiguity. We employ this dataset to evaluate several state-of-the-art Vision and Language Models (VLM) across multiple settings, focusing on their capability to process and accurately respond to TPRs and thus recover from miscommunication. We find that, compared to humans, all models significantly underperform in this task. We then show that VLMs can benefit from specialised losses targeting relevant tokens during fine-tuning, achieving better performance and generisability. Our results suggest that these models are not yet ready to be deployed in multi-modal collaborative settings where repairs are common, and highlight the need to design training regimes and objectives that facilitate learning from interaction.
摘要:在对话中,受话人可能会在一开始误解说话者,并做出错误的回应,这通常会促使说话者在下一轮对话中通过第三位置修正 (Third Position Repair, TPR) 来纠正误解。因此,处理并适当回应此类修正序列的能力对于对话式 AI 系统至关重要。本文首先收集、分析并公开发布了 BlockWorld-Repairs:一个在指令跟随操作任务中包含多模态 TPR 序列的数据集,该任务设计时充满了指代模糊性。我们利用此数据集在多种设置下评估了多个最先进的视觉与语言模型 (Vision and Language Model, VLM),重点关注它们处理和准确回应 TPR 的能力,从而从沟通失误中恢复。我们发现,与人类相比,所有模型在此任务中表现显著不佳。随后,我们展示了 VLM 在微调过程中通过针对相关 Token 的专门损失函数可以获得更好的性能和泛化能力。我们的研究结果表明,这些模型尚未准备好部署在多模态协作环境中,在这种环境中修正操作是常见的,并强调了设计促进从交互中学习的训练机制和目标的必要性。

[NLP-89] Data-centric NLP Backdoor Defense from the Lens of Memorization

【速读】: 该论文试图解决深度学习语言模型中的后门攻击问题,即如何防御这些模型在训练过程中被植入恶意元素。解决方案的关键在于识别和确认训练数据中的重复元素(如单词、短语、结构和风格),这些元素是后门攻击成功的基础。通过检测这些可记忆的重复元素并验证其是否能激活后门行为,论文提出了一种数据中心化的防御方法,该方法在防御不同类型的自然语言处理后门攻击方面优于现有的最先进防御技术。

链接: https://arxiv.org/abs/2409.14200
作者: Zhenting Wang,Zhizhi Wang,Mingyu Jin,Mengnan Du,Juan Zhai,Shiqing Ma
关键词-EN: DNN-based language models, language models, language model backdoors, severe threat, trustworthiness of DNN-based
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Backdoor attack is a severe threat to the trustworthiness of DNN-based language models. In this paper, we first extend the definition of memorization of language models from sample-wise to more fine-grained sentence element-wise (e.g., word, phrase, structure, and style), and then point out that language model backdoors are a type of element-wise memorization. Through further analysis, we find that the strength of such memorization is positively correlated to the frequency of duplicated elements in the training dataset. In conclusion, duplicated sentence elements are necessary for successful backdoor attacks. Based on this, we propose a data-centric defense. We first detect trigger candidates in training data by finding memorizable elements, i.e., duplicated elements, and then confirm real triggers by testing if the candidates can activate backdoor behaviors (i.e., malicious elements). Results show that our method outperforms state-of-the-art defenses in defending against different types of NLP backdoors.
摘要:后门攻击对基于深度神经网络 (DNN) 的语言模型的可信度构成了严重威胁。本文首先将语言模型的记忆定义从样本层面扩展到更细粒度的句子元素层面(例如,单词、短语、结构和风格),然后指出语言模型后门是一种元素层面的记忆。通过进一步分析,我们发现这种记忆的强度与训练数据集中重复元素的频率呈正相关。结论是,重复的句子元素是成功实施后门攻击的必要条件。基于此,我们提出了一种以数据为中心的防御方法。我们首先通过寻找可记忆元素(即重复元素)来检测训练数据中的触发候选者,然后通过测试候选者是否能激活后门行为(即恶意元素)来确认真实触发器。结果表明,我们的方法在防御不同类型的自然语言处理 (NLP) 后门方面优于最先进的防御方法。

[NLP-90] he Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks Techniques and Trends

【速读】: 该论文试图解决在大语言模型(LLMs)时代,由于对话数据的大量积累,对话分析(CA)任务缺乏明确范围和系统性技术协同的问题。解决方案的关键在于系统化CA任务,通过正式定义CA任务并将其划分为四个关键步骤:对话场景重建、深入归因分析、针对性训练以及基于训练生成对话以实现特定目标。这一系统化的方法旨在整合分散的技术,形成技术协同,从而更好地赋能商业应用,并推动对话分析从浅层元素分析向因果关系和高级策略任务的研究转变。

链接: https://arxiv.org/abs/2409.14195
作者: Xinghua Zhang,Haiyang Yu,Yongbin Li,Minzheng Wang,Longze Chen,Fei Huang
关键词-EN: large language models, rapid development trend, language models, large language, era of large
类目: Computation and Language (cs.CL)
备注: 21 pages, work in progress

点击查看摘要

Abstract:In the era of large language models (LLMs), a vast amount of conversation logs will be accumulated thanks to the rapid development trend of language UI. Conversation Analysis (CA) strives to uncover and analyze critical information from conversation data, streamlining manual processes and supporting business insights and decision-making. The need for CA to extract actionable insights and drive empowerment is becoming increasingly prominent and attracting widespread attention. However, the lack of a clear scope for CA leads to a dispersion of various techniques, making it difficult to form a systematic technical synergy to empower business applications. In this paper, we perform a thorough review and systematize CA task to summarize the existing related work. Specifically, we formally define CA task to confront the fragmented and chaotic landscape in this field, and derive four key steps of CA from conversation scene reconstruction, to in-depth attribution analysis, and then to performing targeted training, finally generating conversations based on the targeted training for achieving the specific goals. In addition, we showcase the relevant benchmarks, discuss potential challenges and point out future directions in both industry and academia. In view of current advancements, it is evident that the majority of efforts are still concentrated on the analysis of shallow conversation elements, which presents a considerable gap between the research and business, and with the assist of LLMs, recent work has shown a trend towards research on causality and strategic tasks which are sophisticated and high-level. The analyzed experiences and insights will inevitably have broader application value in business operations that target conversation logs.
摘要:在大语言模型 (LLM) 的时代,语言用户界面 (UI) 的快速发展趋势将积累大量的对话日志。对话分析 (Conversation Analysis, CA) 致力于从对话数据中挖掘和分析关键信息,简化人工流程,并支持业务洞察和决策制定。CA 提取可操作洞察并推动赋能的需求日益显著,并引起了广泛关注。然而,CA 缺乏明确的范围定义,导致各种技术分散,难以形成系统的技术协同来赋能业务应用。本文对 CA 任务进行了全面回顾和系统化,总结了现有的相关工作。具体而言,我们正式定义了 CA 任务,以应对该领域碎片化和混乱的局面,并从对话场景重构、深入归因分析、进行针对性训练,到最后基于针对性训练生成对话以实现特定目标,推导出 CA 的四个关键步骤。此外,我们展示了相关的基准,讨论了潜在的挑战,并指出了行业和学术界未来的方向。鉴于当前的进展,很明显大多数努力仍集中在浅层对话元素的分析上,这显示出研究与业务之间存在相当大的差距,而在大语言模型的辅助下,最近的工作显示出向因果关系和复杂高级任务研究的趋势。分析的经验和洞察无疑将在以对话日志为目标的业务运营中具有更广泛的应用价值。

[NLP-91] Knowledge in Triples for LLMs: Enhancing Table QA Accuracy with Semantic Extraction

【速读】: 该论文试图解决自然语言处理中从复杂、半结构化的表格数据中提取和生成有意义回答的难题。解决方案的关键在于提出了一种新颖的方法,该方法通过直接从表格数据中提取三元组,并将其与检索增强生成(RAG)模型结合,以提升经过微调的GPT-3.5-turbo-0125模型生成回答的准确性、连贯性和上下文丰富性。这种方法在FeTaQA数据集上显著优于现有基线,特别是在Sacre-BLEU和ROUGE指标上表现出色,能够有效地从表格中生成上下文准确且详细的回答。

链接: https://arxiv.org/abs/2409.14192
作者: Hossein Sholehrasa,Sanaz Saki Norouzi,Pascal Hitzler,Majid Jaberi-Douraki
关键词-EN: Integrating structured knowledge, natural language processing, formats poses significant, Integrating structured, tabular formats poses
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Integrating structured knowledge from tabular formats poses significant challenges within natural language processing (NLP), mainly when dealing with complex, semi-structured tables like those found in the FeTaQA dataset. These tables require advanced methods to interpret and generate meaningful responses accurately. Traditional approaches, such as SQL and SPARQL, often fail to fully capture the semantics of such data, especially in the presence of irregular table structures like web tables. This paper addresses these challenges by proposing a novel approach that extracts triples straightforward from tabular data and integrates it with a retrieval-augmented generation (RAG) model to enhance the accuracy, coherence, and contextual richness of responses generated by a fine-tuned GPT-3.5-turbo-0125 model. Our approach significantly outperforms existing baselines on the FeTaQA dataset, particularly excelling in Sacre-BLEU and ROUGE metrics. It effectively generates contextually accurate and detailed long-form answers from tables, showcasing its strength in complex data interpretation.
摘要:将表格格式中的结构化知识整合到自然语言处理 (NLP) 中面临着显著的挑战,尤其是在处理 FeTaQA 数据集中发现的复杂、半结构化表格时。这些表格需要先进的方法来准确解释并生成有意义的响应。传统方法,如 SQL 和 SPARQL,往往无法完全捕捉此类数据的语义,特别是在存在不规则表格结构(如网页表格)的情况下。本文通过提出一种新颖的方法来应对这些挑战,该方法直接从表格数据中提取三元组,并将其与检索增强生成 (RAG) 模型结合,以提高经过微调的 GPT-3.5-turbo-0125 模型生成的响应的准确性、连贯性和上下文丰富性。我们的方法在 FeTaQA 数据集上显著优于现有的基线方法,特别是在 Sacre-BLEU 和 ROUGE 指标上表现尤为突出。它能够有效地从表格中生成上下文准确且详细的较长形式的答案,展示了其在复杂数据解释方面的优势。

[NLP-92] On Lexical Invariance on Multisets and Graphs

【速读】: 该论文试图解决的是词汇不变性(lexical invariance)问题,特别是在多集(multisets)和图(graphs)上的应用。解决方案的关键在于确定一个函数,使其输出对输入词汇空间的任何单射变换保持不变。对于多集,最具有表达力的词汇不变函数必须仅依赖于原始多集中唯一元素的计数;对于图,最具有表达力的词汇不变和排列不变函数必须仅依赖于邻接矩阵和差异矩阵,其中差异矩阵的元素表示节点间特征是否相同。通过这些条件,论文证明了在多集和图上实现词汇不变性的充分必要条件,并通过合成实验验证了其理论。

链接: https://arxiv.org/abs/2409.14179
作者: Muhan Zhang
关键词-EN: called lexical invariance, expressive lexical invariant, lexical invariance, lexical, lexical invariant
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this draft, we study a novel problem, called lexical invariance, using the medium of multisets and graphs. Traditionally in the NLP domain, lexical invariance indicates that the semantic meaning of a sentence should remain unchanged regardless of the specific lexical or word-based representation of the input. For example, The movie was extremely entertaining'' would have the same meaning as The film was very enjoyable’'. In this paper, we study a more challenging setting, where the output of a function is invariant to any injective transformation applied to the input lexical space. For example, multiset 1,2,3,2 is equivalent to multiset a,b,c,b if we specify an injective transformation that maps 1 to a, 2 to b and 3 to c. We study the sufficient and necessary conditions for a most expressive lexical invariant (and permutation invariant) function on multisets and graphs, and proves that for multisets, the function must have a form that only takes the multiset of counts of the unique elements in the original multiset as input. For example, a most expressive lexical invariant function on a,b,c,b must have a form that only operates on 1,1,2 (meaning that there are 1, 1, 2 unique elements corresponding to a,c,b). For graphs, we prove that a most expressive lexical invariant and permutation invariant function must have a form that only takes the adjacency matrix and a difference matrix as input, where the (i,j)th element of the difference matrix is 1 if node i and node j have the same feature and 0 otherwise. We perform synthetic experiments on TU datasets to verify our theorems.
摘要:在本稿中,我们通过多集和图的研究,探讨了一个新颖的问题,称为词汇不变性。传统上,在自然语言处理 (NLP) 领域,词汇不变性意味着句子的语义意义应保持不变,无论输入的具体词汇或基于词的表示如何变化。例如,“The movie was extremely entertaining” 与 “The film was very enjoyable” 具有相同的意义。本文研究了一个更具挑战性的场景,即函数的输出对于应用于输入词汇空间上的任何单射变换保持不变。例如,多集 1,2,3,2 等同于多集 a,b,c,b,如果我们指定一个单射变换,将 1 映射到 a,2 映射到 b,3 映射到 c。我们研究了在多集和图上,最具表达力的词汇不变(及置换不变)函数的充分必要条件,并证明对于多集,该函数必须具有仅以原始多集中唯一元素的计数多集作为输入的形式。例如,在多集 a,b,c,b 上,最具表达力的词汇不变函数必须具有仅操作于 1,1,2 的形式(意味着存在对应于 a,c,b 的 1,1,2 个唯一元素)。对于图,我们证明最具表达力的词汇不变和置换不变函数必须具有仅以邻接矩阵和差异矩阵作为输入的形式,其中差异矩阵的 (i,j) 元素为 1,如果节点 i 和节点 j 具有相同特征,否则为 0。我们在 TU 数据集上进行了合成实验,以验证我们的定理。

[NLP-93] QMOS: Enhancing LLMs for Telecommunication with Question Masked loss and Option Shuffling

【速读】: 该论文试图解决在电信领域应用大型语言模型(LLMs)进行多选题问答时面临的挑战,特别是由于领域特定词汇、复杂技术概念和精确回答需求带来的困难。解决方案的关键在于引入了一种名为QMOS的创新方法,通过使用Question-Masked损失和Option Shuffling技巧来增强LLMs在电信领域多选题问答中的表现。论文采用开源的小型语言模型(如Phi-2和Falcon-7B),并在增强的RAG框架中进行微调、检索、提示工程和推理的全面优化,显著提升了问答准确率,分别将Falcon-7B和Phi-2的准确率从24.70%和42.07%提升至49.30%和84.65%。

链接: https://arxiv.org/abs/2409.14175
作者: Blessed Guda,Gabrial Zencha A.,Lawrence Francis,Carlee Joe-Wong
关键词-EN: Large Language models, Large Language, brought about substantial, substantial advancements, Retrieval Augmented Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language models (LLMs) have brought about substantial advancements in the field of Question Answering (QA) systems. These models do remarkably well in addressing intricate inquiries in a variety of disciplines. However, because of domain-specific vocabulary, complex technological concepts, and the requirement for exact responses applying LLMs to specialized sectors like telecommunications presents additional obstacles. GPT-3.5 has been used in recent work, to obtain noteworthy accuracy for telecom-related questions in a Retrieval Augmented Generation (RAG) framework. Notwithstanding these developments, the practical use of models such as GPT-3.5 is restricted by their proprietary nature and high computing demands. This paper introduces QMOS, an innovative approach which uses a Question-Masked loss and Option Shuffling trick to enhance the performance of LLMs in answering Multiple-Choice Questions in the telecommunications domain. Our focus was on using opensource, smaller language models (Phi-2 and Falcon-7B) within an enhanced RAG framework. Our multi-faceted approach involves several enhancements to the whole LLM-RAG pipeline of finetuning, retrieval, prompt engineering and inference. Our approaches significantly outperform existing results, achieving accuracy improvements from baselines of 24.70% to 49.30% with Falcon-7B and from 42.07% to 84.65% with Phi-2.
摘要:大语言模型 (LLMs) 在问答系统 (QA) 领域带来了显著的进步。这些模型在处理跨学科的复杂问题时表现出色。然而,由于特定领域的词汇、复杂的技术概念以及精确回答的需求,将 LLMs 应用于电信等专业领域带来了额外的挑战。近期的工作中,GPT-3.5 在检索增强生成 (RAG) 框架下,对电信相关问题的回答取得了显著的准确性。尽管如此,GPT-3.5 等模型的实际应用受到其专有性质和高计算需求的限制。本文介绍了 QMOS,一种创新方法,通过使用问题掩码损失 (Question-Masked loss) 和选项洗牌技巧 (Option Shuffling trick) 来提升 LLMs 在电信领域多选题回答中的性能。我们的研究重点是在增强的 RAG 框架内使用开源、较小的语言模型 (Phi-2 和 Falcon-7B)。我们的多方面方法涉及对整个 LLM-RAG 管道的微调、检索、提示工程和推理的多个改进。我们的方法显著优于现有结果,使用 Falcon-7B 从基线的 24.70% 提升到 49.30%,使用 Phi-2 从 42.07% 提升到 84.65%。

[NLP-94] owards Building Efficient Sentence BERT Models using Layer Pruning

【速读】: 该论文试图解决如何在不显著降低嵌入质量的前提下,通过层剪枝技术创建更高效的Sentence BERT(SBERT)模型的问题。解决方案的关键在于通过两阶段的SBERT微调过程(包括自然语言推理和语义文本相似性任务),评估层剪枝对嵌入质量的影响,并证明剪枝后的模型在减少层数的同时,仍能与全层模型表现相当,甚至优于同样大小的从头训练的模型。这一策略不仅有效降低了模型的复杂性和计算需求,还为资源有限的技术环境提供了高质量的嵌入模型。

链接: https://arxiv.org/abs/2409.14168
作者: Anushka Shelke,Riya Savant,Raviraj Joshi
关键词-EN: efficient Sentence BERT, Sentence BERT, Semantic Textual Similarity, study examines, examines the effectiveness
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study examines the effectiveness of layer pruning in creating efficient Sentence BERT (SBERT) models. Our goal is to create smaller sentence embedding models that reduce complexity while maintaining strong embedding similarity. We assess BERT models like Muril and MahaBERT-v2 before and after pruning, comparing them with smaller, scratch-trained models like MahaBERT-Small and MahaBERT-Smaller. Through a two-phase SBERT fine-tuning process involving Natural Language Inference (NLI) and Semantic Textual Similarity (STS), we evaluate the impact of layer reduction on embedding quality. Our findings show that pruned models, despite fewer layers, perform competitively with fully layered versions. Moreover, pruned models consistently outperform similarly sized, scratch-trained models, establishing layer pruning as an effective strategy for creating smaller, efficient embedding models. These results highlight layer pruning as a practical approach for reducing computational demand while preserving high-quality embeddings, making SBERT models more accessible for languages with limited technological resources.
摘要:本研究探讨了层剪枝在创建高效 Sentence BERT (SBERT) 模型中的有效性。我们的目标是创建更小的句子嵌入模型,这些模型在降低复杂性的同时保持较强的嵌入相似性。我们评估了剪枝前后的 BERT 模型,如 Muril 和 MahaBERT-v2,并将它们与从头训练的小型模型(如 MahaBERT-Small 和 MahaBERT-Smaller)进行比较。通过涉及自然语言推理 (NLI) 和语义文本相似性 (STS) 的两阶段 SBERT 微调过程,我们评估了层减少对嵌入质量的影响。我们的研究结果表明,尽管层数较少,剪枝模型在性能上仍能与全层模型相媲美。此外,剪枝模型在性能上始终优于相同大小、从头训练的模型,证明了层剪枝是创建更小、更高效嵌入模型的有效策略。这些结果突显了层剪枝作为一种实用方法,能够在减少计算需求的同时保持高质量的嵌入,使 SBERT 模型更易于应用于技术资源有限的语言。

[NLP-95] Will Large Language Models be a Panacea to Autonomous Driving?

【速读】: 该论文试图解决自动驾驶技术在模块化和端到端两种路径中遇到的问题,特别是模块化路径中各模块训练目标不一致导致的集成效果偏差,以及端到端路径在处理复杂城市交通场景和长尾事件时的局限性。解决方案的关键在于探讨大型语言模型(LLMs)在自动驾驶系统中的潜在应用,利用LLMs强大的推理能力和广泛的知识理解,提升自动驾驶系统的理解和决策能力。论文特别关注LLMs在模块化和端到端两种路径中的优化策略,并探讨基于LLM的人工通用智能(AGI)是否能成为实现高级自动驾驶的关键。

链接: https://arxiv.org/abs/2409.14165
作者: Yuxuan Zhua,Shiyi Wang,Wenqing Zhong,Nianchen Shen,Yunqi Li,Siqi Wang,Zhiheng Li,Cathy Wu,Zhengbing He,Li Li
关键词-EN: plays a crucial, crucial role, role in autonomous, autonomous driving, LLMs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) plays a crucial role in autonomous driving (AD) research, propelling its development towards intelligence and efficiency. Currently, the development of AD technology follows two main technical paths: modularization and end-to-end. Modularization decompose the driving task into modules such as perception, prediction, planning, and control, and train them separately. Due to the inconsistency of training objectives between modules, the integrated effect suffers from bias. End-to-end attempts to address this issue by utilizing a single model that directly maps from sensor data to control signals. This path has limited learning capabilities in a comprehensive set of features and struggles to handle unpredictable long-tail events and complex urban traffic scenarios. In the face of challenges encountered in both paths, many researchers believe that large language models (LLMs) with powerful reasoning capabilities and extensive knowledge understanding may be the solution, expecting LLMs to provide AD systems with deeper levels of understanding and decision-making capabilities. In light of the challenges faced by both paths, many researchers believe that LLMs, with their powerful reasoning abilities and extensive knowledge, could offer a solution. To understand if LLMs could enhance AD, this paper conducts a thorough analysis of the potential applications of LLMs in AD systems, including exploring their optimization strategies in both modular and end-to-end approaches, with a particular focus on how LLMs can tackle the problems and challenges present in current solutions. Furthermore, we discuss an important question: Can LLM-based artificial general intelligence (AGI) be a key to achieve high-level AD? We further analyze the potential limitations and challenges that LLMs may encounter in promoting the development of AD technology.
摘要:人工智能 (AI) 在自动驾驶 (AD) 研究中扮演着至关重要的角色,推动其向智能化和高效化发展。当前,AD 技术的发展遵循两条主要技术路径:模块化和端到端。模块化将驾驶任务分解为感知、预测、规划和控制等模块,并分别进行训练。由于模块间训练目标的不一致性,集成效果存在偏差。端到端试图通过使用单一模型直接从传感器数据映射到控制信号来解决这一问题。该路径在学习全面特征方面能力有限,难以处理不可预测的长尾事件和复杂的城市交通场景。面对两条路径中遇到的问题,许多研究人员认为,具有强大推理能力和广泛知识理解的大语言模型 (LLMs) 可能是解决方案,期望 LLMs 为 AD 系统提供更深层次的理解和决策能力。鉴于两条路径面临的挑战,许多研究人员认为,LLMs 凭借其强大的推理能力和广泛的知识,可能提供解决方案。为了了解 LLMs 是否能增强 AD,本文对 LLMs 在 AD 系统中的潜在应用进行了深入分析,包括在模块化和端到端方法中探索其优化策略,特别关注 LLMs 如何解决当前解决方案中的问题和挑战。此外,我们讨论了一个重要问题:基于 LLM 的通用人工智能 (AGI) 能否成为实现高级 AD 的关键?我们进一步分析了 LLMs 在推动 AD 技术发展中可能遇到的潜在限制和挑战。

[NLP-96] PromptTA: Prompt-driven Text Adapter for Source-free Domain Generalization

【速读】: 该论文试图解决无源域数据情况下的领域泛化问题(Source-free domain generalization, SFDG),即在不访问源域数据的情况下,使模型适应未见过的目标域。解决方案的关键在于提出了一种名为Prompt-Driven Text Adapter(PromptTA)的方法,该方法通过捕捉风格特征的分布并利用重采样来确保全面覆盖领域知识。此外,论文引入了一个文本适配器,从这些风格特征中学习,以高效存储领域信息。实验结果表明,PromptTA在四个基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2409.14163
作者: Haoran Zhang,Shuanghao Bai,Wanqi Zhou,Jingwen Fu,Badong Chen
关键词-EN: Source-free domain generalization, source domain data, unseen target domains, Source-free domain, tackles the challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Source-free domain generalization (SFDG) tackles the challenge of adapting models to unseen target domains without access to source domain data. To deal with this challenging task, recent advances in SFDG have primarily focused on leveraging the text modality of vision-language models such as CLIP. These methods involve developing a transferable linear classifier based on diverse style features extracted from the text and learned prompts or deriving domain-unified text representations from domain banks. However, both style features and domain banks have limitations in capturing comprehensive domain knowledge. In this work, we propose Prompt-Driven Text Adapter (PromptTA) method, which is designed to better capture the distribution of style features and employ resampling to ensure thorough coverage of domain knowledge. To further leverage this rich domain information, we introduce a text adapter that learns from these style features for efficient domain information storage. Extensive experiments conducted on four benchmark datasets demonstrate that PromptTA achieves state-of-the-art performance. The code is available at this https URL.
摘要: 无源域泛化 (Source-free domain generalization, SFDG) 旨在解决在不访问源域数据的情况下,将模型适应于未见过的目标域的挑战。为了应对这一艰巨任务,近期 SFDG 的研究主要集中在利用视觉-语言模型(如 CLIP)的文本模态。这些方法包括基于从文本中提取的多样化风格特征和学习到的提示开发可迁移的线性分类器,或从域库中推导出域统一的文本表示。然而,无论是风格特征还是域库,在捕捉全面的域知识方面都存在局限性。在本研究中,我们提出了提示驱动文本适配器 (Prompt-Driven Text Adapter, PromptTA) 方法,该方法旨在更好地捕捉风格特征的分布,并通过重采样确保全面覆盖域知识。为进一步利用这些丰富的域信息,我们引入了一个文本适配器,该适配器从这些风格特征中学习,以实现高效的域信息存储。在四个基准数据集上进行的大量实验表明,PromptTA 达到了最先进的性能。代码可在以下链接获取:https URL。

[NLP-97] On Importance of Pruning and Distillation for Efficient Low Resource NLP

【速读】: 该论文旨在解决低资源语言(特别是Marathi语)在自然语言处理中的计算效率问题。解决方案的关键在于通过应用Block Movement Pruning、知识蒸馏和混合精度等优化技术,对Marathi语的transformer模型进行优化,以减少计算时间和内存使用,同时保持高准确性。研究结果表明,25%的剪枝结合知识蒸馏是最优配置,能够在保持基准准确率的同时,实现2.56倍的计算速度提升。

链接: https://arxiv.org/abs/2409.14162
作者: Aishwarya Mirashi,Purva Lingayat,Srushti Sonavane,Tejas Padhiyar,Raviraj Joshi,Geetanjali Kale
关键词-EN: Natural Language Processing, revolutionized Natural Language, revolutionized Natural, Language Processing, Natural Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rise of large transformer models has revolutionized Natural Language Processing, leading to significant advances in tasks like text classification. However, this progress demands substantial computational resources, escalating training duration, and expenses with larger model sizes. Efforts have been made to downsize and accelerate English models (e.g., Distilbert, MobileBert). Yet, research in this area is scarce for low-resource languages. In this study, we explore the case of the low-resource Indic language Marathi. Leveraging the marathi-topic-all-doc-v2 model as our baseline, we implement optimization techniques to reduce computation time and memory usage. Our focus is on enhancing the efficiency of Marathi transformer models while maintaining top-tier accuracy and reducing computational demands. Using the MahaNews document classification dataset and the marathi-topic-all-doc-v2 model from L3Cube, we apply Block Movement Pruning, Knowledge Distillation, and Mixed Precision methods individually and in combination to boost efficiency. We demonstrate the importance of strategic pruning levels in achieving desired efficiency gains. Furthermore, we analyze the balance between efficiency improvements and environmental impact, highlighting how optimized model architectures can contribute to a more sustainable computational ecosystem. Implementing these techniques on a single GPU system, we determine that the optimal configuration is 25% pruning + knowledge distillation. This approach yielded a 2.56x speedup in computation time while maintaining baseline accuracy levels. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.14162 [cs.CL] (or arXiv:2409.14162v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.14162 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:大型 Transformer 模型的兴起彻底改变了自然语言处理领域,显著推动了文本分类等任务的发展。然而,这一进步需要大量的计算资源,随着模型规模的增大,训练时间和成本也随之增加。已有研究致力于缩小和加速英语模型(例如 Distilbert、MobileBert)。然而,针对低资源语言的研究在这一领域仍然匮乏。在本研究中,我们探讨了低资源印度语言马拉地语的情况。以 marathi-topic-all-doc-v2 模型为基准,我们实施了优化技术以减少计算时间和内存使用。我们的重点是在保持顶级准确性的同时,提高马拉地语 Transformer 模型的效率并降低计算需求。使用 MahaNews 文档分类数据集和 L3Cube 的 marathi-topic-all-doc-v2 模型,我们分别应用了块移动剪枝 (Block Movement Pruning)、知识蒸馏 (Knowledge Distillation) 和混合精度 (Mixed Precision) 方法,并结合使用以提升效率。我们展示了在实现预期效率提升中,策略性剪枝水平的重要性。此外,我们分析了效率改进与环境影响之间的平衡,强调了优化模型架构如何有助于构建更可持续的计算生态系统。在单 GPU 系统上实施这些技术后,我们确定最佳配置为 25% 剪枝 + 知识蒸馏。这种方法在保持基准准确率水平的同时,计算时间加速了 2.56 倍。

主题:计算与语言 (cs.CL); 机器学习 (cs.LG)
引用方式:arXiv:2409.14162 [cs.CL]
(或 arXiv:2409.14162v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.14162
了解更多信息
arXiv 发布的 DOI 通过 DataCite(待注册)

[NLP-98] Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis EMNLP2024

【速读】: 该论文试图解决的问题是理解算术能力在大语言模型中的实现机制,并探索如何通过模型分析和编辑来优化这一能力。解决方案的关键在于提出了比较神经元分析(Comparative Neuron Analysis, CNA)方法,该方法揭示了从输入到预测的四个阶段逻辑链:浅层FFN神经元进行特征增强、浅层注意力层进行特征转移、算术头进行特征预测以及深层FFN神经元进行预测增强。通过识别这些阶段中的关键神经元,研究进一步揭示了LoRA(低秩适应)如何通过放大与预测相关的FFN神经元的系数来增强预测概率。此外,研究还将该方法应用于算术任务的模型剪枝和减少性别偏见的模型编辑中。

链接: https://arxiv.org/abs/2409.14144
作者: Zeping Yu,Sophia Ananiadou
关键词-EN: Comparative Neuron Analysis, FFN neurons, find arithmetic ability, arithmetic ability resides, ability resides
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024 main. Mechanistic interpretability for arithmetic tasks in large language models

点击查看摘要

Abstract:We find arithmetic ability resides within a limited number of attention heads, with each head specializing in distinct operations. To delve into the reason, we introduce the Comparative Neuron Analysis (CNA) method, which identifies an internal logic chain consisting of four distinct stages from input to prediction: feature enhancing with shallow FFN neurons, feature transferring by shallow attention layers, feature predicting by arithmetic heads, and prediction enhancing among deep FFN neurons. Moreover, we identify the human-interpretable FFN neurons within both feature-enhancing and feature-predicting stages. These findings lead us to investigate the mechanism of LoRA, revealing that it enhances prediction probabilities by amplifying the coefficient scores of FFN neurons related to predictions. Finally, we apply our method in model pruning for arithmetic tasks and model editing for reducing gender bias. Code is on this https URL.
摘要:我们发现算术能力存在于有限数量的注意力头中,每个头专注于不同的操作。为了深入探究其原因,我们引入了比较神经元分析 (Comparative Neuron Analysis, CNA) 方法,该方法识别出从输入到预测的内部逻辑链,该逻辑链由四个不同的阶段组成:浅层 FFN 神经元进行特征增强,浅层注意力层进行特征转移,算术头进行特征预测,深层 FFN 神经元之间进行预测增强。此外,我们识别出在特征增强和特征预测阶段中的人类可解释的 FFN 神经元。这些发现使我们研究了 LoRA 的机制,揭示了它通过放大与预测相关的 FFN 神经元的系数分数来增强预测概率。最后,我们将我们的方法应用于算术任务的模型剪枝和减少性别偏见的模型编辑。代码可在以下链接获取:https URL。

[NLP-99] Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm

【速读】: 该论文试图解决参数高效微调(PEFT)中存在的任务无关后门攻击问题。解决方案的关键在于引入了一种名为Obliviate的PEFT集成后门防御方法,通过放大PEFT层中的良性神经元和惩罚触发词的影响,显著降低了现有任务无关后门攻击的成功率(从83.6%下降),并展示了对抗任务特定后门和自适应攻击的强大防御能力。

链接: https://arxiv.org/abs/2409.14119
作者: Jaehan Kim,Minkyoo Song,Seung Ho Na,Seungwon Shin
关键词-EN: large language models, Parameter-efficient fine-tuning, key training strategy, language models, key training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6% \downarrow ). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at this https URL.
摘要:参数高效微调 (Parameter-efficient fine-tuning, PEFT) 已成为大语言模型 (Large Language Model) 的关键训练策略。然而,其依赖较少可训练参数的特点带来了安全风险,例如任务无关的后门 (task-agnostic backdoors)。尽管这些后门对广泛任务的影响严重,但在 PEFT 背景下,尚无实际可行的防御解决方案来有效对抗任务无关的后门。在本研究中,我们提出了 Obliviate,一种可与 PEFT 集成的后门防御方法。我们开发了两项技术,旨在放大 PEFT 层中的良性神经元 (benign neurons) 并惩罚触发 Token (trigger tokens) 的影响。我们在三种主要的 PEFT 架构上的评估表明,我们的方法能够显著降低最先进的任务无关后门的攻击成功率 (83.6% ↓)。此外,我们的方法在对抗任务特定后门和自适应攻击方面表现出强大的防御能力。源代码将在以下链接获取:https URL。

[NLP-100] Routing in Sparsely-gated Language Models responds to Context

【速读】: 该论文试图解决语言模型中专家混合层(mixture-of-experts layers)在路由决策时对上下文敏感性的问题。解决方案的关键在于通过追踪相似性注释的文本对的路由决策,评估学习到的token-专家分配的上下文敏感性。研究发现,编码器层的路由主要依赖于(语义)关联,但上下文线索提供了额外的细化层;而解码器层的路由则更为多变,对上下文的敏感性显著降低。

链接: https://arxiv.org/abs/2409.14107
作者: Stefan Arnold,Marian Fietta,Dilara Yesilbas
关键词-EN: Language Models, fixed computational budget, recently incorporate, computational budget, collection of experts
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language Models (LMs) recently incorporate mixture-of-experts layers consisting of a router and a collection of experts to scale up their parameter count given a fixed computational budget. Building on previous efforts indicating that token-expert assignments are predominantly influenced by token identities and positions, we trace routing decisions of similarity-annotated text pairs to evaluate the context sensitivity of learned token-expert assignments. We observe that routing in encoder layers mainly depends on (semantic) associations, but contextual cues provide an additional layer of refinement. Conversely, routing in decoder layers is more variable and markedly less sensitive to context.
摘要:语言模型 (Language Models, LMs) 最近引入了由路由器和一组专家组成的专家混合层 (mixture-of-experts layers),以在固定的计算预算下扩展其参数数量。基于先前研究指出 Token-专家分配主要受 Token 身份和位置的影响,我们追踪了相似性注释文本对的路由决策,以评估学习到的 Token-专家分配的上下文敏感性。我们观察到,编码器层中的路由主要依赖于 (语义) 关联,但上下文线索提供了额外的细化层。相反,解码器层中的路由更具变异性,并且对上下文的敏感性显著降低。

[NLP-101] Probing Context Localization of Polysemous Words in Pre-trained Language Model Sub-Layers

【速读】: 该论文试图解决的问题是如何量化和理解预训练语言模型(PLM)中不同子层(如自注意力层、前馈激活层和输出层)对上下文信息的编码强度。解决方案的关键在于通过线性探针实验,提取多义词在不同句子对中的子层表示,并比较这些表示在模型前向传播过程中的变化,以及在词义识别分类任务中评估这些子层表示的上下文信息强度。研究结果表明,BERT在特定位置和较短上下文窗口下表现出较高的上下文化程度,但这种表现并不系统地适用于不同的词位置和上下文长度。

链接: https://arxiv.org/abs/2409.14097
作者: Soniya Vijayakumar,Josef van Genabith,Simon Ostermann
关键词-EN: Large Language Models, performing Large Language, Pre-trained Language Model, high performing Large, Large Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the era of high performing Large Language Models, researchers have widely acknowledged that contextual word representations are one of the key drivers in achieving top performances in downstream tasks. In this work, we investigate the degree of contextualization encoded in the fine-grained sub-layer representations of a Pre-trained Language Model (PLM) by empirical experiments using linear probes. Unlike previous work, we are particularly interested in identifying the strength of contextualization across PLM sub-layer representations (i.e. Self-Attention, Feed-Forward Activation and Output sub-layers). To identify the main contributions of sub-layers to contextualisation, we first extract the sub-layer representations of polysemous words in minimally different sentence pairs, and compare how these representations change through the forward pass of the PLM network. Second, by probing on a sense identification classification task, we try to empirically localize the strength of contextualization information encoded in these sub-layer representations. With these probing experiments, we also try to gain a better understanding of the influence of context length and context richness on the degree of contextualization. Our main conclusion is cautionary: BERT demonstrates a high degree of contextualization in the top sub-layers if the word in question is in a specific position in the sentence with a shorter context window, but this does not systematically generalize across different word positions and context sizes.
摘要:在高效能大语言模型 (Large Language Model) 的时代,研究人员普遍认为上下文词表示是实现下游任务卓越表现的关键驱动因素之一。在本研究中,我们通过使用线性探针 (linear probes) 的实证实验,探讨了预训练语言模型 (Pre-trained Language Model, PLM) 细粒度子层表示中编码的上下文化程度。与以往的工作不同,我们特别关注于识别 PLM 子层表示 (即自注意力层 (Self-Attention)、前馈激活层 (Feed-Forward Activation) 和输出层 (Output sub-layers)) 中上下文化强度的变化。为了确定子层对上下文化的主要贡献,我们首先提取了在语义上多义词在最小差异句子对中的子层表示,并比较这些表示在 PLM 网络前向传递过程中的变化。其次,通过在词义识别分类任务上进行探针实验,我们尝试实证定位这些子层表示中编码的上下文化信息强度。通过这些探针实验,我们还试图更好地理解上下文长度和上下文丰富度对上下文化程度的影响。我们的主要结论是谨慎的:BERT 在特定位置的词且上下文窗口较短的情况下,在顶部子层表现出高度的上下文化,但这并不能系统性地推广到不同的词位置和上下文大小。

[NLP-102] PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)在Text-to-SQL任务中的推理能力提升问题。解决方案的关键在于提出了一种名为PTD-SQL(Query Group Partitioning)的方法,通过将查询任务分组,使LLMs能够专注于学习特定问题类型的思维过程,从而在不同难度级别和问题类别上增强其推理能力。实验结果表明,采用PTD-SQL的多个高级LLMs在Spider和BIRD数据集上能够超越或匹配先前的最先进(SOTA)方法,尤其是在模型能力边界处的针对性训练后,性能显著提升。

链接: https://arxiv.org/abs/2409.14082
作者: Ruilin Luo,Liyuan Wang,Binghuai Lin,Zicheng Lin,Yujiu Yang
关键词-EN: Large Language Models, Large Language, exhibiting remarkable reasoning, exhibiting remarkable, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 Main Conference. Revised by ARR April and ARR June. 32 pages, 7 figures and 30 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful tools for Text-to-SQL tasks, exhibiting remarkable reasoning capabilities. Different from tasks such as math word problems and commonsense reasoning, SQL solutions have a relatively fixed pattern. This facilitates the investigation of whether LLMs can benefit from categorical thinking, mirroring how humans acquire knowledge through inductive reasoning based on comparable examples. In this study, we propose that employing query group partitioning allows LLMs to focus on learning the thought processes specific to a single problem type, consequently enhancing their reasoning abilities across diverse difficulty levels and problem categories. Our experiments reveal that multiple advanced LLMs, when equipped with PTD-SQL, can either surpass or match previous state-of-the-art (SOTA) methods on the Spider and BIRD datasets. Intriguingly, models with varying initial performances have exhibited significant improvements, mainly at the boundary of their capabilities after targeted drilling, suggesting a parallel with human progress. Code is available at this https URL.
摘要:大语言模型 (LLMs) 在文本到 SQL 任务中展现出强大的工具性,表现出卓越的推理能力。与数学应用题和常识推理等任务不同,SQL 解决方案具有相对固定的模式。这有助于研究 LLMs 是否能从分类思维中受益,类似于人类通过基于相似例子的归纳推理来获取知识的方式。在本研究中,我们提出,通过查询组分区,使 LLMs 专注于学习单一问题类型的思维过程,从而增强其在不同难度级别和问题类别中的推理能力。我们的实验表明,配备 PTD-SQL 的多个高级 LLMs 在 Spider 和 BIRD 数据集上能够超越或匹配先前的最先进 (SOTA) 方法。有趣的是,不同初始性能的模型在经过针对性训练后,其能力边界均显示出显著提升,这与人类的进步有相似之处。代码可在以下链接获取:https URL。

[NLP-103] MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

【速读】: 该论文试图解决多语言医疗领域自动语音识别(ASR)的问题,旨在通过构建一个包含五种语言(越南语、英语、德语、法语和普通话)的大型多语言医疗ASR数据集MultiMed,来提升跨语言医疗沟通的效率和准确性。解决方案的关键在于创建了迄今为止最大的多语言医疗ASR数据集,涵盖了广泛的疾病类型、录音条件、说话者角色、独特的医学术语、口音和ICD-10代码,并通过实验验证了多语言医疗ASR模型的有效性,提供了可复现的研究基础和语言学分析。

链接: https://arxiv.org/abs/2409.14074
作者: Khai Le-Duc,Phuc Phan,Tan-Hanh Pham,Bach Phan Tat,Minh-Huong Ngo,Truong-Son Hy
关键词-EN: automatic speech recognition, spoken language understanding, Multilingual automatic speech, multilingual medical ASR, medical ASR
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint

点击查看摘要

Abstract:Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, a collection of small-to-large end-to-end ASR models for the medical domain, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese, together with the corresponding real-world ASR dataset. To our best knowledge, MultiMed stands as the largest and the first multilingual medical ASR dataset, in terms of total duration, number of speakers, diversity of diseases, recording conditions, speaker roles, unique medical terms, accents, and ICD-10 codes. Secondly, we establish the empirical baselines, present the first reproducible study of multilinguality in medical ASR, conduct a layer-wise ablation study for end-to-end ASR training, and provide the first linguistic analysis for multilingual medical ASR. All code, data, and models are available online this https URL
摘要:医疗领域的多语言自动语音识别 (ASR) 是支持多种下游应用的基础任务,如语音翻译、口语理解以及语音激活助手。该技术通过跨越语言障碍实现高效沟通,缓解专业人员短缺问题,并促进诊断和治疗的改进,特别是在疫情期间。在本研究中,我们引入了 MultiMed,这是一系列针对医疗领域的小至大型端到端 ASR 模型,涵盖五种语言:越南语、英语、德语、法语和普通话,以及相应的真实世界 ASR 数据集。据我们所知,MultiMed 在总时长、说话者数量、疾病多样性、录音条件、说话者角色、独特医学术语、口音和 ICD-10 代码等方面,是迄今为止最大且首个多语言医疗 ASR 数据集。其次,我们建立了经验基线,首次进行了医疗 ASR 中多语言性的可重复研究,进行了端到端 ASR 训练的逐层消融研究,并提供了首个多语言医疗 ASR 的语言学分析。所有代码、数据和模型均可在线获取。

[NLP-104] mporally Consistent Factuality Probing for Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在时间维度上的事实一致性问题,即确保模型在处理不同时间点的查询时仍能保持事实的正确性和一致性。解决方案的关键在于提出了一个名为TeCFaP的新任务,并构建了高质量的TEMP-COFAC数据集,同时扩展了现有评估指标以涵盖时间维度。此外,论文提出了一种名为CoTSeLF的解决方案,结合多任务指令调优(MT-IT)和时间敏感的一致性强化学习(CTSRL),以提升LLMs在时间维度上的事实一致性。实验结果表明,CoTSeLF在多个基准上表现优于现有方法。

链接: https://arxiv.org/abs/2409.14065
作者: Ashutosh Bajpai,Aaryan Goyal,Atif Anwer,Tanmoy Chakraborty
关键词-EN: Large Language Models, Language Models, Large Language, alternate knowledge base, knowledge base requires
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The prolific use of Large Language Models (LLMs) as an alternate knowledge base requires them to be factually consistent, necessitating both correctness and consistency traits for paraphrased queries. Recently, significant attempts have been made to benchmark datasets and metrics to evaluate LLMs for these traits. However, structural simplicity (subject-relation-object) and contemporary association in their query formulation limit the broader definition of factuality and consistency. In this study, we introduce TeCFaP, a novel Temporally Consistent Factuality Probe task to expand the consistent factuality probe in the temporal dimension. To this end, we propose TEMP-COFAC, a high-quality dataset of prefix-style English query paraphrases. Subsequently, we extend the definitions of existing metrics to represent consistent factuality across temporal dimension. We experiment with a diverse set of LLMs and find most of them performing poorly on TeCFaP. Next, we propose a novel solution CoTSeLF (Consistent-Time-Sensitive Learning Framework) combining multi-task instruction tuning (MT-IT) with consistent-time-sensitive reinforcement learning (CTSRL) to improve temporally consistent factuality in LLMs. Our experiments demonstrate the efficacy of CoTSeLF over several baselines.
摘要:大语言模型 (LLM) 的广泛应用作为替代知识库,要求其在事实一致性方面表现出色,这需要重述查询的正确性和一致性特征。近期,已有大量尝试通过基准数据集和评估指标来评估这些特征。然而,其查询结构(主语-关系-宾语)的简单性和当代关联性限制了事实性和一致性的更广泛定义。在本研究中,我们引入了 TeCFaP,一种新颖的时序一致性事实性探测任务,以在时间维度上扩展一致性事实性探测。为此,我们提出了 TEMP-COFAC,一个高质量的前缀风格英语查询重述数据集。随后,我们扩展了现有指标的定义,以表示跨时间维度的一致性事实性。我们使用多种 LLM 进行实验,发现大多数模型在 TeCFaP 上表现不佳。接着,我们提出了一种新型解决方案 CoTSeLF(一致性时间敏感学习框架),结合多任务指令调优 (MT-IT) 和一致性时间敏感强化学习 (CTSRL),以提升 LLM 的时序一致性事实性。我们的实验证明了 CoTSeLF 在多个基线上的有效性。

[NLP-105] Co-occurrence is not Factual Association in Language Models

【速读】: 该论文试图解决预训练语言模型在微调过程中难以有效学习新的事实知识的问题。解决方案的关键在于区分并优化语言模型中两种不同的知识表示形式:一种是基于词共现统计的知识,主要存储在模型的中间层,难以泛化到复杂的推理任务;另一种是真正的事实关联知识,存储在模型的较低层,能够广泛应用于各种推理任务。论文提出两种策略来改进事实关联知识的学习:一是通过训练模型学习隐含而非显式的事实关联,以避免过度依赖词共现统计;二是采用一种简单的训练方法,主动遗忘已学习的词共现统计,从而释放并增强模型对事实关联知识的学习能力。这些策略显著提升了微调后知识在复杂推理场景中的泛化能力。

链接: https://arxiv.org/abs/2409.14057
作者: Xiao Zhang,Miao Li,Ji Wu
关键词-EN: Pretrained language models, limited textual demonstrations, factual associations, Pretrained language, language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretrained language models can encode a large amount of knowledge and utilize it for various reasoning tasks, yet they can still struggle to learn novel factual knowledge effectively from finetuning on limited textual demonstrations. In this work, we show that the reason for this deficiency is that language models are biased to learn word co-occurrence statistics instead of true factual associations. We identify the differences between two forms of knowledge representation in language models: knowledge in the form of co-occurrence statistics is encoded in the middle layers of the transformer model and does not generalize well to reasoning scenarios beyond simple question answering, while true factual associations are encoded in the lower layers and can be freely utilized in various reasoning tasks. Based on these observations, we propose two strategies to improve the learning of factual associations in language models. We show that training on text with implicit rather than explicit factual associations can force the model to learn factual associations instead of co-occurrence statistics, significantly improving the generalization of newly learned knowledge. We also propose a simple training method to actively forget the learned co-occurrence statistics, which unblocks and enhances the learning of factual associations when training on plain narrative text. On both synthetic and real-world corpora, the two proposed strategies improve the generalization of the knowledge learned during finetuning to reasoning scenarios such as indirect and multi-hop question answering.
摘要:预训练语言模型能够编码大量知识,并将其用于各种推理任务,然而它们在通过有限的文本演示进行微调时,仍然难以有效地学习新的实际知识。本文指出,这一缺陷的原因在于语言模型倾向于学习词共现统计信息,而非真正的实际关联。我们识别了语言模型中两种知识表示形式之间的差异:以共现统计形式存在的知识编码在 Transformer 模型的中间层,且不易泛化到简单问答之外的推理场景;而真正的实际关联则编码在较低层,并能在各种推理任务中自由应用。基于这些观察,我们提出了两种策略来改进语言模型中实际关联的学习。我们发现,通过训练包含隐含而非显式实际关联的文本,可以迫使模型学习实际关联而非共现统计,从而显著提升新学知识的泛化能力。我们还提出了一种简单的训练方法,主动遗忘已学习的共现统计,从而在训练普通叙述文本时,解锁并增强实际关联的学习。在合成和真实世界语料库上,这两种提出的策略均提升了微调过程中学习到的知识在间接和多跳问答等推理场景中的泛化能力。

[NLP-106] GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion

【速读】: 该论文试图解决多代理辩论技术在逻辑推理任务中由于代理数量和辩论轮次的增加导致的令牌成本急剧上升的问题。解决方案的关键在于将所有代理分成多个辩论小组,各小组内部进行辩论并在小组间共享辩论结果,从而显著降低辩论过程中的总令牌消耗,同时可能提升准确性。实验结果表明,这种方法在辩论中最多可减少51.7%的令牌消耗,并有望提高25%的准确性。

链接: https://arxiv.org/abs/2409.14051
作者: Tongxuan Liu,Xingyu Wang,Weizhe Huang,Wenjiang Xu,Yuting Zeng,Lei Jiang,Hailong Yang,Jing Li
关键词-EN: Large Language Models, Large Language, Language Models, diverse NLP tasks, diverse NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse NLP tasks. Extensive research has explored how to enhance the logical reasoning abilities such as Chain-of-Thought, Chain-of-Thought with Self-Consistency, Tree-Of-Thoughts, and multi-agent debates. In the context of multi-agent debates, significant performance improvements can be achieved with an increasing number of agents and debate rounds. However, the escalation in the number of agents and debate rounds can drastically raise the tokens cost of debates, thereby limiting the scalability of the multi-agent debate technique. To better harness the advantages of multi-agent debates in logical reasoning tasks, this paper proposes a method to significantly reduce token cost in multi-agent debates. This approach involves dividing all agents into multiple debate groups, with agents engaging in debates within their respective groups and sharing interim debate results between groups. Comparative experiments across multiple datasets have demonstrated that this method can reduce the total tokens by up to 51.7% during debates and while potentially enhancing accuracy by as much as 25%. Our method significantly enhances the performance and efficiency of interactions in the multi-agent debate.
摘要:近年来,大语言模型 (LLMs) 在多样化的自然语言处理 (NLP) 任务中展示了卓越的能力。广泛的研究探讨了如何增强逻辑推理能力,如思维链 (Chain-of-Thought)、自一致性思维链 (Chain-of-Thought with Self-Consistency)、思维树 (Tree-Of-Thoughts) 以及多智能体辩论。在多智能体辩论的背景下,随着智能体数量和辩论轮次的增加,可以显著提升性能。然而,智能体数量和辩论轮次的增加会急剧提高辩论的 Token 成本,从而限制了多智能体辩论技术的可扩展性。为了更好地利用多智能体辩论在逻辑推理任务中的优势,本文提出了一种显著降低多智能体辩论中 Token 成本的方法。该方法将所有智能体分为多个辩论组,智能体在其所属组内进行辩论,并在组间共享辩论的中间结果。通过在多个数据集上的对比实验,我们证明了这种方法在辩论过程中可以减少高达 51.7% 的总 Token 数量,同时可能将准确性提高多达 25%。我们的方法显著提升了多智能体辩论中的性能和交互效率。

[NLP-107] OAEI-LLM: A Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching

【速读】: 该论文试图解决大型语言模型(LLMs)在领域特定的本体匹配(OM)任务中普遍存在的幻觉问题,并提出了一种新的基准数据集OAEI-LLM。解决方案的关键在于扩展了Ontology Alignment Evaluation Initiative(OAEI)数据集,以评估LLM在本体匹配任务中的特定幻觉现象,并通过详细的方法论和示例展示了数据集的构建和模式扩展过程,从而为理解和缓解LLM在本体匹配中的幻觉问题提供了基准。

链接: https://arxiv.org/abs/2409.14038
作者: Zhangcheng Qiang,Kerry Taylor,Weiqing Wang,Jing Jiang
关键词-EN: large language models, domain-specific downstream tasks, language models, commonly occur, Alignment Evaluation Initiative
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 4 pages, 1 figure

点击查看摘要

Abstract:Hallucinations of large language models (LLMs) commonly occur in domain-specific downstream tasks, with no exception in ontology matching (OM). The prevalence of using LLMs for OM raises the need for benchmarks to better understand LLM hallucinations. The OAEI-LLM dataset is an extended version of the Ontology Alignment Evaluation Initiative (OAEI) datasets that evaluate LLM-specific hallucinations in OM tasks. We outline the methodology used in dataset construction and schema extension, and provide examples of potential use cases.
摘要:大语言模型 (LLM) 在特定领域的下游任务中经常出现幻觉现象,本体匹配 (Ontology Matching, OM) 也不例外。随着 LLM 在 OM 中的广泛应用,建立基准测试以更好地理解 LLM 幻觉现象的需求日益增加。OAEI-LLM 数据集是本体对齐评估倡议 (Ontology Alignment Evaluation Initiative, OAEI) 数据集的扩展版本,专门用于评估 OM 任务中的 LLM 特定幻觉。本文概述了数据集构建和模式扩展所采用的方法,并提供了潜在应用案例的示例。

[NLP-108] Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

【速读】: 该论文试图解决当前大型语言模型(LLMs)在科学传播中的可靠性问题,特别是这些模型在处理需要复杂理解和判断的科学问答任务时的表现。解决方案的关键在于引入了一个新的数据集SCiPS-QA,该数据集包含742个嵌入复杂科学概念的Yes/No问题,并通过一套基准测试工具评估LLMs在正确性和一致性方面的表现。研究结果表明,尽管大多数开源模型表现不佳,但Llama-3-70B在某些方面超越了GPT-4 Turbo,同时揭示了GPT模型在验证自身响应可靠性方面的不足,以及GPT-4 Turbo在某些情况下可能误导人类评估者的趋势。

链接: https://arxiv.org/abs/2409.14037
作者: Prasoon Bajpai,Niladri Chatterjee,Subhabrata Dutta,Tanmoy Chakraborty
关键词-EN: Large Language Models, Large Language, experiencing exponential growth, amateur users, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and AI assistants driven by these models are experiencing exponential growth in usage among both expert and amateur users. In this work, we focus on evaluating the reliability of current LLMs as science communicators. Unlike existing benchmarks, our approach emphasizes assessing these models on scientific questionanswering tasks that require a nuanced understanding and awareness of answerability. We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria. We benchmark three proprietary LLMs from the OpenAI GPT family and 13 open-access LLMs from the Meta Llama-2, Llama-3, and Mistral families. While most open-access models significantly underperform compared to GPT-4 Turbo, our experiments identify Llama-3-70B as a strong competitor, often surpassing GPT-4 Turbo in various evaluation aspects. We also find that even the GPT models exhibit a general incompetence in reliably verifying LLM responses. Moreover, we observe an alarming trend where human evaluators are deceived by incorrect responses from GPT-4 Turbo.
摘要:大语言模型 (LLMs) 及其驱动的 AI 助手在专业用户和业余用户中的使用量呈指数级增长。本文重点评估当前 LLMs 作为科学传播者的可靠性。与现有基准不同,我们的方法强调在需要细致理解和答案可信度的科学问答任务上评估这些模型。我们引入了一个新的数据集,SCiPS-QA,包含 742 个嵌入复杂科学概念的 Yes/No 查询,以及一个评估 LLMs 在各种标准下正确性和一致性的基准套件。我们基准测试了来自 OpenAI GPT 家族的三种专有 LLMs 和来自 Meta Llama-2、Llama-3 和 Mistral 家族的 13 种开放访问 LLMs。尽管大多数开放访问模型与 GPT-4 Turbo 相比表现显著不佳,但我们的实验识别出 Llama-3-70B 作为强劲的竞争者,在多个评估方面经常超越 GPT-4 Turbo。我们还发现,即使是 GPT 模型在可靠验证 LLM 响应方面也表现出普遍的无能。此外,我们观察到一个令人担忧的趋势,即人类评估者被 GPT-4 Turbo 的错误响应所欺骗。

[NLP-109] Uncovering Latent Chain of Thought Vectors in Language Models ICLR

【速读】: 该论文试图解决如何在不使用自然语言提示的情况下,引导语言模型进行链式思维(Chain of Thought, CoT)推理的问题。解决方案的关键在于引入“引导向量”(steering vector)技术,通过在语言模型的前向传播过程中引入特定任务的偏置,从而实现对模型输出的引导。这种方法不仅在多个推理基准测试(如GSM8k、MMLU、AGI Eval、ARC AI2)中取得了与CoT提示方法相媲美的结果,而且相比传统的微调方法,计算成本更低,且能持续引导模型生成符合CoT推理的响应。

链接: https://arxiv.org/abs/2409.14026
作者: Jason Zhang,Scott Viteri
关键词-EN: language models grow, increasingly paramount, grow more influential, influential and trusted, ability to reliably
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2 Pages, Intended for Tiny Papers 2025 Submission to ICLR

点击查看摘要

Abstract:As language models grow more influential and trusted in our society, our ability to reliably steer them toward favorable behaviors becomes increasingly paramount. For this, we investigate the technique of steering vectors: biasing the forward pass of language models using a “steering vector” derived from a specific task. We apply them to steer language models toward performing Chain of Thought (CoT) Reasoning without the need to prompt through natural language. We demonstrate this approach on Llama3 8b and Mistral 7b v0.2, and obtain competitive results compared to CoT-prompted performances on a series of reasoning benchmarks (GSM8k, MMLU, AGI Eval, ARC AI2) and qualitative examples. We find this approach yields consistent steering towards CoT responses and takes less compute than traditional methods of fine-tuning models towards CoT.
摘要:随着语言模型在社会中的影响力和信任度不断增加,我们可靠地引导它们向有利行为的能力变得愈发重要。为此,我们研究了引导向量的技术:通过从特定任务中提取的“引导向量”来偏置语言模型的前向传递。我们将这种方法应用于引导语言模型进行思维链 (Chain of Thought, CoT) 推理,而无需通过自然语言提示。我们在 Llama3 8b 和 Mistral 7b v0.2 上展示了这种方法,并在一系列推理基准测试 (GSM8k, MMLU, AGI Eval, ARC AI2) 和定性示例中获得了与 CoT 提示性能相媲美的结果。我们发现这种方法能够一致地引导模型向 CoT 响应,并且比传统的微调模型向 CoT 的方法计算量更少。

[NLP-110] Graph Neural Network Framework for Sentiment Analysis Using Syntactic Feature

【速读】: 该论文旨在解决社交媒体和电子商务生态系统中意见挖掘领域的问题,特别是从文本中提取与特定元素相关的细微评价。解决方案的关键在于提出了一种综合框架,该框架结合了主题描述符的位置线索,通过将句法结构转换为矩阵格式,并利用卷积和注意力机制在图中提取显著特征。通过整合描述符相对于词汇项的位置相关性,增强了输入的顺序完整性,从而显著提升了评价分类的效率。

链接: https://arxiv.org/abs/2409.14000
作者: Linxiao Wu,Yuanshuai Luo,Binrong Zhu,Guiran Liu,Rui Wang,Qian Yu
关键词-EN: natural language processing, social media platforms, Amidst the swift, e-commerce ecosystems, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Amidst the swift evolution of social media platforms and e-commerce ecosystems, the domain of opinion mining has surged as a pivotal area of exploration within natural language processing. A specialized segment within this field focuses on extracting nuanced evaluations tied to particular elements within textual contexts. This research advances a composite framework that amalgamates the positional cues of topical descriptors. The proposed system converts syntactic structures into a matrix format, leveraging convolutions and attention mechanisms within a graph to distill salient characteristics. Incorporating the positional relevance of descriptors relative to lexical items enhances the sequential integrity of the input. Trials have substantiated that this integrated graph-centric scheme markedly elevates the efficacy of evaluative categorization, showcasing preeminence.
摘要:在社交媒体平台和电子商务生态系统的迅速演变中,意见挖掘领域在自然语言处理中已成为一个关键的研究领域。该领域的一个专门分支专注于从文本上下文中提取与特定元素相关的细微评价。本研究提出了一种综合框架,该框架结合了主题描述符的位置线索。所提出的系统将句法结构转换为矩阵格式,利用图中的卷积和注意力机制来提取显著特征。通过结合描述符相对于词汇项的位置相关性,增强了输入的顺序完整性。试验证明,这种以图为中心的综合方案显著提升了评价分类的效率,展示了其卓越性。

[NLP-111] Contrastive Learning for Knowledge-Based Question Generation in Large Language Models

【速读】: 该论文试图解决大规模语言模型在知识密集型任务中存在的幻觉和知识缺口问题,提出了一种基于对比学习的增强型问题生成方法。解决方案的关键在于利用多模型联合挖掘领域知识,并通过对比学习引导模型减少生成中的噪声和幻觉。实验结果表明,通过设计包含对比示例的提示,模型在问题生成方面的性能显著提升,特别是当同时使用对比指令和示例时,生成问题的质量和准确性达到最高。

链接: https://arxiv.org/abs/2409.13994
作者: Zhenhong Zhang,Jiajing Chen,Weiyan Shi,Lingjie Yi,Chihang Wang,Qian Yu
关键词-EN: increasingly widespread application, artificial intelligence technology, high-quality question generation, question generation, question generation technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:With the rapid development of artificial intelligence technology, especially the increasingly widespread application of question-and-answer systems, high-quality question generation has become a key component in supporting the development of these systems. This article focuses on knowledge-based question generation technology, which aims to enable computers to simulate the human questioning process based on understanding specific texts or knowledge bases. In light of the issues of hallucination and knowledge gaps present in large-scale language models when applied to knowledge-intensive tasks, this paper proposes an enhanced question generation method that incorporates contrastive learning. This method utilizes multiple models to jointly mine domain knowledge and uses contrastive learning to guide the model in reducing noise and hallucinations in generation. Experimental results show that by designing prompts containing contrasting examples, the model’s performance in question generation improves considerably, particularly when contrasting instructions and examples are used simultaneously, leading to the highest quality of generated questions and improved accuracy. These results demonstrate that the method proposed in this study, which combines contrasting context and chain-of-thought prompts, can effectively improve both the quality and the practicality of question generation.
摘要:随着人工智能技术的快速发展,尤其是问答系统的日益广泛应用,高质量的问题生成已成为支持这些系统发展的关键组成部分。本文聚焦于基于知识的问生成技术,旨在使计算机能够基于对特定文本或知识库的理解,模拟人类的提问过程。针对大规模语言模型在应用于知识密集型任务时存在的幻觉和知识差距问题,本文提出了一种结合对比学习的增强型问生成方法。该方法利用多个模型联合挖掘领域知识,并通过对比学习引导模型减少生成中的噪声和幻觉。实验结果表明,通过设计包含对比示例的提示,模型在问生成方面的性能显著提升,特别是在同时使用对比指令和示例时,生成的问质量最高,准确性也得到提升。这些结果表明,本研究提出的结合对比上下文和思维链提示的方法,能够有效提升问生成的质量和实用性。

[NLP-112] SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

【速读】: 该论文试图解决传统检索增强生成(RAG)方法在无监督检索设置下,由于仅基于查询-上下文相关性选择高排名文档,导致引入冗余和冲突信息的问题。解决方案的关键在于提出了Selection using Matrices for Augmented Retrieval (SMART)框架,该框架利用行列式点过程(DPPs)同时建模相关性、多样性和冲突,从而在无监督和无需训练的情况下优化上下文选择,显著提升问答任务的性能。

链接: https://arxiv.org/abs/2409.13992
作者: Jiatao Li,Xinyu Hu,Xiaojun Wan
关键词-EN: contextually grounded responses, greatly improved large, improved large language, Retrieval-Augmented Generation, large language models
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has greatly improved large language models (LLMs) by enabling them to generate accurate, contextually grounded responses through the integration of external information. However, conventional RAG approaches, which prioritize top-ranked documents based solely on query-context relevance, often introduce redundancy and conflicting information. This issue is particularly evident in unsupervised retrieval settings, where there are no mechanisms to effectively mitigate these problems, leading to suboptimal context selection. To address this, we propose Selection using Matrices for Augmented Retrieval (SMART) in question answering tasks, a fully unsupervised and training-free framework designed to optimize context selection in RAG. SMART leverages Determinantal Point Processes (DPPs) to simultaneously model relevance, diversity and conflict, ensuring the selection of potentially high-quality contexts. Experimental results across multiple datasets demonstrate that SMART significantly enhances QA performance and surpasses previous unsupervised context selection methods, showing a promising strategy for RAG.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 通过整合外部信息,极大地提升了大语言模型 (Large Language Models, LLMs) 生成准确、上下文相关响应的能力。然而,传统的 RAG 方法仅基于查询与上下文的相关性来优先选择排名靠前的文档,往往引入了冗余和冲突信息。这一问题在无监督检索环境中尤为明显,因为没有机制有效缓解这些问题,导致上下文选择次优。为此,我们提出了用于问答任务的增强检索矩阵选择 (Selection using Matrices for Augmented Retrieval, SMART),这是一个完全无监督且无需训练的框架,旨在优化 RAG 中的上下文选择。SMART 利用行列式点过程 (Determinantal Point Processes, DPPs) 同时建模相关性、多样性和冲突,确保选择潜在高质量的上下文。在多个数据集上的实验结果表明,SMART 显著提升了问答性能,并超越了以往的无监督上下文选择方法,展示了 RAG 的潜在有效策略。

[NLP-113] ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models

【速读】: 该论文试图解决现有化学领域大语言模型(LLMs)基准测试未能充分满足化学研究专业人员特定需求的问题。解决方案的关键在于提出了ChemEval,这是一个全面的评估框架,涵盖了化学领域的4个关键层次和12个维度,通过42个不同的化学任务来评估LLMs的性能。ChemEval利用开源数据和化学专家精心设计的数据,确保任务具有实际价值并能有效评估LLMs的能力。实验结果表明,通用LLMs在文献理解和指令跟随方面表现优异,但在需要高级化学知识的任务上表现不足,而专用LLMs则在化学领域表现出更强的能力,尽管在文学理解上有所减弱。这表明LLMs在处理化学领域复杂任务时具有显著的提升潜力。

链接: https://arxiv.org/abs/2409.13989
作者: Yuqing Huang,Rongyang Zhang,Xuesong He,Xuyang Zhi,Hao Wang,Xin Li,Feiyang Xu,Deguang Liu,Huadong Liang,Yi Li,Jian Cui,Zimu Liu,Shijin Wang,Guoping Hu,Guiquan Liu,Qi Liu,Defu Lian,Enhong Chen
关键词-EN: LLMs benchmarks tailored, LLMs, type and complexity, chemical tasks varying, growing interest
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:There is a growing interest in the role that LLMs play in chemistry which lead to an increased focus on the development of LLMs benchmarks tailored to chemical domains to assess the performance of LLMs across a spectrum of chemical tasks varying in type and complexity. However, existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals. To this end, we propose \textbf\textitChemEval, which provides a comprehensive assessment of the capabilities of LLMs across a wide range of chemical domain tasks. Specifically, ChemEval identified 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks which are informed by open-source data and the data meticulously crafted by chemical experts, ensuring that the tasks have practical value and can effectively evaluate the capabilities of LLMs. In the experiment, we evaluate 12 mainstream LLMs on ChemEval under zero-shot and few-shot learning contexts, which included carefully selected demonstration examples and carefully designed prompts. The results show that while general LLMs like GPT-4 and Claude-3.5 excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge. Conversely, specialized LLMs exhibit enhanced chemical competencies, albeit with reduced literary comprehension. This suggests that LLMs have significant potential for enhancement when tackling sophisticated tasks in the field of chemistry. We believe our work will facilitate the exploration of their potential to drive progress in chemistry. Our benchmark and analysis will be available at \colorblue \urlthis https URL.
摘要:随着大语言模型 (LLM) 在化学领域中的作用日益受到关注,针对化学领域的 LLM 基准开发也得到了更多的重视,以评估 LLM 在不同类型和复杂度的化学任务中的表现。然而,现有的基准在这一领域未能充分满足化学研究专业人士的特定需求。为此,我们提出了 ChemEval,它提供了一个全面的评估框架,用于评估 LLM 在广泛化学领域任务中的能力。具体而言,ChemEval 识别了化学中的 4 个关键递进层次,评估了 LLM 在 12 个维度上的表现,涵盖了 42 个不同的化学任务,这些任务基于开源数据和化学专家精心设计的数据,确保了任务的实际价值和能够有效评估 LLM 的能力。在实验中,我们在零样本 (zero-shot) 和少样本 (few-shot) 学习环境下,对 12 个主流 LLM 在 ChemEval 上的表现进行了评估,其中包括精心挑选的示范示例和精心设计的提示。结果显示,尽管像 GPT-4 和 Claude-3.5 这样的通用 LLM 在文献理解和指令跟随方面表现出色,但在需要高级化学知识的任务中表现不佳。相反,专门的 LLM 在化学能力方面有所增强,尽管在文学理解方面有所减弱。这表明,当处理化学领域的复杂任务时,LLM 具有显著的提升潜力。我们相信,我们的工作将有助于探索其在推动化学进步方面的潜力。我们的基准和分析将在 \colorblue \urlthis https URL 上提供。

[NLP-114] Bias and Toxicity in Role-Play Reasoning

【速读】: 该论文试图解决大型语言模型(LLM)在角色扮演(role-play)技术应用中可能产生的刻板印象和有害输出的问题。解决方案的关键在于系统性地评估角色扮演对模型在包含刻板印象和有害问题的基准测试中的影响,发现角色扮演往往增加了生成刻板和有害输出的可能性,从而提醒研究者在应用角色扮演技术时需谨慎考虑其潜在风险。

链接: https://arxiv.org/abs/2409.13979
作者: Jinman Zhao,Zifan Qian,Linbo Cao,Yining Wang,Yitian Ding
关键词-EN: Large Language Model, generate contextually relevant, adopt specific perspectives, Large Language, specific perspectives
类目: Computation and Language (cs.CL)
备注: 14 pages, 9 figures, 9 tables

点击查看摘要

Abstract:Role-play in the Large Language Model (LLM) is a crucial technique that enables models to adopt specific perspectives, enhancing their ability to generate contextually relevant and accurate responses. By simulating different roles, theis approach improves reasoning capabilities across various NLP benchmarks, making the model’s output more aligned with diverse scenarios. However, in this work, we demonstrate that role-play also carries potential risks. We systematically evaluate the impact of role-play by asking the language model to adopt different roles and testing it on multiple benchmarks that contain stereotypical and harmful questions. Despite the significant fluctuations in the benchmark results in different experiments, we find that applying role-play often increases the overall likelihood of generating stereotypical and harmful outputs.
摘要: 大语言模型 (LLM) 中的角色扮演是一项关键技术,它使模型能够采用特定的视角,从而增强其生成上下文相关且准确响应的能力。通过模拟不同的角色,这种方法提升了模型在各种自然语言处理 (NLP) 基准测试中的推理能力,使其输出更符合多样化的场景。然而,在本研究中,我们展示了角色扮演也存在潜在风险。我们通过让语言模型采用不同的角色,并在包含刻板印象和有害问题的多个基准测试中进行测试,系统地评估了角色扮演的影响。尽管在不同实验中基准测试结果存在显著波动,我们发现应用角色扮演通常会增加生成刻板印象和有害输出的总体可能性。

[NLP-115] Can Language Model Understand Word Semantics as A Chatbot? An Empirical Study of Language Model Internal External Mismatch

【速读】: 该论文试图解决语言模型在内部知识表示与外部交互方式之间的语义理解差异问题。解决方案的关键在于研究不同类型的预训练语言模型(如仅编码器、仅解码器和编码器-解码器模型)在单词语义理解上的内部与外部不匹配现象,以揭示这些模型在处理提示信息时与其实际内部知识之间的不一致性。

链接: https://arxiv.org/abs/2409.13972
作者: Jinman Zhao,Xueyan Zhang,Xingyu Yue,Weizhe Chen,Zifan Qian,Ruiyu Wang
关键词-EN: Current common interactions, Current common, full inference, common interactions, Current
类目: Computation and Language (cs.CL)
备注: 10 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Current common interactions with language models is through full inference. This approach may not necessarily align with the model’s internal knowledge. Studies show discrepancies between prompts and internal representations. Most focus on sentence understanding. We study the discrepancy of word semantics understanding in internal and external mismatch across Encoder-only, Decoder-only, and Encoder-Decoder pre-trained language models.
摘要:当前与语言模型的常见交互是通过全推理进行的。这种方法未必与模型的内部知识相一致。研究表明,提示与内部表示之间存在差异。大多数研究集中在句子理解上。我们研究了仅编码器、仅解码器和编码器-解码器预训练语言模型中,内部与外部不匹配的词义理解差异。

[NLP-116] Exploring Automated Keyword Mnemonics Generation with Large Language Models via Overgenerate-and-Rank EMNLP2024

【速读】: 该论文试图解决词汇学习中关键词记忆法(keyword mnemonics)的自动化生成问题。解决方案的关键在于提出了一种“生成并排序”的方法,通过提示大型语言模型(LLMs)生成记忆提示词,并根据心理语言学测量和初步用户研究的反馈对这些提示词进行排序,从而实现记忆提示词的自动化生成和优化。

链接: https://arxiv.org/abs/2409.13952
作者: Jaewook Lee,Hunter McNichols,Andrew Lan
关键词-EN: vocabulary learning, memorizing vocabulary, under-explored area, technique for memorizing, memorable associations
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: EMNLP 2024 findings

点击查看摘要

Abstract:In this paper, we study an under-explored area of language and vocabulary learning: keyword mnemonics, a technique for memorizing vocabulary through memorable associations with a target word via a verbal cue. Typically, creating verbal cues requires extensive human effort and is quite time-consuming, necessitating an automated method that is more scalable. We propose a novel overgenerate-and-rank method via prompting large language models (LLMs) to generate verbal cues and then ranking them according to psycholinguistic measures and takeaways from a pilot user study. To assess cue quality, we conduct both an automated evaluation of imageability and coherence, as well as a human evaluation involving English teachers and learners. Results show that LLM-generated mnemonics are comparable to human-generated ones in terms of imageability, coherence, and perceived usefulness, but there remains plenty of room for improvement due to the diversity in background and preference among language learners.
摘要:本文探讨了一个尚未充分研究的领域——语言和词汇学习中的关键词记忆术,这是一种通过与目标词相关的记忆联想来记忆词汇的技术。通常,创建记忆联想需要大量的人力且非常耗时,因此需要一种更具扩展性的自动化方法。我们提出了一种新颖的“生成-排序”方法,通过提示大语言模型 (LLM) 生成记忆联想,然后根据心理语言学测量和初步用户研究的成果对其进行排序。为了评估联想的质量,我们进行了自动化的图像性和连贯性评估,以及包括英语教师和学习者在内的人工评估。结果表明,LLM 生成的记忆联想在图像性、连贯性和感知有用性方面与人工生成的联想相当,但由于语言学习者的背景和偏好多样性,仍有很大的改进空间。

[NLP-117] Mufu: Multilingual Fused Learning for Low-Resource Translation with LLM

【速读】: 该论文试图解决低资源语言在多语言大语言模型(LLMs)中的翻译难题。解决方案的关键在于引入Mufu方法,通过自动生成多语言候选翻译并结合纠错指令,将翻译任务转化为后编辑任务。Mufu提示利用LLM的推理能力,评估输入质量、跨语言对齐语义、从相关输入中复制并覆盖错误实例,从而在低资源语言对中显著提升翻译性能,超越了NLLB 1.3B模型的表现。

链接: https://arxiv.org/abs/2409.13949
作者: Zheng Wei Lim,Nitish Gupta,Honglin Yu,Trevor Cohn
关键词-EN: Multilingual large language, great translators, largely limited, limited to high-resource, large language models
类目: Computation and Language (cs.CL)
备注: 29 pages

点击查看摘要

Abstract:Multilingual large language models (LLMs) are great translators, but this is largely limited to high-resource languages. For many LLMs, translating in and out of low-resource languages remains a challenging task. To maximize data efficiency in this low-resource setting, we introduce Mufu, which includes a selection of automatically generated multilingual candidates and an instruction to correct inaccurate translations in the prompt. Mufu prompts turn a translation task into a postediting one, and seek to harness the LLM’s reasoning capability with auxiliary translation candidates, from which the model is required to assess the input quality, align the semantics cross-lingually, copy from relevant inputs and override instances that are incorrect. Our experiments on En-XX translations over the Flores-200 dataset show LLMs finetuned against Mufu-style prompts are robust to poor quality auxiliary translation candidates, achieving performance superior to NLLB 1.3B distilled model in 64% of low- and very-low-resource language pairs. We then distill these models to reduce inference cost, while maintaining on average 3.1 chrF improvement over finetune-only baseline in low-resource translations.
摘要:多语言大语言模型 (LLM) 在翻译高资源语言方面表现出色,但在处理低资源语言的翻译任务时仍面临挑战。为了在这种低资源环境下最大化数据效率,我们引入了 Mufu,它包括一组自动生成的多语言候选翻译和一个指令,用于在提示中纠正不准确的翻译。Mufu 提示将翻译任务转化为校对任务,并试图利用 LLM 的推理能力,通过辅助翻译候选来评估输入质量、跨语言对齐语义、从相关输入中复制内容并覆盖不正确的实例。我们在 Flores-200 数据集上的 En-XX 翻译实验表明,经过 Mufu 风格提示微调的 LLM 对质量较差的辅助翻译候选具有较强的鲁棒性,在 64% 的低资源和极低资源语言对中表现优于 NLLB 1.3B 蒸馏模型。随后,我们将这些模型进行蒸馏以降低推理成本,同时在低资源翻译中平均保持了 3.1 chrF 的改进,超过了仅微调的基线模型。

[NLP-118] Aligning Language Models Using Follow-up Likelihood as Reward Signal

【速读】: 该论文试图解决在人机交互中如何自动评估机器响应的质量问题,特别是如何在不依赖人工或商业LLM标注的情况下,区分出用户更偏好的响应。解决方案的关键在于提出了“后续话语可能性作为奖励”(Follow-up Likelihood as Reward, FLR)机制,通过分析用户对机器响应的后续反应来评估响应的优劣。FLR机制不仅在多个基准测试中表现出色,还通过自动挖掘基础策略模型的在线生成数据来增强模型帮助性,最终通过自然语言反馈微调语言模型,显著提升了FLR在奖励建模基准测试中的性能和基础策略模型帮助性的对齐效果。

链接: https://arxiv.org/abs/2409.13948
作者: Chen Zhang,Dading Chong,Feng Jiang,Chengguang Tang,Anningzhe Gao,Guohua Tang,Haizhou Li
关键词-EN: receive feedback signals, participants often receive, follow-up, feedback signals, Follow-up Likelihood
类目: Computation and Language (cs.CL)
备注: 16 pages, reward model, LLM Alignment

点击查看摘要

Abstract:In natural human-to-human conversations, participants often receive feedback signals from one another based on their follow-up reactions. These reactions can include verbal responses, facial expressions, changes in emotional state, and other non-verbal cues. Similarly, in human-machine interactions, the machine can leverage the user’s follow-up utterances as feedback signals to assess whether it has appropriately addressed the user’s request. Therefore, we propose using the likelihood of follow-up utterances as rewards to differentiate preferred responses from less favored ones, without relying on human or commercial LLM-based preference annotations. Our proposed reward mechanism, ``Follow-up Likelihood as Reward" (FLR), matches the performance of strong reward models trained on large-scale human or GPT-4 annotated data on 8 pairwise-preference and 4 rating-based benchmarks. Building upon the FLR mechanism, we propose to automatically mine preference data from the online generations of a base policy model. The preference data are subsequently used to boost the helpfulness of the base model through direct alignment from preference (DAP) methods, such as direct preference optimization (DPO). Lastly, we demonstrate that fine-tuning the language model that provides follow-up likelihood with natural language feedback significantly enhances FLR’s performance on reward modeling benchmarks and effectiveness in aligning the base policy model’s helpfulness.
摘要:在自然的人与人对话中,参与者通常会根据对方的后续反应接收到反馈信号。这些反应可能包括口头回应、面部表情、情绪状态的变化以及其他非语言线索。类似地,在人机交互中,机器可以利用用户的后续话语作为反馈信号,来评估其是否恰当地处理了用户的需求。因此,我们提出使用后续话语的可能性作为奖励,以区分更受欢迎的回应和不太受欢迎的回应,而不依赖于人工或基于商业大语言模型 (LLM) 的偏好标注。我们提出的奖励机制,即“后续可能性作为奖励” (Follow-up Likelihood as Reward, FLR),在 8 个成对偏好和 4 个基于评分的基准测试中,与基于大规模人工或 GPT-4 标注数据训练的强奖励模型的性能相匹配。基于 FLR 机制,我们提出从基础策略模型的在线生成内容中自动挖掘偏好数据。这些偏好数据随后通过直接偏好对齐 (DAP) 方法,如直接偏好优化 (DPO),用于提升基础模型帮助性。最后,我们展示了通过自然语言反馈对提供后续可能性评分的语言模型进行微调,显著增强了 FLR 在奖励建模基准测试中的性能,并有效提升了基础策略模型帮助性的对齐效果。

[NLP-119] MirrorStories: Reflecting Diversity through Personalized Narrative Generation with Large Language Models EMNLP2024

【速读】: 该论文试图解决文学作品中缺乏多样性的问题,提出了一种利用大型语言模型(LLMs)生成个性化“镜像故事”的解决方案。关键在于通过整合读者的姓名、性别、年龄、种族、兴趣和故事道德等元素,生成能够反映和共鸣读者身份的个性化短篇故事。研究结果表明,LLMs能够有效融入多样化的身份元素,个性化故事在吸引力和文本多样性方面均优于通用的人类创作和LLM生成的故事,同时保持了预期的道德内涵。

链接: https://arxiv.org/abs/2409.13935
作者: Sarfaroz Yunusov,Hamza Sidat,Ali Emami
关键词-EN: Large Language Models, Language Models, Large Language, individual readers’ identities, effectiveness of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages (excluding references), accepted to EMNLP 2024 Main Conference

点击查看摘要

Abstract:This study explores the effectiveness of Large Language Models (LLMs) in creating personalized “mirror stories” that reflect and resonate with individual readers’ identities, addressing the significant lack of diversity in literature. We present MirrorStories, a corpus of 1,500 personalized short stories generated by integrating elements such as name, gender, age, ethnicity, reader interest, and story moral. We demonstrate that LLMs can effectively incorporate diverse identity elements into narratives, with human evaluators identifying personalized elements in the stories with high accuracy. Through a comprehensive evaluation involving 26 diverse human judges, we compare the effectiveness of MirrorStories against generic narratives. We find that personalized LLM-generated stories not only outscore generic human-written and LLM-generated ones across all metrics of engagement (with average ratings of 4.22 versus 3.37 on a 5-point scale), but also achieve higher textual diversity while preserving the intended moral. We also provide analyses that include bias assessments and a study on the potential for integrating images into personalized stories.
摘要:本研究探讨了大语言模型 (LLM) 在创作个性化“镜像故事”方面的有效性,这些故事能够反映并引起个体读者身份的共鸣,从而解决文学作品中显著的多样性缺失问题。我们提出了 MirrorStories,这是一个包含 1,500 篇个性化短篇故事的语料库,通过整合姓名、性别、年龄、种族、读者兴趣和故事道德等元素生成。我们证明,LLM 能够有效地将多样化的身份元素融入叙事中,人类评估者能够以高准确度识别故事中的个性化元素。通过涉及 26 位多样化人类评委的综合评估,我们将 MirrorStories 与通用叙事进行了比较。我们发现,个性化 LLM 生成的故事不仅在所有参与度指标上均优于通用的人类撰写和 LLM 生成的故事(在 5 分制中平均评分为 4.22 对 3.37),而且在保持预期道德的同时实现了更高的文本多样性。我们还提供了包括偏见评估和研究个性化故事中整合图像潜力的分析。

[NLP-120] On-device Collaborative Language Modeling via a Mixture of Generalists and Specialists

【速读】: 该论文旨在解决在设备上进行大规模语言模型(LLMs)的协同微调问题,特别是在数据异质性高的情况下。解决方案的关键在于提出了一种新的混合专家(Mixture of Experts, MoE)架构,称为CoMiGS(混合通才与专家),通过将专家分为全局通才和本地专家,实现了在令牌级别上的可学习路由网络,从而在微调过程中平衡了协作与个性化需求。该方法不仅在性能上表现优异,还能适应不同用户在计算资源上的差异,使得资源较少的用户也能从资源丰富的用户中获益。

链接: https://arxiv.org/abs/2409.13931
作者: Dongyang Fan,Bettina Messmer,Martin Jaggi
关键词-EN: Large Language Models, Language Models, Large Language, target on-device collaborative, on-device collaborative fine-tuning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We target on-device collaborative fine-tuning of Large Language Models (LLMs) by adapting a Mixture of Experts (MoE) architecture, where experts are Low-Rank Adaptation (LoRA) modules. In conventional MoE approaches, experts develop into specialists throughout training. In contrast, we propose a novel \textbfCo llaborative learning approach via a \textbfMi xture of \textbfG eneralists and \textbfS pecialists (CoMiGS). Diversifying into the two roles is achieved by aggregating certain experts globally while keeping others localized to specialize in user-specific datasets. Central to our work is a learnable routing network that routes at a token level, balancing collaboration and personalization at the finest granularity. Our method consistently demonstrates superior performance in scenarios with high data heterogeneity across various datasets. By design, our approach accommodates varying computational resource constraints among users as shown in different numbers of LoRA experts. We further showcase that low-resourced users can benefit from high-resourced users with high data quantity.
摘要:我们针对大语言模型 (LLM) 的设备端协同微调,通过采用专家混合 (Mixture of Experts, MoE) 架构,其中专家为低秩适应 (Low-Rank Adaptation, LoRA) 模块。在传统的 MoE 方法中,专家在训练过程中逐渐成为专家。相比之下,我们提出了一种新颖的协同学习方法,即专家与专家的混合 (Mixture of Generalists and Specialists, CoMiGS)。通过全局聚合某些专家,同时保持其他专家本地化以专门处理用户特定的数据集,实现了两种角色的多样化。我们工作的核心是一个可学习的路由网络,该网络在 Token 级别进行路由,以最细粒度平衡协作与个性化。我们的方法在各种数据集的高数据异质性场景中持续展现出优越的性能。通过设计,我们的方法能够适应不同用户之间的计算资源约束,如不同数量的 LoRA 专家所示。我们进一步展示了资源较少的用户可以从数据量较大的用户中受益。

[NLP-121] Eliciting Instruction-tuned Code Language Models Capabilities to Utilize Auxiliary Function for Code Generation EMNLP2024

【速读】: 该论文试图解决在代码生成任务中,如何有效利用辅助函数来增强指令调优模型的性能问题。解决方案的关键在于设计了多种方式将辅助函数引入模型,包括将其添加到查询中或提供响应前缀,从而结合了模型对辅助函数的利用能力和指令跟随能力。实验结果表明,这种结合显著提升了模型性能,甚至超越了最新的强大专有语言模型如GPT-4。

链接: https://arxiv.org/abs/2409.13928
作者: Seonghyeon Lee,Suyeon Kim,Joonwon Jang,Heejae Chon,Dongha Lee,Hwanjo Yu
关键词-EN: code generation behavior, code pre-trained language, instruction-tuned models built, pre-trained language models, provide auxiliary functions
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2024 Findings Short

点击查看摘要

Abstract:We study the code generation behavior of instruction-tuned models built on top of code pre-trained language models when they could access an auxiliary function to implement a function. We design several ways to provide auxiliary functions to the models by adding them to the query or providing a response prefix to incorporate the ability to utilize auxiliary functions with the instruction-following capability. Our experimental results show the effectiveness of combining the base models’ auxiliary function utilization ability with the instruction following ability. In particular, the performance of adopting our approaches with the open-sourced language models surpasses that of the recent powerful proprietary language models, i.e., gpt-4o.
摘要:我们研究了在指令微调模型能够访问辅助函数来实现某个功能时,基于代码预训练语言模型的代码生成行为。我们设计了几种方式来向模型提供辅助函数,包括将其添加到查询中或提供响应前缀,以结合利用辅助函数的能力与遵循指令的能力。我们的实验结果显示,将基础模型的辅助函数利用能力与指令遵循能力相结合是有效的。特别是,采用我们方法的开源语言模型的性能超越了近期强大的专有语言模型,即 gpt-4o。

[NLP-122] One Model is All You Need: ByT5-Sanskrit a Unified Model for Sanskrit NLP Tasks

【速读】: 该论文试图解决形态丰富语言(如梵文)在下游自然语言处理(NLP)应用中的处理难题。解决方案的关键在于提出了一个新的预训练语言模型ByT5-Sanskrit,该模型在梵文词分割任务中显著优于以往的数据驱动方法,并达到了当前最佳词典驱动模型的性能水平。ByT5-Sanskrit易于部署且对未覆盖的外部语言资源数据更具鲁棒性,同时在吠陀梵文依存句法分析和OCR后校正任务中取得了新的最先进结果。此外,论文还引入了基于数字梵文语料库的多任务数据集,用于联合训练梵文词分割、词形还原和形态句法标注任务,进一步提升了模型的多功能性。

链接: https://arxiv.org/abs/2409.13920
作者: Sebastian Nehrdich,Oliver Hellwig,Kurt Keutzer
关键词-EN: NLP applications, Sanskrit, Morphologically rich languages, Morphologically rich, NLP applications involving
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich language Sanskrit. We evaluate ByT5-Sanskrit on established Sanskrit word segmentation tasks, where it outperforms previous data-driven approaches by a considerable margin and matches the performance of the current best lexicon-based model. It is easier to deploy and more robust to data not covered by external linguistic resources. It also achieves new state-of-the-art results in Vedic Sanskrit dependency parsing and OCR post-correction tasks. Additionally, based on the Digital Corpus of Sanskrit, we introduce a novel multitask dataset for the joint training of Sanskrit word segmentation, lemmatization, and morphosyntactic tagging tasks. We fine-tune ByT5-Sanskrit on this dataset, creating a versatile multitask model for various downstream Sanskrit applications. We have used this model in Sanskrit linguistic annotation projects, in information retrieval setups, and as a preprocessing step in a Sanskrit machine translation pipeline. We also show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages. We thus demonstrate that byte-level pretrained language models can achieve excellent performance for morphologically rich languages, outperforming tokenizer-based models and presenting an important vector of exploration when constructing NLP pipelines for such languages.
摘要:形态丰富的语言在下游自然语言处理 (NLP) 应用中处理起来极具挑战性。本文介绍了一种新的预训练语言模型,ByT5-Sanskrit,专门设计用于处理形态丰富的梵语 (Sanskrit) 的 NLP 应用。我们在已建立的梵语分词任务上评估了 ByT5-Sanskrit,结果显示其显著优于以往的数据驱动方法,并达到了当前最佳基于词典模型的性能水平。ByT5-Sanskrit 更易于部署,且对未被外部语言资源覆盖的数据更具鲁棒性。此外,它在吠陀梵语 (Vedic Sanskrit) 依存句法分析和 OCR 后校正任务中取得了新的最先进结果。基于梵文数字语料库 (Digital Corpus of Sanskrit),我们引入了一种新颖的多任务数据集,用于联合训练梵语分词、词形还原和形态句法标注任务。我们对 ByT5-Sanskrit 进行了微调,创建了一个适用于多种下游梵语应用的多功能多任务模型。该模型已被应用于梵语语言学注释项目、信息检索设置以及梵语机器翻译流水线的预处理步骤中。我们还展示了,我们的方法在其他形态丰富的语言的词形还原和依存句法分析任务中取得了新的最佳分数。因此,我们证明了字节级预训练语言模型在处理形态丰富的语言时能够取得优异的性能,超越基于分词器的模型,并在构建此类语言的 NLP 流水线时提供了一个重要的探索方向。

[NLP-123] arget word activity detector: An approach to obtain ASR word boundaries without lexicon ICASSP2025

【速读】: 该论文试图解决端到端(E2E)自动语音识别(ASR)模型中获取单词时间戳信息的难题,尤其是在多语言模型中,由于训练过程中缺乏显式的时间对齐,这一问题更加复杂。现有方法依赖于词典或引入额外标记,导致可扩展性问题和计算成本增加。论文提出的解决方案关键在于利用子词单元(sub-word token units)的词嵌入和预训练的ASR模型,仅在训练时需要单词对齐信息,从而在不增加额外成本的情况下,实现对任意数量语言的扩展。该方法在五种语言的多语言ASR模型上进行了验证,并展示了其相对于强基线的有效性。

链接: https://arxiv.org/abs/2409.13913
作者: Sunit Sivasankaran,Eric Sun,Jinyu Li,Yan Huang,Jing Pan
关键词-EN: Obtaining word timestamp, remains challenging due, models remains challenging, Obtaining word, explicit time alignment
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate word boundaries without relying on lexicons. Our method leverages word embeddings from sub-word token units and a pretrained ASR model, requiring only word alignment information during training. Our proposed method can scale-up to any number of languages without incurring any additional cost. We validate our approach using a multilingual ASR model trained on five languages and demonstrate its effectiveness against a strong baseline.
摘要:从端到端 (E2E) 自动语音识别 (ASR) 模型中获取词时间戳信息仍然是一个挑战,这是由于训练过程中缺乏显式的时间对齐。在多语言模型中,这一问题变得更加复杂。现有的方法要么依赖于词典,要么引入额外的 Token,导致可扩展性问题和计算成本的增加。在这项工作中,我们提出了一种新的方法来估计词边界,而不依赖于词典。我们的方法利用子词 Token 单元的词嵌入和预训练的 ASR 模型,仅在训练过程中需要词对齐信息。我们提出的方法可以扩展到任意数量的语言,而不会产生额外的成本。我们通过在一个包含五种语言的多语言 ASR 模型上验证了我们的方法,并展示了其相对于强基线的有效性。

[NLP-124] Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

【速读】: 该论文试图解决大语言模型(LLMs)在医学领域应用时可能产生的缺乏支持证据或基于幻觉证据的问题。解决方案的关键在于引入检索增强生成(RAG)技术,通过构建一个包含70,000份眼科专业文档的RAG管道,在推理时检索相关文档来增强LLMs的输出。研究结果表明,RAG显著提高了答案的准确性(54.5%的正确率),降低了错误率(18.8%的轻微幻觉和26.7%的错误),并改善了证据的归属(从1.85提升到2.49,P<0.001),尽管在准确性和完整性上略有下降。

链接: https://arxiv.org/abs/2409.13902
作者: Aidan Gilson,Xuguang Ai,Thilaka Arunachalam,Ziyou Chen,Ki Xiong Cheong,Amisha Dave,Cameron Duic,Mercy Kibe,Annette Kaminaka,Minali Prasad,Fares Siddig,Maxwell Singer,Wendy Wong,Qiao Jin,Tiarnan D.L. Keenan,Xia Hu,Emily Y. Chew,Zhiyong Lu,Hua Xu,Ron A. Adelman,Yih-Chung Tham,Qingyu Chen
关键词-EN: Large Language Models, Language Models, Large Language, Retrieval Augment Generation, potential of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. In a case study on long-form consumer health questions, we systematically evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals. The evaluation focuses on factuality of evidence, selection and ranking of evidence, attribution of evidence, and answer accuracy and completeness. LLMs without RAG provided 252 references in total. Of which, 45.3% hallucinated, 34.1% consisted of minor errors, and 20.6% were correct. In contrast, LLMs with RAG significantly improved accuracy (54.5% being correct) and reduced error rates (18.8% with minor hallucinations and 26.7% with errors). 62.5% of the top 10 documents retrieved by RAG were selected as the top references in the LLM response, with an average ranking of 4.9. The use of RAG also improved evidence attribution (increasing from 1.85 to 2.49 on a 5-point scale, P0.001), albeit with slight decreases in accuracy (from 3.52 to 3.23, P=0.03) and completeness (from 3.47 to 3.27, P=0.17). The results demonstrate that LLMs frequently exhibited hallucinated and erroneous evidence in the responses, raising concerns for downstream applications in the medical domain. RAG substantially reduced the proportion of such evidence but encountered challenges.
摘要:尽管大语言模型 (Large Language Models, LLMs) 在医学领域具有潜力,但它们可能生成缺乏支持证据或基于幻觉证据的回答。虽然检索增强生成 (Retrieval Augment Generation, RAG) 是解决这一问题的流行方法,但很少有研究在下游领域特定应用中实施和评估 RAG。我们开发了一个包含 70,000 份眼科特定文档的 RAG 管道,该管道在推理时检索相关文档以增强 LLMs。在针对长篇消费者健康问题的案例研究中,我们系统地评估了包括超过 500 条参考文献的 LLMs 在有无 RAG 情况下的 100 个问题的回答,由 10 位医疗专业人员进行评估。评估重点在于证据的事实性、证据的选择与排序、证据的归属以及答案的准确性和完整性。没有 RAG 的 LLMs 总共提供了 252 条参考文献,其中 45.3% 存在幻觉,34.1% 包含轻微错误,20.6% 是正确的。相比之下,使用 RAG 的 LLMs 显著提高了准确性(54.5% 正确)并降低了错误率(18.8% 轻微幻觉和 26.7% 错误)。RAG 检索的前 10 份文档中有 62.5% 被选为 LLM 回答中的顶级参考文献,平均排名为 4.9。使用 RAG 还改善了证据归属(在 5 分制中从 1.85 增加到 2.49,P<0.001),尽管在准确性(从 3.52 到 3.23,P=0.03)和完整性(从 3.47 到 3.27,P=0.17)方面略有下降。结果表明,LLMs 在回答中经常表现出幻觉和错误的证据,这引发了对其在医疗领域下游应用的担忧。RAG 显著减少了此类证据的比例,但仍面临挑战。

[NLP-125] LLM for Everyone: Representing the Underrepresented in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在多语言环境中,特别是对少数语言支持不足的问题。解决方案的关键在于提出数据和计算效率高的方法,以缩小LLMs在少数语言上的能力差距,同时保持其在多任务上的泛化能力。具体方法包括跨语言持续指令调优、基于检索的跨语言上下文学习以及上下文查询对齐,并引入了一种新的方法来衡量不同语言环境下LLMs的文化价值观对齐,以确保文化敏感性和包容性。

链接: https://arxiv.org/abs/2409.13897
作者: Samuel Cahyawijaya
关键词-EN: Natural language processing, large language models, Natural language, underrepresented languages, witnessed a profound
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:Natural language processing (NLP) has witnessed a profound impact of large language models (LLMs) that excel in a multitude of tasks. However, the limitation of LLMs in multilingual settings, particularly in underrepresented languages, remains a significant hurdle. This thesis aims to bridge the gap in NLP research and development by focusing on underrepresented languages. A comprehensive evaluation of LLMs is conducted to assess their capabilities in these languages, revealing the challenges of multilingual and multicultural generalization. Addressing the multilingual generalization gap, this thesis proposes data-and-compute-efficient methods to mitigate the disparity in LLM ability in underrepresented languages, allowing better generalization on underrepresented languages without the loss of task generalization ability. The proposed solutions cover cross-lingual continual instruction tuning, retrieval-based cross-lingual in-context learning, and in-context query alignment. Furthermore, a novel method to measure cultural values alignment between LLMs operating in different languages is proposed, ensuring cultural sensitivity and inclusivity. These contributions aim to enhance the multilingual and multicultural alignment of LLMs in underrepresented languages, ultimately advancing the NLP field toward greater equality and inclusiveness.
摘要:自然语言处理 (NLP) 领域见证了大语言模型 (LLM) 的深远影响,这些模型在多种任务中表现卓越。然而,LLM 在多语言环境中的局限性,特别是在代表性不足的语言中,仍然是一个重大障碍。本论文旨在通过聚焦于代表性不足的语言,弥合 NLP 研究和开发中的这一差距。我们对 LLM 进行了全面评估,以评估其在这些语言中的能力,揭示了多语言和多文化泛化的挑战。为解决多语言泛化差距,本论文提出了数据和计算效率高的方法,以缩小 LLM 在代表性不足语言中的能力差异,从而在不损失任务泛化能力的情况下,更好地泛化代表性不足的语言。所提出的解决方案包括跨语言持续指令调优、基于检索的跨语言上下文学习以及上下文查询对齐。此外,本论文提出了一种新颖的方法来衡量在不同语言中运行的 LLM 之间的文化价值观对齐,确保文化敏感性和包容性。这些贡献旨在增强 LLM 在代表性不足语言中的多语言和多文化对齐,最终推动 NLP 领域向更大的平等和包容性发展。

[NLP-126] ransfer Learning with Clinical Concept Embeddings from Large Language Models

【速读】: 该论文试图解决跨临床站点数据异质性问题,以促进知识共享和迁移学习在医疗领域的应用。解决方案的关键在于利用领域特定的预训练语言模型(如Med-BERT)生成语义嵌入,这些嵌入能够有效捕捉临床概念的语义信息,从而减少异质性。研究结果表明,领域特定的嵌入在本地和直接迁移场景中表现优于通用模型,但需注意模型微调的适度性,以避免过度调优导致的性能下降。

链接: https://arxiv.org/abs/2409.13893
作者: Yuhe Gao,Runxue Bao,Yuelyu Ji,Yiming Sun,Chenxi Song,Jeffrey P. Ferraro,Ye Ye
关键词-EN: address data scarcity, enable timely interventions, data scarcity, timely interventions, multiple clinical sites
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge sharing is crucial in healthcare, especially when leveraging data from multiple clinical sites to address data scarcity, reduce costs, and enable timely interventions. Transfer learning can facilitate cross-site knowledge transfer, but a major challenge is heterogeneity in clinical concepts across different sites. Large Language Models (LLMs) show significant potential of capturing the semantic meaning of clinical concepts and reducing heterogeneity. This study analyzed electronic health records from two large healthcare systems to assess the impact of semantic embeddings from LLMs on local, shared, and transfer learning models. Results indicate that domain-specific LLMs, such as Med-BERT, consistently outperform in local and direct transfer scenarios, while generic models like OpenAI embeddings require fine-tuning for optimal performance. However, excessive tuning of models with biomedical embeddings may reduce effectiveness, emphasizing the need for balance. This study highlights the importance of domain-specific embeddings and careful model tuning for effective knowledge transfer in healthcare.
摘要:在医疗领域,知识共享至关重要,尤其是在利用多个临床站点的数据来解决数据稀缺、降低成本并实现及时干预时。迁移学习可以促进跨站点的知识转移,但一个主要挑战是不同站点之间临床概念的异质性。大语言模型 (LLMs) 显示出捕捉临床概念语义意义并减少异质性的显著潜力。本研究分析了来自两个大型医疗系统的电子健康记录,以评估 LLMs 的语义嵌入对本地、共享和迁移学习模型的影响。结果表明,特定领域的 LLMs,如 Med-BERT,在本地和直接迁移场景中始终表现优异,而通用模型如 OpenAI 嵌入则需要微调以达到最佳性能。然而,过度微调使用生物医学嵌入的模型可能会降低其有效性,这强调了平衡的重要性。本研究强调了特定领域嵌入和谨慎模型微调在医疗领域有效知识转移中的重要性。

[NLP-127] A Multi-LLM Debiasing Framework

【速读】: 该论文试图解决大型语言模型(LLMs)中存在的偏见问题,这些偏见可能加剧社会不平等。解决方案的关键在于提出了一种新颖的多LLM去偏框架,该框架通过两种不同的方法来减少偏见:一种是集中式方法,由单一的中央LLM引导对话;另一种是分散式方法,所有模型直接进行交流。研究结果表明,这种多LLM框架在减少LLMs中的偏见方面显著优于传统方法,特别是在涉及多个社会群体时。

链接: https://arxiv.org/abs/2409.13884
作者: Deonna M. Owens,Ryan A. Rossi,Sungchul Kim,Tong Yu,Franck Dernoncourt,Xiang Chen,Ruiyi Zhang,Jiuxiang Gu,Hanieh Deilamsalehy,Nedim Lipka
关键词-EN: Large Language Models, Large Language, benefit society immensely, perpetuate societal inequalities, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities. Despite significant advancements in bias mitigation techniques using data augmentation, zero-shot prompting, and model fine-tuning, biases continuously persist, including subtle biases that may elude human detection. Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning and factuality in LLMs. Building on this approach, we propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs. Our work is the first to introduce and evaluate two distinct approaches within this framework for debiasing LLMs: a centralized method, where the conversation is facilitated by a single central LLM, and a decentralized method, where all models communicate directly. Our findings reveal that our multi-LLM framework significantly reduces bias in LLMs, outperforming the baseline method across several social groups.
摘要:大语言模型 (LLMs) 是具有巨大潜力造福社会的强大工具,然而,它们也表现出助长社会不平等的偏见。尽管在偏见缓解技术方面取得了显著进展,包括数据增强、零样本提示和模型微调,但偏见仍然持续存在,包括可能逃过人类检测的微妙偏见。最近的研究表明,人们对多 LLM 方法的兴趣日益增长,这些方法已被证明在提高 LLMs 的推理质量和事实准确性方面是有效的。基于这一方法,我们提出了一种新颖的多 LLM 去偏框架,旨在减少 LLMs 中的偏见。我们的工作首次引入了并评估了该框架内的两种不同去偏方法:集中式方法,其中对话由一个中央 LLM 协调,以及去中心化方法,其中所有模型直接通信。我们的研究结果表明,我们的多 LLM 框架显著减少了 LLMs 中的偏见,在多个社会群体中均优于基线方法。

[NLP-128] “I Never Said That”: A dataset taxonomy and baselines on response clarity classification EMNLP2024

【速读】: 该论文试图解决政治访谈中问题回答的清晰度问题,特别是如何检测和分类回答中的模糊性和回避策略。解决方案的关键在于引入了一个新颖的两级分类法,用于评估回答的清晰度(高层次)和识别具体的回避技巧(低层次)。通过结合ChatGPT和人工标注者,论文构建了一个包含政治访谈中问题-回答对的清晰度分类数据集,并利用不同模型架构和适应方法进行实验,以建立新的基准。

链接: https://arxiv.org/abs/2409.13879
作者: Konstantinos Thomas,Giorgos Filandrianos,Maria Lymperaiou,Chrysoula Zerva,Giorgos Stamou
关键词-EN: well-studied discourse phenomena, political interviews, Large Language Models, discourse phenomena, ambiguity in public
类目: Computation and Language (cs.CL)
备注: Accepted at Findings of EMNLP 2024

点击查看摘要

Abstract:Equivocation and ambiguity in public speech are well-studied discourse phenomena, especially in political science and analysis of political interviews. Inspired by the well-grounded theory on equivocation, we aim to resolve the closely related problem of response clarity in questions extracted from political interviews, leveraging the capabilities of Large Language Models (LLMs) and human expertise. To this end, we introduce a novel taxonomy that frames the task of detecting and classifying response clarity and a corresponding clarity classification dataset which consists of question-answer (QA) pairs drawn from political interviews and annotated accordingly. Our proposed two-level taxonomy addresses the clarity of a response in terms of the information provided for a given question (high-level) and also provides a fine-grained taxonomy of evasion techniques that relate to unclear, ambiguous responses (lower-level). We combine ChatGPT and human annotators to collect, validate and annotate discrete QA pairs from political interviews, to be used for our newly introduced response clarity task. We provide a detailed analysis and conduct several experiments with different model architectures, sizes and adaptation methods to gain insights and establish new baselines over the proposed dataset and task.
摘要:在公共演讲中的含糊其辞和模棱两可是政治学和政治访谈分析中广泛研究的语篇现象。受含糊其辞理论的启发,我们旨在解决从政治访谈中提取的问题的回答清晰度这一密切相关的问题,利用大语言模型 (LLM) 和人类专家的能力。为此,我们引入了一种新的分类法,该分类法构建了检测和分类回答清晰度的任务,并相应地构建了一个清晰度分类数据集,该数据集由从政治访谈中提取的问题-答案 (QA) 对组成,并进行了相应的标注。我们提出的两级分类法从提供给定问题的信息量(高层级)和逃避技巧的细粒度分类(低层级)两个方面解决了回答的清晰度问题,这些逃避技巧与不清晰、模棱两可的回答相关。我们结合 ChatGPT 和人类标注者来收集、验证和标注来自政治访谈的离散 QA 对,以用于我们新引入的回答清晰度任务。我们提供了详细的分析,并进行了多种模型架构、大小和适应方法的实验,以获得见解并在我们提出的数据集和任务上建立新的基准。

[NLP-129] Instruct-Tuning Pretrained Causal Language Models for Ancient Greek Papyrology and Epigraphy

【速读】: 该论文旨在通过微调预训练的因果语言模型(Meta的Llama 3.1 8B Instruct)来辅助解决古希腊铭文和文献纸莎草的三个基本研究任务:年代归属、地理归属和文本修复。解决方案的关键在于采用基于提示的指导方法,使微调后的模型在关键指标上超越现有技术水平,特别是在字符错误率(CER)、准确率和时间偏差等方面取得了显著改进。

链接: https://arxiv.org/abs/2409.13870
作者: Eric Cullhed
关键词-EN: Meta Llama, pretrained causal language, ancient Greek inscriptions, causal language model, Greek inscriptions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 1 table. Under review

点击查看摘要

Abstract:This article presents an experiment in fine-tuning a pretrained causal language model (Meta’s Llama 3.1 8B Instruct) for aiding in three fundamental tasks of philological research: chronological and geographic attribution as well as text restoration in ancient Greek inscriptions and documentary papyri. Using a prompt-based instruct approach, the fine-tuned models surpass the state of the art in key metrics. For inscriptions, the models achieve a lower average character error rate (CER) of 22.5% (vs. 26.3%), while closely matching top-1 accuracy (60.9% vs. 61.8%) and top-20 accuracy (77.5% vs. 78.3%) for sequences up to 10 characters. They also provide a practical advantage by ignoring spaces during reconstruction, aligning better with the scriptio continua typically used in ancient written artifacts. In geographic attribution, the model outperforms previous benchmarks with a top-1 accuracy of 75.0% (vs. 70.8%) and a top-3 accuracy of 83.7% (vs. 82.1%). For dating, it achieves an average deviation of 26.2 years (vs. 29.3) and a median deviation of 1 year (vs. 3) from the actual date range. The models also set new baselines for documentary papyri, with a CER of 16.3%, a top-1 accuracy of 71.3%, and top-20 of 85.0% in text reconstruction; a top-1 accuracy of 66.4% and top-3 of 79.9% in geographic attribution; and, in chronological attribution, a deviation of 21.7 years from the actual termini post/ante quem, with a median deviation of 0 years.
摘要:本文介绍了一项针对预训练因果语言模型(Meta 的 Llama 3.1 8B Instruct)进行微调的实验,旨在辅助古希腊铭文和文献纸莎草研究中的三项基本任务:年代和地理归属以及文本修复。通过基于提示的指令方法,微调后的模型在关键指标上超越了现有技术水平。对于铭文,模型实现了更低的平均字符错误率(CER),达到 22.5%(相比 26.3%),同时在最多 10 个字符的序列中,与最高准确率(60.9% 对 61.8%)和前 20 准确率(77.5% 对 78.3%)相当。它们还通过在重建过程中忽略空格,更好地与古代书写文物中通常使用的连续书写方式相匹配,从而提供了实际优势。在地理归属方面,模型以 75.0% 的最高准确率(相比 70.8%)和 83.7% 的前三准确率(相比 82.1%)超越了先前的基准。在年代归属方面,模型实现了 26.2 年的平均偏差(相比 29.3 年)和 1 年的中位偏差(相比 3 年)。模型还为文献纸莎草设立了新的基准,文本重建的 CER 为 16.3%,最高准确率为 71.3%,前 20 准确率为 85.0%;地理归属的最高准确率为 66.4%,前三准确率为 79.9%;在年代归属方面,与实际的 termini post/ante quem 的偏差为 21.7 年,中位偏差为 0 年。

[NLP-130] Generative AI Carries Non-Democratic Biases and Stereotypes: Representation of Women Black Individuals Age Groups and People with Disability in AI-Generated Images across Occupations

【速读】: 该论文试图解决生成式AI在输出内容中对性别、种族、年龄和可见残疾等权益应得群体的公平性问题。解决方案的关键在于识别和纠正生成式AI在这些方面的偏见,确保其输出内容更具包容性和公平性。

链接: https://arxiv.org/abs/2409.13869
作者: Ayoob Sadeghiani
关键词-EN: prompting active discussions, critical concerns, prompting active, tech companies, governance and ethics
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:AI governance and ethics in AI development have become critical concerns, prompting active discussions among tech companies, governments, and researchers about the potential risks AI poses to our democracies. This short essay aims to highlight one such risk: how generative AI includes or excludes equity-deserving groups in its outputs. The findings reveal that generative AI is not equitably inclusive regarding gender, race, age, and visible disability.
摘要:AI 治理和 AI 发展中的伦理问题已成为关键关注点,促使科技公司、政府和研究人员积极讨论 AI 对民主制度可能带来的风险。本文旨在强调其中一种风险:生成式 AI 在其输出中如何包含或排除应获得公平待遇的群体。研究结果表明,生成式 AI 在性别、种族、年龄和可见残疾方面并不具备公平包容性。

[NLP-131] Unlocking Memorization in Large Language Models with Dynamic Soft Prompting

【速读】: 该论文试图解决预训练大型语言模型(LLMs)在记忆训练数据时带来的隐私泄露和版权侵犯风险问题。解决方案的关键在于提出了一种新颖的方法,使用动态、前缀依赖的软提示(soft prompts)来更准确地估计LLM的记忆情况。具体来说,该方法通过训练一个基于Transformer的生成器来生成适应输入变化的软提示,从而能够更有效地提取被记忆的数据,相较于传统方法,显著提升了在文本生成和代码生成任务中的记忆检测性能。

链接: https://arxiv.org/abs/2409.13853
作者: Zhepeng Wang,Runxue Bao,Yawen Wu,Jackson Taylor,Cao Xiao,Feng Zheng,Weiwen Jiang,Shangqian Gao,Yanfu Zhang
关键词-EN: Pretrained large language, large language models, natural language processing, revolutionized natural language, Pretrained large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pretrained large language models (LLMs) have revolutionized natural language processing (NLP) tasks such as summarization, question answering, and translation. However, LLMs pose significant security risks due to their tendency to memorize training data, leading to potential privacy breaches and copyright infringement. Accurate measurement of this memorization is essential to evaluate and mitigate these potential risks. However, previous attempts to characterize memorization are constrained by either using prefixes only or by prepending a constant soft prompt to the prefixes, which cannot react to changes in input. To address this challenge, we propose a novel method for estimating LLM memorization using dynamic, prefix-dependent soft prompts. Our approach involves training a transformer-based generator to produce soft prompts that adapt to changes in input, thereby enabling more accurate extraction of memorized data. Our method not only addresses the limitations of previous methods but also demonstrates superior performance in diverse experimental settings compared to state-of-the-art techniques. In particular, our method can achieve the maximum relative improvement of 112.75% and 32.26% over the vanilla baseline in terms of discoverable memorization rate for the text generation task and code generation task respectively.
摘要:预训练的大语言模型 (LLMs) 已经彻底改变了自然语言处理 (NLP) 任务,如摘要、问答和翻译。然而,LLMs 由于其记忆训练数据的倾向,带来了显著的安全风险,可能导致隐私泄露和版权侵犯。准确测量这种记忆行为对于评估和减轻这些潜在风险至关重要。然而,先前尝试描述记忆行为的方法要么仅使用前缀,要么在输入前添加一个固定的软提示,这些方法无法对输入的变化做出反应。为了应对这一挑战,我们提出了一种使用动态、前缀依赖的软提示来估计 LLM 记忆的新方法。我们的方法涉及训练一个基于 Transformer 的生成器,以生成适应输入变化的软提示,从而实现更精确的记忆数据提取。我们的方法不仅解决了先前方法的局限性,而且在各种实验设置中相比最先进的技术展示了优越的性能。特别是,我们的方法在文本生成任务和代码生成任务中分别实现了相对于基线的最大相对改进 112.75% 和 32.26% 的可发现记忆率。

[NLP-132] Do language models practice what they preach? Examining language ideologies about gendered language reform encoded in LLMs

【速读】: 该论文试图解决的问题是探究大型语言模型(LLMs)在生成文本时如何体现语言意识形态,特别是与性别语言改革相关的政治偏见和内部不一致性。解决方案的关键在于通过案例研究分析LLMs在不同元语言上下文中的表现,发现当要求使用“正确”或“自然”的语言时,LLMs更倾向于保守价值观的语言表达,而在提供更明确的元语言上下文时,LLMs更频繁地使用性别中性变体。这一发现揭示了LLMs在文本生成中隐含传达特定政治群体语言意识形态的能力,以及其表达的语言意识形态可能随上下文变化的不一致性,从而对价值对齐问题提出了新的思考。

链接: https://arxiv.org/abs/2409.13852
作者: Julia Watson,Sophia Lee,Barend Beekhuizen,Suzanne Stevenson
关键词-EN: English gendered language, English gendered, gendered language reform, study on English, related to role
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study language ideologies in text produced by LLMs through a case study on English gendered language reform (related to role nouns like congressperson/-woman/-man, and singular they). First, we find political bias: when asked to use language that is “correct” or “natural”, LLMs use language most similarly to when asked to align with conservative (vs. progressive) values. This shows how LLMs’ metalinguistic preferences can implicitly communicate the language ideologies of a particular political group, even in seemingly non-political contexts. Second, we find LLMs exhibit internal inconsistency: LLMs use gender-neutral variants more often when more explicit metalinguistic context is provided. This shows how the language ideologies expressed in text produced by LLMs can vary, which may be unexpected to users. We discuss the broader implications of these findings for value alignment.
摘要:我们通过一项关于英语性别语言改革的案例研究,探讨了大语言模型 (LLM) 生成的文本中的语言意识形态(涉及诸如 congressperson/-woman/-man 和 singular they 等角色名词)。首先,我们发现了政治偏见:当被要求使用“正确”或“自然”的语言时,LLM 使用的语言与被要求与保守(而非进步)价值观保持一致时最为相似。这表明,即使在看似非政治性的情境中,LLM 的元语言偏好也能隐含地传达特定政治群体的语言意识形态。其次,我们发现 LLM 表现出内部不一致性:当提供更明确的元语言上下文时,LLM 更频繁地使用性别中性变体。这表明,LLM 生成的文本中表达的语言意识形态可能会有所变化,这可能是用户未曾预料到的。我们讨论了这些发现对价值对齐的更广泛影响。

[NLP-133] STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions EMNLP2024

【速读】: 该论文试图解决大型语言模型(LLMs)中显性和隐性偏见的评估问题,特别是当前方法在评估偏见时缺乏对整体情境和潜在偏见范围的考虑。解决方案的关键在于引入Sensitivity Testing on Offensive Progressions (STOP)数据集,该数据集包含450个逐步升级的冒犯性进展,涵盖9个主要群体和46个子群体,确保了评估的全面性和包容性。通过STOP数据集,研究者能够系统地评估不同模型在检测偏见方面的表现,并展示了通过与人类判断对齐,可以显著提高模型在敏感任务上的回答率,从而推动更有效的偏见缓解策略和更公平的语言模型的开发。

链接: https://arxiv.org/abs/2409.13843
作者: Robert Morabito,Sangmitra Madhusudan,Tyler McDonald,Ali Emami
关键词-EN: Mitigating explicit, natural language processing, Large Language Models, Large Language, explicit and implicit
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages (excluding references), accepted to EMNLP 2024 Main Conference

点击查看摘要

Abstract:Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. However, many current methodologies evaluate scenarios in isolation, without considering the broader context or the spectrum of potential biases within each situation. To address this, we introduce the Sensitivity Testing on Offensive Progressions (STOP) dataset, which includes 450 offensive progressions containing 2,700 unique sentences of varying severity that progressively escalate from less to more explicitly offensive. Covering a broad spectrum of 9 demographics and 46 sub-demographics, STOP ensures inclusivity and comprehensive coverage. We evaluate several leading closed- and open-source models, including GPT-4, Mixtral, and Llama 3. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%. We also demonstrate how aligning models with human judgments on STOP can improve model answer rates on sensitive tasks such as BBQ, StereoSet, and CrowS-Pairs by up to 191%, while maintaining or even improving performance. STOP presents a novel framework for assessing the complex nature of biases in LLMs, which will enable more effective bias mitigation strategies and facilitates the creation of fairer language models.
摘要:减轻大语言模型 (LLMs) 中的显性和隐性偏见已成为自然语言处理领域的一个关键焦点。然而,许多当前的方法在评估场景时孤立地进行,没有考虑更广泛的背景或每个情境中潜在偏见的范围。为了解决这一问题,我们引入了攻击性进展敏感测试 (STOP) 数据集,该数据集包含 450 个攻击性进展,包含 2,700 个不同严重程度的独特句子,这些句子从较不明显到较明显地逐步升级。STOP 涵盖了 9 个主要人群和 46 个子人群,确保了包容性和全面覆盖。我们评估了多个领先的闭源和开源模型,包括 GPT-4、Mixtral 和 Llama 3。我们的研究结果表明,即使是表现最好的模型在检测偏见时也存在不一致性,成功率从 19.3% 到 69.8% 不等。我们还展示了如何通过使模型与 STOP 上的人类判断对齐,来提高模型在敏感任务(如 BBQ、StereoSet 和 CrowS-Pairs)上的回答率,最高可达 191%,同时保持或甚至提高性能。STOP 提供了一个新颖的框架,用于评估大语言模型中偏见的复杂性,这将有助于制定更有效的偏见缓解策略,并促进更公平的语言模型的创建。

[NLP-134] Measuring Copyright Risks of Large Language Model via Partial Information Probing

【速读】: 该论文试图解决大型语言模型(LLMs)在训练过程中可能涉及版权侵权的问题,特别是探讨LLMs是否能够直接输出受版权保护的内容。解决方案的关键在于通过提供受版权保护材料的片段作为输入,并使用迭代提示的方法,促使LLMs生成与原始版权材料高度重叠的内容,从而评估LLMs生成侵权内容的能力。

链接: https://arxiv.org/abs/2409.13831
作者: Weijie Zhao,Huajie Shao,Zhaozhuo Xu,Suzhen Duan,Denghui Zhang
关键词-EN: Large Language Models, train Large Language, Large Language, Language Models, investigating potential copyright
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:Exploring the data sources used to train Large Language Models (LLMs) is a crucial direction in investigating potential copyright infringement by these models. While this approach can identify the possible use of copyrighted materials in training data, it does not directly measure infringing risks. Recent research has shifted towards testing whether LLMs can directly output copyrighted content. Addressing this direction, we investigate and assess LLMs’ capacity to generate infringing content by providing them with partial information from copyrighted materials, and try to use iterative prompting to get LLMs to generate more infringing content. Specifically, we input a portion of a copyrighted text into LLMs, prompt them to complete it, and then analyze the overlap between the generated content and the original copyrighted material. Our findings demonstrate that LLMs can indeed generate content highly overlapping with copyrighted materials based on these partial inputs.
摘要:探索用于训练大语言模型 (Large Language Models, LLMs) 的数据源是研究这些模型潜在版权侵权行为的关键方向。尽管这种方法可以识别训练数据中可能使用的受版权保护的材料,但它并不直接衡量侵权风险。最近的研究已转向测试 LLMs 是否能够直接输出受版权保护的内容。针对这一方向,我们研究并评估了 LLMs 基于受版权保护材料的局部信息生成侵权内容的能力,并尝试使用迭代提示来促使 LLMs 生成更多侵权内容。具体而言,我们将受版权保护文本的一部分输入 LLMs,提示它们完成该文本,然后分析生成内容与原始受版权保护材料之间的重叠度。我们的研究结果表明,基于这些局部输入,LLMs 确实能够生成与受版权保护材料高度重叠的内容。

[NLP-135] Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

【速读】: 该论文试图解决大型语言模型(LLMs)在回答问题时对上下文依赖性的问题,并提出了一种新的评估模型忠实度的方法。解决方案的关键在于引入了一种基于局部扰动和自我解释的新型可解释性技术,该技术受到常用的leave-one-out方法的启发,通过识别生成正确答案所需的充分和必要部分来提供解释,并提出了一种评估忠实度的指标,通过比较这些关键部分与模型的自我解释来实现。实验结果表明,该方法在解释模型决策和评估忠实度方面具有显著效果。

链接: https://arxiv.org/abs/2409.13764
作者: Christos Fragkathoulas,Odysseas S. Chlapanis
关键词-EN: large language models, paper introduces, task to assess, large language, local perturbations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.
摘要:本文介绍了一项新颖的任务,通过局部扰动和自我解释来评估大语言模型 (LLM) 的忠实度。许多 LLM 在回答某些问题时通常需要额外的上下文。为此,我们提出了一种新的高效替代可解释性技术,灵感来源于常用的留一法 (leave-one-out approach)。通过这种方法,我们识别出 LLM 生成正确答案所需的充分且必要部分,作为解释。我们提出了一种评估忠实度的指标,该指标将这些关键部分与模型的自我解释进行比较。使用 Natural Questions 数据集,我们验证了我们的方法,展示了其在解释模型决策和评估忠实度方面的有效性。

[NLP-136] Do Large Language Models Need a Content Delivery Network?

【速读】: 该论文试图解决在大语言模型(LLMs)应用中如何灵活且高效地注入新知识的问题。解决方案的关键在于利用KV缓存作为知识注入的媒介,通过构建一个名为“知识交付网络”(Knowledge Delivery Network, KDN)的系统组件,动态优化KV缓存的存储、传输和组合,从而实现知识注入的模块化管理,并提升LLM服务的效率,降低成本,加快响应速度。这一方法不仅避免了传统的微调(fine-tuning)和上下文学习(in-context learning)的局限性,还借鉴了内容交付网络(CDNs)的成功经验,旨在通过高效的“知识交付”推动LLM应用的成功。

链接: https://arxiv.org/abs/2409.13761
作者: Yihua Cheng,Kuntai Du,Jiayi Yao,Junchen Jiang
关键词-EN: large language models, LLM, expands rapidly, knowledge, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries. Thus, enabling flexible and efficient injection of new knowledge in LLM inference is critical. Three high-level options exist: (i) embedding the knowledge in LLM’s weights (i.e., fine-tuning), (ii) including the knowledge as a part of LLM’s text input (i.e., in-context learning), or (iii) injecting the KV caches of the new knowledge to LLM during prefill. This paper argues that, although fine-tuning and in-context learning are popular, using KV caches as the medium of knowledge could simultaneously enable more modular management of knowledge injection and more efficient LLM serving with low cost and fast response. To realize these benefits, we envision a Knowledge Delivery Network (KDN), a new system component in LLM services that dynamically optimizes the storage, transfer, and composition of KV cache across LLM engines and other compute and storage resources. We believe that, just like content delivery networks (CDNs), such as Akamai, enabled the success of the Internet ecosystem through their efficient data delivery, KDNs will be critical to the success of LLM applications through their efficient knowledge delivery. We have open-sourced a KDN prototype at this https URL.
摘要:随着大语言模型 (LLM) 的应用迅速扩展,补充各种 LLM 查询所需的知识范围也在不断扩大。因此,在 LLM 推理过程中灵活且高效地注入新知识变得至关重要。目前存在三种高层次的选项:(i) 将知识嵌入到 LLM 的权重中(即微调),(ii) 将知识作为 LLM 文本输入的一部分(即上下文学习),或 (iii) 在预填充阶段将新知识的 KV 缓存注入到 LLM 中。本文认为,尽管微调和上下文学习较为流行,但使用 KV 缓存作为知识传递媒介,可以同时实现知识注入的模块化管理,并以低成本和快速响应的方式提高 LLM 服务的效率。为了实现这些优势,我们设想了一个知识传递网络 (KDN),这是 LLM 服务中的一个新系统组件,它动态优化了 KV 缓存在 LLM 引擎及其他计算和存储资源之间的存储、传输和组合。我们相信,正如内容传递网络 (CDN),如 Akamai,通过其高效的数据传递推动了互联网生态系统的成功,KDN 也将通过其高效的知识传递成为 LLM 应用成功的关键。我们在 https URL 上开源了一个 KDN 原型。

[NLP-137] Optimizing the Songwriting Process: Genre-Based Lyric Generation Using Deep Learning Models

【速读】: 该论文试图解决传统歌词创作过程复杂且耗时的问题,解决方案的关键在于利用深度学习技术简化歌词生成过程。通过使用18,000首Spotify歌曲的数据集,论文开发了一种独特的预处理格式,将歌词解析为单独的诗句,并训练了一个基于LSTM的神经网络模型,根据歌曲流派生成歌词。研究结果表明,该方法能够有效提高歌词生成的召回率(ROUGE),并在保持相似精确度(BLEU)的同时,显著加速歌词创作过程,使艺术家能够更快速地创作出符合特定流派的歌词。

链接: https://arxiv.org/abs/2409.13758
作者: Tracy Cai,Wilson Liang,Donte Townes
关键词-EN: form comprehensive verses, traditional songwriting process, songwriting process, form comprehensive, traditional songwriting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The traditional songwriting process is rather complex and this is evident in the time it takes to produce lyrics that fit the genre and form comprehensive verses. Our project aims to simplify this process with deep learning techniques, thus optimizing the songwriting process and enabling an artist to hit their target audience by staying in genre. Using a dataset of 18,000 songs off Spotify, we developed a unique preprocessing format using tokens to parse lyrics into individual verses. These results were used to train a baseline pretrained seq2seq model, and a LSTM-based neural network models according to song genres. We found that generation yielded higher recall (ROUGE) in the baseline model, but similar precision (BLEU) for both models. Qualitatively, we found that many of the lyrical phrases generated by the original model were still comprehensible and discernible between which genres they fit into, despite not necessarily being the exact the same as the true lyrics. Overall, our results yielded that lyric generation can reasonably be sped up to produce genre-based lyrics and aid in hastening the songwriting process.
摘要:传统的歌曲创作过程相当复杂,这一点从创作符合特定风格且内容完整的歌词所需的时间中可见一斑。我们的项目旨在通过深度学习技术简化这一过程,从而优化歌曲创作流程,使艺术家能够通过保持风格来吸引目标受众。我们使用从 Spotify 获取的 18,000 首歌曲的数据集,开发了一种独特的预处理格式,利用 Token 将歌词解析为单独的段落。这些结果被用于训练一个基于预训练 seq2seq 模型的基线模型,以及一个基于 LSTM 的神经网络模型,根据歌曲风格进行训练。我们发现,在基线模型中,生成结果的召回率 (ROUGE) 更高,但两个模型的精确度 (BLEU) 相似。从定性分析来看,我们发现原始模型生成的许多歌词短语仍然具有可理解性,并且能够区分适合哪种风格,尽管它们不一定与真实歌词完全相同。总体而言,我们的研究结果表明,歌词生成可以合理地加速,以生成基于风格的歌词,并有助于加快歌曲创作过程。

[NLP-138] Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance

【速读】: 该论文试图解决大规模语言模型(LLMs)高计算成本与小规模语言模型(SLMs)性能不足之间的矛盾。解决方案的关键在于提出了一种新颖的混合推理方法,通过引入基于奖励的机制,动态决定在生成每个token时是否需要云端LLM的辅助。具体来说,SLM生成的每个token都会根据奖励分数进行评估,仅当分数低于预设阈值时,才调用云端LLM进行下一token的预测。这种方法不仅减少了云端LLM的使用频率,降低了成本,还通过调整奖励分数阈值灵活控制响应质量,从而在保持高性能的同时实现了成本效益。

链接: https://arxiv.org/abs/2409.13757
作者: Adarsh MS,Jithin VG,Ditto PS
关键词-EN: Large language models, language processing tasks, natural language processing, cloud LLM, Large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks, but their deployment comes at a high computational and financial cost. On the other hand, smaller language models (SLMs), which can be deployed on lower-cost edge devices, struggle to match the performance of their larger counterparts. This paper presents a novel hybrid inference approach that leverages the strengths of both model types while minimizing reliance on costly cloud-based LLMs. Unlike existing methods that route entire queries to either an SLM or a cloud LLM, our approach introduces a reward-based mechanism to dynamically determine the involvement of the cloud LLM during token generation. Specifically, each token predicted by the SLM is evaluated against a reward score, and only when this score falls below a certain threshold is the cloud LLM consulted for assistance in the next token prediction. This method not only reduces the traffic to the cloud LLM, thereby lowering costs, but also allows for flexible control over response quality depending on the reward score threshold. Experimental results demonstrate that our approach significantly reduces cloud LLM usage with minimal impact on overall response quality, offering a cost-effective solution for deploying high-performance language models
摘要:大语言模型 (LLMs) 以其在一系列自然语言处理任务中的卓越表现而闻名,但其部署需要高昂的计算和财务成本。另一方面,小型语言模型 (SLMs) 虽然可以在低成本的边缘设备上部署,但其性能却难以与大型模型相媲美。本文提出了一种新颖的混合推理方法,该方法结合了两种模型类型的优势,同时最大限度地减少了对昂贵的基于云的 LLMs 的依赖。与现有的将整个查询路由到 SLM 或云 LLM 的方法不同,我们的方法引入了一种基于奖励的机制,用于在 Token 生成过程中动态确定云 LLM 的参与度。具体而言,SLM 预测的每个 Token 都会根据奖励分数进行评估,只有当该分数低于某个阈值时,才会咨询云 LLM 以协助下一个 Token 的预测。这种方法不仅减少了云 LLM 的流量,从而降低了成本,还允许根据奖励分数阈值灵活控制响应质量。实验结果表明,我们的方法显著减少了云 LLM 的使用,对整体响应质量的影响最小,为部署高性能语言模型提供了一种经济高效的解决方案。

[NLP-139] Language Models Learn Metadata: Political Stance Detection Case Study

【速读】: 该论文试图解决在政治立场检测任务中如何最优地整合元数据的问题。解决方案的关键在于,通过将元数据(如党派和政策信息)前置于政治演讲文本中,简单而有效地提升了检测性能,超越了现有的复杂元数据整合系统,表明直接且简洁的元数据处理方式更能优化任务学习效果。

链接: https://arxiv.org/abs/2409.13756
作者: Stanley Cao,Felix Drinkall
关键词-EN: crucial NLP task, analyzing online discussions, crucial NLP, assessing political campaigns, political stance detection
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stance detection is a crucial NLP task with numerous applications in social science, from analyzing online discussions to assessing political campaigns. This paper investigates the optimal way to incorporate metadata into a political stance detection task. We demonstrate that previous methods combining metadata with language-based data for political stance detection have not fully utilized the metadata information; our simple baseline, using only party membership information, surpasses the current state-of-the-art. We then show that prepending metadata (e.g., party and policy) to political speeches performs best, outperforming all baselines, indicating that complex metadata inclusion systems may not learn the task optimally.
摘要:立场检测是一项重要的自然语言处理 (NLP) 任务,在社会科学中有广泛的应用,从分析在线讨论到评估政治竞选。本文研究了在政治立场检测任务中最佳地整合元数据的方法。我们证明,先前结合元数据与基于语言的数据进行政治立场检测的方法并未充分利用元数据信息;我们仅使用党派成员信息的简单基线方法超越了当前的最先进水平。随后,我们展示了将元数据(例如,党派和政策)前置于政治演讲中的方法表现最佳,超越了所有基线方法,表明复杂的元数据整合系统可能无法最优地学习该任务。

[NLP-140] Entity-Aware Self-Attention and Contextualized GCN for Enhanced Relation Extraction in Long Sentences

【速读】: 该论文试图解决现有依赖树图卷积网络在关系抽取任务中忽略依赖树外单词信息的问题。解决方案的关键在于提出了一种新的模型——实体感知自注意力上下文图卷积网络(ESC-GCN),该模型通过相对位置自注意力机制获取整体语义相关性,利用上下文图卷积网络捕捉句子内部丰富的依赖关系,并通过实体感知注意力层动态选择对关系预测更关键的词元,从而有效整合句法结构和语义上下文,减少依赖树的噪声影响,提升长句子中实体间关系抽取的性能。

链接: https://arxiv.org/abs/2409.13755
作者: Xin Wang,Xinyi Bai
关键词-EN: natural Language processing, important natural Language, Language processing, natural Language, important natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Relation extraction as an important natural Language processing (NLP) task is to identify relations between named entities in text. Recently, graph convolutional networks over dependency trees have been widely used to capture syntactic features and achieved attractive performance. However, most existing dependency-based approaches ignore the positive influence of the words outside the dependency trees, sometimes conveying rich and useful information on relation extraction. In this paper, we propose a novel model, Entity-aware Self-attention Contextualized GCN (ESC-GCN), which efficiently incorporates syntactic structure of input sentences and semantic context of sequences. To be specific, relative position self-attention obtains the overall semantic pairwise correlation related to word position, and contextualized graph convolutional networks capture rich intra-sentence dependencies between words by adequately pruning operations. Furthermore, entity-aware attention layer dynamically selects which token is more decisive to make final relation prediction. In this way, our proposed model not only reduces the noisy impact from dependency trees, but also obtains easily-ignored entity-related semantic representation. Extensive experiments on various tasks demonstrate that our model achieves encouraging performance as compared to existing dependency-based and sequence-based models. Specially, our model excels in extracting relations between entities of long sentences.
摘要:关系抽取作为一项重要的自然语言处理 (NLP) 任务,旨在识别文本中命名实体之间的关系。近年来,基于依存树的图卷积网络被广泛用于捕捉句法特征,并取得了令人瞩目的性能。然而,大多数现有的基于依存树的方法忽略了依存树外部的词语对关系抽取的积极影响,这些词语有时传递了丰富且有用的信息。本文提出了一种新型模型,即实体感知自注意力上下文图卷积网络 (Entity-aware Self-attention Contextualized GCN, ESC-GCN),该模型有效地结合了输入句子的句法结构和序列的语义上下文。具体而言,相对位置自注意力机制获取了与词语位置相关的整体语义成对相关性,而上下文图卷积网络通过充分的剪枝操作捕捉了词语间的丰富句子内依赖关系。此外,实体感知注意力层动态选择哪些 Token 对最终的关系预测更为关键。通过这种方式,我们提出的模型不仅减少了依存树带来的噪声影响,还获得了容易被忽略的与实体相关的语义表示。在多种任务上的广泛实验表明,与现有的基于依存树和基于序列的模型相比,我们的模型取得了令人鼓舞的性能。特别地,我们的模型在长句子实体间关系抽取方面表现尤为出色。

[NLP-141] Synergistic Simulations: Multi-Agent Problem Solving with Large Language Models

【速读】: 该论文试图解决的问题是如何利用大型语言模型(LLMs)促进多智能体系统在模拟环境中协同工作,以解决复杂问题,模拟人类群体协作的优势。解决方案的关键在于开发一个多智能体框架,使多个智能体能够在模拟环境中相互协作,并通过两个具体的模拟场景(一个物理公寓和一个编程任务)来验证这种协作的有效性。论文通过展示LLMs在模拟环境中是否能表现出类似于人类协作的协同效应,来推动LLMs在实际应用中的进一步发展。

链接: https://arxiv.org/abs/2409.13753
作者: Asher Sprigler,Alexander Drobek,Keagan Weinstock,Wendpanga Tapsoba,Gavin Childress,Andy Dao,Lucas Gral
关键词-EN: Large Language Models, Large Language, Language Models, increasingly demonstrated, demonstrated the ability
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: 15 pages, 5 figures, published in the MICS 2024 conference

点击查看摘要

Abstract:Large Language Models (LLMs) have increasingly demonstrated the ability to facilitate the development of multi-agent systems that allow the interpretation of thoughts and actions generated by each individual. Promising advancements have also been made in LLM-based interaction with existing worlds, particularly in interacting with simulated environments. This paper aims to integrate both aforementioned topics (agents world interaction) into a single simulation where multiple agents can work together to solve a problem, modeling how groups of humans can often solve problems better than individuals. By showing whether LLMs demonstrate the synergy of human collaboration, it could lead to advancements in the applications of LLMs. We implemented two simulations: a physical studio apartment with two roommates, and another where agents collaborate to complete a programming task. We provide a multi-agent framework, discuss the performance of the agents in each simulation, and discuss potential future additions.
摘要:大语言模型 (LLMs) 越来越显示出其在促进多智能体系统开发方面的能力,这些系统允许对每个个体产生的思想和行动进行解释。在基于 LLM 与现有世界的交互方面,特别是在与模拟环境的交互方面,也取得了令人鼓舞的进展。本文旨在将上述两个主题(智能体与世界交互)整合到一个单一的模拟环境中,其中多个智能体可以协同工作以解决问题,模拟人类群体通常如何比个体更好地解决问题。通过展示 LLMs 是否表现出人类协作的协同效应,这可能会推动 LLMs 应用的进步。我们实施了两个模拟:一个是具有两个室友的物理工作室公寓,另一个是智能体协作完成编程任务的模拟。我们提供了一个多智能体框架,讨论了智能体在每个模拟中的表现,并讨论了潜在的未来扩展。

[NLP-142] hinking Before Speaking: A Role-playing Model with Mindset

【速读】: 该论文试图解决大型语言模型(LLMs)在角色扮演中难以真实模拟特定角色的问题,特别是在面对超出角色知识范围或需要角色特定经验或逻辑回答的问题时。解决方案的关键在于提出了“先思考后说话”(Thinking Before Speaking, TBS)模型。该模型通过扩展基于角色真实生活场景和历史对话的数据,补充每个对话对的角色心态,并引入少量超出角色知识范围的数据点进行微调,从而帮助LLMs更好地采用角色的思维过程和逻辑,避免生成超出角色知识范围的回答。实验结果表明,TBS模型在语调、知识和心态方面能更好地模拟角色。

链接: https://arxiv.org/abs/2409.13752
作者: Baohua Zhang,Yongyi Huang,Wenyao Cui,Huaping Zhang
关键词-EN: Large Language Models, Large Language, simulating human behaviors, task for Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Role-playing is an easy task for Large Language Models (LLMs), as they are skilled at simulating human behaviors. Many current studies have enabled LLMs to generate responses in the tone of a specific role by fine-tuning the models or using specialized prompts. However, it is typically easy to recognize when a role is being played by LLMs. These models tend to perform poorly when confronted with knowledge that the assumed role does not possess, or a question that requires the specific experience or logic of the role to answer. To address this problem and make LLMs act more like real roles, we propose a Thinking Before Speaking (TBS) model in this paper. Unlike other studies, we first extend the data based on the character’s real-life scenarios and the historical dialogue, supplementing each pair of dialogue with the character’s mindset. Then we add few data points that include elements beyond the role’s knowledge, and fine-tune the LLMs. This approach can help LLMs adopt the role’s thought process and logic, avoiding responses that fall outside the role’s knowledge base. We have also prepared a dataset and evaluation metrics to test these capabilities. Experimental results show that our TBS model can better emulate a role in terms of tone, knowledge, and mindset.
摘要:角色扮演对于大语言模型 (LLM) 来说是一项简单的任务,因为它们擅长模拟人类行为。许多现有研究通过微调模型或使用专门的提示,使 LLM 能够以特定角色的语气生成回应。然而,通常很容易识别出这是 LLM 在扮演角色。当面对角色不具备的知识,或需要角色特定经验或逻辑来回答的问题时,这些模型往往表现不佳。为了解决这一问题,使 LLM 更像真实角色,本文提出了一种“先思考后说话” (TBS) 模型。与其它研究不同,我们首先基于角色的真实生活场景和历史对话扩展数据,为每对对话补充角色的心态。然后,我们添加包含角色知识之外元素的少量数据点,并对 LLM 进行微调。这种方法有助于 LLM 采用角色的思维过程和逻辑,避免超出角色知识库的回应。我们还准备了一个数据集和评估指标来测试这些能力。实验结果表明,我们的 TBS 模型在语气、知识和心态方面能更好地模拟角色。

[NLP-143] KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models

【速读】: 该论文试图解决当前最先进的语言模型(如GPT-4)在高度专业化的金融领域中表现不足的问题。解决方案的关键在于引入KodeXv0.1系列大型语言模型,通过使用Llama 3.1 8B和70B的基础变体,并结合自定义的训练机制,针对金融领域进行专门化训练。具体方法包括收集和处理大量公开的金融文档,生成高质量的合成数据集,并进行RAG-aware 4bit LoRA指令微调,从而在金融问答任务中显著超越GPT-4等现有模型。

链接: https://arxiv.org/abs/2409.13749
作者: Neel Rajani,Lilli Kiessling,Aleksandr Ogaltsov,Claus Lang
关键词-EN: highly specialised sectors, current cutting-edge LLMs, specialised sectors, highly specialised, financial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Although powerful, current cutting-edge LLMs may not fulfil the needs of highly specialised sectors. We introduce KodeXv0.1, a family of large language models that outclass GPT-4 in financial question answering. We utilise the base variants of Llama 3.1 8B and 70B and adapt them to the financial domain through a custom training regime. To this end, we collect and process a large number of publicly available financial documents such as earnings calls and business reports. These are used to generate a high-quality, synthetic dataset consisting of Context-Question-Answer triplets which closely mirror real-world financial tasks. Using the train split of this dataset, we perform RAG-aware 4bit LoRA instruction tuning runs of Llama 3.1 base variants to produce KodeX-8Bv0.1 and KodeX-70Bv0.1. We then complete extensive model evaluations using FinanceBench, FinQABench and the withheld test split of our dataset. Our results show that KodeX-8Bv0.1 is more reliable in financial contexts than cutting-edge instruct models in the same parameter regime, surpassing them by up to 9.24%. In addition, it is even capable of outperforming state-of-the-art proprietary models such as GPT-4 by up to 7.07%. KodeX-70Bv0.1 represents a further improvement upon this, exceeding GPT-4’s performance on every tested benchmark.
摘要:尽管当前最先进的大语言模型 (LLM) 功能强大,但可能无法满足高度专业化领域的需求。我们推出了 KodeXv0.1,这是一系列在金融问答方面超越 GPT-4 的大型语言模型。我们利用 Llama 3.1 8B 和 70B 的基础变体,并通过自定义训练机制将其适应于金融领域。为此,我们收集并处理了大量公开的金融文档,如财报电话会议和商业报告。这些文档用于生成高质量的合成数据集,该数据集由上下文-问题-答案三元组组成,这些三元组紧密反映了现实世界的金融任务。使用该数据集的训练部分,我们对 Llama 3.1 基础变体进行 RAG-aware 4bit LoRA 指令微调,以生成 KodeX-8Bv0.1 和 KodeX-70Bv0.1。随后,我们使用 FinanceBench、FinQABench 以及我们数据集的保留测试部分进行了广泛的模型评估。结果显示,KodeX-8Bv0.1 在金融情境中的可靠性优于同一参数范围内的最先进指令模型,超越它们高达 9.24%。此外,它甚至能够超越如 GPT-4 这样的最先进专有模型,高达 7.07%。KodeX-70Bv0.1 在此基础上进一步改进,在所有测试基准上均超越了 GPT-4 的表现。

[NLP-144] heraGen: Therapy for Every Generation

【速读】: 该论文试图解决心理健康支持的普及性和即时性问题,通过开发名为TheraGen的高级AI驱动的精神健康聊天机器人来实现。解决方案的关键在于利用LLaMA 2 7B模型,结合大规模数据集(包含100万条对话记录)和先进的训练技术(如迁移学习和微调),以提供全天候、个性化的同情心理健康护理。TheraGen不仅提供用户友好的界面,还通过高效的响应时间和基于证据的应对策略,显著提高了用户的心理健康水平,同时确保了响应的准确性和用户满意度。

链接: https://arxiv.org/abs/2409.13748
作者: Kartikey Doshi,Jimit Shah,Narendra Shekokar
关键词-EN: health chatbot utilizing, utilizing the LLaMA, chatbot utilizing, mental health, advanced AI-powered mental
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 11 figures

点击查看摘要

Abstract:We present TheraGen, an advanced AI-powered mental health chatbot utilizing the LLaMA 2 7B model. This approach builds upon recent advancements in language models and transformer architectures. TheraGen provides all-day personalized, compassionate mental health care by leveraging a large dataset of 1 million conversational entries, combining anonymized therapy transcripts, online mental health discussions, and psychological literature, including APA resources. Our implementation employs transfer learning, fine-tuning, and advanced training techniques to optimize performance. TheraGen offers a user-friendly interface for seamless interaction, providing empathetic responses and evidence-based coping strategies. Evaluation results demonstrate high user satisfaction rates, with 94% of users reporting improved mental well-being. The system achieved a BLEU score of 0.67 and a ROUGE score of 0.62, indicating strong response accuracy. With an average response time of 1395 milliseconds, TheraGen ensures real-time, efficient support. While not a replacement for professional therapy, TheraGen serves as a valuable complementary tool, significantly improving user well-being and addressing the accessibility gap in mental health treatments. This paper details TheraGen’s architecture, training methodology, ethical considerations, and future directions, contributing to the growing field of AI-assisted mental healthcare and offering a scalable solution to the pressing need for mental health support.
摘要:我们介绍了 TheraGen,一个利用 LLaMA 2 7B 模型的高级 AI 驱动的精神健康聊天机器人。这种方法建立在语言模型和 Transformer 架构的最新进展之上。TheraGen 通过利用包含 100 万条对话条目的大型数据集,结合匿名的治疗记录、在线精神健康讨论和心理学文献(包括 APA 资源),提供全天候个性化、富有同情心的精神健康护理。我们的实现采用了迁移学习、微调和高阶训练技术来优化性能。TheraGen 提供了一个用户友好的界面,实现无缝交互,提供富有同情心的回应和基于证据的应对策略。评估结果显示用户满意度高,94% 的用户报告其心理健康状况有所改善。系统在 BLEU 评分中达到 0.67,在 ROUGE 评分中达到 0.62,表明回应的准确性很强。平均响应时间为 1395 毫秒,确保了实时、高效的支持。尽管 TheraGen 不能替代专业治疗,但它作为一个有价值的补充工具,显著改善了用户的心理健康,并解决了精神健康治疗中的可及性差距。本文详细介绍了 TheraGen 的架构、训练方法、伦理考量和未来方向,为日益增长的 AI 辅助精神健康护理领域做出了贡献,并提供了一个可扩展的解决方案,以应对迫切的精神健康支持需求。

[NLP-145] Machine Translation with Large Language Models : Decoder Only vs. Encoder-Decoder

【速读】: 该论文旨在解决多语言机器翻译问题,特别是针对印度区域语言如泰卢固语、泰米尔语和马拉雅拉姆语的翻译。解决方案的关键在于比较Decoder-only和Encoder-Decoder两种架构,通过优化翻译质量和效率,提升跨语言沟通工具的性能。论文通过利用大型语言模型,进行严格的实验和分析,以期在多语言翻译领域取得进展,并为不同模型架构的有效性提供有价值的见解。

链接: https://arxiv.org/abs/2409.13747
作者: Abhinav P.M.,SujayKumar Reddy M,Oswald Christopher
关键词-EN: Large Language Models, Indian regional languages, Large Language, Machine Translation, Language Models
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This project, titled “Machine Translation with Large Language Models: Decoder-only vs. Encoder-Decoder,” aims to develop a multilingual machine translation (MT) model. Focused on Indian regional languages, especially Telugu, Tamil, and Malayalam, the model seeks to enable accurate and contextually appropriate translations across diverse language pairs. By comparing Decoder-only and Encoder-Decoder architectures, the project aims to optimize translation quality and efficiency, advancing cross-linguistic communication tools.The primary objective is to develop a model capable of delivering high-quality translations that are accurate and contextually appropriate. By leveraging large language models, specifically comparing the effectiveness of Decoder-only and Encoder-Decoder architectures, the project seeks to optimize translation performance and efficiency across multilingual contexts. Through rigorous experimentation and analysis, this project aims to advance the field of machine translation, contributing valuable insights into the effectiveness of different model architectures and paving the way for enhanced cross-linguistic communication tools.
摘要:本项目名为“基于大语言模型的机器翻译:仅解码器与编码器-解码器架构的比较”,旨在开发一种多语言机器翻译 (MT) 模型。该项目专注于印度地区语言,特别是泰卢固语、泰米尔语和马拉雅拉姆语,旨在实现跨多种语言对的准确且上下文适当的翻译。通过比较仅解码器和编码器-解码器架构,该项目旨在优化翻译质量和效率,推进跨语言沟通工具的发展。主要目标是开发一种能够提供高质量翻译的模型,这些翻译既准确又上下文适当。通过利用大语言模型,特别是比较仅解码器和编码器-解码器架构的有效性,该项目旨在优化多语言环境下的翻译性能和效率。通过严格的实验和分析,本项目旨在推进机器翻译领域的发展,为不同模型架构的有效性提供宝贵见解,并为增强跨语言沟通工具铺平道路。

[NLP-146] When Less Is Not More: Large Language Models Normalize Less-Frequent Terms with Lower Accuracy

【速读】: 该论文试图解决大语言模型(如GPT-4o)在术语标准化过程中对低频术语标准化准确率低的问题,特别是在精准医学中对罕见疾病和罕见表型的标准化。解决方案的关键在于平衡训练和评估数据集中低频和高频术语的比例,以提高模型对低频术语的识别和标准化能力,从而提升整体模型性能,确保精准医学应用中的准确性。

链接: https://arxiv.org/abs/2409.13746
作者: Daniel B. Hier,Thanh Son Do,Tayo Obafemi-Ajayi
关键词-EN: Human Phenotype Ontology, process of mapping, free text, standardized concept, machine-readable code
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Term normalization is the process of mapping a term from free text to a standardized concept and its machine-readable code in an ontology. Accurate normalization of terms that capture phenotypic differences between patients and diseases is critical to the success of precision medicine initiatives. A large language model (LLM), such as GPT-4o, can normalize terms to the Human Phenotype Ontology (HPO), but it may retrieve incorrect HPO IDs. Reported accuracy rates for LLMs on these tasks may be inflated due to imbalanced test datasets skewed towards high-frequency terms. In our study, using a comprehensive dataset of 268,776 phenotype annotations for 12,655 diseases from the HPO, GPT-4o achieved an accuracy of 13.1% in normalizing 11,225 unique terms. However, the accuracy was unevenly distributed, with higher-frequency and shorter terms normalized more accurately than lower-frequency and longer terms. Feature importance analysis, using SHAP and permutation methods, identified low-term frequency as the most significant predictor of normalization errors. These findings suggest that training and evaluation datasets for LLM-based term normalization should balance low- and high-frequency terms to improve model performance, particularly for infrequent terms critical to precision medicine.
摘要:术语规范化是将自由文本中的术语映射到本体中的标准化概念及其机器可读代码的过程。准确地规范化捕捉患者和疾病间表型差异的术语,对于精准医疗计划的成败至关重要。如 GPT-4o 这样的大语言模型 (LLM) 可以将术语规范化到人类表型本体 (HPO),但可能会检索到错误的 HPO ID。由于测试数据集偏向于高频术语,LLM 在这些任务上的报告准确率可能被夸大。在我们的研究中,使用来自 HPO 的 268,776 个表型注释,涵盖 12,655 种疾病的综合数据集,GPT-4o 在规范化 11,225 个独特术语时达到了 13.1% 的准确率。然而,准确率的分布并不均匀,高频和较短的术语比低频和较长的术语更准确地被规范化。通过 SHAP 和排列方法进行特征重要性分析,发现术语频率低是规范化错误的最显著预测因子。这些发现表明,基于 LLM 的术语规范化训练和评估数据集应平衡低频和高频术语,以提高模型性能,特别是对于精准医疗中至关重要的不常见术语。

[NLP-147] Context-Aware Membership Inference Attacks against Pre-trained Large Language Models

【速读】: 该论文试图解决预训练大型语言模型(LLMs)上的成员推断攻击(MIAs)问题,传统基于分类模型的攻击方法因忽视了LLMs在token序列生成过程中的特性而失效。解决方案的关键在于将MIAs的统计测试方法适应于数据点内子序列的困惑度动态变化,从而揭示LLMs中依赖于上下文的记忆模式,这种方法显著优于先前的基于损失的攻击方法。

链接: https://arxiv.org/abs/2409.13745
作者: Hongyan Chang,Ali Shahin Shamsabadi,Kleomenis Katevas,Hamed Haddadi,Reza Shokri
关键词-EN: Large Language Models, Membership Inference Attacks, Prior Membership Inference, pre-trained Large Language, Membership Inference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Prior Membership Inference Attacks (MIAs) on pre-trained Large Language Models (LLMs), adapted from classification model attacks, fail due to ignoring the generative process of LLMs across token sequences. In this paper, we present a novel attack that adapts MIA statistical tests to the perplexity dynamics of subsequences within a data point. Our method significantly outperforms prior loss-based approaches, revealing context-dependent memorization patterns in pre-trained LLMs.
摘要:以往针对预训练大语言模型 (LLM) 的成员推断攻击 (Membership Inference Attack, MIA),源自分类模型攻击,由于忽略了 LLM 在 Token 序列中的生成过程而失败。本文提出了一种新型攻击方法,将 MIA 统计测试适应于数据点内子序列的困惑度动态。我们的方法显著优于以往基于损失的方法,揭示了预训练 LLM 中依赖上下文的记忆模式。

[NLP-148] A Simplified Retriever to Improve Accuracy of Phenotype Normalizations by Large Language Models ALT

【速读】: 该论文试图解决生物医学术语规范化任务中,大型语言模型(LLMs)的准确性问题。解决方案的关键在于引入一个简化的检索器,该检索器利用BioBERT的上下文词嵌入在Human Phenotype Ontology(HPO)中搜索候选匹配项,而无需依赖显式的术语定义。通过这种方式,论文展示了在没有增强的情况下,LLM的规范化准确率为62.3%,而在使用检索器增强后,准确率提升至90.3%。这一方法不仅提高了LLM在术语规范化任务中的表现,还具有推广到其他生物医学术语规范化任务的潜力,并提供了一种比复杂检索方法更高效的替代方案。

链接: https://arxiv.org/abs/2409.13744
作者: Daniel B. Hier,Thanh Son Do,Tayo Obafemi-Ajayi
关键词-EN: Large language models, Human Phenotype Ontology, Large language, shown improved accuracy, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to Frontiers in Digital Health

点击查看摘要

Abstract:Large language models (LLMs) have shown improved accuracy in phenotype term normalization tasks when augmented with retrievers that suggest candidate normalizations based on term definitions. In this work, we introduce a simplified retriever that enhances LLM accuracy by searching the Human Phenotype Ontology (HPO) for candidate matches using contextual word embeddings from BioBERT without the need for explicit term definitions. Testing this method on terms derived from the clinical synopses of Online Mendelian Inheritance in Man (OMIM), we demonstrate that the normalization accuracy of a state-of-the-art LLM increases from a baseline of 62.3% without augmentation to 90.3% with retriever augmentation. This approach is potentially generalizable to other biomedical term normalization tasks and offers an efficient alternative to more complex retrieval methods.
摘要:大语言模型 (LLMs) 在通过增强检索器来提高表型术语规范化任务的准确性方面表现出了改进。这些检索器基于术语定义提供候选规范化建议。在本研究中,我们引入了一种简化的检索器,该检索器通过使用 BioBERT 的上下文词嵌入在 Human Phenotype Ontology (HPO) 中搜索候选匹配项,从而提高 LLM 的准确性,而无需明确的术语定义。在基于 Online Mendelian Inheritance in Man (OMIM) 临床摘要中提取的术语上测试这种方法时,我们证明了一个最先进的 LLM 的规范化准确率从无增强的基线 62.3% 提高到有检索器增强的 90.3%。这种方法可能可推广到其他生物医学术语规范化任务,并提供了一种比更复杂的检索方法更高效的替代方案。

[NLP-149] Knowing When to Ask – Bridging Large Language Models and Data

【速读】: 该论文试图解决大语言模型(LLMs)在处理涉及数值和统计数据或时效性事实的查询时容易生成事实错误信息的问题。解决方案的关键在于将LLMs与数据共享平台Data Commons集成,通过两种主要方法提升LLMs的准确性:一是检索交织生成(Retrieval Interleaved Generation, RIG),训练LLM生成自然语言查询以从Data Commons检索数据;二是检索增强生成(Retrieval Augmented Generation, RAG),从Data Commons获取相关数据表以增强LLM的提示。这两种方法通过结合可验证的统计数据,显著提高了LLM输出的事实准确性,为构建更可信和可靠的LLMs奠定了基础。

链接: https://arxiv.org/abs/2409.13741
作者: Prashanth Radhakrishnan,Jennifer Chen,Bo Xu,Prem Ramaswami,Hannah Pho,Adriana Olmos,James Manyika,R. V. Guha
关键词-EN: Large Language Models, generating factually incorrect, factually incorrect information, Large Language, Language Models
类目: Computation and Language (cs.CL)
备注: 39 pages - 25 page paper, 14 page Appendix, 7 figures, 9 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are prone to generating factually incorrect information when responding to queries that involve numerical and statistical data or other timely facts. In this paper, we present an approach for enhancing the accuracy of LLMs by integrating them with Data Commons, a vast, open-source repository of public statistics from trusted organizations like the United Nations (UN), Center for Disease Control and Prevention (CDC) and global census bureaus. We explore two primary methods: Retrieval Interleaved Generation (RIG), where the LLM is trained to produce natural language queries to retrieve data from Data Commons, and Retrieval Augmented Generation (RAG), where relevant data tables are fetched from Data Commons and used to augment the LLM’s prompt. We evaluate these methods on a diverse set of queries, demonstrating their effectiveness in improving the factual accuracy of LLM outputs. Our work represents an early step towards building more trustworthy and reliable LLMs that are grounded in verifiable statistical data and capable of complex factual reasoning.
摘要:大语言模型 (LLMs) 在处理涉及数值和统计数据或其他时效性事实的查询时,容易生成事实错误的信息。本文提出了一种通过将 LLMs 与数据共享平台 (Data Commons) 集成来提高其准确性的方法。Data Commons 是一个庞大的开源公共统计数据仓库,由联合国 (UN)、疾病控制与预防中心 (CDC) 和全球人口普查局等可信组织提供数据。我们探讨了两种主要方法:检索交织生成 (Retrieval Interleaved Generation, RIG),其中 LLM 被训练以生成自然语言查询以从 Data Commons 检索数据;以及检索增强生成 (Retrieval Augmented Generation, RAG),其中从 Data Commons 获取相关数据表并用于增强 LLM 的提示。我们在一系列多样化的查询上评估了这些方法,证明了它们在提高 LLM 输出的事实准确性方面的有效性。我们的工作代表了构建基于可验证统计数据并具备复杂事实推理能力的更可信和可靠的 LLMs 的早期步骤。

[NLP-150] Language agents achieve superhuman synthesis of scientific knowledge

【速读】: 该论文试图解决语言模型在科学研究中的准确性和可靠性问题,并提出了一种详细的人工智能与人类专家对比评估方法。解决方案的关键在于开发了PaperQA2这一专注于提高事实准确性的高级语言模型,并通过LitQA2基准测试来优化其性能。PaperQA2在信息检索、摘要生成和矛盾检测等实际文献搜索任务中,表现优于或至少与领域专家相当,尤其是在生成引用支持的科学主题维基百科式摘要和识别科学文献中的矛盾方面,显著超越了现有的人类编写的维基百科条目和人类专家的能力。

链接: https://arxiv.org/abs/2409.13740
作者: Michael D. Skarlinski,Sam Cox,Jon M. Laurent,James D. Braza,Michaela Hinks,Michael J. Hammerling,Manvitha Ponnapati,Samuel G. Rodriques,Andrew D. White
关键词-EN: produce incorrect information, Language models, produce incorrect, Language, incorrect information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Language models are known to produce incorrect information, and their accuracy and reliability for scientific research are still in question. We developed a detailed human-AI comparison method to evaluate language models on real-world literature search tasks, including information retrieval, summarization, and contradiction detection. Our findings show that PaperQA2, an advanced language model focused on improving factual accuracy, matches or outperforms subject matter experts on three realistic literature search tasks, with no restrictions on human participants (full internet access, search tools, and time). PaperQA2 generates cited, Wikipedia-style summaries of scientific topics that are significantly more accurate than current human-written Wikipedia entries. We also present LitQA2, a new benchmark for scientific literature research, which shaped the development of PaperQA2 and contributed to its superior performance. Additionally, PaperQA2 identifies contradictions in scientific literature, a challenging task for humans. It finds an average of 2.34 +/- 1.99 contradictions per paper in a random sample of biology papers, with 70% of these contradictions validated by human experts. These results show that language models can now surpass domain experts in important scientific literature tasks.
摘要:语言模型被认为会产生不准确的信息,其在科学研究中的准确性和可靠性仍存在疑问。我们开发了一种详细的人工智能与人类对比方法,用于评估语言模型在实际文献搜索任务中的表现,包括信息检索、摘要生成和矛盾检测。我们的研究结果表明,专注于提高事实准确性的先进语言模型 PaperQA2,在三项现实文献搜索任务中与领域专家的表现相当或更优,且对人类参与者没有任何限制(完全互联网访问、搜索工具和时间)。PaperQA2 生成的科学主题维基百科式摘要,其准确性显著高于当前人类编写的维基百科条目。我们还提出了 LitQA2,这是一个新的科学文献研究基准,它指导了 PaperQA2 的开发并促成了其卓越表现。此外,PaperQA2 能够识别科学文献中的矛盾,这对人类来说是一项挑战性任务。在随机抽样的生物学论文中,它平均每篇论文发现 2.34 +/- 1.99 处矛盾,其中 70% 的矛盾得到了人类专家的验证。这些结果表明,语言模型现在在重要的科学文献任务中可以超越领域专家。

[NLP-151] able-to-Text Generation with Pretrained Diffusion Models

【速读】: 该论文试图解决表格到文本生成的问题,通过将扩散模型(diffusion models)应用于这一任务,并进行深入分析。解决方案的关键在于探索和优化扩散模型的训练和采样策略,包括引入DPM-Solver++加速器、测试不同的预测聚合方法(如ROVER和MBR),以及研究预训练阶段和生成长度约束的影响。研究发现,扩散模型在生成质量和多样性之间取得了平衡,而自回归文本生成模型则难以同时兼顾两者。为达到最高生成质量,建议使用常规采样器并严格控制生成长度,然后通过MBR聚合预测结果;若追求速度和简化,可采用DPM-Solver++快速采样器。这些发现突显了扩散模型在表格到文本生成领域的潜力,为其作为研究方向的可行性提供了支持。

链接: https://arxiv.org/abs/2409.13739
作者: Aleksei S. Krylov,Oleg D. Somov
关键词-EN: demonstrated significant potential, Diffusion models, Diffusion, potential in achieving, text generation tasks
类目: Computation and Language (cs.CL)
备注: IEEE Access

点击查看摘要

Abstract:Diffusion models have demonstrated significant potential in achieving state-of-the-art performance across various text generation tasks. In this systematic study, we investigate their application to the table-to-text problem by adapting the diffusion model to the task and conducting an in-depth analysis. Our experiments cover multiple aspects of diffusion models training. We explore sampling strategy influence by inducing recent diffusion model accelerator DPM-Solver++ into our core model. We have tested different prediction aggregation methods, like ROVER and Minimum Bayes-Risk (MBR). Our studies cover the impact of the pre-training phase in diffusion models and the generation length constraints influence. We also have compared diffusion model generation with auto-regressive text-to-text models with different temperature settings for diversity evaluation. Our key observation is that diffusion models demonstrate the balance between quality and diversity while auto-regressive text-to-text models are not successful at handling both at the same time. Furthermore, we found out that to achieve the highest quality possible, it is preferable to use a regular sampler with the strictest length constraint to create multiple samples, and then use MBR to aggregate the predictions. However, if you are prepared to give up high level of diversity and to accelerate the process, you can also utilize a fast sampler DPM-Solver++. Our findings reveal that diffusion models achieve comparable results in the table-to-text domain, highlighting their viability in the table-to-text challenge as a promising research direction.
摘要:扩散模型在各种文本生成任务中展示了实现最先进性能的显著潜力。在本系统研究中,我们通过将扩散模型适应于表格到文本任务并进行深入分析,探讨了其在该领域的应用。我们的实验涵盖了扩散模型训练的多个方面。我们通过将最近的扩散模型加速器 DPM-Solver++ 引入核心模型,探索了采样策略的影响。我们测试了不同的预测聚合方法,如 ROVER 和最小贝叶斯风险 (MBR)。我们的研究涵盖了预训练阶段对扩散模型的影响以及生成长度约束的影响。我们还比较了扩散模型生成与不同温度设置下的自回归文本到文本模型,以评估多样性。我们的关键观察是,扩散模型在质量和多样性之间展示了平衡,而自回归文本到文本模型在同时处理两者时并不成功。此外,我们发现,要实现最高质量,最好使用带有严格长度约束的常规采样器生成多个样本,然后使用 MBR 聚合预测。然而,如果你愿意放弃高水平的多样性并加速过程,也可以使用快速采样器 DPM-Solver++。我们的研究结果表明,扩散模型在表格到文本领域取得了可比的结果,突显了其在表格到文本挑战中的可行性,作为有前景的研究方向。

[NLP-152] NLP4PBM: A Systematic Review on Process Extraction using Natural Language Processing with Rule-based Machine and Deep Learning Methods

【速读】: 该论文试图解决自动化流程提取问题,即将文本描述转化为结构化流程,主要依赖自然语言处理(NLP)技术。解决方案的关键在于采用机器学习(ML)和深度学习(DL)方法,这些方法在处理NLP任务时显示出优于传统基于规则方法的性能。然而,当前缺乏大规模、标准化的标注数据集,这限制了ML/DL方法的训练和评估,因此构建高质量的数据集是推动该领域发展的关键。此外,论文还探讨了大型语言模型(LLMs)在自动化流程提取中的初步应用及其潜在发展。

链接: https://arxiv.org/abs/2409.13738
作者: William Van Woensel,Soroor Motie
关键词-EN: Natural Language Processing, Language Processing, Natural Language, transforming textual descriptions, literature review studies
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This literature review studies the field of automated process extraction, i.e., transforming textual descriptions into structured processes using Natural Language Processing (NLP). We found that Machine Learning (ML) / Deep Learning (DL) methods are being increasingly used for the NLP component. In some cases, they were chosen for their suitability towards process extraction, and results show that they can outperform classic rule-based methods. We also found a paucity of gold-standard, scalable annotated datasets, which currently hinders objective evaluations as well as the training or fine-tuning of ML / DL methods. Finally, we discuss preliminary work on the application of LLMs for automated process extraction, as well as promising developments in this field.
摘要:本文献综述研究了自动化流程提取领域,即利用自然语言处理 (NLP) 将文本描述转化为结构化流程。我们发现,机器学习 (ML) / 深度学习 (DL) 方法正越来越多地被用于 NLP 组件中。在某些情况下,这些方法因其适合流程提取而被选择,结果表明它们能够超越经典的基于规则的方法。我们还发现,目前缺乏高质量、可扩展的标注数据集,这阻碍了客观评估以及 ML / DL 方法的训练或微调。最后,我们讨论了将大语言模型 (LLM) 应用于自动化流程提取的初步工作,以及该领域的有前景的发展。

[NLP-153] Analysis of Socially Unacceptable Discourse with Zero-shot Learning

【速读】: 该论文试图解决社交网络中不可接受言论(Socially Unacceptable Discourse, SUD)的检测与表征问题。解决方案的关键在于利用基于蕴含关系的零样本文本分类方法,通过预训练的Transformer模型和提示技术,实现对未见数据的良好泛化能力,从而生成用于分析和表征极端言论的标注数据集。这种方法展示了其在促进在线负责任沟通方面的潜力。

链接: https://arxiv.org/abs/2409.13735
作者: Rayane Ghilene,Dimitra Niaouri,Michele Linardi,Julien Longhi
关键词-EN: Socially Unacceptable Discourse, Socially Unacceptable, Unacceptable Discourse, online positive environments, maintaining online positive
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Socially Unacceptable Discourse (SUD) analysis is crucial for maintaining online positive environments. We investigate the effectiveness of Entailment-based zero-shot text classification (unsupervised method) for SUD detection and characterization by leveraging pre-trained transformer models and prompting techniques. The results demonstrate good generalization capabilities of these models to unseen data and highlight the promising nature of this approach for generating labeled datasets for the analysis and characterization of extremist narratives. The findings of this research contribute to the development of robust tools for studying SUD and promoting responsible communication online.
摘要:社会不可接受言论 (Socially Unacceptable Discourse, SUD) 分析对于维护在线积极环境至关重要。我们研究了基于蕴含的零样本文本分类 (无监督方法) 在 SUD 检测和特征化中的有效性,通过利用预训练的 Transformer 模型和提示技术。结果表明,这些模型对未见数据的良好泛化能力,并突显了这种方法在生成用于分析和特征化极端主义叙事的标注数据集方面的潜力。本研究的发现有助于开发用于研究 SUD 和促进在线负责任沟通的强大工具。

[NLP-154] Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach

【速读】: 该论文试图解决低资源语言(如中央库尔德语)在文本到语音(TTS)合成中的挑战,主要由于缺乏语言学信息和专用资源。解决方案的关键在于通过在21小时的中央库尔德语语音语料库上训练Kurdish WaveGlow声码器,替代预训练的英语WaveGlow声码器,以更准确和流畅地适应库尔德语的音素和韵律变化。这一改进显著提升了TTS系统的性能,使得自适应WaveGlow模型在中央库尔德语的语音合成中达到了4.91的MOS评分,为该语言的语音合成设定了新的基准,并为其他库尔德语方言及相关语言的进一步发展打开了大门。

链接: https://arxiv.org/abs/2409.13734
作者: Abdulhady Abas Abdullah,Sabat Salih Muhamad,Hadi Veisi
关键词-EN: greatly facilitated access, Central Kurdish, synthesize spoken language, Kurdish TTS system, Kurdish
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The ability to synthesize spoken language from text has greatly facilitated access to digital content with the advances in text-to-speech technology. However, effective TTS development for low-resource languages, such as Central Kurdish (CKB), still faces many challenges due mainly to the lack of linguistic information and dedicated resources. In this paper, we improve the Kurdish TTS system based on Tacotron by training the Kurdish WaveGlow vocoder on a 21-hour central Kurdish speech corpus instead of using a pre-trained English vocoder WaveGlow. Vocoder training on the target language corpus is required to accurately and fluently adapt phonetic and prosodic changes in Kurdish language. The effectiveness of these enhancements is that our model is significantly better than the baseline system with English pretrained models. In particular, our adaptive WaveGlow model achieves an impressive MOS of 4.91, which sets a new benchmark for Kurdish speech synthesis. On one hand, this study empowers the advanced features of the TTS system for Central Kurdish, and on the other hand, it opens the doors for other dialects in Kurdish and other related languages to further develop.
摘要:随着文本到语音 (Text-to-Speech, TTS) 技术的进步,合成语音从文本的能力极大地促进了数字内容的获取。然而,对于资源匮乏的语言,如中库尔德语 (Central Kurdish, CKB),有效的 TTS 开发仍然面临许多挑战,这主要归因于缺乏语言信息和专用资源。本文中,我们基于 Tacotron 改进了库尔德语 TTS 系统,通过在一个 21 小时的中库尔德语语音语料库上训练库尔德语 WaveGlow 声码器,而不是使用预训练的英语声码器 WaveGlow。在目标语言语料库上训练声码器是必要的,以便准确且流畅地适应库尔德语中的音素和韵律变化。这些改进的有效性在于,我们的模型显著优于使用英语预训练模型的基线系统。特别是,我们的自适应 WaveGlow 模型达到了令人印象深刻的 4.91 的 MOS (Mean Opinion Score),为库尔德语语音合成设定了新的基准。一方面,本研究增强了中库尔德语 TTS 系统的高级功能,另一方面,它为库尔德语的其他方言及相关语言的进一步发展打开了大门。

[NLP-155] RNR: Teaching Large Language Models to Follow Roles and Rules

【速读】: 该论文试图解决大型语言模型(LLMs)在遵循开发者定义的复杂角色和规则(即系统提示)方面的不足问题。解决方案的关键在于提出了一个自动化数据生成管道,名为\model,该管道能够从现有的指令微调(IFT)数据中生成多样化的角色和规则,并生成相应的响应数据。这些生成的数据随后用于训练模型,使其能够更好地遵循复杂的系统提示。通过这种方法,模型在遵循角色和规则的能力上显著提升,实验结果显示在Alpaca和Ultrachat数据集上,规则遵循的通过率提高了25%以上,同时并未影响其在标准指令遵循基准和通用NLP任务上的表现。

链接: https://arxiv.org/abs/2409.13733
作者: Kuan Wang,Alexander Bukharin,Haoming Jiang,Qingyu Yin,Zhengyang Wang,Tuo Zhao,Jingbo Shang,Chao Zhang,Bing Yin,Xian Li,Jianshu Chen,Shiyang Li
关键词-EN: large language models, supervised learning, existing IFT instructions, capabilities and steers, steers the behavior
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction fine-tuning (IFT) elicits instruction following capabilities and steers the behavior of large language models (LLMs) via supervised learning. However, existing models trained on open-source IFT datasets only have the ability to follow instructions from users, and often fail to follow complex role and rules specified by developers, a.k.a. system prompts. The ability to follow these roles and rules is essential for deployment, as it ensures that the model safely interacts with users within developer defined guidelines. To improve such role and rule following ability, we propose \model, an automated data generation pipeline that generates diverse roles and rules from existing IFT instructions, along with corresponding responses. This data can then be used to train models that follow complex system prompts. The models are evaluated on our newly created benchmarks for role and rule following ability, as well as standard instruction-following benchmarks and general NLP tasks. Our framework significantly improves role and rule following capability in LLMs, as evidenced by over 25% increase in pass-rate on rule adherence, i.e. following all requirements, in our experiments with the Alpaca and Ultrachat datasets. Moreover, our models achieves this increase without any regression on popular instruction following benchmarks.
摘要:指令微调 (Instruction Fine-Tuning, IFT) 通过监督学习激发大语言模型 (Large Language Models, LLMs) 的指令跟随能力并调整其行为。然而,现有基于开源 IFT 数据集训练的模型仅具备跟随用户指令的能力,往往无法遵循开发者指定的复杂角色和规则,即系统提示 (system prompts)。这种角色和规则的遵循能力对于部署至关重要,因为它确保模型在开发者定义的指南内安全地与用户互动。为提升这种角色和规则的遵循能力,我们提出了 \model,一个自动化的数据生成管道,从现有 IFT 指令中生成多样化的角色和规则及其相应的响应。这些数据随后可用于训练能够遵循复杂系统提示的模型。这些模型在我们的新创建的角色和规则遵循能力基准测试以及标准指令跟随基准测试和通用自然语言处理 (NLP) 任务中进行了评估。我们的框架显著提升了 LLMs 的角色和规则遵循能力,实验结果显示,在使用 Alpaca 和 Ultrachat 数据集的情况下,规则遵循的通过率(即遵循所有要求)提高了超过 25%。此外,我们的模型在实现这一提升的同时,并未在流行的指令跟随基准测试中出现性能下降。

[NLP-156] opoChat: Enhancing Topological Materials Retrieval With Large Language Model and Multi-Source Knowledge

【速读】: 该论文试图解决大型语言模型(LLMs)在特定领域(如拓扑材料)中的性能受限于领域专用语料库稀缺和训练资源需求高的问题。解决方案的关键在于构建一个材料知识图谱(MaterialsKG),并将其与文献集成,结合大型语言模型和提示学习技术,开发出专门针对拓扑材料的对话系统TopoChat。通过这种方式,TopoChat在结构和属性查询、材料推荐以及复杂关系推理方面表现出优于普通LLMs的性能,从而提高了信息检索的效率和准确性,促进了凝聚态材料领域的发展。

链接: https://arxiv.org/abs/2409.13732
作者: HuangChao Xu,Baohua Zhang,Zhong Jin,Tiannian Zhu,Quansheng Wu,Hongming Weng
关键词-EN: text generation task, demonstrated impressive performance, generation task, showing the ability, Large language models
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs), such as ChatGPT, have demonstrated impressive performance in the text generation task, showing the ability to understand and respond to complex instructions. However, the performance of naive LLMs in speciffc domains is limited due to the scarcity of domain-speciffc corpora and specialized training. Moreover, training a specialized large-scale model necessitates signiffcant hardware resources, which restricts researchers from leveraging such models to drive advances. Hence, it is crucial to further improve and optimize LLMs to meet speciffc domain demands and enhance their scalability. Based on the condensed matter data center, we establish a material knowledge graph (MaterialsKG) and integrate it with literature. Using large language models and prompt learning, we develop a specialized dialogue system for topological materials called TopoChat. Compared to naive LLMs, TopoChat exhibits superior performance in structural and property querying, material recommendation, and complex relational reasoning. This system enables efffcient and precise retrieval of information and facilitates knowledge interaction, thereby encouraging the advancement on the ffeld of condensed matter materials.
摘要:大语言模型 (LLMs),如 ChatGPT,在文本生成任务中展示了令人印象深刻的表现,显示出理解和响应复杂指令的能力。然而,由于特定领域语料库的稀缺和专业训练的不足,朴素 LLMs 在特定领域的表现受到限制。此外,训练一个专门的大规模模型需要大量的硬件资源,这限制了研究人员利用这些模型推动进步的能力。因此,进一步改进和优化 LLMs 以满足特定领域需求并增强其可扩展性至关重要。基于凝聚态数据中心,我们构建了一个材料知识图谱 (MaterialsKG) 并将其与文献集成。利用大语言模型和提示学习,我们开发了一个名为 TopoChat 的拓扑材料专用对话系统。与朴素 LLMs 相比,TopoChat 在结构和属性查询、材料推荐以及复杂关系推理方面表现出更优越的性能。该系统能够实现高效且精确的信息检索,并促进知识交互,从而推动凝聚态材料领域的发展。

[NLP-157] KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation

【速读】: 该论文试图解决现有检索增强生成(RAG)技术在模糊检索、语言模型理解与推理能力的“幻觉”问题以及复杂系统中的级联损失等方面的局限性,特别是在科学计算、医学和法律等对知识准确性、信息完整性和逻辑严谨性要求极高的领域。解决方案的关键在于引入专业领域知识服务框架——知识增强生成(KAG),通过双向增强大型语言模型(LLM)和知识图谱(KG)来提升生成和推理性能。具体包括五个关键增强点:1)LLM友好的知识语义表示;2)知识图谱与原始片段的相互索引;3)逻辑形式引导的混合推理与求解;4)基于语义推理的知识对齐;5)KAG模型。实验结果表明,KAG在多跳问答任务中显著优于现有RAG方法,相对提升幅度为19.6%到33.4%。

链接: https://arxiv.org/abs/2409.13731
作者: Lei Liang,Mengshu Sun,Zhengke Gui,Zhongshu Zhu,Zhouyu Jiang,Ling Zhong,Yuan Qu,Peilong Zhao,Zhongpu Bo,Jin Yang,Huaidong Xiong,Lin Yuan,Jun Xu,Zaoyang Wang,Wen Zhang,Huajun Chen,Zhiqiang Zhang,Jun Zhou
关键词-EN: recently developed retrieval-augmented, developed retrieval-augmented generation, technology enables, domain-specific applications, recently developed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages

点击查看摘要

Abstract:The recently developed retrieval-augmented generation (RAG) technology enables the efficient construction of domain-specific applications. However, it faces limitations due to fuzzy retrieval processes, the “hallucination” problem of understanding and reasoning capabilities of general language models, and cascading losses in complex systems. These challenges hinder the effectiveness of specialized knowledge services. However, in scenarios such as scientific computing, medicine, and law, the accuracy of knowledge, the completeness of information, and the logical rigor of rules, time, and values are particularly critical. We Introduce professional domain knowledge service framework: Knowledge Augmented Generation(KAG) to improve generation and reasoning performance by bidirectionally enhancing large language model(LLM)s and knowledge graph(KG)s, including five key enhancements: 1) LLM-friendly knowledge semantic representation, 2) mutual indexing between knowledge graph and original chunks, 3) logicalform-guided hybrid reasoning and solving, 4) Knowledge alignment based on semantic reasoning, 5) Model for KAG. We compared KAG with existing RAG methods in multi-hop question answering. The results show that KAG performs significantly better than the state-of-the-art methods, with a relative improvement from 19.6% to 33.4% in F1. We apply KAG to two professional knowledge QA tasks of Ant Group, including E-Goverment QA and E-Health QA, and has achieved significant improvement in professionalism compared with NaiveRAG. We will soon natively support KAG on the open source KG engine OpenSPG, allowing developers to more easily build rigorous knowledge decision-making or convenient information retrieval services.
摘要:近期开发的检索增强生成 (Retrieval-Augmented Generation, RAG) 技术使得高效构建领域特定应用成为可能。然而,由于模糊检索过程、通用语言模型在理解和推理能力上的“幻觉”问题,以及复杂系统中的级联损失,RAG 面临诸多限制。这些挑战阻碍了专业知识服务的有效性。然而,在科学计算、医学和法律等领域,知识的准确性、信息的完整性以及规则、时间和价值的逻辑严谨性尤为关键。我们引入了专业领域知识服务框架:知识增强生成 (Knowledge Augmented Generation, KAG),通过双向增强大语言模型 (Large Language Model, LLM) 和知识图谱 (Knowledge Graph, KG) 来提升生成和推理性能,包括五个关键增强点:1) 对 LLM 友好的知识语义表示,2) 知识图谱与原始数据块之间的相互索引,3) 逻辑形式引导的混合推理与求解,4) 基于语义推理的知识对齐,5) KAG 模型。我们在多跳问答任务中对比了 KAG 与现有 RAG 方法。结果显示,KAG 的表现显著优于最先进的方法,F1 值相对提升从 19.6% 到 33.4%。我们将 KAG 应用于蚂蚁集团的两个专业知识问答任务,包括政务问答 (E-Goverment QA) 和健康问答 (E-Health QA),与 NaiveRAG 相比,专业性得到了显著提升。我们即将在开源 KG 引擎 OpenSPG 上原生支持 KAG,使开发者能够更轻松地构建严谨的知识决策或便捷的信息检索服务。

[NLP-158] VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

【速读】: 该论文试图解决当前多模态大语言模型(MLLMs)在科学推理任务中,特别是数学、物理和化学领域评估的不足问题。解决方案的关键在于构建了一个名为VisScience的综合基准,该基准包含3000个问题,涵盖K12教育中的数学、物理和化学三个学科,每个学科1000个问题,分为21个不同主题和五个难度级别。通过VisScience,论文详细评估了25个代表性MLLMs在科学推理任务中的表现,结果显示闭源模型总体上优于开源模型,并指出了未来改进的方向和开发能够有效处理多模态科学推理需求的模型的重要性。

链接: https://arxiv.org/abs/2409.13730
作者: Zhihuan Jiang,Zhen Yang,Jinhao Chen,Zhengxiao Du,Weihan Wang,Bin Xu,Yuxiao Dong,Jie Tang
关键词-EN: demonstrated promising capabilities, achieve visual understanding, Multi-modal large language, visual understanding tasks, large language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 89 pages, 70 figures

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks by integrating textual and visual information to achieve visual understanding in complex scenarios. Despite the availability of several benchmarks aims to evaluating MLLMs in tasks from visual question answering to complex problem-solving, most focus predominantly on mathematics or general visual understanding tasks. This reveals a critical gap in current benchmarks, which often overlook the inclusion of other key scientific disciplines such as physics and chemistry. To address this gap, we meticulously construct a comprehensive benchmark, named VisScience, which is utilized to assess the multi-modal scientific reasoning across the three disciplines of mathematics, physics, and chemistry. This benchmark comprises 3,000 questions drawn from K12 education - spanning elementary school through high school - equally distributed across three disciplines, with 1,000 questions per discipline. The questions within VisScience span 21 distinct subjects and are categorized into five difficulty levels, offering a broad spectrum of topics within each discipline. With VisScience, we present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning. Experimental results demonstrate that closed-source MLLMs generally outperform open-source models. The best performance observed include a 53.4% accuracy in mathematics by Claude3.5-Sonnet, 38.2% in physics by GPT-4o, and 47.0% in chemistry by Gemini-1.5-Pro. These results underscore the strengths and limitations of MLLMs, suggesting areas for future improvement and highlighting the importance of developing models that can effectively handle the diverse demands of multi-modal scientific reasoning.
摘要:多模态大语言模型 (MLLMs) 通过整合文本和视觉信息,在复杂场景中实现视觉理解,展示了在各种任务中的潜力。尽管已有多个基准用于评估 MLLMs 从视觉问答到复杂问题解决的任务,但大多数基准主要集中在数学或一般视觉理解任务上。这揭示了当前基准的一个关键差距,即往往忽视了包括物理和化学在内的其他关键科学学科的纳入。为了填补这一空白,我们精心构建了一个名为 VisScience 的综合基准,用于评估数学、物理和化学三个学科的多模态科学推理能力。该基准包含 3,000 个问题,来自 K12 教育(涵盖小学到高中),平均分布在三个学科中,每个学科 1,000 个问题。VisScience 中的问题涵盖 21 个不同的科目,并分为五个难度级别,提供了每个学科广泛的题目范围。通过 VisScience,我们对 25 个代表性 MLLMs 的科学推理性能进行了详细评估。实验结果表明,闭源 MLLMs 通常优于开源模型。最佳表现包括 Claude3.5-Sonnet 在数学中的 53.4% 准确率,GPT-4o 在物理中的 38.2% 准确率,以及 Gemini-1.5-Pro 在化学中的 47.0% 准确率。这些结果突显了 MLLMs 的优势和局限性,指出了未来改进的方向,并强调了开发能够有效应对多模态科学推理多样需求的模型的重要性。

[NLP-159] MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

【速读】: 该论文试图解决当前多模态大语言模型(MLLMs)在数学领域,特别是几何问题上的局限性,即这些模型主要集中在几何问题的解决,而忽略了数学其他领域中丰富的视觉信息。解决方案的关键在于构建了一个名为MathVL的细调数据集,并通过在MathVL上进行监督细调(SFT),开发了一系列专门用于数学的MLLMs,称为MathGLM-Vision。这些模型在多个公共基准和自制的MathVL-test(包含2000个问题)上进行了广泛评估,实验结果表明,MathGLM-Vision相较于现有模型,包括基础模型和开源的数学MLLMs,在数学推理能力上取得了显著提升,这突显了多样化数据集在增强MLLMs数学推理能力中的重要性。

链接: https://arxiv.org/abs/2409.13729
作者: Zhen Yang,Jinhao Chen,Zhengxiao Du,Wenmeng Yu,Weihan Wang,Wenyi Hong,Zhihuan Jiang,Bin Xu,Yuxiao Dong,Jie Tang
关键词-EN: Large language models, Large language, multi-modal large language, demonstrated significant capabilities, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages,19 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.
摘要:大语言模型 (LLMs) 在数学推理方面,特别是在基于文本的数学问题上,展示了显著的能力。然而,当前的多模态大语言模型 (MLLMs),尤其是那些专门针对数学的模型,主要集中在解决几何问题上,而忽视了数学其他领域中丰富的视觉信息。此外,这些专门数学 MLLMs 的几何信息来源于几个公开数据集,这些数据集通常在多样性和复杂性方面存在局限。为了解决这些局限性,我们旨在构建一个名为 MathVL 的微调数据集,并通过在 MathVL 上进行监督微调 (SFT) 开发一系列专门数学 MLLMs,称为 MathGLM-Vision,使用各种参数规模的骨干模型。为了广泛评估 MathGLM-Vision 的有效性,我们在几个公开基准和我们精心策划的包含 2,000 个问题的 MathVL-test 上进行了实验。实验结果表明,与一些现有模型(包括骨干模型和开源数学 MLLMs)相比,MathGLM-Vision 取得了显著的改进。这些发现表明,多样化的数据集在提升 MLLMs 的数学推理能力方面具有重要意义。

[NLP-160] Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts

【速读】: 该论文试图解决大语言模型(LLMs)在面对超出训练分布(out-of-distribution, OOD)的提示时,如何进行上下文学习的问题。解决方案的关键在于定义了一种新的OOD组合泛化场景,称为“规则外推”(rule extrapolation),即当提示违反至少一条规则时的OOD情况。通过在形式语言中评估不同复杂度的线性、循环、Transformer和状态空间模型,研究者旨在理解这些架构对规则外推的影响,并初步构建了一个基于算法信息理论中Solomonoff先验的规范理论,以更好地解释和预测LLMs在OOD情况下的行为。

链接: https://arxiv.org/abs/2409.13728
作者: Anna Mészáros,Szilvia Ujváry,Wieland Brendel,Patrik Reizinger,Ferenc Huszár
关键词-EN: remarkable emergent abilities, show remarkable emergent, LLMs show remarkable, emergent abilities, in-context learning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:LLMs show remarkable emergent abilities, such as inferring concepts from presumably out-of-distribution prompts, known as in-context learning. Though this success is often attributed to the Transformer architecture, our systematic understanding is limited. In complex real-world data sets, even defining what is out-of-distribution is not obvious. To better understand the OOD behaviour of autoregressive LLMs, we focus on formal languages, which are defined by the intersection of rules. We define a new scenario of OOD compositional generalization, termed rule extrapolation. Rule extrapolation describes OOD scenarios, where the prompt violates at least one rule. We evaluate rule extrapolation in formal languages with varying complexity in linear and recurrent architectures, the Transformer, and state space models to understand the architectures’ influence on rule extrapolation. We also lay the first stones of a normative theory of rule extrapolation, inspired by the Solomonoff prior in algorithmic information theory.
摘要:大语言模型 (LLM) 展现出显著的新兴能力,例如从被认为超出分布范围的提示中推断概念,这种能力被称为上下文学习 (in-context learning)。尽管这一成功常归功于 Transformer 架构,但我们对其系统性理解仍有限。在复杂的现实世界数据集中,甚至定义什么是超出分布范围 (out-of-distribution, OOD) 也并不明显。为了更好地理解自回归大语言模型的 OOD 行为,我们聚焦于由规则交集定义的形式语言。我们定义了一种新的 OOD 组合泛化场景,称为规则外推 (rule extrapolation)。规则外推描述了提示至少违反一条规则的 OOD 场景。我们通过在具有不同复杂度的线性、循环架构、Transformer 和状态空间模型中评估规则外推,以理解架构对规则外推的影响。我们还借鉴算法信息理论中的 Solomonoff 先验,奠定了规则外推规范理论的第一块基石。

[NLP-161] Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records

【速读】: 该论文旨在解决大型语言模型(LLMs)在从兽医电子健康记录(EHRs)中提取信息时的性能差异、温度设置的影响以及文本模糊性对模型错误的影响。解决方案的关键在于通过比较GPT-4 Omni(GPT-4o)和GPT-3.5 Turbo在不同条件下的表现,评估其与人类观察者一致性的关系,并发现GPT-4o在温度0设置下表现出显著优于GPT-3.5 Turbo的性能,特别是在敏感度方面。此外,GPT-4o在处理EHRs时的错误主要集中在人类观察者存在分歧的模糊文本上,而非模型本身的缺陷,这表明GPT-4o在自动化提取兽医EHRs信息方面具有可行性。

链接: https://arxiv.org/abs/2409.13727
作者: Judit M Wulcan,Kevin L Jacques,Mary Ann Lee,Samantha L Kovacs,Nicole Dausend,Lauren E Prince,Jonatan Wulcan,Sina Marsilio,Stefan M Keller
关键词-EN: Large language models, IQR, electronic health records, Large language, veterinary electronic health
类目: Computation and Language (cs.CL)
备注: 24 pages, 3 figures, 8 supplementary figures

点击查看摘要

Abstract:Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of temperature settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by comparing the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions and investigating the relationship between human interobserver agreement and LLM errors. The LLMs and five humans were tasked with identifying six clinical signs associated with Feline chronic enteropathy in 250 EHRs from a veterinary referral hospital. At temperature 0, the performance of GPT-4o compared to the majority opinion of human respondents, achieved 96.9% sensitivity (interquartile range [IQR] 92.9-99.3%), 97.6% specificity (IQR 96.5-98.5%), 80.7% positive predictive value (IQR 70.8-84.6%), 99.5% negative predictive value (IQR 99.0-99.9%), 84.4% F1 score (IQR 77.3-90.4%), and 96.3% balanced accuracy (IQR 95.0-97.9%). The performance of GPT-4o was significantly better than that of its predecessor, GPT-3.5 Turbo, particularly with respect to sensitivity where GPT-3.5 Turbo only achieved 81.7% (IQR 78.9-84.8%). Adjusting the temperature for GPT-4o did not significantly impact classification performance. GPT-4o demonstrated greater reproducibility than human pairs regardless of temperature, with an average Cohen’s kappa of 0.98 (IQR 0.98-0.99) at temperature 0 compared to 0.8 (IQR 0.78-0.81) for humans. Most GPT-4o errors occurred in instances where humans disagreed (35/43 errors, 81.4%), suggesting that these errors were more likely caused by ambiguity of the EHR than explicit model faults. Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction.
摘要:大语言模型 (LLMs) 可以从兽医电子健康记录 (EHRs) 中提取信息,但不同模型之间的性能差异、温度设置的影响以及文本模糊性的影响尚未得到评估。本研究通过在不同条件下比较 GPT-4 omni (GPT-4o) 和 GPT-3.5 Turbo 的性能,并探讨人类观察者间一致性与 LLM 错误之间的关系,填补了这些空白。LLMs 和五名人类被要求从一家兽医转诊医院的 250 份 EHRs 中识别与猫慢性肠病相关的六种临床症状。在温度为 0 时,GPT-4o 的性能与人类多数意见相比,灵敏度达到 96.9% (四分位距 [IQR] 92.9-99.3%),特异性 97.6% (IQR 96.5-98.5%),阳性预测值 80.7% (IQR 70.8-84.6%),阴性预测值 99.5% (IQR 99.0-99.9%),F1 分数 84.4% (IQR 77.3-90.4%),平衡准确率 96.3% (IQR 95.0-97.9%)。GPT-4o 的性能显著优于其前身 GPT-3.5 Turbo,特别是在灵敏度方面,GPT-3.5 Turbo 仅达到 81.7% (IQR 78.9-84.8%)。调整 GPT-4o 的温度对其分类性能没有显著影响。无论温度如何,GPT-4o 的再现性均优于人类配对,平均 Cohen’s kappa 值为 0.98 (IQR 0.98-0.99),而人类为 0.8 (IQR 0.78-0.81)。大多数 GPT-4o 错误发生在人类意见不一致的情况下 (35/43 错误,81.4%),表明这些错误更可能是由 EHR 的模糊性而非模型缺陷引起的。使用 GPT-4o 自动化从兽医 EHRs 中提取信息是手动提取的可行替代方案。

[NLP-162] Multilingual Dyadic Interaction Corpus NoXiJ: Toward Understanding Asian-European Non-verbal Cultural Characteristics and their Influences on Engagement

【速读】: 该论文试图解决跨文化背景下非言语行为对对话参与度识别的影响问题。解决方案的关键在于通过扩展NoXi数据集,纳入日语和中文的对话数据,形成NoXi+J增强数据集,并利用多模态非言语特征(如语音声学、面部表情、反馈和手势)进行计算分析。研究通过统计分析和文化特征识别,揭示了不同语言和文化间的非言语行为差异及其对参与度的影响,并通过LSTM模型和SHAP分析验证了这些特征在跨文化参与度预测中的重要性。

链接: https://arxiv.org/abs/2409.13726
作者: Marius Funk,Shogo Okada,Elisabeth André
关键词-EN: non-verbal behaviors, central challenge, affective states, non-verbal behaviors vary, Non-verbal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages. 6 figures. International Conference on Multimodal Interaction, November 4-8, 2024, San Jose, Costa Rica

点击查看摘要

Abstract:Non-verbal behavior is a central challenge in understanding the dynamics of a conversation and the affective states between interlocutors arising from the interaction. Although psychological research has demonstrated that non-verbal behaviors vary across cultures, limited computational analysis has been conducted to clarify these differences and assess their impact on engagement recognition. To gain a greater understanding of engagement and non-verbal behaviors among a wide range of cultures and language spheres, in this study we conduct a multilingual computational analysis of non-verbal features and investigate their role in engagement and engagement prediction. To achieve this goal, we first expanded the NoXi dataset, which contains interaction data from participants living in France, Germany, and the United Kingdom, by collecting session data of dyadic conversations in Japanese and Chinese, resulting in the enhanced dataset NoXi+J. Next, we extracted multimodal non-verbal features, including speech acoustics, facial expressions, backchanneling and gestures, via various pattern recognition techniques and algorithms. Then, we conducted a statistical analysis of listening behaviors and backchannel patterns to identify culturally dependent and independent features in each language and common features among multiple languages. These features were also correlated with the engagement shown by the interlocutors. Finally, we analyzed the influence of cultural differences in the input features of LSTM models trained to predict engagement for five language datasets. A SHAP analysis combined with transfer learning confirmed a considerable correlation between the importance of input features for a language set and the significant cultural characteristics analyzed.
摘要:非语言行为是理解对话动态和对话者之间情感状态变化的核心挑战。尽管心理学研究表明非语言行为在不同文化间存在差异,但针对这些差异及其对参与度识别影响的计算分析仍较为有限。为了更深入地理解跨文化及语言领域的参与度和非语言行为,本研究对非语言特征进行了多语言计算分析,并探讨了其在参与度和参与度预测中的作用。为实现这一目标,我们首先扩展了NoXi数据集,该数据集包含来自法国、德国和英国参与者的互动数据,通过收集日语和中文的双人对话会话数据,形成了增强数据集NoXi+J。接着,我们通过多种模式识别技术和算法提取了多模态非语言特征,包括语音声学、面部表情、反馈信号和手势。随后,我们对倾听行为和反馈信号模式进行了统计分析,以识别每种语言中文化依赖性和独立性特征,以及多语言间的共同特征。这些特征还与对话者的参与度相关联。最后,我们分析了用于预测五种语言数据集参与度的LSTM模型输入特征中文化差异的影响。结合SHAP分析和迁移学习的结果证实,输入特征的重要性与所分析的文化显著特征之间存在显著相关性。

[NLP-163] Identity-related Speech Suppression in Generative AI Content Moderation

【速读】: 该论文试图解决自动化内容审核系统在处理与不同身份群体相关的言论时,可能存在的错误过滤或言论压制问题。解决方案的关键在于定义并引入“言论压制”的度量方法,通过使用传统的用户生成数据集和新的生成式AI数据集,创建一个针对九个身份群体的言论压制基准。研究结果表明,身份相关的言论比其他言论更容易被错误地压制,且不同API在处理生成式AI内容时的准确性存在差异。

链接: https://arxiv.org/abs/2409.13725
作者: Oghenefejiro Isaacs Anigboro,Charlie M. Crawford,Danaë Metaxa,Sorelle A. Friedler
关键词-EN: Automated content moderation, user-generated content online, content moderation, content, filter undesired user-generated
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Automated content moderation has long been used to help identify and filter undesired user-generated content online. Generative AI systems now use such filters to keep undesired generated content from being created by or shown to users. From classrooms to Hollywood, as generative AI is increasingly used for creative or expressive text generation, whose stories will these technologies allow to be told, and whose will they suppress? In this paper, we define and introduce measures of speech suppression, focusing on speech related to different identity groups incorrectly filtered by a range of content moderation APIs. Using both short-form, user-generated datasets traditional in content moderation and longer generative AI-focused data, including two datasets we introduce in this work, we create a benchmark for measurement of speech suppression for nine identity groups. Across one traditional and four generative AI-focused automated content moderation services tested, we find that identity-related speech is more likely to be incorrectly suppressed than other speech except in the cases of a few non-marginalized groups. Additionally, we find differences between APIs in their abilities to correctly moderate generative AI content.
摘要:自动内容审核长期以来被用于帮助识别和过滤在线用户生成的不良内容。生成式 AI (Generative AI) 系统现在利用这些过滤器来防止不良生成内容被创建或展示给用户。从课堂到好莱坞,随着生成式 AI 越来越多地用于创意或表达性文本生成,这些技术将允许讲述哪些故事,又将压制哪些故事?本文中,我们定义并引入了言论压制的衡量标准,重点关注由一系列内容审核 API 错误过滤的与不同身份群体相关的言论。我们使用传统的短形式用户生成数据集和更长的以生成式 AI 为重点的数据集,包括本文中引入的两个数据集,为九个身份群体的言论压制创建了一个基准。在测试的一个传统和四个以生成式 AI 为重点的自动内容审核服务中,我们发现身份相关的言论比其他言论更容易被错误压制,除非在少数非边缘化群体的情况下。此外,我们发现不同 API 在正确审核生成式 AI 内容的能力上存在差异。

[NLP-164] Logically Consistent Language Models via Neuro-Symbolic Integration

【速读】: 该论文试图解决大型语言模型(LLMs)在生成非事实信息和在推理实体间关系时自相矛盾的问题。解决方案的关键在于引入基于神经符号推理的损失函数,该损失函数教导LLM在与外部事实和规则保持逻辑一致性的同时,提升其自我一致性,即使在有限的微调数据集上也能实现。此外,该方法允许系统地结合多个逻辑约束,并以原则性的方式处理,从而使LLM在所有约束下更加一致,并在特定约束下优于多个基线。

链接: https://arxiv.org/abs/2409.13724
作者: Diego Calanzone,Stefano Teso,Antonio Vergari
关键词-EN: natural language understanding, Large language models, language models, understanding and generation, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are a promising venue for natural language understanding and generation. However, current LLMs are far from reliable: they are prone to generating non-factual information and, more crucially, to contradicting themselves when prompted to reason about relations between entities of the world. These problems are currently addressed with large scale fine-tuning or by delegating reasoning to external tools. In this work, we strive for a middle ground and introduce a loss based on neuro-symbolic reasoning that teaches an LLM to be logically consistent with an external set of facts and rules and improves self-consistency even when the LLM is fine-tuned on a limited set of facts. Our approach also allows to easily combine multiple logical constraints at once in a principled way, delivering LLMs that are more consistent w.r.t. all constraints and improve over several baselines w.r.t. a given constraint. Moreover, our method allows LLMs to extrapolate to unseen but semantically similar factual knowledge, represented in unseen datasets, more systematically.
摘要:大语言模型 (LLMs) 在自然语言理解和生成方面展现出巨大的潜力。然而,当前的 LLMs 远非可靠:它们容易生成不实信息,更为关键的是,在要求推理实体间关系时,往往会自相矛盾。目前,这些问题通过大规模微调或委托外部工具进行推理来解决。在本研究中,我们寻求一种折中方案,引入了一种基于神经符号推理的损失函数,该函数教导 LLM 与外部事实和规则集保持逻辑一致性,即使在有限的微调数据集上也能提高自我一致性。我们的方法还允许以系统化的方式同时结合多个逻辑约束,从而生成在所有约束下更为一致的 LLMs,并在特定约束下超越多个基线。此外,我们的方法使 LLMs 能够更系统地外推到未见但语义相似的事实知识,这些知识体现在未见的数据集中。

[NLP-165] LegiLM: A Fine-Tuned Legal Language Model for Data Compliance

【速读】: 该论文试图解决数据保护和隐私安全领域的合规性问题,特别是确保企业在处理数据时遵守国际数据保护标准。解决方案的关键在于引入了LegiLM,一种专门针对数据或信息合规性咨询的法律语言模型。LegiLM通过利用预训练的GDPR罚款数据集,并结合全球数据保护法律、精心注释的政策文档和相关隐私政策,进行微调,以自动评估特定行为或事件是否违反数据安全和隐私法规。该模型集成了先进的法律推理方法和信息检索增强技术,以提高在实际法律咨询场景中的准确性和可靠性。

链接: https://arxiv.org/abs/2409.13721
作者: Linkai Zhu,Lu Yang,Chaofan Li,Shanwen Hu,Lu Liu,Bin Yin
关键词-EN: substantial legal expertise, requiring substantial legal, Ensuring compliance, complex task, crucial but complex
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Ensuring compliance with international data protection standards for privacy and data security is a crucial but complex task, often requiring substantial legal expertise. This paper introduces LegiLM, a novel legal language model specifically tailored for consulting on data or information compliance. LegiLM leverages a pre-trained GDPR Fines dataset and has been fine-tuned to automatically assess whether particular actions or events breach data security and privacy regulations. By incorporating a specialized dataset that includes global data protection laws, meticulously annotated policy documents, and relevant privacy policies, LegiLM is optimized for addressing data compliance challenges. The model integrates advanced legal reasoning methods and information retrieval enhancements to enhance accuracy and reliability in practical legal consulting scenarios. Our evaluation using a custom benchmark dataset demonstrates that LegiLM excels in detecting data regulation breaches, offering sound legal justifications, and recommending necessary compliance modifications, setting a new benchmark for AI-driven legal compliance solutions. Our resources are publicly available at this https URL
摘要:确保符合国际数据保护标准以保障隐私和数据安全是一项关键但复杂的任务,通常需要大量的法律专业知识。本文介绍了 LegiLM,一种专门为数据或信息合规咨询量身定制的新型法律语言模型。LegiLM 利用预训练的 GDPR 罚款数据集,并经过微调,能够自动评估特定行为或事件是否违反数据安全和隐私法规。通过整合包含全球数据保护法律、精心注释的政策文件和相关隐私政策的专用数据集,LegiLM 针对解决数据合规挑战进行了优化。该模型结合了先进的法律推理方法和信息检索增强技术,以提高在实际法律咨询场景中的准确性和可靠性。我们使用自定义基准数据集进行的评估表明,LegiLM 在检测数据法规违规、提供合理的法律依据以及推荐必要的合规修改方面表现优异,为 AI 驱动的法律合规解决方案树立了新的标杆。我们的资源可在以下链接公开获取:https URL

[NLP-166] DiVA-DocRE: A Discriminative and Voice-Aware Paradigm for Document-Level Relation Extraction

【速读】: 该论文试图解决文档级关系三元组提取(DocRTE)中的问题,特别是现有方法在处理跨句关系和关系元素识别时的低效性和次优性能。解决方案的关键在于引入了一种判别式和语音感知范式(DiVA),该方法通过两个步骤简化了提取过程:首先进行文档级关系提取(DocRE),然后基于关系识别主语和宾语实体。DiVA的创新之处在于将DocRE转化为一个判别任务,并关注关系中的主动语态和被动语态问题,从而提高了三元组提取的准确性和效率。

链接: https://arxiv.org/abs/2409.13717
作者: Yiheng Wu,Roman Yangarber,Xian Mao
关键词-EN: Large Language Models, Large Language, revolutionized Information Extraction, Relation Triplet Extraction, capabilities of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The remarkable capabilities of Large Language Models (LLMs) in text comprehension and generation have revolutionized Information Extraction (IE). One such advancement is in Document-level Relation Triplet Extraction (DocRTE), a critical task in information systems that aims to extract entities and their semantic relationships from documents. However, existing methods are primarily designed for Sentence level Relation Triplet Extraction (SentRTE), which typically handles a limited set of relations and triplet facts within a single sentence. Additionally, some approaches treat relations as candidate choices integrated into prompt templates, resulting in inefficient processing and suboptimal performance when determining the relation elements in triplets. To address these limitations, we introduce a Discriminative and Voice Aware Paradigm DiVA. DiVA involves only two steps: performing document-level relation extraction (DocRE) and then identifying the subject object entities based on the relation. No additional processing is required simply input the document to directly obtain the triplets. This streamlined process more accurately reflects real-world scenarios for triplet extraction. Our innovation lies in transforming DocRE into a discriminative task, where the model pays attention to each relation and to the often overlooked issue of active vs. passive voice within the triplet. Our experiments on the Re-DocRED and DocRED datasets demonstrate state-of-the-art results for the DocRTE task.
摘要:大语言模型 (LLMs) 在文本理解和生成方面的显著能力已经彻底改变了信息提取 (IE) 领域。其中一个重要的进展是文档级关系三元组提取 (DocRTE),这是信息系统中的一个关键任务,旨在从文档中提取实体及其语义关系。然而,现有的方法主要设计用于句子级关系三元组提取 (SentRTE),通常处理的是单个句子内的有限关系和三元组事实。此外,一些方法将关系视为候选选择集成到提示模板中,导致在确定三元组中的关系元素时处理效率低下且性能不佳。为了解决这些限制,我们引入了一种判别和语音感知范式 DiVA。DiVA 仅涉及两个步骤:执行文档级关系提取 (DocRE),然后根据关系识别主语和宾语实体。无需额外处理,只需输入文档即可直接获得三元组。这种简化的流程更准确地反映了现实世界中三元组提取的场景。我们的创新之处在于将 DocRE 转变为一个判别任务,模型关注每个关系以及三元组中常被忽略的主动与被动语态问题。我们在 Re-DocRED 和 DocRED 数据集上的实验表明,DiVA 在 DocRTE 任务中达到了最先进的结果。

[NLP-167] Constrained Multi-Layer Contrastive Learning for Implicit Discourse Relationship Recognition

【速读】: 该论文试图解决隐式话语关系识别(IDRR)任务中,传统分类方法依赖复杂神经网络和多层中间层来捕捉话语单元间交互的问题。解决方案的关键在于采用监督对比学习(CL)方法,特别是标签和实例为中心的对比学习,以增强表示学习。此外,论文提出了一种新的约束多层对比学习方法,确保高层的对比损失小于低层的对比损失,从而优化模型性能。实验结果表明,该方法在PDTB 2.0和PDTB 3.0数据集上显著提升了多类分类和二分类的性能。

链接: https://arxiv.org/abs/2409.13716
作者: Yiheng Wu,Junhui Li,Muhua Zhu
关键词-EN: discourse relation recognition, Previous approaches, implicit discourse relation, relation recognition, generally view
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous approaches to the task of implicit discourse relation recognition (IDRR) generally view it as a classification task. Even with pre-trained language models, like BERT and RoBERTa, IDRR still relies on complicated neural networks with multiple intermediate layers to proper capture the interaction between two discourse units. As a result, the outputs of these intermediate layers may have different capability in discriminating instances of different classes. To this end, we propose to adapt a supervised contrastive learning (CL) method, label- and instance-centered CL, to enhance representation learning. Moreover, we propose a novel constrained multi-layer CL approach to properly impose a constraint that the contrastive loss of higher layers should be smaller than that of lower layers. Experimental results on PDTB 2.0 and PDTB 3.0 show that our approach can significantly improve the performance on both multi-class classification and binary classification.
摘要:以往对于隐式话语关系识别 (Implicit Discourse Relation Recognition, IDRR) 任务的处理通常将其视为分类任务。即便使用了预训练语言模型,如 BERT 和 RoBERTa,IDRR 仍然依赖于具有多层中间层的复杂神经网络来适当捕捉两个话语单元之间的交互。因此,这些中间层的输出在区分不同类别的实例时可能具有不同的能力。为此,我们提出采用一种监督对比学习 (Contrastive Learning, CL) 方法,即标签和实例为中心的 CL,以增强表示学习。此外,我们还提出了一种新颖的约束多层 CL 方法,以适当施加一个约束,即较高层的对比损失应小于较低层的对比损失。在 PDTB 2.0 和 PDTB 3.0 上的实验结果表明,我们的方法在多类分类和二分类任务上均能显著提升性能。

[NLP-168] Introducing MeMo: A Multimodal Dataset for Memory Modelling in Multiparty Conversations

【速读】: 该论文试图解决的问题是如何在计算模型中模拟人类对话记忆,以促进智能系统在理解和改善群体互动质量方面的应用。解决方案的关键在于引入了MeMo语料库,这是首个包含参与者记忆保留报告的对话数据集。MeMo语料库通过31小时的小组讨论录音,涵盖了Covid-19主题,并在两周内重复进行,结合了行为和感知测量,提供了音频、视频和多模态注释。这一资源为研究对话记忆和群体动态提供了宝贵的数据基础,有助于开发能够模拟和理解人类对话记忆的智能系统。

链接: https://arxiv.org/abs/2409.13715
作者: Maria Tsfasman,Bernd Dudzik,Kristian Fenech,Andras Lorincz,Catholijn M. Jonker,Catharine Oertel
关键词-EN: human memory processes, Conversational memory, human social relationships, memory, relationships is intricately
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The quality of human social relationships is intricately linked to human memory processes, with memory serving as the foundation for the creation of social bonds. Since human memory is selective, differing recollections of the same events within a group can lead to misunderstandings and misalignments in what is perceived to be common ground in the group. Yet, conversational facilitation systems, aimed at advancing the quality of group interactions, usually focus on tracking users’ states within an individual session, ignoring what remains in each participant’s memory after the interaction. Conversational memory is the process by which humans encode, retain and retrieve verbal, non-verbal and contextual information from a conversation. Understanding conversational memory can be used as a source of information on the long-term development of social connections within a group. This paper introduces the MeMo corpus, the first conversational dataset annotated with participants’ memory retention reports, aimed at facilitating computational modelling of human conversational memory. The MeMo corpus includes 31 hours of small-group discussions on the topic of Covid-19, repeated over the term of 2 weeks. It integrates validated behavioural and perceptual measures, and includes audio, video, and multimodal annotations, offering a valuable resource for studying and modelling conversational memory and group dynamics. By introducing the MeMo corpus, presenting an analysis of its validity, and demonstrating its usefulness for future research, this paper aims to pave the way for future research in conversational memory modelling for intelligent system development.
摘要:人类社会关系的质量与人类记忆过程紧密相连,记忆作为社会关系建立的基础。由于人类记忆具有选择性,同一群体内对同一事件的不同记忆可能导致误解和认知偏差。然而,旨在提升群体互动质量的对话辅助系统通常只关注单次会话中用户的状态,而忽视了互动后每个参与者记忆中的内容。对话记忆是指人类在对话中编码、保留和检索语言、非语言及上下文信息的过程。理解对话记忆可以作为了解群体内社会关系长期发展的一个信息来源。本文介绍了 MeMo 语料库,这是首个包含参与者记忆保留报告的对话数据集,旨在促进人类对话记忆的计算建模。MeMo 语料库包含 31 小时关于新冠疫情的小组讨论录音,持续时间为两周。它整合了经过验证的行为和感知测量方法,并包含音频、视频和多模态注释,为研究对话记忆和群体动力学提供了宝贵的资源。通过介绍 MeMo 语料库,分析其有效性,并展示其对未来研究的实用性,本文旨在为智能系统开发中的对话记忆建模研究铺平道路。

[NLP-169] racrBench: Generating Interpretability Testbeds with Large Language Models ICML

【速读】: 该论文试图解决基于Transformer的语言模型机制理解问题,特别是由于模型参数众多导致的解释性方法缺乏有效评估的问题。解决方案的关键在于提出了TracrBench,这是一个包含121个手动编写和LLM生成的、经过人工验证的RASP程序及其对应Transformer权重的数据集。通过利用大型语言模型(LLMs)生成解释性测试床,TracrBench为评估和比较解释性方法提供了一个宝贵的测试平台。

链接: https://arxiv.org/abs/2409.13714
作者: Hannes Thurnherr,Jérémy Scheurer
关键词-EN: Achieving a mechanistic, ground truth mappings, mechanistic understanding, understanding of transformer-based, transformer-based language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages + appendix, 4 figures, ICML Mechanistic Interpretability Workshop

点击查看摘要

Abstract:Achieving a mechanistic understanding of transformer-based language models is an open challenge, especially due to their large number of parameters. Moreover, the lack of ground truth mappings between model weights and their functional roles hinders the effective evaluation of interpretability methods, impeding overall progress. Tracr, a method for generating compiled transformers with inherent ground truth mappings in RASP, has been proposed to address this issue. However, manually creating a large number of models needed for verifying interpretability methods is labour-intensive and time-consuming. In this work, we present a novel approach for generating interpretability test beds using large language models (LLMs) and introduce TracrBench, a novel dataset consisting of 121 manually written and LLM-generated, human-validated RASP programs and their corresponding transformer weights. During this process, we evaluate the ability of frontier LLMs to autonomously generate RASP programs and find that this task poses significant challenges. GPT-4-turbo, with a 20-shot prompt and best-of-5 sampling, correctly implements only 57 out of 101 test programs, necessitating the manual implementation of the remaining programs. With its 121 samples, TracrBench aims to serve as a valuable testbed for evaluating and comparing interpretability methods.
摘要:实现对基于 Transformer 的语言模型的机制性理解是一个开放的挑战,尤其是因为其参数数量庞大。此外,模型权重与其功能角色之间缺乏真实映射,阻碍了可解释性方法的有效评估,从而影响了整体进展。Tracr 是一种生成编译型 Transformer 的方法,它在 RASP 中具有固有的真实映射,旨在解决这一问题。然而,手动创建大量用于验证可解释性方法的模型既费时又费力。在这项工作中,我们提出了一种利用大语言模型 (LLM) 生成可解释性测试平台的新方法,并引入了 TracrBench,这是一个包含 121 个手动编写和 LLM 生成的、经过人工验证的 RASP 程序及其相应 Transformer 权重的新型数据集。在此过程中,我们评估了前沿 LLM 自主生成 RASP 程序的能力,发现这一任务具有显著挑战性。GPT-4-turbo 在 20-shot 提示和最佳 5 次采样的情况下,仅正确实现了 101 个测试程序中的 57 个,其余程序需要手动实现。TracrBench 凭借其 121 个样本,旨在成为评估和比较可解释性方法的有价值的测试平台。

[NLP-170] Sentiment Informed Sentence BERT-Ensemble Algorithm for Depression Detection

【速读】: 该论文试图解决早期抑郁症检测的问题,特别是在使用机器学习技术时面临的模型泛化能力和数据复杂性挑战。解决方案的关键在于采用堆叠集成模型,并结合情感指标作为附加特征,以提高模型性能。具体来说,论文通过将句子双向编码器表示(SBERT)数值向量嵌入堆叠集成模型中,在两个基准社交媒体数据集(D1和D2)上分别实现了69%和76%的F1分数,表明情感指标的引入显著提升了抑郁症检测模型的性能。

链接: https://arxiv.org/abs/2409.13713
作者: Bayode Ogunleye,Hemlata Sharma,Olamilekan Shobayo
关键词-EN: World Health Organisation, Health Organisation, World Health, world suffer, revealed approximately
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
备注:

点击查看摘要

Abstract:The World Health Organisation (WHO) revealed approximately 280 million people in the world suffer from depression. Yet, existing studies on early-stage depression detection using machine learning (ML) techniques are limited. Prior studies have applied a single stand-alone algorithm, which is unable to deal with data complexities, prone to overfitting, and limited in generalization. To this end, our paper examined the performance of several ML algorithms for early-stage depression detection using two benchmark social media datasets (D1 and D2). More specifically, we incorporated sentiment indicators to improve our model performance. Our experimental results showed that sentence bidirectional encoder representations from transformers (SBERT) numerical vectors fitted into the stacking ensemble model achieved comparable F1 scores of 69% in the dataset (D1) and 76% in the dataset (D2). Our findings suggest that utilizing sentiment indicators as an additional feature for depression detection yields an improved model performance, and thus, we recommend the development of a depressive term corpus for future work.
摘要:世界卫生组织 (WHO) 披露,全球约有 2.8 亿人患有抑郁症。然而,目前利用机器学习 (ML) 技术进行早期抑郁症检测的研究尚不充分。以往的研究多采用单一的独立算法,无法应对数据复杂性,容易过拟合,且泛化能力有限。为此,本文研究了多种 ML 算法在两个基准社交媒体数据集 (D1 和 D2) 上进行早期抑郁症检测的性能。具体而言,我们引入了情感指标以提升模型性能。实验结果表明,将句子双向编码器表示 (SBERT) 数值向量融入堆叠集成模型后,在数据集 (D1) 和 (D2) 上分别达到了 69% 和 76% 的 F1 分数。我们的研究结果表明,利用情感指标作为抑郁症检测的附加特征可以提升模型性能,因此我们建议未来工作应开发抑郁症术语语料库。

[NLP-171] Good Idea or Not Representation of LLM Could Tell

【速读】: 该论文试图解决学术研究中如何从众多想法中有效区分出有价值和无价值想法的问题。解决方案的关键在于利用大型语言模型(LLM)的特定层表示来量化科学想法的价值,而非依赖其生成输出。通过构建和发布一个包含近四千篇完整文本的基准数据集,论文提出了一种框架,用于训练和评估不同方法在此任务中的表现。实验结果表明,该方法预测的评分与人类评估结果相对一致,显示出LLM在量化想法价值方面的潜力,为自动化想法评估提供了新的途径。

链接: https://arxiv.org/abs/2409.13712
作者: Yi Xu,Bo Xue,Shuqian Sheng,Cheng Deng,Jiaxin Ding,Zanwei Shen,Luoyi Fu,Xinbing Wang,Chenghu Zhou
关键词-EN: discerning valuable ideas, large language models, challenge for researchers, discerning valuable, ever-expanding landscape
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the ever-expanding landscape of academic research, the proliferation of ideas presents a significant challenge for researchers: discerning valuable ideas from the less impactful ones. The ability to efficiently evaluate the potential of these ideas is crucial for the advancement of science and paper review. In this work, we focus on idea assessment, which aims to leverage the knowledge of large language models to assess the merit of scientific ideas. First, we investigate existing text evaluation research and define the problem of quantitative evaluation of ideas. Second, we curate and release a benchmark dataset from nearly four thousand manuscript papers with full texts, meticulously designed to train and evaluate the performance of different approaches to this task. Third, we establish a framework for quantifying the value of ideas by employing representations in a specific layer of large language models. Experimental results show that the scores predicted by our method are relatively consistent with those of humans. Our findings suggest that the representations of large language models hold more potential in quantifying the value of ideas than their generative outputs, demonstrating a promising avenue for automating the idea assessment process.
摘要:在学术研究不断扩展的领域中,思想的激增给研究人员带来了重大挑战:如何从影响力较小的思想中辨别出有价值的思想。高效评估这些思想的潜力对于科学进步和论文评审至关重要。在本研究中,我们专注于思想评估,旨在利用大语言模型的知识来评估科学思想的优劣。首先,我们调查了现有的文本评估研究,并定义了思想定量评估的问题。其次,我们从近四千篇全文稿件中精心策划并发布了一个基准数据集,旨在训练和评估不同方法在此任务中的表现。第三,我们建立了一个框架,通过使用大语言模型特定层的表示来量化思想的值。实验结果表明,我们方法预测的分数与人类评估的结果相对一致。我们的研究结果表明,大语言模型的表示在量化思想价值方面比其生成输出更具潜力,展示了自动化思想评估过程的广阔前景。

[NLP-172] You can remove GPT2s LayerNorm by fine-tuning

【速读】: 该论文试图解决LayerNorm(LN)层在GPT-style transformer模型中对机制可解释性的阻碍问题。解决方案的关键在于通过在预训练的GPT2-small模型上进行微调,移除LN层,并使用部分训练数据(500M tokens)进行再训练,以证明在推理阶段LN层并非关键组件。研究结果表明,移除LN层的模型在OpenWebText和ThePile数据集上的交叉熵损失仅增加0.05,在Hellaswag基准测试中的准确率仅下降0.5%,从而为机制可解释性研究提供了一个简化的模型。

链接: https://arxiv.org/abs/2409.13710
作者: Stefan Heimersheim
关键词-EN: GPT-style transformer models, large language models, mechanistic interpretability, GPT-style transformer, transformer models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability. LN is a crucial component required to stabilize the training of large language models, and LN or the similar RMSNorm have been used in practically all large language models based on the transformer architecture. The non-linear nature of the LN layers is a hindrance for mechanistic interpretability as it hinders interpretation of the residual stream, and makes it difficult to decompose the model into circuits. Some research have gone so far as to name “reasons interpretability researchers hate layer norm”. In this paper we show that it is possible to remove the LN layers from a pre-trained GPT2-small model by fine-tuning on a fraction (500M tokens) of the training data. We demonstrate that this LN-free model achieves similar performance to the original model on the OpenWebText and ThePile datasets (-0.05 cross-entropy loss), and the Hellaswag benchmark (-0.5% accuracy). We provide the fine-tuning procedure and a Hugging Face repository with the fine-tuned GPT2-small models. Our work not only provides a simplified model for mechanistic interpretability research, but also provides evidence that the LN layers, at inference time, do not play a crucial role in transformer models. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.13710 [cs.CL] (or arXiv:2409.13710v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13710 Focus to learn more arXiv-issued DOI via DataCite
摘要:GPT 风格 Transformer 模型中的 LayerNorm (LN) 层长期以来一直是机制可解释性的障碍。LN 是稳定大语言模型训练的关键组件,LN 或类似的 RMSNorm 已被几乎所有基于 Transformer 架构的大语言模型所采用。LN 层的非线性特性阻碍了对残差流的解释,并使得将模型分解为电路变得困难。一些研究甚至将“解释性研究人员讨厌层归一化的原因”作为标题。在本文中,我们展示了通过在训练数据的一小部分(5 亿 Token)上进行微调,可以从预训练的 GPT2-small 模型中移除 LN 层。我们证明,这种无 LN 模型在 OpenWebText 和 ThePile 数据集上(-0.05 交叉熵损失)以及 Hellaswag 基准测试(-0.5% 准确率)上达到了与原始模型相似的性能。我们提供了微调过程以及包含微调 GPT2-small 模型的 Hugging Face 仓库。我们的工作不仅为机制可解释性研究提供了一个简化的模型,而且还提供了证据表明,在推理时,LN 层在 Transformer 模型中并不起关键作用。

主题:计算与语言 (cs.CL); 机器学习 (cs.LG)
引用为:arXiv:2409.13710 [cs.CL] (或 arXiv:2409.13710v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.13710
通过 DataCite 发布的 arXiv DOI

[NLP-173] Column Vocabulary Association (CVA): semantic interpretation of dataless tables

【速读】: 该论文试图解决在仅依赖元数据信息的情况下进行语义表解释(Semantic Table Interpretation, STI)的问题。解决方案的关键在于引入“列词汇关联(Column Vocabulary Association, CVA)”任务,即仅基于元数据信息对列标题进行语义标注。论文评估了多种方法的性能,包括大型语言模型(LLMs)和检索增强生成(RAG)方法,以及传统的基于相似度的SemanticBERT方法。研究发现在温度设置低于1.0时,LLMs表现良好,但在输入数据与术语表相关的情况下,传统方法可能优于LLMs。

链接: https://arxiv.org/abs/2409.13709
作者: Margherita Martorana,Xueli Pan,Benno Kruit,Tobias Kuhn,Jacco van Ossenbruggen
关键词-EN: Semantic Table Interpretation, Table Interpretation, Traditional Semantic Table, underlying table data, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional Semantic Table Interpretation (STI) methods rely primarily on the underlying table data to create semantic annotations. This year’s SemTab challenge introduced the ``Metadata to KG’’ track, which focuses on performing STI by using only metadata information, without access to the underlying data. In response to this new challenge, we introduce a new term: Column Vocabulary Association (CVA). This term refers to the task of semantic annotation of column headers solely based on metadata information. In this study, we evaluate the performance of various methods in executing the CVA task, including a Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) approach, as well as a more traditional similarity approach with SemanticBERT. Our methodology uses a zero-shot setting, with no pretraining or examples passed to the Large Language Models (LLMs), as we aim to avoid a domain-specific setting. We investigate a total of 7 different LLMs, of which three commercial GPT models (i.e. gpt-3.5-turbo-0.125, gpt-4o and gpt-4-turbo) and four open source models (i.e. llama3-80b, llama3-7b, gemma-7b and mixtral-8x7b). We integrate this models with RAG systems, and we explore how variations in temperature settings affect performances. Moreover, we continue our investigation by performing the CVA task utilizing SemanticBERT, analyzing how various metadata information influence its performance. Initial findings indicate that LLMs generally perform well at temperatures below 1.0, achieving an accuracy of 100% in certain cases. Nevertheless, our investigation also reveal that the nature of the data significantly influences CVA task outcomes. In fact, in cases where the input data and glossary are related (for example by being created by the same organizations) traditional methods appear to surpass the performance of LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.13709 [cs.CL] (or arXiv:2409.13709v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13709 Focus to learn more arXiv-issued DOI via DataCite
摘要:传统的语义表格解释 (Semantic Table Interpretation, STI) 方法主要依赖于底层表格数据来创建语义注释。今年的 SemTab 挑战赛引入了“元数据到知识图谱 (Metadata to KG)”赛道,该赛道专注于仅使用元数据信息进行 STI,而无需访问底层数据。针对这一新挑战,我们引入了一个新术语:列词汇关联 (Column Vocabulary Association, CVA)。该术语指的是仅基于元数据信息对列标题进行语义注释的任务。在本研究中,我们评估了多种方法在执行 CVA 任务中的表现,包括大语言模型 (Large Language Models, LLMs) 和检索增强生成 (Retrieval Augmented Generation, RAG) 方法,以及传统的基于 SemanticBERT 的相似性方法。我们的方法采用零样本 (zero-shot) 设置,没有对大语言模型 (LLMs) 进行预训练或传递示例,因为我们旨在避免特定领域的设置。我们研究了总共 7 种不同的大语言模型 (LLMs),其中三种商业 GPT 模型 (即 gpt-3.5-turbo-0.125, gpt-4o 和 gpt-4-turbo) 和四种开源模型 (即 llama3-80b, llama3-7b, gemma-7b 和 mixtral-8x7b)。我们将这些模型与 RAG 系统集成,并探讨了温度设置的变化如何影响性能。此外,我们继续通过使用 SemanticBERT 执行 CVA 任务来进行研究,分析各种元数据信息如何影响其性能。初步发现表明,大语言模型 (LLMs) 在温度低于 1.0 时通常表现良好,在某些情况下达到 100% 的准确率。然而,我们的研究也揭示了数据的性质显著影响 CVA 任务的结果。事实上,在输入数据和术语表相关的情况下(例如由同一组织创建),传统方法似乎超越了大语言模型 (LLMs) 的性能。

主题:计算与语言 (cs.CL); 人工智能 (cs.AI)
引用为:arXiv:2409.13709 [cs.CL] (或 arXiv:2409.13709v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.13709
通过 DataCite 发布的 arXiv DOI

[NLP-174] owards Safe Multilingual Frontier AI

【速读】: 该论文试图解决多语言环境下大型语言模型(LLMs)的安全性和包容性问题,特别是多语言“越狱”(jailbreaks)对AI系统安全性的威胁。解决方案的关键在于通过政策措施增强AI的多语言能力,同时降低多语言越狱的风险。具体措施包括强制评估多语言能力和漏洞、进行公众意见研究以及国家对多语言AI发展的支持,这些措施旨在通过欧盟政策倡议提升AI的安全性和功能性,指导《欧盟AI法案》的实施,并为欧洲AI办公室的监管工作提供信息。

链接: https://arxiv.org/abs/2409.13708
作者: Artūrs Kanepajs,Vladimir Ivanov,Richard Moulange
关键词-EN: Linguistically inclusive LLMs, maintain good performance, Linguistically inclusive, Multilingual jailbreaks, maintain good
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 23 pages; 1 figure and 10 supplementary figures

点击查看摘要

Abstract:Linguistically inclusive LLMs – which maintain good performance regardless of the language with which they are prompted – are necessary for the diffusion of AI benefits around the world. Multilingual jailbreaks that rely on language translation to evade safety measures undermine the safe and inclusive deployment of AI systems. We provide policy recommendations to enhance the multilingual capabilities of AI while mitigating the risks of multilingual jailbreaks. We quantitatively assess the relationship between language resourcedness and model vulnerabilities to multilingual jailbreaks for five frontier large language models across 24 official EU languages. Building on prior research, we propose policy actions that align with the EU legal landscape and institutional framework to address multilingual jailbreaks, while promoting linguistic inclusivity. These include mandatory assessments of multilingual capabilities and vulnerabilities, public opinion research, and state support for multilingual AI development. The measures aim to improve AI safety and functionality through EU policy initiatives, guiding the implementation of the EU AI Act and informing regulatory efforts of the European AI Office.
摘要:无论用户使用何种语言进行提示,都能保持良好性能的大语言模型 (LLM) 对于全球范围内 AI 福利的普及至关重要。依赖语言翻译来规避安全措施的多语言越狱行为,破坏了 AI 系统的安全性和包容性部署。我们提出了政策建议,以增强 AI 的多语言能力,同时降低多语言越狱的风险。我们定量评估了语言资源丰富度与模型对多语言越狱的脆弱性之间的关系,涉及五个前沿大语言模型在 24 种欧盟官方语言中的表现。基于先前研究,我们提出了与欧盟法律环境和机构框架相一致的政策行动,以应对多语言越狱问题,同时促进语言包容性。这些行动包括强制评估多语言能力和脆弱性、公众意见研究以及国家对多语言 AI 开发的支持。这些措施旨在通过欧盟政策倡议提升 AI 的安全性和功能性,指导欧盟 AI 法案的实施,并为欧洲 AI 办公室的监管工作提供信息。

[NLP-175] Retrieval Augmented Generation-Based Incident Resolution Recommendation System for IT Support

【速读】: 该论文试图解决在IT支持和AIOps领域中实施生成式AI时面临的两个关键问题:领域覆盖不足和模型大小限制。解决方案的关键在于采用检索增强生成(RAG)技术,通过检索系统获取必要的领域知识,并利用较小的生成模型作为上下文进行生成,从而在不依赖大型专有模型(如GPT-4)的情况下,提高领域覆盖和生成质量。论文详细介绍了为IT支持领域开发的系统架构、数据收集与标注、开发过程及初步验证,以及最终部署和评估计划,强调了RAG在解决上述问题中的核心作用。

链接: https://arxiv.org/abs/2409.13707
作者: Paulina Toro Isaza,Michael Nidd,Noah Zheutlin,Jae-wook Ahn,Chidansh Amitkumar Bhatt,Yu Deng,Ruchi Mahindru,Martin Franz,Hans Florian,Salim Roukos
关键词-EN: size constraints due, model choice limitations, model size constraints, choice limitations, wishing to implement
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Clients wishing to implement generative AI in the domain of IT Support and AIOps face two critical issues: domain coverage and model size constraints due to model choice limitations. Clients might choose to not use larger proprietary models such as GPT-4 due to cost and privacy concerns and so are limited to smaller models with potentially less domain coverage that do not generalize to the client’s domain. Retrieval augmented generation is a common solution that addresses both of these issues: a retrieval system first retrieves the necessary domain knowledge which a smaller generative model leverages as context for generation. We present a system developed for a client in the IT Support domain for support case solution recommendation that combines retrieval augmented generation (RAG) for answer generation with an encoder-only model for classification and a generative large language model for query generation. We cover architecture details, data collection and annotation, development journey and preliminary validations, expected final deployment process and evaluation plans, and finally lessons learned.
摘要:希望在 IT 支持和 AIOps 领域实施生成式 AI (Generative AI) 的客户面临两个关键问题:领域覆盖范围和模型大小限制。由于成本和隐私问题,客户可能选择不使用如 GPT-4 这样的大型专有模型,因此只能使用可能领域覆盖范围较小且无法泛化到客户领域的较小模型。检索增强生成 (Retrieval Augmented Generation, RAG) 是一种常见的解决方案,它通过首先检索必要的领域知识,然后由较小的生成模型利用这些知识作为上下文进行生成,从而解决这两个问题。我们为 IT 支持领域的客户开发了一个系统,用于支持案例解决方案推荐,该系统结合了检索增强生成 (RAG) 进行答案生成,以及仅编码器模型 (encoder-only model) 进行分类和生成大语言模型 (generative large language model) 进行查询生成。本文涵盖了架构细节、数据收集和标注、开发历程和初步验证、预期的最终部署流程和评估计划,以及最终的经验教训。

[NLP-176] Decolonising Data Systems: Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage

【速读】: 该论文试图解决因中文名字罗马化不标准化导致的健康数据链接率低和选择偏差问题。解决方案的关键在于采用标准化的罗马化系统,如粤语的Jyutping和普通话的Pinyin,这些系统能够更好地保留音调信息,从而提高数据链接的准确性和效率。通过对比Jyutping、Pinyin和香港政府罗马化系统(HKG-romanisation)在处理中文名字时的错误率,论文证明了前两者在减少错误方面的优越性,并建议在数据收集和处理过程中保留原始书写系统,以促进更全面和准确的研究数据。

链接: https://arxiv.org/abs/2409.13706
作者: Joseph Lam(1),Mario Cortina-Borja(1),Robert Aldridge(2),Ruth Blackburn(1),Katie Harron(1) ((1) Great Ormond Street Institute of Child Health, University College London, UK (2) Institute for Health Metrics and Evaluation, University of Washington, USA)
关键词-EN: understanding health inequalities, health inequalities, understanding health, Chinese, policy making
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data linkage is increasingly used in health research and policy making and is relied on for understanding health inequalities. However, linked data is only as useful as the underlying data quality, and differential linkage rates may induce selection bias in the linked data. A mechanism that selectively compromises data quality is name romanisation. Converting text of a different writing system into Latin based writing, or romanisation, has long been the standard process of representing names in character-based writing systems such as Chinese, Vietnamese, and other languages such as Swahili. Unstandardised romanisation of Chinese characters, due in part to problems of preserving the correct name orders the lack of proper phonetic representation of a tonal language, has resulted in poor linkage rates for Chinese immigrants. This opinion piece aims to suggests that the use of standardised romanisation systems for Cantonese (Jyutping) or Mandarin (Pinyin) Chinese, which incorporate tonal information, could improve linkage rates and accuracy for individuals with Chinese names. We used 771 Chinese and English names scraped from openly available sources, and compared the utility of Jyutping, Pinyin and the Hong Kong Government Romanisation system (HKG-romanisation) for representing Chinese names. We demonstrate that both Jyutping and Pinyin result in fewer errors compared with the HKG-romanisation system. We suggest that collecting and preserving people’s names in their original writing systems is ethically and socially pertinent. This may inform development of language-specific pre-processing and linkage paradigms that result in more inclusive research data which better represents the targeted populations.
摘要:数据链接在健康研究和政策制定中越来越常用,并且对于理解健康不平等问题至关重要。然而,链接数据的有用性取决于底层数据的质量,而不同的链接率可能会导致链接数据中的选择偏差。一种可能影响数据质量的机制是名称罗马化。将不同书写系统的文本转换为基于拉丁字母的书写系统,即罗马化,长期以来一直是表示汉字、越南语等表意文字系统中名称的标准过程。由于保留正确名称顺序的问题以及缺乏对声调语言的适当语音表示,中文汉字的非标准化罗马化导致了华人移民的链接率较低。本文旨在建议使用标准化的罗马化系统,如粤语(Jyutping)或普通话(Pinyin),这些系统包含了声调信息,可以提高具有中文名称的个体的链接率和准确性。我们使用了从公开可用资源中抓取的771个中英文名称,并比较了Jyutping、Pinyin和香港政府罗马化系统(HKG-romanisation)在表示中文名称方面的效用。我们证明,与HKG-romanisation系统相比,Jyutping和Pinyin导致的错误更少。我们建议,收集和保存人们的姓名在其原始书写系统中在伦理和社会上都是相关的。这可能为开发特定语言的预处理和链接范式提供信息,从而产生更具包容性的研究数据,更好地代表目标人群。

[NLP-177] Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble

【速读】: 该论文试图解决大型语言模型(LLMs)在处理不平衡数据时可能学习到社会偏见的问题。解决方案的关键在于提出了一种轻量级的后处理方法,通过构建一个集成模型来缓解闭源文本安全分类器中的反事实公平性问题。该方法不仅提升了输入分类器的性能并使其与政策对齐,还作为去偏正则化器。论文还引入了两个与阈值无关的指标来评估模型的反事实公平性,并展示了如何结合这些指标与公平数据重加权(FDW)来减轻偏见。通过创建扩展的Open AI数据集和基于用户提示的模板化LLM生成数据集,论文验证了该方法在提高反事实公平性方面的有效性,同时对模型性能影响最小。

链接: https://arxiv.org/abs/2409.13705
作者: Olivia Sturman,Aparna Joshi,Bhaktipriya Radharapu,Piyush Kumar,Renee Shelby
关键词-EN: demand performant guardrails, large language models, demand performant, large language, performant guardrails
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Increasing use of large language models (LLMs) demand performant guardrails to ensure the safety of inputs and outputs of LLMs. When these safeguards are trained on imbalanced data, they can learn the societal biases. We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfactually balanced across identity groups and cover four key areas of safety; we will work towards publicly releasing these datasets. Our results show that our approach improves counterfactual fairness with minimal impact on model performance.
摘要:随着大语言模型 (LLM) 的广泛应用,对其输入和输出的安全性提出了更高的要求。当这些安全措施基于不平衡的数据进行训练时,它们可能会学习到社会偏见。我们提出了一种轻量级的后处理方法,用于缓解闭源文本安全分类器中的反事实公平性问题。我们的方法涉及构建一个集成模型,该模型不仅优于输入分类器并使其与策略对齐,而且还充当去偏正则化器。我们引入了两个与阈值无关的指标来评估模型的反事实公平性,并展示了如何将这些指标与公平数据重加权 (FDW) 结合使用以缓解偏见。我们创建了一个扩展的 Open AI 数据集,以及一个新的基于用户提示的模板化 LLM 生成的数据集,这两个数据集在身份群体之间进行了反事实平衡,并涵盖了四个关键的安全领域;我们将致力于公开发布这些数据集。我们的结果表明,我们的方法在最小化对模型性能影响的同时,显著提高了反事实公平性。

[NLP-178] Entity Extraction from High-Level Corruption Schemes via Large Language Models

【速读】: 该论文试图解决金融犯罪相关新闻文章中个人和组织识别的问题,特别是在缺乏专门数据集的情况下。解决方案的关键在于提出了一种新的微基准数据集,并开发了一种基于大型语言模型(LLM)的方法来识别和区分这些实体。通过实验验证,该方法在准确性、精确度、召回率和F1分数等标准指标上表现优异,并提出了一种有效的基于LLM的消歧方法,确保评估结果与实际情况一致。此外,该方法在与现有开源基线方法的比较中显示出优越性。

链接: https://arxiv.org/abs/2409.13704
作者: Panagiotis Koletsis,Panagiotis-Konstantinos Gemos,Christos Chronis,Iraklis Varlamis,Vasilis Efthymiou,Georgios Th. Papadopoulos
关键词-EN: rise of financial, financial crime, observed in recent, recent years, years has created
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of financial crime that has been observed in recent years has created an increasing concern around the topic and many people, organizations and governments are more and more frequently trying to combat it. Despite the increase of interest in this area, there is a lack of specialized datasets that can be used to train and evaluate works that try to tackle those problems. This article proposes a new micro-benchmark dataset for algorithms and models that identify individuals and organizations, and their multiple writings, in news articles, and presents an approach that assists in its creation. Experimental efforts are also reported, using this dataset, to identify individuals and organizations in financial-crime-related articles using various low-billion parameter Large Language Models (LLMs). For these experiments, standard metrics (Accuracy, Precision, Recall, F1 Score) are reported and various prompt variants comprising the best practices of prompt engineering are tested. In addition, to address the problem of ambiguous entity mentions, a simple, yet effective LLM-based disambiguation method is proposed, ensuring that the evaluation aligns with reality. Finally, the proposed approach is compared against a widely used state-of-the-art open-source baseline, showing the superiority of the proposed method.
摘要:近年来观察到的金融犯罪现象引发了人们对这一问题的日益关注,许多人、组织和政府正越来越多地尝试应对这一问题。尽管对该领域的兴趣不断增加,但缺乏专门的数据集用于训练和评估旨在解决这些问题的算法和模型。本文提出了一种新的微基准数据集,用于识别新闻文章中的个人和组织及其多种表述,并介绍了一种辅助其创建的方法。实验部分还报告了使用该数据集,通过各种低十亿参数的大语言模型 (LLM) 在金融犯罪相关文章中识别个人和组织的努力。在这些实验中,报告了标准指标(准确率、精确率、召回率、F1 分数),并测试了包含提示工程最佳实践的各种提示变体。此外,为了解决实体提及的歧义问题,提出了一种简单但有效的基于 LLM 的消歧方法,确保评估与现实情况一致。最后,将所提出的方法与广泛使用的最先进开源基线进行了比较,显示出所提出方法的优越性。

[NLP-179] Shaping the Future of Endangered and Low-Resource Languages – Our Role in the Age of LLMs: A Keynote at ECIR 2024

【速读】: 该论文试图解决如何利用大型语言模型(LLM)技术在保护和复兴濒危语言(如Occitan语)的同时,避免语言同质化、文化简化和进一步边缘化的问题。解决方案的关键在于探索技术和传统之间的潜在路径和合作关系,通过结合人类专家知识和人工智能技术,实现对语言多样性的保护,同时应对使用这些强大技术带来的伦理和实践挑战。

链接: https://arxiv.org/abs/2409.13702
作者: Josiane Mothe(IRIT-SIG)
关键词-EN: Isidore of Seville, Seville is credited, underlining the profound, social identity, Large Language Model
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Isidore of Seville is credited with the adage that it is language that gives birth to a people, and not the other way around , underlining the profound role played by language in the formation of cultural and social identity. Today, of the more than 7100 languages listed, a significant number are endangered. Since the 1970s, linguists, information seekers and enthusiasts have helped develop digital resources and automatic tools to support a wide range of languages, including endangered ones. The advent of Large Language Model (LLM) technologies holds both promise and peril. They offer unprecedented possibilities for the translation and generation of content and resources, key elements in the preservation and revitalisation of languages. They also present threat of homogenisation, cultural oversimplification and the further marginalisation of already vulnerable languages. The talk this paper is based on has proposed an initiatory journey, exploring the potential paths and partnerships between technology and tradition, with a particular focus on the Occitan language. Occitan is a language from Southern France, parts of Spain and Italy that played a major cultural and economic role, particularly in the Middle Ages. It is now endangered according to UNESCO. The talk critically has examined how human expertise and artificial intelligence can work together to offer hope for preserving the linguistic diversity that forms the foundation of our global and especially our European heritage while addressing some of the ethical and practical challenges that accompany the use of these powerful technologies. This paper is based on the keynote I gave at the 46th European Conference on Information Retrieval (ECIR 2024). As an alternative to reading this paper, a video talk is available online. 1 Date: 26 March 2024.
摘要:塞维利亚的伊西多尔曾有名言,语言赋予了民族生命,而非相反,这突显了语言在形成文化和社交身份中的深远作用。如今,在列出的7100多种语言中,许多语言正面临濒危。自20世纪70年代以来,语言学家、信息寻求者及爱好者们共同开发了数字资源和自动化工具,以支持包括濒危语言在内的多种语言。大语言模型 (LLM) 技术的出现既带来了希望也带来了挑战。它们为内容的翻译和生成提供了前所未有的可能性,这些是语言保存和复兴的关键要素。同时,它们也带来了同质化、文化简化和进一步边缘化弱势语言的威胁。本文基于的演讲提出了一次初步探索,探讨了技术与传统之间的潜在路径和合作关系,特别关注奥克西坦语。奥克西坦语源自法国南部、西班牙和意大利部分地区,在中世纪曾具有重要的文化和经济地位,现已被联合国教科文组织列为濒危语言。该演讲深入探讨了人类专业知识与人工智能如何协同合作,为保护构成我们全球尤其是欧洲文化遗产基础的语言多样性提供希望,同时应对使用这些强大技术带来的伦理和实际挑战。本文基于我在第46届欧洲信息检索会议 (ECIR 2024) 上发表的主题演讲。若不阅读本文,可在线观看视频演讲。1 日期:2024年3月26日。

[NLP-180] CA-BERT: Leveraging Context Awareness for Enhanced Multi-Turn Chat Interaction

【速读】: 该论文试图解决自动化聊天系统中上下文理解不足的问题,特别是传统模型在判断何时需要额外上下文以生成适当回复方面的困难。解决方案的关键是引入了一种名为Context-Aware BERT(CA-BERT)的模型,该模型基于BERT架构,通过创新的深度学习技术专门微调以识别多轮对话中上下文的必要性。CA-BERT通过专用的聊天对话数据集进行训练,显著提升了上下文分类的准确性和效率,同时减少了训练时间和资源消耗,使其适用于实时应用,从而显著改善了聊天机器人的功能和用户体验。

链接: https://arxiv.org/abs/2409.13701
作者: Minghao Liu,Mingxiu Sui,Cangqing Wang,Zhejie Zhou
关键词-EN: Effective communication, understand and respond, Effective, context, chat systems hinges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ICBASE 2024

点击查看摘要

Abstract:Effective communication in automated chat systems hinges on the ability to understand and respond to context. Traditional models often struggle with determining when additional context is necessary for generating appropriate responses. This paper introduces Context-Aware BERT (CA-BERT), a transformer-based model specifically fine-tuned to address this challenge. CA-BERT innovatively applies deep learning techniques to discern context necessity in multi-turn chat interactions, enhancing both the relevance and accuracy of responses. We describe the development of CA-BERT, which adapts the robust architecture of BERT with a novel training regimen focused on a specialized dataset of chat dialogues. The model is evaluated on its ability to classify context necessity, demonstrating superior performance over baseline BERT models in terms of accuracy and efficiency. Furthermore, CA-BERT’s implementation showcases significant reductions in training time and resource usage, making it feasible for real-time applications. The results indicate that CA-BERT can effectively enhance the functionality of chatbots by providing a nuanced understanding of context, thereby improving user experience and interaction quality in automated systems. This study not only advances the field of NLP in chat applications but also provides a framework for future research into context-sensitive AI developments. Comments: This paper has been accepted by ICBASE 2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.13701 [cs.CL] (or arXiv:2409.13701v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13701 Focus to learn more arXiv-issued DOI via DataCite
摘要:自动化聊天系统中的有效沟通依赖于对上下文的理解和响应能力。传统模型在判断何时需要额外上下文以生成适当响应方面往往表现不佳。本文引入了上下文感知 BERT (Context-Aware BERT, CA-BERT),这是一种基于 Transformer 的模型,专门微调以应对这一挑战。CA-BERT 创新性地应用深度学习技术来识别多轮聊天互动中的上下文需求,从而增强了响应的相关性和准确性。我们描述了 CA-BERT 的开发过程,该模型通过专注于特定聊天对话数据集的新颖训练方案,适应了 BERT 的强大架构。模型在分类上下文需求方面的能力进行了评估,结果显示其在准确性和效率方面均优于基准 BERT 模型。此外,CA-BERT 的实现展示了显著减少的训练时间和资源使用,使其适用于实时应用。结果表明,CA-BERT 通过提供对上下文的细致理解,能够有效增强聊天机器人的功能,从而提升自动化系统中的用户体验和交互质量。本研究不仅推动了聊天应用领域的自然语言处理 (NLP) 发展,还为未来上下文敏感型 AI 开发提供了研究框架。

评论:本文已被 ICBASE 2024 接受。
主题:计算与语言 (cs.CL);人工智能 (cs.AI)
引用方式:arXiv:2409.13701 [cs.CL]
(或 arXiv:2409.13701v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.13701
通过 DataCite 发布的 arXiv 发行 DOI

聚焦以了解更多

[NLP-181] Lightweight Transducer Based on Frame-Level Criterion INTERSPEECH2024

【速读】: 该论文试图解决基于序列级准则训练的转换器模型因生成大型概率矩阵而导致的内存和计算需求过高的问题。解决方案的关键在于提出了一种基于帧级准则的轻量级转换器模型,该模型利用CTC强制对齐算法的结果来确定每个帧的标签,从而避免了将编码器输出的每个元素与解码器输出的每个元素相加的传统转换器做法,显著降低了内存和计算需求。此外,论文还通过解耦空白和非空白概率,并截断空白分类器对主网络的梯度,解决了标签中空白过多导致的分类不平衡问题,使得轻量级转换器模型能够达到与传统转换器相似的效果,并通过使用更丰富的信息预测空白概率,进一步提升了性能。

链接: https://arxiv.org/abs/2409.13698
作者: Genshun Wan,Mengzhi Wang,Tingzhi Mao,Hang Chen,Zhongfu Ye
关键词-EN: sequence-level criterion requires, model trained based, transducer model trained, transducer model based, large probability matrix
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2024, code repository: this https URL

点击查看摘要

Abstract:The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. This enables the lightweight transducer achieving similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.
摘要:基于序列级准则训练的转换器模型由于生成大型概率矩阵而需要大量内存。我们提出了一种基于帧级准则的轻量级转换器模型,该模型利用CTC强制对齐算法的结果来确定每个帧的标签。然后,编码器输出可以与相应时间的解码器输出相结合,而不是像转换器那样将编码器输出的每个元素添加到解码器输出的每个元素。这显著减少了内存和计算需求。为了解决标签中空白过多导致的分类不平衡问题,我们将空白和非空白概率解耦,并将空白分类器的梯度截断到主网络。这使得轻量级转换器能够实现与转换器相似的结果。此外,我们使用更丰富的信息来预测空白的概率,从而获得比转换器更优越的结果。

[NLP-182] Prompt Baking

【速读】: 该论文试图解决如何将自然语言提示(prompting)的效果永久性地嵌入到大型语言模型(LLM)的权重中,从而实现更持久的模型行为改变。解决方案的关键在于提出了一种名为“Prompt Baking”的技术,通过最小化原始提示模型与新权重模型之间的KL散度,将提示信息“烘焙”到模型的权重中,使得新模型在无需额外提示的情况下表现出与原始提示模型相似的行为。这一方法不仅提高了模型在多个基准测试中的零样本性能,还缓解了长序列中的“提示遗忘”问题,并展示了通过迭代重提示和重烘焙实现模型自我改进的潜力。

链接: https://arxiv.org/abs/2409.13697
作者: Aman Bhargava,Cameron Witkowski,Alexander Detkov,Matt Thomson
关键词-EN: change LLM behavior, LLM, weight updates, baking, permanent behavior changes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 8 figures

点击查看摘要

Abstract:Two primary ways to change LLM behavior are prompting and weight updates (e.g., fine-tuning). Prompting LLMs is simple and effective, specifying the desired changes explicitly in natural language, whereas weight updates provide more expressive and permanent behavior changes, specified implicitly via training on large datasets. We present a technique for “baking” prompts into the weights of an LLM. Prompt Baking converts a prompt u and initial weights \theta to a new set of weights \theta_u such that new “baked” LLM behaves like the original prompted LLM. Mathematically, we minimize the KL divergence between P_\theta(\cdot | u) and P_\theta_u(\cdot) , where P is the LLM’s probability distribution over token sequences. Across all our experiments, we find prompts can be readily baked into weight updates. Baking chain-of-thought prompts improves zero-shot performance on GSM8K, ASDiv, MBPP, ARC-Easy, ARC-Challenge, and CommonsenseQA benchmarks. Baking news headlines directly updates an LLM’s knowledge. And baking instructions personas alleviates “prompt forgetting” over long sequences. Furthermore, stopping baking early creates “half-baked” models, continuously scaling prompt strength. Baked models retain their sensitivity to further prompting and baking, including re-prompting with the baked-in prompt. Surprisingly, the re-prompted models yield further performance gains in instruction following, as well as math reasoning and coding benchmarks. Taking re-prompting and re-baking to the limit yields a form of iterative self-improvement we call Prompt Pursuit, and preliminary results on instruction following exhibit dramatic performance gains. Finally, we discuss implications for AI safety, continuous model updating, enhancing real-time learning capabilities in LLM-based agents, and generating more stable AI personas.
摘要:改变大语言模型 (LLM) 行为主要有两种方式:提示 (prompting) 和权重更新 (例如,微调)。提示 LLM 简单且有效,通过自然语言明确指定所需的变化,而权重更新则提供更灵活和持久的行为改变,通过在大数据集上训练隐式指定。我们提出了一种将提示“烘焙”到 LLM 权重中的技术。提示烘焙 (Prompt Baking) 将提示 u 和初始权重 \theta 转换为一组新的权重 \theta_u,使得新的“烘焙”LLM 表现得像原始提示的 LLM。在数学上,我们最小化 P_\theta(\cdot | u) 和 P_\theta_u(\cdot) 之间的 KL 散度,其中 P 是 LLM 在 Token 序列上的概率分布。在我们所有的实验中,我们发现提示可以很容易地烘焙到权重更新中。烘焙思维链 (chain-of-thought) 提示提高了 GSM8K、ASDiv、MBPP、ARC-Easy、ARC-Challenge 和 CommonsenseQA 基准测试中的零样本性能。直接烘焙新闻标题更新了 LLM 的知识。烘焙指令角色缓解了长序列中的“提示遗忘”。此外,提前停止烘焙会创建“半烘焙”模型,持续扩展提示强度。烘焙模型保留了对进一步提示和烘焙的敏感性,包括使用烘焙提示重新提示。令人惊讶的是,重新提示的模型在指令跟随以及数学推理和编码基准测试中进一步提高了性能。将重新提示和重新烘焙推向极限,产生了一种我们称之为提示追求 (Prompt Pursuit) 的迭代自我改进形式,并且在指令跟随的初步结果中展示了显著的性能提升。最后,我们讨论了这对 AI 安全、持续模型更新、增强基于 LLM 的智能体的实时学习能力以及生成更稳定的 AI 角色 (AI personas) 的影响。

[NLP-183] You Only Use Reactive Attention Slice For Long Context Retrieval

【速读】: 该论文试图解决在大语言模型(LLM)中支持更长上下文窗口的问题,尤其是在训练成本高昂的情况下。解决方案的关键是提出了一种基于注意力机制的检索技术,称为You Only Use Reactive Attention slice (YOURA)。YOURA通过引入一种新的检索启发式方法——反应分数(reaction score),来评估输入上下文中每个句子与查询句子的相关性。具体来说,它测量每个token的注意力分数对查询的“反应”,并贪婪地检索最具反应性的句子。此外,YOURA还提出了一种无嵌入的句子映射算法Embedding-Agnostic Sentence Yield (EASY),用于将每个句子映射到token索引向量。实验结果表明,该技术在处理长上下文查询时,能够显著提高推理吞吐量,同时保持与简单截断方法相近的质量评分。

链接: https://arxiv.org/abs/2409.13695
作者: Yun Joon Soh,Hanxian Huang,Yuandong Tian,Jishen Zhao
关键词-EN: Supporting longer context, Large Language Models, Supporting longer, Retrieval Augmented Generation, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supporting longer context for Large Language Models (LLM) is a promising direction to advance LLMs. As training a model for a longer context window is computationally expensive, many alternative solutions, such as Retrieval Augmented Generation (RAG), have been used. However, most existing RAG methods adopt embedding-based retrieval that falls short on long contexts. To address such challenges, we propose an attention-based retrieval technique, You Only Use Reactive Attention slice (YOURA). YOURA leverages a novel retrieval heuristic called reaction score to rank the relevance of each sentence in the input context with the query sentence. Intuitively, we measure how the per-token attention score “reacts” to the query and greedily retrieves the most reactive sentences. Internally, YOURA generates a token-indexed vector (called reaction vector) for the whole input context. To map each sentence to the token-indexed vector, we propose an Embedding-Agnostic Sentence Yield (EASY), a best-effort token wiggling algorithm. We evaluate our retrieval technique on three open-source pre-trained LLM models across six LongBench QA datasets. Our technique achieves up to 30% vLLM inference throughput improvement for serving long-context queries with a nearly identical quality score to the simple yet effective truncate-middle approach. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.13695 [cs.CL] (or arXiv:2409.13695v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13695 Focus to learn more arXiv-issued DOI via DataCite
摘要:支持大语言模型 (LLM) 处理更长上下文是一个有前景的方向,以推进 LLM 的发展。由于训练一个具有更长上下文窗口的模型在计算上非常昂贵,许多替代解决方案,如检索增强生成 (RAG),已被采用。然而,大多数现有的 RAG 方法采用基于嵌入的检索,这在处理长上下文时表现不佳。为了应对这些挑战,我们提出了一种基于注意力机制的检索技术,称为“你只使用反应性注意力切片” (YOURA)。YOURA 利用一种新颖的检索启发式方法,称为反应分数,来评估输入上下文中每个句子与查询句子之间的相关性。直观地说,我们测量每个 Token 的注意力分数对查询的“反应”,并贪婪地检索反应最强的句子。在内部,YOURA 为整个输入上下文生成一个 Token 索引向量(称为反应向量)。为了将每个句子映射到 Token 索引向量,我们提出了一种嵌入无关的句子生成 (EASY),这是一种尽力而为的 Token 微调算法。我们在三个开源预训练的 LLM 模型上,对六个 LongBench QA 数据集进行了评估。我们的技术在处理长上下文查询时,实现了高达 30% 的 vLLM 推理吞吐量提升,且质量评分与简单而有效的截断中间方法几乎相同。

主题:计算与语言 (cs.CL);人工智能 (cs.AI)
引用为:arXiv:2409.13695 [cs.CL] (或 arXiv:2409.13695v1 [cs.CL] 针对此版本)
https://doi.org/10.48550/arXiv.2409.13695
通过 DataCite 发布的 arXiv DOI

聚焦以了解更多

[NLP-184] A Knowledge-Centric Benchmarking Framework and Empirical Study for Retrieval-Augmented Generation

【速读】: 该论文试图解决Retrieval-Augmented Generation (RAG)模型在处理现实世界查询时面临的挑战,特别是如何有效利用外部知识源和减少生成内容中的幻觉现象。解决方案的关键在于提出了一种新的RAG基准测试,通过综合实验评估了知识源选择、信息检索、组织和推理的全过程。特别关注了自动化知识源选择代理的影响以及噪声片段对RAG推理的影响,并通过详细实验分析了不同超参数对RAG性能的影响。此外,论文公开了实验结果、代码和解析后的数据集,为RAG方法的进一步研究和未来工作奠定了基础。

链接: https://arxiv.org/abs/2409.13694
作者: Shuo Yu(1 and 2),Mingyue Cheng(1 and 2),Jiqian Yang(1 and 2),Jie Ouyang(1 and 2) ((1) Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China (2) State Key Laboratory of Cognitive Intelligence)
关键词-EN: enhances generative models, Retrieval-Augmented Generation, utilize external knowledge, integrating retrieval mechanisms, external knowledge sources
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 11 figures; Mingyue Cheng is the corresponding author

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances generative models by integrating retrieval mechanisms, which allow these models to access and utilize external knowledge sources. Despite its advantages, RAG encounters significant challenges, particularly in effectively handling real-world queries and mitigating hallucinations. The KDD Cup 2024 CRAG competition brings these issues to the forefront by incorporating both web pages and a mock API as knowledge sources, adding the complexity of parsing HTML before large language models (LLMs) can process the information. In this paper, we propose a novel RAG benchmark designed to address these challenges. Our work provides a comprehensive set of experimental results, offering valuable insights for the study of RAG. We thoroughly examine the entire RAG process, including knowledge source selection, retrieval, organization, and reasoning. Key findings from our study include the impact of automated knowledge source selection using agents and the influence of noise chunks on RAG reasoning. Additionally, we conduct detailed experiments to analyze the effects of various hyperparameters on RAG performance. To support further research, we have made our results, the associated code, and a parsed version of the CRAG dataset publicly available\footnotethis https URL, contributing to the advancement of RAG methodologies and establishing a solid foundation for future work in this domain.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 通过集成检索机制,增强了生成式模型,使其能够访问和利用外部知识源。尽管 RAG 具有诸多优势,但在有效处理现实世界查询和减少幻觉方面仍面临重大挑战。KDD Cup 2024 CRAG 竞赛通过引入网页和模拟 API 作为知识源,增加了在大型语言模型 (Large Language Models, LLMs) 处理信息之前解析 HTML 的复杂性,从而将这些问题置于前沿。本文提出了一种新的 RAG 基准,旨在应对这些挑战。我们的工作提供了一套全面的实验结果,为 RAG 研究提供了宝贵的见解。我们全面考察了 RAG 的整个过程,包括知识源选择、检索、组织和推理。研究的关键发现包括使用智能体进行自动化知识源选择的影响以及噪声块对 RAG 推理的影响。此外,我们还进行了详细的实验,分析了各种超参数对 RAG 性能的影响。为了支持进一步的研究,我们已将结果、相关代码以及 CRAG 数据集的解析版本公开发布 (https URL),这有助于推动 RAG 方法的发展,并为该领域的未来工作奠定了坚实的基础。

[NLP-185] Declarative Integration and Management of Large Language Models through Finite Automata: Application to Automation Communication and Ethics

【速读】: 该论文试图解决如何高效且灵活地将多个大型语言模型(LLMs)与共享历史记录和触发机制结合,以识别最适合特定任务的LLM的问题。解决方案的关键在于设计了一种通用的、声明式的架构,通过构建有限自动机与事件管理系统相结合,实现对LLMs的复杂集成,从而减少编程工作量,并展示了其在自动化、通信和伦理等领域的应用灵活性。

链接: https://arxiv.org/abs/2409.13693
作者: Thierry Petit,Arnault Pachot,Claire Conan-Vrinat,Alexandre Dubarry
关键词-EN: Large Language Models, combine Large Language, declaratively combine Large, innovative architecture designed, Language Models
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: Submitted to IAAI-2025, Philadelphia, PA

点击查看摘要

Abstract:This article introduces an innovative architecture designed to declaratively combine Large Language Models (LLMs) with shared histories, and triggers to identify the most appropriate LLM for a given task. Our approach is general and declarative, relying on the construction of finite automata coupled with an event management system. The developed tool is crafted to facilitate the efficient and complex integration of LLMs with minimal programming effort, especially, but not only, for integrating methods of positive psychology to AI. The flexibility of our technique is demonstrated through applied examples in automation, communication, and ethics.
摘要:本文介绍了一种创新的架构,旨在通过声明式方法将大语言模型 (LLM) 与共享历史记录和触发器结合,以识别最适合特定任务的 LLM。我们的方法具有普遍性和声明性,依赖于有限自动机的构建与事件管理系统的结合。开发的工具旨在促进 LLM 与现有系统的高效且复杂的集成,且编程工作量最小化,尤其适用于(但不限于)将积极心理学方法集成到 AI 中。通过自动化、通信和伦理等应用示例,展示了我们技术的灵活性。

[NLP-186] am QUST at SemEval-2023 Task 3: A Comprehensive Study of Monolingual and Multilingual Approaches for Detecting Online News Genre Framing and Persuasion Techniques

【速读】: 该论文旨在解决SemEval2023任务3中的多语言情感分析问题。解决方案的关键在于采用多语言预训练模型,并通过结合类别权重和样本权重的微调策略,以及任务无关和任务相关的微调方法,来提升模型性能。实验结果表明,多语言方法在10折交叉验证中优于单语言方法,并在意大利语和西班牙语(零样本)的子任务1中取得了第二名的成绩。

链接: https://arxiv.org/abs/2304.04190
作者: Ye Jiang
关键词-EN: team QUST, paper describes, describes the participation, participation of team, pre-trained multilingual model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper describes the participation of team QUST in the SemEval2023 task 3. The monolingual models are first evaluated with the under-sampling of the majority classes in the early stage of the task. Then, the pre-trained multilingual model is fine-tuned with a combination of the class weights and the sample weights. Two different fine-tuning strategies, the task-agnostic and the task-dependent, are further investigated. All experiments are conducted under the 10-fold cross-validation, the multilingual approaches are superior to the monolingual ones. The submitted system achieves the second best in Italian and Spanish (zero-shot) in subtask-1.
摘要: 本文描述了团队 QUST 在 SemEval2023 任务 3 中的参与情况。首先,在任务初期,单语模型通过对多数类进行欠采样进行评估。随后,使用类别权重和样本权重的组合对预训练的多语种模型进行微调。进一步研究了两种不同的微调策略:任务无关和任务相关。所有实验均在 10 折交叉验证下进行,多语种方法优于单语方法。提交的系统在子任务 1 的意大利语和西班牙语(零样本)中取得了第二名的成绩。

[NLP-187] GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks

【速读】: 该论文试图解决高质量、多任务歌唱数据集稀缺的问题,现有数据集存在质量低、语言和歌手多样性有限、缺乏多技术信息和真实乐谱、任务适用性差等缺陷。解决方案的关键在于提出了GTSinger,这是一个全球性、多技术、高质量、免费使用的歌唱语料库,包含80.59小时的高质量歌唱录音,涵盖20位专业歌手和九种广泛使用的语言,提供六种常见歌唱技术的控制比较和音素级注释,以及真实的音乐乐谱。此外,GTSinger还包括手动音素到音频的对齐、全局风格标签和16.16小时的配对语音,以支持多种歌唱任务。论文还进行了四项基准实验,包括技术可控的歌唱语音合成、技术识别、风格转换和语音到歌唱的转换,以促进GTSinger的使用。

链接: https://arxiv.org/abs/2409.13832
作者: Yu Zhang,Changhao Pan,Wenxiang Guo,Ruiqi Li,Zhiyuan Zhu,Jialei Wang,Wenhao Xu,Jingyu Lu,Zhiqing Hong,Chuxin Wang,LiChao Zhang,Jinzheng He,Ziyue Jiang,Yuxin Chen,Chen Yang,Jiecheng Zhou,Xinyu Cheng,Zhou Zhao
关键词-EN: realistic music scores, poor task suitability, datasets significantly hinders, music scores, personalized singing tasks
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: under processing

点击查看摘要

Abstract:The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present \textbfGTSinger, a large \textbfGlobal, multi-\textbfTechnique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset; (2) 20 professional singers across nine widely spoken languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six commonly used singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompanied by manual phoneme-to-audio alignments, global style labels, and 16.16 hours of paired speech for various singing tasks. Moreover, to facilitate the use of GTSinger, we conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion. The corpus and demos can be found at this http URL. We provide the dataset and the code for processing data and conducting benchmarks at this https URL and this https URL.
摘要:高质量且多任务的歌唱数据集的稀缺性极大地阻碍了多样化可控和个性化歌唱任务的发展,因为现有的歌唱数据集存在质量低下、语言和歌手多样性有限、缺乏多技术信息和真实乐谱、以及任务适用性差等问题。为了解决这些问题,我们提出了 GTSinger,这是一个大型的 全球性、多技术、免费使用、高质量的歌唱语料库,包含真实乐谱,专为所有歌唱任务设计,并附带其基准测试。特别地,(1) 我们收集了 80.59 小时的优质歌唱声音,形成了最大的录制歌唱数据集;(2) 来自九种广泛使用语言的 20 位专业歌手提供了多样化的音色和风格;(3) 我们提供了六种常用歌唱技术的受控比较和音素级注释,有助于技术建模和控制;(4) GTSinger 提供了真实的乐谱,有助于实际音乐创作;(5) 歌唱声音伴随着手动音素到音频的对齐、全局风格标签以及 16.16 小时的配对语音,适用于各种歌唱任务。此外,为了促进 GTSinger 的使用,我们进行了四项基准实验:技术可控的歌唱声音合成、技术识别、风格迁移和语音到歌唱的转换。语料库和演示可以在以下网址找到:[http URL]。我们提供了数据集以及处理数据和进行基准测试的代码,网址分别为:[https URL] 和 [https URL]。

人工智能

[AI-0] A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

链接: https://arxiv.org/abs/2409.15277
作者: Yunfei Xie,Juncheng Wu,Haoqin Tu,Siwei Yang,Bingchen Zhao,Yongshuo Zong,Qiao Jin,Cihang Xie,Yuyin Zhou
关键词-EN: Large language models, exhibited remarkable capabilities, Large language, pushing the boundaries, exhibited remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: The first four authors contributed equally, project page available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable capabilities across various domains and tasks, pushing the boundaries of our knowledge in learning and cognition. The latest model, OpenAI’s o1, stands out as the first LLM with an internalized chain-of-thought technique using reinforcement learning strategies. While it has demonstrated surprisingly strong capabilities on various general language tasks, its performance in specialized fields such as medicine remains unknown. To this end, this report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine (NEJM) and The Lancet. These datasets offer greater clinical relevance compared to standard medical QA benchmarks such as MedQA, translating more effectively into real-world clinical utility. Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios. But meanwhile, we identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. We release our raw data and model outputs at this https URL for future research.

[AI-1] OmniBench: Towards The Future of Universal Omni-Language Models

链接: https://arxiv.org/abs/2409.15272
作者: Yizhi Li,Ge Zhang,Yinghao Ma,Ruibin Yuan,Kang Zhu,Hangyu Guo,Yiming Liang,Jiaheng Liu,Jian Yang,Siwei Wu,Xingwei Qu,Jinjie Shi,Xinyue Zhang,Zhenzhu Yang,Xiangzhou Wang,Zhaoxiang Zhang,Zachary Liu,Emmanouil Benetos,Wenhao Huang,Chenghua Lin
关键词-EN: multimodal large language, Recent advancements, large language models, advancements in multimodal, multimodal large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models’ ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) the baseline models perform poorly (below 50% accuracy) even when provided with alternative textual representations of images and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at this https URL.

[AI-2] Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

链接: https://arxiv.org/abs/2409.15268
作者: Benjamin Feuer,Micah Goldblum,Teresa Datta,Sanjana Nambiar,Raz Besaleli,Samuel Dooley,Max Cembalest,John P. Dickerson
关键词-EN: ChatGPT in November, sparked an explosion, release of ChatGPT, explosion of interest, preference optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM judges. In this work, we attempt to answer the following question – do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench, the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at this https URL.

[AI-3] Generative AI Is Not Ready for Clinical Use in Patient Education for Lower Back Pain Patients Even With Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2409.15260
作者: Yi-Fei Zhao,Allyn Bove,David Thompson,James Hill,Yi Xu,Yufan Ren,Andrea Hassman,Leming Zhou,Yanshan Wang
关键词-EN: Low back pain, Low back, LBP, patient education, back pain
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Low back pain (LBP) is a leading cause of disability globally. Following the onset of LBP and subsequent treatment, adequate patient education is crucial for improving functionality and long-term outcomes. Despite advancements in patient education strategies, significant gaps persist in delivering personalized, evidence-based information to patients with LBP. Recent advancements in large language models (LLMs) and generative artificial intelligence (GenAI) have demonstrated the potential to enhance patient education. However, their application and efficacy in delivering educational content to patients with LBP remain underexplored and warrant further investigation. In this study, we introduce a novel approach utilizing LLMs with Retrieval-Augmented Generation (RAG) and few-shot learning to generate tailored educational materials for patients with LBP. Physical therapists manually evaluated our model responses for redundancy, accuracy, and completeness using a Likert scale. In addition, the readability of the generated education materials is assessed using the Flesch Reading Ease score. The findings demonstrate that RAG-based LLMs outperform traditional LLMs, providing more accurate, complete, and readable patient education materials with less redundancy. Having said that, our analysis reveals that the generated materials are not yet ready for use in clinical practice. This study underscores the potential of AI-driven models utilizing RAG to improve patient education for LBP; however, significant challenges remain in ensuring the clinical relevance and granularity of content generated by these models.

[AI-4] S2AG-Vid: Enhancing Multi-Motion Alignment in Video Diffusion Models via Spatial and Syntactic Attention-Based Guidance

链接: https://arxiv.org/abs/2409.15259
作者: Yuanhang Li,Qi Mao,Lan Chen,Zhen Fang,Lei Tian,Xinyan Xiao,Libiao Jin,Hua Wu
关键词-EN: garnered significant attention, Recent advancements, generation using diffusion, significant attention, garnered significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in text-to-video (T2V) generation using diffusion models have garnered significant attention. However, existing T2V models primarily focus on simple scenes featuring a single object performing a single motion. Challenges arise in scenarios involving multiple objects with distinct motions, often leading to incorrect video-text alignment between subjects and their corresponding motions. To address this challenge, we propose \textbfS ^2 AG-Vid, a training-free inference-stage optimization method that improves the alignment of multiple objects with their corresponding motions in T2V models. S ^2 AG-Vid initially applies a spatial position-based, cross-attention (CA) constraint in the early stages of the denoising process, facilitating multiple nouns distinctly attending to the correct subject regions. To enhance the motion-subject binding, we implement a syntax-guided contrastive constraint in the subsequent denoising phase, aimed at improving the correlations between the CA maps of verbs and their corresponding nouns.Both qualitative and quantitative evaluations demonstrate that the proposed framework significantly outperforms baseline approaches, producing higher-quality videos with improved subject-motion consistency.

[AI-5] Behavioral Bias of Vision-Language Models: A Behavioral Finance View ICML2024

链接: https://arxiv.org/abs/2409.15256
作者: Yuhang Xiao,Yudi Lin,Ming-Chang Chiu
关键词-EN: Large Language Models, Large Language, Large Vision-Language Models, Large Vision-Language, evolve rapidly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: ICML 2024 Workshop on Large Language Models and Cognition

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) evolve rapidly as Large Language Models (LLMs) was equipped with vision modules to create more human-like models. However, we should carefully evaluate their applications in different domains, as they may possess undesired biases. Our work studies the potential behavioral biases of LVLMs from a behavioral finance perspective, an interdisciplinary subject that jointly considers finance and psychology. We propose an end-to-end framework, from data collection to new evaluation metrics, to assess LVLMs’ reasoning capabilities and the dynamic behaviors manifested in two established human financial behavioral biases: recency bias and authority bias. Our evaluations find that recent open-source LVLMs such as LLaVA-NeXT, MobileVLM-V2, Mini-Gemini, MiniCPM-Llama3-V 2.5 and Phi-3-vision-128k suffer significantly from these two biases, while the proprietary model GPT-4o is negligibly impacted. Our observations highlight directions in which open-source models can improve. The code is available at this https URL.

[AI-6] Archon: An Architecture Search Framework for Inference-Time Techniques

链接: https://arxiv.org/abs/2409.15254
作者: Jon Saad-Falcon,Adrian Gamarra Lafuente,Shlok Natarajan,Nahum Maru,Hristo Todorov,E. Kelly Buchanan,Mayee Chen,Neel Guha,Christopher Ré,Azalia Mirhoseini
关键词-EN: highly effective tools, Inference-time techniques, large language model, Inference-time, increase large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Inference-time techniques are emerging as highly effective tools to increase large language model (LLM) capabilities. However, there is still limited understanding of the best practices for developing systems that combine inference-time techniques with one or more LLMs, with challenges including: (1) effectively allocating inference compute budget, (2) understanding the interactions between different combinations of inference-time techniques and their impact on downstream performance, and 3) efficiently searching over the large space of model choices, inference-time techniques, and their compositions. To address these challenges, we introduce Archon, an automated framework for designing inference-time architectures. Archon defines an extensible design space, encompassing methods such as generation ensembling, multi-sampling, ranking, fusion, critiquing, verification, and unit testing. It then transforms the problem of selecting and combining LLMs and inference-time techniques into a hyperparameter optimization objective. To optimize this objective, we introduce automated Inference-Time Architecture Search (ITAS) algorithms. Given target benchmark(s), an inference compute budget, and available LLMs, ITAS outputs optimized architectures. We evaluate Archon architectures across a wide range of instruction-following and reasoning benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. We show that automatically designed inference-time architectures by Archon outperform strong models such as GPT-4o and Claude 3.5 Sonnet on these benchmarks, achieving an average increase of 14.1 and 10.3 percentage points with all-source models and open-source models, respectively. We make our code and datasets available publicly on Github: this https URL.

[AI-7] MACeIP: A Multimodal Ambient Context-enriched Intelligence Platform in Smart Cities

链接: https://arxiv.org/abs/2409.15243
作者: Truong Thanh Hung Nguyen,Phuc Truong Loc Nguyen,Monica Wachowicz,Hung Cao
关键词-EN: Multimodal Ambient Context-enriched, Ambient Context-enriched Intelligence, Context-enriched Intelligence Platform, Ambient Context-enriched, comprehensive system designed
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: 4 pages, 6 figures, IEEE/IEIE ICCE-Asia 2024

点击查看摘要

Abstract:This paper presents a Multimodal Ambient Context-enriched Intelligence Platform (MACeIP) for Smart Cities, a comprehensive system designed to enhance urban management and citizen engagement. Our platform integrates advanced technologies, including Internet of Things (IoT) sensors, edge and cloud computing, and Multimodal AI, to create a responsive and intelligent urban ecosystem. Key components include Interactive Hubs for citizen interaction, an extensive IoT sensor network, intelligent public asset management, a pedestrian monitoring system, a City Planning Portal, and a Cloud Computing System. We demonstrate the prototype of MACeIP in several cities, focusing on Fredericton, New Brunswick. This work contributes to innovative city development by offering a scalable, efficient, and user-centric approach to urban intelligence and management.

[AI-8] Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

链接: https://arxiv.org/abs/2409.15241
作者: Guanhua Wang,Chengming Zhang,Zheyu Shen,Ang Li,Olatunji Ruwase
关键词-EN: Large Language Models, Large Language, Language Models, popularity of generative, consume hundreds
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Given the popularity of generative AI, Large Language Models (LLMs) often consume hundreds or thousands of GPUs for parallelizing and accelerating the training process. Communication overhead becomes more pronounced when training LLMs at scale. To eliminate communication overhead in distributed LLM training, we propose Domino, which provides a generic scheme to hide communication behind computation. By breaking data dependency of a single batch training into smaller independent pieces, Domino pipelines these independent pieces training and provides generic strategy of fine-grained communication and computation overlapping. Extensive results show that, comparing with Megatron-LM, Domino achieves up to 1.3x speedup for LLM training on Nvidia DGX-H100 GPUs.

[AI-9] MemBench: Towards Real-world Evaluation of Memory-Augmented Dialogue Systems

链接: https://arxiv.org/abs/2409.15240
作者: Junqing He,Liang Zhu,Qi Wei,Rui Wang,Jiaxing Zhang
关键词-EN: developed numerous memory-augmented, Long-term memory, important for chatbots, researchers have developed, developed numerous
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: In progress

点击查看摘要

Abstract:Long-term memory is so important for chatbots and dialogue systems (DS) that researchers have developed numerous memory-augmented DS. However, their evaluation methods are different from the real situation in human conversation. They only measured the accuracy of factual information or the perplexity of generated responses given a query, which hardly reflected their performance. Moreover, they only consider passive memory retrieval based on similarity, neglecting diverse memory-recalling paradigms in humans, e.g. emotions and surroundings. To bridge the gap, we construct a novel benchmark covering various memory recalling paradigms based on cognitive science and psychology theory. The Memory Benchmark (MemBench) contains two tasks according to the two-phrase theory in cognitive science: memory retrieval, memory recognition and injection. The benchmark considers both passive and proactive memory recalling based on meta information for the first time. In addition, novel scoring aspects are proposed to comprehensively measure the generated responses. Results from the strongest embedding models and LLMs on MemBench show that there is plenty of room for improvement in existing dialogue systems. Extensive experiments also reveal the correlation between memory injection and emotion supporting (ES) skillfulness, and intimacy. Our code and dataset will be released.

[AI-10] AutoAPIEval: A Framework for Automated Evaluation of LLMs in API-Oriented Code Generation

链接: https://arxiv.org/abs/2409.15228
作者: Yixi Wu,Pengfei He,Zehao Wang,Shaowei Wang,Yuan Tian,Tse-Hsun(Peter)Chen
关键词-EN: Large language models, significantly enhancing productivity, accelerating software development, API-oriented code generation, Large language
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) like GitHub Copilot and ChatGPT have emerged as powerful tools for code generation, significantly enhancing productivity and accelerating software development. However, existing benchmarks primarily focus on general code generation without considering API-oriented code generation, i.e., generating code that invokes APIs from specific libraries. Given the growing demand for API-oriented code generation, there is a pressing need for a systematic and automated approach to evaluate LLM on API-oriented code generation. To address this gap, we propose AutoAPIEval, a lightweight and automated framework designed to evaluate the capabilities of LLMs in API-oriented code generation. Our framework works with any library that provides API documentation and focuses on two unit tasks: API recommendation and code example generation, along with four metrics to evaluate the generated APIs and code examples, such as the proportion of incorrect API recommendations for Task 1, and the proportion of code examples where no specific API is invoked and uncompilable/unexecutable code examples for Task 2. In addition, we conducted a case study on three LLMs (ChatGPT, MagiCoder, and DeepSeek Coder) and Java Runtime Environment 8 to demonstrate the framework’s effectiveness. Our findings reveal substantial variability in LLM performance across tasks, with ChatGPT adhering better to instructions, while sharing similar effectiveness in code example generation with its counterparts (i.e., MagiCoder and DeekSeek Coder). We also identify key factors associated with code quality, such as API popularity and model confidence, and build classifiers that achieve high accuracy in detecting incorrect API recommendations and erroneous code examples. Retrieval-augmented generation enhances the quality of code generated by LLMs, though its effectiveness varies across different LLMs.

[AI-11] Enhancing Pedestrian Trajectory Prediction with Crowd Trip Information

链接: https://arxiv.org/abs/2409.15224
作者: Rei Tamaru,Pei Li,Bin Ran
关键词-EN: active traffic management, Pedestrian trajectory prediction, traffic management, trajectory prediction, Pedestrian trajectory
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pedestrian trajectory prediction is essential for various applications in active traffic management, urban planning, traffic control, crowd management, and autonomous driving, aiming to enhance traffic safety and efficiency. Accurately predicting pedestrian trajectories requires a deep understanding of individual behaviors, social interactions, and road environments. Existing studies have developed various models to capture the influence of social interactions and road conditions on pedestrian trajectories. However, these approaches are limited by the lack of a comprehensive view of social interactions and road environments. To address these limitations and enhance the accuracy of pedestrian trajectory prediction, we propose a novel approach incorporating trip information as a new modality into pedestrian trajectory models. We propose RNTransformer, a generic model that utilizes crowd trip information to capture global information on social interactions. We incorporated RNTransformer with various socially aware local pedestrian trajectory prediction models to demonstrate its performance. Specifically, by leveraging a pre-trained RNTransformer when training different pedestrian trajectory prediction models, we observed improvements in performance metrics: a 1.3/2.2% enhancement in ADE/FDE on Social-LSTM, a 6.5/28.4% improvement on Social-STGCNN, and an 8.6/4.3% improvement on S-Implicit. Evaluation results demonstrate that RNTransformer significantly enhances the accuracy of various pedestrian trajectory prediction models across multiple datasets. Further investigation reveals that the RNTransformer effectively guides local models to more accurate directions due to the consideration of global information. By exploring crowd behavior within the road network, our approach shows great promise in improving pedestrian safety through accurate trajectory predictions.

[AI-12] ASTE Transformer Modelling Dependencies in Aspect-Sentiment Triplet Extraction

链接: https://arxiv.org/abs/2409.15202
作者: Iwo Naglik,Mateusz Lango
关键词-EN: Aspect-Sentiment Triplet Extraction, Triplet Extraction, aspect-based sentiment analysis, Aspect-Sentiment Triplet, recently proposed task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The 2024 Conference on Empirical Methods in Natural Language Processing, November 12-16, Miami, Florida 9 pages, appendix, diagrams

点击查看摘要

Abstract:Aspect-Sentiment Triplet Extraction (ASTE) is a recently proposed task of aspect-based sentiment analysis that consists in extracting (aspect phrase, opinion phrase, sentiment polarity) triples from a given sentence. Recent state-of-the-art methods approach this task by first extracting all possible text spans from a given text, then filtering the potential aspect and opinion phrases with a classifier, and finally considering all their pairs with another classifier that additionally assigns sentiment polarity to them. Although several variations of the above scheme have been proposed, the common feature is that the final result is constructed by a sequence of independent classifier decisions. This hinders the exploitation of dependencies between extracted phrases and prevents the use of knowledge about the interrelationships between classifier predictions to improve performance. In this paper, we propose a new ASTE approach consisting of three transformer-inspired layers, which enables the modelling of dependencies both between phrases and between the final classifier decisions. Experimental results show that the method achieves higher performance in terms of F1 measure than other methods studied on popular benchmarks. In addition, we show that a simple pre-training technique further improves the performance of the model.

[AI-13] Learning from Contrastive Prompts: Automated Optimization and Adaptation

链接: https://arxiv.org/abs/2409.15199
作者: Mingqi Li,Karan Aggarwal,Yong Xie,Aitzaz Ahmad,Stephen Lau
关键词-EN: manually crafting prompts, spent on manually, manually crafting, prompt optimization, LCP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As LLMs evolve, significant effort is spent on manually crafting prompts. While existing prompt optimization methods automate this process, they rely solely on learning from incorrect samples, leading to a sub-optimal performance. Additionally, an unexplored challenge in the literature is prompts effective for prior models may not perform well on newer versions or different languages. We propose the Learning from Contrastive Prompts (LCP) framework to address these gaps, enhancing both prompt optimization and adaptation. LCP employs contrastive learning to generate effective prompts by analyzing patterns in good and bad prompt examples. Our evaluation on the Big-Bench Hard dataset shows that LCP has a win rate of over 76% over existing methods in prompt optimization and demonstrates strong adaptability across different model versions, families, and languages. LCP offers a systematic approach to prompt engineering, reducing manual effort in deploying LLMs across varied contexts.

[AI-14] HOTVCOM: Generating Buzzworthy Comments for Videos ACL2024

链接: https://arxiv.org/abs/2409.15196
作者: Yuyan Chen,Yiwen Qian,Songzhou Yan,Jiyuan Jia,Zhixu Li,Yanghua Xiao,Xiaobo Li,Ming Yang,Qingpei Guo
关键词-EN: attracting user impressions, media video platforms, social media video, making them vital, branding purpose
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:In the era of social media video platforms, popular hot-comments'' play a crucial role in attracting user impressions of short-form videos, making them vital for marketing and branding purpose. However, existing research predominantly focuses on generating descriptive comments or danmaku’’ in English, offering immediate reactions to specific video moments. Addressing this gap, our study introduces \textscHotVCom, the largest Chinese video hot-comment dataset, comprising 94k diverse videos and 137 million comments. We also present the \textttComHeat framework, which synergistically integrates visual, auditory, and textual data to generate influential hot-comments on the Chinese video dataset. Empirical evaluations highlight the effectiveness of our framework, demonstrating its excellence on both the newly constructed and existing datasets.

[AI-15] Location is Key: Leveraging Large Language Model for Functional Bug Localization in Verilog

链接: https://arxiv.org/abs/2409.15186
作者: Bingkun Yao,Ning Wang,Jie Zhou,Xi Wang,Hong Gao,Zhe Jiang,Nan Guan
关键词-EN: Large Language Models, Verilog, Large Language, Language Models, Verilog code
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bug localization in Verilog code is a crucial and time-consuming task during the verification of hardware design. Since introduction, Large Language Models (LLMs) have showed their strong programming capabilities. However, no work has yet considered using LLMs for bug localization in Verilog code. This paper presents Location-is-Key, an opensource LLM solution to locate functional errors in Verilog snippets. LiK achieves high localization accuracy, with a pass@1 localization accuracy of 93.3% on our test dataset based on RTLLM, surpassing GPT-4’s 77.9% and comparable to Claude-3.5’s 90.8%. Additionally, the bug location obtained by LiK significantly improves GPT-3.5’s bug repair efficiency (Functional pass@1 increased from 40.39% to 58.92%), highlighting the importance of bug localization in LLM-based Verilog debugging. Compared to existing methods, LiK only requires the design specification and the erroneous code snippet, without the need for testbenches, assertions, or any other EDA tools. This research demonstrates the feasibility of using LLMs for Verilog error localization, thus providing a new direction for automatic Verilog code debugging.

[AI-16] Chattronics: using GPTs to assist in the design of data acquisition systems

链接: https://arxiv.org/abs/2409.15183
作者: Jonathan Paul Driemeyer Brown,Tiago Oliveira Weber
关键词-EN: Large Language Models, Large Language, usefulness of Large, Language Models, continuously tested
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Signal Processing (eess.SP)
*备注: 8 pages

点击查看摘要

Abstract:The usefulness of Large Language Models (LLM) is being continuously tested in various fields. However, their intrinsic linguistic characteristic is still one of the limiting factors when applying these models to exact sciences. In this article, a novel approach to use General Pre-Trained Transformers to assist in the design phase of data acquisition systems will be presented. The solution is packaged in the form of an application that retains the conversational aspects of LLMs, in such a manner that the user must provide details on the desired project in order for the model to draft both a system-level architectural diagram and the block-level specifications, following a Top-Down methodology based on restrictions. To test this tool, two distinct user emulations were used, one of which uses an additional GPT model. In total, 4 different data acquisition projects were used in the testing phase, each with its own measurement requirements: angular position, temperature, acceleration and a fourth project with both pressure and superficial temperature measurements. After 160 test iterations, the study concludes that there is potential for these models to serve adequately as synthesis/assistant tools for data acquisition systems, but there are still technological limitations. The results show coherent architectures and topologies, but that GPTs have difficulties in simultaneously considering all requirements and many times commits theoretical mistakes.

[AI-17] Goal-based Neural Physics Vehicle Trajectory Prediction Model

链接: https://arxiv.org/abs/2409.15182
作者: Rui Gan,Haotian Shi,Pei Li,Keshu Wu,Bocheng An,Linheng Li,Junyi Ma,Chengyuan Ma,Bin Ran
关键词-EN: intelligent transportation systems, influencing traffic safety, Vehicle trajectory prediction, affects vehicle behavior, vehicle behavior planning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vehicle trajectory prediction plays a vital role in intelligent transportation systems and autonomous driving, as it significantly affects vehicle behavior planning and control, thereby influencing traffic safety and efficiency. Numerous studies have been conducted to predict short-term vehicle trajectories in the immediate future. However, long-term trajectory prediction remains a major challenge due to accumulated errors and uncertainties. Additionally, balancing accuracy with interpretability in the prediction is another challenging issue in predicting vehicle trajectory. To address these challenges, this paper proposes a Goal-based Neural Physics Vehicle Trajectory Prediction Model (GNP). The GNP model simplifies vehicle trajectory prediction into a two-stage process: determining the vehicle’s goal and then choosing the appropriate trajectory to reach this goal. The GNP model contains two sub-modules to achieve this process. The first sub-module employs a multi-head attention mechanism to accurately predict goals. The second sub-module integrates a deep learning model with a physics-based social force model to progressively predict the complete trajectory using the generated goals. The GNP demonstrates state-of-the-art long-term prediction accuracy compared to four baseline models. We provide interpretable visualization results to highlight the multi-modality and inherent nature of our neural physics framework. Additionally, ablation studies are performed to validate the effectiveness of our key designs.

[AI-18] Skills Made to Order: Efficient Acquisition of Robot Cooking Skills Guided by Multiple Forms of Internet Data

链接: https://arxiv.org/abs/2409.15172
作者: Mrinal Verghese,Christopher Atkeson
关键词-EN: internet data sources, internet data, data, data sources, internet
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:This study explores the utility of various internet data sources to select among a set of template robot behaviors to perform skills. Learning contact-rich skills involving tool use from internet data sources has typically been challenging due to the lack of physical information such as contact existence, location, areas, and force in this data. Prior works have generally used internet data and foundation models trained on this data to generate low-level robot behavior. We hypothesize that these data and models may be better suited to selecting among a set of basic robot behaviors to perform these contact-rich skills. We explore three methods of template selection: querying large language models, comparing video of robot execution to retrieved human video using features from a pretrained video encoder common in prior work, and performing the same comparison using features from an optic flow encoder trained on internet data. Our results show that LLMs are surprisingly capable template selectors despite their lack of visual information, optical flow encoding significantly outperforms video encoders trained with an order of magnitude more data, and important synergies exist between various forms of internet data for template selection. By exploiting these synergies, we create a template selector using multiple forms of internet data that achieves a 79% success rate on a set of 16 different cooking skills involving tool-use.

[AI-19] DeepCloth-ROB2_textQSPP: Towards a Robust Robot Deployment for Quasi-Static Pick-and-Place Cloth-Shaping Neural Controllers ICRA

链接: https://arxiv.org/abs/2409.15159
作者: Halid Abdulrahim Kadi,Jose Alex Chandy,Luis Figueredo,Kasim Terzić,Praminda Caleb-Solly
关键词-EN: simulation-trained vision-based data-driven, operation impedes reliable, impedes reliable deployment, Franka Emika Panda, real-world operation impedes
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages main texts, 3 figures, and 3 tables. It is submitted to the 2025 IEEE International Conference on Robotics Automation (ICRA)

点击查看摘要

Abstract:The fidelity gap between simulation-trained vision-based data-driven cloth neural controllers and real-world operation impedes reliable deployment of methods from simulation into physical trials. Real-world grasping errors, such as misgrasping and multilayer grasping, degrade their performance; additionally, some fabrics made of synthetic material also tend to stick to the commonly employed Franka Emika Panda’s original gripper. Different approaches adopted various strategies to resolve these problems, further complicating real-world comparison between state-of-the-art methods. We propose DeepCloth-ROB ^2_\textQS PP with a simulation-to-reality transfer strategy Towel-Sim2Real and a cloth grasping protocol to consider and mitigate these grasping errors for robustly deploying quasi-static pick-and-place neural controllers in cloth shaping and demonstrate its generalisability across different deep-learning methods, fabric contexts and robot platforms. Our approach allows us to compare multiple neural controllers in a real environment for the first time, offering valuable insights to the cloth manipulation community.

[AI-20] Automatic Feature Learning for Essence: a Case Study on Car Sequencing

链接: https://arxiv.org/abs/2409.15158
作者: Alessio Pellegrino,Özgür Akgün,Nguyen Dang,Zeynep Kiziltan,Ian Miguel
关键词-EN: describe combinatorial problems, low-level constraint model, detailed modelling decisions, constraint model, Essence offer
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Constraint modelling languages such as Essence offer a means to describe combinatorial problems at a high-level, i.e., without committing to detailed modelling decisions for a particular solver or solving paradigm. Given a problem description written in Essence, there are multiple ways to translate it to a low-level constraint model. Choosing the right combination of a low-level constraint model and a target constraint solver can have significant impact on the effectiveness of the solving process. Furthermore, the choice of the best combination of constraint model and solver can be instance-dependent, i.e., there may not exist a single combination that works best for all instances of the same problem. In this paper, we consider the task of building machine learning models to automatically select the best combination for a problem instance. A critical part of the learning process is to define instance features, which serve as input to the selection model. Our contribution is automatic learning of instance features directly from the high-level representation of a problem instance using a language model. We evaluate the performance of our approach using the Essence modelling language with a case study involving the car sequencing problem.

[AI-21] RMCBench: Benchmarking Large Language Models Resistance to Malicious Code

链接: https://arxiv.org/abs/2409.15154
作者: Jiachi Chen,Qingyuan Zhong,Yanlin Wang,Kaiwen Ning,Yongkun Liu,Zenan Xu,Zhe Zhao,Ting Chen,Zibin Zheng
关键词-EN: Large Language Models, Large Language, resist malicious code, malicious code generation, software development activities
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 12 pages, 6 figures, 5 tables, 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24)

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has significantly influenced various aspects of software development activities. Despite their benefits, LLMs also pose notable risks, including the potential to generate harmful content and being abused by malicious developers to create malicious code. Several previous studies have focused on the ability of LLMs to resist the generation of harmful content that violates human ethical standards, such as biased or offensive content. However, there is no research evaluating the ability of LLMs to resist malicious code generation. To fill this gap, we propose RMCBench, the first benchmark comprising 473 prompts designed to assess the ability of LLMs to resist malicious code generation. This benchmark employs two scenarios: a text-to-code scenario, where LLMs are prompted with descriptions to generate code, and a code-to-code scenario, where LLMs translate or complete existing malicious code. Based on RMCBench, we conduct an empirical study on 11 representative LLMs to assess their ability to resist malicious code generation. Our findings indicate that current LLMs have a limited ability to resist malicious code generation with an average refusal rate of 40.36% in text-to-code scenario and 11.52% in code-to-code scenario. The average refusal rate of all LLMs in RMCBench is only 28.71%; ChatGPT-4 has a refusal rate of only 35.73%. We also analyze the factors that affect LLMs’ ability to resist malicious code generation and provide implications for developers to enhance model robustness.

[AI-22] COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models ICRA

链接: https://arxiv.org/abs/2409.15146
作者: Kehui Liu,Zixin Tang,Dong Wang,Zhigang Wang,Bin Zhao,Xuelong Li
关键词-EN: powerful reasoning capabilities, Leveraging the powerful, large language models, methods yield promising, yield promising results
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures. Submitted to IEEE International Conference on Robotics and Automation (ICRA), 2025

点击查看摘要

Abstract:Leveraging the powerful reasoning capabilities of large language models (LLMs), recent LLM-based robot task planning methods yield promising results. However, they mainly focus on single or multiple homogeneous robots on simple tasks. Practically, complex long-horizon tasks always require collaborations among multiple heterogeneous robots especially with more complex action spaces, which makes these tasks more challenging. To this end, we propose COHERENT, a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems including quadrotors, robotic dogs, and robotic arms. Specifically, a Proposal-Execution-Feedback-Adjustment (PEFA) mechanism is designed to decompose and assign actions for individual robots, where a centralized task assigner makes a task planning proposal to decompose the complex task into subtasks, and then assigns subtasks to robot executors. Each robot executor selects a feasible action to implement the assigned subtask and reports self-reflection feedback to the task assigner for plan adjustment. The PEFA loops until the task is completed. Moreover, we create a challenging heterogeneous multi-robot task planning benchmark encompassing 100 complex long-horizon tasks. The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency. The experimental videos, code, and benchmark are released at this https URL.

[AI-23] CAMAL: Optimizing LSM-trees via Active Learning SIGMOD2025

链接: https://arxiv.org/abs/2409.15130
作者: Weiping Yu,Siqiang Luo,Zihao Yu,Gao Cong
关键词-EN: optimize LSM-tree structure, active learning, Decoupled Active Learning, write operations, apply active learning
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: SIGMOD 2025

点击查看摘要

Abstract:We use machine learning to optimize LSM-tree structure, aiming to reduce the cost of processing various read/write operations. We introduce a new approach Camal, which boasts the following features: (1) ML-Aided: Camal is the first attempt to apply active learning to tune LSM-tree based key-value stores. The learning process is coupled with traditional cost models to improve the training process; (2) Decoupled Active Learning: backed by rigorous analysis, Camal adopts active learning paradigm based on a decoupled tuning of each parameter, which further accelerates the learning process; (3) Easy Extrapolation: Camal adopts an effective mechanism to incrementally update the model with the growth of the data size; (4) Dynamic Mode: Camal is able to tune LSM-tree online under dynamically changing workloads; (5) Significant System Improvement: By integrating Camal into a full system RocksDB, the system performance improves by 28% on average and up to 8x compared to a state-of-the-art RocksDB design.

[AI-24] Boosting Healthcare LLMs Through Retrieved Context

链接: https://arxiv.org/abs/2409.15127
作者: Jordi Bayarri-Planas,Ashwin Kumar Gururajan,Dario Garcia-Gasulla
关键词-EN: Large Language Models, natural language processing, demonstrated remarkable capabilities, Large Language, Language Models
类目: Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures, 12 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing, and yet, their factual inaccuracies and hallucinations limits their application, particularly in critical domains like healthcare. Context retrieval methods, by introducing relevant information as input, have emerged as a crucial approach for enhancing LLM factuality and reliability. This study explores the boundaries of context retrieval methods within the healthcare domain, optimizing their components and benchmarking their performance against open and closed alternatives. Our findings reveal how open LLMs, when augmented with an optimized retrieval system, can achieve performance comparable to the biggest private solutions on established healthcare benchmarks (multiple-choice question answering). Recognizing the lack of realism of including the possible answers within the question (a setup only found in medical exams), and after assessing a strong LLM performance degradation in the absence of those options, we extend the context retrieval system in that direction. In particular, we propose OpenMedPrompt a pipeline that improves the generation of more reliable open-ended answers, moving this technology closer to practical application.

[AI-25] Log-normal Mutations and their Use in Detecting Surreptitious Fake Images

链接: https://arxiv.org/abs/2409.15119
作者: Ismail Labiad,Thomas Bäck,Pierre Fernandez,Laurent Najman,Tom Sanders,Furong Ye,Mariia Zameshina,Olivier Teytaud
关键词-EN: algorithms specifically dedicated, automatic image classifiers, attacking automatic image, specifically dedicated, dedicated to attacking
类目: Artificial Intelligence (cs.AI)
*备注: log-normal mutations and their use in detecting surreptitious fake images

点击查看摘要

Abstract:In many cases, adversarial attacks are based on specialized algorithms specifically dedicated to attacking automatic image classifiers. These algorithms perform well, thanks to an excellent ad hoc distribution of initial attacks. However, these attacks are easily detected due to their specific initial distribution. We therefore consider other black-box attacks, inspired from generic black-box optimization tools, and in particular the log-normal algorithm. We apply the log-normal method to the attack of fake detectors, and get successful attacks: importantly, these attacks are not detected by detectors specialized on classical adversarial attacks. Then, combining these attacks and deep detection, we create improved fake detectors. Comments: log-normal mutations and their use in detecting surreptitious fake images Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2409.15119 [cs.AI] (or arXiv:2409.15119v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.15119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-26] Evaluating ML Robustness in GNSS Interference Classification Characterization Localization

链接: https://arxiv.org/abs/2409.15114
作者: Lucas Heublein,Tobias Feigl,Thorsten Nowak,Alexander Rügamer,Christopher Mutschler,Felix Ott
关键词-EN: navigation satellite system, global navigation satellite, Jamming devices present, satellite system, accurate positioning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Jamming devices present a significant threat by disrupting signals from the global navigation satellite system (GNSS), compromising the robustness of accurate positioning. The detection of anomalies within frequency snapshots is crucial to counteract these interferences effectively. A critical preliminary measure involves the reliable classification of interferences and characterization and localization of jamming devices. This paper introduces an extensive dataset compromising snapshots obtained from a low-frequency antenna, capturing diverse generated interferences within a large-scale environment including controlled multipath effects. Our objective is to assess the resilience of ML models against environmental changes, such as multipath effects, variations in interference attributes, such as the interference class, bandwidth, and signal-to-noise ratio, the accuracy jamming device localization, and the constraints imposed by snapshot input lengths. By analyzing the aleatoric and epistemic uncertainties, we demonstrate the adaptness of our model in generalizing across diverse facets, thus establishing its suitability for real-world applications. this https URL

[AI-27] ChatGPT as a Solver and Grader of Programming Exams written in Spanish

链接: https://arxiv.org/abs/2409.15112
作者: Pablo Fernández-Saborido,Marcos Fernández-Pichel,David E. Losada
关键词-EN: Large Language Models, receiving increasing attention, Large Language, capabilities of Large, Language Models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluating the capabilities of Large Language Models (LLMs) to assist teachers and students in educational tasks is receiving increasing attention. In this paper, we assess ChatGPT’s capacities to solve and grade real programming exams, from an accredited BSc degree in Computer Science, written in Spanish. Our findings suggest that this AI model is only effective for solving simple coding tasks. Its proficiency in tackling complex problems or evaluating solutions authored by others are far from effective. As part of this research, we also release a new corpus of programming tasks and the corresponding prompts for solving the problems or grading the solutions. This resource can be further exploited by other research teams.

[AI-28] he BRAVO Semantic Segmentation Challenge Results in UNCV2024 ECCV2024

链接: https://arxiv.org/abs/2409.15107
作者: Tuan-Hung Vu,Eduardo Valle,Andrei Bursuc,Tommie Kerssies,Daan de Geus,Gijs Dubbelman,Long Qian,Bingke Zhu,Yingying Chen,Ming Tang,Jinqiao Wang,Tomáš Vojíř,Jan Šochman,Jiří Matas,Michael Smith,Frank Ferrie,Shamik Basu,Christos Sakaridis,Luc Van Gool
关键词-EN: unified BRAVO challenge, unified BRAVO, semantic segmentation models, BRAVO challenge, propose the unified
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ECCV 2024 proceeding paper of the BRAVO challenge 2024, see this https URL

点击查看摘要

Abstract:We propose the unified BRAVO challenge to benchmark the reliability of semantic segmentation models under realistic perturbations and unknown out-of-distribution (OOD) scenarios. We define two categories of reliability: (1) semantic reliability, which reflects the model’s accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model’s ability to detect object classes that are unknown during training. The challenge attracted nearly 100 submissions from international teams representing notable research institutions. The results reveal interesting insights into the importance of large-scale pre-training and minimal architectural design in developing robust and reliable semantic segmentation models.

[AI-29] SPformer: A Transformer Based DRL Decision Making Method for Connected Automated Vehicles

链接: https://arxiv.org/abs/2409.15105
作者: Ye Han,Lijun Zhang,Dejian Meng,Xingyu Hu,Yixia Lu
关键词-EN: mixed autonomy traffic, autonomy traffic environment, transportation system, mixed autonomy, autonomous-driving car
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In mixed autonomy traffic environment, every decision made by an autonomous-driving car may have a great impact on the transportation system. Because of the complex interaction between vehicles, it is challenging to make decisions that can ensure both high traffic efficiency and safety now and futher. Connected automated vehicles (CAVs) have great potential to improve the quality of decision-making in this continuous, highly dynamic and interactive environment because of their stronger sensing and communicating ability. For multi-vehicle collaborative decision-making algorithms based on deep reinforcement learning (DRL), we need to represent the interactions between vehicles to obtain interactive features. The representation in this aspect directly affects the learning efficiency and the quality of the learned policy. To this end, we propose a CAV decision-making architecture based on transformer and reinforcement learning algorithms. A learnable policy token is used as the learning medium of the multi-vehicle joint policy, the states of all vehicles in the area of interest can be adaptively noticed in order to extract interactive features among agents. We also design an intuitive physical positional encodings, the redundant location information of which optimizes the performance of the network. Simulations show that our model can make good use of all the state information of vehicles in traffic scenario, so as to obtain high-quality driving decisions that meet efficiency and safety objectives. The comparison shows that our method significantly improves existing DRL-based multi-vehicle cooperative decision-making algorithms.

[AI-30] Robust Federated Learning Over the Air: Combating Heavy-Tailed Noise with Median Anchored Clipping

链接: https://arxiv.org/abs/2409.15100
作者: Jiaxing Li,Zihan Chen,Kai Fong Ernest Chong,Bikramjit Das,Tony Q. S. Quek,Howard H. Yang
关键词-EN: federated edge learning, effective approach, communication bottleneck, Median Anchored Clipping, Leveraging
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Leveraging over-the-air computations for model aggregation is an effective approach to cope with the communication bottleneck in federated edge learning. By exploiting the superposition properties of multi-access channels, this approach facilitates an integrated design of communication and computation, thereby enhancing system privacy while reducing implementation costs. However, the inherent electromagnetic interference in radio channels often exhibits heavy-tailed distributions, giving rise to exceptionally strong noise in globally aggregated gradients that can significantly deteriorate the training performance. To address this issue, we propose a novel gradient clipping method, termed Median Anchored Clipping (MAC), to combat the detrimental effects of heavy-tailed noise. We also derive analytical expressions for the convergence rate of model training with analog over-the-air federated learning under MAC, which quantitatively demonstrates the effect of MAC on training performance. Extensive experimental results show that the proposed MAC algorithm effectively mitigates the impact of heavy-tailed noise, hence substantially enhancing system robustness.

[AI-31] Efficiently Dispatching Flash Attention For Partially Filled Attention Masks

链接: https://arxiv.org/abs/2409.15097
作者: Agniv Sharma,Jonas Geiping
关键词-EN: Transformers are widely, partially filled attention, filled attention matrices, partially filled, Binary Block Masking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices. Examples include attention masks designed to reduce the quadratic complexity of attention, sequence packing techniques, and recent innovations like tree masking for fast validation in MEDUSA. Despite the inherent sparsity in these matrices, the state-of-the-art algorithm Flash Attention still processes them with quadratic complexity as though they were dense. In this paper, we introduce \textbfBinary Block Masking, a highly efficient modification that enhances Flash Attention by making it mask-aware. We further propose two optimizations: one tailored for masks with contiguous non-zero patterns and another for extremely sparse masks. Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement. The implementation will be publicly released to foster further research and application.

[AI-32] Zero-Cost Whole-Body Teleoperation for Mobile Manipulation

链接: https://arxiv.org/abs/2409.15095
作者: Daniel Honerkamp,Harsh Mahesheka,Jan Ole von Hartz,Tim Welschehold,Abhinav Valada
关键词-EN: robotic foundation models, training robotic foundation, learning complex behaviors, foundation models, plays a key
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project Website: this http URL

点击查看摘要

Abstract:Demonstration data plays a key role in learning complex behaviors and training robotic foundation models. While effective control interfaces exist for static manipulators, data collection remains cumbersome and time intensive for mobile manipulators due to their large number of degrees of freedom. While specialized hardware, avatars, or motion tracking can enable whole-body control, these approaches are either expensive, robot-specific, or suffer from the embodiment mismatch between robot and human demonstrator. In this work, we present MoMa-Teleop, a novel teleoperation method that delegates the base motions to a reinforcement learning agent, leaving the operator to focus fully on the task-relevant end-effector motions. This enables whole-body teleoperation of mobile manipulators with zero additional hardware or setup costs via standard interfaces such as joysticks or hand guidance. Moreover, the operator is not bound to a tracked workspace and can move freely with the robot over spatially extended tasks. We demonstrate that our approach results in a significant reduction in task completion time across a variety of robots and tasks. As the generated data covers diverse whole-body motions without embodiment mismatch, it enables efficient imitation learning. By focusing on task-specific end-effector motions, our approach learns skills that transfer to unseen settings, such as new obstacles or changed object positions, from as little as five demonstrations. We make code and videos available at this http URL.

[AI-33] M2OST: Many-to-one Regression for Predicting Spatial Transcriptomics from Digital Pathology Images

链接: https://arxiv.org/abs/2409.15092
作者: Hongyi Wang,Xiuju Du,Jing Liu,Shuyi Ouyang,Yen-Wei Chen,Lanfen Lin
关键词-EN: Spatial Transcriptomics, advancement of Spatial, digital pathology images, gene expressions based, facilitated the spatially-aware
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of Spatial Transcriptomics (ST) has facilitated the spatially-aware profiling of gene expressions based on histopathology images. Although ST data offers valuable insights into the micro-environment of tumors, its acquisition cost remains expensive. Therefore, directly predicting the ST expressions from digital pathology images is desired. Current methods usually adopt existing regression backbones along with patch-sampling for this task, which ignores the inherent multi-scale information embedded in the pyramidal data structure of digital pathology images, and wastes the inter-spot visual information crucial for accurate gene expression prediction. To address these limitations, we propose M2OST, a many-to-one regression Transformer that can accommodate the hierarchical structure of the pathology images via a decoupled multi-scale feature extractor. Unlike traditional models that are trained with one-to-one image-label pairs, M2OST uses multiple images from different levels of the digital pathology image to jointly predict the gene expressions in their common corresponding spot. Built upon our many-to-one scheme, M2OST can be easily scaled to fit different numbers of inputs, and its network structure inherently incorporates nearby inter-spot features, enhancing regression performance. We have tested M2OST on three public ST datasets and the experimental results show that M2OST can achieve state-of-the-art performance with fewer parameters and floating-point operations (FLOPs). The code will be released upon acceptance.

[AI-34] Depression Diagnosis Dialogue Simulation: Self-improving Psychiatrist with Tertiary Memory

链接: https://arxiv.org/abs/2409.15084
作者: Kunyao Lan,Bingui Jin,Zichen Zhu,Siyuan Chen,Shu Zhang,Kenny Q. Zhu,Mengyue Wu
关键词-EN: Mental health issues, present significant challenges, effective automated diagnostic, Agent Mental Clinic, automated diagnostic methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mental health issues, particularly depressive disorders, present significant challenges in contemporary society, necessitating the development of effective automated diagnostic methods. This paper introduces the Agent Mental Clinic (AMC), a self-improving conversational agent system designed to enhance depression diagnosis through simulated dialogues between patient and psychiatrist agents. To enhance the dialogue quality and diagnosis accuracy, we design a psychiatrist agent consisting of a tertiary memory structure, a dialogue control and reflect plugin that acts as ``supervisor’’ and a memory sampling module, fully leveraging the skills reflected by the psychiatrist agent, achieving great accuracy on depression risk and suicide risk diagnosis via conversation. Experiment results on datasets collected in real-life scenarios demonstrate that the system, simulating the procedure of training psychiatrists, can be a promising optimization method for aligning LLMs with real-life distribution in specific domains without modifying the weights of LLMs, even when only a few representative labeled cases are available.

[AI-35] Enhancing Scientific Reproducibility Through Automated BioCompute Object Creation Using Retrieval-Augmented Generation from Publications

链接: https://arxiv.org/abs/2409.15076
作者: Sean Kim,Raja Mazumder
关键词-EN: necessitating standardized documentation, IEEE BioCompute Object, Large Language Models, BCO assistant tool, BCO assistant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
*备注: 21 pages, 8 figures

点击查看摘要

Abstract:The exponential growth in computational power and accessibility has transformed the complexity and scale of bioinformatics research, necessitating standardized documentation for transparency, reproducibility, and regulatory compliance. The IEEE BioCompute Object (BCO) standard addresses this need but faces adoption challenges due to the overhead of creating compliant documentation, especially for legacy research. This paper presents a novel approach to automate the creation of BCOs from scientific papers using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). We describe the development of the BCO assistant tool that leverages RAG to extract relevant information from source papers and associated code repositories, addressing key challenges such as LLM hallucination and long-context understanding. The implementation incorporates optimized retrieval processes, including a two-pass retrieval with re-ranking, and employs carefully engineered prompts for each BCO domain. We discuss the tool’s architecture, extensibility, and evaluation methods, including automated and manual assessment approaches. The BCO assistant demonstrates the potential to significantly reduce the time and effort required for retroactive documentation of bioinformatics research while maintaining compliance with the standard. This approach opens avenues for AI-assisted scientific documentation and knowledge extraction from publications thereby enhancing scientific reproducibility. The BCO assistant tool and documentation is available at this https URL.

[AI-36] Brotherhood at WMT 2024: Leveraging LLM-Generated Contextual Conversations for Cross-Lingual Image Captioning EMNLP2024

链接: https://arxiv.org/abs/2409.15052
作者: Siddharth Betala,Ishan Chokshi
关键词-EN: Multi-Modal Translation Task, team name Brotherhood, Multi-Modal Translation, Translation Task, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at the Ninth Conference on Machine Translation (WMT24), co-located with EMNLP 2024

点击查看摘要

Abstract:In this paper, we describe our system under the team name Brotherhood for the English-to-Lowres Multi-Modal Translation Task. We participate in the multi-modal translation tasks for English-Hindi, English-Hausa, English-Bengali, and English-Malayalam language pairs. We present a method leveraging multi-modal Large Language Models (LLMs), specifically GPT-4o and Claude 3.5 Sonnet, to enhance cross-lingual image captioning without traditional training or fine-tuning. Our approach utilizes instruction-tuned prompting to generate rich, contextual conversations about cropped images, using their English captions as additional context. These synthetic conversations are then translated into the target languages. Finally, we employ a weighted prompting strategy, balancing the original English caption with the translated conversation to generate captions in the target language. This method achieved competitive results, scoring 37.90 BLEU on the English-Hindi Challenge Set and ranking first and second for English-Hausa on the Challenge and Evaluation Leaderboards, respectively. We conduct additional experiments on a subset of 250 images, exploring the trade-offs between BLEU scores and semantic similarity across various weighting schemes.

[AI-37] Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

链接: https://arxiv.org/abs/2409.15051
作者: Gaëtan Caillaut,Raheel Qader,Mariam Nakhlé,Jingshu Liu,Jean-Gabriel Barthélemy
关键词-EN: showcased remarkable capabilities, Recent studies, NLP tasks, decoder-only models, models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent studies have showcased remarkable capabilities of decoder-only models in many NLP tasks, including translation. Yet, the machine translation field has been largely dominated by encoder-decoder models based on the Transformer architecture. As a consequence, scaling laws of encoder-decoder models for neural machine translation have already been well studied, but decoder-only models have received less attention. This work explores the scaling laws of decoder-only models on the multilingual and multidomain translation task. We trained a collection of six decoder-only models, ranging from 70M to 7B parameters, on a sentence-level, multilingual and multidomain dataset. We conducted a series of experiments showing that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models, but we also show that this scaling law has difficulties to generalize to too large models or to a different data distribution. We also study different scaling methods and show that scaling the depth and the width of a model lead to similar test loss improvements, but with different impact on the model’s efficiency.

[AI-38] AlphaZip: Neural Network-Enhanced Lossless Text Compression

链接: https://arxiv.org/abs/2409.15046
作者: Swathi Shree Narashiman,Nitin Chandrachoodan
关键词-EN: Data compression continues, traditional information theory, information theory methods, Large Language Model, continues to evolve
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data compression continues to evolve, with traditional information theory methods being widely used for compressing text, images, and videos. Recently, there has been growing interest in leveraging Generative AI for predictive compression techniques. This paper introduces a lossless text compression approach using a Large Language Model (LLM). The method involves two key steps: first, prediction using a dense neural network architecture, such as a transformer block; second, compressing the predicted ranks with standard compression algorithms like Adaptive Huffman, LZ77, or Gzip. Extensive analysis and benchmarking against conventional information-theoretic baselines demonstrate that neural compression offers improved performance.

[AI-39] Region Mixup ICLR2024

链接: https://arxiv.org/abs/2409.15028
作者: Saptarshi Saha,Utpal Garain
关键词-EN: visual recognition tasks, data augmentation, recognition tasks, paper introduces, introduces a simple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published as a Tiny Paper at ICLR 2024

点击查看摘要

Abstract:This paper introduces a simple extension of mixup (Zhang et al., 2018) data augmentation to enhance generalization in visual recognition tasks. Unlike the vanilla mixup method, which blends entire images, our approach focuses on combining regions from multiple images.

[AI-40] Generative LLM Powered Conversational AI Application for Personalized Risk Assessment: A Case Study in COVID-19

链接: https://arxiv.org/abs/2409.15027
作者: Mohammad Amin Roshani,Xiangyu Zhou,Yao Qiang,Srinivasan Suresh,Steve Hicks,Usha Sethuraman,Dongxiao Zhu
关键词-EN: Large language models, shown remarkable capabilities, Large language, natural language tasks, healthcare domains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in various natural language tasks and are increasingly being applied in healthcare domains. This work demonstrates a new LLM-powered disease risk assessment approach via streaming human-AI conversation, eliminating the need for programming required by traditional machine learning approaches. In a COVID-19 severity risk assessment case study, we fine-tune pre-trained generative LLMs (e.g., Llama2-7b and Flan-t5-xl) using a few shots of natural language examples, comparing their performance with traditional classifiers (i.e., Logistic Regression, XGBoost, Random Forest) that are trained de novo using tabular data across various experimental settings. We develop a mobile application that uses these fine-tuned LLMs as its generative AI (GenAI) core to facilitate real-time interaction between clinicians and patients, providing no-code risk assessment through conversational interfaces. This integration not only allows for the use of streaming Questions and Answers (QA) as inputs but also offers personalized feature importance analysis derived from the LLM’s attention layers, enhancing the interpretability of risk assessments. By achieving high Area Under the Curve (AUC) scores with a limited number of fine-tuning samples, our results demonstrate the potential of generative LLMs to outperform discriminative classification methods in low-data regimes, highlighting their real-world adaptability and effectiveness. This work aims to fill the existing gap in leveraging generative LLMs for interactive no-code risk assessment and to encourage further research in this emerging field.

[AI-41] A Diagonal Structured State Space Model on Loihi 2 for Efficient Streaming Sequence Processing

链接: https://arxiv.org/abs/2409.15022
作者: Svea Marie Meyer,Philipp Weidel,Philipp Plank,Leobardo Campos-Macias,Sumit Bam Shrestha,Philipp Stratmann,Mathis Richter
关键词-EN: Deep State-Space Models, sequence modeling tasks, long-range sequence modeling, Deep State-Space, State-Space Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:Deep State-Space Models (SSM) demonstrate state-of-the art performance on long-range sequence modeling tasks. While the recurrent structure of SSMs can be efficiently implemented as a convolution or as a parallel scan during training, recurrent token-by-token processing cannot currently be implemented efficiently on GPUs. Here, we demonstrate efficient token-by-token inference of the SSM S4D on Intel’s Loihi 2 state-of-the-art neuromorphic processor. We compare this first ever neuromorphic-hardware implementation of an SSM on sMNIST, psMNIST, and sCIFAR to a recurrent and a convolutional implementation of S4D on Jetson Orin Nano (Jetson). While we find Jetson to perform better in an offline sample-by-sample based batched processing mode, Loihi 2 outperforms during token-by-token based processing, where it consumes 1000 times less energy with a 75 times lower latency and a 75 times higher throughput compared to the recurrent implementation of S4D on Jetson. This opens up new avenues towards efficient real-time streaming applications of SSMs.

[AI-42] Acting for the Right Reasons: Creating Reason-Sensitive Artificial Moral Agents

链接: https://arxiv.org/abs/2409.15014
作者: Kevin Baum,Lisa Dargasz,Felix Jahn,Timo P. Gros,Verena Wolf
关键词-EN: reinforcement learning agents, learning agents based, reinforcement learning architecture, reinforcement learning, enables moral decision-making
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures, Workshop paper accepted to FEAR24 (IFM Workshop)

点击查看摘要

Abstract:We propose an extension of the reinforcement learning architecture that enables moral decision-making of reinforcement learning agents based on normative reasons. Central to this approach is a reason-based shield generator yielding a moral shield that binds the agent to actions that conform with recognized normative reasons so that our overall architecture restricts the agent to actions that are (internally) morally justified. In addition, we describe an algorithm that allows to iteratively improve the reason-based shield generator through case-based feedback from a moral judge.

[AI-43] Analogous Alignments: Digital “Formally” meets Analog

链接: https://arxiv.org/abs/2409.15013
作者: Hansa Mohanty,Deepak Narayan Gadde
关键词-EN: verification, complexity of modern-day, continually increasing, increasingly challenging, challenging to deliver
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Accepted for publication at the Design and Verification Conference and Exhibition (DVCon) Europe, Munich, Germany, 2024

点击查看摘要

Abstract:The complexity of modern-day System-on-Chips (SoCs) is continually increasing, and it becomes increasingly challenging to deliver dependable and credible chips in a short time-to-market. Especially, in the case of test chips, where the aim is to study the feasibility of the design, time is a crucial factor. Pre-silicon functional verification is one of the main contributors that makes up a large portion of the product development cycle. Verification engineers often loosely verify test chips that turn out to be non-functional on the silicon, ultimately resulting in expensive re-spins. To left-shift the verification efforts, formal verification is a powerful methodology that aims to exhaustively verify designs, giving better confidence in the overall quality. This paper focuses on the pragmatic formal verification of a mixed signal Intellectual Property (IP) that has a combination of digital and analog blocks. This paper discusses a novel approach of including the analog behavioral model into the formal verification setup. Digital and Analog Mixed-Signal (AMS) designs, which are fundamentally different in nature, are integrated seamlessly in a formal verification setup, a concept that can be referred to as “Analogous Alignments”. Our formal setup leverages powerful formal techniques such as FPV, CSR verification, and connectivity checks. The properties used for FPV are auto-generated using a metamodeling framework. The paper also discusses the challenges faced especially related to state-space explosion, non-compatibility of formal with AMS models, and techniques to mitigate them such as k-induction. With this verification approach, we were able to exhaustively verify the design within a reasonable time and with sufficient coverage. We also reported several bugs at an early stage, making the complete design verification process iterative and effective.

[AI-44] Inference-Friendly Models With MixAttention

链接: https://arxiv.org/abs/2409.15012
作者: Shashank Rajput,Ying Sheng,Sean Owen,Vitaliy Chiley
关键词-EN: maximum context length, concurrent requests supported, modern language models, plays a critical, critical role
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The size of the key-value (KV) cache plays a critical role in determining both the maximum context length and the number of concurrent requests supported during inference in modern language models. The KV cache size grows proportionally with the number of attention heads and the tokens processed, leading to increased memory consumption and slower inference for long inputs. In this work, we explore the use of MixAttention, a model architecture modification closely related to a blog published by this http URL. MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers. Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks. We also explore various configurations of this architecture, identifying those that maintain quality across evaluation metrics while optimizing resource efficiency.

[AI-45] Generalizing monocular colonoscopy image depth estimation by uncertainty-based global and local fusion network

链接: https://arxiv.org/abs/2409.15006
作者: Sijia Du,Chengfeng Zhou,Suncheng Xiang,Jianwei Xu,Dahong Qian
关键词-EN: obtaining ground-truth depth, real clinical scenarios, obtaining ground-truth, ground-truth depth maps, real clinical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Objective: Depth estimation is crucial for endoscopic navigation and manipulation, but obtaining ground-truth depth maps in real clinical scenarios, such as the colon, is challenging. This study aims to develop a robust framework that generalizes well to real colonoscopy images, overcoming challenges like non-Lambertian surface reflection and diverse data distributions. Methods: We propose a framework combining a convolutional neural network (CNN) for capturing local features and a Transformer for capturing global information. An uncertainty-based fusion block was designed to enhance generalization by identifying complementary contributions from the CNN and Transformer branches. The network can be trained with simulated datasets and generalize directly to unseen clinical data without any fine-tuning. Results: Our method is validated on multiple datasets and demonstrates an excellent generalization ability across various datasets and anatomical structures. Furthermore, qualitative analysis in real clinical scenarios confirmed the robustness of the proposed method. Conclusion: The integration of local and global features through the CNN-Transformer architecture, along with the uncertainty-based fusion block, improves depth estimation performance and generalization in both simulated and real-world endoscopic environments. Significance: This study offers a novel approach to estimate depth maps for endoscopy images despite the complex conditions in clinic, serving as a foundation for endoscopic automatic navigation and other clinical tasks, such as polyp detection and segmentation.

[AI-46] Method of Equal Shares with Bounded Overspending

链接: https://arxiv.org/abs/2409.15005
作者: Georgios Papasotiropoulos,Seyedeh Zeinab Pishbin,Oskar Skibski,Piotr Skowron,Tomasz Wąs
关键词-EN: BOS Equal Shares, Equal Shares, BOS Equal, equal, participatory budgeting
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In participatory budgeting (PB), voters decide through voting which subset of projects to fund within a given budget. Proportionality in the context of PB is crucial to ensure equal treatment of all groups of voters. However, pure proportional rules can sometimes lead to suboptimal outcomes. We introduce the Method of Equal Shares with Bounded Overspending (BOS Equal Shares), a robust variant of Equal Shares that balances proportionality and efficiency. BOS Equal Shares addresses inefficiencies inherent in strict proportionality guarantees yet still provides good proportionality similar to the original Method of Equal Shares. In the course of the analysis, we also discuss a fractional variant of the method which allows for partial funding of projects.

[AI-47] ViBERTgrid BiLSTM-CRF: Multimodal Key Information Extraction from Unstructured Financial Documents ECML KDD2023

链接: https://arxiv.org/abs/2409.15004
作者: Furkan Pala,Mehmet Yasin Akpınar,Onur Deniz,Gülşen Eryiğit
关键词-EN: key information extraction, Multimodal key information, information extraction, key information, studied extensively
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Accepted in MIDAS (The 8th Workshop on MIning DAta for financial applicationS) workshop of ECML PKDD 2023 conference

点击查看摘要

Abstract:Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.

[AI-48] Multi-Modal Generative AI: Multi-modal LLM Diffusion and Beyond

链接: https://arxiv.org/abs/2409.14993
作者: Hong Chen,Xin Wang,Yuwei Zhou,Bin Huang,Yipeng Zhang,Wei Feng,Houlun Chen,Zeyang Zhang,Siao Tang,Wenwu Zhu
关键词-EN: received increasing attention, academia and industry, unified model, received increasing, increasing attention
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modal generative AI has received increasing attention in both academia and industry. Particularly, two dominant families of techniques are: i) The multi-modal large language model (MLLM) such as GPT-4V, which shows impressive ability for multi-modal understanding; ii) The diffusion model such as Sora, which exhibits remarkable multi-modal powers, especially with respect to visual generation. As such, one natural question arises: Is it possible to have a unified model for both understanding and generation? To answer this question, in this paper, we first provide a detailed review of both MLLM and diffusion models, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video large language models as well as text-to-image/video generation. Then, we discuss the two important questions on the unified model: i) whether the unified model should adopt the auto-regressive or diffusion probabilistic modeling, and ii) whether the model should utilize a dense architecture or the Mixture of Experts(MoE) architectures to better support generation and understanding, two objectives. We further provide several possible strategies for building a unified model and analyze their potential advantages and disadvantages. We also summarize existing large-scale multi-modal datasets for better model pretraining in the future. To conclude the paper, we present several challenging future directions, which we believe can contribute to the ongoing advancement of multi-modal generative AI.

[AI-49] Evaluating Theory of (an uncertain) Mind: Predicting the Uncertain Beliefs of Others in Conversation Forecasting

链接: https://arxiv.org/abs/2409.14986
作者: Anthony Sicilia,Malihe Alikhani
关键词-EN: Theory of Mind, evaluating Theory, Typically, Mind, Theory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Typically, when evaluating Theory of Mind, we consider the beliefs of others to be binary: held or not held. But what if someone is unsure about their own beliefs? How can we quantify this uncertainty? We propose a new suite of tasks, challenging language models (LMs) to model the uncertainty of others in dialogue. We design these tasks around conversation forecasting, wherein an agent forecasts an unobserved outcome to a conversation. Uniquely, we view interlocutors themselves as forecasters, asking an LM to predict the uncertainty of the interlocutors (a probability). We experiment with re-scaling methods, variance reduction strategies, and demographic context, for this regression task, conducting experiments on three dialogue corpora (social, negotiation, task-oriented) with eight LMs. While LMs can explain up to 7% variance in the uncertainty of others, we highlight the difficulty of the tasks and room for future work, especially in practical applications, like anticipating ``false

[AI-50] Sparse-to-Dense LiDAR Point Generation by LiDAR-Camera Fusion for 3D Object Detection

链接: https://arxiv.org/abs/2409.14985
作者: Minseung Lee,Seokha Moon,Seung Joon Lee,Jinkyu Kim
关键词-EN: Accurately detecting objects, Accurately detecting, point cloud, remains a critical, critical challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:Accurately detecting objects at long distances remains a critical challenge in 3D object detection when relying solely on LiDAR sensors due to the inherent limitations of data sparsity. To address this issue, we propose the LiDAR-Camera Augmentation Network (LCANet), a novel framework that reconstructs LiDAR point cloud data by fusing 2D image features, which contain rich semantic information, generating additional points to improve detection accuracy. LCANet fuses data from LiDAR sensors and cameras by projecting image features into the 3D space, integrating semantic information into the point cloud data. This fused data is then encoded to produce 3D features that contain both semantic and spatial information, which are further refined to reconstruct final points before bounding box prediction. This fusion effectively compensates for LiDAR’s weakness in detecting objects at long distances, which are often represented by sparse points. Additionally, due to the sparsity of many objects in the original dataset, which makes effective supervision for point generation challenging, we employ a point cloud completion network to create a complete point cloud dataset that supervises the generation of dense point clouds in our network. Extensive experiments on the KITTI and Waymo datasets demonstrate that LCANet significantly outperforms existing models, particularly in detecting sparse and distant objects.

[AI-51] Dynamic Integration of Task-Specific Adapters for Class Incremental Learning

链接: https://arxiv.org/abs/2409.14983
作者: Jiashuo Li,Shaokun Wang,Bo Qian,Yuhang He,Xing Wei,Yihong Gong
关键词-EN: Non-exemplar class Incremental, class Incremental Learning, Incremental Learning, Patch-Level Model Alignment, addressing privacy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-exemplar class Incremental Learning (NECIL) enables models to continuously acquire new classes without retraining from scratch and storing old task exemplars, addressing privacy and storage issues. However, the absence of data from earlier tasks exacerbates the challenge of catastrophic forgetting in NECIL. In this paper, we propose a novel framework called Dynamic Integration of task-specific Adapters (DIA), which comprises two key components: Task-Specific Adapter Integration (TSAI) and Patch-Level Model Alignment. TSAI boosts compositionality through a patch-level adapter integration strategy, which provides a more flexible compositional solution while maintaining low computation costs. Patch-Level Model Alignment maintains feature consistency and accurate decision boundaries via two specialized mechanisms: Patch-Level Distillation Loss (PDL) and Patch-Level Feature Reconstruction method (PFR). Specifically, the PDL preserves feature-level consistency between successive models by implementing a distillation loss based on the contributions of patch tokens to new class learning. The PFR facilitates accurate classifier alignment by reconstructing old class features from previous tasks that adapt to new task knowledge. Extensive experiments validate the effectiveness of our DIA, revealing significant improvements on benchmark datasets in the NECIL setting, maintaining an optimal balance between computational complexity and accuracy. The full code implementation will be made publicly available upon the publication of this paper.

[AI-52] On The Specialization of Neural Modules

链接: https://arxiv.org/abs/2409.14981
作者: Devon Jarvis,Richard Klein,Benjamin Rosman,Andrew M. Saxe
关键词-EN: machine learning models, previous experiences, achieving systematic generalization, number of machine, goal of achieving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The Eleventh International Conference on Learning Representations 2023

点击查看摘要

Abstract:A number of machine learning models have been proposed with the goal of achieving systematic generalization: the ability to reason about new situations by combining aspects of previous experiences. These models leverage compositional architectures which aim to learn specialized modules dedicated to structures in a task that can be composed to solve novel problems with similar structures. While the compositionality of these architectures is guaranteed by design, the modules specializing is not. Here we theoretically study the ability of network modules to specialize to useful structures in a dataset and achieve systematic generalization. To this end we introduce a minimal space of datasets motivated by practical systematic generalization benchmarks. From this space of datasets we present a mathematical definition of systematicity and study the learning dynamics of linear neural modules when solving components of the task. Our results shed light on the difficulty of module specialization, what is required for modules to successfully specialize, and the necessity of modular architectures to achieve systematicity. Finally, we confirm that the theoretical results in our tractable setting generalize to more complex datasets and non-linear architectures.

[AI-53] S-TCD: Triplet-Level Cross-Modal Distillation for Time-Series Forecasting Using Large Language Models ICASSP2025

链接: https://arxiv.org/abs/2409.14978
作者: Pengfei Wang,Huanran Zheng,Silong Dai,Wenjing Yue,Wei Zhu,Xiaoling Wang
关键词-EN: large language models, improving predictive performance, shown great potential, capturing complex dependencies, recent years
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:In recent years, large language models (LLMs) have shown great potential in time-series analysis by capturing complex dependencies and improving predictive performance. However, existing approaches often struggle with modality alignment, leading to suboptimal results. To address these challenges, we present a novel framework, TS-TCD, which introduces a comprehensive three-tiered cross-modal knowledge distillation mechanism. Unlike prior work that focuses on isolated alignment techniques, our framework systematically integrates: 1) Dynamic Adaptive Gating for Input Encoding and Alignment, ensuring coherent alignment between time-series tokens and QR-decomposed textual embeddings; 2) Layer-Wise Contrastive Learning, aligning intermediate representations across modalities to reduce feature-level discrepancies; and 3) Optimal Transport-Driven Output Alignment, which ensures consistent output predictions through fine-grained cross-modal alignment. Extensive experiments on benchmark time-series datasets demonstrate that TS-TCD achieves state-of-the-art results, outperforming traditional methods in both accuracy and robustness.

[AI-54] Deep Reinforcement Learning-based Obstacle Avoidance for Robot Movement in Warehouse Environments

链接: https://arxiv.org/abs/2409.14972
作者: Keqin Li,Jiajing Chen,Denzhi Yu,Tao Dajun,Xinyu Qiu,Lian Jieting,Sun Baiwei,Zhang Shengyuan,Zhenyu Wan,Ran Ji,Bo Hong,Fanghao Ni
关键词-EN: robot obstacle avoidance, obstacle avoidance Algorithm, obstacle avoidance strategy, mobile robot obstacle, deep reinforcement learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:At present, in most warehouse environments, the accumulation of goods is complex, and the management personnel in the control of goods at the same time with the warehouse mobile robot trajectory interaction, the traditional mobile robot can not be very good on the goods and pedestrians to feed back the correct obstacle avoidance strategy, in order to control the mobile robot in the warehouse environment efficiently and friendly to complete the obstacle avoidance task, this paper proposes a deep reinforcement learning based on the warehouse environment, the mobile robot obstacle avoidance Algorithm. Firstly, for the insufficient learning ability of the value function network in the deep reinforcement learning algorithm, the value function network is improved based on the pedestrian interaction, the interaction information between pedestrians is extracted through the pedestrian angle grid, and the temporal features of individual pedestrians are extracted through the attention mechanism, so that we can learn to obtain the relative importance of the current state and the historical trajectory state as well as the joint impact on the robot’s obstacle avoidance strategy, which provides an opportunity for the learning of multi-layer perceptual machines afterwards. Secondly, the reward function of reinforcement learning is designed based on the spatial behaviour of pedestrians, and the robot is punished for the state where the angle changes too much, so as to achieve the requirement of comfortable obstacle avoidance; Finally, the feasibility and effectiveness of the deep reinforcement learning-based mobile robot obstacle avoidance algorithm in the warehouse environment in the complex environment of the warehouse are verified through simulation experiments.

[AI-55] Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

链接: https://arxiv.org/abs/2409.14924
作者: Siyun Zhao,Yuqing Yang,Zilong Wang,Zhiyuan He,Luna K. Qiu,Lili Qiu
关键词-EN: Large language models, Large language, completing real-world tasks, demonstrated remarkable capabilities, external data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications.

[AI-56] KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems

链接: https://arxiv.org/abs/2409.14908
作者: Zixuan Wang,Bo Yu,Junzhe Zhao,Wenhao Sun,Sai Hou,Shuai Liang,Xing Hu,Yinhe Han,Yiming Gan
关键词-EN: short-term memory, executing interconnected, leading to inefficiencies, memory, responsible for executing
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Embodied AI agents responsible for executing interconnected, long-sequence household tasks often face difficulties with in-context memory, leading to inefficiencies and errors in task execution. To address this issue, we introduce KARMA, an innovative memory system that integrates long-term and short-term memory modules, enhancing large language models (LLMs) for planning in embodied agents through memory-augmented prompting. KARMA distinguishes between long-term and short-term memory, with long-term memory capturing comprehensive 3D scene graphs as representations of the environment, while short-term memory dynamically records changes in objects’ positions and states. This dual-memory structure allows agents to retrieve relevant past scene experiences, thereby improving the accuracy and efficiency of task planning. Short-term memory employs strategies for effective and adaptive memory replacement, ensuring the retention of critical information while discarding less pertinent data. Compared to state-of-the-art embodied agents enhanced with memory, our memory-augmented embodied AI agent improves success rates by 1.3x and 2.3x in Composite Tasks and Complex Tasks within the AI2-THOR simulator, respectively, and enhances task execution efficiency by 3.4x and 62.7x. Furthermore, we demonstrate that KARMA’s plug-and-play capability allows for seamless deployment on real-world robotic systems, such as mobile manipulation platforms.Through this plug-and-play memory system, KARMA significantly enhances the ability of embodied agents to generate coherent and contextually appropriate plans, making the execution of complex household tasks more efficient. The experimental videos from the work can be found at this https URL.

[AI-57] DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models

链接: https://arxiv.org/abs/2409.14904
作者: Sangyeon Cho,Jangyeong Jeon,Dongjoon Lee,Changhee Lee,Junyeong Kim
关键词-EN: natural language processing, language models, language, common approach, approach in natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: IEEE ACCESS 2024

点击查看摘要

Abstract:The use of pre-trained language models fine-tuned to address specific downstream tasks is a common approach in natural language processing (NLP). However, acquiring domain-specific knowledge via fine-tuning is challenging. Traditional methods involve pretraining language models using vast amounts of domain-specific data before fine-tuning for particular tasks. This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea. Our findings reveal that existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics of non-English-speaking regions. To address these limitations, we propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning. This study demonstrates the effective transfer of specialized knowledge between models by defining a general language model as the student model and a domain-specific pre-trained model as the teacher model. In particular, we address the complexities of EMR data obtained from PEDs in non-English-speaking regions, such as Korea, and demonstrate that the proposed method enhances classification performance in such contexts. The proposed methodology not only outperforms baseline models on Korean PED EMR data, but also promises broader applicability in various professional and technical domains. In future works, we intend to extend this methodology to include diverse non-English-speaking regions and address additional downstream tasks, with the aim of developing advanced model architectures using state-of-the-art KD techniques. The code is available in this https URL.

[AI-58] Deploying Open-Source Large Language Models : A performance Analysis

链接: https://arxiv.org/abs/2409.14887
作者: Yannis Bendi-Ouis,Dan Dutarte,Xavier Hinaut
关键词-EN: ChatGPT in November, considerable success, open-source community, release of ChatGPT, large language models
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Since the release of ChatGPT in November 2023, large language models (LLMs) have seen considerable success, including in the open-source community, with many open-weight models available. However, the requirements to deploy such a service are often unknown and difficult to evaluate in advance. To facilitate this process, we conducted numerous tests at the Centre Inria de l’Université de Bordeaux. In this article, we propose a comparison of the performance of several models of different sizes (mainly Mistral and LLaMa) depending on the available GPUs, using vLLM, a Python library designed to optimize the inference of these models. Our results provide valuable information for private and public groups wishing to deploy LLMs, allowing them to evaluate the performance of different models based on their available hardware. This study thus contributes to facilitating the adoption and use of these large language models in various application domains.

[AI-59] End-to-End Graph Flattening Method for Large Language Models

链接: https://arxiv.org/abs/2409.14880
作者: Bin Hong,Jinze Wu,Jiayu Liu,Liang Ding,Jing Sha,Kai Zhang,Shijin Wang,Zhenya Huang
关键词-EN: Large Language Models, Language Models, breakthrough of Large, Large Language, achieving universal methods
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 2024 1st International Conference on Computational Linguistics and Natural Language Processing (CLNLP 2024)

点击查看摘要

Abstract:In recent years, the breakthrough of Large Language Models (LLMs) offers new ideas for achieving universal methods on graph data. The common practice of converting graphs into natural language for LLMs, which refers to graph flattening, exhibits good generalizability and interpretability. However, the poor organization of the textual format results in poor performance in long-distance scenario understanding. Inspired by human cognitive reasoning habits, we propose a novel method for graph flattening to fit LLMs, termed as End-to-End DAG-Path prompting (EEDP). Experiments on real-world datasets show that EEDP enhances the reasoning performance of LLMs in long-distance scenarios while maintaining excellent performance in short-distance scenarios, demonstrating good robustness in the face of distance variations.

[AI-60] Mammo-Clustering:A Weakly Supervised Multi-view Global-Local Context Clustering Network for Detection and Classification in Mammography

链接: https://arxiv.org/abs/2409.14876
作者: Shilong Yang,Chulong Zhang,Qi Zang,Juan Yu,Liang Zeng,Xiao Luo,Yexuan Xing,Xin Pan,Qi Li,Xiaokun Liang,Yaoqin Xie
关键词-EN: making early screening, early screening crucial, early screening, mitigating its impact, long posed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Breast cancer has long posed a significant threat to women’s health, making early screening crucial for mitigating its impact. However, mammography, the preferred method for early screening, faces limitations such as the burden of double reading by radiologists, challenges in widespread adoption in remote and underdeveloped areas, and obstacles in intelligent early screening development due to data constraints. To address these challenges, we propose a weakly supervised multi-view mammography early screening model for breast cancer based on context clustering. Context clustering, a feature extraction structure that is neither CNN nor transformer, combined with multi-view learning for information complementation, presents a promising approach. The weak supervision design specifically addresses data limitations. Our model achieves state-of-the-art performance with fewer parameters on two public datasets, with an AUC of 0.828 on the Vindr-Mammo dataset and 0.805 on the CBIS-DDSM dataset. Our model shows potential in reducing the burden on doctors and increasing the feasibility of breast cancer screening for women in underdeveloped regions.

[AI-61] FedSlate:A Federated Deep Reinforcement Learning Recommender System

链接: https://arxiv.org/abs/2409.14872
作者: Yongxin Deng,Xiaoyu Tan,Xihe Qiu,Yaochu Jin
关键词-EN: recommendation systems, optimize long-term user, long-term user engagement, recommendation, Reinforcement
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning methods have been used to optimize long-term user engagement in recommendation systems. However, existing reinforcement learning-based recommendation systems do not fully exploit the relevance of individual user behavior across different platforms. One potential solution is to aggregate data from various platforms in a centralized location and use the aggregated data for training. However, this approach raises economic and legal concerns, including increased communication costs and potential threats to user privacy. To address these challenges, we propose \textbfFedSlate, a federated reinforcement learning recommendation algorithm that effectively utilizes information that is prohibited from being shared at a legal level. We employ the SlateQ algorithm to assist FedSlate in learning users’ long-term behavior and evaluating the value of recommended content. We extend the existing application scope of recommendation systems from single-user single-platform to single-user multi-platform and address cross-platform learning challenges by introducing federated learning. We use RecSim to construct a simulation environment for evaluating FedSlate and compare its performance with state-of-the-art benchmark recommendation models. Experimental results demonstrate the superior effects of FedSlate over baseline methods in various environmental settings, and FedSlate facilitates the learning of recommendation strategies in scenarios where baseline methods are completely inapplicable. Code is available at \textitthis https URL.

[AI-62] A novel agent with formal goal-reaching guarantees: an experimental study with a mobile robot

链接: https://arxiv.org/abs/2409.14867
作者: Grigory Yaremenko,Dmitrii Dobriborsci,Roman Zashchitin,Ruben Contreras Maestre,Ngoc Quoc Huy Hoang,Pavel Osinenko
关键词-EN: Reinforcement Learning, number of tasks, sufficiently large number, CALF, online model-free learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has been shown to be effective and convenient for a number of tasks in robotics. However, it requires the exploration of a sufficiently large number of state-action pairs, many of which may be unsafe or unimportant. For instance, online model-free learning can be hazardous and inefficient in the absence of guarantees that a certain set of desired states will be reached during an episode. An increasingly common approach to address safety involves the addition of a shielding system that constrains the RL actions to a safe set of actions. In turn, a difficulty for such frameworks is how to effectively couple RL with the shielding system to make sure the exploration is not excessively restricted. This work presents a novel safe model-free RL agent called Critic As Lyapunov Function (CALF) and showcases how CALF can be used to improve upon control baselines in robotics in an efficient and convenient fashion while ensuring guarantees of stable goal reaching. The latter is a crucial part of safety, as seen generally. With CALF all state-action pairs remain explorable and yet reaching of desired goal states is formally guaranteed. Formal analysis is provided that shows the goal stabilization-ensuring properties of CALF and a set of real-world and numerical experiments with a non-holonomic wheeled mobile robot (WMR) TurtleBot3 Burger confirmed the superiority of CALF over such a well-established RL agent as proximal policy optimization (PPO), and a modified version of SARSA in a few-episode setting in terms of attained total cost.

[AI-63] Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

链接: https://arxiv.org/abs/2409.14866
作者: Xueluan Gong,Mingzhe Li,Yilin Zhang,Fengyuan Ran,Chen Chen,Yanjiao Chen,Qian Wang,Kwok-Yan Lam
关键词-EN: Large Language Models, Large Language, Language Models, attackers create jailbreak, manually crafted templates
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks, where attackers create jailbreak prompts to mislead the model to produce harmful or offensive content. Current jailbreak methods either rely heavily on manually crafted templates, which pose challenges in scalability and adaptability, or struggle to generate semantically coherent prompts, making them easy to detect. Additionally, most existing approaches involve lengthy prompts, leading to higher query this http URL this paper, to remedy these challenges, we introduce a novel jailbreaking attack framework, which is an automated, black-box jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs. Instead of relying on manually crafted templates, our method starts with an empty seed pool, removing the need to search for any related jailbreaking templates. We also develop three novel question-dependent mutation strategies using an LLM helper to generate prompts that maintain semantic coherence while significantly reducing their length. Additionally, we implement a two-level judge module to accurately detect genuine successful jailbreaks. We evaluated our method on 7 representative LLMs and compared it with 5 state-of-the-art jailbreaking attack strategies. For proprietary LLM APIs, such as GPT-3.5 turbo, GPT-4, and Gemini-Pro, our method achieves attack success rates of over 90%, 80%, and 74%, respectively, exceeding existing baselines by more than 60%. Additionally, our method can maintain high semantic coherence while significantly reducing the length of jailbreak prompts. When targeting GPT-4, our method can achieve over 78% attack success rate even with 100 tokens. Moreover, our method demonstrates transferability and is robust to state-of-the-art defenses. We will open-source our codes upon publication. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.14866 [cs.CR] (or arXiv:2409.14866v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.14866 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-64] FUSED-Net: Enhancing Few-Shot Traffic Sign Detection with Unfrozen Parameters Pseudo-Support Sets Embedding Normalization and Domain Adaptation

链接: https://arxiv.org/abs/2409.14852
作者: Md. Atiqur Rahman,Nahian Ibn Asad,Md. Mushfiqul Haque Omi,Md. Bakhtiar Hasan,Sabbir Ahmed,Md. Hasanul Kabir
关键词-EN: Automatic Traffic Sign, Traffic Sign Recognition, modern transportation systems, Recognition is paramount, Automatic Traffic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 6 figures, 3 tables, submitted to IEEE Access for review

点击查看摘要

Abstract:Automatic Traffic Sign Recognition is paramount in modern transportation systems, motivating several research endeavors to focus on performance improvement by utilizing large-scale datasets. As the appearance of traffic signs varies across countries, curating large-scale datasets is often impractical; and requires efficient models that can produce satisfactory performance using limited data. In this connection, we present ‘FUSED-Net’, built-upon Faster RCNN for traffic sign detection, enhanced by Unfrozen Parameters, Pseudo-Support Sets, Embedding Normalization, and Domain Adaptation while reducing data requirement. Unlike traditional approaches, we keep all parameters unfrozen during training, enabling FUSED-Net to learn from limited samples. The generation of a Pseudo-Support Set through data augmentation further enhances performance by compensating for the scarcity of target domain data. Additionally, Embedding Normalization is incorporated to reduce intra-class variance, standardizing feature representation. Domain Adaptation, achieved by pre-training on a diverse traffic sign dataset distinct from the target domain, improves model generalization. Evaluating FUSED-Net on the BDTSD dataset, we achieved 2.4x, 2.2x, 1.5x, and 1.3x improvements of mAP in 1-shot, 3-shot, 5-shot, and 10-shot scenarios, respectively compared to the state-of-the-art Few-Shot Object Detection (FSOD) models. Additionally, we outperform state-of-the-art works on the cross-domain FSOD benchmark under several scenarios.

[AI-65] GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth

链接: https://arxiv.org/abs/2409.14850
作者: Aurélien Cecille,Stefan Duffner,Franck Davoine,Thibault Neveu,Rémi Agier
关键词-EN: Monocular depth estimation, predicting metric depth, models predicting metric, Monocular depth, estimation has greatly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Monocular depth estimation has greatly improved in the recent years but models predicting metric depth still struggle to generalize across diverse camera poses and datasets. While recent supervised methods mitigate this issue by leveraging ground prior information at inference, their adaptability to self-supervised settings is limited due to the additional challenge of scale recovery. Addressing this gap, we propose in this paper a novel constraint on ground areas designed specifically for the self-supervised paradigm. This mechanism not only allows to accurately recover the scale but also ensures coherence between the depth prediction and the ground prior. Experimental results show that our method surpasses existing scale recovery techniques on the KITTI benchmark and significantly enhances model generalization capabilities. This improvement can be observed by its more robust performance across diverse camera rotations and its adaptability in zero-shot conditions with previously unseen driving datasets such as DDAD.

[AI-66] A-VL: Adaptive Attention for Large Vision-Language Models

链接: https://arxiv.org/abs/2409.14846
作者: Junyang Zhang,Mu Yuan,Ruiguang Zhong,Puhan Luo,Huiyou Zhan,Ningkang Zhang,Chengchen Hu,Xiangyang Li
关键词-EN: integrates computer vision, offering substantial application, substantial application potential, Large Vision-Language Model, natural language processing
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local information. Based on our observation and analysis of vision-language attention patterns, we develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference. Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.

[AI-67] HW-TSCs Submission to the CCMT 2024 Machine Translation Tasks

链接: https://arxiv.org/abs/2409.14842
作者: Zhanglin Wu,Yuanchang Luo,Daimeng Wei,Jiawei Zheng,Bin Wei,Zongyao Li,Hengchao Shang,Jiaxin Guo,Shaojun Li,Weidong Zhang,Ning Xie,Hao Yang
关键词-EN: Translation Services Center, Huawei Translation Services, Services Center, China Conference, machine translation task
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 2 figures, 6 Tables, CCMT2024

点击查看摘要

Abstract:This paper presents the submission of Huawei Translation Services Center (HW-TSC) to machine translation tasks of the 20th China Conference on Machine Translation (CCMT 2024). We participate in the bilingual machine translation task and multi-domain machine translation task. For these two translation tasks, we use training strategies such as regularized dropout, bidirectional training, data diversification, forward translation, back translation, alternated training, curriculum learning, and transductive ensemble learning to train neural machine translation (NMT) models based on the deep Transformer-big architecture. Furthermore, to explore whether large language model (LLM) can help improve the translation quality of NMT systems, we use supervised fine-tuning to train llama2-13b as an Automatic post-editing (APE) model to improve the translation results of the NMT model on the multi-domain machine translation task. By using these plyometric strategies, our submission achieves a competitive result in the final evaluation.

[AI-68] Explainable and Human-Grounded AI for Decision Support Systems: The Theory of Epistemic Quasi-Partnerships

链接: https://arxiv.org/abs/2409.14839
作者: John Dorsch,Maximilian Moll
关键词-EN: decision support systems, provide human decision-makers, developing AI-DSS, support systems, current empirical XAI
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: 20 pages

点击查看摘要

Abstract:In the context of AI decision support systems (AI-DSS), we argue that meeting the demands of ethical and explainable AI (XAI) is about developing AI-DSS to provide human decision-makers with three types of human-grounded explanations: reasons, counterfactuals, and confidence, an approach we refer to as the RCC approach. We begin by reviewing current empirical XAI literature that investigates the relationship between various methods for generating model explanations (e.g., LIME, SHAP, Anchors), the perceived trustworthiness of the model, and end-user accuracy. We demonstrate how current theories about what constitutes good human-grounded reasons either do not adequately explain this evidence or do not offer sound ethical advice for development. Thus, we offer a novel theory of human-machine interaction: the theory of epistemic quasi-partnerships (EQP). Finally, we motivate adopting EQP and demonstrate how it explains the empirical evidence, offers sound ethical advice, and entails adopting the RCC approach.

[AI-69] MICSim: A Modular Simulator for Mixed-signal Compute-in-Memory based AI Accelerator

链接: https://arxiv.org/abs/2409.14838
作者: Cong Wang,Zeming Chen,Shanshi Huang
关键词-EN: pre-circuit simulator designed, overhead of mixed-signal, designed for early-stage, MICSim, Transformers CIM accelerators
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: The 30th Asia and South Pacific Design Automation Conference (ASP-DAC 2025)

点击查看摘要

Abstract:This work introduces MICSim, an open-source, pre-circuit simulator designed for early-stage evaluation of chip-level software performance and hardware overhead of mixed-signal compute-in-memory (CIM) accelerators. MICSim features a modular design, allowing easy multi-level co-design and design space exploration. Modularized from the state-of-the-art CIM simulator NeuroSim, MICSim provides a highly configurable simulation framework supporting multiple quantization algorithms, diverse circuit/architecture designs, and different memory devices. This modular approach also allows MICSim to be effectively extended to accommodate new designs. MICSim natively supports evaluating accelerators’ software and hardware performance for CNNs and Transformers in Python, leveraging the popular PyTorch and HuggingFace Transformers frameworks. These capabilities make MICSim highly adaptive when simulating different networks and user-friendly. This work demonstrates that MICSim can easily be combined with optimization strategies to perform design space exploration and used for chip-level Transformers CIM accelerators evaluation. Also, MICSim can achieve a 9x - 32x speedup of NeuroSim through a statistic-based average mode proposed by this work. Comments: The 30th Asia and South Pacific Design Automation Conference (ASP-DAC 2025) Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2409.14838 [cs.AI] (or arXiv:2409.14838v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.14838 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-70] Orthogonal Finetuning for Direct Preference Optimization

链接: https://arxiv.org/abs/2409.14836
作者: Chenxu Yang,Ruipeng Jia,Naibin Gu,Zheng Lin,Siyuan Chen,Chao Pang,Weichong Yin,Yu Sun,Hua Wu,Weiping Wang
关键词-EN: preference optimization algorithm, optimization algorithm, effective preference optimization, preference optimization, weight-Rotated Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:DPO is an effective preference optimization algorithm. However, the DPO-tuned models tend to overfit on the dispreferred samples, manifested as overly long generations lacking diversity. While recent regularization approaches have endeavored to alleviate this issue by modifying the objective function, they achieved that at the cost of alignment performance degradation. In this paper, we innovatively incorporate regularization from the perspective of weight updating to curb alignment overfitting. Through the pilot experiment, we discovered that there exists a positive correlation between overfitting and the hyperspherical energy fluctuation. Hence, we introduce orthogonal finetuning for DPO via a weight-Rotated Preference Optimization (RoPO) method, which merely conducts rotational and magnitude-stretching updates on the weight parameters to maintain the hyperspherical energy invariant, thereby preserving the knowledge encoded in the angle between neurons. Extensive experiments demonstrate that our model aligns perfectly with human preferences while retaining the original expressive capacity using only 0.0086% of the trainable parameters, suggesting an effective regularization against overfitting. Specifically, RoPO outperforms DPO by up to 10 points on MT-Bench and by up to 2.8 points on AlpacaEval 2, while enhancing the generation diversity by an average of 6 points.

[AI-71] Identify As A Human Does: A Pathfinder of Next-Generation Anti-Cheat Framework for First-Person Shooter Games

链接: https://arxiv.org/abs/2409.14830
作者: Jiayi Zhang,Chenxin Sun,Yue Gu,Qingyu Zhang,Jiayi Lin,Xiaojiang Du,Chenxiong Qian
关键词-EN: experienced substantial growth, online games poses, gaming experience, gaming industry, substantial growth
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The gaming industry has experienced substantial growth, but cheating in online games poses a significant threat to the integrity of the gaming experience. Cheating, particularly in first-person shooter (FPS) games, can lead to substantial losses for the game industry. Existing anti-cheat solutions have limitations, such as client-side hardware constraints, security risks, server-side unreliable methods, and both-sides suffer from a lack of comprehensive real-world datasets. To address these limitations, the paper proposes HAWK, a server-side FPS anti-cheat framework for the popular game CS:GO. HAWK utilizes machine learning techniques to mimic human experts’ identification process, leverages novel multi-view features, and it is equipped with a well-defined workflow. The authors evaluate HAWK with the first large and real-world datasets containing multiple cheat types and cheating sophistication, and it exhibits promising efficiency and acceptable overheads, shorter ban times compared to the in-use anti-cheat, a significant reduction in manual labor, and the ability to capture cheaters who evaded official inspections.

[AI-72] oolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback

链接: https://arxiv.org/abs/2409.14826
作者: Qinzhuo Wu,Wei Liu,Jian Luan,Bin Wang
关键词-EN: gained increasing attention, increasing attention, tool-augmented LLMs, gained increasing, Recently
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, tool-augmented LLMs have gained increasing attention. Given an instruction, tool-augmented LLMs can interact with various external tools in multiple rounds and provide a final answer. However, previous LLMs were trained on overly detailed instructions, which included API names or parameters, while real users would not explicitly mention these API details. This leads to a gap between trained LLMs and real-world scenarios. In addition, most works ignore whether the interaction process follows the instruction. To address these issues, we constructed a training dataset called MGToolBench, which contains statement and category-level instructions to better reflect real-world scenarios. In addition, we propose ToolPlanner, a two-stage reinforcement learning framework that utilizes path planning and two feedback mechanisms to enhance the LLM’s task completion and instruction-following capabilities. Experimental results show that ToolPlanner significantly improves the Match Rate, Pass Rate and Win Rate by 26.8%, 20.2%, and 5.6% compared to the SOTA model. Human evaluation verifies that the multi-granularity instructions can better align with users’ usage habits. Our data and code will be released upon acceptance.

[AI-73] owards Real-world Deployment of NILM Systems: Challenges and Practices

链接: https://arxiv.org/abs/2409.14821
作者: Junyu Xue,Yu Zhang,Xudong Wang,Yi Wang,Guoming Tang
关键词-EN: Non-intrusive load monitoring, load monitoring technology, key load monitoring, traditional power sensors, load monitoring
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Non-intrusive load monitoring (NILM), as a key load monitoring technology, can much reduce the deployment cost of traditional power sensors. Previous research has largely focused on developing cloud-exclusive NILM algorithms, which often result in high computation costs and significant service delays. To address these issues, we propose a three-tier framework to enhance the real-world applicability of NILM systems through edge-cloud collaboration. Considering the computational resources available at both the edge and cloud, we implement a lightweight NILM model at the edge and a deep learning based model at the cloud, respectively. In addition to the differential model implementations, we also design a NILM-specific deployment scheme that integrates Gunicorn and NGINX to bridge the gap between theoretical algorithms and practical applications. To verify the effectiveness of the proposed framework, we apply real-world NILM scenario settings and implement the entire process of data acquisition, model training, and system deployment. The results demonstrate that our framework can achieve high decomposition accuracy while significantly reducing the cloud workload and communication overhead under practical considerations.

[AI-74] Past Meets Present: Creating Historical Analogy with Large Language Models

链接: https://arxiv.org/abs/2409.14820
作者: Nianqi Li,Siyu Yuan,Jiangjie Chen,Jiaqing Liang,Feng Wei,Zujie Liang,Deqing Yang,Yanghua Xiao
关键词-EN: people make decisions, understand the world, Historical analogies, compare known past, contemporary but unfamiliar
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Historical analogies, which compare known past events with contemporary but unfamiliar events, are important abilities that help people make decisions and understand the world. However, research in applied history suggests that people have difficulty finding appropriate analogies. And previous studies in the AI community have also overlooked historical analogies. To fill this gap, in this paper, we focus on the historical analogy acquisition task, which aims to acquire analogous historical events for a given event. We explore retrieval and generation methods for acquiring historical analogies based on different large language models (LLMs). Furthermore, we propose a self-reflection method to mitigate hallucinations and stereotypes when LLMs generate historical analogies. Through human evaluations and our specially designed automatic multi-dimensional assessment, we find that LLMs generally have a good potential for historical analogies. And the performance of the models can be further improved by using our self-reflection method.

[AI-75] MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

链接: https://arxiv.org/abs/2409.14818
作者: Qinzhuo Wu,Weikai Xu,Wei Liu,Tao Tan,Jianfeng Liu,Ang Li,Jian Luan,Bin Wang,Shuo Shang
关键词-EN: gaining increasing attention, increasing attention, agents based, gaining increasing, Recently
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, mobile AI agents based on VLMs have been gaining increasing attention. These works typically utilize VLM as a foundation, fine-tuning it with instruction-based mobile datasets. However, these VLMs are typically pre-trained on general-domain data, which often results in a lack of fundamental capabilities specific to the mobile domain. Therefore, they may struggle to recognize specific UI elements and understand intra-UI fine-grained information. In addition, the current fine-tuning task focuses on interacting with the most relevant element for the given instruction. These fine-tuned VLMs may still ignore the relationships between UI pages, neglect the roles of elements in page transitions and lack inter-UI understanding. To address issues, we propose a VLM called MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. We defined four UI-based pre-training tasks, enabling the model to better perceive fine-grained elements and capture page transition actions. To address the lack of mobile pre-training data, we built a large Chinese mobile dataset Mobile3M from scratch, which contains 3 million UI pages, and real-world transition actions, forming a directed graph structure. Experimental results show MobileVLM excels on both our test set and public mobile benchmarks, outperforming existing VLMs.

[AI-76] VARADE: a Variational-based AutoRegressive model for Anomaly Detection on the Edge

链接: https://arxiv.org/abs/2409.14816
作者: Alessio Mascolini,Sebastiano Gaiardelli,Francesco Ponzio,Nicola Dall’Ora,Enrico Macii,Sara Vinco,Santa Di Cataldo,Franco Fummi
关键词-EN: Detecting complex anomalies, Detecting complex, task in Industry, deep learning, complex anomalies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Detecting complex anomalies on massive amounts of data is a crucial task in Industry 4.0, best addressed by deep learning. However, available solutions are computationally demanding, requiring cloud architectures prone to latency and bandwidth issues. This work presents VARADE, a novel solution implementing a light autoregressive framework based on variational inference, which is best suited for real-time execution on the edge. The proposed approach was validated on a robotic arm, part of a pilot production line, and compared with several state-of-the-art algorithms, obtaining the best trade-off between anomaly detection accuracy, power consumption and inference frequency on two different edge platforms.

[AI-77] Benchmarking Edge AI Platforms for High-Performance ML Inference

链接: https://arxiv.org/abs/2409.14803
作者: Rakshith Jayanth,Neelesh Gupta,Viktor Prasanna
关键词-EN: reduce communication latency, computing growing prominence, Edge computing growing, enable real-time processing, growing prominence
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Edge computing’s growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions. While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads on these platforms can vary significantly, especially when it comes to parallel processing, which is a critical consideration for edge deployments. To address this, we conduct a comprehensive study comparing the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions. We find that the Neural Processing Unit (NPU) excels in matrix-vector multiplication (58.6% faster) and some neural network tasks (3.2 \times faster for video classification and large language models). GPU outperforms in matrix multiplication (22.6% faster) and LSTM networks (2.7 \times faster) while CPU excels at less parallel operations like dot product. NPU-based inference offers a balance of latency and throughput at lower power consumption. GPU-based inference, though more energy-intensive, performs best with large dimensions and batch sizes. We highlight the potential of heterogeneous computing solutions for edge AI, where diverse compute units can be strategically leveraged to boost accurate and real-time inference.

[AI-78] Choose the Final Translation from NMT and LLM hypotheses Using MBR Decoding: HW-TSCs Submission to the WMT24 General MT Shared Task EMNLP2024

链接: https://arxiv.org/abs/2409.14800
作者: Zhanglin Wu,Daimeng Wei,Zongyao Li,Hengchao Shang,Jiaxin Guo,Shaojun Li,Zhiqiang Rao,Yuanchang Luo,Ning Xie,Hao Yang
关键词-EN: Translate Services Center, Huawei Translate Services, Services Center, English to Chinese, Huawei Translate
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 4 figures, 2 Tables, EMNLP2024

点击查看摘要

Abstract:This paper presents the submission of Huawei Translate Services Center (HW-TSC) to the WMT24 general machine translation (MT) shared task, where we participate in the English to Chinese (en2zh) language pair. Similar to previous years’ work, we use training strategies such as regularized dropout, bidirectional training, data diversification, forward translation, back translation, alternated training, curriculum learning, and transductive ensemble learning to train the neural machine translation (NMT) model based on the deep Transformer-big architecture. The difference is that we also use continue pre-training, supervised fine-tuning, and contrastive preference optimization to train the large language model (LLM) based MT model. By using Minimum Bayesian risk (MBR) decoding to select the final translation from multiple hypotheses for NMT and LLM-based MT models, our submission receives competitive results in the final evaluation.

[AI-79] Research on Dynamic Data Flow Anomaly Detection based on Machine Learning

链接: https://arxiv.org/abs/2409.14796
作者: Liyang Wang,Yu Cheng,Hao Gong,Jiacheng Hu,Xirui Tang,Iris Li
关键词-EN: defensive strategy inadequate, standalone defensive strategy, strategy inadequate, data, sophistication and diversity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The sophistication and diversity of contemporary cyberattacks have rendered the use of proxies, gateways, firewalls, and encrypted tunnels as a standalone defensive strategy inadequate. Consequently, the proactive identification of data anomalies has emerged as a prominent area of research within the field of data security. The majority of extant studies concentrate on sample equilibrium data, with the consequence that the detection effect is not optimal in the context of unbalanced data. In this study, the unsupervised learning method is employed to identify anomalies in dynamic data flows. Initially, multi-dimensional features are extracted from real-time data, and a clustering algorithm is utilised to analyse the patterns of the data. This enables the potential outliers to be automatically identified. By clustering similar data, the model is able to detect data behaviour that deviates significantly from normal traffic without the need for labelled data. The results of the experiments demonstrate that the proposed method exhibits high accuracy in the detection of anomalies across a range of scenarios. Notably, it demonstrates robust and adaptable performance, particularly in the context of unbalanced data.

[AI-80] SAMEdge: An Edge-cloud Video Analytics Architecture for the Segment Anything Model

链接: https://arxiv.org/abs/2409.14784
作者: Rui Lu,Siping Shi,Yanting Liu,Dan Wang
关键词-EN: video analytics tasks, video analytics, large model, analytics tasks, continues to evolve
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As artificial intelligence continues to evolve, it is increasingly capable of handling a wide range of video analytics tasks with merely one large model. One of the key foundation technologies is the Segment Anything Model (SAM), which allows the video analytics tasks to be determined on the fly according to the input prompts from the user. However, achieving real-time response in video analytics applications is crucial for user experiences due to the limited communication and computation resources on the edge, especially with SAM, where users may continuously interact by adding or adjusting prompts. In this paper, we propose SAMEdge, a novel edge-cloud computing architecture designed to support SAM computations for edge users. SAMEdge integrates new modules on the edge and the cloud to maximize analytics accuracy under visual prompts and image prompts input with latency constraints. It addresses resource challenges associated with prompt encoding and image encoding by offering a visual prompt transformation algorithm for visual prompts and efficient workload partitioning for image encoding. SAMEdge is implemented by extending the open-source SAM project from Meta AI. We demonstrate the practical application of SAMEdge through a case study on a Visual Tour Guide application. Our evaluation indicates that SAMEdge significantly enhances the accuracy of the video analytics application under distinct network bandwidths across various prompts. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2409.14784 [cs.AI] (or arXiv:2409.14784v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.14784 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-81] Do Large Language Models have Problem-Solving Capability under Incomplete Information Scenarios? ACL2024

链接: https://arxiv.org/abs/2409.14762
作者: Yuyan Chen,Tianhao Yu,Yueze Li,Songzhou Yan,Sijia Liu,Jiaqing Liang,Yanghua Xiao
关键词-EN: Large Language Models, Language Models, Large Language, knowledge search, error detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:The evaluation of the problem-solving capability under incomplete information scenarios of Large Language Models (LLMs) is increasingly important, encompassing capabilities such as questioning, knowledge search, error detection, and path planning. Current research mainly focus on LLMs’ problem-solving capability such as Twenty Questions''. However, these kinds of games do not require recognizing misleading cues which are necessary in the incomplete information scenario. Moreover, the existing game such as Who is undercover’’ are highly subjective, making it challenging for evaluation. Therefore, in this paper, we introduce a novel game named BrainKing based on the Who is undercover'' and Twenty Questions’’ for evaluating LLM capabilities under incomplete information scenarios. It requires LLMs to identify target entities with limited yes-or-no questions and potential misleading answers. By setting up easy, medium, and hard difficulty modes, we comprehensively assess the performance of LLMs across various aspects. Our results reveal the capabilities and limitations of LLMs in BrainKing, providing significant insights of LLM problem-solving levels.

[AI-82] VLMs Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models

链接: https://arxiv.org/abs/2409.14759
作者: Nam Hyeon-Woo,Moon Ye-Bin,Wonseok Choi,Lee Hyun,Tae-Hyun Oh
关键词-EN: Vision language models, perception remains limited, shown promising reasoning, promising reasoning capabilities, Vision language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks; however, our understanding of their visual perception remains limited. In this work, we propose an eye examination process to investigate how a VLM perceives images, specifically focusing on key elements of visual recognition, from primitive color and shape to semantic levels. To this end, we introduce a dataset named LENS to guide a VLM to follow the examination and check its readiness. Once the model is ready, we conduct the examination. Through this examination, we quantify and visualize VLMs’ sensitivities to color and shape, and semantic matching. Our findings reveal that VLMs have varying sensitivity to different colors while consistently showing insensitivity to green across different VLMs. Also, we found different shape sensitivity and semantic recognition depending on LLM’s capacity despite using the same fixed visual encoder. Our analyses and findings have potential to inspire the design of VLMs and the pre-processing of visual input to VLMs for improving application performance.

[AI-83] UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection

链接: https://arxiv.org/abs/2409.14751
作者: Haocheng Zhao,Runwei Guan,Taoyu Wu,Ka Lok Man,Limin Yu,Yutao Yue
关键词-EN: dense point cloud, MMW radar, point cloud data, MMW, dense point
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figues, conference

点击查看摘要

Abstract:4D millimeter-wave (MMW) radar, which provides both height information and dense point cloud data over 3D MMW radar, has become increasingly popular in 3D object detection. In recent years, radar-vision fusion models have demonstrated performance close to that of LiDAR-based models, offering advantages in terms of lower hardware costs and better resilience in extreme conditions. However, many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information. Additionally, these multi-modal networks are often sensitive to the failure of a single modality, particularly vision. To address these challenges, we propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process, enhancing the quality of visual Bird-Eye View (BEV) features. We further introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities using shared module. To assess the robustness of multi-modal models, we develop a novel Failure Test (FT) ablation experiment, which simulates vision modality failure by injecting Gaussian noise. We conduct extensive experiments on the View-of-Delft (VoD) and TJ4D datasets. The results demonstrate that our proposed Unified BEVFusion (UniBEVFusion) network significantly outperforms state-of-the-art models on the TJ4D dataset, with improvements of 1.44 in 3D and 1.72 in BEV object detection accuracy.

[AI-84] Distribution-Level Feature Distancing for Machine Unlearning: Towards a Better Trade-off Between Model Utility and Forgetting AAAI

链接: https://arxiv.org/abs/2409.14747
作者: Dasol Choi,Dongbin Na
关键词-EN: deep learning applications, learning applications, explosive growth, increasingly in demand, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures, submitted to the AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:With the explosive growth of deep learning applications, the right to be forgotten has become increasingly in demand in various AI industries. For example, given a facial recognition system, some individuals may wish to remove images that might have been used in the training phase from the trained model. Unfortunately, modern deep neural networks sometimes unexpectedly leak personal identities. Recent studies have presented various machine unlearning algorithms to make a trained model unlearn the data to be forgotten. While these methods generally perform well in terms of forgetting scores, we have found that an unexpected modelutility drop can occur. This phenomenon, which we term correlation collapse, happens when the machine unlearning algorithms reduce the useful correlation between image features and the true label. To address this challenge, we propose Distribution-Level Feature Distancing (DLFD), a novel method that efficiently forgets instances while preventing correlation collapse. Our method synthesizes data samples so that the generated data distribution is far from the distribution of samples being forgotten in the feature space, achieving effective results within a single training epoch. Through extensive experiments on facial recognition datasets, we demonstrate that our approach significantly outperforms state-of-the-art machine unlearning methods.

[AI-85] Less yet robust: crucial region selection for scene recognition

链接: https://arxiv.org/abs/2409.14741
作者: Jianqi Zhang,Mengxuan Wang,Jingyao Wang,Lingyu Si,Changwen Zheng,Fanjiang Xu
关键词-EN: scene recognition tasks, Scene recognition, types of degradation, blurring or overexposure, Underwater Geological Scene
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scene recognition, particularly for aerial and underwater images, often suffers from various types of degradation, such as blurring or overexposure. Previous works that focus on convolutional neural networks have been shown to be able to extract panoramic semantic features and perform well on scene recognition tasks. However, low-quality images still impede model performance due to the inappropriate use of high-level semantic features. To address these To address these challenges, we propose an adaptive selection mechanism to identify the most important and robust regions with high-level features. Thus, the model can perform learning via these regions to avoid interference. implement a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix. We also introduce a regularization term to further enhance the significance of key high-level feature regions. Different from previous methods, our learnable matrix pays extra attention to regions that are important to multiple categories but may cause misclassification and sets constraints to reduce the influence of such regions.This is a plug-and-play architecture that can be easily extended to other methods. Additionally, we construct an Underwater Geological Scene Classification dataset to assess the effectiveness of our model. Extensive experimental results demonstrate the superiority and robustness of our proposed method over state-of-the-art techniques on two datasets.

[AI-86] oxiCraft: A Novel Framework for Synthetic Generation of Harmful Information

链接: https://arxiv.org/abs/2409.14740
作者: Zheng Hui,Zhaoxiao Guo,Hang Zhao,Juanyong Duan,Congrui Huang
关键词-EN: NLP tasks, detecting harmful content, online environments, social media, crucial for online
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In different NLP tasks, detecting harmful content is crucial for online environments, especially with the growing influence of social media. However, previous research has two main issues: 1) a lack of data in low-resource settings, and 2) inconsistent definitions and criteria for judging harmful content, requiring classification models to be robust to spurious features and diverse. We propose Toxicraft, a novel framework for synthesizing datasets of harmful information to address these weaknesses. With only a small amount of seed data, our framework can generate a wide variety of synthetic, yet remarkably realistic, examples of toxic information. Experimentation across various datasets showcases a notable enhancement in detection model robustness and adaptability, surpassing or close to the gold labels. We release the generated data at Github upon acceptance.

[AI-87] PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs

链接: https://arxiv.org/abs/2409.14729
作者: Jiahao Yu,Yangguang Shao,Hanwen Miao,Junzheng Shi,Xinyu Xing
关键词-EN: Large Language Models, Large Language, Language Models, prompt injection attacks, prompt injection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained widespread use in various applications due to their powerful capability to generate human-like text. However, prompt injection attacks, which involve overwriting a model’s original instructions with malicious prompts to manipulate the generated text, have raised significant concerns about the security and reliability of LLMs. Ensuring that LLMs are robust against such attacks is crucial for their deployment in real-world applications, particularly in critical tasks. In this paper, we propose PROMPTFUZZ, a novel testing framework that leverages fuzzing techniques to systematically assess the robustness of LLMs against prompt injection attacks. Inspired by software fuzzing, PROMPTFUZZ selects promising seed prompts and generates a diverse set of prompt injections to evaluate the target LLM’s resilience. PROMPTFUZZ operates in two stages: the prepare phase, which involves selecting promising initial seeds and collecting few-shot examples, and the focus phase, which uses the collected examples to generate diverse, high-quality prompt injections. Using PROMPTFUZZ, we can uncover more vulnerabilities in LLMs, even those with strong defense prompts. By deploying the generated attack prompts from PROMPTFUZZ in a real-world competition, we achieved the 7th ranking out of over 4000 participants (top 0.14%) within 2 hours. Additionally, we construct a dataset to fine-tune LLMs for enhanced robustness against prompt injection attacks. While the fine-tuned model shows improved robustness, PROMPTFUZZ continues to identify vulnerabilities, highlighting the importance of robust testing for LLMs. Our work emphasizes the critical need for effective testing tools and provides a practical framework for evaluating and improving the robustness of LLMs against prompt injection attacks. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.14729 [cs.CR] (or arXiv:2409.14729v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.14729 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-88] EDSNet: Efficient-DSNet for Video Summarization

链接: https://arxiv.org/abs/2409.14724
作者: Ashish Prasad,Pranav Jeevan,Amit Sethi
关键词-EN: methods largely rely, require substantial computational, substantial computational resources, Current video summarization, Current video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Current video summarization methods largely rely on transformer-based architectures, which, due to their quadratic complexity, require substantial computational resources. In this work, we address these inefficiencies by enhancing the Direct-to-Summarize Network (DSNet) with more resource-efficient token mixing mechanisms. We show that replacing traditional attention with alternatives like Fourier, Wavelet transforms, and Nyströmformer improves efficiency and performance. Furthermore, we explore various pooling strategies within the Regional Proposal Network, including ROI pooling, Fast Fourier Transform pooling, and flat pooling. Our experimental results on TVSum and SumMe datasets demonstrate that these modifications significantly reduce computational costs while maintaining competitive summarization performance. Thus, our work offers a more scalable solution for video summarization tasks.

[AI-89] ERABAL: Enhancing Role-Playing Agents through Boundary-Aware Learning

链接: https://arxiv.org/abs/2409.14710
作者: Yihong Tang,Jiao Ou,Che Liu,Fuzheng Zhang,Di Zhang,Kun Gai
关键词-EN: Human-Computer Interaction, large language model, primarily implemented, HCI, LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2402.10618

点击查看摘要

Abstract:Role-playing is an emerging application in the field of Human-Computer Interaction (HCI), primarily implemented through the alignment training of a large language model (LLM) with assigned characters. Despite significant progress, role-playing agents (RPLAs) still struggle with maintaining role-consistency across conversations, particularly when confronted with boundary queries subtly related to character attributes. In this paper, we present ERABAL, a framework aimed at enhancing RPLAs’ role-playing capabilities through boundary-aware learning. ERABAL encompasses a generation pipeline for role-specific dialogues and a concomitant methodology for alignment training. Through comprehensive evaluations, we demonstrate that ERABAL is both efficient and effective. By training with significantly fewer dialogues than those used in leading approaches, ERABAL achieves notable improvements across WikiRoleEval, CharacterEval, and the role-playing subset of MT-Bench compared to the generalist baseline models. Our code and datasets will be made publicly available to support further research.

[AI-90] arget-Aware Language Modeling via Granular Data Sampling EMNLP2024

链接: https://arxiv.org/abs/2409.14705
作者: Ernie Chang,Pin-Jie Lin,Yang Li,Changsheng Zhao,Daeil Kim,Rastislav Rabatin,Zechun Liu,Yangyang Shi,Vikas Chandra
关键词-EN: diverse sources, broad range, pretraining generally targets, model pretraining generally, data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024 Main Conference, 9 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective and straightforward approach is sampling with low-dimensional data features, which allows to select large-scale pretraining data for domain-specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the target downstream task performance while preserving its effectiveness on other tasks. This leads to the proposed data sampling paradigm where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with \sim 1% of the data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.

[AI-91] VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models EMNLP2024

链接: https://arxiv.org/abs/2409.14704
作者: Jingtao Cao,Zheng Zhang,Hongru Wang,Kam-Fai Wong
关键词-EN: Language Evaluation Understudy, Visual Language Evaluation, significantly improved, improved the generation, textual descriptions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: accepted by EMNLP2024(long paper,main conference)

点击查看摘要

Abstract:Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models’ ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP model.VLEU quantifies a model’s generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric provides a quantitative way to compare different T2I models and track improvements during model finetuning. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models, positioning it as an essential metric for future research in text-to-image synthesis.

[AI-92] Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

链接: https://arxiv.org/abs/2409.14683
作者: Benjamin Clavié,Antoine Chaffin,Griffin Adams
关键词-EN: increasingly popular approach, multi-vector retrieval methods, increasingly popular, multi-vector retrieval, approach to Neural
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.

[AI-93] Quantifying Context Bias in Domain Adaptation for Object Detection

链接: https://arxiv.org/abs/2409.14679
作者: Hojun Son,Arpan Kusari
关键词-EN: context bias, aims to transfer, DAOD, bias, context
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Under review

点击查看摘要

Abstract:Domain adaptation for object detection (DAOD) aims to transfer a trained model from a source to a target domain. Various DAOD methods exist, some of which minimize context bias between foreground-background associations in various domains. However, no prior work has studied context bias in DAOD by analyzing changes in background features during adaptation and how context bias is represented in different domains. Our research experiment highlights the potential usability of context bias in DAOD. We address the problem by varying activation values over different layers of trained models and by masking the background, both of which impact the number and quality of detections. We then use one synthetic dataset from CARLA and two different versions of real open-source data, Cityscapes and Cityscapes foggy, as separate domains to represent and quantify context bias. We utilize different metrics such as Maximum Mean Discrepancy (MMD) and Maximum Variance Discrepancy (MVD) to find the layer-specific conditional probability estimates of foreground given manipulated background regions for separate domains. We demonstrate through detailed analysis that understanding of the context bias can affect DAOD approach and foc

[AI-94] Instruction Tuning Vs. In-Context Learning: Revisiting Large Language Models in Few-Shot Computational Social Science

链接: https://arxiv.org/abs/2409.14673
作者: Taihang Wang,Xiaoman Xu,Yimin Wang,Ye Jiang
关键词-EN: large language models, computational social science, Real-world applications, tasks primarily depend, CSS tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-world applications of large language models (LLMs) in computational social science (CSS) tasks primarily depend on the effectiveness of instruction tuning (IT) or in-context learning (ICL). While IT has shown highly effective at fine-tuning LLMs for various tasks, ICL offers a rapid alternative for task adaptation by learning from examples without explicit gradient updates. In this paper, we evaluate the classification performance of LLMs using IT versus ICL in few-shot CSS tasks. The experimental results indicate that ICL consistently outperforms IT in most CSS tasks. Additionally, we investigate the relationship between the increasing number of training samples and LLM performance. Our findings show that simply increasing the number of samples without considering their quality does not consistently enhance the performance of LLMs with either ICL or IT and can sometimes even result in a performance decline. Finally, we compare three prompting strategies, demonstrating that ICL is more effective than zero-shot and Chain-of-Thought (CoT). Our research highlights the significant advantages of ICL in handling CSS tasks in few-shot settings and emphasizes the importance of optimizing sample quality and prompting strategies to improve LLM classification performance. The code will be made available.

[AI-95] Speechworthy Instruction-tuned Language Models EMNLP2024

链接: https://arxiv.org/abs/2409.14672
作者: Hyundong Cho,Nicolaas Jedema,Leonardo F.R. Ribeiro,Karishma Sharma,Pedro Szekely,Alessandro Moschitti,Ruben Janssen,Jonathan May
关键词-EN: Current instruction-tuned language, textual preference data, Current instruction-tuned, exclusively trained, trained with textual
类目: Artificial Intelligence (cs.AI)
*备注: EMNLP2024

点击查看摘要

Abstract:Current instruction-tuned language models are exclusively trained with textual preference data and thus are often not aligned with the unique requirements of other modalities, such as speech. To better align language models with the speech domain, we explore (i) prompting strategies grounded in radio-industry best practices and (ii) preference learning using a novel speech-based preference data of 20K samples, generated with a wide spectrum of prompts that induce varying dimensions of speech-suitability and labeled by annotators who listen to response pairs. Both human and automatic evaluation show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs. Interestingly, we find that prompting and preference learning can be additive; combining them achieves the best win rates in head-to-head comparison, resulting in responses that are preferred or tied to the base model in 76.2% of comparisons on average. Lastly, we share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.

[AI-96] FedGCA: Global Consistent Augmentation Based Single-Source Federated Domain Generalization

链接: https://arxiv.org/abs/2409.14671
作者: Yuan Liu,Shu Wang,Zhe Qu,Xingyu Li,Shichao Kan,Jianxin Wang
关键词-EN: Federated Domain Generalization, multi-domain training samples, generalization ability, Domain Generalization, aims to train
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 7 figures, conference

点击查看摘要

Abstract:Federated Domain Generalization (FedDG) aims to train the global model for generalization ability to unseen domains with multi-domain training samples. However, clients in federated learning networks are often confined to a single, non-IID domain due to inherent sampling and temporal limitations. The lack of cross-domain interaction and the in-domain divergence impede the learning of domain-common features and limit the effectiveness of existing FedDG, referred to as the single-source FedDG (sFedDG) problem. To address this, we introduce the Federated Global Consistent Augmentation (FedGCA) method, which incorporates a style-complement module to augment data samples with diverse domain styles. To ensure the effective integration of augmented samples, FedGCA employs both global guided semantic consistency and class consistency, mitigating inconsistencies from local semantics within individual clients and classes across multiple clients. The conducted extensive experiments demonstrate the superiority of FedGCA.

[AI-97] Semi-supervised Learning For Robust Speech Evaluation

链接: https://arxiv.org/abs/2409.14666
作者: Huayun Zhang,Jeremy H.M. Wong,Geyu Lin,Nancy F. Chen
关键词-EN: learners oral proficiency, Speech evaluation measures, oral proficiency, proficiency levels, Speech evaluation
类目: Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:Speech evaluation measures a learners oral proficiency using automatic models. Corpora for training such models often pose sparsity challenges given that there often is limited scored data from teachers, in addition to the score distribution across proficiency levels being often imbalanced among student cohorts. Automatic scoring is thus not robust when faced with under-represented samples or out-of-distribution samples, which inevitably exist in real-world deployment scenarios. This paper proposes to address such challenges by exploiting semi-supervised pre-training and objective regularization to approximate subjective evaluation criteria. In particular, normalized mutual information is used to quantify the speech characteristics from the learner and the reference. An anchor model is trained using pseudo labels to predict the correctness of pronunciation. An interpolated loss function is proposed to minimize not only the prediction error with respect to ground-truth scores but also the divergence between two probability distributions estimated by the speech evaluation model and the anchor model. Compared to other state-of-the-art methods on a public data-set, this approach not only achieves high performance while evaluating the entire test-set as a whole, but also brings the most evenly distributed prediction error across distinct proficiency levels. Furthermore, empirical results show the model accuracy on out-of-distribution data also compares favorably with competitive baselines.

[AI-98] zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning

链接: https://arxiv.org/abs/2409.14644
作者: Zixiang Xian,Chenhui Cui,Rubing Huang,Chunrong Fang,Zhenyu Chen
关键词-EN: Large language models, Large language, unlike pre-trained models, unlike pre-trained, code embeddings
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Regarding software engineering (SE) tasks, Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning, unlike pre-trained models (PTMs). However, LLMs are primarily designed for natural language output, and cannot directly produce intermediate embeddings from source code. They also face some challenges, for example, the restricted context length may prevent them from handling larger inputs, limiting their applicability to many SE tasks; while hallucinations may occur when LLMs are applied to complex downstream tasks. Motivated by the above facts, we propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs. Our approach utilizes LLMs to convert source code into concise summaries through zero-shot learning, which is then transformed into functional code embeddings using specialized embedding models. This unsupervised approach eliminates the need for training and addresses the issue of hallucinations encountered with LLMs. To the best of our knowledge, this is the first approach that combines LLMs and embedding models to generate code embeddings. We conducted experiments to evaluate the performance of our approach. The results demonstrate the effectiveness and superiority of our approach over state-of-the-art unsupervised methods. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.14644 [cs.SE] (or arXiv:2409.14644v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.14644 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-99] Not Only the Last-Layer Features for Spurious Correlations: All Layer Deep Feature Reweighting

链接: https://arxiv.org/abs/2409.14637
作者: Humza Wajid Hameed,Geraldin Nanfack,Eugene Belilovsky
关键词-EN: machine learning models, Spurious correlations, combat spurious correlations, learning models, group-level fairness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spurious correlations are a major source of errors for machine learning models, in particular when aiming for group-level fairness. It has been recently shown that a powerful approach to combat spurious correlations is to re-train the last layer on a balanced validation dataset, isolating robust features for the predictor. However, key attributes can sometimes be discarded by neural networks towards the last layer. In this work, we thus consider retraining a classifier on a set of features derived from all layers. We utilize a recently proposed feature selection strategy to select unbiased features from all the layers. We observe this approach gives significant improvements in worst-group accuracy on several standard benchmarks.

[AI-100] Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination

链接: https://arxiv.org/abs/2409.14634
作者: Marissa Radensky,Simra Shahid,Raymond Fok,Pao Siangliulue,Tom Hope,Daniel S. Weld
关键词-EN: involves blending salient, blending salient aspects, scientific ideation process, involves blending, blending salient
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The scientific ideation process often involves blending salient aspects of existing papers to create new ideas. To see if large language models (LLMs) can assist this process, we contribute Scideator, a novel mixed-initiative tool for scientific ideation. Starting from a user-provided set of papers, Scideator extracts key facets (purposes, mechanisms, and evaluations) from these and relevant papers, allowing users to explore the idea space by interactively recombining facets to synthesize inventive ideas. Scideator also helps users to gauge idea novelty by searching the literature for potential overlaps and showing automated novelty assessments and explanations. To support these tasks, Scideator introduces four LLM-powered retrieval-augmented generation (RAG) modules: Analogous Paper Facet Finder, Faceted Idea Generator, Idea Novelty Checker, and Idea Novelty Iterator. In a within-subjects user study, 19 computer-science researchers identified significantly more interesting ideas using Scideator compared to a strong baseline combining a scientific search engine with LLM interaction.

[AI-101] Hierarchical end-to-end autonomous navigation through few-shot waypoint detection ICRA

链接: https://arxiv.org/abs/2409.14633
作者: Amin Ghafourian,Zhongying CuiZhu,Debo Shi,Ian Chuang,Francois Charette,Rithik Sachdeva,Iman Soltani
关键词-EN: recognize salient features, ability to recognize, recognize salient, salient features, navigation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Appeared at the 40th Anniversary of the IEEE International Conference on Robotics and Automation (ICRA@40), 23-26 September, 2024, Rotterdam, The Netherlands. 9 pages, 5 figures

点击查看摘要

Abstract:Human navigation is facilitated through the association of actions with landmarks, tapping into our ability to recognize salient features in our environment. Consequently, navigational instructions for humans can be extremely concise, such as short verbal descriptions, indicating a small memory requirement and no reliance on complex and overly accurate navigation tools. Conversely, current autonomous navigation schemes rely on accurate positioning devices and algorithms as well as extensive streams of sensory data collected from the environment. Inspired by this human capability and motivated by the associated technological gap, in this work we propose a hierarchical end-to-end meta-learning scheme that enables a mobile robot to navigate in a previously unknown environment upon presentation of only a few sample images of a set of landmarks along with their corresponding high-level navigation actions. This dramatically simplifies the wayfinding process and enables easy adoption to new environments. For few-shot waypoint detection, we implement a metric-based few-shot learning technique through distribution embedding. Waypoint detection triggers the multi-task low-level maneuver controller module to execute the corresponding high-level navigation action. We demonstrate the effectiveness of the scheme using a small-scale autonomous vehicle on novel indoor navigation tasks in several previously unseen environments.

[AI-102] EQ-CBM: A Probabilistic Concept Bottleneck with Energy-based Models and Quantized Vectors ACCV2024

链接: https://arxiv.org/abs/2409.14630
作者: Sangwon Kim,Dasom Ahn,Byoung Chul Ko,In-su Jang,Kwang-Ju Kim
关键词-EN: deep neural networks, interpretable deep neural, neural networks, demand for reliable, reliable AI systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ACCV 2024

点击查看摘要

Abstract:The demand for reliable AI systems has intensified the need for interpretable deep neural networks. Concept bottleneck models (CBMs) have gained attention as an effective approach by leveraging human-understandable concepts to enhance interpretability. However, existing CBMs face challenges due to deterministic concept encoding and reliance on inconsistent concepts, leading to inaccuracies. We propose EQ-CBM, a novel framework that enhances CBMs through probabilistic concept encoding using energy-based models (EBMs) with quantized concept activation vectors (qCAVs). EQ-CBM effectively captures uncertainties, thereby improving prediction reliability and accuracy. By employing qCAVs, our method selects homogeneous vectors during concept encoding, enabling more decisive task performance and facilitating higher levels of human intervention. Empirical results using benchmark datasets demonstrate that our approach outperforms the state-of-the-art in both concept and task accuracy.

[AI-103] Brain Surgery: Ensuring GDPR Compliance in Large Language Models via Concept Erasure

链接: https://arxiv.org/abs/2409.14603
作者: Michele Laurelli
关键词-EN: General Data Protection, Data Protection Regulation, General Data, Data Protection, data privacy laws
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As large-scale AI systems proliferate, ensuring compliance with data privacy laws such as the General Data Protection Regulation (GDPR) has become critical. This paper introduces Brain Surgery, a transformative methodology for making every local AI model GDPR-ready by enabling real-time privacy management and targeted unlearning. Building on advanced techniques such as Embedding-Corrupted Prompts (ECO Prompts), blockchain-based privacy management, and privacy-aware continual learning, Brain Surgery provides a modular solution that can be deployed across various AI architectures. This tool not only ensures compliance with privacy regulations but also empowers users to define their own privacy limits, creating a new paradigm in AI ethics and governance.

[AI-104] Can pre-trained language models generate titles for research papers?

链接: https://arxiv.org/abs/2409.14602
作者: Tohida Rehman,Debarshi Kumar Sanyal,Samiran Chattopadhyay
关键词-EN: research paper communicates, succinct style, style the main, main theme, research paper
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The title of a research paper communicates in a succinct style the main theme and, sometimes, the findings of the paper. Coming up with the right title is often an arduous task, and therefore, it would be beneficial to authors if title generation can be automated. In this paper, we fine-tune pre-trained and large language models to generate titles of papers from their abstracts. We also use ChatGPT in a zero-shot setting to generate paper titles. The performance of the models is measured with ROUGE, METEOR, MoverScore, BERTScore and SciBERTScore metrics.

[AI-105] sting Causal Models with Hidden Variables in Polynomial Delay via Conditional Independencies

链接: https://arxiv.org/abs/2409.14593
作者: Hyunchai Jeong,Adiba Ejaz,Jin Tian,Elias Bareinboim
关键词-EN: causal inference tasks, CIs, inference tasks, key prerequisite, Testing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 34 total pages, 14 figures

点击查看摘要

Abstract:Testing a hypothesized causal model against observational data is a key prerequisite for many causal inference tasks. A natural approach is to test whether the conditional independence relations (CIs) assumed in the model hold in the data. While a model can assume exponentially many CIs (with respect to the number of variables), testing all of them is both impractical and unnecessary. Causal graphs, which encode these CIs in polynomial space, give rise to local Markov properties that enable model testing with a significantly smaller subset of CIs. Model testing based on local properties requires an algorithm to list the relevant CIs. However, existing algorithms for realistic settings with hidden variables and non-parametric distributions can take exponential time to produce even a single CI constraint. In this paper, we introduce the c-component local Markov property (C-LMP) for causal graphs with hidden variables. Since C-LMP can still invoke an exponential number of CIs, we develop a polynomial delay algorithm to list these CIs in poly-time intervals. To our knowledge, this is the first algorithm that enables poly-delay testing of CIs in causal graphs with hidden variables against arbitrary data distributions. Experiments on real-world and synthetic data demonstrate the practicality of our algorithm.

[AI-106] Explainable AI needs formal notions of explanation correctness

链接: https://arxiv.org/abs/2409.14590
作者: Stefan Haufe,Rick Wilming,Benedict Clark,Rustam Zhumagambetov,Danny Panknin,Ahcène Boubekki
关键词-EN: medicine poses risks, machine learning, requires regulation, critical domains, medicine poses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The use of machine learning (ML) in critical domains such as medicine poses risks and requires regulation. One requirement is that decisions of ML systems in high-risk applications should be human-understandable. The field of “explainable artificial intelligence” (XAI) seemingly addresses this need. However, in its current form, XAI is unfit to provide quality control for ML; it itself needs scrutiny. Popular XAI methods cannot reliably answer important questions about ML models, their training data, or a given test input. We recapitulate results demonstrating that popular XAI methods systematically attribute importance to input features that are independent of the prediction target. This limits their utility for purposes such as model and data (in)validation, model improvement, and scientific discovery. We argue that the fundamental reason for this limitation is that current XAI methods do not address well-defined problems and are not evaluated against objective criteria of explanation correctness. Researchers should formally define the problems they intend to solve first and then design methods accordingly. This will lead to notions of explanation correctness that can be theoretically verified and objective metrics of explanation performance that can be assessed using ground-truth data.

[AI-107] Backtracking Improves Generation Safety

链接: https://arxiv.org/abs/2409.14586
作者: Yiming Zhang,Jianfeng Chi,Hailey Nguyen,Kartikeya Upasani,Daniel M. Bikel,Jason Weston,Eric Michael Smith
关键词-EN: taking back tokens, fundamental limitation, taking back, unsafe additional text, Text generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to “undo” and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% \to 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.

[AI-108] Evaluating Gender Racial and Age Biases in Large Language Models : A Comparative Analysis of Occupational and Crime Scenarios

链接: https://arxiv.org/abs/2409.14583
作者: Vishal Mirza,Rahul Kulkarni,Aakanksha Jadhav
关键词-EN: Large Language Models, Language Models, Large Language, widespread enterprise adoption, enterprise adoption remains
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 17 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models(LLMs) have been notable, yet widespread enterprise adoption remains limited due to various constraints. This paper examines bias in LLMs-a crucial issue affecting their usability, reliability, and fairness. Researchers are developing strategies to mitigate bias, including debiasing layers, specialized reference datasets like Winogender and Winobias, and reinforcement learning with human feedback (RLHF). These techniques have been integrated into the latest LLMs. Our study evaluates gender bias in occupational scenarios and gender, age, and racial bias in crime scenarios across four leading LLMs released in 2024: Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, and GPT-4o. Findings reveal that LLMs often depict female characters more frequently than male ones in various occupations, showing a 37% deviation from US BLS data. In crime scenarios, deviations from US FBI data are 54% for gender, 28% for race, and 17% for age. We observe that efforts to reduce gender and racial bias often lead to outcomes that may over-index one sub-class, potentially exacerbating the issue. These results highlight the limitations of current bias mitigation techniques and underscore the need for more effective approaches.

[AI-109] Evaluating the Performance and Robustness of LLMs in Materials Science QA and Property Predictions

链接: https://arxiv.org/abs/2409.14572
作者: Hongchen Wang,Kangming Li,Scott Ramsay,Yao Fehlis,Edward Kim,Jason Hattrick-Simpers
关键词-EN: Large Language Models, Large Language, revolutionize scientific research, remain insufficiently explored, applications remain insufficiently
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. This study conducts a comprehensive evaluation and robustness analysis of LLMs within the field of materials science, focusing on domain-specific question answering and materials property prediction. Three distinct datasets are used in this study: 1) a set of multiple-choice questions from undergraduate-level materials science courses, 2) a dataset including various steel compositions and yield strengths, and 3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of ‘noise’, ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study uncovers unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance enhancement from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.

[AI-110] Encoder with the Empirical Mode Decomposition (EMD) to remove muscle artefacts from EEG signal

链接: https://arxiv.org/abs/2409.14571
作者: Ildar Rakhmatulin
关键词-EN: Empirical Mode Decomposition, Mode Decomposition, Empirical Mode, effectively removing artifacts, combining the Empirical
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a novel method for effectively removing artifacts from EEG signals by combining the Empirical Mode Decomposition (EMD) method with a machine learning architecture. The proposed method addresses the limitations of existing artifact removal techniques by enhancing the EMD method through interpolation of the upper and lower. For conventional artifact removal methods, the EMD technique is commonly employed. However, the challenge lies in accurately interpolating the missing components of the signal while preserving its inherent frequency components. To overcome this limitation, we incorporated machine learning technique, which enables us to carefully handle the interpolation process without directly manipulating the data. The key advantage of our approach lies in the preservation of the natural characteristics of the EEG signal during artifact removal. By utilizing machine learning for interpolation, we ensure that the average component obtained through the EMD method retains the crucial frequency components of the original signal. This preservation is essential for maintaining the integrity and fidelity of the EEG data, allowing for accurate analysis and interpretation. The results obtained from our evaluation serve to validate the effectiveness of our approach and pave the way for further advancements in EEG signal processing and analysis.

[AI-111] Combating Spatial Disorientation in a Dynamic Self-Stabilization Task Using AI Assistants

链接: https://arxiv.org/abs/2409.14565
作者: Sheikh Mannan,Paige Hansen,Vivekanand Pandey Vimal,Hannah N. Davies,Paul DiZio,Nikhil Krishnaswamy
关键词-EN: fatal aircraft accidents, aircraft accidents, Spatial disorientation, fatal aircraft, ameliorate spatial disorientation
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注: 10 pages, To be published in the International Conference on Human-Agent Interaction (HAI '24) proceedings

点击查看摘要

Abstract:Spatial disorientation is a leading cause of fatal aircraft accidents. This paper explores the potential of AI agents to aid pilots in maintaining balance and preventing unrecoverable losses of control by offering cues and corrective measures that ameliorate spatial disorientation. A multi-axis rotation system (MARS) was used to gather data from human subjects self-balancing in a spaceflight analog condition. We trained models over this data to create “digital twins” that exemplified performance characteristics of humans with different proficiency levels. We then trained various reinforcement learning and deep learning models to offer corrective cues if loss of control is predicted. Digital twins and assistant models then co-performed a virtual inverted pendulum (VIP) programmed with identical physics. From these simulations, we picked the 5 best-performing assistants based on task metrics such as crash frequency and mean distance from the direction of balance. These were used in a co-performance study with 20 new human subjects performing a version of the VIP task with degraded spatial information. We show that certain AI assistants were able to improve human performance and that reinforcement-learning based assistants were objectively more effective but rated as less trusted and preferable by humans.

[AI-112] RACOON: An LLM-based Framework for Retrieval-Augmented Column Type Annotation with a Knowledge Graph

链接: https://arxiv.org/abs/2409.14556
作者: Linxi Wei,Guorui Xiao,Magdalena Balazinska
关键词-EN: Column Type Annotation, Type Annotation, Column Type, Large Language Models, label columns
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As an important component of data exploration and integration, Column Type Annotation (CTA) aims to label columns of a table with one or more semantic types. With the recent development of Large Language Models (LLMs), researchers have started to explore the possibility of using LLMs for CTA, leveraging their strong zero-shot capabilities. In this paper, we build on this promising work and improve on LLM-based methods for CTA by showing how to use a Knowledge Graph (KG) to augment the context information provided to the LLM. Our approach, called RACOON, combines both pre-trained parametric and non-parametric knowledge during generation to improve LLMs’ performance on CTA. Our experiments show that RACOON achieves up to a 0.21 micro F-1 improvement compared against vanilla LLM inference.

[AI-113] Unleashing the Power of Emojis in Texts via Self-supervised Graph Pre-Training EMNLP2024

链接: https://arxiv.org/abs/2409.14552
作者: Zhou Zhang,Dongzeng Tan,Jiaan Wang,Yilong Chen,Jiarong Xu
关键词-EN: gained immense popularity, gained immense, immense popularity, supplement or replace, ordinary Unicode characters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024 Main Conference

点击查看摘要

Abstract:Emojis have gained immense popularity on social platforms, serving as a common means to supplement or replace text. However, existing data mining approaches generally either completely ignore or simply treat emojis as ordinary Unicode characters, which may limit the model’s ability to grasp the rich semantic information in emojis and the interaction between emojis and texts. Thus, it is necessary to release the emoji’s power in social media data mining. To this end, we first construct a heterogeneous graph consisting of three types of nodes, i.e. post, word and emoji nodes to improve the representation of different elements in posts. The edges are also well-defined to model how these three elements interact with each other. To facilitate the sharing of information among post, word and emoji nodes, we propose a graph pre-train framework for text and emoji co-modeling, which contains two graph pre-training tasks: node-level graph contrastive learning and edge-level link reconstruction learning. Extensive experiments on the Xiaohongshu and Twitter datasets with two types of downstream tasks demonstrate that our approach proves significant improvement over previous strong baseline methods.

[AI-114] Why Is Anything Conscious?

链接: https://arxiv.org/abs/2409.14545
作者: Michael Timothy Bennett,Sean Welsh,Anna Ciaunica
关键词-EN: taking the naturally-selected, embodied organism, starting point, tackle the hard, hard problem
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We tackle the hard problem of consciousness taking the naturally-selected, self-organising, embodied organism as our starting point. We provide a mathematical formalism describing how biological systems self-organise to hierarchically interpret unlabelled sensory information according to valence and specific needs. Such interpretations imply behavioural policies which can only be differentiated from each other by the qualitative aspect of information processing. Selection pressures favour systems that can intervene in the world to achieve homeostatic and reproductive goals. Quality is a property arising in such systems to link cause to affect to motivate real world interventions. This produces a range of qualitative classifiers (interoceptive and exteroceptive) that motivate specific actions and determine priorities and preferences. Building upon the seminal distinction between access and phenomenal consciousness, our radical claim here is that phenomenal consciousness without access consciousness is likely very common, but the reverse is implausible. To put it provocatively: Nature does not like zombies. We formally describe the multilayered architecture of self-organisation from rocks to Einstein, illustrating how our argument applies in the real world. We claim that access consciousness at the human level is impossible without the ability to hierarchically model i) the self, ii) the world/others and iii) the self as modelled by others. Phenomenal consciousness is therefore required for human-level functionality. Our proposal lays the foundations of a formal science of consciousness, deeply connected with natural selection rather than abstract thinking, closer to human fact than zombie fiction.

[AI-115] rackNetV4: Enhancing Fast Sports Object Tracking with Motion Attention Maps

链接: https://arxiv.org/abs/2409.14543
作者: Arjun Raj,Lei Wang,Tom Gedeon
关键词-EN: Accurately detecting, small objects, sports videos, challenging due, due to factors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research report

点击查看摘要

Abstract:Accurately detecting and tracking high-speed, small objects, such as balls in sports videos, is challenging due to factors like motion blur and occlusion. Although recent deep learning frameworks like TrackNetV1, V2, and V3 have advanced tennis ball and shuttlecock tracking, they often struggle in scenarios with partial occlusion or low visibility. This is primarily because these models rely heavily on visual features without explicitly incorporating motion information, which is crucial for precise tracking and trajectory prediction. In this paper, we introduce an enhancement to the TrackNet family by fusing high-level visual features with learnable motion attention maps through a motion-aware fusion mechanism, effectively emphasizing the moving ball’s location and improving tracking performance. Our approach leverages frame differencing maps, modulated by a motion prompt layer, to highlight key motion regions over time. Experimental results on the tennis ball and shuttlecock datasets show that our method enhances the tracking performance of both TrackNetV2 and V3. We refer to our lightweight, plug-and-play solution, built on top of the existing TrackNet, as TrackNetV4.

[AI-116] Beyond Words: Evaluating Large Language Models in Transportation Planning

链接: https://arxiv.org/abs/2409.14516
作者: Shaowei Ying,Zhenlong Li,Manzhu Yu
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, numerous industry sectors, advancement of Generative
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The resurgence and rapid advancement of Generative Artificial Intelligence (GenAI) in 2023 has catalyzed transformative shifts across numerous industry sectors, including urban transportation and logistics. This study investigates the evaluation of Large Language Models (LLMs), specifically GPT-4 and Phi-3-mini, to enhance transportation planning. The study assesses the performance and spatial comprehension of these models through a transportation-informed evaluation framework that includes general geospatial skills, general transportation domain skills, and real-world transportation problem-solving. Utilizing a mixed-methods approach, the research encompasses an evaluation of the LLMs’ general Geographic Information System (GIS) skills, general transportation domain knowledge as well as abilities to support human decision-making in the real-world transportation planning scenarios of congestion pricing. Results indicate that GPT-4 demonstrates superior accuracy and reliability across various GIS and transportation-specific tasks compared to Phi-3-mini, highlighting its potential as a robust tool for transportation planners. Nonetheless, Phi-3-mini exhibits competence in specific analytical scenarios, suggesting its utility in resource-constrained environments. The findings underscore the transformative potential of GenAI technologies in urban transportation planning. Future work could explore the application of newer LLMs and the impact of Retrieval-Augmented Generation (RAG) techniques, on a broader set of real-world transportation planning and operations challenges, to deepen the integration of advanced AI models in transportation management practices.

[AI-117] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

链接: https://arxiv.org/abs/2409.14507
作者: David Chanin,James Wilken-Smith,Tomáš Dulka,Hardik Bhatnagar,Joseph Bloom
关键词-EN: Large Language Models, Sparse Autoencoders, Language Models, Large Language, activations of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of Large Language Models (LLMs) into human-interpretable latents. In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents? Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability? By investigating these questions in the context of a simple first-letter identification task where we have complete access to ground truth labels for all tokens in the vocabulary, we are able to provide more detail than prior investigations. Critically, we identify a problematic form of feature-splitting we call feature absorption where seemingly monosemantic latents fail to fire in cases where they clearly should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that there are deeper conceptual issues in need of resolution.

[AI-118] abGraphs: A Benchmark and Strong Baselines for Learning on Graphs with Tabular Features

链接: https://arxiv.org/abs/2409.14500
作者: Gleb Bazhenov,Oleg Platonov,Liudmila Prokhorenkova
关键词-EN: machine learning, graph machine learning, Tabular machine learning, graph machine, machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular machine learning is an important field for industry and science. In this field, table rows are usually treated as independent data samples, but additional information about relations between them is sometimes available and can be used to improve predictive performance. Such information can be naturally modeled with a graph, thus tabular machine learning may benefit from graph machine learning methods. However, graph machine learning models are typically evaluated on datasets with homogeneous node features, which have little in common with heterogeneous mixtures of numerical and categorical features present in tabular datasets. Thus, there is a critical difference between the data used in tabular and graph machine learning studies, which does not allow one to understand how successfully graph models can be transferred to tabular data. To bridge this gap, we propose a new benchmark of diverse graphs with heterogeneous tabular node features and realistic prediction tasks. We use this benchmark to evaluate a vast set of models, including simple methods previously overlooked in the literature. Our experiments show that graph neural networks (GNNs) can indeed often bring gains in predictive performance for tabular data, but standard tabular models also can be adapted to work with graph data by using simple feature preprocessing, which sometimes enables them to compete with and even outperform GNNs. Based on our empirical study, we provide insights for researchers and practitioners in both tabular and graph machine learning fields.

[AI-119] On a measure of intelligence

链接: https://arxiv.org/abs/2409.14496
作者: Yuri Gurevich
关键词-EN: Computer Science column, fascinating must-read article, Logic in Computer, François Chollet, Computer Science
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Fall 2024 Logic in Computer Science column of the Bulletin of EATCS is a little discussion on intelligence, measuring intelligence, and related issues, provoked by a fascinating must-read article ``On the measure of intelligence’’ by François Chollet. The discussion includes a modicum of critique of the article.

[AI-120] hought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension

链接: https://arxiv.org/abs/2409.14495
作者: Chenxu Wang,Ping Jian,Yang Zhen
关键词-EN: Logical reading comprehension, reading comprehension, task that entails, entails grasping, grasping the underlying
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Logical reading comprehension is a challenging task that entails grasping the underlying semantics of text and applying reasoning to deduce the correct answer. Prior researches have primarily focused on enhancing logical reasoning capabilities through Chain-of-Thought (CoT) or data augmentation. However, previous work constructing chain-of-thought rationales concentrates solely on analyzing correct options, neglecting the incorrect alternatives. Addtionally, earlier efforts on data augmentation by altering contexts rely on rule-based methods, which result in generated contexts that lack diversity and coherence. To address these issues, we propose a Premise-Oriented Data Augmentation (PODA) framework. This framework can generate CoT rationales including analyses for both correct and incorrect options, while constructing diverse and high-quality counterfactual contexts from incorrect candidate options. We integrate summarizing premises and identifying premises for each option into rationales. Subsequently, we employ multi-step prompts with identified premises to construct counterfactual context. To facilitate the model’s capabilities to better differentiate the reasoning process associated with each option, we introduce a novel thought-path contrastive learning method that compares reasoning paths between the original and counterfactual samples. Experimental results on three representative LLMs demonstrate that our method can improve the baselines substantially across two challenging logical reasoning benchmarks (ReClor and LogiQA 2.0). The data and code are released at this https URL.

[AI-121] Enhancing LLM-based Autonomous Driving Agents to Mitigate Perception Attacks

链接: https://arxiv.org/abs/2409.14488
作者: Ruoyu Song,Muslum Ozgur Ozmen,Hyungsub Kim,Antonio Bianchi,Z. Berkay Celik
关键词-EN: Large Language Models, integrating Large Language, Language Models, Large Language, ODT attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:There is a growing interest in integrating Large Language Models (LLMs) with autonomous driving (AD) systems. However, AD systems are vulnerable to attacks against their object detection and tracking (ODT) functions. Unfortunately, our evaluation of four recent LLM agents against ODT attacks shows that the attacks are 63.26% successful in causing them to crash or violate traffic rules due to (1) misleading memory modules that provide past experiences for decision making, (2) limitations of prompts in identifying inconsistencies, and (3) reliance on ground truth perception data. In this paper, we introduce Hudson, a driving reasoning agent that extends prior LLM-based driving systems to enable safer decision making during perception attacks while maintaining effectiveness under benign conditions. Hudson achieves this by first instrumenting the AD software to collect real-time perception results and contextual information from the driving scene. This data is then formalized into a domain-specific language (DSL). To guide the LLM in detecting and making safe control decisions during ODT attacks, Hudson translates the DSL into natural language, along with a list of custom attack detection instructions. Following query execution, Hudson analyzes the LLM’s control decision to understand its causal reasoning process. We evaluate the effectiveness of Hudson using a proprietary LLM (GPT-4) and two open-source LLMs (Llama and Gemma) in various adversarial driving scenarios. GPT-4, Llama, and Gemma achieve, on average, an attack detection accuracy of 83. 3%, 63. 6%, and 73. 6%. Consequently, they make safe control decisions in 86.4%, 73.9%, and 80% of the attacks. Our results, following the growing interest in integrating LLMs into AD systems, highlight the strengths of LLMs and their potential to detect and mitigate ODT attacks. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.14488 [cs.CR] (or arXiv:2409.14488v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.14488 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-122] Can Large Language Models Logically Predict Myocardial Infarction? Evaluation based on UK Biobank Cohort

链接: https://arxiv.org/abs/2409.14478
作者: Yuxing Zhi,Yuan Guo,Kai Yuan,Hesong Wang,Heng Xu,Haina Yao,Albert C Yang,Guangrui Huang,Yuping Duan
关键词-EN: clinical decision support, extraordinary advances, advances with applications, accurate clinical decisions, clinical decisions based
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Background: Large language models (LLMs) have seen extraordinary advances with applications in clinical decision support. However, high-quality evidence is urgently needed on the potential and limitation of LLMs in providing accurate clinical decisions based on real-world medical data. Objective: To evaluate quantitatively whether universal state-of-the-art LLMs (ChatGPT and GPT-4) can predict the incidence risk of myocardial infarction (MI) with logical inference, and to further make comparison between various models to assess the performance of LLMs comprehensively. Methods: In this retrospective cohort study, 482,310 participants recruited from 2006 to 2010 were initially included in UK Biobank database and later on resampled into a final cohort of 690 participants. For each participant, tabular data of the risk factors of MI were transformed into standardized textual descriptions for ChatGPT recognition. Responses were generated by asking ChatGPT to select a score ranging from 0 to 10 representing the risk. Chain of Thought (CoT) questioning was used to evaluate whether LLMs make prediction logically. The predictive performance of ChatGPT was compared with published medical indices, traditional machine learning models and other large language models. Conclusions: Current LLMs are not ready to be applied in clinical medicine fields. Future medical LLMs are suggested to be expert in medical domain knowledge to understand both natural languages and quantified medical data, and further make logical inferences.

[AI-123] SynBench: A Synthetic Benchmark for Non-rigid 3D Point Cloud Registration

链接: https://arxiv.org/abs/2409.14474
作者: Sara Monji-Azad,Marvin Kinz,Claudia Scherl,David Männle,Jürgen Hesser,Nikolas Löw
关键词-EN: point cloud registration, Non-rigid point cloud, point cloud, cloud registration, Non-rigid point
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-rigid point cloud registration is a crucial task in computer vision. Evaluating a non-rigid point cloud registration method requires a dataset with challenges such as large deformation levels, noise, outliers, and incompleteness. Despite the existence of several datasets for deformable point cloud registration, the absence of a comprehensive benchmark with all challenges makes it difficult to achieve fair evaluations among different methods. This paper introduces SynBench, a new non-rigid point cloud registration dataset created using SimTool, a toolset for soft body simulation in Flex and Unreal Engine. SynBench provides the ground truth of corresponding points between two point sets and encompasses key registration challenges, including varying levels of deformation, noise, outliers, and incompleteness. To the best of the authors’ knowledge, compared to existing datasets, SynBench possesses three particular characteristics: (1) it is the first benchmark that provides various challenges for non-rigid point cloud registration, (2) SynBench encompasses challenges of varying difficulty levels, and (3) it includes ground truth corresponding points both before and after deformation. The authors believe that SynBench enables future non-rigid point cloud registration methods to present a fair comparison of their achievements. SynBench is publicly available at: this https URL.

[AI-124] On logic and generative AI

链接: https://arxiv.org/abs/2409.14465
作者: Yuri Gurevich,Andreas Blass
关键词-EN: hundred years ago, years ago, hundred years, foundational studies, foundational problems
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:A hundred years ago, logic was almost synonymous with foundational studies. The ongoing AI revolution raises many deep foundational problems involving neuroscience, philosophy, computer science, and logic. The goal of the following dialog is to provoke young logicians with a taste for foundations to notice the foundational problems raised by the AI revolution.

[AI-125] Exploring Multilingual Probing in Large Language Models : A Cross-Language Analysis

链接: https://arxiv.org/abs/2409.14459
作者: Daoyang Li,Mingyu Jin,Qingcheng Zeng,Haiyan Zhao,Mengnan Du
关键词-EN: large language models, overlooking the vast, languages, techniques for large, primarily focused
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of the world’s languages. In this paper, we extend these probing methods to a multilingual context, investigating the behaviors of LLMs across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results highlight significant disparities in LLMs’ multilingual capabilities and emphasize the need for improved modeling of low-resource languages.

[AI-126] Large Model Agents : State-of-the-Art Cooperation Paradigms Security and Privacy and Future Trends

链接: https://arxiv.org/abs/2409.14457
作者: Yuntao Wang,Yanghe Pan,Quan Zhao,Yi Deng,Zhou Su,Linkang Du,Tom H. Luan
关键词-EN: large foundation models, achieving Artificial General, Artificial General Intelligence, Large Model, achieving Artificial
类目: Artificial Intelligence (cs.AI)
*备注: 35 pages, 23 figures, 9 tables

点击查看摘要

Abstract:Large Model (LM) agents, powered by large foundation models such as GPT-4 and DALL-E 2, represent a significant step towards achieving Artificial General Intelligence (AGI). LM agents exhibit key characteristics of autonomy, embodiment, and connectivity, allowing them to operate across physical, virtual, and mixed-reality environments while interacting seamlessly with humans, other agents, and their surroundings. This paper provides a comprehensive survey of the state-of-the-art in LM agents, focusing on the architecture, cooperation paradigms, security, privacy, and future prospects. Specifically, we first explore the foundational principles of LM agents, including general architecture, key components, enabling technologies, and modern applications. Then, we discuss practical collaboration paradigms from data, computation, and knowledge perspectives towards connected intelligence of LM agents. Furthermore, we systematically analyze the security vulnerabilities and privacy breaches associated with LM agents, particularly in multi-agent settings. We also explore their underlying mechanisms and review existing and potential countermeasures. Finally, we outline future research directions for building robust and secure LM agent ecosystems.

[AI-127] Scoring rule nets: beyond mean target prediction in multivariate regression

链接: https://arxiv.org/abs/2409.14456
作者: Daan Roordink,Sibylle Hess
关键词-EN: maximum likelihood estimation, Probabilistic regression models, Probabilistic regression, regression models trained, Ranked Probability Score
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Probabilistic regression models trained with maximum likelihood estimation (MLE), can sometimes overestimate variance to an unacceptable degree. This is mostly problematic in the multivariate domain. While univariate models often optimize the popular Continuous Ranked Probability Score (CRPS), in the multivariate domain, no such alternative to MLE has yet been widely accepted. The Energy Score - the most investigated alternative - notoriously lacks closed-form expressions and sensitivity to the correlation between target variables. In this paper, we propose Conditional CRPS: a multivariate strictly proper scoring rule that extends CRPS. We show that closed-form expressions exist for popular distributions and illustrate their sensitivity to correlation. We then show in a variety of experiments on both synthetic and real data, that Conditional CRPS often outperforms MLE, and produces results comparable to state-of-the-art non-parametric models, such as Distributional Random Forest (DRF).

[AI-128] A Visualized Malware Detection Framework with CNN and Conditional GAN

链接: https://arxiv.org/abs/2409.14439
作者: Fang Wang(Florence Wong),Hussam Al Hamadi,Ernesto Damiani
关键词-EN: Machine Learning, visualization analysis incorporating, improving security defenses, Malware visualization analysis, incorporating with Machine
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 2022 IEEE International Conference on Big Data (Big Data), 2022

点击查看摘要

Abstract:Malware visualization analysis incorporating with Machine Learning (ML) has been proven to be a promising solution for improving security defenses on different platforms. In this work, we propose an integrated framework for addressing common problems experienced by ML utilizers in developing malware detection systems. Namely, a pictorial presentation system with extensions is designed to preserve the identities of benign/malign samples by encoding each variable into binary digits and mapping them into black and white pixels. A conditional Generative Adversarial Network based model is adopted to produce synthetic images and mitigate issues of imbalance classes. Detection models architected by Convolutional Neural Networks are for validating performances while training on datasets with and without artifactual samples. Result demonstrates accuracy rates of 98.51% and 97.26% for these two training scenarios.

[AI-129] Automotive innovation landscaping using LLM

链接: https://arxiv.org/abs/2409.14436
作者: Raju Gorain,Omkar Salunke
关键词-EN: landscaping automotive innovation, analysis is crucial, Large Language Models, automotive innovation, comprehending innovation trends
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 9pages, 4Figures, 1 Flow chart

点击查看摘要

Abstract:The process of landscaping automotive innovation through patent analysis is crucial for Research and Development teams. It aids in comprehending innovation trends, technological advancements, and the latest technologies from competitors. Traditionally, this process required intensive manual efforts. However, with the advent of Large Language Models (LLMs), it can now be automated, leading to faster and more efficient patent categorization state-of-the-art of inventive concept extraction. This automation can assist various R\D teams in extracting relevant information from extensive patent databases. This paper introduces a method based on prompt engineering to extract essential information for landscaping. The information includes the problem addressed by the patent, the technology utilized, and the area of innovation within the vehicle ecosystem (such as safety, Advanced Driver Assistance Systems and more).The result demonstrates the implementation of this method to create a landscape of fuel cell technology using open-source patent data. This approach provides a comprehensive overview of the current state of fuel cell technology, offering valuable insights for future research and development in this field.

[AI-130] OStr-DARTS: Differentiable Neural Architecture Search based on Operation Strength

链接: https://arxiv.org/abs/2409.14433
作者: Le Yang,Ziwei Zheng,Yizeng Han,Shiji Song,Gao Huang,Fan Li
关键词-EN: Differentiable architecture search, effective neural architecture, Differentiable architecture, neural architecture search, gradient descent
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Differentiable architecture search (DARTS) has emerged as a promising technique for effective neural architecture search, and it mainly contains two steps to find the high-performance architecture: First, the DARTS supernet that consists of mixed operations will be optimized via gradient descent. Second, the final architecture will be built by the selected operations that contribute the most to the supernet. Although DARTS improves the efficiency of NAS, it suffers from the well-known degeneration issue which can lead to deteriorating architectures. Existing works mainly attribute the degeneration issue to the failure of its supernet optimization, while little attention has been paid to the selection method. In this paper, we cease to apply the widely-used magnitude-based selection method and propose a novel criterion based on operation strength that estimates the importance of an operation by its effect on the final loss. We show that the degeneration issue can be effectively addressed by using the proposed criterion without any modification of supernet optimization, indicating that the magnitude-based selection method can be a critical reason for the instability of DARTS. The experiments on NAS-Bench-201 and DARTS search spaces show the effectiveness of our method.

[AI-131] Pomo3D: 3D-Aware Portrait Accessorizing and More

链接: https://arxiv.org/abs/2409.14430
作者: Tzu-Chieh Liu,Chih-Ting Liu,Shao-Yi Chien
关键词-EN: free accessorizing, accessorizing by decomposing, decomposing and recomposing, portrait manipulation framework, accessories
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose Pomo3D, a 3D portrait manipulation framework that allows free accessorizing by decomposing and recomposing portraits and accessories. It enables the avatars to attain out-of-distribution (OOD) appearances of simultaneously wearing multiple accessories. Existing methods still struggle to offer such explicit and fine-grained editing; they either fail to generate additional objects on given portraits or cause alterations to portraits (e.g., identity shift) when generating accessories. This restriction presents a noteworthy obstacle as people typically seek to create charming appearances with diverse and fashionable accessories in the virtual universe. Our approach provides an effective solution to this less-addressed issue. We further introduce the Scribble2Accessories module, enabling Pomo3D to create 3D accessories from user-drawn accessory scribble maps. Moreover, we design a bias-conscious mapper to mitigate biased associations present in real-world datasets. In addition to object-level manipulation above, Pomo3D also offers extensive editing options on portraits, including global or local editing of geometry and texture and avatar stylization, elevating 3D editing of neural portraits to a more comprehensive level.

[AI-132] Challenging the Performance-Interpretability Trade-off: An Evaluation of Interpretable Machine Learning Models

链接: https://arxiv.org/abs/2409.14429
作者: Sven Kruschel,Nico Hambauer,Sven Weinzierl,Sandra Zilker,Mathias Kraus,Patrick Zschech
关键词-EN: data-driven decision support, promote data-driven decision, decision support, permeating every conceivable, conceivable domain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted for publication in Business Information Systems Engineering (2024)

点击查看摘要

Abstract:Machine learning is permeating every conceivable domain to promote data-driven decision support. The focus is often on advanced black-box models due to their assumed performance advantages, whereas interpretable models are often associated with inferior predictive qualities. More recently, however, a new generation of generalized additive models (GAMs) has been proposed that offer promising properties for capturing complex, non-linear patterns while remaining fully interpretable. To uncover the merits and limitations of these models, this study examines the predictive performance of seven different GAMs in comparison to seven commonly used machine learning models based on a collection of twenty tabular benchmark datasets. To ensure a fair and robust model comparison, an extensive hyperparameter search combined with cross-validation was performed, resulting in 68,500 model runs. In addition, this study qualitatively examines the visual output of the models to assess their level of interpretability. Based on these results, the paper dispels the misconception that only black-box models can achieve high accuracy by demonstrating that there is no strict trade-off between predictive performance and model interpretability for tabular data. Furthermore, the paper discusses the importance of GAMs as powerful interpretable models for the field of information systems and derives implications for future work from a socio-technical perspective.

[AI-133] Dormant: Defending against Pose-driven Human Image Animation

链接: https://arxiv.org/abs/2409.14424
作者: Jiachen Zhou,Mingsi Wang,Tianlin Li,Guozhu Meng,Kai Chen
关键词-EN: achieved tremendous progress, Pose-driven human image, human image animation, tremendous progress, single photo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pose-driven human image animation has achieved tremendous progress, enabling the generation of vivid and realistic human videos from just one single photo. However, it conversely exacerbates the risk of image misuse, as attackers may use one available image to create videos involving politics, violence and other illegal content. To counter this threat, we propose Dormant, a novel protection approach tailored to defend against pose-driven human image animation techniques. Dormant applies protective perturbation to one human image, preserving the visual similarity to the original but resulting in poor-quality video generation. The protective perturbation is optimized to induce misextraction of appearance features from the image and create incoherence among the generated video frames. Our extensive evaluation across 8 animation methods and 4 datasets demonstrates the superiority of Dormant over 6 baseline protection methods, leading to misaligned identities, visual distortions, noticeable artifacts, and inconsistent frames in the generated videos. Moreover, Dormant shows effectiveness on 6 real-world commercial services, even with fully black-box access.

[AI-134] COSBO: Conservative Offline Simulation-Based Policy Optimization

链接: https://arxiv.org/abs/2409.14412
作者: Eshagh Kargar,Ville Kyrki
关键词-EN: reinforcement learning models, Offline reinforcement learning, training reinforcement learning, reinforcement learning, live deployments
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning allows training reinforcement learning models on data from live deployments. However, it is limited to choosing the best combination of behaviors present in the training data. In contrast, simulation environments attempting to replicate the live environment can be used instead of the live data, yet this approach is limited by the simulation-to-reality gap, resulting in a bias. In an attempt to get the best of both worlds, we propose a method that combines an imperfect simulation environment with data from the target environment, to train an offline reinforcement learning policy. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches CQL, MOPO, and COMBO, especially in scenarios with diverse and challenging dynamics, and demonstrates robust behavior across a variety of experimental conditions. The results highlight that using simulator-generated data can effectively enhance offline policy learning despite the sim-to-real gap, when direct interaction with the real-world is not possible.

[AI-135] Beyond Persuasion: Towards Conversational Recommender System with Credible Explanations EMNLP2024

链接: https://arxiv.org/abs/2409.14399
作者: Peixin Qin,Chen Huang,Yang Deng,Wenqiang Lei,Tat-Seng Chua
关键词-EN: large language models, conversational recommender system, accept recommended items, gaining strong abilities, current conversational recommender
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Findings of EMNLP 2024

点击查看摘要

Abstract:With the aid of large language models, current conversational recommender system (CRS) has gaining strong abilities to persuade users to accept recommended items. While these CRSs are highly persuasive, they can mislead users by incorporating incredible information in their explanations, ultimately damaging the long-term trust between users and the CRS. To address this, we propose a simple yet effective method, called PC-CRS, to enhance the credibility of CRS’s explanations during persuasion. It guides the explanation generation through our proposed credibility-aware persuasive strategies and then gradually refines explanations via post-hoc self-reflection. Experimental results demonstrate the efficacy of PC-CRS in promoting persuasive and credible explanations. Further analysis reveals the reason behind current methods producing incredible explanations and the potential of credible explanations to improve recommendation accuracy.

[AI-136] MaskedMimic: Unified Physics-Based Character Control Through Masked Motion Inpainting SIGGRAPH

链接: https://arxiv.org/abs/2409.14393
作者: Chen Tessler,Yunrong Guo,Ofir Nabati,Gal Chechik,Xue Bin Peng
关键词-EN: Crafting a single, breathe life, spectrum of scenarios, scenarios represents, represents an exciting
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: ACM Transactions on Graphics (Proc. SIGGRAPH Asia 2024) Project page: this https URL

点击查看摘要

Abstract:Crafting a single, versatile physics-based controller that can breathe life into interactive characters across a wide spectrum of scenarios represents an exciting frontier in character animation. An ideal controller should support diverse control modalities, such as sparse target keyframes, text instructions, and scene information. While previous works have proposed physically simulated, scene-aware control models, these systems have predominantly focused on developing controllers that each specializes in a narrow set of tasks and control modalities. This work presents MaskedMimic, a novel approach that formulates physics-based character control as a general motion inpainting problem. Our key insight is to train a single unified model to synthesize motions from partial (masked) motion descriptions, such as masked keyframes, objects, text descriptions, or any combination thereof. This is achieved by leveraging motion tracking data and designing a scalable training method that can effectively utilize diverse motion descriptions to produce coherent animations. Through this process, our approach learns a physics-based controller that provides an intuitive control interface without requiring tedious reward engineering for all behaviors of interest. The resulting controller supports a wide range of control modalities and enables seamless transitions between disparate tasks. By unifying character control through motion inpainting, MaskedMimic creates versatile virtual characters. These characters can dynamically adapt to complex scenes and compose diverse motions on demand, enabling more interactive and immersive experiences.

[AI-137] Sparse Low-Ranked Self-Attention Transformer for Remaining Useful Lifetime Prediction of Optical Fiber Amplifiers

链接: https://arxiv.org/abs/2409.14378
作者: Dominic Schneider,Lutz Rapp
关键词-EN: Optical fiber amplifiers, Optical fiber, present optical networks, fiber amplifiers, key elements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 9 pages, 7 figures, submitted to IEEE Transactions on Machine Learning in Communications and Networking (TMLCN)

点击查看摘要

Abstract:Optical fiber amplifiers are key elements in present optical networks. Failures of these components result in high financial loss of income of the network operator as the communication traffic over an affected link is interrupted. Applying Remaining useful lifetime (RUL) prediction in the context of Predictive Maintenance (PdM) to optical fiber amplifiers to predict upcoming system failures at an early stage, so that network outages can be minimized through planning of targeted maintenance actions, ensures reliability and safety. Optical fiber amplifier are complex systems, that work under various operating conditions, which makes correct forecasting a difficult task. Increased monitoring capabilities of systems results in datasets that facilitate the application of data-driven RUL prediction methods. Deep learning models in particular have shown good performance, but generalization based on comparatively small datasets for RUL prediction is difficult. In this paper, we propose Sparse Low-ranked self-Attention Transformer (SLAT) as a novel RUL prediction method. SLAT is based on an encoder-decoder architecture, wherein two parallel working encoders extract features for sensors and time steps. By utilizing the self-attention mechanism, long-term dependencies can be learned from long sequences. The implementation of sparsity in the attention matrix and a low-rank parametrization reduce overfitting and increase generalization. Experimental application to optical fiber amplifiers exemplified on EDFA, as well as a reference dataset from turbofan engines, shows that SLAT outperforms the state-of-the-art methods.

[AI-138] o Err Is AI! Debugging as an Intervention to Facilitate Appropriate Reliance on AI Systems

链接: https://arxiv.org/abs/2409.14377
作者: Gaole He,Abri Bharos,Ujwal Gadiraju
关键词-EN: demonstrated great potential, human decision making, Powerful predictive, augmenting human decision, decision making
类目: Artificial Intelligence (cs.AI)
*备注: Paper accepted at HT’24 as late-break. This is an expanded version of HT’24 paper, providing more details and experimental analysis

点击查看摘要

Abstract:Powerful predictive AI systems have demonstrated great potential in augmenting human decision making. Recent empirical work has argued that the vision for optimal human-AI collaboration requires ‘appropriate reliance’ of humans on AI systems. However, accurately estimating the trustworthiness of AI advice at the instance level is quite challenging, especially in the absence of performance feedback pertaining to the AI system. In practice, the performance disparity of machine learning models on out-of-distribution data makes the dataset-specific performance feedback unreliable in human-AI collaboration. Inspired by existing literature on critical thinking and a critical mindset, we propose the use of debugging an AI system as an intervention to foster appropriate reliance. In this paper, we explore whether a critical evaluation of AI performance within a debugging setting can better calibrate users’ assessment of an AI system and lead to more appropriate reliance. Through a quantitative empirical study (N = 234), we found that our proposed debugging intervention does not work as expected in facilitating appropriate reliance. Instead, we observe a decrease in reliance on the AI system after the intervention – potentially resulting from an early exposure to the AI system’s weakness. We explore the dynamics of user confidence and user estimation of AI trustworthiness across groups with different performance levels to help explain how inappropriate reliance patterns occur. Our findings have important implications for designing effective interventions to facilitate appropriate reliance and better human-AI collaboration.

[AI-139] Evaluating the Quality of Code Comments Generated by Large Language Models for Novice Programmers

链接: https://arxiv.org/abs/2409.14368
作者: Aysa Xuemo Fan,Arun Balajiee Lekshmi Narayanan,Mohammad Hassany,Jiaze Ke
关键词-EN: Large Language Models, Large Language, Language Models, effectiveness remains under-evaluated, educational effectiveness remains
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show promise in generating code comments for novice programmers, but their educational effectiveness remains under-evaluated. This study assesses the instructional quality of code comments produced by GPT-4, GPT-3.5-Turbo, and Llama2, compared to expert-developed comments, focusing on their suitability for novices. Analyzing a dataset of ``easy’’ level Java solutions from LeetCode, we find that GPT-4 exhibits comparable quality to expert comments in aspects critical for beginners, such as clarity, beginner-friendliness, concept elucidation, and step-by-step guidance. GPT-4 outperforms Llama2 in discussing complexity (chi-square = 11.40, p = 0.001) and is perceived as significantly more supportive for beginners than GPT-3.5 and Llama2 with Mann-Whitney U-statistics = 300.5 and 322.5, p = 0.0017 and 0.0003). This study highlights the potential of LLMs for generating code comments tailored to novice programmers.

[AI-140] MANTA – Model Adapter Native generations thats Affordable

链接: https://arxiv.org/abs/2409.14363
作者: Ansh Chaurasia
关键词-EN: provide personalized results, inflexible adapter selection, presiding model generation, model generation algorithms, generation algorithms rely
类目: Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The presiding model generation algorithms rely on simple, inflexible adapter selection to provide personalized results. We propose the model-adapter composition problem as a generalized problem to past work factoring in practical hardware and affordability constraints, and introduce MANTA as a new approach to the problem. Experiments on COCO 2014 validation show MANTA to be superior in image task diversity and quality at the cost of a modest drop in alignment. Our system achieves a 94% win rate in task diversity and a 80% task quality win rate versus the best known system, and demonstrates strong potential for direct use in synthetic data generation and the creative art domains.

[AI-141] Data-Driven Spatiotemporal Feature Representation and Mining in Multidimensional Time Series

链接: https://arxiv.org/abs/2409.14327
作者: Xu Yan,Yaoting Jiang,Wenyi Liu,Didi Yi,Haoyang Sang,Jianjun Wei
关键词-EN: time series data, multidimensional time series, time series, series data, series data analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores a new method for time series data analysis, aiming to overcome the limitations of traditional mining techniques when dealing with multidimensional time series data. Time series data are extensively utilized in diverse fields, including backend services for monitoring and optimizing IT infrastructure, medical diagnosis through continuous patient monitoring and health trend analysis, and internet business for tracking user behavior and forecasting sales. However, since the effective information in time series data is often hidden in sequence fragments, the uncertainty of their length, quantity, and morphological variables brings challenges to mining. To this end, this paper proposes a new spatiotemporal feature representation method, which converts multidimensional time series (MTS) into one-dimensional event sequences by transforming spatially varying events, and uses a series of event symbols to represent the spatial structural information of multidimensional coupling in the sequence, which has good interpretability. Then, this paper introduces a variable-length tuple mining method to extract non-redundant key event subsequences in event sequences as spatiotemporal structural features of motion sequences. This method is an unsupervised method that does not rely on large-scale training samples and defines a new model for representing the spatiotemporal structural features of multidimensional time series. The superior performance of the STEM model is verified by pattern classification experiments on a variety of motion sequences. The research results of this paper provide an important theoretical basis and technical support for understanding and predicting human behavior patterns, and have far-reaching practical application value.

[AI-142] Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses EMNLP2024

链接: https://arxiv.org/abs/2409.14324
作者: Hung-Ting Su,Ya-Ching Hsu,Xudong Lin,Xiang-Qian Shi,Yulei Niu,Han-Yuan Hsu,Hung-yi Lee,Winston H. Hsu
关键词-EN: Large language models, Large language, shown significant multi-step, language models, prompting have shown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP 2024 Findings. The first two authors contributed equally. Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4’s performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT’s heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.

[AI-143] DilateQuant: Accurate and Efficient Diffusion Quantization via Weight Dilation

链接: https://arxiv.org/abs/2409.14307
作者: Xuewen Liu,Zhikai Li,Qingyi Gu
关键词-EN: image generation tasks, substantial computational costs, shown excellent performance, Diffusion models, generation tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code: this http URL

点击查看摘要

Abstract:Diffusion models have shown excellent performance on various image generation tasks, but the substantial computational costs and huge memory footprint hinder their low-latency applications in real-world scenarios. Quantization is a promising way to compress and accelerate models. Nevertheless, due to the wide range and time-varying activations in diffusion models, existing methods cannot maintain both accuracy and efficiency simultaneously for low-bit quantization. To tackle this issue, we propose DilateQuant, a novel quantization framework for diffusion models that offers comparable accuracy and high efficiency. Specifically, we keenly aware of numerous unsaturated in-channel weights, which can be cleverly exploited to reduce the range of activations without additional computation cost. Based on this insight, we propose Weight Dilation (WD) that maximally dilates the unsaturated in-channel weights to a constrained range through a mathematically equivalent scaling. WD costlessly absorbs the activation quantization errors into weight quantization. The range of activations decreases, which makes activations quantization easy. The range of weights remains constant, which makes model easy to converge in training stage. Considering the temporal network leads to time-varying activations, we design a Temporal Parallel Quantizer (TPQ), which sets time-step quantization parameters and supports parallel quantization for different time steps, significantly improving the performance and reducing time cost. To further enhance performance while preserving efficiency, we introduce a Block-wise Knowledge Distillation (BKD) to align the quantized models with the full-precision models at a block level. The simultaneous training of time-step quantization parameters and weights minimizes the time required, and the shorter backpropagation paths decreases the memory footprint of the quantization process.

[AI-144] LLMs are One-Shot URL Classifiers and Explainers

链接: https://arxiv.org/abs/2409.14306
作者: Fariza Rashid,Nishavi Ranaweera,Ben Doyle,Suranga Seneviratne
关键词-EN: Malicious URL classification, Malicious URL, URL classification represents, URL classification, URL classification models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Malicious URL classification represents a crucial aspect of cyber security. Although existing work comprises numerous machine learning and deep learning-based URL classification models, most suffer from generalisation and domain-adaptation issues arising from the lack of representative training datasets. Furthermore, these models fail to provide explanations for a given URL classification in natural human language. In this work, we investigate and demonstrate the use of Large Language Models (LLMs) to address this issue. Specifically, we propose an LLM-based one-shot learning framework that uses Chain-of-Thought (CoT) reasoning to predict whether a given URL is benign or phishing. We evaluate our framework using three URL datasets and five state-of-the-art LLMs and show that one-shot LLM prompting indeed provides performances close to supervised models, with GPT 4-Turbo being the best model, followed by Claude 3 Opus. We conduct a quantitative analysis of the LLM explanations and show that most of the explanations provided by LLMs align with the post-hoc explanations of the supervised classifiers, and the explanations have high readability, coherency, and informativeness.

[AI-145] UU-Mamba: Uncertainty-aware U-Mamba for Cardiovascular Segmentation

链接: https://arxiv.org/abs/2409.14305
作者: Ting Yu Tsai,Li Lin,Shu Hu,Connie W. Tsao,Xin Li,Ming-Ching Chang,Hongtu Zhu,Xin Wang
关键词-EN: deep learning models, cardiovascular structure segmentation, increasing attention, success of deep, deep learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Building on the success of deep learning models in cardiovascular structure segmentation, increasing attention has been focused on improving generalization and robustness, particularly in small, annotated datasets. Despite recent advancements, current approaches often face challenges such as overfitting and accuracy limitations, largely due to their reliance on large datasets and narrow optimization techniques. This paper introduces the UU-Mamba model, an extension of the U-Mamba architecture, designed to address these challenges in both cardiac and vascular segmentation. By incorporating Sharpness-Aware Minimization (SAM), the model enhances generalization by targeting flatter minima in the loss landscape. Additionally, we propose an uncertainty-aware loss function that combines region-based, distribution-based, and pixel-based components to improve segmentation accuracy by capturing both local and global features. While the UU-Mamba model has already demonstrated great performance, further testing is required to fully assess its generalization and robustness. We expand our evaluation by conducting new trials on the ImageCAS (coronary artery) and Aorta (aortic branches and zones) datasets, which present more complex segmentation challenges than the ACDC dataset (left and right ventricles) used in our previous work, showcasing the model’s adaptability and resilience. We confirm UU-Mamba’s superior performance over leading models such as TransUNet, Swin-Unet, nnUNet, and nnFormer. Moreover, we provide a more comprehensive evaluation of the model’s robustness and segmentation accuracy, as demonstrated by extensive experiments.

[AI-146] PretextTrans: Investigating Medical Factual Knowledge Mastery of LLMs with Predicate-text Dual Transformation

链接: https://arxiv.org/abs/2409.14302
作者: Yuxuan Zhou,Xien Liu,Chen Ning,Ji Wu
关键词-EN: automatically generate multiple, generate multiple test, multiple test samples, dynamic evaluation schema, test samples
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:In the study, we aim to investigate current LLMs’ mastery of medical factual knowledge with a dynamic evaluation schema, which can automatically generate multiple test samples for each medical factual knowledge point. Test samples produced directly by LLMs always introduce factual errors and lack diversity in the manner of knowledge expression. To overcome the drawbacks, here we propose a novel evaluation method, Predicate-text Dual Transformation (PretextTrans), by introducing predicate transformations into the dynamic evaluation schema. Specifically, each medical knowledge point is firstly transformed into a predicate expression; then, the predicate expression derives a series of variants through predicate transformations; lastly, the produced predicate variants are transformed back into textual expressions, resulting in a series of test samples with both factual reliability and expression diversity. Using the proposed PretextTrans method, we systematically investigate 12 well-known LLMs’ mastery of medical factual knowledge based on two medical datasets. The comparison results show that current LLMs still have significant deficiencies in fully mastering medical knowledge, which may illustrate why current LLMs still perform unsatisfactorily in real-world medical scenarios despite having achieved considerable performance on public benchmarks. Our proposed method serves as an effective solution for evaluation of LLMs in medical domain and offers valuable insights for developing medical-specific LLMs.

[AI-147] HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation

链接: https://arxiv.org/abs/2409.14296
作者: Naoki Yokoyama,Ram Ramrakhya,Abhishek Das,Dhruv Batra,Sehoon Ha
关键词-EN: Open Vocabulary Object, Open Vocabulary, Vocabulary Object Goal, Object Goal Navigation, prior Object Goal
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We present the Habitat-Matterport 3D Open Vocabulary Object Goal Navigation dataset (HM3D-OVON), a large-scale benchmark that broadens the scope and semantic range of prior Object Goal Navigation (ObjectNav) benchmarks. Leveraging the HM3DSem dataset, HM3D-OVON incorporates over 15k annotated instances of household objects across 379 distinct categories, derived from photo-realistic 3D scans of real-world environments. In contrast to earlier ObjectNav datasets, which limit goal objects to a predefined set of 6-20 categories, HM3D-OVON facilitates the training and evaluation of models with an open-set of goals defined through free-form language at test-time. Through this open-vocabulary formulation, HM3D-OVON encourages progress towards learning visuo-semantic navigation behaviors that are capable of searching for any object specified by text in an open-vocabulary manner. Additionally, we systematically evaluate and compare several different types of approaches on HM3D-OVON. We find that HM3D-OVON can be used to train an open-vocabulary ObjectNav agent that achieves both higher performance and is more robust to localization and actuation noise than the state-of-the-art ObjectNav approach. We hope that our benchmark and baseline results will drive interest in developing embodied agents that can navigate real-world spaces to find household objects specified through free-form language, taking a step towards more flexible and human-like semantic visual navigation. Code and videos available at: this http URL.

[AI-148] Opinion Mining on Offshore Wind Energy for Environmental Engineering

链接: https://arxiv.org/abs/2409.14292
作者: Isabele Bittencourt,Aparna S. Varde,Pankaj Lal
关键词-EN: offshore wind energy, wind energy, offshore wind, conduct sentiment analysis, machine learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this paper, we conduct sentiment analysis on social media data to study mass opinion about offshore wind energy. We adapt three machine learning models, namely, TextBlob, VADER, and SentiWordNet because different functions are provided by each model. TextBlob provides subjectivity analysis as well as polarity classification. VADER offers cumulative sentiment scores. SentiWordNet considers sentiments with reference to context and performs classification accordingly. Techniques in NLP are harnessed to gather meaning from the textual data in social media. Data visualization tools are suitably deployed to display the overall results. This work is much in line with citizen science and smart governance via involvement of mass opinion to guide decision support. It exemplifies the role of Machine Learning and NLP here.

[AI-149] ESPERANTO: Evaluating Synthesized Phrases to Enhance Robustness in AI Detection for Text Origination

链接: https://arxiv.org/abs/2409.14285
作者: Navid Ayoobi,Lily Knab,Wen Cheng,David Pantoja,Hamidreza Alikhani,Sylvain Flamant,Jin Kim,Arjun Mukherjee
关键词-EN: exhibit significant utility, including academic misconduct, exhibit significant, unethical purposes, dissemination of misinformation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) exhibit significant utility across various domains, they simultaneously are susceptible to exploitation for unethical purposes, including academic misconduct and dissemination of misinformation. Consequently, AI-generated text detection systems have emerged as a countermeasure. However, these detection mechanisms demonstrate vulnerability to evasion techniques and lack robustness against textual manipulations. This paper introduces back-translation as a novel technique for evading detection, underscoring the need to enhance the robustness of current detection systems. The proposed method involves translating AI-generated text through multiple languages before back-translating to English. We present a model that combines these back-translated texts to produce a manipulated version of the original AI-generated text. Our findings demonstrate that the manipulated text retains the original semantics while significantly reducing the true positive rate (TPR) of existing detection methods. We evaluate this technique on nine AI detectors, including six open-source and three proprietary systems, revealing their susceptibility to back-translation manipulation. In response to the identified shortcomings of existing AI text detectors, we present a countermeasure to improve the robustness against this form of manipulation. Our results indicate that the TPR of the proposed method declines by only 1.85% after back-translation manipulation. Furthermore, we build a large dataset of 720k texts using eight different LLMs. Our dataset contains both human-authored and LLM-generated texts in various domains and writing styles to assess the performance of our method and existing detectors. This dataset is publicly shared for the benefit of the research community.

[AI-150] Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models

链接: https://arxiv.org/abs/2409.14277
作者: Yew Ken Chia,Qi Sun,Lidong Bing,Soujanya Poria
关键词-EN: demonstrated impressive problem-solving, encode extensive world, Large multimodal models, extensive world knowledge, impressive problem-solving abilities
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Large multimodal models have demonstrated impressive problem-solving abilities in vision and language tasks, and have the potential to encode extensive world knowledge. However, it remains an open challenge for these models to perceive, reason, plan, and act in realistic environments. In this work, we introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities through more diverse and complex scenarios than previous datasets. Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans. The data encompasses diverse aspects of commonsense knowledge, physical understanding, and safety awareness. Our fine-grained analysis reveals that state-of-the-art models, including GPT-4V, face bottlenecks in visual perception, comprehension, and reasoning abilities. To address these challenges, we propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans. Experimental results demonstrate the effectiveness of our framework compared to strong baselines. Our code and dataset are available at this https URL.

[AI-151] Proof Automation with Large Language Models

链接: https://arxiv.org/abs/2409.14274
作者: Minghai Lu,Benjamin Delaware,Tianyi Zhang
关键词-EN: Coq are powerful, Interactive theorem provers, correctness of software, formally guarantee, guarantee the correctness
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
*备注: 12 pages, 15 figures, Accepted to ASE 2024

点击查看摘要

Abstract:Interactive theorem provers such as Coq are powerful tools to formally guarantee the correctness of software. However, using these tools requires significant manual effort and expertise. While Large Language Models (LLMs) have shown promise in automatically generating informal proofs in natural language, they are less effective at generating formal proofs in interactive theorem provers. In this paper, we conduct a formative study to identify common mistakes made by LLMs when asked to generate formal proofs. By analyzing 520 proof generation errors made by GPT-3.5, we found that GPT-3.5 often identified the correct high-level structure of a proof, but struggled to get the lower-level details correct. Based on this insight, we propose PALM, a novel generate-then-repair approach that first prompts an LLM to generate an initial proof and then leverages targeted symbolic methods to iteratively repair low-level problems. We evaluate PALM on a large dataset that includes more than 10K theorems. Our results show that PALM significantly outperforms other state-of-the-art approaches, successfully proving 76.6% to 180.4% more theorems. Moreover, PALM proves 1270 theorems beyond the reach of existing approaches. We also demonstrate the generalizability of PALM across different LLMs.

[AI-152] Higher-order-ReLU-KANs (HRKANs) for solving physics-informed neural networks (PINNs) more accurately robustly and faster

链接: https://arxiv.org/abs/2409.14248
作者: Chi Chiu So,Siu Pang Yung
关键词-EN: Finding solutions, Physics-informed Neural Networks, partial differential equations, engineering discoveries, Neural Networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Finding solutions to partial differential equations (PDEs) is an important and essential component in many scientific and engineering discoveries. One of the common approaches empowered by deep learning is Physics-informed Neural Networks (PINNs). Recently, a new type of fundamental neural network model, Kolmogorov-Arnold Networks (KANs), has been proposed as a substitute of Multilayer Perceptions (MLPs), and possesses trainable activation functions. To enhance KANs in fitting accuracy, a modification of KANs, so called ReLU-KANs, using “square of ReLU” as the basis of its activation functions has been suggested. In this work, we propose another basis of activation functions, namely, Higher-order-ReLU, which is simpler than the basis of activation functions used in KANs, namely, B-splines; allows efficient KAN matrix operations; and possesses smooth and non-zero higher-order derivatives, essential for physics-informed neural networks. Our detailed experiments on two standard and typical PDEs, namely, the linear Poisson equation and nonlinear Burgers’ equation with viscosity, reveal that our proposed Higher-order-ReLU-KANs (HRKANs) achieve the highest fitting accuracy and training robustness and lowest training time significantly among KANs, ReLUKANs and HRKANs.

[AI-153] An Instance-based Plus Ensemble Learning Method for Classification of Scientific Papers

链接: https://arxiv.org/abs/2409.14237
作者: Fang Zhang,Shengli Wu
关键词-EN: exponential growth, publications in recent, recent years, years has posed, posed a significant
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The exponential growth of scientific publications in recent years has posed a significant challenge in effective and efficient categorization. This paper introduces a novel approach that combines instance-based learning and ensemble learning techniques for classifying scientific papers into relevant research fields. Working with a classification system with a group of research fields, first a number of typical seed papers are allocated to each of the fields manually. Then for each paper that needs to be classified, we compare it with all the seed papers in every field. Contents and citations are considered separately. An ensemble-based method is then employed to make the final decision. Experimenting with the datasets from DBLP, our experimental results demonstrate that the proposed classification method is effective and efficient in categorizing papers into various research areas. We also find that both content and citation features are useful for the classification of scientific papers.

[AI-154] Predicting Coronary Heart Disease Using a Suite of Machine Learning Models

链接: https://arxiv.org/abs/2409.14231
作者: Jamal Al-Karaki,Philip Ilono,Sanchit Baweja,Jalal Naghiyev,Raja Singh Yadav,Muhammad Al-Zafar Khan
关键词-EN: Coronary Heart Disease, Disease affects millions, Heart Disease affects, Coronary Heart, Heart Disease
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Coronary Heart Disease affects millions of people worldwide and is a well-studied area of healthcare. There are many viable and accurate methods for the diagnosis and prediction of heart disease, but they have limiting points such as invasiveness, late detection, or cost. Supervised learning via machine learning algorithms presents a low-cost (computationally speaking), non-invasive solution that can be a precursor for early diagnosis. In this study, we applied several well-known methods and benchmarked their performance against each other. It was found that Random Forest with oversampling of the predictor variable produced the highest accuracy of 84%.

[AI-155] MEGA-PT: A Meta-Game Framework for Agile Penetration Testing

链接: https://arxiv.org/abs/2409.14219
作者: Yunfei Ge,Quanyan Zhu
关键词-EN: escalating cybersecurity incidents, Penetration testing, automated penetration testing, cybersecurity incidents, face of escalating
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Penetration testing is an essential means of proactive defense in the face of escalating cybersecurity incidents. Traditional manual penetration testing methods are time-consuming, resource-intensive, and prone to human errors. Current trends in automated penetration testing are also impractical, facing significant challenges such as the curse of dimensionality, scalability issues, and lack of adaptability to network changes. To address these issues, we propose MEGA-PT, a meta-game penetration testing framework, featuring micro tactic games for node-level local interactions and a macro strategy process for network-wide attack chains. The micro- and macro-level modeling enables distributed, adaptive, collaborative, and fast penetration testing. MEGA-PT offers agile solutions for various security schemes, including optimal local penetration plans, purple teaming solutions, and risk assessment, providing fundamental principles to guide future automated penetration testing. Our experiments demonstrate the effectiveness and agility of our model by providing improved defense strategies and adaptability to changes at both local and network levels.

[AI-156] R-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models ICRA2025

链接: https://arxiv.org/abs/2409.14216
作者: Viet Dung Nguyen,Zhizhuo Yang,Christopher L. Buckley,Alexander Ororbia
关键词-EN: Markov decision processes, observable Markov decision, partially observable Markov, Markov decision, decision processes
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 20 pages, 2 algorithms, 2 tables, 5 figures, submitted to ICRA 2025

点击查看摘要

Abstract:Although research has produced promising results demonstrating the utility of active inference (AIF) in Markov decision processes (MDPs), there is relatively less work that builds AIF models in the context of environments and problems that take the form of partially observable Markov decision processes (POMDPs). In POMDP scenarios, the agent must infer the unobserved environmental state from raw sensory observations, e.g., pixels in an image. Additionally, less work exists in examining the most difficult form of POMDP-centered control: continuous action space POMDPs under sparse reward signals. In this work, we address issues facing the AIF modeling paradigm by introducing novel prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. Empirically, we show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate. The code in support of this work can be found at this https URL.

[AI-157] AI Assistants for Spaceflight Procedures: Combining Generative Pre-Trained Transformer and Retrieval-Augmented Generation on Knowledge Graphs With Augmented Reality Cues

链接: https://arxiv.org/abs/2409.14206
作者: Oliver Bensch,Leonie Bensch,Tommy Nilsson,Florian Saling,Bernd Bewer,Sophie Jentzsch,Tobias Hecking,J. Nathan Kutz
关键词-EN: Lunar Gateway station, International Space Station, Research and Exploration, Gateway station, Organizer for Research
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Accepted for the ESA SPAICE Conference 2024: AI in and for Space

点击查看摘要

Abstract:This paper describes the capabilities and potential of the intelligent personal assistant (IPA) CORE (Checklist Organizer for Research and Exploration), designed to support astronauts during procedures onboard the International Space Station (ISS), the Lunar Gateway station, and beyond. We reflect on the importance of a reliable and flexible assistant capable of offline operation and highlight the usefulness of audiovisual interaction using augmented reality elements to intuitively display checklist information. We argue that current approaches to the design of IPAs in space operations fall short of meeting these criteria. Therefore, we propose CORE as an assistant that combines Knowledge Graphs (KGs), Retrieval-Augmented Generation (RAG) for a Generative Pre-Trained Transformer (GPT), and Augmented Reality (AR) elements to ensure an intuitive understanding of procedure steps, reliability, offline availability, and flexibility in terms of response style and procedure updates.

[AI-158] Loop-Residual Neural Networks for Iterative Refinement

链接: https://arxiv.org/abs/2409.14199
作者: Kei-Sing Ng,Qingchen Wang
关键词-EN: large-scale language models, success of large-scale, ability to efficiently, efficiently predict, Loop-Residual Neural Network
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The success of large-scale language models like GPT can be attributed to their ability to efficiently predict the next token in a sequence. However, these models rely on constant computational effort regardless of the complexity of the token they are predicting, lacking the capacity for iterative refinement. In this paper, we introduce a novel Loop-Residual Neural Network, which achieves better performance by utilizing longer computational time without increasing the model size. Our approach revisits the input multiple times, refining the prediction by iteratively looping over a subset of the model with residual connections. We demonstrate the effectiveness of this method through experiments comparing versions of GPT-2 with our Loop-Residual models, showing improved performance in language modeling tasks while maintaining similar parameter counts. Importantly, these improvements are achieved without the need for extra training data.

[AI-159] Data-Driven Approach to assess and identify gaps in healthcare set up in South Asia

链接: https://arxiv.org/abs/2409.14194
作者: Rusham Elahi,Zia Tahseen,Tehreem Fatima,Syed Wafa Zahra,Hafiz Muhammad Abubakar,Tehreem Zafar,Aqs Younas,Muhammad Talha Quddoos,Usman Nazir
关键词-EN: primary healthcare system, achieving universal health, universal health coverage, Health Information Systems, Primary healthcare
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Primary healthcare is a crucial strategy for achieving universal health coverage. South Asian countries are working to improve their primary healthcare system through their country specific policies designed in line with WHO health system framework using the six thematic pillars: Health Financing, Health Service delivery, Human Resource for Health, Health Information Systems, Governance, Essential Medicines and Technology, and an addition area of Cross-Sectoral Linkages. Measuring the current accessibility of healthcare facilities and workforce availability is essential for improving healthcare standards and achieving universal health coverage in developing countries. Data-driven surveillance approaches are required that can provide rapid, reliable, and geographically scalable solutions to understand a) which communities and areas are most at risk of inequitable access and when, b) what barriers to health access exist, and c) how they can be overcome in ways tailored to the specific challenges faced by individual communities. We propose to harness current breakthroughs in Earth-observation (EO) technology, which provide the ability to generate accurate, up-to-date, publicly accessible, and reliable data, which is necessary for equitable access planning and resource allocation to ensure that vaccines, and other interventions reach everyone, particularly those in greatest need, during normal and crisis times. This requires collaboration among countries to identify evidence based solutions to shape health policy and interventions, and drive innovations and research in the region.

[AI-160] Addressing and Visualizing Misalignments in Human Task-Solving Trajectories

链接: https://arxiv.org/abs/2409.14191
作者: Sejin Kim,Hosung Lee,Sundong Kim
关键词-EN: model training hinges, model training, human intentions, trajectory data, model decision
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The effectiveness of AI model training hinges on the quality of the trajectory data used, particularly in aligning the model’s decision with human intentions. However, in the human task-solving trajectories, we observe significant misalignments between human intentions and the recorded trajectories, which can undermine AI model training. This paper addresses the challenges of these misalignments by proposing a visualization tool and a heuristic algorithm designed to detect and categorize discrepancies in trajectory data. Although the heuristic algorithm requires a set of predefined human intentions to function, which we currently cannot extract, the visualization tool offers valuable insights into the nature of these misalignments. We expect that eliminating these misalignments could significantly improve the utility of trajectory data for AI model training. We also propose that future work should focus on developing methods, such as Topic Modeling, to accurately extract human intentions from trajectory data, thereby enhancing the alignment between user actions and AI learning processes.

[AI-161] Democratising Artificial Intelligence for Pandemic Preparedness and Global Governance in Latin American and Caribbean Countries

链接: https://arxiv.org/abs/2409.14181
作者: Andre de Carvalho,Robson Bonidia,Jude Dzevela Kong,Mariana Dauhajre,Claudio Struchiner,Guilherme Goedert,Peter F. Stadler,Maria Emilia Walter,Danilo Sanches,Troy Day,Marcia Castro,John Edmunds,Manuel Colome-Hidalgo,Demian Arturo Herrera Morban,Edian F. Franco,Cesar Ugarte-Gil,Patricia Espinoza-Lopez,Gabriel Carrasco-Escobar,Ulisses Rocha
关键词-EN: transmitted directly, directly or indirectly, Pandemic Epidemic Preparedness, predicting epidemic outbreaks, Global South
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Infectious diseases, transmitted directly or indirectly, are among the leading causes of epidemics and pandemics. Consequently, several open challenges exist in predicting epidemic outbreaks, detecting variants, tracing contacts, discovering new drugs, and fighting misinformation. Artificial Intelligence (AI) can provide tools to deal with these scenarios, demonstrating promising results in the fight against the COVID-19 pandemic. AI is becoming increasingly integrated into various aspects of society. However, ensuring that AI benefits are distributed equitably and that they are used responsibly is crucial. Multiple countries are creating regulations to address these concerns, but the borderless nature of AI requires global cooperation to define regulatory and guideline consensus. Considering this, The Global South AI for Pandemic Epidemic Preparedness Response Network (AI4PEP) has developed an initiative comprising 16 projects across 16 countries in the Global South, seeking to strengthen equitable and responsive public health systems that leverage Southern-led responsible AI solutions to improve prevention, preparedness, and response to emerging and re-emerging infectious disease outbreaks. This opinion introduces our branches in Latin American and Caribbean (LAC) countries and discusses AI governance in LAC in the light of biotechnology. Our network in LAC has high potential to help fight infectious diseases, particularly in low- and middle-income countries, generating opportunities for the widespread use of AI techniques to improve the health and well-being of their communities.

[AI-162] PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

链接: https://arxiv.org/abs/2409.14177
作者: Zhihao Lin,Wei Ma,Mingyi Zhou,Yanjie Zhao,Haoyu Wang,Yang Liu,Jun Wang,Li Li
关键词-EN: Large Language Models, Large Language, Language Models, recent years, accompanied by increasing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have gained widespread use, accompanied by increasing concerns over their security. Traditional jailbreak attacks rely on internal model details or have limitations when exploring the unsafe behavior of the victim model, limiting their generalizability. In this paper, we introduce PathSeeker, a novel black-box jailbreak method inspired by the concept of escaping a security maze. This work is inspired by the game of rats escaping a maze. We think that each LLM has its unique “security maze”, and attackers attempt to find the exit learning from the received feedback and their accumulated experience to compromise the target LLM’s security defences. Our approach leverages multi-agent reinforcement learning, where smaller models collaborate to guide the main LLM in performing mutation operations to achieve the attack objectives. By progressively modifying inputs based on the model’s feedback, our system induces richer, harmful responses. During our manual attempts to perform jailbreak attacks, we found that the vocabulary of the response of the target model gradually became richer and eventually produced harmful responses. Based on the observation, we also introduce a reward mechanism that exploits the expansion of vocabulary richness in LLM responses to weaken security constraints. Our method outperforms five state-of-the-art attack techniques when tested across 13 commercial and open-source LLMs, achieving high attack success rates, especially in strongly aligned commercial models like GPT-4o-mini, Claude-3.5, and GLM-4-air with strong safety alignment. This study aims to improve the understanding of LLM security vulnerabilities and we hope that this sturdy can contribute to the development of more robust defenses.

[AI-163] QMOS: Enhancing LLMs for Telecommunication with Question Masked loss and Option Shuffling

链接: https://arxiv.org/abs/2409.14175
作者: Blessed Guda,Gabrial Zencha A.,Lawrence Francis,Carlee Joe-Wong
关键词-EN: Large Language models, Large Language, brought about substantial, substantial advancements, Retrieval Augmented Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language models (LLMs) have brought about substantial advancements in the field of Question Answering (QA) systems. These models do remarkably well in addressing intricate inquiries in a variety of disciplines. However, because of domain-specific vocabulary, complex technological concepts, and the requirement for exact responses applying LLMs to specialized sectors like telecommunications presents additional obstacles. GPT-3.5 has been used in recent work, to obtain noteworthy accuracy for telecom-related questions in a Retrieval Augmented Generation (RAG) framework. Notwithstanding these developments, the practical use of models such as GPT-3.5 is restricted by their proprietary nature and high computing demands. This paper introduces QMOS, an innovative approach which uses a Question-Masked loss and Option Shuffling trick to enhance the performance of LLMs in answering Multiple-Choice Questions in the telecommunications domain. Our focus was on using opensource, smaller language models (Phi-2 and Falcon-7B) within an enhanced RAG framework. Our multi-faceted approach involves several enhancements to the whole LLM-RAG pipeline of finetuning, retrieval, prompt engineering and inference. Our approaches significantly outperform existing results, achieving accuracy improvements from baselines of 24.70% to 49.30% with Falcon-7B and from 42.07% to 84.65% with Phi-2.

[AI-164] An Evolutionary Algorithm For the Vehicle Routing Problem with Drones with Interceptions

链接: https://arxiv.org/abs/2409.14173
作者: Carlos Pambo,Jacomine Grobler
关键词-EN: promising research direction, research direction explored, address last-mile delivery, last-mile delivery challenges, algorithm
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The use of trucks and drones as a solution to address last-mile delivery challenges is a new and promising research direction explored in this paper. The variation of the problem where the drone can intercept the truck while in movement or at the customer location is part of an optimisation problem called the vehicle routing problem with drones with interception (VRPDi). This paper proposes an evolutionary algorithm to solve the VRPDi. In this variation of the VRPDi, multiple pairs of trucks and drones need to be scheduled. The pairs leave and return to a depot location together or separately to make deliveries to customer nodes. The drone can intercept the truck after the delivery or meet up with the truck at the following customer location. The algorithm was executed on the travelling salesman problem with drones (TSPD) datasets by Bouman et al. (2015), and the performance of the algorithm was compared by benchmarking the results of the VRPDi against the results of the VRP of the same dataset. This comparison showed improvements in total delivery time between 39% and 60%. Further detailed analysis of the algorithm results examined the total delivery time, distance, node delivery scheduling and the degree of diversity during the algorithm execution. This analysis also considered how the algorithm handled the VRPDi constraints. The results of the algorithm were then benchmarked against algorithms in Dillon et al. (2023) and Ernst (2024). The latter solved the problem with a maximum drone distance constraint added to the VRPDi. The analysis and benchmarking of the algorithm results showed that the algorithm satisfactorily solved 50 and 100-nodes problems in a reasonable amount of time, and the solutions found were better than those found by the algorithms in Dillon et al. (2023) and Ernst (2024) for the same problems.

[AI-165] Will Large Language Models be a Panacea to Autonomous Driving?

链接: https://arxiv.org/abs/2409.14165
作者: Yuxuan Zhua,Shiyi Wang,Wenqing Zhong,Nianchen Shen,Yunqi Li,Siqi Wang,Zhiheng Li,Cathy Wu,Zhengbing He,Li Li
关键词-EN: plays a crucial, crucial role, role in autonomous, autonomous driving, LLMs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) plays a crucial role in autonomous driving (AD) research, propelling its development towards intelligence and efficiency. Currently, the development of AD technology follows two main technical paths: modularization and end-to-end. Modularization decompose the driving task into modules such as perception, prediction, planning, and control, and train them separately. Due to the inconsistency of training objectives between modules, the integrated effect suffers from bias. End-to-end attempts to address this issue by utilizing a single model that directly maps from sensor data to control signals. This path has limited learning capabilities in a comprehensive set of features and struggles to handle unpredictable long-tail events and complex urban traffic scenarios. In the face of challenges encountered in both paths, many researchers believe that large language models (LLMs) with powerful reasoning capabilities and extensive knowledge understanding may be the solution, expecting LLMs to provide AD systems with deeper levels of understanding and decision-making capabilities. In light of the challenges faced by both paths, many researchers believe that LLMs, with their powerful reasoning abilities and extensive knowledge, could offer a solution. To understand if LLMs could enhance AD, this paper conducts a thorough analysis of the potential applications of LLMs in AD systems, including exploring their optimization strategies in both modular and end-to-end approaches, with a particular focus on how LLMs can tackle the problems and challenges present in current solutions. Furthermore, we discuss an important question: Can LLM-based artificial general intelligence (AGI) be a key to achieve high-level AD? We further analyze the potential limitations and challenges that LLMs may encounter in promoting the development of AD technology.

[AI-166] MSSDA: Multi-Sub-Source Adaptation for Diabetic Foot Neuropathy Recognition

链接: https://arxiv.org/abs/2409.14154
作者: Yan Zhong,Zhixin Yan,Yi Xie,Shibin Wu,Huaidong Zhang,Lin Shu,Peiru Zhou
关键词-EN: Diabetic foot neuropathy, diabetic foot ulcers, Diabetic foot, critical factor leading, diabetes mellitus
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diabetic foot neuropathy (DFN) is a critical factor leading to diabetic foot ulcers, which is one of the most common and severe complications of diabetes mellitus (DM) and is associated with high risks of amputation and mortality. Despite its significance, existing datasets do not directly derive from plantar data and lack continuous, long-term foot-specific information. To advance DFN research, we have collected a novel dataset comprising continuous plantar pressure data to recognize diabetic foot neuropathy. This dataset includes data from 94 DM patients with DFN and 41 DM patients without DFN. Moreover, traditional methods divide datasets by individuals, potentially leading to significant domain discrepancies in some feature spaces due to the absence of mid-domain data. In this paper, we propose an effective domain adaptation method to address this proplem. We split the dataset based on convolutional feature statistics and select appropriate sub-source domains to enhance efficiency and avoid negative transfer. We then align the distributions of each source and target domain pair in specific feature spaces to minimize the domain gap. Comprehensive results validate the effectiveness of our method on both the newly proposed dataset for DFN recognition and an existing dataset.

[AI-167] Present and Future Generalization of Synthetic Image Detectors

链接: https://arxiv.org/abs/2409.14128
作者: Pablo Bernabeu-Perez,Enrique Lopez-Cuena,Dario Garcia-Gasulla
关键词-EN: generation models increases, image generation models, continued release, generation models, models increases
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:The continued release of new and better image generation models increases the demand for synthetic image detectors. In such a dynamic field, detectors need to be able to generalize widely and be robust to uncontrolled alterations. The present work is motivated by this setting, when looking at the role of time, image transformations and data sources, for detector generalization. In these experiments, none of the evaluated detectors is found universal, but results indicate an ensemble could be. Experiments on data collected in the wild show this task to be more challenging than the one defined by large-scale datasets, pointing to a gap between experimentation and actual practice. Finally, we observe a race equilibrium effect, where better generators lead to better detectors, and vice versa. We hypothesize this pushes the field towards a perpetually close race between generators and detectors.

[AI-168] Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm

链接: https://arxiv.org/abs/2409.14119
作者: Jaehan Kim,Minkyoo Song,Seung Ho Na,Seungwon Shin
关键词-EN: large language models, Parameter-efficient fine-tuning, key training strategy, language models, key training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6% \downarrow ). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at this https URL.

[AI-169] FineMolTex: Towards Fine-grained Molecular Graph-Text Pre-training

链接: https://arxiv.org/abs/2409.14106
作者: Yibo Li,Yuan Fang,Mengmei Zhang,Chuan Shi
关键词-EN: Understanding molecular structure, Understanding molecular, scientific research, structure and related, crucial for scientific
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding molecular structure and related knowledge is crucial for scientific research. Recent studies integrate molecular graphs with their textual descriptions to enhance molecular representation learning. However, they focus on the whole molecular graph and neglect frequently occurring subgraphs, known as motifs,which are essential for determining molecular properties. Without such fine-grained knowledge, these models struggle to generalize to unseen molecules and tasks that require motif-level insights. To bridge this gap, we propose FineMolTex, a novel Fine-grained Molecular graph-Text pre-training framework to jointly learn coarse-grained molecule-level knowledge and fine-grained motif-level knowledge. Specifically, FineMolTex consists of two pre-training tasks: a contrastive alignment task for coarse-grained matching and a masked multi-modal modeling task for fine-grained matching. In particular, the latter predicts the labels of masked motifs and words, leveraging insights from each other, thereby enabling FineMolTex to understand the fine-grained matching between motifs and words. Finally, we conduct extensive experiments across three downstream tasks, achieving up to 230% improvement in the text-based molecule editing task. Additionally, our case studies reveal that FineMolTex successfully captures fine-grained knowledge, potentially offering valuable insights for drug discovery and catalyst design.

[AI-170] Normalized Narrow Jump To Conclusions: Normalized Narrow Shortcuts for Parameter Efficient Early Exit Transformer Prediction

链接: https://arxiv.org/abs/2409.14091
作者: Amrit Diggavi Seshadri
关键词-EN: cheaper model inference, Jump to Conclusions, Normalized Narrow Jump, language models growing, size and cost
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the size and cost of large transformer-based language models growing, recently, there has been interest in shortcut casting of early transformer hidden-representations to final-representations for cheaper model inference. In particular, shortcutting pre-trained transformers with linear transformations over early layers has been shown to improve precision in early inference. However, for large language models, even this becomes computationally expensive. In this work, we propose Narrow Jump to Conclusions (NJTC) and Normalized Narrow Jump to Conclusions (N-NJTC) - parameter efficient alternatives to standard linear shortcutting that reduces shortcut parameter count by over 97%. We show that N-NJTC reliably outperforms Identity shortcuts at early stages and offers stable precision from all transformer block levels for GPT-2-XL, Phi3-Mini and Llama2-7B transformer models, demonstrating the viability of more parameter efficient short-cutting approaches.

[AI-171] One-shot World Models Using a Transformer Trained on a Synthetic Prior

链接: https://arxiv.org/abs/2409.14084
作者: Fabio Ferreira,Moreno Schlageter,Raghu Rajan,Andre Biedenkapp,Frank Hutter
关键词-EN: execute planning methods, real world environment, World Model, World, planning methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A World Model is a compressed spatial and temporal representation of a real world environment that allows one to train an agent or execute planning methods. However, world models are typically trained on observations from the real world environment, and they usually do not enable learning policies for other real environments. We propose One-Shot World Model (OSWM), a transformer world model that is learned in an in-context learning fashion from purely synthetic data sampled from a prior distribution. Our prior is composed of multiple randomly initialized neural networks, where each network models the dynamics of each state and reward dimension of a desired target environment. We adopt the supervised learning procedure of Prior-Fitted Networks by masking next-state and reward at random context positions and query OSWM to make probabilistic predictions based on the remaining transition context. During inference time, OSWM is able to quickly adapt to the dynamics of a simple grid world, as well as the CartPole gym and a custom control environment by providing 1k transition steps as context and is then able to successfully train environment-solving agent policies. However, transferring to more complex environments remains a challenge, currently. Despite these limitations, we see this work as an important stepping-stone in the pursuit of learning world models purely from synthetic data.

[AI-172] PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL EMNLP2024

链接: https://arxiv.org/abs/2409.14082
作者: Ruilin Luo,Liyuan Wang,Binghuai Lin,Zicheng Lin,Yujiu Yang
关键词-EN: Large Language Models, Large Language, exhibiting remarkable reasoning, exhibiting remarkable, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Main Conference. Revised by ARR April and ARR June. 32 pages, 7 figures and 30 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful tools for Text-to-SQL tasks, exhibiting remarkable reasoning capabilities. Different from tasks such as math word problems and commonsense reasoning, SQL solutions have a relatively fixed pattern. This facilitates the investigation of whether LLMs can benefit from categorical thinking, mirroring how humans acquire knowledge through inductive reasoning based on comparable examples. In this study, we propose that employing query group partitioning allows LLMs to focus on learning the thought processes specific to a single problem type, consequently enhancing their reasoning abilities across diverse difficulty levels and problem categories. Our experiments reveal that multiple advanced LLMs, when equipped with PTD-SQL, can either surpass or match previous state-of-the-art (SOTA) methods on the Spider and BIRD datasets. Intriguingly, models with varying initial performances have exhibited significant improvements, mainly at the boundary of their capabilities after targeted drilling, suggesting a parallel with human progress. Code is available at this https URL.

[AI-173] N-Version Assessment and Enhancement of Generative AI

链接: https://arxiv.org/abs/2409.14071
作者: Marcus Kessel,Colin Atkinson
关键词-EN: pose significant challenges, holds great potential, improve software engineering, holds great, untrustworthy outputs
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: This work has been accepted for publication in an upcoming issue of IEEE Software. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Generative AI (GAI) holds great potential to improve software engineering productivity, but its untrustworthy outputs, particularly in code synthesis, pose significant challenges. The need for extensive verification and validation (VV) of GAI-generated artifacts may undermine the potential productivity gains. This paper proposes a way of mitigating these risks by exploiting GAI’s ability to generate multiple versions of code and tests to facilitate comparative analysis across versions. Rather than relying on the quality of a single test or code module, this “differential GAI” (D-GAI) approach promotes more reliable quality evaluation through version diversity. We introduce the Large-Scale Software Observatorium (LASSO), a platform that supports D-GAI by executing and analyzing large sets of code versions and tests. We discuss how LASSO enables rigorous evaluation of GAI-generated artifacts and propose its application in both software development and GAI research.

[AI-174] KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data

链接: https://arxiv.org/abs/2409.14066
作者: Grace Tang,Swetha Rajkumar,Yifei Zhou,Homer Rich Walke,Sergey Levine,Kuan Fang
关键词-EN: Building generalist robotic, involves effectively endowing, Building generalist, Vision Language Models, effectively endowing robots
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Building generalist robotic systems involves effectively endowing robots with the capabilities to handle novel objects in an open-world setting. Inspired by the advances of large pre-trained models, we propose Keypoint Affordance Learning from Imagined Environments (KALIE), which adapts pre-trained Vision Language Models (VLMs) for robotic control in a scalable manner. Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations based on natural language instructions and visual observations of the scene. The VLM is trained on 2D images with affordances labeled by humans, bypassing the need for training data collected on robotic systems. Through an affordance-aware data synthesis pipeline, KALIE automatically creates massive high-quality training data based on limited example data manually collected by humans. We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points. Compared to baselines using pre-trained VLMs, our approach consistently achieves superior performance.

[AI-175] GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion

链接: https://arxiv.org/abs/2409.14051
作者: Tongxuan Liu,Xingyu Wang,Weizhe Huang,Wenjiang Xu,Yuting Zeng,Lei Jiang,Hailong Yang,Jing Li
关键词-EN: Large Language Models, Large Language, Language Models, diverse NLP tasks, diverse NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse NLP tasks. Extensive research has explored how to enhance the logical reasoning abilities such as Chain-of-Thought, Chain-of-Thought with Self-Consistency, Tree-Of-Thoughts, and multi-agent debates. In the context of multi-agent debates, significant performance improvements can be achieved with an increasing number of agents and debate rounds. However, the escalation in the number of agents and debate rounds can drastically raise the tokens cost of debates, thereby limiting the scalability of the multi-agent debate technique. To better harness the advantages of multi-agent debates in logical reasoning tasks, this paper proposes a method to significantly reduce token cost in multi-agent debates. This approach involves dividing all agents into multiple debate groups, with agents engaging in debates within their respective groups and sharing interim debate results between groups. Comparative experiments across multiple datasets have demonstrated that this method can reduce the total tokens by up to 51.7% during debates and while potentially enhancing accuracy by as much as 25%. Our method significantly enhances the performance and efficiency of interactions in the multi-agent debate.

[AI-176] he use of GPT-4o and Other Large Language Models for the Improvement and Design of Self-Assessment Scales for Measurement of Interpersonal Communication Skills

链接: https://arxiv.org/abs/2409.14050
作者: Goran Bubaš
关键词-EN: Large Language Models, Google Gemini, Microsoft Copilot, Antrophic Claude, Large Language
类目: Artificial Intelligence (cs.AI)
*备注: 41 pages

点击查看摘要

Abstract:OpenAI’s ChatGPT (GPT-4 and GPT-4o) and other Large Language Models (LLMs) like Microsoft’s Copilot, Google’s Gemini 1.5 Pro, and Antrophic’s Claude 3.5 Sonnet can be effectively used in various phases of scientific research. Their performance in diverse verbal tasks and reasoning is close to or above the average human level and rapidly increasing, providing those models with a capacity that resembles a relatively high level of theory of mind. The current ability of LLMs to process information about human psychology and communication creates an opportunity for their scientific use in the fields of personality psychology and interpersonal communication skills. This article illustrates the possible uses of GPT-4o and other advanced LLMs for typical tasks in designing self-assessment scales for interpersonal communication skills measurement like the selection and improvement of scale items and evaluation of content validity of scales. The potential for automated item generation and application is illustrated as well. The case study examples are accompanied by prompts for LLMs that can be useful for these purposes. Finally, a summary is provided of the potential benefits of using LLMs in the process of evaluation, design, and improvement of interpersonal communication skills self-assessment scales.

[AI-177] OAEI-LLM: A Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching

链接: https://arxiv.org/abs/2409.14038
作者: Zhangcheng Qiang,Kerry Taylor,Weiqing Wang,Jing Jiang
关键词-EN: large language models, domain-specific downstream tasks, language models, commonly occur, Alignment Evaluation Initiative
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 4 pages, 1 figure

点击查看摘要

Abstract:Hallucinations of large language models (LLMs) commonly occur in domain-specific downstream tasks, with no exception in ontology matching (OM). The prevalence of using LLMs for OM raises the need for benchmarks to better understand LLM hallucinations. The OAEI-LLM dataset is an extended version of the Ontology Alignment Evaluation Initiative (OAEI) datasets that evaluate LLM-specific hallucinations in OM tasks. We outline the methodology used in dataset construction and schema extension, and provide examples of potential use cases.

[AI-178] Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

链接: https://arxiv.org/abs/2409.14037
作者: Prasoon Bajpai,Niladri Chatterjee,Subhabrata Dutta,Tanmoy Chakraborty
关键词-EN: Large Language Models, Large Language, experiencing exponential growth, amateur users, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and AI assistants driven by these models are experiencing exponential growth in usage among both expert and amateur users. In this work, we focus on evaluating the reliability of current LLMs as science communicators. Unlike existing benchmarks, our approach emphasizes assessing these models on scientific questionanswering tasks that require a nuanced understanding and awareness of answerability. We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria. We benchmark three proprietary LLMs from the OpenAI GPT family and 13 open-access LLMs from the Meta Llama-2, Llama-3, and Mistral families. While most open-access models significantly underperform compared to GPT-4 Turbo, our experiments identify Llama-3-70B as a strong competitor, often surpassing GPT-4 Turbo in various evaluation aspects. We also find that even the GPT models exhibit a general incompetence in reliably verifying LLM responses. Moreover, we observe an alarming trend where human evaluators are deceived by incorrect responses from GPT-4 Turbo.

[AI-179] Uncovering Latent Chain of Thought Vectors in Language Models ICLR

链接: https://arxiv.org/abs/2409.14026
作者: Jason Zhang,Scott Viteri
关键词-EN: language models grow, increasingly paramount, grow more influential, influential and trusted, ability to reliably
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 2 Pages, Intended for Tiny Papers 2025 Submission to ICLR

点击查看摘要

Abstract:As language models grow more influential and trusted in our society, our ability to reliably steer them toward favorable behaviors becomes increasingly paramount. For this, we investigate the technique of steering vectors: biasing the forward pass of language models using a “steering vector” derived from a specific task. We apply them to steer language models toward performing Chain of Thought (CoT) Reasoning without the need to prompt through natural language. We demonstrate this approach on Llama3 8b and Mistral 7b v0.2, and obtain competitive results compared to CoT-prompted performances on a series of reasoning benchmarks (GSM8k, MMLU, AGI Eval, ARC AI2) and qualitative examples. We find this approach yields consistent steering towards CoT responses and takes less compute than traditional methods of fine-tuning models towards CoT.

[AI-180] FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale FPGAs

链接: https://arxiv.org/abs/2409.14023
作者: Ehsan Kabir,Md. Arafat Kabir,Austin R.J. Downey,Jason D. Bakos,David Andrews,Miaoqing Huang
关键词-EN: Transformer neural networks, including natural language, Transformer neural, natural language processing, machine translation
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV). Their popularity is largely attributed to the exceptional performance of their multi-head self-attention blocks when analyzing sequential data and extracting features. To date, there are limited hardware accelerators tailored for this mechanism, which is the first step before designing an accelerator for a complete model. This paper proposes \textitFAMOUS, a flexible hardware accelerator for dense multi-head attention (MHA) computation of TNNs on field-programmable gate arrays (FPGAs). It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency. An efficient tiling of large matrices has been employed to distribute memory and computing resources across different modules on various FPGA platforms. The design is evaluated on Xilinx Alveo U55C and U200 data center cards containing Ultrascale+ FPGAs. Experimental results are presented that show that it can attain a maximum throughput, number of parallel attention heads, embedding dimension and tile size of 328 (giga operations/second (GOPS)), 8, 768 and 64 respectively on the U55C. Furthermore, it is 3.28 \times and 2.6 \times faster than the Intel Xeon Gold 5220R CPU and NVIDIA V100 GPU respectively. It is also 1.3 \times faster than the fastest state-of-the-art FPGA-based accelerator.

[AI-181] BrainDreamer: Reasoning-Coherent and Controllable Image Generation from EEG Brain Signals via Language Guidance

链接: https://arxiv.org/abs/2409.14021
作者: Ling Wang,Chen Wu,Lin Wang
关键词-EN: directly visualize, EEG, brain, image, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Can we directly visualize what we imagine in our brain together with what we describe? The inherent nature of human perception reveals that, when we think, our body can combine language description and build a vivid picture in our brain. Intuitively, generative models should also hold such versatility. In this paper, we introduce BrainDreamer, a novel end-to-end language-guided generative framework that can mimic human reasoning and generate high-quality images from electroencephalogram (EEG) brain signals. Our method is superior in its capacity to eliminate the noise introduced by non-invasive EEG data acquisition and meanwhile achieve a more precise mapping between the EEG and image modality, thus leading to significantly better-generated images. Specifically, BrainDreamer consists of two key learning stages: 1) modality alignment and 2) image generation. In the alignment stage, we propose a novel mask-based triple contrastive learning strategy to effectively align EEG, text, and image embeddings to learn a unified representation. In the generation stage, we inject the EEG embeddings into the pre-trained Stable Diffusion model by designing a learnable EEG adapter to generate high-quality reasoning-coherent images. Moreover, BrainDreamer can accept textual descriptions (e.g., color, position, etc.) to achieve controllable image generation. Extensive experiments show that our method significantly outperforms prior arts in terms of generating quality and quantitative performance.

[AI-182] MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors

链接: https://arxiv.org/abs/2409.14019
作者: Zhenhua Du,Binbin Xu,Haoyu Zhang,Kai Huo,Shuaifeng Zhi
关键词-EN: Accurately reconstructing dense, Accurately reconstructing, monocular images remains, challenging task due, semantically annotated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 10 figures

点击查看摘要

Abstract:Accurately reconstructing dense and semantically annotated 3D meshes from monocular images remains a challenging task due to the lack of geometry guidance and imperfect view-dependent 2D priors. Though we have witnessed recent advancements in implicit neural scene representations enabling precise 2D rendering simply from multi-view images, there have been few works addressing 3D scene understanding with monocular priors alone. In this paper, we propose MOSE, a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D, producing accurate semantics and geometry in both 3D and 2D space. The key motivation for our method is to leverage generic class-agnostic segment masks as guidance to promote local consistency of rendered semantics during training. With the help of semantics, we further apply a smoothness regularization to texture-less regions for better geometric quality, thus achieving mutual benefits of geometry and semantics. Experiments on the ScanNet dataset show that our MOSE outperforms relevant baselines across all metrics on tasks of 3D semantic segmentation, 2D semantic segmentation and 3D surface reconstruction.

[AI-183] Mitigating Exposure Bias in Score-Based Generation of Molecular Conformations

链接: https://arxiv.org/abs/2409.14014
作者: Sijia Wang,Chen Wang,Zhenhao Zhao,Jiqiang Zhang,Weiran Cai
关键词-EN: Diffusion Probabilistic Models, exposure bias, Molecular conformation generation, conformation generation poses, computational chemistry
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: SMC 2024

点击查看摘要

Abstract:Molecular conformation generation poses a significant challenge in the field of computational chemistry. Recently, Diffusion Probabilistic Models (DPMs) and Score-Based Generative Models (SGMs) are effectively used due to their capacity for generating accurate conformations far beyond conventional physics-based approaches. However, the discrepancy between training and inference rises a critical problem known as the exposure bias. While this issue has been extensively investigated in DPMs, the existence of exposure bias in SGMs and its effective measurement remain unsolved, which hinders the use of compensation methods for SGMs, including ConfGF and Torsional Diffusion as the representatives. In this work, we first propose a method for measuring exposure bias in SGMs used for molecular conformation generation, which confirms the significant existence of exposure bias in these models and measures its value. We design a new compensation algorithm Input Perturbation (IP), which is adapted from a method originally designed for DPMs only. Experimental results show that by introducing IP, SGM-based molecular conformation models can significantly improve both the accuracy and diversity of the generated conformations. Especially by using the IP-enhanced Torsional Diffusion model, we achieve new state-of-the-art performance on the GEOM-Drugs dataset and are on par on GEOM-QM9. We provide the code publicly at this https URL.

[AI-184] ChronoGAN: Supervised and Embedded Generative Adversarial Networks for Time Series Generation ICML

链接: https://arxiv.org/abs/2409.14013
作者: MohammadReza EskandariNasab,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
关键词-EN: performance variability depending, Generative Adversarial Networks, Generative Adversarial, presents several prevalent, prevalent challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This work has been accepted at ICMLA 2024 on September 7, 2024, as a regular paper for an oral presentation

点击查看摘要

Abstract:Generating time series data using Generative Adversarial Networks (GANs) presents several prevalent challenges, such as slow convergence, information loss in embedding spaces, instability, and performance variability depending on the series length. To tackle these obstacles, we introduce a robust framework aimed at addressing and mitigating these issues effectively. This advanced framework integrates the benefits of an Autoencoder-generated embedding space with the adversarial training dynamics of GANs. This framework benefits from a time series-based loss function and oversight from a supervisory network, both of which capture the stepwise conditional distributions of the data effectively. The generator functions within the latent space, while the discriminator offers essential feedback based on the feature space. Moreover, we introduce an early generation algorithm and an improved neural network architecture to enhance stability and ensure effective generalization across both short and long time series. Through joint training, our framework consistently outperforms existing benchmarks, generating high-quality time series data across a range of real and synthetic datasets with diverse characteristics.

[AI-185] st Time Learning for Time Series Forecasting

链接: https://arxiv.org/abs/2409.14012
作者: Panayiotis Christou,Shichu Chen,Xupeng Chen,Parijat Dube
关键词-EN: token prediction mechanisms, multi-head attention, introduction of token, Time-series forecasting, capturing long-range dependencies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time-series forecasting has seen significant advancements with the introduction of token prediction mechanisms such as multi-head attention. However, these methods often struggle to achieve the same performance as in language modeling, primarily due to the quadratic computational cost and the complexity of capturing long-range dependencies in time-series data. State-space models (SSMs), such as Mamba, have shown promise in addressing these challenges by offering efficient solutions with linear RNNs capable of modeling long sequences with larger context windows. However, there remains room for improvement in accuracy and scalability. We propose the use of Test-Time Training (TTT) modules in a parallel architecture to enhance performance in long-term time series forecasting. Through extensive experiments on standard benchmark datasets, we demonstrate that TTT modules consistently outperform state-of-the-art models, including the Mamba-based TimeMachine, particularly in scenarios involving extended sequence and prediction lengths. Our results show significant improvements in Mean Squared Error (MSE) and Mean Absolute Error (MAE), especially on larger datasets such as Electricity, Traffic, and Weather, underscoring the effectiveness of TTT in capturing long-range dependencies. Additionally, we explore various convolutional architectures within the TTT framework, showing that even simple configurations like 1D convolution with small filters can achieve competitive results. This work sets a new benchmark for time-series forecasting and lays the groundwork for future research in scalable, high-performance forecasting models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.14012 [cs.LG] (or arXiv:2409.14012v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.14012 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-186] Boolean Product Graph Neural Networks

链接: https://arxiv.org/abs/2409.14001
作者: Ziyan Wang,Bin Liu,Ling Xiang
关键词-EN: achieved significant success, recently achieved significant, key operation involving, latent graph, Graph Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2407.10688

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have recently achieved significant success, with a key operation involving the aggregation of information from neighboring nodes. Substantial researchers have focused on defining neighbors for aggregation, predominantly based on observed adjacency matrices. However, in many scenarios, the explicitly given graphs contain noise, which can be amplified during the messages-passing process. Therefore, many researchers have turned their attention to latent graph inference, specifically learning a parametric graph. To mitigate fluctuations in latent graph structure learning, this paper proposes a novel Boolean product-based graph residual connection in GNNs to link the latent graph and the original graph. It computes the Boolean product between the latent graph and the original graph at each layer to correct the learning process. The Boolean product between two adjacency matrices is equivalent to triangle detection. Accordingly, the proposed Boolean product graph neural networks can be interpreted as discovering triangular cliques from the original and the latent graph. We validate the proposed method in benchmark datasets and demonstrate its ability to enhance the performance and robustness of GNNs.

[AI-187] Graph Neural Network Framework for Sentiment Analysis Using Syntactic Feature

链接: https://arxiv.org/abs/2409.14000
作者: Linxiao Wu,Yuanshuai Luo,Binrong Zhu,Guiran Liu,Rui Wang,Qian Yu
关键词-EN: natural language processing, social media platforms, Amidst the swift, e-commerce ecosystems, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Amidst the swift evolution of social media platforms and e-commerce ecosystems, the domain of opinion mining has surged as a pivotal area of exploration within natural language processing. A specialized segment within this field focuses on extracting nuanced evaluations tied to particular elements within textual contexts. This research advances a composite framework that amalgamates the positional cues of topical descriptors. The proposed system converts syntactic structures into a matrix format, leveraging convolutions and attention mechanisms within a graph to distill salient characteristics. Incorporating the positional relevance of descriptors relative to lexical items enhances the sequential integrity of the input. Trials have substantiated that this integrated graph-centric scheme markedly elevates the efficacy of evaluative categorization, showcasing preeminence.

[AI-188] Relevance-driven Decision Making for Safer and More Efficient Human Robot Collaboration

链接: https://arxiv.org/abs/2409.13998
作者: Xiaotong Zhang,Dingcheng Huang,Kamal Youcef-Toumi
关键词-EN: important environmental components, Human intelligence possesses, relevance, environmental components, enhances perception
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human intelligence possesses the ability to effectively focus on important environmental components, which enhances perception, learning, reasoning, and decision-making. Inspired by this cognitive mechanism, we introduced a novel concept termed relevance for Human-Robot Collaboration (HRC). Relevance is defined as the importance of the objects based on the applicability and pertinence of the objects for the human objective or other factors. In this paper, we further developed a novel two-loop framework integrating real-time and asynchronous processing to quantify relevance and apply relevance for safer and more efficient HRC. The asynchronous loop leverages the world knowledge from an LLM and quantifies relevance, and the real-time loop executes scene understanding, human intent prediction, and decision-making based on relevance. In decision making, we proposed and developed a human robot task allocation method based on relevance and a novel motion generation and collision avoidance methodology considering the prediction of human trajectory. Simulations and experiments show that our methodology for relevance quantification can accurately and robustly predict the human objective and relevance, with an average accuracy of up to 0.90 for objective prediction and up to 0.96 for relevance prediction. Moreover, our motion generation methodology reduces collision cases by 63.76% and collision frames by 44.74% when compared with a state-of-the-art (SOTA) collision avoidance method. Our framework and methodologies, with relevance, guide the robot on how to best assist humans and generate safer and more efficient actions for HRC.

[AI-189] Drift to Remember

链接: https://arxiv.org/abs/2409.13997
作者: Jin Du,Xinhe Zhang,Hao Shen,Xun Xian,Ganghua Wang,Jiawei Zhang,Yuhong Yang,Na Li,Jia Liu,Jie Ding
关键词-EN: biological brain ability, artificial intelligence, aims to mimic, brain ability, ability to continuously
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Lifelong learning in artificial intelligence (AI) aims to mimic the biological brain’s ability to continuously learn and retain knowledge, yet it faces challenges such as catastrophic forgetting. Recent neuroscience research suggests that neural activity in biological systems undergoes representational drift, where neural responses evolve over time, even with consistent inputs and tasks. We hypothesize that representational drift can alleviate catastrophic forgetting in AI during new task acquisition. To test this, we introduce DriftNet, a network designed to constantly explore various local minima in the loss landscape while dynamically retrieving relevant tasks. This approach ensures efficient integration of new information and preserves existing knowledge. Experimental studies in image classification and natural language processing demonstrate that DriftNet outperforms existing models in lifelong learning. Importantly, DriftNet is scalable in handling a sequence of tasks such as sentiment analysis and question answering using large language models (LLMs) with billions of parameters on a single Nvidia A100 GPU. DriftNet efficiently updates LLMs using only new data, avoiding the need for full dataset retraining. Tested on GPT-2 and RoBERTa, DriftNet is a robust, cost-effective solution for lifelong learning in LLMs. This study not only advances AI systems to emulate biological learning, but also provides insights into the adaptive mechanisms of biological neural systems, deepening our understanding of lifelong learning in nature.

[AI-190] Contrastive Learning for Knowledge-Based Question Generation in Large Language Models

链接: https://arxiv.org/abs/2409.13994
作者: Zhenhong Zhang,Jiajing Chen,Weiyan Shi,Lingjie Yi,Chihang Wang,Qian Yu
关键词-EN: increasingly widespread application, artificial intelligence technology, high-quality question generation, question generation, question generation technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:With the rapid development of artificial intelligence technology, especially the increasingly widespread application of question-and-answer systems, high-quality question generation has become a key component in supporting the development of these systems. This article focuses on knowledge-based question generation technology, which aims to enable computers to simulate the human questioning process based on understanding specific texts or knowledge bases. In light of the issues of hallucination and knowledge gaps present in large-scale language models when applied to knowledge-intensive tasks, this paper proposes an enhanced question generation method that incorporates contrastive learning. This method utilizes multiple models to jointly mine domain knowledge and uses contrastive learning to guide the model in reducing noise and hallucinations in generation. Experimental results show that by designing prompts containing contrasting examples, the model’s performance in question generation improves considerably, particularly when contrasting instructions and examples are used simultaneously, leading to the highest quality of generated questions and improved accuracy. These results demonstrate that the method proposed in this study, which combines contrasting context and chain-of-thought prompts, can effectively improve both the quality and the practicality of question generation.

[AI-191] ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models

链接: https://arxiv.org/abs/2409.13989
作者: Yuqing Huang,Rongyang Zhang,Xuesong He,Xuyang Zhi,Hao Wang,Xin Li,Feiyang Xu,Deguang Liu,Huadong Liang,Yi Li,Jian Cui,Zimu Liu,Shijin Wang,Guoping Hu,Guiquan Liu,Qi Liu,Defu Lian,Enhong Chen
关键词-EN: LLMs benchmarks tailored, LLMs, type and complexity, chemical tasks varying, growing interest
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:There is a growing interest in the role that LLMs play in chemistry which lead to an increased focus on the development of LLMs benchmarks tailored to chemical domains to assess the performance of LLMs across a spectrum of chemical tasks varying in type and complexity. However, existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals. To this end, we propose \textbf\textitChemEval, which provides a comprehensive assessment of the capabilities of LLMs across a wide range of chemical domain tasks. Specifically, ChemEval identified 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks which are informed by open-source data and the data meticulously crafted by chemical experts, ensuring that the tasks have practical value and can effectively evaluate the capabilities of LLMs. In the experiment, we evaluate 12 mainstream LLMs on ChemEval under zero-shot and few-shot learning contexts, which included carefully selected demonstration examples and carefully designed prompts. The results show that while general LLMs like GPT-4 and Claude-3.5 excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge. Conversely, specialized LLMs exhibit enhanced chemical competencies, albeit with reduced literary comprehension. This suggests that LLMs have significant potential for enhancement when tackling sophisticated tasks in the field of chemistry. We believe our work will facilitate the exploration of their potential to drive progress in chemistry. Our benchmark and analysis will be available at \colorblue \urlthis https URL.

[AI-192] Enhancing Advanced Visual Reasoning Ability of Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2409.13980
作者: Zhiyuan Li,Dongnan Liu,Chaoyi Zhang,Heng Wang,Tengfei Xue,Weidong Cai
关键词-EN: Large Language Models, challenging models’ advanced, Language Models, advanced reasoning ability, complex visual reasoning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Main

点击查看摘要

Abstract:Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning, challenging models’ advanced reasoning ability. Traditional Vision-Language Models (VLMs) perform well in visual perception tasks while struggling with complex reasoning scenarios. Conversely, Large Language Models (LLMs) demonstrate robust text reasoning capabilities; however, they lack visual acuity. To bridge this gap, we propose Complex Visual Reasoning Large Language Models (CVR-LLM), capitalizing on VLMs’ visual perception proficiency and LLMs’ extensive reasoning capability. Unlike recent multimodal large language models (MLLMs) that require a projection layer, our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop and leverages LLMs’ text knowledge for accurate predictions without extra training. We also introduce a novel multi-modal in-context learning (ICL) methodology to enhance LLMs’ contextual understanding and reasoning. Additionally, we introduce Chain-of-Comparison (CoC), a step-by-step comparison technique enabling contrasting various aspects of predictions. Our CVR-LLM presents the first comprehensive study across a wide array of complex visual reasoning tasks and achieves SOTA performance among all.

[AI-193] Detecting Inpainted Video with Frequency Domain Insights ICASSP2025

链接: https://arxiv.org/abs/2409.13976
作者: Quanhui Tang,Jingtao Cao
关键词-EN: enables seamless content, seamless content removal, replacement within frames, posing ethical, inpainting enables seamless
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: submit to ICASSP2025

点击查看摘要

Abstract:Video inpainting enables seamless content removal and replacement within frames, posing ethical and legal risks when misused. To mitigate these risks, detecting manipulated regions in inpainted videos is critical. Previous detection methods often focus solely on the characteristics derived from spatial and temporal dimensions, which limits their effectiveness by overlooking the unique frequency characteristics of different inpainting algorithms. In this paper, we propose the Frequency Domain Insights Network (FDIN), which significantly enhances detection accuracy by incorporating insights from the frequency domain. Our network features an Adaptive Band Selective Response module to discern frequency characteristics specific to various inpainting techniques and a Fast Fourier Convolution-based Attention module for identifying periodic artifacts in inpainted regions. Utilizing 3D ResBlocks for spatiotemporal analysis, FDIN progressively refines detection precision from broad assessments to detailed localization. Experimental evaluations on public datasets demonstrate that FDIN achieves state-of-the-art performance, setting a new benchmark in video inpainting detection.

[AI-194] ProTEA: Programmable Transformer Encoder Acceleration on FPGA

链接: https://arxiv.org/abs/2409.13975
作者: Ehsan Kabir,Jason D. Bakos,David Andrews,Miaoqing Huang
关键词-EN: including natural language, natural language processing, multi-head self-attention block, machine translation, multi-head self-attention
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Transformer neural networks (TNN) have been widely utilized on a diverse range of applications, including natural language processing (NLP), machine translation, and computer vision (CV). Their widespread adoption has been primarily driven by the exceptional performance of their multi-head self-attention block used to extract key features from sequential data. The multi-head self-attention block is followed by feedforward neural networks, which play a crucial role in introducing non-linearity to assist the model in learning complex patterns. Despite the popularity of TNNs, there has been limited numbers of hardware accelerators targeting these two critical blocks. Most prior works have concentrated on sparse architectures that are not flexible for popular TNN variants. This paper introduces \textitProTEA, a runtime programmable accelerator tailored for the dense computations of most of state-of-the-art transformer encoders. \textitProTEA is designed to reduce latency by maximizing parallelism. We introduce an efficient tiling of large matrices that can distribute memory and computing resources across different hardware components within the FPGA. We provide run time evaluations of \textitProTEA on a Xilinx Alveo U55C high-performance data center accelerator card. Experimental results demonstrate that \textitProTEA can host a wide range of popular transformer networks and achieve near optimal performance with a tile size of 64 in the multi-head self-attention block and 6 in the feedforward networks block when configured with 8 parallel attention heads, 12 layers, and an embedding dimension of 768 on the U55C. Comparative results are provided showing \textitProTEA is 2.5 \times faster than an NVIDIA Titan XP GPU. Results also show that it achieves 1.3 – 2.8 \times speed up compared with current state-of-the-art custom designed FPGA accelerators.

[AI-195] One Model Any Conjunctive Query: Graph Neural Networks for Answering Complex Queries over Knowledge Graphs

链接: https://arxiv.org/abs/2409.13959
作者: Krzysztof Olejniczak,Xingyue Huang,İsmail İlkan Ceylan,Mikhail Galkin
关键词-EN: Traditional query answering, Traditional query, knowledge graph, data management, query
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional query answering over knowledge graphs – or broadly over relational data – is one of the most fundamental problems in data management. Motivated by the incompleteness of modern knowledge graphs, a new setup for query answering has emerged, where the goal is to predict answers that do not necessarily appear in the knowledge graph, but are present in its completion. In this work, we propose AnyCQ, a graph neural network model that can classify answers to any conjunctive query on any knowledge graph, following training. At the core of our framework lies a graph neural network model trained using a reinforcement learning objective to answer Boolean queries. Our approach and problem setup differ from existing query answering studies in multiple dimensions. First, we focus on the problem of query answer classification: given a query and a set of possible answers, classify these proposals as true or false relative to the complete knowledge graph. Second, we study the problem of query answer retrieval: given a query, retrieve an answer to the query relative to the complete knowledge graph or decide that no correct solutions exist. Trained on simple, small instances, AnyCQ can generalize to large queries of arbitrary structure, reliably classifying and retrieving answers to samples where existing approaches fail, which is empirically validated on new and challenging benchmarks. Furthermore, we demonstrate that our AnyCQ models effectively transfer to out-of-distribution knowledge graphs, when equipped with a relevant link predictor, highlighting their potential to serve as a general engine for query answering.

[AI-196] PureDiffusion: Using Backdoor to Counter Backdoor in Generative Diffusion Models

链接: https://arxiv.org/abs/2409.13945
作者: Vu Tuan Truong,Long Bao Le
关键词-EN: deep learning models, advanced deep learning, Diffusion models, learning models, generative tasks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) are advanced deep learning models that achieved state-of-the-art capability on a wide range of generative tasks. However, recent studies have shown their vulnerability regarding backdoor attacks, in which backdoored DMs consistently generate a designated result (e.g., a harmful image) called backdoor target when the models’ input contains a backdoor trigger. Although various backdoor techniques have been investigated to attack DMs, defense methods against these threats are still limited and underexplored, especially in inverting the backdoor trigger. In this paper, we introduce PureDiffusion, a novel backdoor defense framework that can efficiently detect backdoor attacks by inverting backdoor triggers embedded in DMs. Our extensive experiments on various trigger-target pairs show that PureDiffusion outperforms existing defense methods with a large gap in terms of fidelity (i.e., how much the inverted trigger resembles the original trigger) and backdoor success rate (i.e., the rate that the inverted trigger leads to the corresponding backdoor target). Notably, in certain cases, backdoor triggers inverted by PureDiffusion even achieve higher attack success rate than the original triggers.

[AI-197] alkMosaic: Interactive PhotoMosaic with Multi-modal LLM QA Interactions

链接: https://arxiv.org/abs/2409.13941
作者: Kevin Li,Fulu Li
关键词-EN: single composed image, original car image, car image, car images information, image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:We use images of cars of a wide range of varieties to compose an image of an animal such as a bird or a lion for the theme of environmental protection to maximize the information about cars in a single composed image and to raise the awareness about environmental challenges. We present a novel way of image interaction with an artistically-composed photomosaic image, in which a simple operation of “click and display” is used to demonstrate the interactive switch between a tile image in a photomosaic image and the corresponding original car image, which will be automatically saved on the Desktop. We build a multimodal custom GPT named TalkMosaic by incorporating car images information and the related knowledge to ChatGPT. By uploading the original car image to TalkMosaic, we can ask questions about the given car image and get the corresponding answers efficiently and effectively such as where to buy the tire in the car image that satisfies high environmental standards. We give an in-depth analysis on how to speed up the inference of multimodal LLM using sparse attention and quantization techniques with presented probabilistic FlashAttention (PrFlashAttention) and Staircase Adaptive Quantization (SAQ) methods. The implemented prototype demonstrates the feasibility and effectiveness of the presented approach.

[AI-198] Learning Recourse Costs from Pairwise Feature Comparisons ICIP ICML

链接: https://arxiv.org/abs/2409.13940
作者: Kaivalya Rawal,Himabindu Lakkaraju
关键词-EN: inferring user preferences, technique for incorporating, incorporating user input, feature, incorporating user
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注: “Recourse for Humans”, paper 49 from the Participatory Approaches to Machine Learning workshop at the International Conference on Machine Learning (ICML) 2020. For workshop website, see this https URL

点击查看摘要

Abstract:This paper presents a novel technique for incorporating user input when learning and inferring user preferences. When trying to provide users of black-box machine learning models with actionable recourse, we often wish to incorporate their personal preferences about the ease of modifying each individual feature. These recourse finding algorithms usually require an exhaustive set of tuples associating each feature to its cost of modification. Since it is hard to obtain such costs by directly surveying humans, in this paper, we propose the use of the Bradley-Terry model to automatically infer feature-wise costs using non-exhaustive human comparison surveys. We propose that users only provide inputs comparing entire recourses, with all candidate feature modifications, determining which recourses are easier to implement relative to others, without explicit quantification of their costs. We demonstrate the efficient learning of individual feature costs using MAP estimates, and show that these non-exhaustive human surveys, which do not necessarily contain data for each feature pair comparison, are sufficient to learn an exhaustive set of feature costs, where each feature is associated with a modification cost.

[AI-199] Simple Unsupervised Knowledge Distillation With Space Similarity

链接: https://arxiv.org/abs/2409.13939
作者: Aditya Singh,Haohan Wang
关键词-EN: Self-supervised learning, recent studies, readily extend, smaller architectures, SSL
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As per recent studies, Self-supervised learning (SSL) does not readily extend to smaller architectures. One direction to mitigate this shortcoming while simultaneously training a smaller network without labels is to adopt unsupervised knowledge distillation (UKD). Existing UKD approaches handcraft preservation worthy inter/intra sample relationships between the teacher and its student. However, this may overlook/ignore other key relationships present in the mapping of a teacher. In this paper, instead of heuristically constructing preservation worthy relationships between samples, we directly motivate the student to model the teacher’s embedding manifold. If the mapped manifold is similar, all inter/intra sample relationships are indirectly conserved. We first demonstrate that prior methods cannot preserve teacher’s latent manifold due to their sole reliance on L_2 normalised embedding features. Subsequently, we propose a simple objective to capture the lost information due to normalisation. Our proposed loss component, termed \textbfspace similarity, motivates each dimension of a student’s feature space to be similar to the corresponding dimension of its teacher. We perform extensive experiments demonstrating strong performance of our proposed approach on various benchmarks.

[AI-200] MirrorStories: Reflecting Diversity through Personalized Narrative Generation with Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2409.13935
作者: Sarfaroz Yunusov,Hamza Sidat,Ali Emami
关键词-EN: Large Language Models, Language Models, Large Language, individual readers’ identities, effectiveness of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 9 pages (excluding references), accepted to EMNLP 2024 Main Conference

点击查看摘要

Abstract:This study explores the effectiveness of Large Language Models (LLMs) in creating personalized “mirror stories” that reflect and resonate with individual readers’ identities, addressing the significant lack of diversity in literature. We present MirrorStories, a corpus of 1,500 personalized short stories generated by integrating elements such as name, gender, age, ethnicity, reader interest, and story moral. We demonstrate that LLMs can effectively incorporate diverse identity elements into narratives, with human evaluators identifying personalized elements in the stories with high accuracy. Through a comprehensive evaluation involving 26 diverse human judges, we compare the effectiveness of MirrorStories against generic narratives. We find that personalized LLM-generated stories not only outscore generic human-written and LLM-generated ones across all metrics of engagement (with average ratings of 4.22 versus 3.37 on a 5-point scale), but also achieve higher textual diversity while preserving the intended moral. We also provide analyses that include bias assessments and a study on the potential for integrating images into personalized stories.

[AI-201] Failures in Perspective-taking of Multimodal AI Systems

链接: https://arxiv.org/abs/2409.13929
作者: Bridget Leonard,Kristin Woodard,Scott O. Murray
关键词-EN: study extends previous, extends previous research, study extends, extends previous, multimodal AI systems
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study extends previous research on spatial representations in multimodal AI systems. Although current models demonstrate a rich understanding of spatial information from images, this information is rooted in propositional representations, which differ from the analog representations employed in human and animal spatial cognition. To further explore these limitations, we apply techniques from cognitive and developmental science to assess the perspective-taking abilities of GPT-4o. Our analysis enables a comparison between the cognitive development of the human brain and that of multimodal AI, offering guidance for future research and model development.

[AI-202] Eliciting Instruction-tuned Code Language Models Capabilities to Utilize Auxiliary Function for Code Generation EMNLP2024

链接: https://arxiv.org/abs/2409.13928
作者: Seonghyeon Lee,Suyeon Kim,Joonwon Jang,Heejae Chon,Dongha Lee,Hwanjo Yu
关键词-EN: code generation behavior, code pre-trained language, instruction-tuned models built, pre-trained language models, provide auxiliary functions
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 Findings Short

点击查看摘要

Abstract:We study the code generation behavior of instruction-tuned models built on top of code pre-trained language models when they could access an auxiliary function to implement a function. We design several ways to provide auxiliary functions to the models by adding them to the query or providing a response prefix to incorporate the ability to utilize auxiliary functions with the instruction-following capability. Our experimental results show the effectiveness of combining the base models’ auxiliary function utilization ability with the instruction following ability. In particular, the performance of adopting our approaches with the open-sourced language models surpasses that of the recent powerful proprietary language models, i.e., gpt-4o.

[AI-203] SpaceBlender: Creating Context-Rich Collaborative Spaces Through Generative 3D Scene Blending

链接: https://arxiv.org/abs/2409.13926
作者: Nels Numan,Shwetha Rajaram,Balasaravanan Thoravi Kumaravel,Nicolai Marquardt,Andrew D. Wilson
关键词-EN: Virtual Reality, increased interest, Reality, Virtual, spaces
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:There is increased interest in using generative AI to create 3D spaces for Virtual Reality (VR) applications. However, today’s models produce artificial environments, falling short of supporting collaborative tasks that benefit from incorporating the user’s physical context. To generate environments that support VR telepresence, we introduce SpaceBlender, a novel pipeline that utilizes generative AI techniques to blend users’ physical surroundings into unified virtual spaces. This pipeline transforms user-provided 2D images into context-rich 3D environments through an iterative process consisting of depth estimation, mesh alignment, and diffusion-based space completion guided by geometric priors and adaptive text prompts. In a preliminary within-subjects study, where 20 participants performed a collaborative VR affinity diagramming task in pairs, we compared SpaceBlender with a generic virtual environment and a state-of-the-art scene generation framework, evaluating its ability to create virtual spaces suitable for collaboration. Participants appreciated the enhanced familiarity and context provided by SpaceBlender but also noted complexities in the generative environments that could detract from task focus. Drawing on participant feedback, we propose directions for improving the pipeline and discuss the value and design of blended spaces for different scenarios.

[AI-204] Measuring Error Alignment for Decision-Making Systems

链接: https://arxiv.org/abs/2409.13919
作者: Binxia Xu,Antonis Bikakis,Daniel Onah,Andreas Vlachidis,Luke Dickens
关键词-EN: future decision-making processes, decision-making processes, critical concern, play a pivotal, pivotal role
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Given that AI systems are set to play a pivotal role in future decision-making processes, their trustworthiness and reliability are of critical concern. Due to their scale and complexity, modern AI systems resist direct interpretation, and alternative ways are needed to establish trust in those systems, and determine how well they align with human values. We argue that good measures of the information processing similarities between AI and humans, may be able to achieve these same ends. While Representational alignment (RA) approaches measure similarity between the internal states of two systems, the associated data can be expensive and difficult to collect for human systems. In contrast, Behavioural alignment (BA) comparisons are cheaper and easier, but questions remain as to their sensitivity and reliability. We propose two new behavioural alignment metrics misclassification agreement which measures the similarity between the errors of two systems on the same instances, and class-level error similarity which measures the similarity between the error distributions of two systems. We show that our metrics correlate well with RA metrics, and provide complementary information to another BA metric, within a range of domains, and set the scene for a new approach to value alignment.

[AI-205] Nonlinear Inverse Design of Mechanical Multi-Material Metamaterials Enabled by Video Denoising Diffusion and Structure Identifier

链接: https://arxiv.org/abs/2409.13908
作者: Jaewan Park,Shashank Kushwaha,Junyan He,Seid Koric,Qibang Liu,Iwona Jasiuk,Diab Abueidda
关键词-EN: additive manufacturing, due to advancements, advancements in additive, customized properties, synthetic materials
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: 26 pages, 15 figures

点击查看摘要

Abstract:Metamaterials, synthetic materials with customized properties, have emerged as a promising field due to advancements in additive manufacturing. These materials derive unique mechanical properties from their internal lattice structures, which are often composed of multiple materials that repeat geometric patterns. While traditional inverse design approaches have shown potential, they struggle to map nonlinear material behavior to multiple possible structural configurations. This paper presents a novel framework leveraging video diffusion models, a type of generative artificial Intelligence (AI), for inverse multi-material design based on nonlinear stress-strain responses. Our approach consists of two key components: (1) a fields generator using a video diffusion model to create solution fields based on target nonlinear stress-strain responses, and (2) a structure identifier employing two UNet models to determine the corresponding multi-material 2D design. By incorporating multiple materials, plasticity, and large deformation, our innovative design method allows for enhanced control over the highly nonlinear mechanical behavior of metamaterials commonly seen in real-world applications. It offers a promising solution for generating next-generation metamaterials with finely tuned mechanical characteristics.

[AI-206] CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data

链接: https://arxiv.org/abs/2409.13903
作者: Zhao Cheng,Diane Wan,Matthew Abueg,Sahra Ghalebikesabi,Ren Yi,Eugene Bagdasarian,Borja Balle,Stefan Mellem,Shawn O’Banion
关键词-EN: Advances in generative, perform diverse tasks, behalf of users, generative AI point, era of personalized
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advances in generative AI point towards a new era of personalized applications that perform diverse tasks on behalf of users. While general AI assistants have yet to fully emerge, their potential to share personal data raises significant privacy challenges. This paper introduces CI-Bench, a comprehensive synthetic benchmark for evaluating the ability of AI assistants to protect personal information during model inference. Leveraging the Contextual Integrity framework, our benchmark enables systematic assessment of information flow across important context dimensions, including roles, information types, and transmission principles. We present a novel, scalable, multi-step synthetic data pipeline for generating natural communications, including dialogues and emails. Unlike previous work with smaller, narrowly focused evaluations, we present a novel, scalable, multi-step data pipeline that synthetically generates natural communications, including dialogues and emails, which we use to generate 44 thousand test samples across eight domains. Additionally, we formulate and evaluate a naive AI assistant to demonstrate the need for further study and careful training towards personal assistant tasks. We envision CI-Bench as a valuable tool for guiding future language model development, deployment, system design, and dataset construction, ultimately contributing to the development of AI assistants that align with users’ privacy expectations.

[AI-207] Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

链接: https://arxiv.org/abs/2409.13902
作者: Aidan Gilson,Xuguang Ai,Thilaka Arunachalam,Ziyou Chen,Ki Xiong Cheong,Amisha Dave,Cameron Duic,Mercy Kibe,Annette Kaminaka,Minali Prasad,Fares Siddig,Maxwell Singer,Wendy Wong,Qiao Jin,Tiarnan D.L. Keenan,Xia Hu,Emily Y. Chew,Zhiyong Lu,Hua Xu,Ron A. Adelman,Yih-Chung Tham,Qingyu Chen
关键词-EN: Large Language Models, Language Models, Large Language, Retrieval Augment Generation, potential of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. In a case study on long-form consumer health questions, we systematically evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals. The evaluation focuses on factuality of evidence, selection and ranking of evidence, attribution of evidence, and answer accuracy and completeness. LLMs without RAG provided 252 references in total. Of which, 45.3% hallucinated, 34.1% consisted of minor errors, and 20.6% were correct. In contrast, LLMs with RAG significantly improved accuracy (54.5% being correct) and reduced error rates (18.8% with minor hallucinations and 26.7% with errors). 62.5% of the top 10 documents retrieved by RAG were selected as the top references in the LLM response, with an average ranking of 4.9. The use of RAG also improved evidence attribution (increasing from 1.85 to 2.49 on a 5-point scale, P0.001), albeit with slight decreases in accuracy (from 3.52 to 3.23, P=0.03) and completeness (from 3.47 to 3.27, P=0.17). The results demonstrate that LLMs frequently exhibited hallucinated and erroneous evidence in the responses, raising concerns for downstream applications in the medical domain. RAG substantially reduced the proportion of such evidence but encountered challenges.

[AI-208] LLM for Everyone: Representing the Underrepresented in Large Language Models

链接: https://arxiv.org/abs/2409.13897
作者: Samuel Cahyawijaya
关键词-EN: Natural language processing, large language models, Natural language, underrepresented languages, witnessed a profound
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: PhD thesis

点击查看摘要

Abstract:Natural language processing (NLP) has witnessed a profound impact of large language models (LLMs) that excel in a multitude of tasks. However, the limitation of LLMs in multilingual settings, particularly in underrepresented languages, remains a significant hurdle. This thesis aims to bridge the gap in NLP research and development by focusing on underrepresented languages. A comprehensive evaluation of LLMs is conducted to assess their capabilities in these languages, revealing the challenges of multilingual and multicultural generalization. Addressing the multilingual generalization gap, this thesis proposes data-and-compute-efficient methods to mitigate the disparity in LLM ability in underrepresented languages, allowing better generalization on underrepresented languages without the loss of task generalization ability. The proposed solutions cover cross-lingual continual instruction tuning, retrieval-based cross-lingual in-context learning, and in-context query alignment. Furthermore, a novel method to measure cultural values alignment between LLMs operating in different languages is proposed, ensuring cultural sensitivity and inclusivity. These contributions aim to enhance the multilingual and multicultural alignment of LLMs in underrepresented languages, ultimately advancing the NLP field toward greater equality and inclusiveness.

[AI-209] Learning to Play Video Games with Intuitive Physics Priors

链接: https://arxiv.org/abs/2409.13886
作者: Abhishek Jaiswal,Nisheeth Srivastava
关键词-EN: adverse real-world consequences, extremely structured domain, Video game playing, real-world consequences, extremely structured
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, Accepted in Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 46

点击查看摘要

Abstract:Video game playing is an extremely structured domain where algorithmic decision-making can be tested without adverse real-world consequences. While prevailing methods rely on image inputs to avoid the problem of hand-crafting state space representations, this approach systematically diverges from the way humans actually learn to play games. In this paper, we design object-based input representations that generalize well across a number of video games. Using these representations, we evaluate an agent’s ability to learn games similar to an infant - with limited world experience, employing simple inductive biases derived from intuitive representations of physics from the real world. Using such biases, we construct an object category representation to be used by a Q-learning algorithm and assess how well it learns to play multiple games based on observed object affordances. Our results suggest that a human-like object interaction setup capably learns to play several video games, and demonstrates superior generalizability, particularly for unfamiliar objects. Further exploring such methods will allow machines to learn in a human-centric way, thus incorporating more human-like learning benefits.

[AI-210] A Multi-LLM Debiasing Framework

链接: https://arxiv.org/abs/2409.13884
作者: Deonna M. Owens,Ryan A. Rossi,Sungchul Kim,Tong Yu,Franck Dernoncourt,Xiang Chen,Ruiyi Zhang,Jiuxiang Gu,Hanieh Deilamsalehy,Nedim Lipka
关键词-EN: Large Language Models, Large Language, benefit society immensely, perpetuate societal inequalities, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities. Despite significant advancements in bias mitigation techniques using data augmentation, zero-shot prompting, and model fine-tuning, biases continuously persist, including subtle biases that may elude human detection. Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning and factuality in LLMs. Building on this approach, we propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs. Our work is the first to introduce and evaluate two distinct approaches within this framework for debiasing LLMs: a centralized method, where the conversation is facilitated by a single central LLM, and a decentralized method, where all models communicate directly. Our findings reveal that our multi-LLM framework significantly reduces bias in LLMs, outperforming the baseline method across several social groups.

[AI-211] abular Data Generation using Binary Diffusion

链接: https://arxiv.org/abs/2409.13882
作者: Vitaliy Kinakh,Slava Voloshynovskiy
关键词-EN: Generating synthetic tabular, Generating synthetic, synthetic tabular data, machine learning, limited or sensitive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive. Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed data types and varied distributions, and require complex preprocessing or large pretrained models. In this paper, we introduce a novel, lossless binary transformation method that converts any tabular data into fixed-size binary representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary data. Binary Diffusion leverages the simplicity of XOR operations for noise addition and removal and employs binary cross-entropy loss for training. Our approach eliminates the need for extensive preprocessing, complex noise parameter tuning, and pretraining on large datasets. We evaluate our model on several popular tabular benchmark datasets, demonstrating that Binary Diffusion outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes datasets while being significantly smaller in size.

[AI-212] Instruct-Tuning Pretrained Causal Language Models for Ancient Greek Papyrology and Epigraphy

链接: https://arxiv.org/abs/2409.13870
作者: Eric Cullhed
关键词-EN: Meta Llama, pretrained causal language, ancient Greek inscriptions, causal language model, Greek inscriptions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 1 table. Under review

点击查看摘要

Abstract:This article presents an experiment in fine-tuning a pretrained causal language model (Meta’s Llama 3.1 8B Instruct) for aiding in three fundamental tasks of philological research: chronological and geographic attribution as well as text restoration in ancient Greek inscriptions and documentary papyri. Using a prompt-based instruct approach, the fine-tuned models surpass the state of the art in key metrics. For inscriptions, the models achieve a lower average character error rate (CER) of 22.5% (vs. 26.3%), while closely matching top-1 accuracy (60.9% vs. 61.8%) and top-20 accuracy (77.5% vs. 78.3%) for sequences up to 10 characters. They also provide a practical advantage by ignoring spaces during reconstruction, aligning better with the scriptio continua typically used in ancient written artifacts. In geographic attribution, the model outperforms previous benchmarks with a top-1 accuracy of 75.0% (vs. 70.8%) and a top-3 accuracy of 83.7% (vs. 82.1%). For dating, it achieves an average deviation of 26.2 years (vs. 29.3) and a median deviation of 1 year (vs. 3) from the actual date range. The models also set new baselines for documentary papyri, with a CER of 16.3%, a top-1 accuracy of 71.3%, and top-20 of 85.0% in text reconstruction; a top-1 accuracy of 66.4% and top-3 of 79.9% in geographic attribution; and, in chronological attribution, a deviation of 21.7 years from the actual termini post/ante quem, with a median deviation of 0 years.

[AI-213] Generative AI Carries Non-Democratic Biases and Stereotypes: Representation of Women Black Individuals Age Groups and People with Disability in AI-Generated Images across Occupations

链接: https://arxiv.org/abs/2409.13869
作者: Ayoob Sadeghiani
关键词-EN: prompting active discussions, critical concerns, prompting active, tech companies, governance and ethics
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:AI governance and ethics in AI development have become critical concerns, prompting active discussions among tech companies, governments, and researchers about the potential risks AI poses to our democracies. This short essay aims to highlight one such risk: how generative AI includes or excludes equity-deserving groups in its outputs. The findings reveal that generative AI is not equitably inclusive regarding gender, race, age, and visible disability.

[AI-214] MAGICS: Adversarial RL with Minimax Actors Guided by Implicit Critic Stackelberg for Convergent Neural Synthesis of Robot Safety

链接: https://arxiv.org/abs/2409.13867
作者: Justin Wang,Haimin Hu,Duy Phuong Nguyen,Jaime Fernández Fisac
关键词-EN: optimal control theory, Implicit Critic Stackelberg, provably safe, high-dimensional problems, leading to increased
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Algorithmic Foundations of Robotics (WAFR) XVI

点击查看摘要

Abstract:While robust optimal control theory provides a rigorous framework to compute robot control policies that are provably safe, it struggles to scale to high-dimensional problems, leading to increased use of deep learning for tractable synthesis of robot safety. Unfortunately, existing neural safety synthesis methods often lack convergence guarantees and solution interpretability. In this paper, we present Minimax Actors Guided by Implicit Critic Stackelberg (MAGICS), a novel adversarial reinforcement learning (RL) algorithm that guarantees local convergence to a minimax equilibrium solution. We then build on this approach to provide local convergence guarantees for a general deep RL-based robot safety synthesis algorithm. Through both simulation studies on OpenAI Gym environments and hardware experiments with a 36-dimensional quadruped robot, we show that MAGICS can yield robust control policies outperforming the state-of-the-art neural safety synthesis methods.

[AI-215] Wormhole: Concept-Aware Deep Representation Learning for Co-Evolving Sequences

链接: https://arxiv.org/abs/2409.13857
作者: Kunpeng Xu,Lifei Chen,Shengrui Wang
关键词-EN: online activity logs, financial markets, IoT applications, activity logs, online activity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Identifying and understanding dynamic concepts in co-evolving sequences is crucial for analyzing complex systems such as IoT applications, financial markets, and online activity logs. These concepts provide valuable insights into the underlying structures and behaviors of sequential data, enabling better decision-making and forecasting. This paper introduces Wormhole, a novel deep representation learning framework that is concept-aware and designed for co-evolving time sequences. Our model presents a self-representation layer and a temporal smoothness constraint to ensure robust identification of dynamic concepts and their transitions. Additionally, concept transitions are detected by identifying abrupt changes in the latent space, signifying a shift to new behavior - akin to passing through a wormhole. This novel mechanism accurately discerns concepts within co-evolving sequences and pinpoints the exact locations of these wormholes, enhancing the interpretability of the learned representations. Experiments demonstrate that this method can effectively segment time series data into meaningful concepts, providing a valuable tool for analyzing complex temporal patterns and advancing the detection of concept drifts.

[AI-216] More Consideration to the Perceptron

链接: https://arxiv.org/abs/2409.13854
作者: Slimane Larabi
关键词-EN: additional input computed, Breast Cancer Wisconsin, additional input, input computed, existing inputs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:In this paper, we introduce the gated perceptron, an enhancement of the conventional perceptron, which incorporates an additional input computed as the product of the existing inputs. This allows the perceptron to capture non-linear interactions between features, significantly improving its ability to classify and regress on complex datasets. We explore its application in both linear and non-linear regression tasks using the Iris dataset, as well as binary and multi-class classification problems, including the PIMA Indian dataset and Breast Cancer Wisconsin dataset. Our results demonstrate that the gated perceptron can generate more distinct decision regions compared to traditional perceptrons, enhancing its classification capabilities, particularly in handling non-linear data. Performance comparisons show that the gated perceptron competes with state-of-the-art classifiers while maintaining a simple architecture.

[AI-217] Unlocking Memorization in Large Language Models with Dynamic Soft Prompting

链接: https://arxiv.org/abs/2409.13853
作者: Zhepeng Wang,Runxue Bao,Yawen Wu,Jackson Taylor,Cao Xiao,Feng Zheng,Weiwen Jiang,Shangqian Gao,Yanfu Zhang
关键词-EN: Pretrained large language, large language models, natural language processing, revolutionized natural language, Pretrained large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pretrained large language models (LLMs) have revolutionized natural language processing (NLP) tasks such as summarization, question answering, and translation. However, LLMs pose significant security risks due to their tendency to memorize training data, leading to potential privacy breaches and copyright infringement. Accurate measurement of this memorization is essential to evaluate and mitigate these potential risks. However, previous attempts to characterize memorization are constrained by either using prefixes only or by prepending a constant soft prompt to the prefixes, which cannot react to changes in input. To address this challenge, we propose a novel method for estimating LLM memorization using dynamic, prefix-dependent soft prompts. Our approach involves training a transformer-based generator to produce soft prompts that adapt to changes in input, thereby enabling more accurate extraction of memorized data. Our method not only addresses the limitations of previous methods but also demonstrates superior performance in diverse experimental settings compared to state-of-the-art techniques. In particular, our method can achieve the maximum relative improvement of 112.75% and 32.26% over the vanilla baseline in terms of discoverable memorization rate for the text generation task and code generation task respectively.

[AI-218] Do language models practice what they preach? Examining language ideologies about gendered language reform encoded in LLMs

链接: https://arxiv.org/abs/2409.13852
作者: Julia Watson,Sophia Lee,Barend Beekhuizen,Suzanne Stevenson
关键词-EN: English gendered language, English gendered, gendered language reform, study on English, related to role
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study language ideologies in text produced by LLMs through a case study on English gendered language reform (related to role nouns like congressperson/-woman/-man, and singular they). First, we find political bias: when asked to use language that is “correct” or “natural”, LLMs use language most similarly to when asked to align with conservative (vs. progressive) values. This shows how LLMs’ metalinguistic preferences can implicitly communicate the language ideologies of a particular political group, even in seemingly non-political contexts. Second, we find LLMs exhibit internal inconsistency: LLMs use gender-neutral variants more often when more explicit metalinguistic context is provided. This shows how the language ideologies expressed in text produced by LLMs can vary, which may be unexpected to users. We discuss the broader implications of these findings for value alignment.

[AI-219] STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions EMNLP2024

链接: https://arxiv.org/abs/2409.13843
作者: Robert Morabito,Sangmitra Madhusudan,Tyler McDonald,Ali Emami
关键词-EN: Mitigating explicit, natural language processing, Large Language Models, Large Language, explicit and implicit
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 9 pages (excluding references), accepted to EMNLP 2024 Main Conference

点击查看摘要

Abstract:Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. However, many current methodologies evaluate scenarios in isolation, without considering the broader context or the spectrum of potential biases within each situation. To address this, we introduce the Sensitivity Testing on Offensive Progressions (STOP) dataset, which includes 450 offensive progressions containing 2,700 unique sentences of varying severity that progressively escalate from less to more explicitly offensive. Covering a broad spectrum of 9 demographics and 46 sub-demographics, STOP ensures inclusivity and comprehensive coverage. We evaluate several leading closed- and open-source models, including GPT-4, Mixtral, and Llama 3. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%. We also demonstrate how aligning models with human judgments on STOP can improve model answer rates on sensitive tasks such as BBQ, StereoSet, and CrowS-Pairs by up to 191%, while maintaining or even improving performance. STOP presents a novel framework for assessing the complex nature of biases in LLMs, which will enable more effective bias mitigation strategies and facilitates the creation of fairer language models.

[AI-220] Measuring Copyright Risks of Large Language Model via Partial Information Probing

链接: https://arxiv.org/abs/2409.13831
作者: Weijie Zhao,Huajie Shao,Zhaozhuo Xu,Suzhen Duan,Denghui Zhang
关键词-EN: Large Language Models, train Large Language, Large Language, Language Models, investigating potential copyright
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:Exploring the data sources used to train Large Language Models (LLMs) is a crucial direction in investigating potential copyright infringement by these models. While this approach can identify the possible use of copyrighted materials in training data, it does not directly measure infringing risks. Recent research has shifted towards testing whether LLMs can directly output copyrighted content. Addressing this direction, we investigate and assess LLMs’ capacity to generate infringing content by providing them with partial information from copyrighted materials, and try to use iterative prompting to get LLMs to generate more infringing content. Specifically, we input a portion of a copyrighted text into LLMs, prompt them to complete it, and then analyze the overlap between the generated content and the original copyrighted material. Our findings demonstrate that LLMs can indeed generate content highly overlapping with copyrighted materials based on these partial inputs.

[AI-221] A Personalised 3Dt Mesh Generative Model for Unveiling Normal Heart Dynamics

链接: https://arxiv.org/abs/2409.13825
作者: Mengyun Qiao,Kathryn A McGurk,Shuo Wang,Paul M. Matthews,Declan P O Regan,Wenjia Bai
关键词-EN: managing cardiovascular diseases, global death, crucial for diagnosing, diagnosing and managing, managing cardiovascular
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding the structure and motion of the heart is crucial for diagnosing and managing cardiovascular diseases, the leading cause of global death. There is wide variation in cardiac shape and motion patterns, that are influenced by demographic, anthropometric and disease factors. Unravelling the normal patterns of shape and motion, as well as understanding how each individual deviates from the norm, would facilitate accurate diagnosis and personalised treatment strategies. To this end, we developed a novel conditional generative model, MeshHeart, to learn the distribution of cardiac shape and motion patterns. MeshHeart is capable of generating 3D+t cardiac mesh sequences, taking into account clinical factors such as age, sex, weight and height. To model the high-dimensional and complex spatio-temporal mesh data, MeshHeart employs a geometric encoder to represent cardiac meshes in a latent space, followed by a temporal Transformer to model the motion dynamics of latent representations. Based on MeshHeart, we investigate the latent space of 3D+t cardiac mesh sequences and propose a novel distance metric termed latent delta, which quantifies the deviation of a real heart from its personalised normative pattern in the latent space. In experiments using a large dataset of 38,309 subjects, MeshHeart demonstrates a high performance in cardiac mesh sequence reconstruction and generation. Features defined in the latent space are highly discriminative for cardiac disease classification, whereas the latent delta exhibits strong correlation with clinical phenotypes in phenome-wide association studies. The codes and models of this study will be released to benefit further research on digital heart modelling.

[AI-222] On the Feasibility of Fully AI-automated Vishing Attacks

链接: https://arxiv.org/abs/2409.13793
作者: João Figueiredo,Afonso Carvalho,Daniel Castro,Daniel Gonçalves,Nuno Santos
关键词-EN: disclosing sensitive information, personal data, deceive individuals, sensitive information, financial information
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:A vishing attack is a form of social engineering where attackers use phone calls to deceive individuals into disclosing sensitive information, such as personal data, financial information, or security credentials. Attackers exploit the perceived urgency and authenticity of voice communication to manipulate victims, often posing as legitimate entities like banks or tech support. Vishing is a particularly serious threat as it bypasses security controls designed to protect information. In this work, we study the potential for vishing attacks to escalate with the advent of AI. In theory, AI-powered software bots may have the ability to automate these attacks by initiating conversations with potential victims via phone calls and deceiving them into disclosing sensitive information. To validate this thesis, we introduce ViKing, an AI-powered vishing system developed using publicly available AI technology. It relies on a Large Language Model (LLM) as its core cognitive processor to steer conversations with victims, complemented by a pipeline of speech-to-text and text-to-speech modules that facilitate audio-text conversion in phone calls. Through a controlled social experiment involving 240 participants, we discovered that ViKing has successfully persuaded many participants to reveal sensitive information, even those who had been explicitly warned about the risk of vishing campaigns. Interactions with ViKing’s bots were generally considered realistic. From these findings, we conclude that tools like ViKing may already be accessible to potential malicious actors, while also serving as an invaluable resource for cyber awareness programs.

[AI-223] Continual Learning for Multimodal Data Fusion of a Soft Gripper

链接: https://arxiv.org/abs/2409.13792
作者: Nilay Kushawaha,Egidio Falotico
关键词-EN: previously learned information, retaining previously learned, learned information, acquire new knowledge, retaining previously
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 15 pages, 10 figures

点击查看摘要

Abstract:Continual learning (CL) refers to the ability of an algorithm to continuously and incrementally acquire new knowledge from its environment while retaining previously learned information. A model trained on one data modality often fails when tested with a different modality. A straightforward approach might be to fuse the two modalities by concatenating their features and training the model on the fused data. However, this requires retraining the model from scratch each time it encounters a new domain. In this paper, we introduce a continual learning algorithm capable of incrementally learning different data modalities by leveraging both class-incremental and domain-incremental learning scenarios in an artificial environment where labeled data is scarce, yet non-iid (independent and identical distribution) unlabeled data from the environment is plentiful. The proposed algorithm is efficient and only requires storing prototypes for each class. We evaluate the algorithm’s effectiveness on a challenging custom multimodal dataset comprising of tactile data from a soft pneumatic gripper, and visual data from non-stationary images of objects extracted from video sequences. Additionally, we conduct an ablation study on the custom dataset and the Core50 dataset to highlight the contributions of different components of the algorithm. To further demonstrate the robustness of the algorithm, we perform a real-time experiment for object classification using the soft gripper and an external independent camera setup, all synchronized with the Robot Operating System (ROS) framework.

[AI-224] Multi-omics data integration for early diagnosis of hepatocellular carcinoma (HCC) using machine learning

链接: https://arxiv.org/abs/2409.13791
作者: Annette Spooner,Mohammad Karimi Moridani,Azadeh Safarchi,Salim Maher,Fatemeh Vafaee,Amany Zekry,Arcot Sowmya
关键词-EN: complementary information found, underlying biological processes, patient disease state, complementary information, information found
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:The complementary information found in different modalities of patient data can aid in more accurate modelling of a patient’s disease state and a better understanding of the underlying biological processes of a disease. However, the analysis of multi-modal, multi-omics data presents many challenges, including high dimensionality and varying size, statistical distribution, scale and signal strength between modalities. In this work we compare the performance of a variety of ensemble machine learning algorithms that are capable of late integration of multi-class data from different modalities. The ensemble methods and their variations tested were i) a voting ensemble, with hard and soft vote, ii) a meta learner, iii) a multi-modal Adaboost model using a hard vote, a soft vote and a meta learner to integrate the modalities on each boosting round, the PB-MVBoost model and a novel application of a mixture of experts model. These were compared to simple concatenation as a baseline. We examine these methods using data from an in-house study on hepatocellular carcinoma (HCC), along with four validation datasets on studies from breast cancer and irritable bowel disease (IBD). Using the area under the receiver operating curve as a measure of performance we develop models that achieve a performance value of up to 0.85 and find that two boosted methods, PB-MVBoost and Adaboost with a soft vote were the overall best performing models. We also examine the stability of features selected, and the size of the clinical signature determined. Finally, we provide recommendations for the integration of multi-modal multi-class data.

[AI-225] Revisiting Synthetic Human Trajectories: Imitative Generation and Benchmarks Beyond Datasaurus

链接: https://arxiv.org/abs/2409.13790
作者: Bangchao Deng,Xin Jing,Tianyue Yang,Bingqing Qu,Philippe Cudre-Mauroux,Dingqi Yang
关键词-EN: Human trajectory data, epidemic prevention, privacy concerns, plays a crucial, crucial role
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human trajectory data, which plays a crucial role in various applications such as crowd management and epidemic prevention, is challenging to obtain due to practical constraints and privacy concerns. In this context, synthetic human trajectory data is generated to simulate as close as possible to real-world human trajectories, often under summary statistics and distributional similarities. However, the complexity of human mobility patterns is oversimplified by these similarities (a.k.a. ``Datasaurus’'), resulting in intrinsic biases in both generative model design and benchmarks of the generated trajectories. Against this background, we propose MIRAGE, a huMan-Imitative tRAjectory GenErative model designed as a neural Temporal Point Process integrating an Exploration and Preferential Return model. It imitates the human decision-making process in trajectory generation, rather than fitting any specific statistical distributions as traditional methods do, thus avoiding the Datasaurus issue. Moreover, we also propose a comprehensive task-based evaluation protocol beyond Datasaurus to systematically benchmark trajectory generative models on four typical downstream tasks, integrating multiple techniques and evaluation metrics for each task, to comprehensively assess the ultimate utility of the generated trajectories. We conduct a thorough evaluation of MIRAGE on three real-world user trajectory datasets against a sizeable collection of baselines. Results show that compared to the best baselines, MIRAGE-generated trajectory data not only achieves the best statistical and distributional similarities with 59.0-71.5% improvement, but also yields the best performance in the task-based evaluation with 10.9-33.4% improvement.

[AI-226] Learning to Generalize Unseen Domains via Multi-Source Meta Learning for Text Classification

链接: https://arxiv.org/abs/2409.13787
作者: Yuxuan Hu,Chenwei Zhang,Min Yang,Xiaodan Liang,Chengming Li,Xiping Hu
关键词-EN: unseen domain, deep learning methods, rapid development, development of deep, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of deep learning methods, there have been many breakthroughs in the field of text classification. Models developed for this task have been shown to achieve high accuracy. However, most of these models are trained using labeled data from seen domains. It is difficult for these models to maintain high accuracy in a new challenging unseen domain, which is directly related to the generalization of the model. In this paper, we study the multi-source Domain Generalization of text classification and propose a framework to use multiple seen domains to train a model that can achieve high accuracy in an unseen domain. Specifically, we propose a multi-source meta-learning Domain Generalization framework to simulate the process of model generalization to an unseen domain, so as to extract sufficient domain-related features. We introduced a memory mechanism to store domain-specific features, which coordinate with the meta-learning framework. Besides, we adopt the novel “jury” mechanism that enables the model to learn sufficient domain-invariant features. Experiments demonstrate that our meta-learning framework can effectively enhance the ability of the model to generalize to an unseen domain and can outperform the state-of-the-art methods on multi-source text classification datasets.

[AI-227] A Value Based Parallel Update MCTS Method for Multi-Agent Cooperative Decision Making of Connected and Automated Vehicles

链接: https://arxiv.org/abs/2409.13783
作者: Ye Han,Lijun Zhang,Dejian Meng,Xingyu Hu,Songyu Weng
关键词-EN: Monte Carlo tree, multi-agent Markov game, Carlo tree search, time discounted setting, Monte Carlo
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY)
*备注: arXiv admin note: text overlap with arXiv:2408.04295 by other authors

点击查看摘要

Abstract:To solve the problem of lateral and logitudinal joint decision-making of multi-vehicle cooperative driving for connected and automated vehicles (CAVs), this paper proposes a Monte Carlo tree search (MCTS) method with parallel update for multi-agent Markov game with limited horizon and time discounted setting. By analyzing the parallel actions in the multi-vehicle joint action space in the partial-steady-state traffic flow, the parallel update method can quickly exclude potential dangerous actions, thereby increasing the search depth without sacrificing the search breadth. The proposed method is tested in a large number of randomly generated traffic flow. The experiment results show that the algorithm has good robustness and better performance than the SOTA reinforcement learning algorithms and heuristic methods. The vehicle driving strategy using the proposed algorithm shows rationality beyond human drivers, and has advantages in traffic efficiency and safety in the coordinating zone.

[AI-228] rustworthy Intrusion Detection: Confidence Estimation Using Latent Space

链接: https://arxiv.org/abs/2409.13774
作者: Ioannis Pitsiorlas,George Arvanitakis,Marios Kountouris
关键词-EN: Intrusion Detection Systems, Variational Autoencoder, Detection Systems, Intrusion Detection, work introduces
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:This work introduces a novel method for enhancing confidence in anomaly detection in Intrusion Detection Systems (IDS) through the use of a Variational Autoencoder (VAE) architecture. By developing a confidence metric derived from latent space representations, we aim to improve the reliability of IDS predictions against cyberattacks. Applied to the NSL-KDD dataset, our approach focuses on binary classification tasks to effectively distinguish between normal and malicious network activities. The methodology demonstrates a significant enhancement in anomaly detection, evidenced by a notable correlation of 0.45 between the reconstruction error and the proposed metric. Our findings highlight the potential of employing VAEs for more accurate and trustworthy anomaly detection in network security.

[AI-229] A Case Study of Web App Coding with OpenAI Reasoning Models

链接: https://arxiv.org/abs/2409.13773
作者: Yi Cui
关键词-EN: paper presents, deliver SOTA results, models deliver SOTA, latest reasoning models, case study
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a case study of coding tasks by the latest reasoning models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling number of tasks and test cases. The new benchmark causes the o1 model performances to decline significantly, falling behind Claude 3.5. Moreover, they consistently fail when confronted with atypical yet correct test cases, a trap non-reasoning models occasionally avoid. We hypothesize that the performance variability is due to instruction comprehension. Specifically, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths. As such, we argue that the coding success of reasoning models hinges on the top-notch base model and SFT to ensure meticulous adherence to instructions.

[AI-230] Magika: AI-Powered Content-Type Detection

链接: https://arxiv.org/abs/2409.13768
作者: Yanick Fratantonio,Luca Invernizzi,Loua Farah,Kurt Thomas,Marina Zhang,Ange Albertini,Francois Galilee,Giancarlo Metitieri,Julien Cretin,Alex Petit-Bianco,David Tao,Elie Bursztein
关键词-EN: reverse engineering environments, arbitrary byte sequence, content-type detection, byte sequence, operating systems
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The task of content-type detection – which entails identifying the data encoded in an arbitrary byte sequence – is critical for operating systems, development, reverse engineering environments, and a variety of security applications. In this paper, we introduce Magika, a novel AI-powered content-type detection tool. Under the hood, Magika employs a deep learning model that can execute on a single CPU with just 1MB of memory to store the model’s weights. We show that Magika achieves an average F1 score of 99% across over a hundred content types and a test set of more than 1M files, outperforming all existing content-type detection tools today. In order to foster adoption and improvements, we open source Magika under an Apache 2 license on GitHub and make our model and training pipeline publicly available. Our tool has already seen adoption by the Gmail email provider for attachment scanning, and it has been integrated with VirusTotal to aid with malware analysis. We note that this paper discusses the first iteration of Magika, and a more recent version already supports more than 200 content types. The interested reader can see the latest development on the Magika GitHub repository, available at this https URL. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.13768 [cs.CR] (or arXiv:2409.13768v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.13768 Focus to learn more arXiv-issued DOI via DataCite

[AI-231] Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

链接: https://arxiv.org/abs/2409.13764
作者: Christos Fragkathoulas,Odysseas S. Chlapanis
关键词-EN: large language models, paper introduces, task to assess, large language, local perturbations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.

[AI-232] Do Large Language Models Need a Content Delivery Network?

链接: https://arxiv.org/abs/2409.13761
作者: Yihua Cheng,Kuntai Du,Jiayi Yao,Junchen Jiang
关键词-EN: large language models, LLM, expands rapidly, knowledge, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries. Thus, enabling flexible and efficient injection of new knowledge in LLM inference is critical. Three high-level options exist: (i) embedding the knowledge in LLM’s weights (i.e., fine-tuning), (ii) including the knowledge as a part of LLM’s text input (i.e., in-context learning), or (iii) injecting the KV caches of the new knowledge to LLM during prefill. This paper argues that, although fine-tuning and in-context learning are popular, using KV caches as the medium of knowledge could simultaneously enable more modular management of knowledge injection and more efficient LLM serving with low cost and fast response. To realize these benefits, we envision a Knowledge Delivery Network (KDN), a new system component in LLM services that dynamically optimizes the storage, transfer, and composition of KV cache across LLM engines and other compute and storage resources. We believe that, just like content delivery networks (CDNs), such as Akamai, enabled the success of the Internet ecosystem through their efficient data delivery, KDNs will be critical to the success of LLM applications through their efficient knowledge delivery. We have open-sourced a KDN prototype at this https URL.

[AI-233] Simulacion de la distribucion de alimento en el cultivo de camaron

链接: https://arxiv.org/abs/2409.13759
作者: Renato L. Conforme Rosado,Francisco C. Calderon Bocanegra
关键词-EN: document presents, presents the experimentation, shrimp farming, food, food distribution
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This document presents the experimentation of 4 cases of food distribution for shrimp farming. The distributions are based on the location of the automatic feeders. Three cases applied in reality and a fourth case where the food is irrigated on the crop simultaneously and uniformly. In a first stage, the simulation of the three distribution cases is successfully adjusted to reality, where the trend of the shrimp growth curve is correlated with the historical data curve. A second stage where you experiment in 16 configurations that are based on the amount of food, the density of biomass and the distribution of the food. The simulation adopts the concepts of genetic algorithms to improve the population and fuzzy logic as an agent evaluation technique for decision-making against the quality of physical-chemical parameters in the simulated environment. The results of these interactions reveal a reduction in the simulated total culture time from 22 weeks to 14 weeks.

[AI-234] Optimizing the Songwriting Process: Genre-Based Lyric Generation Using Deep Learning Models

链接: https://arxiv.org/abs/2409.13758
作者: Tracy Cai,Wilson Liang,Donte Townes
关键词-EN: form comprehensive verses, traditional songwriting process, songwriting process, form comprehensive, traditional songwriting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The traditional songwriting process is rather complex and this is evident in the time it takes to produce lyrics that fit the genre and form comprehensive verses. Our project aims to simplify this process with deep learning techniques, thus optimizing the songwriting process and enabling an artist to hit their target audience by staying in genre. Using a dataset of 18,000 songs off Spotify, we developed a unique preprocessing format using tokens to parse lyrics into individual verses. These results were used to train a baseline pretrained seq2seq model, and a LSTM-based neural network models according to song genres. We found that generation yielded higher recall (ROUGE) in the baseline model, but similar precision (BLEU) for both models. Qualitatively, we found that many of the lyrical phrases generated by the original model were still comprehensible and discernible between which genres they fit into, despite not necessarily being the exact the same as the true lyrics. Overall, our results yielded that lyric generation can reasonably be sped up to produce genre-based lyrics and aid in hastening the songwriting process.

[AI-235] Entity-Aware Self-Attention and Contextualized GCN for Enhanced Relation Extraction in Long Sentences

链接: https://arxiv.org/abs/2409.13755
作者: Xin Wang,Xinyi Bai
关键词-EN: natural Language processing, important natural Language, Language processing, natural Language, important natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Relation extraction as an important natural Language processing (NLP) task is to identify relations between named entities in text. Recently, graph convolutional networks over dependency trees have been widely used to capture syntactic features and achieved attractive performance. However, most existing dependency-based approaches ignore the positive influence of the words outside the dependency trees, sometimes conveying rich and useful information on relation extraction. In this paper, we propose a novel model, Entity-aware Self-attention Contextualized GCN (ESC-GCN), which efficiently incorporates syntactic structure of input sentences and semantic context of sequences. To be specific, relative position self-attention obtains the overall semantic pairwise correlation related to word position, and contextualized graph convolutional networks capture rich intra-sentence dependencies between words by adequately pruning operations. Furthermore, entity-aware attention layer dynamically selects which token is more decisive to make final relation prediction. In this way, our proposed model not only reduces the noisy impact from dependency trees, but also obtains easily-ignored entity-related semantic representation. Extensive experiments on various tasks demonstrate that our model achieves encouraging performance as compared to existing dependency-based and sequence-based models. Specially, our model excels in extracting relations between entities of long sentences.

[AI-236] Increasing the Value of Information During Planning in Uncertain Environments

链接: https://arxiv.org/abs/2409.13754
作者: Gaurab Pokharel
关键词-EN: Prior studies, studies have demonstrated, Prior, information, real-world problems
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注: Honors thesis submitted to Computer Science Department at Oberlin College. this https URL

点击查看摘要

Abstract:Prior studies have demonstrated that for many real-world problems, POMDPs can be solved through online algorithms both quickly and with near optimality. However, on an important set of problems where there is a large time delay between when the agent can gather information and when it needs to use that information, these solutions fail to adequately consider the value of information. As a result, information gathering actions, even when they are critical in the optimal policy, will be ignored by existing solutions, leading to sub-optimal decisions by the agent. In this research, we develop a novel solution that rectifies this problem by introducing a new algorithm that improves upon state-of-the-art online planning by better reflecting on the value of actions that gather information. We do this by adding Entropy to the UCB1 heuristic in the POMCP algorithm. We test this solution on the hallway problem. Results indicate that our new algorithm performs significantly better than POMCP.

[AI-237] Synergistic Simulations: Multi-Agent Problem Solving with Large Language Models

链接: https://arxiv.org/abs/2409.13753
作者: Asher Sprigler,Alexander Drobek,Keagan Weinstock,Wendpanga Tapsoba,Gavin Childress,Andy Dao,Lucas Gral
关键词-EN: Large Language Models, Large Language, Language Models, increasingly demonstrated, demonstrated the ability
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
*备注: 15 pages, 5 figures, published in the MICS 2024 conference

点击查看摘要

Abstract:Large Language Models (LLMs) have increasingly demonstrated the ability to facilitate the development of multi-agent systems that allow the interpretation of thoughts and actions generated by each individual. Promising advancements have also been made in LLM-based interaction with existing worlds, particularly in interacting with simulated environments. This paper aims to integrate both aforementioned topics (agents world interaction) into a single simulation where multiple agents can work together to solve a problem, modeling how groups of humans can often solve problems better than individuals. By showing whether LLMs demonstrate the synergy of human collaboration, it could lead to advancements in the applications of LLMs. We implemented two simulations: a physical studio apartment with two roommates, and another where agents collaborate to complete a programming task. We provide a multi-agent framework, discuss the performance of the agents in each simulation, and discuss potential future additions.

[AI-238] hinking Before Speaking: A Role-playing Model with Mindset

链接: https://arxiv.org/abs/2409.13752
作者: Baohua Zhang,Yongyi Huang,Wenyao Cui,Huaping Zhang
关键词-EN: Large Language Models, Large Language, simulating human behaviors, task for Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Role-playing is an easy task for Large Language Models (LLMs), as they are skilled at simulating human behaviors. Many current studies have enabled LLMs to generate responses in the tone of a specific role by fine-tuning the models or using specialized prompts. However, it is typically easy to recognize when a role is being played by LLMs. These models tend to perform poorly when confronted with knowledge that the assumed role does not possess, or a question that requires the specific experience or logic of the role to answer. To address this problem and make LLMs act more like real roles, we propose a Thinking Before Speaking (TBS) model in this paper. Unlike other studies, we first extend the data based on the character’s real-life scenarios and the historical dialogue, supplementing each pair of dialogue with the character’s mindset. Then we add few data points that include elements beyond the role’s knowledge, and fine-tune the LLMs. This approach can help LLMs adopt the role’s thought process and logic, avoiding responses that fall outside the role’s knowledge base. We have also prepared a dataset and evaluation metrics to test these capabilities. Experimental results show that our TBS model can better emulate a role in terms of tone, knowledge, and mindset.

[AI-239] KodeXv0.1: A Family of State-of-the-Art Financial Large Language Models

链接: https://arxiv.org/abs/2409.13749
作者: Neel Rajani,Lilli Kiessling,Aleksandr Ogaltsov,Claus Lang
关键词-EN: highly specialised sectors, current cutting-edge LLMs, specialised sectors, highly specialised, financial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Although powerful, current cutting-edge LLMs may not fulfil the needs of highly specialised sectors. We introduce KodeXv0.1, a family of large language models that outclass GPT-4 in financial question answering. We utilise the base variants of Llama 3.1 8B and 70B and adapt them to the financial domain through a custom training regime. To this end, we collect and process a large number of publicly available financial documents such as earnings calls and business reports. These are used to generate a high-quality, synthetic dataset consisting of Context-Question-Answer triplets which closely mirror real-world financial tasks. Using the train split of this dataset, we perform RAG-aware 4bit LoRA instruction tuning runs of Llama 3.1 base variants to produce KodeX-8Bv0.1 and KodeX-70Bv0.1. We then complete extensive model evaluations using FinanceBench, FinQABench and the withheld test split of our dataset. Our results show that KodeX-8Bv0.1 is more reliable in financial contexts than cutting-edge instruct models in the same parameter regime, surpassing them by up to 9.24%. In addition, it is even capable of outperforming state-of-the-art proprietary models such as GPT-4 by up to 7.07%. KodeX-70Bv0.1 represents a further improvement upon this, exceeding GPT-4’s performance on every tested benchmark.

[AI-240] heraGen: Therapy for Every Generation

链接: https://arxiv.org/abs/2409.13748
作者: Kartikey Doshi,Jimit Shah,Narendra Shekokar
关键词-EN: health chatbot utilizing, utilizing the LLaMA, chatbot utilizing, mental health, advanced AI-powered mental
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 12 pages, 11 figures

点击查看摘要

Abstract:We present TheraGen, an advanced AI-powered mental health chatbot utilizing the LLaMA 2 7B model. This approach builds upon recent advancements in language models and transformer architectures. TheraGen provides all-day personalized, compassionate mental health care by leveraging a large dataset of 1 million conversational entries, combining anonymized therapy transcripts, online mental health discussions, and psychological literature, including APA resources. Our implementation employs transfer learning, fine-tuning, and advanced training techniques to optimize performance. TheraGen offers a user-friendly interface for seamless interaction, providing empathetic responses and evidence-based coping strategies. Evaluation results demonstrate high user satisfaction rates, with 94% of users reporting improved mental well-being. The system achieved a BLEU score of 0.67 and a ROUGE score of 0.62, indicating strong response accuracy. With an average response time of 1395 milliseconds, TheraGen ensures real-time, efficient support. While not a replacement for professional therapy, TheraGen serves as a valuable complementary tool, significantly improving user well-being and addressing the accessibility gap in mental health treatments. This paper details TheraGen’s architecture, training methodology, ethical considerations, and future directions, contributing to the growing field of AI-assisted mental healthcare and offering a scalable solution to the pressing need for mental health support.

[AI-241] When Less Is Not More: Large Language Models Normalize Less-Frequent Terms with Lower Accuracy

链接: https://arxiv.org/abs/2409.13746
作者: Daniel B. Hier,Thanh Son Do,Tayo Obafemi-Ajayi
关键词-EN: Human Phenotype Ontology, process of mapping, free text, standardized concept, machine-readable code
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Term normalization is the process of mapping a term from free text to a standardized concept and its machine-readable code in an ontology. Accurate normalization of terms that capture phenotypic differences between patients and diseases is critical to the success of precision medicine initiatives. A large language model (LLM), such as GPT-4o, can normalize terms to the Human Phenotype Ontology (HPO), but it may retrieve incorrect HPO IDs. Reported accuracy rates for LLMs on these tasks may be inflated due to imbalanced test datasets skewed towards high-frequency terms. In our study, using a comprehensive dataset of 268,776 phenotype annotations for 12,655 diseases from the HPO, GPT-4o achieved an accuracy of 13.1% in normalizing 11,225 unique terms. However, the accuracy was unevenly distributed, with higher-frequency and shorter terms normalized more accurately than lower-frequency and longer terms. Feature importance analysis, using SHAP and permutation methods, identified low-term frequency as the most significant predictor of normalization errors. These findings suggest that training and evaluation datasets for LLM-based term normalization should balance low- and high-frequency terms to improve model performance, particularly for infrequent terms critical to precision medicine.

[AI-242] Context-Aware Membership Inference Attacks against Pre-trained Large Language Models

链接: https://arxiv.org/abs/2409.13745
作者: Hongyan Chang,Ali Shahin Shamsabadi,Kleomenis Katevas,Hamed Haddadi,Reza Shokri
关键词-EN: Large Language Models, Membership Inference Attacks, Prior Membership Inference, pre-trained Large Language, Membership Inference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Prior Membership Inference Attacks (MIAs) on pre-trained Large Language Models (LLMs), adapted from classification model attacks, fail due to ignoring the generative process of LLMs across token sequences. In this paper, we present a novel attack that adapts MIA statistical tests to the perplexity dynamics of subsequences within a data point. Our method significantly outperforms prior loss-based approaches, revealing context-dependent memorization patterns in pre-trained LLMs.

[AI-243] A Simplified Retriever to Improve Accuracy of Phenotype Normalizations by Large Language Models ALT

链接: https://arxiv.org/abs/2409.13744
作者: Daniel B. Hier,Thanh Son Do,Tayo Obafemi-Ajayi
关键词-EN: Large language models, Human Phenotype Ontology, Large language, shown improved accuracy, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Submitted to Frontiers in Digital Health

点击查看摘要

Abstract:Large language models (LLMs) have shown improved accuracy in phenotype term normalization tasks when augmented with retrievers that suggest candidate normalizations based on term definitions. In this work, we introduce a simplified retriever that enhances LLM accuracy by searching the Human Phenotype Ontology (HPO) for candidate matches using contextual word embeddings from BioBERT without the need for explicit term definitions. Testing this method on terms derived from the clinical synopses of Online Mendelian Inheritance in Man (OMIM), we demonstrate that the normalization accuracy of a state-of-the-art LLM increases from a baseline of 62.3% without augmentation to 90.3% with retriever augmentation. This approach is potentially generalizable to other biomedical term normalization tasks and offers an efficient alternative to more complex retrieval methods.

[AI-244] Language agents achieve superhuman synthesis of scientific knowledge

链接: https://arxiv.org/abs/2409.13740
作者: Michael D. Skarlinski,Sam Cox,Jon M. Laurent,James D. Braza,Michaela Hinks,Michael J. Hammerling,Manvitha Ponnapati,Samuel G. Rodriques,Andrew D. White
关键词-EN: produce incorrect information, Language models, produce incorrect, Language, incorrect information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Language models are known to produce incorrect information, and their accuracy and reliability for scientific research are still in question. We developed a detailed human-AI comparison method to evaluate language models on real-world literature search tasks, including information retrieval, summarization, and contradiction detection. Our findings show that PaperQA2, an advanced language model focused on improving factual accuracy, matches or outperforms subject matter experts on three realistic literature search tasks, with no restrictions on human participants (full internet access, search tools, and time). PaperQA2 generates cited, Wikipedia-style summaries of scientific topics that are significantly more accurate than current human-written Wikipedia entries. We also present LitQA2, a new benchmark for scientific literature research, which shaped the development of PaperQA2 and contributed to its superior performance. Additionally, PaperQA2 identifies contradictions in scientific literature, a challenging task for humans. It finds an average of 2.34 +/- 1.99 contradictions per paper in a random sample of biology papers, with 70% of these contradictions validated by human experts. These results show that language models can now surpass domain experts in important scientific literature tasks.

[AI-245] RNR: Teaching Large Language Models to Follow Roles and Rules

链接: https://arxiv.org/abs/2409.13733
作者: Kuan Wang,Alexander Bukharin,Haoming Jiang,Qingyu Yin,Zhengyang Wang,Tuo Zhao,Jingbo Shang,Chao Zhang,Bing Yin,Xian Li,Jianshu Chen,Shiyang Li
关键词-EN: large language models, supervised learning, existing IFT instructions, capabilities and steers, steers the behavior
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Instruction fine-tuning (IFT) elicits instruction following capabilities and steers the behavior of large language models (LLMs) via supervised learning. However, existing models trained on open-source IFT datasets only have the ability to follow instructions from users, and often fail to follow complex role and rules specified by developers, a.k.a. system prompts. The ability to follow these roles and rules is essential for deployment, as it ensures that the model safely interacts with users within developer defined guidelines. To improve such role and rule following ability, we propose \model, an automated data generation pipeline that generates diverse roles and rules from existing IFT instructions, along with corresponding responses. This data can then be used to train models that follow complex system prompts. The models are evaluated on our newly created benchmarks for role and rule following ability, as well as standard instruction-following benchmarks and general NLP tasks. Our framework significantly improves role and rule following capability in LLMs, as evidenced by over 25% increase in pass-rate on rule adherence, i.e. following all requirements, in our experiments with the Alpaca and Ultrachat datasets. Moreover, our models achieves this increase without any regression on popular instruction following benchmarks.

[AI-246] KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation

链接: https://arxiv.org/abs/2409.13731
作者: Lei Liang,Mengshu Sun,Zhengke Gui,Zhongshu Zhu,Zhouyu Jiang,Ling Zhong,Yuan Qu,Peilong Zhao,Zhongpu Bo,Jin Yang,Huaidong Xiong,Lin Yuan,Jun Xu,Zaoyang Wang,Wen Zhang,Huajun Chen,Zhiqiang Zhang,Jun Zhou
关键词-EN: recently developed retrieval-augmented, developed retrieval-augmented generation, technology enables, domain-specific applications, recently developed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 28 pages

点击查看摘要

Abstract:The recently developed retrieval-augmented generation (RAG) technology enables the efficient construction of domain-specific applications. However, it faces limitations due to fuzzy retrieval processes, the “hallucination” problem of understanding and reasoning capabilities of general language models, and cascading losses in complex systems. These challenges hinder the effectiveness of specialized knowledge services. However, in scenarios such as scientific computing, medicine, and law, the accuracy of knowledge, the completeness of information, and the logical rigor of rules, time, and values are particularly critical. We Introduce professional domain knowledge service framework: Knowledge Augmented Generation(KAG) to improve generation and reasoning performance by bidirectionally enhancing large language model(LLM)s and knowledge graph(KG)s, including five key enhancements: 1) LLM-friendly knowledge semantic representation, 2) mutual indexing between knowledge graph and original chunks, 3) logicalform-guided hybrid reasoning and solving, 4) Knowledge alignment based on semantic reasoning, 5) Model for KAG. We compared KAG with existing RAG methods in multi-hop question answering. The results show that KAG performs significantly better than the state-of-the-art methods, with a relative improvement from 19.6% to 33.4% in F1. We apply KAG to two professional knowledge QA tasks of Ant Group, including E-Goverment QA and E-Health QA, and has achieved significant improvement in professionalism compared with NaiveRAG. We will soon natively support KAG on the open source KG engine OpenSPG, allowing developers to more easily build rigorous knowledge decision-making or convenient information retrieval services.

[AI-247] VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

链接: https://arxiv.org/abs/2409.13730
作者: Zhihuan Jiang,Zhen Yang,Jinhao Chen,Zhengxiao Du,Weihan Wang,Bin Xu,Yuxiao Dong,Jie Tang
关键词-EN: demonstrated promising capabilities, achieve visual understanding, Multi-modal large language, visual understanding tasks, large language models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 89 pages, 70 figures

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks by integrating textual and visual information to achieve visual understanding in complex scenarios. Despite the availability of several benchmarks aims to evaluating MLLMs in tasks from visual question answering to complex problem-solving, most focus predominantly on mathematics or general visual understanding tasks. This reveals a critical gap in current benchmarks, which often overlook the inclusion of other key scientific disciplines such as physics and chemistry. To address this gap, we meticulously construct a comprehensive benchmark, named VisScience, which is utilized to assess the multi-modal scientific reasoning across the three disciplines of mathematics, physics, and chemistry. This benchmark comprises 3,000 questions drawn from K12 education - spanning elementary school through high school - equally distributed across three disciplines, with 1,000 questions per discipline. The questions within VisScience span 21 distinct subjects and are categorized into five difficulty levels, offering a broad spectrum of topics within each discipline. With VisScience, we present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning. Experimental results demonstrate that closed-source MLLMs generally outperform open-source models. The best performance observed include a 53.4% accuracy in mathematics by Claude3.5-Sonnet, 38.2% in physics by GPT-4o, and 47.0% in chemistry by Gemini-1.5-Pro. These results underscore the strengths and limitations of MLLMs, suggesting areas for future improvement and highlighting the importance of developing models that can effectively handle the diverse demands of multi-modal scientific reasoning.

[AI-248] MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

链接: https://arxiv.org/abs/2409.13729
作者: Zhen Yang,Jinhao Chen,Zhengxiao Du,Wenmeng Yu,Weihan Wang,Wenyi Hong,Zhihuan Jiang,Bin Xu,Yuxiao Dong,Jie Tang
关键词-EN: Large language models, Large language, multi-modal large language, demonstrated significant capabilities, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 30 pages,19 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.

[AI-249] Multilingual Dyadic Interaction Corpus NoXiJ: Toward Understanding Asian-European Non-verbal Cultural Characteristics and their Influences on Engagement

链接: https://arxiv.org/abs/2409.13726
作者: Marius Funk,Shogo Okada,Elisabeth André
关键词-EN: non-verbal behaviors, central challenge, affective states, non-verbal behaviors vary, Non-verbal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages. 6 figures. International Conference on Multimodal Interaction, November 4-8, 2024, San Jose, Costa Rica

点击查看摘要

Abstract:Non-verbal behavior is a central challenge in understanding the dynamics of a conversation and the affective states between interlocutors arising from the interaction. Although psychological research has demonstrated that non-verbal behaviors vary across cultures, limited computational analysis has been conducted to clarify these differences and assess their impact on engagement recognition. To gain a greater understanding of engagement and non-verbal behaviors among a wide range of cultures and language spheres, in this study we conduct a multilingual computational analysis of non-verbal features and investigate their role in engagement and engagement prediction. To achieve this goal, we first expanded the NoXi dataset, which contains interaction data from participants living in France, Germany, and the United Kingdom, by collecting session data of dyadic conversations in Japanese and Chinese, resulting in the enhanced dataset NoXi+J. Next, we extracted multimodal non-verbal features, including speech acoustics, facial expressions, backchanneling and gestures, via various pattern recognition techniques and algorithms. Then, we conducted a statistical analysis of listening behaviors and backchannel patterns to identify culturally dependent and independent features in each language and common features among multiple languages. These features were also correlated with the engagement shown by the interlocutors. Finally, we analyzed the influence of cultural differences in the input features of LSTM models trained to predict engagement for five language datasets. A SHAP analysis combined with transfer learning confirmed a considerable correlation between the importance of input features for a language set and the significant cultural characteristics analyzed.

[AI-250] Logically Consistent Language Models via Neuro-Symbolic Integration

链接: https://arxiv.org/abs/2409.13724
作者: Diego Calanzone,Stefano Teso,Antonio Vergari
关键词-EN: natural language understanding, Large language models, language models, understanding and generation, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are a promising venue for natural language understanding and generation. However, current LLMs are far from reliable: they are prone to generating non-factual information and, more crucially, to contradicting themselves when prompted to reason about relations between entities of the world. These problems are currently addressed with large scale fine-tuning or by delegating reasoning to external tools. In this work, we strive for a middle ground and introduce a loss based on neuro-symbolic reasoning that teaches an LLM to be logically consistent with an external set of facts and rules and improves self-consistency even when the LLM is fine-tuned on a limited set of facts. Our approach also allows to easily combine multiple logical constraints at once in a principled way, delivering LLMs that are more consistent w.r.t. all constraints and improve over several baselines w.r.t. a given constraint. Moreover, our method allows LLMs to extrapolate to unseen but semantically similar factual knowledge, represented in unseen datasets, more systematically.

[AI-251] Explainable Malware Analysis: Concepts Approaches and Challenges

链接: https://arxiv.org/abs/2409.13723
作者: Harikha Manthena,Shaghayegh Shajarian,Jeffrey Kimmell,Mahmoud Abdelsalam,Sajad Khorsandroo,Maanak Gupta
关键词-EN: exponential growth, Machine learning, Malware, malware analysis, explainable malware analysis
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning (ML) has seen exponential growth in recent years, finding applications in various domains such as finance, medicine, and cybersecurity. Malware remains a significant threat to modern computing, frequently used by attackers to compromise systems. While numerous machine learning-based approaches for malware detection achieve high performance, they often lack transparency and fail to explain their predictions. This is a critical drawback in malware analysis, where understanding the rationale behind detections is essential for security analysts to verify and disseminate information. Explainable AI (XAI) addresses this issue by maintaining high accuracy while producing models that provide clear, understandable explanations for their decisions. In this survey, we comprehensively review the current state-of-the-art ML-based malware detection techniques and popular XAI approaches. Additionally, we discuss research implementations and the challenges of explainable malware analysis. This theoretical survey serves as an entry point for researchers interested in XAI applications in malware detection. By analyzing recent advancements in explainable malware analysis, we offer a broad overview of the progress in this field, positioning our work as the first to extensively cover XAI methods for malware classification and detection.

[AI-252] DiVA-DocRE: A Discriminative and Voice-Aware Paradigm for Document-Level Relation Extraction

链接: https://arxiv.org/abs/2409.13717
作者: Yiheng Wu,Roman Yangarber,Xian Mao
关键词-EN: Large Language Models, Large Language, revolutionized Information Extraction, Relation Triplet Extraction, capabilities of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The remarkable capabilities of Large Language Models (LLMs) in text comprehension and generation have revolutionized Information Extraction (IE). One such advancement is in Document-level Relation Triplet Extraction (DocRTE), a critical task in information systems that aims to extract entities and their semantic relationships from documents. However, existing methods are primarily designed for Sentence level Relation Triplet Extraction (SentRTE), which typically handles a limited set of relations and triplet facts within a single sentence. Additionally, some approaches treat relations as candidate choices integrated into prompt templates, resulting in inefficient processing and suboptimal performance when determining the relation elements in triplets. To address these limitations, we introduce a Discriminative and Voice Aware Paradigm DiVA. DiVA involves only two steps: performing document-level relation extraction (DocRE) and then identifying the subject object entities based on the relation. No additional processing is required simply input the document to directly obtain the triplets. This streamlined process more accurately reflects real-world scenarios for triplet extraction. Our innovation lies in transforming DocRE into a discriminative task, where the model pays attention to each relation and to the often overlooked issue of active vs. passive voice within the triplet. Our experiments on the Re-DocRED and DocRED datasets demonstrate state-of-the-art results for the DocRTE task.

[AI-253] Introducing MeMo: A Multimodal Dataset for Memory Modelling in Multiparty Conversations

链接: https://arxiv.org/abs/2409.13715
作者: Maria Tsfasman,Bernd Dudzik,Kristian Fenech,Andras Lorincz,Catholijn M. Jonker,Catharine Oertel
关键词-EN: human memory processes, Conversational memory, human social relationships, memory, relationships is intricately
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The quality of human social relationships is intricately linked to human memory processes, with memory serving as the foundation for the creation of social bonds. Since human memory is selective, differing recollections of the same events within a group can lead to misunderstandings and misalignments in what is perceived to be common ground in the group. Yet, conversational facilitation systems, aimed at advancing the quality of group interactions, usually focus on tracking users’ states within an individual session, ignoring what remains in each participant’s memory after the interaction. Conversational memory is the process by which humans encode, retain and retrieve verbal, non-verbal and contextual information from a conversation. Understanding conversational memory can be used as a source of information on the long-term development of social connections within a group. This paper introduces the MeMo corpus, the first conversational dataset annotated with participants’ memory retention reports, aimed at facilitating computational modelling of human conversational memory. The MeMo corpus includes 31 hours of small-group discussions on the topic of Covid-19, repeated over the term of 2 weeks. It integrates validated behavioural and perceptual measures, and includes audio, video, and multimodal annotations, offering a valuable resource for studying and modelling conversational memory and group dynamics. By introducing the MeMo corpus, presenting an analysis of its validity, and demonstrating its usefulness for future research, this paper aims to pave the way for future research in conversational memory modelling for intelligent system development.

[AI-254] racrBench: Generating Interpretability Testbeds with Large Language Models ICML

链接: https://arxiv.org/abs/2409.13714
作者: Hannes Thurnherr,Jérémy Scheurer
关键词-EN: Achieving a mechanistic, ground truth mappings, mechanistic understanding, understanding of transformer-based, transformer-based language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages + appendix, 4 figures, ICML Mechanistic Interpretability Workshop

点击查看摘要

Abstract:Achieving a mechanistic understanding of transformer-based language models is an open challenge, especially due to their large number of parameters. Moreover, the lack of ground truth mappings between model weights and their functional roles hinders the effective evaluation of interpretability methods, impeding overall progress. Tracr, a method for generating compiled transformers with inherent ground truth mappings in RASP, has been proposed to address this issue. However, manually creating a large number of models needed for verifying interpretability methods is labour-intensive and time-consuming. In this work, we present a novel approach for generating interpretability test beds using large language models (LLMs) and introduce TracrBench, a novel dataset consisting of 121 manually written and LLM-generated, human-validated RASP programs and their corresponding transformer weights. During this process, we evaluate the ability of frontier LLMs to autonomously generate RASP programs and find that this task poses significant challenges. GPT-4-turbo, with a 20-shot prompt and best-of-5 sampling, correctly implements only 57 out of 101 test programs, necessitating the manual implementation of the remaining programs. With its 121 samples, TracrBench aims to serve as a valuable testbed for evaluating and comparing interpretability methods.

[AI-255] Good Idea or Not Representation of LLM Could Tell

链接: https://arxiv.org/abs/2409.13712
作者: Yi Xu,Bo Xue,Shuqian Sheng,Cheng Deng,Jiaxin Ding,Zanwei Shen,Luoyi Fu,Xinbing Wang,Chenghu Zhou
关键词-EN: discerning valuable ideas, large language models, challenge for researchers, discerning valuable, ever-expanding landscape
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the ever-expanding landscape of academic research, the proliferation of ideas presents a significant challenge for researchers: discerning valuable ideas from the less impactful ones. The ability to efficiently evaluate the potential of these ideas is crucial for the advancement of science and paper review. In this work, we focus on idea assessment, which aims to leverage the knowledge of large language models to assess the merit of scientific ideas. First, we investigate existing text evaluation research and define the problem of quantitative evaluation of ideas. Second, we curate and release a benchmark dataset from nearly four thousand manuscript papers with full texts, meticulously designed to train and evaluate the performance of different approaches to this task. Third, we establish a framework for quantifying the value of ideas by employing representations in a specific layer of large language models. Experimental results show that the scores predicted by our method are relatively consistent with those of humans. Our findings suggest that the representations of large language models hold more potential in quantifying the value of ideas than their generative outputs, demonstrating a promising avenue for automating the idea assessment process.

[AI-256] WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

链接: https://arxiv.org/abs/2409.13711
作者: Maria Wang,Srinivas Sunkara,Gilles Baechler,Jason Lin,Yun Zhu,Fedir Zubach,Lei Shu,Jindong Chen
关键词-EN: web agents calls, evaluate neural architectures, neural architectures, agents calls, creation of challenging
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of multimodal LLMs and web agents calls for the creation of challenging benchmarks to evaluate neural architectures. Unlike existing benchmarks that focus on multi-step web navigation, we present WebQuest, a multi-page question-answering dataset that requires simultaneous retrieval and reasoning across web interaction sequences grounded in real-world usage. WebQuest includes three question categories: single-screen reasoning, multi-screen reasoning, and questions based on navigation traces. We evaluate some of the leading multimodal models like GPT-4V, Gemini Flash, and Claude 3 on our dataset, revealing a significant gap between single-screen and multi-screen reasoning. Finally, we investigate inference time techniques like Chain-of-Thought prompting to improve model capabilities on multi-screen reasoning.

[AI-257] Column Vocabulary Association (CVA): semantic interpretation of dataless tables

链接: https://arxiv.org/abs/2409.13709
作者: Margherita Martorana,Xueli Pan,Benno Kruit,Tobias Kuhn,Jacco van Ossenbruggen
关键词-EN: Semantic Table Interpretation, Table Interpretation, Traditional Semantic Table, underlying table data, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional Semantic Table Interpretation (STI) methods rely primarily on the underlying table data to create semantic annotations. This year’s SemTab challenge introduced the ``Metadata to KG’’ track, which focuses on performing STI by using only metadata information, without access to the underlying data. In response to this new challenge, we introduce a new term: Column Vocabulary Association (CVA). This term refers to the task of semantic annotation of column headers solely based on metadata information. In this study, we evaluate the performance of various methods in executing the CVA task, including a Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) approach, as well as a more traditional similarity approach with SemanticBERT. Our methodology uses a zero-shot setting, with no pretraining or examples passed to the Large Language Models (LLMs), as we aim to avoid a domain-specific setting. We investigate a total of 7 different LLMs, of which three commercial GPT models (i.e. gpt-3.5-turbo-0.125, gpt-4o and gpt-4-turbo) and four open source models (i.e. llama3-80b, llama3-7b, gemma-7b and mixtral-8x7b). We integrate this models with RAG systems, and we explore how variations in temperature settings affect performances. Moreover, we continue our investigation by performing the CVA task utilizing SemanticBERT, analyzing how various metadata information influence its performance. Initial findings indicate that LLMs generally perform well at temperatures below 1.0, achieving an accuracy of 100% in certain cases. Nevertheless, our investigation also reveal that the nature of the data significantly influences CVA task outcomes. In fact, in cases where the input data and glossary are related (for example by being created by the same organizations) traditional methods appear to surpass the performance of LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.13709 [cs.CL] (or arXiv:2409.13709v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13709 Focus to learn more arXiv-issued DOI via DataCite

[AI-258] owards Safe Multilingual Frontier AI

链接: https://arxiv.org/abs/2409.13708
作者: Artūrs Kanepajs,Vladimir Ivanov,Richard Moulange
关键词-EN: Linguistically inclusive LLMs, maintain good performance, Linguistically inclusive, Multilingual jailbreaks, maintain good
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 23 pages; 1 figure and 10 supplementary figures

点击查看摘要

Abstract:Linguistically inclusive LLMs – which maintain good performance regardless of the language with which they are prompted – are necessary for the diffusion of AI benefits around the world. Multilingual jailbreaks that rely on language translation to evade safety measures undermine the safe and inclusive deployment of AI systems. We provide policy recommendations to enhance the multilingual capabilities of AI while mitigating the risks of multilingual jailbreaks. We quantitatively assess the relationship between language resourcedness and model vulnerabilities to multilingual jailbreaks for five frontier large language models across 24 official EU languages. Building on prior research, we propose policy actions that align with the EU legal landscape and institutional framework to address multilingual jailbreaks, while promoting linguistic inclusivity. These include mandatory assessments of multilingual capabilities and vulnerabilities, public opinion research, and state support for multilingual AI development. The measures aim to improve AI safety and functionality through EU policy initiatives, guiding the implementation of the EU AI Act and informing regulatory efforts of the European AI Office.

[AI-259] Retrieval Augmented Generation-Based Incident Resolution Recommendation System for IT Support

链接: https://arxiv.org/abs/2409.13707
作者: Paulina Toro Isaza,Michael Nidd,Noah Zheutlin,Jae-wook Ahn,Chidansh Amitkumar Bhatt,Yu Deng,Ruchi Mahindru,Martin Franz,Hans Florian,Salim Roukos
关键词-EN: size constraints due, model choice limitations, model size constraints, choice limitations, wishing to implement
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 7 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Clients wishing to implement generative AI in the domain of IT Support and AIOps face two critical issues: domain coverage and model size constraints due to model choice limitations. Clients might choose to not use larger proprietary models such as GPT-4 due to cost and privacy concerns and so are limited to smaller models with potentially less domain coverage that do not generalize to the client’s domain. Retrieval augmented generation is a common solution that addresses both of these issues: a retrieval system first retrieves the necessary domain knowledge which a smaller generative model leverages as context for generation. We present a system developed for a client in the IT Support domain for support case solution recommendation that combines retrieval augmented generation (RAG) for answer generation with an encoder-only model for classification and a generative large language model for query generation. We cover architecture details, data collection and annotation, development journey and preliminary validations, expected final deployment process and evaluation plans, and finally lessons learned.

[AI-260] Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble

链接: https://arxiv.org/abs/2409.13705
作者: Olivia Sturman,Aparna Joshi,Bhaktipriya Radharapu,Piyush Kumar,Renee Shelby
关键词-EN: demand performant guardrails, large language models, demand performant, large language, performant guardrails
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Increasing use of large language models (LLMs) demand performant guardrails to ensure the safety of inputs and outputs of LLMs. When these safeguards are trained on imbalanced data, they can learn the societal biases. We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfactually balanced across identity groups and cover four key areas of safety; we will work towards publicly releasing these datasets. Our results show that our approach improves counterfactual fairness with minimal impact on model performance.

[AI-261] CA-BERT: Leveraging Context Awareness for Enhanced Multi-Turn Chat Interaction

链接: https://arxiv.org/abs/2409.13701
作者: Minghao Liu,Mingxiu Sui,Cangqing Wang,Zhejie Zhou
关键词-EN: Effective communication, understand and respond, Effective, context, chat systems hinges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by ICBASE 2024

点击查看摘要

Abstract:Effective communication in automated chat systems hinges on the ability to understand and respond to context. Traditional models often struggle with determining when additional context is necessary for generating appropriate responses. This paper introduces Context-Aware BERT (CA-BERT), a transformer-based model specifically fine-tuned to address this challenge. CA-BERT innovatively applies deep learning techniques to discern context necessity in multi-turn chat interactions, enhancing both the relevance and accuracy of responses. We describe the development of CA-BERT, which adapts the robust architecture of BERT with a novel training regimen focused on a specialized dataset of chat dialogues. The model is evaluated on its ability to classify context necessity, demonstrating superior performance over baseline BERT models in terms of accuracy and efficiency. Furthermore, CA-BERT’s implementation showcases significant reductions in training time and resource usage, making it feasible for real-time applications. The results indicate that CA-BERT can effectively enhance the functionality of chatbots by providing a nuanced understanding of context, thereby improving user experience and interaction quality in automated systems. This study not only advances the field of NLP in chat applications but also provides a framework for future research into context-sensitive AI developments. Comments: This paper has been accepted by ICBASE 2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.13701 [cs.CL] (or arXiv:2409.13701v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13701 Focus to learn more arXiv-issued DOI via DataCite

[AI-262] MAS4POI: a Multi-Agents Collaboration System for Next POI Recommendation

链接: https://arxiv.org/abs/2409.13700
作者: Yuqian Wu,Yuhong Peng,Jiapeng Yu,Raymond S. T. Lee
关键词-EN: complex decision-making tasks, decision-making tasks management, recommendation remain underexplored, LLM-based Multi-Agent Systems, remain underexplored
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:LLM-based Multi-Agent Systems have potential benefits of complex decision-making tasks management across various domains but their applications in the next Point-of-Interest (POI) recommendation remain underexplored. This paper proposes a novel MAS4POI system designed to enhance next POI recommendations through multi-agent interactions. MAS4POI supports Large Language Models (LLMs) specializing in distinct agents such as DataAgent, Manager, Analyst, and Navigator with each contributes to a collaborative process of generating the next POI recommendations.The system is examined by integrating six distinct LLMs and evaluated by two real-world datasets for recommendation accuracy improvement in real-world scenarios. Our code is available at this https URL.

[AI-263] Prompt Baking

链接: https://arxiv.org/abs/2409.13697
作者: Aman Bhargava,Cameron Witkowski,Alexander Detkov,Matt Thomson
关键词-EN: change LLM behavior, LLM, weight updates, baking, permanent behavior changes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 25 pages, 8 figures

点击查看摘要

Abstract:Two primary ways to change LLM behavior are prompting and weight updates (e.g., fine-tuning). Prompting LLMs is simple and effective, specifying the desired changes explicitly in natural language, whereas weight updates provide more expressive and permanent behavior changes, specified implicitly via training on large datasets. We present a technique for “baking” prompts into the weights of an LLM. Prompt Baking converts a prompt u and initial weights \theta to a new set of weights \theta_u such that new “baked” LLM behaves like the original prompted LLM. Mathematically, we minimize the KL divergence between P_\theta(\cdot | u) and P_\theta_u(\cdot) , where P is the LLM’s probability distribution over token sequences. Across all our experiments, we find prompts can be readily baked into weight updates. Baking chain-of-thought prompts improves zero-shot performance on GSM8K, ASDiv, MBPP, ARC-Easy, ARC-Challenge, and CommonsenseQA benchmarks. Baking news headlines directly updates an LLM’s knowledge. And baking instructions personas alleviates “prompt forgetting” over long sequences. Furthermore, stopping baking early creates “half-baked” models, continuously scaling prompt strength. Baked models retain their sensitivity to further prompting and baking, including re-prompting with the baked-in prompt. Surprisingly, the re-prompted models yield further performance gains in instruction following, as well as math reasoning and coding benchmarks. Taking re-prompting and re-baking to the limit yields a form of iterative self-improvement we call Prompt Pursuit, and preliminary results on instruction following exhibit dramatic performance gains. Finally, we discuss implications for AI safety, continuous model updating, enhancing real-time learning capabilities in LLM-based agents, and generating more stable AI personas.

[AI-264] You Only Use Reactive Attention Slice For Long Context Retrieval

链接: https://arxiv.org/abs/2409.13695
作者: Yun Joon Soh,Hanxian Huang,Yuandong Tian,Jishen Zhao
关键词-EN: Supporting longer context, Large Language Models, Supporting longer, Retrieval Augmented Generation, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Supporting longer context for Large Language Models (LLM) is a promising direction to advance LLMs. As training a model for a longer context window is computationally expensive, many alternative solutions, such as Retrieval Augmented Generation (RAG), have been used. However, most existing RAG methods adopt embedding-based retrieval that falls short on long contexts. To address such challenges, we propose an attention-based retrieval technique, You Only Use Reactive Attention slice (YOURA). YOURA leverages a novel retrieval heuristic called reaction score to rank the relevance of each sentence in the input context with the query sentence. Intuitively, we measure how the per-token attention score “reacts” to the query and greedily retrieves the most reactive sentences. Internally, YOURA generates a token-indexed vector (called reaction vector) for the whole input context. To map each sentence to the token-indexed vector, we propose an Embedding-Agnostic Sentence Yield (EASY), a best-effort token wiggling algorithm. We evaluate our retrieval technique on three open-source pre-trained LLM models across six LongBench QA datasets. Our technique achieves up to 30% vLLM inference throughput improvement for serving long-context queries with a nearly identical quality score to the simple yet effective truncate-middle approach. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.13695 [cs.CL] (or arXiv:2409.13695v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13695 Focus to learn more arXiv-issued DOI via DataCite

[AI-265] A Knowledge-Centric Benchmarking Framework and Empirical Study for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2409.13694
作者: Shuo Yu(1 and 2),Mingyue Cheng(1 and 2),Jiqian Yang(1 and 2),Jie Ouyang(1 and 2) ((1) Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China (2) State Key Laboratory of Cognitive Intelligence)
关键词-EN: enhances generative models, Retrieval-Augmented Generation, utilize external knowledge, integrating retrieval mechanisms, external knowledge sources
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 14 pages, 11 figures; Mingyue Cheng is the corresponding author

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances generative models by integrating retrieval mechanisms, which allow these models to access and utilize external knowledge sources. Despite its advantages, RAG encounters significant challenges, particularly in effectively handling real-world queries and mitigating hallucinations. The KDD Cup 2024 CRAG competition brings these issues to the forefront by incorporating both web pages and a mock API as knowledge sources, adding the complexity of parsing HTML before large language models (LLMs) can process the information. In this paper, we propose a novel RAG benchmark designed to address these challenges. Our work provides a comprehensive set of experimental results, offering valuable insights for the study of RAG. We thoroughly examine the entire RAG process, including knowledge source selection, retrieval, organization, and reasoning. Key findings from our study include the impact of automated knowledge source selection using agents and the influence of noise chunks on RAG reasoning. Additionally, we conduct detailed experiments to analyze the effects of various hyperparameters on RAG performance. To support further research, we have made our results, the associated code, and a parsed version of the CRAG dataset publicly available\footnotethis https URL, contributing to the advancement of RAG methodologies and establishing a solid foundation for future work in this domain.

[AI-266] Declarative Integration and Management of Large Language Models through Finite Automata: Application to Automation Communication and Ethics

链接: https://arxiv.org/abs/2409.13693
作者: Thierry Petit,Arnault Pachot,Claire Conan-Vrinat,Alexandre Dubarry
关键词-EN: Large Language Models, combine Large Language, declaratively combine Large, innovative architecture designed, Language Models
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: Submitted to IAAI-2025, Philadelphia, PA

点击查看摘要

Abstract:This article introduces an innovative architecture designed to declaratively combine Large Language Models (LLMs) with shared histories, and triggers to identify the most appropriate LLM for a given task. Our approach is general and declarative, relying on the construction of finite automata coupled with an event management system. The developed tool is crafted to facilitate the efficient and complex integration of LLMs with minimal programming effort, especially, but not only, for integrating methods of positive psychology to AI. The flexibility of our technique is demonstrated through applied examples in automation, communication, and ethics.

[AI-267] Artificial neural networks on graded vector spaces

链接: https://arxiv.org/abs/2407.19031
作者: T. Shaska
关键词-EN: graded vector spaces, artificial neural network, usual vector spaces, neural network models, vector spaces
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop new artificial neural network models for graded vector spaces, which are suitable when different features in the data have different significance (weights). This is the first time that such models are designed mathematically and they are expected to perform better than neural networks over usual vector spaces, which are the special case when the gradings are all 1s.

[AI-268] Boosting Facial Action Unit Detection Through Jointly Learning Facial Landmark Detection and Domain Separation and Reconstruction

链接: https://arxiv.org/abs/2310.05207
作者: Ziqiao Shang,Li Yu
关键词-EN: Facial Action Unit, supervised Facial Action, Action Unit, introduce large amounts, unlabeled facial images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Recently how to introduce large amounts of unlabeled facial images in the wild into supervised Facial Action Unit (AU) detection frameworks has become a challenging problem. In this paper, we propose a new AU detection framework where multi-task learning is introduced to jointly learn AU domain separation and reconstruction and facial landmark detection by sharing the parameters of homostructural facial extraction modules. In addition, we propose a new feature alignment scheme based on contrastive learning by simple projectors and an improved contrastive loss, which adds four additional intermediate supervisors to promote the feature reconstruction process. Experimental results on two benchmarks demonstrate our superiority against the state-of-the-art methods for AU detection in the wild.

[AI-269] am QUST at SemEval-2023 Task 3: A Comprehensive Study of Monolingual and Multilingual Approaches for Detecting Online News Genre Framing and Persuasion Techniques

链接: https://arxiv.org/abs/2304.04190
作者: Ye Jiang
关键词-EN: team QUST, paper describes, describes the participation, participation of team, pre-trained multilingual model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper describes the participation of team QUST in the SemEval2023 task 3. The monolingual models are first evaluated with the under-sampling of the majority classes in the early stage of the task. Then, the pre-trained multilingual model is fine-tuned with a combination of the class weights and the sample weights. Two different fine-tuning strategies, the task-agnostic and the task-dependent, are further investigated. All experiments are conducted under the 10-fold cross-validation, the multilingual approaches are superior to the monolingual ones. The submitted system achieves the second best in Italian and Spanish (zero-shot) in subtask-1.

[AI-270] he Palomar twilight survey of Aylochaxnim Atiras and comets

链接: https://arxiv.org/abs/2409.15263
作者: B. T. Bolin,F. J. Masci,M. W. Coughlin,D. A. Duev,Ž. Ivezić,R. L. Jones,P. Yoachim,T. Ahumada,V. Bhalerao,H. Choudhary,C. Contreras,Y.-C. Cheng,C.M. Copperwheat,K. Deshmukh,C. Fremling,M. Granvik,K. K. Hardegree-Ullman,A. Y. Q. Ho,R. Jedicke,M. Kasliwal,H. Kumar,Z.-Y. Lin,A. Mahabal,A. Monson,J.D. Neill,D. Nesvorný,D. A. Perley,J. N. Purdum,R. Quimby,E. Serabyn,K. Sharma,V. Swain
关键词-EN: Zwicky Transient Facility, Near-sun sky twilight, Near-sun sky, morning astronomical twilight, twilight
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages, 13 figures, 4 tables, accepted for publication in Icarus

点击查看摘要

Abstract:Near-sun sky twilight observations allow for the detection of asteroid interior to the orbit of Venus (Aylos), the Earth (Atiras), and comets. We present the results of observations with the Palomar 48-inch telescope (P48)/Zwicky Transient Facility (ZTF) camera in 30 s r-band exposures taken during evening astronomical twilight from 2019 Sep 20 to 2022 March 7 and during morning astronomical twilight sky from 2019 Sep 21 to 2022 Sep 29. More than 46,000 exposures were taken in evening and morning astronomical twilight within 31 to 66 degrees from the Sun with an r-band limiting magnitude between 18.1 and 20.9. The twilight pointings show a slight seasonal dependence in limiting magnitude and ability to point closer towards the Sun, with limiting magnitude slightly improving during summer. In total, the one Aylo, (594913) 'Ayló’chaxnim, and 4 Atiras, 2020 OV1, 2021 BS1, 2021 PB2, and 2021 VR3, were discovered in evening and morning twilight observations. Additional twilight survey discoveries also include 6 long-period comets: C/2020 T2, C/2020 V2, C/2021 D2, C/2021 E3, C/2022 E3, and C/2022 P3, and two short-period comets: P/2021 N1 and P/2022 P2 using deep learning comet detection pipelines. The P48/ZTF twilight survey also recovered 11 known Atiras, one Aylo, three short-period comes, two long-period comets, and one interstellar object. Lastly, the Vera Rubin Observatory will conduct a twilight survey starting in its first year of operations and will cover the sky within 45 degrees of the Sun. Twilight surveys such as those by ZTF and future surveys will provide opportunities for discovering asteroids inside the orbits of Earth and Venus.

[AI-271] Identification and Localization of Cometary Activity in Solar System Objects with Machine Learning

链接: https://arxiv.org/abs/2409.15261
作者: Bryce T. Bolin,Michael W. Coughlin
关键词-EN: Solar System objects, Machine Learning methods, space-based wide-field all-sky, extended Solar System, Machine Learning
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 25 pages, 9 figures, accepted chapter in Machine Learning for Small Bodies in the Solar System, Valerio Carruba, Evgeny Smirnov, and Dagmara Oszkiewicz, Elsevier, 2024, p. 209-227

点击查看摘要

Abstract:In this chapter, we will discuss the use of Machine Learning methods for the identification and localization of cometary activity for Solar System objects in ground and in space-based wide-field all-sky surveys. We will begin the chapter by discussing the challenges of identifying known and unknown active, extended Solar System objects in the presence of stellar-type sources and the application of classical pre-ML identification techniques and their limitations. We will then transition to the discussion of implementing ML techniques to address the challenge of extended object identification. We will finish with prospective future methods and the application to future surveys such as the Vera C. Rubin Observatory.

[AI-272] MAR-DTN: Metal Artifact Reduction using Domain Transformation Network for Radiotherapy Planning ICPR

链接: https://arxiv.org/abs/2409.15155
作者: Belén Serrano-Antón,Mubashara Rehman,Niki Martinel,Michele Avanzo,Riccardo Spizzo,Giuseppe Fanetti,Alberto P. Muñuzuri,Christian Micheloni
关键词-EN: Computed Tomography, artifact-free MVCT images, head and neck, MVCT scans, typically employed
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in 27th International Conference on Pattern Recognition (ICPR). Mubashara Rehman and Belén Serrano-Antón, both co-first authors of the manuscript

点击查看摘要

Abstract:For the planning of radiotherapy treatments for head and neck cancers, Computed Tomography (CT) scans of the patients are typically employed. However, in patients with head and neck cancer, the quality of standard CT scans generated using kilo-Voltage (kVCT) tube potentials is severely degraded by streak artifacts occurring in the presence of metallic implants such as dental fillings. Some radiotherapy devices offer the possibility of acquiring Mega-Voltage CT (MVCT) for daily patient setup verification, due to the higher energy of X-rays used, MVCT scans are almost entirely free from artifacts making them more suitable for radiotherapy treatment planning. In this study, we leverage the advantages of kVCT scans with those of MVCT scans (artifact-free). We propose a deep learning-based approach capable of generating artifact-free MVCT images from acquired kVCT images. The outcome offers the benefits of artifact-free MVCT images with enhanced soft tissue contrast, harnessing valuable information obtained through kVCT technology for precise therapy calibration. Our proposed method employs UNet-inspired model, and is compared with adversarial learning and transformer networks. This first and unique approach achieves remarkable success, with PSNR of 30.02 dB across the entire patient volume and 27.47 dB in artifact-affected regions exclusively. It is worth noting that the PSNR calculation excludes the background, concentrating solely on the region of interest. Comments: Accepted in 27th International Conference on Pattern Recognition (ICPR). Mubashara Rehman and Belén Serrano-Antón, both co-first authors of the manuscript Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.15155 [eess.IV] (or arXiv:2409.15155v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2409.15155 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-273] owards Ground-truth-free Evaluation of Any Segmentation in Medical Images

链接: https://arxiv.org/abs/2409.14874
作者: Ahjol Senbi,Tianyu Huang,Fei Lyu,Qing Li,Yuhui Tao,Wei Shao,Qiang Chen,Chengyan Wang,Shuo Wang,Tao Zhou,Yizhe Zhang
关键词-EN: medical images, segmentation quality scores, segmentation, Segment, images
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 15 figures

点击查看摘要

Abstract:We are interested in building a ground-truth-free evaluation model to assess the quality of segmentations produced by SAM (Segment Anything Model) and its variants in medical images. This model estimates segmentation quality scores by comparing input images with their corresponding segmentation maps. Building on prior research, we frame this as a regression problem within a supervised learning framework, using Dice scores (and optionally other metrics) to compute the training loss. The model is trained using a large collection of public datasets of medical images with segmentation predictions from SAM and its variants. We name this model EvanySeg (Evaluation of Any Segmentation in Medical Images). Our exploration of convolution-based models (e.g., ResNet) and transformer-based models (e.g., ViT) revealed that ViT offers superior performance for EvanySeg. This model can be employed for various tasks, including: (1) identifying poorly segmented samples by detecting low-percentile segmentation quality scores; (2) benchmark segmentation models without ground truth by averaging scores across test samples; (3) alerting human experts during human-AI collaboration by applying a threshold within the score space; and (4) selecting the best segmentation prediction for each test sample at test time when multiple segmentation models are available, by choosing the prediction with the highest score. Models and code will be made available at this https URL.

[AI-274] Embedding Knowledge Graph in Function Space

链接: https://arxiv.org/abs/2409.14857
作者: Louis Mozart Kamdem Teyou,Caglar Demir,Axel-Cyrille Ngonga Ngomo
关键词-EN: standard knowledge graph, finite vector space, graph embedding techniques, embedding method diverging, knowledge graph embedding
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel embedding method diverging from conventional approaches by operating within function spaces of finite dimension rather than finite vector space, thus departing significantly from standard knowledge graph embedding techniques. Initially employing polynomial functions to compute embeddings, we progress to more intricate representations using neural networks with varying layer complexities. We argue that employing functions for embedding computation enhances expressiveness and allows for more degrees of freedom, enabling operations such as composition, derivatives and primitive of entities representation. Additionally, we meticulously outline the step-by-step construction of our approach and provide code for reproducibility, thereby facilitating further exploration and application in the field.

[AI-275] LatentQGAN: A Hybrid QGAN with Classical Convolutional Autoencoder

链接: https://arxiv.org/abs/2409.14622
作者: Vieloszynski Alexis,Soumaya Cherkaoui,Jean-Frédéric Laprade,Oliver Nahman-Lévesque,Abdallah Aaraba,Shengrui Wang
关键词-EN: Quantum machine learning, machine learning consists, Generative Adversarial Networks, generate classical data, machine learning
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper was accepted for publication on the 10th IEEE World Forum on Internet of Things (IEEE WFIoT2024), in the session SS - QIoT-1: Special Session - Quantum Internet of Things (QIoT)-1, November 10th, from 14:00 to 15:30 EST

点击查看摘要

Abstract:Quantum machine learning consists in taking advantage of quantum computations to generate classical data. A potential application of quantum machine learning is to harness the power of quantum computers for generating classical data, a process essential to a multitude of applications such as enriching training datasets, anomaly detection, and risk management in finance. Given the success of Generative Adversarial Networks in classical image generation, the development of its quantum versions has been actively conducted. However, existing implementations on quantum computers often face significant challenges, such as scalability and training convergence issues. To address these issues, we propose LatentQGAN, a novel quantum model that uses a hybrid quantum-classical GAN coupled with an autoencoder. Although it was initially designed for image generation, the LatentQGAN approach holds potential for broader application across various practical data generation tasks. Experimental outcomes on both classical simulators and noisy intermediate scale quantum computers have demonstrated significant performance enhancements over existing quantum methods, alongside a significant reduction in quantum resources overhead.

[AI-276] Detection of pulmonary pathologies using convolutional neural networks Data Augmentation ResNet50 and Vision Transformers

链接: https://arxiv.org/abs/2409.14446
作者: Pablo Ramirez Amador,Dinarle Milagro Ortega,Arnold Cesarano
关键词-EN: fast diagnostic techniques, public health problem, diagnostic techniques, public health, health problem
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Pulmonary diseases are a public health problem that requires accurate and fast diagnostic techniques. In this paper, a method based on convolutional neural networks (CNN), Data Augmentation, ResNet50 and Vision Transformers (ViT) is proposed to detect lung pathologies from medical images. A dataset of X-ray images and CT scans of patients with different lung diseases, such as cancer, pneumonia, tuberculosis and fibrosis, is used. The results obtained by the proposed method are compared with those of other existing methods, using performance metrics such as accuracy, sensitivity, specificity and area under the ROC curve. The results show that the proposed method outperforms the other methods in all metrics, achieving an accuracy of 98% and an area under the ROC curve of 99%. It is concluded that the proposed method is an effective and promising tool for the diagnosis of pulmonary pathologies by medical imaging.

[AI-277] PepINVENT: Generative peptide design beyond the natural amino acids

链接: https://arxiv.org/abs/2409.14040
作者: Gökçe Geylan,Jon Paul Janet,Alessandro Tibo,Jiazhen He,Atanas Patronov,Mikhail Kabeshov,Florian David,Werngard Czechtizky,Ola Engkvist,Leonardo De Maria
关键词-EN: amino acids, delivery agent, Peptides, peptide, play a crucial
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Peptides play a crucial role in the drug design and discovery whether as a therapeutic modality or a delivery agent. Non-natural amino acids (NNAAs) have been used to enhance the peptide properties from binding affinity, plasma stability to permeability. Incorporating novel NNAAs facilitates the design of more effective peptides with improved properties. The generative models used in the field, have focused on navigating the peptide sequence space. The sequence space is formed by combinations of a predefined set of amino acids. However, there is still a need for a tool to explore the peptide landscape beyond this enumerated space to unlock and effectively incorporate de novo design of new amino acids. To thoroughly explore the theoretical chemical space of the peptides, we present PepINVENT, a novel generative AI-based tool as an extension to the small molecule molecular design platform, REINVENT. PepINVENT navigates the vast space of natural and non-natural amino acids to propose valid, novel, and diverse peptide designs. The generative model can serve as a central tool for peptide-related tasks, as it was not trained on peptides with specific properties or topologies. The prior was trained to understand the granularity of peptides and to design amino acids for filling the masked positions within a peptide. PepINVENT coupled with reinforcement learning enables the goal-oriented design of peptides using its chemistry-informed generative capabilities. This study demonstrates PepINVENT’s ability to explore the peptide space with unique and novel designs, and its capacity for property optimization in the context of therapeutically relevant peptides. Our tool can be employed for multi-parameter learning objectives, peptidomimetics, lead optimization, and variety of other tasks within the peptide domain.

[AI-278] Enhancing Multivariate Time Series-based Solar Flare Prediction with Multifaceted Preprocessing and Contrastive Learning ICML

链接: https://arxiv.org/abs/2409.14016
作者: MohammadReza EskandariNasab,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
关键词-EN: Accurate solar flare, satellite communication systems, solar flare prediction, intense solar flares, solar flares pose
类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This work has been accepted at ICMLA 2024 on September 7, 2024, as a regular paper for an oral presentation

点击查看摘要

Abstract:Accurate solar flare prediction is crucial due to the significant risks that intense solar flares pose to astronauts, space equipment, and satellite communication systems. Our research enhances solar flare prediction by utilizing advanced data preprocessing and classification methods on a multivariate time series-based dataset of photospheric magnetic field parameters. First, our study employs a novel preprocessing pipeline that includes missing value imputation, normalization, balanced sampling, near decision boundary sample removal, and feature selection to significantly boost prediction accuracy. Second, we integrate contrastive learning with a GRU regression model to develop a novel classifier, termed ContReg, which employs dual learning methodologies, thereby further enhancing prediction performance. To validate the effectiveness of our preprocessing pipeline, we compare and demonstrate the performance gain of each step, and to demonstrate the efficacy of the ContReg classifier, we compare its performance to that of sequence-based deep learning architectures, machine learning models, and findings from previous studies. Our results illustrate exceptional True Skill Statistic (TSS) scores, surpassing previous methods and highlighting the critical role of precise data preprocessing and classifier development in time series-based solar flare prediction.

[AI-279] Quantum evolutionary algorithm for TSP combinatorial optimisation problem

链接: https://arxiv.org/abs/2409.13788
作者: Yijiang Ma,Tan Chye Cheah
关键词-EN: traveling salesman problem, quantum genetic algorithm, classical genetic algorithm, genetic algorithm, paper implements
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 24 pages

点击查看摘要

Abstract:This paper implements a new way of solving a problem called the traveling salesman problem (TSP) using quantum genetic algorithm (QGA). We compared how well this new approach works to the traditional method known as a classical genetic algorithm (CGA). The TSP is a well-established challenge in combinatorial optimization where the objective is to find the most efficient path to visit a series of cities, minimizing the total distance, and returning to the starting point. We chose the TSP to test the performance of both algorithms because of its computational complexity and importance in practical applications. We choose the dataset from the international standard library TSPLIB for our experiments. By designing and implementing both algorithms and conducting experiments on various sizes and types of TSP instances, we provide an in-depth analysis of the accuracy of the optimal solution, the number of iterations, the execution time, and the stability of the algorithms for both. The empirical findings indicate that the CGA outperforms the QGA in terms of finding superior solutions more quickly in most of the test instances, especially when the problem size is large. This suggests that although the principle of quantum computing provides a new way to solve complex combinatorial optimisation problems, the implementation of quantum phenomena and the setting of parameters such as the optimal angle for a quantum revolving gate is challenging and need further optimisation to achieve the desired results. Additionally, it is important to note that the QGA has not been tested on real quantum hardware, so its true performance remains unverified. These limitations provide rich opportunities for further research in the future.

[AI-280] AutoPET III Challenge: Tumor Lesion Segmentation using ResEnc-Model Ensemble

链接: https://arxiv.org/abs/2409.13779
作者: Tanya Chutani,Saikiran Bonthu,Pranab Samanta,Nitin Singhal
关键词-EN: Positron Emission Tomography, Computed Tomography, Emission Tomography, Positron Emission, Tomography
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Positron Emission Tomography (PET) /Computed Tomography (CT) is crucial for diagnosing, managing, and planning treatment for various cancers. Developing reliable deep learning models for the segmentation of tumor lesions in PET/CT scans in a multi-tracer multicenter environment, is a critical area of research. Different tracers, such as Fluorodeoxyglucose (FDG) and Prostate-Specific Membrane Antigen (PSMA), have distinct physiological uptake patterns and data from different centers often vary in terms of acquisition protocols, scanner types, and patient populations. Because of this variability, it becomes more difficult to design reliable segmentation algorithms and generalization techniques due to variations in image quality and lesion detectability. To address this challenge, We trained a 3D Residual encoder U-Net within the no new U-Net framework, aiming to generalize the performance of automatic lesion segmentation of whole body PET/CT scans, across different tracers and clinical sites. Further, We explored several preprocessing techniques and ultimately settled on using the Total Segmentator to crop our training data. Additionally, we applied resampling during this process. During inference, we leveraged test-time augmentations and other post-processing techniques to enhance tumor lesion segmentation. Our team currently hold the top position in the Auto-PET III challenge and outperformed the challenge baseline model in the preliminary test set with Dice score of 0.9627.

计算机视觉

[CV-0] PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

链接: https://arxiv.org/abs/2409.15278
作者: Weifeng Lin,Xinyu Wei,Renrui Zhang,Le Zhuo,Shitian Zhao,Siyuan Huang,Junlin Xie,Yu Qiao,Peng Gao,Hongsheng Li
关键词-EN: visual assistant, presents a versatile, paper presents, free-from language instructions, free-from language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is released at this https URL

点击查看摘要

Abstract:This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions. The code and related resources are available at this https URL

[CV-1] MaterialFusion: Enhancing Inverse Rendering with Material Diffusion Priors

链接: https://arxiv.org/abs/2409.15273
作者: Yehonathan Litman,Or Patashnik,Kangle Deng,Aviral Agrawal,Rushikesh Zawar,Fernando De la Torre,Shubham Tulsiani
关键词-EN: Recent works, recover shape, shown promise, inverse rendering, Recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Recent works in inverse rendering have shown promise in using multi-view images of an object to recover shape, albedo, and materials. However, the recovered components often fail to render accurately under new lighting conditions due to the intrinsic challenge of disentangling albedo and material properties from input images. To address this challenge, we introduce MaterialFusion, an enhanced conventional 3D inverse rendering pipeline that incorporates a 2D prior on texture and material properties. We present StableMaterial, a 2D diffusion model prior that refines multi-lit data to estimate the most likely albedo and material from given input appearances. This model is trained on albedo, material, and relit image data derived from a curated dataset of approximately ~12K artist-designed synthetic Blender objects called BlenderVault. we incorporate this diffusion prior with an inverse rendering framework where we use score distillation sampling (SDS) to guide the optimization of the albedo and materials, improving relighting performance in comparison with previous work. We validate MaterialFusion’s relighting performance on 4 datasets of synthetic and real objects under diverse illumination conditions, showing our diffusion-aided approach significantly improves the appearance of reconstructed objects under novel lighting conditions. We intend to publicly release our BlenderVault dataset to support further research in this field.

[CV-2] OmniBench: Towards The Future of Universal Omni-Language Models

链接: https://arxiv.org/abs/2409.15272
作者: Yizhi Li,Ge Zhang,Yinghao Ma,Ruibin Yuan,Kang Zhu,Hangyu Guo,Yiming Liang,Jiaheng Liu,Jian Yang,Siwei Wu,Xingwei Qu,Jinjie Shi,Xinyue Zhang,Zhenzhu Yang,Xiangzhou Wang,Zhaoxiang Zhang,Zachary Liu,Emmanouil Benetos,Wenhao Huang,Chenghua Lin
关键词-EN: multimodal large language, Recent advancements, large language models, advancements in multimodal, multimodal large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models’ ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) the baseline models perform poorly (below 50% accuracy) even when provided with alternative textual representations of images and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at this https URL.

[CV-3] ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild

链接: https://arxiv.org/abs/2409.15269
作者: Chen Guo,Tianjian Jiang,Manuel Kaufmann,Chengwei Zheng,Julien Valentin,Jie Song,Otmar Hilliges
关键词-EN: exhibit large non-rigid, large non-rigid surface, non-rigid surface deformations, previous years, great progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:While previous years have seen great progress in the 3D reconstruction of humans from monocular videos, few of the state-of-the-art methods are able to handle loose garments that exhibit large non-rigid surface deformations during articulation. This limits the application of such methods to humans that are dressed in standard pants or T-shirts. Our method, ReLoo, overcomes this limitation and reconstructs high-quality 3D models of humans dressed in loose garments from monocular in-the-wild videos. To tackle this problem, we first establish a layered neural human representation that decomposes clothed humans into a neural inner body and outer clothing. On top of the layered neural representation, we further introduce a non-hierarchical virtual bone deformation module for the clothing layer that can freely move, which allows the accurate recovery of non-rigidly deforming loose clothing. A global optimization jointly optimizes the shape, appearance, and deformations of the human body and clothing via multi-layer differentiable volume rendering. To evaluate ReLoo, we record subjects with dynamically deforming garments in a multi-view capture studio. This evaluation, both on existing and our novel dataset, demonstrates ReLoo’s clear superiority over prior art on both indoor datasets and in-the-wild videos.

[CV-4] UDA-Bench: Revisiting Common Assumptions in Unsupervised Domain Adaptation Using a Standardized Framework ECCV2024

链接: https://arxiv.org/abs/2409.15264
作者: Tarun Kalluri,Sreyas Ravichandran,Manmohan Chandraker
关键词-EN: modern unsupervised domain, controlled empirical study, diverse factors, factors that influence, influence the efficacy
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 Camera-ready version

点击查看摘要

Abstract:In this work, we take a deeper look into the diverse factors that influence the efficacy of modern unsupervised domain adaptation (UDA) methods using a large-scale, controlled empirical study. To facilitate our analysis, we first develop UDA-Bench, a novel PyTorch framework that standardizes training and evaluation for domain adaptation enabling fair comparisons across several UDA methods. Using UDA-Bench, our comprehensive empirical study into the impact of backbone architectures, unlabeled data quantity, and pre-training datasets reveals that: (i) the benefits of adaptation methods diminish with advanced backbones, (ii) current methods underutilize unlabeled data, and (iii) pre-training data significantly affects downstream adaptation in both supervised and self-supervised settings. In the context of unsupervised adaptation, these observations uncover several novel and surprising properties, while scientifically validating several others that were often considered empirical heuristics or practitioner intuitions in the absence of a standardized training and evaluation framework. The UDA-Bench framework and trained models are publicly available at this https URL.

[CV-5] S2AG-Vid: Enhancing Multi-Motion Alignment in Video Diffusion Models via Spatial and Syntactic Attention-Based Guidance

链接: https://arxiv.org/abs/2409.15259
作者: Yuanhang Li,Qi Mao,Lan Chen,Zhen Fang,Lei Tian,Xinyan Xiao,Libiao Jin,Hua Wu
关键词-EN: garnered significant attention, Recent advancements, generation using diffusion, significant attention, garnered significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in text-to-video (T2V) generation using diffusion models have garnered significant attention. However, existing T2V models primarily focus on simple scenes featuring a single object performing a single motion. Challenges arise in scenarios involving multiple objects with distinct motions, often leading to incorrect video-text alignment between subjects and their corresponding motions. To address this challenge, we propose \textbfS ^2 AG-Vid, a training-free inference-stage optimization method that improves the alignment of multiple objects with their corresponding motions in T2V models. S ^2 AG-Vid initially applies a spatial position-based, cross-attention (CA) constraint in the early stages of the denoising process, facilitating multiple nouns distinctly attending to the correct subject regions. To enhance the motion-subject binding, we implement a syntax-guided contrastive constraint in the subsequent denoising phase, aimed at improving the correlations between the CA maps of verbs and their corresponding nouns.Both qualitative and quantitative evaluations demonstrate that the proposed framework significantly outperforms baseline approaches, producing higher-quality videos with improved subject-motion consistency.

[CV-6] ZeroSCD: Zero-Shot Street Scene Change Detection

链接: https://arxiv.org/abs/2409.15255
作者: Shyam Sundar Kannan,Byung-Cheol Min
关键词-EN: Scene Change Detection, Change Detection, challenging task, task in computer, computer vision
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Scene Change Detection is a challenging task in computer vision and robotics that aims to identify differences between two images of the same scene captured at different times. Traditional change detection methods rely on training models that take these image pairs as input and estimate the changes, which requires large amounts of annotated data, a costly and time-consuming process. To overcome this, we propose ZeroSCD, a zero-shot scene change detection framework that eliminates the need for training. ZeroSCD leverages pre-existing models for place recognition and semantic segmentation, utilizing their features and outputs to perform change detection. In this framework, features extracted from the place recognition model are used to estimate correspondences and detect changes between the two images. These are then combined with segmentation results from the semantic segmentation model to precisely delineate the boundaries of the detected changes. Extensive experiments on benchmark datasets demonstrate that ZeroSCD outperforms several state-of-the-art methods in change detection accuracy, despite not being trained on any of the benchmark datasets, proving its effectiveness and adaptability across different scenarios.

[CV-7] Investigating Robot Dogs for Construction Monitoring: A Comparative Analysis of Specifications and On-site Requirements

链接: https://arxiv.org/abs/2409.15253
作者: Miguel Arturo Vega Torres,Fabian Pfitzner
关键词-EN: receiving increasing attention, fields of research, receiving increasing, increasing attention, Robot dogs
类目: Robotics (cs.RO); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 3 figures, 2 Tables, Forum Bauinformatik

点击查看摘要

Abstract:Robot dogs are receiving increasing attention in various fields of research. However, the number of studies investigating their potential usability on construction sites is scarce. The construction industry implies several human resource-demanding tasks such as safety monitoring, material transportation, and site inspections. Robot dogs can address some of these challenges by providing automated support and lowering manual effort. In this paper, we investigate the potential usability of currently available robot dogs on construction sites in terms of focusing on their different specifications and on-site requirements to support data acquisition. In addition, we conducted a real-world experiment on a large-scale construction site using a quadruped robot. In conclusion, we consider robot dogs to be a valuable asset for monitoring intricate construction environments in the future, particularly as their limitations are mitigated through technical advancements. Comments: 8 pages, 3 figures, 2 Tables, Forum Bauinformatik Subjects: Robotics (cs.RO); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.15253 [cs.RO] (or arXiv:2409.15253v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.15253 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.13154/294-10094 Focus to learn more DOI(s) linking to related resources Submission history From: Miguel Arturo Vega Torres [view email] [v1] Mon, 23 Sep 2024 17:51:31 UTC (3,320 KB)

[CV-8] ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

链接: https://arxiv.org/abs/2409.15250
作者: Sombit Dey,Jan-Nico Zaech,Nikolay Nikolov,Luc Van Gool,Danda Pani Paudel
关键词-EN: Recent progress, Vision Language Action, large-scale robotic datasets, Language Action models, large language models
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is potentially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA which requires the adaptation of the visual backbones during initial training – to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77% and 66% for grasping and lifting in visual OOD tasks . Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2409.15250 [cs.CV] (or arXiv:2409.15250v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.15250 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-9] Semantic Inference-Based Deep Learning and Modeling for Earth Observation: Cognitive Semantic Augmentation Satellite Networks

链接: https://arxiv.org/abs/2409.15246
作者: Hong-fu Chou,Vu Nguyen Ha,Prabhu Thiruvasagam,Thanh-Dung Le,Geoffrey Eappen,Ti Ti Nguyen,Luis M. Garces-Socarras,Jorge L. Gonzalez-Rios,Juan Carlos Merlano-Duncan,Symeon Chatzinotas
关键词-EN: Sustainable Development Goals, achieving Sustainable Development, Earth Observation, Sustainable Development, Development Goals
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
*备注: 18 pages, 10 figures, magazine

点击查看摘要

Abstract:Earth Observation (EO) systems play a crucial role in achieving Sustainable Development Goals by collecting and analyzing vital global data through satellite networks. These systems are essential for tasks like mapping, disaster monitoring, and resource management, but they face challenges in processing and transmitting large volumes of EO data, especially in specialized fields such as agriculture and real-time disaster response. Domain-adapted Large Language Models (LLMs) provide a promising solution by facilitating data fusion between extensive EO data and semantic EO data. By improving integration and interpretation of diverse datasets, LLMs address the challenges of processing specialized information in agriculture and disaster response applications. This fusion enhances the accuracy and relevance of transmitted data. This paper presents a framework for semantic communication in EO satellite networks, aimed at improving data transmission efficiency and overall system performance through cognitive processing techniques. The proposed system employs Discrete-Task-Oriented Source-Channel Coding (DT-JSCC) and Semantic Data Augmentation (SA) to focus on relevant information while minimizing communication overhead. By integrating cognitive semantic processing and inter-satellite links, the framework enhances the analysis and transmission of multispectral satellite imagery, improving object detection, pattern recognition, and real-time decision-making. The introduction of Cognitive Semantic Augmentation (CSA) allows satellites to process and transmit semantic information, boosting adaptability to changing environments and application needs. This end-to-end architecture is tailored for next-generation satellite networks, such as those supporting 6G, and demonstrates significant improvements in efficiency and accuracy.

[CV-10] Enhancing Pedestrian Trajectory Prediction with Crowd Trip Information

链接: https://arxiv.org/abs/2409.15224
作者: Rei Tamaru,Pei Li,Bin Ran
关键词-EN: active traffic management, Pedestrian trajectory prediction, traffic management, trajectory prediction, Pedestrian trajectory
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pedestrian trajectory prediction is essential for various applications in active traffic management, urban planning, traffic control, crowd management, and autonomous driving, aiming to enhance traffic safety and efficiency. Accurately predicting pedestrian trajectories requires a deep understanding of individual behaviors, social interactions, and road environments. Existing studies have developed various models to capture the influence of social interactions and road conditions on pedestrian trajectories. However, these approaches are limited by the lack of a comprehensive view of social interactions and road environments. To address these limitations and enhance the accuracy of pedestrian trajectory prediction, we propose a novel approach incorporating trip information as a new modality into pedestrian trajectory models. We propose RNTransformer, a generic model that utilizes crowd trip information to capture global information on social interactions. We incorporated RNTransformer with various socially aware local pedestrian trajectory prediction models to demonstrate its performance. Specifically, by leveraging a pre-trained RNTransformer when training different pedestrian trajectory prediction models, we observed improvements in performance metrics: a 1.3/2.2% enhancement in ADE/FDE on Social-LSTM, a 6.5/28.4% improvement on Social-STGCNN, and an 8.6/4.3% improvement on S-Implicit. Evaluation results demonstrate that RNTransformer significantly enhances the accuracy of various pedestrian trajectory prediction models across multiple datasets. Further investigation reveals that the RNTransformer effectively guides local models to more accurate directions due to the consideration of global information. By exploring crowd behavior within the road network, our approach shows great promise in improving pedestrian safety through accurate trajectory predictions.

[CV-11] FLeNS: Federated Learning with Enhanced Nesterov-Newton Sketch

链接: https://arxiv.org/abs/2409.15216
作者: Sunny Gupta,Mohit,Pankhi Kashyap,Pranav Jeevan,Amit Sethi
关键词-EN: Federated learning faces, balancing communication efficiency, faces a critical, critical challenge, challenge in balancing
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注: 10 pages, 3 figures, 2 Tables

点击查看摘要

Abstract:Federated learning faces a critical challenge in balancing communication efficiency with rapid convergence, especially for second-order methods. While Newton-type algorithms achieve linear convergence in communication rounds, transmitting full Hessian matrices is often impractical due to quadratic complexity. We introduce Federated Learning with Enhanced Nesterov-Newton Sketch (FLeNS), a novel method that harnesses both the acceleration capabilities of Nesterov’s method and the dimensionality reduction benefits of Hessian sketching. FLeNS approximates the centralized Newton’s method without relying on the exact Hessian, significantly reducing communication overhead. By combining Nesterov’s acceleration with adaptive Hessian sketching, FLeNS preserves crucial second-order information while preserving the rapid convergence characteristics. Our theoretical analysis, grounded in statistical learning, demonstrates that FLeNS achieves super-linear convergence rates in communication rounds - a notable advancement in federated optimization. We provide rigorous convergence guarantees and characterize tradeoffs between acceleration, sketch size, and convergence speed. Extensive empirical evaluation validates our theoretical findings, showcasing FLeNS’s state-of-the-art performance with reduced communication requirements, particularly in privacy-sensitive and edge-computing scenarios. The code is available at this https URL

[CV-12] HydroVision: LiDAR-Guided Hydrometric Prediction with Vision Transformers and Hybrid Graph Learning

链接: https://arxiv.org/abs/2409.15213
作者: Naghmeh Shafiee Roudbari,Ursula Eicker,Charalambos Poullis,Zachary Patterson
关键词-EN: managing water resources, Hydrometric forecasting, environmental protection, crucial for managing, Hydrometric
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hydrometric forecasting is crucial for managing water resources, flood prediction, and environmental protection. Water stations are interconnected, and this connectivity influences the measurements at other stations. However, the dynamic and implicit nature of water flow paths makes it challenging to extract a priori knowledge of the connectivity structure. We hypothesize that terrain elevation significantly affects flow and connectivity. To incorporate this, we use LiDAR terrain elevation data encoded through a Vision Transformer (ViT). The ViT, which has demonstrated excellent performance in image classification by directly applying transformers to sequences of image patches, efficiently captures spatial features of terrain elevation. To account for both spatial and temporal features, we employ GRU blocks enhanced with graph convolution, a method widely used in the literature. We propose a hybrid graph learning structure that combines static and dynamic graph learning. A static graph, derived from transformer-encoded LiDAR data, captures terrain elevation relationships, while a dynamic graph adapts to temporal changes, improving the overall graph representation. We apply graph convolution in two layers through these static and dynamic graphs. Our method makes daily predictions up to 12 days ahead. Empirical results from multiple water stations in Quebec demonstrate that our method significantly reduces prediction error by an average of 10% across all days, with greater improvements for longer forecasting horizons.

[CV-13] HOTVCOM: Generating Buzzworthy Comments for Videos ACL2024

链接: https://arxiv.org/abs/2409.15196
作者: Yuyan Chen,Yiwen Qian,Songzhou Yan,Jiyuan Jia,Zhixu Li,Yanghua Xiao,Xiaobo Li,Ming Yang,Qingpei Guo
关键词-EN: attracting user impressions, media video platforms, social media video, making them vital, branding purpose
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:In the era of social media video platforms, popular hot-comments'' play a crucial role in attracting user impressions of short-form videos, making them vital for marketing and branding purpose. However, existing research predominantly focuses on generating descriptive comments or danmaku’’ in English, offering immediate reactions to specific video moments. Addressing this gap, our study introduces \textscHotVCom, the largest Chinese video hot-comment dataset, comprising 94k diverse videos and 137 million comments. We also present the \textttComHeat framework, which synergistically integrates visual, auditory, and textual data to generate influential hot-comments on the Chinese video dataset. Empirical evaluations highlight the effectiveness of our framework, demonstrating its excellence on both the newly constructed and existing datasets.

[CV-14] Interpretability-Guided Test-Time Adversarial Defense ECCV2024

链接: https://arxiv.org/abs/2409.15190
作者: Akshay Kulkarni,Tsui-Wei Weng
关键词-EN: devising interpretability-guided neuron, interpretability-guided neuron importance, neuron importance ranking, identify neurons important, importance ranking methods
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: ECCV 2024. Project Page: this https URL

点击查看摘要

Abstract:We propose a novel and low-cost test-time adversarial defense by devising interpretability-guided neuron importance ranking methods to identify neurons important to the output classes. Our method is a training-free approach that can significantly improve the robustness-accuracy tradeoff while incurring minimal computational overhead. While being among the most efficient test-time defenses (4x faster), our method is also robust to a wide range of black-box, white-box, and adaptive attacks that break previous test-time defenses. We demonstrate the efficacy of our method for CIFAR10, CIFAR100, and ImageNet-1k on the standard RobustBench benchmark (with average gains of 2.6%, 4.9%, and 2.8% respectively). We also show improvements (average 1.5%) over the state-of-the-art test-time defenses even under strong adaptive attacks.

[CV-15] MIMAFace: Face Animation via Motion-Identity Modulated Appearance Feature Learning

链接: https://arxiv.org/abs/2409.15179
作者: Yue Han,Junwei Zhu,Yuxiang Feng,Xiaozhong Ji,Keke He,Xiangtai Li,zhucun xue,Yong Liu
关键词-EN: Current diffusion-based face, curated self-acquired data, Current diffusion-based, ensuring temporal stability, diffusion-based face animation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current diffusion-based face animation methods generally adopt a ReferenceNet (a copy of U-Net) and a large amount of curated self-acquired data to learn appearance features, as robust appearance features are vital for ensuring temporal stability. However, when trained on public datasets, the results often exhibit a noticeable performance gap in image quality and temporal consistency. To address this issue, we meticulously examine the essential appearance features in the facial animation tasks, which include motion-agnostic (e.g., clothing, background) and motion-related (e.g., facial details) texture components, along with high-level discriminative identity features. Drawing from this analysis, we introduce a Motion-Identity Modulated Appearance Learning Module (MIA) that modulates CLIP features at both motion and identity levels. Additionally, to tackle the semantic/ color discontinuities between clips, we design an Inter-clip Affinity Learning Module (ICA) to model temporal relationships across clips. Our method achieves precise facial motion control (i.e., expressions and gaze), faithful identity preservation, and generates animation videos that maintain both intra/inter-clip temporal consistency. Moreover, it easily adapts to various modalities of driving sources. Extensive experiments demonstrate the superiority of our method.

[CV-16] SpikeGS: Learning 3D Gaussian Fields from Continuous Spike Stream ACCV2024

链接: https://arxiv.org/abs/2409.15176
作者: Jinze Yu,Xi Peng,Zhengda Lu,Laurent Kneip,Yiqun Wang
关键词-EN: specialized high-speed visual, high-speed visual sensor, dynamic range compared, conventional frame cameras, high temporal resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACCV 2024. Project page: this https URL

点击查看摘要

Abstract:A spike camera is a specialized high-speed visual sensor that offers advantages such as high temporal resolution and high dynamic range compared to conventional frame cameras. These features provide the camera with significant advantages in many computer vision tasks. However, the tasks of 3D reconstruction and novel view synthesis based on spike cameras remain underdeveloped. Although there are existing methods for learning neural radiance fields from spike stream, they either lack robustness in extremely noisy, low-quality lighting conditions or suffer from high computational complexity due to the deep fully connected neural networks and ray marching rendering strategies used in neural radiance fields, making it difficult to recover fine texture details. In contrast, the latest advancements in 3DGS have achieved high-quality real-time rendering by optimizing the point cloud representation into Gaussian ellipsoids. Building on this, we introduce SpikeGS, the first method to learn 3D Gaussian fields solely from spike stream. We designed a differentiable spike stream rendering framework based on 3DGS, incorporating noise embedding and spiking neurons. By leveraging the multi-view consistency of 3DGS and the tile-based multi-threaded parallel rendering mechanism, we achieved high-quality real-time rendering results. Additionally, we introduced a spike rendering loss function that generalizes under varying illumination conditions. Our method can reconstruct view synthesis results with fine texture details from a continuous spike stream captured by a moving spike camera, while demonstrating high robustness in extremely noisy low-light scenarios. Experimental results on both real and synthetic datasets demonstrate that our method surpasses existing approaches in terms of rendering quality and speed. Our code will be available at this https URL.

[CV-17] FusionRF: High-Fidelity Satellite Neural Radiance Fields from Multispectral and Panchromatic Acquisitions

链接: https://arxiv.org/abs/2409.15132
作者: Michael Sprintson,Rama Chellappa,Cheng Peng
关键词-EN: neural rendering terrain, rendering terrain reconstruction, neural rendering, rendering terrain, optically unprocessed
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We introduce FusionRF, a novel neural rendering terrain reconstruction method from optically unprocessed satellite imagery. While previous methods depend on external pansharpening methods to fuse low resolution multispectral imagery and high resolution panchromatic imagery, FusionRF directly performs reconstruction based on optically unprocessed acquisitions with no prior knowledge. This is accomplished through the addition of a sharpening kernel which models the resolution loss in multispectral images. Additionally, novel modal embeddings allow the model to perform image fusion as a bottleneck to novel view synthesis. We evaluate our method on multispectral and panchromatic satellite images from the WorldView-3 satellite in various locations, and FusionRF outperforms previous State-of-The-Art methods in depth reconstruction on unprocessed imagery, renders sharp training and novel views, and retains multi-spectral information.

[CV-18] Detect Describe Discriminate: Moving Beyond VQA for MLLM Evaluation ECCV2024

链接: https://arxiv.org/abs/2409.15125
作者: Manu Gaur,Darshan Singh S,Makarand Tapaswi
关键词-EN: Multimodal Large Language, Visual Question Answering, Large Language Models, Multimodal Large, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 Workshop EVAL-FoMo; Project Page: this https URL

点击查看摘要

Abstract:Visual Question Answering (VQA) with multiple choice questions enables a vision-centric evaluation of Multimodal Large Language Models (MLLMs). Although it reliably checks the existence of specific visual abilities, it is easier for the model to select an answer from multiple choices (VQA evaluation) than to generate the answer itself. In this work, we offer a novel perspective: we evaluate how well an MLLM understands a specific visual concept by its ability to uniquely describe two extremely similar images that differ only in the targeted visual concept. Specifically, we assess the ability of MLLMs to capture specific points of visual differences using self-retrieval, i.e., by retrieving the target image using its generated caption against the other image in the pair serving as the distractor. We curate 247 highly similar image pairs as part of the D3 benchmark. For each image pair, the model is prompted to: (1) Detect a specific visual difference, and (2) Describe the target image uniquely such that it (3) Discriminates the target image from the distractor. Self-retrieval within D3 enables whitebox evaluation across six different visual patterns, revealing that current models struggle to independently discern fine-grained visual differences, with open-source models failing to outperform random guess.

[CV-19] Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer

链接: https://arxiv.org/abs/2409.15117
作者: Minh Bui,Kostas Alexis
关键词-EN: Vision-based perception, autonomous system, perception and reasoning, reasoning is essential, essential for scene
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-based perception and reasoning is essential for scene understanding in any autonomous system. RGB and depth images are commonly used to capture both the semantic and geometric features of the environment. Developing methods to reliably interpret this data is critical for real-world applications, where noisy measurements are often unavoidable. In this work, we introduce a diffusion-based framework to address the RGB-D semantic segmentation problem. Additionally, we demonstrate that utilizing a Deformable Attention Transformer as the encoder to extract features from depth images effectively captures the characteristics of invalid regions in depth measurements. Our generative framework shows a greater capacity to model the underlying distribution of RGB-D images, achieving robust performance in challenging scenarios with significantly less training time compared to discriminative methods. Experimental results indicate that our approach achieves State-of-the-Art performance on both the NYUv2 and SUN-RGBD datasets in general and especially in the most challenging of their image data. Our project page will be available at this https URL

[CV-20] he BRAVO Semantic Segmentation Challenge Results in UNCV2024 ECCV2024

链接: https://arxiv.org/abs/2409.15107
作者: Tuan-Hung Vu,Eduardo Valle,Andrei Bursuc,Tommie Kerssies,Daan de Geus,Gijs Dubbelman,Long Qian,Bingke Zhu,Yingying Chen,Ming Tang,Jinqiao Wang,Tomáš Vojíř,Jan Šochman,Jiří Matas,Michael Smith,Frank Ferrie,Shamik Basu,Christos Sakaridis,Luc Van Gool
关键词-EN: unified BRAVO challenge, unified BRAVO, semantic segmentation models, BRAVO challenge, propose the unified
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ECCV 2024 proceeding paper of the BRAVO challenge 2024, see this https URL

点击查看摘要

Abstract:We propose the unified BRAVO challenge to benchmark the reliability of semantic segmentation models under realistic perturbations and unknown out-of-distribution (OOD) scenarios. We define two categories of reliability: (1) semantic reliability, which reflects the model’s accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model’s ability to detect object classes that are unknown during training. The challenge attracted nearly 100 submissions from international teams representing notable research institutions. The results reveal interesting insights into the importance of large-scale pre-training and minimal architectural design in developing robust and reliable semantic segmentation models.

[CV-21] M2OST: Many-to-one Regression for Predicting Spatial Transcriptomics from Digital Pathology Images

链接: https://arxiv.org/abs/2409.15092
作者: Hongyi Wang,Xiuju Du,Jing Liu,Shuyi Ouyang,Yen-Wei Chen,Lanfen Lin
关键词-EN: Spatial Transcriptomics, advancement of Spatial, digital pathology images, gene expressions based, facilitated the spatially-aware
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of Spatial Transcriptomics (ST) has facilitated the spatially-aware profiling of gene expressions based on histopathology images. Although ST data offers valuable insights into the micro-environment of tumors, its acquisition cost remains expensive. Therefore, directly predicting the ST expressions from digital pathology images is desired. Current methods usually adopt existing regression backbones along with patch-sampling for this task, which ignores the inherent multi-scale information embedded in the pyramidal data structure of digital pathology images, and wastes the inter-spot visual information crucial for accurate gene expression prediction. To address these limitations, we propose M2OST, a many-to-one regression Transformer that can accommodate the hierarchical structure of the pathology images via a decoupled multi-scale feature extractor. Unlike traditional models that are trained with one-to-one image-label pairs, M2OST uses multiple images from different levels of the digital pathology image to jointly predict the gene expressions in their common corresponding spot. Built upon our many-to-one scheme, M2OST can be easily scaled to fit different numbers of inputs, and its network structure inherently incorporates nearby inter-spot features, enhancing regression performance. We have tested M2OST on three public ST datasets and the experimental results show that M2OST can achieve state-of-the-art performance with fewer parameters and floating-point operations (FLOPs). The code will be released upon acceptance.

[CV-22] SCLIP: Robust CLIP Fine-Tuning for Worldwide Cross-Regional Traffic Sign Recognition

链接: https://arxiv.org/abs/2409.15077
作者: Guoyang Zhao,Fulong Ma,Weiqing Qi,Chenguang Zhang,Yuxuan Liu,Ming Liu,Jun Ma
关键词-EN: cross-regional traffic sign, critical map feature, traffic sign recognition, Traffic sign, Traffic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traffic sign is a critical map feature for navigation and traffic control. Nevertheless, current methods for traffic sign recognition rely on traditional deep learning models, which typically suffer from significant performance degradation considering the variations in data distribution across different regions. In this paper, we propose TSCLIP, a robust fine-tuning approach with the contrastive language-image pre-training (CLIP) model for worldwide cross-regional traffic sign recognition. We first curate a cross-regional traffic sign benchmark dataset by combining data from ten different sources. Then, we propose a prompt engineering scheme tailored to the characteristics of traffic signs, which involves specific scene descriptions and corresponding rules to generate targeted text descriptions for optimizing the model training process. During the TSCLIP fine-tuning process, we implement adaptive dynamic weight ensembling (ADWE) to seamlessly incorporate outcomes from each training iteration with the zero-shot CLIP model. This approach ensures that the model retains its ability to generalize while acquiring new knowledge about traffic signs. Our method surpasses conventional classification benchmark models in cross-regional traffic sign evaluations, and it achieves state-of-the-art performance compared to existing CLIP fine-tuning techniques. To the best knowledge of authors, TSCLIP is the first contrastive language-image model used for the worldwide cross-regional traffic sign recognition task. The project website is available at: this https URL.

[CV-23] FisheyeDepth: A Real Scale Self-Supervised Depth Estimation Model for Fisheye Camera

链接: https://arxiv.org/abs/2409.15054
作者: Guoyang Zhao,Yuxuan Liu,Weiqing Qi,Fulong Ma,Ming Liu,Jun Ma
关键词-EN: Accurate depth estimation, Accurate depth, depth estimation, scene comprehension, autonomous vehicles
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Accurate depth estimation is crucial for 3D scene comprehension in robotics and autonomous vehicles. Fisheye cameras, known for their wide field of view, have inherent geometric benefits. However, their use in depth estimation is restricted by a scarcity of ground truth data and image distortions. We present FisheyeDepth, a self-supervised depth estimation model tailored for fisheye cameras. We incorporate a fisheye camera model into the projection and reprojection stages during training to handle image distortions, thereby improving depth estimation accuracy and training stability. Furthermore, we incorporate real-scale pose information into the geometric projection between consecutive frames, replacing the poses estimated by the conventional pose network. Essentially, this method offers the necessary physical depth for robotic tasks, and also streamlines the training and inference procedures. Additionally, we devise a multi-channel output strategy to improve robustness by adaptively fusing features at various scales, which reduces the noise from real pose data. We demonstrate the superior performance and robustness of our model in fisheye image depth estimation through evaluations on public datasets and real-world scenarios. The project website is available at: this https URL.

[CV-24] AIM 2024 Sparse Neural Rendering Challenge: Methods and Results ECCV2024

链接: https://arxiv.org/abs/2409.15045
作者: Michal Nazarczuk,Sibi Catley-Chandar,Thomas Tanay,Richard Shaw,Eduardo Pérez-Pellitero,Radu Timofte,Xing Yan,Pan Wang,Yali Guo,Yongxin Wu,Youcheng Cai,Yanan Yang,Junting Li,Yanghong Zhou,P. Y. Mok,Zongqi He,Zhe Xiao,Kin-Chung Chan,Hana Lebeta Goshu,Cuixin Yang,Rongkang Dong,Jun Xiao,Kin-Man Lam,Jiayao Hao,Qiong Gao,Yanyan Zu,Junpei Zhang,Licheng Jiao,Xu Liu,Kuldeep Purohit
关键词-EN: conjunction with ECCV, Image Manipulation, Sparse Neural Rendering, held in conjunction, Sparse
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Part of Advances in Image Manipulation workshop at ECCV 2024

点击查看摘要

Abstract:This paper reviews the challenge on Sparse Neural Rendering that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ECCV 2024. This manuscript focuses on the competition set-up, the proposed methods and their respective results. The challenge aims at producing novel camera view synthesis of diverse scenes from sparse image observations. It is composed of two tracks, with differing levels of sparsity; 3 views in Track 1 (very sparse) and 9 views in Track 2 (sparse). Participants are asked to optimise objective fidelity to the ground-truth images as measured via the Peak Signal-to-Noise Ratio (PSNR) metric. For both tracks, we use the newly introduced Sparse Rendering (SpaRe) dataset and the popular DTU MVS dataset. In this challenge, 5 teams submitted final results to Track 1 and 4 teams submitted final results to Track 2. The submitted models are varied and push the boundaries of the current state-of-the-art in sparse neural rendering. A detailed description of all models developed in the challenge is provided in this paper.

[CV-25] AIM 2024 Sparse Neural Rendering Challenge: Dataset and Benchmark ECCV2024

链接: https://arxiv.org/abs/2409.15041
作者: Michal Nazarczuk,Thomas Tanay,Sibi Catley-Chandar,Richard Shaw,Radu Timofte,Eduardo Pérez-Pellitero
关键词-EN: made impressive breakthroughs, Recent developments, made impressive, impressive breakthroughs, sparse rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Part of Advances in Image Manipulation workshop at ECCV 2024. Available at: this https URL

点击查看摘要

Abstract:Recent developments in differentiable and neural rendering have made impressive breakthroughs in a variety of 2D and 3D tasks, e.g. novel view synthesis, 3D reconstruction. Typically, differentiable rendering relies on a dense viewpoint coverage of the scene, such that the geometry can be disambiguated from appearance observations alone. Several challenges arise when only a few input views are available, often referred to as sparse or few-shot neural rendering. As this is an underconstrained problem, most existing approaches introduce the use of regularisation, together with a diversity of learnt and hand-crafted priors. A recurring problem in sparse rendering literature is the lack of an homogeneous, up-to-date, dataset and evaluation protocol. While high-resolution datasets are standard in dense reconstruction literature, sparse rendering methods often evaluate with low-resolution images. Additionally, data splits are inconsistent across different manuscripts, and testing ground-truth images are often publicly available, which may lead to over-fitting. In this work, we propose the Sparse Rendering (SpaRe) dataset and benchmark. We introduce a new dataset that follows the setup of the DTU MVS dataset. The dataset is composed of 97 new scenes based on synthetic, high-quality assets. Each scene has up to 64 camera views and 7 lighting configurations, rendered at 1600x1200 resolution. We release a training split of 82 scenes to foster generalizable approaches, and provide an online evaluation platform for the validation and test sets, whose ground-truth images remain hidden. We propose two different sparse configurations (3 and 9 input images respectively). This provides a powerful and convenient tool for reproducible evaluation, and enable researchers easy access to a public leaderboard with the state-of-the-art performance scores. Available at: this https URL

[CV-26] Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP EMNLP2024

链接: https://arxiv.org/abs/2409.15035
作者: Zeliang Zhang,Zhuo Liu,Mingqian Feng,Chenliang Xu
关键词-EN: visual question answering, demonstrated great versatility, visual question, question answering, demonstrated great
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Short paper. Accepted by the Findings of EMNLP 2024

点击查看摘要

Abstract:CLIP has demonstrated great versatility in adapting to various downstream tasks, such as image editing and generation, visual question answering, and video understanding. However, CLIP-based applications often suffer from misunderstandings regarding user intent, leading to discrepancies between the required number of objects and the actual outputs in image generation tasks. In this work, we empirically investigate the quantity bias in CLIP. By carefully designing different experimental settings and datasets, we comprehensively evaluate CLIP’s understanding of quantity from text, image, and cross-modal perspectives. Our experimental results reveal a quantity bias in CLIP embeddings, impacting the reliability of downstream tasks.

[CV-27] Region Mixup ICLR2024

链接: https://arxiv.org/abs/2409.15028
作者: Saptarshi Saha,Utpal Garain
关键词-EN: visual recognition tasks, data augmentation, recognition tasks, paper introduces, introduces a simple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published as a Tiny Paper at ICLR 2024

点击查看摘要

Abstract:This paper introduces a simple extension of mixup (Zhang et al., 2018) data augmentation to enhance generalization in visual recognition tasks. Unlike the vanilla mixup method, which blends entire images, our approach focuses on combining regions from multiple images.

[CV-28] Cross Branch Feature Fusion Decoder for Consistency Regularization-based Semi-Supervised Change Detection ICASSP2024

链接: https://arxiv.org/abs/2409.15021
作者: Yan Xing,Qi’ao Xu,Jingcheng Zeng,Rui Huang,Sihua Gao,Weifeng Xu,Yuxiang Zhang,Wei Fan
关键词-EN: Semi-supervised change detection, Semi-supervised change, utilizes partially labeled, partially labeled data, labeled data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 4 figures, accepted by ICASSP 2024

点击查看摘要

Abstract:Semi-supervised change detection (SSCD) utilizes partially labeled data and a large amount of unlabeled data to detect changes. However, the transformer-based SSCD network does not perform as well as the convolution-based SSCD network due to the lack of labeled data. To overcome this limitation, we introduce a new decoder called Cross Branch Feature Fusion CBFF, which combines the strengths of both local convolutional branch and global transformer branch. The convolutional branch is easy to learn and can produce high-quality features with a small amount of labeled data. The transformer branch, on the other hand, can extract global context features but is hard to learn without a lot of labeled data. Using CBFF, we build our SSCD model based on a strong-to-weak consistency strategy. Through comprehensive experiments on WHU-CD and LEVIR-CD datasets, we have demonstrated the superiority of our method over seven state-of-the-art SSCD methods.

[CV-29] DepthART: Monocular Depth Estimation as Autoregressive Refinement Task

链接: https://arxiv.org/abs/2409.15010
作者: Bulat Gabdullin,Nina Konovalova,Nikolay Patakin,Dmitry Senushkin,Anton Konushin
关键词-EN: quality remains limited, Visual AutoRegressive, Visual AutoRegressive modeling, visual autoregressive transformer, autoregressive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite recent success in discriminative approaches in monocular depth estimation its quality remains limited by training datasets. Generative approaches mitigate this issue by leveraging strong priors derived from training on internet-scale datasets. Recent studies have demonstrated that large text-to-image diffusion models achieve state-of-the-art results in depth estimation when fine-tuned on small depth datasets. Concurrently, autoregressive generative approaches, such as the Visual AutoRegressive modeling~(VAR), have shown promising results in conditioned image synthesis. Following the visual autoregressive modeling paradigm, we introduce the first autoregressive depth estimation model based on the visual autoregressive transformer. Our primary contribution is DepthART – a novel training method formulated as Depth Autoregressive Refinement Task. Unlike the original VAR training procedure, which employs static targets, our method utilizes a dynamic target formulation that enables model self-refinement and incorporates multi-modal guidance during training. Specifically, we use model predictions as inputs instead of ground truth token maps during training, framing the objective as residual minimization. Our experiments demonstrate that the proposed training approach significantly outperforms visual autoregressive modeling via next-scale prediction in the depth estimation task. The Visual Autoregressive Transformer trained with our approach on Hypersim achieves superior results on a set of unseen benchmarks compared to other generative and discriminative baselines.

[CV-30] Generalizing monocular colonoscopy image depth estimation by uncertainty-based global and local fusion network

链接: https://arxiv.org/abs/2409.15006
作者: Sijia Du,Chengfeng Zhou,Suncheng Xiang,Jianwei Xu,Dahong Qian
关键词-EN: obtaining ground-truth depth, real clinical scenarios, obtaining ground-truth, ground-truth depth maps, real clinical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Objective: Depth estimation is crucial for endoscopic navigation and manipulation, but obtaining ground-truth depth maps in real clinical scenarios, such as the colon, is challenging. This study aims to develop a robust framework that generalizes well to real colonoscopy images, overcoming challenges like non-Lambertian surface reflection and diverse data distributions. Methods: We propose a framework combining a convolutional neural network (CNN) for capturing local features and a Transformer for capturing global information. An uncertainty-based fusion block was designed to enhance generalization by identifying complementary contributions from the CNN and Transformer branches. The network can be trained with simulated datasets and generalize directly to unseen clinical data without any fine-tuning. Results: Our method is validated on multiple datasets and demonstrates an excellent generalization ability across various datasets and anatomical structures. Furthermore, qualitative analysis in real clinical scenarios confirmed the robustness of the proposed method. Conclusion: The integration of local and global features through the CNN-Transformer architecture, along with the uncertainty-based fusion block, improves depth estimation performance and generalization in both simulated and real-world endoscopic environments. Significance: This study offers a novel approach to estimate depth maps for endoscopy images despite the complex conditions in clinic, serving as a foundation for endoscopic automatic navigation and other clinical tasks, such as polyp detection and segmentation.

[CV-31] ViBERTgrid BiLSTM-CRF: Multimodal Key Information Extraction from Unstructured Financial Documents ECML KDD2023

链接: https://arxiv.org/abs/2409.15004
作者: Furkan Pala,Mehmet Yasin Akpınar,Onur Deniz,Gülşen Eryiğit
关键词-EN: key information extraction, Multimodal key information, information extraction, key information, studied extensively
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Accepted in MIDAS (The 8th Workshop on MIning DAta for financial applicationS) workshop of ECML PKDD 2023 conference

点击查看摘要

Abstract:Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.

[CV-32] Multi-Modal Generative AI: Multi-modal LLM Diffusion and Beyond

链接: https://arxiv.org/abs/2409.14993
作者: Hong Chen,Xin Wang,Yuwei Zhou,Bin Huang,Yipeng Zhang,Wei Feng,Houlun Chen,Zeyang Zhang,Siao Tang,Wenwu Zhu
关键词-EN: received increasing attention, academia and industry, unified model, received increasing, increasing attention
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modal generative AI has received increasing attention in both academia and industry. Particularly, two dominant families of techniques are: i) The multi-modal large language model (MLLM) such as GPT-4V, which shows impressive ability for multi-modal understanding; ii) The diffusion model such as Sora, which exhibits remarkable multi-modal powers, especially with respect to visual generation. As such, one natural question arises: Is it possible to have a unified model for both understanding and generation? To answer this question, in this paper, we first provide a detailed review of both MLLM and diffusion models, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video large language models as well as text-to-image/video generation. Then, we discuss the two important questions on the unified model: i) whether the unified model should adopt the auto-regressive or diffusion probabilistic modeling, and ii) whether the model should utilize a dense architecture or the Mixture of Experts(MoE) architectures to better support generation and understanding, two objectives. We further provide several possible strategies for building a unified model and analyze their potential advantages and disadvantages. We also summarize existing large-scale multi-modal datasets for better model pretraining in the future. To conclude the paper, we present several challenging future directions, which we believe can contribute to the ongoing advancement of multi-modal generative AI.

[CV-33] Sparse-to-Dense LiDAR Point Generation by LiDAR-Camera Fusion for 3D Object Detection

链接: https://arxiv.org/abs/2409.14985
作者: Minseung Lee,Seokha Moon,Seung Joon Lee,Jinkyu Kim
关键词-EN: Accurately detecting objects, Accurately detecting, point cloud, remains a critical, critical challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:Accurately detecting objects at long distances remains a critical challenge in 3D object detection when relying solely on LiDAR sensors due to the inherent limitations of data sparsity. To address this issue, we propose the LiDAR-Camera Augmentation Network (LCANet), a novel framework that reconstructs LiDAR point cloud data by fusing 2D image features, which contain rich semantic information, generating additional points to improve detection accuracy. LCANet fuses data from LiDAR sensors and cameras by projecting image features into the 3D space, integrating semantic information into the point cloud data. This fused data is then encoded to produce 3D features that contain both semantic and spatial information, which are further refined to reconstruct final points before bounding box prediction. This fusion effectively compensates for LiDAR’s weakness in detecting objects at long distances, which are often represented by sparse points. Additionally, due to the sparsity of many objects in the original dataset, which makes effective supervision for point generation challenging, we employ a point cloud completion network to create a complete point cloud dataset that supervises the generation of dense point clouds in our network. Extensive experiments on the KITTI and Waymo datasets demonstrate that LCANet significantly outperforms existing models, particularly in detecting sparse and distant objects.

[CV-34] SocialCircle: Learning the Angle-based Conditioned Interaction Representation for Pedestrian Trajectory Prediction

链接: https://arxiv.org/abs/2409.14984
作者: Conghao Wong,Beihao Xia,Ziqian Zou,Xinge You
关键词-EN: understanding human behaviors, crucial aspect, aspect of understanding, understanding human, Trajectory prediction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Trajectory prediction is a crucial aspect of understanding human behaviors. Researchers have made efforts to represent socially interactive behaviors among pedestrians and utilize various networks to enhance prediction capability. Unfortunately, they still face challenges not only in fully explaining and measuring how these interactive behaviors work to modify trajectories but also in modeling pedestrians’ preferences to plan or participate in social interactions in response to the changeable physical environments as extra conditions. This manuscript mainly focuses on the above explainability and conditionality requirements for trajectory prediction networks. Inspired by marine animals perceiving other companions and the environment underwater by echolocation, this work constructs an angle-based conditioned social interaction representation SocialCircle+ to represent the socially interactive context and its corresponding conditions. It employs a social branch and a conditional branch to describe how pedestrians are positioned in prediction scenes socially and physically in angle-based-cyclic-sequence forms. Then, adaptive fusion is applied to fuse the above conditional clues onto the social ones to learn the final interaction representation. Experiments demonstrate the superiority of SocialCircle+ with different trajectory prediction backbones. Moreover, counterfactual interventions have been made to simultaneously verify the modeling capacity of causalities among interactive variables and the conditioning capability.

[CV-35] Dynamic Integration of Task-Specific Adapters for Class Incremental Learning

链接: https://arxiv.org/abs/2409.14983
作者: Jiashuo Li,Shaokun Wang,Bo Qian,Yuhang He,Xing Wei,Yihong Gong
关键词-EN: Non-exemplar class Incremental, class Incremental Learning, Incremental Learning, Patch-Level Model Alignment, addressing privacy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-exemplar class Incremental Learning (NECIL) enables models to continuously acquire new classes without retraining from scratch and storing old task exemplars, addressing privacy and storage issues. However, the absence of data from earlier tasks exacerbates the challenge of catastrophic forgetting in NECIL. In this paper, we propose a novel framework called Dynamic Integration of task-specific Adapters (DIA), which comprises two key components: Task-Specific Adapter Integration (TSAI) and Patch-Level Model Alignment. TSAI boosts compositionality through a patch-level adapter integration strategy, which provides a more flexible compositional solution while maintaining low computation costs. Patch-Level Model Alignment maintains feature consistency and accurate decision boundaries via two specialized mechanisms: Patch-Level Distillation Loss (PDL) and Patch-Level Feature Reconstruction method (PFR). Specifically, the PDL preserves feature-level consistency between successive models by implementing a distillation loss based on the contributions of patch tokens to new class learning. The PFR facilitates accurate classifier alignment by reconstructing old class features from previous tasks that adapt to new task knowledge. Extensive experiments validate the effectiveness of our DIA, revealing significant improvements on benchmark datasets in the NECIL setting, maintaining an optimal balance between computational complexity and accuracy. The full code implementation will be made publicly available upon the publication of this paper.

[CV-36] A new baseline for edge detection: Make Encoder-Decoder great again

链接: https://arxiv.org/abs/2409.14976
作者: Yachuan Li,Xavier Soria Pomab,Yongke Xi,Guanlin Li,Chaozhi Yang,Qian Xiao,Yun Bai,Zongmin LI
关键词-EN: deep learning based, location features, semantic features, features, learning based edge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The performance of deep learning based edge detector has far exceeded that of humans, but the huge computational cost and complex training strategy hinder its further development and application. In this paper, we eliminate these complexities with a vanilla encoder-decoder based detector. Firstly, we design a bilateral encoder to decouple the extraction process of location features and semantic features. Since the location branch no longer provides cues for the semantic branch, the richness of features can be further compressed, which is the key to make our model more compact. We propose a cascaded feature fusion decoder, where the location features are progressively refined by semantic features. The refined location features are the only basis for generating the edge map. The coarse original location features and semantic features are avoided from direct contact with the final result. So the noise in the location features and the location error in the semantic features can be suppressed in the generated edge map. The proposed New Baseline for Edge Detection (NBED) achieves superior performance consistently across multiple edge detection benchmarks, even compared with those methods with huge computational cost and complex training strategy. The ODS of NBED on BSDS500 is 0.838, achieving state-of-the-art performance. Our study shows that what really matters in the current edge detection is high-quality features, and we can make the encoder-decoder based detector great again even without complex training strategies and huge computational cost. The code is available at this https URL.

[CV-37] Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models

链接: https://arxiv.org/abs/2409.14963
作者: Anil Osman Tur,Alessandro Conti,Cigdem Beyan,Davide Boscaini,Roberto Larcher,Stefano Messelodi,Fabio Poiesi,Elisa Ricci
关键词-EN: frequent turnover necessitate, turnover necessitate reliable, necessitate reliable zero-shot, zero-shot object classification, MIMEX dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at 2024 IEEE 8th Forum on Research and Technologies for Society and Industry Innovation (RTSI) conference

点击查看摘要

Abstract:In smart retail applications, the large number of products and their frequent turnover necessitate reliable zero-shot object classification methods. The zero-shot assumption is essential to avoid the need for re-training the classifier every time a new product is introduced into stock or an existing product undergoes rebranding. In this paper, we make three key contributions. Firstly, we introduce the MIMEX dataset, comprising 28 distinct product categories. Unlike existing datasets in the literature, MIMEX focuses on fine-grained product classification and includes a diverse range of retail products. Secondly, we benchmark the zero-shot object classification performance of state-of-the-art vision-language models (VLMs) on the proposed MIMEX dataset. Our experiments reveal that these models achieve unsatisfactory fine-grained classification performance, highlighting the need for specialized approaches. Lastly, we propose a novel ensemble approach that integrates embeddings from CLIP and DINOv2 with dimensionality reduction techniques to enhance classification performance. By combining these components, our ensemble approach outperforms VLMs, effectively capturing visual cues crucial for fine-grained product discrimination. Additionally, we introduce a class adaptation method that utilizes visual prototyping with limited samples in scenarios with scarce labeled data, addressing a critical need in retail environments where product variety frequently changes. To encourage further research into zero-shot object classification for smart retail applications, we will release both the MIMEX dataset and benchmark to the research community. Interested researchers can contact the authors for details on the terms and conditions of use. The code is available: this https URL.

[CV-38] Improving Adversarial Robustness for 3D Point Cloud Recognition at Test-Time through Purified Self-Training

链接: https://arxiv.org/abs/2409.14940
作者: Jinpeng Lin,Xulei Yang,Tianrui Li,Xun Xu
关键词-EN: point cloud plays, point cloud, point cloud deep, real-world applications, plays a pivotal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recognizing 3D point cloud plays a pivotal role in many real-world applications. However, deploying 3D point cloud deep learning model is vulnerable to adversarial attacks. Despite many efforts into developing robust model by adversarial training, they may become less effective against emerging attacks. This limitation motivates the development of adversarial purification which employs generative model to mitigate the impact of adversarial attacks. In this work, we highlight the remaining challenges from two perspectives. First, the purification based method requires retraining the classifier on purified samples which introduces additional computation overhead. Moreover, in a more realistic scenario, testing samples arrives in a streaming fashion and adversarial samples are not isolated from clean samples. These challenges motivates us to explore dynamically update model upon observing testing samples. We proposed a test-time purified self-training strategy to achieve this objective. Adaptive thresholding and feature distribution alignment are introduced to improve the robustness of self-training. Extensive results on different adversarial attacks suggest the proposed method is complementary to purification based method in handling continually changing adversarial attacks on the testing data stream.

[CV-39] Deep Cost Ray Fusion for Sparse Depth Video Completion ECCV2024

链接: https://arxiv.org/abs/2409.14935
作者: Jungeon Kim,Soongjin Kim,Jaesik Park,Seungyong Lee
关键词-EN: depth video completion, sparse depth video, Depth Completion benchmark, depth completion, VOID Depth Completion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, accepted to ECCV 2024

点击查看摘要

Abstract:In this paper, we present a learning-based framework for sparse depth video completion. Given a sparse depth map and a color image at a certain viewpoint, our approach makes a cost volume that is constructed on depth hypothesis planes. To effectively fuse sequential cost volumes of the multiple viewpoints for improved depth completion, we introduce a learning-based cost volume fusion framework, namely RayFusion, that effectively leverages the attention mechanism for each pair of overlapped rays in adjacent cost volumes. As a result of leveraging feature statistics accumulated over time, our proposed framework consistently outperforms or rivals state-of-the-art approaches on diverse indoor and outdoor datasets, including the KITTI Depth Completion benchmark, VOID Depth Completion benchmark, and ScanNetV2 dataset, using much fewer network parameters.

[CV-40] DanceCamAnimator: Keyframe-Based Controllable 3D Dance Camera Synthesis

链接: https://arxiv.org/abs/2409.14925
作者: Zixuan Wang,Jiayi Li,Xiaoyu Qin,Shikun Sun,Songtao Zhou,Jia Jia,Jiebo Luo
关键词-EN: Synthesizing camera movements, highly challenging due, Synthesizing camera, highly challenging, challenging due
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted by ACM Multimedia 2024

点击查看摘要

Abstract:Synthesizing camera movements from music and dance is highly challenging due to the contradicting requirements and complexities of dance cinematography. Unlike human movements, which are always continuous, dance camera movements involve both continuous sequences of variable lengths and sudden drastic changes to simulate the switching of multiple cameras. However, in previous works, every camera frame is equally treated and this causes jittering and unavoidable smoothing in post-processing. To solve these problems, we propose to integrate animator dance cinematography knowledge by formulating this task as a three-stage process: keyframe detection, keyframe synthesis, and tween function prediction. Following this formulation, we design a novel end-to-end dance camera synthesis framework \textbfDanceCamAnimator, which imitates human animation procedures and shows powerful keyframe-based controllability with variable lengths. Extensive experiments on the DCM dataset demonstrate that our method surpasses previous baselines quantitatively and qualitatively. Code will be available at \urlthis https URL.

[CV-41] CON: Continual Object Navigation via Data-Free Inter-Agent Knowledge Transfer in Unseen and Unfamiliar Places

链接: https://arxiv.org/abs/2409.14899
作者: Kouki Terashima,Daiki Iwata,Kanji Tanaka
关键词-EN: object goal navigation, robotic object goal, goal navigation, work explores, explores the potential
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, workshop paper’s draft version

点击查看摘要

Abstract:This work explores the potential of brief inter-agent knowledge transfer (KT) to enhance the robotic object goal navigation (ON) in unseen and unfamiliar environments. Drawing on the analogy of human travelers acquiring local knowledge, we propose a framework in which a traveler robot (student) communicates with local robots (teachers) to obtain ON knowledge through minimal interactions. We frame this process as a data-free continual learning (CL) challenge, aiming to transfer knowledge from a black-box model (teacher) to a new model (student). In contrast to approaches like zero-shot ON using large language models (LLMs), which utilize inherently communication-friendly natural language for knowledge representation, the other two major ON approaches – frontier-driven methods using object feature maps and learning-based ON using neural state-action maps – present complex challenges where data-free KT remains largely uncharted. To address this gap, we propose a lightweight, plug-and-play KT module targeting non-cooperative black-box teachers in open-world settings. Using the universal assumption that every teacher robot has vision and mobility capabilities, we define state-action history as the primary knowledge base. Our formulation leads to the development of a query-based occupancy map that dynamically represents target object locations, serving as an effective and communication-friendly knowledge representation. We validate the effectiveness of our method through experiments conducted in the Habitat environment.

[CV-42] Observe Then Act: Asynchronous Active Vision-Action Model for Robotic Manipulation

链接: https://arxiv.org/abs/2409.14891
作者: Guokang Wang,Hang Li,Shuyuan Zhang,Yanhong Liu,Huaping Liu
关键词-EN: posing significant challenges, passive observation-based models, real-world scenarios, fields of view, posing significant
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In real-world scenarios, many robotic manipulation tasks are hindered by occlusions and limited fields of view, posing significant challenges for passive observation-based models that rely on fixed or wrist-mounted cameras. In this paper, we investigate the problem of robotic manipulation under limited visual observation and propose a task-driven asynchronous active vision-action model.Our model serially connects a camera Next-Best-View (NBV) policy with a gripper Next-Best Pose (NBP) policy, and trains them in a sensor-motor coordination framework using few-shot reinforcement learning. This approach allows the agent to adjust a third-person camera to actively observe the environment based on the task goal, and subsequently infer the appropriate manipulation actions.We trained and evaluated our model on 8 viewpoint-constrained tasks in RLBench. The results demonstrate that our model consistently outperforms baseline algorithms, showcasing its effectiveness in handling visual constraints in manipulation tasks.

[CV-43] Advancing Video Quality Assessment for AIGC

链接: https://arxiv.org/abs/2409.14888
作者: Xinli Yue,Jianhui Sun,Han Kong,Liangchao Yao,Tianyi Wang,Lei Li,Fengyun Rao,Jing Lv,Fan Xia,Yuetang Deng,Qian Wang,Lingchen Zhao
关键词-EN: made remarkable progress, including text generation, recent years, including text, made remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:In recent years, AI generative models have made remarkable progress across various domains, including text generation, image generation, and video generation. However, assessing the quality of text-to-video generation is still in its infancy, and existing evaluation frameworks fall short when compared to those for natural videos. Current video quality assessment (VQA) methods primarily focus on evaluating the overall quality of natural videos and fail to adequately account for the substantial quality discrepancies between frames in generated videos. To address this issue, we propose a novel loss function that combines mean absolute error with cross-entropy loss to mitigate inter-frame quality inconsistencies. Additionally, we introduce the innovative S2CNet technique to retain critical content, while leveraging adversarial training to enhance the model’s generalization capabilities. Experimental results demonstrate that our method outperforms existing VQA techniques on the AIGC Video dataset, surpassing the previous state-of-the-art by 3.1% in terms of PLCC.

[CV-44] Probabilistically Aligned View-unaligned Clustering with Adaptive Template Selection

链接: https://arxiv.org/abs/2409.14882
作者: Wenhua Dong,Xiao-Jun Wu,Zhenhua Feng,Sara Atito,Muhammad Awais,Josef Kittler
关键词-EN: multi-view modeling scenarios, paired image-text data, existing multi-view modeling, Adaptive Template Selection, cross-view correspondence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:In most existing multi-view modeling scenarios, cross-view correspondence (CVC) between instances of the same target from different views, like paired image-text data, is a crucial prerequisite for effortlessly deriving a consistent representation. Nevertheless, this premise is frequently compromised in certain applications, where each view is organized and transmitted independently, resulting in the view-unaligned problem (VuP). Restoring CVC of unaligned multi-view data is a challenging and highly demanding task that has received limited attention from the research community. To tackle this practical challenge, we propose to integrate the permutation derivation procedure into the bipartite graph paradigm for view-unaligned clustering, termed Probabilistically Aligned View-unaligned Clustering with Adaptive Template Selection (PAVuC-ATS). Specifically, we learn consistent anchors and view-specific graphs by the bipartite graph, and derive permutations applied to the unaligned graphs by reformulating the alignment between two latent representations as a 2-step transition of a Markov chain with adaptive template selection, thereby achieving the probabilistic alignment. The convergence of the resultant optimization problem is validated both experimentally and theoretically. Extensive experiments on six benchmark datasets demonstrate the superiority of the proposed PAVuC-ATS over the baseline methods.

[CV-45] Mammo-Clustering:A Weakly Supervised Multi-view Global-Local Context Clustering Network for Detection and Classification in Mammography

链接: https://arxiv.org/abs/2409.14876
作者: Shilong Yang,Chulong Zhang,Qi Zang,Juan Yu,Liang Zeng,Xiao Luo,Yexuan Xing,Xin Pan,Qi Li,Xiaokun Liang,Yaoqin Xie
关键词-EN: making early screening, early screening crucial, early screening, mitigating its impact, long posed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Breast cancer has long posed a significant threat to women’s health, making early screening crucial for mitigating its impact. However, mammography, the preferred method for early screening, faces limitations such as the burden of double reading by radiologists, challenges in widespread adoption in remote and underdeveloped areas, and obstacles in intelligent early screening development due to data constraints. To address these challenges, we propose a weakly supervised multi-view mammography early screening model for breast cancer based on context clustering. Context clustering, a feature extraction structure that is neither CNN nor transformer, combined with multi-view learning for information complementation, presents a promising approach. The weak supervision design specifically addresses data limitations. Our model achieves state-of-the-art performance with fewer parameters on two public datasets, with an AUC of 0.828 on the Vindr-Mammo dataset and 0.805 on the CBIS-DDSM dataset. Our model shows potential in reducing the burden on doctors and increasing the feasibility of breast cancer screening for women in underdeveloped regions.

[CV-46] FUSED-Net: Enhancing Few-Shot Traffic Sign Detection with Unfrozen Parameters Pseudo-Support Sets Embedding Normalization and Domain Adaptation

链接: https://arxiv.org/abs/2409.14852
作者: Md. Atiqur Rahman,Nahian Ibn Asad,Md. Mushfiqul Haque Omi,Md. Bakhtiar Hasan,Sabbir Ahmed,Md. Hasanul Kabir
关键词-EN: Automatic Traffic Sign, Traffic Sign Recognition, modern transportation systems, Recognition is paramount, Automatic Traffic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 6 figures, 3 tables, submitted to IEEE Access for review

点击查看摘要

Abstract:Automatic Traffic Sign Recognition is paramount in modern transportation systems, motivating several research endeavors to focus on performance improvement by utilizing large-scale datasets. As the appearance of traffic signs varies across countries, curating large-scale datasets is often impractical; and requires efficient models that can produce satisfactory performance using limited data. In this connection, we present ‘FUSED-Net’, built-upon Faster RCNN for traffic sign detection, enhanced by Unfrozen Parameters, Pseudo-Support Sets, Embedding Normalization, and Domain Adaptation while reducing data requirement. Unlike traditional approaches, we keep all parameters unfrozen during training, enabling FUSED-Net to learn from limited samples. The generation of a Pseudo-Support Set through data augmentation further enhances performance by compensating for the scarcity of target domain data. Additionally, Embedding Normalization is incorporated to reduce intra-class variance, standardizing feature representation. Domain Adaptation, achieved by pre-training on a diverse traffic sign dataset distinct from the target domain, improves model generalization. Evaluating FUSED-Net on the BDTSD dataset, we achieved 2.4x, 2.2x, 1.5x, and 1.3x improvements of mAP in 1-shot, 3-shot, 5-shot, and 10-shot scenarios, respectively compared to the state-of-the-art Few-Shot Object Detection (FSOD) models. Additionally, we outperform state-of-the-art works on the cross-domain FSOD benchmark under several scenarios.

[CV-47] Disentanglement with Factor Quantized Variational Autoencoders

链接: https://arxiv.org/abs/2409.14851
作者: Gulcin Baykal,Melih Kandemir,Gozde Unal
关键词-EN: Disentangled representation learning, underlying generative factors, Disentangled representation, generative factors, latent representation independently
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint submitted to Pattern Recognition

点击查看摘要

Abstract:Disentangled representation learning aims to represent the underlying generative factors of a dataset in a latent representation independently of one another. In our work, we propose a discrete variational autoencoder (VAE) based model where the ground truth information about the generative factors are not provided to the model. We demonstrate the advantages of learning discrete representations over learning continuous representations in facilitating disentanglement. Furthermore, we propose incorporating an inductive bias into the model to further enhance disentanglement. Precisely, we propose scalar quantization of the latent variables in a latent representation with scalar values from a global codebook, and we add a total correlation term to the optimization as an inductive bias. Our method called FactorQVAE is the first method that combines optimization based disentanglement approaches with discrete representation learning, and it outperforms the former disentanglement methods in terms of two disentanglement metrics (DCI and InfoMEC) while improving the reconstruction performance. Our code can be found at \urlthis https URL.

[CV-48] GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth

链接: https://arxiv.org/abs/2409.14850
作者: Aurélien Cecille,Stefan Duffner,Franck Davoine,Thibault Neveu,Rémi Agier
关键词-EN: Monocular depth estimation, predicting metric depth, models predicting metric, Monocular depth, estimation has greatly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Monocular depth estimation has greatly improved in the recent years but models predicting metric depth still struggle to generalize across diverse camera poses and datasets. While recent supervised methods mitigate this issue by leveraging ground prior information at inference, their adaptability to self-supervised settings is limited due to the additional challenge of scale recovery. Addressing this gap, we propose in this paper a novel constraint on ground areas designed specifically for the self-supervised paradigm. This mechanism not only allows to accurately recover the scale but also ensures coherence between the depth prediction and the ground prior. Experimental results show that our method surpasses existing scale recovery techniques on the KITTI benchmark and significantly enhances model generalization capabilities. This improvement can be observed by its more robust performance across diverse camera rotations and its adaptability in zero-shot conditions with previously unseen driving datasets such as DDAD.

[CV-49] Revisiting Video Quality Assessment from the Perspective of Generalization

链接: https://arxiv.org/abs/2409.14847
作者: Xinli Yue,Jianhui Sun,Liangchao Yao,Fan Xia,Yuetang Deng,Tianyi Wang,Lei Li,Fengyun Rao,Jing Lv,Qian Wang,Lingchen Zhao
关键词-EN: short video platforms, Video Quality Assessment, short video, User-Generated Content, presents significant challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:The increasing popularity of short video platforms such as YouTube Shorts, TikTok, and Kwai has led to a surge in User-Generated Content (UGC), which presents significant challenges for the generalization performance of Video Quality Assessment (VQA) tasks. These challenges not only affect performance on test sets but also impact the ability to generalize across different datasets. While prior research has primarily focused on enhancing feature extractors, sampling methods, and network branches, it has largely overlooked the generalization capabilities of VQA tasks. In this work, we reevaluate the VQA task from a generalization standpoint. We begin by analyzing the weight loss landscape of VQA models, identifying a strong correlation between this landscape and the generalization gaps. We then investigate various techniques to regularize the weight loss landscape. Our results reveal that adversarial weight perturbations can effectively smooth this landscape, significantly improving the generalization performance, with cross-dataset generalization and fine-tuning performance enhanced by up to 1.8% and 3%, respectively. Through extensive experiments across various VQA methods and datasets, we validate the effectiveness of our approach. Furthermore, by leveraging our insights, we achieve state-of-the-art performance in Image Quality Assessment (IQA) tasks. Our code is available at this https URL.

[CV-50] A-VL: Adaptive Attention for Large Vision-Language Models

链接: https://arxiv.org/abs/2409.14846
作者: Junyang Zhang,Mu Yuan,Ruiguang Zhong,Puhan Luo,Huiyou Zhan,Ningkang Zhang,Chengchen Hu,Xiangyang Li
关键词-EN: integrates computer vision, offering substantial application, substantial application potential, Large Vision-Language Model, natural language processing
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local information. Based on our observation and analysis of vision-language attention patterns, we develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference. Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.

[CV-51] RoWSFormer: A Robust Watermarking Framework with Swin Transformer for Enhanced Geometric Attack Resilience

链接: https://arxiv.org/abs/2409.14829
作者: Weitong Chen,Yuheng Li
关键词-EN: digital watermarking techniques, Swin Transformer Block, watermarking techniques based, Enhanced Swin Transformer, convolutional neural networks
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In recent years, digital watermarking techniques based on deep learning have been widely studied. To achieve both imperceptibility and robustness of image watermarks, most current methods employ convolutional neural networks to build robust watermarking frameworks. However, despite the success of CNN-based watermarking models, they struggle to achieve robustness against geometric attacks due to the limitations of convolutional neural networks in capturing global and long-range relationships. To address this limitation, we propose a robust watermarking framework based on the Swin Transformer, named RoWSFormer. Specifically, we design the Locally-Channel Enhanced Swin Transformer Block as the core of both the encoder and decoder. This block utilizes the self-attention mechanism to capture global and long-range information, thereby significantly improving adaptation to geometric distortions. Additionally, we construct the Frequency-Enhanced Transformer Block to extract frequency domain information, which further strengthens the robustness of the watermarking framework. Experimental results demonstrate that our RoWSFormer surpasses existing state-of-the-art watermarking methods. For most non-geometric attacks, RoWSFormer improves the PSNR by 3 dB while maintaining the same extraction accuracy. In the case of geometric attacks (such as rotation, scaling, and affine transformations), RoWSFormer achieves over a 6 dB improvement in PSNR, with extraction accuracy exceeding 97%.

[CV-52] wo Deep Learning Solutions for Automatic Blurring of Faces in Videos

链接: https://arxiv.org/abs/2409.14828
作者: Roman Plaud,Jose-Luis Lisani
关键词-EN: everyday life situations, life situations generates, license plates, physical characteristics, everyday life
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The widespread use of cameras in everyday life situations generates a vast amount of data that may contain sensitive information about the people and vehicles moving in front of them (location, license plates, physical characteristics, etc). In particular, people’s faces are recorded by surveillance cameras in public spaces. In order to ensure the privacy of individuals, face blurring techniques can be applied to the collected videos. In this paper we present two deep-learning based options to tackle the problem. First, a direct approach, consisting of a classical object detector (based on the YOLO architecture) trained to detect faces, which are subsequently blurred. Second, an indirect approach, in which a Unet-like segmentation network is trained to output a version of the input image in which all the faces have been blurred.

[CV-53] AIM 2024 Challenge on Video Saliency Prediction: Methods and Results ECCV

链接: https://arxiv.org/abs/2409.14827
作者: Andrey Moskalenko,Alexey Bryncev,Dmitry Vatolin,Radu Timofte,Gen Zhan,Li Yang,Yunlong Tang,Yiting Liao,Jiongzhi Lin,Baitao Huang,Morteza Moradi,Mohammad Moradi,Francesco Rundo,Concetto Spampinato,Ali Borji,Simone Palazzo,Yuxin Zhu,Yinan Sun,Huiyu Duan,Yuqin Cao,Ziheng Jia,Qiang Hu,Xiongkuo Min,Guangtao Zhai,Hao Fang,Runmin Cong,Xiankai Lu,Xiaofei Zhou,Wei Zhang,Chunyu Zhao,Wentao Mu,Tao Deng,Hamed R. Tavakoli
关键词-EN: Video Saliency Prediction, Prediction at AIM, Saliency Prediction, paper reviews, accurate saliency maps
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
*备注: ECCVW 2024

点击查看摘要

Abstract:This paper reviews the Challenge on Video Saliency Prediction at AIM 2024. The goal of the participants was to develop a method for predicting accurate saliency maps for the provided set of video sequences. Saliency maps are widely exploited in various applications, including video compression, quality assessment, visual perception studies, the advertising industry, etc. For this competition, a previously unused large-scale audio-visual mouse saliency (AViMoS) dataset of 1500 videos with more than 70 observers per video was collected using crowdsourced mouse tracking. The dataset collection methodology has been validated using conventional eye-tracking data and has shown high consistency. Over 30 teams registered in the challenge, and there are 7 teams that submitted the results in the final phase. The final phase solutions were tested and ranked by commonly used quality metrics on a private test subset. The results of this evaluation and the descriptions of the solutions are presented in this report. All data, including the private test subset, is made publicly available on the challenge homepage - this https URL.

[CV-54] Advancing Depression Detection on Social Media Platforms Through Fine-Tuned Large Language Models

链接: https://arxiv.org/abs/2409.14794
作者: Shahid Munir Shah,Syeda Anshrah Gillani,Mirza Samad Ahmed Baig,Muhammad Aamer Saleem,Muhammad Hamzah Siddiqui
关键词-EN: Large Language Models, Large Language, social media data, users social media, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages

点击查看摘要

Abstract:This study investigates the use of Large Language Models (LLMs) for improved depression detection from users social media data. Through the use of fine-tuned GPT 3.5 Turbo 1106 and LLaMA2-7B models and a sizable dataset from earlier studies, we were able to identify depressed content in social media posts with a high accuracy of nearly 96.0 percent. The comparative analysis of the obtained results with the relevant studies in the literature shows that the proposed fine-tuned LLMs achieved enhanced performance compared to existing state of the-art systems. This demonstrates the robustness of LLM-based fine-tuned systems to be used as potential depression detection systems. The study describes the approach in depth, including the parameters used and the fine-tuning procedure, and it addresses the important implications of our results for the early diagnosis of depression on several social media platforms.

[CV-55] owards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models

链接: https://arxiv.org/abs/2409.14785
作者: Patrick Amadeus Irawan,Genta Indra Winata,Samuel Cahyawijaya,Ayu Purwarianti
关键词-EN: Natural Language Explanation, Natural Language, Language Explanation, aims to elucidate, providing detailed
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models (LVLMs) through the use of language models. While existing methods for creating a Vision Question-Answering with Natural Language Explanation (VQA-NLE) datasets can provide explanations, they heavily rely on human annotations that are time-consuming and costly. In this study, we propose a novel approach that leverages LVLMs to efficiently generate high-quality synthetic VQA-NLE datasets. By evaluating our synthetic data, we showcase how advanced prompting techniques can lead to the production of high-quality VQA-NLE data. Our findings indicate that this proposed method achieves up to 20x faster than human annotation, with only a minimal decrease in qualitative metrics, achieving robust quality that is nearly equivalent to human-annotated data. Furthermore, we show that incorporating visual prompts significantly enhances the relevance of text generation. Our study paves the way for a more efficient and robust automated generation of multi-modal NLE data, offering a promising solution to the problem.

[CV-56] Human Hair Reconstruction with Strand-Aligned 3D Gaussians

链接: https://arxiv.org/abs/2409.14778
作者: Egor Zakharov,Vanessa Sklyarova,Michael Black,Giljoo Nam,Justus Thies,Otmar Hilliges
关键词-EN: classical hair strands, produce accurate, unstructured Gaussians, Gaussians, leverage unstructured Gaussians
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We introduce a new hair modeling method that uses a dual representation of classical hair strands and 3D Gaussians to produce accurate and realistic strand-based reconstructions from multi-view data. In contrast to recent approaches that leverage unstructured Gaussians to model human avatars, our method reconstructs the hair using 3D polylines, or strands. This fundamental difference allows the use of the resulting hairstyles out-of-the-box in modern computer graphics engines for editing, rendering, and simulation. Our 3D lifting method relies on unstructured Gaussians to generate multi-view ground truth data to supervise the fitting of hair strands. The hairstyle itself is represented in the form of the so-called strand-aligned 3D Gaussians. This representation allows us to combine strand-based hair priors, which are essential for realistic modeling of the inner structure of hairstyles, with the differentiable rendering capabilities of 3D Gaussian Splatting. Our method, named Gaussian Haircut, is evaluated on synthetic and real scenes and demonstrates state-of-the-art performance in the task of strand-based hair reconstruction.

[CV-57] CFVNet: An End-to-End Cancelable Finger Vein Network for Recognition

链接: https://arxiv.org/abs/2409.14774
作者: Yifan Wang,Jie Gui,Yuan Yan Tang,James Tin-Yau Kwok
关键词-EN: Finger vein recognition, high-security identification systems, Finger vein, vein recognition technology, secure finger vein
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Finger vein recognition technology has become one of the primary solutions for high-security identification systems. However, it still has information leakage problems, which seriously jeopardizes users privacy and anonymity and cause great security risks. In addition, there is no work to consider a fully integrated secure finger vein recognition system. So, different from the previous systems, we integrate preprocessing and template protection into an integrated deep learning model. We propose an end-to-end cancelable finger vein network (CFVNet), which can be used to design an secure finger vein recognition this http URL includes a plug-and-play BWR-ROIAlign unit, which consists of three sub-modules: Localization, Compression and Transformation. The localization module achieves automated localization of stable and unique finger vein ROI. The compression module losslessly removes spatial and channel redundancies. The transformation module uses the proposed BWR method to introduce unlinkability, irreversibility and revocability to the system. BWR-ROIAlign can directly plug into the model to introduce the above features for DCNN-based finger vein recognition systems. We perform extensive experiments on four public datasets to study the performance and cancelable biometric attributes of the CFVNet-based recognition system. The average accuracy, EERs and Dsys on the four datasets are 99.82%, 0.01% and 0.025, respectively, and achieves competitive performance compared with the state-of-the-arts.

[CV-58] Robust and Flexible Omnidirectional Depth Estimation with Multiple 360deg Cameras

链接: https://arxiv.org/abs/2409.14766
作者: Ming Li,Xueqian Jin,Xuejiao Hu,Jinghao Cao,Sidan Du,Yang Li
关键词-EN: Omnidirectional depth estimation, depth estimation, depth, recent years, Omnidirectional depth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Omnidirectional depth estimation has received much attention from researchers in recent years. However, challenges arise due to camera soiling and variations in camera layouts, affecting the robustness and flexibility of the algorithm. In this paper, we use the geometric constraints and redundant information of multiple 360-degree cameras to achieve robust and flexible multi-view omnidirectional depth estimation. We implement two algorithms, in which the two-stage algorithm obtains initial depth maps by pairwise stereo matching of multiple cameras and fuses the multiple depth maps to achieve the final depth estimation; the one-stage algorithm adopts spherical sweeping based on hypothetical depths to construct a uniform spherical matching cost of the multi-camera images and obtain the depth. Additionally, a generalized epipolar equirectangular projection is introduced to simplify the spherical epipolar constraints. To overcome panorama distortion, a spherical feature extractor is implemented. Furthermore, a synthetic 360-degree dataset consisting of 12K road scene panoramas and 3K ground truth depth maps is presented to train and evaluate 360-degree depth estimation algorithms. Our dataset takes soiled camera lenses and glare into consideration, which is more consistent with the real-world environment. Experiments show that our two algorithms achieve state-of-the-art performance, accurately predicting depth maps even when provided with soiled panorama inputs. The flexibility of the algorithms is experimentally validated in terms of camera layouts and numbers.

[CV-59] VLMs Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models

链接: https://arxiv.org/abs/2409.14759
作者: Nam Hyeon-Woo,Moon Ye-Bin,Wonseok Choi,Lee Hyun,Tae-Hyun Oh
关键词-EN: Vision language models, perception remains limited, shown promising reasoning, promising reasoning capabilities, Vision language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks; however, our understanding of their visual perception remains limited. In this work, we propose an eye examination process to investigate how a VLM perceives images, specifically focusing on key elements of visual recognition, from primitive color and shape to semantic levels. To this end, we introduce a dataset named LENS to guide a VLM to follow the examination and check its readiness. Once the model is ready, we conduct the examination. Through this examination, we quantify and visualize VLMs’ sensitivities to color and shape, and semantic matching. Our findings reveal that VLMs have varying sensitivity to different colors while consistently showing insensitivity to green across different VLMs. Also, we found different shape sensitivity and semantic recognition depending on LLM’s capacity despite using the same fixed visual encoder. Our analyses and findings have potential to inspire the design of VLMs and the pre-processing of visual input to VLMs for improving application performance.

[CV-60] BranchPoseNet: Characterizing tree branching with a deep learning-based pose estimation approach

链接: https://arxiv.org/abs/2409.14755
作者: Stefano Puliti,Carolin Fischer,Rasmus Astrup
关键词-EN: deep learning model, proximally laser scanning, pose-estimation deep learning, laser scanning data, learning model
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:This paper presents an automated pipeline for detecting tree whorls in proximally laser scanning data using a pose-estimation deep learning model. Accurate whorl detection provides valuable insights into tree growth patterns, wood quality, and offers potential for use as a biometric marker to track trees throughout the forestry value chain. The workflow processes point cloud data to create sectional images, which are subsequently used to identify keypoints representing tree whorls and branches along the stem. The method was tested on a dataset of destructively sampled individual trees, where the whorls were located along the stems of felled trees. The results demonstrated strong potential, with accurate identification of tree whorls and precise calculation of key structural metrics, unlocking new insights and deeper levels of information from individual tree point clouds.

[CV-61] UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection

链接: https://arxiv.org/abs/2409.14751
作者: Haocheng Zhao,Runwei Guan,Taoyu Wu,Ka Lok Man,Limin Yu,Yutao Yue
关键词-EN: dense point cloud, MMW radar, point cloud data, MMW, dense point
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figues, conference

点击查看摘要

Abstract:4D millimeter-wave (MMW) radar, which provides both height information and dense point cloud data over 3D MMW radar, has become increasingly popular in 3D object detection. In recent years, radar-vision fusion models have demonstrated performance close to that of LiDAR-based models, offering advantages in terms of lower hardware costs and better resilience in extreme conditions. However, many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information. Additionally, these multi-modal networks are often sensitive to the failure of a single modality, particularly vision. To address these challenges, we propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process, enhancing the quality of visual Bird-Eye View (BEV) features. We further introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities using shared module. To assess the robustness of multi-modal models, we develop a novel Failure Test (FT) ablation experiment, which simulates vision modality failure by injecting Gaussian noise. We conduct extensive experiments on the View-of-Delft (VoD) and TJ4D datasets. The results demonstrate that our proposed Unified BEVFusion (UniBEVFusion) network significantly outperforms state-of-the-art models on the TJ4D dataset, with improvements of 1.44 in 3D and 1.72 in BEV object detection accuracy.

[CV-62] FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension EMNLP2024

链接: https://arxiv.org/abs/2409.14750
作者: Junzhuo Liu,Xuzheng Yang,Weiwei Li,Peng Wang
关键词-EN: Referring Expression Comprehension, Referring Expression, Expression Comprehension, Multi-modal Large Language, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 19 pages, EMNLP 2024

点击查看摘要

Abstract:Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model’s ability to correctly reject scenarios where the target object is not visible in the image–an essential aspect often overlooked in existing datasets and approaches. Utilizing this high-quality dataset, we conducted comprehensive evaluations of both state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal interaction strategies, ultimately unlocking the full potential of MLLMs. Our code and the datasets are available at this https URL.

[CV-63] Distribution-Level Feature Distancing for Machine Unlearning: Towards a Better Trade-off Between Model Utility and Forgetting AAAI

链接: https://arxiv.org/abs/2409.14747
作者: Dasol Choi,Dongbin Na
关键词-EN: deep learning applications, learning applications, explosive growth, increasingly in demand, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures, submitted to the AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:With the explosive growth of deep learning applications, the right to be forgotten has become increasingly in demand in various AI industries. For example, given a facial recognition system, some individuals may wish to remove images that might have been used in the training phase from the trained model. Unfortunately, modern deep neural networks sometimes unexpectedly leak personal identities. Recent studies have presented various machine unlearning algorithms to make a trained model unlearn the data to be forgotten. While these methods generally perform well in terms of forgetting scores, we have found that an unexpected modelutility drop can occur. This phenomenon, which we term correlation collapse, happens when the machine unlearning algorithms reduce the useful correlation between image features and the true label. To address this challenge, we propose Distribution-Level Feature Distancing (DLFD), a novel method that efficiently forgets instances while preventing correlation collapse. Our method synthesizes data samples so that the generated data distribution is far from the distribution of samples being forgotten in the feature space, achieving effective results within a single training epoch. Through extensive experiments on facial recognition datasets, we demonstrate that our approach significantly outperforms state-of-the-art machine unlearning methods.

[CV-64] Less yet robust: crucial region selection for scene recognition

链接: https://arxiv.org/abs/2409.14741
作者: Jianqi Zhang,Mengxuan Wang,Jingyao Wang,Lingyu Si,Changwen Zheng,Fanjiang Xu
关键词-EN: scene recognition tasks, Scene recognition, types of degradation, blurring or overexposure, Underwater Geological Scene
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scene recognition, particularly for aerial and underwater images, often suffers from various types of degradation, such as blurring or overexposure. Previous works that focus on convolutional neural networks have been shown to be able to extract panoramic semantic features and perform well on scene recognition tasks. However, low-quality images still impede model performance due to the inappropriate use of high-level semantic features. To address these To address these challenges, we propose an adaptive selection mechanism to identify the most important and robust regions with high-level features. Thus, the model can perform learning via these regions to avoid interference. implement a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix. We also introduce a regularization term to further enhance the significance of key high-level feature regions. Different from previous methods, our learnable matrix pays extra attention to regions that are important to multiple categories but may cause misclassification and sets constraints to reduce the influence of such regions.This is a plug-and-play architecture that can be easily extended to other methods. Additionally, we construct an Underwater Geological Scene Classification dataset to assess the effectiveness of our model. Extensive experimental results demonstrate the superiority and robustness of our proposed method over state-of-the-art techniques on two datasets.

[CV-65] EDSNet: Efficient-DSNet for Video Summarization

链接: https://arxiv.org/abs/2409.14724
作者: Ashish Prasad,Pranav Jeevan,Amit Sethi
关键词-EN: methods largely rely, require substantial computational, substantial computational resources, Current video summarization, Current video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Current video summarization methods largely rely on transformer-based architectures, which, due to their quadratic complexity, require substantial computational resources. In this work, we address these inefficiencies by enhancing the Direct-to-Summarize Network (DSNet) with more resource-efficient token mixing mechanisms. We show that replacing traditional attention with alternatives like Fourier, Wavelet transforms, and Nyströmformer improves efficiency and performance. Furthermore, we explore various pooling strategies within the Regional Proposal Network, including ROI pooling, Fast Fourier Transform pooling, and flat pooling. Our experimental results on TVSum and SumMe datasets demonstrate that these modifications significantly reduce computational costs while maintaining competitive summarization performance. Thus, our work offers a more scalable solution for video summarization tasks.

[CV-66] ControlEdit: A MultiModal Local Clothing Image Editing Method

链接: https://arxiv.org/abs/2409.14720
作者: Di Cheng,YingJie Shi,ShiXin Sun,JiaFu Zhang,WeiJing Wang,Yu Liu
关键词-EN: Multimodal clothing image, clothing image editing, Multimodal clothing, clothing image, image editing refers
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal clothing image editing refers to the precise adjustment and modification of clothing images using data such as textual descriptions and visual images as control conditions, which effectively improves the work efficiency of designers and reduces the threshold for user design. In this paper, we propose a new image editing method ControlEdit, which transfers clothing image editing to multimodal-guided local inpainting of clothing images. We address the difficulty of collecting real image datasets by leveraging the self-supervised learning approach. Based on this learning approach, we extend the channels of the feature extraction network to ensure consistent clothing image style before and after editing, and we design an inverse latent loss function to achieve soft control over the content of non-edited areas. In addition, we adopt Blended Latent Diffusion as the sampling method to make the editing boundaries transition naturally and enforce consistency of non-edited area content. Extensive experiments demonstrate that ControlEdit surpasses baseline algorithms in both qualitative and quantitative evaluations.

[CV-67] Phantom of Latent for Large Language and Vision Models

链接: https://arxiv.org/abs/2409.14713
作者: Byung-Kwan Lee,Sangyun Chung,Chae Won Kim,Beomchan Park,Yong Man Ro
关键词-EN: visual instruction tuning, large language, large language models, success of visual, visual instruction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available in this https URL

点击查看摘要

Abstract:The success of visual instruction tuning has accelerated the development of large language and vision models (LLVMs). Following the scaling laws of instruction-tuned large language models (LLMs), LLVMs either have further increased their sizes, reaching 26B, 34B, and even 80B parameters. While this increase in model size has yielded significant performance gains, it demands substantially more hardware resources for both training and inference. Consequently, there naturally exists a strong need for efficient LLVMs that achieve the performance of larger models while being smaller in size. To achieve this need, we present a new efficient LLVM family with model sizes of 0.5B, 1.8B, 3.8B, and 7B parameters, Phantom, which significantly enhances learning capabilities within limited structures. By temporarily increasing the latent hidden dimension during multi-head self-attention (MHSA), we make LLVMs prepare to look and understand much more vision-language knowledge on the latent, without substantially increasing physical model sizes. To maximize its advantage, we introduce Phantom Optimization (PO) using both autoregressive supervised fine-tuning (SFT) and direct preference optimization (DPO)-like concept, which effectively follows correct answers while eliminating incorrect and ambiguous ones. Phantom outperforms numerous larger open- and closed-source LLVMs, positioning itself as a leading solution in the landscape of efficient LLVMs.

[CV-68] VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models EMNLP2024

链接: https://arxiv.org/abs/2409.14704
作者: Jingtao Cao,Zheng Zhang,Hongru Wang,Kam-Fai Wong
关键词-EN: Language Evaluation Understudy, Visual Language Evaluation, significantly improved, improved the generation, textual descriptions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: accepted by EMNLP2024(long paper,main conference)

点击查看摘要

Abstract:Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models’ ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP model.VLEU quantifies a model’s generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric provides a quantitative way to compare different T2I models and track improvements during model finetuning. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models, positioning it as an essential metric for future research in text-to-image synthesis.

[CV-69] Dynamic Realms: 4D Content Analysis Recovery and Generation with Geometric Topological and Physical Priors

链接: https://arxiv.org/abs/2409.14692
作者: Zhiyang Dou
关键词-EN: shape and motion, spatial dimensions, temporal dimension, research focuses, temporal variations
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Research Summary - DC

点击查看摘要

Abstract:My research focuses on the analysis, recovery, and generation of 4D content, where 4D includes three spatial dimensions (x, y, z) and a temporal dimension t, such as shape and motion. This focus goes beyond static objects to include dynamic changes over time, providing a comprehensive understanding of both spatial and temporal variations. These techniques are critical in applications like AR/VR, embodied AI, and robotics. My research aims to make 4D content generation more efficient, accessible, and higher in quality by incorporating geometric, topological, and physical priors. I also aim to develop effective methods for 4D content recovery and analysis using these priors.

[CV-70] Quantifying Context Bias in Domain Adaptation for Object Detection

链接: https://arxiv.org/abs/2409.14679
作者: Hojun Son,Arpan Kusari
关键词-EN: context bias, aims to transfer, DAOD, bias, context
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Under review

点击查看摘要

Abstract:Domain adaptation for object detection (DAOD) aims to transfer a trained model from a source to a target domain. Various DAOD methods exist, some of which minimize context bias between foreground-background associations in various domains. However, no prior work has studied context bias in DAOD by analyzing changes in background features during adaptation and how context bias is represented in different domains. Our research experiment highlights the potential usability of context bias in DAOD. We address the problem by varying activation values over different layers of trained models and by masking the background, both of which impact the number and quality of detections. We then use one synthetic dataset from CARLA and two different versions of real open-source data, Cityscapes and Cityscapes foggy, as separate domains to represent and quantify context bias. We utilize different metrics such as Maximum Mean Discrepancy (MMD) and Maximum Variance Discrepancy (MVD) to find the layer-specific conditional probability estimates of foreground given manipulated background regions for separate domains. We demonstrate through detailed analysis that understanding of the context bias can affect DAOD approach and foc

[CV-71] Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections

链接: https://arxiv.org/abs/2409.14677
作者: Ankit Dhiman,Manan Shah,Rishubh Parihar,Yash Bhalgat,Lokesh R Boregowda,R Venkatesh Babu
关键词-EN: diffusion-based generative models, generating highly realistic, highly realistic, realistic and plausible, diffusion-based generative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We tackle the problem of generating highly realistic and plausible mirror reflections using diffusion-based generative models. We formulate this problem as an image inpainting task, allowing for more user control over the placement of mirrors during the generation process. To enable this, we create SynMirror, a large-scale dataset of diverse synthetic scenes with objects placed in front of mirrors. SynMirror contains around 198K samples rendered from 66K unique 3D objects, along with their associated depth maps, normal maps and instance-wise segmentation masks, to capture relevant geometric properties of the scene. Using this dataset, we propose a novel depth-conditioned inpainting method called MirrorFusion, which generates high-quality geometrically consistent and photo-realistic mirror reflections given an input image and a mask depicting the mirror region. MirrorFusion outperforms state-of-the-art methods on SynMirror, as demonstrated by extensive quantitative and qualitative analysis. To the best of our knowledge, we are the first to successfully tackle the challenging problem of generating controlled and faithful mirror reflections of an object in a scene using diffusion based models. SynMirror and MirrorFusion open up new avenues for image editing and augmented reality applications for practitioners and researchers alike.

[CV-72] RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

链接: https://arxiv.org/abs/2409.14674
作者: Yinpei Dai,Jayjun Lee,Nima Fazeli,Joyce Chai
关键词-EN: Developing robust, simple language instructions, correctable visuomotor policies, guiding robot actions, robust and correctable
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Website: this https URL

点击查看摘要

Abstract:Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: this https URL.

[CV-73] FedGCA: Global Consistent Augmentation Based Single-Source Federated Domain Generalization

链接: https://arxiv.org/abs/2409.14671
作者: Yuan Liu,Shu Wang,Zhe Qu,Xingyu Li,Shichao Kan,Jianxin Wang
关键词-EN: Federated Domain Generalization, multi-domain training samples, generalization ability, Domain Generalization, aims to train
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 7 figures, conference

点击查看摘要

Abstract:Federated Domain Generalization (FedDG) aims to train the global model for generalization ability to unseen domains with multi-domain training samples. However, clients in federated learning networks are often confined to a single, non-IID domain due to inherent sampling and temporal limitations. The lack of cross-domain interaction and the in-domain divergence impede the learning of domain-common features and limit the effectiveness of existing FedDG, referred to as the single-source FedDG (sFedDG) problem. To address this, we introduce the Federated Global Consistent Augmentation (FedGCA) method, which incorporates a style-complement module to augment data samples with diverse domain styles. To ensure the effective integration of augmented samples, FedGCA employs both global guided semantic consistency and class consistency, mitigating inconsistencies from local semantics within individual clients and classes across multiple clients. The conducted extensive experiments demonstrate the superiority of FedGCA.

[CV-74] AEANet: Affinity Enhanced Attentional Networks for Arbitrary Style Transfer

链接: https://arxiv.org/abs/2409.14652
作者: Gen Li
关键词-EN: integrates rational academic, rational academic research, emotional artistic creation, style, content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures,1 table

点击查看摘要

Abstract:Arbitrary artistic style transfer is a field that integrates rational academic research with emotional artistic creation. It aims to produce an image that not only features artistic characteristics of the target style but also preserves the texture structure of the content image itself. Existing style transfer methods primarily rely either on global statistics-based information or local patch-based. As a result, the generated images often either superficially apply a filter to the content image or capture extraneous semantic information from the style image, leading to a significant deviation from the global style. In this paper, we propose Affinity Enhanced-Attentional Networks (AEANet), which include a content affinity-enhanced attention (CAEA) module, style affinity-enhanced attention (SAEA) module, and hybrid attention (HA) module. The CAEA and SAEA modules first use attention to improve content and style representations with a Detail Enhanced(DE) module to reinforce fine details. Then, it aligns the global statistical information of the content and style features to fine-tune the feature information. Subsequently, the HA module adjusts the distribution of style features based on the distribution of content features. Additionally, we introduce affinity attention-based Local Dissimilarity Loss to preserve the affinities between the content and style images. Experimental results demonstrate that our approach outperforms state-of-the-art methods in arbitrary style transfer.

[CV-75] EQ-CBM: A Probabilistic Concept Bottleneck with Energy-based Models and Quantized Vectors ACCV2024

链接: https://arxiv.org/abs/2409.14630
作者: Sangwon Kim,Dasom Ahn,Byoung Chul Ko,In-su Jang,Kwang-Ju Kim
关键词-EN: deep neural networks, interpretable deep neural, neural networks, demand for reliable, reliable AI systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ACCV 2024

点击查看摘要

Abstract:The demand for reliable AI systems has intensified the need for interpretable deep neural networks. Concept bottleneck models (CBMs) have gained attention as an effective approach by leveraging human-understandable concepts to enhance interpretability. However, existing CBMs face challenges due to deterministic concept encoding and reliance on inconsistent concepts, leading to inaccuracies. We propose EQ-CBM, a novel framework that enhances CBMs through probabilistic concept encoding using energy-based models (EBMs) with quantized concept activation vectors (qCAVs). EQ-CBM effectively captures uncertainties, thereby improving prediction reliability and accuracy. By employing qCAVs, our method selects homogeneous vectors during concept encoding, enabling more decisive task performance and facilitating higher levels of human intervention. Empirical results using benchmark datasets demonstrate that our approach outperforms the state-of-the-art in both concept and task accuracy.

[CV-76] SOS: Segment Object System for Open-World Instance Segmentation With Object Priors ECCV2024

链接: https://arxiv.org/abs/2409.14627
作者: Christian Wilms,Tim Rolff,Maris Hillemann,Robert Johanson,Simone Frintrop
关键词-EN: segment arbitrary unknown, arbitrary unknown objects, annotated object classes, Open-World Instance Segmentation, classes during training
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024. Code available at this https URL

点击查看摘要

Abstract:We propose an approach for Open-World Instance Segmentation (OWIS), a task that aims to segment arbitrary unknown objects in images by generalizing from a limited set of annotated object classes during training. Our Segment Object System (SOS) explicitly addresses the generalization ability and the low precision of state-of-the-art systems, which often generate background detections. To this end, we generate high-quality pseudo annotations based on the foundation model SAM. We thoroughly study various object priors to generate prompts for SAM, explicitly focusing the foundation model on objects. The strongest object priors were obtained by self-attention maps from self-supervised Vision Transformers, which we utilize for prompting SAM. Finally, the post-processed segments from SAM are used as pseudo annotations to train a standard instance segmentation system. Our approach shows strong generalization capabilities on COCO, LVIS, and ADE20k datasets and improves on the precision by up to 81.6% compared to the state-of-the-art. Source code is available at: this https URL

[CV-77] Secrets of Edge-Informed Contrast Maximization for Event-Based Vision WACV

链接: https://arxiv.org/abs/2409.14611
作者: Pritam P. Karmokar,Quan H. Nguyen,William J. Beksi
关键词-EN: Event cameras capture, rapid asynchronous events, cameras capture, image plane, form of rapid
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: To be published in the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

点击查看摘要

Abstract:Event cameras capture the motion of intensity gradients (edges) in the image plane in the form of rapid asynchronous events. When accumulated in 2D histograms, these events depict overlays of the edges in motion, consequently obscuring the spatial structure of the generating edges. Contrast maximization (CM) is an optimization framework that can reverse this effect and produce sharp spatial structures that resemble the moving intensity gradients by estimating the motion trajectories of the events. Nonetheless, CM is still an underexplored area of research with avenues for improvement. In this paper, we propose a novel hybrid approach that extends CM from uni-modal (events only) to bi-modal (events and edges). We leverage the underpinning concept that, given a reference time, optimally warped events produce sharp gradients consistent with the moving edge at that time. Specifically, we formalize a correlation-based objective to aid CM and provide key insights into the incorporation of multiscale and multireference techniques. Moreover, our edge-informed CM method yields superior sharpness scores and establishes new state-of-the-art event optical flow benchmarks on the MVSEC, DSEC, and ECD datasets.

[CV-78] Patch Ranking: Efficient CLIP by Learning to Rank Local Patches

链接: https://arxiv.org/abs/2409.14607
作者: Cheng-En Wu,Jinhong Lin,Yu Hen Hu,Pedro Morgado
关键词-EN: Contrastive image-text pre-trained, shown remarkable adaptability, Contrastive image-text, image-text pre-trained models, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP and identifying the optimal subset of tokens for maximum performance. To address this, we propose greedy search methods to establish a “Golden Ranking” and introduce a lightweight predictor specifically trained to approximate this Ranking. To compensate for any performance degradation resulting from token pruning, we incorporate learnable visual tokens that aid in restoring and potentially enhancing the model’s performance. Our work presents a comprehensive and systematic investigation of pruning tokens within the ViT backbone of CLIP models. Through our framework, we successfully reduced 40% of patch tokens in CLIP’s ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets. Our study lays the groundwork for building more computationally efficient multimodal models without sacrificing their performance, addressing a key challenge in the application of advanced vision-language models.

[CV-79] URSimulator: Human-Perception-Driven Prompt Tuning for Enhanced Virtual Urban Renewal via Diffusion Models

链接: https://arxiv.org/abs/2409.14589
作者: Chuanbo Hu,Shan Jia,Xin Li
关键词-EN: Tackling Urban Physical, Urban Physical Disorder, Physical Disorder, Tackling Urban, messy vegetation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Tackling Urban Physical Disorder (e.g., abandoned buildings, litter, messy vegetation, graffiti) is essential, as it negatively impacts the safety, well-being, and psychological state of communities. Urban Renewal is the process of revitalizing these neglected and decayed areas within a city to improve the physical environment and quality of life for residents. Effective urban renewal efforts can transform these environments, enhancing their appeal and livability. However, current research lacks simulation tools that can quantitatively assess and visualize the impacts of renewal efforts, often relying on subjective judgments. Such tools are crucial for planning and implementing effective strategies by providing a clear visualization of potential changes and their impacts. This paper presents a novel framework addressing this gap by using human perception feedback to simulate street environment enhancement. We develop a prompt tuning approach that integrates text-driven Stable Diffusion with human perception feedback, iteratively editing local areas of street view images to better align with perceptions of beauty, liveliness, and safety. Our experiments show that this framework significantly improves perceptions of urban environments, with increases of 17.60% in safety, 31.15% in beauty, and 28.82% in liveliness. In contrast, advanced methods like DiffEdit achieve only 2.31%, 11.87%, and 15.84% improvements, respectively. We applied this framework across various virtual scenarios, including neighborhood improvement, building redevelopment, green space expansion, and community garden creation. The results demonstrate its effectiveness in simulating urban renewal, offering valuable insights for urban planning and policy-making.

[CV-80] Space evaluation based on pitch control using drone video in Ultimate

链接: https://arxiv.org/abs/2409.14588
作者: Shunsuke Iwashita,Atom Scott,Rikuhei Umemoto,Ning Ding,Keisuke Fujii
关键词-EN: end zone, Ultimate, compete for points, players compete, player holding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 2 pages, 1 figure. Presented at Cascadia Symposium on Statistics in Sport (CASSIS) 2024

点击查看摘要

Abstract:Ultimate is a sport in which teams of seven players compete for points by passing a disc into the end zone. A distinctive aspect of Ultimate is that the player holding the disc is unable to move, underscoring the significance of creating space to receive passes. Despite extensive research into space evaluation in sports such as football and basketball, there is a paucity of information available for Ultimate. This study focuses on the 3-on-3 format, which is widely practiced in Ultimate, and evaluates space during offensive play. The data collection process entailed the use of drones for filming and the subsequent correction of the angles for the purpose of obtaining positional data. The model is derived from the pitch control model of soccer and adapted to the rules of Ultimate, where the player holding the disc is stationary. The integration of position and distance weights with pitch control values enables the derivation of space evaluation metrics. The findings of this study indicate that movement to create space and accurate passing into that space are both significant factors in scoring. The code is available at this https URL.

[CV-81] Deep Learning Techniques for Atmospheric Turbulence Removal: A Review

链接: https://arxiv.org/abs/2409.14587
作者: Paul Hill,Nantheera Anantrasirichai,Alin Achim,David Bull
关键词-EN: scene analysis extremely, analysis extremely difficult, atmospheric turbulence, analysis extremely, reduces the effectiveness
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: 36 Pages, 8 figures

点击查看摘要

Abstract:The influence of atmospheric turbulence on acquired imagery makes image interpretation and scene analysis extremely difficult and reduces the effectiveness of conventional approaches for classifying and tracking objects of interest in the scene. Restoring a scene distorted by atmospheric turbulence is also a challenging problem. The effect, which is caused by random, spatially varying perturbations, makes conventional model-based approaches difficult and, in most cases, impractical due to complexity and memory requirements. Deep learning approaches offer faster operation and are capable of implementation on small devices. This paper reviews the characteristics of atmospheric turbulence and its impact on acquired imagery. It compares the performance of various state-of-the-art deep neural networks, including Transformers, SWIN and Mamba, when used to mitigate spatio-temporal image distortions.

[CV-82] AR Overlay: Training Image Pose Estimation on Curved Surface in a Synthetic Way

链接: https://arxiv.org/abs/2409.14577
作者: Sining Huang,Yukun Song,Yixiao Kang,Chang Yu
关键词-EN: spatial computing, objects, essential tasks, pose estimation, leveraging geometric constraints
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12th International Conference on Signal, Image Processing and Pattern Recognition (SIPP 2024)

点击查看摘要

Abstract:In the field of spatial computing, one of the most essential tasks is the pose estimation of 3D objects. While rigid transformations of arbitrary 3D objects are relatively hard to detect due to varying environment introducing factors like insufficient lighting or even occlusion, objects with pre-defined shapes are often easy to track, leveraging geometric constraints. Curved images, with flexible dimensions but a confined shape, are essential shapes often targeted in 3D tracking. Traditionally, proprietary algorithms often require specific curvature measures as the input along with the original flattened images to enable pose estimation for a single image target. In this paper, we propose a pipeline that can detect several logo images simultaneously and only requires the original images as the input, unlocking more effects in downstream fields such as Augmented Reality (AR).

[CV-83] Event-ECC: Asynchronous Tracking of Events with Continuous Optimization

链接: https://arxiv.org/abs/2409.14564
作者: Maria Zafeiri,Georgios Evangelidis,Emmanouil Psarakis
关键词-EN: Enhanced Correlation Coefficient, Correlation Coefficient, Enhanced Correlation, Abstract, paper
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, an event-based tracker is presented. Inspired by recent advances in asynchronous processing of individual events, we develop a direct matching scheme that aligns spatial distributions of events at different times. More specifically, we adopt the Enhanced Correlation Coefficient (ECC) criterion and propose a tracking algorithm that computes a 2D motion warp per single event, called event-ECC (eECC). The complete tracking of a feature along time is cast as a \emphsingle iterative continuous optimization problem, whereby every single iteration is executed per event. The computational burden of event-wise processing is alleviated through a lightweight version that benefits from incremental processing and updating scheme. We test the proposed algorithm on publicly available datasets and we report improvements in tracking accuracy and feature age over state-of-the-art event-based asynchronous trackers.

[CV-84] GlamTry: Advancing Virtual Try-On for High-End Accessories

链接: https://arxiv.org/abs/2409.14553
作者: Ting-Yu Chang,Seretsi Khabane Lekena
关键词-EN: virtual try-on models, online retail applications, photorealistic virtual try-on, virtual try-on, jewelry and watches
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The paper aims to address the lack of photorealistic virtual try-on models for accessories such as jewelry and watches, which are particularly relevant for online retail applications. While existing virtual try-on models focus primarily on clothing items, there is a gap in the market for accessories. This research explores the application of techniques from 2D virtual try-on models for clothing, such as VITON-HD, and integrates them with other computer vision models, notably MediaPipe Hand Landmarker. Drawing on existing literature, the study customizes and retrains a unique model using accessory-specific data and network architecture modifications to assess the feasibility of extending virtual try-on technology to accessories. Results demonstrate improved location prediction compared to the original model for clothes, even with a small dataset. This underscores the model’s potential with larger datasets exceeding 10,000 images, paving the way for future research in virtual accessory try-on applications.

[CV-85] rackNetV4: Enhancing Fast Sports Object Tracking with Motion Attention Maps

链接: https://arxiv.org/abs/2409.14543
作者: Arjun Raj,Lei Wang,Tom Gedeon
关键词-EN: Accurately detecting, small objects, sports videos, challenging due, due to factors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research report

点击查看摘要

Abstract:Accurately detecting and tracking high-speed, small objects, such as balls in sports videos, is challenging due to factors like motion blur and occlusion. Although recent deep learning frameworks like TrackNetV1, V2, and V3 have advanced tennis ball and shuttlecock tracking, they often struggle in scenarios with partial occlusion or low visibility. This is primarily because these models rely heavily on visual features without explicitly incorporating motion information, which is crucial for precise tracking and trajectory prediction. In this paper, we introduce an enhancement to the TrackNet family by fusing high-level visual features with learnable motion attention maps through a motion-aware fusion mechanism, effectively emphasizing the moving ball’s location and improving tracking performance. Our approach leverages frame differencing maps, modulated by a motion prompt layer, to highlight key motion regions over time. Experimental results on the tennis ball and shuttlecock datasets show that our method enhances the tracking performance of both TrackNetV2 and V3. We refer to our lightweight, plug-and-play solution, built on top of the existing TrackNet, as TrackNetV4.

[CV-86] owards Model-Agnostic Dataset Condensation by Heterogeneous Models ECCV2024

链接: https://arxiv.org/abs/2409.14538
作者: Jun-Yeong Moon,Jung Uk Kim,Gyeong-Moon Park
关键词-EN: Abstract, Model Dataset Condensation, condensed images, Dataset Condensation, Heterogeneous Model Dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024, 17 pages, 3 figures, 4 tables in main paper

点击查看摘要

Abstract:Abstract. The advancement of deep learning has coincided with the proliferation of both models and available data. The surge in dataset sizes and the subsequent surge in computational requirements have led to the development of the Dataset Condensation (DC). While prior studies have delved into generating synthetic images through methods like distribution alignment and training trajectory tracking for more efficient model training, a significant challenge arises when employing these condensed images practically. Notably, these condensed images tend to be specific to particular models, constraining their versatility and practicality. In response to this limitation, we introduce a novel method, Heterogeneous Model Dataset Condensation (HMDC), designed to produce universally applicable condensed images through cross-model interactions. To address the issues of gradient magnitude difference and semantic distance in models when utilizing heterogeneous models, we propose the Gradient Balance Module (GBM) and Mutual Distillation (MD) with the SpatialSemantic Decomposition method. By balancing the contribution of each model and maintaining their semantic meaning closely, our approach overcomes the limitations associated with model-specific condensed images and enhances the broader utility. The source code is available in this https URL.

[CV-87] RobotFingerPrint: Unified Gripper Coordinate Space for Multi-Gripper Grasp Synthesis

链接: https://arxiv.org/abs/2409.14519
作者: Ninad Khargonkar,Luis Felipe Casas,Balakrishnan Prabhakaran,Yu Xiang
关键词-EN: unified gripper coordinate, gripper coordinate space, unified gripper, gripper coordinate, coordinate space
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 7 pages, 8 figures, 2 tables. Project page available at this https URL

点击查看摘要

Abstract:We introduce a novel representation named as the unified gripper coordinate space for grasp synthesis of multiple grippers. The space is a 2D surface of a sphere in 3D using longitude and latitude as its coordinates, and it is shared for all robotic grippers. We propose a new algorithm to map the palm surface of a gripper into the unified gripper coordinate space, and design a conditional variational autoencoder to predict the unified gripper coordinates given an input object. The predicted unified gripper coordinates establish correspondences between the gripper and the object, which can be used in an optimization problem to solve the grasp pose and the finger joints for grasp synthesis. We demonstrate that using the unified gripper coordinate space improves the success rate and diversity in the grasp synthesis of multiple grippers.

[CV-88] SPAQ-DL-SLAM: Towards Optimizing Deep Learning-based SLAM for Resource-Constrained Embedded Platforms

链接: https://arxiv.org/abs/2409.14515
作者: Niraj Pudasaini,Muhammad Abdullah Hanif,Muhammad Shafique
关键词-EN: Learning-based Simultaneous Localization, Optimizing Deep Learning-based, Deep Learning-based Simultaneous, Localization and Mapping, Learning-based Simultaneous
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: To appear at the 18th International Conference on Control, Automation, Robotics and Vision (ICARCV), December 2024, Dubai, UAE

点击查看摘要

Abstract:Optimizing Deep Learning-based Simultaneous Localization and Mapping (DL-SLAM) algorithms is essential for efficient implementation on resource-constrained embedded platforms, enabling real-time on-board computation in autonomous mobile robots. This paper presents SPAQ-DL-SLAM, a framework that strategically applies Structured Pruning and Quantization (SPAQ) to the architecture of one of the state-ofthe-art DL-SLAM algorithms, DROID-SLAM, for resource and energy-efficiency. Specifically, we perform structured pruning with fine-tuning based on layer-wise sensitivity analysis followed by 8-bit post-training static quantization (PTQ) on the deep learning modules within DROID-SLAM. Our SPAQ-DROIDSLAM model, optimized version of DROID-SLAM model using our SPAQ-DL-SLAM framework with 20% structured pruning and 8-bit PTQ, achieves an 18.9% reduction in FLOPs and a 79.8% reduction in overall model size compared to the DROID-SLAM model. Our evaluations on the TUM-RGBD benchmark shows that SPAQ-DROID-SLAM model surpasses the DROID-SLAM model by an average of 10.5% on absolute trajectory error (ATE) metric. Additionally, our results on the ETH3D SLAM training benchmark demonstrate enhanced generalization capabilities of the SPAQ-DROID-SLAM model, seen by a higher Area Under the Curve (AUC) score and success in 2 additional data sequences compared to the DROIDSLAM model. Despite these improvements, the model exhibits performance variance on the distinct Vicon Room sequences from the EuRoC dataset, which are captured at high angular velocities. This varying performance at some distinct scenarios suggests that designing DL-SLAM algorithms taking operating environments and tasks in consideration can achieve optimal performance and resource efficiency for deployment in resource-constrained embedded platforms.

[CV-89] Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

链接: https://arxiv.org/abs/2409.14485
作者: Yan Shu,Peitian Zhang,Zheng Liu,Minghao Qin,Junjie Zhou,Tiejun Huang,Bo Zhao
关键词-EN: current Multi-modal Large, Multi-modal Large Language, current Multi-modal, Multi-modal Large, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although current Multi-modal Large Language Models (MLLMs) demonstrate promising results in video understanding, processing extremely long videos remains an ongoing challenge. Typically, MLLMs struggle with handling thousands of tokens that exceed the maximum context length of LLMs, and they experience reduced visual clarity due to token aggregation. Another challenge is the high computational cost stemming from the large number of video tokens. To tackle these issues, we propose Video-XL, an extra-long vision language model designed for efficient hour-scale video understanding. Specifically, we argue that LLMs can be adapted as effective visual condensers and introduce Visual Context Latent Summarization, which condenses visual contexts into highly compact forms. Extensive experiments demonstrate that our model achieves promising results on popular long video understanding benchmarks, despite being trained on limited image data. Moreover, Video-XL strikes a promising balance between efficiency and effectiveness, processing 1024 frames on a single 80GB GPU while achieving nearly 100% accuracy in the Needle-in-a-Haystack evaluation. We envision Video-XL becoming a valuable tool for long video applications such as video summarization, surveillance anomaly detection, and Ad placement identification.

[CV-90] Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization

链接: https://arxiv.org/abs/2409.14484
作者: Minyi Zhao,Jie Wang,Zhaoyang Li,Jiyuan Zhang,Zhenbang Sun,Shuigeng Zhou
关键词-EN: Vision Language Large, Language Large Models, Vision Language, Language Large, Recent studies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent studies have shown that Vision Language Large Models (VLLMs) may output content not relevant to the input images. This problem, called the hallucination phenomenon, undoubtedly degrades VLLM performance. Therefore, various anti-hallucination techniques have been proposed to make model output more reasonable and accurate. Despite their successes, from extensive tests we found that augmenting the prompt (e.g. word appending, rewriting, and spell error etc.) may change model output and make the output hallucinate again. To cure this drawback, we propose a new instruct-tuning framework called Prompt Augmentation and Caption Utilization (PACU) to boost VLLM’s generation ability under the augmented prompt scenario. Concretely, on the one hand, PACU exploits existing LLMs to augment and evaluate diverse prompts automatically. The resulting high-quality prompts are utilized to enhance VLLM’s ability to process different prompts. On the other hand, PACU exploits image captions to jointly work with image features as well as the prompts for response generation. When the visual feature is inaccurate, LLM can capture useful information from the image captions for response generation. Extensive experiments on hallucination evaluation and prompt-augmented datasets demonstrate that our PACU method can work well with existing schemes to effectively boost VLLM model performance. Code is available in this https URL.

[CV-91] One Model for Two Tasks: Cooperatively Recognizing and Recovering Low-Resolution Scene Text Images by Iterative Mutual Guidance

链接: https://arxiv.org/abs/2409.14483
作者: Minyi Zhao,Yang Wang,Jihong Guan,Shuigeng Zhou
关键词-EN: STR model, Scene text, STISR model, STR, STISR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Scene text recognition (STR) from high-resolution (HR) images has been significantly successful, however text reading on low-resolution (LR) images is still challenging due to insufficient visual information. Therefore, recently many scene text image super-resolution (STISR) models have been proposed to generate super-resolution (SR) images for the LR ones, then STR is done on the SR images, which thus boosts recognition performance. Nevertheless, these methods have two major weaknesses. On the one hand, STISR approaches may generate imperfect or even erroneous SR images, which mislead the subsequent recognition of STR models. On the other hand, as the STISR and STR models are jointly optimized, to pursue high recognition accuracy, the fidelity of SR images may be spoiled. As a result, neither the recognition performance nor the fidelity of STISR models are desirable. Then, can we achieve both high recognition performance and good fidelity? To this end, in this paper we propose a novel method called IMAGE (the abbreviation of Iterative MutuAl GuidancE) to effectively recognize and recover LR scene text images simultaneously. Concretely, IMAGE consists of a specialized STR model for recognition and a tailored STISR model to recover LR images, which are optimized separately. And we develop an iterative mutual guidance mechanism, with which the STR model provides high-level semantic information as clue to the STISR model for better super-resolution, meanwhile the STISR model offers essential low-level pixel clue to the STR model for more accurate recognition. Extensive experiments on two LR datasets demonstrate the superiority of our method over the existing works on both recognition performance and super-resolution fidelity.

[CV-92] SynBench: A Synthetic Benchmark for Non-rigid 3D Point Cloud Registration

链接: https://arxiv.org/abs/2409.14474
作者: Sara Monji-Azad,Marvin Kinz,Claudia Scherl,David Männle,Jürgen Hesser,Nikolas Löw
关键词-EN: point cloud registration, Non-rigid point cloud, point cloud, cloud registration, Non-rigid point
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-rigid point cloud registration is a crucial task in computer vision. Evaluating a non-rigid point cloud registration method requires a dataset with challenges such as large deformation levels, noise, outliers, and incompleteness. Despite the existence of several datasets for deformable point cloud registration, the absence of a comprehensive benchmark with all challenges makes it difficult to achieve fair evaluations among different methods. This paper introduces SynBench, a new non-rigid point cloud registration dataset created using SimTool, a toolset for soft body simulation in Flex and Unreal Engine. SynBench provides the ground truth of corresponding points between two point sets and encompasses key registration challenges, including varying levels of deformation, noise, outliers, and incompleteness. To the best of the authors’ knowledge, compared to existing datasets, SynBench possesses three particular characteristics: (1) it is the first benchmark that provides various challenges for non-rigid point cloud registration, (2) SynBench encompasses challenges of varying difficulty levels, and (3) it includes ground truth corresponding points both before and after deformation. The authors believe that SynBench enables future non-rigid point cloud registration methods to present a fair comparison of their achievements. SynBench is publicly available at: this https URL.

[CV-93] Low-Light Enhancement Effect on Classification and Detection: An Empirical Study

链接: https://arxiv.org/abs/2409.14461
作者: Xu Wu,Zhihui Lai,Zhou Jie,Can Gao,Xianxu Hou,Ya-nan Zhang,Linlin Shen
关键词-EN: LLIE, LLIE methods, low-light image enhancement, numerous low-light image, high-level vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages,8 figures

点击查看摘要

Abstract:Low-light images are commonly encountered in real-world scenarios, and numerous low-light image enhancement (LLIE) methods have been proposed to improve the visibility of these images. The primary goal of LLIE is to generate clearer images that are more visually pleasing to humans. However, the impact of LLIE methods in high-level vision tasks, such as image classification and object detection, which rely on high-quality image datasets, is not well explored. To explore the impact, we comprehensively evaluate LLIE methods on these high-level vision tasks by utilizing an empirical investigation comprising image classification and object detection experiments. The evaluation reveals a dichotomy: \textitWhile Low-Light Image Enhancement (LLIE) methods enhance human visual interpretation, their effect on computer vision tasks is inconsistent and can sometimes be harmful. Our findings suggest a disconnect between image enhancement for human visual perception and for machine analysis, indicating a need for LLIE methods tailored to support high-level vision tasks effectively. This insight is crucial for the development of LLIE techniques that align with the needs of both human and machine vision.

[CV-94] Fake It till You Make It: Curricular Dynamic Forgery Augmentations towards General Deepfake Detection ECCV2024

链接: https://arxiv.org/abs/2409.14444
作者: Yuzhen Lin,Wentang Song,Bin Li,Yuezun Li,Jiangqun Ni,Han Chen,Qiushi Li
关键词-EN: shown promising results, Toggle, forgery augmentation, forgery augmentation policy, general deepfake detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Previous studies in deepfake detection have shown promising results when testing face forgeries from the same dataset as the training. However, the problem remains challenging when one tries to generalize the detector to forgeries from unseen datasets and created by unseen methods. In this work, we present a novel general deepfake detection method, called \textbfCurricular \textbfDynamic \textbfForgery \textbfAugmentation (CDFA), which jointly trains a deepfake detector with a forgery augmentation policy network. Unlike the previous works, we propose to progressively apply forgery augmentations following a monotonic curriculum during the training. We further propose a dynamic forgery searching strategy to select one suitable forgery augmentation operation for each image varying between training stages, producing a forgery augmentation policy optimized for better generalization. In addition, we propose a novel forgery augmentation named self-shifted blending image to simply imitate the temporal inconsistency of deepfake generation. Comprehensive experiments show that CDFA can significantly improve both cross-datasets and cross-manipulations performances of various naive deepfake detectors in a plug-and-play way, and make them attain superior performances over the existing methods in several benchmark datasets. Comments: Accepted by ECCV 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.14444 [cs.CV] (or arXiv:2409.14444v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.14444 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuzhen Lin [view email] [v1] Sun, 22 Sep 2024 13:51:22 UTC (13,707 KB) Full-text links: Access Paper: View a PDF of the paper titled Fake It till You Make It: Curricular Dynamic Forgery Augmentations towards General Deepfake Detection, by Yuzhen Lin and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2024-09 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[CV-95] EM-DARTS: Hierarchical Differentiable Architecture Search for Eye Movement Recognition

链接: https://arxiv.org/abs/2409.14432
作者: Huafeng Qin,Hongyu Zhu,Xin Jin,Xin Yu,Mounim A. El-Yacoubi,Xinbo Gao
关键词-EN: received increasing attention, Eye movement biometrics, Neural Architecture Search, eye movement recognition, Eye movement
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submited to IEEE Transactions on Information Forensics and Security

点击查看摘要

Abstract:Eye movement biometrics has received increasing attention thanks to its high secure identification. Although deep learning (DL) models have been recently successfully applied for eye movement recognition, the DL architecture still is determined by human prior knowledge. Differentiable Neural Architecture Search (DARTS) automates the manual process of architecture design with high search efficiency. DARTS, however, usually stacks the same multiple learned cells to form a final neural network for evaluation, limiting therefore the diversity of the network. Incidentally, DARTS usually searches the architecture in a shallow network while evaluating it in a deeper one, which results in a large gap between the architecture depths in the search and evaluation scenarios. To address this issue, we propose EM-DARTS, a hierarchical differentiable architecture search algorithm to automatically design the DL architecture for eye movement recognition. First, we define a supernet and propose a global and local alternate Neural Architecture Search method to search the optimal architecture alternately with an differentiable neural architecture search. The local search strategy aims to find an optimal architecture for different cells while the global search strategy is responsible for optimizing the architecture of the target network. To further reduce redundancy, a transfer entropy is proposed to compute the information amount of each layer, so as to further simplify search network. Our experiments on three public databases demonstrate that the proposed EM-DARTS is capable of producing an optimal architecture that leads to state-of-the-art recognition performance.

[CV-96] Pomo3D: 3D-Aware Portrait Accessorizing and More

链接: https://arxiv.org/abs/2409.14430
作者: Tzu-Chieh Liu,Chih-Ting Liu,Shao-Yi Chien
关键词-EN: free accessorizing, accessorizing by decomposing, decomposing and recomposing, portrait manipulation framework, accessories
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose Pomo3D, a 3D portrait manipulation framework that allows free accessorizing by decomposing and recomposing portraits and accessories. It enables the avatars to attain out-of-distribution (OOD) appearances of simultaneously wearing multiple accessories. Existing methods still struggle to offer such explicit and fine-grained editing; they either fail to generate additional objects on given portraits or cause alterations to portraits (e.g., identity shift) when generating accessories. This restriction presents a noteworthy obstacle as people typically seek to create charming appearances with diverse and fashionable accessories in the virtual universe. Our approach provides an effective solution to this less-addressed issue. We further introduce the Scribble2Accessories module, enabling Pomo3D to create 3D accessories from user-drawn accessory scribble maps. Moreover, we design a bias-conscious mapper to mitigate biased associations present in real-world datasets. In addition to object-level manipulation above, Pomo3D also offers extensive editing options on portraits, including global or local editing of geometry and texture and avatar stylization, elevating 3D editing of neural portraits to a more comprehensive level.

[CV-97] Dormant: Defending against Pose-driven Human Image Animation

链接: https://arxiv.org/abs/2409.14424
作者: Jiachen Zhou,Mingsi Wang,Tianlin Li,Guozhu Meng,Kai Chen
关键词-EN: achieved tremendous progress, Pose-driven human image, human image animation, tremendous progress, single photo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pose-driven human image animation has achieved tremendous progress, enabling the generation of vivid and realistic human videos from just one single photo. However, it conversely exacerbates the risk of image misuse, as attackers may use one available image to create videos involving politics, violence and other illegal content. To counter this threat, we propose Dormant, a novel protection approach tailored to defend against pose-driven human image animation techniques. Dormant applies protective perturbation to one human image, preserving the visual similarity to the original but resulting in poor-quality video generation. The protective perturbation is optimized to induce misextraction of appearance features from the image and create incoherence among the generated video frames. Our extensive evaluation across 8 animation methods and 4 datasets demonstrates the superiority of Dormant over 6 baseline protection methods, leading to misaligned identities, visual distortions, noticeable artifacts, and inconsistent frames in the generated videos. Moreover, Dormant shows effectiveness on 6 real-world commercial services, even with fully black-box access.

[CV-98] GraspMamba: A Mamba-based Language-driven Grasp Detection Framework with Hierarchical Feature Learning

链接: https://arxiv.org/abs/2409.14403
作者: Huy Hoang Nguyen,An Vuong,Anh Nguyen,Ian Reid,Minh Nhat Vu
关键词-EN: industrial applications, Grasp detection, fundamental robotic task, robotic task critical, task critical
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages. Project page: this https URL

点击查看摘要

Abstract:Grasp detection is a fundamental robotic task critical to the success of many industrial applications. However, current language-driven models for this task often struggle with cluttered images, lengthy textual descriptions, or slow inference speed. We introduce GraspMamba, a new language-driven grasp detection method that employs hierarchical feature fusion with Mamba vision to tackle these challenges. By leveraging rich visual features of the Mamba-based backbone alongside textual information, our approach effectively enhances the fusion of multimodal features. GraspMamba represents the first Mamba-based grasp detection model to extract vision and language features at multiple scales, delivering robust performance and rapid inference time. Intensive experiments show that GraspMamba outperforms recent methods by a clear margin. We validate our approach through real-world robotic experiments, highlighting its fast inference speed.

[CV-99] Prior Knowledge Distillation Network for Face Super-Resolution

链接: https://arxiv.org/abs/2409.14385
作者: Qiu Yang,Xiao Sun,Xin-yu Li,Feng-Qi Cui,Yu-Tong Guo,Shuang-Zhen Hu,Ping Luo,Si-Ying Li
关键词-EN: reconstruct high-resolution, super-resolution reconstruction process, FSR, prior, prior information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The purpose of face super-resolution (FSR) is to reconstruct high-resolution (HR) face images from low-resolution (LR) inputs. With the continuous advancement of deep learning technologies, contemporary prior-guided FSR methods initially estimate facial priors and then use this information to assist in the super-resolution reconstruction process. However, ensuring the accuracy of prior estimation remains challenging, and straightforward cascading and convolutional operations often fail to fully leverage prior knowledge. Inaccurate or insufficiently utilized prior information inevitably degrades FSR performance. To address this issue, we propose a prior knowledge distillation network (PKDN) for FSR, which involves transferring prior information from the teacher network to the student network. This approach enables the network to learn priors during the training stage while relying solely on low-resolution facial images during the testing stage, thus mitigating the adverse effects of prior estimation inaccuracies. Additionally, we incorporate robust attention mechanisms to design a parsing map fusion block that effectively utilizes prior information. To prevent feature loss, we retain multi-scale features during the feature extraction stage and employ them in the subsequent super-resolution reconstruction process. Experimental results on benchmark datasets demonstrate that our PKDN approach surpasses existing FSR methods in generating high-quality face images.

[CV-100] GroupDiff: Diffusion-based Group Portrait Editing ECCV2024

链接: https://arxiv.org/abs/2409.14379
作者: Yuming Jiang,Nanxuan Zhao,Qing Liu,Krishna Kumar Singh,Shuai Yang,Chen Change Loy,Ziwei Liu
关键词-EN: Group portrait editing, group photo editing, Data Engine, highly desirable, desirable since users
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Group portrait editing is highly desirable since users constantly want to add a person, delete a person, or manipulate existing persons. It is also challenging due to the intricate dynamics of human interactions and the diverse gestures. In this work, we present GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions: 1) Data Engine: Since there is no labeled data for group photo editing, we create a data engine to generate paired data for training. The training data engine covers the diverse needs of group portrait editing. 2) Appearance Preservation: To keep the appearance consistent after editing, we inject the images of persons from the group photo into the attention modules and employ skeletons to provide intra-person guidance. 3) Control Flexibility: Bounding boxes indicating the locations of each person are used to reweight the attention matrix so that the features of each person can be injected into the correct places. This inter-person guidance provides flexible manners for manipulation. Extensive experiments demonstrate that GroupDiff exhibits state-of-the-art performance compared to existing methods. GroupDiff offers controllability for editing and maintains the fidelity of the original photos.

[CV-101] Memory Matching is not Enough: Jointly Improving Memory Matching and Decoding for Video Object Segmentation ICPR2024

链接: https://arxiv.org/abs/2409.14343
作者: Jintu Zheng,Yun Liang,Yuqing Zhang,Wanchao Su
关键词-EN: Memory-based video object, segmentation methods model, methods model multiple, long temporal-spatial spans, establishing memory bank
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted to ICPR2024

点击查看摘要

Abstract:Memory-based video object segmentation methods model multiple objects over long temporal-spatial spans by establishing memory bank, which achieve the remarkable performance. However, they struggle to overcome the false matching and are prone to lose critical information, resulting in confusion among different objects. In this paper, we propose an effective approach which jointly improving the matching and decoding stages to alleviate the false matching issue.For the memory matching stage, we present a cost aware mechanism that suppresses the slight errors for short-term memory and a shunted cross-scale matching for long-term memory which establish a wide filed matching spaces for various object scales. For the readout decoding stage, we implement a compensatory mechanism aims at recovering the essential information where missing at the matching stage. Our approach achieves the outstanding performance in several popular benchmarks (i.e., DAVIS 20162017 Val (92.4%88.1%), and DAVIS 2017 Test (83.9%)), and achieves 84.8%84.6% on YouTubeVOS 20182019 Val.

[CV-102] Self-Supervised Audio-Visual Soundscape Stylization ECCV2024

链接: https://arxiv.org/abs/2409.14340
作者: Tingle Li,Renhao Wang,Po-Yao Huang,Andrew Owens,Gopala Anumanchipalli
关键词-EN: convey a great, great deal, deal of information, variety of effects, effects ranging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: ECCV 2024

点击查看摘要

Abstract:Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example’s sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities. Please see our project webpage for video results: https://tinglok.netlify.app/files/avsoundscape/

[CV-103] Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

链接: https://arxiv.org/abs/2409.14336
作者: Jidong Kuang,Hongsong Wang,Chaolei Han,Jie Gui
关键词-EN: computer vision communities, important research topic, Zero-shot action recognition, unseen actions dynamically, action recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Zero-shot action recognition, which addresses the issue of scalability and generalization in action recognition and allows the models to adapt to new and unseen actions dynamically, is an important research topic in computer vision communities. The key to zero-shot action recognition lies in aligning visual features with semantic vectors representing action categories. Most existing methods either directly project visual features onto the semantic space of text category or learn a shared embedding space between the two modalities. However, a direct projection cannot accurately align the two modalities, and learning robust and discriminative embedding space between visual and text representations is often difficult. To address these issues, we introduce Dual Visual-Text Alignment (DVTA) for skeleton-based zero-shot action recognition. The DVTA consists of two alignment modules-Direct Alignment (DA) and Augmented Alignment (AA)-along with a designed Semantic Description Enhancement (SDE). The DA module maps the skeleton features to the semantic space through a specially designed visual projector, followed by the SDE, which is based on cross-attention to enhance the connection between skeleton and text, thereby reducing the gap between modalities. The AA module further strengthens the learning of the embedding space by utilizing deep metric learning to learn the similarity between skeleton and text. Our approach achieves state-of-the-art performances on several popular zero-shot skeleton-based action recognition benchmarks.

[CV-104] PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects ECCV2024

链接: https://arxiv.org/abs/2409.14331
作者: Guangcheng Chen,Yicheng He,Li He,Hong Zhang
关键词-EN: remarkable progress recently, achieved remarkable progress, progress recently, Neural implicit surface, achieved remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Neural implicit surface reconstruction has achieved remarkable progress recently. Despite resorting to complex radiance modeling, state-of-the-art methods still struggle with textureless and specular surfaces. Different from RGB images, polarization images can provide direct constraints on the azimuth angles of the surface normals. In this paper, we present PISR, a novel method that utilizes a geometrically accurate polarimetric loss to refine shape independently of appearance. In addition, PISR smooths surface normals in image space to eliminate severe shape distortions and leverages the hash-grid-based neural signed distance function to accelerate the reconstruction. Experimental results demonstrate that PISR achieves higher accuracy and robustness, with an L1 Chamfer distance of 0.5 mm and an F-score of 99.5% at 1 mm, while converging 4~30 times faster than previous polarimetric surface reconstruction methods.

[CV-105] Scene-Text Grounding for Text-Based Video Question Answering

链接: https://arxiv.org/abs/2409.14319
作者: Sheng Zhou,Junbin Xiao,Xun Yang,Peipei Song,Dan Guo,Angela Yao,Meng Wang,Tat-Seng Chua
关键词-EN: Grounded TextVideoQA, scene-text, efforts in text-based, opaque decisionmaking, decisionmaking and heavy
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and heavy reliance on scene-text recognition. In this paper, we propose to study Grounded TextVideoQA by forcing models to answer questions and spatio-temporally localize the relevant scene-text regions, thus decoupling QA from scenetext recognition and promoting research towards interpretable QA. The task has three-fold significance. First, it encourages scene-text evidence versus other short-cuts for answer predictions. Second, it directly accepts scene-text regions as visual answers, thus circumventing the problem of ineffective answer evaluation by stringent string matching. Third, it isolates the challenges inherited in VideoQA and scene-text recognition. This enables the diagnosis of the root causes for failure predictions, e.g., wrong QA or wrong scene-text recognition? To achieve Grounded TextVideoQA, we propose the T2S-QA model that highlights a disentangled temporal-to-spatial contrastive learning strategy for weakly-supervised scene-text grounding and grounded TextVideoQA. To facilitate evaluation, we construct a new dataset ViTXT-GQA which features 52K scene-text bounding boxes within 2.2K temporal segments related to 2K questions and 729 videos. With ViTXT-GQA, we perform extensive experiments and demonstrate the severe limitations of existing techniques in Grounded TextVideoQA. While T2S-QA achieves superior results, the large performance gap with human leaves ample space for improvement. Our further analysis of oracle scene-text inputs posits that the major challenge is scene-text recognition. To advance the research of Grounded TextVideoQA, our dataset and code are at \urlthis https URL

[CV-106] MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views ECCV2024

链接: https://arxiv.org/abs/2409.14316
作者: Wangze Xu,Huachen Gao,Shihe Shen,Rui Peng,Jianbo Jiao,Ronggang Wang
关键词-EN: Neural Radiance Field, Radiance Field, Neural Radiance, View Synthesis, vision applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024, Project page: this https URL

点击查看摘要

Abstract:Recently, the Neural Radiance Field (NeRF) advancement has facilitated few-shot Novel View Synthesis (NVS), which is a significant challenge in 3D vision applications. Despite numerous attempts to reduce the dense input requirement in NeRF, it still suffers from time-consumed training and rendering processes. More recently, 3D Gaussian Splatting (3DGS) achieves real-time high-quality rendering with an explicit point-based representation. However, similar to NeRF, it tends to overfit the train views for lack of constraints. In this paper, we propose \textbfMVPGS, a few-shot NVS method that excavates the multi-view priors based on 3D Gaussian Splatting. We leverage the recent learning-based Multi-view Stereo (MVS) to enhance the quality of geometric initialization for 3DGS. To mitigate overfitting, we propose a forward-warping method for additional appearance constraints conforming to scenes based on the computed geometry. Furthermore, we introduce a view-consistent geometry constraint for Gaussian parameters to facilitate proper optimization convergence and utilize a monocular depth regularization as compensation. Experiments show that the proposed method achieves state-of-the-art performance with real-time rendering speed. Project page: this https URL

[CV-107] Anisotropic Diffusion Probabilistic Model for Imbalanced Image Classification

链接: https://arxiv.org/abs/2409.14313
作者: Jingyu Kong,Yuan Guo,Yu Wang,Yuping Duan
关键词-EN: Diffusion Probabilistic Models, Diffusion Probabilistic, Anisotropic Diffusion Probabilistic, Probabilistic Models, anisotropic diffusion model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Real-world data often has a long-tailed distribution, where the scarcity of tail samples significantly limits the model’s generalization ability. Denoising Diffusion Probabilistic Models (DDPM) are generative models based on stochastic differential equation theory and have demonstrated impressive performance in image classification tasks. However, existing diffusion probabilistic models do not perform satisfactorily in classifying tail classes. In this work, we propose the Anisotropic Diffusion Probabilistic Model (ADPM) for imbalanced image classification problems. We utilize the data distribution to control the diffusion speed of different class samples during the forward process, effectively improving the classification accuracy of the denoiser in the reverse process. Specifically, we provide a theoretical strategy for selecting noise levels for different categories in the diffusion process based on error analysis theory to address the imbalanced classification problem. Furthermore, we integrate global and local image prior in the forward process to enhance the model’s discriminative ability in the spatial dimension, while incorporate semantic-level contextual information in the reverse process to boost the model’s discriminative power and robustness. Through comparisons with state-of-the-art methods on four medical benchmark datasets, we validate the effectiveness of the proposed method in handling long-tail data. Our results confirm that the anisotropic diffusion model significantly improves the classification accuracy of rare classes while maintaining the accuracy of head classes. On the skin lesion datasets, PAD-UFES and HAM10000, the F1-scores of our method improved by 4% and 3%, respectively compared to the original diffusion probabilistic model.

[CV-108] DilateQuant: Accurate and Efficient Diffusion Quantization via Weight Dilation

链接: https://arxiv.org/abs/2409.14307
作者: Xuewen Liu,Zhikai Li,Qingyi Gu
关键词-EN: image generation tasks, substantial computational costs, shown excellent performance, Diffusion models, generation tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code: this http URL

点击查看摘要

Abstract:Diffusion models have shown excellent performance on various image generation tasks, but the substantial computational costs and huge memory footprint hinder their low-latency applications in real-world scenarios. Quantization is a promising way to compress and accelerate models. Nevertheless, due to the wide range and time-varying activations in diffusion models, existing methods cannot maintain both accuracy and efficiency simultaneously for low-bit quantization. To tackle this issue, we propose DilateQuant, a novel quantization framework for diffusion models that offers comparable accuracy and high efficiency. Specifically, we keenly aware of numerous unsaturated in-channel weights, which can be cleverly exploited to reduce the range of activations without additional computation cost. Based on this insight, we propose Weight Dilation (WD) that maximally dilates the unsaturated in-channel weights to a constrained range through a mathematically equivalent scaling. WD costlessly absorbs the activation quantization errors into weight quantization. The range of activations decreases, which makes activations quantization easy. The range of weights remains constant, which makes model easy to converge in training stage. Considering the temporal network leads to time-varying activations, we design a Temporal Parallel Quantizer (TPQ), which sets time-step quantization parameters and supports parallel quantization for different time steps, significantly improving the performance and reducing time cost. To further enhance performance while preserving efficiency, we introduce a Block-wise Knowledge Distillation (BKD) to align the quantized models with the full-precision models at a block level. The simultaneous training of time-step quantization parameters and weights minimizes the time required, and the shorter backpropagation paths decreases the memory footprint of the quantization process.

[CV-109] Deep Learning Technology for Face Forgery Detection: A Survey

链接: https://arxiv.org/abs/2409.14289
作者: Lixia Ma,Puning Yang,Yuting Xu,Ziming Yang,Peipei Li,Huaibo Huang
关键词-EN: high-fidelity facial images, deepfake detection, rapid development, development of computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Currently, the rapid development of computer vision and deep learning has enabled the creation or manipulation of high-fidelity facial images and videos via deep generative approaches. This technology, also known as deepfake, has achieved dramatic progress and become increasingly popular in social media. However, the technology can generate threats to personal privacy and national security by spreading misinformation. To diminish the risks of deepfake, it is desirable to develop powerful forgery detection methods to distinguish fake faces from real faces. This paper presents a comprehensive survey of recent deep learning-based approaches for facial forgery detection. We attempt to provide the reader with a deeper understanding of the current advances as well as the major challenges for deepfake detection based on deep learning. We present an overview of deepfake techniques and analyse the characteristics of various deepfake datasets. We then provide a systematic review of different categories of deepfake detection and state-of-the-art deepfake detection methods. The drawbacks of existing detection methods are analyzed, and future research directions are discussed to address the challenges in improving both the performance and generalization of deepfake detection.

[CV-110] Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models

链接: https://arxiv.org/abs/2409.14277
作者: Yew Ken Chia,Qi Sun,Lidong Bing,Soujanya Poria
关键词-EN: demonstrated impressive problem-solving, encode extensive world, Large multimodal models, extensive world knowledge, impressive problem-solving abilities
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Large multimodal models have demonstrated impressive problem-solving abilities in vision and language tasks, and have the potential to encode extensive world knowledge. However, it remains an open challenge for these models to perceive, reason, plan, and act in realistic environments. In this work, we introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities through more diverse and complex scenarios than previous datasets. Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans. The data encompasses diverse aspects of commonsense knowledge, physical understanding, and safety awareness. Our fine-grained analysis reveals that state-of-the-art models, including GPT-4V, face bottlenecks in visual perception, comprehension, and reasoning abilities. To address these challenges, we propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans. Experimental results demonstrate the effectiveness of our framework compared to strong baselines. Our code and dataset are available at this https URL.

[CV-111] Lidar Panoptic Segmentation in an Open World

链接: https://arxiv.org/abs/2409.14273
作者: Anirudh S Chakravarthy,Meghana Reddy Ganesina,Peiyun Hu,Laura Leal-Taixe,Shu Kong,Deva Ramanan,Aljosa Osep
关键词-EN: Addressing Lidar Panoptic, Lidar Panoptic Segmentation, Addressing Lidar, Lidar Panoptic, Panoptic Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Pre-print. Accepted in the International Journal of Computer Vision, 19 Sept 2024. Code available at this https URL

点击查看摘要

Abstract:Addressing Lidar Panoptic Segmentation (LPS ) is crucial for safe deployment of autonomous vehicles. LPS aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including thing classes of countable objects (e.g., pedestrians and vehicles) and stuff classes of amorphous regions (e.g., vegetation and road). Importantly, LPS requires segmenting individual thing instances (e.g., every single vehicle). Current LPS methods make an unrealistic assumption that the semantic class vocabulary is fixed in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of novel classes that are considered to be unknowns w.r.t. the pre-defined class vocabulary. To address this unrealistic assumption, we study LPS in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of thing and stuff classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (i.e., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network [1]. We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.

[CV-112] Combining Absolute and Semi-Generalized Relative Poses for Visual Localization

链接: https://arxiv.org/abs/2409.14269
作者: Vojtech Panek,Torsten Sattler,Zuzana Kukelova
关键词-EN: Visual localization, problem of estimating, estimating the camera, Visual, query image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual localization is the problem of estimating the camera pose of a given query image within a known scene. Most state-of-the-art localization approaches follow the structure-based paradigm and use 2D-3D matches between pixels in a query image and 3D points in the scene for pose estimation. These approaches assume an accurate 3D model of the scene, which might not always be available, especially if only a few images are available to compute the scene representation. In contrast, structure-less methods rely on 2D-2D matches and do not require any 3D scene model. However, they are also less accurate than structure-based methods. Although one prior work proposed to combine structure-based and structure-less pose estimation strategies, its practical relevance has not been shown. We analyze combining structure-based and structure-less strategies while exploring how to select between poses obtained from 2D-2D and 2D-3D matches, respectively. We show that combining both strategies improves localization performance in multiple practically relevant scenarios.

[CV-113] End to End Face Reconstruction via Differentiable PnP ECCV2022

链接: https://arxiv.org/abs/2409.14249
作者: Yiren Lu,Huawei Wei
关键词-EN: Face Reconstruction Track, Reconstruction Track, WCPA Challenge, Face Landmark Detection, Face Reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2022 workshop

点击查看摘要

Abstract:This is a challenge report of the ECCV 2022 WCPA Challenge, Face Reconstruction Track. Inside this report is a brief explanation of how we accomplish this challenge. We design a two-branch network to accomplish this task, whose roles are Face Reconstruction and Face Landmark Detection. The former outputs canonical 3D face coordinates. The latter outputs pixel coordinates, i.e. 2D mapping of 3D coordinates with head pose and perspective projection. In addition, we utilize a differentiable PnP (Perspective-n-Points) layer to finetune the outputs of the two branch. Our method achieves very competitive quantitative results on the MVP-Human dataset and wins a 3^rd prize in the challenge.

[CV-114] Cloud Adversarial Example Generation for Remote Sensing Image Classification

链接: https://arxiv.org/abs/2409.14240
作者: Fei Ma,Yuqiang Feng,Fan Zhang,Yongsheng Zhou
关键词-EN: remote sensing images, add adversarial perturbations, Perlin noise, remote sensing, Perlin noise based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most existing adversarial attack methods for remote sensing images merely add adversarial perturbations or patches, resulting in unnatural modifications. Clouds are common atmospheric effects in remote sensing images. Generating clouds on these images can produce adversarial examples better aligning with human perception. In this paper, we propose a Perlin noise based cloud generation attack method. Common Perlin noise based cloud generation is a random, non-optimizable process, which cannot be directly used to attack the target models. We design a Perlin Gradient Generator Network (PGGN), which takes a gradient parameter vector as input and outputs the grids of Perlin noise gradient vectors at different scales. After a series of computations based on the gradient vectors, cloud masks at corresponding scales can be produced. These cloud masks are then weighted and summed depending on a mixing coefficient vector and a scaling factor to produce the final cloud masks. The gradient vector, coefficient vector and scaling factor are collectively represented as a cloud parameter vector, transforming the cloud generation into a black-box optimization problem. The Differential Evolution (DE) algorithm is employed to solve for the optimal solution of the cloud parameter vector, achieving a query-based black-box attack. Detailed experiments confirm that this method has strong attack capabilities and achieves high query efficiency. Additionally, we analyze the transferability of the generated adversarial examples and their robustness in adversarial defense scenarios.

[CV-115] Masks and Boxes: Combining the Best of Both Worlds for Multi-Object Tracking

链接: https://arxiv.org/abs/2409.14220
作者: Tomasz Stanczyk,Francois Bremond
关键词-EN: consistently tracking objects, Multi-object tracking, involves identifying, video sequences, identifying and consistently
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) involves identifying and consistently tracking objects across video sequences. Traditional tracking-by-detection methods, while effective, often require extensive tuning and lack generalizability. On the other hand, segmentation mask-based methods are more generic but struggle with tracking management, making them unsuitable for MOT. We propose a novel approach, McByte, which incorporates a temporally propagated segmentation mask as a strong association cue within a tracking-by-detection framework. By combining bounding box and mask information, McByte enhances robustness and generalizability without per-sequence tuning. Evaluated on four benchmark datasets - DanceTrack, MOT17, SoccerNet-tracking 2022, and KITTI-tracking - McByte demonstrates performance gain in all cases examined. At the same time, it outperforms existing mask-based methods. Implementation code will be provided upon acceptance.

[CV-116] R-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models ICRA2025

链接: https://arxiv.org/abs/2409.14216
作者: Viet Dung Nguyen,Zhizhuo Yang,Christopher L. Buckley,Alexander Ororbia
关键词-EN: Markov decision processes, observable Markov decision, partially observable Markov, Markov decision, decision processes
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 20 pages, 2 algorithms, 2 tables, 5 figures, submitted to ICRA 2025

点击查看摘要

Abstract:Although research has produced promising results demonstrating the utility of active inference (AIF) in Markov decision processes (MDPs), there is relatively less work that builds AIF models in the context of environments and problems that take the form of partially observable Markov decision processes (POMDPs). In POMDP scenarios, the agent must infer the unobserved environmental state from raw sensory observations, e.g., pixels in an image. Additionally, less work exists in examining the most difficult form of POMDP-centered control: continuous action space POMDPs under sparse reward signals. In this work, we address issues facing the AIF modeling paradigm by introducing novel prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. Empirically, we show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate. The code in support of this work can be found at this https URL.

[CV-117] @Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology WACV2025

链接: https://arxiv.org/abs/2409.14215
作者: Xin Jiang,Junwei Zheng,Ruiping Liu,Jiahang Li,Jiaming Zhang,Sven Matthiesen,Rainer Stiefelhagen
关键词-EN: human-centered Assistive Technologies, Visual Impairments, Assistive Technologies, evolving into generalists, capable of performing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by WACV 2025, project page: this https URL

点击查看摘要

Abstract:As Vision-Language Models (VLMs) advance, human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. However, benchmarking VLMs for ATs remains under-explored. To bridge this gap, we first create a novel AT benchmark (@Bench). Guided by a pre-design user study with PVIs, our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation, Depth Estimation, Optical Character Recognition (OCR), Image Captioning, and Visual Question Answering (VQA). Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.

[CV-118] Egocentric zone-aware action recognition across environments

链接: https://arxiv.org/abs/2409.14205
作者: Simone Alberto Peirone,Gabriele Goletto,Mirco Planamente,Andrea Bottino,Barbara Caputo,Giuseppe Averta
关键词-EN: Human activities exhibit, exhibit a strong, strong correlation, Human activities, recognize human activities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Human activities exhibit a strong correlation between actions and the places where these are performed, such as washing something at a sink. More specifically, in daily living environments we may identify particular locations, hereinafter named activity-centric zones, which may afford a set of homogeneous actions. Their knowledge can serve as a prior to favor vision models to recognize human activities. However, the appearance of these zones is scene-specific, limiting the transferability of this prior information to unfamiliar areas and domains. This problem is particularly relevant in egocentric vision, where the environment takes up most of the image, making it even more difficult to separate the action from the context. In this paper, we discuss the importance of decoupling the domain-specific appearance of activity-centric zones from their universal, domain-agnostic representations, and show how the latter can improve the cross-domain transferability of Egocentric Action Recognition (EAR) models. We validate our solution on the EPIC-Kitchens-100 and Argo1M datasets

[CV-119] LATTE: Improving Latex Recognition for Tables and Formulae with Iterative Refinement

链接: https://arxiv.org/abs/2409.14201
作者: Nan Jiang,Shanchao Liang,Chengxiao Wang,Jiannan Wang,Lin Tan
关键词-EN: Portable Document Format, disseminating scientific research, Portable Document, Document Format, creating PDF documents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Portable Document Format (PDF) files are dominantly used for storing and disseminating scientific research, legal documents, and tax information. LaTeX is a popular application for creating PDF documents. Despite its advantages, LaTeX is not WYSWYG – what you see is what you get, i.e., the LaTeX source and rendered PDF images look drastically different, especially for formulae and tables. This gap makes it hard to modify or export LaTeX sources for formulae and tables from PDF images, and existing work is still limited. First, prior work generates LaTeX sources in a single iteration and struggles with complex LaTeX formulae. Second, existing work mainly recognizes and extracts LaTeX sources for formulae; and is incapable or ineffective for tables. This paper proposes LATTE, the first iterative refinement framework for LaTeX recognition. Specifically, we propose delta-view as feedback, which compares and pinpoints the differences between a pair of rendered images of the extracted LaTeX source and the expected correct image. Such delta-view feedback enables our fault localization model to localize the faulty parts of the incorrect recognition more accurately and enables our LaTeX refinement model to repair the incorrect extraction more accurately. LATTE improves the LaTeX source extraction accuracy of both LaTeX formulae and tables, outperforming existing techniques as well as GPT-4V by at least 7.07% of exact match, with a success refinement rate of 46.08% (formula) and 25.51% (table).

[CV-120] Content-aware Tile Generation using Exterior Boundary Inpainting

链接: https://arxiv.org/abs/2409.14184
作者: Sam Sartor,Pieter Peers
关键词-EN: flexible learning-based method, generating tileable image, flexible learning-based, generating tileable, tileable image sets
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We present a novel and flexible learning-based method for generating tileable image sets. Our method goes beyond simple self-tiling, supporting sets of mutually tileable images that exhibit a high degree of diversity. To promote diversity we decouple structure from content by foregoing explicit copying of patches from an exemplar image. Instead we leverage the prior knowledge of natural images and textures embedded in large-scale pretrained diffusion models to guide tile generation constrained by exterior boundary conditions and a text prompt to specify the content. By carefully designing and selecting the exterior boundary conditions, we can reformulate the tile generation process as an inpainting problem, allowing us to directly employ existing diffusion-based inpainting models without the need to retrain a model on a custom training set. We demonstrate the flexibility and efficacy of our content-aware tile generation method on different tiling schemes, such as Wang tiles, from only a text prompt. Furthermore, we introduce a novel Dual Wang tiling scheme that provides greater texture continuity and diversity than existing Wang tile variants.

[CV-121] LFP: Efficient and Accurate End-to-End Lane-Level Planning via Camera-LiDAR Fusion

链接: https://arxiv.org/abs/2409.14170
作者: Guoliang You,Xiaomeng Chu,Yifan Duan,Xingchen Li,Sha Zhang,Jianmin Ji,Yanyong Zhang
关键词-EN: Multi-modal systems enhance, face inefficiencies due, Multi-modal systems, face inefficiencies, inefficiencies due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages

点击查看摘要

Abstract:Multi-modal systems enhance performance in autonomous driving but face inefficiencies due to indiscriminate processing within each modality. Additionally, the independent feature learning of each modality lacks interaction, which results in extracted features that do not possess the complementary characteristics. These issue increases the cost of fusing redundant information across modalities. To address these challenges, we propose targeting driving-relevant elements, which reduces the volume of LiDAR features while preserving critical information. This approach enhances lane level interaction between the image and LiDAR branches, allowing for the extraction and fusion of their respective advantageous features. Building upon the camera-only framework PHP, we introduce the Lane-level camera-LiDAR Fusion Planning (LFP) method, which balances efficiency with performance by using lanes as the unit for sensor fusion. Specifically, we design three modules to enhance efficiency and performance. For efficiency, we propose an image-guided coarse lane prior generation module that forecasts the region of interest (ROI) for lanes and assigns a confidence score, guiding LiDAR processing. The LiDAR feature extraction modules leverages lane-aware priors from the image branch to guide sampling for pillar, retaining essential pillars. For performance, the lane-level cross-modal query integration and feature enhancement module uses confidence score from ROI to combine low-confidence image queries with LiDAR queries, extracting complementary depth features. These features enhance the low-confidence image features, compensating for the lack of depth. Experiments on the Carla benchmarks show that our method achieves state-of-the-art performance in both driving score and infraction score, with maximum improvement of 15% and 14% over existing algorithms, respectively, maintaining high frame rate of 19.27 FPS.

[CV-122] PromptTA: Prompt-driven Text Adapter for Source-free Domain Generalization

链接: https://arxiv.org/abs/2409.14163
作者: Haoran Zhang,Shuanghao Bai,Wanqi Zhou,Jingwen Fu,Badong Chen
关键词-EN: Source-free domain generalization, source domain data, unseen target domains, Source-free domain, tackles the challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Source-free domain generalization (SFDG) tackles the challenge of adapting models to unseen target domains without access to source domain data. To deal with this challenging task, recent advances in SFDG have primarily focused on leveraging the text modality of vision-language models such as CLIP. These methods involve developing a transferable linear classifier based on diverse style features extracted from the text and learned prompts or deriving domain-unified text representations from domain banks. However, both style features and domain banks have limitations in capturing comprehensive domain knowledge. In this work, we propose Prompt-Driven Text Adapter (PromptTA) method, which is designed to better capture the distribution of style features and employ resampling to ensure thorough coverage of domain knowledge. To further leverage this rich domain information, we introduce a text adapter that learns from these style features for efficient domain information storage. Extensive experiments conducted on four benchmark datasets demonstrate that PromptTA achieves state-of-the-art performance. The code is available at this https URL.

[CV-123] MSSDA: Multi-Sub-Source Adaptation for Diabetic Foot Neuropathy Recognition

链接: https://arxiv.org/abs/2409.14154
作者: Yan Zhong,Zhixin Yan,Yi Xie,Shibin Wu,Huaidong Zhang,Lin Shu,Peiru Zhou
关键词-EN: Diabetic foot neuropathy, diabetic foot ulcers, Diabetic foot, critical factor leading, diabetes mellitus
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diabetic foot neuropathy (DFN) is a critical factor leading to diabetic foot ulcers, which is one of the most common and severe complications of diabetes mellitus (DM) and is associated with high risks of amputation and mortality. Despite its significance, existing datasets do not directly derive from plantar data and lack continuous, long-term foot-specific information. To advance DFN research, we have collected a novel dataset comprising continuous plantar pressure data to recognize diabetic foot neuropathy. This dataset includes data from 94 DM patients with DFN and 41 DM patients without DFN. Moreover, traditional methods divide datasets by individuals, potentially leading to significant domain discrepancies in some feature spaces due to the absence of mid-domain data. In this paper, we propose an effective domain adaptation method to address this proplem. We split the dataset based on convolutional feature statistics and select appropriate sub-source domains to enhance efficiency and avoid negative transfer. We then align the distributions of each source and target domain pair in specific feature spaces to minimize the domain gap. Comprehensive results validate the effectiveness of our method on both the newly proposed dataset for DFN recognition and an existing dataset.

[CV-124] JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation

链接: https://arxiv.org/abs/2409.14149
作者: Hadrien Reynaud,Matthew Baugh,Mischa Dombrowski,Sarah Cechnicka,Qingjie Meng,Bernhard Kainz
关键词-EN: Joint Video-Image Diffusion, introduce the Joint, Joint Video-Image, Video-Image Diffusion model, Diffusion model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce the Joint Video-Image Diffusion model (JVID), a novel approach to generating high-quality and temporally coherent videos. We achieve this by integrating two diffusion models: a Latent Image Diffusion Model (LIDM) trained on images and a Latent Video Diffusion Model (LVDM) trained on video data. Our method combines these models in the reverse diffusion process, where the LIDM enhances image quality and the LVDM ensures temporal consistency. This unique combination allows us to effectively handle the complex spatio-temporal dynamics in video generation. Our results demonstrate quantitative and qualitative improvements in producing realistic and coherent videos.

[CV-125] A Feature Generator for Few-Shot Learning ACCV2024

链接: https://arxiv.org/abs/2409.14141
作者: Heethanjan Kanagalingam,Thenukan Pathmanathan,Navaneethan Ketheeswaran,Mokeeshan Vathanakumar,Mohamed Afham,Ranga Rodrigo
关键词-EN: Few-shot learning, limited labelled data, aims to enable, recognize novel objects, objects or classes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, Accepted to ACCV 2024

点击查看摘要

Abstract:Few-shot learning (FSL) aims to enable models to recognize novel objects or classes with limited labelled data. Feature generators, which synthesize new data points to augment limited datasets, have emerged as a promising solution to this challenge. This paper investigates the effectiveness of feature generators in enhancing the embedding process for FSL tasks. To address the issue of inaccurate embeddings due to the scarcity of images per class, we introduce a feature generator that creates visual features from class-level textual descriptions. By training the generator with a combination of classifier loss, discriminator loss, and distance loss between the generated features and true class embeddings, we ensure the generation of accurate same-class features and enhance the overall feature representation. Our results show a significant improvement in accuracy over baseline methods, with our approach outperforming the baseline model by 10% in 1-shot and around 5% in 5-shot approaches. Additionally, both visual-only and visual + textual generators have also been tested in this paper.

[CV-126] Present and Future Generalization of Synthetic Image Detectors

链接: https://arxiv.org/abs/2409.14128
作者: Pablo Bernabeu-Perez,Enrique Lopez-Cuena,Dario Garcia-Gasulla
关键词-EN: generation models increases, image generation models, continued release, generation models, models increases
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:The continued release of new and better image generation models increases the demand for synthetic image detectors. In such a dynamic field, detectors need to be able to generalize widely and be robust to uncontrolled alterations. The present work is motivated by this setting, when looking at the role of time, image transformations and data sources, for detector generalization. In these experiments, none of the evaluated detectors is found universal, but results indicate an ensemble could be. Experiments on data collected in the wild show this task to be more challenging than the one defined by large-scale datasets, pointing to a gap between experimentation and actual practice. Finally, we observe a race equilibrium effect, where better generators lead to better detectors, and vice versa. We hypothesize this pushes the field towards a perpetually close race between generators and detectors.

[CV-127] Local Patterns Generalize Better for Novel Anomalies

链接: https://arxiv.org/abs/2409.14109
作者: Yalong Jiang,Liquan Mao
关键词-EN: Video anomaly detection, local patterns, anomaly detection, aims at identifying, unseen during training
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video anomaly detection (VAD) aims at identifying novel actions or events which are unseen during training. Existing mainstream VAD techniques focus on the global patterns of events and cannot properly generalize to novel samples. In this paper, we propose a framework to identify the spatial local patterns which generalize to novel samples and model the dynamics of local patterns. In spatial part of the framework, the capability of extracting local patterns is gained from image-text contrastive learning with Image-Text Alignment Module (ITAM). To detect different types of anomalies, a two-branch framework is proposed for representing the local patterns in both actions and appearances. In temporal part of the framework, a State Machine Module (SMM) is proposed to model the dynamics of local patterns by decomposing their temporal variations into motion components. Different dynamics are represented with different weighted sums of a fixed set of motion components. The video sequences with either novel spatial distributions of local patterns or distinctive dynamics of local patterns are deemed as anomalies. Extensive experiments on popular benchmark datasets demonstrate that state-of-the-art performance can be achieved.

[CV-128] ExFMan: Rendering 3D Dynamic Humans with Hybrid Monocular Blurry Frames and Events

链接: https://arxiv.org/abs/2409.14103
作者: Kanghao Chen,Zeyu Wang,Lin Wang
关键词-EN: witnessed tremendous progress, Recent years, neural rendering techniques, reconstruction of dynamic, years have witnessed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent years have witnessed tremendous progress in the 3D reconstruction of dynamic humans from a monocular video with the advent of neural rendering techniques. This task has a wide range of applications, including the creation of virtual characters for virtual reality (VR) environments. However, it is still challenging to reconstruct clear humans when the monocular video is affected by motion blur, particularly caused by rapid human motion (e.g., running, dancing), as often occurs in the wild. This leads to distinct inconsistency of shape and appearance for the rendered 3D humans, especially in the blurry regions with rapid motion, e.g., hands and legs. In this paper, we propose ExFMan, the first neural rendering framework that unveils the possibility of rendering high-quality humans in rapid motion with a hybrid frame-based RGB and bio-inspired event camera. The ``out-of-the-box’’ insight is to leverage the high temporal information of event data in a complementary manner and adaptively reweight the effect of losses for both RGB frames and events in the local regions, according to the velocity of the rendered human. This significantly mitigates the inconsistency associated with motion blur in the RGB frames. Specifically, we first formulate a velocity field of the 3D body in the canonical space and render it to image space to identify the body parts with motion blur. We then propose two novel losses, i.e., velocity-aware photometric loss and velocity-relative event loss, to optimize the neural human for both modalities under the guidance of the estimated velocity. In addition, we incorporate novel pose regularization and alpha losses to facilitate continuous pose and clear boundary. Extensive experiments on synthetic and real-world datasets demonstrate that ExFMan can reconstruct sharper and higher quality humans.

[CV-129] PoseAugment: Generative Human Pose Data Augmentation with Physical Plausibility for IMU-based Motion Capture ECCV2024

链接: https://arxiv.org/abs/2409.14101
作者: Zhuojun Li,Chun Yu,Chen Liang,Yuanchun Shi
关键词-EN: data scarcity problem, IMU-based motion capture, motion capture, scarcity problem, crucial factor
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: Accepted to ECCV 2024. Code: this https URL

点击查看摘要

Abstract:The data scarcity problem is a crucial factor that hampers the model performance of IMU-based human motion capture. However, effective data augmentation for IMU-based motion capture is challenging, since it has to capture the physical relations and constraints of the human body, while maintaining the data distribution and quality. We propose PoseAugment, a novel pipeline incorporating VAE-based pose generation and physical optimization. Given a pose sequence, the VAE module generates infinite poses with both high fidelity and diversity, while keeping the data distribution. The physical module optimizes poses to satisfy physical constraints with minimal motion restrictions. High-quality IMU data are then synthesized from the augmented poses for training motion capture models. Experiments show that PoseAugment outperforms previous data augmentation and pose generation methods in terms of motion capture accuracy, revealing a strong potential of our method to alleviate the data collection burden for IMU-based motion capture and related tasks driven by human poses.

[CV-130] Foundation Models for Amodal Video Instance Segmentation in Automated Driving ECCV

链接: https://arxiv.org/abs/2409.14095
作者: Jasmin Breitenstein,Franz Jünger,Andreas Bär,Tim Fingscheidt
关键词-EN: video instance segmentation, instance segmentation, video instance, instance, amodal video instance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted at ECCV VCAD Workshop 2024

点击查看摘要

Abstract:In this work, we study amodal video instance segmentation for automated driving. Previous works perform amodal video instance segmentation relying on methods trained on entirely labeled video data with techniques borrowed from standard video instance segmentation. Such amodally labeled video data is difficult and expensive to obtain and the resulting methods suffer from a trade-off between instance segmentation and tracking performance. To largely solve this issue, we propose to study the application of foundation models for this task. More precisely, we exploit the extensive knowledge of the Segment Anything Model (SAM), while fine-tuning it to the amodal instance segmentation task. Given an initial video instance segmentation, we sample points from the visible masks to prompt our amodal SAM. We use a point memory to store those points. If a previously observed instance is not predicted in a following frame, we retrieve its most recent points from the point memory and use a point tracking method to follow those points to the current frame, together with the corresponding last amodal instance mask. This way, while basing our method on an amodal instance segmentation, we nevertheless obtain video-level amodal instance segmentation results. Our resulting S-AModal method achieves state-of-the-art results in amodal video instance segmentation while resolving the need for amodal video-based labels. Code for S-AModal is available at this https URL.

[CV-131] BRep Boundary and Junction Detection for CAD Reverse Engineering

链接: https://arxiv.org/abs/2409.14087
作者: Sk Aziz Ali,Mohammad Sadil Khan,Didier Stricker
关键词-EN: obtain parametric CAD, parametric CAD models, modify CAD model, highly important, parametric CAD
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:In machining process, 3D reverse engineering of the mechanical system is an integral, highly important, and yet time consuming step to obtain parametric CAD models from 3D scans. Therefore, deep learning-based Scan-to-CAD modeling can offer designers enormous editability to quickly modify CAD model, being able to parse all its structural compositions and design steps. In this paper, we propose a supervised boundary representation (BRep) detection network BRepDetNet from 3D scans of CC3D and ABC dataset. We have carefully annotated the 50K and 45K scans of both the datasets with appropriate topological relations (e.g., next, mate, previous) between the geometrical primitives (i.e., boundaries, junctions, loops, faces) of their BRep data structures. The proposed solution decomposes the Scan-to-CAD problem in Scan-to-BRep ensuring the right step towards feature-based modeling, and therefore, leveraging other existing BRep-to-CAD modeling methods. Our proposed Scan-to-BRep neural network learns to detect BRep boundaries and junctions by minimizing focal-loss and non-maximal suppression (NMS) during training time. Experimental results show that our BRepDetNet with NMS-Loss achieves impressive results.

[CV-132] SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information

链接: https://arxiv.org/abs/2409.14083
作者: Jiashuo Sun,Jihai Zhang,Yucheng Zhou,Zhaochen Su,Xiaoye Qu,Yu Cheng
关键词-EN: Large Vision-Language Models, natural language processing, Large Vision-Language, Vision-Language Models, selectively utilize retrieved
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 9 tables, 11 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have become pivotal at the intersection of computer vision and natural language processing. However, the full potential of LVLMs Retrieval-Augmented Generation (RAG) capabilities remains underutilized. Existing works either focus solely on the text modality or are limited to specific tasks. Moreover, most LVLMs struggle to selectively utilize retrieved information and are sensitive to irrelevant or misleading references. To address these challenges, we propose a self-refinement framework designed to teach LVLMs to Selectively Utilize Retrieved Information (SURf). Specifically, when given questions that are incorrectly answered by the LVLM backbone, we obtain references that help correct the answers (positive references) and those that do not (negative references). We then fine-tune the LVLM backbone using a combination of these positive and negative references. Our experiments across three tasks and seven datasets demonstrate that our framework significantly enhances LVLMs ability to effectively utilize retrieved multimodal references and improves their robustness against irrelevant or misleading information. The source code is available at this https URL.

[CV-133] Dynamic 2D Gaussians: Geometrically accurate radiance fields for dynamic objects

链接: https://arxiv.org/abs/2409.14072
作者: Shuai Zhang,Guanjun Wu,Xinggang Wang,Bin Feng,Wenyu Liu
关键词-EN: high-quality surfaces play, real world, surfaces play, play a vital, vital role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing objects and extracting high-quality surfaces play a vital role in the real world. Current 4D representations show the ability to render high-quality novel views for dynamic objects but cannot reconstruct high-quality meshes due to their implicit or geometrically inaccurate representations. In this paper, we propose a novel representation that can reconstruct accurate meshes from sparse image input, named Dynamic 2D Gaussians (D-2DGS). We adopt 2D Gaussians for basic geometry representation and use sparse-controlled points to capture 2D Gaussian’s deformation. By extracting the object mask from the rendered high-quality image and masking the rendered depth map, a high-quality dynamic mesh sequence of the object can be extracted. Experiments demonstrate that our D-2DGS is outstanding in reconstructing high-quality meshes from sparse input. More demos and code are available at this https URL.

[CV-134] SplatLoc: 3D Gaussian Splatting-based Visual Localization for Augmented Reality

链接: https://arxiv.org/abs/2409.14067
作者: Hongjia Zhai,Xiyu Zhang,Boming Zhao,Hai Li,Yijia He,Zhaopeng Cui,Hujun Bao,Guofeng Zhang
关键词-EN: Augmented Reality, render virtual content, applications of Augmented, plays an important, important role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual localization plays an important role in the applications of Augmented Reality (AR), which enable AR devices to obtain their 6-DoF pose in the pre-build map in order to render virtual content in real scenes. However, most existing approaches can not perform novel view rendering and require large storage capacities for maps. To overcome these limitations, we propose an efficient visual localization method capable of high-quality rendering with fewer parameters. Specifically, our approach leverages 3D Gaussian primitives as the scene representation. To ensure precise 2D-3D correspondences for pose estimation, we develop an unbiased 3D scene-specific descriptor decoder for Gaussian primitives, distilled from a constructed feature volume. Additionally, we introduce a salient 3D landmark selection algorithm that selects a suitable primitive subset based on the saliency score for localization. We further regularize key Gaussian primitives to prevent anisotropic effects, which also improves localization performance. Extensive experiments on two widely used datasets demonstrate that our method achieves superior or comparable rendering and localization performance to state-of-the-art implicit-based visual localization approaches. Project page: \hrefthis https URLthis https URL.

[CV-135] Recovering Global Data Distribution Locally in Federated Learning BMVC2024

链接: https://arxiv.org/abs/2409.14063
作者: Ziyu Yao
关键词-EN: machine learning paradigm, distributed machine learning, Federated Learning, machine learning, learning paradigm
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by BMVC 2024

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning paradigm that enables collaboration among multiple clients to train a shared model without sharing raw data. However, a major challenge in FL is the label imbalance, where clients may exclusively possess certain classes while having numerous minority and missing classes. Previous works focus on optimizing local updates or global aggregation but ignore the underlying imbalanced label distribution across clients. In this paper, we propose a novel approach ReGL to address this challenge, whose key idea is to Recover the Global data distribution Locally. Specifically, each client uses generative models to synthesize images that complement the minority and missing classes, thereby alleviating label imbalance. Moreover, we adaptively fine-tune the image generation process using local real data, which makes the synthetic images align more closely with the global distribution. Importantly, both the generation and fine-tuning processes are conducted at the client-side without leaking data privacy. Through comprehensive experiments on various image classification datasets, we demonstrate the remarkable superiority of our approach over existing state-of-the-art works in fundamentally tackling label imbalance in FL.

[CV-136] Soft Segmented Randomization: Enhancing Domain Generalization in SAR-ATR for Synthetic-to-Measured

链接: https://arxiv.org/abs/2409.14060
作者: Minjun Kim,Ohtae Jang,Haekang Song,Heesub Shin,Jaewoo Ok,Minyoung Back,Jaehyuk Youn,Sungho Kim
关键词-EN: remains challenging due, Synthetic aperture radar, deep learning-based automatic, recognition remains challenging, data availability issues
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 13 figures

点击查看摘要

Abstract:Synthetic aperture radar technology is crucial for high-resolution imaging under various conditions; however, the acquisition of real-world synthetic aperture radar data for deep learning-based automatic target recognition remains challenging due to high costs and data availability issues. To overcome these challenges, synthetic data generated through simulations have been employed, although discrepancies between synthetic and real data can degrade model performance. In this study, we introduce a novel framework, soft segmented randomization, designed to reduce domain discrepancy and improve the generalize ability of synthetic aperture radar automatic target recognition models. The soft segmented randomization framework applies a Gaussian mixture model to segment target and clutter regions softly, introducing randomized variations that align the synthetic data’s statistical properties more closely with those of real-world data. Experimental results demonstrate that the proposed soft segmented randomization framework significantly enhances model performance on measured synthetic aperture radar data, making it a promising approach for robust automatic target recognition in scenarios with limited or no access to measured data.

[CV-137] ECHO: Environmental Sound Classification with Hierarchical Ontology-guided Semi-Supervised Learning

链接: https://arxiv.org/abs/2409.14043
作者: Pranav Gupta,Raunak Sharma,Rashmi Kumari,Sri Krishna Aditya,Shwetank Choudhary,Sumit Kumar,Kanchana M,Thilagavathy R
关键词-EN: Environment Sound Classification, well-studied research problem, Environment Sound, Sound Classification, well-studied research
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: IEEE CONECCT 2024, Signal Processing and Pattern Recognition, Environmental Sound Classification, ESC

点击查看摘要

Abstract:Environment Sound Classification has been a well-studied research problem in the field of signal processing and up till now more focus has been laid on fully supervised approaches. Over the last few years, focus has moved towards semi-supervised methods which concentrate on the utilization of unlabeled data, and self-supervised methods which learn the intermediate representation through pretext task or contrastive learning. However, both approaches require a vast amount of unlabelled data to improve performance. In this work, we propose a novel framework called Environmental Sound Classification with Hierarchical Ontology-guided semi-supervised Learning (ECHO) that utilizes label ontology-based hierarchy to learn semantic representation by defining a novel pretext task. In the pretext task, the model tries to predict coarse labels defined by the Large Language Model (LLM) based on ground truth label ontology. The trained model is further fine-tuned in a supervised way to predict the actual task. Our proposed novel semi-supervised framework achieves an accuracy improvement in the range of 1% to 8% over baseline systems across three datasets namely UrbanSound8K, ESC-10, and ESC-50.

[CV-138] BrainDreamer: Reasoning-Coherent and Controllable Image Generation from EEG Brain Signals via Language Guidance

链接: https://arxiv.org/abs/2409.14021
作者: Ling Wang,Chen Wu,Lin Wang
关键词-EN: directly visualize, EEG, brain, image, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Can we directly visualize what we imagine in our brain together with what we describe? The inherent nature of human perception reveals that, when we think, our body can combine language description and build a vivid picture in our brain. Intuitively, generative models should also hold such versatility. In this paper, we introduce BrainDreamer, a novel end-to-end language-guided generative framework that can mimic human reasoning and generate high-quality images from electroencephalogram (EEG) brain signals. Our method is superior in its capacity to eliminate the noise introduced by non-invasive EEG data acquisition and meanwhile achieve a more precise mapping between the EEG and image modality, thus leading to significantly better-generated images. Specifically, BrainDreamer consists of two key learning stages: 1) modality alignment and 2) image generation. In the alignment stage, we propose a novel mask-based triple contrastive learning strategy to effectively align EEG, text, and image embeddings to learn a unified representation. In the generation stage, we inject the EEG embeddings into the pre-trained Stable Diffusion model by designing a learnable EEG adapter to generate high-quality reasoning-coherent images. Moreover, BrainDreamer can accept textual descriptions (e.g., color, position, etc.) to achieve controllable image generation. Extensive experiments show that our method significantly outperforms prior arts in terms of generating quality and quantitative performance.

[CV-139] MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors

链接: https://arxiv.org/abs/2409.14019
作者: Zhenhua Du,Binbin Xu,Haoyu Zhang,Kai Huo,Shuaifeng Zhi
关键词-EN: Accurately reconstructing dense, Accurately reconstructing, monocular images remains, challenging task due, semantically annotated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 10 figures

点击查看摘要

Abstract:Accurately reconstructing dense and semantically annotated 3D meshes from monocular images remains a challenging task due to the lack of geometry guidance and imperfect view-dependent 2D priors. Though we have witnessed recent advancements in implicit neural scene representations enabling precise 2D rendering simply from multi-view images, there have been few works addressing 3D scene understanding with monocular priors alone. In this paper, we propose MOSE, a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D, producing accurate semantics and geometry in both 3D and 2D space. The key motivation for our method is to leverage generic class-agnostic segment masks as guidance to promote local consistency of rendered semantics during training. With the help of semantics, we further apply a smoothness regularization to texture-less regions for better geometric quality, thus achieving mutual benefits of geometry and semantics. Experiments on the ScanNet dataset show that our MOSE outperforms relevant baselines across all metrics on tasks of 3D semantic segmentation, 2D semantic segmentation and 3D surface reconstruction.

[CV-140] Generalizable Non-Line-of-Sight Imaging with Learnable Physical Priors

链接: https://arxiv.org/abs/2409.14011
作者: Shida Sun,Yue Li,Yueyi Zhang,Zhiwei Xiong
关键词-EN: attracted increasing attention, increasing attention due, NLOS reconstruction approaches, existing NLOS reconstruction, recovering the hidden
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Non-line-of-sight (NLOS) imaging, recovering the hidden volume from indirect reflections, has attracted increasing attention due to its potential applications. Despite promising results, existing NLOS reconstruction approaches are constrained by the reliance on empirical physical priors, e.g., single fixed path compensation. Moreover, these approaches still possess limited generalization ability, particularly when dealing with scenes at a low signal-to-noise ratio (SNR). To overcome the above problems, we introduce a novel learning-based solution, comprising two key designs: Learnable Path Compensation (LPC) and Adaptive Phasor Field (APF). The LPC applies tailored path compensation coefficients to adapt to different objects in the scene, effectively reducing light wave attenuation, especially in distant regions. Meanwhile, the APF learns the precise Gaussian window of the illumination function for the phasor field, dynamically selecting the relevant spectrum band of the transient measurement. Experimental validations demonstrate that our proposed approach, only trained on synthetic data, exhibits the capability to seamlessly generalize across various real-world datasets captured by different imaging systems and characterized by low SNRs.

[CV-141] Multiple-Exit Tuning: Towards Inference-Efficient Adaptation for Vision Transformer

链接: https://arxiv.org/abs/2409.13999
作者: Zheng Liu,Jinchao Zhu,Nannan Li,Gao Huang
关键词-EN: Parameter-efficient transfer learning, shown great potential, Parameter-efficient transfer, transfer learning, vision transformer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages,13 figures,6 tables

点击查看摘要

Abstract:Parameter-efficient transfer learning (PETL) has shown great potential in adapting a vision transformer (ViT) pre-trained on large-scale datasets to various downstream tasks. Existing studies primarily focus on minimizing the number of learnable parameters. Although these methods are storage-efficient, they allocate excessive computational resources to easy samples, leading to inefficient inference. To address this issue, we introduce an inference-efficient tuning method termed multiple-exit tuning (MET). MET integrates multiple exits into the pre-trained ViT backbone. Since the predictions in ViT are made by a linear classifier, each exit is equipped with a linear prediction head. In inference stage, easy samples will exit at early exits and only hard enough samples will flow to the last exit, thus saving the computational cost for easy samples. MET consists of exit-specific adapters (E-adapters) and graph regularization. E-adapters are designed to extract suitable representations for different exits. To ensure parameter efficiency, all E-adapters share the same down-projection and up-projection matrices. As the performances of linear classifiers are influenced by the relationship among samples, we employ graph regularization to improve the representations fed into the classifiers at early exits. Finally, we conduct extensive experiments to verify the performance of MET. Experimental results show that MET has an obvious advantage over the state-of-the-art methods in terms of both accuracy and inference efficiency.

[CV-142] GAInS: Gradient Anomaly-aware Biomedical Instance Segmentation

链接: https://arxiv.org/abs/2409.13988
作者: Runsheng Liu,Hao Jiang,Yanning Zhou,Huangjing Lin,Liansheng Wang,Hao Chen
关键词-EN: enabling precise identification, gradient anomaly, Instance segmentation plays, tissues and cells, enabling precise
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by BIBM2024

点击查看摘要

Abstract:Instance segmentation plays a vital role in the morphological quantification of biomedical entities such as tissues and cells, enabling precise identification and delineation of different structures. Current methods often address the challenges of touching, overlapping or crossing instances through individual modeling, while neglecting the intrinsic interrelation between these conditions. In this work, we propose a Gradient Anomaly-aware Biomedical Instance Segmentation approach (GAInS), which leverages instance gradient information to perceive local gradient anomaly regions, thus modeling the spatial relationship between instances and refining local region segmentation. Specifically, GAInS is firstly built on a Gradient Anomaly Mapping Module (GAMM), which encodes the radial fields of instances through window sliding to obtain instance gradient anomaly maps. To efficiently refine boundaries and regions with gradient anomaly attention, we propose an Adaptive Local Refinement Module (ALRM) with a gradient anomaly-aware loss function. Extensive comparisons and ablation experiments in three biomedical scenarios demonstrate that our proposed GAInS outperforms other state-of-the-art (SOTA) instance segmentation methods. The code is available at this https URL.

[CV-143] Holistic and Historical Instance Comparison for Cervical Cell Detection

链接: https://arxiv.org/abs/2409.13987
作者: Hao Jiang,Runsheng Liu,Yanning Zhou,Huangjing Lin,Hao Chen
关键词-EN: slide images serves, screening from Papanicolaou, preventive clinical management, cervical cell detection, abnormal cell detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by BIBM2024

点击查看摘要

Abstract:Cytology screening from Papanicolaou (Pap) smears is a common and effective tool for the preventive clinical management of cervical cancer, where abnormal cell detection from whole slide images serves as the foundation for reporting cervical cytology. However, cervical cell detection remains challenging due to 1) hazily-defined cell types (e.g., ASC-US) with subtle morphological discrepancies caused by the dynamic cancerization process, i.e., cell class ambiguity, and 2) imbalanced class distributions of clinical data may cause missed detection, especially for minor categories, i.e., cell class imbalance. To this end, we propose a holistic and historical instance comparison approach for cervical cell detection. Specifically, we first develop a holistic instance comparison scheme enforcing both RoI-level and class-level cell discrimination. This coarse-to-fine cell comparison encourages the model to learn foreground-distinguishable and class-wise representations. To emphatically improve the distinguishability of minor classes, we then introduce a historical instance comparison scheme with a confident sample selection-based memory bank, which involves comparing current embeddings with historical embeddings for better cell instance discrimination. Extensive experiments and analysis on two large-scale cytology datasets including 42,592 and 114,513 cervical cells demonstrate the effectiveness of our method. The code is available at this https URL.

[CV-144] Cycle-Consistency Uncertainty Estimation for Visual Prompting based One-Shot Defect Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.13984
作者: Geonuk Kim
关键词-EN: detection traditionally relies, defect detection traditionally, supervised learning models, learning models trained, Industrial defect detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 VISION workshop Most Innovative Prize

点击查看摘要

Abstract:Industrial defect detection traditionally relies on supervised learning models trained on fixed datasets of known defect types. While effective within a closed set, these models struggle with new, unseen defects, necessitating frequent re-labeling and re-training. Recent advances in visual prompting offer a solution by allowing models to adaptively infer novel categories based on provided visual cues. However, a prevalent issue in these methods is the over-confdence problem, where models can mis-classify unknown objects as known objects with high certainty. To addresssing the fundamental concerns about the adaptability, we propose a solution to estimate uncertainty of the visual prompting process by cycle-consistency. We designed to check whether it can accurately restore the original prompt from its predictions. To quantify this, we measure the mean Intersection over Union (mIoU) between the restored prompt mask and the originally provided prompt mask. Without using complex designs or ensemble methods with multiple networks, our approach achieved a yield rate of 0.9175 in the VISION24 one-shot industrial challenge.

[CV-145] Enhanced Semantic Segmentation for Large-Scale and Imbalanced Point Clouds

链接: https://arxiv.org/abs/2409.13983
作者: Haoran Gong,Haodong Wang,Di Wang
关键词-EN: Multilateral Cascading Network, Multilateral Cascading, Semantic segmentation, large-scale point clouds, point cloud scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semantic segmentation of large-scale point clouds is of significant importance in environment perception and scene understanding. However, point clouds collected from real-world environments are usually imbalanced and small-sized objects are prone to be under-sampled or misclassified due to their low occurrence frequency, thereby reducing the overall accuracy of semantic segmentation. In this study, we propose the Multilateral Cascading Network (MCNet) for large-scale and sample-imbalanced point cloud scenes. To increase the frequency of small-sized objects, we introduce the semantic-weighted sampling module, which incorporates a probability parameter into the collected data group. To facilitate feature learning, we propose a Multilateral Cascading Attention Enhancement (MCAE) module to learn complex local features through multilateral cascading operations and attention mechanisms. To promote feature fusion, we propose a Point Cross Stage Partial (P-CSP) module to combine global and local features, optimizing the integration of valuable feature information across multiple scales. Finally, we introduce the neighborhood voting module to integrate results at the output layer. Our proposed method demonstrates either competitive or superior performance relative to state-of-the-art approaches across three widely recognized benchmark datasets: S3DIS, Toronto3D, and SensatUrban with mIoU scores of 74.0%, 82.9% and 64.5%, respectively. Notably, our work yielded consistent optimal results on the under-sampled semantic categories, thereby demonstrating exceptional performance in the recognition of small-sized objects.

[CV-146] CUS3D :CLIP-based Unsupervised 3D Segmentation via Object-level Denoise

链接: https://arxiv.org/abs/2409.13982
作者: Fuyang Yu,Runze Tian,Zhen Wang,Xiaochuan Wang,Xiaohui Liang
关键词-EN: acquiring annotation labels, CLIP semantic knowledge, ease the difficulty, difficulty of acquiring, acquiring annotation
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 6 pages,3 figures

点击查看摘要

Abstract:To ease the difficulty of acquiring annotation labels in 3D data, a common method is using unsupervised and open-vocabulary semantic segmentation, which leverage 2D CLIP semantic knowledge. In this paper, unlike previous research that ignores the noise'' raised during feature projection from 2D to 3D, we propose a novel distillation learning framework named CUS3D. In our approach, an object-level denosing projection module is designed to screen out the noise’’ and ensure more accurate 3D feature. Based on the obtained features, a multimodal distillation learning module is designed to align the 3D feature with CLIP semantic feature space with object-centered constrains to achieve advanced unsupervised semantic segmentation. We conduct comprehensive experiments in both unsupervised and open-vocabulary segmentation, and the results consistently showcase the superiority of our model in achieving advanced unsupervised segmentation results and its effectiveness in open-vocabulary segmentation.

[CV-147] Enhancing Advanced Visual Reasoning Ability of Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2409.13980
作者: Zhiyuan Li,Dongnan Liu,Chaoyi Zhang,Heng Wang,Tengfei Xue,Weidong Cai
关键词-EN: Large Language Models, challenging models’ advanced, Language Models, advanced reasoning ability, complex visual reasoning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Main

点击查看摘要

Abstract:Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning, challenging models’ advanced reasoning ability. Traditional Vision-Language Models (VLMs) perform well in visual perception tasks while struggling with complex reasoning scenarios. Conversely, Large Language Models (LLMs) demonstrate robust text reasoning capabilities; however, they lack visual acuity. To bridge this gap, we propose Complex Visual Reasoning Large Language Models (CVR-LLM), capitalizing on VLMs’ visual perception proficiency and LLMs’ extensive reasoning capability. Unlike recent multimodal large language models (MLLMs) that require a projection layer, our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop and leverages LLMs’ text knowledge for accurate predictions without extra training. We also introduce a novel multi-modal in-context learning (ICL) methodology to enhance LLMs’ contextual understanding and reasoning. Additionally, we introduce Chain-of-Comparison (CoC), a step-by-step comparison technique enabling contrasting various aspects of predictions. Our CVR-LLM presents the first comprehensive study across a wide array of complex visual reasoning tasks and achieves SOTA performance among all.

[CV-148] FracGM: A Fast Fractional Programming Technique for Geman-McClure Robust Estimator

链接: https://arxiv.org/abs/2409.13978
作者: Bang-Shien Chen,Yu-Kai Lin,Jian-Yu Chen,Chih-Wei Huang,Jann-Long Chern,Ching-Cherng Sun
关键词-EN: Robust estimation, Geman-McClure robust estimation, computer vision, aiming to minimize, improved accuracy
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Robust estimation is essential in computer vision, robotics, and navigation, aiming to minimize the impact of outlier measurements for improved accuracy. We present a fast algorithm for Geman-McClure robust estimation, FracGM, leveraging fractional programming techniques. This solver reformulates the original non-convex fractional problem to a convex dual problem and a linear equation system, iteratively solving them in an alternating optimization pattern. Compared to graduated non-convexity approaches, this strategy exhibits a faster convergence rate and better outlier rejection capability. In addition, the global optimality of the proposed solver can be guaranteed under given conditions. We demonstrate the proposed FracGM solver with Wahba’s rotation problem and 3-D point-cloud registration along with relaxation pre-processing and projection post-processing. Compared to state-of-the-art algorithms, when the outlier rates increase from 20% to 80%, FracGM shows 53% and 88% lower rotation and translation increases. In real-world scenarios, FracGM achieves better results in 13 out of 18 outcomes, while having a 19.43% improvement in the computation time.

[CV-149] Improving 3D Semi-supervised Learning by Effectively Utilizing All Unlabelled Data ECCV2024

链接: https://arxiv.org/abs/2409.13977
作者: Sneha Paul,Zachary Patterson,Nizar Bouguila
关键词-EN: utilizing large unlabelled, large unlabelled data, unlabelled data, unlabelled, unlabelled samples
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the European Conference on Computer Vision, ECCV 2024

点击查看摘要

Abstract:Semi-supervised learning (SSL) has shown its effectiveness in learning effective 3D representation from a small amount of labelled data while utilizing large unlabelled data. Traditional semi-supervised approaches rely on the fundamental concept of predicting pseudo-labels for unlabelled data and incorporating them into the learning process. However, we identify that the existing methods do not fully utilize all the unlabelled samples and consequently limit their potential performance. To address this issue, we propose AllMatch, a novel SSL-based 3D classification framework that effectively utilizes all the unlabelled samples. AllMatch comprises three modules: (1) an adaptive hard augmentation module that applies relatively hard augmentations to the high-confident unlabelled samples with lower loss values, thereby enhancing the contribution of such samples, (2) an inverse learning module that further improves the utilization of unlabelled data by learning what not to learn, and (3) a contrastive learning module that ensures learning from all the samples in both supervised and unsupervised settings. Comprehensive experiments on two popular 3D datasets demonstrate a performance improvement of up to 11.2% with 1% labelled data, surpassing the SOTA by a significant margin. Furthermore, AllMatch exhibits its efficiency in effectively leveraging all the unlabelled data, demonstrated by the fact that only 10% of labelled data reaches nearly the same performance as fully-supervised learning with all labelled data. The code of our work is available at: this https URL.

[CV-150] Detecting Inpainted Video with Frequency Domain Insights ICASSP2025

链接: https://arxiv.org/abs/2409.13976
作者: Quanhui Tang,Jingtao Cao
关键词-EN: enables seamless content, seamless content removal, replacement within frames, posing ethical, inpainting enables seamless
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: submit to ICASSP2025

点击查看摘要

Abstract:Video inpainting enables seamless content removal and replacement within frames, posing ethical and legal risks when misused. To mitigate these risks, detecting manipulated regions in inpainted videos is critical. Previous detection methods often focus solely on the characteristics derived from spatial and temporal dimensions, which limits their effectiveness by overlooking the unique frequency characteristics of different inpainting algorithms. In this paper, we propose the Frequency Domain Insights Network (FDIN), which significantly enhances detection accuracy by incorporating insights from the frequency domain. Our network features an Adaptive Band Selective Response module to discern frequency characteristics specific to various inpainting techniques and a Fast Fourier Convolution-based Attention module for identifying periodic artifacts in inpainted regions. Utilizing 3D ResBlocks for spatiotemporal analysis, FDIN progressively refines detection precision from broad assessments to detailed localization. Experimental evaluations on public datasets demonstrate that FDIN achieves state-of-the-art performance, setting a new benchmark in video inpainting detection.

[CV-151] Monocular Event-Inertial Odometry with Adaptive decay-based Time Surface and Polarity-aware Tracking IROS

链接: https://arxiv.org/abs/2409.13971
作者: Kai Tang,Xiaolei Lang,Yukai Ma,Yuehao Huang,Laijian Li,Yong Liu,Jiajun Lv
关键词-EN: low power consumption, garnered considerable attention, considerable attention due, high dynamic range, power consumption
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024

点击查看摘要

Abstract:Event cameras have garnered considerable attention due to their advantages over traditional cameras in low power consumption, high dynamic range, and no motion blur. This paper proposes a monocular event-inertial odometry incorporating an adaptive decay kernel-based time surface with polarity-aware tracking. We utilize an adaptive decay-based Time Surface to extract texture information from asynchronous events, which adapts to the dynamic characteristics of the event stream and enhances the representation of environmental textures. However, polarity-weighted time surfaces suffer from event polarity shifts during changes in motion direction. To mitigate its adverse effects on feature tracking, we optimize the feature tracking by incorporating an additional polarity-inverted time surface to enhance the robustness. Comparative analysis with visual-inertial and event-inertial odometry methods shows that our approach outperforms state-of-the-art techniques, with competitive results across various datasets.

[CV-152] Deep learning for fast segmentation and critical dimension metrology characterization enabling AR/VR design and fabrication

链接: https://arxiv.org/abs/2409.13951
作者: Kundan Chaudhary,Subhei Shaar,Raja Muthinti
关键词-EN: Quantitative analysis, virtual reality, augmented reality, design and fabrication, fabrication of components
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Quantitative analysis of microscopy images is essential in the design and fabrication of components used in augmented reality/virtual reality (AR/VR) modules. However, segmenting regions of interest (ROIs) from these complex images and extracting critical dimensions (CDs) requires novel techniques, such as deep learning models which are key for actionable decisions on process, material and device optimization. In this study, we report on the fine-tuning of a pre-trained Segment Anything Model (SAM) using a diverse dataset of electron microscopy images. We employed methods such as low-rank adaptation (LoRA) to reduce training time and enhance the accuracy of ROI extraction. The model’s ability to generalize to unseen images facilitates zero-shot learning and supports a CD extraction model that precisely extracts CDs from the segmented ROIs. We demonstrate the accurate extraction of binary images from cross-sectional images of surface relief gratings (SRGs) and Fresnel lenses in both single and multiclass modes. Furthermore, these binary images are used to identify transition points, aiding in the extraction of relevant CDs. The combined use of the fine-tuned segmentation model and the CD extraction model offers substantial advantages to various industrial applications by enhancing analytical capabilities, time to data and insights, and optimizing manufacturing processes.

[CV-153] alkMosaic: Interactive PhotoMosaic with Multi-modal LLM QA Interactions

链接: https://arxiv.org/abs/2409.13941
作者: Kevin Li,Fulu Li
关键词-EN: single composed image, original car image, car image, car images information, image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:We use images of cars of a wide range of varieties to compose an image of an animal such as a bird or a lion for the theme of environmental protection to maximize the information about cars in a single composed image and to raise the awareness about environmental challenges. We present a novel way of image interaction with an artistically-composed photomosaic image, in which a simple operation of “click and display” is used to demonstrate the interactive switch between a tile image in a photomosaic image and the corresponding original car image, which will be automatically saved on the Desktop. We build a multimodal custom GPT named TalkMosaic by incorporating car images information and the related knowledge to ChatGPT. By uploading the original car image to TalkMosaic, we can ask questions about the given car image and get the corresponding answers efficiently and effectively such as where to buy the tire in the car image that satisfies high environmental standards. We give an in-depth analysis on how to speed up the inference of multimodal LLM using sparse attention and quantization techniques with presented probabilistic FlashAttention (PrFlashAttention) and Staircase Adaptive Quantization (SAQ) methods. The implemented prototype demonstrates the feasibility and effectiveness of the presented approach.

[CV-154] Simple Unsupervised Knowledge Distillation With Space Similarity

链接: https://arxiv.org/abs/2409.13939
作者: Aditya Singh,Haohan Wang
关键词-EN: Self-supervised learning, recent studies, readily extend, smaller architectures, SSL
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As per recent studies, Self-supervised learning (SSL) does not readily extend to smaller architectures. One direction to mitigate this shortcoming while simultaneously training a smaller network without labels is to adopt unsupervised knowledge distillation (UKD). Existing UKD approaches handcraft preservation worthy inter/intra sample relationships between the teacher and its student. However, this may overlook/ignore other key relationships present in the mapping of a teacher. In this paper, instead of heuristically constructing preservation worthy relationships between samples, we directly motivate the student to model the teacher’s embedding manifold. If the mapped manifold is similar, all inter/intra sample relationships are indirectly conserved. We first demonstrate that prior methods cannot preserve teacher’s latent manifold due to their sole reliance on L_2 normalised embedding features. Subsequently, we propose a simple objective to capture the lost information due to normalisation. Our proposed loss component, termed \textbfspace similarity, motivates each dimension of a student’s feature space to be similar to the corresponding dimension of its teacher. We perform extensive experiments demonstrating strong performance of our proposed approach on various benchmarks.

[CV-155] Data Pruning via Separability Integrity and Model Uncertainty-Aware Importance Sampling

链接: https://arxiv.org/abs/2409.13915
作者: Steven Grosz,Rui Zhao,Rajeev Ranjan,Hongcheng Wang,Manoj Aggarwal,Gerard Medioni,Anil Jain
关键词-EN: pruning procedure based, existing data pruning, based on importance, pruning, procedure based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper improves upon existing data pruning methods for image classification by introducing a novel pruning metric and pruning procedure based on importance sampling. The proposed pruning metric explicitly accounts for data separability, data integrity, and model uncertainty, while the sampling procedure is adaptive to the pruning ratio and considers both intra-class and inter-class separation to further enhance the effectiveness of pruning. Furthermore, the sampling method can readily be applied to other pruning metrics to improve their performance. Overall, the proposed approach scales well to high pruning ratio and generalizes better across different classification models, as demonstrated by experiments on four benchmark datasets, including the fine-grained classification scenario.

[CV-156] OneBEV: Using One Panoramic Image for Birds-Eye-View Semantic Mapping ACCV2024

链接: https://arxiv.org/abs/2409.13912
作者: Jiale Wei,Junwei Zheng,Ruiping Liu,Jie Hu,Jiaming Zhang,Rainer Stiefelhagen
关键词-EN: comprehensive information compared, attracted increasing attention, perception has attracted, pinhole front-view images, BEV semantic mapping
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACCV 2024. Project code at: this https URL

点击查看摘要

Abstract:In the field of autonomous driving, Bird’s-Eye-View (BEV) perception has attracted increasing attention in the community since it provides more comprehensive information compared with pinhole front-view images and panoramas. Traditional BEV methods, which rely on multiple narrow-field cameras and complex pose estimations, often face calibration and synchronization issues. To break the wall of the aforementioned challenges, in this work, we introduce OneBEV, a novel BEV semantic mapping approach using merely a single panoramic image as input, simplifying the mapping process and reducing computational complexities. A distortion-aware module termed Mamba View Transformation (MVT) is specifically designed to handle the spatial distortions in panoramas, transforming front-view features into BEV features without leveraging traditional attention mechanisms. Apart from the efficient framework, we contribute two datasets, i.e., nuScenes-360 and DeepAccident-360, tailored for the OneBEV task. Experimental results showcase that OneBEV achieves state-of-the-art performance with 51.1% and 36.1% mIoU on nuScenes-360 and DeepAccident-360, respectively. This work advances BEV semantic mapping in autonomous driving, paving the way for more advanced and reliable autonomous systems.

[CV-157] Brain-Cognition Fingerprinting via Graph-GCCA with Contrastive Learning

链接: https://arxiv.org/abs/2409.13887
作者: Yixin Wang,Wei Peng,Yu Zhang,Ehsan Adeli,Qingyu Zhao,Kilian M. Pohl
关键词-EN: neuroimaging studies aim, longitudinal neuroimaging studies, Canonical Correlational Analysis, brain aging, brain function
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Many longitudinal neuroimaging studies aim to improve the understanding of brain aging and diseases by studying the dynamic interactions between brain function and cognition. Doing so requires accurate encoding of their multidimensional relationship while accounting for individual variability over time. For this purpose, we propose an unsupervised learning model (called \underline\textbfContrastive Learning-based \underline\textbfGraph Generalized \underline\textbfCanonical Correlation Analysis (CoGraCa)) that encodes their relationship via Graph Attention Networks and generalized Canonical Correlational Analysis. To create brain-cognition fingerprints reflecting unique neural and cognitive phenotype of each person, the model also relies on individualized and multimodal contrastive learning. We apply CoGraCa to longitudinal dataset of healthy individuals consisting of resting-state functional MRI and cognitive measures acquired at multiple visits for each participant. The generated fingerprints effectively capture significant individual differences and outperform current single-modal and CCA-based multimodal models in identifying sex and age. More importantly, our encoding provides interpretable interactions between those two modalities.

[CV-158] Learning to Play Video Games with Intuitive Physics Priors

链接: https://arxiv.org/abs/2409.13886
作者: Abhishek Jaiswal,Nisheeth Srivastava
关键词-EN: adverse real-world consequences, extremely structured domain, Video game playing, real-world consequences, extremely structured
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, Accepted in Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 46

点击查看摘要

Abstract:Video game playing is an extremely structured domain where algorithmic decision-making can be tested without adverse real-world consequences. While prevailing methods rely on image inputs to avoid the problem of hand-crafting state space representations, this approach systematically diverges from the way humans actually learn to play games. In this paper, we design object-based input representations that generalize well across a number of video games. Using these representations, we evaluate an agent’s ability to learn games similar to an infant - with limited world experience, employing simple inductive biases derived from intuitive representations of physics from the real world. Using such biases, we construct an object category representation to be used by a Q-learning algorithm and assess how well it learns to play multiple games based on observed object affordances. Our results suggest that a human-like object interaction setup capably learns to play several video games, and demonstrates superior generalizability, particularly for unfamiliar objects. Further exploring such methods will allow machines to learn in a human-centric way, thus incorporating more human-like learning benefits.

[CV-159] SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation

链接: https://arxiv.org/abs/2409.13860
作者: Maying Shen,Nadine Chang,Sifei Liu,Jose M. Alvarez
关键词-EN: recent years, unmanageable amount, collected for artificial, artificial intelligence, intelligence has grown
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, the data collected for artificial intelligence has grown to an unmanageable amount. Particularly within industrial applications, such as autonomous vehicles, model training computation budgets are being exceeded while model performance is saturating – and yet more data continues to pour in. To navigate the flood of data, we propose a framework to select the most semantically diverse and important dataset portion. Then, we further semantically enrich it by discovering meaningful new data from a massive unlabeled data pool. Importantly, we can provide explainability by leveraging foundation models to generate semantics for every data point. We quantitatively show that our Semantic Selection and Enrichment framework (SSE) can a) successfully maintain model performance with a smaller training dataset and b) improve model performance by enriching the smaller dataset without exceeding the original dataset size. Consequently, we demonstrate that semantic diversity is imperative for optimal data selection and model performance.

[CV-160] Multi-Modality Conditioned Variational U-Net for Field-of-View Extension in Brain Diffusion MRI

链接: https://arxiv.org/abs/2409.13846
作者: Zhiyuan Li,Tianyuan Yao,Praitayini Kanakaraj,Chenyu Gao,Shunxing Bao,Lianrui Zuo,Michael E. Kim,Nancy R. Newlin,Gaurav Rudravaram,Nazirah M. Khairi,Yuankai Huo,Kurt G. Schilling,Walter A. Kukull,Arthur W. Toga,Derek B. Archer,Timothy J. Hohman,Bennett A. Landman
关键词-EN: magnetic resonance imaging, white matter connectivity, diffusion magnetic resonance, whole-brain white matter, dMRI scans
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 20 pages; 8 figures

点击查看摘要

Abstract:An incomplete field-of-view (FOV) in diffusion magnetic resonance imaging (dMRI) can severely hinder the volumetric and bundle analyses of whole-brain white matter connectivity. Although existing works have investigated imputing the missing regions using deep generative models, it remains unclear how to specifically utilize additional information from paired multi-modality data and whether this can enhance the imputation quality and be useful for downstream tractography. To fill this gap, we propose a novel framework for imputing dMRI scans in the incomplete part of the FOV by integrating the learned diffusion features in the acquired part of the FOV to the complete brain anatomical structure. We hypothesize that by this design the proposed framework can enhance the imputation performance of the dMRI scans and therefore be useful for repairing whole-brain tractography in corrupted dMRI scans with incomplete FOV. We tested our framework on two cohorts from different sites with a total of 96 subjects and compared it with a baseline imputation method that treats the information from T1w and dMRI scans equally. The proposed framework achieved significant improvements in imputation performance, as demonstrated by angular correlation coefficient (p 1E-5), and in downstream tractography accuracy, as demonstrated by Dice score (p 0.01). Results suggest that the proposed framework improved imputation performance in dMRI scans by specifically utilizing additional information from paired multi-modality data, compared with the baseline method. The imputation achieved by the proposed framework enhances whole brain tractography, and therefore reduces the uncertainty when analyzing bundles associated with neurodegenerative.

[CV-161] ViTGuard: Attention-aware Detection against Adversarial Examples for Vision Transformer ACSA

链接: https://arxiv.org/abs/2409.13828
作者: Shihua Sun,Kenechukwu Nwodo,Shridatt Sugrim,Angelos Stavrou,Haining Wang
关键词-EN: convolutional neural networks, traditional dominant role, Vision Transformer, computer vision, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: To appear in the Annual Computer Security Applications Conference (ACSAC) 2024

点击查看摘要

Abstract:The use of transformers for vision tasks has challenged the traditional dominant role of convolutional neural networks (CNN) in computer vision (CV). For image classification tasks, Vision Transformer (ViT) effectively establishes spatial relationships between patches within images, directing attention to important areas for accurate predictions. However, similar to CNNs, ViTs are vulnerable to adversarial attacks, which mislead the image classifier into making incorrect decisions on images with carefully designed perturbations. Moreover, adversarial patch attacks, which introduce arbitrary perturbations within a small area, pose a more serious threat to ViTs. Even worse, traditional detection methods, originally designed for CNN models, are impractical or suffer significant performance degradation when applied to ViTs, and they generally overlook patch attacks. In this paper, we propose ViTGuard as a general detection method for defending ViT models against adversarial attacks, including typical attacks where perturbations spread over the entire input and patch attacks. ViTGuard uses a Masked Autoencoder (MAE) model to recover randomly masked patches from the unmasked regions, providing a flexible image reconstruction strategy. Then, threshold-based detectors leverage distinctive ViT features, including attention maps and classification (CLS) token representations, to distinguish between normal and adversarial samples. The MAE model does not involve any adversarial samples during training, ensuring the effectiveness of our detectors against unseen attacks. ViTGuard is compared with seven existing detection methods under nine attacks across three datasets. The evaluation results show the superiority of ViTGuard over existing detectors. Finally, considering the potential detection evasion, we further demonstrate ViTGuard’s robustness against adaptive attacks for evasion. Comments: To appear in the Annual Computer Security Applications Conference (ACSAC) 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR) Cite as: arXiv:2409.13828 [cs.CV] (or arXiv:2409.13828v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.13828 Focus to learn more arXiv-issued DOI via DataCite

[CV-162] Intrinsic Single-Image HDR Reconstruction ECCV2024

链接: https://arxiv.org/abs/2409.13803
作者: Sebastian Dille,Chris Careaga,Yağız Aksoy
关键词-EN: common cameras fails, low dynamic range, dynamic range, high dynamic range, resulting in loss
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
*备注: Accepted for ECCV 2024

点击查看摘要

Abstract:The low dynamic range (LDR) of common cameras fails to capture the rich contrast in natural scenes, resulting in loss of color and details in saturated pixels. Reconstructing the high dynamic range (HDR) of luminance present in the scene from single LDR photographs is an important task with many applications in computational photography and realistic display of images. The HDR reconstruction task aims to infer the lost details using the context present in the scene, requiring neural networks to understand high-level geometric and illumination cues. This makes it challenging for data-driven algorithms to generate accurate and high-resolution results. In this work, we introduce a physically-inspired remodeling of the HDR reconstruction problem in the intrinsic domain. The intrinsic model allows us to train separate networks to extend the dynamic range in the shading domain and to recover lost color details in the albedo domain. We show that dividing the problem into two simpler sub-tasks improves performance in a wide variety of photographs.

[CV-163] A Stochastic Geo-spatiotemporal Bipartite Network to Optimize GCOOS Sensor Placement Strategies

链接: https://arxiv.org/abs/2404.14357
作者: Ted Edward Holmberg,Elias Ioup,Mahdi Abdelguerfi
关键词-EN: spatial bipartite network, bipartite network, bipartite network model, Geo-SpatioTemporal Bipartite Network, spatial bipartite
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 6 figures, 2022 IEEE International Conference on Big Data (Big Data)

点击查看摘要

Abstract:This paper proposes two new measures applicable in a spatial bipartite network model: coverage and coverage robustness. The bipartite network must consist of observer nodes, observable nodes, and edges that connect observer nodes to observable nodes. The coverage and coverage robustness scores evaluate the effectiveness of the observer node placements. This measure is beneficial for stochastic data as it may be coupled with Monte Carlo simulations to identify optimal placements for new observer nodes. In this paper, we construct a Geo-SpatioTemporal Bipartite Network (GSTBN) within the stochastic and dynamical environment of the Gulf of Mexico. This GSTBN consists of GCOOS sensor nodes and HYCOM Region of Interest (RoI) event nodes. The goal is to identify optimal placements to expand GCOOS to improve the forecasting outcomes by the HYCOM ocean prediction model.

[CV-164] Boosting Facial Action Unit Detection Through Jointly Learning Facial Landmark Detection and Domain Separation and Reconstruction

链接: https://arxiv.org/abs/2310.05207
作者: Ziqiao Shang,Li Yu
关键词-EN: Facial Action Unit, supervised Facial Action, Action Unit, introduce large amounts, unlabeled facial images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Recently how to introduce large amounts of unlabeled facial images in the wild into supervised Facial Action Unit (AU) detection frameworks has become a challenging problem. In this paper, we propose a new AU detection framework where multi-task learning is introduced to jointly learn AU domain separation and reconstruction and facial landmark detection by sharing the parameters of homostructural facial extraction modules. In addition, we propose a new feature alignment scheme based on contrastive learning by simple projectors and an improved contrastive loss, which adds four additional intermediate supervisors to promote the feature reconstruction process. Experimental results on two benchmarks demonstrate our superiority against the state-of-the-art methods for AU detection in the wild.

[CV-165] MAR-DTN: Metal Artifact Reduction using Domain Transformation Network for Radiotherapy Planning ICPR

链接: https://arxiv.org/abs/2409.15155
作者: Belén Serrano-Antón,Mubashara Rehman,Niki Martinel,Michele Avanzo,Riccardo Spizzo,Giuseppe Fanetti,Alberto P. Muñuzuri,Christian Micheloni
关键词-EN: Computed Tomography, artifact-free MVCT images, head and neck, MVCT scans, typically employed
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in 27th International Conference on Pattern Recognition (ICPR). Mubashara Rehman and Belén Serrano-Antón, both co-first authors of the manuscript

点击查看摘要

Abstract:For the planning of radiotherapy treatments for head and neck cancers, Computed Tomography (CT) scans of the patients are typically employed. However, in patients with head and neck cancer, the quality of standard CT scans generated using kilo-Voltage (kVCT) tube potentials is severely degraded by streak artifacts occurring in the presence of metallic implants such as dental fillings. Some radiotherapy devices offer the possibility of acquiring Mega-Voltage CT (MVCT) for daily patient setup verification, due to the higher energy of X-rays used, MVCT scans are almost entirely free from artifacts making them more suitable for radiotherapy treatment planning. In this study, we leverage the advantages of kVCT scans with those of MVCT scans (artifact-free). We propose a deep learning-based approach capable of generating artifact-free MVCT images from acquired kVCT images. The outcome offers the benefits of artifact-free MVCT images with enhanced soft tissue contrast, harnessing valuable information obtained through kVCT technology for precise therapy calibration. Our proposed method employs UNet-inspired model, and is compared with adversarial learning and transformer networks. This first and unique approach achieves remarkable success, with PSNR of 30.02 dB across the entire patient volume and 27.47 dB in artifact-affected regions exclusively. It is worth noting that the PSNR calculation excludes the background, concentrating solely on the region of interest. Comments: Accepted in 27th International Conference on Pattern Recognition (ICPR). Mubashara Rehman and Belén Serrano-Antón, both co-first authors of the manuscript Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.15155 [eess.IV] (or arXiv:2409.15155v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2409.15155 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-166] owards Accountable AI-Assisted Eye Disease Diagnosis: Workflow Design External Validation and Continual Learning

链接: https://arxiv.org/abs/2409.15087
作者: Qingyu Chen,Tiarnan D L Keenan,Elvira Agron,Alexis Allot,Emily Guan,Bryant Duong,Amr Elsawy,Benjamin Hou,Cancan Xue,Sanjeeb Bhandari,Geoffrey Broadhead,Chantal Cousineau-Krieger,Ellen Davis,William G Gensheimer,David Grasic,Seema Gupta,Luis Haddock,Eleni Konstantinou,Tania Lamba,Michele Maiberger,Dimosthenis Mantopoulos,Mitul C Mehta,Ayman G Nahri,Mutaz AL-Nawaflh,Arnold Oshinsky,Brittany E Powell,Boonkit Purt,Soo Shin,Hillary Stiefel,Alisa T Thavikulwat,Keith James Wroblewski,Tham Yih Chung,Chui Ming Gemmy Cheung,Ching-Yu Cheng,Emily Y Chew,Michelle R. Hribar,Michael F. Chiang,Zhiyong Lu
关键词-EN: Timely disease diagnosis, Timely disease, limited clinician availability, burdens and limited, increasing disease burdens
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Timely disease diagnosis is challenging due to increasing disease burdens and limited clinician availability. AI shows promise in diagnosis accuracy but faces real-world application issues due to insufficient validation in clinical workflows and diverse populations. This study addresses gaps in medical AI downstream accountability through a case study on age-related macular degeneration (AMD) diagnosis and severity classification. We designed and implemented an AI-assisted diagnostic workflow for AMD, comparing diagnostic performance with and without AI assistance among 24 clinicians from 12 institutions with real patient data sampled from the Age-Related Eye Disease Study (AREDS). Additionally, we demonstrated continual enhancement of an existing AI model by incorporating approximately 40,000 additional medical images (named AREDS2 dataset). The improved model was then systematically evaluated using both AREDS and AREDS2 test sets, as well as an external test set from Singapore. AI assistance markedly enhanced diagnostic accuracy and classification for 23 out of 24 clinicians, with the average F1-score increasing by 20% from 37.71 (Manual) to 45.52 (Manual + AI) (P-value 0.0001), achieving an improvement of over 50% in some cases. In terms of efficiency, AI assistance reduced diagnostic times for 17 out of the 19 clinicians tracked, with time savings of up to 40%. Furthermore, a model equipped with continual learning showed robust performance across three independent datasets, recording a 29% increase in accuracy, and elevating the F1-score from 42 to 54 in the Singapore population.

[CV-167] owards Ground-truth-free Evaluation of Any Segmentation in Medical Images

链接: https://arxiv.org/abs/2409.14874
作者: Ahjol Senbi,Tianyu Huang,Fei Lyu,Qing Li,Yuhui Tao,Wei Shao,Qiang Chen,Chengyan Wang,Shuo Wang,Tao Zhou,Yizhe Zhang
关键词-EN: medical images, segmentation quality scores, segmentation, Segment, images
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 15 figures

点击查看摘要

Abstract:We are interested in building a ground-truth-free evaluation model to assess the quality of segmentations produced by SAM (Segment Anything Model) and its variants in medical images. This model estimates segmentation quality scores by comparing input images with their corresponding segmentation maps. Building on prior research, we frame this as a regression problem within a supervised learning framework, using Dice scores (and optionally other metrics) to compute the training loss. The model is trained using a large collection of public datasets of medical images with segmentation predictions from SAM and its variants. We name this model EvanySeg (Evaluation of Any Segmentation in Medical Images). Our exploration of convolution-based models (e.g., ResNet) and transformer-based models (e.g., ViT) revealed that ViT offers superior performance for EvanySeg. This model can be employed for various tasks, including: (1) identifying poorly segmented samples by detecting low-percentile segmentation quality scores; (2) benchmark segmentation models without ground truth by averaging scores across test samples; (3) alerting human experts during human-AI collaboration by applying a threshold within the score space; and (4) selecting the best segmentation prediction for each test sample at test time when multiple segmentation models are available, by choosing the prediction with the highest score. Models and code will be made available at this https URL.

[CV-168] ransUKAN:Computing-Efficient Hybrid KAN-Transformer for Enhanced Medical Image Segmentation

链接: https://arxiv.org/abs/2409.14676
作者: Yanlin Wu,Tao Li,Zhihong Wang,Hong Kang,Along He
关键词-EN: medical image segmentation, medical image, image segmentation, accomplish medical image, medical image analysis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:U-Net is currently the most widely used architecture for medical image segmentation. Benefiting from its unique encoder-decoder architecture and skip connections, it can effectively extract features from input images to segment target regions. The commonly used U-Net is typically based on convolutional operations or Transformers, modeling the dependencies between local or global information to accomplish medical image analysis tasks. However, convolutional layers, fully connected layers, and attention mechanisms used in this process introduce a significant number of parameters, often requiring the stacking of network layers to model complex nonlinear relationships, which can impact the training process. To address these issues, we propose TransUKAN. Specifically, we have improved the KAN to reduce memory usage and computational load. On this basis, we explored an effective combination of KAN, Transformer, and U-Net structures. This approach enhances the model’s capability to capture nonlinear relationships by introducing only a small number of additional parameters and compensates for the Transformer structure’s deficiency in local information extraction. We validated TransUKAN on multiple medical image segmentation tasks. Experimental results demonstrate that TransUKAN achieves excellent performance with significantly reduced parameters. The code will be available athttps://github.com/wuyanlin-wyl/TransUKAN.

[CV-169] Lesion Segmentation in Whole-Body Multi-Tracer PET-CT Images; a Contribution to AutoPET 2024 Challenge MICCAI24

链接: https://arxiv.org/abs/2409.14475
作者: Mehdi Astaraki,Simone Bendazzoli
关键词-EN: whole-body PET-CT volumes, treatment planning, pathological regions, regions within whole-body, whole-body PET-CT
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 4 tables, 1 figure, AutoPET MICCAI 24

点击查看摘要

Abstract:The automatic segmentation of pathological regions within whole-body PET-CT volumes has the potential to streamline various clinical applications such as diagno-sis, prognosis, and treatment planning. This study aims to address this challenge by contributing to the AutoPET MICCAI 2024 challenge through a proposed workflow that incorporates image preprocessing, tracer classification, and lesion segmentation steps. The implementation of this pipeline led to a significant enhancement in the segmentation accuracy of the models. This improvement is evidenced by an average overall Dice score of 0.548 across 1611 training subjects, 0.631 and 0.559 for classi-fied FDG and PSMA subjects of the training set, and 0.792 on the preliminary testing phase dataset.

[CV-170] Detection of pulmonary pathologies using convolutional neural networks Data Augmentation ResNet50 and Vision Transformers

链接: https://arxiv.org/abs/2409.14446
作者: Pablo Ramirez Amador,Dinarle Milagro Ortega,Arnold Cesarano
关键词-EN: fast diagnostic techniques, public health problem, diagnostic techniques, public health, health problem
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Pulmonary diseases are a public health problem that requires accurate and fast diagnostic techniques. In this paper, a method based on convolutional neural networks (CNN), Data Augmentation, ResNet50 and Vision Transformers (ViT) is proposed to detect lung pathologies from medical images. A dataset of X-ray images and CT scans of patients with different lung diseases, such as cancer, pneumonia, tuberculosis and fibrosis, is used. The results obtained by the proposed method are compared with those of other existing methods, using performance metrics such as accuracy, sensitivity, specificity and area under the ROC curve. The results show that the proposed method outperforms the other methods in all metrics, achieving an accuracy of 98% and an area under the ROC curve of 99%. It is concluded that the proposed method is an effective and promising tool for the diagnosis of pulmonary pathologies by medical imaging.

[CV-171] Frequency-regularized Neural Representation Method for Sparse-view Tomographic Reconstruction ICME2024

链接: https://arxiv.org/abs/2409.14394
作者: Jingmou Xian,Jian Zhu,Haolin Liao,Si Li
关键词-EN: augmenting clinical applicability, reducing radiation dose, Sparse-view tomographic reconstruction, clinical applicability, pivotal direction
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages,5 figures,Accepted to ICME 2024

点击查看摘要

Abstract:Sparse-view tomographic reconstruction is a pivotal direction for reducing radiation dose and augmenting clinical applicability. While many research works have proposed the reconstruction of tomographic images from sparse 2D projections, existing models tend to excessively focus on high-frequency information while overlooking low-frequency components within the sparse input images. This bias towards high-frequency information often leads to overfitting, particularly intense at edges and boundaries in the reconstructed slices. In this paper, we introduce the Frequency Regularized Neural Attenuation/Activity Field (Freq-NAF) for self-supervised sparse-view tomographic reconstruction. Freq-NAF mitigates overfitting by incorporating frequency regularization, directly controlling the visible frequency bands in the neural network input. This approach effectively balances high-frequency and low-frequency information. We conducted numerical experiments on CBCT and SPECT datasets, and our method demonstrates state-of-the-art accuracy.

[CV-172] hinking in Granularity: Dynamic Quantization for Image Super-Resolution by Intriguing Multi-Granularity Clues

链接: https://arxiv.org/abs/2409.14330
作者: Mingshen Wang,Zhao Zhang,Feng Li,Ke Xu,Kang Miao,Meng Wang
关键词-EN: preserving competitive performance, attracted rising attention, competitive performance, attracted rising, rising attention
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dynamic quantization has attracted rising attention in image super-resolution (SR) as it expands the potential of heavy SR models onto mobile devices while preserving competitive performance. Existing methods explore layer-to-bit configuration upon varying local regions, adaptively allocating the bit to each layer and patch. Despite the benefits, they still fall short in the trade-off of SR accuracy and quantization efficiency. Apart from this, adapting the quantization level for each layer individually can disturb the original inter-layer relationships, thus diminishing the representation capability of quantized models. In this work, we propose Granular-DQ, which capitalizes on the intrinsic characteristics of images while dispensing with the previous consideration for layer sensitivity in quantization. Granular-DQ conducts a multi-granularity analysis of local patches with further exploration of their information densities, achieving a distinctive patch-wise and layer-invariant dynamic quantization paradigm. Specifically, Granular-DQ initiates by developing a granularity-bit controller (GBC) to apprehend the coarse-to-fine granular representations of different patches, matching their proportional contribution to the entire image to determine the proper bit-width allocation. On this premise, we investigate the relation between bit-width and information density, devising an entropy-to-bit (E2B) mechanism that enables further fine-grained dynamic bit adaption of high-bit patches. Extensive experiments validate the superiority and generalization ability of Granular-DQ over recent state-of-the-art methods on various SR models. Code will be available at \urlthis https URL.

[CV-173] FeDETR: a Federated Approach for Stenosis Detection in Coronary Angiography

链接: https://arxiv.org/abs/2409.14268
作者: Raffaele Mineo,Amelia Sorrenti,Federica Proietto Salanitri
关键词-EN: patient health, heart failure, underlying factor, factor in heart, grading coronary lesions
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 9 figures, Image Analysis and Processing - ICIAP 2023 Workshops. ICIAP 2023. Lecture Notes in Computer Science, vol 14366. Springer, Cham

点击查看摘要

Abstract:Assessing the severity of stenoses in coronary angiography is critical to the patient’s health, as coronary stenosis is an underlying factor in heart failure. Current practice for grading coronary lesions, i.e. fractional flow reserve (FFR) or instantaneous wave-free ratio (iFR), suffers from several drawbacks, including time, cost and invasiveness, alongside potential interobserver variability. In this context, some deep learning methods have emerged to assist cardiologists in automating the estimation of FFR/iFR values. Despite the effectiveness of these methods, their reliance on large datasets is challenging due to the distributed nature of sensitive medical data. Federated learning addresses this challenge by aggregating knowledge from multiple nodes to improve model generalization, while preserving data privacy. We propose the first federated detection transformer approach, FeDETR, to assess stenosis severity in angiography videos based on FFR/iFR values estimation. In our approach, each node trains a detection transformer (DETR) on its local dataset, with the central server federating the backbone part of the network. The proposed method is trained and evaluated on a dataset collected from five hospitals, consisting of 1001 angiographic examinations, and its performance is compared with state-of-the-art federated learning methods.

[CV-174] UniMo: Universal Motion Correction For Medical Images without Network Retraining

链接: https://arxiv.org/abs/2409.14204
作者: Jian Wang,Razieh Faghihpirayesh,Danny Joca,Polina Golland,Ali Gholipour
关键词-EN: leveraging deep neural, Universal Motion Correction, introduce a Universal, deep neural networks, Universal Motion
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:In this paper, we introduce a Universal Motion Correction (UniMo) framework, leveraging deep neural networks to tackle the challenges of motion correction across diverse imaging modalities. Our approach employs advanced neural network architectures with equivariant filters, overcoming the limitations of current models that require iterative inference or retraining for new image modalities. UniMo enables one-time training on a single modality while maintaining high stability and adaptability for inference across multiple unseen image modalities. We developed a joint learning framework that integrates multimodal knowledge from both shape and images that faithfully improve motion correction accuracy despite image appearance variations. UniMo features a geometric deformation augmenter that enhances the robustness of global motion correction by addressing any local deformations whether they are caused by object deformations or geometric distortions, and also generates augmented data to improve the training process. Our experimental results, conducted on various datasets with four different image modalities, demonstrate that UniMo surpasses existing motion correction methods in terms of accuracy. By offering a comprehensive solution to motion correction, UniMo marks a significant advancement in medical imaging, especially in challenging applications with wide ranges of motion, such as fetal imaging. The code for this work is available online, this https URL.

[CV-175] A Sinkhorn Regularized Adversarial Network for Image Guided DEM Super-resolution using Frequency Selective Hybrid Graph Transformer

链接: https://arxiv.org/abs/2409.14198
作者: Subhajit Paul,Ashutosh Gupta
关键词-EN: Digital Elevation Model, Digital Elevation, Frequency Selective Graph, Selective Graph Attention, surface elevations
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 25 pages, 19 figures. arXiv admin note: substantial text overlap with arXiv:2311.16490

点击查看摘要

Abstract:Digital Elevation Model (DEM) is an essential aspect in the remote sensing (RS) domain to analyze various applications related to surface elevations. Here, we address the generation of high-resolution (HR) DEMs using HR multi-spectral (MX) satellite imagery as a guide by introducing a novel hybrid transformer model consisting of Densely connected Multi-Residual Block (DMRB) and multi-headed Frequency Selective Graph Attention (M-FSGA). To promptly regulate this process, we utilize the notion of discriminator spatial maps as the conditional attention to the MX guide. Further, we present a novel adversarial objective related to optimizing Sinkhorn distance with classical GAN. In this regard, we provide both theoretical and empirical substantiation of better performance in terms of vanishing gradient issues and numerical convergence. Based on our experiments on 4 different DEM datasets, we demonstrate both qualitative and quantitative comparisons with available baseline methods and show that the performance of our proposed model is superior to others with sharper details and minimal errors.

[CV-176] Accelerated Multi-Contrast MRI Reconstruction via Frequency and Spatial Mutual Learning MICCAI

链接: https://arxiv.org/abs/2409.14113
作者: Qi Chen,Xiaohan Xing,Zhen Chen,Zhiwei Xiong
关键词-EN: accelerate Magnetic Resonance, support high-quality reconstruction, Magnetic Resonance, under-sampled k-space measurements, accelerate Magnetic
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted as a poster by Medical Image Computing and Computer Assisted Intervention (MICCAI) 2024

点击查看摘要

Abstract:To accelerate Magnetic Resonance (MR) imaging procedures, Multi-Contrast MR Reconstruction (MCMR) has become a prevalent trend that utilizes an easily obtainable modality as an auxiliary to support high-quality reconstruction of the target modality with under-sampled k-space measurements. The exploration of global dependency and complementary information across different modalities is essential for MCMR. However, existing methods either struggle to capture global dependency due to the limited receptive field or suffer from quadratic computational complexity. To tackle this dilemma, we propose a novel Frequency and Spatial Mutual Learning Network (FSMNet), which efficiently explores global dependencies across different modalities. Specifically, the features for each modality are extracted by the Frequency-Spatial Feature Extraction (FSFE) module, featuring a frequency branch and a spatial branch. Benefiting from the global property of the Fourier transform, the frequency branch can efficiently capture global dependency with an image-size receptive field, while the spatial branch can extract local features. To exploit complementary information from the auxiliary modality, we propose a Cross-Modal Selective fusion (CMS-fusion) module that selectively incorporate the frequency and spatial features from the auxiliary modality to enhance the corresponding branch of the target modality. To further integrate the enhanced global features from the frequency branch and the enhanced local features from the spatial branch, we develop a Frequency-Spatial fusion (FS-fusion) module, resulting in a comprehensive feature representation for the target modality. Extensive experiments on the BraTS and fastMRI datasets demonstrate that the proposed FSMNet achieves state-of-the-art performance for the MCMR task with different acceleration factors. The code is available at: this https URL.

[CV-177] Window-based Channel Attention for Wavelet-enhanced Learned Image Compression ACCV2024

链接: https://arxiv.org/abs/2409.14090
作者: Heng Xu,Bowen Hai,Yushun Tang,Zhihai He
关键词-EN: Learned Image Compression, achieved superior rate-distortion, Learned Image, Image Compression, superior rate-distortion performance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCV2024 accepted; reviewed version

点击查看摘要

Abstract:Learned Image Compression (LIC) models have achieved superior rate-distortion performance than traditional codecs. Existing LIC models use CNN, Transformer, or Mixed CNN-Transformer as basic blocks. However, limited by the shifted window attention, Swin-Transformer-based LIC exhibits a restricted growth of receptive fields, affecting the ability to model large objects in the image. To address this issue, we incorporate window partition into channel attention for the first time to obtain large receptive fields and capture more global information. Since channel attention hinders local information learning, it is important to extend existing attention mechanisms in Transformer codecs to the space-channel attention to establish multiple receptive fields, being able to capture global correlations with large receptive fields while maintaining detailed characterization of local correlations with small receptive fields. We also incorporate the discrete wavelet transform into our Spatial-Channel Hybrid (SCH) framework for efficient frequency-dependent down-sampling and further enlarging receptive fields. Experiment results demonstrate that our method achieves state-of-the-art performances, reducing BD-rate by 18.54%, 23.98%, 22.33%, and 24.71% on four standard datasets compared to VTM-23.1.

[CV-178] MSDet: Receptive Field Enhanced Multiscale Detection for Tiny Pulmonary Nodule

链接: https://arxiv.org/abs/2409.14028
作者: Guohui Cai,Ying Cai,Zeyu Zhang,Daji Ergu,Yuanzhouhan Cao,Binbin Hu,Zhibin Liao,Yang Zhao
关键词-EN: timely treatment, critical indicators, essential for timely, Pulmonary nodules, detection
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pulmonary nodules are critical indicators for the early diagnosis of lung cancer, making their detection essential for timely treatment. However, traditional CT imaging methods suffered from cumbersome procedures, low detection rates, and poor localization accuracy. The subtle differences between pulmonary nodules and surrounding tissues in complex lung CT images, combined with repeated downsampling in feature extraction networks, often lead to missed or false detections of small nodules. Existing methods such as FPN, with its fixed feature fusion and limited receptive field, struggle to effectively overcome these issues. To address these challenges, our paper proposed three key contributions: Firstly, we proposed MSDet, a multiscale attention and receptive field network for detecting tiny pulmonary nodules. Secondly, we proposed the extended receptive domain (ERD) strategy to capture richer contextual information and reduce false positives caused by nodule occlusion. We also proposed the position channel attention mechanism (PCAM) to optimize feature learning and reduce multiscale detection errors, and designed the tiny object detection block (TODB) to enhance the detection of tiny nodules. Lastly, we conducted thorough experiments on the public LUNA16 dataset, achieving state-of-the-art performance, with an mAP improvement of 8.8% over the previous state-of-the-art method YOLOv8. These advancements significantly boosted detection accuracy and reliability, providing a more effective solution for early lung cancer diagnosis. The code will be available at this https URL

[CV-179] RN-SDEs: Limited-Angle CT Reconstruction with Residual Null-Space Diffusion Stochastic Differential Equations

链接: https://arxiv.org/abs/2409.13930
作者: Jiaqi Guo,Santiago Lopez-Tapia,Wing Shun Li,Yunnan Wu,Marcelo Carignano,Vadim Backman,Vinayak P. Dravid,Aggelos K. Katsaggelos
关键词-EN: Angle Computed Tomography, Limited Angle Computed, Computed tomography, Stochastic Differential Equations, material analysis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Computed tomography is a widely used imaging modality with applications ranging from medical imaging to material analysis. One major challenge arises from the lack of scanning information at certain angles, leading to distorted CT images with artifacts. This results in an ill-posed problem known as the Limited Angle Computed Tomography (LACT) reconstruction problem. To address this problem, we propose Residual Null-Space Diffusion Stochastic Differential Equations (RN-SDEs), which are a variant of diffusion models that characterize the diffusion process with mean-reverting (MR) stochastic differential equations. To demonstrate the generalizability of RN-SDEs, our experiments are conducted on two different LACT datasets, i.e., ChromSTEM and C4KC-KiTS. Through extensive experiments, we show that by leveraging learned Mean-Reverting SDEs as a prior and emphasizing data consistency using Range-Null Space Decomposition (RNSD) based rectification, RN-SDEs can restore high-quality images from severe degradation and achieve state-of-the-art performance in most LACT tasks. Additionally, we present a quantitative comparison of computational complexity and runtime efficiency, highlighting the superior effectiveness of our proposed approach.

[CV-180] Deep Learning-Based Channel Squeeze U-Structure for Lung Nodule Detection and Segmentation

链接: https://arxiv.org/abs/2409.13868
作者: Mingxiu Sui,Jiacheng Hu,Tong Zhou,Zibo Liu,Likang Wen,Junliang Du
关键词-EN: aimed at advancing, paper introduces, advancing the accuracy, accuracy of early-stage, Channel Squeeze U-Structure
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel deep-learning method for the automatic detection and segmentation of lung nodules, aimed at advancing the accuracy of early-stage lung cancer diagnosis. The proposed approach leverages a unique “Channel Squeeze U-Structure” that optimizes feature extraction and information integration across multiple semantic levels of the network. This architecture includes three key modules: shallow information processing, channel residual structure, and channel squeeze integration. These modules enhance the model’s ability to detect and segment small, imperceptible, or ground-glass nodules, which are critical for early diagnosis. The method demonstrates superior performance in terms of sensitivity, Dice similarity coefficient, precision, and mean Intersection over Union (IoU). Extensive experiments were conducted on the Lung Image Database Consortium (LIDC) dataset using five-fold cross-validation, showing excellent stability and robustness. The results indicate that this approach holds significant potential for improving computer-aided diagnosis systems, providing reliable support for radiologists in clinical practice and aiding in the early detection of lung cancer, especially in resource-limited settings

[CV-181] AutoPET III Challenge: Tumor Lesion Segmentation using ResEnc-Model Ensemble

链接: https://arxiv.org/abs/2409.13779
作者: Tanya Chutani,Saikiran Bonthu,Pranab Samanta,Nitin Singhal
关键词-EN: Positron Emission Tomography, Computed Tomography, Emission Tomography, Positron Emission, Tomography
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Positron Emission Tomography (PET) /Computed Tomography (CT) is crucial for diagnosing, managing, and planning treatment for various cancers. Developing reliable deep learning models for the segmentation of tumor lesions in PET/CT scans in a multi-tracer multicenter environment, is a critical area of research. Different tracers, such as Fluorodeoxyglucose (FDG) and Prostate-Specific Membrane Antigen (PSMA), have distinct physiological uptake patterns and data from different centers often vary in terms of acquisition protocols, scanner types, and patient populations. Because of this variability, it becomes more difficult to design reliable segmentation algorithms and generalization techniques due to variations in image quality and lesion detectability. To address this challenge, We trained a 3D Residual encoder U-Net within the no new U-Net framework, aiming to generalize the performance of automatic lesion segmentation of whole body PET/CT scans, across different tracers and clinical sites. Further, We explored several preprocessing techniques and ultimately settled on using the Total Segmentator to crop our training data. Additionally, we applied resampling during this process. During inference, we leveraged test-time augmentations and other post-processing techniques to enhance tumor lesion segmentation. Our team currently hold the top position in the Auto-PET III challenge and outperformed the challenge baseline model in the preliminary test set with Dice score of 0.9627.

[CV-182] Efficient Classification of Histopathology Images ICPR

链接: https://arxiv.org/abs/2409.13720
作者: Mohammad Iqbal Nouyed,Mary-Anne Hartley,Gianfranco Doretto,Donald A. Adjeroh
关键词-EN: efficiently classify challenging, classify challenging histopathology, challenging histopathology images, image-level annotation, efficiently classify
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 2 figures, Accepted paper for the 27th International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

Abstract:This work addresses how to efficiently classify challenging histopathology images, such as gigapixel whole-slide images for cancer diagnostics with image-level annotation. We use images with annotated tumor regions to identify a set of tumor patches and a set of benign patches in a cancerous slide. Due to the variable nature of region of interest the tumor positive regions may refer to an extreme minority of the pixels. This creates an important problem during patch-level classification, where the majority of patches from an image labeled as ‘cancerous’ are actually tumor-free. This problem is different from semantic segmentation which associates a label to every pixel in an image, because after patch extraction we are only dealing with patch-level labels.Most existing approaches address the data imbalance issue by mitigating the data shortage in minority classes in order to prevent the model from being dominated by the majority classes. These methods include data re-sampling, loss re-weighting, margin modification, and data augmentation. In this work, we mitigate the patch-level class imbalance problem by taking a divide-and-conquer approach. First, we partition the data into sub-groups, and define three separate classification sub-problems based on these data partitions. Then, using an information-theoretic cluster-based sampling of deep image patch features, we sample discriminative patches from the sub-groups. Using these sampled patches, we build corresponding deep models to solve the new classification sub-problems. Finally, we integrate information learned from the respective models to make a final decision on the patches. Our result shows that the proposed approach can perform competitively using a very low percentage of the available patches in a given whole-slide image.

机器学习

[LG-0] Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

链接: https://arxiv.org/abs/2409.15268
作者: Benjamin Feuer,Micah Goldblum,Teresa Datta,Sanjana Nambiar,Raz Besaleli,Samuel Dooley,Max Cembalest,John P. Dickerson
关键词-EN: ChatGPT in November, sparked an explosion, release of ChatGPT, explosion of interest, preference optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM judges. In this work, we attempt to answer the following question – do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench, the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judgments do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at this https URL.

[LG-1] Peer-to-Peer Learning Dynamics of Wide Neural Networks

链接: https://arxiv.org/abs/2409.15267
作者: Shreyas Chaudhari,Srinivasa Pranav,Emile Anand,José M. F. Moura
关键词-EN: collaboratively train deep, increasingly popular framework, distributed edge devices, deep neural networks, neural networks
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Peer-to-peer learning is an increasingly popular framework that enables beyond-5G distributed edge devices to collaboratively train deep neural networks in a privacy-preserving manner without the aid of a central server. Neural network training algorithms for emerging environments, e.g., smart cities, have many design considerations that are difficult to tune in deployment settings – such as neural network architectures and hyperparameters. This presents a critical need for characterizing the training dynamics of distributed optimization algorithms used to train highly nonconvex neural networks in peer-to-peer learning environments. In this work, we provide an explicit, non-asymptotic characterization of the learning dynamics of wide neural networks trained using popular distributed gradient descent (DGD) algorithms. Our results leverage both recent advancements in neural tangent kernel (NTK) theory and extensive previous work on distributed learning and consensus. We validate our analytical results by accurately predicting the parameter and error dynamics of wide neural networks trained for classification tasks.

[LG-2] UDA-Bench: Revisiting Common Assumptions in Unsupervised Domain Adaptation Using a Standardized Framework ECCV2024

链接: https://arxiv.org/abs/2409.15264
作者: Tarun Kalluri,Sreyas Ravichandran,Manmohan Chandraker
关键词-EN: modern unsupervised domain, controlled empirical study, diverse factors, factors that influence, influence the efficacy
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 Camera-ready version

点击查看摘要

Abstract:In this work, we take a deeper look into the diverse factors that influence the efficacy of modern unsupervised domain adaptation (UDA) methods using a large-scale, controlled empirical study. To facilitate our analysis, we first develop UDA-Bench, a novel PyTorch framework that standardizes training and evaluation for domain adaptation enabling fair comparisons across several UDA methods. Using UDA-Bench, our comprehensive empirical study into the impact of backbone architectures, unlabeled data quantity, and pre-training datasets reveals that: (i) the benefits of adaptation methods diminish with advanced backbones, (ii) current methods underutilize unlabeled data, and (iii) pre-training data significantly affects downstream adaptation in both supervised and self-supervised settings. In the context of unsupervised adaptation, these observations uncover several novel and surprising properties, while scientifically validating several others that were often considered empirical heuristics or practitioner intuitions in the absence of a standardized training and evaluation framework. The UDA-Bench framework and trained models are publicly available at this https URL.

[LG-3] Archon: An Architecture Search Framework for Inference-Time Techniques

链接: https://arxiv.org/abs/2409.15254
作者: Jon Saad-Falcon,Adrian Gamarra Lafuente,Shlok Natarajan,Nahum Maru,Hristo Todorov,E. Kelly Buchanan,Mayee Chen,Neel Guha,Christopher Ré,Azalia Mirhoseini
关键词-EN: highly effective tools, Inference-time techniques, large language model, Inference-time, increase large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Inference-time techniques are emerging as highly effective tools to increase large language model (LLM) capabilities. However, there is still limited understanding of the best practices for developing systems that combine inference-time techniques with one or more LLMs, with challenges including: (1) effectively allocating inference compute budget, (2) understanding the interactions between different combinations of inference-time techniques and their impact on downstream performance, and 3) efficiently searching over the large space of model choices, inference-time techniques, and their compositions. To address these challenges, we introduce Archon, an automated framework for designing inference-time architectures. Archon defines an extensible design space, encompassing methods such as generation ensembling, multi-sampling, ranking, fusion, critiquing, verification, and unit testing. It then transforms the problem of selecting and combining LLMs and inference-time techniques into a hyperparameter optimization objective. To optimize this objective, we introduce automated Inference-Time Architecture Search (ITAS) algorithms. Given target benchmark(s), an inference compute budget, and available LLMs, ITAS outputs optimized architectures. We evaluate Archon architectures across a wide range of instruction-following and reasoning benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. We show that automatically designed inference-time architectures by Archon outperform strong models such as GPT-4o and Claude 3.5 Sonnet on these benchmarks, achieving an average increase of 14.1 and 10.3 percentage points with all-source models and open-source models, respectively. We make our code and datasets available publicly on Github: this https URL.

[LG-4] Semantic Inference-Based Deep Learning and Modeling for Earth Observation: Cognitive Semantic Augmentation Satellite Networks

链接: https://arxiv.org/abs/2409.15246
作者: Hong-fu Chou,Vu Nguyen Ha,Prabhu Thiruvasagam,Thanh-Dung Le,Geoffrey Eappen,Ti Ti Nguyen,Luis M. Garces-Socarras,Jorge L. Gonzalez-Rios,Juan Carlos Merlano-Duncan,Symeon Chatzinotas
关键词-EN: Sustainable Development Goals, achieving Sustainable Development, Earth Observation, Sustainable Development, Development Goals
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
*备注: 18 pages, 10 figures, magazine

点击查看摘要

Abstract:Earth Observation (EO) systems play a crucial role in achieving Sustainable Development Goals by collecting and analyzing vital global data through satellite networks. These systems are essential for tasks like mapping, disaster monitoring, and resource management, but they face challenges in processing and transmitting large volumes of EO data, especially in specialized fields such as agriculture and real-time disaster response. Domain-adapted Large Language Models (LLMs) provide a promising solution by facilitating data fusion between extensive EO data and semantic EO data. By improving integration and interpretation of diverse datasets, LLMs address the challenges of processing specialized information in agriculture and disaster response applications. This fusion enhances the accuracy and relevance of transmitted data. This paper presents a framework for semantic communication in EO satellite networks, aimed at improving data transmission efficiency and overall system performance through cognitive processing techniques. The proposed system employs Discrete-Task-Oriented Source-Channel Coding (DT-JSCC) and Semantic Data Augmentation (SA) to focus on relevant information while minimizing communication overhead. By integrating cognitive semantic processing and inter-satellite links, the framework enhances the analysis and transmission of multispectral satellite imagery, improving object detection, pattern recognition, and real-time decision-making. The introduction of Cognitive Semantic Augmentation (CSA) allows satellites to process and transmit semantic information, boosting adaptability to changing environments and application needs. This end-to-end architecture is tailored for next-generation satellite networks, such as those supporting 6G, and demonstrates significant improvements in efficiency and accuracy.

[LG-5] Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

链接: https://arxiv.org/abs/2409.15241
作者: Guanhua Wang,Chengming Zhang,Zheyu Shen,Ang Li,Olatunji Ruwase
关键词-EN: Large Language Models, Large Language, Language Models, popularity of generative, consume hundreds
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Given the popularity of generative AI, Large Language Models (LLMs) often consume hundreds or thousands of GPUs for parallelizing and accelerating the training process. Communication overhead becomes more pronounced when training LLMs at scale. To eliminate communication overhead in distributed LLM training, we propose Domino, which provides a generic scheme to hide communication behind computation. By breaking data dependency of a single batch training into smaller independent pieces, Domino pipelines these independent pieces training and provides generic strategy of fine-grained communication and computation overlapping. Extensive results show that, comparing with Megatron-LM, Domino achieves up to 1.3x speedup for LLM training on Nvidia DGX-H100 GPUs.

[LG-6] AutoAPIEval: A Framework for Automated Evaluation of LLMs in API-Oriented Code Generation

链接: https://arxiv.org/abs/2409.15228
作者: Yixi Wu,Pengfei He,Zehao Wang,Shaowei Wang,Yuan Tian,Tse-Hsun(Peter)Chen
关键词-EN: Large language models, significantly enhancing productivity, accelerating software development, API-oriented code generation, Large language
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) like GitHub Copilot and ChatGPT have emerged as powerful tools for code generation, significantly enhancing productivity and accelerating software development. However, existing benchmarks primarily focus on general code generation without considering API-oriented code generation, i.e., generating code that invokes APIs from specific libraries. Given the growing demand for API-oriented code generation, there is a pressing need for a systematic and automated approach to evaluate LLM on API-oriented code generation. To address this gap, we propose AutoAPIEval, a lightweight and automated framework designed to evaluate the capabilities of LLMs in API-oriented code generation. Our framework works with any library that provides API documentation and focuses on two unit tasks: API recommendation and code example generation, along with four metrics to evaluate the generated APIs and code examples, such as the proportion of incorrect API recommendations for Task 1, and the proportion of code examples where no specific API is invoked and uncompilable/unexecutable code examples for Task 2. In addition, we conducted a case study on three LLMs (ChatGPT, MagiCoder, and DeepSeek Coder) and Java Runtime Environment 8 to demonstrate the framework’s effectiveness. Our findings reveal substantial variability in LLM performance across tasks, with ChatGPT adhering better to instructions, while sharing similar effectiveness in code example generation with its counterparts (i.e., MagiCoder and DeekSeek Coder). We also identify key factors associated with code quality, such as API popularity and model confidence, and build classifiers that achieve high accuracy in detecting incorrect API recommendations and erroneous code examples. Retrieval-augmented generation enhances the quality of code generated by LLMs, though its effectiveness varies across different LLMs.

[LG-7] Intelligent Routing Algorithm over SDN: Reusable Reinforcement Learning Approach

链接: https://arxiv.org/abs/2409.15226
作者: Wang Wumian,Sajal Saha,Anwar Haque,Greg Sidebottom
关键词-EN: proper functioning, routing, Internet, Traffic, algorithm
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 19 pages, 11 figures, Submitted to Elsevier Journal of Computer Network

点击查看摘要

Abstract:Traffic routing is vital for the proper functioning of the Internet. As users and network traffic increase, researchers try to develop adaptive and intelligent routing algorithms that can fulfill various QoS requirements. Reinforcement Learning (RL) based routing algorithms have shown better performance than traditional approaches. We developed a QoS-aware, reusable RL routing algorithm, RLSR-Routing over SDN. During the learning process, our algorithm ensures loop-free path exploration. While finding the path for one traffic demand (a source destination pair with certain amount of traffic), RLSR-Routing learns the overall network QoS status, which can be used to speed up algorithm convergence when finding the path for other traffic demands. By adapting Segment Routing, our algorithm can achieve flow-based, source packet routing, and reduce communications required between SDN controller and network plane. Our algorithm shows better performance in terms of load balancing than the traditional approaches. It also has faster convergence than the non-reusable RL approach when finding paths for multiple traffic demands.

[LG-8] Enhancing Pedestrian Trajectory Prediction with Crowd Trip Information

链接: https://arxiv.org/abs/2409.15224
作者: Rei Tamaru,Pei Li,Bin Ran
关键词-EN: active traffic management, Pedestrian trajectory prediction, traffic management, trajectory prediction, Pedestrian trajectory
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pedestrian trajectory prediction is essential for various applications in active traffic management, urban planning, traffic control, crowd management, and autonomous driving, aiming to enhance traffic safety and efficiency. Accurately predicting pedestrian trajectories requires a deep understanding of individual behaviors, social interactions, and road environments. Existing studies have developed various models to capture the influence of social interactions and road conditions on pedestrian trajectories. However, these approaches are limited by the lack of a comprehensive view of social interactions and road environments. To address these limitations and enhance the accuracy of pedestrian trajectory prediction, we propose a novel approach incorporating trip information as a new modality into pedestrian trajectory models. We propose RNTransformer, a generic model that utilizes crowd trip information to capture global information on social interactions. We incorporated RNTransformer with various socially aware local pedestrian trajectory prediction models to demonstrate its performance. Specifically, by leveraging a pre-trained RNTransformer when training different pedestrian trajectory prediction models, we observed improvements in performance metrics: a 1.3/2.2% enhancement in ADE/FDE on Social-LSTM, a 6.5/28.4% improvement on Social-STGCNN, and an 8.6/4.3% improvement on S-Implicit. Evaluation results demonstrate that RNTransformer significantly enhances the accuracy of various pedestrian trajectory prediction models across multiple datasets. Further investigation reveals that the RNTransformer effectively guides local models to more accurate directions due to the consideration of global information. By exploring crowd behavior within the road network, our approach shows great promise in improving pedestrian safety through accurate trajectory predictions.

[LG-9] MotifDisco: Motif Causal Discovery For Time Series Motifs

链接: https://arxiv.org/abs/2409.15219
作者: Josephine Lamp,Mark Derdzinski,Christopher Hannemann,Sam Hatfield,Joost van der Linden
关键词-EN: health data streams, time series, data streams, health data, series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many time series, particularly health data streams, can be best understood as a sequence of phenomenon or events, which we call motifs. A time series motif is a short trace segment which may implicitly capture an underlying phenomenon within the time series. Specifically, we focus on glucose traces collected from continuous glucose monitors (CGMs), which inherently contain motifs representing underlying human behaviors such as eating and exercise. The ability to identify and quantify causal relationships amongst motifs can provide a mechanism to better understand and represent these patterns, useful for improving deep learning and generative models and for advanced technology development (e.g., personalized coaching and artificial insulin delivery systems). However, no previous work has developed causal discovery methods for time series motifs. Therefore, in this paper we develop MotifDisco (motif disco-very of causality), a novel causal discovery framework to learn causal relations amongst motifs from time series traces. We formalize a notion of Motif Causality (MC), inspired from Granger Causality and Transfer Entropy, and develop a Graph Neural Network-based framework that learns causality between motifs by solving an unsupervised link prediction problem. We also integrate MC with three model use cases of forecasting, anomaly detection and clustering, to showcase the use of MC as a building block for other downstream tasks. Finally, we evaluate our framework and find that Motif Causality provides a significant performance improvement in all use cases.

[LG-10] FLeNS: Federated Learning with Enhanced Nesterov-Newton Sketch

链接: https://arxiv.org/abs/2409.15216
作者: Sunny Gupta,Mohit,Pankhi Kashyap,Pranav Jeevan,Amit Sethi
关键词-EN: Federated learning faces, balancing communication efficiency, faces a critical, critical challenge, challenge in balancing
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注: 10 pages, 3 figures, 2 Tables

点击查看摘要

Abstract:Federated learning faces a critical challenge in balancing communication efficiency with rapid convergence, especially for second-order methods. While Newton-type algorithms achieve linear convergence in communication rounds, transmitting full Hessian matrices is often impractical due to quadratic complexity. We introduce Federated Learning with Enhanced Nesterov-Newton Sketch (FLeNS), a novel method that harnesses both the acceleration capabilities of Nesterov’s method and the dimensionality reduction benefits of Hessian sketching. FLeNS approximates the centralized Newton’s method without relying on the exact Hessian, significantly reducing communication overhead. By combining Nesterov’s acceleration with adaptive Hessian sketching, FLeNS preserves crucial second-order information while preserving the rapid convergence characteristics. Our theoretical analysis, grounded in statistical learning, demonstrates that FLeNS achieves super-linear convergence rates in communication rounds - a notable advancement in federated optimization. We provide rigorous convergence guarantees and characterize tradeoffs between acceleration, sketch size, and convergence speed. Extensive empirical evaluation validates our theoretical findings, showcasing FLeNS’s state-of-the-art performance with reduced communication requirements, particularly in privacy-sensitive and edge-computing scenarios. The code is available at this https URL

[LG-11] HydroVision: LiDAR-Guided Hydrometric Prediction with Vision Transformers and Hybrid Graph Learning

链接: https://arxiv.org/abs/2409.15213
作者: Naghmeh Shafiee Roudbari,Ursula Eicker,Charalambos Poullis,Zachary Patterson
关键词-EN: managing water resources, Hydrometric forecasting, environmental protection, crucial for managing, Hydrometric
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hydrometric forecasting is crucial for managing water resources, flood prediction, and environmental protection. Water stations are interconnected, and this connectivity influences the measurements at other stations. However, the dynamic and implicit nature of water flow paths makes it challenging to extract a priori knowledge of the connectivity structure. We hypothesize that terrain elevation significantly affects flow and connectivity. To incorporate this, we use LiDAR terrain elevation data encoded through a Vision Transformer (ViT). The ViT, which has demonstrated excellent performance in image classification by directly applying transformers to sequences of image patches, efficiently captures spatial features of terrain elevation. To account for both spatial and temporal features, we employ GRU blocks enhanced with graph convolution, a method widely used in the literature. We propose a hybrid graph learning structure that combines static and dynamic graph learning. A static graph, derived from transformer-encoded LiDAR data, captures terrain elevation relationships, while a dynamic graph adapts to temporal changes, improving the overall graph representation. We apply graph convolution in two layers through these static and dynamic graphs. Our method makes daily predictions up to 12 days ahead. Empirical results from multiple water stations in Quebec demonstrate that our method significantly reduces prediction error by an average of 10% across all days, with greater improvements for longer forecasting horizons.

[LG-12] Fast and Accurate Triangle Counting in Graph Streams Using Predictions ICDM2024

链接: https://arxiv.org/abs/2409.15205
作者: Cristian Boldrin,Fabio Vandin
关键词-EN: number of triangles, efficient and practical, estimating the number, practical algorithm, algorithm
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Accepted to ICDM2024

点击查看摘要

Abstract:In this work, we present the first efficient and practical algorithm for estimating the number of triangles in a graph stream using predictions. Our algorithm combines waiting room sampling and reservoir sampling with a predictor for the heaviness of edges, that is, the number of triangles in which an edge is involved. As a result, our algorithm is fast, provides guarantees on the amount of memory used, and exploits the additional information provided by the predictor to produce highly accurate estimates. We also propose a simple and domain-independent predictor, based on the degree of nodes, that can be easily computed with one pass on a stream of edges when the stream is available beforehand. Our analytical results show that, when the predictor provides useful information on the heaviness of edges, it leads to estimates with reduced variance compared to the state-of-the-art, even when the predictions are far from perfect. Our experimental results show that, when analyzing a single graph stream, our algorithm is faster than the state-of-the-art for a given memory budget, while providing significantly more accurate estimates. Even more interestingly, when sequences of hundreds of graph streams are analyzed, our algorithm significantly outperforms the state-of-the-art using our simple degree-based predictor built by analyzing only the first graph of the sequence. Comments: Accepted to ICDM2024 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2409.15205 [cs.DS] (or arXiv:2409.15205v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2409.15205 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-13] RAMBO: Enhancing RAG-based Repository-Level Method Body Completion

链接: https://arxiv.org/abs/2409.15204
作者: Tuan-Dung Bui,Duc-Thieu Luu-Van,Thanh-Phat Nguyen,Thu-Trang Nguyen,Son Nguyen,Hieu Dinh Vo
关键词-EN: Method Body Completion, predicting code snippets, software development, helping developers, code snippets based
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Code completion is essential in software development, helping developers by predicting code snippets based on context. Among completion tasks, Method Body Completion (MBC) is particularly challenging as it involves generating complete method bodies based on their signatures and context. This task becomes significantly harder in large repositories, where method bodies must integrate repositoryspecific elements such as custom APIs, inter-module dependencies, and project-specific conventions. In this paper, we introduce RAMBO, a novel RAG-based approach for repository-level MBC. Instead of retrieving similar method bodies, RAMBO identifies essential repositoryspecific elements, such as classes, methods, and variables/fields, and their relevant usages. By incorporating these elements and their relevant usages into the code generation process, RAMBO ensures more accurate and contextually relevant method bodies. Our experimental results with leading code LLMs across 40 Java projects show that RAMBO significantly outperformed the state-of-the-art repository-level MBC approaches, with the improvements of up to 46% in BLEU, 57% in CodeBLEU, 36% in Compilation Rate, and up to 3X in Exact Match. Notably, RAMBO surpassed RepoCoder Oracle method by up to 12% in Exact Match, setting a new benchmark for repository-level MBC.

[LG-14] ASTE Transformer Modelling Dependencies in Aspect-Sentiment Triplet Extraction

链接: https://arxiv.org/abs/2409.15202
作者: Iwo Naglik,Mateusz Lango
关键词-EN: Aspect-Sentiment Triplet Extraction, Triplet Extraction, aspect-based sentiment analysis, Aspect-Sentiment Triplet, recently proposed task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The 2024 Conference on Empirical Methods in Natural Language Processing, November 12-16, Miami, Florida 9 pages, appendix, diagrams

点击查看摘要

Abstract:Aspect-Sentiment Triplet Extraction (ASTE) is a recently proposed task of aspect-based sentiment analysis that consists in extracting (aspect phrase, opinion phrase, sentiment polarity) triples from a given sentence. Recent state-of-the-art methods approach this task by first extracting all possible text spans from a given text, then filtering the potential aspect and opinion phrases with a classifier, and finally considering all their pairs with another classifier that additionally assigns sentiment polarity to them. Although several variations of the above scheme have been proposed, the common feature is that the final result is constructed by a sequence of independent classifier decisions. This hinders the exploitation of dependencies between extracted phrases and prevents the use of knowledge about the interrelationships between classifier predictions to improve performance. In this paper, we propose a new ASTE approach consisting of three transformer-inspired layers, which enables the modelling of dependencies both between phrases and between the final classifier decisions. Experimental results show that the method achieves higher performance in terms of F1 measure than other methods studied on popular benchmarks. In addition, we show that a simple pre-training technique further improves the performance of the model.

[LG-15] Enabling Tensor Decomposition for Time-Series Classification via A Simple Pseudo-Laplacian Contrast

链接: https://arxiv.org/abs/2409.15200
作者: Man Li,Ziyue Li,Lijun Sun,Fugee Tsung
关键词-EN: data inference tasks, primarily benefiting data, learn low-dimensional representation, benefiting data inference, Tensor decomposition
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tensor decomposition has emerged as a prominent technique to learn low-dimensional representation under the supervision of reconstruction error, primarily benefiting data inference tasks like completion and imputation, but not classification task. We argue that the non-uniqueness and rotation invariance of tensor decomposition allow us to identify the directions with largest class-variability and simple graph Laplacian can effectively achieve this objective. Therefore we propose a novel Pseudo Laplacian Contrast (PLC) tensor decomposition framework, which integrates the data augmentation and cross-view Laplacian to enable the extraction of class-aware representations while effectively capturing the intrinsic low-rank structure within reconstruction constraint. An unsupervised alternative optimization algorithm is further developed to iteratively estimate the pseudo graph and minimize the loss using Alternating Least Square (ALS). Extensive experimental results on various datasets demonstrate the effectiveness of our approach.

[LG-16] Interpretability-Guided Test-Time Adversarial Defense ECCV2024

链接: https://arxiv.org/abs/2409.15190
作者: Akshay Kulkarni,Tsui-Wei Weng
关键词-EN: devising interpretability-guided neuron, interpretability-guided neuron importance, neuron importance ranking, identify neurons important, importance ranking methods
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: ECCV 2024. Project Page: this https URL

点击查看摘要

Abstract:We propose a novel and low-cost test-time adversarial defense by devising interpretability-guided neuron importance ranking methods to identify neurons important to the output classes. Our method is a training-free approach that can significantly improve the robustness-accuracy tradeoff while incurring minimal computational overhead. While being among the most efficient test-time defenses (4x faster), our method is also robust to a wide range of black-box, white-box, and adaptive attacks that break previous test-time defenses. We demonstrate the efficacy of our method for CIFAR10, CIFAR100, and ImageNet-1k on the standard RobustBench benchmark (with average gains of 2.6%, 4.9%, and 2.8% respectively). We also show improvements (average 1.5%) over the state-of-the-art test-time defenses even under strong adaptive attacks.

[LG-17] Skills Made to Order: Efficient Acquisition of Robot Cooking Skills Guided by Multiple Forms of Internet Data

链接: https://arxiv.org/abs/2409.15172
作者: Mrinal Verghese,Christopher Atkeson
关键词-EN: internet data sources, internet data, data, data sources, internet
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:This study explores the utility of various internet data sources to select among a set of template robot behaviors to perform skills. Learning contact-rich skills involving tool use from internet data sources has typically been challenging due to the lack of physical information such as contact existence, location, areas, and force in this data. Prior works have generally used internet data and foundation models trained on this data to generate low-level robot behavior. We hypothesize that these data and models may be better suited to selecting among a set of basic robot behaviors to perform these contact-rich skills. We explore three methods of template selection: querying large language models, comparing video of robot execution to retrieved human video using features from a pretrained video encoder common in prior work, and performing the same comparison using features from an optic flow encoder trained on internet data. Our results show that LLMs are surprisingly capable template selectors despite their lack of visual information, optical flow encoding significantly outperforms video encoders trained with an order of magnitude more data, and important synergies exist between various forms of internet data for template selection. By exploiting these synergies, we create a template selector using multiple forms of internet data that achieves a 79% success rate on a set of 16 different cooking skills involving tool-use.

[LG-18] Data-driven model discovery with Kolmogorov-Arnold networks

链接: https://arxiv.org/abs/2409.15167
作者: Mohammadamin Moradi,Shirin Panahi,Erik M. Bollt,Ying-Cheng Lai
关键词-EN: elementary mathematical terms, Data-driven model discovery, underlying governing equations, complex dynamical systems, sparse optimization fails
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Data-driven model discovery of complex dynamical systems is typically done using sparse optimization, but it has a fundamental limitation: sparsity in that the underlying governing equations of the system contain only a small number of elementary mathematical terms. Examples where sparse optimization fails abound, such as the classic Ikeda or optical-cavity map in nonlinear dynamics and a large variety of ecosystems. Exploiting the recently articulated Kolmogorov-Arnold networks, we develop a general model-discovery framework for any dynamical systems including those that do not satisfy the sparsity condition. In particular, we demonstrate non-uniqueness in that a large number of approximate models of the system can be found which generate the same invariant set with the correct statistics such as the Lyapunov exponents and Kullback-Leibler divergence. An analogy to shadowing of numerical trajectories in chaotic systems is pointed out.

[LG-19] A Gated Residual Kolmogorov-Arnold Networks for Mixtures of Experts

链接: https://arxiv.org/abs/2409.15161
作者: Hugo Inzirillo,Remi Genet
关键词-EN: Gated Residual Kolmogorov-Arnold, Mixture of Experts, Residual Kolmogorov-Arnold Networks, based on Gated, Gated Residual
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:This paper introduces KAMoE, a novel Mixture of Experts (MoE) framework based on Gated Residual Kolmogorov-Arnold Networks (GRKAN). We propose GRKAN as an alternative to the traditional gating function, aiming to enhance efficiency and interpretability in MoE modeling. Through extensive experiments on digital asset markets and real estate valuation, we demonstrate that KAMoE consistently outperforms traditional MoE architectures across various tasks and model types. Our results show that GRKAN exhibits superior performance compared to standard Gating Residual Networks, particularly in LSTM-based models for sequential tasks. We also provide insights into the trade-offs between model complexity and performance gains in MoE and KAMoE architectures.

[LG-20] Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

链接: https://arxiv.org/abs/2409.15156
作者: Lechao Xiao
关键词-EN: Toggle, machine learning, scaling, Code, Papers
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 19 pages

点击查看摘要

Abstract:The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This paper examines several influential regularization-based principles that may no longer hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed ``scaling law crossover,‘’ where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm: \bullet Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling? \bullet Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible? Comments: 19 pages Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2409.15156 [cs.LG] (or arXiv:2409.15156v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.15156 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lechao Xiao [view email] [v1] Mon, 23 Sep 2024 16:04:03 UTC (4,105 KB) Full-text links: Access Paper: View a PDF of the paper titled Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling, by Lechao XiaoView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2024-09 Change to browse by: cs stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-21] Designing an Interpretable Interface for Contextual Bandits RECSYS24

链接: https://arxiv.org/abs/2409.15143
作者: Andrew Maher,Matia Gobbo,Lancelot Lachartre,Subash Prabanantham,Rowan Swiers,Puli Liyanagama
关键词-EN: increasingly popular solution, personalized recommender systems, Contextual bandits, increasingly popular, popular solution
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 1 figure, Accepted at the IntRS 24 workshop, co-located with ACM RecSys 24

点击查看摘要

Abstract:Contextual bandits have become an increasingly popular solution for personalized recommender systems. Despite their growing use, the interpretability of these systems remains a significant challenge, particularly for the often non-expert operators tasked with ensuring their optimal performance. In this paper, we address this challenge by designing a new interface to explain to domain experts the underlying behaviour of a bandit. Central is a metric we term “value gain”, a measure derived from off-policy evaluation to quantify the real-world impact of sub-components within a bandit. We conduct a qualitative user study to evaluate the effectiveness of our interface. Our findings suggest that by carefully balancing technical rigour with accessible presentation, it is possible to empower non-experts to manage complex machine learning systems. We conclude by outlining guiding principles that other researchers should consider when building similar such interfaces in future.

[LG-22] CAMAL: Optimizing LSM-trees via Active Learning SIGMOD2025

链接: https://arxiv.org/abs/2409.15130
作者: Weiping Yu,Siqiang Luo,Zihao Yu,Gao Cong
关键词-EN: optimize LSM-tree structure, active learning, Decoupled Active Learning, write operations, apply active learning
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: SIGMOD 2025

点击查看摘要

Abstract:We use machine learning to optimize LSM-tree structure, aiming to reduce the cost of processing various read/write operations. We introduce a new approach Camal, which boasts the following features: (1) ML-Aided: Camal is the first attempt to apply active learning to tune LSM-tree based key-value stores. The learning process is coupled with traditional cost models to improve the training process; (2) Decoupled Active Learning: backed by rigorous analysis, Camal adopts active learning paradigm based on a decoupled tuning of each parameter, which further accelerates the learning process; (3) Easy Extrapolation: Camal adopts an effective mechanism to incrementally update the model with the growth of the data size; (4) Dynamic Mode: Camal is able to tune LSM-tree online under dynamically changing workloads; (5) Significant System Improvement: By integrating Camal into a full system RocksDB, the system performance improves by 28% on average and up to 8x compared to a state-of-the-art RocksDB design.

[LG-23] he Number of Trials Matters in Infinite-Horizon General-Utility Markov Decision Processes

链接: https://arxiv.org/abs/2409.15128
作者: Pedro P. Santos,Alberto Sardinha,Francisco S. Melo
关键词-EN: Markov decision processes, general-utility Markov decision, general-utility Markov, Markov decision, framework generalizes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The general-utility Markov decision processes (GUMDPs) framework generalizes the MDPs framework by considering objective functions that depend on the frequency of visitation of state-action pairs induced by a given policy. In this work, we contribute with the first analysis on the impact of the number of trials, i.e., the number of randomly sampled trajectories, in infinite-horizon GUMDPs. We show that, as opposed to standard MDPs, the number of trials plays a key-role in infinite-horizon GUMDPs and the expected performance of a given policy depends, in general, on the number of trials. We consider both discounted and average GUMDPs, where the objective function depends, respectively, on discounted and average frequencies of visitation of state-action pairs. First, we study policy evaluation under discounted GUMDPs, proving lower and upper bounds on the mismatch between the finite and infinite trials formulations for GUMDPs. Second, we address average GUMDPs, studying how different classes of GUMDPs impact the mismatch between the finite and infinite trials formulations. Third, we provide a set of empirical results to support our claims, highlighting how the number of trajectories and the structure of the underlying GUMDP influence policy evaluation.

[LG-24] UTrace: Poisoning Forensics for Private Collaborative Learning

链接: https://arxiv.org/abs/2409.15126
作者: Evan Rose,Hidde Lycklama,Harsh Chaudhari,Anwar Hithnawi,Alina Oprea
关键词-EN: Privacy-preserving machine learning, secure multi-party computation, Privacy-preserving machine, machine learning, multi-party computation
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 28 pages, 10 figures

点击查看摘要

Abstract:Privacy-preserving machine learning (PPML) enables multiple data owners to contribute their data privately to a set of servers that run a secure multi-party computation (MPC) protocol to train a joint ML model. In these protocols, the input data remains private throughout the training process, and only the resulting model is made available. While this approach benefits privacy, it also exacerbates the risks of data poisoning, where compromised data owners induce undesirable model behavior by contributing malicious datasets. Existing MPC mechanisms can mitigate certain poisoning attacks, but these measures are not exhaustive. To complement existing poisoning defenses, we introduce UTrace: a framework for User-level Traceback of poisoning attacks in PPML. Utrace computes user responsibility scores using gradient similarity metrics aggregated across the most relevant samples in an owner’s dataset. UTrace is effective at low poisoning rates and is resilient to poisoning attacks distributed across multiple data owners, unlike existing unlearning-based methods. We introduce methods for checkpointing gradients with low storage overhead, enabling traceback in the absence of data owners at deployment time. We also design several optimizations that reduce traceback time and communication in MPC. We provide a comprehensive evaluation of UTrace across four datasets from three data modalities (vision, text, and malware) and show its effectiveness against 10 poisoning attacks.

[LG-25] he BRAVO Semantic Segmentation Challenge Results in UNCV2024 ECCV2024

链接: https://arxiv.org/abs/2409.15107
作者: Tuan-Hung Vu,Eduardo Valle,Andrei Bursuc,Tommie Kerssies,Daan de Geus,Gijs Dubbelman,Long Qian,Bingke Zhu,Yingying Chen,Ming Tang,Jinqiao Wang,Tomáš Vojíř,Jan Šochman,Jiří Matas,Michael Smith,Frank Ferrie,Shamik Basu,Christos Sakaridis,Luc Van Gool
关键词-EN: unified BRAVO challenge, unified BRAVO, semantic segmentation models, BRAVO challenge, propose the unified
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ECCV 2024 proceeding paper of the BRAVO challenge 2024, see this https URL

点击查看摘要

Abstract:We propose the unified BRAVO challenge to benchmark the reliability of semantic segmentation models under realistic perturbations and unknown out-of-distribution (OOD) scenarios. We define two categories of reliability: (1) semantic reliability, which reflects the model’s accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model’s ability to detect object classes that are unknown during training. The challenge attracted nearly 100 submissions from international teams representing notable research institutions. The results reveal interesting insights into the importance of large-scale pre-training and minimal architectural design in developing robust and reliable semantic segmentation models.

[LG-26] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts

链接: https://arxiv.org/abs/2409.15104
作者: Zeyu Zhang,Haiying Shen
关键词-EN: generative large-language model, Long-sequence generative large-language, large-language model, increasingly popular, Long-sequence generative
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-sequence generative large-language model (LLM) applications have become increasingly popular. In this paper, through trace-based experiments, we found that the existing method for long sequences results in a high Time-To-First-Token (TTFT) due to sequential chunk processing, long Time-Between-Tokens (TBT) from batching long-sequence prefills and decodes, and low throughput due to constrained key-value cache (KVC) for long sequences. To address these issues, we propose two Sequence-Parallelism (SP) architectures for both tensor parallelism (TP) and non-TP. However, SP introduces two challenges: 1) network communication and computation become performance bottlenecks; 2) the latter two issues above are mitigated but not resolved, and SP’s resultant KV value distribution across GPUs still requires communication for decode, increasing TBT. Hence, we propose a Communication-efficient Sparse Attention (CSA) and communication-computation-communication three-phase pipelining. We also propose SP-based decode that processes decode separately from prefill, distributes KV values of a request across different GPUs, and novelly moves Query (Q) values instead of KV values to reduce communication overhead. These methods constitute a communication-efficient Sequence-Parallelism based LLM Serving System (SPS2). Our trace-driven evaluation demonstrates that SPS2 improves the average TTFT, TBT, and response time by up to 7.5x, 1.92x, and 9.8x and improves the prefill and decode throughput by 8.2x and 5.2x while maintaining the accuracy compared to Sarathi-Serve. We distributed our source code.

[LG-27] Robust Federated Learning Over the Air: Combating Heavy-Tailed Noise with Median Anchored Clipping

链接: https://arxiv.org/abs/2409.15100
作者: Jiaxing Li,Zihan Chen,Kai Fong Ernest Chong,Bikramjit Das,Tony Q. S. Quek,Howard H. Yang
关键词-EN: federated edge learning, effective approach, communication bottleneck, Median Anchored Clipping, Leveraging
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Leveraging over-the-air computations for model aggregation is an effective approach to cope with the communication bottleneck in federated edge learning. By exploiting the superposition properties of multi-access channels, this approach facilitates an integrated design of communication and computation, thereby enhancing system privacy while reducing implementation costs. However, the inherent electromagnetic interference in radio channels often exhibits heavy-tailed distributions, giving rise to exceptionally strong noise in globally aggregated gradients that can significantly deteriorate the training performance. To address this issue, we propose a novel gradient clipping method, termed Median Anchored Clipping (MAC), to combat the detrimental effects of heavy-tailed noise. We also derive analytical expressions for the convergence rate of model training with analog over-the-air federated learning under MAC, which quantitatively demonstrates the effect of MAC on training performance. Extensive experimental results show that the proposed MAC algorithm effectively mitigates the impact of heavy-tailed noise, hence substantially enhancing system robustness.

[LG-28] Efficiently Dispatching Flash Attention For Partially Filled Attention Masks

链接: https://arxiv.org/abs/2409.15097
作者: Agniv Sharma,Jonas Geiping
关键词-EN: Transformers are widely, partially filled attention, filled attention matrices, partially filled, Binary Block Masking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Transformers are widely used across various applications, many of which yield sparse or partially filled attention matrices. Examples include attention masks designed to reduce the quadratic complexity of attention, sequence packing techniques, and recent innovations like tree masking for fast validation in MEDUSA. Despite the inherent sparsity in these matrices, the state-of-the-art algorithm Flash Attention still processes them with quadratic complexity as though they were dense. In this paper, we introduce \textbfBinary Block Masking, a highly efficient modification that enhances Flash Attention by making it mask-aware. We further propose two optimizations: one tailored for masks with contiguous non-zero patterns and another for extremely sparse masks. Our experiments on attention masks derived from real-world scenarios demonstrate up to a 9x runtime improvement. The implementation will be publicly released to foster further research and application.

[LG-29] AdapFair: Ensuring Continuous Fairness for Machine Learning Operations

链接: https://arxiv.org/abs/2409.15088
作者: Yinghui Huang,Zihao Tang,Xiangyu Chang
关键词-EN: attracted significant attention, machine learning operations, significant attention, specific contexts, machine learning
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 18 pages,15 figures

点击查看摘要

Abstract:The biases and discrimination of machine learning algorithms have attracted significant attention, leading to the development of various algorithms tailored to specific contexts. However, these solutions often fall short of addressing fairness issues inherent in machine learning operations. In this paper, we present a debiasing framework designed to find an optimal fair transformation of input data that maximally preserves data predictability. A distinctive feature of our approach is its flexibility and efficiency. It can be integrated with any downstream black-box classifiers, providing continuous fairness guarantees with minimal retraining efforts, even in the face of frequent data drifts, evolving fairness requirements, and batches of similar tasks. To achieve this, we leverage the normalizing flows to enable efficient, information-preserving data transformation, ensuring that no critical information is lost during the debiasing process. Additionally, we incorporate the Wasserstein distance as the unfairness measure to guide the optimization of data transformations. Finally, we introduce an efficient optimization algorithm with closed-formed gradient computations, making our framework scalable and suitable for dynamic, real-world environments.

[LG-30] Evaluating the Usability of LLMs in Threat Intelligence Enrichment

链接: https://arxiv.org/abs/2409.15072
作者: Sanchana Srikanth,Mohammad Hasanuzzaman,Farah Tasnur Meem
关键词-EN: Large Language Models, Large Language, Language Models, significantly enhance threat, automating the collection
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the potential to significantly enhance threat intelligence by automating the collection, preprocessing, and analysis of threat data. However, the usability of these tools is critical to ensure their effective adoption by security professionals. Despite the advanced capabilities of LLMs, concerns about their reliability, accuracy, and potential for generating inaccurate information persist. This study conducts a comprehensive usability evaluation of five LLMs ChatGPT, Gemini, Cohere, Copilot, and Meta AI focusing on their user interface design, error handling, learning curve, performance, and integration with existing tools in threat intelligence enrichment. Utilizing a heuristic walkthrough and a user study methodology, we identify key usability issues and offer actionable recommendations for improvement. Our findings aim to bridge the gap between LLM functionality and user experience, thereby promoting more efficient and accurate threat intelligence practices by ensuring these tools are user-friendly and reliable.

[LG-31] SHFL: Secure Hierarchical Federated Learning Framework for Edge Networks

链接: https://arxiv.org/abs/2409.15067
作者: Omid Tavallaie,Kanchana Thilakarathna,Suranga Seneviratne,Aruna Seneviratne,Albert Y. Zomaya
关键词-EN: distributed machine learning, machine learning paradigm, Federated Learning, Independently Distributed, learning paradigm designed
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning paradigm designed for privacy-sensitive applications that run on resource-constrained devices with non-Identically and Independently Distributed (IID) data. Traditional FL frameworks adopt the client-server model with a single-level aggregation (AGR) process, where the server builds the global model by aggregating all trained local models received from client devices. However, this conventional approach encounters challenges, including susceptibility to model/data poisoning attacks. In recent years, advancements in the Internet of Things (IoT) and edge computing have enabled the development of hierarchical FL systems with a two-level AGR process running at edge and cloud servers. In this paper, we propose a Secure Hierarchical FL (SHFL) framework to address poisoning attacks in hierarchical edge networks. By aggregating trained models at the edge, SHFL employs two novel methods to address model/data poisoning attacks in the presence of client adversaries: 1) a client selection algorithm running at the edge for choosing IoT devices to participate in training, and 2) a model AGR method designed based on convex optimization theory to reduce the impact of edge models from networks with adversaries in the process of computing the global model (at the cloud level). The evaluation results reveal that compared to state-of-the-art methods, SHFL significantly increases the maximum accuracy achieved by the global model in the presence of client adversaries applying model/data poisoning attacks.

[LG-32] AlphaZip: Neural Network-Enhanced Lossless Text Compression

链接: https://arxiv.org/abs/2409.15046
作者: Swathi Shree Narashiman,Nitin Chandrachoodan
关键词-EN: Data compression continues, traditional information theory, information theory methods, Large Language Model, continues to evolve
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data compression continues to evolve, with traditional information theory methods being widely used for compressing text, images, and videos. Recently, there has been growing interest in leveraging Generative AI for predictive compression techniques. This paper introduces a lossless text compression approach using a Large Language Model (LLM). The method involves two key steps: first, prediction using a dense neural network architecture, such as a transformer block; second, compressing the predicted ranks with standard compression algorithms like Adaptive Huffman, LZ77, or Gzip. Extensive analysis and benchmarking against conventional information-theoretic baselines demonstrate that neural compression offers improved performance.

[LG-33] Anomaly Detection from a Tensor Train Perspective

链接: https://arxiv.org/abs/2409.15030
作者: Alejandro Mata Ali,Aitor Moreno Fdez. de Leceta,Jorge López Rubio
关键词-EN: Tensor Train representation, Train representation, Tensor Train, present a series, anomaly detection
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Information Theory (cs.IT); Quantum Physics (quant-ph)
*备注: 10 pages, 13 figures

点击查看摘要

Abstract:We present a series of algorithms in tensor networks for anomaly detection in datasets, by using data compression in a Tensor Train representation. These algorithms consist of preserving the structure of normal data in compression and deleting the structure of anomalous data. The algorithms can be applied to any tensor network representation. We test the effectiveness of the methods with digits and Olivetti faces datasets and a cybersecurity dataset to determine cyber-attacks.

[LG-34] Region Mixup ICLR2024

链接: https://arxiv.org/abs/2409.15028
作者: Saptarshi Saha,Utpal Garain
关键词-EN: visual recognition tasks, data augmentation, recognition tasks, paper introduces, introduces a simple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published as a Tiny Paper at ICLR 2024

点击查看摘要

Abstract:This paper introduces a simple extension of mixup (Zhang et al., 2018) data augmentation to enhance generalization in visual recognition tasks. Unlike the vanilla mixup method, which blends entire images, our approach focuses on combining regions from multiple images.

[LG-35] A Diagonal Structured State Space Model on Loihi 2 for Efficient Streaming Sequence Processing

链接: https://arxiv.org/abs/2409.15022
作者: Svea Marie Meyer,Philipp Weidel,Philipp Plank,Leobardo Campos-Macias,Sumit Bam Shrestha,Philipp Stratmann,Mathis Richter
关键词-EN: Deep State-Space Models, sequence modeling tasks, long-range sequence modeling, Deep State-Space, State-Space Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:Deep State-Space Models (SSM) demonstrate state-of-the art performance on long-range sequence modeling tasks. While the recurrent structure of SSMs can be efficiently implemented as a convolution or as a parallel scan during training, recurrent token-by-token processing cannot currently be implemented efficiently on GPUs. Here, we demonstrate efficient token-by-token inference of the SSM S4D on Intel’s Loihi 2 state-of-the-art neuromorphic processor. We compare this first ever neuromorphic-hardware implementation of an SSM on sMNIST, psMNIST, and sCIFAR to a recurrent and a convolutional implementation of S4D on Jetson Orin Nano (Jetson). While we find Jetson to perform better in an offline sample-by-sample based batched processing mode, Loihi 2 outperforms during token-by-token based processing, where it consumes 1000 times less energy with a 75 times lower latency and a 75 times higher throughput compared to the recurrent implementation of S4D on Jetson. This opens up new avenues towards efficient real-time streaming applications of SSMs.

[LG-36] Evaluating Synthetic Activations composed of SAE Latents in GPT-2

链接: https://arxiv.org/abs/2409.15019
作者: Giorgi Giglemiani,Nora Petrova,Chatrik Singh Mangat,Jett Janiak,Stefan Heimersheim
关键词-EN: Sparse Auto-Encoders, SAE latents, monosemantic SAE latents, activations, SAE
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse Auto-Encoders (SAEs) are commonly employed in mechanistic interpretability to decompose the residual stream into monosemantic SAE latents. Recent work demonstrates that perturbing a model’s activations at an early layer results in a step-function-like change in the model’s final layer activations. Furthermore, the model’s sensitivity to this perturbation differs between model-generated (real) activations and random activations. In our study, we assess model sensitivity in order to compare real activations to synthetic activations composed of SAE latents. Our findings indicate that synthetic activations closely resemble real activations when we control for the sparsity and cosine similarity of the constituent SAE latents. This suggests that real activations cannot be explained by a simple “bag of SAE latents” lacking internal structure, and instead suggests that SAE latents possess significant geometric and statistical properties. Notably, we observe that our synthetic activations exhibit less pronounced activation plateaus compared to those typically surrounding real activations.

[LG-37] Acting for the Right Reasons: Creating Reason-Sensitive Artificial Moral Agents

链接: https://arxiv.org/abs/2409.15014
作者: Kevin Baum,Lisa Dargasz,Felix Jahn,Timo P. Gros,Verena Wolf
关键词-EN: reinforcement learning agents, learning agents based, reinforcement learning architecture, reinforcement learning, enables moral decision-making
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures, Workshop paper accepted to FEAR24 (IFM Workshop)

点击查看摘要

Abstract:We propose an extension of the reinforcement learning architecture that enables moral decision-making of reinforcement learning agents based on normative reasons. Central to this approach is a reason-based shield generator yielding a moral shield that binds the agent to actions that conform with recognized normative reasons so that our overall architecture restricts the agent to actions that are (internally) morally justified. In addition, we describe an algorithm that allows to iteratively improve the reason-based shield generator through case-based feedback from a moral judge.

[LG-38] Dynamic Integration of Task-Specific Adapters for Class Incremental Learning

链接: https://arxiv.org/abs/2409.14983
作者: Jiashuo Li,Shaokun Wang,Bo Qian,Yuhang He,Xing Wei,Yihong Gong
关键词-EN: Non-exemplar class Incremental, class Incremental Learning, Incremental Learning, Patch-Level Model Alignment, addressing privacy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-exemplar class Incremental Learning (NECIL) enables models to continuously acquire new classes without retraining from scratch and storing old task exemplars, addressing privacy and storage issues. However, the absence of data from earlier tasks exacerbates the challenge of catastrophic forgetting in NECIL. In this paper, we propose a novel framework called Dynamic Integration of task-specific Adapters (DIA), which comprises two key components: Task-Specific Adapter Integration (TSAI) and Patch-Level Model Alignment. TSAI boosts compositionality through a patch-level adapter integration strategy, which provides a more flexible compositional solution while maintaining low computation costs. Patch-Level Model Alignment maintains feature consistency and accurate decision boundaries via two specialized mechanisms: Patch-Level Distillation Loss (PDL) and Patch-Level Feature Reconstruction method (PFR). Specifically, the PDL preserves feature-level consistency between successive models by implementing a distillation loss based on the contributions of patch tokens to new class learning. The PFR facilitates accurate classifier alignment by reconstructing old class features from previous tasks that adapt to new task knowledge. Extensive experiments validate the effectiveness of our DIA, revealing significant improvements on benchmark datasets in the NECIL setting, maintaining an optimal balance between computational complexity and accuracy. The full code implementation will be made publicly available upon the publication of this paper.

[LG-39] On The Specialization of Neural Modules

链接: https://arxiv.org/abs/2409.14981
作者: Devon Jarvis,Richard Klein,Benjamin Rosman,Andrew M. Saxe
关键词-EN: machine learning models, previous experiences, achieving systematic generalization, number of machine, goal of achieving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The Eleventh International Conference on Learning Representations 2023

点击查看摘要

Abstract:A number of machine learning models have been proposed with the goal of achieving systematic generalization: the ability to reason about new situations by combining aspects of previous experiences. These models leverage compositional architectures which aim to learn specialized modules dedicated to structures in a task that can be composed to solve novel problems with similar structures. While the compositionality of these architectures is guaranteed by design, the modules specializing is not. Here we theoretically study the ability of network modules to specialize to useful structures in a dataset and achieve systematic generalization. To this end we introduce a minimal space of datasets motivated by practical systematic generalization benchmarks. From this space of datasets we present a mathematical definition of systematicity and study the learning dynamics of linear neural modules when solving components of the task. Our results shed light on the difficulty of module specialization, what is required for modules to successfully specialize, and the necessity of modular architectures to achieve systematicity. Finally, we confirm that the theoretical results in our tractable setting generalize to more complex datasets and non-linear architectures.

[LG-40] Blind Spatial Impulse Response Generation from Separate Room- and Scene-Specific Information

链接: https://arxiv.org/abs/2409.14971
作者: Francesc Lluís,Nils Meyer-Kahlen
关键词-EN: real acoustic environment, users’ real acoustic, rendering virtual sounds, augmented reality, audio in augmented
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:For audio in augmented reality (AR), knowledge of the users’ real acoustic environment is crucial for rendering virtual sounds that seamlessly blend into the environment. As acoustic measurements are usually not feasible in practical AR applications, information about the room needs to be inferred from available sound sources. Then, additional sound sources can be rendered with the same room acoustic qualities. Crucially, these are placed at different positions than the sources available for estimation. Here, we propose to use an encoder network trained using a contrastive loss that maps input sounds to a low-dimensional feature space representing only room-specific information. Then, a diffusion-based spatial room impulse response generator is trained to take the latent space and generate a new response, given a new source-receiver position. We show how both room- and position-specific parameters are considered in the final output.

[LG-41] Adaptive Learning on User Segmentation: Universal to Specific Representation via Bipartite Neural Interaction

链接: https://arxiv.org/abs/2409.14945
作者: Xiaoyu Tan,Yongxin Deng,Chao Qu,Siqiao Xue,Xiaoming Shi,James Zhang,Xihe Qiu
关键词-EN: widely applied, user, Recently, CTR, representation
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recently, models for user representation learning have been widely applied in click-through-rate (CTR) and conversion-rate (CVR) prediction. Usually, the model learns a universal user representation as the input for subsequent scenario-specific models. However, in numerous industrial applications (e.g., recommendation and marketing), the business always operates such applications as various online activities among different user segmentation. These segmentation are always created by domain experts. Due to the difference in user distribution (i.e., user segmentation) and business objectives in subsequent tasks, learning solely on universal representation may lead to detrimental effects on both model performance and robustness. In this paper, we propose a novel learning framework that can first learn general universal user representation through information bottleneck. Then, merge and learn a segmentation-specific or a task-specific representation through neural interaction. We design the interactive learning process by leveraging a bipartite graph architecture to model the representation learning and merging between contextual clusters and each user segmentation. Our proposed method is evaluated in two open-source benchmarks, two offline business datasets, and deployed on two online marketing applications to predict users’ CVR. The results demonstrate that our method can achieve superior performance and surpass the baseline methods.

[LG-42] FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale ASPLOS2024

链接: https://arxiv.org/abs/2409.14939
作者: Zeyu Zhu,Peisong Wang,Qinghao Hu,Gang Li,Xiaoyao Liang,Jian Cheng
关键词-EN: Graph Neural Networks, Neural Networks, achieving ground-breaking performance, shown great superiority, Graph Neural
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by ASPLOS 2024 fall cycle after major revision

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have shown great superiority on non-Euclidean graph data, achieving ground-breaking performance on various graph-related tasks. As a practical solution to train GNN on large graphs with billions of nodes and edges, the sampling-based training is widely adopted by existing training frameworks. However, through an in-depth analysis, we observe that the efficiency of existing sampling-based training frameworks is still limited due to the key bottlenecks lying in all three phases of sampling-based training, i.e., subgraph sample, memory IO, and computation. To this end, we propose FastGL, a GPU-efficient Framework for accelerating sampling-based training of GNN at Large scale by simultaneously optimizing all above three phases, taking into account both GPU characteristics and graph structure. Specifically, by exploiting the inherent overlap within graph structures, FastGL develops the Match-Reorder strategy to reduce the data traffic, which accelerates the memory IO without incurring any GPU memory overhead. Additionally, FastGL leverages a Memory-Aware computation method, harnessing the GPU memory’s hierarchical nature to mitigate irregular data access during computation. FastGL further incorporates the Fused-Map approach aimed at diminishing the synchronization overhead during sampling. Extensive experiments demonstrate that FastGL can achieve an average speedup of 11.8x, 2.2x and 1.5x over the state-of-the-art frameworks PyG, DGL, and GNNLab, respectively.Our code is available at this https URL.

[LG-43] A Realistic Simulation Framework for Analog/Digital Neuromorphic Architectures

链接: https://arxiv.org/abs/2409.14918
作者: Fernando M. Quintana,Maryada,Pedro L. Galindo,Elisa Donati,Giacomo Indiveri,Fernando Perez-Peña
关键词-EN: requires time-consuming design, initial prototyping efforts, edge-computing applications requires, applications requires time-consuming, computing platforms optimized
类目: Neural and Evolutionary Computing (cs.NE); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Developing dedicated neuromorphic computing platforms optimized for embedded or edge-computing applications requires time-consuming design, fabrication, and deployment of full-custom neuromorphic processors.bTo ensure that initial prototyping efforts, exploring the properties of different network architectures and parameter settings, lead to realistic results it is important to use simulation frameworks that match as best as possible the properties of the final hardware. This is particularly challenging for neuromorphic hardware platforms made using mixed-signal analog/digital circuits, due to the variability and noise sensitivity of their components. In this paper, we address this challenge by developing a software spiking neural network simulator explicitly designed to account for the properties of mixed-signal neuromorphic circuits, including device mismatch variability. The simulator, called ARCANA (A Realistic Simulation Framework for Analog/Digital Neuromorphic Architectures), is designed to reproduce the dynamics of mixed-signal synapse and neuron electronic circuits with autogradient differentiation for parameter optimization and GPU acceleration. We demonstrate the effectiveness of this approach by matching software simulation results with measurements made from an existing neuromorphic processor. We show how the results obtained provide a reliable estimate of the behavior of the spiking neural network trained in software, once deployed in hardware. This framework enables the development and innovation of new learning rules and processing architectures in neuromorphic embedded systems.

[LG-44] owards a Realistic Long-Term Benchmark for Open-Web Research Agents

链接: https://arxiv.org/abs/2409.14913
作者: Peter Mühlbacher,Nikos I. Bosse,Lawrence Phillips
关键词-EN: present initial results, evaluating LLM agents, LLM agents, agents, evaluating LLM
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present initial results of a forthcoming benchmark for evaluating LLM agents on white-collar tasks of economic value. We evaluate eight realistic and messy'' tasks that are routine in finance and consulting, drawn from real-world cases from our customers. We lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact. This fills a gap in existing benchmarks with tasks like order a pizza to the following address’’ that do not constitute real-human work of economic value. Our evaluations assign credit to agents for partially solving tasks. By doing that, this initial evaluation, and the forthcoming benchmark, allow us to more accurately extrapolate performance of LLM-based agents on economically valuable tasks. We built and tested several architectures with GPT-4o, Claude-3.5 Sonnet, Llama 3.1 (405b), and GPT-4o-mini, ensuring that failure to solve a task was due to failures of reasoning and planning, rather than due to common failures like e.g. the inability to parse a website. On average, LLM agents powered by Claude-3.5 Sonnet substantially outperformed agents using GPT-4o, with agents based on Llama 3.1 (405b) and GPT-4o-mini lagging noticeably behind. Across LLMs, a ReAct architecture with the ability to delegate subtasks to subagents performed best. In addition to quantitative evaluations, we qualitatively assessed the performance of the LLM agents by inspecting their traces and reflecting on their observations. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.14913 [cs.CL] (or arXiv:2409.14913v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.14913 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Efficient Tabular Data Preprocessing of ML Pipelines

链接: https://arxiv.org/abs/2409.14912
作者: Yu Zhu,Wenqi Jiang,Gustavo Alonso
关键词-EN: includes data decoding, Machine Learning, crucial component, Data preprocessing pipelines, Data preprocessing
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Data preprocessing pipelines, which includes data decoding, cleaning, and transforming, are a crucial component of Machine Learning (ML) training. Thy are computationally intensive and often become a major bottleneck, due to the increasing performance gap between the CPUs used for preprocessing and the GPUs used for model training. Recent studies show that a significant number of CPUs across several machines are required to achieve sufficient throughput to saturate the GPUs, leading to increased resource and energy consumption. When the pipeline involves vocabulary generation, the preprocessing performance scales poorly due to significant row-wise synchronization overhead between different CPU cores and servers. To address this limitation, in this paper we present the design of Piper, a hardware accelerator for tabular data preprocessing, prototype it on FPGAs, and demonstrate its potential for training pipelines of commercial recommender systems. Piper achieves 4.7 \sim 71.3 \times speedup in latency over a 128-core CPU server and outperforms a data-center GPU by 4.8 \sim 20.3 \times when using binary input. The impressive performance showcases Piper’s potential to increase the efficiency of data preprocessing pipelines and significantly reduce their resource consumption.

[LG-46] Kriformer: A Novel Spatiotemporal Kriging Approach Based on Graph Transformers

链接: https://arxiv.org/abs/2409.14906
作者: Renbin Pan,Feng Xiao,Hegui Zhang,Minyu Shen
关键词-EN: Accurately estimating data, understanding system dynamics, Accurately estimating, system dynamics, environmental monitoring
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accurately estimating data in sensor-less areas is crucial for understanding system dynamics, such as traffic state estimation and environmental monitoring. This study addresses challenges posed by sparse sensor deployment and unreliable data by framing the problem as a spatiotemporal kriging task and proposing a novel graph transformer model, Kriformer. This model estimates data at locations without sensors by mining spatial and temporal correlations, even with limited resources. Kriformer utilizes transformer architecture to enhance the model’s perceptual range and solve edge information aggregation challenges, capturing spatiotemporal information effectively. A carefully constructed positional encoding module embeds the spatiotemporal features of nodes, while a sophisticated spatiotemporal attention mechanism enhances estimation accuracy. The multi-head spatial interaction attention module captures subtle spatial relationships between observed and unobserved locations. During training, a random masking strategy prompts the model to learn with partial information loss, allowing the spatiotemporal embedding and multi-head attention mechanisms to synergistically capture correlations among locations. Experimental results show that Kriformer excels in representation learning for unobserved locations, validated on two real-world traffic speed datasets, demonstrating its effectiveness in spatiotemporal kriging tasks.

[LG-47] CON: Continual Object Navigation via Data-Free Inter-Agent Knowledge Transfer in Unseen and Unfamiliar Places

链接: https://arxiv.org/abs/2409.14899
作者: Kouki Terashima,Daiki Iwata,Kanji Tanaka
关键词-EN: object goal navigation, robotic object goal, goal navigation, work explores, explores the potential
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, workshop paper’s draft version

点击查看摘要

Abstract:This work explores the potential of brief inter-agent knowledge transfer (KT) to enhance the robotic object goal navigation (ON) in unseen and unfamiliar environments. Drawing on the analogy of human travelers acquiring local knowledge, we propose a framework in which a traveler robot (student) communicates with local robots (teachers) to obtain ON knowledge through minimal interactions. We frame this process as a data-free continual learning (CL) challenge, aiming to transfer knowledge from a black-box model (teacher) to a new model (student). In contrast to approaches like zero-shot ON using large language models (LLMs), which utilize inherently communication-friendly natural language for knowledge representation, the other two major ON approaches – frontier-driven methods using object feature maps and learning-based ON using neural state-action maps – present complex challenges where data-free KT remains largely uncharted. To address this gap, we propose a lightweight, plug-and-play KT module targeting non-cooperative black-box teachers in open-world settings. Using the universal assumption that every teacher robot has vision and mobility capabilities, we define state-action history as the primary knowledge base. Our formulation leads to the development of a query-based occupancy map that dynamically represents target object locations, serving as an effective and communication-friendly knowledge representation. We validate the effectiveness of our method through experiments conducted in the Habitat environment.

[LG-48] Built Different: Tactile Perception to Overcome Cross-Embodiment Capability Differences in Collaborative Manipulation ICRA2025

链接: https://arxiv.org/abs/2409.14896
作者: William van den Bogert,Madhavan Iyengar,Nima Fazeli
关键词-EN: Tactile sensing, implicit communication, Tactile, robot assistant, sensing
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 6 pages + references, 4 figures, 4 tables, submitted to ICRA 2025

点击查看摘要

Abstract:Tactile sensing is a powerful means of implicit communication between a human and a robot assistant. In this paper, we investigate how tactile sensing can transcend cross-embodiment differences across robotic systems in the context of collaborative manipulation. Consider tasks such as collaborative object carrying where the human-robot interaction is force rich. Learning and executing such skills requires the robot to comply to the human and to learn behaviors at the joint-torque level. However, most robots do not offer this compliance or provide access to their joint torques. To address this challenge, we present an approach that uses tactile sensors to transfer policies from robots with these capabilities to those without. We show how our method can enable a cooperative task where a robot and human must work together to maneuver objects through space. We first demonstrate the skill on an impedance control-capable robot equipped with tactile sensing, then show the positive transfer of the tactile policy to a planar prismatic robot that is only capable of position control and does not come equipped with any sort of force/torque feedback, yet is able to comply to the human motions only using tactile feedback. Further details and videos can be found on our project website at this https URL.

[LG-49] Novel Gradient Sparsification Algorithm via Bayesian Inference

链接: https://arxiv.org/abs/2409.14893
作者: Ali Bereyhi,Ben Liang,Gary Boudreau,Ali Afana
关键词-EN: distributed gradient descent, essential component, method in distributed, Error accumulation, Top
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注: To appear in Proc. IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2024

点击查看摘要

Abstract:Error accumulation is an essential component of the Top- k sparsification method in distributed gradient descent. It implicitly scales the learning rate and prevents the slow-down of lateral movement, but it can also deteriorate convergence. This paper proposes a novel sparsification algorithm called regularized Top- k (RegTop- k ) that controls the learning rate scaling of error accumulation. The algorithm is developed by looking at the gradient sparsification as an inference problem and determining a Bayesian optimal sparsification mask via maximum-a-posteriori estimation. It utilizes past aggregated gradients to evaluate posterior statistics, based on which it prioritizes the local gradient entries. Numerical experiments with ResNet-18 on CIFAR-10 show that at 0.1% sparsification, RegTop- k achieves about 8% higher accuracy than standard Top- k .

[LG-50] Deploying Open-Source Large Language Models : A performance Analysis

链接: https://arxiv.org/abs/2409.14887
作者: Yannis Bendi-Ouis,Dan Dutarte,Xavier Hinaut
关键词-EN: ChatGPT in November, considerable success, open-source community, release of ChatGPT, large language models
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Since the release of ChatGPT in November 2023, large language models (LLMs) have seen considerable success, including in the open-source community, with many open-weight models available. However, the requirements to deploy such a service are often unknown and difficult to evaluate in advance. To facilitate this process, we conducted numerous tests at the Centre Inria de l’Université de Bordeaux. In this article, we propose a comparison of the performance of several models of different sizes (mainly Mistral and LLaMa) depending on the available GPUs, using vLLM, a Python library designed to optimize the inference of these models. Our results provide valuable information for private and public groups wishing to deploy LLMs, allowing them to evaluate the performance of different models based on their available hardware. This study thus contributes to facilitating the adoption and use of these large language models in various application domains.

[LG-51] sting Dependency of Weighted Random Graphs

链接: https://arxiv.org/abs/2409.14870
作者: Mor Oren,Vered Paslev,Wasim Huleihel
关键词-EN: weighted random graphs, weighted random, detecting the edge, edge dependency, random graphs
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 32 pages. arXiv admin note: text overlap with arXiv:2008.10097 by other authors

点击查看摘要

Abstract:In this paper, we study the task of detecting the edge dependency between two weighted random graphs. We formulate this task as a simple hypothesis testing problem, where under the null hypothesis, the two observed graphs are statistically independent, whereas under the alternative, the edges of one graph are dependent on the edges of a randomly vertex-permuted version of the other graph. For general edge-weights distributions, we establish thresholds at which optimal testing is information-theoretically impossible and possible, as a function of the total number of nodes in the observed graphs and the generative distributions of the weights. Finally, we observe a statistical-computational gap in our problem, and we provide evidence that this is fundamental using the framework of low-degree polynomials.

[LG-52] Disentanglement with Factor Quantized Variational Autoencoders

链接: https://arxiv.org/abs/2409.14851
作者: Gulcin Baykal,Melih Kandemir,Gozde Unal
关键词-EN: Disentangled representation learning, underlying generative factors, Disentangled representation, generative factors, latent representation independently
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint submitted to Pattern Recognition

点击查看摘要

Abstract:Disentangled representation learning aims to represent the underlying generative factors of a dataset in a latent representation independently of one another. In our work, we propose a discrete variational autoencoder (VAE) based model where the ground truth information about the generative factors are not provided to the model. We demonstrate the advantages of learning discrete representations over learning continuous representations in facilitating disentanglement. Furthermore, we propose incorporating an inductive bias into the model to further enhance disentanglement. Precisely, we propose scalar quantization of the latent variables in a latent representation with scalar values from a global codebook, and we add a total correlation term to the optimization as an inductive bias. Our method called FactorQVAE is the first method that combines optimization based disentanglement approaches with discrete representation learning, and it outperforms the former disentanglement methods in terms of two disentanglement metrics (DCI and InfoMEC) while improving the reconstruction performance. Our code can be found at \urlthis https URL.

[LG-53] GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth

链接: https://arxiv.org/abs/2409.14850
作者: Aurélien Cecille,Stefan Duffner,Franck Davoine,Thibault Neveu,Rémi Agier
关键词-EN: Monocular depth estimation, predicting metric depth, models predicting metric, Monocular depth, estimation has greatly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Monocular depth estimation has greatly improved in the recent years but models predicting metric depth still struggle to generalize across diverse camera poses and datasets. While recent supervised methods mitigate this issue by leveraging ground prior information at inference, their adaptability to self-supervised settings is limited due to the additional challenge of scale recovery. Addressing this gap, we propose in this paper a novel constraint on ground areas designed specifically for the self-supervised paradigm. This mechanism not only allows to accurately recover the scale but also ensures coherence between the depth prediction and the ground prior. Experimental results show that our method surpasses existing scale recovery techniques on the KITTI benchmark and significantly enhances model generalization capabilities. This improvement can be observed by its more robust performance across diverse camera rotations and its adaptability in zero-shot conditions with previously unseen driving datasets such as DDAD.

[LG-54] Orthogonal Finetuning for Direct Preference Optimization

链接: https://arxiv.org/abs/2409.14836
作者: Chenxu Yang,Ruipeng Jia,Naibin Gu,Zheng Lin,Siyuan Chen,Chao Pang,Weichong Yin,Yu Sun,Hua Wu,Weiping Wang
关键词-EN: preference optimization algorithm, optimization algorithm, effective preference optimization, preference optimization, weight-Rotated Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:DPO is an effective preference optimization algorithm. However, the DPO-tuned models tend to overfit on the dispreferred samples, manifested as overly long generations lacking diversity. While recent regularization approaches have endeavored to alleviate this issue by modifying the objective function, they achieved that at the cost of alignment performance degradation. In this paper, we innovatively incorporate regularization from the perspective of weight updating to curb alignment overfitting. Through the pilot experiment, we discovered that there exists a positive correlation between overfitting and the hyperspherical energy fluctuation. Hence, we introduce orthogonal finetuning for DPO via a weight-Rotated Preference Optimization (RoPO) method, which merely conducts rotational and magnitude-stretching updates on the weight parameters to maintain the hyperspherical energy invariant, thereby preserving the knowledge encoded in the angle between neurons. Extensive experiments demonstrate that our model aligns perfectly with human preferences while retaining the original expressive capacity using only 0.0086% of the trainable parameters, suggesting an effective regularization against overfitting. Specifically, RoPO outperforms DPO by up to 10 points on MT-Bench and by up to 2.8 points on AlpacaEval 2, while enhancing the generation diversity by an average of 6 points.

[LG-55] Energy-Aware Federated Learning in Satellite Constellations

链接: https://arxiv.org/abs/2409.14832
作者: Nasrin Razmi,Bho Matthiesen,Armin Dekorsy,Petar Popovski
关键词-EN: machine learning model, terrestrial mobile networks, enabling globally connected, globally connected intelligence, Federated learning
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This paper is accepted for the IEEE Global Communications Conference (GLOBECOM Workshops), 2024

点击查看摘要

Abstract:Federated learning in satellite constellations, where the satellites collaboratively train a machine learning model, is a promising technology towards enabling globally connected intelligence and the integration of space networks into terrestrial mobile networks. The energy required for this computationally intensive task is provided either by solar panels or by an internal battery if the satellite is in Earth’s shadow. Careful management of this battery and system’s available energy resources is not only necessary for reliable satellite operation, but also to avoid premature battery aging. We propose a novel energy-aware computation time scheduler for satellite FL, which aims to minimize battery usage without any impact on the convergence speed. Numerical results indicate an increase of more than 3x in battery lifetime can be achieved over energy-agnostic task scheduling.

[LG-56] Identify As A Human Does: A Pathfinder of Next-Generation Anti-Cheat Framework for First-Person Shooter Games

链接: https://arxiv.org/abs/2409.14830
作者: Jiayi Zhang,Chenxin Sun,Yue Gu,Qingyu Zhang,Jiayi Lin,Xiaojiang Du,Chenxiong Qian
关键词-EN: experienced substantial growth, online games poses, gaming experience, gaming industry, substantial growth
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The gaming industry has experienced substantial growth, but cheating in online games poses a significant threat to the integrity of the gaming experience. Cheating, particularly in first-person shooter (FPS) games, can lead to substantial losses for the game industry. Existing anti-cheat solutions have limitations, such as client-side hardware constraints, security risks, server-side unreliable methods, and both-sides suffer from a lack of comprehensive real-world datasets. To address these limitations, the paper proposes HAWK, a server-side FPS anti-cheat framework for the popular game CS:GO. HAWK utilizes machine learning techniques to mimic human experts’ identification process, leverages novel multi-view features, and it is equipped with a well-defined workflow. The authors evaluate HAWK with the first large and real-world datasets containing multiple cheat types and cheating sophistication, and it exhibits promising efficiency and acceptable overheads, shorter ban times compared to the in-use anti-cheat, a significant reduction in manual labor, and the ability to capture cheaters who evaded official inspections.

[LG-57] VARADE: a Variational-based AutoRegressive model for Anomaly Detection on the Edge

链接: https://arxiv.org/abs/2409.14816
作者: Alessio Mascolini,Sebastiano Gaiardelli,Francesco Ponzio,Nicola Dall’Ora,Enrico Macii,Sara Vinco,Santa Di Cataldo,Franco Fummi
关键词-EN: Detecting complex anomalies, Detecting complex, task in Industry, deep learning, complex anomalies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Detecting complex anomalies on massive amounts of data is a crucial task in Industry 4.0, best addressed by deep learning. However, available solutions are computationally demanding, requiring cloud architectures prone to latency and bandwidth issues. This work presents VARADE, a novel solution implementing a light autoregressive framework based on variational inference, which is best suited for real-time execution on the edge. The proposed approach was validated on a robotic arm, part of a pilot production line, and compared with several state-of-the-art algorithms, obtaining the best trade-off between anomaly detection accuracy, power consumption and inference frequency on two different edge platforms.

[LG-58] Pre-trained Language Model and Knowledge Distillation for Lightweight Sequential Recommendation

链接: https://arxiv.org/abs/2409.14810
作者: Li Li,Mingyue Cheng,Zhiding Liu,Hao Zhang,Qi Liu,Enhong Chen
关键词-EN: recommendation, Sequential recommendation, sequential recommendation algorithms, pre-trained language, user interests
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: in Chinese language

点击查看摘要

Abstract:Sequential recommendation models user interests based on historical behaviors to provide personalized recommendation. Previous sequential recommendation algorithms primarily employ neural networks to extract features of user interests, achieving good performance. However, due to the recommendation system datasets sparsity, these algorithms often employ small-scale network frameworks, resulting in weaker generalization capability. Recently, a series of sequential recommendation algorithms based on large pre-trained language models have been proposed. Nonetheless, given the real-time demands of recommendation systems, the challenge remains in applying pre-trained language models for rapid recommendations in real scenarios. To address this, we propose a sequential recommendation algorithm based on a pre-trained language model and knowledge distillation. The key of proposed algorithm is to transfer pre-trained knowledge across domains and achieve lightweight inference by knowledge distillation. The algorithm operates in two stages: in the first stage, we fine-tune the pre-trained language model on the recommendation dataset to transfer the pre-trained knowledge to the recommendation task; in the second stage, we distill the trained language model to transfer the learned knowledge to a lightweight model. Extensive experiments on multiple public recommendation datasets show that the proposed algorithm enhances recommendation accuracy and provide timely recommendation services.

[LG-59] SDBA: A Stealthy and Long-Lasting Durable Backdoor Attack in Federated Learning

链接: https://arxiv.org/abs/2409.14805
作者: Minyeong Choe,Cheolhee Park,Changho Seo,Hyunil Kim
关键词-EN: training machine learning, preserving data privacy, research remains limited, Federated Learning, distributed nature makes
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 13 pages, 13 figures This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Federated Learning is a promising approach for training machine learning models while preserving data privacy, but its distributed nature makes it vulnerable to backdoor attacks, particularly in NLP tasks while related research remains limited. This paper introduces SDBA, a novel backdoor attack mechanism designed for NLP tasks in FL environments. Our systematic analysis across LSTM and GPT-2 models identifies the most vulnerable layers for backdoor injection and achieves both stealth and long-lasting durability through layer-wise gradient masking and top-k% gradient masking within these layers. Experiments on next token prediction and sentiment analysis tasks show that SDBA outperforms existing backdoors in durability and effectively bypasses representative defense mechanisms, with notable performance in LLM such as GPT-2. These results underscore the need for robust defense strategies in NLP-based FL systems.

[LG-60] Research on Dynamic Data Flow Anomaly Detection based on Machine Learning

链接: https://arxiv.org/abs/2409.14796
作者: Liyang Wang,Yu Cheng,Hao Gong,Jiacheng Hu,Xirui Tang,Iris Li
关键词-EN: defensive strategy inadequate, standalone defensive strategy, strategy inadequate, data, sophistication and diversity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The sophistication and diversity of contemporary cyberattacks have rendered the use of proxies, gateways, firewalls, and encrypted tunnels as a standalone defensive strategy inadequate. Consequently, the proactive identification of data anomalies has emerged as a prominent area of research within the field of data security. The majority of extant studies concentrate on sample equilibrium data, with the consequence that the detection effect is not optimal in the context of unbalanced data. In this study, the unsupervised learning method is employed to identify anomalies in dynamic data flows. Initially, multi-dimensional features are extracted from real-time data, and a clustering algorithm is utilised to analyse the patterns of the data. This enables the potential outliers to be automatically identified. By clustering similar data, the model is able to detect data behaviour that deviates significantly from normal traffic without the need for labelled data. The results of the experiments demonstrate that the proposed method exhibits high accuracy in the detection of anomalies across a range of scenarios. Notably, it demonstrates robust and adaptable performance, particularly in the context of unbalanced data.

[LG-61] Multiscale scattered data analysis in samplet coordinates

链接: https://arxiv.org/abs/2409.14791
作者: Sara Avesani,Rüdiger Kempf,Michael Multerer,Holger Wendland
关键词-EN: radial basis functions, globally supported radial, supported radial basis, study multiscale scattered, scattered data interpolation
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study multiscale scattered data interpolation schemes for globally supported radial basis functions, with a focus on the Matérn class. The multiscale approximation is constructed through a sequence of residual corrections, where radial basis functions with different lengthscale parameters are employed to capture varying levels of detail. To apply this approach to large data sets, we suggest to represent the resulting generalized Vandermonde matrices in samplet coordinates. Samplets are localized, discrete signed measures exhibiting vanishing moments and allow for the sparse approximation of generalized Vandermonde matrices issuing from a vast class of radial basis functions. Given a quasi-uniform set of N data sites, and local approximation spaces with geometrically decreasing dimension, the full multiscale system can be assembled with cost \mathcalO(N \log N) . We prove that the condition numbers of the linear systems at each level remain bounded independent of the particular level, allowing us to use an iterative solver with a bounded number of iterations for the numerical solution. Hence, the overall cost of the proposed approach is \mathcalO(N \log N) . The theoretical findings are accompanied by extensive numerical studies in two and three spatial dimensions.

[LG-62] Isometric Immersion Learning with Riemannian Geometry

链接: https://arxiv.org/abs/2409.14760
作者: Zihao Chen,Wenyong Wang,Yu Xiang
关键词-EN: implicitly intrinsic structure, manifold learning method, Manifold learning, data representations, non-Euclidean data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Manifold learning has been proven to be an effective method for capturing the implicitly intrinsic structure of non-Euclidean data, in which one of the primary challenges is how to maintain the distortion-free (isometry) of the data representations. Actually, there is still no manifold learning method that provides a theoretical guarantee of isometry. Inspired by Nash’s isometric theorem, we introduce a new concept called isometric immersion learning based on Riemannian geometry principles. Following this concept, an unsupervised neural network-based model that simultaneously achieves metric and manifold learning is proposed by integrating Riemannian geometry priors. What’s more, we theoretically derive and algorithmically implement a maximum likelihood estimation-based training method for the new model. In the simulation experiments, we compared the new model with the state-of-the-art baselines on various 3-D geometry datasets, demonstrating that the new model exhibited significantly superior performance in multiple evaluation metrics. Moreover, we applied the Riemannian metric learned from the new model to downstream prediction tasks in real-world scenarios, and the accuracy was improved by an average of 8.8%.

[LG-63] EDSNet: Efficient-DSNet for Video Summarization

链接: https://arxiv.org/abs/2409.14724
作者: Ashish Prasad,Pranav Jeevan,Amit Sethi
关键词-EN: methods largely rely, require substantial computational, substantial computational resources, Current video summarization, Current video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Current video summarization methods largely rely on transformer-based architectures, which, due to their quadratic complexity, require substantial computational resources. In this work, we address these inefficiencies by enhancing the Direct-to-Summarize Network (DSNet) with more resource-efficient token mixing mechanisms. We show that replacing traditional attention with alternatives like Fourier, Wavelet transforms, and Nyströmformer improves efficiency and performance. Furthermore, we explore various pooling strategies within the Regional Proposal Network, including ROI pooling, Fast Fourier Transform pooling, and flat pooling. Our experimental results on TVSum and SumMe datasets demonstrate that these modifications significantly reduce computational costs while maintaining competitive summarization performance. Thus, our work offers a more scalable solution for video summarization tasks.

[LG-64] MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification EMNLP2024

链接: https://arxiv.org/abs/2409.14703
作者: Siddhant Bikram Shah,Shuvam Shiwakoti,Maheep Chaudhary,Haohan Wang
关键词-EN: text-embedded images presents, encompass multiple aspects, multiple aspects, presents a formidable, formidable challenge
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: Accepted to EMNLP 2024 (Main)

点击查看摘要

Abstract:The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of the multiple aspects of expression conveyed in them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, our study expands the focus to encompass multiple aspects of linguistics: hate, target, stance, and humor detection. We introduce a novel dataset PrideMM comprising text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: this https URL.

[LG-65] EDGE-Rec: Efficient and Data-Guided Edge Diffusion For Recommender Systems Graphs

链接: https://arxiv.org/abs/2409.14689
作者: Utkarsh Priyam,Hemit Shah,Edoardo Botta
关键词-EN: recommender systems research, systems research focuses, predict future interactions, binary historical user-item, historical user-item interaction
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 13 figures

点击查看摘要

Abstract:Most recommender systems research focuses on binary historical user-item interaction encodings to predict future interactions. User features, item features, and interaction strengths remain largely under-utilized in this space or only indirectly utilized, despite proving largely effective in large-scale production recommendation systems. We propose a new attention mechanism, loosely based on the principles of collaborative filtering, called Row-Column Separable Attention RCSA to take advantage of real-valued interaction weights as well as user and item features directly. Building on this mechanism, we additionally propose a novel Graph Diffusion Transformer GDiT architecture which is trained to iteratively denoise the weighted interaction matrix of the user-item interaction graph directly. The weighted interaction matrix is built from the bipartite structure of the user-item interaction graph and corresponding edge weights derived from user-item rating interactions. Inspired by the recent progress in text-conditioned image generation, our method directly produces user-item rating predictions on the same scale as the original ratings by conditioning the denoising process on user and item features with a principled approach.

[LG-66] Robust Training Objectives Improve Embedding-based Retrieval in Industrial Recommendation Systems RECSYS RECSYS2024

链接: https://arxiv.org/abs/2409.14682
作者: Matthew Kolodner,Mingxuan Ju,Zihao Fan,Tong Zhao,Elham Ghazizadeh,Yan Wu,Neil Shah,Yozen Liu
关键词-EN: Improving recommendation systems, Improving recommendation, greatly enhance, Improving, EBR
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: RobustRecSys workshop @ RecSys 2024

点击查看摘要

Abstract:Improving recommendation systems (RS) can greatly enhance the user experience across many domains, such as social media. Many RS utilize embedding-based retrieval (EBR) approaches to retrieve candidates for recommendation. In an EBR system, the embedding quality is key. According to recent literature, self-supervised multitask learning (SSMTL) has showed strong performance on academic benchmarks in embedding learning and resulted in an overall improvement in multiple downstream tasks, demonstrating a larger resilience to the adverse conditions between each downstream task and thereby increased robustness and task generalization ability through the training objective. However, whether or not the success of SSMTL in academia as a robust training objectives translates to large-scale (i.e., over hundreds of million users and interactions in-between) industrial RS still requires verification. Simply adopting academic setups in industrial RS might entail two issues. Firstly, many self-supervised objectives require data augmentations (e.g., embedding masking/corruption) over a large portion of users and items, which is prohibitively expensive in industrial RS. Furthermore, some self-supervised objectives might not align with the recommendation task, which might lead to redundant computational overheads or negative transfer. In light of these two challenges, we evaluate using a robust training objective, specifically SSMTL, through a large-scale friend recommendation system on a social media platform in the tech sector, identifying whether this increase in robustness can work at scale in enhancing retrieval in the production setting. Through online A/B testing with SSMTL-based EBR, we observe statistically significant increases in key metrics in the friend recommendations, with up to 5.45% improvements in new friends made and 1.91% improvements in new friends made with cold-start users.

[LG-67] Federated Graph Learning with Adaptive Importance-based Sampling

链接: https://arxiv.org/abs/2409.14655
作者: Anran Li,Yuanyuan Chen,Chao Ren,Wenhan Wang,Ming Hu,Tianlin Li,Han Yu,Qingyu Chen
关键词-EN: learning tasks involving, tasks involving distributed, based GCN, privacy-preserving graph learning, graph learning tasks
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For privacy-preserving graph learning tasks involving distributed graph datasets, federated learning (FL)-based GCN (FedGCN) training is required. A key challenge for FedGCN is scaling to large-scale graphs, which typically incurs high computation and communication costs when dealing with the explosively increasing number of neighbors. Existing graph sampling-enhanced FedGCN training approaches ignore graph structural information or dynamics of optimization, resulting in high variance and inaccurate node embeddings. To address this limitation, we propose the Federated Adaptive Importance-based Sampling (FedAIS) approach. It achieves substantial computational cost saving by focusing the limited resources on training important nodes, while reducing communication overhead via adaptive historical embedding synchronization. The proposed adaptive importance-based sampling method jointly considers the graph structural heterogeneity and the optimization dynamics to achieve optimal trade-off between efficiency and accuracy. Extensive evaluations against five state-of-the-art baselines on five real-world graph datasets show that FedAIS achieves comparable or up to 3.23% higher test accuracy, while saving communication and computation costs by 91.77% and 85.59%.

[LG-68] Demystifying Trajectory Recovery From Ash: An Open-Source Evaluation and Enhancement

链接: https://arxiv.org/abs/2409.14645
作者: Nicholas D’Silva,Toran Shahi,Øyvind Timian Dokk Husveg,Adith Sanjeeve,Erik Buchholz,Salil S. Kanhere
关键词-EN: provide valuable insights, valuable insights beneficial, provide valuable, valuable insights, insights beneficial
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at the 17th International Conference on Security of Information and Networks (SIN’24). DOI will be added once available

点击查看摘要

Abstract:Once analysed, location trajectories can provide valuable insights beneficial to various applications. However, such data is also highly sensitive, rendering them susceptible to privacy risks in the event of mismanagement, for example, revealing an individual’s identity, home address, or political affiliations. Hence, ensuring that privacy is preserved for this data is a priority. One commonly taken measure to mitigate this concern is aggregation. Previous work by Xu et al. shows that trajectories are still recoverable from anonymised and aggregated datasets. However, the study lacks implementation details, obfuscating the mechanisms of the attack. Additionally, the attack was evaluated on commercial non-public datasets, rendering the results and subsequent claims unverifiable. This study reimplements the trajectory recovery attack from scratch and evaluates it on two open-source datasets, detailing the preprocessing steps and implementation. Results confirm that privacy leakage still exists despite common anonymisation and aggregation methods but also indicate that the initial accuracy claims may have been overly ambitious. We release all code as open-source to ensure the results are entirely reproducible and, therefore, verifiable. Moreover, we propose a stronger attack by designing a series of enhancements to the baseline attack. These enhancements yield higher accuracies by up to 16%, providing an improved benchmark for future research in trajectory recovery methods. Our improvements also enable online execution of the attack, allowing partial attacks on larger datasets previously considered unprocessable, thereby furthering the extent of privacy leakage. The findings emphasise the importance of using strong privacy-preserving mechanisms when releasing aggregated mobility data and not solely relying on aggregation as a means of anonymisation.

[LG-69] Harmonising the Clinical Melody: Tuning Large Language Models for Hospital Course Summarisation in Clinical Coding

链接: https://arxiv.org/abs/2409.14638
作者: Bokang Bi,Leibo Liu,Oscar Perez-Concha,Sanja Lujic,Louisa Jorm
关键词-EN: Electronic Medical Records, Medical Records systems, Records systems pose, Electronic Medical, Medical Records
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:The increasing volume and complexity of clinical documentation in Electronic Medical Records systems pose significant challenges for clinical coders, who must mentally process and summarise vast amounts of clinical text to extract essential information needed for coding tasks. While large language models have been successfully applied to shorter summarisation tasks in recent years, the challenge of summarising a hospital course remains an open area for further research and development. In this study, we adapted three pre trained LLMs, Llama 3, BioMistral, Mistral Instruct v0.1 for the hospital course summarisation task, using Quantized Low Rank Adaptation fine tuning. We created a free text clinical dataset from MIMIC III data by concatenating various clinical notes as the input clinical text, paired with ground truth Brief Hospital Course sections extracted from the discharge summaries for model training. The fine tuned models were evaluated using BERTScore and ROUGE metrics to assess the effectiveness of clinical domain fine tuning. Additionally, we validated their practical utility using a novel hospital course summary assessment metric specifically tailored for clinical coding. Our findings indicate that fine tuning pre trained LLMs for the clinical domain can significantly enhance their performance in hospital course summarisation and suggest their potential as assistive tools for clinical coding. Future work should focus on refining data curation methods to create higher quality clinical datasets tailored for hospital course summary tasks and adapting more advanced open source LLMs comparable to proprietary models to further advance this research.

[LG-70] Not Only the Last-Layer Features for Spurious Correlations: All Layer Deep Feature Reweighting

链接: https://arxiv.org/abs/2409.14637
作者: Humza Wajid Hameed,Geraldin Nanfack,Eugene Belilovsky
关键词-EN: machine learning models, Spurious correlations, combat spurious correlations, learning models, group-level fairness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spurious correlations are a major source of errors for machine learning models, in particular when aiming for group-level fairness. It has been recently shown that a powerful approach to combat spurious correlations is to re-train the last layer on a balanced validation dataset, isolating robust features for the predictor. However, key attributes can sometimes be discarded by neural networks towards the last layer. In this work, we thus consider retraining a classifier on a set of features derived from all layers. We utilize a recently proposed feature selection strategy to select unbiased features from all the layers. We observe this approach gives significant improvements in worst-group accuracy on several standard benchmarks.

[LG-71] Hierarchical end-to-end autonomous navigation through few-shot waypoint detection ICRA

链接: https://arxiv.org/abs/2409.14633
作者: Amin Ghafourian,Zhongying CuiZhu,Debo Shi,Ian Chuang,Francois Charette,Rithik Sachdeva,Iman Soltani
关键词-EN: recognize salient features, ability to recognize, recognize salient, salient features, navigation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Appeared at the 40th Anniversary of the IEEE International Conference on Robotics and Automation (ICRA@40), 23-26 September, 2024, Rotterdam, The Netherlands. 9 pages, 5 figures

点击查看摘要

Abstract:Human navigation is facilitated through the association of actions with landmarks, tapping into our ability to recognize salient features in our environment. Consequently, navigational instructions for humans can be extremely concise, such as short verbal descriptions, indicating a small memory requirement and no reliance on complex and overly accurate navigation tools. Conversely, current autonomous navigation schemes rely on accurate positioning devices and algorithms as well as extensive streams of sensory data collected from the environment. Inspired by this human capability and motivated by the associated technological gap, in this work we propose a hierarchical end-to-end meta-learning scheme that enables a mobile robot to navigate in a previously unknown environment upon presentation of only a few sample images of a set of landmarks along with their corresponding high-level navigation actions. This dramatically simplifies the wayfinding process and enables easy adoption to new environments. For few-shot waypoint detection, we implement a metric-based few-shot learning technique through distribution embedding. Waypoint detection triggers the multi-task low-level maneuver controller module to execute the corresponding high-level navigation action. We demonstrate the effectiveness of the scheme using a small-scale autonomous vehicle on novel indoor navigation tasks in several previously unseen environments.

[LG-72] From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

链接: https://arxiv.org/abs/2409.14623
作者: Clémentine C. J. Dominé,Nicolas Anguita,Alexandra M. Proca,Lukas Braun,Daniel Kunin,Pedro A. M. Mediano,Andrew M. Saxe
关键词-EN: perform complex tasks, networks develop internal, develop internal representations, neural networks develop, develop internal
类目: Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Biological and artificial neural networks develop internal representations that enable them to perform complex tasks. In artificial networks, the effectiveness of these models relies on their ability to build task specific representation, a process influenced by interactions among datasets, architectures, initialization strategies, and optimization algorithms. Prior studies highlight that different initializations can place networks in either a lazy regime, where representations remain static, or a rich/feature learning regime, where representations evolve dynamically. Here, we examine how initialization influences learning dynamics in deep linear neural networks, deriving exact solutions for lambda-balanced initializations-defined by the relative scale of weights across layers. These solutions capture the evolution of representations and the Neural Tangent Kernel across the spectrum from the rich to the lazy regimes. Our findings deepen the theoretical understanding of the impact of weight initialization on learning regimes, with implications for continual learning, reversal learning, and transfer learning, relevant to both neuroscience and practical applications.

[LG-73] Protein-Mamba: Biological Mamba Models for Protein Function Prediction

链接: https://arxiv.org/abs/2409.14617
作者: Bohao Xu,Yingzhou Lu,Yoshitaka Inoue,Namkyeong Lee,Tianfan Fu,Jintai Chen
关键词-EN: Protein function prediction, Protein function, significantly impacting, safe therapeutics, pivotal task
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Protein function prediction is a pivotal task in drug discovery, significantly impacting the development of effective and safe therapeutics. Traditional machine learning models often struggle with the complexity and variability inherent in predicting protein functions, necessitating more sophisticated approaches. In this work, we introduce Protein-Mamba, a novel two-stage model that leverages both self-supervised learning and fine-tuning to improve protein function prediction. The pre-training stage allows the model to capture general chemical structures and relationships from large, unlabeled datasets, while the fine-tuning stage refines these insights using specific labeled datasets, resulting in superior prediction performance. Our extensive experiments demonstrate that Protein-Mamba achieves competitive performance, compared with a couple of state-of-the-art methods across a range of protein function datasets. This model’s ability to effectively utilize both unlabeled and labeled data highlights the potential of self-supervised learning in advancing protein function prediction and offers a promising direction for future research in drug discovery.

[LG-74] Patch Ranking: Efficient CLIP by Learning to Rank Local Patches

链接: https://arxiv.org/abs/2409.14607
作者: Cheng-En Wu,Jinhong Lin,Yu Hen Hu,Pedro Morgado
关键词-EN: Contrastive image-text pre-trained, shown remarkable adaptability, Contrastive image-text, image-text pre-trained models, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP and identifying the optimal subset of tokens for maximum performance. To address this, we propose greedy search methods to establish a “Golden Ranking” and introduce a lightweight predictor specifically trained to approximate this Ranking. To compensate for any performance degradation resulting from token pruning, we incorporate learnable visual tokens that aid in restoring and potentially enhancing the model’s performance. Our work presents a comprehensive and systematic investigation of pruning tokens within the ViT backbone of CLIP models. Through our framework, we successfully reduced 40% of patch tokens in CLIP’s ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets. Our study lays the groundwork for building more computationally efficient multimodal models without sacrificing their performance, addressing a key challenge in the application of advanced vision-language models.

[LG-75] Implicit Dynamical Flow Fusion (IDFF) for Generative Modeling

链接: https://arxiv.org/abs/2409.14599
作者: Mohammad R. Rezaei,Rahul G. Krishnan,Milos R. Popovic,Milad Lankarany
关键词-EN: Conditional Flow Matching, Dynamical Flow Fusion, Implicit Dynamical Flow, Conditional Flow, Flow Matching
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conditional Flow Matching (CFM) models can generate high-quality samples from a non-informative prior, but they can be slow, often needing hundreds of network evaluations (NFE). To address this, we propose Implicit Dynamical Flow Fusion (IDFF); IDFF learns a new vector field with an additional momentum term that enables taking longer steps during sample generation while maintaining the fidelity of the generated distribution. Consequently, IDFFs reduce the NFEs by a factor of ten (relative to CFMs) without sacrificing sample quality, enabling rapid sampling and efficient handling of image and time-series data generation tasks. We evaluate IDFF on standard benchmarks such as CIFAR-10 and CelebA for image generation. We achieved likelihood and quality performance comparable to CFMs and diffusion-based models with fewer NFEs. IDFF also shows superior performance on time-series datasets modeling, including molecular simulation and sea surface temperature (SST) datasets, highlighting its versatility and effectiveness across different domains.

[LG-76] EchoAtt: Attend Copy then Adjust for More Efficient Large Language Models

链接: https://arxiv.org/abs/2409.14595
作者: Hossein Rajabzadeh,Aref Jafari,Aman Sharma,Benyamin Jami,Hyock Ju Kwon,Ali Ghodsi,Boxing Chen,Mehdi Rezagholizadeh
关键词-EN: Large Language Models, language processing tasks, natural language processing, Large Language, natural language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs), with their increasing depth and number of parameters, have demonstrated outstanding performance across a variety of natural language processing tasks. However, this growth in scale leads to increased computational demands, particularly during inference and fine-tuning. To address these challenges, we introduce EchoAtt, a novel framework aimed at optimizing transformer-based models by analyzing and leveraging the similarity of attention patterns across layers. Our analysis reveals that many inner layers in LLMs, especially larger ones, exhibit highly similar attention matrices. By exploiting this similarity, EchoAtt enables the sharing of attention matrices in less critical layers, significantly reducing computational requirements without compromising performance. We incorporate this approach within a knowledge distillation setup, where a pre-trained teacher model guides the training of a smaller student model. The student model selectively shares attention matrices in layers with high similarity while inheriting key parameters from the teacher. Our best results with TinyLLaMA-1.1B demonstrate that EchoAtt improves inference speed by 15%, training speed by 25%, and reduces the number of parameters by approximately 4%, all while improving zero-shot performance. These findings highlight the potential of attention matrix sharing to enhance the efficiency of LLMs, making them more practical for real-time and resource-limited applications.

[LG-77] sting Causal Models with Hidden Variables in Polynomial Delay via Conditional Independencies

链接: https://arxiv.org/abs/2409.14593
作者: Hyunchai Jeong,Adiba Ejaz,Jin Tian,Elias Bareinboim
关键词-EN: causal inference tasks, CIs, inference tasks, key prerequisite, Testing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 34 total pages, 14 figures

点击查看摘要

Abstract:Testing a hypothesized causal model against observational data is a key prerequisite for many causal inference tasks. A natural approach is to test whether the conditional independence relations (CIs) assumed in the model hold in the data. While a model can assume exponentially many CIs (with respect to the number of variables), testing all of them is both impractical and unnecessary. Causal graphs, which encode these CIs in polynomial space, give rise to local Markov properties that enable model testing with a significantly smaller subset of CIs. Model testing based on local properties requires an algorithm to list the relevant CIs. However, existing algorithms for realistic settings with hidden variables and non-parametric distributions can take exponential time to produce even a single CI constraint. In this paper, we introduce the c-component local Markov property (C-LMP) for causal graphs with hidden variables. Since C-LMP can still invoke an exponential number of CIs, we develop a polynomial delay algorithm to list these CIs in poly-time intervals. To our knowledge, this is the first algorithm that enables poly-delay testing of CIs in causal graphs with hidden variables against arbitrary data distributions. Experiments on real-world and synthetic data demonstrate the practicality of our algorithm.

[LG-78] Explainable AI needs formal notions of explanation correctness

链接: https://arxiv.org/abs/2409.14590
作者: Stefan Haufe,Rick Wilming,Benedict Clark,Rustam Zhumagambetov,Danny Panknin,Ahcène Boubekki
关键词-EN: medicine poses risks, machine learning, requires regulation, critical domains, medicine poses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The use of machine learning (ML) in critical domains such as medicine poses risks and requires regulation. One requirement is that decisions of ML systems in high-risk applications should be human-understandable. The field of “explainable artificial intelligence” (XAI) seemingly addresses this need. However, in its current form, XAI is unfit to provide quality control for ML; it itself needs scrutiny. Popular XAI methods cannot reliably answer important questions about ML models, their training data, or a given test input. We recapitulate results demonstrating that popular XAI methods systematically attribute importance to input features that are independent of the prediction target. This limits their utility for purposes such as model and data (in)validation, model improvement, and scientific discovery. We argue that the fundamental reason for this limitation is that current XAI methods do not address well-defined problems and are not evaluated against objective criteria of explanation correctness. Researchers should formally define the problems they intend to solve first and then design methods accordingly. This will lead to notions of explanation correctness that can be theoretically verified and objective metrics of explanation performance that can be assessed using ground-truth data.

[LG-79] Backtracking Improves Generation Safety

链接: https://arxiv.org/abs/2409.14586
作者: Yiming Zhang,Jianfeng Chi,Hailey Nguyen,Kartikeya Upasani,Daniel M. Bikel,Jason Weston,Eric Michael Smith
关键词-EN: taking back tokens, fundamental limitation, taking back, unsafe additional text, Text generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to “undo” and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% \to 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.

[LG-80] Medical Concept Normalization in a Low-Resource Setting

链接: https://arxiv.org/abs/2409.14579
作者: Tim Patzelt
关键词-EN: large knowledge base, natural language processing, biomedical natural language, medical concept normalization, accurately mapping mentions
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Master Thesis

点击查看摘要

Abstract:In the field of biomedical natural language processing, medical concept normalization is a crucial task for accurately mapping mentions of concepts to a large knowledge base. However, this task becomes even more challenging in low-resource settings, where limited data and resources are available. In this thesis, I explore the challenges of medical concept normalization in a low-resource setting. Specifically, I investigate the shortcomings of current medical concept normalization methods applied to German lay texts. Since there is no suitable dataset available, a dataset consisting of posts from a German medical online forum is annotated with concepts from the Unified Medical Language System. The experiments demonstrate that multilingual Transformer-based models are able to outperform string similarity methods. The use of contextual information to improve the normalization of lay mentions is also examined, but led to inferior results. Based on the results of the best performing model, I present a systematic error analysis and lay out potential improvements to mitigate frequent errors.

[LG-81] Domain knowledge-guided machine learning framework for state of health estimation in Lithium-ion batteries

链接: https://arxiv.org/abs/2409.14575
作者: Andrea Lanubile,Pietro Bosoni,Gabriele Pozzato,Anirudh Allam,Matteo Acquarone,Simona Onori
关键词-EN: effective electric vehicle, vehicle battery management, electric vehicle, crucial for effective, electric vehicle battery
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Accurate estimation of battery state of health is crucial for effective electric vehicle battery management. Here, we propose five health indicators that can be extracted online from real-world electric vehicle operation and develop a machine learning-based method to estimate the battery state of health. The proposed indicators provide physical insights into the energy and power fade of the battery and enable accurate capacity estimation even with partially missing data. Moreover, they can be computed for portions of the charging profile and real-world driving discharging conditions, facilitating real-time battery degradation estimation. The indicators are computed using experimental data from five cells aged under electric vehicle conditions, and a linear regression model is used to estimate the state of health. The results show that models trained with power autocorrelation and energy-based features achieve capacity estimation with maximum absolute percentage error within 1.5% to 2.5% .

[LG-82] Evaluating the Performance and Robustness of LLMs in Materials Science QA and Property Predictions

链接: https://arxiv.org/abs/2409.14572
作者: Hongchen Wang,Kangming Li,Scott Ramsay,Yao Fehlis,Edward Kim,Jason Hattrick-Simpers
关键词-EN: Large Language Models, Large Language, revolutionize scientific research, remain insufficiently explored, applications remain insufficiently
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. This study conducts a comprehensive evaluation and robustness analysis of LLMs within the field of materials science, focusing on domain-specific question answering and materials property prediction. Three distinct datasets are used in this study: 1) a set of multiple-choice questions from undergraduate-level materials science courses, 2) a dataset including various steel compositions and yield strengths, and 3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of ‘noise’, ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study uncovers unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance enhancement from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.

[LG-83] Combating Spatial Disorientation in a Dynamic Self-Stabilization Task Using AI Assistants

链接: https://arxiv.org/abs/2409.14565
作者: Sheikh Mannan,Paige Hansen,Vivekanand Pandey Vimal,Hannah N. Davies,Paul DiZio,Nikhil Krishnaswamy
关键词-EN: fatal aircraft accidents, aircraft accidents, Spatial disorientation, fatal aircraft, ameliorate spatial disorientation
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注: 10 pages, To be published in the International Conference on Human-Agent Interaction (HAI '24) proceedings

点击查看摘要

Abstract:Spatial disorientation is a leading cause of fatal aircraft accidents. This paper explores the potential of AI agents to aid pilots in maintaining balance and preventing unrecoverable losses of control by offering cues and corrective measures that ameliorate spatial disorientation. A multi-axis rotation system (MARS) was used to gather data from human subjects self-balancing in a spaceflight analog condition. We trained models over this data to create “digital twins” that exemplified performance characteristics of humans with different proficiency levels. We then trained various reinforcement learning and deep learning models to offer corrective cues if loss of control is predicted. Digital twins and assistant models then co-performed a virtual inverted pendulum (VIP) programmed with identical physics. From these simulations, we picked the 5 best-performing assistants based on task metrics such as crash frequency and mean distance from the direction of balance. These were used in a co-performance study with 20 new human subjects performing a version of the VIP task with degraded spatial information. We show that certain AI assistants were able to improve human performance and that reinforcement-learning based assistants were objectively more effective but rated as less trusted and preferable by humans.

[LG-84] Optimizing Feature Selection with Genetic Algorithms: A Review of Methods and Applications

链接: https://arxiv.org/abs/2409.14563
作者: Zhila Yaseen Taha,Abdulhady Abas Abdullah,Tarik A. Rashid
关键词-EN: Analyzing large datasets, Analyzing large, select optimal features, important research areas, data mining
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Analyzing large datasets to select optimal features is one of the most important research areas in machine learning and data mining. This feature selection procedure involves dimensionality reduction which is crucial in enhancing the performance of the model, making it less complex. Recently, several types of attribute selection methods have been proposed that use different approaches to obtain representative subsets of the attributes. However, population-based evolutionary algorithms like Genetic Algorithms (GAs) have been proposed to provide remedies for these drawbacks by avoiding local optima and improving the selection process itself. This manuscript presents a sweeping review on GA-based feature selection techniques in applications and their effectiveness across different domains. This review was conducted using the PRISMA methodology; hence, the systematic identification, screening, and analysis of relevant literature were performed. Thus, our results hint that the field’s hybrid GA methodologies including, but not limited to, GA-Wrapper feature selector and HGA-neural networks, have substantially improved their potential through the resolution of problems such as exploration of unnecessary search space, accuracy performance problems, and complexity. The conclusions of this paper would result in discussing the potential that GAs bear in feature selection and future research directions for their enhancement in applicability and performance.

[LG-85] Adaptive Feedforward Gradient Estimation in Neural ODEs

链接: https://arxiv.org/abs/2409.14549
作者: Jaouad Dabounou
关键词-EN: Ordinary Differential Equations, Neural Ordinary Differential, Differential Equations, rich theoretical frameworks, theoretical frameworks developed
类目: Machine Learning (cs.LG)
*备注: 22 pages, 10 figures

点击查看摘要

Abstract:Neural Ordinary Differential Equations (Neural ODEs) represent a significant breakthrough in deep learning, promising to bridge the gap between machine learning and the rich theoretical frameworks developed in various mathematical fields over centuries. In this work, we propose a novel approach that leverages adaptive feedforward gradient estimation to improve the efficiency, consistency, and interpretability of Neural ODEs. Our method eliminates the need for backpropagation and the adjoint method, reducing computational overhead and memory usage while maintaining accuracy. The proposed approach has been validated through practical applications, and showed good performance relative to Neural ODEs state of the art methods.

[LG-86] rackNetV4: Enhancing Fast Sports Object Tracking with Motion Attention Maps

链接: https://arxiv.org/abs/2409.14543
作者: Arjun Raj,Lei Wang,Tom Gedeon
关键词-EN: Accurately detecting, small objects, sports videos, challenging due, due to factors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research report

点击查看摘要

Abstract:Accurately detecting and tracking high-speed, small objects, such as balls in sports videos, is challenging due to factors like motion blur and occlusion. Although recent deep learning frameworks like TrackNetV1, V2, and V3 have advanced tennis ball and shuttlecock tracking, they often struggle in scenarios with partial occlusion or low visibility. This is primarily because these models rely heavily on visual features without explicitly incorporating motion information, which is crucial for precise tracking and trajectory prediction. In this paper, we introduce an enhancement to the TrackNet family by fusing high-level visual features with learnable motion attention maps through a motion-aware fusion mechanism, effectively emphasizing the moving ball’s location and improving tracking performance. Our approach leverages frame differencing maps, modulated by a motion prompt layer, to highlight key motion regions over time. Experimental results on the tennis ball and shuttlecock datasets show that our method enhances the tracking performance of both TrackNetV2 and V3. We refer to our lightweight, plug-and-play solution, built on top of the existing TrackNet, as TrackNetV4.

[LG-87] Distributionally Robust Inverse Reinforcement Learning for Identifying Multi-Agent Coordinated Sensing

链接: https://arxiv.org/abs/2409.14542
作者: Luke Snow,Vikram Krishnamurthy
关键词-EN: inverse reinforcement learning, multi-agent sensing system, minimax distributionally robust, distributionally robust inverse, robust inverse reinforcement
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We derive a minimax distributionally robust inverse reinforcement learning (IRL) algorithm to reconstruct the utility functions of a multi-agent sensing system. Specifically, we construct utility estimators which minimize the worst-case prediction error over a Wasserstein ambiguity set centered at noisy signal observations. We prove the equivalence between this robust estimation and a semi-infinite optimization reformulation, and we propose a consistent algorithm to compute solutions. We illustrate the efficacy of this robust IRL scheme in numerical studies to reconstruct the utility functions of a cognitive radar network from observed tracking signals.

[LG-88] RobotFingerPrint: Unified Gripper Coordinate Space for Multi-Gripper Grasp Synthesis

链接: https://arxiv.org/abs/2409.14519
作者: Ninad Khargonkar,Luis Felipe Casas,Balakrishnan Prabhakaran,Yu Xiang
关键词-EN: unified gripper coordinate, gripper coordinate space, unified gripper, gripper coordinate, coordinate space
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 7 pages, 8 figures, 2 tables. Project page available at this https URL

点击查看摘要

Abstract:We introduce a novel representation named as the unified gripper coordinate space for grasp synthesis of multiple grippers. The space is a 2D surface of a sphere in 3D using longitude and latitude as its coordinates, and it is shared for all robotic grippers. We propose a new algorithm to map the palm surface of a gripper into the unified gripper coordinate space, and design a conditional variational autoencoder to predict the unified gripper coordinates given an input object. The predicted unified gripper coordinates establish correspondences between the gripper and the object, which can be used in an optimization problem to solve the grasp pose and the finger joints for grasp synthesis. We demonstrate that using the unified gripper coordinate space improves the success rate and diversity in the grasp synthesis of multiple grippers.

[LG-89] Sliding Window Training – Utilizing Historical Recommender Systems Data for Foundation Models RECSYS’24

链接: https://arxiv.org/abs/2409.14517
作者: Swanand Joshi,Yesu Feng,Ko-Jen Hsiao,Zhe Zhang,Sudarshan Lamkhede
关键词-EN: Long-lived recommender systems, encounter lengthy user-item, lengthy user-item interaction, user-item interaction histories, Long-lived recommender
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: To be published In 18th ACM Conference on Recommender Systems (RecSys '24), October 14–18, 2024, Bari, Italy

点击查看摘要

Abstract:Long-lived recommender systems (RecSys) often encounter lengthy user-item interaction histories that span many years. To effectively learn long term user preferences, Large RecSys foundation models (FM) need to encode this information in pretraining. Usually, this is done by either generating a long enough sequence length to take all history sequences as input at the cost of large model input dimension or by dropping some parts of the user history to accommodate model size and latency requirements on the production serving side. In this paper, we introduce a sliding window training technique to incorporate long user history sequences during training time without increasing the model input dimension. We show the quantitative qualitative improvements this technique brings to the RecSys FM in learning user long term preferences. We additionally show that the average quality of items in the catalog learnt in pretraining also improves.

[LG-90] SPAQ-DL-SLAM: Towards Optimizing Deep Learning-based SLAM for Resource-Constrained Embedded Platforms

链接: https://arxiv.org/abs/2409.14515
作者: Niraj Pudasaini,Muhammad Abdullah Hanif,Muhammad Shafique
关键词-EN: Learning-based Simultaneous Localization, Optimizing Deep Learning-based, Deep Learning-based Simultaneous, Localization and Mapping, Learning-based Simultaneous
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: To appear at the 18th International Conference on Control, Automation, Robotics and Vision (ICARCV), December 2024, Dubai, UAE

点击查看摘要

Abstract:Optimizing Deep Learning-based Simultaneous Localization and Mapping (DL-SLAM) algorithms is essential for efficient implementation on resource-constrained embedded platforms, enabling real-time on-board computation in autonomous mobile robots. This paper presents SPAQ-DL-SLAM, a framework that strategically applies Structured Pruning and Quantization (SPAQ) to the architecture of one of the state-ofthe-art DL-SLAM algorithms, DROID-SLAM, for resource and energy-efficiency. Specifically, we perform structured pruning with fine-tuning based on layer-wise sensitivity analysis followed by 8-bit post-training static quantization (PTQ) on the deep learning modules within DROID-SLAM. Our SPAQ-DROIDSLAM model, optimized version of DROID-SLAM model using our SPAQ-DL-SLAM framework with 20% structured pruning and 8-bit PTQ, achieves an 18.9% reduction in FLOPs and a 79.8% reduction in overall model size compared to the DROID-SLAM model. Our evaluations on the TUM-RGBD benchmark shows that SPAQ-DROID-SLAM model surpasses the DROID-SLAM model by an average of 10.5% on absolute trajectory error (ATE) metric. Additionally, our results on the ETH3D SLAM training benchmark demonstrate enhanced generalization capabilities of the SPAQ-DROID-SLAM model, seen by a higher Area Under the Curve (AUC) score and success in 2 additional data sequences compared to the DROIDSLAM model. Despite these improvements, the model exhibits performance variance on the distinct Vicon Room sequences from the EuRoC dataset, which are captured at high angular velocities. This varying performance at some distinct scenarios suggests that designing DL-SLAM algorithms taking operating environments and tasks in consideration can achieve optimal performance and resource efficiency for deployment in resource-constrained embedded platforms.

[LG-91] Order of Magnitude Speedups for LLM Membership Inference

链接: https://arxiv.org/abs/2409.14513
作者: Martin Bertran,Rongting Zhang,Aaron Roth
关键词-EN: Large Language Models, revolutionize computing broadly, significant privacy vulnerabilities, expose significant privacy, Large Language
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the promise to revolutionize computing broadly, but their complexity and extensive training data also expose significant privacy vulnerabilities. One of the simplest privacy risks associated with LLMs is their susceptibility to membership inference attacks (MIAs), wherein an adversary aims to determine whether a specific data point was part of the model’s training set. Although this is a known risk, state of the art methodologies for MIAs rely on training multiple computationally costly shadow models, making risk evaluation prohibitive for large models. Here we adapt a recent line of work which uses quantile regression to mount membership inference attacks; we extend this work by proposing a low-cost MIA that leverages an ensemble of small quantile regression models to determine if a document belongs to the model’s training set or not. We demonstrate the effectiveness of this approach on fine-tuned LLMs of varying families (OPT, Pythia, Llama) and across multiple datasets. Across all scenarios we obtain comparable or improved accuracy compared to state of the art shadow model approaches, with as little as 6% of their computation budget. We demonstrate increased effectiveness across multi-epoch trained target models, and architecture miss-specification robustness, that is, we can mount an effective attack against a model using a different tokenizer and architecture, without requiring knowledge on the target model.

[LG-92] abGraphs: A Benchmark and Strong Baselines for Learning on Graphs with Tabular Features

链接: https://arxiv.org/abs/2409.14500
作者: Gleb Bazhenov,Oleg Platonov,Liudmila Prokhorenkova
关键词-EN: machine learning, graph machine learning, Tabular machine learning, graph machine, machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular machine learning is an important field for industry and science. In this field, table rows are usually treated as independent data samples, but additional information about relations between them is sometimes available and can be used to improve predictive performance. Such information can be naturally modeled with a graph, thus tabular machine learning may benefit from graph machine learning methods. However, graph machine learning models are typically evaluated on datasets with homogeneous node features, which have little in common with heterogeneous mixtures of numerical and categorical features present in tabular datasets. Thus, there is a critical difference between the data used in tabular and graph machine learning studies, which does not allow one to understand how successfully graph models can be transferred to tabular data. To bridge this gap, we propose a new benchmark of diverse graphs with heterogeneous tabular node features and realistic prediction tasks. We use this benchmark to evaluate a vast set of models, including simple methods previously overlooked in the literature. Our experiments show that graph neural networks (GNNs) can indeed often bring gains in predictive performance for tabular data, but standard tabular models also can be adapted to work with graph data by using simple feature preprocessing, which sometimes enables them to compete with and even outperform GNNs. Based on our empirical study, we provide insights for researchers and practitioners in both tabular and graph machine learning fields.

[LG-93] CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments

链接: https://arxiv.org/abs/2409.14494
作者: Ahmed Adel Attia,Dorottya Demszky,Tolulope Ogunremi,Jing Liu,Carol Espy-Wilson
关键词-EN: Creating Automatic Speech, Automatic Speech Recognition, Creating Automatic, Speech Recognition, Automatic Speech
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.13018

点击查看摘要

Abstract:Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%. More specifically, CPT improves the model’s robustness to different noises, microphones and classroom conditions.

[LG-94] SynBench: A Synthetic Benchmark for Non-rigid 3D Point Cloud Registration

链接: https://arxiv.org/abs/2409.14474
作者: Sara Monji-Azad,Marvin Kinz,Claudia Scherl,David Männle,Jürgen Hesser,Nikolas Löw
关键词-EN: point cloud registration, Non-rigid point cloud, point cloud, cloud registration, Non-rigid point
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-rigid point cloud registration is a crucial task in computer vision. Evaluating a non-rigid point cloud registration method requires a dataset with challenges such as large deformation levels, noise, outliers, and incompleteness. Despite the existence of several datasets for deformable point cloud registration, the absence of a comprehensive benchmark with all challenges makes it difficult to achieve fair evaluations among different methods. This paper introduces SynBench, a new non-rigid point cloud registration dataset created using SimTool, a toolset for soft body simulation in Flex and Unreal Engine. SynBench provides the ground truth of corresponding points between two point sets and encompasses key registration challenges, including varying levels of deformation, noise, outliers, and incompleteness. To the best of the authors’ knowledge, compared to existing datasets, SynBench possesses three particular characteristics: (1) it is the first benchmark that provides various challenges for non-rigid point cloud registration, (2) SynBench encompasses challenges of varying difficulty levels, and (3) it includes ground truth corresponding points both before and after deformation. The authors believe that SynBench enables future non-rigid point cloud registration methods to present a fair comparison of their achievements. SynBench is publicly available at: this https URL.

[LG-95] Exploring Multilingual Probing in Large Language Models : A Cross-Language Analysis

链接: https://arxiv.org/abs/2409.14459
作者: Daoyang Li,Mingyu Jin,Qingcheng Zeng,Haiyan Zhao,Mengnan Du
关键词-EN: large language models, overlooking the vast, languages, techniques for large, primarily focused
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of the world’s languages. In this paper, we extend these probing methods to a multilingual context, investigating the behaviors of LLMs across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results highlight significant disparities in LLMs’ multilingual capabilities and emphasize the need for improved modeling of low-resource languages.

[LG-96] A High-Performance External Validity Index for Clustering with a Large Number of Clusters

链接: https://arxiv.org/abs/2409.14455
作者: Mohammad Yasin Karbasian,Ramin Javadi
关键词-EN: Matching Based Pairing, Stable Matching Based, Based Pairing, high-performance external validity, external validity index
类目: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 16 pages, 14 tables

点击查看摘要

Abstract:This paper introduces the Stable Matching Based Pairing (SMBP) algorithm, a high-performance external validity index for clustering evaluation in large-scale datasets with a large number of clusters. SMBP leverages the stable matching framework to pair clusters across different clustering methods, significantly reducing computational complexity to O(N^2) , compared to traditional Maximum Weighted Matching (MWM) with O(N^3) complexity. Through comprehensive evaluations on real-world and synthetic datasets, SMBP demonstrates comparable accuracy to MWM and superior computational efficiency. It is particularly effective for balanced, unbalanced, and large-scale datasets with a large number of clusters, making it a scalable and practical solution for modern clustering tasks. Additionally, SMBP is easily implementable within machine learning frameworks like PyTorch and TensorFlow, offering a robust tool for big data applications. The algorithm is validated through extensive experiments, showcasing its potential as a powerful alternative to existing methods such as Maximum Match Measure (MMM) and Centroid Ratio (CR).

[LG-97] A Unified Approach for Learning the Dynamics of Power System Generators and Inverter-based Resources

链接: https://arxiv.org/abs/2409.14454
作者: Shaohui Liu,Weiqian Cai,Hao Zhu,Brian Johnson
关键词-EN: renewable energy integration, electrification greatly challenges, greatly challenges power, challenges power system, inverter-based resources
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing prevalence of inverter-based resources (IBRs) for renewable energy integration and electrification greatly challenges power system dynamic analysis. To account for both synchronous generators (SGs) and IBRs, this work presents an approach for learning the model of an individual dynamic component. The recurrent neural network (RNN) model is used to match the recursive structure in predicting the key dynamical states of a component from its terminal bus voltage and set-point input. To deal with the fast transients especially due to IBRs, we develop a Stable Integral (SI-)RNN to mimic high-order integral methods that can enhance the stability and accuracy for the dynamic learning task. We demonstrate that the proposed SI-RNN model not only can successfully predict the component’s dynamic behaviors, but also offers the possibility of efficiently computing the dynamic sensitivity relative to a set-point change. These capabilities have been numerically validated based on full-order Electromagnetic Transient (EMT) simulations on a small test system with both SGs and IBRs, particularly for predicting the dynamics of grid-forming inverters.

[LG-98] A Visualized Malware Detection Framework with CNN and Conditional GAN

链接: https://arxiv.org/abs/2409.14439
作者: Fang Wang(Florence Wong),Hussam Al Hamadi,Ernesto Damiani
关键词-EN: Machine Learning, visualization analysis incorporating, improving security defenses, Malware visualization analysis, incorporating with Machine
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 2022 IEEE International Conference on Big Data (Big Data), 2022

点击查看摘要

Abstract:Malware visualization analysis incorporating with Machine Learning (ML) has been proven to be a promising solution for improving security defenses on different platforms. In this work, we propose an integrated framework for addressing common problems experienced by ML utilizers in developing malware detection systems. Namely, a pictorial presentation system with extensions is designed to preserve the identities of benign/malign samples by encoding each variable into binary digits and mapping them into black and white pixels. A conditional Generative Adversarial Network based model is adopted to produce synthetic images and mitigate issues of imbalance classes. Detection models architected by Convolutional Neural Networks are for validating performances while training on datasets with and without artifactual samples. Result demonstrates accuracy rates of 98.51% and 97.26% for these two training scenarios.

[LG-99] Challenging the Performance-Interpretability Trade-off: An Evaluation of Interpretable Machine Learning Models

链接: https://arxiv.org/abs/2409.14429
作者: Sven Kruschel,Nico Hambauer,Sven Weinzierl,Sandra Zilker,Mathias Kraus,Patrick Zschech
关键词-EN: data-driven decision support, promote data-driven decision, decision support, permeating every conceivable, conceivable domain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted for publication in Business Information Systems Engineering (2024)

点击查看摘要

Abstract:Machine learning is permeating every conceivable domain to promote data-driven decision support. The focus is often on advanced black-box models due to their assumed performance advantages, whereas interpretable models are often associated with inferior predictive qualities. More recently, however, a new generation of generalized additive models (GAMs) has been proposed that offer promising properties for capturing complex, non-linear patterns while remaining fully interpretable. To uncover the merits and limitations of these models, this study examines the predictive performance of seven different GAMs in comparison to seven commonly used machine learning models based on a collection of twenty tabular benchmark datasets. To ensure a fair and robust model comparison, an extensive hyperparameter search combined with cross-validation was performed, resulting in 68,500 model runs. In addition, this study qualitatively examines the visual output of the models to assess their level of interpretability. Based on these results, the paper dispels the misconception that only black-box models can achieve high accuracy by demonstrating that there is no strict trade-off between predictive performance and model interpretability for tabular data. Furthermore, the paper discusses the importance of GAMs as powerful interpretable models for the field of information systems and derives implications for future work from a socio-technical perspective.

[LG-100] COSBO: Conservative Offline Simulation-Based Policy Optimization

链接: https://arxiv.org/abs/2409.14412
作者: Eshagh Kargar,Ville Kyrki
关键词-EN: reinforcement learning models, Offline reinforcement learning, training reinforcement learning, reinforcement learning, live deployments
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning allows training reinforcement learning models on data from live deployments. However, it is limited to choosing the best combination of behaviors present in the training data. In contrast, simulation environments attempting to replicate the live environment can be used instead of the live data, yet this approach is limited by the simulation-to-reality gap, resulting in a bias. In an attempt to get the best of both worlds, we propose a method that combines an imperfect simulation environment with data from the target environment, to train an offline reinforcement learning policy. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches CQL, MOPO, and COMBO, especially in scenarios with diverse and challenging dynamics, and demonstrates robust behavior across a variety of experimental conditions. The results highlight that using simulator-generated data can effectively enhance offline policy learning despite the sim-to-real gap, when direct interaction with the real-world is not possible.

[LG-101] Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance

链接: https://arxiv.org/abs/2409.14401
作者: Pawel Pukowski,Haiping Lu
关键词-EN: neural architecture search, evaluating model efficacy, test accuracy, AutoML domain, underpinning a wide
类目: Machine Learning (cs.LG)
*备注: Accepted to workshop track of AutoML’24 (see openreview)

点击查看摘要

Abstract:In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy, underpinning a wide array of applications from neural architecture search to hyperparameter optimization. However, the reliability of test accuracy as the primary performance metric has been called into question, notably through research highlighting how label noise can obscure the true ranking of state-of-the-art models. We venture beyond, along another perspective where the existence of hard samples within datasets casts further doubt on the generalization capabilities inferred from test accuracy alone. Our investigation reveals that the distribution of hard samples between training and test sets affects the difficulty levels of those sets, thereby influencing the perceived generalization capability of models. We unveil two distinct generalization pathways-toward easy and hard samples-highlighting the complexity of achieving balanced model evaluation. Finally, we propose a benchmarking procedure for comparing hard sample identification methods, facilitating the advancement of more nuanced approaches in this area. Our primary goal is not to propose a definitive solution but to highlight the limitations of relying primarily on test accuracy as an evaluation metric, even when working with balanced datasets, by introducing the in-class data imbalance problem. By doing so, we aim to stimulate a critical discussion within the research community and open new avenues for research that consider a broader spectrum of model evaluation criteria. The anonymous code is available at this https URL blueunder the GPL-3.0 license.

[LG-102] Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape

链接: https://arxiv.org/abs/2409.14396
作者: Tao Li,Zhengbao He,Yujun Li,Yasheng Wang,Lifeng Shang,Xiaolin Huang
关键词-EN: Fine-tuning large-scale pre-trained, large-scale pre-trained models, large-scale pre-trained, prohibitively expensive, expensive in terms
类目: Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computational and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, provides an efficient way to fine-tune models by optimizing only a low-rank matrix. Despite recent progress made in improving LoRA’s performance, the connection between the LoRA optimization space and the original full parameter space is often overlooked. A solution that appears flat in the LoRA space may exist sharp directions in the full parameter space, potentially harming generalization performance. In this paper, we propose Flat-LoRA, an efficient approach that seeks a low-rank adaptation located in a flat region of the full parameter space.Instead of relying on the well-established sharpness-aware minimization approach, which can incur significant computational and memory burdens, we utilize random weight perturbation with a Bayesian expectation loss objective to maintain training efficiency and design a refined perturbation generation strategy for improved performance. Experiments on natural language processing and image classification tasks with various architectures demonstrate the effectiveness of our approach.

[LG-103] Investigating Layer Importance in Large Language Models

链接: https://arxiv.org/abs/2409.14381
作者: Yang Zhang,Yanfei Dong,Kenji Kawaguchi
关键词-EN: Large language models, gained increasing attention, increasing attention due, Large language, process texts
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have gained increasing attention due to their prominent ability to understand and process texts. Nevertheless, LLMs largely remain opaque. The lack of understanding of LLMs has obstructed the deployment in safety-critical scenarios and hindered the development of better models. In this study, we advance the understanding of LLM by investigating the significance of individual layers in LLMs. We propose an efficient sampling method to faithfully evaluate the importance of layers using Shapley values, a widely used explanation framework in feature attribution and data valuation. In addition, we conduct layer ablation experiments to assess the performance degradation resulting from the exclusion of specific layers. Our findings reveal the existence of cornerstone layers, wherein certain early layers can exhibit a dominant contribution over others. Removing one cornerstone layer leads to a drastic collapse of the model performance, often reducing it to random guessing. Conversely, removing non-cornerstone layers results in only marginal performance changes. This study identifies cornerstone layers in LLMs and underscores their critical role for future research.

[LG-104] Sparse Low-Ranked Self-Attention Transformer for Remaining Useful Lifetime Prediction of Optical Fiber Amplifiers

链接: https://arxiv.org/abs/2409.14378
作者: Dominic Schneider,Lutz Rapp
关键词-EN: Optical fiber amplifiers, Optical fiber, present optical networks, fiber amplifiers, key elements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 9 pages, 7 figures, submitted to IEEE Transactions on Machine Learning in Communications and Networking (TMLCN)

点击查看摘要

Abstract:Optical fiber amplifiers are key elements in present optical networks. Failures of these components result in high financial loss of income of the network operator as the communication traffic over an affected link is interrupted. Applying Remaining useful lifetime (RUL) prediction in the context of Predictive Maintenance (PdM) to optical fiber amplifiers to predict upcoming system failures at an early stage, so that network outages can be minimized through planning of targeted maintenance actions, ensures reliability and safety. Optical fiber amplifier are complex systems, that work under various operating conditions, which makes correct forecasting a difficult task. Increased monitoring capabilities of systems results in datasets that facilitate the application of data-driven RUL prediction methods. Deep learning models in particular have shown good performance, but generalization based on comparatively small datasets for RUL prediction is difficult. In this paper, we propose Sparse Low-ranked self-Attention Transformer (SLAT) as a novel RUL prediction method. SLAT is based on an encoder-decoder architecture, wherein two parallel working encoders extract features for sensors and time steps. By utilizing the self-attention mechanism, long-term dependencies can be learned from long sequences. The implementation of sparsity in the attention matrix and a low-rank parametrization reduce overfitting and increase generalization. Experimental application to optical fiber amplifiers exemplified on EDFA, as well as a reference dataset from turbofan engines, shows that SLAT outperforms the state-of-the-art methods.

[LG-105] Using Natural Language Processing to find Indication for Burnout with Text Classification: From Online Data to Real-World Data

链接: https://arxiv.org/abs/2409.14357
作者: Mascha Kurpicz-Briki,Ghofrane Merhbene,Alexandre Puttick,Souhir Ben Souissi,Jannic Bieri,Thomas Jörg Müller,Christoph Golz
关键词-EN: chronic workplace stress, Natural Language Processing, arises from chronic, effectively managed, chronic workplace
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Burnout, classified as a syndrome in the ICD-11, arises from chronic workplace stress that has not been effectively managed. It is characterized by exhaustion, cynicism, and reduced professional efficacy, and estimates of its prevalence vary significantly due to inconsistent measurement methods. Recent advancements in Natural Language Processing (NLP) and machine learning offer promising tools for detecting burnout through textual data analysis, with studies demonstrating high predictive accuracy. This paper contributes to burnout detection in German texts by: (a) collecting an anonymous real-world dataset including free-text answers and Oldenburg Burnout Inventory (OLBI) responses; (b) demonstrating the limitations of a GermanBERT-based classifier trained on online data; © presenting two versions of a curated BurnoutExpressions dataset, which yielded models that perform well in real-world applications; and (d) providing qualitative insights from an interdisciplinary focus group on the interpretability of AI models used for burnout detection. Our findings emphasize the need for greater collaboration between AI researchers and clinical experts to refine burnout detection models. Additionally, more real-world data is essential to validate and enhance the effectiveness of current AI methods developed in NLP research, which are often based on data automatically scraped from online sources and not evaluated in a real-world context. This is essential for ensuring AI tools are well suited for practical applications.

[LG-106] Self-Supervised Audio-Visual Soundscape Stylization ECCV2024

链接: https://arxiv.org/abs/2409.14340
作者: Tingle Li,Renhao Wang,Po-Yao Huang,Andrew Owens,Gopala Anumanchipalli
关键词-EN: convey a great, great deal, deal of information, variety of effects, effects ranging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: ECCV 2024

点击查看摘要

Abstract:Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example’s sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities. Please see our project webpage for video results: https://tinglok.netlify.app/files/avsoundscape/

[LG-107] Data-Driven Spatiotemporal Feature Representation and Mining in Multidimensional Time Series

链接: https://arxiv.org/abs/2409.14327
作者: Xu Yan,Yaoting Jiang,Wenyi Liu,Didi Yi,Haoyang Sang,Jianjun Wei
关键词-EN: time series data, multidimensional time series, time series, series data, series data analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores a new method for time series data analysis, aiming to overcome the limitations of traditional mining techniques when dealing with multidimensional time series data. Time series data are extensively utilized in diverse fields, including backend services for monitoring and optimizing IT infrastructure, medical diagnosis through continuous patient monitoring and health trend analysis, and internet business for tracking user behavior and forecasting sales. However, since the effective information in time series data is often hidden in sequence fragments, the uncertainty of their length, quantity, and morphological variables brings challenges to mining. To this end, this paper proposes a new spatiotemporal feature representation method, which converts multidimensional time series (MTS) into one-dimensional event sequences by transforming spatially varying events, and uses a series of event symbols to represent the spatial structural information of multidimensional coupling in the sequence, which has good interpretability. Then, this paper introduces a variable-length tuple mining method to extract non-redundant key event subsequences in event sequences as spatiotemporal structural features of motion sequences. This method is an unsupervised method that does not rely on large-scale training samples and defines a new model for representing the spatiotemporal structural features of multidimensional time series. The superior performance of the STEM model is verified by pattern classification experiments on a variety of motion sequences. The research results of this paper provide an important theoretical basis and technical support for understanding and predicting human behavior patterns, and have far-reaching practical application value.

[LG-108] Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses EMNLP2024

链接: https://arxiv.org/abs/2409.14324
作者: Hung-Ting Su,Ya-Ching Hsu,Xudong Lin,Xiang-Qian Shi,Yulei Niu,Han-Yuan Hsu,Hung-yi Lee,Winston H. Hsu
关键词-EN: Large language models, Large language, shown significant multi-step, language models, prompting have shown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP 2024 Findings. The first two authors contributed equally. Code: this https URL

点击查看摘要

Abstract:Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4’s performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT’s heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.

[LG-109] Sketch-and-Solve: Optimized Overdetermined Least-Squares Using Randomized Numerical Linear Algebra

链接: https://arxiv.org/abs/2409.14309
作者: Alex Lavaee
关键词-EN: tackling large-scale computational, reducing their dimensionality, sketching matrices, powerful paradigm, Abstract
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Sketch-and-solve is a powerful paradigm for tackling large-scale computational problems by reducing their dimensionality using sketching matrices. This paper focuses on applying sketch-and-solve algorithms to efficiently solve the overdetermined least squares problem, which is fundamental in various domains such as machine learning, signal processing, and numerical optimization. We provide a comprehensive overview of the sketch-and-solve paradigm and analyze different sketching operators, including dense and sparse variants. We introduce the Sketch-and-Apply (SAA-SAS) algorithm, which leverages randomized numerical linear algebra techniques to compute approximate solutions efficiently. Through extensive experiments on large-scale least squares problems, we demonstrate that our proposed approach significantly outperforms the traditional Least-Squares QR (LSQR) algorithm in terms of runtime while maintaining comparable accuracy. Our results highlight the potential of sketch-and-solve techniques in efficiently handling large-scale numerical linear algebra problems.

[LG-110] Opinion Mining on Offshore Wind Energy for Environmental Engineering

链接: https://arxiv.org/abs/2409.14292
作者: Isabele Bittencourt,Aparna S. Varde,Pankaj Lal
关键词-EN: offshore wind energy, wind energy, offshore wind, conduct sentiment analysis, machine learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this paper, we conduct sentiment analysis on social media data to study mass opinion about offshore wind energy. We adapt three machine learning models, namely, TextBlob, VADER, and SentiWordNet because different functions are provided by each model. TextBlob provides subjectivity analysis as well as polarity classification. VADER offers cumulative sentiment scores. SentiWordNet considers sentiments with reference to context and performs classification accordingly. Techniques in NLP are harnessed to gather meaning from the textual data in social media. Data visualization tools are suitably deployed to display the overall results. This work is much in line with citizen science and smart governance via involvement of mass opinion to guide decision support. It exemplifies the role of Machine Learning and NLP here.

[LG-111] Proof Automation with Large Language Models

链接: https://arxiv.org/abs/2409.14274
作者: Minghai Lu,Benjamin Delaware,Tianyi Zhang
关键词-EN: Coq are powerful, Interactive theorem provers, correctness of software, formally guarantee, guarantee the correctness
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
*备注: 12 pages, 15 figures, Accepted to ASE 2024

点击查看摘要

Abstract:Interactive theorem provers such as Coq are powerful tools to formally guarantee the correctness of software. However, using these tools requires significant manual effort and expertise. While Large Language Models (LLMs) have shown promise in automatically generating informal proofs in natural language, they are less effective at generating formal proofs in interactive theorem provers. In this paper, we conduct a formative study to identify common mistakes made by LLMs when asked to generate formal proofs. By analyzing 520 proof generation errors made by GPT-3.5, we found that GPT-3.5 often identified the correct high-level structure of a proof, but struggled to get the lower-level details correct. Based on this insight, we propose PALM, a novel generate-then-repair approach that first prompts an LLM to generate an initial proof and then leverages targeted symbolic methods to iteratively repair low-level problems. We evaluate PALM on a large dataset that includes more than 10K theorems. Our results show that PALM significantly outperforms other state-of-the-art approaches, successfully proving 76.6% to 180.4% more theorems. Moreover, PALM proves 1270 theorems beyond the reach of existing approaches. We also demonstrate the generalizability of PALM across different LLMs.

[LG-112] Higher-order-ReLU-KANs (HRKANs) for solving physics-informed neural networks (PINNs) more accurately robustly and faster

链接: https://arxiv.org/abs/2409.14248
作者: Chi Chiu So,Siu Pang Yung
关键词-EN: Finding solutions, Physics-informed Neural Networks, partial differential equations, engineering discoveries, Neural Networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Finding solutions to partial differential equations (PDEs) is an important and essential component in many scientific and engineering discoveries. One of the common approaches empowered by deep learning is Physics-informed Neural Networks (PINNs). Recently, a new type of fundamental neural network model, Kolmogorov-Arnold Networks (KANs), has been proposed as a substitute of Multilayer Perceptions (MLPs), and possesses trainable activation functions. To enhance KANs in fitting accuracy, a modification of KANs, so called ReLU-KANs, using “square of ReLU” as the basis of its activation functions has been suggested. In this work, we propose another basis of activation functions, namely, Higher-order-ReLU, which is simpler than the basis of activation functions used in KANs, namely, B-splines; allows efficient KAN matrix operations; and possesses smooth and non-zero higher-order derivatives, essential for physics-informed neural networks. Our detailed experiments on two standard and typical PDEs, namely, the linear Poisson equation and nonlinear Burgers’ equation with viscosity, reveal that our proposed Higher-order-ReLU-KANs (HRKANs) achieve the highest fitting accuracy and training robustness and lowest training time significantly among KANs, ReLUKANs and HRKANs.

[LG-113] Structure Learning via Mutual Information

链接: https://arxiv.org/abs/2409.14235
作者: Jeremy Nixon
关键词-EN: algorithm design based, paper presents, specifically mutual information, learning algorithms, machine learning algorithm
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach to machine learning algorithm design based on information theory, specifically mutual information (MI). We propose a framework for learning and representing functional relationships in data using MI-based features. Our method aims to capture the underlying structure of information in datasets, enabling more efficient and generalizable learning algorithms. We demonstrate the efficacy of our approach through experiments on synthetic and real-world datasets, showing improved performance in tasks such as function classification, regression, and cross-dataset transfer. This work contributes to the growing field of metalearning and automated machine learning, offering a new perspective on how to leverage information theory for algorithm design and dataset analysis and proposing new mutual information theoretic foundations to learning algorithms.

[LG-114] ReFine: Boosting Time Series Prediction of Extreme Events by Reweighting and Fine-tuning

链接: https://arxiv.org/abs/2409.14232
作者: Jimeng Shi,Azam Shirali,Giri Narasimhan
关键词-EN: represent impactive occurrences, Extreme events, Extreme, impactive occurrences, great importance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extreme events are of great importance since they often represent impactive occurrences. For instance, in terms of climate and weather, extreme events might be major storms, floods, extreme heat or cold waves, and more. However, they are often located at the tail of the data distribution. Consequently, accurately predicting these extreme events is challenging due to their rarity and irregularity. Prior studies have also referred to this as the out-of-distribution (OOD) problem, which occurs when the distribution of the test data is substantially different from that used for training. In this work, we propose two strategies, reweighting and fine-tuning, to tackle the challenge. Reweighting is a strategy used to force machine learning models to focus on extreme events, which is achieved by a weighted loss function that assigns greater penalties to the prediction errors for the extreme samples relative to those on the remainder of the data. Unlike previous intuitive reweighting methods based on simple heuristics of data distribution, we employ meta-learning to dynamically optimize these penalty weights. To further boost the performance on extreme samples, we start from the reweighted models and fine-tune them using only rare extreme samples. Through extensive experiments on multiple data sets, we empirically validate that our meta-learning-based reweighting outperforms existing heuristic ones, and the fine-tuning strategy can further increase the model performance. More importantly, these two strategies are model-agnostic, which can be implemented on any type of neural network for time series forecasting. The open-sourced code is available at \urlthis https URL.

[LG-115] R-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models ICRA2025

链接: https://arxiv.org/abs/2409.14216
作者: Viet Dung Nguyen,Zhizhuo Yang,Christopher L. Buckley,Alexander Ororbia
关键词-EN: Markov decision processes, observable Markov decision, partially observable Markov, Markov decision, decision processes
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 20 pages, 2 algorithms, 2 tables, 5 figures, submitted to ICRA 2025

点击查看摘要

Abstract:Although research has produced promising results demonstrating the utility of active inference (AIF) in Markov decision processes (MDPs), there is relatively less work that builds AIF models in the context of environments and problems that take the form of partially observable Markov decision processes (POMDPs). In POMDP scenarios, the agent must infer the unobserved environmental state from raw sensory observations, e.g., pixels in an image. Additionally, less work exists in examining the most difficult form of POMDP-centered control: continuous action space POMDPs under sparse reward signals. In this work, we address issues facing the AIF modeling paradigm by introducing novel prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. Empirically, we show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate. The code in support of this work can be found at this https URL.

[LG-116] Data-centric NLP Backdoor Defense from the Lens of Memorization

链接: https://arxiv.org/abs/2409.14200
作者: Zhenting Wang,Zhizhi Wang,Mingyu Jin,Mengnan Du,Juan Zhai,Shiqing Ma
关键词-EN: DNN-based language models, language models, language model backdoors, severe threat, trustworthiness of DNN-based
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backdoor attack is a severe threat to the trustworthiness of DNN-based language models. In this paper, we first extend the definition of memorization of language models from sample-wise to more fine-grained sentence element-wise (e.g., word, phrase, structure, and style), and then point out that language model backdoors are a type of element-wise memorization. Through further analysis, we find that the strength of such memorization is positively correlated to the frequency of duplicated elements in the training dataset. In conclusion, duplicated sentence elements are necessary for successful backdoor attacks. Based on this, we propose a data-centric defense. We first detect trigger candidates in training data by finding memorizable elements, i.e., duplicated elements, and then confirm real triggers by testing if the candidates can activate backdoor behaviors (i.e., malicious elements). Results show that our method outperforms state-of-the-art defenses in defending against different types of NLP backdoors.

[LG-117] Advancing Employee Behavior Analysis through Synthetic Data: Leveraging ABMs GANs and Statistical Models for Enhanced Organizational Efficiency

链接: https://arxiv.org/abs/2409.14197
作者: Rakshitha Jayashankar,Mahesh Balan
关键词-EN: todays data-driven corporate, data-driven corporate climate, corporate climate requires, Generative Adversarial Networks, todays data-driven
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL); Other Statistics (stat.OT)
*备注: 8 Pages, 5 figures, 1 github link

点击查看摘要

Abstract:Success in todays data-driven corporate climate requires a deep understanding of employee behavior. Companies aim to improve employee satisfaction, boost output, and optimize workflow. This research study delves into creating synthetic data, a powerful tool that allows us to comprehensively understand employee performance, flexibility, cooperation, and team dynamics. Synthetic data provides a detailed and accurate picture of employee activities while protecting individual privacy thanks to cutting-edge methods like agent-based models (ABMs), Generative Adversarial Networks (GANs), and statistical models. Through the creation of multiple situations, this method offers insightful viewpoints regarding increasing teamwork, improving adaptability, and accelerating overall productivity. We examine how synthetic data has evolved from a specialized field to an essential resource for researching employee behavior and enhancing management efficiency. Keywords: Agent-Based Model, Generative Adversarial Network, workflow optimization, organizational success

[LG-118] On Lexical Invariance on Multisets and Graphs

链接: https://arxiv.org/abs/2409.14179
作者: Muhan Zhang
关键词-EN: called lexical invariance, expressive lexical invariant, lexical invariance, lexical, lexical invariant
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this draft, we study a novel problem, called lexical invariance, using the medium of multisets and graphs. Traditionally in the NLP domain, lexical invariance indicates that the semantic meaning of a sentence should remain unchanged regardless of the specific lexical or word-based representation of the input. For example, The movie was extremely entertaining'' would have the same meaning as The film was very enjoyable’'. In this paper, we study a more challenging setting, where the output of a function is invariant to any injective transformation applied to the input lexical space. For example, multiset 1,2,3,2 is equivalent to multiset a,b,c,b if we specify an injective transformation that maps 1 to a, 2 to b and 3 to c. We study the sufficient and necessary conditions for a most expressive lexical invariant (and permutation invariant) function on multisets and graphs, and proves that for multisets, the function must have a form that only takes the multiset of counts of the unique elements in the original multiset as input. For example, a most expressive lexical invariant function on a,b,c,b must have a form that only operates on 1,1,2 (meaning that there are 1, 1, 2 unique elements corresponding to a,c,b). For graphs, we prove that a most expressive lexical invariant and permutation invariant function must have a form that only takes the adjacency matrix and a difference matrix as input, where the (i,j)th element of the difference matrix is 1 if node i and node j have the same feature and 0 otherwise. We perform synthetic experiments on TU datasets to verify our theorems.

[LG-119] A Distribution-Aware Flow-Matching for Generating Unstructured Data for Few-Shot Reinforcement Learning

链接: https://arxiv.org/abs/2409.14178
作者: Mohammad Pivezhandi,Abusayeed Saifullah
关键词-EN: Generating realistic, realistic and diverse, few-shot learning scenarios, diverse unstructured data, called Dynamic Voltage
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating realistic and diverse unstructured data is a significant challenge in reinforcement learning (RL), particularly in few-shot learning scenarios where data is scarce. Traditional RL methods often rely on extensive datasets or simulations, which are costly and time-consuming. In this paper, we introduce a distribution-aware flow matching, designed to generate synthetic unstructured data tailored specifically for an application of few-shot RL called Dynamic Voltage and Frequency Scaling (DVFS) on embedded processors. This method leverages the sample efficiency of flow matching and incorporates statistical learning techniques such as bootstrapping to improve its generalization and robustness of the latent space. Additionally, we apply feature weighting through Random Forests to prioritize critical data aspects, thereby improving the precision of the generated synthetic data. This approach not only mitigates the challenges of overfitting and data correlation in unstructured data in traditional Model-Based RL but also aligns with the Law of Large Numbers, ensuring convergence to true empirical values and optimal policy as the number of samples increases. Through extensive experimentation on an application of DVFS for low energy processing, we demonstrate that our method provides an stable convergence based on max Q-value while enhancing frame rate by 30% in the very beginning first timestamps, making this RL model efficient in resource-constrained environments.

[LG-120] QMOS: Enhancing LLMs for Telecommunication with Question Masked loss and Option Shuffling

链接: https://arxiv.org/abs/2409.14175
作者: Blessed Guda,Gabrial Zencha A.,Lawrence Francis,Carlee Joe-Wong
关键词-EN: Large Language models, Large Language, brought about substantial, substantial advancements, Retrieval Augmented Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language models (LLMs) have brought about substantial advancements in the field of Question Answering (QA) systems. These models do remarkably well in addressing intricate inquiries in a variety of disciplines. However, because of domain-specific vocabulary, complex technological concepts, and the requirement for exact responses applying LLMs to specialized sectors like telecommunications presents additional obstacles. GPT-3.5 has been used in recent work, to obtain noteworthy accuracy for telecom-related questions in a Retrieval Augmented Generation (RAG) framework. Notwithstanding these developments, the practical use of models such as GPT-3.5 is restricted by their proprietary nature and high computing demands. This paper introduces QMOS, an innovative approach which uses a Question-Masked loss and Option Shuffling trick to enhance the performance of LLMs in answering Multiple-Choice Questions in the telecommunications domain. Our focus was on using opensource, smaller language models (Phi-2 and Falcon-7B) within an enhanced RAG framework. Our multi-faceted approach involves several enhancements to the whole LLM-RAG pipeline of finetuning, retrieval, prompt engineering and inference. Our approaches significantly outperform existing results, achieving accuracy improvements from baselines of 24.70% to 49.30% with Falcon-7B and from 42.07% to 84.65% with Phi-2.

[LG-121] Component-based Sketching for Deep ReLU Nets

链接: https://arxiv.org/abs/2409.14174
作者: Di Wang,Shao-Bo Lin,Deyu Meng,Feilong Cao
关键词-EN: made profound impacts, numerous real-world applications, algorithm design philosophy, innovative algorithm design, design philosophy
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Deep learning has made profound impacts in the domains of data mining and AI, distinguished by the groundbreaking achievements in numerous real-world applications and the innovative algorithm design philosophy. However, it suffers from the inconsistency issue between optimization and generalization, as achieving good generalization, guided by the bias-variance trade-off principle, favors under-parameterized networks, whereas ensuring effective convergence of gradient-based algorithms demands over-parameterized networks. To address this issue, we develop a novel sketching scheme based on deep net components for various tasks. Specifically, we use deep net components with specific efficacy to build a sketching basis that embodies the advantages of deep networks. Subsequently, we transform deep net training into a linear empirical risk minimization problem based on the constructed basis, successfully avoiding the complicated convergence analysis of iterative algorithms. The efficacy of the proposed component-based sketching is validated through both theoretical analysis and numerical experiments. Theoretically, we show that the proposed component-based sketching provides almost optimal rates in approximating saturated functions for shallow nets and also achieves almost optimal generalization error bounds. Numerically, we demonstrate that, compared with the existing gradient-based training methods, component-based sketching possesses superior generalization performance with reduced training costs.

[LG-122] owards Building Efficient Sentence BERT Models using Layer Pruning

链接: https://arxiv.org/abs/2409.14168
作者: Anushka Shelke,Riya Savant,Raviraj Joshi
关键词-EN: efficient Sentence BERT, Sentence BERT, Semantic Textual Similarity, study examines, examines the effectiveness
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study examines the effectiveness of layer pruning in creating efficient Sentence BERT (SBERT) models. Our goal is to create smaller sentence embedding models that reduce complexity while maintaining strong embedding similarity. We assess BERT models like Muril and MahaBERT-v2 before and after pruning, comparing them with smaller, scratch-trained models like MahaBERT-Small and MahaBERT-Smaller. Through a two-phase SBERT fine-tuning process involving Natural Language Inference (NLI) and Semantic Textual Similarity (STS), we evaluate the impact of layer reduction on embedding quality. Our findings show that pruned models, despite fewer layers, perform competitively with fully layered versions. Moreover, pruned models consistently outperform similarly sized, scratch-trained models, establishing layer pruning as an effective strategy for creating smaller, efficient embedding models. These results highlight layer pruning as a practical approach for reducing computational demand while preserving high-quality embeddings, making SBERT models more accessible for languages with limited technological resources.

[LG-123] Will Large Language Models be a Panacea to Autonomous Driving?

链接: https://arxiv.org/abs/2409.14165
作者: Yuxuan Zhua,Shiyi Wang,Wenqing Zhong,Nianchen Shen,Yunqi Li,Siqi Wang,Zhiheng Li,Cathy Wu,Zhengbing He,Li Li
关键词-EN: plays a crucial, crucial role, role in autonomous, autonomous driving, LLMs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) plays a crucial role in autonomous driving (AD) research, propelling its development towards intelligence and efficiency. Currently, the development of AD technology follows two main technical paths: modularization and end-to-end. Modularization decompose the driving task into modules such as perception, prediction, planning, and control, and train them separately. Due to the inconsistency of training objectives between modules, the integrated effect suffers from bias. End-to-end attempts to address this issue by utilizing a single model that directly maps from sensor data to control signals. This path has limited learning capabilities in a comprehensive set of features and struggles to handle unpredictable long-tail events and complex urban traffic scenarios. In the face of challenges encountered in both paths, many researchers believe that large language models (LLMs) with powerful reasoning capabilities and extensive knowledge understanding may be the solution, expecting LLMs to provide AD systems with deeper levels of understanding and decision-making capabilities. In light of the challenges faced by both paths, many researchers believe that LLMs, with their powerful reasoning abilities and extensive knowledge, could offer a solution. To understand if LLMs could enhance AD, this paper conducts a thorough analysis of the potential applications of LLMs in AD systems, including exploring their optimization strategies in both modular and end-to-end approaches, with a particular focus on how LLMs can tackle the problems and challenges present in current solutions. Furthermore, we discuss an important question: Can LLM-based artificial general intelligence (AGI) be a key to achieve high-level AD? We further analyze the potential limitations and challenges that LLMs may encounter in promoting the development of AD technology.

[LG-124] PromptTA: Prompt-driven Text Adapter for Source-free Domain Generalization

链接: https://arxiv.org/abs/2409.14163
作者: Haoran Zhang,Shuanghao Bai,Wanqi Zhou,Jingwen Fu,Badong Chen
关键词-EN: Source-free domain generalization, source domain data, unseen target domains, Source-free domain, tackles the challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Source-free domain generalization (SFDG) tackles the challenge of adapting models to unseen target domains without access to source domain data. To deal with this challenging task, recent advances in SFDG have primarily focused on leveraging the text modality of vision-language models such as CLIP. These methods involve developing a transferable linear classifier based on diverse style features extracted from the text and learned prompts or deriving domain-unified text representations from domain banks. However, both style features and domain banks have limitations in capturing comprehensive domain knowledge. In this work, we propose Prompt-Driven Text Adapter (PromptTA) method, which is designed to better capture the distribution of style features and employ resampling to ensure thorough coverage of domain knowledge. To further leverage this rich domain information, we introduce a text adapter that learns from these style features for efficient domain information storage. Extensive experiments conducted on four benchmark datasets demonstrate that PromptTA achieves state-of-the-art performance. The code is available at this https URL.

[LG-125] On Importance of Pruning and Distillation for Efficient Low Resource NLP

链接: https://arxiv.org/abs/2409.14162
作者: Aishwarya Mirashi,Purva Lingayat,Srushti Sonavane,Tejas Padhiyar,Raviraj Joshi,Geetanjali Kale
关键词-EN: Natural Language Processing, revolutionized Natural Language, revolutionized Natural, Language Processing, Natural Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of large transformer models has revolutionized Natural Language Processing, leading to significant advances in tasks like text classification. However, this progress demands substantial computational resources, escalating training duration, and expenses with larger model sizes. Efforts have been made to downsize and accelerate English models (e.g., Distilbert, MobileBert). Yet, research in this area is scarce for low-resource languages. In this study, we explore the case of the low-resource Indic language Marathi. Leveraging the marathi-topic-all-doc-v2 model as our baseline, we implement optimization techniques to reduce computation time and memory usage. Our focus is on enhancing the efficiency of Marathi transformer models while maintaining top-tier accuracy and reducing computational demands. Using the MahaNews document classification dataset and the marathi-topic-all-doc-v2 model from L3Cube, we apply Block Movement Pruning, Knowledge Distillation, and Mixed Precision methods individually and in combination to boost efficiency. We demonstrate the importance of strategic pruning levels in achieving desired efficiency gains. Furthermore, we analyze the balance between efficiency improvements and environmental impact, highlighting how optimized model architectures can contribute to a more sustainable computational ecosystem. Implementing these techniques on a single GPU system, we determine that the optimal configuration is 25% pruning + knowledge distillation. This approach yielded a 2.56x speedup in computation time while maintaining baseline accuracy levels. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.14162 [cs.CL] (or arXiv:2409.14162v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.14162 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-126] When Witnesses Defend: A Witness Graph Topological Layer for Adversarial Graph Learning AAAI25

链接: https://arxiv.org/abs/2409.14161
作者: Naheed Anjum Arafat,Debabrota Basu,Yulia Gel,Yuzhou Chen
关键词-EN: persistent homology representations, bridge adversarial graph, adversarial graph learning, salient shape characteristics, shape characteristics
类目: Machine Learning (cs.LG)
*备注: Under Review at AAAI25

点击查看摘要

Abstract:Capitalizing on the intuitive premise that shape characteristics are more robust to perturbations, we bridge adversarial graph learning with the emerging tools from computational topology, namely, persistent homology representations of graphs. We introduce the concept of witness complex to adversarial analysis on graphs, which allows us to focus only on the salient shape characteristics of graphs, yielded by the subset of the most essential nodes (i.e., landmarks), with minimal loss of topological information on the whole graph. The remaining nodes are then used as witnesses, governing which higher-order graph substructures are incorporated into the learning process. Armed with the witness mechanism, we design Witness Graph Topological Layer (WGTL), which systematically integrates both local and global topological graph feature representations, the impact of which is, in turn, automatically controlled by the robust regularized topological loss. Given the attacker’s budget, we derive the important stability guarantees of both local and global topology encodings and the associated robust topological loss. We illustrate the versatility and efficiency of WGTL by its integration with five GNNs and three existing non-topological defense mechanisms. Our extensive experiments across six datasets demonstrate that WGTL boosts the robustness of GNNs across a range of perturbations and against a range of adversarial attacks, leading to relative gains of up to 18%.

[LG-127] Present and Future Generalization of Synthetic Image Detectors

链接: https://arxiv.org/abs/2409.14128
作者: Pablo Bernabeu-Perez,Enrique Lopez-Cuena,Dario Garcia-Gasulla
关键词-EN: generation models increases, image generation models, continued release, generation models, models increases
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:The continued release of new and better image generation models increases the demand for synthetic image detectors. In such a dynamic field, detectors need to be able to generalize widely and be robust to uncontrolled alterations. The present work is motivated by this setting, when looking at the role of time, image transformations and data sources, for detector generalization. In these experiments, none of the evaluated detectors is found universal, but results indicate an ensemble could be. Experiments on data collected in the wild show this task to be more challenging than the one defined by large-scale datasets, pointing to a gap between experimentation and actual practice. Finally, we observe a race equilibrium effect, where better generators lead to better detectors, and vice versa. We hypothesize this pushes the field towards a perpetually close race between generators and detectors.

[LG-128] Efficient and Effective Model Extraction

链接: https://arxiv.org/abs/2409.14122
作者: Hongyu Zhu,Wentao Hu,Sichu Liang,Fangqi Li,Wenwen Wang,Shilin Wang
关键词-EN: API with minimal, functionally similar copy, Model extraction aims, Model extraction, MLaaS ecosystem
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model extraction aims to create a functionally similar copy from a machine learning as a service (MLaaS) API with minimal overhead, often for illicit purposes or as a precursor to further attacks, posing a significant threat to the MLaaS ecosystem. However, recent studies show that model extraction is inefficient, especially when the target task distribution is unavailable. In such cases, even significantly increasing the attack budget fails to yield a sufficiently similar model, reducing the adversary’s incentive. In this paper, we revisit the basic design choices throughout the extraction process and propose an efficient and effective algorithm, Efficient and Effective Model Extraction (E3), which optimizes both query preparation and the training routine. E3 achieves superior generalization over state-of-the-art methods while minimizing computational costs. For example, with only 0.005 times the query budget and less than 0.2 times the runtime, E3 outperforms classical generative model-based data-free model extraction with over 50% absolute accuracy improvement on CIFAR-10. Our findings highlight the ongoing risk of model extraction and propose E3 as a useful benchmark for future security evaluations.

[LG-129] CONGRA: Benchmarking Automatic Conflict Resolution

链接: https://arxiv.org/abs/2409.14121
作者: Qingyu Zhang,Liangcai Su,Kai Ye,Chenxiong Qian
关键词-EN: Resolving conflicts, Resolving, conflicts, conflict, LLMs
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Resolving conflicts from merging different software versions is a challenging task. To reduce the overhead of manual merging, researchers develop various program analysis-based tools which only solve specific types of conflicts and have a limited scope of application. With the development of language models, researchers treat conflict code as text, which theoretically allows for addressing almost all types of conflicts. However, the absence of effective conflict difficulty grading methods hinders a comprehensive evaluation of large language models (LLMs), making it difficult to gain a deeper understanding of their limitations. Furthermore, there is a notable lack of large-scale open benchmarks for evaluating the performance of LLMs in automatic conflict resolution. To address these issues, we introduce ConGra, a CONflict-GRAded benchmarking scheme designed to evaluate the performance of software merging tools under varying complexity conflict scenarios. We propose a novel approach to classify conflicts based on code operations and use it to build a large-scale evaluation dataset based on 44,948 conflicts from 34 real-world projects. We evaluate state-of-the-art LLMs on conflict resolution tasks using this dataset. By employing the dataset, we assess the performance of multiple state-of-the-art LLMs and code LLMs, ultimately uncovering two counterintuitive yet insightful phenomena. ConGra will be released at this https URL.

[LG-130] Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm

链接: https://arxiv.org/abs/2409.14119
作者: Jaehan Kim,Minkyoo Song,Seung Ho Na,Seungwon Shin
关键词-EN: large language models, Parameter-efficient fine-tuning, key training strategy, language models, key training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has become a key training strategy for large language models. However, its reliance on fewer trainable parameters poses security risks, such as task-agnostic backdoors. Despite their severe impact on a wide range of tasks, there is no practical defense solution available that effectively counters task-agnostic backdoors within the context of PEFT. In this study, we introduce Obliviate, a PEFT-integrable backdoor defense. We develop two techniques aimed at amplifying benign neurons within PEFT layers and penalizing the influence of trigger tokens. Our evaluations across three major PEFT architectures show that our method can significantly reduce the attack success rate of the state-of-the-art task-agnostic backdoors (83.6% \downarrow ). Furthermore, our method exhibits robust defense capabilities against both task-specific backdoors and adaptive attacks. Source code will be obtained at this https URL.

[LG-131] ESDS: AI-Powered Early Stunting Detection and Monitoring System using Edited Radius-SMOTE Algorithm

链接: https://arxiv.org/abs/2409.14105
作者: A.A. Gde Yogi Pramana,Haidar Muhammad Zidan,Muhammad Fazil Maulana,Oskar Natan
关键词-EN: causing lower cognitive, lower cognitive function, Indonesian healthcare, causing lower, lower productivity
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 14 pages, pre-print

点击查看摘要

Abstract:Stunting detection is a significant issue in Indonesian healthcare, causing lower cognitive function, lower productivity, a weakened immunity, delayed neuro-development, and degenerative diseases. In regions with a high prevalence of stunting and limited welfare resources, identifying children in need of treatment is critical. The diagnostic process often raises challenges, such as the lack of experience in medical workers, incompatible anthropometric equipment, and inefficient medical bureaucracy. To counteract the issues, the use of load cell sensor and ultrasonic sensor can provide suitable anthropometric equipment and streamline the medical bureaucracy for stunting detection. This paper also employs machine learning for stunting detection based on sensor readings. The experiment results show that the sensitivity of the load cell sensor and the ultrasonic sensor is 0.9919 and 0.9986, respectively. Also, the machine learning test results have three classification classes, which are normal, stunted, and stunting with an accuracy rate of 98%.

[LG-132] BRep Boundary and Junction Detection for CAD Reverse Engineering

链接: https://arxiv.org/abs/2409.14087
作者: Sk Aziz Ali,Mohammad Sadil Khan,Didier Stricker
关键词-EN: obtain parametric CAD, parametric CAD models, modify CAD model, highly important, parametric CAD
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:In machining process, 3D reverse engineering of the mechanical system is an integral, highly important, and yet time consuming step to obtain parametric CAD models from 3D scans. Therefore, deep learning-based Scan-to-CAD modeling can offer designers enormous editability to quickly modify CAD model, being able to parse all its structural compositions and design steps. In this paper, we propose a supervised boundary representation (BRep) detection network BRepDetNet from 3D scans of CC3D and ABC dataset. We have carefully annotated the 50K and 45K scans of both the datasets with appropriate topological relations (e.g., next, mate, previous) between the geometrical primitives (i.e., boundaries, junctions, loops, faces) of their BRep data structures. The proposed solution decomposes the Scan-to-CAD problem in Scan-to-BRep ensuring the right step towards feature-based modeling, and therefore, leveraging other existing BRep-to-CAD modeling methods. Our proposed Scan-to-BRep neural network learns to detect BRep boundaries and junctions by minimizing focal-loss and non-maximal suppression (NMS) during training time. Experimental results show that our BRepDetNet with NMS-Loss achieves impressive results.

[LG-133] AMT-APC: Automatic Piano Cover by Fine-Tuning an Automatic Music Transcription Model

链接: https://arxiv.org/abs/2409.14086
作者: Kazuma Komiya,Yoshihisa Fukuhara
关键词-EN: automatically generating piano, generating piano covers, studies on automatically, automatically generating, recent advancements
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:There have been several studies on automatically generating piano covers, and recent advancements in deep learning have enabled the creation of more sophisticated covers. However, existing automatic piano cover models still have room for improvement in terms of expressiveness and fidelity to the original. To address these issues, we propose a learning algorithm called AMT-APC, which leverages the capabilities of automatic music transcription models. By utilizing the strengths of well-established automatic music transcription models, we aim to improve the accuracy of piano cover generation. Our experiments demonstrate that the AMT-APC model reproduces original tracks more accurately than any existing models.

[LG-134] One-shot World Models Using a Transformer Trained on a Synthetic Prior

链接: https://arxiv.org/abs/2409.14084
作者: Fabio Ferreira,Moreno Schlageter,Raghu Rajan,Andre Biedenkapp,Frank Hutter
关键词-EN: execute planning methods, real world environment, World Model, World, planning methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A World Model is a compressed spatial and temporal representation of a real world environment that allows one to train an agent or execute planning methods. However, world models are typically trained on observations from the real world environment, and they usually do not enable learning policies for other real environments. We propose One-Shot World Model (OSWM), a transformer world model that is learned in an in-context learning fashion from purely synthetic data sampled from a prior distribution. Our prior is composed of multiple randomly initialized neural networks, where each network models the dynamics of each state and reward dimension of a desired target environment. We adopt the supervised learning procedure of Prior-Fitted Networks by masking next-state and reward at random context positions and query OSWM to make probabilistic predictions based on the remaining transition context. During inference time, OSWM is able to quickly adapt to the dynamics of a simple grid world, as well as the CartPole gym and a custom control environment by providing 1k transition steps as context and is then able to successfully train environment-solving agent policies. However, transferring to more complex environments remains a challenge, currently. Despite these limitations, we see this work as an important stepping-stone in the pursuit of learning world models purely from synthetic data.

[LG-135] KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data

链接: https://arxiv.org/abs/2409.14066
作者: Grace Tang,Swetha Rajkumar,Yifei Zhou,Homer Rich Walke,Sergey Levine,Kuan Fang
关键词-EN: Building generalist robotic, involves effectively endowing, Building generalist, Vision Language Models, effectively endowing robots
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Building generalist robotic systems involves effectively endowing robots with the capabilities to handle novel objects in an open-world setting. Inspired by the advances of large pre-trained models, we propose Keypoint Affordance Learning from Imagined Environments (KALIE), which adapts pre-trained Vision Language Models (VLMs) for robotic control in a scalable manner. Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations based on natural language instructions and visual observations of the scene. The VLM is trained on 2D images with affordances labeled by humans, bypassing the need for training data collected on robotic systems. Through an affordance-aware data synthesis pipeline, KALIE automatically creates massive high-quality training data based on limited example data manually collected by humans. We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points. Compared to baselines using pre-trained VLMs, our approach consistently achieves superior performance.

[LG-136] mporally Consistent Factuality Probing for Large Language Models

链接: https://arxiv.org/abs/2409.14065
作者: Ashutosh Bajpai,Aaryan Goyal,Atif Anwer,Tanmoy Chakraborty
关键词-EN: Large Language Models, Language Models, Large Language, alternate knowledge base, knowledge base requires
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The prolific use of Large Language Models (LLMs) as an alternate knowledge base requires them to be factually consistent, necessitating both correctness and consistency traits for paraphrased queries. Recently, significant attempts have been made to benchmark datasets and metrics to evaluate LLMs for these traits. However, structural simplicity (subject-relation-object) and contemporary association in their query formulation limit the broader definition of factuality and consistency. In this study, we introduce TeCFaP, a novel Temporally Consistent Factuality Probe task to expand the consistent factuality probe in the temporal dimension. To this end, we propose TEMP-COFAC, a high-quality dataset of prefix-style English query paraphrases. Subsequently, we extend the definitions of existing metrics to represent consistent factuality across temporal dimension. We experiment with a diverse set of LLMs and find most of them performing poorly on TeCFaP. Next, we propose a novel solution CoTSeLF (Consistent-Time-Sensitive Learning Framework) combining multi-task instruction tuning (MT-IT) with consistent-time-sensitive reinforcement learning (CTSRL) to improve temporally consistent factuality in LLMs. Our experiments demonstrate the efficacy of CoTSeLF over several baselines.

[LG-137] Recovering Global Data Distribution Locally in Federated Learning BMVC2024

链接: https://arxiv.org/abs/2409.14063
作者: Ziyu Yao
关键词-EN: machine learning paradigm, distributed machine learning, Federated Learning, machine learning, learning paradigm
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by BMVC 2024

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning paradigm that enables collaboration among multiple clients to train a shared model without sharing raw data. However, a major challenge in FL is the label imbalance, where clients may exclusively possess certain classes while having numerous minority and missing classes. Previous works focus on optimizing local updates or global aggregation but ignore the underlying imbalanced label distribution across clients. In this paper, we propose a novel approach ReGL to address this challenge, whose key idea is to Recover the Global data distribution Locally. Specifically, each client uses generative models to synthesize images that complement the minority and missing classes, thereby alleviating label imbalance. Moreover, we adaptively fine-tune the image generation process using local real data, which makes the synthetic images align more closely with the global distribution. Importantly, both the generation and fine-tuning processes are conducted at the client-side without leaking data privacy. Through comprehensive experiments on various image classification datasets, we demonstrate the remarkable superiority of our approach over existing state-of-the-art works in fundamentally tackling label imbalance in FL.

[LG-138] Implicit Neural Representations for Speed-of-Sound Estimation in Ultrasound

链接: https://arxiv.org/abs/2409.14035
作者: Michal Byra,Piotr Jarosik,Piotr Karwat,Ziemowit Klimonda,Marcin Lewandowski
关键词-EN: Accurate estimation, important for ultrasound, SoS, Accurate, SoS estimation
类目: Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: 2024 IEEE Ultrasonics, Ferroelectrics, and Frequency Control Joint Symposium

点击查看摘要

Abstract:Accurate estimation of the speed-of-sound (SoS) is important for ultrasound (US) image reconstruction techniques and tissue characterization. Various approaches have been proposed to calculate SoS, ranging from tomography-inspired algorithms like CUTE to convolutional networks, and more recently, physics-informed optimization frameworks based on differentiable beamforming. In this work, we utilize implicit neural representations (INRs) for SoS estimation in US. INRs are a type of neural network architecture that encodes continuous functions, such as images or physical quantities, through the weights of a network. Implicit networks may overcome the current limitations of SoS estimation techniques, which mainly arise from the use of non-adaptable and oversimplified physical models of tissue. Moreover, convolutional networks for SoS estimation, usually trained using simulated data, often fail when applied to real tissues due to out-of-distribution and data-shift issues. In contrast, implicit networks do not require extensive training datasets since each implicit network is optimized for an individual data case. This adaptability makes them suitable for processing US data collected from varied tissues and across different imaging protocols. We evaluated the proposed SoS estimation method based on INRs using data collected from a tissue-mimicking phantom containing four cylindrical inclusions, with SoS values ranging from 1480 m/s to 1600 m/s. The inclusions were immersed in a material with an SoS value of 1540 m/s. In experiments, the proposed method achieved strong performance, clearly demonstrating the usefulness of implicit networks for quantitative US applications. Comments: 2024 IEEE Ultrasonics, Ferroelectrics, and Frequency Control Joint Symposium Subjects: Machine Learning (cs.LG); Medical Physics (physics.med-ph) Cite as: arXiv:2409.14035 [cs.LG] (or arXiv:2409.14035v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.14035 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-139] FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale FPGAs

链接: https://arxiv.org/abs/2409.14023
作者: Ehsan Kabir,Md. Arafat Kabir,Austin R.J. Downey,Jason D. Bakos,David Andrews,Miaoqing Huang
关键词-EN: Transformer neural networks, including natural language, Transformer neural, natural language processing, machine translation
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV). Their popularity is largely attributed to the exceptional performance of their multi-head self-attention blocks when analyzing sequential data and extracting features. To date, there are limited hardware accelerators tailored for this mechanism, which is the first step before designing an accelerator for a complete model. This paper proposes \textitFAMOUS, a flexible hardware accelerator for dense multi-head attention (MHA) computation of TNNs on field-programmable gate arrays (FPGAs). It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency. An efficient tiling of large matrices has been employed to distribute memory and computing resources across different modules on various FPGA platforms. The design is evaluated on Xilinx Alveo U55C and U200 data center cards containing Ultrascale+ FPGAs. Experimental results are presented that show that it can attain a maximum throughput, number of parallel attention heads, embedding dimension and tile size of 328 (giga operations/second (GOPS)), 8, 768 and 64 respectively on the U55C. Furthermore, it is 3.28 \times and 2.6 \times faster than the Intel Xeon Gold 5220R CPU and NVIDIA V100 GPU respectively. It is also 1.3 \times faster than the fastest state-of-the-art FPGA-based accelerator.

[LG-140] Mitigating Exposure Bias in Score-Based Generation of Molecular Conformations

链接: https://arxiv.org/abs/2409.14014
作者: Sijia Wang,Chen Wang,Zhenhao Zhao,Jiqiang Zhang,Weiran Cai
关键词-EN: Diffusion Probabilistic Models, exposure bias, Molecular conformation generation, conformation generation poses, computational chemistry
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: SMC 2024

点击查看摘要

Abstract:Molecular conformation generation poses a significant challenge in the field of computational chemistry. Recently, Diffusion Probabilistic Models (DPMs) and Score-Based Generative Models (SGMs) are effectively used due to their capacity for generating accurate conformations far beyond conventional physics-based approaches. However, the discrepancy between training and inference rises a critical problem known as the exposure bias. While this issue has been extensively investigated in DPMs, the existence of exposure bias in SGMs and its effective measurement remain unsolved, which hinders the use of compensation methods for SGMs, including ConfGF and Torsional Diffusion as the representatives. In this work, we first propose a method for measuring exposure bias in SGMs used for molecular conformation generation, which confirms the significant existence of exposure bias in these models and measures its value. We design a new compensation algorithm Input Perturbation (IP), which is adapted from a method originally designed for DPMs only. Experimental results show that by introducing IP, SGM-based molecular conformation models can significantly improve both the accuracy and diversity of the generated conformations. Especially by using the IP-enhanced Torsional Diffusion model, we achieve new state-of-the-art performance on the GEOM-Drugs dataset and are on par on GEOM-QM9. We provide the code publicly at this https URL.

[LG-141] ChronoGAN: Supervised and Embedded Generative Adversarial Networks for Time Series Generation ICML

链接: https://arxiv.org/abs/2409.14013
作者: MohammadReza EskandariNasab,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
关键词-EN: performance variability depending, Generative Adversarial Networks, Generative Adversarial, presents several prevalent, prevalent challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This work has been accepted at ICMLA 2024 on September 7, 2024, as a regular paper for an oral presentation

点击查看摘要

Abstract:Generating time series data using Generative Adversarial Networks (GANs) presents several prevalent challenges, such as slow convergence, information loss in embedding spaces, instability, and performance variability depending on the series length. To tackle these obstacles, we introduce a robust framework aimed at addressing and mitigating these issues effectively. This advanced framework integrates the benefits of an Autoencoder-generated embedding space with the adversarial training dynamics of GANs. This framework benefits from a time series-based loss function and oversight from a supervisory network, both of which capture the stepwise conditional distributions of the data effectively. The generator functions within the latent space, while the discriminator offers essential feedback based on the feature space. Moreover, we introduce an early generation algorithm and an improved neural network architecture to enhance stability and ensure effective generalization across both short and long time series. Through joint training, our framework consistently outperforms existing benchmarks, generating high-quality time series data across a range of real and synthetic datasets with diverse characteristics.

[LG-142] st Time Learning for Time Series Forecasting

链接: https://arxiv.org/abs/2409.14012
作者: Panayiotis Christou,Shichu Chen,Xupeng Chen,Parijat Dube
关键词-EN: token prediction mechanisms, multi-head attention, introduction of token, Time-series forecasting, capturing long-range dependencies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time-series forecasting has seen significant advancements with the introduction of token prediction mechanisms such as multi-head attention. However, these methods often struggle to achieve the same performance as in language modeling, primarily due to the quadratic computational cost and the complexity of capturing long-range dependencies in time-series data. State-space models (SSMs), such as Mamba, have shown promise in addressing these challenges by offering efficient solutions with linear RNNs capable of modeling long sequences with larger context windows. However, there remains room for improvement in accuracy and scalability. We propose the use of Test-Time Training (TTT) modules in a parallel architecture to enhance performance in long-term time series forecasting. Through extensive experiments on standard benchmark datasets, we demonstrate that TTT modules consistently outperform state-of-the-art models, including the Mamba-based TimeMachine, particularly in scenarios involving extended sequence and prediction lengths. Our results show significant improvements in Mean Squared Error (MSE) and Mean Absolute Error (MAE), especially on larger datasets such as Electricity, Traffic, and Weather, underscoring the effectiveness of TTT in capturing long-range dependencies. Additionally, we explore various convolutional architectures within the TTT framework, showing that even simple configurations like 1D convolution with small filters can achieve competitive results. This work sets a new benchmark for time-series forecasting and lays the groundwork for future research in scalable, high-performance forecasting models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.14012 [cs.LG] (or arXiv:2409.14012v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.14012 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-143] Boolean Product Graph Neural Networks

链接: https://arxiv.org/abs/2409.14001
作者: Ziyan Wang,Bin Liu,Ling Xiang
关键词-EN: achieved significant success, recently achieved significant, key operation involving, latent graph, Graph Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2407.10688

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have recently achieved significant success, with a key operation involving the aggregation of information from neighboring nodes. Substantial researchers have focused on defining neighbors for aggregation, predominantly based on observed adjacency matrices. However, in many scenarios, the explicitly given graphs contain noise, which can be amplified during the messages-passing process. Therefore, many researchers have turned their attention to latent graph inference, specifically learning a parametric graph. To mitigate fluctuations in latent graph structure learning, this paper proposes a novel Boolean product-based graph residual connection in GNNs to link the latent graph and the original graph. It computes the Boolean product between the latent graph and the original graph at each layer to correct the learning process. The Boolean product between two adjacency matrices is equivalent to triangle detection. Accordingly, the proposed Boolean product graph neural networks can be interpreted as discovering triangular cliques from the original and the latent graph. We validate the proposed method in benchmark datasets and demonstrate its ability to enhance the performance and robustness of GNNs.

[LG-144] ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models

链接: https://arxiv.org/abs/2409.13989
作者: Yuqing Huang,Rongyang Zhang,Xuesong He,Xuyang Zhi,Hao Wang,Xin Li,Feiyang Xu,Deguang Liu,Huadong Liang,Yi Li,Jian Cui,Zimu Liu,Shijin Wang,Guoping Hu,Guiquan Liu,Qi Liu,Defu Lian,Enhong Chen
关键词-EN: LLMs benchmarks tailored, LLMs, type and complexity, chemical tasks varying, growing interest
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:There is a growing interest in the role that LLMs play in chemistry which lead to an increased focus on the development of LLMs benchmarks tailored to chemical domains to assess the performance of LLMs across a spectrum of chemical tasks varying in type and complexity. However, existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals. To this end, we propose \textbf\textitChemEval, which provides a comprehensive assessment of the capabilities of LLMs across a wide range of chemical domain tasks. Specifically, ChemEval identified 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks which are informed by open-source data and the data meticulously crafted by chemical experts, ensuring that the tasks have practical value and can effectively evaluate the capabilities of LLMs. In the experiment, we evaluate 12 mainstream LLMs on ChemEval under zero-shot and few-shot learning contexts, which included carefully selected demonstration examples and carefully designed prompts. The results show that while general LLMs like GPT-4 and Claude-3.5 excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge. Conversely, specialized LLMs exhibit enhanced chemical competencies, albeit with reduced literary comprehension. This suggests that LLMs have significant potential for enhancement when tackling sophisticated tasks in the field of chemistry. We believe our work will facilitate the exploration of their potential to drive progress in chemistry. Our benchmark and analysis will be available at \colorblue \urlthis https URL.

[LG-145] ProTEA: Programmable Transformer Encoder Acceleration on FPGA

链接: https://arxiv.org/abs/2409.13975
作者: Ehsan Kabir,Jason D. Bakos,David Andrews,Miaoqing Huang
关键词-EN: including natural language, natural language processing, multi-head self-attention block, machine translation, multi-head self-attention
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Transformer neural networks (TNN) have been widely utilized on a diverse range of applications, including natural language processing (NLP), machine translation, and computer vision (CV). Their widespread adoption has been primarily driven by the exceptional performance of their multi-head self-attention block used to extract key features from sequential data. The multi-head self-attention block is followed by feedforward neural networks, which play a crucial role in introducing non-linearity to assist the model in learning complex patterns. Despite the popularity of TNNs, there has been limited numbers of hardware accelerators targeting these two critical blocks. Most prior works have concentrated on sparse architectures that are not flexible for popular TNN variants. This paper introduces \textitProTEA, a runtime programmable accelerator tailored for the dense computations of most of state-of-the-art transformer encoders. \textitProTEA is designed to reduce latency by maximizing parallelism. We introduce an efficient tiling of large matrices that can distribute memory and computing resources across different hardware components within the FPGA. We provide run time evaluations of \textitProTEA on a Xilinx Alveo U55C high-performance data center accelerator card. Experimental results demonstrate that \textitProTEA can host a wide range of popular transformer networks and achieve near optimal performance with a tile size of 64 in the multi-head self-attention block and 6 in the feedforward networks block when configured with 8 parallel attention heads, 12 layers, and an embedding dimension of 768 on the U55C. Comparative results are provided showing \textitProTEA is 2.5 \times faster than an NVIDIA Titan XP GPU. Results also show that it achieves 1.3 – 2.8 \times speed up compared with current state-of-the-art custom designed FPGA accelerators.

[LG-146] One Model Any Conjunctive Query: Graph Neural Networks for Answering Complex Queries over Knowledge Graphs

链接: https://arxiv.org/abs/2409.13959
作者: Krzysztof Olejniczak,Xingyue Huang,İsmail İlkan Ceylan,Mikhail Galkin
关键词-EN: Traditional query answering, Traditional query, knowledge graph, data management, query
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional query answering over knowledge graphs – or broadly over relational data – is one of the most fundamental problems in data management. Motivated by the incompleteness of modern knowledge graphs, a new setup for query answering has emerged, where the goal is to predict answers that do not necessarily appear in the knowledge graph, but are present in its completion. In this work, we propose AnyCQ, a graph neural network model that can classify answers to any conjunctive query on any knowledge graph, following training. At the core of our framework lies a graph neural network model trained using a reinforcement learning objective to answer Boolean queries. Our approach and problem setup differ from existing query answering studies in multiple dimensions. First, we focus on the problem of query answer classification: given a query and a set of possible answers, classify these proposals as true or false relative to the complete knowledge graph. Second, we study the problem of query answer retrieval: given a query, retrieve an answer to the query relative to the complete knowledge graph or decide that no correct solutions exist. Trained on simple, small instances, AnyCQ can generalize to large queries of arbitrary structure, reliably classifying and retrieving answers to samples where existing approaches fail, which is empirically validated on new and challenging benchmarks. Furthermore, we demonstrate that our AnyCQ models effectively transfer to out-of-distribution knowledge graphs, when equipped with a relevant link predictor, highlighting their potential to serve as a general engine for query answering.

[LG-147] raining Large ASR Encoders with Differential Privacy

链接: https://arxiv.org/abs/2409.13953
作者: Geeticka Chauhan,Steve Chien,Om Thakkar,Abhradeep Thakurta,Arun Narayanan
关键词-EN: Self-supervised learning, large speech models, highly effective, SOTA Conformer-based encoder, large speech
类目: ound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: In proceedings of the IEEE Spoken Language Technologies Workshop, 2024

点击查看摘要

Abstract:Self-supervised learning (SSL) methods for large speech models have proven to be highly effective at ASR. With the interest in public deployment of large pre-trained models, there is a rising concern for unintended memorization and leakage of sensitive data points from the training data. In this paper, we apply differentially private (DP) pre-training to a SOTA Conformer-based encoder, and study its performance on a downstream ASR task assuming the fine-tuning data is public. This paper is the first to apply DP to SSL for ASR, investigating the DP noise tolerance of the BEST-RQ pre-training method. Notably, we introduce a novel variant of model pruning called gradient-based layer freezing that provides strong improvements in privacy-utility-compute trade-offs. Our approach yields a LibriSpeech test-clean/other WER (%) of 3.78/ 8.41 with ( 10 , 1e^-9)-DP for extrapolation towards low dataset scales, and 2.81/ 5.89 with (10, 7.9e^-11)-DP for extrapolation towards high scales.

[LG-148] Deep learning for fast segmentation and critical dimension metrology characterization enabling AR/VR design and fabrication

链接: https://arxiv.org/abs/2409.13951
作者: Kundan Chaudhary,Subhei Shaar,Raja Muthinti
关键词-EN: Quantitative analysis, virtual reality, augmented reality, design and fabrication, fabrication of components
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Quantitative analysis of microscopy images is essential in the design and fabrication of components used in augmented reality/virtual reality (AR/VR) modules. However, segmenting regions of interest (ROIs) from these complex images and extracting critical dimensions (CDs) requires novel techniques, such as deep learning models which are key for actionable decisions on process, material and device optimization. In this study, we report on the fine-tuning of a pre-trained Segment Anything Model (SAM) using a diverse dataset of electron microscopy images. We employed methods such as low-rank adaptation (LoRA) to reduce training time and enhance the accuracy of ROI extraction. The model’s ability to generalize to unseen images facilitates zero-shot learning and supports a CD extraction model that precisely extracts CDs from the segmented ROIs. We demonstrate the accurate extraction of binary images from cross-sectional images of surface relief gratings (SRGs) and Fresnel lenses in both single and multiclass modes. Furthermore, these binary images are used to identify transition points, aiding in the extraction of relevant CDs. The combined use of the fine-tuned segmentation model and the CD extraction model offers substantial advantages to various industrial applications by enhancing analytical capabilities, time to data and insights, and optimizing manufacturing processes.

[LG-149] Learning Recourse Costs from Pairwise Feature Comparisons ICIP ICML

链接: https://arxiv.org/abs/2409.13940
作者: Kaivalya Rawal,Himabindu Lakkaraju
关键词-EN: inferring user preferences, technique for incorporating, incorporating user input, feature, incorporating user
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注: “Recourse for Humans”, paper 49 from the Participatory Approaches to Machine Learning workshop at the International Conference on Machine Learning (ICML) 2020. For workshop website, see this https URL

点击查看摘要

Abstract:This paper presents a novel technique for incorporating user input when learning and inferring user preferences. When trying to provide users of black-box machine learning models with actionable recourse, we often wish to incorporate their personal preferences about the ease of modifying each individual feature. These recourse finding algorithms usually require an exhaustive set of tuples associating each feature to its cost of modification. Since it is hard to obtain such costs by directly surveying humans, in this paper, we propose the use of the Bradley-Terry model to automatically infer feature-wise costs using non-exhaustive human comparison surveys. We propose that users only provide inputs comparing entire recourses, with all candidate feature modifications, determining which recourses are easier to implement relative to others, without explicit quantification of their costs. We demonstrate the efficient learning of individual feature costs using MAP estimates, and show that these non-exhaustive human surveys, which do not necessarily contain data for each feature pair comparison, are sufficient to learn an exhaustive set of feature costs, where each feature is associated with a modification cost.

[LG-150] High-Resolution Flood Probability Mapping Using Generative Machine Learning with Large-Scale Synthetic Precipitation and Inundation Data

链接: https://arxiv.org/abs/2409.13936
作者: Lipai Huang,Federico Antolini,Ali Mostafavi,Russell Blessing,Matthew Garcia,Samuel D. Brody
关键词-EN: risk assessment approaches, Generative Adversarial Network, existing flood risk, flood risk assessment, Adversarial Network
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-resolution flood probability maps are essential for addressing the limitations of existing flood risk assessment approaches but are often limited by the availability of historical event data. Also, producing simulated data needed for creating probabilistic flood maps using physics-based models involves significant computation and time effort inhibiting the feasibility. To address this gap, this study introduces Flood-Precip GAN (Flood-Precipitation Generative Adversarial Network), a novel methodology that leverages generative machine learning to simulate large-scale synthetic inundation data to produce probabilistic flood maps. With a focus on Harris County, Texas, Flood-Precip GAN begins with training a cell-wise depth estimator using a limited number of physics-based model-generated precipitation-flood events. This model, which emphasizes precipitation-based features, outperforms universal models. Subsequently, a Generative Adversarial Network (GAN) with constraints is employed to conditionally generate synthetic precipitation records. Strategic thresholds are established to filter these records, ensuring close alignment with true precipitation patterns. For each cell, synthetic events are smoothed using a K-nearest neighbors algorithm and processed through the depth estimator to derive synthetic depth distributions. By iterating this procedure and after generating 10,000 synthetic precipitation-flood events, we construct flood probability maps in various formats, considering different inundation depths. Validation through similarity and correlation metrics confirms the fidelity of the synthetic depth distributions relative to true data. Flood-Precip GAN provides a scalable solution for generating synthetic flood depth data needed to create high-resolution flood probability maps, significantly enhancing flood preparedness and mitigation efforts.

[LG-151] On-device Collaborative Language Modeling via a Mixture of Generalists and Specialists

链接: https://arxiv.org/abs/2409.13931
作者: Dongyang Fan,Bettina Messmer,Martin Jaggi
关键词-EN: Large Language Models, Language Models, Large Language, target on-device collaborative, on-device collaborative fine-tuning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We target on-device collaborative fine-tuning of Large Language Models (LLMs) by adapting a Mixture of Experts (MoE) architecture, where experts are Low-Rank Adaptation (LoRA) modules. In conventional MoE approaches, experts develop into specialists throughout training. In contrast, we propose a novel \textbfCo llaborative learning approach via a \textbfMi xture of \textbfG eneralists and \textbfS pecialists (CoMiGS). Diversifying into the two roles is achieved by aggregating certain experts globally while keeping others localized to specialize in user-specific datasets. Central to our work is a learnable routing network that routes at a token level, balancing collaboration and personalization at the finest granularity. Our method consistently demonstrates superior performance in scenarios with high data heterogeneity across various datasets. By design, our approach accommodates varying computational resource constraints among users as shown in different numbers of LoRA experts. We further showcase that low-resourced users can benefit from high-resourced users with high data quantity.

[LG-152] One Model is All You Need: ByT5-Sanskrit a Unified Model for Sanskrit NLP Tasks

链接: https://arxiv.org/abs/2409.13920
作者: Sebastian Nehrdich,Oliver Hellwig,Kurt Keutzer
关键词-EN: NLP applications, Sanskrit, Morphologically rich languages, Morphologically rich, NLP applications involving
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich language Sanskrit. We evaluate ByT5-Sanskrit on established Sanskrit word segmentation tasks, where it outperforms previous data-driven approaches by a considerable margin and matches the performance of the current best lexicon-based model. It is easier to deploy and more robust to data not covered by external linguistic resources. It also achieves new state-of-the-art results in Vedic Sanskrit dependency parsing and OCR post-correction tasks. Additionally, based on the Digital Corpus of Sanskrit, we introduce a novel multitask dataset for the joint training of Sanskrit word segmentation, lemmatization, and morphosyntactic tagging tasks. We fine-tune ByT5-Sanskrit on this dataset, creating a versatile multitask model for various downstream Sanskrit applications. We have used this model in Sanskrit linguistic annotation projects, in information retrieval setups, and as a preprocessing step in a Sanskrit machine translation pipeline. We also show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages. We thus demonstrate that byte-level pretrained language models can achieve excellent performance for morphologically rich languages, outperforming tokenizer-based models and presenting an important vector of exploration when constructing NLP pipelines for such languages.

[LG-153] PTQ4ADM: Post-Training Quantization for Efficient Text Conditional Audio Diffusion Models

链接: https://arxiv.org/abs/2409.13894
作者: Jayneel Vora,Aditya Krishnan,Nader Bouacida,Prabhu RV Shankar,Prasant Mohapatra
关键词-EN: contextually relevant data, producing high-quality, video domains, relevant data, contextually relevant
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Denoising diffusion models have emerged as state-of-the-art in generative tasks across image, audio, and video domains, producing high-quality, diverse, and contextually relevant data. However, their broader adoption is limited by high computational costs and large memory footprints. Post-training quantization (PTQ) offers a promising approach to mitigate these challenges by reducing model complexity through low-bandwidth parameters. Yet, direct application of PTQ to diffusion models can degrade synthesis quality due to accumulated quantization noise across multiple denoising steps, particularly in conditional tasks like text-to-audio synthesis. This work introduces PTQ4ADM, a novel framework for quantizing audio diffusion models(ADMs). Our key contributions include (1) a coverage-driven prompt augmentation method and (2) an activation-aware calibration set generation algorithm for text-conditional ADMs. These techniques ensure comprehensive coverage of audio aspects and modalities while preserving synthesis fidelity. We validate our approach on TANGO, Make-An-Audio, and AudioLDM models for text-conditional audio generation. Extensive experiments demonstrate PTQ4ADM’s capability to reduce the model size by up to 70% while achieving synthesis quality metrics comparable to full-precision models( 5% increase in FD scores). We show that specific layers in the backbone network can be quantized to 4-bit weights and 8-bit activations without significant quality loss. This work paves the way for more efficient deployment of ADMs in resource-constrained environments.

[LG-154] Causal Feature Selection Method for Contextual Multi-Armed Bandits in Recommender System

链接: https://arxiv.org/abs/2409.13888
作者: Zhenyu Zhao,Yexi Jiang
关键词-EN: contextual multi-armed bandits, contextual MAB, multi-armed bandits, MAB, feature selection methods
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Features (a.k.a. context) are critical for contextual multi-armed bandits (MAB) performance. In practice of large scale online system, it is important to select and implement important features for the model: missing important features can led to sub-optimal reward outcome, and including irrelevant features can cause overfitting, poor model interpretability, and implementation cost. However, feature selection methods for conventional machine learning models fail short for contextual MAB use cases, as conventional methods select features correlated with the outcome variable, but not necessarily causing heterogeneuous treatment effect among arms which are truely important for contextual MAB. In this paper, we introduce model-free feature selection methods designed for contexutal MAB problem, based on heterogeneous causal effect contributed by the feature to the reward distribution. Empirical evaluation is conducted based on synthetic data as well as real data from an online experiment for optimizing content cover image in a recommender system. The results show this feature selection method effectively selects the important features that lead to higher contextual MAB reward than unimportant features. Compared with model embedded method, this model-free method has advantage of fast computation speed, ease of implementation, and prune of model mis-specification issues.

[LG-155] Learning to Play Video Games with Intuitive Physics Priors

链接: https://arxiv.org/abs/2409.13886
作者: Abhishek Jaiswal,Nisheeth Srivastava
关键词-EN: adverse real-world consequences, extremely structured domain, Video game playing, real-world consequences, extremely structured
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, Accepted in Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 46

点击查看摘要

Abstract:Video game playing is an extremely structured domain where algorithmic decision-making can be tested without adverse real-world consequences. While prevailing methods rely on image inputs to avoid the problem of hand-crafting state space representations, this approach systematically diverges from the way humans actually learn to play games. In this paper, we design object-based input representations that generalize well across a number of video games. Using these representations, we evaluate an agent’s ability to learn games similar to an infant - with limited world experience, employing simple inductive biases derived from intuitive representations of physics from the real world. Using such biases, we construct an object category representation to be used by a Q-learning algorithm and assess how well it learns to play multiple games based on observed object affordances. Our results suggest that a human-like object interaction setup capably learns to play several video games, and demonstrates superior generalizability, particularly for unfamiliar objects. Further exploring such methods will allow machines to learn in a human-centric way, thus incorporating more human-like learning benefits.

[LG-156] A Multi-LLM Debiasing Framework

链接: https://arxiv.org/abs/2409.13884
作者: Deonna M. Owens,Ryan A. Rossi,Sungchul Kim,Tong Yu,Franck Dernoncourt,Xiang Chen,Ruiyi Zhang,Jiuxiang Gu,Hanieh Deilamsalehy,Nedim Lipka
关键词-EN: Large Language Models, Large Language, benefit society immensely, perpetuate societal inequalities, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities. Despite significant advancements in bias mitigation techniques using data augmentation, zero-shot prompting, and model fine-tuning, biases continuously persist, including subtle biases that may elude human detection. Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning and factuality in LLMs. Building on this approach, we propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs. Our work is the first to introduce and evaluate two distinct approaches within this framework for debiasing LLMs: a centralized method, where the conversation is facilitated by a single central LLM, and a decentralized method, where all models communicate directly. Our findings reveal that our multi-LLM framework significantly reduces bias in LLMs, outperforming the baseline method across several social groups.

[LG-157] abular Data Generation using Binary Diffusion

链接: https://arxiv.org/abs/2409.13882
作者: Vitaliy Kinakh,Slava Voloshynovskiy
关键词-EN: Generating synthetic tabular, Generating synthetic, synthetic tabular data, machine learning, limited or sensitive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive. Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed data types and varied distributions, and require complex preprocessing or large pretrained models. In this paper, we introduce a novel, lossless binary transformation method that converts any tabular data into fixed-size binary representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary data. Binary Diffusion leverages the simplicity of XOR operations for noise addition and removal and employs binary cross-entropy loss for training. Our approach eliminates the need for extensive preprocessing, complex noise parameter tuning, and pretraining on large datasets. We evaluate our model on several popular tabular benchmark datasets, demonstrating that Binary Diffusion outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes datasets while being significantly smaller in size.

[LG-158] Investigation of Time-Frequency Feature Combinations with Histogram Layer Time Delay Neural Networks

链接: https://arxiv.org/abs/2409.13881
作者: Amirmohammad Mohammadi,Iren’e Masabarakiza,Ethan Barnes,Davelle Carreiro,Alexandra Van Dine,Joshua Peeples
关键词-EN: engineering remains essential, underwater acoustic signals, improving model performance, manual feature extraction, feature engineering remains
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 14 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:While deep learning has reduced the prevalence of manual feature extraction, transformation of data via feature engineering remains essential for improving model performance, particularly for underwater acoustic signals. The methods by which audio signals are converted into time-frequency representations and the subsequent handling of these spectrograms can significantly impact performance. This work demonstrates the performance impact of using different combinations of time-frequency features in a histogram layer time delay neural network. An optimal set of features is identified with results indicating that specific feature combinations outperform single data features.

[LG-159] ransfer Learning for Passive Sonar Classification using Pre-trained Audio and ImageNet Models

链接: https://arxiv.org/abs/2409.13878
作者: Amirmohammad Mohammadi,Tejashri Kelhe,Davelle Carreiro,Alexandra Van Dine,Joshua Peeples
关键词-EN: leverage large, downstream tasks, commonly employed, employed to leverage, pre-trained models
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 6 figures, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Transfer learning is commonly employed to leverage large, pre-trained models and perform fine-tuning for downstream tasks. The most prevalent pre-trained models are initially trained using ImageNet. However, their ability to generalize can vary across different data modalities. This study compares pre-trained Audio Neural Networks (PANNs) and ImageNet pre-trained models within the context of underwater acoustic target recognition (UATR). It was observed that the ImageNet pre-trained models slightly out-perform pre-trained audio models in passive sonar classification. We also analyzed the impact of audio sampling rates for model pre-training and fine-tuning. This study contributes to transfer learning applications of UATR, illustrating the potential of pre-trained models to address limitations caused by scarce, labeled data in the UATR domain.

[LG-160] Achieving Predictive Precision: Leveraging LSTM and Pseudo Labeling for Volvos Discovery Challenge at ECML-PKDD 2024 ECML-PKDD-2024 ECML-PKDD WWW

链接: https://arxiv.org/abs/2409.13877
作者: Carlo Metta,Marco Gregnanin,Andrea Papini,Silvia Giulia Galfrè,Andrea Fois,Francesco Morandin,Marco Fantozzi,Maurizio Parton
关键词-EN: Volvo Discovery Challenge, Long Short-Term Memory, Short-Term Memory networks, Discovery Challenge, Volvo Discovery
类目: Machine Learning (cs.LG)
*备注: 2nd place at ECML-PKDD Discovery Challenge this https URL

点击查看摘要

Abstract:This paper presents the second-place methodology in the Volvo Discovery Challenge at ECML-PKDD 2024, where we used Long Short-Term Memory networks and pseudo-labeling to predict maintenance needs for a component of Volvo trucks. We processed the training data to mirror the test set structure and applied a base LSTM model to label the test data iteratively. This approach refined our model’s predictive capabilities and culminated in a macro-average F1-score of 0.879, demonstrating robust performance in predictive maintenance. This work provides valuable insights for applying machine learning techniques effectively in industrial settings.

[LG-161] Physics-Informed Variational State-Space Gaussian Processes

链接: https://arxiv.org/abs/2409.13876
作者: Oliver Hamelijnck,Arno Solin,Theodoros Damoulas
关键词-EN: Differential equations, important mechanistic models, engineering applications, equations are important, important mechanistic
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Differential equations are important mechanistic models that are integral to many scientific and engineering applications. With the abundance of available data there has been a growing interest in data-driven physics-informed models. Gaussian processes (GPs) are particularly suited to this task as they can model complex, non-linear phenomena whilst incorporating prior knowledge and quantifying uncertainty. Current approaches have found some success but are limited as they either achieve poor computational scalings or focus only on the temporal setting. This work addresses these issues by introducing a variational spatio-temporal state-space GP that handles linear and non-linear physical constraints while achieving efficient linear-in-time computation costs. We demonstrate our methods in a range of synthetic and real-world settings and outperform the current state-of-the-art in both predictive and computational performance.

[LG-162] Data Distribution Shifts in (Industrial) Federated Learning as a Privacy Issue

链接: https://arxiv.org/abs/2409.13875
作者: David Brunner,Alessio Montuoro
关键词-EN: competing industrial players, potentially competing industrial, industrial federated learning, industrial players, industrial federated
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We consider industrial federated learning, a collaboration between a small number of powerful, potentially competing industrial players, mediated by a third party aspiring to improve the service it provides to its customers. We argue that this configuration harbours covert privacy risks that do not arise in e.g. cross-device settings. Companies are very protective of their intellectual property and production processes. Information about changes to their production and the timing of which is to be kept private. We study a scenario in which one of the collaborators infers changes to their competitors’ production by detecting potentially subtle temporal data distribution shifts. In this framing, a data distribution shift is always problematic, even if it has no negative effect on training convergence. Thus, our goal is to find means that allow the detection of distributional shifts better than customary evaluation metrics. Based on the assumption that even minor shifts translate into the collaboratively learned machine learning model, the attacker tracks the shared models’ internal state with a selection of metrics from literature in order to pick up on relevant changes. In an empirical study on benchmark datasets, we show an honest-but-curious attacker to be capable of detecting subtle distributional shifts on other clients, in some cases long before they become obvious in evaluation.

[LG-163] Instruct-Tuning Pretrained Causal Language Models for Ancient Greek Papyrology and Epigraphy

链接: https://arxiv.org/abs/2409.13870
作者: Eric Cullhed
关键词-EN: Meta Llama, pretrained causal language, ancient Greek inscriptions, causal language model, Greek inscriptions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 1 table. Under review

点击查看摘要

Abstract:This article presents an experiment in fine-tuning a pretrained causal language model (Meta’s Llama 3.1 8B Instruct) for aiding in three fundamental tasks of philological research: chronological and geographic attribution as well as text restoration in ancient Greek inscriptions and documentary papyri. Using a prompt-based instruct approach, the fine-tuned models surpass the state of the art in key metrics. For inscriptions, the models achieve a lower average character error rate (CER) of 22.5% (vs. 26.3%), while closely matching top-1 accuracy (60.9% vs. 61.8%) and top-20 accuracy (77.5% vs. 78.3%) for sequences up to 10 characters. They also provide a practical advantage by ignoring spaces during reconstruction, aligning better with the scriptio continua typically used in ancient written artifacts. In geographic attribution, the model outperforms previous benchmarks with a top-1 accuracy of 75.0% (vs. 70.8%) and a top-3 accuracy of 83.7% (vs. 82.1%). For dating, it achieves an average deviation of 26.2 years (vs. 29.3) and a median deviation of 1 year (vs. 3) from the actual date range. The models also set new baselines for documentary papyri, with a CER of 16.3%, a top-1 accuracy of 71.3%, and top-20 of 85.0% in text reconstruction; a top-1 accuracy of 66.4% and top-3 of 79.9% in geographic attribution; and, in chronological attribution, a deviation of 21.7 years from the actual termini post/ante quem, with a median deviation of 0 years.

[LG-164] MAGICS: Adversarial RL with Minimax Actors Guided by Implicit Critic Stackelberg for Convergent Neural Synthesis of Robot Safety

链接: https://arxiv.org/abs/2409.13867
作者: Justin Wang,Haimin Hu,Duy Phuong Nguyen,Jaime Fernández Fisac
关键词-EN: optimal control theory, Implicit Critic Stackelberg, provably safe, high-dimensional problems, leading to increased
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Algorithmic Foundations of Robotics (WAFR) XVI

点击查看摘要

Abstract:While robust optimal control theory provides a rigorous framework to compute robot control policies that are provably safe, it struggles to scale to high-dimensional problems, leading to increased use of deep learning for tractable synthesis of robot safety. Unfortunately, existing neural safety synthesis methods often lack convergence guarantees and solution interpretability. In this paper, we present Minimax Actors Guided by Implicit Critic Stackelberg (MAGICS), a novel adversarial reinforcement learning (RL) algorithm that guarantees local convergence to a minimax equilibrium solution. We then build on this approach to provide local convergence guarantees for a general deep RL-based robot safety synthesis algorithm. Through both simulation studies on OpenAI Gym environments and hardware experiments with a 36-dimensional quadruped robot, we show that MAGICS can yield robust control policies outperforming the state-of-the-art neural safety synthesis methods.

[LG-165] Persistent Backdoor Attacks in Continual Learning

链接: https://arxiv.org/abs/2409.13864
作者: Zhen Guo,Abhinav Kumar,Reza Tourani
关键词-EN: manipulate model outputs, neural networks, enabling adversaries, specific inputs, devastating consequences
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 18 pages, 15 figures, 6 tables

点击查看摘要

Abstract:Backdoor attacks pose a significant threat to neural networks, enabling adversaries to manipulate model outputs on specific inputs, often with devastating consequences, especially in critical applications. While backdoor attacks have been studied in various contexts, little attention has been given to their practicality and persistence in continual learning, particularly in understanding how the continual updates to model parameters, as new data distributions are learned and integrated, impact the effectiveness of these attacks over time. To address this gap, we introduce two persistent backdoor attacks-Blind Task Backdoor and Latent Task Backdoor-each leveraging minimal adversarial influence. Our blind task backdoor subtly alters the loss computation without direct control over the training process, while the latent task backdoor influences only a single task’s training, with all other tasks trained benignly. We evaluate these attacks under various configurations, demonstrating their efficacy with static, dynamic, physical, and semantic triggers. Our results show that both attacks consistently achieve high success rates across different continual learning algorithms, while effectively evading state-of-the-art defenses, such as SentiNet and I-BAU.

[LG-166] Wormhole: Concept-Aware Deep Representation Learning for Co-Evolving Sequences

链接: https://arxiv.org/abs/2409.13857
作者: Kunpeng Xu,Lifei Chen,Shengrui Wang
关键词-EN: online activity logs, financial markets, IoT applications, activity logs, online activity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Identifying and understanding dynamic concepts in co-evolving sequences is crucial for analyzing complex systems such as IoT applications, financial markets, and online activity logs. These concepts provide valuable insights into the underlying structures and behaviors of sequential data, enabling better decision-making and forecasting. This paper introduces Wormhole, a novel deep representation learning framework that is concept-aware and designed for co-evolving time sequences. Our model presents a self-representation layer and a temporal smoothness constraint to ensure robust identification of dynamic concepts and their transitions. Additionally, concept transitions are detected by identifying abrupt changes in the latent space, signifying a shift to new behavior - akin to passing through a wormhole. This novel mechanism accurately discerns concepts within co-evolving sequences and pinpoints the exact locations of these wormholes, enhancing the interpretability of the learned representations. Experiments demonstrate that this method can effectively segment time series data into meaningful concepts, providing a valuable tool for analyzing complex temporal patterns and advancing the detection of concept drifts.

[LG-167] More Consideration to the Perceptron

链接: https://arxiv.org/abs/2409.13854
作者: Slimane Larabi
关键词-EN: additional input computed, Breast Cancer Wisconsin, additional input, input computed, existing inputs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:In this paper, we introduce the gated perceptron, an enhancement of the conventional perceptron, which incorporates an additional input computed as the product of the existing inputs. This allows the perceptron to capture non-linear interactions between features, significantly improving its ability to classify and regress on complex datasets. We explore its application in both linear and non-linear regression tasks using the Iris dataset, as well as binary and multi-class classification problems, including the PIMA Indian dataset and Breast Cancer Wisconsin dataset. Our results demonstrate that the gated perceptron can generate more distinct decision regions compared to traditional perceptrons, enhancing its classification capabilities, particularly in handling non-linear data. Performance comparisons show that the gated perceptron competes with state-of-the-art classifiers while maintaining a simple architecture.

[LG-168] Unlocking Memorization in Large Language Models with Dynamic Soft Prompting

链接: https://arxiv.org/abs/2409.13853
作者: Zhepeng Wang,Runxue Bao,Yawen Wu,Jackson Taylor,Cao Xiao,Feng Zheng,Weiwen Jiang,Shangqian Gao,Yanfu Zhang
关键词-EN: Pretrained large language, large language models, natural language processing, revolutionized natural language, Pretrained large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pretrained large language models (LLMs) have revolutionized natural language processing (NLP) tasks such as summarization, question answering, and translation. However, LLMs pose significant security risks due to their tendency to memorize training data, leading to potential privacy breaches and copyright infringement. Accurate measurement of this memorization is essential to evaluate and mitigate these potential risks. However, previous attempts to characterize memorization are constrained by either using prefixes only or by prepending a constant soft prompt to the prefixes, which cannot react to changes in input. To address this challenge, we propose a novel method for estimating LLM memorization using dynamic, prefix-dependent soft prompts. Our approach involves training a transformer-based generator to produce soft prompts that adapt to changes in input, thereby enabling more accurate extraction of memorized data. Our method not only addresses the limitations of previous methods but also demonstrates superior performance in diverse experimental settings compared to state-of-the-art techniques. In particular, our method can achieve the maximum relative improvement of 112.75% and 32.26% over the vanilla baseline in terms of discoverable memorization rate for the text generation task and code generation task respectively.

[LG-169] Segment Discovery: Enhancing E-commerce Targeting RECSYS’24

链接: https://arxiv.org/abs/2409.13847
作者: Qiqi Li,Roopali Singh,Charin Polpanumas,Tanner Fiez,Namita Kumar,Shreya Chakrabarti
关键词-EN: Modern e-commerce services, e-commerce services frequently, Modern e-commerce, services frequently target, video streaming
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted at the CONSEQUENCES’24 workshop, co-located with ACM RecSys’24

点击查看摘要

Abstract:Modern e-commerce services frequently target customers with incentives or interventions to engage them in their products such as games, shopping, video streaming, etc. This customer engagement increases acquisition of more customers and retention of existing ones, leading to more business for the company while improving customer experience. Often, customers are either randomly targeted or targeted based on the propensity of desirable behavior. However, such policies can be suboptimal as they do not target the set of customers who would benefit the most from the intervention and they may also not take account of any constraints. In this paper, we propose a policy framework based on uplift modeling and constrained optimization that identifies customers to target for a use-case specific intervention so as to maximize the value to the business, while taking account of any given constraints. We demonstrate improvement over state-of-the-art targeting approaches using two large-scale experimental studies and a production implementation.

[LG-170] Multi-Modality Conditioned Variational U-Net for Field-of-View Extension in Brain Diffusion MRI

链接: https://arxiv.org/abs/2409.13846
作者: Zhiyuan Li,Tianyuan Yao,Praitayini Kanakaraj,Chenyu Gao,Shunxing Bao,Lianrui Zuo,Michael E. Kim,Nancy R. Newlin,Gaurav Rudravaram,Nazirah M. Khairi,Yuankai Huo,Kurt G. Schilling,Walter A. Kukull,Arthur W. Toga,Derek B. Archer,Timothy J. Hohman,Bennett A. Landman
关键词-EN: magnetic resonance imaging, white matter connectivity, diffusion magnetic resonance, whole-brain white matter, dMRI scans
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 20 pages; 8 figures

点击查看摘要

Abstract:An incomplete field-of-view (FOV) in diffusion magnetic resonance imaging (dMRI) can severely hinder the volumetric and bundle analyses of whole-brain white matter connectivity. Although existing works have investigated imputing the missing regions using deep generative models, it remains unclear how to specifically utilize additional information from paired multi-modality data and whether this can enhance the imputation quality and be useful for downstream tractography. To fill this gap, we propose a novel framework for imputing dMRI scans in the incomplete part of the FOV by integrating the learned diffusion features in the acquired part of the FOV to the complete brain anatomical structure. We hypothesize that by this design the proposed framework can enhance the imputation performance of the dMRI scans and therefore be useful for repairing whole-brain tractography in corrupted dMRI scans with incomplete FOV. We tested our framework on two cohorts from different sites with a total of 96 subjects and compared it with a baseline imputation method that treats the information from T1w and dMRI scans equally. The proposed framework achieved significant improvements in imputation performance, as demonstrated by angular correlation coefficient (p 1E-5), and in downstream tractography accuracy, as demonstrated by Dice score (p 0.01). Results suggest that the proposed framework improved imputation performance in dMRI scans by specifically utilizing additional information from paired multi-modality data, compared with the baseline method. The imputation achieved by the proposed framework enhances whole brain tractography, and therefore reduces the uncertainty when analyzing bundles associated with neurodegenerative.

[LG-171] Continual Learning for Multimodal Data Fusion of a Soft Gripper

链接: https://arxiv.org/abs/2409.13792
作者: Nilay Kushawaha,Egidio Falotico
关键词-EN: previously learned information, retaining previously learned, learned information, acquire new knowledge, retaining previously
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 15 pages, 10 figures

点击查看摘要

Abstract:Continual learning (CL) refers to the ability of an algorithm to continuously and incrementally acquire new knowledge from its environment while retaining previously learned information. A model trained on one data modality often fails when tested with a different modality. A straightforward approach might be to fuse the two modalities by concatenating their features and training the model on the fused data. However, this requires retraining the model from scratch each time it encounters a new domain. In this paper, we introduce a continual learning algorithm capable of incrementally learning different data modalities by leveraging both class-incremental and domain-incremental learning scenarios in an artificial environment where labeled data is scarce, yet non-iid (independent and identical distribution) unlabeled data from the environment is plentiful. The proposed algorithm is efficient and only requires storing prototypes for each class. We evaluate the algorithm’s effectiveness on a challenging custom multimodal dataset comprising of tactile data from a soft pneumatic gripper, and visual data from non-stationary images of objects extracted from video sequences. Additionally, we conduct an ablation study on the custom dataset and the Core50 dataset to highlight the contributions of different components of the algorithm. To further demonstrate the robustness of the algorithm, we perform a real-time experiment for object classification using the soft gripper and an external independent camera setup, all synchronized with the Robot Operating System (ROS) framework.

[LG-172] Multi-omics data integration for early diagnosis of hepatocellular carcinoma (HCC) using machine learning

链接: https://arxiv.org/abs/2409.13791
作者: Annette Spooner,Mohammad Karimi Moridani,Azadeh Safarchi,Salim Maher,Fatemeh Vafaee,Amany Zekry,Arcot Sowmya
关键词-EN: complementary information found, underlying biological processes, patient disease state, complementary information, information found
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:The complementary information found in different modalities of patient data can aid in more accurate modelling of a patient’s disease state and a better understanding of the underlying biological processes of a disease. However, the analysis of multi-modal, multi-omics data presents many challenges, including high dimensionality and varying size, statistical distribution, scale and signal strength between modalities. In this work we compare the performance of a variety of ensemble machine learning algorithms that are capable of late integration of multi-class data from different modalities. The ensemble methods and their variations tested were i) a voting ensemble, with hard and soft vote, ii) a meta learner, iii) a multi-modal Adaboost model using a hard vote, a soft vote and a meta learner to integrate the modalities on each boosting round, the PB-MVBoost model and a novel application of a mixture of experts model. These were compared to simple concatenation as a baseline. We examine these methods using data from an in-house study on hepatocellular carcinoma (HCC), along with four validation datasets on studies from breast cancer and irritable bowel disease (IBD). Using the area under the receiver operating curve as a measure of performance we develop models that achieve a performance value of up to 0.85 and find that two boosted methods, PB-MVBoost and Adaboost with a soft vote were the overall best performing models. We also examine the stability of features selected, and the size of the clinical signature determined. Finally, we provide recommendations for the integration of multi-modal multi-class data.

[LG-173] Revisiting Synthetic Human Trajectories: Imitative Generation and Benchmarks Beyond Datasaurus

链接: https://arxiv.org/abs/2409.13790
作者: Bangchao Deng,Xin Jing,Tianyue Yang,Bingqing Qu,Philippe Cudre-Mauroux,Dingqi Yang
关键词-EN: Human trajectory data, epidemic prevention, privacy concerns, plays a crucial, crucial role
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human trajectory data, which plays a crucial role in various applications such as crowd management and epidemic prevention, is challenging to obtain due to practical constraints and privacy concerns. In this context, synthetic human trajectory data is generated to simulate as close as possible to real-world human trajectories, often under summary statistics and distributional similarities. However, the complexity of human mobility patterns is oversimplified by these similarities (a.k.a. ``Datasaurus’'), resulting in intrinsic biases in both generative model design and benchmarks of the generated trajectories. Against this background, we propose MIRAGE, a huMan-Imitative tRAjectory GenErative model designed as a neural Temporal Point Process integrating an Exploration and Preferential Return model. It imitates the human decision-making process in trajectory generation, rather than fitting any specific statistical distributions as traditional methods do, thus avoiding the Datasaurus issue. Moreover, we also propose a comprehensive task-based evaluation protocol beyond Datasaurus to systematically benchmark trajectory generative models on four typical downstream tasks, integrating multiple techniques and evaluation metrics for each task, to comprehensively assess the ultimate utility of the generated trajectories. We conduct a thorough evaluation of MIRAGE on three real-world user trajectory datasets against a sizeable collection of baselines. Results show that compared to the best baselines, MIRAGE-generated trajectory data not only achieves the best statistical and distributional similarities with 59.0-71.5% improvement, but also yields the best performance in the task-based evaluation with 10.9-33.4% improvement.

[LG-174] Learning to Generalize Unseen Domains via Multi-Source Meta Learning for Text Classification

链接: https://arxiv.org/abs/2409.13787
作者: Yuxuan Hu,Chenwei Zhang,Min Yang,Xiaodan Liang,Chengming Li,Xiping Hu
关键词-EN: unseen domain, deep learning methods, rapid development, development of deep, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of deep learning methods, there have been many breakthroughs in the field of text classification. Models developed for this task have been shown to achieve high accuracy. However, most of these models are trained using labeled data from seen domains. It is difficult for these models to maintain high accuracy in a new challenging unseen domain, which is directly related to the generalization of the model. In this paper, we study the multi-source Domain Generalization of text classification and propose a framework to use multiple seen domains to train a model that can achieve high accuracy in an unseen domain. Specifically, we propose a multi-source meta-learning Domain Generalization framework to simulate the process of model generalization to an unseen domain, so as to extract sufficient domain-related features. We introduced a memory mechanism to store domain-specific features, which coordinate with the meta-learning framework. Besides, we adopt the novel “jury” mechanism that enables the model to learn sufficient domain-invariant features. Experiments demonstrate that our meta-learning framework can effectively enhance the ability of the model to generalize to an unseen domain and can outperform the state-of-the-art methods on multi-source text classification datasets.

[LG-175] rustworthy Intrusion Detection: Confidence Estimation Using Latent Space

链接: https://arxiv.org/abs/2409.13774
作者: Ioannis Pitsiorlas,George Arvanitakis,Marios Kountouris
关键词-EN: Intrusion Detection Systems, Variational Autoencoder, Detection Systems, Intrusion Detection, work introduces
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:This work introduces a novel method for enhancing confidence in anomaly detection in Intrusion Detection Systems (IDS) through the use of a Variational Autoencoder (VAE) architecture. By developing a confidence metric derived from latent space representations, we aim to improve the reliability of IDS predictions against cyberattacks. Applied to the NSL-KDD dataset, our approach focuses on binary classification tasks to effectively distinguish between normal and malicious network activities. The methodology demonstrates a significant enhancement in anomaly detection, evidenced by a notable correlation of 0.45 between the reconstruction error and the proposed metric. Our findings highlight the potential of employing VAEs for more accurate and trustworthy anomaly detection in network security.

[LG-176] A constrained optimization approach to improve robustness of neural networks

链接: https://arxiv.org/abs/2409.13770
作者: Shudian Zhao,Jan Kronqvist
关键词-EN: fine-tune pre-trained neural, pre-trained neural networks, nonlinear programming-based approach, nonlinear programming-based, fine-tune pre-trained
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 25 pages, 2 figures, 4 tables

点击查看摘要

Abstract:In this paper, we present a novel nonlinear programming-based approach to fine-tune pre-trained neural networks to improve robustness against adversarial attacks while maintaining high accuracy on clean data. Our method introduces adversary-correction constraints to ensure correct classification of adversarial data and minimizes changes to the model parameters. We propose an efficient cutting-plane-based algorithm to iteratively solve the large-scale nonconvex optimization problem by approximating the feasible region through polyhedral cuts and balancing between robustness and accuracy. Computational experiments on standard datasets such as MNIST and CIFAR10 demonstrate that the proposed approach significantly improves robustness, even with a very small set of adversarial data, while maintaining minimal impact on accuracy.

[LG-177] Machine Translation with Large Language Models : Decoder Only vs. Encoder-Decoder

链接: https://arxiv.org/abs/2409.13747
作者: Abhinav P.M.,SujayKumar Reddy M,Oswald Christopher
关键词-EN: Large Language Models, Indian regional languages, Large Language, Machine Translation, Language Models
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This project, titled “Machine Translation with Large Language Models: Decoder-only vs. Encoder-Decoder,” aims to develop a multilingual machine translation (MT) model. Focused on Indian regional languages, especially Telugu, Tamil, and Malayalam, the model seeks to enable accurate and contextually appropriate translations across diverse language pairs. By comparing Decoder-only and Encoder-Decoder architectures, the project aims to optimize translation quality and efficiency, advancing cross-linguistic communication tools.The primary objective is to develop a model capable of delivering high-quality translations that are accurate and contextually appropriate. By leveraging large language models, specifically comparing the effectiveness of Decoder-only and Encoder-Decoder architectures, the project seeks to optimize translation performance and efficiency across multilingual contexts. Through rigorous experimentation and analysis, this project aims to advance the field of machine translation, contributing valuable insights into the effectiveness of different model architectures and paving the way for enhanced cross-linguistic communication tools.

[LG-178] Context-Aware Membership Inference Attacks against Pre-trained Large Language Models

链接: https://arxiv.org/abs/2409.13745
作者: Hongyan Chang,Ali Shahin Shamsabadi,Kleomenis Katevas,Hamed Haddadi,Reza Shokri
关键词-EN: Large Language Models, Membership Inference Attacks, Prior Membership Inference, pre-trained Large Language, Membership Inference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Prior Membership Inference Attacks (MIAs) on pre-trained Large Language Models (LLMs), adapted from classification model attacks, fail due to ignoring the generative process of LLMs across token sequences. In this paper, we present a novel attack that adapts MIA statistical tests to the perplexity dynamics of subsequences within a data point. Our method significantly outperforms prior loss-based approaches, revealing context-dependent memorization patterns in pre-trained LLMs.

[LG-179] opoChat: Enhancing Topological Materials Retrieval With Large Language Model and Multi-Source Knowledge

链接: https://arxiv.org/abs/2409.13732
作者: HuangChao Xu,Baohua Zhang,Zhong Jin,Tiannian Zhu,Quansheng Wu,Hongming Weng
关键词-EN: text generation task, demonstrated impressive performance, generation task, showing the ability, Large language models
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs), such as ChatGPT, have demonstrated impressive performance in the text generation task, showing the ability to understand and respond to complex instructions. However, the performance of naive LLMs in speciffc domains is limited due to the scarcity of domain-speciffc corpora and specialized training. Moreover, training a specialized large-scale model necessitates signiffcant hardware resources, which restricts researchers from leveraging such models to drive advances. Hence, it is crucial to further improve and optimize LLMs to meet speciffc domain demands and enhance their scalability. Based on the condensed matter data center, we establish a material knowledge graph (MaterialsKG) and integrate it with literature. Using large language models and prompt learning, we develop a specialized dialogue system for topological materials called TopoChat. Compared to naive LLMs, TopoChat exhibits superior performance in structural and property querying, material recommendation, and complex relational reasoning. This system enables efffcient and precise retrieval of information and facilitates knowledge interaction, thereby encouraging the advancement on the ffeld of condensed matter materials.

[LG-180] Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts

链接: https://arxiv.org/abs/2409.13728
作者: Anna Mészáros,Szilvia Ujváry,Wieland Brendel,Patrik Reizinger,Ferenc Huszár
关键词-EN: remarkable emergent abilities, show remarkable emergent, LLMs show remarkable, emergent abilities, in-context learning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:LLMs show remarkable emergent abilities, such as inferring concepts from presumably out-of-distribution prompts, known as in-context learning. Though this success is often attributed to the Transformer architecture, our systematic understanding is limited. In complex real-world data sets, even defining what is out-of-distribution is not obvious. To better understand the OOD behaviour of autoregressive LLMs, we focus on formal languages, which are defined by the intersection of rules. We define a new scenario of OOD compositional generalization, termed rule extrapolation. Rule extrapolation describes OOD scenarios, where the prompt violates at least one rule. We evaluate rule extrapolation in formal languages with varying complexity in linear and recurrent architectures, the Transformer, and state space models to understand the architectures’ influence on rule extrapolation. We also lay the first stones of a normative theory of rule extrapolation, inspired by the Solomonoff prior in algorithmic information theory.

[LG-181] Multilingual Dyadic Interaction Corpus NoXiJ: Toward Understanding Asian-European Non-verbal Cultural Characteristics and their Influences on Engagement

链接: https://arxiv.org/abs/2409.13726
作者: Marius Funk,Shogo Okada,Elisabeth André
关键词-EN: non-verbal behaviors, central challenge, affective states, non-verbal behaviors vary, Non-verbal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages. 6 figures. International Conference on Multimodal Interaction, November 4-8, 2024, San Jose, Costa Rica

点击查看摘要

Abstract:Non-verbal behavior is a central challenge in understanding the dynamics of a conversation and the affective states between interlocutors arising from the interaction. Although psychological research has demonstrated that non-verbal behaviors vary across cultures, limited computational analysis has been conducted to clarify these differences and assess their impact on engagement recognition. To gain a greater understanding of engagement and non-verbal behaviors among a wide range of cultures and language spheres, in this study we conduct a multilingual computational analysis of non-verbal features and investigate their role in engagement and engagement prediction. To achieve this goal, we first expanded the NoXi dataset, which contains interaction data from participants living in France, Germany, and the United Kingdom, by collecting session data of dyadic conversations in Japanese and Chinese, resulting in the enhanced dataset NoXi+J. Next, we extracted multimodal non-verbal features, including speech acoustics, facial expressions, backchanneling and gestures, via various pattern recognition techniques and algorithms. Then, we conducted a statistical analysis of listening behaviors and backchannel patterns to identify culturally dependent and independent features in each language and common features among multiple languages. These features were also correlated with the engagement shown by the interlocutors. Finally, we analyzed the influence of cultural differences in the input features of LSTM models trained to predict engagement for five language datasets. A SHAP analysis combined with transfer learning confirmed a considerable correlation between the importance of input features for a language set and the significant cultural characteristics analyzed.

[LG-182] Logically Consistent Language Models via Neuro-Symbolic Integration

链接: https://arxiv.org/abs/2409.13724
作者: Diego Calanzone,Stefano Teso,Antonio Vergari
关键词-EN: natural language understanding, Large language models, language models, understanding and generation, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are a promising venue for natural language understanding and generation. However, current LLMs are far from reliable: they are prone to generating non-factual information and, more crucially, to contradicting themselves when prompted to reason about relations between entities of the world. These problems are currently addressed with large scale fine-tuning or by delegating reasoning to external tools. In this work, we strive for a middle ground and introduce a loss based on neuro-symbolic reasoning that teaches an LLM to be logically consistent with an external set of facts and rules and improves self-consistency even when the LLM is fine-tuned on a limited set of facts. Our approach also allows to easily combine multiple logical constraints at once in a principled way, delivering LLMs that are more consistent w.r.t. all constraints and improve over several baselines w.r.t. a given constraint. Moreover, our method allows LLMs to extrapolate to unseen but semantically similar factual knowledge, represented in unseen datasets, more systematically.

[LG-183] Introducing MeMo: A Multimodal Dataset for Memory Modelling in Multiparty Conversations

链接: https://arxiv.org/abs/2409.13715
作者: Maria Tsfasman,Bernd Dudzik,Kristian Fenech,Andras Lorincz,Catholijn M. Jonker,Catharine Oertel
关键词-EN: human memory processes, Conversational memory, human social relationships, memory, relationships is intricately
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The quality of human social relationships is intricately linked to human memory processes, with memory serving as the foundation for the creation of social bonds. Since human memory is selective, differing recollections of the same events within a group can lead to misunderstandings and misalignments in what is perceived to be common ground in the group. Yet, conversational facilitation systems, aimed at advancing the quality of group interactions, usually focus on tracking users’ states within an individual session, ignoring what remains in each participant’s memory after the interaction. Conversational memory is the process by which humans encode, retain and retrieve verbal, non-verbal and contextual information from a conversation. Understanding conversational memory can be used as a source of information on the long-term development of social connections within a group. This paper introduces the MeMo corpus, the first conversational dataset annotated with participants’ memory retention reports, aimed at facilitating computational modelling of human conversational memory. The MeMo corpus includes 31 hours of small-group discussions on the topic of Covid-19, repeated over the term of 2 weeks. It integrates validated behavioural and perceptual measures, and includes audio, video, and multimodal annotations, offering a valuable resource for studying and modelling conversational memory and group dynamics. By introducing the MeMo corpus, presenting an analysis of its validity, and demonstrating its usefulness for future research, this paper aims to pave the way for future research in conversational memory modelling for intelligent system development.

[LG-184] racrBench: Generating Interpretability Testbeds with Large Language Models ICML

链接: https://arxiv.org/abs/2409.13714
作者: Hannes Thurnherr,Jérémy Scheurer
关键词-EN: Achieving a mechanistic, ground truth mappings, mechanistic understanding, understanding of transformer-based, transformer-based language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages + appendix, 4 figures, ICML Mechanistic Interpretability Workshop

点击查看摘要

Abstract:Achieving a mechanistic understanding of transformer-based language models is an open challenge, especially due to their large number of parameters. Moreover, the lack of ground truth mappings between model weights and their functional roles hinders the effective evaluation of interpretability methods, impeding overall progress. Tracr, a method for generating compiled transformers with inherent ground truth mappings in RASP, has been proposed to address this issue. However, manually creating a large number of models needed for verifying interpretability methods is labour-intensive and time-consuming. In this work, we present a novel approach for generating interpretability test beds using large language models (LLMs) and introduce TracrBench, a novel dataset consisting of 121 manually written and LLM-generated, human-validated RASP programs and their corresponding transformer weights. During this process, we evaluate the ability of frontier LLMs to autonomously generate RASP programs and find that this task poses significant challenges. GPT-4-turbo, with a 20-shot prompt and best-of-5 sampling, correctly implements only 57 out of 101 test programs, necessitating the manual implementation of the remaining programs. With its 121 samples, TracrBench aims to serve as a valuable testbed for evaluating and comparing interpretability methods.

[LG-185] Sentiment Informed Sentence BERT-Ensemble Algorithm for Depression Detection

链接: https://arxiv.org/abs/2409.13713
作者: Bayode Ogunleye,Hemlata Sharma,Olamilekan Shobayo
关键词-EN: World Health Organisation, Health Organisation, World Health, world suffer, revealed approximately
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The World Health Organisation (WHO) revealed approximately 280 million people in the world suffer from depression. Yet, existing studies on early-stage depression detection using machine learning (ML) techniques are limited. Prior studies have applied a single stand-alone algorithm, which is unable to deal with data complexities, prone to overfitting, and limited in generalization. To this end, our paper examined the performance of several ML algorithms for early-stage depression detection using two benchmark social media datasets (D1 and D2). More specifically, we incorporated sentiment indicators to improve our model performance. Our experimental results showed that sentence bidirectional encoder representations from transformers (SBERT) numerical vectors fitted into the stacking ensemble model achieved comparable F1 scores of 69% in the dataset (D1) and 76% in the dataset (D2). Our findings suggest that utilizing sentiment indicators as an additional feature for depression detection yields an improved model performance, and thus, we recommend the development of a depressive term corpus for future work.

[LG-186] You can remove GPT2s LayerNorm by fine-tuning

链接: https://arxiv.org/abs/2409.13710
作者: Stefan Heimersheim
关键词-EN: GPT-style transformer models, large language models, mechanistic interpretability, GPT-style transformer, transformer models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability. LN is a crucial component required to stabilize the training of large language models, and LN or the similar RMSNorm have been used in practically all large language models based on the transformer architecture. The non-linear nature of the LN layers is a hindrance for mechanistic interpretability as it hinders interpretation of the residual stream, and makes it difficult to decompose the model into circuits. Some research have gone so far as to name “reasons interpretability researchers hate layer norm”. In this paper we show that it is possible to remove the LN layers from a pre-trained GPT2-small model by fine-tuning on a fraction (500M tokens) of the training data. We demonstrate that this LN-free model achieves similar performance to the original model on the OpenWebText and ThePile datasets (-0.05 cross-entropy loss), and the Hellaswag benchmark (-0.5% accuracy). We provide the fine-tuning procedure and a Hugging Face repository with the fine-tuned GPT2-small models. Our work not only provides a simplified model for mechanistic interpretability research, but also provides evidence that the LN layers, at inference time, do not play a crucial role in transformer models. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.13710 [cs.CL] (or arXiv:2409.13710v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.13710 Focus to learn more arXiv-issued DOI via DataCite

[LG-187] Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble

链接: https://arxiv.org/abs/2409.13705
作者: Olivia Sturman,Aparna Joshi,Bhaktipriya Radharapu,Piyush Kumar,Renee Shelby
关键词-EN: demand performant guardrails, large language models, demand performant, large language, performant guardrails
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Increasing use of large language models (LLMs) demand performant guardrails to ensure the safety of inputs and outputs of LLMs. When these safeguards are trained on imbalanced data, they can learn the societal biases. We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfactually balanced across identity groups and cover four key areas of safety; we will work towards publicly releasing these datasets. Our results show that our approach improves counterfactual fairness with minimal impact on model performance.

[LG-188] Artificial neural networks on graded vector spaces

链接: https://arxiv.org/abs/2407.19031
作者: T. Shaska
关键词-EN: graded vector spaces, artificial neural network, usual vector spaces, neural network models, vector spaces
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop new artificial neural network models for graded vector spaces, which are suitable when different features in the data have different significance (weights). This is the first time that such models are designed mathematically and they are expected to perform better than neural networks over usual vector spaces, which are the special case when the gradings are all 1s.

[LG-189] Boosting Facial Action Unit Detection Through Jointly Learning Facial Landmark Detection and Domain Separation and Reconstruction

链接: https://arxiv.org/abs/2310.05207
作者: Ziqiao Shang,Li Yu
关键词-EN: Facial Action Unit, supervised Facial Action, Action Unit, introduce large amounts, unlabeled facial images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Recently how to introduce large amounts of unlabeled facial images in the wild into supervised Facial Action Unit (AU) detection frameworks has become a challenging problem. In this paper, we propose a new AU detection framework where multi-task learning is introduced to jointly learn AU domain separation and reconstruction and facial landmark detection by sharing the parameters of homostructural facial extraction modules. In addition, we propose a new feature alignment scheme based on contrastive learning by simple projectors and an improved contrastive loss, which adds four additional intermediate supervisors to promote the feature reconstruction process. Experimental results on two benchmarks demonstrate our superiority against the state-of-the-art methods for AU detection in the wild.

[LG-190] am QUST at SemEval-2023 Task 3: A Comprehensive Study of Monolingual and Multilingual Approaches for Detecting Online News Genre Framing and Persuasion Techniques

链接: https://arxiv.org/abs/2304.04190
作者: Ye Jiang
关键词-EN: team QUST, paper describes, describes the participation, participation of team, pre-trained multilingual model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper describes the participation of team QUST in the SemEval2023 task 3. The monolingual models are first evaluated with the under-sampling of the majority classes in the early stage of the task. Then, the pre-trained multilingual model is fine-tuned with a combination of the class weights and the sample weights. Two different fine-tuning strategies, the task-agnostic and the task-dependent, are further investigated. All experiments are conducted under the 10-fold cross-validation, the multilingual approaches are superior to the monolingual ones. The submitted system achieves the second best in Italian and Spanish (zero-shot) in subtask-1.

[LG-191] he Palomar twilight survey of Aylochaxnim Atiras and comets

链接: https://arxiv.org/abs/2409.15263
作者: B. T. Bolin,F. J. Masci,M. W. Coughlin,D. A. Duev,Ž. Ivezić,R. L. Jones,P. Yoachim,T. Ahumada,V. Bhalerao,H. Choudhary,C. Contreras,Y.-C. Cheng,C.M. Copperwheat,K. Deshmukh,C. Fremling,M. Granvik,K. K. Hardegree-Ullman,A. Y. Q. Ho,R. Jedicke,M. Kasliwal,H. Kumar,Z.-Y. Lin,A. Mahabal,A. Monson,J.D. Neill,D. Nesvorný,D. A. Perley,J. N. Purdum,R. Quimby,E. Serabyn,K. Sharma,V. Swain
关键词-EN: Zwicky Transient Facility, Near-sun sky twilight, Near-sun sky, morning astronomical twilight, twilight
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages, 13 figures, 4 tables, accepted for publication in Icarus

点击查看摘要

Abstract:Near-sun sky twilight observations allow for the detection of asteroid interior to the orbit of Venus (Aylos), the Earth (Atiras), and comets. We present the results of observations with the Palomar 48-inch telescope (P48)/Zwicky Transient Facility (ZTF) camera in 30 s r-band exposures taken during evening astronomical twilight from 2019 Sep 20 to 2022 March 7 and during morning astronomical twilight sky from 2019 Sep 21 to 2022 Sep 29. More than 46,000 exposures were taken in evening and morning astronomical twilight within 31 to 66 degrees from the Sun with an r-band limiting magnitude between 18.1 and 20.9. The twilight pointings show a slight seasonal dependence in limiting magnitude and ability to point closer towards the Sun, with limiting magnitude slightly improving during summer. In total, the one Aylo, (594913) 'Ayló’chaxnim, and 4 Atiras, 2020 OV1, 2021 BS1, 2021 PB2, and 2021 VR3, were discovered in evening and morning twilight observations. Additional twilight survey discoveries also include 6 long-period comets: C/2020 T2, C/2020 V2, C/2021 D2, C/2021 E3, C/2022 E3, and C/2022 P3, and two short-period comets: P/2021 N1 and P/2022 P2 using deep learning comet detection pipelines. The P48/ZTF twilight survey also recovered 11 known Atiras, one Aylo, three short-period comes, two long-period comets, and one interstellar object. Lastly, the Vera Rubin Observatory will conduct a twilight survey starting in its first year of operations and will cover the sky within 45 degrees of the Sun. Twilight surveys such as those by ZTF and future surveys will provide opportunities for discovering asteroids inside the orbits of Earth and Venus.

[LG-192] Identification and Localization of Cometary Activity in Solar System Objects with Machine Learning

链接: https://arxiv.org/abs/2409.15261
作者: Bryce T. Bolin,Michael W. Coughlin
关键词-EN: Solar System objects, Machine Learning methods, space-based wide-field all-sky, extended Solar System, Machine Learning
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 25 pages, 9 figures, accepted chapter in Machine Learning for Small Bodies in the Solar System, Valerio Carruba, Evgeny Smirnov, and Dagmara Oszkiewicz, Elsevier, 2024, p. 209-227

点击查看摘要

Abstract:In this chapter, we will discuss the use of Machine Learning methods for the identification and localization of cometary activity for Solar System objects in ground and in space-based wide-field all-sky surveys. We will begin the chapter by discussing the challenges of identifying known and unknown active, extended Solar System objects in the presence of stellar-type sources and the application of classical pre-ML identification techniques and their limitations. We will then transition to the discussion of implementing ML techniques to address the challenge of extended object identification. We will finish with prospective future methods and the application to future surveys such as the Vera C. Rubin Observatory.

[LG-193] Machine Learning Toric Duality in Brane Tilings

链接: https://arxiv.org/abs/2409.15251
作者: Pietro Capuozzo,Tancredi Schettini Gherardini,Benjamin Suzzoni
关键词-EN: quantum field theories, field theories arising, probing toric Calabi-Yau, quantum field, apply a variety
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 32 pages, 13 figures and 3 tables

点击查看摘要

Abstract:We apply a variety of machine learning methods to the study of Seiberg duality within 4d \mathcalN=1 quantum field theories arising on the worldvolumes of D3-branes probing toric Calabi-Yau 3-folds. Such theories admit an elegant description in terms of bipartite tessellations of the torus known as brane tilings or dimer models. An intricate network of infrared dualities interconnects the space of such theories and partitions it into universality classes, the prediction and classification of which is a problem that naturally lends itself to a machine learning investigation. In this paper, we address a preliminary set of such enquiries. We begin by training a fully connected neural network to identify classes of Seiberg dual theories realised on \mathbbZ_m\times\mathbbZ_n orbifolds of the conifold and achieve R^2=0.988 . Then, we evaluate various notions of robustness of our methods against perturbations of the space of theories under investigation, and discuss these results in terms of the nature of the neural network’s learning. Finally, we employ a more sophisticated residual architecture to classify the toric phase space of the Y^6,0 theories, and to predict the individual gauged linear \sigma -model multiplicities in toric diagrams thereof. In spite of the non-trivial nature of this task, we achieve remarkably accurate results; namely, upon fixing a choice of Kasteleyn matrix representative, the regressor achieves a mean absolute error of 0.021 . We also discuss how the performance is affected by relaxing these assumptions.

[LG-194] Harmonic Path Integral Diffusion

链接: https://arxiv.org/abs/2409.15166
作者: Hamidreza Behjoo,Michael Chertkov
关键词-EN: continuous multivariate probability, multivariate probability distribution, Path Integral Diffusion, Path Integral, Path Integral Control
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:In this manuscript, we present a novel approach for sampling from a continuous multivariate probability distribution, which may either be explicitly known (up to a normalization factor) or represented via empirical samples. Our method constructs a time-dependent bridge from a delta function centered at the origin of the state space at t=0 , optimally transforming it into the target distribution at t=1 . We formulate this as a Stochastic Optimal Control problem of the Path Integral Control type, with a cost function comprising (in its basic form) a quadratic control term, a quadratic state term, and a terminal constraint. This framework, which we refer to as Harmonic Path Integral Diffusion (H-PID), leverages an analytical solution through a mapping to an auxiliary quantum harmonic oscillator in imaginary time. The H-PID framework results in a set of efficient sampling algorithms, without the incorporation of Neural Networks. The algorithms are validated on two standard use cases: a mixture of Gaussians over a grid and images from CIFAR-10. We contrast these algorithms with other sampling methods, particularly simulated annealing and path integral sampling, highlighting their advantages in terms of analytical control, accuracy, and computational efficiency on benchmark problems. Additionally, we extend the methodology to more general cases where the underlying stochastic differential equation includes an external deterministic, possibly non-conservative force, and where the cost function incorporates a gauge potential term. These extensions open up new possibilities for applying our framework to a broader range of statistics specific to applications. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO) Cite as: arXiv:2409.15166 [stat.ML] (or arXiv:2409.15166v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2409.15166 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-195] owards Accountable AI-Assisted Eye Disease Diagnosis: Workflow Design External Validation and Continual Learning

链接: https://arxiv.org/abs/2409.15087
作者: Qingyu Chen,Tiarnan D L Keenan,Elvira Agron,Alexis Allot,Emily Guan,Bryant Duong,Amr Elsawy,Benjamin Hou,Cancan Xue,Sanjeeb Bhandari,Geoffrey Broadhead,Chantal Cousineau-Krieger,Ellen Davis,William G Gensheimer,David Grasic,Seema Gupta,Luis Haddock,Eleni Konstantinou,Tania Lamba,Michele Maiberger,Dimosthenis Mantopoulos,Mitul C Mehta,Ayman G Nahri,Mutaz AL-Nawaflh,Arnold Oshinsky,Brittany E Powell,Boonkit Purt,Soo Shin,Hillary Stiefel,Alisa T Thavikulwat,Keith James Wroblewski,Tham Yih Chung,Chui Ming Gemmy Cheung,Ching-Yu Cheng,Emily Y Chew,Michelle R. Hribar,Michael F. Chiang,Zhiyong Lu
关键词-EN: Timely disease diagnosis, Timely disease, limited clinician availability, burdens and limited, increasing disease burdens
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Timely disease diagnosis is challenging due to increasing disease burdens and limited clinician availability. AI shows promise in diagnosis accuracy but faces real-world application issues due to insufficient validation in clinical workflows and diverse populations. This study addresses gaps in medical AI downstream accountability through a case study on age-related macular degeneration (AMD) diagnosis and severity classification. We designed and implemented an AI-assisted diagnostic workflow for AMD, comparing diagnostic performance with and without AI assistance among 24 clinicians from 12 institutions with real patient data sampled from the Age-Related Eye Disease Study (AREDS). Additionally, we demonstrated continual enhancement of an existing AI model by incorporating approximately 40,000 additional medical images (named AREDS2 dataset). The improved model was then systematically evaluated using both AREDS and AREDS2 test sets, as well as an external test set from Singapore. AI assistance markedly enhanced diagnostic accuracy and classification for 23 out of 24 clinicians, with the average F1-score increasing by 20% from 37.71 (Manual) to 45.52 (Manual + AI) (P-value 0.0001), achieving an improvement of over 50% in some cases. In terms of efficiency, AI assistance reduced diagnostic times for 17 out of the 19 clinicians tracked, with time savings of up to 40%. Furthermore, a model equipped with continual learning showed robust performance across three independent datasets, recording a 29% increase in accuracy, and elevating the F1-score from 42 to 54 in the Singapore population.

[LG-196] Methods for Convex (L_0L_1)-Smooth Optimization: Clipping Acceleration and Adaptivity

链接: https://arxiv.org/abs/2409.14989
作者: Eduard Gorbunov,Nazarii Tupitsa,Sayantan Choudhury,Alen Aliev,Peter Richtárik,Samuel Horváth,Martin Takáč
关键词-EN: Machine Learning, problems in Machine, generalized smoothness assumptions, Gradient Descent, Adaptive Gradient Descent
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 51 pages, 1 figure

点击查看摘要

Abstract:Due to the non-smoothness of optimization problems in Machine Learning, generalized smoothness assumptions have been gaining a lot of attention in recent years. One of the most popular assumptions of this type is (L_0,L_1) -smoothness (Zhang et al., 2020). In this paper, we focus on the class of (strongly) convex (L_0,L_1) -smooth functions and derive new convergence guarantees for several existing methods. In particular, we derive improved convergence rates for Gradient Descent with (Smoothed) Gradient Clipping and for Gradient Descent with Polyak Stepsizes. In contrast to the existing results, our rates do not rely on the standard smoothness assumption and do not suffer from the exponential dependency from the initial distance to the solution. We also extend these results to the stochastic case under the over-parameterization assumption, propose a new accelerated method for convex (L_0,L_1) -smooth optimization, and derive new convergence rates for Adaptive Gradient Descent (Malitsky and Mishchenko, 2020).

[LG-197] (De)-regularized Maximum Mean Discrepancy Gradient Flow

链接: https://arxiv.org/abs/2409.14980
作者: Zonghao Chen,Aratrika Mustafi,Pierre Glaser,Anna Korba,Arthur Gretton,Bharath K. Sriperumbudur
关键词-EN: Wasserstein gradient flow, Maximum Mean Discrepancy, Wasserstein gradient, Discrepancy flows, Existing gradient flows
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a (de)-regularization of the Maximum Mean Discrepancy (DrMMD) and its Wasserstein gradient flow. Existing gradient flows that transport samples from source distribution to target distribution with only target samples, either lack tractable numerical implementation ( f -divergence flows) or require strong assumptions, and modifications such as noise injection, to ensure convergence (Maximum Mean Discrepancy flows). In contrast, DrMMD flow can simultaneously (i) guarantee near-global convergence for a broad class of targets in both continuous and discrete time, and (ii) be implemented in closed form using only samples. The former is achieved by leveraging the connection between the DrMMD and the \chi^2 -divergence, while the latter comes by treating DrMMD as MMD with a de-regularized kernel. Our numerical scheme uses an adaptive de-regularization schedule throughout the flow to optimally trade off between discretization errors and deviations from the \chi^2 regime. The potential application of the DrMMD flow is demonstrated across several numerical experiments, including a large-scale setting of training student/teacher networks.

[LG-198] owards Ground-truth-free Evaluation of Any Segmentation in Medical Images

链接: https://arxiv.org/abs/2409.14874
作者: Ahjol Senbi,Tianyu Huang,Fei Lyu,Qing Li,Yuhui Tao,Wei Shao,Qiang Chen,Chengyan Wang,Shuo Wang,Tao Zhou,Yizhe Zhang
关键词-EN: medical images, segmentation quality scores, segmentation, Segment, images
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 15 figures

点击查看摘要

Abstract:We are interested in building a ground-truth-free evaluation model to assess the quality of segmentations produced by SAM (Segment Anything Model) and its variants in medical images. This model estimates segmentation quality scores by comparing input images with their corresponding segmentation maps. Building on prior research, we frame this as a regression problem within a supervised learning framework, using Dice scores (and optionally other metrics) to compute the training loss. The model is trained using a large collection of public datasets of medical images with segmentation predictions from SAM and its variants. We name this model EvanySeg (Evaluation of Any Segmentation in Medical Images). Our exploration of convolution-based models (e.g., ResNet) and transformer-based models (e.g., ViT) revealed that ViT offers superior performance for EvanySeg. This model can be employed for various tasks, including: (1) identifying poorly segmented samples by detecting low-percentile segmentation quality scores; (2) benchmark segmentation models without ground truth by averaging scores across test samples; (3) alerting human experts during human-AI collaboration by applying a threshold within the score space; and (4) selecting the best segmentation prediction for each test sample at test time when multiple segmentation models are available, by choosing the prediction with the highest score. Models and code will be made available at this https URL.

[LG-199] Embedding Knowledge Graph in Function Space

链接: https://arxiv.org/abs/2409.14857
作者: Louis Mozart Kamdem Teyou,Caglar Demir,Axel-Cyrille Ngonga Ngomo
关键词-EN: standard knowledge graph, finite vector space, graph embedding techniques, embedding method diverging, knowledge graph embedding
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel embedding method diverging from conventional approaches by operating within function spaces of finite dimension rather than finite vector space, thus departing significantly from standard knowledge graph embedding techniques. Initially employing polynomial functions to compute embeddings, we progress to more intricate representations using neural networks with varying layer complexities. We argue that employing functions for embedding computation enhances expressiveness and allows for more degrees of freedom, enabling operations such as composition, derivatives and primitive of entities representation. Additionally, we meticulously outline the step-by-step construction of our approach and provide code for reproducibility, thereby facilitating further exploration and application in the field.

[LG-200] Adaptive Conformal Inference for Multi-Step Ahead Time-Series Forecasting Online

链接: https://arxiv.org/abs/2409.14792
作者: Johan Hallberg Szabadváry
关键词-EN: adaptive conformal inference, multi-step ahead ACI, achieve finite-sample coverage, multi-step ahead time-series, ahead ACI algorithm
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:The aim of this paper is to propose an adaptation of the well known adaptive conformal inference (ACI) algorithm to achieve finite-sample coverage guarantees in multi-step ahead time-series forecasting in the online setting. ACI dynamically adjusts significance levels, and comes with finite-sample guarantees on coverage, even for non-exchangeable data. Our multi-step ahead ACI procedure inherits these guarantees at each prediction step, as well as for the overall error rate. The multi-step ahead ACI algorithm can be used with different target error and learning rates at different prediction steps, which is illustrated in our numerical examples, where we employ a version of the confromalised ridge regression algorithm, adapted to multi-input multi-output forecasting. The examples serve to show how the method works in practice, illustrating the effect of variable target error and learning rates for different prediction steps, which suggests that a balance may be struck between efficiency (interval width) and coverage.t

[LG-201] Neural refractive index field: Unlocking the Potential of Background-oriented Schlieren Tomography in Volumetric Flow Visualization

链接: https://arxiv.org/abs/2409.14722
作者: Yuanzhe He,Yutao Zheng,Shijie Xu,Chang Liu,Di Peng,Yingzheng Liu,Weiwei Cai
关键词-EN: capture three-dimensional distributions, visualizing intricate turbulent, Background-oriented Schlieren tomography, prevalent method, method for visualizing
类目: Fluid Dynamics (physics.flu-dyn); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Background-oriented Schlieren tomography (BOST) is a prevalent method for visualizing intricate turbulent flows, valued for its ease of implementation and capacity to capture three-dimensional distributions of a multitude of flow parameters. However, the voxel-based meshing scheme leads to significant challenges, such as inadequate spatial resolution, substantial discretization errors, poor noise immunity, and excessive computational costs. This work presents an innovative reconstruction approach termed neural refractive index field (NeRIF) which implicitly represents the flow field with a neural network, which is trained with tailored strategies. Both numerical simulations and experimental demonstrations on turbulent Bunsen flames suggest that our approach can significantly improve the reconstruction accuracy and spatial resolution while concurrently reducing computational expenses. Although showcased in the context of background-oriented schlieren tomography here, the key idea embedded in the NeRIF can be readily adapted to various other tomographic modalities including tomographic absorption spectroscopy and tomographic particle imaging velocimetry, broadening its potential impact across different domains of flow visualization and analysis.

[LG-202] Fourier neural operators for spatiotemporal dynamics in two-dimensional turbulence

链接: https://arxiv.org/abs/2409.14660
作者: Mohammad Atif,Pulkit Dubey,Pratik P. Aghor,Vanessa Lopez-Marrero,Tao Zhang,Abdullah Sharfuddin,Kwangmin Yu,Fan Yang,Foluso Ladeinde,Yangang Liu,Meifeng Lin,Lingda Li
关键词-EN: High-fidelity direct numerical, real-world applications remains, outstanding computational challenge, High-fidelity direct, direct numerical simulation
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:High-fidelity direct numerical simulation of turbulent flows for most real-world applications remains an outstanding computational challenge. Several machine learning approaches have recently been proposed to alleviate the computational cost even though they become unstable or unphysical for long time predictions. We identify that the Fourier neural operator (FNO) based models combined with a partial differential equation (PDE) solver can accelerate fluid dynamic simulations and thus address computational expense of large-scale turbulence simulations. We treat the FNO model on the same footing as a PDE solver and answer important questions about the volume and temporal resolution of data required to build pre-trained models for turbulence. We also discuss the pitfalls of purely data-driven approaches that need to be avoided by the machine learning models to become viable and competitive tools for long time simulations of turbulence.

[LG-203] LatentQGAN: A Hybrid QGAN with Classical Convolutional Autoencoder

链接: https://arxiv.org/abs/2409.14622
作者: Vieloszynski Alexis,Soumaya Cherkaoui,Jean-Frédéric Laprade,Oliver Nahman-Lévesque,Abdallah Aaraba,Shengrui Wang
关键词-EN: Quantum machine learning, machine learning consists, Generative Adversarial Networks, generate classical data, machine learning
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper was accepted for publication on the 10th IEEE World Forum on Internet of Things (IEEE WFIoT2024), in the session SS - QIoT-1: Special Session - Quantum Internet of Things (QIoT)-1, November 10th, from 14:00 to 15:30 EST

点击查看摘要

Abstract:Quantum machine learning consists in taking advantage of quantum computations to generate classical data. A potential application of quantum machine learning is to harness the power of quantum computers for generating classical data, a process essential to a multitude of applications such as enriching training datasets, anomaly detection, and risk management in finance. Given the success of Generative Adversarial Networks in classical image generation, the development of its quantum versions has been actively conducted. However, existing implementations on quantum computers often face significant challenges, such as scalability and training convergence issues. To address these issues, we propose LatentQGAN, a novel quantum model that uses a hybrid quantum-classical GAN coupled with an autoencoder. Although it was initially designed for image generation, the LatentQGAN approach holds potential for broader application across various practical data generation tasks. Experimental outcomes on both classical simulators and noisy intermediate scale quantum computers have demonstrated significant performance enhancements over existing quantum methods, alongside a significant reduction in quantum resources overhead.

[LG-204] Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning

链接: https://arxiv.org/abs/2409.14557
作者: Jia Wan,Sean R. Sinclair,Devavrat Shah,Martin J. Wainwright
关键词-EN: Markov Decision Processes, structured Markov Decision, Decision Processes, Markov Decision, structured Markov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:We study a class of structured Markov Decision Processes (MDPs) known as Exo-MDPs, characterized by a partition of the state space into two components. The exogenous states evolve stochastically in a manner not affected by the agent’s actions, whereas the endogenous states are affected by the actions, and evolve in a deterministic and known way conditional on the exogenous states. Exo-MDPs are a natural model for various applications including inventory control, finance, power systems, ride sharing, among others. Despite seeming restrictive, this work establishes that any discrete MDP can be represented as an Exo-MDP. Further, Exo-MDPs induce a natural representation of the transition and reward dynamics as linear functions of the exogenous state distribution. This linear representation leads to near-optimal algorithms with regret guarantees scaling only with the (effective) size of the exogenous state space d , independent of the sizes of the endogenous state and action spaces. Specifically, when the exogenous state is fully observed, a simple plug-in approach achieves a regret upper bound of O(H^3/2\sqrtdK) , where H denotes the horizon and K denotes the total number of episodes. When the exogenous state is unobserved, the linear representation leads to a regret upper bound of O(H^3/2d\sqrtK) . We also establish a nearly matching regret lower bound of \Omega(Hd\sqrtK) for the no observation regime. We complement our theoretical findings with an experimental study on inventory control problems.

[LG-205] A Feature Engineering Approach for Literary and Colloquial Tamil Speech Classification using 1D-CNN

链接: https://arxiv.org/abs/2409.14348
作者: M. Nanmalar,S. Johanan Joysingh,P. Vijayalakshmi,T. Nagarajan
关键词-EN: human computer interaction, ideal human computer, handcrafted features, features, HCI
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:In ideal human computer interaction (HCI), the colloquial form of a language would be preferred by most users, since it is the form used in their day-to-day conversations. However, there is also an undeniable necessity to preserve the formal literary form. By embracing the new and preserving the old, both service to the common man (practicality) and service to the language itself (conservation) can be rendered. Hence, it is ideal for computers to have the ability to accept, process, and converse in both forms of the language, as required. To address this, it is first necessary to identify the form of the input speech, which in the current work is between literary and colloquial Tamil speech. Such a front-end system must consist of a simple, effective, and lightweight classifier that is trained on a few effective features that are capable of capturing the underlying patterns of the speech signal. To accomplish this, a one-dimensional convolutional neural network (1D-CNN) that learns the envelope of features across time, is proposed. The network is trained on a select number of handcrafted features initially, and then on Mel frequency cepstral coefficients (MFCC) for comparison. The handcrafted features were selected to address various aspects of speech such as the spectral and temporal characteristics, prosody, and voice quality. The features are initially analyzed by considering ten parallel utterances and observing the trend of each feature with respect to time. The proposed 1D-CNN, trained using the handcrafted features, offers an F1 score of 0.9803, while that trained on the MFCC offers an F1 score of 0.9895. In light of this, feature ablation and feature combination are explored. When the best ranked handcrafted features, from the feature ablation study, are combined with the MFCC, they offer the best results with an F1 score of 0.9946.

[LG-206] A competitive baseline for deep learning enhanced data assimilation using conditional Gaussian ensemble Kalman filtering

链接: https://arxiv.org/abs/2409.14300
作者: Zachariah Malik,Romit Maulik
关键词-EN: Ensemble Kalman Filtering, Ensemble Kalman, Kalman Filtering, ranging applications, popular technique
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Ensemble Kalman Filtering (EnKF) is a popular technique for data assimilation, with far ranging applications. However, the vanilla EnKF framework is not well-defined when perturbations are nonlinear. We study two non-linear extensions of the vanilla EnKF - dubbed the conditional-Gaussian EnKF (CG-EnKF) and the normal score EnKF (NS-EnKF) - which sidestep assumptions of linearity by constructing the Kalman gain matrix with the `conditional Gaussian’ update formula in place of the traditional one. We then compare these models against a state-of-the-art deep learning based particle filter called the score filter (SF). This model uses an expensive score diffusion model for estimating densities and also requires a strong assumption on the perturbation operator for validity. In our comparison, we find that CG-EnKF and NS-EnKF dramatically outperform SF for a canonical problem in high-dimensional multiscale data assimilation given by the Lorenz-96 system. Our analysis also demonstrates that the CG-EnKF and NS-EnKF can handle highly non-Gaussian additive noise perturbations, with the latter typically outperforming the former.

[LG-207] Accelerated Stochastic ExtraGradient: Mixing Hessian and Gradient Similarity to Reduce Communication in Distributed and Federated Learning

链接: https://arxiv.org/abs/2409.14280
作者: Dmitry Bylinkin,Kirill Degtyarev,Aleksandr Beznosikov
关键词-EN: Modern realities, training sample size, sample size, realities and trends, generalization ability
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 25 pages, 15 figures, 4 appendices

点击查看摘要

Abstract:Modern realities and trends in learning require more and more generalization ability of models, which leads to an increase in both models and training sample size. It is already difficult to solve such tasks in a single device mode. This is the reason why distributed and federated learning approaches are becoming more popular every day. Distributed computing involves communication between devices, which requires solving two key problems: efficiency and privacy. One of the most well-known approaches to combat communication costs is to exploit the similarity of local data. Both Hessian similarity and homogeneous gradients have been studied in the literature, but separately. In this paper, we combine both of these assumptions in analyzing a new method that incorporates the ideas of using data similarity and clients sampling. Moreover, to address privacy concerns, we apply the technique of additional noise and analyze its impact on the convergence of the proposed method. The theory is confirmed by training on real datasets.

[LG-208] FeDETR: a Federated Approach for Stenosis Detection in Coronary Angiography

链接: https://arxiv.org/abs/2409.14268
作者: Raffaele Mineo,Amelia Sorrenti,Federica Proietto Salanitri
关键词-EN: patient health, heart failure, underlying factor, factor in heart, grading coronary lesions
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 9 figures, Image Analysis and Processing - ICIAP 2023 Workshops. ICIAP 2023. Lecture Notes in Computer Science, vol 14366. Springer, Cham

点击查看摘要

Abstract:Assessing the severity of stenoses in coronary angiography is critical to the patient’s health, as coronary stenosis is an underlying factor in heart failure. Current practice for grading coronary lesions, i.e. fractional flow reserve (FFR) or instantaneous wave-free ratio (iFR), suffers from several drawbacks, including time, cost and invasiveness, alongside potential interobserver variability. In this context, some deep learning methods have emerged to assist cardiologists in automating the estimation of FFR/iFR values. Despite the effectiveness of these methods, their reliance on large datasets is challenging due to the distributed nature of sensitive medical data. Federated learning addresses this challenge by aggregating knowledge from multiple nodes to improve model generalization, while preserving data privacy. We propose the first federated detection transformer approach, FeDETR, to assess stenosis severity in angiography videos based on FFR/iFR values estimation. In our approach, each node trains a detection transformer (DETR) on its local dataset, with the central server federating the backbone part of the network. The proposed method is trained and evaluated on a dataset collected from five hospitals, consisting of 1001 angiographic examinations, and its performance is compared with state-of-the-art federated learning methods.

[LG-209] Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models ICASSP2025

链接: https://arxiv.org/abs/2409.14131
作者: Orchid Chetia Phukan,Sarthak Jain,Swarup Ranjan Behera,Arun Balaji Buduru,Rajesh Sharma,S.R Mahadeva Prasanna
关键词-EN: recently attracted attention, voice deepfake detection, music foundation models, speaker recognition SFM, speech foundation models
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:In this study, for the first time, we extensively investigate whether music foundation models (MFMs) or speech foundation models (SFMs) work better for singing voice deepfake detection (SVDD), which has recently attracted attention in the research community. For this, we perform a comprehensive comparative study of state-of-the-art (SOTA) MFMs (MERT variants and music2vec) and SFMs (pre-trained for general speech representation learning as well as speaker recognition). We show that speaker recognition SFM representations perform the best amongst all the foundation models (FMs), and this performance can be attributed to its higher efficacy in capturing the pitch, tone, intensity, etc, characteristics present in singing voices. To our end, we also explore the fusion of FMs for exploiting their complementary behavior for improved SVDD, and we propose a novel framework, FIONA for the same. With FIONA, through the synchronization of x-vector (speaker recognition SFM) and MERT-v1-330M (MFM), we report the best performance with the lowest Equal Error Rate (EER) of 13.74 %, beating all the individual FMs as well as baseline FM fusions and achieving SOTA results.

[LG-210] Consistency for Large Neural Networks

链接: https://arxiv.org/abs/2409.14123
作者: Haoran Zhan,Yingcun Xia
关键词-EN: shown remarkable success, Integrated Squared Error, remarkable success, shown remarkable, Neural networks
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Neural networks have shown remarkable success, especially in overparameterized or “large” models. Despite increasing empirical evidence and intuitive understanding, a formal mathematical justification for the behavior of such models, particularly regarding overfitting, remains incomplete. In this paper, we prove that the Mean Integrated Squared Error (MISE) of neural networks with either L^1 or L^2 penalty decreases after a certain model size threshold, provided that the sample size is sufficiently large, and achieves nearly the minimax optimality in the Barron space. These results challenge conventional statistical modeling frameworks and broadens recent findings on the double descent phenomenon in neural networks. Our theoretical results also extend to deep learning models with ReLU activation functions.

[LG-211] Quantum enhanced stratification of Breast Cancer: exploring quantum expressivity for real omics data

链接: https://arxiv.org/abs/2409.14089
作者: Valeria Repetto,Elia Giuseppe Ceroni,Giuseppe Buonaiuto,Romina D’Aurizio
关键词-EN: Noisy Intermediate Scale, Quantum Machine Learning, Intermediate Scale Quantum, Machine Learning, Noisy Intermediate
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Quantum Machine Learning (QML) is considered one of the most promising applications of Quantum Computing in the Noisy Intermediate Scale Quantum (NISQ) era for the impact it is thought to have in the near future. Although promising theoretical assumptions, the exploration of how QML could foster new discoveries in Medicine and Biology fields is still in its infancy with few examples. In this study, we aimed to assess whether Quantum Kernels (QK) could effectively classify subtypes of Breast Cancer (BC) patients on the basis of molecular characteristics. We performed an heuristic exploration of encoding configurations with different entanglement levels to determine a trade-off between kernel expressivity and performances. Our results show that QKs yield comparable clustering results with classical methods while using fewer data points, and are able to fit the data with a higher number of clusters. Additionally, we conducted the experiments on the Quantum Processing Unit (QPU) to evaluate the effect of noise on the outcome. We found that less expressive encodings showed a higher resilience to noise, indicating that the computational pipeline can be reliably implemented on the NISQ devices. Our findings suggest that QK methods show promises for application in Precision Oncology, especially in scenarios where the dataset is limited in size and a granular non-trivial stratification of complex molecular data cannot be achieved classically.

[LG-212] Enhancing Multivariate Time Series-based Solar Flare Prediction with Multifaceted Preprocessing and Contrastive Learning ICML

链接: https://arxiv.org/abs/2409.14016
作者: MohammadReza EskandariNasab,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
关键词-EN: Accurate solar flare, satellite communication systems, solar flare prediction, intense solar flares, solar flares pose
类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This work has been accepted at ICMLA 2024 on September 7, 2024, as a regular paper for an oral presentation

点击查看摘要

Abstract:Accurate solar flare prediction is crucial due to the significant risks that intense solar flares pose to astronauts, space equipment, and satellite communication systems. Our research enhances solar flare prediction by utilizing advanced data preprocessing and classification methods on a multivariate time series-based dataset of photospheric magnetic field parameters. First, our study employs a novel preprocessing pipeline that includes missing value imputation, normalization, balanced sampling, near decision boundary sample removal, and feature selection to significantly boost prediction accuracy. Second, we integrate contrastive learning with a GRU regression model to develop a novel classifier, termed ContReg, which employs dual learning methodologies, thereby further enhancing prediction performance. To validate the effectiveness of our preprocessing pipeline, we compare and demonstrate the performance gain of each step, and to demonstrate the efficacy of the ContReg classifier, we compare its performance to that of sequence-based deep learning architectures, machine learning models, and findings from previous studies. Our results illustrate exceptional True Skill Statistic (TSS) scores, surpassing previous methods and highlighting the critical role of precise data preprocessing and classifier development in time series-based solar flare prediction.

[LG-213] High-dimensional learning of narrow neural networks

链接: https://arxiv.org/abs/2409.13904
作者: Hugo Cui
关键词-EN: machine learning applications, machine learning, neural networks, fast-pace diversification, diversification and increasing
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent years have been marked with the fast-pace diversification and increasing ubiquity of machine learning applications. Yet, a firm theoretical understanding of the surprising efficiency of neural networks to learn from high-dimensional data still proves largely elusive. In this endeavour, analyses inspired by statistical physics have proven instrumental, enabling the tight asymptotic characterization of the learning of neural networks in high dimensions, for a broad class of solvable models. This manuscript reviews the tools and ideas underlying recent progress in this line of work. We introduce a generic model – the sequence multi-index model – which encompasses numerous previously studied models as special instances. This unified framework covers a broad class of machine learning architectures with a finite number of hidden units, including multi-layer perceptrons, autoencoders, attention mechanisms; and tasks, including (un)supervised learning, denoising, contrastive learning, in the limit of large data dimension, and comparably large number of samples. We explicate in full detail the analysis of the learning of sequence multi-index models, using statistical physics techniques such as the replica method and approximate message-passing algorithms. This manuscript thus provides a unified presentation of analyses reported in several previous works, and a detailed overview of central techniques in the field of statistical physics of machine learning. This review should be a useful primer for machine learning theoreticians curious of statistical physics approaches; it should also be of value to statistical physicists interested in the transfer of such ideas to the study of neural networks.

[LG-214] Deep Learning-Based Channel Squeeze U-Structure for Lung Nodule Detection and Segmentation

链接: https://arxiv.org/abs/2409.13868
作者: Mingxiu Sui,Jiacheng Hu,Tong Zhou,Zibo Liu,Likang Wen,Junliang Du
关键词-EN: aimed at advancing, paper introduces, advancing the accuracy, accuracy of early-stage, Channel Squeeze U-Structure
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel deep-learning method for the automatic detection and segmentation of lung nodules, aimed at advancing the accuracy of early-stage lung cancer diagnosis. The proposed approach leverages a unique “Channel Squeeze U-Structure” that optimizes feature extraction and information integration across multiple semantic levels of the network. This architecture includes three key modules: shallow information processing, channel residual structure, and channel squeeze integration. These modules enhance the model’s ability to detect and segment small, imperceptible, or ground-glass nodules, which are critical for early diagnosis. The method demonstrates superior performance in terms of sensitivity, Dice similarity coefficient, precision, and mean Intersection over Union (IoU). Extensive experiments were conducted on the Lung Image Database Consortium (LIDC) dataset using five-fold cross-validation, showing excellent stability and robustness. The results indicate that this approach holds significant potential for improving computer-aided diagnosis systems, providing reliable support for radiologists in clinical practice and aiding in the early detection of lung cancer, especially in resource-limited settings

[LG-215] Learning to Simulate Aerosol Dynamics with Graph Neural Networks

链接: https://arxiv.org/abs/2409.13861
作者: Fabiana Ferracina,Payton Beeler,Mahantesh Halappanavar,Bala Krishnamoorthy,Marco Minutoli,Laura Fierce
关键词-EN: air quality depend, effects on climate, air quality, quality depend, depend on characteristics
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aerosol effects on climate, weather, and air quality depend on characteristics of individual particles, which are tremendously diverse and change in time. Particle-resolved models are the only models able to capture this diversity in particle physiochemical properties, and these models are computationally expensive. As a strategy for accelerating particle-resolved microphysics models, we introduce Graph-based Learning of Aerosol Dynamics (GLAD) and use this model to train a surrogate of the particle-resolved model PartMC-MOSAIC. GLAD implements a Graph Network-based Simulator (GNS), a machine learning framework that has been used to simulate particle-based fluid dynamics models. In GLAD, each particle is represented as a node in a graph, and the evolution of the particle population over time is simulated through learned message passing. We demonstrate our GNS approach on a simple aerosol system that includes condensation of sulfuric acid onto particles composed of sulfate, black carbon, organic carbon, and water. A graph with particles as nodes is constructed, and a graph neural network (GNN) is then trained using the model output from PartMC-MOSAIC. The trained GNN can then be used for simulating and predicting aerosol dynamics over time. Results demonstrate the framework’s ability to accurately learn chemical dynamics and generalize across different scenarios, achieving efficient training and prediction times. We evaluate the performance across three scenarios, highlighting the framework’s robustness and adaptability in modeling aerosol microphysics and chemistry.

[LG-216] Learning Ordering in Crystalline Materials with Symmetry-Aware Graph Neural Networks

链接: https://arxiv.org/abs/2409.13851
作者: Jiayu Peng,James Damewood,Jessica Karaguesian,Jaclyn R. Lunger,Rafael Gómez-Bombarelli
关键词-EN: Graph convolutional neural, machine learning workhorse, Graph convolutional, energy storage, machine learning
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Graph convolutional neural networks (GCNNs) have become a machine learning workhorse for screening the chemical space of crystalline materials in fields such as catalysis and energy storage, by predicting properties from structures. Multicomponent materials, however, present a unique challenge since they can exhibit chemical (dis)order, where a given lattice structure can encompass a variety of elemental arrangements ranging from highly ordered structures to fully disordered solid solutions. Critically, properties like stability, strength, and catalytic performance depend not only on structures but also on orderings. To enable rigorous materials design, it is thus critical to ensure GCNNs are capable of distinguishing among atomic orderings. However, the ordering-aware capability of GCNNs has been poorly understood. Here, we benchmark various neural network architectures for capturing the ordering-dependent energetics of multicomponent materials in a custom-made dataset generated with high-throughput atomistic simulations. Conventional symmetry-invariant GCNNs were found unable to discern the structural difference between the diverse symmetrically inequivalent atomic orderings of the same material, while symmetry-equivariant model architectures could inherently preserve and differentiate the distinct crystallographic symmetries of various orderings.

[LG-217] Physics-informed kernel learning

链接: https://arxiv.org/abs/2409.13786
作者: Nathan Doumèche(LPSM, EDF Ramp;D OSIRIS),Francis Bach(PSL),Gérard Biau(SU, IUF),Claire Boyer(IUF)
关键词-EN: partial differential equation, machine learning typically, learning typically integrates, typically integrates physical, Physics-informed machine learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Physics-informed machine learning typically integrates physical priors into the learning process by minimizing a loss function that includes both a data-driven term and a partial differential equation (PDE) regularization. Building on the formulation of the problem as a kernel regression task, we use Fourier methods to approximate the associated kernel, and propose a tractable estimator that minimizes the physics-informed risk function. We refer to this approach as physics-informed kernel learning (PIKL). This framework provides theoretical guarantees, enabling the quantification of the physical prior’s impact on convergence speed. We demonstrate the numerical performance of the PIKL estimator through simulations, both in the context of hybrid modeling and in solving PDEs. In particular, we show that PIKL can outperform physics-informed neural networks in terms of both accuracy and computation time. Additionally, we identify cases where PIKL surpasses traditional PDE solvers, particularly in scenarios with noisy boundary conditions.

[LG-218] Effect of Clinical History on Predictive Model Performance for Renal Complications of Diabetes

链接: https://arxiv.org/abs/2409.13743
作者: Davide Dei Cas,Barbara Di Camillo,Gian Paolo Fadini,Giovanni Sparacino,Enrico Longato
关键词-EN: chronic disease characterised, chronic kidney disease, end-stage chronic kidney, developing diabetic nephropathy, chronic disease
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 6 pages, 3 tables. In Proceedings of 19th International Conference on Computational Intelligence methods for Bioinformatics and Biostatistics (CIBB 2024), Benevento, Italy, September 4-6, 2024

点击查看摘要

Abstract:Diabetes is a chronic disease characterised by a high risk of developing diabetic nephropathy, which, in turn, is the leading cause of end-stage chronic kidney disease. The early identification of individuals at heightened risk of such complications or their exacerbation can be of paramount importance to set a correct course of treatment. In the present work, from the data collected in the DARWIN-Renal (DApagliflozin Real-World evIdeNce-Renal) study, a nationwide multicentre retrospective real-world study, we develop an array of logistic regression models to predict, over different prediction horizons, the crossing of clinically relevant glomerular filtration rate (eGFR) thresholds for patients with diabetes by means of variables associated with demographic, anthropometric, laboratory, pathology, and therapeutic data. In doing so, we investigate the impact of information coming from patient’s past visits on the model’s predictive performance, coupled with an analysis of feature importance through the Boruta algorithm. Our models yield very good performance (AUROC as high as 0.98). We also show that the introduction of information from patient’s past visits leads to improved model performance of up to 4%. The usefulness of past information is further corroborated by a feature importance analysis.

[LG-219] Lecture notes on high-dimensional data

链接: https://arxiv.org/abs/2101.05841
作者: Sven-Ake Wegner
关键词-EN: Mathematical Data Science, lecture notes based, final year BSc, year BSc students, Data Science
类目: Functional Analysis (math.FA); Machine Learning (cs.LG)
*备注: 57 pages; link in abstract corrected

点击查看摘要

Abstract:These are lecture notes based on the first part of a course on ‘Mathematical Data Science’, which I taught to final year BSc students in the UK in 2019-2020. Topics include: concentration of measure in high dimensions; Gaussian random vectors in high dimensions; random projections; separation/disentangling of Gaussian data. A revised version has been published as part of the textbook [Mathematical Introduction to Data Science, Springer, Berlin, Heidelberg, 2024, this https URL].

信息检索

[IR-0] Generative AI Is Not Ready for Clinical Use in Patient Education for Lower Back Pain Patients Even With Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2409.15260
作者: Yi-Fei Zhao,Allyn Bove,David Thompson,James Hill,Yi Xu,Yufan Ren,Andrea Hassman,Leming Zhou,Yanshan Wang
关键词-EN: Low back pain, Low back, LBP, patient education, back pain
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Low back pain (LBP) is a leading cause of disability globally. Following the onset of LBP and subsequent treatment, adequate patient education is crucial for improving functionality and long-term outcomes. Despite advancements in patient education strategies, significant gaps persist in delivering personalized, evidence-based information to patients with LBP. Recent advancements in large language models (LLMs) and generative artificial intelligence (GenAI) have demonstrated the potential to enhance patient education. However, their application and efficacy in delivering educational content to patients with LBP remain underexplored and warrant further investigation. In this study, we introduce a novel approach utilizing LLMs with Retrieval-Augmented Generation (RAG) and few-shot learning to generate tailored educational materials for patients with LBP. Physical therapists manually evaluated our model responses for redundancy, accuracy, and completeness using a Likert scale. In addition, the readability of the generated education materials is assessed using the Flesch Reading Ease score. The findings demonstrate that RAG-based LLMs outperform traditional LLMs, providing more accurate, complete, and readable patient education materials with less redundancy. Having said that, our analysis reveals that the generated materials are not yet ready for use in clinical practice. This study underscores the potential of AI-driven models utilizing RAG to improve patient education for LBP; however, significant challenges remain in ensuring the clinical relevance and granularity of content generated by these models.

[IR-1] Recommendation with Generative Models

链接: https://arxiv.org/abs/2409.15173
作者: Yashar Deldjoo,Zhankui He,Julian McAuley,Anton Korikov,Scott Sanner,Arnau Ramisa,Rene Vidal,Maheswaran Sathiamoorthy,Atoosa Kasrizadeh,Silvia Milano,Francesco Ricci
关键词-EN: Generative models, models, statistical distributions, capable of creating, creating new instances
类目: Information Retrieval (cs.IR)
*备注: This submission is a full-length book, expanding significantly on two chapters previously submitted ( arXiv:2409.10993v1 , arXiv:2408.10946v1 ). It includes additional chapters, context, analysis, and content, providing a comprehensive presentation of the subject. We have ensured it is appropriately presented as a new, distinct work. arXiv admin note: substantial text overlap with arXiv:2409.10993

点击查看摘要

Abstract:Generative models are a class of AI models capable of creating new instances of data by learning and sampling from their statistical distributions. In recent years, these models have gained prominence in machine learning due to the development of approaches such as generative adversarial networks (GANs), variational autoencoders (VAEs), and transformer-based architectures such as GPT. These models have applications across various domains, such as image generation, text synthesis, and music composition. In recommender systems, generative models, referred to as Gen-RecSys, improve the accuracy and diversity of recommendations by generating structured outputs, text-based interactions, and multimedia content. By leveraging these capabilities, Gen-RecSys can produce more personalized, engaging, and dynamic user experiences, expanding the role of AI in eCommerce, media, and beyond. Our book goes beyond existing literature by offering a comprehensive understanding of generative models and their applications, with a special focus on deep generative models (DGMs) and their classification. We introduce a taxonomy that categorizes DGMs into three types: ID-driven models, large language models (LLMs), and multimodal models. Each category addresses unique technical and architectural advancements within its respective research area. This taxonomy allows researchers to easily navigate developments in Gen-RecSys across domains such as conversational AI and multimodal content generation. Additionally, we examine the impact and potential risks of generative models, emphasizing the importance of robust evaluation frameworks. Comments: This submission is a full-length book, expanding significantly on two chapters previously submitted (arXiv:2409.10993v1, arXiv:2408.10946v1). It includes additional chapters, context, analysis, and content, providing a comprehensive presentation of the subject. We have ensured it is appropriately presented as a new, distinct work. arXiv admin note: substantial text overlap with arXiv:2409.10993 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2409.15173 [cs.IR] (or arXiv:2409.15173v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.15173 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] Lessons Learned on Information Retrieval in Electronic Health Records: A Comparison of Embedding Models and Pooling Strategies

链接: https://arxiv.org/abs/2409.15163
作者: Skatje Myers,Timothy A. Miller,Yanjun Gao,Matthew M. Churpek,Anoop Mayampurath,Dmitriy Dligach,Majid Afshar
关键词-EN: Applying large language, Applying large, retrieval, challenging due, context-heavy nature
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Objective: Applying large language models (LLMs) to the clinical domain is challenging due to the context-heavy nature of processing medical records. Retrieval-augmented generation (RAG) offers a solution by facilitating reasoning over large text sources. However, there are many parameters to optimize in just the retrieval system alone. This paper presents an ablation study exploring how different embedding models and pooling methods affect information retrieval for the clinical domain. Methods: Evaluating on three retrieval tasks on two electronic health record (EHR) data sources, we compared seven models, including medical- and general-domain models, specialized encoder embedding models, and off-the-shelf decoder LLMs. We also examine the choice of embedding pooling strategy for each model, independently on the query and the text to retrieve. Results: We found that the choice of embedding model significantly impacts retrieval performance, with BGE, a comparatively small general-domain model, consistently outperforming all others, including medical-specific models. However, our findings also revealed substantial variability across datasets and query text phrasings. We also determined the best pooling methods for each of these models to guide future design of retrieval systems. Discussion: The choice of embedding model, pooling strategy, and query formulation can significantly impact retrieval performance and the performance of these models on other public benchmarks does not necessarily transfer to new domains. Further studies such as this one are vital for guiding empirically-grounded development of retrieval frameworks, such as in the context of RAG, for the clinical domain. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2409.15163 [cs.CL] (or arXiv:2409.15163v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.15163 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Skatje Myers [view email] [v1] Mon, 23 Sep 2024 16:16:08 UTC (7,556 KB)

[IR-3] Dont Use LLMs to Make Relevance Judgments

链接: https://arxiv.org/abs/2409.15133
作者: Ian Soboroff
关键词-EN: complex and expensive, Making the relevance, relevance judgments, TREC-style test collection, Making
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Making the relevance judgments for a TREC-style test collection can be complex and expensive. A typical TREC track usually involves a team of six contractors working for 2-4 weeks. Those contractors need to be trained and monitored. Software has to be written to support recording relevance judgments correctly and efficiently. The recent advent of large language models that produce astoundingly human-like flowing text output in response to a natural language prompt has inspired IR researchers to wonder how those models might be used in the relevance judgment collection process. At the ACM SIGIR 2024 conference, a workshop ``LLM4Eval’’ provided a venue for this work, and featured a data challenge activity where participants reproduced TREC deep learning track judgments, as was done by Thomas et al (arXiv:2408.08896, arXiv:2309.10621). I was asked to give a keynote at the workshop, and this paper presents that keynote in article form. The bottom-line-up-front message is, don’t use LLMs to create relevance judgments for TREC-style evaluations.

[IR-4] EMERS: Energy Meter for Recommender Systems

链接: https://arxiv.org/abs/2409.15060
作者: Lukas Wegmeth,Tobias Vente,Alan Said,Joeran Beel
关键词-EN: recommender systems experiments, Due to recent, recommender systems, energy consumption, systems experiments
类目: Information Retrieval (cs.IR)
*备注: Accepted at the RecSoGood 2024 Workshop co-located with the 18th ACM Conference on Recommender Systems

点击查看摘要

Abstract:Due to recent advancements in machine learning, recommender systems use increasingly more energy for training, evaluation, and deployment. However, the recommender systems community often does not report the energy consumption of their experiments. In today’s research landscape, no tools exist to easily measure the energy consumption of recommender systems experiments. To bridge this gap, we introduce EMERS, the first software library that simplifies measuring, monitoring, recording, and sharing the energy consumption of recommender systems experiments. EMERS measures energy consumption with smart power plugs and offers a user interface to monitor and compare the energy consumption of recommender systems experiments. Thereby, EMERS improves sustainability awareness and simplifies self-reporting energy consumption for recommender systems practitioners and researchers.

[IR-5] ViBERTgrid BiLSTM-CRF: Multimodal Key Information Extraction from Unstructured Financial Documents ECML KDD2023

链接: https://arxiv.org/abs/2409.15004
作者: Furkan Pala,Mehmet Yasin Akpınar,Onur Deniz,Gülşen Eryiğit
关键词-EN: key information extraction, Multimodal key information, information extraction, key information, studied extensively
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Accepted in MIDAS (The 8th Workshop on MIning DAta for financial applicationS) workshop of ECML PKDD 2023 conference

点击查看摘要

Abstract:Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.

[IR-6] Adaptive Learning on User Segmentation: Universal to Specific Representation via Bipartite Neural Interaction

链接: https://arxiv.org/abs/2409.14945
作者: Xiaoyu Tan,Yongxin Deng,Chao Qu,Siqiao Xue,Xiaoming Shi,James Zhang,Xihe Qiu
关键词-EN: widely applied, user, Recently, CTR, representation
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recently, models for user representation learning have been widely applied in click-through-rate (CTR) and conversion-rate (CVR) prediction. Usually, the model learns a universal user representation as the input for subsequent scenario-specific models. However, in numerous industrial applications (e.g., recommendation and marketing), the business always operates such applications as various online activities among different user segmentation. These segmentation are always created by domain experts. Due to the difference in user distribution (i.e., user segmentation) and business objectives in subsequent tasks, learning solely on universal representation may lead to detrimental effects on both model performance and robustness. In this paper, we propose a novel learning framework that can first learn general universal user representation through information bottleneck. Then, merge and learn a segmentation-specific or a task-specific representation through neural interaction. We design the interactive learning process by leveraging a bipartite graph architecture to model the representation learning and merging between contextual clusters and each user segmentation. Our proposed method is evaluated in two open-source benchmarks, two offline business datasets, and deployed on two online marketing applications to predict users’ CVR. The results demonstrate that our method can achieve superior performance and surpass the baseline methods.

[IR-7] FedSlate:A Federated Deep Reinforcement Learning Recommender System

链接: https://arxiv.org/abs/2409.14872
作者: Yongxin Deng,Xiaoyu Tan,Xihe Qiu,Yaochu Jin
关键词-EN: recommendation systems, optimize long-term user, long-term user engagement, recommendation, Reinforcement
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning methods have been used to optimize long-term user engagement in recommendation systems. However, existing reinforcement learning-based recommendation systems do not fully exploit the relevance of individual user behavior across different platforms. One potential solution is to aggregate data from various platforms in a centralized location and use the aggregated data for training. However, this approach raises economic and legal concerns, including increased communication costs and potential threats to user privacy. To address these challenges, we propose \textbfFedSlate, a federated reinforcement learning recommendation algorithm that effectively utilizes information that is prohibited from being shared at a legal level. We employ the SlateQ algorithm to assist FedSlate in learning users’ long-term behavior and evaluating the value of recommended content. We extend the existing application scope of recommendation systems from single-user single-platform to single-user multi-platform and address cross-platform learning challenges by introducing federated learning. We use RecSim to construct a simulation environment for evaluating FedSlate and compare its performance with state-of-the-art benchmark recommendation models. Experimental results demonstrate the superior effects of FedSlate over baseline methods in various environmental settings, and FedSlate facilitates the learning of recommendation strategies in scenarios where baseline methods are completely inapplicable. Code is available at \textitthis https URL.

[IR-8] Pre-trained Language Model and Knowledge Distillation for Lightweight Sequential Recommendation

链接: https://arxiv.org/abs/2409.14810
作者: Li Li,Mingyue Cheng,Zhiding Liu,Hao Zhang,Qi Liu,Enhong Chen
关键词-EN: recommendation, Sequential recommendation, sequential recommendation algorithms, pre-trained language, user interests
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: in Chinese language

点击查看摘要

Abstract:Sequential recommendation models user interests based on historical behaviors to provide personalized recommendation. Previous sequential recommendation algorithms primarily employ neural networks to extract features of user interests, achieving good performance. However, due to the recommendation system datasets sparsity, these algorithms often employ small-scale network frameworks, resulting in weaker generalization capability. Recently, a series of sequential recommendation algorithms based on large pre-trained language models have been proposed. Nonetheless, given the real-time demands of recommendation systems, the challenge remains in applying pre-trained language models for rapid recommendations in real scenarios. To address this, we propose a sequential recommendation algorithm based on a pre-trained language model and knowledge distillation. The key of proposed algorithm is to transfer pre-trained knowledge across domains and achieve lightweight inference by knowledge distillation. The algorithm operates in two stages: in the first stage, we fine-tune the pre-trained language model on the recommendation dataset to transfer the pre-trained knowledge to the recommendation task; in the second stage, we distill the trained language model to transfer the learned knowledge to a lightweight model. Extensive experiments on multiple public recommendation datasets show that the proposed algorithm enhances recommendation accuracy and provide timely recommendation services.

[IR-9] EDGE-Rec: Efficient and Data-Guided Edge Diffusion For Recommender Systems Graphs

链接: https://arxiv.org/abs/2409.14689
作者: Utkarsh Priyam,Hemit Shah,Edoardo Botta
关键词-EN: recommender systems research, systems research focuses, predict future interactions, binary historical user-item, historical user-item interaction
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 13 figures

点击查看摘要

Abstract:Most recommender systems research focuses on binary historical user-item interaction encodings to predict future interactions. User features, item features, and interaction strengths remain largely under-utilized in this space or only indirectly utilized, despite proving largely effective in large-scale production recommendation systems. We propose a new attention mechanism, loosely based on the principles of collaborative filtering, called Row-Column Separable Attention RCSA to take advantage of real-valued interaction weights as well as user and item features directly. Building on this mechanism, we additionally propose a novel Graph Diffusion Transformer GDiT architecture which is trained to iteratively denoise the weighted interaction matrix of the user-item interaction graph directly. The weighted interaction matrix is built from the bipartite structure of the user-item interaction graph and corresponding edge weights derived from user-item rating interactions. Inspired by the recent progress in text-conditioned image generation, our method directly produces user-item rating predictions on the same scale as the original ratings by conditioning the denoising process on user and item features with a principled approach.

[IR-10] Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

链接: https://arxiv.org/abs/2409.14683
作者: Benjamin Clavié,Antoine Chaffin,Griffin Adams
关键词-EN: increasingly popular approach, multi-vector retrieval methods, increasingly popular, multi-vector retrieval, approach to Neural
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.

[IR-11] Robust Training Objectives Improve Embedding-based Retrieval in Industrial Recommendation Systems RECSYS RECSYS2024

链接: https://arxiv.org/abs/2409.14682
作者: Matthew Kolodner,Mingxuan Ju,Zihao Fan,Tong Zhao,Elham Ghazizadeh,Yan Wu,Neil Shah,Yozen Liu
关键词-EN: Improving recommendation systems, Improving recommendation, greatly enhance, Improving, EBR
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: RobustRecSys workshop @ RecSys 2024

点击查看摘要

Abstract:Improving recommendation systems (RS) can greatly enhance the user experience across many domains, such as social media. Many RS utilize embedding-based retrieval (EBR) approaches to retrieve candidates for recommendation. In an EBR system, the embedding quality is key. According to recent literature, self-supervised multitask learning (SSMTL) has showed strong performance on academic benchmarks in embedding learning and resulted in an overall improvement in multiple downstream tasks, demonstrating a larger resilience to the adverse conditions between each downstream task and thereby increased robustness and task generalization ability through the training objective. However, whether or not the success of SSMTL in academia as a robust training objectives translates to large-scale (i.e., over hundreds of million users and interactions in-between) industrial RS still requires verification. Simply adopting academic setups in industrial RS might entail two issues. Firstly, many self-supervised objectives require data augmentations (e.g., embedding masking/corruption) over a large portion of users and items, which is prohibitively expensive in industrial RS. Furthermore, some self-supervised objectives might not align with the recommendation task, which might lead to redundant computational overheads or negative transfer. In light of these two challenges, we evaluate using a robust training objective, specifically SSMTL, through a large-scale friend recommendation system on a social media platform in the tech sector, identifying whether this increase in robustness can work at scale in enhancing retrieval in the production setting. Through online A/B testing with SSMTL-based EBR, we observe statistically significant increases in key metrics in the friend recommendations, with up to 5.45% improvements in new friends made and 1.91% improvements in new friends made with cold-start users.

[IR-12] Nirjas: An open source framework for extracting metadata from the source code

链接: https://arxiv.org/abs/2409.14609
作者: Ayush Bhardwaj,Sahil,Kaushlendra Pratap,Gaurav Mishra
关键词-EN: software development process, development process, critical elements, software development, Metadata
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
*备注: 2022 12th International Conference on Cloud Computing, Data Science Engineering (Confluence)

点击查看摘要

Abstract:Metadata and comments are critical elements of any software development process. In this paper, we explain how metadata and comments in source code can play an essential role in comprehending software. We introduce a Python-based open-source framework, Nirjas, which helps in extracting this metadata in a structured manner. Various syntaxes, types, and widely accepted conventions exist for adding comments in source files of different programming languages. Edge cases can create noise in extraction, for which we use Regex to accurately retrieve metadata. Non-Regex methods can give results but often miss accuracy and noise separation. Nirjas also separates different types of comments, source code, and provides details about those comments, such as line number, file name, language used, total SLOC, etc. Nirjas is a standalone Python framework/library and can be easily installed via source or pip (the Python package installer). Nirjas was initially created as part of a Google Summer of Code project and is currently developed and maintained under the FOSSology organization.

[IR-13] abulapdf: An R Package to Extract Tables from PDF Documents

链接: https://arxiv.org/abs/2409.14524
作者: Mauricio Vargas Sepúlveda,Thomas J. Leeper,Tom Paskhalis,Manuel Aristarán,Jeremy B. Merrill,Mike Tigas
关键词-EN: Tabula Java library, PDF files directly, Tabula Java, utilizes the Tabula, Java library
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: 10 pages, 1 figure

点击查看摘要

Abstract:tabulapdf is an R package that utilizes the Tabula Java library to import tables from PDF files directly into R. This tool can reduce time and effort in data extraction processes in fields like investigative journalism. It allows for automatic and manual table extraction, the latter facilitated through a Shiny interface, enabling manual areas selection with a computer mouse for data retrieval.

[IR-14] Sliding Window Training – Utilizing Historical Recommender Systems Data for Foundation Models RECSYS’24

链接: https://arxiv.org/abs/2409.14517
作者: Swanand Joshi,Yesu Feng,Ko-Jen Hsiao,Zhe Zhang,Sudarshan Lamkhede
关键词-EN: Long-lived recommender systems, encounter lengthy user-item, lengthy user-item interaction, user-item interaction histories, Long-lived recommender
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: To be published In 18th ACM Conference on Recommender Systems (RecSys '24), October 14–18, 2024, Bari, Italy

点击查看摘要

Abstract:Long-lived recommender systems (RecSys) often encounter lengthy user-item interaction histories that span many years. To effectively learn long term user preferences, Large RecSys foundation models (FM) need to encode this information in pretraining. Usually, this is done by either generating a long enough sequence length to take all history sequences as input at the cost of large model input dimension or by dropping some parts of the user history to accommodate model size and latency requirements on the production serving side. In this paper, we introduce a sliding window training technique to incorporate long user history sequences during training time without increasing the model input dimension. We show the quantitative qualitative improvements this technique brings to the RecSys FM in learning user long term preferences. We additionally show that the average quality of items in the catalog learnt in pretraining also improves.

[IR-15] Beyond Words: Evaluating Large Language Models in Transportation Planning

链接: https://arxiv.org/abs/2409.14516
作者: Shaowei Ying,Zhenlong Li,Manzhu Yu
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, numerous industry sectors, advancement of Generative
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The resurgence and rapid advancement of Generative Artificial Intelligence (GenAI) in 2023 has catalyzed transformative shifts across numerous industry sectors, including urban transportation and logistics. This study investigates the evaluation of Large Language Models (LLMs), specifically GPT-4 and Phi-3-mini, to enhance transportation planning. The study assesses the performance and spatial comprehension of these models through a transportation-informed evaluation framework that includes general geospatial skills, general transportation domain skills, and real-world transportation problem-solving. Utilizing a mixed-methods approach, the research encompasses an evaluation of the LLMs’ general Geographic Information System (GIS) skills, general transportation domain knowledge as well as abilities to support human decision-making in the real-world transportation planning scenarios of congestion pricing. Results indicate that GPT-4 demonstrates superior accuracy and reliability across various GIS and transportation-specific tasks compared to Phi-3-mini, highlighting its potential as a robust tool for transportation planners. Nonetheless, Phi-3-mini exhibits competence in specific analytical scenarios, suggesting its utility in resource-constrained environments. The findings underscore the transformative potential of GenAI technologies in urban transportation planning. Future work could explore the application of newer LLMs and the impact of Retrieval-Augmented Generation (RAG) techniques, on a broader set of real-world transportation planning and operations challenges, to deepen the integration of advanced AI models in transportation management practices.

[IR-16] Revisiting BPR: A Replicability Study of a Common Recommender System Baseline RECSYS’24

链接: https://arxiv.org/abs/2409.14217
作者: Aleksandr Milogradskii,Oleg Lashinin,Alexander P,Marina Ananyeva,Sergey Kolesnikov
关键词-EN: Bayesian Personalized Ranking, Bayesian Personalized, Personalized Ranking, recommender systems research, collaborative filtering approach
类目: Information Retrieval (cs.IR)
*备注: This paper is accepted at the Reproducibility track of the ACM RecSys '24 conference

点击查看摘要

Abstract:Bayesian Personalized Ranking (BPR), a collaborative filtering approach based on matrix factorization, frequently serves as a benchmark for recommender systems research. However, numerous studies often overlook the nuances of BPR implementation, claiming that it performs worse than newly proposed methods across various tasks. In this paper, we thoroughly examine the features of the BPR model, indicating their impact on its performance, and investigate open-source BPR implementations. Our analysis reveals inconsistencies between these implementations and the original BPR paper, leading to a significant decrease in performance of up to 50% for specific implementations. Furthermore, through extensive experiments on real-world datasets under modern evaluation settings, we demonstrate that with proper tuning of its hyperparameters, the BPR model can achieve performance levels close to state-of-the-art methods on the top-n recommendation tasks and even outperform them on specific datasets. Specifically, on the Million Song Dataset, the BPR model with hyperparameters tuning statistically significantly outperforms Mult-VAE by 10% in NDCG@100 with binary relevance function.

[IR-17] Knowledge in Triples for LLMs: Enhancing Table QA Accuracy with Semantic Extraction

链接: https://arxiv.org/abs/2409.14192
作者: Hossein Sholehrasa,Sanaz Saki Norouzi,Pascal Hitzler,Majid Jaberi-Douraki
关键词-EN: Integrating structured knowledge, natural language processing, formats poses significant, Integrating structured, tabular formats poses
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Integrating structured knowledge from tabular formats poses significant challenges within natural language processing (NLP), mainly when dealing with complex, semi-structured tables like those found in the FeTaQA dataset. These tables require advanced methods to interpret and generate meaningful responses accurately. Traditional approaches, such as SQL and SPARQL, often fail to fully capture the semantics of such data, especially in the presence of irregular table structures like web tables. This paper addresses these challenges by proposing a novel approach that extracts triples straightforward from tabular data and integrates it with a retrieval-augmented generation (RAG) model to enhance the accuracy, coherence, and contextual richness of responses generated by a fine-tuned GPT-3.5-turbo-0125 model. Our approach significantly outperforms existing baselines on the FeTaQA dataset, particularly excelling in Sacre-BLEU and ROUGE metrics. It effectively generates contextually accurate and detailed long-form answers from tables, showcasing its strength in complex data interpretation.

[IR-18] Data Generation via Latent Factor Simulation for Fairness-aware Re-ranking

链接: https://arxiv.org/abs/2409.14078
作者: Elena Stefancova,Cassidy All,Joshua Paup,Martin Homola,Nicholas Mattei,Robin Burke
关键词-EN: resource for algorithmic, data, algorithmic research, Synthetic data, Synthetic
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Synthetic data is a useful resource for algorithmic research. It allows for the evaluation of systems under a range of conditions that might be difficult to achieve in real world settings. In recommender systems, the use of synthetic data is somewhat limited; some work has concentrated on building user-item interaction data at large scale. We believe that fairness-aware recommendation research can benefit from simulated data as it allows the study of protected groups and their interactions without depending on sensitive data that needs privacy protection. In this paper, we propose a novel type of data for fairness-aware recommendation: synthetic recommender system outputs that can be used to study re-ranking algorithms.

[IR-19] OAEI-LLM: A Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching

链接: https://arxiv.org/abs/2409.14038
作者: Zhangcheng Qiang,Kerry Taylor,Weiqing Wang,Jing Jiang
关键词-EN: large language models, domain-specific downstream tasks, language models, commonly occur, Alignment Evaluation Initiative
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 4 pages, 1 figure

点击查看摘要

Abstract:Hallucinations of large language models (LLMs) commonly occur in domain-specific downstream tasks, with no exception in ontology matching (OM). The prevalence of using LLMs for OM raises the need for benchmarks to better understand LLM hallucinations. The OAEI-LLM dataset is an extended version of the Ontology Alignment Evaluation Initiative (OAEI) datasets that evaluate LLM-specific hallucinations in OM tasks. We outline the methodology used in dataset construction and schema extension, and provide examples of potential use cases.

[IR-20] Cost-Effective Community-Hierarchy-Based Mutual Voting Approach for Influence Maximization in Complex Networks

链接: https://arxiv.org/abs/2409.14034
作者: Yi Liu,Xiaoan Tang,Witold Pedrycz,Qiang Zhang
关键词-EN: influential nodes identification, identify influential nodes, influential nodes, nodes, types of promising
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Various types of promising techniques have come into being for influence maximization whose aim is to identify influential nodes in complex networks. In essence, real-world applications usually have high requirements on the balance between time complexity and accuracy of influential nodes identification. To address the challenges of imperfect node influence measurement and inefficient seed nodes selection mechanism in such class of foregoing techniques, this article proposes a novel approach called Cost-Effective Community-Hierarchy-Based Mutual Voting for influence maximization in complex networks. First, we develop a method for measuring the importance of different nodes in networks based on an original concept of Dual-Scale Community-Hierarchy Information that synthesizes both hierarchy structural information and community structural information of nodes. The community structural information contained in the nodes is measured by a new notion of Hierarchical-Community Entropy. Second, we develop a method named Cost-Effective Mutual-Influence-based Voting for seed nodes selection. Hereinto, a low-computational-cost mutual voting mechanism and an updating strategy called Lazy Score Updating Strategy are newly constructed for optimizing the selecting of seed nodes. Third, we develop a balance index to evaluate the performance of different methods in striking the tradeoff between time complexity and the accuracy of influential nodes identification. Finally, we demonstrate the approach performance over ten public datasets. The extensive experiments show that the proposed approach outperforms 16 state-of-the-art techniques on the balance between time complexity and accuracy of influential nodes identification. Compared with the method with the second highest value of the balance index, our approach can be improved by at most 9.29%.

[IR-21] Causal Feature Selection Method for Contextual Multi-Armed Bandits in Recommender System

链接: https://arxiv.org/abs/2409.13888
作者: Zhenyu Zhao,Yexi Jiang
关键词-EN: contextual multi-armed bandits, contextual MAB, multi-armed bandits, MAB, feature selection methods
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Features (a.k.a. context) are critical for contextual multi-armed bandits (MAB) performance. In practice of large scale online system, it is important to select and implement important features for the model: missing important features can led to sub-optimal reward outcome, and including irrelevant features can cause overfitting, poor model interpretability, and implementation cost. However, feature selection methods for conventional machine learning models fail short for contextual MAB use cases, as conventional methods select features correlated with the outcome variable, but not necessarily causing heterogeneuous treatment effect among arms which are truely important for contextual MAB. In this paper, we introduce model-free feature selection methods designed for contexutal MAB problem, based on heterogeneous causal effect contributed by the feature to the reward distribution. Empirical evaluation is conducted based on synthetic data as well as real data from an online experiment for optimizing content cover image in a recommender system. The results show this feature selection method effectively selects the important features that lead to higher contextual MAB reward than unimportant features. Compared with model embedded method, this model-free method has advantage of fast computation speed, ease of implementation, and prune of model mis-specification issues.

[IR-22] Segment Discovery: Enhancing E-commerce Targeting RECSYS’24

链接: https://arxiv.org/abs/2409.13847
作者: Qiqi Li,Roopali Singh,Charin Polpanumas,Tanner Fiez,Namita Kumar,Shreya Chakrabarti
关键词-EN: Modern e-commerce services, e-commerce services frequently, Modern e-commerce, services frequently target, video streaming
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted at the CONSEQUENCES’24 workshop, co-located with ACM RecSys’24

点击查看摘要

Abstract:Modern e-commerce services frequently target customers with incentives or interventions to engage them in their products such as games, shopping, video streaming, etc. This customer engagement increases acquisition of more customers and retention of existing ones, leading to more business for the company while improving customer experience. Often, customers are either randomly targeted or targeted based on the propensity of desirable behavior. However, such policies can be suboptimal as they do not target the set of customers who would benefit the most from the intervention and they may also not take account of any constraints. In this paper, we propose a policy framework based on uplift modeling and constrained optimization that identifies customers to target for a use-case specific intervention so as to maximize the value to the business, while taking account of any given constraints. We demonstrate improvement over state-of-the-art targeting approaches using two large-scale experimental studies and a production implementation.

[IR-23] Language agents achieve superhuman synthesis of scientific knowledge

链接: https://arxiv.org/abs/2409.13740
作者: Michael D. Skarlinski,Sam Cox,Jon M. Laurent,James D. Braza,Michaela Hinks,Michael J. Hammerling,Manvitha Ponnapati,Samuel G. Rodriques,Andrew D. White
关键词-EN: produce incorrect information, Language models, produce incorrect, Language, incorrect information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Language models are known to produce incorrect information, and their accuracy and reliability for scientific research are still in question. We developed a detailed human-AI comparison method to evaluate language models on real-world literature search tasks, including information retrieval, summarization, and contradiction detection. Our findings show that PaperQA2, an advanced language model focused on improving factual accuracy, matches or outperforms subject matter experts on three realistic literature search tasks, with no restrictions on human participants (full internet access, search tools, and time). PaperQA2 generates cited, Wikipedia-style summaries of scientific topics that are significantly more accurate than current human-written Wikipedia entries. We also present LitQA2, a new benchmark for scientific literature research, which shaped the development of PaperQA2 and contributed to its superior performance. Additionally, PaperQA2 identifies contradictions in scientific literature, a challenging task for humans. It finds an average of 2.34 +/- 1.99 contradictions per paper in a random sample of biology papers, with 70% of these contradictions validated by human experts. These results show that language models can now surpass domain experts in important scientific literature tasks.

[IR-24] WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

链接: https://arxiv.org/abs/2409.13711
作者: Maria Wang,Srinivas Sunkara,Gilles Baechler,Jason Lin,Yun Zhu,Fedir Zubach,Lei Shu,Jindong Chen
关键词-EN: web agents calls, evaluate neural architectures, neural architectures, agents calls, creation of challenging
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of multimodal LLMs and web agents calls for the creation of challenging benchmarks to evaluate neural architectures. Unlike existing benchmarks that focus on multi-step web navigation, we present WebQuest, a multi-page question-answering dataset that requires simultaneous retrieval and reasoning across web interaction sequences grounded in real-world usage. WebQuest includes three question categories: single-screen reasoning, multi-screen reasoning, and questions based on navigation traces. We evaluate some of the leading multimodal models like GPT-4V, Gemini Flash, and Claude 3 on our dataset, revealing a significant gap between single-screen and multi-screen reasoning. Finally, we investigate inference time techniques like Chain-of-Thought prompting to improve model capabilities on multi-screen reasoning.

[IR-25] Retrieval Augmented Generation-Based Incident Resolution Recommendation System for IT Support

链接: https://arxiv.org/abs/2409.13707
作者: Paulina Toro Isaza,Michael Nidd,Noah Zheutlin,Jae-wook Ahn,Chidansh Amitkumar Bhatt,Yu Deng,Ruchi Mahindru,Martin Franz,Hans Florian,Salim Roukos
关键词-EN: size constraints due, model choice limitations, model size constraints, choice limitations, wishing to implement
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 7 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Clients wishing to implement generative AI in the domain of IT Support and AIOps face two critical issues: domain coverage and model size constraints due to model choice limitations. Clients might choose to not use larger proprietary models such as GPT-4 due to cost and privacy concerns and so are limited to smaller models with potentially less domain coverage that do not generalize to the client’s domain. Retrieval augmented generation is a common solution that addresses both of these issues: a retrieval system first retrieves the necessary domain knowledge which a smaller generative model leverages as context for generation. We present a system developed for a client in the IT Support domain for support case solution recommendation that combines retrieval augmented generation (RAG) for answer generation with an encoder-only model for classification and a generative large language model for query generation. We cover architecture details, data collection and annotation, development journey and preliminary validations, expected final deployment process and evaluation plans, and finally lessons learned.

[IR-26] Zeroshot Listwise Learning to Rank Algorithm for Recommendation

链接: https://arxiv.org/abs/2409.13703
作者: Hao Wang
关键词-EN: rare technology compared, deep neural networks, neural networks, rare technology, technology compared
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Learning to rank is a rare technology compared with other techniques such as deep neural networks. The number of experts in the field is roughly 1/6 of the number of professionals in deep learning. Being an effective ranking methodology, learning to rank has been widely used in the field of information retrieval. However, in recent years, learning to rank as a recommendation approach has been on decline. In this paper, we take full advantage of order statistic approximation and power law distribution to design a zeroshot listwise learning to rank algorithm for recommendation. We prove in the experiment section that our approach is both accurate and fair.

[IR-27] Shaping the Future of Endangered and Low-Resource Languages – Our Role in the Age of LLMs: A Keynote at ECIR 2024

链接: https://arxiv.org/abs/2409.13702
作者: Josiane Mothe(IRIT-SIG)
关键词-EN: Isidore of Seville, Seville is credited, underlining the profound, social identity, Large Language Model
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Isidore of Seville is credited with the adage that it is language that gives birth to a people, and not the other way around , underlining the profound role played by language in the formation of cultural and social identity. Today, of the more than 7100 languages listed, a significant number are endangered. Since the 1970s, linguists, information seekers and enthusiasts have helped develop digital resources and automatic tools to support a wide range of languages, including endangered ones. The advent of Large Language Model (LLM) technologies holds both promise and peril. They offer unprecedented possibilities for the translation and generation of content and resources, key elements in the preservation and revitalisation of languages. They also present threat of homogenisation, cultural oversimplification and the further marginalisation of already vulnerable languages. The talk this paper is based on has proposed an initiatory journey, exploring the potential paths and partnerships between technology and tradition, with a particular focus on the Occitan language. Occitan is a language from Southern France, parts of Spain and Italy that played a major cultural and economic role, particularly in the Middle Ages. It is now endangered according to UNESCO. The talk critically has examined how human expertise and artificial intelligence can work together to offer hope for preserving the linguistic diversity that forms the foundation of our global and especially our European heritage while addressing some of the ethical and practical challenges that accompany the use of these powerful technologies. This paper is based on the keynote I gave at the 46th European Conference on Information Retrieval (ECIR 2024). As an alternative to reading this paper, a video talk is available online. 1 Date: 26 March 2024.

[IR-28] MAS4POI: a Multi-Agents Collaboration System for Next POI Recommendation

链接: https://arxiv.org/abs/2409.13700
作者: Yuqian Wu,Yuhong Peng,Jiapeng Yu,Raymond S. T. Lee
关键词-EN: complex decision-making tasks, decision-making tasks management, recommendation remain underexplored, LLM-based Multi-Agent Systems, remain underexplored
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:LLM-based Multi-Agent Systems have potential benefits of complex decision-making tasks management across various domains but their applications in the next Point-of-Interest (POI) recommendation remain underexplored. This paper proposes a novel MAS4POI system designed to enhance next POI recommendations through multi-agent interactions. MAS4POI supports Large Language Models (LLMs) specializing in distinct agents such as DataAgent, Manager, Analyst, and Navigator with each contributes to a collaborative process of generating the next POI recommendations.The system is examined by integrating six distinct LLMs and evaluated by two real-world datasets for recommendation accuracy improvement in real-world scenarios. Our code is available at this https URL.

[IR-29] Vietnamese Legal Information Retrieval in Question-Answering System

链接: https://arxiv.org/abs/2409.13699
作者: Thiem Nguyen Ba,Vinh Doan The,Tung Pham Quang,Toan Tran Van
关键词-EN: Question Answering, increasing data volumes, rapidly increasing data, Retrieval Augmented Generation, accurately retrieving
类目: Information Retrieval (cs.IR)
*备注: 7 pages

点击查看摘要

Abstract:In the modern era of rapidly increasing data volumes, accurately retrieving and recommending relevant documents has become crucial in enhancing the reliability of Question Answering (QA) systems. Recently, Retrieval Augmented Generation (RAG) has gained significant recognition for enhancing the capabilities of large language models (LLMs) by mitigating hallucination issues in QA systems, which is particularly beneficial in the legal domain. Various methods, such as semantic search using dense vector embeddings or a combination of multiple techniques to improve results before feeding them to LLMs, have been proposed. However, these methods often fall short when applied to the Vietnamese language due to several challenges, namely inefficient Vietnamese data processing leading to excessive token length or overly simplistic ensemble techniques that lead to instability and limited improvement. Moreover, a critical issue often overlooked is the ordering of final relevant documents which are used as reference to ensure the accuracy of the answers provided by LLMs. In this report, we introduce our three main modifications taken to address these challenges. First, we explore various practical approaches to data processing to overcome the limitations of the embedding model. Additionally, we enhance Reciprocal Rank Fusion by normalizing order to combine results from keyword and vector searches effectively. We also meticulously re-rank the source pieces of information used by LLMs with Active Retrieval to improve user experience when refining the information generated. In our opinion, this technique can also be considered as a new re-ranking method that might be used in place of the traditional cross encoder. Finally, we integrate these techniques into a comprehensive QA system, significantly improving its performance and reliability

附件下载

点击下载今日全部论文列表