本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-06-27)

今日共更新387篇论文,其中:

  • 自然语言处理101篇(Computation and Language (cs.CL))
  • 计算机视觉92篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能112篇(Artificial Intelligence (cs.AI))
  • 机器学习107篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] owards Compositionality in Concept Learning
[NLP-0] 概念学习中的owards组合性

链接: https://arxiv.org/abs/2406.18534
作者: Adam Stein,Aaditya Naik,Yinjun Wu,Mayur Naik,Eric Wong
关键词: Concept-based interpretability methods, Concept-based interpretability, interpretability methods offer, compositional concept representations, compositional concept
中文关键词: 基于概念的可解释性方法、基于概念的可解释性、可解释性方法提供、组合概念表示、组合概念
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICML 2024. 26 pages, 10 figures

点击查看摘要

Abstract:Concept-based interpretability methods offer a lens into the internals of foundation models by decomposing their embeddings into high-level concepts. These concept representations are most useful when they are compositional, meaning that the individual concepts compose to explain the full sample. We show that existing unsupervised concept extraction methods find concepts which are not compositional. To automatically discover compositional concept representations, we identify two salient properties of such representations, and propose Compositional Concept Extraction (CCE) for finding concepts which obey these properties. We evaluate CCE on five different datasets over image and text data. Our evaluation shows that CCE finds more compositional concept representations than baselines and yields better accuracy on four downstream classification tasks. Code and data are available at this https URL .
摘要:基于概念的可解释性方法通过将基础模型的嵌入分解为高级概念,为基础模型的内部提供了一个视角。这些概念表示在组合时最有用,这意味着各个概念组合以解释完整样本。我们表明,现有的无监督概念提取方法可以找到非组合的概念。为了自动发现组合概念表示,我们识别了此类表示的两个显着属性,并提出组合概念提取(CCE)来寻找遵守这些属性的概念。我们对图像和文本数据的五个不同数据集进行了CTE评估。我们的评估表明,CCE发现的组合概念表示比基线更多,并且对四个下游分类任务产生更好的准确性。代码和数据可在此https URL中获取。

[NLP-1] Symbolic Learning Enables Self-Evolving Agents
[NLP-1] 符号学习使智能体能够自我进化

链接: https://arxiv.org/abs/2406.18532
作者: Wangchunshu Zhou,Yixin Ou,Shengwei Ding,Long Li,Jialong Wu,Tiannan Wang,Jiamin Chen,Shuai Wang,Xiaohua Xu,Ningyu Zhang,Huajun Chen,Yuchen Eleanor Jiang
关键词: language agents, large language models, agent symbolic learning, tool usage methods, language
中文关键词: 语言代理、大型语言模型、代理符号学习、工具使用方法、语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code available at this https URL

点击查看摘要

Abstract:The AI community has been exploring a pathway to artificial general intelligence (AGI) by developing “language agents”, which are complex large language models (LLMs) pipelines involving both prompting techniques and tool usage methods. While language agents have demonstrated impressive capabilities for many real-world tasks, a fundamental limitation of current language agents research is that they are model-centric, or engineering-centric. That’s to say, the progress on prompts, tools, and pipelines of language agents requires substantial manual engineering efforts from human experts rather than automatically learning from data. We believe the transition from model-centric, or engineering-centric, to data-centric, i.e., the ability of language agents to autonomously learn and evolve in environments, is the key for them to possibly achieve AGI. In this work, we introduce agent symbolic learning, a systematic framework that enables language agents to optimize themselves on their own in a data-centric way using symbolic optimizers. Specifically, we consider agents as symbolic networks where learnable weights are defined by prompts, tools, and the way they are stacked together. Agent symbolic learning is designed to optimize the symbolic network within language agents by mimicking two fundamental algorithms in connectionist learning: back-propagation and gradient descent. Instead of dealing with numeric weights, agent symbolic learning works with natural language simulacrums of weights, loss, and gradients. We conduct proof-of-concept experiments on both standard benchmarks and complex real-world tasks and show that agent symbolic learning enables language agents to update themselves after being created and deployed in the wild, resulting in “self-evolving agents”. Comments: Code available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2406.18532 [cs.CL] (or arXiv:2406.18532v1 [cs.CL] for this version)
摘要:人工智能领域一直在探索一条通向人工智能(AGI)的道路,它是一种同时涉及提示技术和工具使用方法的复杂的大型语言模型(LLMS)管道。虽然语言代理在许多现实世界的任务中表现出了令人印象深刻的能力,但当前语言代理研究的一个根本限制是它们是以模型为中心的,或者是以工程为中心的。这就是说,语言代理的提示、工具和管道方面的进展需要人类专家进行大量的人工工程工作,而不是自动从数据中学习。我们认为,从以模型为中心,或以工程为中心,向以数据为中心的转变,即语言代理在环境中自主学习和进化的能力,是他们可能实现AGI的关键。在这项工作中,我们引入了代理符号学习,这是一个系统的框架,使语言代理能够使用符号优化器以数据为中心的方式自我优化。具体地说,我们将代理视为符号网络,其中可学习的权重由提示、工具和它们堆叠在一起的方式定义。智能体符号学习旨在通过模仿连接主义学习中的两种基本算法:反向传播和梯度下降来优化语言智能体中的符号网络。代理符号学习不是处理数字权重,而是使用权重、损失和梯度的自然语言模拟。我们在标准基准测试和复杂的现实任务上进行了概念验证实验,结果表明,智能体符号学习使语言智能体在被创建和部署在野外后能够自我更新,从而产生“自我进化的智能体”。评论:此HTTPS URL主题上提供的代码主题:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG)引用为:arxiv:2406.18532cs.CL

[NLP-2] PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation
[NLP-2] PrExMe!大规模快速探索用于机器翻译和总结评估的开源LLM

链接: https://arxiv.org/abs/2406.18528
作者: Christoph Leiter,Steffen Eger
关键词: Large language models, field of NLP, Large language, revolutionized the field, Large
中文关键词: 大型语言模型,NLP领域,大型语言,彻底改变了该领域,大型
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized the field of NLP. Notably, their in-context learning capabilities also enable their use as evaluation metrics for natural language generation, making them particularly advantageous in low-resource scenarios and time-restricted applications. In this work, we introduce PrExMe, a large-scale prompt exploration for metrics, where we evaluate more than 720 prompt templates for open-source LLM-based metrics on machine translation (MT) and summarization datasets, totalling over 6.6M evaluations. This extensive comparison (1) serves as a benchmark of the performance of recent open-source LLMs as metrics and (2) explores the stability and variability of different prompting strategies. We discover that, on the one hand, there are scenarios for which prompts are stable. For instance, some LLMs show idiosyncratic preferences and favor to grade generated texts with textual labels while others prefer to return numeric scores. On the other hand, the stability of prompts and model rankings can be susceptible to seemingly innocuous changes. For example, changing the requested output format from “0 to 100” to “-1 to +1” can strongly affect the rankings in our evaluation. Our study contributes to understanding the impact of different prompting approaches on LLM-based metrics for MT and summarization evaluation, highlighting the most stable prompting patterns and potential limitations.
摘要:大型语言模型给自然语言处理领域带来了革命性的变化。值得注意的是,它们的情景学习能力还使它们能够用作自然语言生成的评估指标,使它们在低资源场景和时间受限的应用程序中特别具有优势。在这项工作中,我们介绍了PrExMe,一个大规模的度量提示探索,在其中我们评估了超过720个基于开源LLM的机器翻译度量和摘要数据集的提示模板,总计超过660万次评估。这种广泛的比较(1)作为最近开源LLM性能的基准,以及(2)探索不同激励策略的稳定性和可变性。我们发现,一方面,存在提示稳定的场景。例如,一些LLM显示出特殊的偏好和偏好,使用文本标签对生成的文本进行评分,而另一些LLM则更喜欢返回数字分数。另一方面,提示和模型排名的稳定性可能会受到看似无害的变化的影响。例如,将请求的输出格式从“0到100”更改为“-1到+1”可能会极大地影响我们评估中的排名。我们的研究有助于了解不同的提示方法对基于LLM的机器翻译和总结评估指标的影响,突出最稳定的提示模式和潜在的局限性。

[NLP-3] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
[NLP-3] ChronoMagic-Bench:文本到延时视频生成变形评估的基准

链接: https://arxiv.org/abs/2406.18522
作者: Shenghai Yuan,Jinfa Huang,Yongqi Xu,Yaoyang Liu,Shaofeng Zhang,Yujun Shi,Ruijie Zhu,Xinhua Cheng,Jiebo Luo,Li Yuan
关键词: Sora and Lumiere, time-lapse videos, temporal coherence, videos, time-lapse video generation
中文关键词: Sora和Lumiere,延时视频,时间一致性,视频,延时视频生成
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 31 pages, 15 figures

点击查看摘要

Abstract:We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on the visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model’s ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model’s capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos’ metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude.
摘要:我们提出了一种新的文本到视频(T2V)生成基准,ChronoMagic-Back,用于评估T2V模型(如Sora和Lumiere)在时延视频生成中的时间和变形能力。与关注生成视频的视觉质量和文本相关性的现有基准不同,ChronoMagic-Back专注于该模型生成具有显著变形幅度和时间一致性的延时视频的能力。该基准测试在自由格式的文本查询中检测T2V型号的物理、生物和化学能力。为此,ChronoMagic-Back引入了1649个提示和真实世界的视频作为参考,分为四种主要类型的延时视频:生物、人类创造的、气象和物理现象,并进一步分为75个子类别。这一分类全面评估了模型处理各种复杂转换的能力。为了准确地将人的偏好与基准保持一致,我们引入了两个新的自动度量MTScore和CHScore来评估视频的变形属性和时间一致性。MTScore测量变形幅度,反映随时间变化的程度,而CHScore评估时间连贯性,确保生成的视频保持逻辑递进和连续性。基于ChronoMagic-Back,我们对十个具有代表性的T2V模型进行了全面的手动评估,揭示了它们在不同类别提示中的优势和劣势,并提供了一个全面的评估框架,以弥补当前视频生成研究中的差距。此外,我们创建了一个大规模的ChronoMagic-Pro数据集,包含460K对高质量的720p延时视频和详细的字幕,确保了高物理针对性和大变质幅度。

[NLP-4] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
[NLP-4] CharXiv:多模式LLM中现实图表理解中的差距

链接: https://arxiv.org/abs/2406.18521
作者: Zirui Wang,Mengzhou Xia,Luxi He,Howard Chen,Yitao Liu,Richard Zhu,Kaiqu Liang,Xindi Wu,Haotian Liu,Sadhika Malladi,Alexis Chevalier,Sanjeev Arora,Danqi Chen
关键词: Multimodal Large Language, applying Multimodal Large, Large Language Models, Multimodal Large, Large Language
中文关键词: 多模式大型语言,应用多模式大型、大型语言模型,多模式大型、大型语言
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 121 pages, 90 figures

点击查看摘要

Abstract:Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: this https URL
摘要:在将多通道大语言模型应用于分析科学论文或金融报告等实际任务时,图表理解起着至关重要的作用。然而,现有的数据集往往侧重于带有基于模板的问题的过于简化和同质的图表,导致对进展的衡量过于乐观。我们证明,尽管开源模型在这些基准测试中的表现似乎优于强大的专有模型,但带有略微不同的图表或问题的简单压力测试可能会使性能下降高达34.5%。在这项工作中,我们提出了Charxiv,一个全面的评估套件,涉及2323张来自arxiv论文的自然的、具有挑战性的和多样化的图表。Charxiv包括两种类型的问题:1)关于检查基本图表元素的描述性问题,以及2)需要综合图表中复杂视觉元素的信息的推理问题。为了确保质量,所有的图表和问题都是由人类专家手工挑选、策划和验证的。我们的结果显示,最强的专有模型(即GPT-40)和最强的开源模型(即InternVL Chat V1.5)的推理技能之间存在巨大的差距,前者的准确率达到47.1%,后者的准确率达到29.2%。所有模型都远远落后于人类80.5%的表现,突显出现有最大似然模型在图表理解能力方面的弱点。我们希望Charxiv通过提供一个更现实和更忠实的进度度量来促进未来对MLLM图表理解的研究。项目页面和排行榜:此HTTPS URL

[NLP-5] APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
[NLP-5] APIGen:用于生成可验证和多样化功能调用数据集的自动化管道

链接: https://arxiv.org/abs/2406.18518
作者: Zuxin Liu,Thai Hoang,Jianguo Zhang,Ming Zhu,Tian Lan,Shirley Kokane,Juntao Tan,Weiran Yao,Zhiwei Liu,Yihao Feng,Rithesh Murthy,Liangwei Yang,Silvio Savarese,Juan Carlos Niebles,Huan Wang,Shelby Heinecke,Caiming Xiong
关键词: Berkeley Function-Calling Benchmark, models requires diverse, function-calling, requires diverse, agent models requires
中文关键词: 伯克利功能调用基准,模型需要多元化,功能调用,需要多元化,代理模型需要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: this https URL and the project homepage: this https URL
摘要:函数调用代理模型的发展需要多样化、可靠、高质量的数据集。本文介绍了APIGen,这是一个自动数据生成流水线,旨在为函数调用应用程序合成可验证的高质量数据集。我们利用APIGen并收集21个不同类别的3,673个可执行API,以可扩展和结构化的方式生成不同的函数调用数据集。我们的数据集中的每个数据都经过了三个层次的验证:格式检查、实际函数执行和语义验证,以确保其可靠性和正确性。我们证明,使用我们精选的数据集训练的模型,即使只有7B参数,也可以在Berkeley函数调用基准上获得最先进的性能,超过多个GPT-4模型。此外,我们的1B型号实现了卓越的性能,超过了GPT-3.5-Turbo和Claude-3 haiku。我们发布了一个包含6万个高质量条目的数据集,旨在推进函数调用代理域领域的发展。该数据集可在HuggingFace:This https URL和项目主页:This HTTPS URL上获得

[NLP-6] “Is ChatGPT a Better Explainer than My Professor?”: Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline
[NLP-6] “ChatGPT比我的教授更好的解释者吗?”:与人类基线相比,评估LLM在对话中的解释能力

链接: https://arxiv.org/abs/2406.18512
作者: Grace Li,Milad Alshomary,Smaranda Muresan
关键词: social dynamics, communication principles, learning theories, form the foundation, foundation of knowledge
中文关键词: 社会动态、沟通原则、学习理论、形成基础、知识的基础
类目: Computation and Language (cs.CL)
备注: 6 figures, 5 pages

点击查看摘要

Abstract:Explanations form the foundation of knowledge sharing and build upon communication principles, social dynamics, and learning theories. We focus specifically on conversational approaches for explanations because the context is highly adaptive and interactive. Our research leverages previous work on explanatory acts, a framework for understanding the different strategies that explainers and explainees employ in a conversation to both explain, understand, and engage with the other party. We use the 5-Levels dataset was constructed from the WIRED YouTube series by Wachsmuth et al., and later annotated by Booshehri et al. with explanatory acts. These annotations provide a framework for understanding how explainers and explainees structure their response when crafting a response. With the rise of generative AI in the past year, we hope to better understand the capabilities of Large Language Models (LLMs) and how they can augment expert explainer’s capabilities in conversational settings. To achieve this goal, the 5-Levels dataset (We use Booshehri et al.'s 2023 annotated dataset with explanatory acts.) allows us to audit the ability of LLMs in engaging in explanation dialogues. To evaluate the effectiveness of LLMs in generating explainer responses, we compared 3 different strategies, we asked human annotators to evaluate 3 different strategies: human explainer response, GPT4 standard response, GPT4 response with Explanation Moves. Comments: 6 figures, 5 pages Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.18512 [cs.CL] (or arXiv:2406.18512v1 [cs.CL] for this version)
摘要:解释构成了知识共享的基础,建立在交流原理、社会动力学和学习理论的基础上。我们特别关注用于解释的对话方法,因为上下文具有高度的适应性和互动性。我们的研究利用了之前关于解释行为的研究,解释行为是一个理解解释者和被解释者在对话中使用的不同策略的框架,解释者和被解释者都在解释、理解和与对方接触。我们使用的5级数据集是由Wachsmuth等人从有线YouTube系列构建的,并随后由Booshehri等人注释。以及解释行为。这些注释提供了一个框架,用于理解解说者和被解说者在制定回复时如何组织他们的回复。随着产生式人工智能在过去一年的兴起,我们希望更好地了解大型语言模型(LLM)的能力,以及它们如何在对话环境中增强专家解释员的能力。为了实现这一目标,5级数据集(我们使用Booshehri等人的S 2023带解释行为的注释数据集。)使我们能够审计LLMS参与解释对话的能力。为了评估LLMS在生成解说员反应方面的有效性,我们比较了3种不同的策略,我们要求人类注释员评估3种不同的策略:人类解说员反应,GPT4标准反应,GPT4解释动作反应。评论:6位数字,5页主题:计算和语言(cs.CL)引用如下:arxiv:2406.18512cs.CL

[NLP-7] WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
[NLP-7] 大规模的野生合作:从野外越狱到(相反)更安全的语言模型

链接: https://arxiv.org/abs/2406.18510
作者: Liwei Jiang,Kavel Rao,Seungju Han,Allyson Ettinger,Faeze Brahman,Sachin Kumar,Niloofar Mireshghallah,Ximing Lu,Maarten Sap,Yejin Choi,Nouha Dziri
关键词: composes multiple tactics, automatic LLM safety, multiple tactics, safety red-teaming framework, framework that mines
中文关键词: 构成多种策略、自动LLM安全、多种策略、安全红色团队框架、挖掘框架
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.18510 [cs.CL] (or arXiv:2406.18510v1 [cs.CL] for this version)
摘要:我们介绍了WildTeaming,一个自动的LLM安全红色团队框架,它在野外挖掘用户和聊天机器人的交互,发现5.7k独特的新越狱战术簇,然后合成多个策略,用于系统地探索新越狱战术。与之前通过招募人类工人进行红团队合作、基于梯度的优化或使用LLMS进行迭代修订相比,我们的工作调查了聊天机器人用户的越狱行为,这些用户没有得到明确的指示来破坏系统。WildTeaming揭示了FronTier LLMS以前未知的漏洞,导致与最先进的越狱方法相比,多样化和成功的对抗性攻击高达4.6倍。虽然存在许多用于越狱评估的数据集,但用于越狱培训的开源数据集很少,因为即使在模型重量打开的情况下,安全培训数据也已关闭。使用WildTeaming,我们创建了WildJailBreak,这是一个大规模的开源合成安全数据集,具有262K的普通(直接请求)和对抗性(复杂的越狱)提示-响应对。为了减少夸张的安全行为,WildJailBreak提供了两种对比类型的查询:1)有害查询(普通对抗性)和2)形式上类似于有害查询但不包含危害的良性查询。由于WildJailBreak极大地提升了现有安全资源的质量和规模,它独特地使我们能够在安全培训期间检查数据的缩放效应以及数据属性和模型功能的相互作用。通过广泛的实验,我们确定了能够实现安全行为的理想平衡的训练属性:适当的保护而不过度拒绝,有效地处理普通和敌意的查询,以及最小程度地降低一般能力(如果有的话)。WildJailbeak的所有组件都有助于实现模型的安全行为平衡。科目:计算和语言(cs.CL)引用为:arxiv:2406.18510cs.CL

[NLP-8] Mental Modeling of Reinforcement Learning Agents by Language Models
[NLP-8] 语言模型对强化学习主体进行心理建模

链接: https://arxiv.org/abs/2406.18505
作者: Wenhao Lu,Xufeng Zhao,Josua Spisak,Jae Hee Lee,Stefan Wermter
关键词: language models faithfully, emergent language models, models faithfully model, language models, intelligence of decision-making
中文关键词: 忠实的语言模型,涌现的语言模型,忠实的模型,语言模型,决策智能
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: this https URL

点击查看摘要

Abstract:Can emergent language models faithfully model the intelligence of decision-making agents? Though modern language models exhibit already some reasoning ability, and theoretically can potentially express any probable distribution over tokens, it remains underexplored how the world knowledge these pretrained models have memorized can be utilized to comprehend an agent’s behaviour in the physical world. This study empirically examines, for the first time, how well large language models (LLMs) can build a mental model of agents, termed agent mental modelling, by reasoning about an agent’s behaviour and its effect on states from agent interaction history. This research may unveil the potential of leveraging LLMs for elucidating RL agent behaviour, addressing a key challenge in eXplainable reinforcement learning (XRL). To this end, we propose specific evaluation metrics and test them on selected RL task datasets of varying complexity, reporting findings on agent mental model establishment. Our results disclose that LLMs are not yet capable of fully mental modelling agents through inference alone without further innovations. This work thus provides new insights into the capabilities and limitations of modern LLMs.
摘要:新兴语言模型能忠实地模拟决策主体的智能吗?尽管现代语言模型已经显示出一些推理能力,理论上可以表达任何可能在符号上的分布,但如何利用这些预先训练的模型记住的世界知识来理解代理在物理世界中的行为仍未得到充分的探索。本研究首次实证检验了大型语言模型(LLM)通过从智能体交互历史中推理智能体的行为及其对状态的影响,来构建智能体心理模型的能力。这项研究可能揭示利用LLMS来阐明RL代理行为的潜力,解决可解释强化学习(XRL)中的一个关键挑战。为此,我们提出了具体的评估指标,并在选定的不同复杂性的RL任务数据集上进行了测试,报告了代理心理模型建立的结果。我们的结果表明,如果没有进一步的创新,LLM还不能完全通过推理来实现心智模型代理。因此,这项工作为了解现代LLM的能力和局限性提供了新的见解。

[NLP-9] Is In-Context Learning a Type of Gradient-Based Learning? Evidence from the Inverse Frequency Effect in Structural Priming
[NLP-9] 上下文学习是一种基于学生的学习吗?结构启动中反频率效应的证据

链接: https://arxiv.org/abs/2406.18501
作者: Zhenghao Zhou,Robert Frank,R. Thomas McCoy
关键词: Large language models, Large language, ICL, shown the emergent, emergent capability
中文关键词: 大型语言模型,大型语言ICL,展示了新兴的能力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown the emergent capability of in-context learning (ICL). One line of research has explained ICL as functionally performing gradient descent. In this paper, we introduce a new way of diagnosing whether ICL is functionally equivalent to gradient-based learning. Our approach is based on the inverse frequency effect (IFE) – a phenomenon in which an error-driven learner is expected to show larger updates when trained on infrequent examples than frequent ones. The IFE has previously been studied in psycholinguistics because humans show this effect in the context of structural priming (the tendency for people to produce sentence structures they have encountered recently); the IFE has been used as evidence that human structural priming must involve error-driven learning mechanisms. In our experiments, we simulated structural priming within ICL and found that LLMs display the IFE, with the effect being stronger in larger models. We conclude that ICL is indeed a type of gradient-based learning, supporting the hypothesis that a gradient component is implicitly computed in the forward pass during ICL. Our results suggest that both humans and LLMs make use of gradient-based, error-driven processing mechanisms.
摘要:大型语言模型(LLM)已经显示出上下文中学习(ICL)的新兴能力。有一种研究解释说,ICL在功能上是在进行梯度下降。在本文中,我们介绍了一种新的诊断ICL是否在功能上等价于基于梯度的学习的方法。我们的方法是基于逆频率效应(IFE)–一种现象,在这种现象中,错误驱动的学习者在对不频繁的例子进行训练时,预计会比经常出现的例子表现出更大的更新。心理语言学此前对IFE进行了研究,因为人类在结构启动(人们产生他们最近遇到的句子结构的倾向)的背景下表现出这种效应;IFE被用作证据,证明人类的结构启动必须涉及错误驱动的学习机制。在我们的实验中,我们模拟了ICL中的结构启动,发现LLMS显示了IFE,并且在较大的模型中效果更强。我们得出结论,ICL确实是一种基于梯度的学习,支持在ICL过程中的前向传递中隐含地计算梯度分量的假设。我们的结果表明,人类和LLMS都使用基于梯度的、错误驱动的加工机制。

[NLP-10] WildGuard: Open One-Stop Moderation Tools for Safety Risks Jailbreaks and Refusals of LLMs
[NLP-10] WildGuard:针对LLC安全风险的开放式一站式审核工具越狱和拒绝

链接: https://arxiv.org/abs/2406.18495
作者: Seungju Han,Kavel Rao,Allyson Ettinger,Liwei Jiang,Bill Yuchen Lin,Nathan Lambert,Yejin Choi,Nouha Dziri
关键词: identifying malicious intent, light-weight moderation tool, detecting safety risks, achieves three goals, determining model refusal
中文关键词: 识别恶意意图、轻量级审核工具、检测安全风险、实现三个目标、确定模型拒绝
类目: Computation and Language (cs.CL)
备注: First two authors contributed equally. Third and fourth authors contributed equally

点击查看摘要

Abstract:We introduce WildGuard – an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models’ refusals, a key measure for evaluating safety behaviors in model responses. To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%. Comments: First two authors contributed equally. Third and fourth authors contributed equally Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.18495 [cs.CL] (or arXiv:2406.18495v1 [cs.CL] for this version)
摘要:我们介绍了WildGuard–一个开放的、轻量级的LLM安全防御工具,它实现了三个目标:(1)识别用户提示中的恶意意图;(2)检测模型响应的安全风险;(3)确定模型拒绝率。综合起来,WildGuard可满足日益增长的自动安全审核和评估LLM交互作用的需求,提供了一种一站式工具,具有更高的准确性和广泛的覆盖范围,涵盖13个风险类别。虽然现有的开放式审核工具,如Llama-Guard2,在对直接的模型交互进行分类方面得分相当好,但它们远远落后于GPT-4,特别是在识别对抗性越狱和评估模型拒绝方面,这是评估模型响应中安全行为的关键指标。为了应对这些挑战,我们构建了WildGuardMix,这是一个大规模的、仔细平衡的多任务安全缓和数据集,具有92K标记的示例,涵盖普通(直接)提示和对抗性越狱,并与各种拒绝和合规响应配对。WildGuardMix是WildGuard的训练数据WildGuardTrain和WildGuardTest的组合,WildGuardTest是一种高质量的人工注释适度测试集,具有覆盖广泛风险情景的5K标签项目。通过对WildGuardTest和十个现有公共基准的广泛评估,我们表明WildGuard在所有三个任务中建立了开源安全适度的最先进性能,而不是现有的十个强大的开源适度模型(例如,拒绝检测方面高达26.4%的改进)。重要的是,WildGuard的性能与GPT-4相当,有时甚至超过GPT-4(例如,在及时识别危害性方面最高提高3.9%)。WildGuard在LLM界面中充当高效的安全调节器,将越狱攻击的成功率从79.8%降低到2.4%。评论:前两位作者贡献相同。第三和第四作者贡献了相同的科目:计算与语言(cs.CL)引用为:arxiv:2406.18495cs.CL

[NLP-11] Role-Play Zero-Shot Prompting with Large Language Models for Open-Domain Human-Machine Conversation
[NLP-11] 使用大型语言模型进行开放领域人机对话的角色扮演零镜头预算

链接: https://arxiv.org/abs/2406.18460
作者: Ahmed Njifenjou,Virgile Sucal,Bassam Jabaian,Fabrice Lefèvre
关键词: Large Language Models, Large Language, proposed to create, create open-domain conversational, Recently
中文关键词: 大型语言模型,大型语言,提议创建,创建开放领域对话,最近
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Updated version of a paper originally submitted at SIGDIAL 2023

点击查看摘要

Abstract:Recently, various methods have been proposed to create open-domain conversational agents with Large Language Models (LLMs). These models are able to answer user queries, but in a one-way QA format rather than a true conversation. Fine-tuning on particular datasets is the usual way to modify their style to increase conversational ability, but this is expensive and usually only available in a few languages. In this study, we explore role-play zero-shot prompting as an efficient and cost-effective solution for open-domain conversation, using capable multilingual LLMs (Beeching et al., 2023) trained to obey instructions. We design a prompting system that, when combined with an instruction-following model - here Vicuna (Chiang et al., 2023) - produces conversational agents that match and even surpass fine-tuned models in human evaluation in French in two different tasks.
摘要:最近,人们提出了各种方法来创建具有大型语言模型(LLM)的开放域对话代理。这些模型能够回答用户询问,但采用单向QA格式,而不是真正的对话。对特定数据集进行微调是修改其风格以提高对话能力的常见方法,但这很昂贵,而且通常仅适用于少数语言。在这项研究中,我们探索了角色扮演零镜头提示作为开放领域对话的高效且具有成本效益的解决方案,使用功能强大的多语言LLM(Beeching等人,2023年)接受过遵守指示的培训。我们设计了一个提示系统,当与描述跟随模型结合时–这里是Vicuna(Chiang等人,2023)-在两个不同任务中,生产出与法语人类评估中的微调模型相匹配甚至超越的对话代理。

[NLP-12] Cascading Large Language Models for Salient Event Graph Generation
[NLP-12] 级联大型语言模型用于显着事件图生成

链接: https://arxiv.org/abs/2406.18449
作者: Xingwei Tan,Yuxiang Zhou,Gabriele Pergola,Yulan He
关键词: multiple tasks involved, reconciling unstructured input, Generating event graphs, Generating event, identifying their relationships
中文关键词: 涉及多个任务、协调非结构化输入、生成事件图、生成事件、识别它们的关系
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 + 12 pages

点击查看摘要

Abstract:Generating event graphs from long documents is challenging due to the inherent complexity of multiple tasks involved such as detecting events, identifying their relationships, and reconciling unstructured input with structured graphs. Recent studies typically consider all events with equal importance, failing to distinguish salient events crucial for understanding narratives. This paper presents CALLMSAE, a CAscading Large Language Model framework for SAlient Event graph generation, which leverages the capabilities of LLMs and eliminates the need for costly human annotations. We first identify salient events by prompting LLMs to generate summaries, from which salient events are identified. Next, we develop an iterative code refinement prompting strategy to generate event relation graphs, removing hallucinated relations and recovering missing edges. Fine-tuning contextualised graph generation models on the LLM-generated graphs outperforms the models trained on CAEVO-generated data. Experimental results on a human-annotated test set show that the proposed method generates salient and more accurate graphs, outperforming competitive baselines.
摘要:从长文档中生成事件图是具有挑战性的,因为涉及到多个任务的固有复杂性,例如检测事件、识别事件关系以及协调非结构化输入和结构化图。最近的研究通常认为所有事件都具有同等的重要性,未能区分对理解叙事至关重要的显著事件。本文提出了CALLMSAE,这是一个用于显著事件图生成的级联大型语言模型框架,它利用了LLMS的能力并消除了对昂贵的人工注释的需要。我们首先通过提示LLM生成摘要来识别显著事件,从摘要中识别显著事件。接下来,我们开发了一种迭代的代码求精提示策略来生成事件关系图,去除幻觉关系并恢复丢失的边。在LLM生成的图形上微调上下文图形生成模型的性能优于在CAEVO生成的数据上训练的模型。在人工标注的测试集上的实验结果表明,该方法生成了显著且更准确的图形,性能优于竞争基线。

[NLP-13] IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons
[NLP-13] IRCAN:通过识别和重新加权上下文感知神经元来缓解LLM生成中的知识冲突

链接: https://arxiv.org/abs/2406.18406
作者: Dan Shi,Renren Jin,Tianhao Shen,Weilong Dong,Xinwei Wu,Deyi Xiong
关键词: large language models, encode a vast, mass data, widely acknowledged, acknowledged that large
中文关键词: 大型语言模型,编码大量数据,被广泛认可,承认大
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 13 figures, 5 tables

点击查看摘要

Abstract:It is widely acknowledged that large language models (LLMs) encode a vast reservoir of knowledge after being trained on mass data. Recent studies disclose knowledge conflicts in LLM generation, wherein outdated or incorrect parametric knowledge (i.e., encoded knowledge) contradicts new knowledge provided in the context. To mitigate such knowledge conflicts, we propose a novel framework, IRCAN (Identifying and Reweighting Context-Aware Neurons) to capitalize on neurons that are crucial in processing contextual cues. Specifically, IRCAN first identifies neurons that significantly contribute to context processing, utilizing a context-aware attribution score derived from integrated gradients. Subsequently, the identified context-aware neurons are strengthened via reweighting. In doing so, we steer LLMs to generate context-sensitive outputs with respect to the new knowledge provided in the context. Extensive experiments conducted across a variety of models and tasks demonstrate that IRCAN not only achieves remarkable improvements in handling knowledge conflicts but also offers a scalable, plug-andplay solution that can be integrated seamlessly with existing models.
摘要:人们普遍认为,大语言模型(LLM)经过海量数据训练后,编码了大量的知识。最近的研究揭示了LLM生成中的知识冲突,其中过时或不正确的参数知识(即编码的知识)与上下文中提供的新知识相矛盾。为了缓解这种知识冲突,我们提出了一个新的框架,IRCAN(识别和重新加权上下文感知神经元),以利用在处理上下文线索中至关重要的神经元。具体地说,IRCAN首先利用从综合梯度得出的上下文感知归因分数来识别对上下文处理有显著贡献的神经元。随后,通过重新加权来增强识别出的上下文感知神经元。在这样做的过程中,我们引导LLM针对上下文中提供的新知识生成上下文敏感的输出。在各种模型和任务上进行的广泛实验表明,IRCAN不仅在处理知识冲突方面取得了显著的改进,而且还提供了一个可扩展的、即插即用的解决方案,可以与现有模型无缝集成。

[NLP-14] LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
[NLP-14] 法学硕士而不是人类法官?针对20项NLP评估任务的大规模实证研究

链接: https://arxiv.org/abs/2406.18403
作者: Anna Bavaresco,Raffaella Bernardi,Leonardo Bertolazzi,Desmond Elliott,Raquel Fernández,Albert Gatt,Esam Ghaleb,Mario Giulianelli,Michael Hanna,Alexander Koller,André F. T. Martins,Philipp Mondorf,Vera Neplenbroek,Sandro Pezzelle,Barbara Plank,David Schlangen,Alessandro Suglia,Aditya K Surikuchi,Ece Takmaz,Alberto Testoni
关键词: evaluating NLP models, increasing trend, trend towards evaluating, evaluating NLP, NLP
中文关键词: 评估NLP模型,增长趋势,评估趋势,评估NLP,NLP
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.
摘要:使用LLM生成的判断而不是人类判断来评估NLP模型的趋势越来越大。在缺乏与人类数据的比较的情况下,这引发了对这些评估有效性的担忧;如果使用专有模型进行评估,这也引发了对重复性的担忧。我们提供JUPGE-BENCH,这是20个带有人类注释的NLP数据集的集合,并全面评估了11个当前的LLM(涵盖开权模型和专有模型)复制注释的能力。我们的评估表明,每个LLM在其与人类判断的相关性方面在数据集之间表现出很大的差异。我们的结论是,LLM尚未准备好系统性地取代NLP中的人类法官。

[NLP-15] Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
[NLP-15] LLM会梦见大象吗(当被告知不要这样做时)?变形金刚中的潜在概念联想和联想记忆

链接: https://arxiv.org/abs/2406.18400
作者: Yibo Jiang,Goutham Rajendran,Pradeep Ravikumar,Bryon Aragam
关键词: Large Language Models, Large Language, Language Models, capacity to store, store and recall
中文关键词: 大型语言模型、大型语言、语言模型、存储、存储和召回的能力
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the capacity to store and recall facts. Through experimentation with open-source models, we observe that this ability to retrieve facts can be easily manipulated by changing contexts, even without altering their factual meanings. These findings highlight that LLMs might behave like an associative memory model where certain tokens in the contexts serve as clues to retrieving facts. We mathematically explore this property by studying how transformers, the building blocks of LLMs, can complete such memory tasks. We study a simple latent concept association problem with a one-layer transformer and we show theoretically and empirically that the transformer gathers information using self-attention and uses the value matrix for associative memory.
摘要:大型语言模型(LLM)具有存储和回忆事实的能力。通过开源模型的实验,我们观察到这种检索事实的能力可以通过改变上下文来轻松操纵,即使不改变其事实含义。这些发现强调,LLM可能表现得像一个联想记忆模型,其中上下文中的某些标记作为检索事实的线索。我们通过研究变换器(LLM的构建模块)如何完成此类记忆任务,从数学上探索了这一属性。我们用单层Transformer研究了一个简单的潜在概念关联问题,并从理论和经验上表明,Transformer使用自我注意力收集信息,并使用价值矩阵进行联想记忆。

[NLP-16] Dynamic Data Pruning for Automatic Speech Recognition
[NLP-16] 自动语音识别的动态数据修剪

链接: https://arxiv.org/abs/2406.18373
作者: Qiao Xiao,Pingchuan Ma,Adriana Fernandez-Lopez,Boqian Wu,Lu Yin,Stavros Petridis,Mykola Pechenizkiy,Maja Pantic,Decebal Constantin Mocanu,Shiwei Liu
关键词: Automatic Speech Recognition, Speech Recognition, Automatic Speech, success of Automatic, recent success
中文关键词: 自动语音识别,语音识别,自动语音,自动的成功,最近的成功
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works often entail significant overhead to achieve meaningful results. To fill this gap, this paper presents the first investigation of dynamic data pruning for ASR, finding that we can reach the full-data performance by dynamically selecting 70% of data. Furthermore, we introduce Dynamic Data Pruning for ASR (DDP-ASR), which offers several fine-grained pruning granularities specifically tailored for speech-related datasets, going beyond the conventional pruning of entire time sequences. Our intensive experiments show that DDP-ASR can save up to 1.6x training time with negligible performance loss.
摘要:自动语音识别(ASR)近年来的成功很大程度上归功于不断增长的训练数据量。然而,这一趋势使得模型训练的成本高得令人望而却步,并强加了计算要求。虽然数据修剪被提出通过识别一小部分相关数据来缓解这一问题,但它在ASR中的应用几乎没有被探索,现有的工作往往需要大量的开销才能获得有意义的结果。为了填补这一空白,本文首次对ASR的动态数据剪枝进行了研究,发现动态选取70%的数据可以达到全数据的性能。此外,我们引入了ASR的动态数据剪枝(DDP-ASR),它提供了几个专门针对语音相关数据集的细粒度剪枝,超越了传统的对整个时间序列的剪枝。我们的实验表明,DDP-ASR可以节省高达1.6倍的训练时间,而性能损失可以忽略不计。

[NLP-17] hemis: Towards Flexible and Interpretable NLG Evaluation
[NLP-17] hemis:迈向灵活且可解释的NLG评估

链接: https://arxiv.org/abs/2406.18365
作者: Xinyu Hu,Li Lin,Mingqi Gao,Xunjian Yin,Xiaojun Wan
关键词: longstanding research issue, natural language generation, research issue, significant and longstanding, longstanding research
中文关键词: 长期存在的研究问题,自然语言生成,研究问题,重要且长期存在的研究
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The evaluation of natural language generation (NLG) tasks is a significant and longstanding research issue. With the recent emergence of powerful large language models (LLMs), some studies have turned to LLM-based automatic evaluation methods, which demonstrate great potential to become a new evaluation paradigm following traditional string-based and model-based metrics. However, despite the improved performance of existing methods, they still possess some deficiencies, such as dependency on references and limited evaluation flexibility. Therefore, in this paper, we meticulously construct a large-scale NLG evaluation corpus NLG-Eval with human and GPT-4 annotations to alleviate the lack of relevant data in this field. Furthermore, we propose Themis, an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency and rating-oriented preference alignment methods. Themis can conduct flexible and interpretable evaluations without references, and it exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
摘要:自然语言生成任务的评价是一个重要且由来已久的研究课题。随着最近功能强大的大型语言模型的出现,一些研究转向了基于大型语言模型的自动评估方法,它具有成为继传统的基于字符串和基于模型的度量之后的一种新的评估范例的巨大潜力。然而,尽管现有方法的性能有所提高,但它们仍然存在一些不足,如对参考文献的依赖和有限的评估灵活性。因此,在本文中,我们精心构建了一个大规模的NLG评估语料库NLG-EVAL,其中包含了人类和GPT-4的注释,以缓解该领域相关数据的缺乏。此外,我们还提出了一种专门用于NLG评价的LLM-THEMIS,并用我们设计的多视角一致性和面向评级的偏好对齐方法对其进行了训练。THEMIS可以在没有参考的情况下进行灵活和可解释的评估,在各种NLG任务上表现出优越的评估性能,同时对看不见的任务具有很好的泛化能力,并超过了包括GPT-4在内的其他评估模型。

[NLP-18] Research on Information Extraction of LCSTS Dataset Based on an Improved BERTSum-LSTM Model
[NLP-18] 基于改进BERTSum-LSTM模型的LCSTS数据集信息提取研究

链接: https://arxiv.org/abs/2406.18364
作者: Yiming Chen,Haobin Chen,Simin Liu,Yunyun Liu,Fanhao Zhou,Bing Wei
关键词: natural language processing, language processing technology, Chinese news summaries, artificial intelligence, Chinese
中文关键词: 自然语言处理、语言处理技术、中文新闻摘要、人工智能、中文
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: submitted to ICMIII 2024

点击查看摘要

Abstract:With the continuous advancement of artificial intelligence, natural language processing technology has become widely utilized in various fields. At the same time, there are many challenges in creating Chinese news summaries. First of all, the semantics of Chinese news is complex, and the amount of information is enormous. Extracting critical information from Chinese news presents a significant challenge. Second, the news summary should be concise and clear, focusing on the main content and avoiding redundancy. In addition, the particularity of the Chinese language, such as polysemy, word segmentation, etc., makes it challenging to generate Chinese news summaries. Based on the above, this paper studies the information extraction method of the LCSTS dataset based on an improved BERTSum-LSTM model. We improve the BERTSum-LSTM model to make it perform better in generating Chinese news summaries. The experimental results show that the proposed method has a good effect on creating news summaries, which is of great importance to the construction of news summaries.
摘要:随着人工智能的不断推进,自然语言处理技术在各个领域得到了广泛的应用。与此同时,在创作中文新闻摘要方面也面临着许多挑战。首先,中文新闻的语义复杂,信息量巨大。从中文新闻中提取关键信息是一个巨大的挑战。第二,新闻摘要要简明扼要,突出主要内容,避免重复。此外,汉语的特殊性,如多义性、分词等,给中文新闻摘要的生成带来了挑战。在此基础上,本文研究了基于改进的BERTSum-LSTM模型的LCSTS数据集信息提取方法。我们对BERTSum-LSTM模型进行了改进,使其在生成中文新闻摘要时具有更好的性能。实验结果表明,该方法具有较好的新闻摘要生成效果,对新闻摘要的构建具有重要意义。

[NLP-19] Grammar Assistance Using Syntactic Structures (GAUSS)
[NLP-19] 使用句法结构的语法辅助(GAASS)

链接: https://arxiv.org/abs/2406.18340
作者: Olga Zamaraeva,Lorena S. Allegue,Carlos Gómez-Rodríguez,Anastasiia Ogneva,Margarita Alonso-Ramos
关键词: established social roles, imposing social pressures, reinforcing established social, Automatic grammar coaching, social roles
中文关键词: 既定的社会角色、施加社会压力、强化既定的社会、自动语法辅导、社会角色
类目: Computation and Language (cs.CL)
备注: 5 pages, 4 figures, project summary for CEDI-SEPLN Seminar of the Spanish Society for Natural Language Processing at the 7th Spanish Conference on Informatics, June 19-20, 2024, A Coruña, Spain

点击查看摘要

Abstract:Automatic grammar coaching serves an important purpose of advising on standard grammar varieties while not imposing social pressures or reinforcing established social roles. Such systems already exist but most of them are for English and few of them offer meaningful feedback. Furthermore, they typically rely completely on neural methods and require huge computational resources which most of the world cannot afford. We propose a grammar coaching system for Spanish that relies on (i) a rich linguistic formalism capable of giving informative feedback; and (ii) a faster parsing algorithm which makes using this formalism practical in a real-world application. The approach is feasible for any language for which there is a computerized grammar and is less reliant on expensive and environmentally costly neural methods. We seek to contribute to Greener AI and to address global education challenges by raising the standards of inclusivity and engagement in grammar coaching.
摘要:自动语法指导的重要目的是为标准语法变体提供建议,同时不施加社会压力或强化既定的社会角色。此类系统已经存在,但其中大部分都是英语系统,而且很少提供有意义的反馈。此外,它们通常完全依赖神经方法,并需要世界上大多数人负担不起的大量计算资源。我们提出了一种西班牙语语法指导系统,该系统依赖于(i)能够提供信息反馈的丰富语言形式主义;和(ii)更快的解析算法,使使用这种形式主义在现实世界的应用中变得实用。该方法对于任何有计算机语法并且不太依赖昂贵和环境成本高昂的神经方法的语言都是可行的。我们寻求通过提高包容性和语法指导参与度的标准来为绿色人工智能做出贡献,并应对全球教育挑战。

[NLP-20] PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models
[NLP-20] PaCoST:大型语言模型中基准污染检测的配对置信度显着性测试

链接: https://arxiv.org/abs/2406.18326
作者: Huixuan Zhang,Yun Lin,Xiaojun Wan
关键词: Large language models, Large language, intentionally include data, trained on vast, vast amounts
中文关键词: 大型语言模型,大型语言,故意包括经过大量训练的数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks. This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications. To address this benchmark contamination problem, we first propose a set of requirements that practical contamination detection methods should follow. Following these proposed requirements, we introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs. Our method constructs a counterpart for each piece of data with the same distribution, and performs statistical analysis of the corresponding confidence to test whether the model is significantly more confident under the original benchmark. We validate the effectiveness of PaCoST and apply it on popular open-source models and benchmarks. We find that almost all models and benchmarks we tested are suspected contaminated more or less. We finally call for new LLM evaluation methods.
摘要:众所周知,大型语言模型(LLM)是在海量数据上进行训练的,这些数据可能无意或有意地包含了来自常用基准的数据。这种纳入可能会导致模型排行榜上的高分,但在现实世界的应用程序中却会导致令人失望的表现。为了解决这个基准污染问题,我们首先提出了一套实用的污染检测方法应该遵循的要求。根据这些建议的要求,我们引入了PaCoST,一种配对置信度重要性测试,以有效地检测LLMS中的基准污染。我们的方法为每个具有相同分布的数据构造一个对应物,并对相应的置信度进行统计分析,以测试模型在原始基准下是否显著更具置信度。我们验证了PaCoST的有效性,并将其应用于流行的开源模型和基准测试。我们发现,我们测试的几乎所有模型和基准都或多或少被怀疑受到了污染。最后,我们呼吁采用新的LLM评价方法。

[NLP-21] MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data
[NLP-21] MathOdyssey:使用Odyssey数学数据对大型语言模型中的数学问题解决技能进行基准测试

链接: https://arxiv.org/abs/2406.18321
作者: Meng Fang,Xiangpeng Wan,Fei Lu,Fei Xing,Kai Zou
关键词: Large language models, Large language, advanced natural language, natural language understanding, strong problem-solving abilities
中文关键词: 大型语言模型、大型语言、高级自然语言、自然语言理解、解决问题能力强
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed “MathOdyssey” dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and DBRX-Instruct, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and code are publicly available.
摘要:大语言模型极大地提高了自然语言的理解能力和解决问题的能力。尽管取得了这些成功,但由于所需的复杂推理,大多数LLM仍然在解决数学问题方面苦苦挣扎。本文利用新开发的“数学奥德赛”数据集研究了LLMS的数学问题解决能力。该数据集包括高中和大学层面的各种数学问题,由著名机构的专家创建,以严格测试高级问题解决方案中的LLM,并涵盖更广泛的学科领域。通过将MathOdysey数据集作为资源提供给AI社区,我们的目标是为理解和提高AI在复杂数学问题解决中的能力做出贡献。我们对开源模型,如Llama-3和DBRX-Indict,以及来自GPT系列和Gemini模型的闭源模型进行基准测试。我们的结果表明,尽管LLM在常规和中等难度的任务中表现良好,但他们面临着奥林匹克级别的问题和复杂的大学级别问题的重大挑战。我们的分析显示,开源和封闭源代码模型之间的性能差距正在缩小,但仍然存在实质性的挑战,特别是在最苛刻的问题上。这项研究强调了正在进行的加强LLMS数学推理的研究的必要性。数据集、结果和代码是公开提供的。

[NLP-22] Advancing Airport Tower Command Recognition: Integrating Squeeze-and-Excitation and Broadcasted Residual Learning
[NLP-22] 推进机场塔楼命令识别:集成挤压和激励和广播剩余学习

链接: https://arxiv.org/abs/2406.18313
作者: Yuanxi Lin,Tonglin Zhou,Yang Xiao
关键词: follow air traffic, air traffic control, control instructions precisely, traffic control instructions, Accurate recognition
中文关键词: 遵循空中交通,空中交通管制,精确控制指令,交通管制指令,准确识别
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by IALP 2024

点击查看摘要

Abstract:Accurate recognition of aviation commands is vital for flight safety and efficiency, as pilots must follow air traffic control instructions precisely. This paper addresses challenges in speech command recognition, such as noisy environments and limited computational resources, by advancing keyword spotting technology. We create a dataset of standardized airport tower commands, including routine and emergency instructions. We enhance broadcasted residual learning with squeeze-and-excitation and time-frame frequency-wise squeeze-and-excitation techniques, resulting in our BC-SENet model. This model focuses on crucial information with fewer parameters. Our tests on five keyword spotting models, including BC-SENet, demonstrate superior accuracy and efficiency. These findings highlight the effectiveness of our model advancements in improving speech command recognition for aviation safety and efficiency in noisy, high-stakes environments. Additionally, BC-SENet shows comparable performance on the common Google Speech Command dataset.
摘要:准确识别飞行指令对飞行安全和效率至关重要,因为飞行员必须严格遵守空中交通管制指令。本文通过提出关键字检测技术来解决语音命令识别中的挑战,如噪声环境和有限的计算资源。我们创建了一个标准化机场塔台指令的数据集,包括例行指令和紧急指令。我们用压缩激励和时间帧频率方向压缩激励技术来增强广播的剩余学习,从而得到我们的BC-Senet模型。该模型以较少的参数关注关键信息。我们在BC-Senet等五个关键词识别模型上的测试表明,该模型具有较高的准确率和效率。这些发现突显了我们的模型进步在改善语音命令识别方面的有效性,以提高在嘈杂、高风险环境中的航空安全和效率。此外,BC-Senet在常见的Google Speech Command数据集上显示了类似的性能。

[NLP-23] AI-native Memory: A Pathway from LLMs Towards AGI
[NLP-23] 人工智能原生内存:从LLM到AGI的道路

链接: https://arxiv.org/abs/2406.18312
作者: Jingbo Shang,Zai Zheng,Xiang Ying,Felix Tao,Mindverse Team
关键词: artificial general intelligence, general intelligence, context length, demonstrated the world, sparks of artificial
中文关键词: 人工通用智能,通用智能,上下文长度,展示世界,人工的火花
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated the world with the sparks of artificial general intelligence (AGI). One opinion, especially from some startups working on LLMs, argues that an LLM with nearly unlimited context length can realize AGI. However, they might be too optimistic about the long-context capability of (existing) LLMs – (1) Recent literature has shown that their effective context length is significantly smaller than their claimed context length; and (2) Our reasoning-in-a-haystack experiments further demonstrate that simultaneously finding the relevant information from a long context and conducting (simple) reasoning is nearly impossible. In this paper, we envision a pathway from LLMs to AGI through the integration of \emphmemory. We believe that AGI should be a system where LLMs serve as core processors. In addition to raw data, the memory in this system would store a large number of important conclusions derived from reasoning processes. Compared with retrieval-augmented generation (RAG) that merely processing raw data, this approach not only connects semantically related information closer, but also simplifies complex inferences at the time of querying. As an intermediate stage, the memory will likely be in the form of natural language descriptions, which can be directly consumed by users too. Ultimately, every agent/person should have its own large personal model, a deep neural network model (thus \emphAI-native) that parameterizes and compresses all types of memory, even the ones cannot be described by natural languages. Finally, we discuss the significant potential of AI-native memory as the transformative infrastructure for (proactive) engagement, personalization, distribution, and social in the AGI era, as well as the incurred privacy and security challenges with preliminary solutions.
摘要:大型语言模型(LLM)展示了人工通用智能(AGI)的火花。一种观点,尤其是一些致力于LLM的初创公司的观点认为,上下文长度几乎不受限制的LLM可以实现AGI。然而,他们可能对(现有的)LLM的长语境能力过于乐观–(1)最近的文献表明,它们的有效语境长度显著小于他们声称的语境长度;(2)我们的大海捞针实验进一步表明,同时从长语境中找到相关信息和进行(简单)推理几乎是不可能的。在本文中,我们设想了一条通过整合记忆从LLMS到AGI的途径。我们认为AGI应该是一个以LLM为核心处理器的系统。除了原始数据外,该系统中的内存还将存储大量从推理过程中得出的重要结论。与仅处理原始数据的检索增强生成(RAG)方法相比,该方法不仅将语义相关的信息联系得更紧密,而且在查询时简化了复杂的推理。作为一个中间阶段,记忆可能会以自然语言描述的形式出现,用户也可以直接使用。归根结底,每个代理/人都应该有自己的大型个人模型,一个深度神经网络模型(因此是原生的),它参数化并压缩所有类型的记忆,即使是那些不能用自然语言描述的记忆。最后,我们讨论了人工智能原生内存作为AGI时代(主动)参与、个性化、分发和社交的变革性基础设施的巨大潜力,以及带来的隐私和安全挑战以及初步解决方案。

[NLP-24] S3: A Simple Strong Sample-effective Multimodal Dialog System
[NLP-24] S3:一个简单、强样本有效的多模式对话系统

链接: https://arxiv.org/abs/2406.18305
作者: Elisei Rykov,Egor Malkershin,Alexander Panchenko
关键词: Journey Contest, compelling leaderboards, present a conceptually, conceptually simple, simple yet powerful
中文关键词: 旅程竞赛,引人注目的排行榜,呈现了概念上、概念上简单、简单但强大的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.
摘要:在这项工作中,我们为多模式对话任务(S3模型)提供了一个概念简单但功能强大的基线,该基线在两个引人注目的排行榜上取得了接近最先进的结果:MMMU和AI Journey Contest 2023。该系统基于预先训练的大型语言模型、预先训练的图像和音频的模式编码器以及可训练的模式投影仪。提出的用于训练此类架构的有效数据混合表明,基于强语言模型并在少量多模式数据上训练的多模式模型可以有效执行多模式对话任务。

[NLP-25] FactFinders at CheckThat! 2024: Refining Check-worthy Statement Detection with LLMs through Data Pruning
[NLP-25] CheckThat上的FactFinders!2024年:通过数据修剪用LLM完善值得检查的报表检测

链接: https://arxiv.org/abs/2406.18297
作者: Yufeng Li,Rrubaa Panchendrarajan,Arkaitz Zubiaga
关键词: claims needing fact-checking, filtering claims needing, Internet has posed, identifying check-worthy claims, needing fact-checking
中文关键词: 需要事实核查的索赔、过滤需要的索赔、互联网提出的索赔、识别值得核查的索赔、需要事实核查
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid dissemination of information through social media and the Internet has posed a significant challenge for fact-checking, among others in identifying check-worthy claims that fact-checkers should pay attention to, i.e. filtering claims needing fact-checking from a large pool of sentences. This challenge has stressed the need to focus on determining the priority of claims, specifically which claims are worth to be fact-checked. Despite advancements in this area in recent years, the application of large language models (LLMs), such as GPT, has only recently drawn attention in studies. However, many open-source LLMs remain underexplored. Therefore, this study investigates the application of eight prominent open-source LLMs with fine-tuning and prompt engineering to identify check-worthy statements from political transcriptions. Further, we propose a two-step data pruning approach to automatically identify high-quality training data instances for effective learning. The efficiency of our approach is demonstrated through evaluations on the English language dataset as part of the check-worthiness estimation task of CheckThat! 2024. Further, the experiments conducted with data pruning demonstrate that competitive performance can be achieved with only about 44% of the training data. Our team ranked first in the check-worthiness estimation task in the English language.
摘要:通过社交媒体和互联网迅速传播的信息对事实核查提出了重大挑战,尤其是在确定事实核查人员应该注意的值得核查的主张方面,即从大量句子中筛选需要事实核查的主张。这一挑战强调了必须重点确定索赔的优先次序,特别是哪些索赔值得进行事实核查。尽管近年来在这一领域取得了进展,但大型语言模型(LLM)的应用,如GPT,直到最近才引起研究的注意。然而,许多开源LLM仍未得到充分开发。因此,这项研究调查了8个具有微调和提示工程的著名开源LLM的应用,以从政治转录中识别出值得检查的声明。此外,我们提出了一种两步数据剪枝的方法来自动识别高质量的训练数据实例,以便进行有效的学习。作为CheckThat!2024的可检验性评估任务的一部分,我们的方法的有效性通过对英语语言数据集的评估得到了证明。此外,使用数据剪枝进行的实验表明,仅使用约44%的训练数据就可以获得具有竞争力的性能。我们的团队在英语检查可信度评估任务中排名第一。

[NLP-26] Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs
[NLP-26] 分层上下文修剪:使用存储库级预训练代码LLM优化现实世界的代码完成

链接: https://arxiv.org/abs/2406.18294
作者: Lei Zhang,Yunshui Li,Jiaming Li,Xiaobo Xia,Jiaxi Yang,Run Luo,Minzheng Wang,Longze Chen,Junhao Liu,Min Yang
关键词: utilize cross-file information, Repo-Code LLMs, recognize repository structures, recently developed code, code
中文关键词: 利用跨文件信息、Repo-Code LLM、识别存储库结构、最近开发的代码、代码
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Some recently developed code large language models (Code LLMs) have been pre-trained on repository-level code data (Repo-Code LLMs), enabling these models to recognize repository structures and utilize cross-file information for code completion. However, in real-world development scenarios, simply concatenating the entire code repository often exceeds the context window limits of these Repo-Code LLMs, leading to significant performance degradation. In this study, we conducted extensive preliminary experiments and analyses on six Repo-Code LLMs. The results indicate that maintaining the topological dependencies of files and increasing the code file content in the completion prompts can improve completion accuracy; pruning the specific implementations of functions in all dependent files does not significantly reduce the accuracy of completions. Based on these findings, we proposed a strategy named Hierarchical Context Pruning (HCP) to construct completion prompts with high informational code content. The HCP models the code repository at the function level, maintaining the topological dependencies between code files while removing a large amount of irrelevant code content, significantly reduces the input length for repository-level code completion. We applied the HCP strategy in experiments with six Repo-Code LLMs, and the results demonstrate that our proposed method can significantly enhance completion accuracy while substantially reducing the length of input. Our code and data are available at this https URL.
摘要:最近开发的一些代码大型语言模型(Code LLM)已经针对存储库级代码数据(Repo-Code LLM)进行了预训练,使这些模型能够识别存储库结构并利用跨文件信息进行代码完成。然而,在实际开发场景中,简单地连接整个代码库通常会超出这些Repo-Code LLM的上下文窗口限制,从而导致显著的性能下降。在这项研究中,我们对六个Repo-Code LLM进行了广泛的初步实验和分析。结果表明,保持文件的拓扑依赖关系和增加补全提示中代码文件的内容可以提高补全的准确率;剪枝所有依赖文件中函数的具体实现不会显著降低补全的准确性。基于这些发现,我们提出了一种层次化的上下文修剪策略(HCP)来构建信息含量较高的补全提示。HCP在功能级别对代码库进行建模,在保留代码文件之间的拓扑依赖关系的同时删除大量不相关的代码内容,显著减少了存储库级代码补全的输入长度。我们将该策略应用于六个Repo-Code LLM的实验中,结果表明,我们提出的方法可以在显著减少输入长度的同时显著提高补全精度。我们的代码和数据可以在这个HTTPS URL上找到。

[NLP-27] Sanskrit Knowledge-based Systems: Annotation and Computational Tools
[NLP-27] 梵语知识基系统:注释和计算工具

链接: https://arxiv.org/abs/2406.18276
作者: Hrishikesh Terdalkar
关键词: question answering, address the challenges, challenges and opportunities, focus on question, Sanskrit text analysis
中文关键词: 回答问题,应对挑战、挑战和机遇,关注问题,梵文文本分析
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: PhD Thesis. 204 pages, 6 publications

点击查看摘要

Abstract:We address the challenges and opportunities in the development of knowledge systems for Sanskrit, with a focus on question answering. By proposing a framework for the automated construction of knowledge graphs, introducing annotation tools for ontology-driven and general-purpose tasks, and offering a diverse collection of web-interfaces, tools, and software libraries, we have made significant contributions to the field of computational Sanskrit. These contributions not only enhance the accessibility and accuracy of Sanskrit text analysis but also pave the way for further advancements in knowledge representation and language processing. Ultimately, this research contributes to the preservation, understanding, and utilization of the rich linguistic information embodied in Sanskrit texts.
摘要:我们应对梵文知识系统开发中的挑战和机遇,重点关注问题回答。通过提出知识图自动构建的框架,引入用于实体驱动和通用任务的注释工具,并提供多样化的网络界面、工具和软件库集合,我们为计算梵文领域做出了重大贡献。这些贡献不仅增强了梵文文本分析的可访问性和准确性,而且还为知识表示和语言处理的进一步进步铺平了道路。最终,这项研究有助于保存、理解和利用梵文文本中所体现的丰富语言信息。

[NLP-28] “Vorbecsti Romanecste?” A Recipe to Train Powerful Romanian LLMs with English Instructions
[NLP-28] “沃贝斯蒂·罗曼内斯特?“用英语指导培训强大的罗马尼亚法学硕士的食谱

链接: https://arxiv.org/abs/2406.18266
作者: Mihai Masala,Denis C. Ilie-Ablachim,Alexandru Dima,Dragos Corlatescu,Miruna Zavelca,Ovio Olaru,Simina Terian-Dan,Andrei Terian-Dan,Marius Leordeanu,Horia Velicu,Marius Popescu,Mihai Dascalu,Traian Rebedea
关键词: Large Language Models, recent years, achieved almost human-like, English greatly exceeds, human-like performance
中文关键词: 大型语言模型近年来取得了近乎人类、英语远远超过人类的表现
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2405.07703

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) to support and encourage research on Romanian LLMs while concurrently creating a generalizable recipe, adequate for other low or less-resourced languages.
摘要:近年来,大型语言模型(LLM)在各种任务上取得了几乎与人类相似的表现。虽然一些小岛屿发展中国家接受了多语种数据培训,但大部分培训数据是英语的;因此,他们在英语方面的表现远远超过其他语言。据我们所知,我们是第一个收集和翻译大量文本、说明和基准的公司,并培训、评估和发布为罗马尼亚人量身定做的开源LLM。我们在四个不同的类别上对我们的方法进行评估,包括学术基准、MT-BENCH(手动翻译)和适合罗马尼亚语的专业构建的历史、文化和社会基准。我们通过全面获得最先进的结果来证明RoLLMS的有效性和高性能。我们公开发布所有资源(即数据、培训和评估代码、模型),以支持和鼓励罗马尼亚语LLMS的研究,同时创建适用于其他低资源或资源较少语言的通用配方。

[NLP-29] Detecting Machine-Generated Texts: Not Just “AI vs Humans” and Explainability is Complicated
[NLP-29] 检测机器生成的文本:不仅仅是“人工智能与人类”,解释性很复杂

链接: https://arxiv.org/abs/2406.18259
作者: Jiazhou Ji,Ruizhe Li,Shujun Li,Jie Guo,Weidong Qiu,Zheng Huang,Chiyu Chen,Xiaoyu Jiang,Xinru Lu
关键词: increasing concerns arise, LLMs rapidly advance, rapidly advance, increasing concerns, real world
中文关键词: 越来越多的担忧出现,LLM迅速进步,迅速进步,越来越多的担忧,现实世界
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 2 figures

点击查看摘要

Abstract:As LLMs rapidly advance, increasing concerns arise regarding risks about actual authorship of texts we see online and in real world. The task of distinguishing LLM-authored texts is complicated by the nuanced and overlapping behaviors of both machines and humans. In this paper, we challenge the current practice of considering LLM-generated text detection a binary classification task of differentiating human from AI. Instead, we introduce a novel ternary text classification scheme, adding an “undecided” category for texts that could be attributed to either source, and we show that this new category is crucial to understand how to make the detection result more explainable to lay users. This research shifts the paradigm from merely classifying to explaining machine-generated texts, emphasizing need for detectors to provide clear and understandable explanations to users. Our study involves creating four new datasets comprised of texts from various LLMs and human authors. Based on new datasets, we performed binary classification tests to ascertain the most effective SOTA detection methods and identified SOTA LLMs capable of producing harder-to-detect texts. We constructed a new dataset of texts generated by two top-performing LLMs and human authors, and asked three human annotators to produce ternary labels with explanation notes. This dataset was used to investigate how three top-performing SOTA detectors behave in new ternary classification context. Our results highlight why “undecided” category is much needed from the viewpoint of explainability. Additionally, we conducted an analysis of explainability of the three best-performing detectors and the explanation notes of the human annotators, revealing insights about the complexity of explainable detection of machine-generated texts. Finally, we propose guidelines for developing future detection systems with improved explanatory power.
摘要:随着LLMS的快速发展,人们越来越担心我们在网上和现实世界中看到的文本的实际作者的风险。区分LLM创作的文本的任务因机器和人类的细微差别和重叠行为而变得复杂。在本文中,我们挑战了目前将LLM生成的文本检测视为区分人类和人工智能的二进制分类任务的做法。相反,我们引入了一种新的三元文本分类方案,为可以归因于任一来源的文本添加了一个“未确定”类别,并且我们表明,这个新类别对于理解如何使检测结果更容易被外行用户解释是至关重要的。这项研究将范式从单纯的分类转变为解释机器生成的文本,强调检测器需要向用户提供清晰和可理解的解释。我们的研究涉及创建四个新的数据集,由来自不同LLM和人类作者的文本组成。基于新的数据集,我们进行了二进制分类测试,以确定最有效的SOTA检测方法,并识别出能够产生难以检测的文本的SOTA LLM。我们构建了一个由两个表现最好的LLM和人类作者生成的文本的新数据集,并要求三个人类注释员产生带有解释注释的三元标签。这个数据集被用来研究三个性能最好的SOTA检测器在新的三元分类环境中的行为。我们的结果突出了为什么从可解释性的角度来看,“未决定的”范畴是非常需要的。此外,我们对三个性能最好的检测器的可解释性进行了分析,并对人类注释者的解释注释进行了分析,揭示了机器生成文本的可解释性检测的复杂性。最后,我们提出了开发具有更好的解释能力的未来检测系统的指导方针。

[NLP-30] LLaMIPa: An Incremental Discourse Parser
[NLP-30] LLaMIPa:增量话语解析器

链接: https://arxiv.org/abs/2406.18256
作者: Kate Thompson,Akshay Chaturvedi,Julie Hunter,Nicholas Asher
关键词: large language model, Asher and Lascarides, style of SDRT, discourse parsing experiments, LLaMA Incremental Parser
中文关键词: 大语言模型,Asher和Lascarides,SDRT风格,话语解析实验,LLaMA增量解析器
类目: Computation and Language (cs.CL)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:This paper provides the first discourse parsing experiments with a large language model (LLM) finetuned on corpora annotated in the style of SDRT (Asher, 1993; Asher and Lascarides, 2003). The result is a discourse parser, LLaMIPa (LLaMA Incremental Parser), which is able to more fully exploit discourse context, leading to substantial performance gains over approaches that use encoder-only models to provide local, context-sensitive representations of discourse units. Furthermore, it is able to process discourse data incrementally, which is essential for the eventual use of discourse information in downstream tasks.
摘要:本文提供了第一次使用大语言模型(LLM)对以SDRT风格注释的数据库进行微调的话语分析实验(Asher,1993; Asher和Lascarides,2003)。结果是一个话语解析器LLaMIPa(LLaMA增量解析器),它能够更充分地利用话语上下文,与使用仅编码器模型来提供本地、上下文敏感的话语单元表示的方法相比,带来了显着的性能提升。此外,它能够增量地处理话语数据,这对于最终在下游任务中使用话语信息至关重要。

[NLP-31] Weak Reward Model Transforms Generative Models into Robust Causal Event Extraction Systems
[NLP-31] 弱奖励模型将生成模型转化为稳健的因果事件提取系统

链接: https://arxiv.org/abs/2406.18245
作者: Italo Luis da Silva,Hanqi Yan,Lin Gui,Yulan He
关键词: effect boundaries poses, evaluating causal event, event extraction tasks, causal event extraction, inherent ambiguity
中文关键词: 效应边界构成、评估因果事件、事件提取任务、因果事件提取、固有歧义
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 figures, 6 tables

点击查看摘要

Abstract:The inherent ambiguity of cause and effect boundaries poses a challenge in evaluating causal event extraction tasks. Traditional metrics like Exact Match and BertScore poorly reflect model performance, so we trained evaluation models to approximate human evaluation, achieving high agreement. We used them to perform Reinforcement Learning with extraction models to align them with human preference, prioritising semantic understanding. We successfully explored our approach through multiple datasets, including transferring an evaluator trained on one dataset to another as a way to decrease the reliance on human-annotated data. In that vein, we also propose a weak-to-strong supervision method that uses a fraction of the annotated data to train an evaluation model while still achieving high performance in training an RL model. Our code is available at \urlthis https URL.
摘要:因果边界固有的模糊性给评估因果事件提取任务带来了挑战。Exact Match和BertScore等传统指标很难反映模型性能,因此我们训练评估模型以接近人类评估,从而实现了高度一致性。我们使用它们来执行具有提取模型的强化学习,以使它们与人类偏好保持一致,优先考虑语义理解。我们通过多个数据集成功探索了我们的方法,包括将在一个数据集上训练的评估者转移到另一个数据集,以减少对人类注释数据的依赖。本着这种精神,我们还提出了一种从弱到强的监督方法,该方法使用一小部分注释数据来训练评估模型,同时仍然在训练RL模型时实现高性能。我们的代码可在\urlThis https URL上找到。

[NLP-32] Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets
[NLP-32] 零镜头基于预算的分类:德国推文基金会模型时代的主题标签

链接: https://arxiv.org/abs/2406.18239
作者: Simon Münker,Kai Kugler,Achim Rettinger
关键词: Filtering and annotating, annotating textual data, annotating textual, Natural Language Processing, Filtering
中文关键词: 过滤和注释、注释文本数据、注释文本、自然语言处理、过滤
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 tables, 1 figure

点击查看摘要

Abstract:Filtering and annotating textual data are routine tasks in many areas, like social media or news analytics. Automating these tasks allows to scale the analyses wrt. speed and breadth of content covered and decreases the manual effort required. Due to technical advancements in Natural Language Processing, specifically the success of large foundation models, a new tool for automating such annotation processes by using a text-to-text interface given written guidelines without providing training samples has become available. In this work, we assess these advancements in-the-wild by empirically testing them in an annotation task on German Twitter data about social and political European crises. We compare the prompt-based results with our human annotation and preceding classification approaches, including Naive Bayes and a BERT-based fine-tuning/domain adaptation pipeline. Our results show that the prompt-based approach - despite being limited by local computation resources during the model selection - is comparable with the fine-tuned BERT but without any annotated training data. Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data. Comments: 10 pages, 2 tables, 1 figure Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.18239 [cs.CL] (or arXiv:2406.18239v1 [cs.CL] for this version)
摘要:过滤和注释文本数据在许多领域都是常规任务,比如社交媒体或新闻分析。自动化这些任务允许扩展分析WRT。覆盖内容的速度和广度,并减少所需的手动工作。由于自然语言处理方面的技术进步,特别是大型基础模型的成功,已经有了一种新的工具,可以通过给出书面指导方针的文本到文本的界面来自动化这种注释过程,而不需要提供训练样本。在这项工作中,我们通过在德国推特上关于欧洲社会和政治危机的数据的注释任务中进行实证测试,来评估这些野外进展。我们将基于提示的结果与我们的人工标注和之前的分类方法进行了比较,包括朴素贝叶斯和基于BERT的微调/领域适应管道。我们的结果表明,尽管在模型选择过程中受到局部计算资源的限制,但基于提示的方法与微调的BERT方法相当,但没有任何标注的训练数据。我们的发现强调了NLP领域正在进行的范式转变,即统一下游任务和消除对预先标记的训练数据的需求。评论:10页,2个表格,1个图形主题:计算与语言(cs.CL);人工智能(cs.AI)引用为:arxiv:2406.18239cs.CL

[NLP-33] GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension
[NLP-33] GUIDE:指导视频理解的指导数据集

链接: https://arxiv.org/abs/2406.18227
作者: Jiafeng Liang,Shixin Jiang,Zekun Wang,Haojie Pan,Zerui Chen,Zheng Chu,Ming Liu,Ruiji Fu,Zhongyuan Wang,Bing Qin
关键词: specific steps, Internet, specific, steps, instructional
中文关键词: 具体步骤,互联网,具体,步骤,指导性
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: IJCAI 2024

点击查看摘要

Abstract:There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the GUIDE (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with GUIDE and perform in-depth analysis. Given the diversity and practicality of GUIDE, we believe that it can be used as a better benchmark for instructional video comprehension.
摘要:互联网上有大量的教学视频,为我们完成各种任务提供了指导。现有的教学视频数据集只关注视频层面的具体步骤,缺乏任务层面的经验指导,这可能会导致初学者由于缺乏相关经验而难以学习新任务。此外,没有指导方针的具体步骤是琐碎和不系统的,很难提供明确的教程。为了解决这些问题,我们提出了GUIDE(指导性的)数据集,其中包含与我们日常生活相关的8个领域的560个教学任务的3.5k视频。具体地说,我们用一个指导方针来注释每个教学任务,代表了所有与任务相关的视频共享的通用模式。在此基础上,我们对系统的具体步骤进行了注释,包括与其相关的指导步骤、具体步骤描述和时间戳。我们提出的基准测试包括三个子任务来评估模型的理解能力:(1)步骤字幕:模型必须从视频中为特定步骤生成字幕。(2)指导原则总结:模型必须挖掘任务相关视频中的常见模式,并从中总结出指导原则。(3)指导性字幕:模型必须在指导性指导下为特定步骤生成字幕。我们用GUIDE对大量的地基模型进行了评估,并进行了深入的分析。鉴于GUIDE的多样性和实用性,我们相信它可以作为教学视频理解的更好基准。

[NLP-34] Enhancing Data Privacy in Large Language Models through Private Association Editing
[NLP-34] 通过私人关联编辑增强大型语言模型中的数据隐私

链接: https://arxiv.org/abs/2406.18221
作者: Davide Venditti,Elena Sofia Ruzzetti,Giancarlo A. Xompero,Cristina Giannone,Andrea Favalli,Raniero Romagnoli,Fabio Massimo Zanzotto
关键词: Large Language Models, Large Language, raises significant concerns, information raises significant, Private Association Editing
中文关键词: 大型语言模型,大型语言,引发重大问题,信息引发重大问题,私人协会编辑
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful tools with extensive applications, but their tendency to memorize private information raises significant concerns as private data leakage can easily happen. In this paper, we introduce Private Association Editing (PAE), a novel defense approach for private data leakage. PAE is designed to effectively remove Personally Identifiable Information (PII) without retraining the model. Our approach consists of a four-step procedure: detecting memorized PII, applying PAE cards to mitigate memorization of private data, verifying resilience to targeted data extraction (TDE) attacks, and ensuring consistency in the post-edit LLMs. The versatility and efficiency of PAE, which allows for batch modifications, significantly enhance data privacy in LLMs. Experimental results demonstrate the effectiveness of PAE in mitigating private data leakage. We believe PAE will serve as a critical tool in the ongoing effort to protect data privacy in LLMs, encouraging the development of safer models for real-world applications.
摘要:大型语言模型是一种功能强大、应用广泛的工具,但其存储隐私信息的倾向引起了人们的极大关注,因为隐私数据很容易泄露。本文介绍了一种新的隐私数据泄露防御方法–私有关联编辑(PAE)。PAE旨在有效地删除个人身份信息(PII),而无需对模型进行重新培训。我们的方法包括四个步骤:检测记忆的PII,应用PAE卡来减少私人数据的记忆,验证对目标数据提取(TDE)攻击的恢复能力,以及确保编辑后LLM的一致性。PAE的多功能性和效率允许批量修改,显著增强了LLMS中的数据隐私。实验结果证明了PAE在缓解私有数据泄露方面的有效性。我们相信,PAE将作为正在进行的保护LLMS数据隐私的努力中的关键工具,鼓励为现实世界应用程序开发更安全的模型。

[NLP-35] A Closer Look into Mixture-of-Experts in Large Language Models
[NLP-35] 更仔细地研究大型语言模型中的专家混合

链接: https://arxiv.org/abs/2406.18219
作者: Ka Man Lo,Zeyu Huang,Zihan Qiu,Zili Wang,Jie Fu
关键词: gaining increasing attention, increasing attention due, gaining increasing, increasing attention, attention due
中文关键词: 获得越来越多的关注,由于越来越多的关注
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at this https URL.
摘要:专家混合测试因其独特的性质和显著的性能而受到越来越多的关注,尤其是在语言任务中。通过稀疏地激活每个令牌的参数子集,MOE体系结构可以在不牺牲计算效率的情况下增加模型规模,在性能和训练成本之间实现更好的权衡。然而,MOE的内在机制仍然缺乏进一步的探索,其模块化程度仍然值得怀疑。在本文中,我们对基于MOE的大型语言模型的内部工作原理进行了初步的尝试。具体地说,我们综合研究了最近三个基于MOE的模型的参数和行为特征,并揭示了一些有趣的观察结果,包括:(1)神经元的行为像细粒度的专家。(2)教育部的路由器通常选择产出指标较大的专家。(3)专家多样性随层数增加而增加,而最后一层是一个离群值。根据观察结果,我们还为广泛的MOE从业者提供了建议,如路由器设计和专家分配。我们希望这项工作能够对未来MOE框架和其他模块化体系结构的研究有所帮助。代码可在此HTTPS URL上找到。

[NLP-36] SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding
[NLP-36] SEED:通过预定的推测解码加速推理树构建

链接: https://arxiv.org/abs/2406.18200
作者: Zhenglin Wang,Jialong Wu,Yilong Lai,Congzhi Zhang,Deyu Zhou
关键词: Large Language Models, Large Language, remarkable emergent abilities, planning tasks, demonstrate remarkable emergent
中文关键词: 大型语言模型,大型语言,非凡的紧急能力,规划任务,展示非凡的紧急
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable emergent abilities across various tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based reasoning methods address this by surpassing the capabilities of chain-of-thought prompting, encouraging exploration of intermediate steps. However, such methods introduce significant inference latency due to the systematic exploration and evaluation of multiple thought paths. This paper introduces SeeD, a novel and efficient inference framework to optimize runtime speed and GPU memory management concurrently. By employing a scheduled speculative execution, SeeD efficiently handles multiple iterations for the thought generation and the state evaluation, leveraging a rounds-scheduled strategy to manage draft model dispatching. Extensive experimental evaluations on three reasoning datasets demonstrate superior speedup performance of SeeD, providing a viable path for batched inference in training-free speculative decoding.
摘要:大型语言模型在各种任务中表现出显著的涌现能力,但在复杂的推理和规划任务中表现出不足。基于树搜索的推理方法解决了这个问题,它超越了思维链提示的能力,鼓励探索中间步骤。然而,由于系统地探索和评估多条思维路径,这些方法引入了显著的推理延迟。本文介绍了一种新颖而高效的推理框架SEED,它可以同时优化运行速度和GPU内存管理。通过采用调度的推测性执行,SEED有效地处理思想生成和状态评估的多次迭代,利用轮次调度策略来管理草稿模型调度。在三个推理数据集上的大量实验评估表明,SEED算法具有良好的加速性能,为批量推理的免训练推测译码提供了可行的途径。

[NLP-37] Methodology of Adapting Large English Language Models for Specific Cultural Contexts
[NLP-37] 根据特定文化背景调整大型英语语言模型的方法

链接: https://arxiv.org/abs/2406.18192
作者: Wenjing Zhang,Siqi Xiao,Xuejiao Lei,Ning Wang,Huazheng Zhang,Meijuan An,Bikun Yang,Zhaoxiang Liu,Kai Wang,Shiguo Lian
关键词: artificial intelligence, prominent trend, field of artificial, specific cultural, large language models
中文关键词: 人工智能,突出趋势,人工领域,特定文化,大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:The rapid growth of large language models(LLMs) has emerged as a prominent trend in the field of artificial intelligence. However, current state-of-the-art LLMs are predominantly based on English. They encounter limitations when directly applied to tasks in specific cultural domains, due to deficiencies in domain-specific knowledge and misunderstandings caused by differences in cultural values. To address this challenge, our paper proposes a rapid adaptation method for large models in specific cultural contexts, which leverages instruction-tuning based on specific cultural knowledge and safety values data. Taking Chinese as the specific cultural context and utilizing the LLaMA3-8B as the experimental English LLM, the evaluation results demonstrate that the adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values, while maintaining its original expertise advantages.
摘要:大型语言模型(LLM)的快速发展已成为人工智能领域的一个突出趋势。然而,目前最先进的法学硕士主要基于英语。由于特定领域知识的不足以及文化价值观差异造成的误解,它们在直接应用于特定文化领域的任务时会遇到局限性。为了应对这一挑战,我们的论文提出了一种针对特定文化背景下大型模型的快速适应方法,该方法利用基于特定文化知识和安全价值数据的描述调整。以中文为特定文化背景,利用LLaMA 3 -8B作为实验性英语LLM,评估结果表明,调整后的LLM显着增强了其特定领域知识的能力和对安全价值的适应性,同时保持了其原有的专业优势。

[NLP-38] Selective Prompting Tuning for Personalized Conversations with LLMs
[NLP-38] 与LLM进行个性化对话的选择性预算调整

链接: https://arxiv.org/abs/2406.18187
作者: Qiushi Huang,Xubo Liu,Tom Ko,Bo Wu,Wenwu Wang,Yu Zhang,Lilian Tang
关键词: understanding is essential, profiles and contextual, contextual understanding, SPT, persona profiles
中文关键词: 理解至关重要,配置文件和上下文、上下文理解、STP、人物配置文件
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2024 findings

点击查看摘要

Abstract:In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models’ (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to yield responses that are similar to the ground truths in datasets, while direct fine-tuning tends to produce repetitive or overly generic replies. To alleviate those issues, we propose \textbfSelective \textbfPrompt \textbfTuning (SPT), which softly prompts LLMs for personalized conversations in a selective way. Concretely, SPT initializes a set of soft prompts and uses a trainable dense retriever to adaptively select suitable soft prompts for LLMs according to different input contexts, where the prompt retriever is dynamically updated through feedback from the LLMs. Additionally, we propose context-prompt contrastive learning and prompt fusion learning to encourage the SPT to enhance the diversity of personalized conversations. Experiments on the CONVAI2 dataset demonstrate that SPT significantly enhances response diversity by up to 90%, along with improvements in other critical performance indicators. Those results highlight the efficacy of SPT in fostering engaging and personalized dialogue generation. The SPT model code (this https URL) is publicly available for further exploration.
摘要:在对话式人工智能中,使用人物模型和上下文理解来个性化对话是必不可少的。尽管大型语言模型(LLMS)改善了回应的一致性,但有效的人物角色整合仍然是一个挑战。在这项工作中,我们首先研究了两种常见的个性化LLMS方法:文本提示和直接微调。我们观察到,文本提示通常难以产生与数据集中的基本事实相似的反应,而直接的微调往往会产生重复或过于笼统的回答。为了缓解这些问题,我们提出了\extbfSelective\extbfPrompt\extbfTuning(SPT),它以选择性的方式温和地提示LLM进行个性化对话。具体地,SPT初始化一组软提示,并使用可训练的密集检索器根据不同的输入上下文自适应地为LLMS选择合适的软提示,其中提示检索器通过来自LLMS的反馈来动态更新。此外,我们还提出了情景提示对比学习和快速融合学习,以鼓励SPT增强个性化对话的多样性。在CONVAI2数据集上的实验表明,SPT显著提高了响应多样性高达90%,同时在其他关键性能指标方面也有所改善。这些结果突显了小组委员会在促进参与和个性化对话生成方面的成效。SPT模型代码(此HTTURL)公开可供进一步研究。

[NLP-39] UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs
[NLP-39] UIO-LLM:长上下文LLM的无偏增量优化

链接: https://arxiv.org/abs/2406.18173
作者: Wenhao Li,Mingbao Lin,Yunshan Zhong,Shuicheng Yan,Rongrong Ji
关键词: large language models, Managing long texts, language models, due to limited, Managing long
中文关键词: 大型语言模型,管理长文本,语言模型,由于有限,管理长
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Managing long texts is challenging for large language models (LLMs) due to limited context window sizes. This study introduces UIO-LLMs, an unbiased incremental optimization approach for memory-enhanced transformers under long-context settings. We initially conceptualize the process as a streamlined encoder-decoder framework where the weights-shared encoder and decoder respectively encapsulate a context segment into memories and leverage these memories to predict outputs of the subsequent segment. Subsequently, by treating our memory-enhanced transformers as fully-connected recurrent neural networks (RNNs), we refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm, which incorporates innovative incremental optimization techniques. These techniques not only diminish time complexity but also address the bias in gradient computation through an unbiased optimization process. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters, while keeping the inference cost nearly linear as context length increases.
摘要:由于上下文窗口大小的限制,对于大型语言模型(LLM)来说,管理长文本是一项挑战。本文提出了一种无偏增量优化方法UIO-LLMS,用于长上下文环境下的记忆增强型变压器优化。我们最初将该过程概念化为一个简化的编解码器框架,其中权重共享的编码器和解码器分别将上下文段封装到存储器中,并利用这些存储器来预测后续段的输出。随后,通过将我们的记忆增强型变压器视为完全连接的递归神经网络(RNN),我们使用截断的时间反向传播(TBPTT)算法来优化训练过程,该算法结合了创新的增量优化技术。这些技术不仅降低了时间复杂度,而且通过无偏优化过程解决了梯度计算中的偏差。UIO-LLMS成功地处理了长上下文,例如使用最少2%的附加参数将Llama2-7b-Chat的上下文窗口从4K扩展到100K,同时保持了推理代价随着上下文长度的增加而近似线性。

[NLP-40] NeBuLa: A discourse aware Minecraft Builder
[NLP-40] NebuLa:具有话语意识的《我的世界》构建者

链接: https://arxiv.org/abs/2406.18164
作者: Akshay Chaturvedi,Kate Thompson,Nicholas Asher
关键词: humans efficiently exploit, humans efficiently, engaging in collaborative, efficiently exploit, exploit the semantic
中文关键词: 人类有效地利用,人类有效地,参与协作,有效地利用,利用语义
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:When engaging in collaborative tasks, humans efficiently exploit the semantic structure of a conversation to optimize verbal and nonverbal interactions. But in recent “language to code” or “language to action” models, this information is lacking. We show how incorporating the prior discourse and nonlinguistic context of a conversation situated in a nonlinguistic environment can improve the “language to action” component of such interactions. We fine tune an LLM to predict actions based on prior context; our model, NeBuLa, doubles the net-action F1 score over the baseline on this task of Jayannavar et al.(2020). We also investigate our model’s ability to construct shapes and understand location descriptions using a synthetic dataset.
摘要:当参与协作任务时,人类有效地利用对话的语义结构来优化言语和非言语交互。但在最近的“语言到代码”或“语言到行动”模型中,缺乏这种信息。我们展示了如何将位于非语言环境中的对话的先前话语和非语言背景结合起来可以改善此类互动的“语言到行动”组成部分。我们微调LLM,以根据先前的上下文预测动作;我们的模型NeBuLa在Jayannavar等人的这项任务中将净动作F1得分比基线增加了一倍。(2020年)。我们还研究了我们的模型使用合成数据集构建形状和理解位置描述的能力。

[NLP-41] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
[NLP-41] LOOK-M:KV缓存中的Look Once优化,以实现高效的多模式长上下文推理

链接: https://arxiv.org/abs/2406.18139
作者: Zhongwei Wan,Ziang Wu,Che Liu,Jinfa Huang,Zhihong Zhu,Peng Jin,Longyue Wang,Li Yuan
关键词: Large Language Models, Multimodal Large Language, Large Language, demand substantial computational, increasing input lengths
中文关键词: 大型语言模型、多模式大型语言、大型语言需要大量计算且不断增加的输入长度
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs’ KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge. In this work, we introduce LOOK-M, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. LOOK-M demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by 80% in some cases, it not only achieves up to 1.5x faster decoding but also maintains or even enhances performance across a variety of long context multimodal tasks.
摘要:长上下文多通道大型语言模型(MLLMS)需要大量的计算资源用于推理,因为它们的多通道Key-Value(KV)缓存随着输入长度的增加而增加,这对内存和时间效率提出了挑战。与仅管理文本上下文的单通道LLM不同,长上下文MLLM的KV缓存包括来自具有时间和空间关系以及相关文本上下文的多个图像的表示。图像令牌的优势意味着对LLMS KV缓存的传统优化不适合于多模式长上下文设置,并且以前的工作没有解决这一挑战。在这项工作中,我们引入了Look-M,这是一种开创性的、无需微调的方法,它有效地减少了多模式KV缓存大小,同时保持了与完整缓存相当的性能。我们观察到,在提示预填充过程中,该模型将文本关注度优先于图像特征,并基于多通道交互观察,提出了一种新的文本优先方法来压缩KV缓存。此外,为了缓解图像上下文信息的退化,我们提出了几种基于KV对合并的补偿策略。Look-M表明,随着KV缓存内存使用量的显著减少,例如在某些情况下将其减少80%,它不仅实现了高达1.5倍的解码速度,而且在各种长上下文多模式任务中保持甚至提高了性能。

[NLP-42] Automatic Speech Recognition for Hindi
[NLP-42] 印地语自动语音识别

链接: https://arxiv.org/abs/2406.18135
作者: Anish Saha,A.G. Ramakrishnan
关键词: convert spoken language, Automatic speech recognition, Automatic speech, key area, area in computational
中文关键词: 转换口语,自动语音识别,自动语音,关键区域,计算区域
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on developing technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real text, while ASR systems rely on language models trained on large text corpora. High-quality transcribed data is essential for training predictive models. The research involved two main components: developing a web application and designing a web interface for speech recognition. The web application, created with JavaScript and Node.js, manages large volumes of audio files and their transcriptions, facilitating collaborative human correction of ASR transcripts. It operates in real-time using a client-server architecture. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine. VAD detects human speech presence, aiding efficient speech processing and reducing unnecessary processing during non-speech intervals, thus saving computation and network bandwidth in VoIP applications. The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations.
摘要:自动语音识别(ASR)是计算语言学中的一个关键领域,致力于开发使计算机能够将口语转换为文本的技术。该领域结合了语言学和机器学习。ASR模型通过监督学习将语音音频映射到抄本,需要处理真实且不受限制的文本。文本到语音转换系统直接处理真实文本,而ASR系统依赖于在大型文本语料库上训练的语言模型。高质量的转录数据对于训练预测模型是必不可少的。这项研究涉及两个主要部分:开发Web应用程序和设计用于语音识别的Web界面。这个Web应用程序是用JavaScript和Node.js创建的,管理着大量的音频文件及其转录,促进了对ASR转录的协作人工更正。它使用客户端-服务器架构实时运行。用于语音识别的Web界面记录来自任何运行Web应用程序的设备的16 kHz单声道音频,执行语音活动检测(VAD),并将音频发送到识别引擎。VAD检测人类语音的存在,有助于高效的语音处理,并减少非语音时段的不必要处理,从而节省VoIP应用中的计算和网络带宽。研究的最后阶段测试了一种神经网络,用于准确地将语音信号与隐马尔可夫模型(HMM)状态对齐。这包括实施一种新的反向传播方法,该方法利用节点共同激活的先前统计数据。

[NLP-43] Assessing “Implicit” Retrieval Robustness of Large Language Models
[NLP-43] 评估大型语言模型的“隐性”检索稳健性

链接: https://arxiv.org/abs/2406.18134
作者: Xiaoyu Shen,Rexhina Blloshmi,Dawei Zhu,Jiahuan Pei,Wei Zhang
关键词: Retrieval-augmented generation, large language models, external knowledge, language models, generation has gained
中文关键词: 检索增强生成、大型语言模型、外部知识、语言模型、生成已经获得
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation has gained popularity as a framework to enhance large language models with external knowledge. However, its effectiveness hinges on the retrieval robustness of the model. If the model lacks retrieval robustness, its performance is constrained by the accuracy of the retriever, resulting in significant compromises when the retrieved context is irrelevant. In this paper, we evaluate the “implicit” retrieval robustness of various large language models, instructing them to directly output the final answer without explicitly judging the relevance of the retrieved context. Our findings reveal that fine-tuning on a mix of gold and distracting context significantly enhances the model’s robustness to retrieval inaccuracies, while still maintaining its ability to extract correct answers when retrieval is accurate. This suggests that large language models can implicitly handle relevant or irrelevant retrieved context by learning solely from the supervision of the final answer in an end-to-end manner. Introducing an additional process for explicit relevance judgment can be unnecessary and disrupts the end-to-end approach.
摘要:检索增强生成作为一种利用外部知识增强大型语言模型的框架已经越来越流行。然而,它的有效性取决于模型的检索稳健性。如果模型缺乏检索健壮性,其性能会受到检索者准确性的限制,当检索到的上下文不相关时,会导致显著的妥协。在本文中,我们评估了各种大型语言模型的“隐式”检索健壮性,指导它们直接输出最终答案,而不需要显式地判断检索到的上下文的相关性。我们的发现表明,对黄金和分散注意力的上下文的混合进行微调显著增强了模型对检索不准确的稳健性,同时仍然保持了在检索准确时提取正确答案的能力。这表明,大型语言模型可以隐含地处理相关或不相关的检索上下文,只需以端到端的方式从最终答案的监督中学习。为明确的相关性判断引入额外的流程可能是不必要的,并且会扰乱端到端的方法。

[NLP-44] ConvoCache: Smart Re-Use of Chatbot Responses
[NLP-44] Convoache:Chatbot响应的智能重复使用

链接: https://arxiv.org/abs/2406.18133
作者: Conor Atkins,Ian Wood,Mohamed Ali Kaafar,Hassan Asghar,Nardine Basta,Michal Kepkowski
关键词: conversational caching system, conversational caching, caching system, system that solves, solves the problem
中文关键词: 对话式缓存系统,对话式缓存,缓存系统,解决问题的系统
类目: Computation and Language (cs.CL)
备注: Accepted to appear at Interspeech 2024

点击查看摘要

Abstract:We present ConvoCache, a conversational caching system that solves the problem of slow and expensive generative AI models in spoken chatbots. ConvoCache finds a semantically similar prompt in the past and reuses the response. In this paper we evaluate ConvoCache on the DailyDialog dataset. We find that ConvoCache can apply a UniEval coherence threshold of 90% and respond to 89% of prompts using the cache with an average latency of 214ms, replacing LLM and voice synthesis that can take over 1s. To further reduce latency we test prefetching and find limited usefulness. Prefetching with 80% of a request leads to a 63% hit rate, and a drop in overall coherence. ConvoCache can be used with any chatbot to reduce costs by reducing usage of generative AI by up to 89%.
摘要:我们介绍了Convoache,这是一种对话缓存系统,可以解决语音聊天机器人中生成人工智能模型缓慢且昂贵的问题。Converoache查找过去语义相似的提示并重用响应。在本文中,我们在DailyDialogue数据集上评估了Convoache。我们发现Convoache可以应用90%的UniEval一致性阈值,并使用平均延迟为214 ms的缓存响应89%的提示,取代了可以占用1秒的LLM和语音合成。为了进一步减少延迟,我们测试了预取,发现有用性有限。以80%的请求进行预取会导致63%的命中率,并且总体一致性下降。Convoache可以与任何聊天机器人一起使用,通过将生成性人工智能的使用减少高达89%来降低成本。

[NLP-45] ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models
[NLP-45] ResumeAtlas:使用大规模数据集和大型语言模型重新审视简历分类

链接: https://arxiv.org/abs/2406.18125
作者: Ahmed Heakl,Youssef Mohamed,Noran Mohamed,Ali Sharkaway,Ahmed Zaky
关键词: recruitment platforms coupled, resume classification methods, efficient resume classification, increasing reliance, platforms coupled
中文关键词: 招聘平台耦合,简历分类方法,高效简历分类,增加依赖度,平台耦合
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 8 pages, 6 figures, 1 table, 6th International Conference on AI in Computational Linguistics

点击查看摘要

Abstract:The increasing reliance on online recruitment platforms coupled with the adoption of AI technologies has highlighted the critical need for efficient resume classification methods. However, challenges such as small datasets, lack of standardized resume templates, and privacy concerns hinder the accuracy and effectiveness of existing classification models. In this work, we address these challenges by presenting a comprehensive approach to resume classification. We curated a large-scale dataset of 13,389 resumes from diverse sources and employed Large Language Models (LLMs) such as BERT and Gemma1.1 2B for classification. Our results demonstrate significant improvements over traditional machine learning approaches, with our best model achieving a top-1 accuracy of 92% and a top-5 accuracy of 97.5%. These findings underscore the importance of dataset quality and advanced model architectures in enhancing the accuracy and robustness of resume classification systems, thus advancing the field of online recruitment practices.
摘要:对在线招聘平台的日益依赖,加上人工智能技术的采用,突显了对高效简历分类方法的迫切需求。然而,数据集小、缺乏标准化简历模板以及隐私问题等挑战阻碍了现有分类模型的准确性和有效性。在这项工作中,我们通过提出一种全面的简历分类方法来解决这些挑战。我们整理了一个包含13,389份来自不同来源的简历的大规模数据集,并使用了BERT和Gemma1.1 2B等大型语言模型(LLM)进行分类。结果表明,与传统的机器学习方法相比,我们的最优模型获得了92%的TOP-1准确率和97.5%的TOP-5准确率。这些调查结果强调了数据集质量和先进的模型架构在提高简历分类系统的准确性和稳健性方面的重要性,从而推动了在线招聘做法领域的发展。

[NLP-46] Poisoned LangChain: Jailbreak LLMs by LangChain
[NLP-46] 中毒的LangChain:LangChain的越狱LLMS

链接: https://arxiv.org/abs/2406.18122
作者: Ziqiu Wang,Jun Liu,Shengkai Zhang,Yang Yang
关键词: large language models, natural language processing, language models, large language, language
中文关键词: 大型语言模型、自然语言处理、语言模型、大型语言、语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages,2 figures,This paper is a submission to ACM TURC. It has been accepted by the editor of the organizer

点击查看摘要

Abstract:With the development of natural language processing (NLP), large language models (LLMs) are becoming increasingly popular. LLMs are integrating more into everyday life, raising public concerns about their security vulnerabilities. Consequently, the security of large language models is becoming critically important. Currently, the techniques for attacking and defending against LLMs are continuously evolving. One significant method type of attack is the jailbreak attack, which designed to evade model safety mechanisms and induce the generation of inappropriate content. Existing jailbreak attacks primarily rely on crafting inducement prompts for direct jailbreaks, which are less effective against large models with robust filtering and high comprehension abilities. Given the increasing demand for real-time capabilities in large language models, real-time updates and iterations of new knowledge have become essential. Retrieval-Augmented Generation (RAG), an advanced technique to compensate for the model’s lack of new knowledge, is gradually becoming mainstream. As RAG enables the model to utilize external knowledge bases, it provides a new avenue for jailbreak attacks. In this paper, we conduct the first work to propose the concept of indirect jailbreak and achieve Retrieval-Augmented Generation via LangChain. Building on this, we further design a novel method of indirect jailbreak attack, termed Poisoned-LangChain (PLC), which leverages a poisoned external knowledge base to interact with large language models, thereby causing the large models to generate malicious non-compliant dialogues.We tested this method on six different large language models across three major categories of jailbreak issues. The experiments demonstrate that PLC successfully implemented indirect jailbreak attacks under three different scenarios, achieving success rates of 88.56%, 79.04%, and 82.69% respectively. Comments: 6 pages,2 figures,This paper is a submission to ACM TURC. It has been accepted by the editor of the organizer Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.18122 [cs.CL] (or arXiv:2406.18122v1 [cs.CL] for this version)
摘要:随着自然语言处理(NLP)的发展,大语言模型(LLM)变得越来越流行。LLMS正在更多地融入日常生活,这引发了公众对其安全漏洞的担忧。因此,大型语言模型的安全性变得至关重要。目前,攻击和防御LLMS的技术正在不断发展。越狱攻击是一种重要的攻击方法类型,旨在逃避模型安全机制并诱导生成不适当的内容。现有的越狱攻击主要依赖于精心制作直接越狱的诱导提示,这对具有强大过滤和高理解能力的大型模型效果较差。鉴于大型语言模型对实时能力的需求日益增加,实时更新和迭代新知识变得至关重要。检索-增强生成(RAG)是一种弥补模型缺乏新知识的先进技术,正逐渐成为主流。由于RAG使模型能够利用外部知识库,它为越狱攻击提供了一种新的途径。在本文中,我们首次提出了间接越狱的概念,并通过LangChain实现了检索增强生成。在此基础上,我们进一步设计了一种新的间接越狱攻击方法,称为毒化朗链(PLC),它利用有毒的外部知识库与大型语言模型交互,从而导致大型模型生成恶意的不符合规则的对话框,并在三大类越狱问题的六个不同的大型语言模型上测试了该方法。实验表明,PLC在三种不同的场景下成功实现了间接越狱攻击,成功率分别为88.56%、79.04%和82.69%。评论:6页,2张图,这篇论文是提交给ACM TURC的。已被主办方主题:计算与语言(cs.CL);人工智能(cs.AI)引用为:arxiv:2406.18122cs.CL的编辑接受)

[NLP-47] ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs
[NLP-47] ArzEn-LLM:使用LLM的代码交换埃及阿拉伯语-英语翻译和语音识别

链接: https://arxiv.org/abs/2406.18120
作者: Ahmed Heakl,Youssef Zaghloul,Mennatullah Ali,Rania Hossam,Walid Gomaa
关键词: Egyptian Arabic, Egyptian Arabic recognition, translating code-switched Egyptian, code-switched Egyptian Arabic-English, automatic speech recognition
中文关键词: 埃及阿拉伯语、埃及阿拉伯语识别、翻译代码交换埃及语、代码交换埃及阿拉伯语-英语、自动语音识别
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, 5 tables, 6th International Conference on AI in Computational Linguistics

点击查看摘要

Abstract:Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of 56% in English translation over the state-of-the-art and 9.3% in Arabic translation. Since code-switching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources. Code: \urlthis http URL, Models: \urlthis http URL.
摘要:针对近年来在埃及阿拉伯语和英语之间代码转换现象的普遍增加,本文探讨了机器翻译(MT)和自动语音识别(ASR)系统的复杂性,重点是将代码转换后的埃及阿拉伯语-英语翻译成英语或埃及阿拉伯语。我们的目标是介绍在开发这些系统时使用的方法,利用大型语言模型,如Llama和Gema。在ASR领域,我们探索了利用Whisper模型进行代码切换的埃及阿拉伯语识别,详细介绍了我们的实验过程,包括数据预处理和训练技术。通过实施将ASR与机器翻译相结合的连续语音到文本翻译系统,我们的目标是克服有限的资源和埃及阿拉伯方言的独特特征带来的挑战。对照已建立的指标进行的评估显示了有希望的结果,我们的方法在英语翻译方面比最先进的翻译方法显著提高了56%,在阿拉伯语翻译方面提高了9.3%。由于语码转换在口语中是根深蒂固的,因此ASR系统能否有效地处理这一现象至关重要。这一能力对于实现各个领域的无缝交互至关重要,包括商业谈判、文化交流和学术话语。我们的模型和代码以开源资源的形式提供。代码:\url此http URL,型号:\url此http URL。

[NLP-48] SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
[NLP-48] SafeAligner:通过响应差异指导针对越狱攻击的安全调整

链接: https://arxiv.org/abs/2406.18118
作者: Caishuang Huang,Wanxu Zhao,Rui Zheng,Huijie Lv,Shihan Dou,Sixian Li,Xiao Wang,Enyu Zhou,Junjie Ye,Yuming Yang,Tao Gui,Qi Zhang,Xuanjing Huang
关键词: large language models, rapidly advances, area of research, development of large, large language
中文关键词: 大型语言模型,快速发展,研究领域,大型语言的开发
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the development of large language models (LLMs) rapidly advances, securing these models effectively without compromising their utility has become a pivotal area of research. However, current defense strategies against jailbreak attacks (i.e., efforts to bypass security protocols) often suffer from limited adaptability, restricted general capability, and high cost. To address these challenges, we introduce SafeAligner, a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks. We begin by developing two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses. SafeAligner leverages the disparity in security levels between the responses from these models to differentiate between harmful and beneficial tokens, effectively guiding the safety alignment by altering the output token distribution of the target model. Extensive experiments show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones, thereby ensuring secure alignment with minimal loss to generality.
摘要:随着大型语言模型的快速发展,在不影响其实用性的情况下有效地保护这些模型已成为一个关键的研究领域。然而,当前针对越狱攻击的防御策略(即绕过安全协议的努力)往往存在适应性有限、通用能力有限和成本较高的问题。为了应对这些挑战,我们引入了SafeAligner,这是一种在解码阶段实施的方法,用于加强对越狱攻击的防御。我们首先开发两个专门的模型:哨兵模型和入侵者模型,前者旨在促进安全,后者旨在产生更高风险的反应。SafeAligner利用这些模型响应之间的安全级别差异来区分有害令牌和有益令牌,通过更改目标模型的输出令牌分布有效地指导安全对齐。广泛的实验表明,SafeAligner可以增加有益令牌的可能性,同时减少有害令牌的发生,从而确保安全对齐,并将对一般性的损失降至最低。

[NLP-49] BADGE: BADminton report Generation and Evaluation with LLM
[NLP-49] BADGE:BADminton报告使用LLM生成和评估

链接: https://arxiv.org/abs/2406.18116
作者: Shang-Hsuan Chiang,Lin-Wei Chao,Kuang-Da Wang,Chih-Chuan Wang,Wen-Chih Peng
关键词: enjoys widespread popularity, matches generally include, Badminton enjoys widespread, Large Language Model, generally include details
中文关键词: 广受欢迎,比赛一般包括,羽毛球广受欢迎,大语言模型,一般包括细节
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted by IJCAI 2024 Workshop: The 2nd International Workshop on Intelligent Technologies for Precision Sports Science (IT4PSS)

点击查看摘要

Abstract:Badminton enjoys widespread popularity, and reports on matches generally include details such as player names, game scores, and ball types, providing audiences with a comprehensive view of the games. However, writing these reports can be a time-consuming task. This challenge led us to explore whether a Large Language Model (LLM) could automate the generation and evaluation of badminton reports. We introduce a novel framework named BADGE, designed for this purpose using LLM. Our method consists of two main phases: Report Generation and Report Evaluation. Initially, badminton-related data is processed by the LLM, which then generates a detailed report of the match. We tested different Input Data Types, In-Context Learning (ICL), and LLM, finding that GPT-4 performs best when using CSV data type and the Chain of Thought prompting. Following report generation, the LLM evaluates and scores the reports to assess their quality. Our comparisons between the scores evaluated by GPT-4 and human judges show a tendency to prefer GPT-4 generated reports. Since the application of LLM in badminton reporting remains largely unexplored, our research serves as a foundational step for future advancements in this area. Moreover, our method can be extended to other sports games, thereby enhancing sports promotion. For more details, please refer to this https URL.
摘要:羽毛球运动广受欢迎,比赛报道一般包括球员姓名、比分、球种等细节,为观众提供了一个全面的比赛视角。然而,编写这些报告可能是一项耗时的任务。这一挑战促使我们探索大型语言模型(LLM)是否可以自动生成和评估羽毛球报告。我们介绍了一个名为BAGE的新框架,该框架使用LLM来设计。我们的方法包括两个主要阶段:报告生成和报告评估。最初,与羽毛球相关的数据由LLM处理,然后生成比赛的详细报告。我们测试了不同的输入数据类型,情境学习(ICL)和LLM,发现GPT-4在使用CSV数据类型和思维提示链时表现最好。在生成报告后,LLM对报告进行评估和评分,以评估其质量。我们在GPT-4和人类评委评估的分数之间的比较显示,我们倾向于更喜欢GPT-4生成的报告。由于LLM在羽毛球报道中的应用在很大程度上还没有被探索,我们的研究为这一领域的未来发展奠定了基础。此外,我们的方法也可以推广到其他体育比赛,从而加强体育推广。有关更多详细信息,请参阅此HTTPS URL。

[NLP-50] oken-Weighted RNN-T for Learning from Flawed Data
[NLP-50] oken加权RNN-T用于从有缺陷的数据中学习

链接: https://arxiv.org/abs/2406.18108
作者: Gil Keren,Wei Zhou,Ozlem Kalinli
关键词: ASR models, target token sequence, models are commonly, commonly trained, increase the probability
中文关键词: ASB模型,目标令牌序列,模型是常见的,常见的训练,增加了概率
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:ASR models are commonly trained with the cross-entropy criterion to increase the probability of a target token sequence. While optimizing the probability of all tokens in the target sequence is sensible, one may want to de-emphasize tokens that reflect transcription errors. In this work, we propose a novel token-weighted RNN-T criterion that augments the RNN-T objective with token-specific weights. The new objective is used for mitigating accuracy loss from transcriptions errors in the training data, which naturally appear in two settings: pseudo-labeling and human annotation errors. Experiments results show that using our method for semi-supervised learning with pseudo-labels leads to a consistent accuracy improvement, up to 38% relative. We also analyze the accuracy degradation resulting from different levels of WER in the reference transcription, and show that token-weighted RNN-T is suitable for overcoming this degradation, recovering 64%-99% of the accuracy loss.
摘要:ASB模型通常使用交叉熵标准进行训练,以增加目标令牌序列的概率。虽然优化目标序列中所有标记的概率是明智的,但人们可能想要淡化反映转录错误的标记。在这项工作中,我们提出了一种新型的代币加权RNN-T标准,该标准通过代币特定权重来增强RNN-T目标。新目标用于减轻训练数据中转录错误造成的准确性损失,这些错误自然出现在两种情况下:伪标签和人为注释错误。实验结果表明,使用我们的方法进行带有伪标签的半监督学习会带来一致的准确性提高,相对最高可达38%。我们还分析了参考转录中不同水平的WER导致的准确性下降,并表明标记加权RNN-T适合克服这种下降,可恢复64%-99%的准确性损失。

[NLP-51] Shimo Lab at “Discharge Me!”: Discharge Summarization by Prompt-Driven Concatenation of Electronic Health Record Sections
[NLP-51] Shimo Lab在“释放我!”:通过预算驱动的电子健康记录部分级联进行放电总结

链接: https://arxiv.org/abs/2406.18094
作者: Yunzhen He,Hiroaki Yamagiwa,Hidetoshi Shimodaira
关键词: BioNLP Workshop, shared task, Discharge Instructions, EHR, present our approach
中文关键词: BioNLP研讨会、共享任务、出院说明、EHR、介绍我们的方法
类目: Computation and Language (cs.CL)
备注: BioNLP @ ACL2024

点击查看摘要

Abstract:In this paper, we present our approach to the shared task “Discharge Me!” at the BioNLP Workshop 2024. The primary goal of this task is to reduce the time and effort clinicians spend on writing detailed notes in the electronic health record (EHR). Participants develop a pipeline to generate the “Brief Hospital Course” and “Discharge Instructions” sections from the EHR. Our approach involves a first step of extracting the relevant sections from the EHR. We then add explanatory prompts to these sections and concatenate them with separate tokens to create the input text. To train a text generation model, we perform LoRA fine-tuning on the ClinicalT5-large model. On the final test data, our approach achieved a ROUGE-1 score of 0.394 , which is comparable to the top solutions.
摘要:在本文中,我们介绍了共同任务“释放我!”的方法。”在2024年BioNLP研讨会上。该任务的主要目标是减少临床医生在电子健康记录(EHR)中写下详细笔记所花费的时间和精力。参与者开发一个管道,从EHR中生成“医院简短课程”和“出院说明”部分。我们的方法涉及从EHR中提取相关部分的第一步。然后,我们将解释性提示添加到这些部分,并将它们与单独的标记连接起来以创建输入文本。为了训练文本生成模型,我们对ClinicalT 5大型模型执行LoRA微调。在最终的测试数据中,我们的方法获得了0.394的ROUGE-1评分,与顶级解决方案相当。

[NLP-52] LLM-Driven Multimodal Opinion Expression Identification
[NLP-52] LLM驱动的多模式意见表达识别

链接: https://arxiv.org/abs/2406.18088
作者: Bonian Jia,Huiyao Chen,Yueheng Sun,Meishan Zhang,Min Zhang
关键词: Opinion Expression Identification, essential in NLP, NLP for applications, Expression Identification, depression diagnosis
中文关键词: 意见表达识别,NLP中必不可少,应用NLP,表达识别,抑郁症诊断
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 6 pages, 3 Figures

点击查看摘要

Abstract:Opinion Expression Identification (OEI) is essential in NLP for applications ranging from voice assistants to depression diagnosis. This study extends OEI to encompass multimodal inputs, underlining the significance of auditory cues in delivering emotional subtleties beyond the capabilities of text. We introduce a novel multimodal OEI (MOEI) task, integrating text and speech to mirror real-world scenarios. Utilizing CMU MOSEI and IEMOCAP datasets, we construct the CI-MOEI dataset. Additionally, Text-to-Speech (TTS) technology is applied to the MPQA dataset to obtain the CIM-OEI dataset. We design a template for the OEI task to take full advantage of the generative power of large language models (LLMs). Advancing further, we propose an LLM-driven method STOEI, which combines speech and text modal to identify opinion expressions. Our experiments demonstrate that MOEI significantly improves the performance while our method outperforms existing methods by 9.20% and obtains SOTA results.
摘要:从语音助手到抑郁症诊断,意见表达识别(OEI)在自然语言处理中是必不可少的。这项研究将OEI扩展到包括多通道输入,强调了听觉线索在传递文本能力之外的情感微妙方面的重要性。我们引入了一种新颖的多模式OEI(MOEI)任务,将文本和语音相结合以反映真实世界的场景。利用CMU MOSEI和IEMOCAP数据集,构建了CI-MOEI数据集。此外,将文本到语音(TTS)技术应用于MPQA数据集,以获得CIM-OEI数据集。我们为OEI任务设计了一个模板,以充分利用大型语言模型(LLM)的生成能力。进一步,我们提出了一种LLM驱动的STOEI方法,它结合了语音和文本模式来识别观点表达。实验表明,MOEI算法显著提高了性能,而我们的方法比已有方法提高了9.20%,并获得了SOTA结果。

[NLP-53] EHR-Based Mobile and Web Platform for Chronic Disease Risk Prediction Using Large Language Multimodal Models
[NLP-53] 基于EHR的移动和Web平台使用大语言多模式模型进行慢性病风险预测

链接: https://arxiv.org/abs/2406.18087
作者: Chun-Chieh Liao,Wei-Ting Kuo,I-Hsuan Hu,Yen-Chen Shih,Jun-En Ding,Feng Liu,Fang-Ming Hung
关键词: involves in-person consultations, Traditional diagnosis, diseases involves in-person, Electronic Health Records, involves in-person
中文关键词: 涉及面对面咨询、传统诊断、疾病涉及面对面、电子健康记录、涉及面对面
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional diagnosis of chronic diseases involves in-person consultations with physicians to identify the disease. However, there is a lack of research focused on predicting and developing application systems using clinical notes and blood test values. We collected five years of Electronic Health Records (EHRs) from Taiwan’s hospital database between 2017 and 2021 as an AI database. Furthermore, we developed an EHR-based chronic disease prediction platform utilizing Large Language Multimodal Models (LLMMs), successfully integrating with frontend web and mobile applications for prediction. This prediction platform can also connect to the hospital’s backend database, providing physicians with real-time risk assessment diagnostics. The demonstration link can be found at this https URL.
摘要:慢性病的传统诊断涉及与医生面对面咨询以识别疾病。然而,缺乏专注于使用临床记录和血液测试值预测和开发应用系统的研究。我们从台湾医院数据库中收集了2017年至2021年间五年的电子健康记录(EHR)作为人工智能数据库。此外,我们利用大语言多模式模型(LLSYS)开发了一个基于EHR的慢性病预测平台,成功与前端网络和移动应用程序集成进行预测。该预测平台还可以连接到医院的后台数据库,为医生提供实时风险评估诊断。演示链接可以在此https URL中找到。

[NLP-54] Multilingual Knowledge Graph Completion from Pretrained Language Models with Knowledge Constraints
[NLP-54] 来自具有知识约束的预训练语言模型的多语言知识完成图

链接: https://arxiv.org/abs/2406.18085
作者: Ran Song,Shizhu He,Shengxiang Gao,Li Cai,Kang Liu,Zhengtao Yu,Jun Zhao
关键词: Knowledge Graph Completion, Graph Completion, Multilingual Knowledge Graph, multilingual pretrained language, improving multilingual knowledge
中文关键词: 知识图谱完成,图谱完成,多语言知识图谱,多语言预训练语言,提高多语言知识
类目: Computation and Language (cs.CL)
备注: 11 pages, ACL 2023

点击查看摘要

Abstract:Multilingual Knowledge Graph Completion (mKGC) aim at solving queries like (h, r, ?) in different languages by reasoning a tail entity t thus improving multilingual knowledge graphs. Previous studies leverage multilingual pretrained language models (PLMs) and the generative paradigm to achieve mKGC. Although multilingual pretrained language models contain extensive knowledge of different languages, its pretraining tasks cannot be directly aligned with the mKGC tasks. Moreover, the majority of KGs and PLMs currently available exhibit a pronounced English-centric bias. This makes it difficult for mKGC to achieve good results, particularly in the context of low-resource languages. To overcome previous problems, this paper introduces global and local knowledge constraints for mKGC. The former is used to constrain the reasoning of answer entities, while the latter is used to enhance the representation of query contexts. The proposed method makes the pretrained model better adapt to the mKGC task. Experimental results on public datasets demonstrate that our method outperforms the previous SOTA on Hits@1 and Hits@10 by an average of 12.32% and 16.03%, which indicates that our proposed method has significant enhancement on mKGC.
摘要:多语言知识图补全(MKGC)旨在解决(h,r,?)在不同的语言中通过推理得到一个尾部实体t,从而改进了多语言知识图谱。以往的研究利用多语言预训练语言模型(PLM)和生成范式来实现mKGC。尽管多语言预训练语言模型包含不同语言的广泛知识,但其预训练任务不能直接与mKGC任务相一致。此外,目前可用的大多数KG和PLM都表现出明显的以英语为中心的偏见。这使得mKGC很难取得好的结果,特别是在资源较少的语言环境中。为了克服上述问题,本文引入了全局知识约束和局部知识约束。前者用于约束答案实体的推理,后者用于增强查询上下文的表示。该方法使预先训练好的模型更好地适应了mKGC任务。在公共数据集上的实验结果表明,我们的方法在HITS@1和HITS@10上的性能分别比SOTA算法高12.32%和16.03%,这表明我们提出的方法在mKGC上有显著的提高。

[NLP-55] Octo-planner: On-device Language Model for Planner-Action Agents
[NLP-55] 任务规划器:规划器动作代理的设备上语言模型

链接: https://arxiv.org/abs/2406.18082
作者: Wei Chen,Zhiyuan Li,Zhen Guo,Yikang Shen
关键词: enabling autonomous decision-making, enabling autonomous, decision-making and problem-solving, increasingly significant, autonomous decision-making
中文关键词: 实现自主决策,实现自主决策和解决问题,日益重要的自主决策
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI agents have become increasingly significant in various domains, enabling autonomous decision-making and problem-solving. To function effectively, these agents require a planning process that determines the best course of action and then executes the planned actions. In this paper, we present an efficient on-device Planner-Action framework that separates planning and action execution into two distinct components: a planner agent based on Phi-3 Mini, a 3.8 billion parameter LLM optimized for edge devices, and an action agent using the Octopus model for function execution. The planner agent first responds to user queries by decomposing tasks into a sequence of sub-steps, which are then executed by the action agent. To optimize performance on resource-constrained devices, we employ model fine-tuning instead of in-context learning, reducing computational costs and energy consumption while improving response times. Our approach involves using GPT-4 to generate diverse planning queries and responses based on available functions, with subsequent validations to ensure data quality. We fine-tune the Phi-3 Mini model on this curated dataset, achieving a 97% success rate in our in-domain test environment. To address multi-domain planning challenges, we developed a multi-LoRA training method that merges weights from LoRAs trained on distinct function subsets. This approach enables flexible handling of complex, multi-domain queries while maintaining computational efficiency on resource-constrained devices. To support further research, we have open-sourced our model weights at \urlthis https URL. For the demo, please refer to \urlthis https URL.
摘要:人工智能代理在各个领域中发挥着越来越重要的作用,使其能够自主决策和解决问题。为了有效地发挥作用,这些代理需要一个计划过程,该过程确定最佳行动方案,然后执行计划的行动。在本文中,我们提出了一种高效的设备上规划-动作框架,它将规划和动作执行分成两个不同的组件:基于Phi-3 Mini的规划代理,针对边缘设备优化的38亿参数LLM,以及使用八达通模型执行功能的动作代理。计划器代理首先通过将任务分解成一系列子步骤来响应用户查询,这些子步骤然后由动作代理执行。为了在资源受限的设备上优化性能,我们使用模型微调而不是情景学习,从而降低了计算成本和能源消耗,同时提高了响应时间。我们的方法包括使用GPT-4根据可用功能生成不同的规划查询和响应,并进行后续验证以确保数据质量。我们在这个精选的数据集上微调了Phi-3 Mini模型,在我们的域内测试环境中实现了97%的成功率。为了解决多域规划的挑战,我们开发了一种多LORA训练方法,该方法将训练在不同功能子集上的LORA的权重合并。这种方法能够灵活处理复杂的多域查询,同时在资源受限的设备上保持计算效率。为了支持进一步的研究,我们在这个HTTPS URL上开放了我们的模型权重。有关演示,请参阅此HTTPS URL。

[NLP-56] Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction
[NLP-56] 使用伪标签评分器进行自我训练进行方面情绪四元预测

链接: https://arxiv.org/abs/2406.18078
作者: Yice Zhang,Jie Zeng,Weiming Hu,Ziyi Wang,Shiwei Chen,Ruifeng Xu
关键词: Sentiment Quad Prediction, Aspect Sentiment Quad, aspect-based sentiment analysis, Quad Prediction, Sentiment Quad
中文关键词: 情绪四元预测、方面情绪四元、基于方面的情绪分析、四元预测、情绪四元
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024 Main Conference

点击查看摘要

Abstract:Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review, which is the most representative and challenging task in aspect-based sentiment analysis. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. To tackle this issue, we propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels, aiming to filter out mismatches and thereby enhance the effectiveness of self-training. We highlight two critical aspects to ensure the scorer’s effectiveness and reliability: the quality of the training dataset and its model architecture. To this end, we create a human-annotated comparison dataset and train a generative model on it using ranking-based objectives. Extensive experiments on public ASQP datasets reveal that using our scorer can greatly and consistently improve the effectiveness of self-training. Moreover, we explore the possibility of replacing humans with large language models for comparison dataset annotation, and experiments demonstrate its feasibility. We release our code and data at this https URL .
摘要:方面情感四项预测(ASQP)旨在预测某一评论的所有四项内容(方面项、方面类别、意见项、情感极性),这是基于方面的情感分析中最具代表性和挑战性的任务。ASQP任务中的一个关键挑战是标记数据的稀缺,这限制了现有方法的性能。为了解决这一问题,我们提出了一个带有伪标签计分器的自我训练框架,其中计分者评估评论与其伪标签之间的匹配,旨在过滤不匹配,从而提高自我训练的有效性。我们强调了两个关键方面来确保记分器的有效性和可靠性:训练数据集的质量及其模型架构。为此,我们创建了一个人类注释的比较数据集,并使用基于排名的目标在其上训练生成模型。在公共ASQP数据集上的广泛实验表明,使用我们的记分器可以极大地并持续地提高自我训练的有效性。此外,我们还探索了用大型语言模型代替人类进行比较数据集标注的可能性,并通过实验证明了其可行性。我们在这个HTTPS URL发布我们的代码和数据。

[NLP-57] Exploring Energy-Based Models for Out-of-Distribution Detection in Dialect Identification
[NLP-57] 探索方言识别中的分布外检测基于能量的模型

链接: https://arxiv.org/abs/2406.18067
作者: Yaqian Hao,Chenguang Hu,Yingying Gao,Shilei Zhang,Junlan Feng
关键词: specific linguistic patterns, dialects presents challenges, linguistic patterns, rendering them susceptible, diverse nature
中文关键词: 特定的语言模式、方言带来了挑战、语言模式,使它们容易受到影响、多样性
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The diverse nature of dialects presents challenges for models trained on specific linguistic patterns, rendering them susceptible to errors when confronted with unseen or out-of-distribution (OOD) data. This study introduces a novel margin-enhanced joint energy model (MEJEM) tailored specifically for OOD detection in dialects. By integrating a generative model and the energy margin loss, our approach aims to enhance the robustness of dialect identification systems. Furthermore, we explore two OOD scores for OOD dialect detection, and our findings conclusively demonstrate that the energy score outperforms the softmax score. Leveraging Sharpness-Aware Minimization to optimize the training process of the joint model, we enhance model generalization by minimizing both loss and sharpness. Experiments conducted on dialect identification tasks validate the efficacy of Energy-Based Models and provide valuable insights into their performance.
摘要:方言的多样性给在特定语言模式上训练的模型带来了挑战,这使得它们在面对不可见或不分布(OOD)数据时容易出错。这项研究引入了一种专门为方言中的OOD检测量身定制的新型边际增强联合能量模型(MEJEM)。通过集成生成模型和能量裕度损失,我们的方法旨在增强方言识别系统的鲁棒性。此外,我们探索了OOD方言检测的两种OOD分数,我们的研究结果最终证明能量分数优于softmax分数。利用清晰度最小化来优化联合模型的训练过程,我们通过最大限度地减少损失和清晰度来增强模型概括性。对方言识别任务进行的实验验证了基于能量的模型的有效性,并为其性能提供了有价值的见解。

[NLP-58] Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need
[NLP-58] 评估检索增强一代的答案质量:强大的LLM就是您所需要的

链接: https://arxiv.org/abs/2406.18064
作者: Yang Wang,Alberto Garcia Hernandez,Roman Kyslyi,Nicholas Kersting
关键词: Retrieval-Augmented Generation, assess correctness, present a comprehensive, designed to assess, Large Language Models
中文关键词: 检索增强生成,评估正确性,呈现全面的、旨在评估的大型语言模型
类目: Computation and Language (cs.CL)
备注: 12 pages, 6 figures, 12 tables

点击查看摘要

Abstract:We present a comprehensive evaluation of answer quality in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive “thumbs-up” or “thumbs-down” gesture commonly used in chat applications. This approach suits factual business settings where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4’s assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources.
摘要:我们使用vRAG-EVAL对RAG应用中的答案质量进行了综合评估,VRAG-EVAL是一种新的评分系统,旨在评估准确性、完备性和诚实程度。我们进一步将上述质量方面的分级映射为二进制分数,表示接受或拒绝决定,反映了聊天应用程序中常用的直观的“竖起大拇指”或“竖起大拇指”的手势。这种方法适合实际的业务环境,在这些环境中,清晰的决策意见至关重要。我们的评估将vRAG-Eval应用于两个大型语言模型(LLM),评估了普通RAG应用程序生成的答案的质量。我们将这些评估与人类专家的判断进行了比较,发现GPT-4的S评估与人类专家的评估基本一致,在接受或拒绝决定上达成了83%的一致。这项研究强调了小岛屿发展中国家在封闭领域、封闭环境中作为可靠评估者的潜力,特别是在人工评估需要大量资源的情况下。

[NLP-59] AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning
[NLP-59] AdaZeta:内存高效大型语言模型微调的自适应零阶张量训练

链接: https://arxiv.org/abs/2406.18060
作者: Yifan Yang,Kai Zhen,Ershad Banijamal,Athanasios Mouchtaris,Zheng Zhang
关键词: natural language processing, Fine-tuning large language, language processing tasks, large language models, achieved remarkable performance
中文关键词: 自然语言处理、微调大型语言、语言处理任务、大型语言模型,取得了显着的性能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.
摘要:微调大语言模型在各种自然语言处理任务中取得了显著的性能,但随着模型规模的不断增长,它需要越来越多的存储空间。为了解决这个问题,最近提出的内存效率高的零阶(MEZO)方法试图仅使用前向遍来微调LLM,从而避免了对反向传播图的需要。然而,显著的性能下降和高度的分歧风险限制了它们的广泛采用。在本文中,我们提出了自适应零阶张量-训练自适应(AdaZeta)框架,专门设计用于改善ZO方法的性能和收敛。为了提高与维度相关的ZO估计精度,我们引入了一种快速前进的低参数张化适配器。针对大规模ZO微调任务中经常出现的发散问题,提出了一种保证收敛的自适应查询数调度方法。在Roberta-Large和Llama-2-7B模型上的详细理论分析和广泛的实验结果证明了我们的AdaZeta框架在准确性、存储效率和收敛速度方面的有效性。

[NLP-60] Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources
[NLP-60] 使用深度学习和微调大型语言模型的集成改进实体识别:从多个来源提取不良事件的案例研究

链接: https://arxiv.org/abs/2406.18049
作者: Yiming Li,Deepthi Viswaroopan,William He,Jianfu Li,Xu Zuo,Hua Xu,Cui Tao
关键词: Adverse event, Traditional deep learning, deep learning models, deep learning, Traditional deep
中文关键词: 不良事件、传统深度学习、深度学习模型、深度学习、传统深度学习
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adverse event (AE) extraction following COVID-19 vaccines from text data is crucial for monitoring and analyzing the safety profiles of immunizations. Traditional deep learning models are adept at learning intricate feature representations and dependencies in sequential data, but often require extensive labeled data. In contrast, large language models (LLMs) excel in understanding contextual information, but exhibit unstable performance on named entity recognition tasks, possibly due to their broad but unspecific training. This study aims to evaluate the effectiveness of LLMs and traditional deep learning models in AE extraction, and to assess the impact of ensembling these models on performance. In this study, we utilized reports and posts from the VAERS (n=621), Twitter (n=9,133), and Reddit (n=131) as our corpora. Our goal was to extract three types of entities: “vaccine”, “shot”, and “ae”. We explored and fine-tuned (except GPT-4) multiple LLMs, including GPT-2, GPT-3.5, GPT-4, and Llama-2, as well as traditional deep learning models like RNN and BioBERT. To enhance performance, we created ensembles of the three models with the best performance. For evaluation, we used strict and relaxed F1 scores to evaluate the performance for each entity type, and micro-average F1 was used to assess the overall performance. The ensemble model achieved the highest performance in “vaccine”, “shot”, and “ae” with strict F1-scores of 0.878, 0.930, and 0.925, respectively, along with a micro-average score of 0.903. In conclusion, this study demonstrates the effectiveness and robustness of ensembling fine-tuned traditional deep learning models and LLMs, for extracting AE-related information. This study contributes to the advancement of biomedical natural language processing, providing valuable insights into improving AE extraction from text data for pharmacovigilance and public health surveillance.
摘要:从文本数据中提取新冠肺炎疫苗的不良事件是监测和分析疫苗安全性的关键。传统的深度学习模型擅长学习序列数据中复杂的特征表示和依赖关系,但往往需要大量的标记数据。相比之下,大型语言模型(LLM)在理解上下文信息方面表现出色,但在命名实体识别任务中表现出不稳定的性能,这可能是因为它们进行了广泛但不特定的培训。本研究旨在评估LLMS和传统深度学习模型在声发射提取中的有效性,并评估这些模型集成对性能的影响。在本研究中,我们使用了来自VAERS(n=621)、Twitter(n=9,133)和Reddit(n=131)的报告和帖子作为我们的语料库。我们的目标是提取三种类型的实体:“疫苗”、“疫苗”和“AE”。我们探索和微调了(GPT-4除外)多个LLM,包括GPT-2、GPT-3.5、GPT-4和Llama-2,以及RNN和BioBERT等传统深度学习模型。为了提高性能,我们创建了三个性能最好的模型的合奏。对于评价,我们使用严格和宽松的F1分数来评估每种实体类型的表现,并使用微平均F1来评估整体表现。该合奏模式在《疫苗》、《枪击》和《AE》中取得了最高的表现,严格的F1得分分别为0.878、0.930和0.925,微观平均得分为0.903。总之,这项研究证明了集成微调的传统深度学习模型和LLMS在提取声发射相关信息方面的有效性和稳健性。这项研究有助于生物医学自然语言处理的进步,为改进文本数据中的声发射提取提供了有价值的见解,用于药物警戒和公共卫生监测。

[NLP-61] PharmGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry
[NLP-61] PharmGPT:生物制药和化学领域特定大型语言模型

链接: https://arxiv.org/abs/2406.18045
作者: Linqing Chen,Weilei Wang,Zilong Bai,Peng Xu,Yan Fang,Jie Fang,Wentao Wu,Lizhi Zhou,Ruiji Zhang,Yubin Xia,Chaobo Xu,Ran Hu,Licong Xu,Qijun Cai,Haoran Hua,Jing Sun,Jin Liu,Tian Qiu,Haowen Liu,Meng Hu,Xiuwen Li,Fei Gao,Yufu Wang,Lin Tie,Chaochao Wang,Jianping Lu,Cheng Sun,Yixin Wang,Shengjie Yang,Yuancheng Li,Lu Jin,Lisha Zhang,Fu Bian,Changyang Tu
关键词: Natural Language Processing, revolutionized Natural Language, complex feature engineering, revolutionized Natural, Large language models
中文关键词: 自然语言处理、革命性的自然语言、复杂特征工程、革命性的自然、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmGPT, a suite of multilingual LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus of hundreds of billions of tokens tailored to the Bio-Pharmaceutical and Chemical sectors. Our evaluation shows that PharmGPT matches or surpasses existing general models on key benchmarks, such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. This advancement establishes a new benchmark for LLMs in the Bio-Pharmaceutical and Chemical fields, addressing the existing gap in specialized language modeling. Furthermore, this suggests a promising path for enhanced research and development in these specialized areas, paving the way for more precise and effective applications of NLP in specialized domains.
摘要:大型语言模型通过最大限度地减少对复杂特征工程的需求,使自然语言处理(NLP)发生了革命性的变化。然而,LLMS在生物制药和化学等专业领域的应用在很大程度上仍未被探索。这些领域的特点是术语复杂、专业知识,以及对通用低成本管理往往达不到的精度领域的高要求。在这项研究中,我们介绍了PharmGPT,这是一套具有130亿和700亿参数的多语言LLM,专门在一个包含数千亿个令牌的综合语料库上进行培训,这些令牌是为生物制药和化工行业量身定做的。我们的评估显示,PharmGPT在关键基准上与现有的通用模型相匹配或超过,如NAPLEX,显示了其在特定领域任务中的卓越能力。这一进展为生物制药和化学领域的LLMS建立了一个新的基准,解决了专业语言建模方面的现有差距。此外,这为加强这些专门领域的研究和开发提供了一条很有希望的途径,为在专门领域更精确和有效地应用自然语言规划铺平了道路。

[NLP-62] LLMs for Doctors: Leveraging Medical LLMs to Assist Doctors Not Replace Them
[NLP-62] 医生的LLM:利用医学LLM来协助医生而不是取代他们

链接: https://arxiv.org/abs/2406.18034
作者: Wenya Xie,Qingying Xiao,Yu Zheng,Xidong Wang,Junying Chen,Ke Ji,Anningzhe Gao,Xiang Wan,Feng Jiang,Benyou Wang
关键词: Large Language Models, Large Language, success of Large, Language Models, healthcare field
中文关键词: 大型语言模型,大型语言,大型的成功,语言模型,医疗保健领域
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent success of Large Language Models (LLMs) has had a significant impact on the healthcare field, providing patients with medical advice, diagnostic information, and more. However, due to a lack of professional medical knowledge, patients are easily misled by generated erroneous information from LLMs, which may result in serious medical problems. To address this issue, we focus on tuning the LLMs to be medical assistants who collaborate with more experienced doctors. We first conduct a two-stage survey by inspiration-feedback to gain a broad understanding of the real needs of doctors for medical assistants. Based on this, we construct a Chinese medical dataset called DoctorFLAN to support the entire workflow of doctors, which includes 92K Q\A samples from 22 tasks and 27 specialists. Moreover, we evaluate LLMs in doctor-oriented scenarios by constructing the DoctorFLAN-\textittest containing 550 single-turn Q\A and DotaBench containing 74 multi-turn conversations. The evaluation results indicate that being a medical assistant still poses challenges for existing open-source models, but DoctorFLAN can help them significantly. It demonstrates that the doctor-oriented dataset and benchmarks we construct can complement existing patient-oriented work and better promote medical LLMs research.
摘要:最近大型语言模型(LLM)的成功对医疗保健领域产生了重大影响,为患者提供了医疗建议、诊断信息等。然而,由于缺乏专业的医学知识,患者容易被LLMS产生的错误信息误导,这可能会导致严重的医疗问题。为了解决这个问题,我们专注于将LLM调整为与更有经验的医生合作的医疗助理。我们首先通过灵感反馈的方式进行了两个阶段的调查,以广泛了解医生对医疗助理的真实需求。在此基础上,我们构建了一个支持医生整个工作流程的中文医学数据集DoctorFLAN,其中包括来自22个任务和27个专家的92K个问答样本。此外,我们通过构建包含550个单话轮问答的DoctorFLAN-\文本测试和包含74个多话轮对话的DotaBitch来评估面向医生的LLMS。评估结果表明,作为一名医疗助理仍然对现有的开源模式构成挑战,但DoctorFLAN可以显著帮助他们。这表明我们构建的以医生为中心的数据集和基准可以补充现有的以患者为中心的工作,并更好地促进医学LLMS研究。

[NLP-63] Automated Clinical Data Extraction with Knowledge Conditioned LLMs
[NLP-63] 使用知识条件LLM自动化临床数据提取

链接: https://arxiv.org/abs/2406.18027
作者: Diya Li,Asim Kadav,Aijing Gao,Rui Li,Richard Bourgon
关键词: medical imaging reports, lung-related diseases, medical imaging, crucial for research, care of lung-related
中文关键词: 医学成像报告、肺部相关疾病、医学成像、对研究至关重要、肺部相关的护理
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The extraction of lung lesion information from clinical and medical imaging reports is crucial for research on and clinical care of lung-related diseases. Large language models (LLMs) can be effective at interpreting unstructured text in reports, but they often hallucinate due to a lack of domain-specific knowledge, leading to reduced accuracy and posing challenges for use in clinical settings. To address this, we propose a novel framework that aligns generated internal knowledge with external knowledge through in-context learning (ICL). Our framework employs a retriever to identify relevant units of internal or external knowledge and a grader to evaluate the truthfulness and helpfulness of the retrieved internal-knowledge rules, to align and update the knowledge bases. Our knowledge-conditioned approach also improves the accuracy and reliability of LLM outputs by addressing the extraction task in two stages: (i) lung lesion finding detection and primary structured field parsing, followed by (ii) further parsing of lesion description text into additional structured fields. Experiments with expert-curated test datasets demonstrate that this ICL approach can increase the F1 score for key fields (lesion size, margin and solidity) by an average of 12.9% over existing ICL methods.
摘要:从临床和医学影像报告中提取肺部病变信息对于肺部相关疾病的研究和临床护理至关重要。大型语言模型(LLM)可以有效地解释报告中的非结构化文本,但由于缺乏特定领域的知识,它们经常产生幻觉,导致准确性降低,并对临床环境的使用构成挑战。为了解决这一问题,我们提出了一种新的框架,通过情境学习(ICL)将生成的内部知识与外部知识对齐。该框架使用一个检索器来识别内部或外部知识的相关单元,并使用一个评分器来评估检索到的内部知识规则的真实性和有用性,以对齐和更新知识库。我们的知识条件化方法还通过分两个阶段处理提取任务来提高LLM输出的准确性和可靠性:(I)肺病变发现检测和初步结构化字段解析,随后(Ii)进一步将病变描述文本解析为附加结构化字段。用专家精选的测试数据集进行的实验表明,这种ICL方法可以将关键领域(病变大小、边缘和坚固性)的F1分数比现有的ICL方法平均提高12.9%。

[NLP-64] Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher
[NLP-64] 在有限的教师监督下进行解码需要了解何时信任教师

链接: https://arxiv.org/abs/2406.18002
作者: Hyunjong Ok,Jegwang Ryu,Jaeho Lee
关键词: sLLMs efficiently utilize, generative quality, improve their generative, efficiently utilize, LLM
中文关键词: sLLM有效利用、生成质量,提高其生成、高效利用、LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:How can sLLMs efficiently utilize the supervision of LLMs to improve their generative quality? This question has been well studied in scenarios where there is no restriction on the number of LLM supervisions one can use, giving birth to many decoding algorithms that utilize supervision without further training. However, it is still unclear what is an effective strategy under the limited supervision scenario, where we assume that no more than a few tokens can be generated by LLMs. To this end, we develop an algorithm to effectively aggregate the sLLM and LLM predictions on initial tokens so that the generated tokens can more accurately condition the subsequent token generation by sLLM only. Critically, we find that it is essential to adaptively overtrust or disregard the LLM prediction based on the confidence of the sLLM. Through our experiments on a wide range of models and datasets, we demonstrate that our method provides a consistent improvement over conventional decoding strategies.
摘要:小土地管理模式如何有效地利用小土地管理模式的监督来提高其生成质量?这个问题已经在不限制LLM监督次数的场景中得到了很好的研究,从而产生了许多无需进一步训练就利用监督的译码算法。然而,目前还不清楚在有限监管场景下,什么是有效的策略,在这种情况下,我们假设LLM可以生成不超过几个令牌。为此,我们开发了一种算法来有效地聚合初始令牌上的sLLM和LLM预测,以便生成的令牌可以更准确地条件下仅由sLLM生成的后续令牌。关键的是,我们发现,适应性地过度信任或忽略基于sLLM的置信度的LLM预测是必要的。通过在广泛的模型和数据集上的实验,我们证明了我们的方法比传统的解码策略提供了一致的改进。

[NLP-65] Catching Chameleons: Detecting Evolving Disinformation Generated using Large Language Models
[NLP-65] 捕捉变色龙:检测使用大型语言模型生成的不断演变的虚假信息

链接: https://arxiv.org/abs/2406.17992
作者: Bohan Jiang,Chengshuai Zhao,Zhen Tan,Huan Liu
关键词: current efforts overlook, detecting evolving LLM-generated, detecting disinformation generated, evolving LLM-generated disinformation, current efforts
中文关键词: 当前的工作忽视,检测不断变化的LLM生成,检测生成的虚假信息,不断变化的LLM生成的虚假信息,当前的工作
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Despite recent advancements in detecting disinformation generated by large language models (LLMs), current efforts overlook the ever-evolving nature of this disinformation. In this work, we investigate a challenging yet practical research problem of detecting evolving LLM-generated disinformation. Disinformation evolves constantly through the rapid development of LLMs and their variants. As a consequence, the detection model faces significant challenges. First, it is inefficient to train separate models for each disinformation generator. Second, the performance decreases in scenarios when evolving LLM-generated disinformation is encountered in sequential order. To address this problem, we propose DELD (Detecting Evolving LLM-generated Disinformation), a parameter-efficient approach that jointly leverages the general fact-checking capabilities of pre-trained language models (PLM) and the independent disinformation generation characteristics of various LLMs. In particular, the learned characteristics are concatenated sequentially to facilitate knowledge accumulation and transformation. DELD addresses the issue of label scarcity by integrating the semantic embeddings of disinformation with trainable soft prompts to elicit model-specific knowledge. Our experiments show that \textitDELD significantly outperforms state-of-the-art methods. Moreover, our method provides critical insights into the unique patterns of disinformation generation across different LLMs, offering valuable perspectives in this line of research.
摘要:尽管最近在检测大型语言模型(LLM)产生的虚假信息方面取得了进展,但目前的努力忽略了这种虚假信息的不断演变的本质。在这项工作中,我们研究了一个具有挑战性但实用的研究问题,即检测不断演变的LLM生成的虚假信息。随着LLM及其变种的快速发展,虚假信息不断演变。因此,检测模型面临着重大挑战。首先,为每个虚假信息生成器训练单独的模型是低效的。其次,当按顺序遇到演化的LLM生成的虚假信息时,性能会下降。为了解决这一问题,我们提出了一种参数高效的方法DELD,它结合了预先训练的语言模型(PLM)的一般事实核查能力和各种LLM的独立虚假信息生成特性。特别是,学习的特征被顺序地连接起来,以便于知识的积累和转化。DELD通过将虚假信息的语义嵌入与可训练的软提示相结合来解决标签稀缺性问题,以获取特定于模型的知识。我们的实验表明,文本DELD的性能明显优于最先进的方法。此外,我们的方法提供了对不同LLM产生虚假信息的独特模式的批判性见解,为这一研究提供了有价值的视角。

[NLP-66] Explicit Diversity Conditions for Effective Question Answer Generation with Large Language Models
[NLP-66] 大型语言模型有效生成问答的显式多样性条件

链接: https://arxiv.org/abs/2406.17990
作者: Vikas Yadav,Hyuk Joon Kwon,Vijay Srinivasan,Hongxia Jin
关键词: Question Answer Generation, Answer Generation, question answering systems, Question Answer, explicit diversity conditions
中文关键词: 问答生成、问答生成、问答系统、问答、显式多样性条件
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published at COLING 2024

点击查看摘要

Abstract:Question Answer Generation (QAG) is an effective data augmentation technique to improve the accuracy of question answering systems, especially in low-resource domains. While recent pretrained and large language model-based QAG methods have made substantial progress, they face the critical issue of redundant QA pair generation, affecting downstream QA systems. Implicit diversity techniques such as sampling and diverse beam search are proven effective solutions but often yield smaller diversity. We present explicit diversity conditions for QAG, focusing on spatial aspects, question types, and entities, substantially increasing diversity in QA generation. Our work emphasizes the need of explicit diversity conditions for generating diverse question-answer synthetic data by showing significant improvements in downstream QA task over existing widely adopted implicit diversity techniques. In particular, generated QA pairs from explicit diversity conditions when used to train the downstream QA model results in an average 4.1% exact match and 4.5% F1 improvement over QAG from implicit sampling techniques on SQuADDU. Our work emphasizes the need for explicit diversity conditions even more in low-resource datasets (SubjQA), where average downstream QA performance improvements are around 12% EM.
摘要:问答生成是一种有效的数据扩充技术,可以提高问答系统的准确率,特别是在低资源领域。虽然最近的基于预训练和大语言模型的QAG方法取得了实质性的进展,但它们面临着冗余的QA对生成的关键问题,影响了下游的QA系统。采样和分束搜索等隐式分集技术被证明是有效的解决方案,但往往产生较小的分集。我们提出了QAG的显式多样性条件,重点关注空间方面、问题类型和实体,大大增加了QA生成的多样性。我们的工作强调了需要显式多样性条件来生成多样化的问答合成数据,这表明下游QA任务比现有的广泛采用的隐式多样性技术有了显着的改进。具体地说,当用于训练下行QA模型时,从显式分集条件生成的QA对比基于SQuADDU上的隐式采样技术的QAG平均得到4.1%的精确匹配和4.5%的F1改进。我们的工作强调了在低资源数据集(SubjQA)中更需要显式分集条件,其中下游QA性能的平均改善约为12%EM。

[NLP-67] Multi-step Knowledge Retrieval and Inference over Unstructured Data
[NLP-67] 非结构化数据上的多步知识检索和推理

链接: https://arxiv.org/abs/2406.17987
作者: Aditya Kalyanpur,Kailash Saravanakumar,Victor Barres,CJ McFate,Lori Moon,Nati Seifu,Maksim Eremeev,Jose Barrera,Eric Brown,David Ferrucci
关键词: Large Language Models, revolutionized natural language, natural language applications, Language Models, Large Language
中文关键词: 大型语言模型、革命性的自然语言、自然语言应用、语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) and Generative AI has revolutionized natural language applications across various domains. However, high-stakes decision-making tasks in fields such as medical, legal and finance require a level of precision, comprehensiveness, and logical consistency that pure LLM or Retrieval-Augmented-Generation (RAG) approaches often fail to deliver. At Elemental Cognition (EC), we have developed a neuro-symbolic AI platform to tackle these problems. The platform integrates fine-tuned LLMs for knowledge extraction and alignment with a robust symbolic reasoning engine for logical inference, planning and interactive constraint solving. We describe Cora, a Collaborative Research Assistant built on this platform, that is designed to perform complex research and discovery tasks in high-stakes domains. This paper discusses the multi-step inference challenges inherent in such domains, critiques the limitations of existing LLM-based methods, and demonstrates how Cora’s neuro-symbolic approach effectively addresses these issues. We provide an overview of the system architecture, key algorithms for knowledge extraction and formal reasoning, and present preliminary evaluation results that highlight Cora’s superior performance compared to well-known LLM and RAG baselines.
摘要:大型语言模型和产生式人工智能的出现使自然语言在各个领域的应用发生了革命性的变化。然而,医疗、法律和金融等领域的高风险决策任务需要一定程度的精确度、全面性和逻辑一致性,而纯粹的LLM或检索增强生成(RAG)方法往往无法实现这一点。在元素认知(EC),我们开发了一个神经符号人工智能平台来解决这些问题。该平台集成了用于知识提取和比对的微调LLM与用于逻辑推理、规划和交互式约束求解的健壮符号推理引擎。我们描述了Cora,一个构建在该平台上的协作研究助理,旨在执行高风险领域的复杂研究和发现任务。本文讨论了这些领域固有的多步推理挑战,批评了现有基于LLM的方法的局限性,并展示了Cora的神经符号方法如何有效地解决这些问题。我们提供了系统架构的概述,知识提取和形式推理的关键算法,并给出了初步评估结果,突出了CORA与著名的LLM和RAG基线相比的优越性能。

[NLP-68] EDEN: Empathetic Dialogues for English learning
[NLP-68] 伊登:英语学习的同理心对话

链接: https://arxiv.org/abs/2406.17982
作者: Li Siyan,Teresa Shao,Zhou Yu,Julia Hirschberg
关键词: Dialogue systems, improve learning outcomes, systems improve learning, Dialogue, learning outcomes
中文关键词: 对话系统,改善学习成果,系统改善学习,对话,学习成果
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dialogue systems have been used as conversation partners in English learning, but few have studied whether these systems improve learning outcomes. Student passion and perseverance, or grit, has been associated with language learning success. Recent work establishes that as students perceive their English teachers to be more supportive, their grit improves. Hypothesizing that the same pattern applies to English-teaching chatbots, we create EDEN, a robust open-domain chatbot for spoken conversation practice that provides empathetic feedback. To construct EDEN, we first train a specialized spoken utterance grammar correction model and a high-quality social chit-chat conversation model. We then conduct a preliminary user study with a variety of strategies for empathetic feedback. Our experiment suggests that using adaptive empathetic feedback leads to higher perceived affective support, which, in turn, predicts increased student grit.
摘要:对话系统已被用作英语学习中的对话伙伴,但很少有人研究这些系统是否能改善学习结果。学生的热情和毅力或毅力与语言学习的成功有关。最近的研究表明,随着学生认为英语老师更加支持,他们的勇气就会提高。假设同样的模式也适用于英语教学聊天机器人,我们创建了EDEN,这是一个强大的开放域聊天机器人,用于口语对话练习,提供同理心的反馈。为了构建EDEN,我们首先训练专门的口语语法纠正模型和高质量的社交闲聊对话模型。然后,我们使用各种同理心反馈策略进行初步用户研究。我们的实验表明,使用适应性同理心反馈会带来更高的感知情感支持,这反过来又预示着学生的毅力会增加。

[NLP-69] Inherent Challenges of Post-Hoc Membership Inference for Large Language Models
[NLP-69] 大型语言模型事后成员推理的内在挑战

链接: https://arxiv.org/abs/2406.17975
作者: Matthieu Meeus,Shubham Jain,Marek Rei,Yves-Alexandre de Montjoye
关键词: Large Language Models, Large Language, Membership Inference Attacks, Language Models, Inference Attacks
中文关键词: 大型语言模型、大型语言、成员推理攻击、语言模型、推理攻击
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are often trained on vast amounts of undisclosed data, motivating the development of post-hoc Membership Inference Attacks (MIAs) to gain insight into their training data composition. However, in this paper, we identify inherent challenges in post-hoc MIA evaluation due to potential distribution shifts between collected member and non-member datasets. Using a simple bag-of-words classifier, we demonstrate that datasets used in recent post-hoc MIAs suffer from significant distribution shifts, in some cases achieving near-perfect distinction between members and non-members. This implies that previously reported high MIA performance may be largely attributable to these shifts rather than model memorization. We confirm that randomized, controlled setups eliminate such shifts and thus enable the development and fair evaluation of new MIAs. However, we note that such randomized setups are rarely available for the latest LLMs, making post-hoc data collection still required to infer membership for real-world LLMs. As a potential solution, we propose a Regression Discontinuity Design (RDD) approach for post-hoc data collection, which substantially mitigates distribution shifts. Evaluating various MIA methods on this RDD setup yields performance barely above random guessing, in stark contrast to previously reported results. Overall, our findings highlight the challenges in accurately measuring LLM memorization and the need for careful experimental design in (post-hoc) membership inference tasks.
摘要:大型语言模型(LLM)经常在大量未公开的数据上进行训练,这促使了后自组织成员推理攻击(MIA)的发展,以了解它们的训练数据组成。然而,在这篇文章中,我们识别了由于收集的成员和非成员数据集之间潜在的分布变化而导致的后MIA评估的内在挑战。使用一个简单的词袋分类器,我们证明了在最近的后自组织MIA中使用的数据集遭受了显著的分布偏移,在某些情况下实现了成员和非成员之间的近乎完美的区分。这意味着之前报道的高MIA成绩可能在很大程度上归因于这些变化,而不是模型记忆。我们确认,随机的、受控的设置消除了这种转变,从而使新的MIA的开发和公平评估成为可能。然而,我们注意到,这种随机化设置很少适用于最新的LLM,这使得仍然需要事后数据收集来推断真实世界LLM的成员资格。作为一种潜在的解决方案,我们提出了一种回归不连续设计(RDD)方法,用于后自组织数据收集,大大减轻了分布漂移。在这种RDD设置上评估各种MIA方法的性能仅略高于随机猜测,这与之前报道的结果形成了鲜明对比。总体而言,我们的发现突出了在准确测量LLM记忆方面的挑战,以及在(后即席)成员推理任务中仔细实验设计的必要性。

[NLP-70] Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts
[NLP-70] 跨不同人口属性和假设评估大型视觉语言模型中的公平性

链接: https://arxiv.org/abs/2406.17974
作者: Xuyang Wu,Yuan Wang,Hsin-Tai Wu,Zhiqiang Tao,Yi Fang
关键词: Large vision-language models, achieved significant progress, demonstrating strong capabilities, recently achieved significant, Large vision-language
中文关键词: 大型视觉语言模型,取得了重大进展,展示了强大的能力,最近取得了重大的、大型视觉语言
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) have recently achieved significant progress, demonstrating strong capabilities in open-world visual understanding. However, it is not yet clear how LVLMs address demographic biases in real life, especially the disparities across attributes such as gender, skin tone, and age. In this paper, we empirically investigate \emphvisual fairness in several mainstream LVLMs and audit their performance disparities across sensitive demographic attributes, based on public fairness benchmark datasets (e.g., FACET). To disclose the visual bias in LVLMs, we design a fairness evaluation framework with direct questions and single-choice question-instructed prompts on visual question-answering/classification tasks. The zero-shot prompting results indicate that, despite enhancements in visual understanding, both open-source and closed-source LVLMs exhibit prevalent fairness issues across different instruct prompts and demographic attributes.
摘要:大型视觉语言模型(LVLM)最近取得了重大进展,展示了开放世界视觉理解的强大能力。然而,目前尚不清楚LVLM如何解决现实生活中的人口偏见,特别是性别、肤色和年龄等属性之间的差异。在本文中,我们基于公共公平基准数据集(例如,Facet)。为了揭示LVLM中的视觉偏见,我们设计了一个公平性评估框架,其中包括直接问题和针对视觉问答/分类任务的单项选择问题指导提示。零镜头提示结果表明,尽管视觉理解有所增强,但开源和闭源LVLM在不同的指令提示和人口统计属性中都表现出普遍的公平性问题。

[NLP-71] LABOR-LLM: Language-Based Occupational Representations with Large Language Models
[NLP-71] LABOR-LLM:具有大型语言模型的基于数字的职业表示

链接: https://arxiv.org/abs/2406.17972
作者: Tianyu Du,Ayush Kanodia,Herman Brunborg,Keyon Vafa,Susan Athey
关键词: carefully constructed longitudinal, labor market questions, market questions rely, constructed longitudinal survey, longitudinal survey datasets
中文关键词: 精心构建的纵向、劳动力市场问题、市场问题依赖、构建的纵向调查、纵向调查数据集
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Econometrics (econ.EM)
备注:

点击查看摘要

Abstract:Many empirical studies of labor market questions rely on estimating relatively simple predictive models using small, carefully constructed longitudinal survey datasets based on hand-engineered features. Large Language Models (LLMs), trained on massive datasets, encode vast quantities of world knowledge and can be used for the next job prediction problem. However, while an off-the-shelf LLM produces plausible career trajectories when prompted, the probability with which an LLM predicts a particular job transition conditional on career history will not, in general, align with the true conditional probability in a given population. Recently, Vafa et al. (2024) introduced a transformer-based “foundation model”, CAREER, trained using a large, unrepresentative resume dataset, that predicts transitions between jobs; it further demonstrated how transfer learning techniques can be used to leverage the foundation model to build better predictive models of both transitions and wages that reflect conditional transition probabilities found in nationally representative survey datasets. This paper considers an alternative where the fine-tuning of the CAREER foundation model is replaced by fine-tuning LLMs. For the task of next job prediction, we demonstrate that models trained with our approach outperform several alternatives in terms of predictive performance on the survey data, including traditional econometric models, CAREER, and LLMs with in-context learning, even though the LLM can in principle predict job titles that are not allowed in the survey data. Further, we show that our fine-tuned LLM-based models’ predictions are more representative of the career trajectories of various workforce subpopulations than off-the-shelf LLM models and CAREER. We conduct experiments and analyses that highlight the sources of the gains in the performance of our models for representative predictions.
摘要:许多关于劳动力市场问题的实证研究依赖于使用基于手工设计特征的小型、精心构建的纵向调查数据来估计相对简单的预测模型。大型语言模型(LLM)在海量数据集上进行训练,编码了大量的世界知识,可用于下一个就业预测问题。然而,尽管现成的LLM在被提示时会产生看似合理的职业轨迹,但LLM根据职业历史预测特定工作转变的概率通常与给定人群中的真实条件概率不一致。最近,Vafa等人。(2024)引入了一个基于变压器的“基础模型”–Career,该模型使用一个不具代表性的大型简历数据集进行培训,可以预测工作之间的过渡;它进一步展示了如何利用迁移学习技术来利用基础模型来建立更好的过渡和工资预测模型,这些模型反映了在具有全国代表性的调查数据集中找到的条件过渡概率。本文考虑了一种替代方案,将职业基础模型的微调替换为微调的LLM。对于下一份工作的预测任务,我们证明了用我们的方法训练的模型在对调查数据的预测性能方面优于几种替代方法,包括传统的计量经济学模型、职业和具有情景学习的LLMS,尽管LLM原则上可以预测调查数据中不允许的职位。此外,我们表明,我们微调的基于LLM的模型的预测比现成的LLM模型和职业更能代表不同劳动力亚群的职业轨迹。我们进行实验和分析,以突出我们的代表性预测模型性能的收益来源。

[NLP-72] Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective
[NLP-72] 鼓励还是抑制单一性?从特征解构角度重新审视单一性

链接: https://arxiv.org/abs/2406.17969
作者: Hanqi Yan,Yanzheng Xiang,Guangyi Chen,Yifei Wang,Lin Gui,Yulan He
关键词: recent studies focus, large language models, recent studies, basic units, interpret the intrinsic
中文关键词: 最近的研究重点,大型语言模型,最近的研究,基本单位,解释内在的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To better interpret the intrinsic mechanism of large language models (LLMs), recent studies focus on monosemanticity on its basic units. A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts. Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to model capacity. To explore this question, we revisit monosemanticity from the feature decorrelation perspective and advocate for its encouragement. We experimentally observe that the current conclusion by wang2024learning, which suggests that decreasing monosemanticity enhances model performance, does not hold when the model changes. Instead, we demonstrate that monosemanticity consistently exhibits a positive correlation with model capacity, in the preference alignment process. Consequently, we apply feature correlation as a proxy for monosemanticity and incorporate a feature decorrelation regularizer into the dynamic preference optimization process. The experiments show that our method not only enhances representation diversity and activation sparsity but also improves preference alignment performance.
摘要:为了更好地解释大语言模型的内在机制,最近的研究集中在其基本单位上的单词性。单一语义神经元致力于单个特定的概念,在神经元和概念之间形成一一对应的关系。尽管对单一性的探索进行了广泛的研究,但单一性对模型的能力是有利的还是有害的仍然不清楚。为了探讨这个问题,我们从特征去关联性的角度重新审视了单一性,并提倡鼓励单一性。我们通过实验观察到,当前wang2024学习得出的结论在模型改变时并不成立,该结论认为降低单调性可以提高模型的性能。相反,我们证明了在偏好匹配过程中,单一性始终与模型容量呈正相关。因此,我们将特征相关性作为单一性的代理,并将特征去相关正则化引入到动态偏好优化过程中。实验表明,该方法不仅提高了表示多样性和激活稀疏性,而且提高了偏好对齐性能。

[NLP-73] Unmasking the Imposters: In-Domain Detection of Human vs. Machine-Generated Tweets
[NLP-73] 揭露冒名顶替者:人类与机器生成推文的域内检测

链接: https://arxiv.org/abs/2406.17967
作者: Bryan E. Tuck,Rakesh M. Verma
关键词: social media platforms, large language models, raising concerns, media platforms, rapid development
中文关键词: 社交媒体平台、大型语言模型、引发担忧、媒体平台、快速发展
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has significantly improved the generation of fluent and convincing text, raising concerns about their misuse on social media platforms. We present a methodology using Twitter datasets to examine the generative capabilities of four LLMs: Llama 3, Mistral, Qwen2, and GPT4o. We evaluate 7B and 8B parameter base-instruction models of the three open-source LLMs and validate the impact of further fine-tuning and “uncensored” versions. Our findings show that “uncensored” models with additional in-domain fine-tuning dramatically reduce the effectiveness of automated detection methods. This study addresses a gap by exploring smaller open-source models and the effects of “uncensoring,” providing insights into how fine-tuning and content moderation influence machine-generated text detection.
摘要:大型语言模型(LLM)的快速发展显着改善了流畅且令人信服的文本的生成,引发了人们对其在社交媒体平台上滥用的担忧。我们提出了一种使用Twitter数据集来检查四种LLM的生成能力的方法:Llama 3、Mistral、Qwen 2和GPT 4o。我们评估了三种开源LLM的7 B和8B参数基本指令模型,并验证了进一步微调和“未经审查”版本的影响。我们的研究结果表明,具有额外域内微调的“未经审查”模型会显着降低自动检测方法的有效性。这项研究通过探索较小的开源模型和“未经审查”的影响来弥补这一差距,从而深入了解微调和内容审核如何影响机器生成的文本检测。

[NLP-74] SimsChat: A Customisable Persona-Driven Role-Playing Agent
[NLP-74] SimsChat:可定制的角色驱动角色扮演代理

链接: https://arxiv.org/abs/2406.17962
作者: Bohao Yang,Dong Liu,Chen Tang,Chenghao Xiao,Kun Zhao,Chao Li,Lin Yuan,Guang Yang,Lanxiao Huang,Chenghua Lin
关键词: Large Language Models, Large Language, Language Models, generate high-quality text, understand human instructions
中文关键词: 大型语言模型,大型语言,语言模型,生成高质量文本,理解人类指令
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) possess the remarkable capability to understand human instructions and generate high-quality text, enabling them to act as agents that simulate human behaviours. This capability allows LLMs to emulate human beings in a more advanced manner, beyond merely replicating simple human behaviours. However, there is a lack of exploring into leveraging LLMs to craft characters from several aspects. In this work, we introduce the Customisable Conversation Agent Framework, which employs LLMs to simulate real-world characters that can be freely customised according to different user preferences. The customisable framework is helpful for designing customisable characters and role-playing agents according to human’s preferences. We first propose the SimsConv dataset, which comprises 68 different customised characters, 1,360 multi-turn role-playing dialogues, and encompasses 13,971 interaction dialogues in total. The characters are created from several real-world elements, such as career, aspiration, trait, and skill. Building on these foundations, we present SimsChat, a freely customisable role-playing agent. It incorporates different real-world scenes and topic-specific character interaction dialogues, simulating characters’ life experiences in various scenarios and topic-specific interactions with specific emotions. Experimental results show that our proposed framework achieves desirable performance and provides helpful guideline for building better simulacra of human beings in the future. Our data and code are available at this https URL.
摘要:大型语言模型具有理解人类指令和生成高质量文本的显著能力,使它们能够充当模拟人类行为的主体。这种能力使LLM能够以更高级的方式模仿人类,而不仅仅是复制简单的人类行为。然而,缺乏对利用LLM从几个方面来制作角色的探索。在这项工作中,我们介绍了可定制的对话代理框架,它使用LLMS来模拟真实世界的角色,这些角色可以根据不同的用户偏好自由地定制。这个可定制的框架有助于根据人类的喜好设计可定制的角色和角色扮演代理。我们首先提出了SimsConv数据集,它包括68个不同的定制角色,1360个多回合角色扮演对话,总共包含13971个互动对话。这些角色是从几个现实世界的元素中创造出来的,比如职业、抱负、特质和技能。在这些基础上,我们推出了SimsChat,一个自由定制的角色扮演代理。它融入了不同的现实世界场景和特定话题的角色互动对话,模拟了角色在各种场景中的生活经历和带有特定情感的特定话题的互动。实验结果表明,该框架取得了较好的性能,为今后构建更好的人体拟像提供了有益的指导。我们的数据和代码可以在这个HTTPS URL上找到。

[NLP-75] NormTab: Improving Symbolic Reasoning in LLMs Through Tabular Data Normalization
[NLP-75] NormTab:通过表格数据规范化改进LLM中的符号推理

链接: https://arxiv.org/abs/2406.17961
作者: Md Mahadi Hasan Nahid,Davood Rafiei
关键词: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, parsing textual data
中文关键词: 大型语言模型,大型语言,语言模型,表现出非凡的能力,解析文本数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注: Work in Progress

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in parsing textual data and generating code. However, their performance in tasks involving tabular data, especially those requiring symbolic reasoning, faces challenges due to the structural variance and inconsistency in table cell values often found in web tables. In this paper, we introduce NormTab, a novel framework aimed at enhancing the symbolic reasoning performance of LLMs by normalizing web tables. We study table normalization as a stand-alone, one-time preprocessing step using LLMs to support symbolic reasoning on tabular data. Our experimental evaluation, conducted on challenging web table datasets such as WikiTableQuestion and TabFact, demonstrates that leveraging NormTab significantly improves symbolic reasoning performance, showcasing the importance and effectiveness of web table normalization for enhancing LLM-based symbolic reasoning tasks.
摘要:近年来,大型语言模型(LLM)在解析文本数据和生成代码方面表现出了非凡的能力。然而,由于Web表中经常发现的表单元格值的结构差异和不一致性,它们在涉及表格数据的任务中的性能,尤其是那些需要符号推理的任务中的性能面临挑战。本文中,我们介绍了NormTab,这是一个新颖的框架,旨在通过规范化Web表来增强LLM的符号推理性能。我们将表规范化作为一个独立的一次性预处理步骤进行研究,使用LLM来支持表格数据的符号推理。我们对具有挑战性的Web表数据集(例如WikiTable Question和TabFact)进行的实验评估表明,利用NormTab可以显着提高符号推理性能,展示了Web表规范化对于增强基于LLM的符号推理任务的重要性和有效性。

[NLP-76] Do they mean us? Interpreting Referring Expressions in Intergroup Bias
[NLP-76] 他们是指我们吗?群体间偏见中的指代表达的解释

链接: https://arxiv.org/abs/2406.17947
作者: Venkata S Govindarajan,Matianyu Zang,Kyle Mahowald,David Beaver,Junyi Jessy Li
关键词: underlie many social, social phenomena, phenomena like stereotype, stereotype perpetuation, intergroup bias
中文关键词: 构成许多社会、社会现象的基础,例如刻板印象、刻板印象延续、群体间偏见
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The variations between in-group and out-group speech (intergroup bias) are subtle and could underlie many social phenomena like stereotype perpetuation and implicit bias. In this paper, we model the intergroup bias as a tagging task on English sports comments from forums dedicated to fandom for NFL teams. We curate a unique dataset of over 6 million game-time comments from opposing perspectives (the teams in the game), each comment grounded in a non-linguistic description of the events that precipitated these comments (live win probabilities for each team). Expert and crowd annotations justify modeling the bias through tagging of implicit and explicit referring expressions and reveal the rich, contextual understanding of language and the world required for this task. For large-scale analysis of intergroup variation, we use LLMs for automated tagging, and discover that some LLMs perform best when prompted with linguistic descriptions of the win probability at the time of the comment, rather than numerical probability. Further, large-scale tagging of comments using LLMs uncovers linear variations in the form of referent across win probabilities that distinguish in-group and out-group utterances. Code and data are available at this https URL .
摘要:群体内和群体外言语之间的差异(群体间偏见)是微妙的,可能是许多社会现象的基础,如刻板印象、永久化和隐性偏见。在这篇文章中,我们将组间偏见建模为对来自NFL球队球迷论坛的英语体育评论的一项标注任务。我们整理了一个独特的数据集,其中包含600多万条来自相反角度(比赛中的球队)的比赛时间评论,每条评论都基于对引发这些评论的事件的非语言描述(每支球队的实时获胜概率)。专家和人群注释通过标记隐含和显式的指代表达来证明建模偏见的合理性,并揭示了这项任务所需的对语言和世界的丰富、上下文理解。对于大规模的组间差异分析,我们使用LLMS进行自动标注,并发现一些LLMS在评论时提示获胜概率的语言描述时性能最好,而不是数字概率。此外,使用LLMS的大规模评论标记揭示了区分组内和组外话语的Win概率中所指形式的线性变化。代码和数据可在此HTTPS URL上找到。

[NLP-77] Sequential Editing for Lifelong Training of Speech Recognition Models
[NLP-77] 语音识别模型终身训练的序列编辑

链接: https://arxiv.org/abs/2406.17935
作者: Devang Kulshreshtha,Saket Dingliwal,Brady Houston,Nikolaos Pappas,Srikanth Ronanki
关键词: Automatic Speech Recognition, Automatic Speech, Speech Recognition, computational inefficiencies linked, domain raises concerns
中文关键词: 自动语音识别、自动语音、语音识别、计算效率低下相关领域引发担忧
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: INTERSPEECH 2024

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) traditionally assumes known domains, but adding data from a new domain raises concerns about computational inefficiencies linked to retraining models on both existing and new domains. Fine-tuning solely on new domain risks Catastrophic Forgetting (CF). To address this, Lifelong Learning (LLL) algorithms have been proposed for ASR. Prior research has explored techniques such as Elastic Weight Consolidation, Knowledge Distillation, and Replay, all of which necessitate either additional parameters or access to prior domain data. We propose Sequential Model Editing as a novel method to continually learn new domains in ASR systems. Different than previous methods, our approach does not necessitate access to prior datasets or the introduction of extra parameters. Our study demonstrates up to 15% Word Error Rate Reduction (WERR) over fine-tuning baseline, and superior efficiency over other LLL techniques on CommonVoice English multi-accent dataset.
摘要:自动语音识别(ASB)传统上假设已知域,但从新域添加数据会引发人们对与现有和新域上的再培训模型相关的计算效率低下的担忧。仅对新域名进行微调可能会带来灾难性遗忘(CF)的风险。为了解决这一问题,已经为ASB提出了终身学习(LLL)算法。之前的研究探索了弹性权重整合、知识蒸馏和回放等技术,所有这些技术都需要额外的参数或访问先前的领域数据。我们提出序列模型编辑作为一种在ASB系统中不断学习新领域的新颖方法。与以前的方法不同,我们的方法不需要访问以前的数据集或引入额外的参数。我们的研究表明,在CommonVoice英语多口音数据集上,字错误率(WRR)比微调基线可降低高达15%,并且效率优于其他LLL技术。

[NLP-78] FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data
[NLP-78] FASA:一种灵活的自动语音对齐器,用于提取高质量对齐儿童语音数据

链接: https://arxiv.org/abs/2406.17926
作者: Dancheng Liu,Jinjun Xiong
关键词: Automatic Speech Recognition, deep neural network, made significant progress, employing deep neural, speech distinct characteristics
中文关键词: 自动语音识别、深度神经网络取得了重大进展,采用深度神经、语音独特特征
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 4 pages, 1 figure

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) for adults’ speeches has made significant progress by employing deep neural network (DNN) models recently, but improvement in children’s speech is still unsatisfactory due to children’s speech’s distinct characteristics. DNN models pre-trained on adult data often struggle in generalizing children’s speeches with fine tuning because of the lack of high-quality aligned children’s speeches. When generating datasets, human annotations are not scalable, and existing forced-alignment tools are not usable as they make impractical assumptions about the quality of the input transcriptions. To address these challenges, we propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children’s speech data from many of the existing noisy children’s speech data. We demonstrate its usage on the CHILDES dataset and show that FASA can improve data quality by 13.6 \times over human annotations.
摘要:近年来,通过采用深度神经网络(DNN)模型,针对成人语音的自动语音识别(ASB)取得了重大进展,但由于儿童语音的明显特征,儿童语音的改善仍然不令人满意。由于缺乏高质量的对齐儿童演讲,在成人数据上预先训练的DNN模型经常难以通过微调来概括儿童演讲。生成数据集时,人工注释不可扩展,现有的强制对齐工具也不可用,因为它们对输入转录的质量做出了不切实际的假设。为了应对这些挑战,我们提出了一种新的强制对齐工具FASA,作为一种灵活且自动的语音对齐器,可以从许多现有的有噪音的儿童语音数据中提取高质量对齐的儿童语音数据。我们展示了它在CHILDES数据集上的使用,并表明FASA可以将数据质量提高13.6倍。

[NLP-79] PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
[NLP-79] PAFT:有效LLM微调的并行培训范式

链接: https://arxiv.org/abs/2406.17923
作者: Shiva Kumar Pentyala,Zhichao Wang,Bin Bi,Kiran Ramnath,Xiang-Bo Mao,Regunathan Radhakrishnan,Sitaram Asur, Na (Claire)Cheng
关键词: shown remarkable abilities, Large language models, Large language, preference alignment, shown remarkable
中文关键词: 表现出非凡的能力,大型语言模型,大型语言,偏好对齐,表现出非凡的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance. This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning, which independently performs SFT and preference alignment (e.g., DPO and ORPO, etc.) with the same pre-trained model on respective datasets. The model produced by SFT and the model from preference alignment are then merged into a final model by parameter fusing for use in downstream applications. This work reveals important findings that preference alignment like DPO naturally results in a sparse model while SFT leads to a natural dense model which needs to be sparsified for effective model merging. This paper introduces an effective interference resolution which reduces the redundancy by sparsifying the delta parameters. The LLM resulted from the new training paradigm achieved Rank #1 on the HuggingFace Open LLM Leaderboard. Comprehensive evaluation shows the effectiveness of the parallel training paradigm. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.17923 [cs.CL] (or arXiv:2406.17923v1 [cs.CL] for this version)
摘要:大语言模型在各种自然语言处理任务中表现出了卓越的性能。LLM通常经过监督微调(SFT),然后进行偏好对齐,以便在下游应用中使用。本文介绍了一种新的并行训练范式PAFT,它可以独立地执行SFT和偏好对齐(如DPO和ORPO等),从而实现有效的LLM微调。然后通过参数融合将SFT生成的模型和偏好匹配得到的模型合并成最终的模型,以用于下游应用。这项工作揭示了重要的发现:偏好对齐,如DPO,自然会产生稀疏模型,而SFT会产生自然稠密的模型,需要进行稀疏处理才能有效地进行模型合并。由新的培训模式产生的LLM在HuggingFace Open LLM排行榜上排名第一。科目:计算和语言(cs.CL)引用为:arxiv:2406.17923cs.CL

[NLP-80] X-ray Made Simple: Radiology Report Generation and Evaluation with Laymans Terms
[NLP-80] X射线变得简单:使用外行术语生成和评估放射学报告

链接: https://arxiv.org/abs/2406.17911
作者: Kun Zhao,Chenghao Xiao,Chen Tang,Bohao Yang,Kai Ye,Noura Al Moubayed,Liang Zhan,Chenghua Lin
关键词: Radiology Report Generation, achieved significant progress, multimodal generative models, Radiology Report, Report Generation
中文关键词: 放射学报告生成,取得重大进展,多模式生成模型,放射学报告,报告生成
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Radiology Report Generation (RRG) has achieved significant progress with the advancements of multimodal generative models. However, the evaluation in the domain suffers from a lack of fair and robust metrics. We reveal that, high performance on RRG with existing lexical-based metrics (e.g. BLEU) might be more of a mirage - a model can get a high BLEU only by learning the template of reports. This has become an urgent problem for RRG due to the highly patternized nature of these reports. In this work, we un-intuitively approach this problem by proposing the Layman’s RRG framework, a layman’s terms-based dataset, evaluation and training framework that systematically improves RRG with day-to-day language. We first contribute the translated Layman’s terms dataset. Building upon the dataset, we then propose a semantics-based evaluation method, which is proved to mitigate the inflated numbers of BLEU and provides fairer evaluation. Last, we show that training on the layman’s terms dataset encourages models to focus on the semantics of the reports, as opposed to overfitting to learning the report templates. We reveal a promising scaling law between the number of training examples and semantics gain provided by our dataset, compared to the inverse pattern brought by the original formats. Our code is available at \urlthis https URL.
然而,该领域的评估缺乏公平和稳健的衡量标准。我们发现,使用现有的基于词汇的指标(如BLEU)在RRG上取得高性能可能更多的是海市蜃楼–一个模型只有通过学习报告模板才能获得高BLEU。在这项工作中,我们通过提出Layman的RRG框架来非直观地解决这个问题,RRG框架是一个基于外行术语的数据集、评估和训练框架,它使用日常语言系统地改进RRG。我们首先贡献翻译的莱曼术语数据集。我们的代码位于此HTTPS URL。

[NLP-81] Mapping the Past: Geographically Linking an Early 20th Century Swedish Encyclopedia with Wikidata
[NLP-81] 绘制过去:20世纪初瑞典百科全书与维基数据的地理联系

链接: https://arxiv.org/abs/2406.17903
作者: Axel Ahlin,Alfred Myrne,Pierre Nugues
关键词: Nordic Family Book, Nordic Family, Family Book, prominent Swedish encyclopedia, prominent Swedish
中文关键词: 北欧家庭书,北欧家庭,家庭书,著名瑞典百科全书,著名瑞典语
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:In this paper, we describe the extraction of all the location entries from a prominent Swedish encyclopedia from the early 20th century, the \textitNordisk Familjebok `Nordic Family Book.’ We focused on the second edition called \textitUggleupplagan, which comprises 38 volumes and over 182,000 articles. This makes it one of the most extensive Swedish encyclopedias. Using a classifier, we first determined the category of the entries. We found that approximately 22 percent of them were locations. We applied a named entity recognition to these entries and we linked them to Wikidata. Wikidata enabled us to extract their precise geographic locations resulting in almost 18,000 valid coordinates. We then analyzed the distribution of these locations and the entry selection process. It showed a higher density within Sweden, Germany, and the United Kingdom. The paper sheds light on the selection and representation of geographic information in the \textitNordisk Familjebok, providing insights into historical and societal perspectives. It also paves the way for future investigations into entry selection in different time periods and comparative analyses among various encyclopedias.
摘要:本文描述了从20世纪初瑞典著名百科全书《北欧家庭百科全书》中提取所有地点条目的过程。这使它成为瑞典最广泛的百科全书之一。使用分类器,我们首先确定条目的类别。我们发现,其中大约22%是地点。我们对这些条目应用了命名实体识别,并将它们链接到维基数据。维基数据使我们能够提取它们的精确地理位置,从而产生近18,000个有效坐标。然后我们分析了这些地点的分布和条目选择过程。它在瑞典、德国和英国显示出较高的密度。这篇论文阐述了《诺德家族》中地理信息的选择和表达,提供了对历史和社会视角的洞察。

[NLP-82] Script-Agnostic Language Identification
[NLP-82] 脚本不可知语言识别

链接: https://arxiv.org/abs/2406.17901
作者: Milind Agarwal,Joshua Otten,Antonios Anastasopoulos
关键词: sort online text, language-specific buckets, data collection, collection and crawling, crawling efforts
中文关键词: 排序在线文本、特定语言桶、数据收集、收集和爬行、爬行工作
类目: Computation and Language (cs.CL)
备注: Under Review in ACL Rolling Review

点击查看摘要

Abstract:Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.
摘要:语言识别被用作许多数据收集和爬行工作的第一步,因为它允许我们将在线文本分类到特定于语言的桶中。然而,许多现代语言,如Konkani,Kashmiri,Punjabi等,都是用几种文字同步编写的。为了应对这一问题,我们建议使用几种不同的实验策略(放大、扁平化和脚本混合)学习与脚本无关的表示法,重点放在四种主要的德拉维甸语(泰米尔语、泰卢固语、卡纳达语和马来亚语)上。

[NLP-83] CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design
[NLP-83] CTBench:临床试验设计中评估语言模型能力的综合基准

链接: https://arxiv.org/abs/2406.17888
作者: Nafis Neehal,Bowen Wang,Shayom Debopadhaya,Soham Dan,Keerthiram Murugesan,Vibha Anand,Kristin P. Bennett
关键词: assess language models, baseline features, language models, aiding clinical study, benchmark to assess
中文关键词: 评估语言模型、基线特征、语言模型、辅助临床研究、评估基准
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. Given study-specific metadata, CTBench evaluates AI models’ ability to determine the baseline features of a clinical trial (CT), which include demographic and relevant features collected at the trial’s start from all participants. These baseline features, typically presented in CT publications (often as Table 1), are crucial for characterizing study cohorts and validating results. Baseline features, including confounders and covariates, are also necessary for accurate treatment effect estimation in studies involving observational data. CTBench consists of two datasets: “CT-Repo,” containing baseline features from 1,690 clinical trials sourced from this http URL, and “CT-Pub,” a subset of 100 trials with more comprehensive baseline features gathered from relevant publications. Two LM-based evaluation methods are developed to compare the actual baseline feature lists against LM-generated responses. “ListMatch-LM” and “ListMatch-BERT” use GPT-4o and BERT scores (at various thresholds), respectively, for evaluation. To establish baseline results, advanced prompt engineering techniques using LLaMa3-70B-Instruct and GPT-4o in zero-shot and three-shot learning settings are applied to generate potential baseline features. The performance of GPT-4o as an evaluator is validated through human-in-the-loop evaluations on the CT-Pub dataset, where clinical experts confirm matches between actual and LM-generated features. The results highlight a promising direction with significant potential for improvement, positioning CTBench as a useful tool for advancing research on AI in CT design and potentially enhancing the efficacy and robustness of CTs.
摘要:在辅助临床研究设计中,CTBtch被引入作为评估语言模型(LMS)的基准。在给定研究特定元数据的情况下,CTBtch评估人工智能模型确定临床试验(CT)基线特征的能力,其中包括在试验开始时从所有参与者那里收集的人口统计和相关特征。这些基线特征通常出现在CT出版物中(通常如表1所示),对于确定研究队列特征和验证结果至关重要。基线特征,包括混杂因素和协变量,对于在涉及观察数据的研究中准确估计治疗效果也是必要的。CT-PUB由两个数据集组成:“CT-Repo”和“CT-Pub”,“CT-Repo”包含来自该http URL的1690项临床试验的基线特征,“CT-Pub”是从相关出版物收集的100项试验的子集,具有更全面的基线特征。开发了两种基于LM的评估方法,以将实际基线特征列表与LM生成的响应进行比较。“ListMatch-LM”和“ListMatch-Bert”分别使用GPT-40和BERT分数(在不同的阈值下)进行评估。为了建立基线结果,在零射击和三射击学习设置中使用LLaMa3-70B-指令和GPT-4O的高级提示工程技术被应用来生成潜在的基线特征。GPT-40作为评估器的性能通过对CT-Pub数据集的人在环评估来验证,其中临床专家确认实际特征和LM生成的特征之间的匹配。这些结果突出了一个前景光明的方向,具有巨大的改进潜力,将CTB边定位为一个有用的工具,用于推进CT设计中的人工智能研究,并潜在地增强CT的有效性和稳健性。

[NLP-84] ET tu CLIP? Addressing Common Object Errors for Unseen Environments
[NLP-84] ET你剪辑吗?解决不可见环境的常见对象错误

链接: https://arxiv.org/abs/2406.17876
作者: Ye Won Byun,Cathy Jiao,Shahriar Noroozizadeh,Jimin Sun,Rosa Vitiello
关键词: enhance model generalization, employs pre-trained CLIP, pre-trained CLIP encoders, ALFRED task, introduce a simple
中文关键词: 增强模型概括性,采用预训练的CLIP、预训练的CLIP编码器、ALFRED任务,引入一个简单的
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.
摘要:我们引入了一种简单的方法,使用预先训练的CLIP编码器来增强ALFRED任务中的模型概括性。与之前使用CLIP取代视觉编码器的文献相比,我们建议使用CLIP作为通过辅助对象检测目标的额外模块。我们在最近提出的Episodic Transformer架构上验证了我们的方法,并证明合并CLIP可以提高不可见验证集中的任务性能。此外,我们的分析结果支持CLIP特别有助于利用对象描述、检测小对象和解释罕见单词。

[NLP-85] Cloaked Classifiers: Pseudonymization Strategies on Sensitive Classification Tasks
[NLP-85] 隐形分类器:敏感分类任务的假名策略

链接: https://arxiv.org/abs/2406.17875
作者: Arij Riabi,Menel Mahamdi,Virginie Mouilleron,Djamé Seddah
关键词: Protecting privacy, European GDPR shape, online radicalization dataset, personal information, European GDPR
中文关键词: 保护隐私、欧洲GDPR形状、在线激进化数据集、个人信息、欧洲GDPR
类目: Computation and Language (cs.CL)
备注: Proceedings of the fifth Workshop on Privacy in Natural Language Processing

点击查看摘要

Abstract:Protecting privacy is essential when sharing data, particularly in the case of an online radicalization dataset that may contain personal information. In this paper, we explore the balance between preserving data usefulness and ensuring robust privacy safeguards, since regulations like the European GDPR shape how personal information must be handled. We share our method for manually pseudonymizing a multilingual radicalization dataset, ensuring performance comparable to the original data. Furthermore, we highlight the importance of establishing comprehensive guidelines for processing sensitive NLP data by sharing our complete pseudonymization process, our guidelines, the challenges we encountered as well as the resulting dataset.
摘要:共享数据时保护隐私至关重要,特别是在可能包含个人信息的在线激进化数据集的情况下。在本文中,我们探讨了保留数据有用性和确保强有力的隐私保护之间的平衡,因为欧洲GDPR等法规决定了个人信息的处理方式。我们分享了手动别名多语言激进化数据集的方法,确保性能与原始数据相当。此外,我们强调了通过分享我们完整的假名化流程、我们的指南、我们遇到的挑战以及生成的数据集来制定处理敏感NLP数据的全面指南的重要性。

[NLP-86] Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples Verification and Dynamic Feedback
[NLP-86] 通过关系字节验证和动态反馈提高大型语言模型的算术推理能力

链接: https://arxiv.org/abs/2406.17873
作者: Zhongtao Miao,Kaiyan Zhao,Yoshimasa Tsuruoka
关键词: large language models, large language, Current representations, language models, reasoning steps
中文关键词: 大型语言模型、大型语言、当前表示、语言模型、推理步骤
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review, 25 figures, 8 tables, 29 pages

点击查看摘要

Abstract:Current representations used in reasoning steps of large language models can mostly be categorized into two main types: (1) natural language, which is difficult to verify; and (2) non-natural language, usually programming code, which is difficult for people who are unfamiliar with coding to read. In this paper, we propose to use a semi-structured form to represent reasoning steps of large language models. Specifically, we use relation tuples, which are not only human-readable but also machine-friendly and easier to verify than natural language. We implement a framework that includes three main components: (1) introducing relation tuples into the reasoning steps of large language models; (2) implementing an automatic verification process of reasoning steps with a local code interpreter based on relation tuples; and (3) integrating a simple and effective dynamic feedback mechanism, which we found helpful for self-improvement of large language models. The experimental results on various arithmetic datasets demonstrate the effectiveness of our method in improving the arithmetic reasoning ability of large language models. The source code is available at this https URL.
摘要:目前用于大型语言模型推理步骤的表示主要有两类:(1)自然语言,难以验证;(2)非自然语言,通常是编程代码,对于不熟悉编码的人来说很难阅读。在本文中,我们提出用一种半结构形式来表示大型语言模型的推理步骤。具体地说,我们使用关系元组,它不仅人类可读,而且机器友好,比自然语言更容易验证。我们实现了一个框架,该框架包括三个主要部分:(1)将关系元组引入到大型语言模型的推理步骤中;(2)使用基于关系元组的本地代码解释器实现推理步骤的自动验证过程;(3)集成一种简单有效的动态反馈机制,有助于大型语言模型的自我改进。在各种算术数据集上的实验结果表明,该方法在提高大型语言模型的算术推理能力方面是有效的。源代码可在此HTTPS URL上找到。

[NLP-87] Automatic speech recognition for the Nepali language using CNN bidirectional LSTM and ResNet
[NLP-87] 使用CNN双向LSTM和ResNet的尼泊尔语自动语音识别

链接: https://arxiv.org/abs/2406.17825
作者: Manish Dhakal,Arman Chhetri,Aman Kumar Gupta,Prabin Lamichhane,Suraj Pandey,Subarna Shakya
关键词: Automatic Speech Recognition, transcribes Nepali speech, Speech Recognition, Automatic Speech, deep learning model
中文关键词: 自动语音识别,转录尼泊尔语音,语音识别,自动语音,深度学习模型
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at 2022 International Conference on Inventive Computation Technologies (ICICT), IEEE

点击查看摘要

Abstract:This paper presents an end-to-end deep learning model for Automatic Speech Recognition (ASR) that transcribes Nepali speech to text. The model was trained and tested on the OpenSLR (audio, text) dataset. The majority of the audio dataset have silent gaps at both ends which are clipped during dataset preprocessing for a more uniform mapping of audio frames and their corresponding texts. Mel Frequency Cepstral Coefficients (MFCCs) are used as audio features to feed into the model. The model having Bidirectional LSTM paired with ResNet and one-dimensional CNN produces the best results for this dataset out of all the models (neural networks with variations of LSTM, GRU, CNN, and ResNet) that have been trained so far. This novel model uses Connectionist Temporal Classification (CTC) function for loss calculation during training and CTC beam search decoding for predicting characters as the most likely sequence of Nepali text. On the test dataset, the character error rate (CER) of 17.06 percent has been achieved. The source code is available at: this https URL.
摘要:提出了一种端到端深度学习模型,用于尼泊尔语语音到文本的自动语音识别(ASR)。该模型在OpenSLR(音频、文本)数据集上进行了训练和测试。大多数音频数据集在两端都有静默间隙,这些间隙在数据集预处理期间被修剪,以便更均匀地映射音频帧及其对应的文本。将Mel频率倒谱系数(MFCC)作为音频特征输入到模型中。在迄今训练过的所有模型(具有LSTM、GRU、CNN和ResNet变体的神经网络)中,将双向LSTM与ResNet和一维CNN配对的模型为该数据集产生了最好的结果。该模型使用连接主义时间分类(CTC)函数计算训练过程中的损失,并使用CTC波束搜索译码来预测作为尼泊尔文本最可能序列的字符。在测试数据集上,获得了17.06%的字符错误率。源代码可在以下网址获得:This HTTPS URL。

[NLP-88] raining-Free Exponential Extension of Sliding Window Context with Cascading KV Cache
[NLP-88] 具有级联KV缓存的滑动窗口上下文的无降雨指数扩展

链接: https://arxiv.org/abs/2406.17808
作者: Jeffrey Willette,Heejun Lee,Youngwan Lee,Myeongjae Jeon,Sung Ju Hwang
关键词: current task, Large Language Models, form of active, active memory, few-shot learning
中文关键词: 当前任务、大型语言模型、主动记忆形式、少量学习
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The context window within a transformer provides a form of active memory for the current task, which can be useful for few-shot learning and conditional generation, both which depend heavily on previous context tokens. However, as the context length grows, the computational cost increases quadratically. Recent works have shown that saving a few initial tokens along with a fixed-sized sliding window leads to stable streaming generation with linear complexity in transformer-based Large Language Models (LLMs). However, they make suboptimal use of the fixed window by naively evicting all tokens unconditionally from the key-value (KV) cache once they reach the end of the window, resulting in tokens being forgotten and no longer able to affect subsequent predictions. To overcome this limitation, we propose a novel mechanism for storing longer sliding window contexts with the same total cache size by keeping separate cascading sub-cache buffers whereby each subsequent buffer conditionally accepts a fraction of the relatively more important tokens evicted from the previous buffer. Our method results in a dynamic KV cache that can store tokens from the more distant past than a fixed, static sliding window approach. Our experiments show improvements of 5.6% on long context generation (LongBench), 1.2% in streaming perplexity (PG19), and 0.6% in language understanding (MMLU STEM) using LLMs given the same fixed cache size. Additionally, we provide an efficient implementation that improves the KV cache latency from 1.33ms per caching operation to 0.54ms, a 59% speedup over previous work.
摘要:转换器中的上下文窗口为当前任务提供了一种活动记忆形式,这对于少量学习和条件生成非常有用,这两者都严重依赖于先前的上下文标记。然而,随着上下文长度的增加,计算成本呈二次曲线增加。最近的工作表明,在基于转换器的大型语言模型(LLM)中,保存一些初始令牌和固定大小的滑动窗口可以产生稳定的、具有线性复杂性的流生成。然而,一旦令牌到达窗口末尾,它们就天真地无条件地将所有令牌从键值(KV)缓存中逐出,从而次优地利用固定窗口,导致令牌被遗忘并且不再能够影响后续预测。为了克服这一局限性,我们提出了一种新的机制,通过保持单独的级联子缓存缓冲区来存储具有相同总缓存大小的较长滑动窗口上下文,其中每个后续缓冲区有条件地接受从先前缓冲区中逐出的相对更重要的令牌的一部分。我们的方法产生了一个动态KV缓存,它可以存储比固定的静态滑动窗口方法更遥远的过去的令牌。我们的实验表明,在缓存大小相同的情况下,使用LLMS在长上下文生成(LongBch)上提高了5.6%,在流困惑(PG19)上提高了1.2%,在语言理解(MMLU STEM)上提高了0.6%。此外,我们还提供了一个高效的实施方案,可以将KV缓存延迟从每个缓存操作1.33毫秒提高到0.54毫秒,比以前的工作加速59%。

[NLP-89] Enhancing Commentary Strategies for Imperfect Information Card Games: A Study of Large Language Models in Guandan Commentary
[NLP-89] 增强不完美信息卡游戏的评论策略:关丹评论中大型语言模型的研究

链接: https://arxiv.org/abs/2406.17807
作者: Meiling Tao.Xuechen Liang,Yiling Tao,Tianyu Shi
关键词: Recent advancements, generating high-quality game, large language models, advancements in large, unlocked the potential
中文关键词: 最近的进步、生成高质量游戏、大型语言模型、大型进步,释放了潜力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have unlocked the potential for generating high-quality game commentary. However, producing insightful and engaging commentary for complex games with incomplete information remains a significant challenge. In this paper, we introduce a novel commentary method that combine Reinforcement Learning (RL) and LLMs, tailored specifically for the Chinese card game \textitGuandan. Our system leverages RL to generate intricate card-playing scenarios and employs LLMs to generate corresponding commentary text, effectively emulating the strategic analysis and narrative prowess of professional commentators. The framework comprises a state commentary guide, a Theory of Mind (ToM)-based strategy analyzer, and a style retrieval module, which seamlessly collaborate to deliver detailed and context-relevant game commentary in the Chinese language environment. We empower LLMs with ToM capabilities and refine both retrieval and information filtering mechanisms. This facilitates the generation of personalized commentary content. Our experimental results showcase the substantial enhancement in performance achieved by the proposed commentary framework when applied to open-source LLMs, surpassing the performance of GPT-4 across multiple evaluation metrics.
摘要:大型语言模型(LLM)的最新进展释放了生成高质量游戏评论的潜力。然而,为不完全信息的复杂游戏制作有洞察力和引人入胜的评论仍然是一个巨大的挑战。本文介绍了一种将强化学习(RL)和最小二乘支持向量机(LLMS)相结合的新的评论方法,该方法是专门为中国纸牌游戏关丹而定制的。我们的系统利用RL生成错综复杂的扑克场景,并使用LLMS生成相应的评论文本,有效地模仿专业解说员的战略分析和叙事能力。该框架包括状态解说指南、基于心理理论(TOM)的策略分析器和风格检索模块,它们无缝协作,在中文环境中提供详细的和上下文相关的游戏解说。我们为LLMS提供了TOM功能,并完善了检索和信息过滤机制。这促进了个性化评论内容的生成。我们的实验结果表明,当应用于开源LLMS时,所提出的评论框架在性能上取得了显著的提升,在多个评估指标上超过了GPT-4的性能。

[NLP-90] MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
[NLP-90] MOSSBench:您的多模式语言模型是否对安全收件箱过于敏感?

链接: https://arxiv.org/abs/2406.17806
作者: Xirui Li,Hengguang Zhou,Ruochen Wang,Tianyi Zhou,Minhao Cheng,Cho-Jui Hsieh
关键词: biased thinking patterns, Humans are prone, Large Language Models, Multimodal Large Language, cognitive distortions
中文关键词: 有偏见的思维模式、人类容易犯错、大型语言模型、多模式大型语言、认知扭曲
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Humans are prone to cognitive distortions – biased thinking patterns that lead to exaggerated responses to specific stimuli, albeit in very different contexts. This paper demonstrates that advanced Multimodal Large Language Models (MLLMs) exhibit similar tendencies. While these models are designed to respond queries under safety mechanism, they sometimes reject harmless queries in the presence of certain visual stimuli, disregarding the benign nature of their contexts. As the initial step in investigating this behavior, we identify three types of stimuli that trigger the oversensitivity of existing MLLMs: Exaggerated Risk, Negated Harm, and Counterintuitive Interpretation. To systematically evaluate MLLMs’ oversensitivity to these stimuli, we propose the Multimodal OverSenSitivity Benchmark (MOSSBench). This toolkit consists of 300 manually collected benign multimodal queries, cross-verified by third-party reviewers (AMT). Empirical studies using MOSSBench on 20 MLLMs reveal several insights: (1). Oversensitivity is prevalent among SOTA MLLMs, with refusal rates reaching up to 76% for harmless queries. (2). Safer models are more oversensitive: increasing safety may inadvertently raise caution and conservatism in the model’s responses. (3). Different types of stimuli tend to cause errors at specific stages – perception, intent reasoning, and safety judgement – in the response process of MLLMs. These findings highlight the need for refined safety mechanisms that balance caution with contextually appropriate responses, improving the reliability of MLLMs in real-world applications. We make our project available at this https URL.
摘要:人类容易出现认知扭曲–一种偏颇的思维模式,会导致人们对特定刺激做出夸大的反应,尽管是在非常不同的背景下。本文论证了高级多通道大型语言模型(MLLM)也表现出类似的趋势。虽然这些模型被设计为在安全机制下响应查询,但它们有时会在存在某些视觉刺激的情况下拒绝无害的查询,而忽略其上下文的良性性质。作为研究这一行为的第一步,我们确定了三种类型的刺激,它们触发了现有MLLM的过度敏感性:夸大的风险、否定的伤害和违反直觉的解释。为了系统地评估MLLMS对这些刺激的过度敏感性,我们提出了多模式过度敏感性基准(MOSSBtch)。该工具包包含300个手动收集的良性多模式查询,由第三方审查者(AMT)交叉验证。对20个最大似然模型的实证研究揭示了以下几点见解:(1)过度敏感在Sota MLLMS中很普遍,无害查询的拒绝率高达76%。(2)。更安全的模型更敏感:增加安全性可能会在无意中提高模型的反应中的谨慎和保守。(3)。不同类型的刺激往往会在MLLMS的反应过程中的特定阶段–感知、意图推理和安全判断–造成错误。这些发现突显了需要改进的安全机制,以平衡谨慎和上下文适当的反应,提高实际应用中MLLMS的可靠性。我们在这个HTTPS URL上提供我们的项目。

[NLP-91] Can LLMs Generate Visualizations with Dataless Prompts?
[NLP-91] LLM可以使用无数据预算生成可视化吗?

链接: https://arxiv.org/abs/2406.17805
作者: Darius Coelho,Harshit Barot,Naitik Rathod,Klaus Mueller
关键词: revolutionized information access, preferred information source, Recent advancements, address complex queries, information access
中文关键词: 革命性的信息访问、首选信息源、最新进展、解决复杂查询、信息访问
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recent advancements in large language models have revolutionized information access, as these models harness data available on the web to address complex queries, becoming the preferred information source for many users. In certain cases, queries are about publicly available data, which can be effectively answered with data visualizations. In this paper, we investigate the ability of large language models to provide accurate data and relevant visualizations in response to such queries. Specifically, we investigate the ability of GPT-3 and GPT-4 to generate visualizations with dataless prompts, where no data accompanies the query. We evaluate the results of the models by comparing them to visualization cheat sheets created by visualization experts.
摘要:大型语言模型的最新进展彻底改变了信息访问,因为这些模型利用网络上可用的数据来解决复杂的查询,成为许多用户的首选信息源。在某些情况下,查询是关于公开可用的数据,可以通过数据可视化有效地回答这些问题。在本文中,我们研究了大型语言模型响应此类查询提供准确数据和相关可视化的能力。具体来说,我们研究GPT-3和GPT-4在查询没有数据的情况下使用无数据提示生成可视化的能力。我们通过将模型的结果与可视化专家创建的可视化备忘单进行比较来评估模型的结果。

[NLP-92] Understanding the Role of User Profile in the Personalization of Large Language Models
[NLP-92] 了解用户配置文件在大型语言模型个性化中的作用

链接: https://arxiv.org/abs/2406.17803
作者: Bin Wu,Zhengyan Shi,Hossein A. Rahmani,Varsha Ramineni,Emine Yilmaz
关键词: Large Language Models, personalize Large Language, Language Models, Large Language, Utilizing user profiles
中文关键词: 大型语言模型,个性化大型语言,语言模型,大型语言,利用用户配置文件
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Utilizing user profiles to personalize Large Language Models (LLMs) has been shown to enhance the performance on a wide range of tasks. However, the precise role of user profiles and their effect mechanism on LLMs remains unclear. This study first confirms that the effectiveness of user profiles is primarily due to personalization information rather than semantic information. Furthermore, we investigate how user profiles affect the personalization of LLMs. Within the user profile, we reveal that it is the historical personalized response produced or approved by users that plays a pivotal role in personalizing LLMs. This discovery unlocks the potential of LLMs to incorporate a greater number of user profiles within the constraints of limited input length. As for the position of user profiles, we observe that user profiles integrated into different positions of the input context do not contribute equally to personalization. Instead, where the user profile that is closer to the beginning affects more on the personalization of LLMs. Our findings reveal the role of user profiles for the personalization of LLMs, and showcase how incorporating user profiles impacts performance providing insight to leverage user profiles effectively.
摘要:已有研究表明,利用用户配置文件对大型语言模型(LLM)进行个性化处理可以提高其在各种任务中的性能。然而,用户配置文件的确切作用及其对LLMS的影响机制仍不清楚。本研究首先证实了用户档案的有效性主要源于个性化信息,而不是语义信息。此外,我们还研究了用户配置文件如何影响LLMS的个性化。在用户配置文件中,我们揭示了用户产生或批准的历史个性化响应在个性化LLM中起着关键作用。这一发现释放了LLMS在有限输入长度的约束下合并更多数量的用户配置文件的潜力。至于用户配置文件的位置,我们观察到,集成到输入上下文的不同位置的用户配置文件对个性化的贡献并不相等。相反,越接近开头的用户配置文件对LLMS的个性化影响越大。我们的发现揭示了用户配置文件在LLM个性化中的作用,并展示了合并用户配置文件如何影响性能,为有效利用用户配置文件提供了洞察力。

[NLP-93] A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge
[NLP-93] 基于vits 2的多扬声器多语言语音克隆系统for limmits 2024挑战

链接: https://arxiv.org/abs/2406.17801
作者: Xiaopeng Wang,Yi Lu,Xin Qi,Zhiyong Wang,Yuankun Xie,Shuchen Shi,Ruibo Fu
关键词: speaker similarity, Speaker Similarity score, focusing primarily, paper presents, presents the development
中文关键词: 说话者相似性,说话者相似性得分,主要关注,论文呈现,呈现发展
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper presents the development of a speech synthesis system for the LIMMITS’24 Challenge, focusing primarily on Track 2. The objective of the challenge is to establish a multi-speaker, multi-lingual Indic Text-to-Speech system with voice cloning capabilities, covering seven Indian languages with both male and female speakers. The system was trained using challenge data and fine-tuned for few-shot voice cloning on target speakers. Evaluation included both mono-lingual and cross-lingual synthesis across all seven languages, with subjective tests assessing naturalness and speaker similarity. Our system uses the VITS2 architecture, augmented with a multi-lingual ID and a BERT model to enhance contextual language comprehension. In Track 1, where no additional data usage was permitted, our model achieved a Speaker Similarity score of 4.02. In Track 2, which allowed the use of extra data, it attained a Speaker Similarity score of 4.17.
摘要:本文介绍了LIMMITS ’ 24挑战赛语音合成系统的开发,主要关注第2轨。该挑战的目标是建立一个具有语音克隆功能的多说话者、多语言印度文本到语音系统,覆盖七种印度语言,包括男性和女性。该系统使用挑战数据进行训练,并针对目标说话者的少量语音克隆进行微调。评估包括所有七种语言的单语和跨语合成,并通过主观测试评估自然性和说话者相似性。我们的系统使用VITS 2架构,并增强了多语言ID和BERT模型,以增强上下文语言理解。在不允许使用额外数据的第1轨中,我们的模型获得了发言者相似性评分4.02。在允许使用额外数据的第2轨中,其发言人相似性得分为4.17。

[NLP-94] Deep Learning Approaches for Detecting Adversarial Cyberbullying and Hate Speech in Social Networks
[NLP-94] 用于检测社交网络中对抗性网络欺凌和仇恨言论的深度学习方法

链接: https://arxiv.org/abs/2406.17793
作者: Sylvia Worlali Azumah,Nelly Elsayed,Zag ElSayed,Murat Ozer,Amanda La Guardia
关键词: concern intricately linked, intricately linked, find resolution, resolution through technological, significant concern intricately
中文关键词: 关注错综复杂,错综复杂,通过技术找到解决方案,解决方案,重大关注错综复杂
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 10 pages, 8 figures, 3 tables, under reviewing

点击查看摘要

Abstract:Cyberbullying is a significant concern intricately linked to technology that can find resolution through technological means. Despite its prevalence, technology also provides solutions to mitigate cyberbullying. To address growing concerns regarding the adverse impact of cyberbullying on individuals’ online experiences, various online platforms and researchers are actively adopting measures to enhance the safety of digital environments. While researchers persist in crafting detection models to counteract or minimize cyberbullying, malicious actors are deploying adversarial techniques to circumvent these detection methods. This paper focuses on detecting cyberbullying in adversarial attack content within social networking site text data, specifically emphasizing hate speech. Utilizing a deep learning-based approach with a correction algorithm, this paper yielded significant results. An LSTM model with a fixed epoch of 100 demonstrated remarkable performance, achieving high accuracy, precision, recall, F1-score, and AUC-ROC scores of 87.57%, 88.73%, 87.57%, 88.15%, and 91% respectively. Additionally, the LSTM model’s performance surpassed that of previous studies.
摘要:网络欺凌是一个重要的问题,与可以通过技术手段找到解决方案的技术有着错综复杂的联系。尽管网络欺凌很普遍,但技术也提供了减少网络欺凌的解决方案。为了解决人们对网络欺凌对个人在线体验不利影响的日益担忧,各种在线平台和研究人员正在积极采取措施,加强数字环境的安全。尽管研究人员坚持设计检测模型来对抗或最大限度地减少网络欺凌,但恶意行为者正在部署对抗性技术来规避这些检测方法。本文主要针对社交网站文本数据中的对抗性攻击内容中的网络欺凌行为进行检测,特别是针对仇恨言论的检测。将基于深度学习的方法与校正算法相结合,取得了显著的效果。固定历时为100的LSTM模型表现出显著的性能,其准确率、精确度、召回率、F1得分和AUC-ROC得分分别为87.57%、88.73%、87.57%、88.15%和91%。此外,LSTM模型的性能超过了以往的研究。

[NLP-95] Mashee at SemEval-2024 Task 8: The Impact of Samples Quality on the Performance of In-Context Learning for Machine Text Classification
[NLP-95] Mashee在SemEval-2024任务8:样本质量对机器文本分类的上下文学习性能的影响

链接: https://arxiv.org/abs/2406.17790
作者: Areeg Fahad Rasheed,M. Zarkoosh
关键词: leveraging contextual information, improve model performance, datasets is prohibitive, improve model, training models
中文关键词: 利用上下文信息、改进模型性能、数据集令人望而却步、改进模型、训练模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Within few-shot learning, in-context learning (ICL) has become a potential method for leveraging contextual information to improve model performance on small amounts of data or in resource-constrained environments where training models on large datasets is prohibitive. However, the quality of the selected sample in a few shots severely limits the usefulness of ICL. The primary goal of this paper is to enhance the performance of evaluation metrics for in-context learning by selecting high-quality samples in few-shot learning scenarios. We employ the chi-square test to identify high-quality samples and compare the results with those obtained using low-quality samples. Our findings demonstrate that utilizing high-quality samples leads to improved performance with respect to all evaluated metrics.
摘要:在少量学习中,上下文学习(ICL)已成为一种潜在的方法,可以利用上下文信息来提高少量数据或在资源受限的环境中的模型性能,因为在大型数据集上训练模型是禁止的。然而,几次拍摄中所选样本的质量严重限制了ICL的实用性。本文的主要目标是通过在少量学习场景中选择高质量样本来增强上下文学习评估指标的性能。我们采用卡方检验来识别高质量样本,并将结果与使用低质量样本获得的结果进行比较。我们的研究结果表明,利用高质量样本可以提高所有评估指标的性能。

[NLP-96] Spanish and LLM Benchmarks: is MMLU Lost in Translation?
[NLP-96] 西班牙语和LLM基准:MMLU是否在翻译方面迷失了方向?

链接: https://arxiv.org/abs/2406.17789
作者: Irene Plaza,Nina Melero,Cristina del Pozo,Javier Conde,Pedro Reviriego,Marina Mayor-Rocher,María Grandury
关键词: Large Language Models, continuous improvement process, evaluation of Large, Language Models, Large Language
中文关键词: 大型语言模型、持续改进过程、大型评估、语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evaluation of Large Language Models (LLMs) is a key element in their continuous improvement process and many benchmarks have been developed to assess the performance of LLMs in different tasks and topics. As LLMs become adopted worldwide, evaluating them in languages other than English is increasingly important. However, most LLM benchmarks are simply translated using an automated tool and then run in the target language. This means that the results depend not only on the LLM performance in that language but also on the quality of the translation. In this paper, we consider the case of the well-known Massive Multitask Language Understanding (MMLU) benchmark. Selected categories of the benchmark are translated into Spanish using Azure Translator and ChatGPT4 and run on ChatGPT4. Next, the results are processed to identify the test items that produce different answers in Spanish and English. Those are then analyzed manually to understand if the automatic translation caused the change. The results show that a significant fraction of the failing items can be attributed to mistakes in the translation of the benchmark. These results make a strong case for improving benchmarks in languages other than English by at least revising the translations of the items and preferably by adapting the tests to the target language by experts.
摘要:大型语言模型的评估是其持续改进过程中的一个关键因素,人们已经开发了许多基准来评估大型语言模型在不同任务和主题中的表现。随着LLMS在全球范围内被采用,用英语以外的语言对它们进行评估变得越来越重要。然而,大多数LLM基准测试只是使用自动化工具进行翻译,然后以目标语言运行。这意味着结果不仅取决于LLM在该语言中的表现,还取决于翻译的质量。在本文中,我们考虑著名的大规模多任务语言理解(MMLU)基准测试的情况。使用Azure Translator和ChatGPT4将基准测试的选定类别翻译成西班牙语,并在ChatGPT4上运行。接下来,对结果进行处理,以确定产生西班牙语和英语不同答案的测试项。然后对这些内容进行手动分析,以了解是否自动翻译导致了更改。结果表明,很大一部分不合格的项目可以归因于基准翻译中的错误。这些结果为改进英语以外其他语言的基准提供了强有力的理由,至少要修改这些项目的翻译,最好是由专家将测试内容调整为目标语言。

[NLP-97] Role of Dependency Distance in Text Simplification: A Human vs ChatGPT Simplification Comparison
[NLP-97] 依赖距离在文本简化中的作用:人类与ChatGPT简化比较

链接: https://arxiv.org/abs/2406.17787
作者: Sumi Lee,Gondy Leroy,David Kauchak,Melissa Just
关键词: ChatGPT text simplification, dependency distance, study investigates human, text simplification, study investigates
中文关键词: ChatGPT文本简化、依赖距离、研究调查人类、文本简化、研究调查
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates human and ChatGPT text simplification and its relationship to dependency distance. A set of 220 sentences, with increasing grammatical difficulty as measured in a prior user study, were simplified by a human expert and using ChatGPT. We found that the three sentence sets all differed in mean dependency distances: the highest in the original sentence set, followed by ChatGPT simplified sentences, and the human simplified sentences showed the lowest mean dependency distance.
摘要:本研究调查了人类和ChatGPT文本简化及其与依赖距离的关系。人类专家并使用ChatGPT简化了一组220个句子,根据之前的用户研究测量,语法难度不断增加。我们发现,这三个句子集的平均依赖距离都有所不同:原始句子集中最高,其次是ChatGPT简化句,人类简化句的平均依赖距离最低。

[NLP-98] OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
[NLP-98] OmAgent:一个用于复杂视频理解的多模式Agent框架,具有任务划分和征服

链接: https://arxiv.org/abs/2406.16620
作者: Lu Zhang,Tiancheng Zhao,Heting Ying,Yibo Ma,Kyusong Lee
关键词: Large Language Models, Language Models, Large Language, Recent advancements, advancements in Large
中文关键词: 大型语言模型,语言模型,大型语言,最近的进步,大型的进步
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have expanded their capabilities to multimodal contexts, including comprehensive video understanding. However, processing extensive videos such as 24-hour CCTV footage or full-length films presents significant challenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, often result in substantial information loss. To address these shortcomings, we develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries, preserving the detailed content of videos. Additionally, it features an Divide-and-Conquer Loop capable of autonomous reasoning, dynamically invoking APIs and tools to enhance query processing and accuracy. This approach ensures robust video understanding, significantly reducing information loss. Experimental results affirm OmAgent’s efficacy in handling various types of videos and complex tasks. Moreover, we have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.
摘要:大型语言模型(LLM)的最新进展已将其功能扩展到多模式上下文,包括全面的视频理解。然而,由于海量的数据和处理需求,处理大量视频(如24小时闭路电视镜头或全长电影)带来了巨大的挑战。传统的方法,如提取关键帧或将帧转换为文本,通常会导致大量信息丢失。针对这些不足,我们开发了OmAgent,高效地存储和检索特定查询的相关视频帧,保留了视频的详细内容。此外,它还具有能够自主推理的分而治之的循环,动态调用API和工具来增强查询处理和准确性。这种方法确保了强大的视频理解能力,显著减少了信息丢失。实验结果证实了OmAgent在处理各种类型的视频和复杂任务方面的有效性。此外,我们赋予它更大的自治权和强大的工具调用系统,使其能够完成更复杂的任务。

[NLP-99] MSR-86K: An Evolving Multilingual Corpus with 86300 Hours of Transcribed Audio for Speech Recognition Research
[NLP-99] MSR-86 K:一个不断发展的多语言数据库,包含86300小时的转录音频,用于语音识别研究

链接: https://arxiv.org/abs/2406.18301
作者: Song Li,Yongbin You,Xuezhi Wang,Zhengkun Tian,Ke Ding,Guanglu Wan
关键词: artificial intelligence assistants, gained immense popularity, exemplified by ChatGPT, multilingual artificial intelligence, intelligence assistants
中文关键词: 人工智能助理,受到广泛欢迎,以ChatGPT、多语言人工智能、智能助理为例
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by InterSpeech 2024

点击查看摘要

Abstract:Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers’ efforts to study multilingual ASR. This paper introduces MSR-86K, an evolving, large-scale multilingual corpus for speech recognition research. The corpus is derived from publicly accessible videos on YouTube, comprising 15 languages and a total of 86,300 hours of transcribed ASR data. We also introduce how to use the MSR-86K corpus and other open-source corpora to train a robust multilingual ASR model that is competitive with Whisper. MSR-86K will be publicly released on HuggingFace, and we believe that such a large corpus will pave new avenues for research in multilingual ASR.
摘要:最近,以ChatGPT为例的多语言人工智能助手受到了广泛的欢迎。作为人机交互的重要门户,多语言自动语音识别(ASB)也引起了人们的高度关注,Whisper等系统就证明了这一点。然而,训练数据的专有性质阻碍了研究人员研究多语言ASB的努力。本文介绍了MSR-86 K,这是一个不断发展的用于语音识别研究的大规模多语言数据库。该数据库源自YouTube上可公开访问的视频,包含15种语言和总计86,300小时的转录ASC数据。我们还介绍了如何使用MSR-86 K数据库和其他开源数据库来训练一个与Whisper具有竞争力的稳健多语言ASB模型。MSR-86 K将在HuggingFace上公开发布,我们相信如此大的数据库将为多语言ASB的研究铺平新的道路。

[NLP-100] Large Language Models for Cuffless Blood Pressure Measurement From Wearable Biosignals
[NLP-100] 通过可穿戴生物信号测量无袖带血压的大型语言模型

链接: https://arxiv.org/abs/2406.18069
作者: Zengding Liu,Chen Chen,Jiannong Cao,Minglei Pan,Jikui Liu,Nan Li,Fen Miao,Ye Li
关键词: Large language models, captured significant interest, Large language, language models, captured significant
中文关键词: 大型语言模型,引起了人们的极大兴趣,大型语言,语言模型,引起了人们的极大兴趣
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have captured significant interest from both academia and industry due to their impressive performance across various textual tasks. However, the potential of LLMs to analyze physiological time-series data remains an emerging research field. Particularly, there is a notable gap in the utilization of LLMs for analyzing wearable biosignals to achieve cuffless blood pressure (BP) measurement, which is critical for the management of cardiovascular diseases. This paper presents the first work to explore the capacity of LLMs to perform cuffless BP estimation based on wearable biosignals. We extracted physiological features from electrocardiogram (ECG) and photoplethysmogram (PPG) signals and designed context-enhanced prompts by combining these features with BP domain knowledge and user information. Subsequently, we adapted LLMs to BP estimation tasks through instruction tuning. To evaluate the proposed approach, we conducted assessments of ten advanced LLMs using a comprehensive public dataset of wearable biosignals from 1,272 participants. The experimental results demonstrate that the optimally fine-tuned LLM significantly surpasses conventional task-specific baselines, achieving an estimation error of 0.00 \pm 9.25 mmHg for systolic BP and 1.29 \pm 6.37 mmHg for diastolic BP. Notably, the ablation studies highlight the benefits of our context enhancement strategy, leading to an 8.9% reduction in mean absolute error for systolic BP estimation. This paper pioneers the exploration of LLMs for cuffless BP measurement, providing a potential solution to enhance the accuracy of cuffless BP measurement.
摘要:大型语言模型因其在各种文本任务中的出色表现,引起了学术界和产业界的极大兴趣。然而,LLMS在分析生理时间序列数据方面的潜力仍然是一个新兴的研究领域。特别是,在利用LLMS分析可穿戴生物信号以实现无袖带血压(BP)测量方面存在显著差距,这对心血管疾病的管理至关重要。本文首次探索了LLMS基于可穿戴生物信号进行无袖带BP估计的能力。我们从心电信号和光体积图信号中提取生理特征,并结合BP领域知识和用户信息设计上下文增强提示。随后,我们通过指令调优使LLMS适应BP估计任务。为了评估所提出的方法,我们使用来自1,272名参与者的可穿戴生物信号的全面公共数据集对10个先进的LLM进行了评估。实验结果表明,优化后的LLM显著超过了传统的特定任务基线,对收缩压的估计误差为0.00\pm 9.25 mm Hg,对舒张压的估计误差为1.29\pm 6.37 mm Hg。值得注意的是,消融研究强调了我们的背景增强策略的好处,导致收缩压估计的平均绝对误差减少了8.9%。本文率先探索了LLMS在无袖带血压测量中的应用,为提高无袖带血压测量的准确性提供了一种潜在的解决方案。

计算机视觉

[CV-0] On Scaling Up 3D Gaussian Splatting Training

链接: https://arxiv.org/abs/2406.18533
作者: Hexu Zhao,Haoyang Weng,Daohan Lu,Ang Li,Jinyang Li,Aurojit Panda,Saining Xie
关键词: Gaussian Splatting, superior visual quality, reconstruction tasks due, increasingly popular, superior visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code: this https URL ; Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is increasingly popular for 3D reconstruction due to its superior visual quality and rendering speed. However, 3DGS training currently occurs on a single GPU, limiting its ability to handle high-resolution and large-scale 3D reconstruction tasks due to memory constraints. We introduce Grendel, a distributed system designed to partition 3DGS parameters and parallelize computation across multiple GPUs. As each Gaussian affects a small, dynamic subset of rendered pixels, Grendel employs sparse all-to-all communication to transfer the necessary Gaussians to pixel partitions and performs dynamic load balancing. Unlike existing 3DGS systems that train using one camera view image at a time, Grendel supports batched training with multiple views. We explore various optimization hyperparameter scaling strategies and find that a simple sqrt(batch size) scaling rule is highly effective. Evaluations using large-scale, high-resolution scenes show that Grendel enhances rendering quality by scaling up 3DGS parameters across multiple GPUs. On the Rubble dataset, we achieve a test PSNR of 27.28 by distributing 40.4 million Gaussians across 16 GPUs, compared to a PSNR of 26.28 using 11.2 million Gaussians on a single GPU. Grendel is an open-source project available at: this https URL

[CV-1] MatchTime: Towards Automatic Soccer Game Commentary Generation

链接: https://arxiv.org/abs/2406.18530
作者: Jiayuan Rao,Haoning Wu,Chang Liu,Yanfeng Wang,Weidi Xie
关键词: soccer game commentary, audiences’ viewing experience, globally popular sport, automatic soccer game, soccer game
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report; Project Page: this https URL

点击查看摘要

Abstract:Soccer is a globally popular sport with a vast audience, in this paper, we consider constructing an automatic soccer game commentary model to improve the audiences’ viewing experience. In general, we make the following contributions: First, observing the prevalent video-text misalignment in existing datasets, we manually annotate timestamps for 49 matches, establishing a more robust benchmark for soccer game commentary generation, termed as SN-Caption-test-align; Second, we propose a multi-modal temporal alignment pipeline to automatically correct and filter the existing dataset at scale, creating a higher-quality soccer game commentary dataset for training, denoted as MatchTime; Third, based on our curated dataset, we train an automatic commentary generation model, named MatchVoice. Extensive experiments and ablation studies have demonstrated the effectiveness of our alignment pipeline, and training model on the curated datasets achieves state-of-the-art performance for commentary generation, showcasing that better alignment can lead to significant performance improvements in downstream tasks.

[CV-2] MultiDiff: Consistent Novel View Synthesis from a Single Image

链接: https://arxiv.org/abs/2406.18524
作者: Norman Müller,Katja Schwarz,Barbara Roessle,Lorenzo Porzi,Samuel Rota Bulò,Matthias Nießner,Peter Kontschieder
关键词: single RGB image, single RGB, RGB image, RGB, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL Video: this https URL - CVPR 2024

点击查看摘要

Abstract:We introduce MultiDiff, a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature, as there exist multiple, plausible explanations for unobserved areas. To address this issue, we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views, increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes, allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation, MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results – even for long-term scene generation with large camera movements, while reducing inference time by an order of magnitude. For additional consistency and image quality improvements, we introduce a novel, structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet. Finally, our model naturally supports multi-view consistent editing without the need for further tuning.

[CV-3] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

链接: https://arxiv.org/abs/2406.18522
作者: Shenghai Yuan,Jinfa Huang,Yongqi Xu,Yaoyang Liu,Shaofeng Zhang,Yujun Shi,Ruijie Zhu,Xinhua Cheng,Jiebo Luo,Li Yuan
关键词: Sora and Lumiere, time-lapse videos, temporal coherence, videos, time-lapse video generation
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 31 pages, 15 figures

点击查看摘要

Abstract:We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on the visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model’s ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model’s capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos’ metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude.

[CV-4] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

链接: https://arxiv.org/abs/2406.18521
作者: Zirui Wang,Mengzhou Xia,Luxi He,Howard Chen,Yitao Liu,Richard Zhu,Kaiqu Liang,Xindi Wu,Haotian Liu,Sadhika Malladi,Alexis Chevalier,Sanjeev Arora,Danqi Chen
关键词: Multimodal Large Language, applying Multimodal Large, Large Language Models, Multimodal Large, Large Language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 121 pages, 90 figures

点击查看摘要

Abstract:Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: this https URL

[CV-5] Denoising as Adaptation: Noise-Space Domain Adaptation for Image Restoration

链接: https://arxiv.org/abs/2406.18516
作者: Kang Liao,Zongsheng Yue,Zhouxia Wang,Chen Change Loy
关键词: made significant progress, real-world scenarios due, substantial domain gap, domain gap caused, significant progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Github Repository: this https URL

点击查看摘要

Abstract:Although deep learning-based image restoration methods have made significant progress, they still struggle with limited generalization to real-world scenarios due to the substantial domain gap caused by training on synthetic data. Existing methods address this issue by improving data synthesis pipelines, estimating degradation kernels, employing deep internal learning, and performing domain adaptation and regularization. Previous domain adaptation methods have sought to bridge the domain gap by learning domain-invariant knowledge in either feature or pixel space. However, these techniques often struggle to extend to low-level vision tasks within a stable and compact framework. In this paper, we show that it is possible to perform domain adaptation via the noise-space using diffusion models. In particular, by leveraging the unique property of how the multi-step denoising process is influenced by auxiliary conditional inputs, we obtain meaningful gradients from noise prediction to gradually align the restored results of both synthetic and real-world data to a common clean distribution. We refer to this method as denoising as adaptation. To prevent shortcuts during training, we present useful techniques such as channel shuffling and residual-swapping contrastive learning. Experimental results on three classical image restoration tasks, namely denoising, deblurring, and deraining, demonstrate the effectiveness of the proposed method. Code will be released at: this https URL.

[CV-6] Robust Surgical Phase Recognition From Annotation Efficient Supervision

链接: https://arxiv.org/abs/2406.18481
作者: Or Rubin,Shlomi Laufer
关键词: Surgical phase recognition, missing phase annotations, computer-assisted surgery, aiming to automatically, Surgical phase
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Surgical phase recognition is a key task in computer-assisted surgery, aiming to automatically identify and categorize the different phases within a surgical procedure. Despite substantial advancements, most current approaches rely on fully supervised training, requiring expensive and time-consuming frame-level annotations. Timestamp supervision has recently emerged as a promising alternative, significantly reducing annotation costs while maintaining competitive performance. However, models trained on timestamp annotations can be negatively impacted by missing phase annotations, leading to a potential drawback in real-world scenarios. In this work, we address this issue by proposing a robust method for surgical phase recognition that can handle missing phase annotations effectively. Furthermore, we introduce the SkipTag@K annotation approach to the surgical domain, enabling a flexible balance between annotation effort and model performance. Our method achieves competitive results on two challenging datasets, demonstrating its efficacy in handling missing phase annotations and its potential for reducing annotation costs. Specifically, we achieve an accuracy of 85.1% on the MultiBypass140 dataset using only 3 annotated frames per video, showcasing the effectiveness of our method and the potential of the SkipTag@K setup. We perform extensive experiments to validate the robustness of our method and provide valuable insights to guide future research in surgical phase recognition. Our work contributes to the advancement of surgical workflow recognition and paves the way for more efficient and reliable surgical phase recognition systems.

[CV-7] GaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enhanced Quality

链接: https://arxiv.org/abs/2406.18462
作者: Taoran Yi,Jiemin Fang,Zanwei Zhou,Junjie Wang,Guanjun Wu,Lingxi Xie,Xiaopeng Zhang,Wenyu Liu,Xinggang Wang,Qi Tian
关键词: rendering real-world scenes, achieved great success, real-world scenes, great success, success in reconstructing
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recently, 3D Gaussian splatting (3D-GS) has achieved great success in reconstructing and rendering real-world scenes. To transfer the high rendering quality to generation tasks, a series of research works attempt to generate 3D-Gaussian assets from text. However, the generated assets have not achieved the same quality as those in reconstruction tasks. We observe that Gaussians tend to grow without control as the generation process may cause indeterminacy. Aiming at highly enhancing the generation quality, we propose a novel framework named GaussianDreamerPro. The main idea is to bind Gaussians to reasonable geometry, which evolves over the whole generation process. Along different stages of our framework, both the geometry and appearance can be enriched progressively. The final output asset is constructed with 3D Gaussians bound to mesh, which shows significantly enhanced details and quality compared with previous methods. Notably, the generated asset can also be seamlessly integrated into downstream manipulation pipelines, e.g. animation, composition, and simulation etc., greatly promoting its potential in wide applications. Demos are available at this https URL.

[CV-8] DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

链接: https://arxiv.org/abs/2406.18459
作者: Younghyun Kim,Geunmin Hwang,Eunbyung Park
关键词: Recent surge, computer vision, spurred the development, development of vast, vast fields
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent surge in large-scale generative models has spurred the development of vast fields in computer vision. In particular, text-to-image diffusion models have garnered widespread adoption across diverse domain due to their potential for high-fidelity image generation. Nonetheless, existing large-scale diffusion models are confined to generate images of up to 1K resolution, which is far from meeting the demands of contemporary commercial applications. Directly sampling higher-resolution images often yields results marred by artifacts such as object repetition and distorted shapes. Addressing the aforementioned issues typically necessitates training or fine-tuning models on higher resolution datasets. However, this undertaking poses a formidable challenge due to the difficulty in collecting large-scale high-resolution contents and substantial computational resources. While several preceding works have proposed alternatives, they often fail to produce convincing results. In this work, we probe the generative ability of diffusion models at higher resolution beyond its original capability and propose a novel progressive approach that fully utilizes generated low-resolution image to guide the generation of higher resolution image. Our method obviates the need for additional training or fine-tuning which significantly lowers the burden of computational costs. Extensive experiments and results validate the efficiency and efficacy of our method.

[CV-9] owards Human-Level 3D Relative Pose Estimation: Generalizable Training-Free with Single Reference

链接: https://arxiv.org/abs/2406.18453
作者: Yuan Gao,Yajing Luo,Junhong Wang,Kui Jia,Gui-Song Xia
关键词: query-reference image pair, Humans can easily, single query-reference image, easily deduce, single query-reference
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The codes are available at this https URL

点击查看摘要

Abstract:Humans can easily deduce the relative pose of an unseen object, without label/training, given only a single query-reference image pair. This is arguably achieved by incorporating (i) 3D/2.5D shape perception from a single image, (ii) render-and-compare simulation, and (iii) rich semantic cue awareness to furnish (coarse) reference-query correspondence. Existing methods implement (i) by a 3D CAD model or well-calibrated multiple images and (ii) by training a network on specific objects, which necessitate laborious ground-truth labeling and tedious training, potentially leading to challenges in generalization. Moreover, (iii) was less exploited in the paradigm of (ii), despite that the coarse correspondence from (iii) enhances the compare process by filtering out non-overlapped parts under substantial pose differences/occlusions. Motivated by this, we propose a novel 3D generalizable relative pose estimation method by elaborating (i) with a 2.5D shape from an RGB-D reference, (ii) with an off-the-shelf differentiable renderer, and (iii) with semantic cues from a pretrained model like DINOv2. Specifically, our differentiable renderer takes the 2.5D rotatable mesh textured by the RGB and the semantic maps (obtained by DINOv2 from the RGB input), then renders new RGB and semantic maps (with back-surface culling) under a novel rotated view. The refinement loss comes from comparing the rendered RGB and semantic maps with the query ones, back-propagating the gradients through the differentiable renderer to refine the 3D relative pose. As a result, our method can be readily applied to unseen objects, given only a single RGB-D reference, without label/training. Extensive experiments on LineMOD, LM-O, and YCB-V show that our training-free method significantly outperforms the SOTA supervised methods, especially under the rigorous Acc@5/10/15° metrics and the challenging cross-dataset settings.

[CV-10] Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers

链接: https://arxiv.org/abs/2406.18451
作者: Jonas Ngnawé,Sabyasachi Sahoo,Yann Pequignot,Frédéric Precioso,Christian Gagné
关键词: high-stakes real-world applications, adversarial training strategies, input space margins, improve robustness, imperceptible perturbations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures, 2 tables, 1 algorithm

点击查看摘要

Abstract:Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model’s vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency – a property that links the input space margins and the logit margins in robust models – for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model’s logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate strong margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively use the logit margin to confidently detect brittle decisions with such models and accurately estimate robust accuracy on an arbitrarily large test set by estimating the input margins only on a small subset. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to efficiently assess adversarial vulnerability in deployment scenarios.

[CV-11] Unveiling the Unknown: Conditional Evidence Decoupling for Unknown Rejection

链接: https://arxiv.org/abs/2406.18443
作者: Zhaowei Wu,Binyi Su,Hua Zhang,Zhong Zhou
关键词: conditional evidence decoupling, open-set object detector, condition of scarce, Evidence Decoupling Loss, scarce training samples
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we focus on training an open-set object detector under the condition of scarce training samples, which should distinguish the known and unknown categories. Under this challenging scenario, the decision boundaries of unknowns are difficult to learn and often ambiguous. To mitigate this issue, we develop a novel open-set object detection framework, which delves into conditional evidence decoupling for the unknown rejection. Specifically, we select pseudo-unknown samples by leveraging the discrepancy in attribution gradients between known and unknown classes, alleviating the inadequate unknown distribution coverage of training data. Subsequently, we propose a Conditional Evidence Decoupling Loss (CEDL) based on Evidential Deep Learning (EDL) theory, which decouples known and unknown properties in pseudo-unknown samples to learn distinct knowledge, enhancing separability between knowns and unknowns. Additionally, we propose an Abnormality Calibration Loss (ACL), which serves as a regularization term to adjust the output probability distribution, establishing robust decision boundaries for the unknown rejection. Our method has achieved the superiority performance over previous state-of-the-art approaches, improving the mean recall of unknown class by 7.24% across all shots in VOC10-5-5 dataset settings and 1.38% in VOC-COCO dataset settings. The code is available via this https URL.

[CV-12] Facial Image Feature Analysis and its Specialization for Frechet Distance and Neighborhoods

链接: https://arxiv.org/abs/2406.18430
作者: Doruk Cetin,Benedikt Schesch,Petar Stamenkovic,Niko Benjamin Huber,Fabio Zünd,Majed El Helou
关键词: Assessing distances, Fréchet Inception Distance, fundamental task, task in vision-based, vision-based research
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Assessing distances between images and image datasets is a fundamental task in vision-based research. It is a challenging open problem in the literature and despite the criticism it receives, the most ubiquitous method remains the Fréchet Inception Distance. The Inception network is trained on a specific labeled dataset, ImageNet, which has caused the core of its criticism in the most recent research. Improvements were shown by moving to self-supervision learning over ImageNet, leaving the training data domain as an open question. We make that last leap and provide the first analysis on domain-specific feature training and its effects on feature distance, on the widely-researched facial image domain. We provide our findings and insights on this domain specialization for Fréchet distance and image neighborhoods, supported by extensive experiments and in-depth user studies.

[CV-13] Repeat and Concatenate: 2D to 3D Image Translation with 3D to 3D Generative Modeling

链接: https://arxiv.org/abs/2406.18422
作者: Abril Corona-Figueroa,Hubert P. H. Shum,Chris G. Willcocks
关键词: image translation method, image translation, paper investigates, straightforward technique, X-ray
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: CVPRW 2024 - DCA in MI; Best Paper Award

点击查看摘要

Abstract:This paper investigates a 2D to 3D image translation method with a straightforward technique, enabling correlated 2D X-ray to 3D CT-like reconstruction. We observe that existing approaches, which integrate information across multiple 2D views in the latent space, lose valuable signal information during latent encoding. Instead, we simply repeat and concatenate the 2D views into higher-channel 3D volumes and approach the 3D reconstruction challenge as a straightforward 3D to 3D generative modeling problem, sidestepping several complex modeling issues. This method enables the reconstructed 3D volume to retain valuable information from the 2D inputs, which are passed between channel states in a Swin UNETR backbone. Our approach applies neural optimal transport, which is fast and stable to train, effectively integrating signal information across multiple views without the requirement for precise alignment; it produces non-collapsed reconstructions that are highly faithful to the 2D views, even after limited training. We demonstrate correlated results, both qualitatively and quantitatively, having trained our model on a single dataset and evaluated its generalization ability across six datasets, including out-of-distribution samples.

[CV-14] BiTrack: Bidirectional Offline 3D Multi-Object Tracking Using Camera-LiDAR Data

链接: https://arxiv.org/abs/2406.18414
作者: Kemiao Huang,Meiying Zhang,Qi Hao
关键词: erroneous link correction, offline multi-object tracking, bounding box misalignment, real-time multi-object tracking, full track optimization
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Compared with real-time multi-object tracking (MOT), offline multi-object tracking (OMOT) has the advantages to perform 2D-3D detection fusion, erroneous link correction, and full track optimization but has to deal with the challenges from bounding box misalignment and track evaluation, editing, and refinement. This paper proposes “BiTrack”, a 3D OMOT framework that includes modules of 2D-3D detection fusion, initial trajectory generation, and bidirectional trajectory re-optimization to achieve optimal tracking results from camera-LiDAR data. The novelty of this paper includes threefold: (1) development of a point-level object registration technique that employs a density-based similarity metric to achieve accurate fusion of 2D-3D detection results; (2) development of a set of data association and track management skills that utilizes a vertex-based similarity metric as well as false alarm rejection and track recovery mechanisms to generate reliable bidirectional object trajectories; (3) development of a trajectory re-optimization scheme that re-organizes track fragments of different fidelities in a greedy fashion, as well as refines each trajectory with completion and smoothing techniques. The experiment results on the KITTI dataset demonstrate that BiTrack achieves the state-of-the-art performance for 3D OMOT tasks in terms of accuracy and efficiency.

[CV-15] DoubleTake: Geometry Guided Depth Estimation

链接: https://arxiv.org/abs/2406.18387
作者: Mohamed Sayed,Filippo Aleotti,Jamie Watson,Zawar Qureshi,Guillermo Garcia-Hernando,Gabriel Brostow,Sara Vicente,Michael Firman
关键词: posed RGB images, computer vision task, fundamental computer vision, posed RGB, RGB images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood. In contrast, our model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames. We introduce a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry. We demonstrate that our method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.

[CV-16] From Majority to Minority: A Diffusion-based Augmentation for Underrepresented Groups in Skin Lesion Analysis

链接: https://arxiv.org/abs/2406.18375
作者: Janet Wang,Yunsung Chung,Zhengming Ding,Jihun Hamm
关键词: demonstrated dermatologist-level performance, classifying skin cancer, AI-based diagnoses, minority groups, diagnoses have demonstrated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:AI-based diagnoses have demonstrated dermatologist-level performance in classifying skin cancer. However, such systems are prone to under-performing when tested on data from minority groups that lack sufficient representation in the training sets. Although data collection and annotation offer the best means for promoting minority groups, these processes are costly and time-consuming. Prior works have suggested that data from majority groups may serve as a valuable information source to supplement the training of diagnosis tools for minority groups. In this work, we propose an effective diffusion-based augmentation framework that maximizes the use of rich information from majority groups to benefit minority groups. Using groups with different skin types as a case study, our results show that the proposed framework can generate synthetic images that improve diagnostic results for the minority groups, even when there is little or no reference data from these target groups. The practical value of our work is evident in medical imaging analysis, where under-diagnosis persists as a problem for certain groups due to insufficient representation.

[CV-17] Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

链接: https://arxiv.org/abs/2406.18361
作者: Tianyu Lin,Zhiguang Chen,Zhonghao Yan,Fudan Zheng,Weijiang Yu
关键词: generative tasks, demonstrated their effectiveness, Abstract, Diffusion, SDSeg
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: Accepted at MICCAI 2024. Code and citation info see this https URL

点击查看摘要

Abstract:Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model’s stability as implied by its name. The code is available at this https URL

[CV-18] XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis

链接: https://arxiv.org/abs/2406.18360
作者: Hao Li,Ming Yuan,Yan Zhang,Chenming Wu,Chen Zhao,Chunyu Song,Haocheng Feng,Errui Ding,Dingwen Zhang,Jingdong Wang
关键词: testing autonomy systems, autonomous driving vehicles, autonomy systems, systems is crucial, pursuit of safe
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:Thoroughly testing autonomy systems is crucial in the pursuit of safe autonomous driving vehicles. It necessitates creating safety-critical scenarios that go beyond what can be safely collected from real-world data, as many of these scenarios occur infrequently on public roads. However, the evaluation of most existing NVS methods relies on sporadic sampling of image frames from the training data, comparing the rendered images with ground truth images using metrics. Unfortunately, this evaluation protocol falls short of meeting the actual requirements in closed-loop simulations. Specifically, the true application demands the capability to render novel views that extend beyond the original trajectory (such as cross-lane views), which are challenging to capture in the real world. To address this, this paper presents a novel driving view synthesis dataset and benchmark specifically designed for autonomous driving simulations. This dataset is unique as it includes testing images captured by deviating from the training trajectory by 1-4 meters. It comprises six sequences encompassing various time and weather conditions. Each sequence contains 450 training images, 150 testing images, and their corresponding camera poses and intrinsic parameters. Leveraging this novel dataset, we establish the first realistic benchmark for evaluating existing NVS approaches under front-only and multi-camera settings. The experimental findings underscore the significant gap that exists in current approaches, revealing their inadequate ability to fulfill the demanding prerequisites of cross-lane or closed-loop simulation. Our dataset is released publicly at the project page: this https URL.

[CV-19] On Reducing Activity with Distillation and Regularization for Energy Efficient Spiking Neural Networks

链接: https://arxiv.org/abs/2406.18350
作者: Thomas Louis,Benoit Miramond,Alain Pegatoquet,Adrien Girard
关键词: formal neural networks, artificial neural networks, neural networks, growing steadily, alternative to formal
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Interest in spiking neural networks (SNNs) has been growing steadily, promising an energy-efficient alternative to formal neural networks (FNNs), commonly known as artificial neural networks (ANNs). Despite increasing interest, especially for Edge applications, these event-driven neural networks suffered from their difficulty to be trained compared to FNNs. To alleviate this problem, a number of innovative methods have been developed to provide performance more or less equivalent to that of FNNs. However, the spiking activity of a network during inference is usually not considered. While SNNs may usually have performance comparable to that of FNNs, it is often at the cost of an increase of the network’s activity, thus limiting the benefit of using them as a more energy-efficient solution. In this paper, we propose to leverage Knowledge Distillation (KD) for SNNs training with surrogate gradient descent in order to optimize the trade-off between performance and spiking activity. Then, after understanding why KD led to an increase in sparsity, we also explored Activations regularization and proposed a novel method with Logits Regularization. These approaches, validated on several datasets, clearly show a reduction in network spiking activity (-26.73% on GSC and -14.32% on CIFAR-10) while preserving accuracy. Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2406.18350 [cs.CV] (or arXiv:2406.18350v1 [cs.CV] for this version)

[CV-20] AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space

链接: https://arxiv.org/abs/2406.18344
作者: Huzheng Yang,James Gee,Jianbo Shi
关键词: study the intriguing, intriguing connection, deep networks, visual data, brain voxel fMRI
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We study the intriguing connection between visual data, deep networks, and the brain. Our method creates a universal channel alignment by using brain voxel fMRI response prediction as the training objective. We discover that deep networks, trained with different objectives, share common feature channels across various models. These channels can be clustered into recurring sets, corresponding to distinct brain regions, indicating the formation of visual concepts. Tracing the clusters of channel responses onto the images, we see semantically meaningful object segments emerge, even without any supervised decoder. Furthermore, the universal feature alignment and the clustering of channels produce a picture and quantification of how visual information is processed through the different network layers, which produces precise comparisons between the networks.

[CV-21] Continuous Sign Language Recognition Using Intra-inter Gloss Attention

链接: https://arxiv.org/abs/2406.18333
作者: Hossein Ranjbar,Alireza Taheri
关键词: capturing global contexts, adopt transformer-based architectures, sequence modeling due, studies adopt transformer-based, sign language recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many continuous sign language recognition (CSLR) studies adopt transformer-based architectures for sequence modeling due to their powerful capacity for capturing global contexts. Nevertheless, vanilla self-attention, which serves as the core module of the transformer, calculates a weighted average over all time steps; therefore, the local temporal semantics of sign videos may not be fully exploited. In this study, we introduce a novel module in sign language recognition studies, called intra-inter gloss attention module, to leverage the relationships among frames within glosses and the semantic and grammatical dependencies between glosses in the video. In the intra-gloss attention module, the video is divided into equally sized chunks and a self-attention mechanism is applied within each chunk. This localized self-attention significantly reduces complexity and eliminates noise introduced by considering non-relative frames. In the inter-gloss attention module, we first aggregate the chunk-level features within each gloss chunk by average pooling along the temporal dimension. Subsequently, multi-head self-attention is applied to all chunk-level features. Given the non-significance of the signer-environment interaction, we utilize segmentation to remove the background of the videos. This enables the proposed model to direct its focus toward the signer. Experimental results on the PHOENIX-2014 benchmark dataset demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge, improve the accuracy of CSLR, and achieve the word error rate (WER) of 20.4 on the test set which is a competitive result compare to the state-of-the-art which uses additional supervisions.

[CV-22] Spatial-temporal Hierarchical Reinforcement Learning for Interpretable Pathology Image Super-Resolution

链接: https://arxiv.org/abs/2406.18310
作者: Wenting Chen,Jie Liu,Tommy W.S. Chow,Yixuan Yuan
关键词: accurately interpreting lesion, interpreting lesion cells, acquiring high-resolution digital, high-resolution digital slides, digital slides requires
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Accepted to IEEE TRANSACTIONS ON MEDICAL IMAGING (TMI)

点击查看摘要

Abstract:Pathology image are essential for accurately interpreting lesion cells in cytopathology screening, but acquiring high-resolution digital slides requires specialized equipment and long scanning times. Though super-resolution (SR) techniques can alleviate this problem, existing deep learning models recover pathology image in a black-box manner, which can lead to untruthful biological details and misdiagnosis. Additionally, current methods allocate the same computational resources to recover each pixel of pathology image, leading to the sub-optimal recovery issue due to the large variation of pathology image. In this paper, we propose the first hierarchical reinforcement learning framework named Spatial-Temporal hierARchical Reinforcement Learning (STAR-RL), mainly for addressing the aforementioned issues in pathology image super-resolution problem. We reformulate the SR problem as a Markov decision process of interpretable operations and adopt the hierarchical recovery mechanism in patch level, to avoid sub-optimal recovery. Specifically, the higher-level spatial manager is proposed to pick out the most corrupted patch for the lower-level patch worker. Moreover, the higher-level temporal manager is advanced to evaluate the selected patch and determine whether the optimization should be stopped earlier, thereby avoiding the over-processed problem. Under the guidance of spatial-temporal managers, the lower-level patch worker processes the selected patch with pixel-wise interpretable actions at each time step. Experimental results on medical images degraded by different kernels show the effectiveness of STAR-RL. Furthermore, STAR-RL validates the promotion in tumor diagnosis with a large margin and shows generalizability under various degradations. The source code is available at this https URL.

[CV-23] Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI

链接: https://arxiv.org/abs/2406.18295
作者: Nikolaos Dionelis,Casper Fibaek,Luke Camilleri,Andreas Luyts,Jente Bosmans,Bertrand Le Saux
关键词: Foundation Models, Computer Vision application, Models, Foundation, prescribed high performance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, Submitted

点击查看摘要

Abstract:When we are primarily interested in solving several problems jointly with a given prescribed high performance accuracy for each target application, then Foundation Models should for most cases be used rather than problem-specific models. We focus on the specific Computer Vision application of Foundation Models for Earth Observation (EO) and geospatial AI. These models can solve important problems we are tackling, including for example land cover classification, crop type mapping, flood segmentation, building density estimation, and road regression segmentation. In this paper, we show that for a limited number of labelled data, Foundation Models achieve improved performance compared to problem-specific models. In this work, we also present our proposed evaluation benchmark for Foundation Models for EO. Benchmarking the generalization performance of Foundation Models is important as it has become difficult to standardize a fair comparison across the many different models that have been proposed recently. We present the results using our evaluation benchmark for EO Foundation Models and show that Foundation Models are label efficient in the downstream tasks and help us solve problems we are tackling in EO and remote sensing.

[CV-24] RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

链接: https://arxiv.org/abs/2406.18284
作者: Xiaozhong Ji,Chuming Lin,Zhonggan Ding,Ying Tai,Jian Yang,Junwei Zhu,Xiaobin Hu,Jiangning Zhang,Donghao Luo,Chengjie Wang
关键词: Person-generic audio-driven face, computer vision, Person-generic audio-driven, challenging task, task in computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.

[CV-25] CAS: Confidence Assessments of classification algorithms for Semantic segmentation of EO data

链接: https://arxiv.org/abs/2406.18279
作者: Nikolaos Dionelis,Nicolas Longepe
关键词: semantic segmentation, Confidence assessments, remote sensing, Confidence, semantic segmentation algorithms
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 7 figures, 4 tables, Submitted

点击查看摘要

Abstract:Confidence assessments of semantic segmentation algorithms in remote sensing are important. It is a desirable property of models to a priori know if they produce an incorrect output. Evaluations of the confidence assigned to the estimates of models for the task of classification in Earth Observation (EO) are crucial as they can be used to achieve improved semantic segmentation performance and prevent high error rates during inference and deployment. The model we develop, the Confidence Assessments of classification algorithms for Semantic segmentation (CAS) model, performs confidence evaluations at both the segment and pixel levels, and outputs both labels and confidence. The outcome of this work has important applications. The main application is the evaluation of EO Foundation Models on semantic segmentation downstream tasks, in particular land cover classification using satellite Copernicus Sentinel-2 data. The evaluation shows that the proposed model is effective and outperforms other alternative baseline models.

[CV-26] Generalized Deepfake Attribution

链接: https://arxiv.org/abs/2406.18278
作者: Sowdagar Mahammad Shahid,Sudev Kumar Padhi,Umesh Kashyap,Sk. Subidh Ali
关键词: Generative Adversarial Networks, Generative Adversarial, introduction of Generative, GAN, media creation changed
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The landscape of fake media creation changed with the introduction of Generative Adversarial Networks (GAN s). Fake media creation has been on the rise with the rapid advances in generation technology, leading to new challenges in Detecting fake media. A fundamental characteristic of GAN s is their sensitivity to parameter initialization, known as seeds. Each distinct seed utilized during training leads to the creation of unique model instances, resulting in divergent image outputs despite employing the same architecture. This means that even if we have one GAN architecture, it can produce countless variations of GAN models depending on the seed used. Existing methods for attributing deepfakes work well only if they have seen the specific GAN model during training. If the GAN architectures are retrained with a different seed, these methods struggle to attribute the fakes. This seed dependency issue made it difficult to attribute deepfakes with existing methods. We proposed a generalized deepfake attribution network (GDA-N et) to attribute fake images to their respective GAN architectures, even if they are generated from a retrained version of the GAN architecture with a different seed (cross-seed) or from the fine-tuned version of the existing GAN model. Extensive experiments on cross-seed and fine-tuned data of GAN models show that our method is highly effective compared to existing methods. We have provided the source code to validate our results.

[CV-27] On the Role of Visual Grounding in VQA

链接: https://arxiv.org/abs/2406.18253
作者: Daniel Reich,Tanja Schultz
关键词: Visual Grounding, question-relevant image regions, infer answers based, image regions, VQA
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual Grounding (VG) in VQA refers to a model’s proclivity to infer answers based on question-relevant image regions. Conceptually, VG identifies as an axiomatic requirement of the VQA task. In practice, however, DNN-based VQA models are notorious for bypassing VG by way of shortcut (SC) learning without suffering obvious performance losses in standard benchmarks. To uncover the impact of SC learning, Out-of-Distribution (OOD) tests have been proposed that expose a lack of VG with low accuracy. These tests have since been at the center of VG research and served as basis for various investigations into VG’s impact on accuracy. However, the role of VG in VQA still remains not fully understood and has not yet been properly formalized. In this work, we seek to clarify VG’s role in VQA by formalizing it on a conceptual level. We propose a novel theoretical framework called “Visually Grounded Reasoning” (VGR) that uses the concepts of VG and Reasoning to describe VQA inference in ideal OOD testing. By consolidating fundamental insights into VG’s role in VQA, VGR helps to reveal rampant VG-related SC exploitation in OOD testing, which explains why the relationship between VG and OOD accuracy has been difficult to define. Finally, we propose an approach to create OOD tests that properly emphasize a requirement for VG, and show how to improve performance on them. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.18253 [cs.CV] (or arXiv:2406.18253v1 [cs.CV] for this version)

[CV-28] Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation

链接: https://arxiv.org/abs/2406.18249
作者: Hamideh Kerdegari,Kyle Higgins,Dennis Veselkov,Ivan Laponogov,Inese Polaka,Miguel Coimbra,Junior Andrea Pescino,Marcis Leja,Mario Dinis-Ribeiro,Tania Fleitas Kanonnikoff,Kirill Veselkov
关键词: managing upper gastrointestinal, global cancer mortality, medical diagnostics represents, artificial intelligence, upper gastrointestinal
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The integration of artificial intelligence (AI) in medical diagnostics represents a significant advancement in managing upper gastrointestinal (GI) cancer, a major cause of global cancer mortality. Specifically for gastric cancer (GC), chronic inflammation causes changes in the mucosa such as atrophy, intestinal metaplasia (IM), dysplasia and ultimately cancer. Early detection through endoscopic regular surveillance is essential for better outcomes. Foundation models (FM), which are machine or deep learning models trained on diverse data and applicable to broad use cases, offer a promising solution to enhance the accuracy of endoscopy and its subsequent pathology image analysis. This review explores the recent advancements, applications, and challenges associated with FM in endoscopy and pathology imaging. We started by elucidating the core principles and architectures underlying these models, including their training methodologies and the pivotal role of large-scale data in developing their predictive capabilities. Moreover, this work discusses emerging trends and future research directions, emphasizing the integration of multimodal data, the development of more robust and equitable models, and the potential for real-time diagnostic support. This review aims to provide a roadmap for researchers and practitioners in navigating the complexities of incorporating FM into clinical practice for prevention/management of GC cases, thereby improving patient outcomes.

[CV-29] ConStyle v2: A Strong Prompter for All-in-One Image Restoration

链接: https://arxiv.org/abs/2406.18242
作者: Dongqi Fan,Junhao Zhang,Liang Chang
关键词: Image Restoration models, Image Restoration, output clean visual, clean visual prompts, Image Restoration framework
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper introduces ConStyle v2, a strong plug-and-play prompter designed to output clean visual prompts and assist U-Net Image Restoration models in handling multiple degradations. The joint training process of IRConStyle, an Image Restoration framework consisting of ConStyle and a general restoration network, is divided into two stages: first, pre-training ConStyle alone, and then freezing its weights to guide the training of the general restoration network. Three improvements are proposed in the pre-training stage to train ConStyle: unsupervised pre-training, adding a pretext task (i.e. classification), and adopting knowledge distillation. Without bells and whistles, we can get ConStyle v2, a strong prompter for all-in-one Image Restoration, in less than two GPU days and doesn’t require any fine-tuning. Extensive experiments on Restormer (transformer-based), NAFNet (CNN-based), MAXIM-1S (MLP-based), and a vanilla CNN network demonstrate that ConStyle v2 can enhance any U-Net style Image Restoration models to all-in-one Image Restoration models. Furthermore, models guided by the well-trained ConStyle v2 exhibit superior performance in some specific degradation compared to ConStyle.

[CV-30] CoDA: Interactive Segmentation and Morphological Analysis of Dendroid Structures Exemplified on Stony Cold-Water Corals

链接: https://arxiv.org/abs/2406.18236
作者: Kira Schmitt,Jürgen Titschack,Daniel Baum
关键词: visual analytics suite, Dendroid structure Analyzer, important framework-forming dendroid, framework-forming dendroid cold-water, Lophelia pertusa
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Herein, we present CoDA, the Coral Dendroid structure Analyzer, a visual analytics suite that allows for the first time to investigate the ontogenetic morphological development of complex dendroid coral colonies, exemplified on three important framework-forming dendroid cold-water corals: Lophelia pertusa (Linnaeus, 1758), Madrepora oculata (Linnaeus, 1758), and Goniocorella dumosa (Alcock, 1902). Input to CoDA is an initial instance segmentation of the coral polyp cavities (calices), from which it estimates the skeleton tree of the colony and extracts classical morphological measurements and advanced shape features of the individual corallites. CoDA also works as a proofreading and error correction tool by helping to identify wrong parts in the skeleton tree and providing tools to quickly correct these errors. The final skeleton tree enables the derivation of additional information about the calices/corallite instances that otherwise could not be obtained, including their ontogenetic generation and branching patterns - the basis of a fully quantitative statistical analysis of the coral colony morphology. Part of CoDA is CoDAGraph, a feature-rich link-and-brush user interface for visualizing the extracted features and 2D graph layouts of the skeleton tree, enabling the real-time exploration of complex coral colonies and their building blocks, the individual corallites and branches. In the future, we expect CoDA to greatly facilitate the analysis of large stony corals of different species and morphotypes, as well as other dendroid structures, enabling new insights into the influence of genetic and environmental factors on their ontogenetic morphological development. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.18236 [cs.CV] (or arXiv:2406.18236v1 [cs.CV] for this version)

[CV-31] GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension

链接: https://arxiv.org/abs/2406.18227
作者: Jiafeng Liang,Shixin Jiang,Zekun Wang,Haojie Pan,Zerui Chen,Zheng Chu,Ming Liu,Ruiji Fu,Zhongyuan Wang,Bing Qin
关键词: specific steps, Internet, specific, steps, instructional
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: IJCAI 2024

点击查看摘要

Abstract:There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the GUIDE (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with GUIDE and perform in-depth analysis. Given the diversity and practicality of GUIDE, we believe that it can be used as a better benchmark for instructional video comprehension.

[CV-32] Guiding Video Prediction with Explicit Procedural Knowledge

链接: https://arxiv.org/abs/2406.18220
作者: Patrick Takenaka,Johannes Maucher,Marco F. Huber
关键词: deep learning models, integrate procedural knowledge, propose a general, video prediction, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published in 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

点击查看摘要

Abstract:We propose a general way to integrate procedural knowledge of a domain into deep learning models. We apply it to the case of video prediction, building on top of object-centric deep models and show that this leads to a better performance than using data-driven models alone. We develop an architecture that facilitates latent space disentanglement in order to use the integrated procedural knowledge, and establish a setup that allows the model to learn the procedural interface in the latent space using the downstream task of video prediction. We contrast the performance to a state-of-the-art data-driven approach and show that problems where purely data-driven approaches struggle can be handled by using knowledge about the domain, providing an alternative to simply collecting more data.

[CV-33] Unlocking the Potential of Operations Research for Multi-Graph Matching

链接: https://arxiv.org/abs/2406.18215
作者: Max Kahl,Sebastian Stricker,Lisa Hutschenreiter,Florian Bernard,Bogdan Savchynskyy
关键词: multiple finite sets, incomplete multi-graph matching, matching multiple finite, multi-graph matching, quadratic assignment problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We consider the incomplete multi-graph matching problem, which is a generalization of the NP-hard quadratic assignment problem for matching multiple finite sets. Multi-graph matching plays a central role in computer vision, e.g., for matching images or shapes, so that a number of dedicated optimization techniques have been proposed. While the closely related NP-hard multi-dimensional assignment problem (MDAP) has been studied for decades in the operations research community, it only considers complete matchings and has a different cost structure. We bridge this gap and transfer well-known approximation algorithms for the MDAP to incomplete multi-graph matching. To this end, we revisit respective algorithms, adapt them to incomplete multi-graph matching, and propose their extended and parallelized versions. Our experimental validation shows that our new method substantially outperforms the previous state of the art in terms of objective and runtime. Our algorithm matches, for example, 29 images with more than 500 keypoints each in less than two minutes, whereas the fastest considered competitor requires at least half an hour while producing far worse results.

[CV-34] rimming the Fat: Efficient Compression of 3D Gaussian Splats through Pruning

链接: https://arxiv.org/abs/2406.18214
作者: Muhammad Salman Ali,Maryam Qamar,Sung-Ho Bae,Enzo Tartaglione
关键词: Neural Radiance Fields, Neural Radiance, Radiance Fields, training initially offered, offered by Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent times, the utilization of 3D models has gained traction, owing to the capacity for end-to-end training initially offered by Neural Radiance Fields and more recently by 3D Gaussian Splatting (3DGS) models. The latter holds a significant advantage by inherently easing rapid convergence during training and offering extensive editability. However, despite rapid advancements, the literature still lives in its infancy regarding the scalability of these models. In this study, we take some initial steps in addressing this gap, showing an approach that enables both the memory and computational scalability of such models. Specifically, we propose “Trimming the fat”, a post-hoc gradient-informed iterative pruning technique to eliminate redundant information encoded in the model. Our experimental findings on widely acknowledged benchmarks attest to the effectiveness of our approach, revealing that up to 75% of the Gaussians can be removed while maintaining or even improving upon baseline performance. Our approach achieves around 50 \times compression while preserving performance similar to the baseline model, and is able to speed-up computation up to 600~FPS.

[CV-35] GS-Octree: Octree-based 3D Gaussian Splatting for Robust Object-level 3D Reconstruction Under Strong Lighting

链接: https://arxiv.org/abs/2406.18199
作者: Jiaze Li,Zhengyu Wen,Luo Zhang,Jiangbei Hu,Fei Hou,Zhebin Zhang,Ying He
关键词: enabling real-time rendering, Gaussian Splatting technique, enabling real-time, technique has significantly, significantly advanced
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The 3D Gaussian Splatting technique has significantly advanced the construction of radiance fields from multi-view images, enabling real-time rendering. While point-based rasterization effectively reduces computational demands for rendering, it often struggles to accurately reconstruct the geometry of the target object, especially under strong lighting. To address this challenge, we introduce a novel approach that combines octree-based implicit surface representations with Gaussian splatting. Our method consists of four stages. Initially, it reconstructs a signed distance field (SDF) and a radiance field through volume rendering, encoding them in a low-resolution octree. The initial SDF represents the coarse geometry of the target object. Subsequently, it introduces 3D Gaussians as additional degrees of freedom, which are guided by the SDF. In the third stage, the optimized Gaussians further improve the accuracy of the SDF, allowing it to recover finer geometric details compared to the initial SDF obtained in the first stage. Finally, it adopts the refined SDF to further optimize the 3D Gaussians via splatting, eliminating those that contribute little to visual appearance. Experimental results show that our method, which leverages the distribution of 3D Gaussians with SDFs, reconstructs more accurate geometry, particularly in images with specular highlights caused by strong lighting.

[CV-36] VDG: Vision-Only Dynamic Gaussian for Driving Simulation

链接: https://arxiv.org/abs/2406.18198
作者: Hao Li,Jingfeng Li,Dingwen Zhang,Chenming Wu,Jieqi Shi,Chen Zhao,Haocheng Feng,Errui Ding,Jingdong Wang,Junwei Han
关键词: Dynamic Gaussian splatting, impressive scene reconstruction, dynamic Gaussian method, Dynamic Gaussian, Gaussian splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views. Existing methods, however, heavily rely on pre-computed poses and Gaussian initialization by Structure from Motion (SfM) algorithms or expensive sensors. For the first time, this paper addresses this issue by integrating self-supervised VO into our pose-free dynamic Gaussian method (VDG) to boost pose and depth initialization and static-dynamic decomposition. Moreover, VDG can work with only RGB image input and construct dynamic scenes at a faster speed and larger scenes compared with the pose-free dynamic view-synthesis method. We demonstrate the robustness of our approach via extensive quantitative and qualitative experiments. Our results show favorable performance over the state-of-the-art dynamic view synthesis methods. Additional video and source code will be posted on our project page at this https URL.

[CV-37] Human-free Prompted Based Anomaly Detection: prompt optimization with Meta-guiding prompt scheme

链接: https://arxiv.org/abs/2406.18197
作者: Pi-Wei Chen,Jerry Chun-Wei Lin,Jia Ji,Feng-Hao Yeh,Chao-Chun Chen
关键词: Pre-trained vision-language models, Pre-trained vision-language, prompt-based anomaly detection, making prompt-based anomaly, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pre-trained vision-language models (VLMs) are highly adaptable to various downstream tasks through few-shot learning, making prompt-based anomaly detection a promising approach. Traditional methods depend on human-crafted prompts that require prior knowledge of specific anomaly types. Our goal is to develop a human-free prompt-based anomaly detection framework that optimally learns prompts through data-driven methods, eliminating the need for human intervention. The primary challenge in this approach is the lack of anomalous samples during the training phase. Additionally, the Vision Transformer (ViT)-based image encoder in VLMs is not ideal for pixel-wise anomaly segmentation due to a locality feature mismatch between the original image and the output feature map. To tackle the first challenge, we have developed the Object-Attention Anomaly Generation Module (OAGM) to synthesize anomaly samples for training. Furthermore, our Meta-Guiding Prompt-Tuning Scheme (MPTS) iteratively adjusts the gradient-based optimization direction of learnable prompts to avoid overfitting to the synthesized anomalies. For the second challenge, we propose Locality-Aware Attention, which ensures that each local patch feature attends only to nearby patch features, preserving the locality features corresponding to their original locations. This framework allows for the optimal prompt embeddings by searching in the continuous latent space via backpropagation, free from human semantic constraints. Additionally, the modified locality-aware attention improves the precision of pixel-wise anomaly segmentation.

[CV-38] MammothModa: Multi-Modal Large Language Model

链接: https://arxiv.org/abs/2406.18193
作者: Qi She,Junwen Pan,Xin Wan,Rui Zhang,Dawei Lu,Kai Huang
关键词: multi-modal large language, Complex Language Understanding, Maintaining Complex Language, Visual Attention Experts, Integrating Visual Capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Technical report

点击查看摘要

Abstract:In this report, we introduce MammothModa, yet another multi-modal large language model (MLLM) designed to achieve state-of-the-art performance starting from an elementary baseline. We focus on three key design insights: (i) Integrating Visual Capabilities while Maintaining Complex Language Understanding: In addition to the vision encoder, we incorporated the Visual Attention Experts into the LLM to enhance its visual capabilities. (ii) Extending Context Window for High-Resolution and Long-Duration Visual Feature: We explore the Visual Merger Module to effectively reduce the token number of high-resolution images and incorporated frame position ids to avoid position interpolation. (iii) High-Quality Bilingual Datasets: We meticulously curated and filtered a high-quality bilingual multimodal dataset to reduce visual hallucinations. With above recipe we build MammothModa that consistently outperforms the state-of-the-art models, e.g., LLaVA-series, across main real-world visual language benchmarks without bells and whistles.

[CV-39] VIPriors 4: Visual Inductive Priors for Data-Efficient Deep Learning Challenges

链接: https://arxiv.org/abs/2406.18176
作者: Robert-Jan Bruintjes,Attila Lengyel,Marcos Baptista Rios,Osman Semih Kayhan,Davide Zambrano,Nergis Tomen,Jan van Gemert
关键词: Visual Inductive Priors, Data-Efficient Deep Learning, deep learning models, Priors for Data-Efficient, Deep Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The fourth edition of the “VIPriors: Visual Inductive Priors for Data-Efficient Deep Learning” workshop features two data-impaired challenges. These challenges address the problem of training deep learning models for computer vision tasks with limited data. Participants are limited to training models from scratch using a low number of training samples and are not allowed to use any form of transfer learning. We aim to stimulate the development of novel approaches that incorporate inductive biases to improve the data efficiency of deep learning models. Significant advancements are made compared to the provided baselines, where winning solutions surpass the baselines by a considerable margin in both tasks. As in previous editions, these achievements are primarily attributed to heavy use of data augmentation policies and large model ensembles, though novel prior-based methods seem to contribute more to successful solutions compared to last year. This report highlights the key aspects of the challenges and their outcomes.

[CV-40] Human-Aware 3D Scene Generation with Spatially-constrained Diffusion Models

链接: https://arxiv.org/abs/2406.18159
作者: Xiaolin Hong,Hongwei Yi,Fazhi He,Qiong Cao
关键词: supports numerous applications, including virtual reality, sequences supports numerous, motion sequences supports, numerous applications
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Generating 3D scenes from human motion sequences supports numerous applications, including virtual reality and architectural design. However, previous auto-regression-based human-aware 3D scene generation methods have struggled to accurately capture the joint distribution of multiple objects and input humans, often resulting in overlapping object generation in the same space. To address this limitation, we explore the potential of diffusion models that simultaneously consider all input humans and the floor plan to generate plausible 3D scenes. Our approach not only satisfies all input human interactions but also adheres to spatial constraints with the floor plan. Furthermore, we introduce two spatial collision guidance mechanisms: human-object collision avoidance and object-room boundary constraints. These mechanisms help avoid generating scenes that conflict with human motions while respecting layout constraints. To enhance the diversity and accuracy of human-guided scene generation, we have developed an automated pipeline that improves the variety and plausibility of human-object interactions in the existing 3D FRONT HUMAN dataset. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework can generate more natural and plausible 3D scenes with precise human-scene interactions, while significantly reducing human-object collisions compared to previous state-of-the-art methods. Our code and data will be made publicly available upon publication of this work.

[CV-41] 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation

链接: https://arxiv.org/abs/2406.18158
作者: Shengyi Qian,Kaichun Mo,Valts Blukis,David F. Fouhey,Dieter Fox,Ankit Goyal
关键词: Recent works, MAE, masked autoencoders, Recent, downstream robotics tasks
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders. We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict gripper pose actions. We split RVT’s multi-view transformer into visual encoder and action decoder, and pretrain its visual encoder using masked autoencoding on large-scale 3D datasets such as Objaverse. We evaluate 3D-MVP on a suite of virtual robot manipulation tasks and demonstrate improved performance over baselines. We also show promising results on a real robot platform with minimal finetuning. Our results suggest that 3D-aware pretraining is a promising approach to improve sample efficiency and generalization of vision-based robotic manipulation policies. We will release code and pretrained models for 3D-MVP to facilitate future research. Project site: this https URL

[CV-42] SynRS3D: A Synthetic Dataset for Global 3D Semantic Understanding from Monocular Remote Sensing Imagery

链接: https://arxiv.org/abs/2406.18151
作者: Jian Song,Hongruixuan Chen,Weihao Xuan,Junshi Xia,Naoto Yokoya
关键词: Earth Observation, crucial for Earth, high-resolution remote sensing, single-view high-resolution remote, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Global semantic 3D understanding from single-view high-resolution remote sensing (RS) imagery is crucial for Earth Observation (EO). However, this task faces significant challenges due to the high costs of annotations and data collection, as well as geographically restricted data availability. To address these challenges, synthetic data offer a promising solution by being easily accessible and thus enabling the provision of large and diverse datasets. We develop a specialized synthetic data generation pipeline for EO and introduce SynRS3D, the largest synthetic RS 3D dataset. SynRS3D comprises 69,667 high-resolution optical images that cover six different city styles worldwide and feature eight land cover types, precise height information, and building change masks. To further enhance its utility, we develop a novel multi-task unsupervised domain adaptation (UDA) method, RS3DAda, coupled with our synthetic dataset, which facilitates the RS-specific transition from synthetic to real scenarios for land cover mapping and height estimation tasks, ultimately enabling global monocular 3D semantic understanding based on synthetic data. Extensive experiments on various real-world datasets demonstrate the adaptability and effectiveness of our synthetic dataset and proposed RS3DAda method. SynRS3D and related codes will be available.

[CV-43] A Refer-and-Ground Multimodal Large Language Model for Biomedicine

链接: https://arxiv.org/abs/2406.18146
作者: Xiaoshuang Huang,Haifeng Huang,Lingdong Shen,Yehui Yang,Fangxin Shang,Junwei Liu,Jia Liu
关键词: multimodal large language, refer and ground, increasingly recognized, visual chat, significance is increasingly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI2024

点击查看摘要

Abstract:With the rapid development of multimodal large language models (MLLMs), especially their capabilities in visual chat through refer and ground functionalities, their significance is increasingly recognized. However, the biomedical field currently exhibits a substantial gap in this area, primarily due to the absence of a dedicated refer and ground dataset for biomedical images. To address this challenge, we devised the Med-GRIT-270k dataset. It comprises 270k question-and-answer pairs and spans eight distinct medical imaging modalities. Most importantly, it is the first dedicated to the biomedical domain and integrating refer and ground conversations. The key idea is to sample large-scale biomedical image-mask pairs from medical segmentation datasets and generate instruction datasets from text using chatGPT. Additionally, we introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning. Extensive experiments have corroborated the efficacy of the Med-GRIT-270k dataset and the multi-modal, fine-grained interactive capabilities of the BiRD model. This holds significant reference value for the exploration and development of intelligent biomedical assistants.

[CV-44] Artificial Immune System of Secure Face Recognition Against Adversarial Attacks

链接: https://arxiv.org/abs/2406.18144
作者: Min Ren,Yunlong Wang,Yuhao Zhu,Yongzhen Huang,Zhenan Sun,Qi Li,Tieniu Tan
关键词: ensure food safety, Insect production, ensure food, food safety, promising supplement
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Insect production for food and feed presents a promising supplement to ensure food safety and address the adverse impacts of agriculture on climate and environment in the future. However, optimisation is required for insect production to realise its full potential. This can be by targeted improvement of traits of interest through selective breeding, an approach which has so far been underexplored and underutilised in insect farming. Here we present a comprehensive review of the selective breeding framework in the context of insect production. We systematically evaluate adjustments of selective breeding techniques to the realm of insects and highlight the essential components integral to the breeding process. The discussion covers every step of a conventional breeding scheme, such as formulation of breeding objectives, phenotyping, estimation of genetic parameters and breeding values, selection of appropriate breeding strategies, and mitigation of issues associated with genetic diversity depletion and inbreeding. This review combines knowledge from diverse disciplines, bridging the gap between animal breeding, quantitative genetics, evolutionary biology, and entomology, offering an integrated view of the insect breeding research area and uniting knowledge which has previously remained scattered across diverse fields of expertise.

[CV-45] Exclusive Style Removal for Cross Domain Novel Class Discovery

链接: https://arxiv.org/abs/2406.18140
作者: Yicheng Wang,Feng Liu,Junmin Liu,Zhen Fang,Kai Sun
关键词: Class Discovery, NCD methods, open-world learning, NCD, unlabeled set based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As a promising field in open-world learning, \textitNovel Class Discovery (NCD) is usually a task to cluster unseen novel classes in an unlabeled set based on the prior knowledge of labeled data within the same domain. However, the performance of existing NCD methods could be severely compromised when novel classes are sampled from a different distribution with the labeled ones. In this paper, we explore and establish the solvability of NCD in cross domain setting with the necessary condition that style information must be removed. Based on the theoretical analysis, we introduce an exclusive style removal module for extracting style information that is distinctive from the baseline features, thereby facilitating inference. Moreover, this module is easy to integrate with other NCD methods, acting as a plug-in to improve performance on novel classes with different distributions compared to the seen labeled set. Additionally, recognizing the non-negligible influence of different backbones and pre-training strategies on the performance of the NCD methods, we build a fair benchmark for future NCD research. Extensive experiments on three common datasets demonstrate the effectiveness of our proposed module.

[CV-46] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

链接: https://arxiv.org/abs/2406.18139
作者: Zhongwei Wan,Ziang Wu,Che Liu,Jinfa Huang,Zhihong Zhu,Peng Jin,Longyue Wang,Li Yuan
关键词: Large Language Models, Multimodal Large Language, Large Language, demand substantial computational, increasing input lengths
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs’ KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge. In this work, we introduce LOOK-M, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. LOOK-M demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by 80% in some cases, it not only achieves up to 1.5x faster decoding but also maintains or even enhances performance across a variety of long context multimodal tasks.

[CV-47] CTS: Sim-to-Real Unsupervised Domain Adaptation on 3D Detection

链接: https://arxiv.org/abs/2406.18129
作者: Meiying Zhang,Weiyuan Peng,Guangyao Ding,Chenyang Lei,Chunlin Ji,Qi Hao
关键词: including object detection, object detection, object detection algorithms, expected to improve, cross-domain object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation data can be accurately labeled and have been expected to improve the performance of data-driven algorithms, including object detection. However, due to the various domain inconsistencies from simulation to reality (sim-to-real), cross-domain object detection algorithms usually suffer from dramatic performance drops. While numerous unsupervised domain adaptation (UDA) methods have been developed to address cross-domain tasks between real-world datasets, progress in sim-to-real remains limited. This paper presents a novel Complex-to-Simple (CTS) framework to transfer models from labeled simulation (source) to unlabeled reality (target) domains. Based on a two-stage detector, the novelty of this work is threefold: 1) developing fixed-size anchor heads and RoI augmentation to address size bias and feature diversity between two domains, thereby improving the quality of pseudo-label; 2) developing a novel corner-format representation of aleatoric uncertainty (AU) for the bounding box, to uniformly quantify pseudo-label quality; 3) developing a noise-aware mean teacher domain adaptation method based on AU, as well as object-level and frame-level sampling strategies, to migrate the impact of noisy labels. Experimental results demonstrate that our proposed approach significantly enhances the sim-to-real domain adaptation capability of 3D object detection models, outperforming state-of-the-art cross-domain algorithms, which are usually developed for real-to-real UDA tasks.

[CV-48] Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

链接: https://arxiv.org/abs/2406.18115
作者: Dicong Qiu,Wenzong Ma,Zhenfu Pan,Hui Xiong,Junwei Liang
关键词: Open-Vocabulary Mobile Manipulation, Mobile Manipulation, crucial capability, capability for autonomous, posed by unknown
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Open-vocabulary, Mobile Manipulation, Dynamic Environments, 3D Semantic Maps, Zero-shot, LLMs, VLMs, 18 pages, 2 figures

点击查看摘要

Abstract:Open-Vocabulary Mobile Manipulation (OVMM) is a crucial capability for autonomous robots, especially when faced with the challenges posed by unknown and dynamic environments. This task requires robots to explore and build a semantic understanding of their surroundings, generate feasible plans to achieve manipulation goals, adapt to environmental changes, and comprehend natural language instructions from humans. To address these challenges, we propose a novel framework that leverages the zero-shot detection and grounded recognition capabilities of pretraining visual-language models (VLMs) combined with dense 3D entity reconstruction to build 3D semantic maps. Additionally, we utilize large language models (LLMs) for spatial region abstraction and online planning, incorporating human instructions and spatial semantic context. We have built a 10-DoF mobile manipulation robotic platform JSR-1 and demonstrated in real-world robot experiments that our proposed framework can effectively capture spatial semantics and process natural language user instructions for zero-shot OVMM tasks under dynamic environment settings, with an overall navigation and task success rate of 80.95% and 73.33% over 105 episodes, and better SFT and SPL by 157.18% and 19.53% respectively compared to the baseline. Furthermore, the framework is capable of replanning towards the next most probable candidate location based on the spatial semantic context derived from the 3D semantic map when initial plans fail, keeping an average success rate of 76.67%.

[CV-49] he Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

链接: https://arxiv.org/abs/2406.18113
作者: Meinardus Boris,Batra Anil,Rohrbach Anna,Rohrbach Marcus
关键词: shown promising results, Recent studies, computer vision tasks, utilizing multimodal large, multimodal large language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 3 figures

点击查看摘要

Abstract:Recent studies have shown promising results in utilizing multimodal large language models (MLLMs) for computer vision tasks such as object detection and semantic segmentation. However, many challenging video tasks remain under-explored. Video-language tasks necessitate spatial and temporal comprehension and require significant compute. Therefore, prior works have developed complex, highly specialized architectures or leveraged additional input signals such as video transcripts to best encode contextual and temporal information, which limits their generality and can be impractical. One particularly challenging task is video moment retrieval, which requires precise temporal and contextual grounding. This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval. We introduce Mr. BLIP (Mr. as in Moment Retrieval), a multimodal, single-stage model that requires no expensive video-language pretraining, no additional input signal (e.g., no transcript or audio), and has a simpler and more versatile design than prior state-of-the-art methods. We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions and illustrate our method’s versatility with a new state-of-the-art in temporal action localization on ActivityNet. Notably, we attain over 9% (absolute) higher Recall (at 0.5 and 0.7 IoU) on the challenging long-video multi-moment QVHighlights benchmark. Our code is publicly available.

[CV-50] MFDNet: Multi-Frequency Deflare Network for Efficient Nighttime Flare Removal

链接: https://arxiv.org/abs/2406.18079
作者: Yiguo Jiang,Xuhang Chen,Chi-Man Pun,Shuqiang Wang,Wei Feng
关键词: photos’ visual quality, captured photos, affecting the photos’, visual quality, light is scattered
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted by The Visual Computer journal

点击查看摘要

Abstract:When light is scattered or reflected accidentally in the lens, flare artifacts may appear in the captured photos, affecting the photos’ visual quality. The main challenge in flare removal is to eliminate various flare artifacts while preserving the original content of the image. To address this challenge, we propose a lightweight Multi-Frequency Deflare Network (MFDNet) based on the Laplacian Pyramid. Our network decomposes the flare-corrupted image into low and high-frequency bands, effectively separating the illumination and content information in the image. The low-frequency part typically contains illumination information, while the high-frequency part contains detailed content information. So our MFDNet consists of two main modules: the Low-Frequency Flare Perception Module (LFFPM) to remove flare in the low-frequency part and the Hierarchical Fusion Reconstruction Module (HFRM) to reconstruct the flare-free image. Specifically, to perceive flare from a global perspective while retaining detailed information for image restoration, LFFPM utilizes Transformer to extract global information while utilizing a convolutional neural network to capture detailed local features. Then HFRM gradually fuses the outputs of LFFPM with the high-frequency component of the image through feature aggregation. Moreover, our MFDNet can reduce the computational cost by processing in multiple frequency bands instead of directly removing the flare on the input image. Experimental results demonstrate that our approach outperforms state-of-the-art methods in removing nighttime flare on real-world and synthetic images from the Flare7K dataset. Furthermore, the computational complexity of our model is remarkably low.

[CV-51] Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

链接: https://arxiv.org/abs/2406.18074
作者: Song Tang,Shaxu Yan,Xiaozhi Qi,Jianxin Gao,Mao Ye,Jianwei Zhang,Xiatian Zhu
关键词: Few-shot Semantic Segmentation, single labelled training, labelled training sample, aims to adapt, sample per class
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Few-shot Semantic Segmentation (FSS) aims to adapt a pretrained model to new classes with as few as a single labelled training sample per class. Despite the prototype based approaches have achieved substantial success, existing models are limited to the imaging scenarios with considerably distinct objects and not highly complex background, e.g., natural images. This makes such models suboptimal for medical imaging with both conditions invalid. To address this problem, we propose a novel Detail Self-refined Prototype Network (DSPNet) to constructing high-fidelity prototypes representing the object foreground and the background more comprehensively. Specifically, to construct global semantics while maintaining the captured detail semantics, we learn the foreground prototypes by modelling the multi-modal structures with clustering and then fusing each in a channel-wise manner. Considering that the background often has no apparent semantic relation in the spatial dimensions, we integrate channel-specific structural information under sparse channel-aware regulation. Extensive experiments on three challenging medical image benchmarks show the superiority of DSPNet over previous state-of-the-art methods.

[CV-52] EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

链接: https://arxiv.org/abs/2406.18070
作者: Baoqi Pei,Guo Chen,Jilan Xu,Yuping He,Yicheng Liu,Kanghua Pan,Yifei Huang,Yali Wang,Tong Lu,Limin Wang,Yu Qiao
关键词: Long-term Action Anticipation, Natural Language Queries, Object Interaction Anticipation, present our solutions, Short-term Object Interaction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Champion solutions in the EgoVis CVPR 2024 workshop

点击查看摘要

Abstract:In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions. In the Ego4D challenges, we tackle various tasks including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation. In addition, we also participate in the EPIC-Kitchens challenge, where we engage in the Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these diverse tasks, we showcase its versatility and effectiveness in different egocentric video analysis scenarios, demonstrating the powerful representation ability of EgoVideo as an egocentric foundation model. Our codebase and pretrained models are publicly available at this https URL.

[CV-53] Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs

链接: https://arxiv.org/abs/2406.18068
作者: Uttaran Bhattacharya,Aniket Bera,Dinesh Manocha
关键词: RGB video data, video data captured, RGB video, face landmarks, commodity cameras
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 7 figures, 2 tables

点击查看摘要

Abstract:We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters using RGB video data captured using commodity cameras. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions. Given a speech audio waveform and a token sequence of the speaker’s face landmark motion and body-joint motion computed from a video, our method synthesizes the motion sequences for the speaker’s face landmarks and body joints to match the content and the affect of the speech. We design a generator consisting of a set of encoders to transform all the inputs into a multimodal embedding space capturing their correlations, followed by a pair of decoders to synthesize the desired face and pose motions. To enhance the plausibility of synthesis, we use an adversarial discriminator that learns to differentiate between the face and pose motions computed from the original videos and our synthesized motions based on their affective expressions. To evaluate our approach, we extend the TED Gesture Dataset to include view-normalized, co-speech face landmarks in addition to body gestures. We demonstrate the performance of our method through thorough quantitative and qualitative experiments on multiple evaluation metrics and via a user study. We observe that our method results in low reconstruction error and produces synthesized samples with diverse facial expressions and body gestures for digital characters.

[CV-54] ViT-1.58b: Mobile Vision Transformers in the 1-bit Era

链接: https://arxiv.org/abs/2406.18051
作者: Zhengqing Yuan,Rong Zhou,Hongyi Wang,Lifang He,Yanfang Ye,Lichao Sun
关键词: Vision Transformers, image classification tasks, process image patches, achieved remarkable performance, image classification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved remarkable performance in various image classification tasks by leveraging the attention mechanism to process image patches as tokens. However, the high computational and memory demands of ViTs pose significant challenges for deployment in resource-constrained environments. This paper introduces ViT-1.58b, a novel 1.58-bit quantized ViT model designed to drastically reduce memory and computational overhead while preserving competitive performance. ViT-1.58b employs ternary quantization, which refines the balance between efficiency and accuracy by constraining weights to -1, 0, 1 and quantizing activations to 8-bit precision. Our approach ensures efficient scaling in terms of both memory and computation. Experiments on CIFAR-10 and ImageNet-1k demonstrate that ViT-1.58b maintains comparable accuracy to full-precision Vit, with significant reductions in memory usage and computational costs. This paper highlights the potential of extreme quantization techniques in developing sustainable AI solutions and contributes to the broader discourse on efficient model deployment in practical applications. Our code and weights are available at this https URL.

[CV-55] A Multi-Stage Goal-Driven Network for Pedestrian Trajectory Prediction

链接: https://arxiv.org/abs/2406.18050
作者: Xiuen Wu,Tao Wang,Yuanzheng Cai,Lingyu Liang,George Papageorgiou
关键词: traffic management systems, including autonomous vehicles, Pedestrian trajectory prediction, trajectory prediction plays, Pedestrian trajectory
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper accepted by 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL 2024)

点击查看摘要

Abstract:Pedestrian trajectory prediction plays a pivotal role in ensuring the safety and efficiency of various applications, including autonomous vehicles and traffic management systems. This paper proposes a novel method for pedestrian trajectory prediction, called multi-stage goal-driven network (MGNet). Diverging from prior approaches relying on stepwise recursive prediction and the singular forecasting of a long-term goal, MGNet directs trajectory generation by forecasting intermediate stage goals, thereby reducing prediction errors. The network comprises three main components: a conditional variational autoencoder (CVAE), an attention module, and a multi-stage goal evaluator. Trajectories are encoded using conditional variational autoencoders to acquire knowledge about the approximate distribution of pedestrians’ future trajectories, and combined with an attention mechanism to capture the temporal dependency between trajectory sequences. The pivotal module is the multi-stage goal evaluator, which utilizes the encoded feature vectors to predict intermediate goals, effectively minimizing cumulative errors in the recursive inference process. The effectiveness of MGNet is demonstrated through comprehensive experiments on the JAAD and PIE datasets. Comparative evaluations against state-of-the-art algorithms reveal significant performance improvements achieved by our proposed method.

[CV-56] ScanFormer: Referring Expression Comprehension by Iteratively Scanning

链接: https://arxiv.org/abs/2406.18048
作者: Wei Su,Peihan Miao,Huanzhang Dou,Xi Li
关键词: Referring Expression Comprehension, Referring Expression, Expression Comprehension, free-form natural language, natural language descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CVPR2024

点击查看摘要

Abstract:Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries, leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks, with limited exploration in vision-language fields. To address this, we propose a coarse-to-fine iterative perception framework, called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration, irrelevant patches are discarded by our designed informativeness prediction. Furthermore, we propose a patch selection strategy for discarded patches to accelerate inference. Experiments on widely used datasets, namely RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, verify the effectiveness of our method, which can strike a balance between accuracy and efficiency.

[CV-57] Multimodal foundation world models for generalist embodied agents

链接: https://arxiv.org/abs/2406.18043
作者: Pietro Mazzaglia,Tim Verbelen,Bart Dhoedt,Aaron Courville,Sai Rajeswar
关键词: solve multitudes, Learning, models, foundation, tasks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning generalist embodied agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. In contrast, language can specify tasks in a more natural way. Current foundation vision-language models (VLMs) generally require fine-tuning or other adaptations to be functional, due to the significant domain gap. However, the lack of multimodal data in such domains represents an obstacle toward developing foundation models for embodied applications. In this work, we overcome these problems by presenting multimodal foundation world models, able to connect and align the representation of foundation VLMs with the latent space of generative world models for RL, without any language annotations. The resulting agent learning framework, GenRL, allows one to specify tasks through vision and/or language prompts, ground them in the embodied domain’s dynamics, and learns the corresponding behaviors in imagination. As assessed through large-scale multi-task benchmarking, GenRL exhibits strong multi-task generalization performance in several locomotion and manipulation domains. Furthermore, by introducing a data-free RL strategy, it lays the groundwork for foundation model-based RL for generalist embodied agents.

[CV-58] owards Synchronous Memorizability and Generalizability with Site-Modulated Diffusion Replay for Cross-Site Continual Segmentation

链接: https://arxiv.org/abs/2406.18037
作者: Dunyuan Xu,Xi Wang,Jingyang Zhang,Pheng-Ann Heng
关键词: diagnosis problems due, solving practical medical, image diagnosis problems, ability to learn, deep network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The ability to learn sequentially from different data sites is crucial for a deep network in solving practical medical image diagnosis problems due to privacy restrictions and storage limitations. However, adapting on incoming site leads to catastrophic forgetting on past sites and decreases generalizablity on unseen sites. Existing Continual Learning (CL) and Domain Generalization (DG) methods have been proposed to solve these two challenges respectively, but none of them can address both simultaneously. Recognizing this limitation, this paper proposes a novel training paradigm, learning towards Synchronous Memorizability and Generalizability (SMG-Learning). To achieve this, we create the orientational gradient alignment to ensure memorizability on previous sites, and arbitrary gradient alignment to enhance generalizability on unseen sites. This approach is named as Parallel Gradient Alignment (PGA). Furthermore, we approximate the PGA as dual meta-objectives using the first-order Taylor expansion to reduce computational cost of aligning gradients. Considering that performing gradient alignments, especially for previous sites, is not feasible due to the privacy constraints, we design a Site-Modulated Diffusion (SMD) model to generate images with site-specific learnable prompts, replaying images have similar data distributions as previous sites. We evaluate our method on two medical image segmentation tasks, where data from different sites arrive sequentially. Experimental results show that our method efficiently enhances both memorizability and generalizablity better than other state-of-the-art methods, delivering satisfactory performance across all sites. Our code will be available at: this https URL.

[CV-59] Real-time Structure Flow

链接: https://arxiv.org/abs/2406.18031
作者: Juan David Adarve,Robert Mahony
关键词: robo-centric motion information, highly dynamic robotic, dynamic robotic devices, structure flow, structure flow field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This article introduces the structure flow field; a flow field that can provide high-speed robo-centric motion information for motion control of highly dynamic robotic devices and autonomous vehicles. Structure flow is the angular 3D velocity of the scene at a given pixel. We show that structure flow posses an elegant evolution model in the form of a Partial Differential Equation (PDE) that enables us to create dense flow predictions forward in time. We exploit this structure to design a predictor-update algorithm to compute structure flow in real time using image and depth measurements. The prediction stage takes the previous estimate of the structure flow and propagates it forward in time using a numerical implementation of the structure flow PDE. The predicted flow is then updated using new image and depth data. The algorithm runs up to 600 Hz on a Desktop GPU machine for 512x512 images with flow values up to 8 pixels. We provide ground truth validation on high-speed synthetic image sequences as well as results on real-life video on driving scenarios.

[CV-60] View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adaptive View Synthesis

链接: https://arxiv.org/abs/2406.18012
作者: Subin Varghese,Vedhus Hoskere
关键词: requires identifying visual, infrastructure assets typically, assets typically requires, typically requires identifying, scenes periodically photographed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The inspection and monitoring of infrastructure assets typically requires identifying visual anomalies in scenes periodically photographed over time. Images collected manually or with robots such as unmanned aerial vehicles from the same scene at different instances in time are typically not perfectly aligned. Supervised segmentation methods can be applied to identify known problems, but unsupervised anomaly detection approaches are required when unknown anomalies occur. Current unsupervised pixel-level anomaly detection methods have mainly been developed for industrial settings where the camera position is known and constant. However, we find that these methods fail to generalize to the case when images are not perfectly aligned. We term the problem of unsupervised anomaly detection between two such imperfectly aligned sets of images as Scene Anomaly Detection (Scene AD). We present a novel network termed OmniAD to address the Scene AD problem posed. Specifically, we refine the anomaly detection method reverse distillation to achieve a 40% increase in pixel-level anomaly detection performance. The network’s performance is further demonstrated to improve with two new data augmentation strategies proposed that leverage novel view synthesis and camera localization to improve generalization. We validate our approach with qualitative and quantitative results on a new dataset, ToyCity, the first Scene AD dataset with multiple objects, as well as on the established single object-centric dataset, MAD. this https URL

[CV-61] Expressive Keypoints for Skeleton-based Action Recognition via Skeleton Transformation

链接: https://arxiv.org/abs/2406.18011
作者: Yijie Yang,Jinlu Zhang,Jiaxu Zhang,Zhigang Tu
关键词: coarse body keypoints, body keypoints fall, keypoints fall short, capturing subtle human, subtle human actions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the realm of skeleton-based action recognition, the traditional methods which rely on coarse body keypoints fall short of capturing subtle human actions. In this work, we propose Expressive Keypoints that incorporates hand and foot details to form a fine-grained skeletal representation, improving the discriminative ability for existing models in discerning intricate actions. To efficiently model Expressive Keypoints, the Skeleton Transformation strategy is presented to gradually downsample the keypoints and prioritize prominent joints by allocating the importance weights. Additionally, a plug-and-play Instance Pooling module is exploited to extend our approach to multi-person scenarios without surging computation costs. Extensive experimental results over seven datasets present the superiority of our method compared to the state-of-the-art for skeleton-based human action recognition. Code is available at this https URL.

[CV-62] Changen2: Multi-Temporal Remote Sensing Generative Change Foundation Model

链接: https://arxiv.org/abs/2406.17998
作者: Zhuo Zheng,Stefano Ermon,Dongjun Kim,Liangpei Zhang,Yanfei Zhong
关键词: Earth surface, deep vision models, change, temporal dynamics, advanced by deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The enhanced extension of our ICCV 2023 (Changen)

点击查看摘要

Abstract:Our understanding of the temporal dynamics of the Earth’s surface has been advanced by deep vision models, which often require lots of labeled multi-temporal images for training. However, collecting, preprocessing, and annotating multi-temporal remote sensing images at scale is non-trivial since it is expensive and knowledge-intensive. In this paper, we present change data generators based on generative models, which are cheap and automatic, alleviating these data problems. Our main idea is to simulate a stochastic change process over time. We describe the stochastic change process as a probabilistic graphical model (GPCM), which factorizes the complex simulation problem into two more tractable sub-problems, i.e., change event simulation and semantic change synthesis. To solve these two problems, we present Changen2, a GPCM with a resolution-scalable diffusion transformer which can generate time series of images and their semantic and change labels from labeled or unlabeled single-temporal images. Changen2 is a generative change foundation model that can be trained at scale via self-supervision, and can produce change supervisory signals from unlabeled single-temporal images. Unlike existing foundation models, Changen2 synthesizes change data to train task-specific foundation models for change detection. The resulting model possesses inherent zero-shot change detection capabilities and excellent transferability. Experiments suggest Changen2 has superior spatiotemporal scalability, e.g., Changen2 model trained on 256 ^2 pixel single-temporal images can yield time series of any length and resolutions of 1,024 ^2 pixels. Changen2 pre-trained models exhibit superior zero-shot performance (narrowing the performance gap to 3% on LEVIR-CD and approximately 10% on both S2Looking and SECOND, compared to fully supervised counterparts) and transferability across multiple types of change tasks.

[CV-63] DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image

链接: https://arxiv.org/abs/2406.17988
作者: Qingxuan Wu,Zhiyang Dou,Sirui Xu,Soshi Shimada,Chen Wang,Zhengming Yu,Yuan Liu,Cheng Lin,Zeyu Cao,Taku Komura,Vladislav Golyanik,Christian Theobalt,Wenping Wang,Lingjie Liu
关键词: hand-face interaction recovery, hand-face interaction, Deformation-aware hand-face Interaction, single-view hand-face interactions, hand-face interaction data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Reconstructing 3D hand-face interactions with deformations from a single image is a challenging yet crucial task with broad applications in AR, VR, and gaming. The challenges stem from self-occlusions during single-view hand-face interactions, diverse spatial relationships between hands and face, complex deformations, and the ambiguity of the single-view setting. The first and only method for hand-face interaction recovery, Decaf, introduces a global fitting optimization guided by contact and deformation estimation networks trained on studio-collected data with 3D annotations. However, Decaf suffers from a time-consuming optimization process and limited generalization capability due to its reliance on 3D annotations of hand-face interaction data. To address these issues, we present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image. DICE estimates the poses of hands and faces, contacts, and deformations simultaneously using a Transformer-based architecture. It features disentangling the regression of local deformation fields and global mesh vertex locations into two network branches, enhancing deformation and contact estimation for precise and robust hand-face mesh recovery. To improve generalizability, we propose a weakly-supervised training approach that augments the training set using in-the-wild images without 3D ground-truth annotations, employing the depths of 2D keypoints estimated by off-the-shelf models and adversarial priors of poses for supervision. Our experiments demonstrate that DICE achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility. Additionally, our method operates at an interactive rate (20 fps) on an Nvidia 4090 GPU, whereas Decaf requires more than 15 seconds for a single image. Our code will be publicly available upon publication.

[CV-64] Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts

链接: https://arxiv.org/abs/2406.17974
作者: Xuyang Wu,Yuan Wang,Hsin-Tai Wu,Zhiqiang Tao,Yi Fang
关键词: Large vision-language models, achieved significant progress, demonstrating strong capabilities, recently achieved significant, Large vision-language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) have recently achieved significant progress, demonstrating strong capabilities in open-world visual understanding. However, it is not yet clear how LVLMs address demographic biases in real life, especially the disparities across attributes such as gender, skin tone, and age. In this paper, we empirically investigate \emphvisual fairness in several mainstream LVLMs and audit their performance disparities across sensitive demographic attributes, based on public fairness benchmark datasets (e.g., FACET). To disclose the visual bias in LVLMs, we design a fairness evaluation framework with direct questions and single-choice question-instructed prompts on visual question-answering/classification tasks. The zero-shot prompting results indicate that, despite enhancements in visual understanding, both open-source and closed-source LVLMs exhibit prevalent fairness issues across different instruct prompts and demographic attributes.

[CV-65] Highly Constrained Coded Aperture Imaging Systems Design Via a Knowledge Distillation Approach

链接: https://arxiv.org/abs/2406.17970
作者: Leon Suarez-Rodriguez,Roman Jacome,Henry Arguello
关键词: optical coding elements, Computational optical imaging, constrained COI systems, COI systems, physically constrained COI
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 7 pages, 3 figures. Accepted at ICIP 2024

点击查看摘要

Abstract:Computational optical imaging (COI) systems have enabled the acquisition of high-dimensional signals through optical coding elements (OCEs). OCEs encode the high-dimensional signal in one or more snapshots, which are subsequently decoded using computational algorithms. Currently, COI systems are optimized through an end-to-end (E2E) approach, where the OCEs are modeled as a layer of a neural network and the remaining layers perform a specific imaging task. However, the performance of COI systems optimized through E2E is limited by the physical constraints imposed by these systems. This paper proposes a knowledge distillation (KD) framework for the design of highly physically constrained COI systems. This approach employs the KD methodology, which consists of a teacher-student relationship, where a high-performance, unconstrained COI system (the teacher), guides the optimization of a physically constrained system (the student) characterized by a limited number of snapshots. We validate the proposed approach, using a binary coded apertures single pixel camera for monochromatic and multispectral image reconstruction. Simulation results demonstrate the superiority of the KD scheme over traditional E2E optimization for the designing of highly physically constrained COI systems.

[CV-66] MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation for Effective-and-Efficient Vision-and-Language Navigation

链接: https://arxiv.org/abs/2406.17960
作者: Liuyi Wang,Zongtao He,Mengjiao Shen,Jingwei Yang,Chengju Liu,Qijun Chen
关键词: Embodied Artificial Intelligence, Artificial Intelligence, Embodied Artificial, recent large models, excessive parameter sizes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the remarkable developments of recent large models in Embodied Artificial Intelligence (E-AI), their integration into robotics is hampered by their excessive parameter sizes and computational demands. Towards the Vision-and-Language Navigation (VLN) task, a core task in E-AI, this paper reveals the great potential of using knowledge distillation for obtaining lightweight student models by proposing a Meta-Ability Guided Interactive Chain-of-distillation (MAGIC) method. Specifically, a Meta-Ability Knowledge Distillation (MAKD) framework is proposed for decoupling and refining the necessary meta-abilities of VLN agents. A Meta-Knowledge Randomization Weighting (MKRW) and a Meta-Knowledge Transferable Determination (MKTD) module are incorporated to dynamically adjust aggregation weights at the meta-ability and sample levels, respectively. Move beyond the traditional one-step unidirectional distillation, an Interactive Chain-of-Distillation (ICoD) learning strategy is proposed to allow students to give feedback to teachers, forming a new multi-step teacher-student co-evolution pipeline. Remarkably, on the R2R test unseen public leaderboard, our smallest model, MAGIC-S, with only 5% (11M) of the teacher’s size, outperforms all previous methods under the same training data. Additionally, our largest model, MAGIC-L, surpasses the previous state-of-the-art by 5.84% in SPL and 3.18% in SR. Furthermore, a new dataset was collected and annotated from our living environments, where MAGIC-S demonstrated superior performance and real-time efficiency. Our code is publicly available on this https URL.

[CV-67] Hot-Distance: Combining One-Hot and Signed Distance Embeddings for Segmentation

链接: https://arxiv.org/abs/2406.17936
作者: Marwan Zouinkhi,Jeff L. Rhoades,Aubrey V. Weigel
关键词: Machine learning models, Machine learning, Machine, learning models, data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
*备注: 3 pages, 1 figure, in progress

点击查看摘要

Abstract:Machine learning models are only as good as the data to which they are fit. As such, it is always preferable to use as much data as possible in training models. What data can be used for fitting a model depends a lot on the formulation of the task. We introduce Hot-Distance, a novel segmentation target that incorporates the strength of signed boundary distance prediction with the flexibility of one-hot encoding, to increase the amount of usable training data for segmentation of subcellular structures in focused ion beam scanning electron microscopy (FIB-SEM).

[CV-68] Semi-supervised classification of dental conditions in panoramic radiographs using large language model and instance segmentation: A real-world dataset evaluation

链接: https://arxiv.org/abs/2406.17915
作者: Bernardo Silva,Jefferson Fontinele,Carolina Letícia Zilli Vieira,João Manuel R.S. Tavares,Patricia Ramos Cury,Luciano Oliveira
关键词: vast diagnostic opportunities, offer vast diagnostic, training supervised deep, radiographs offer vast, supervised deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 43 pages, 12 figures, 9 tables

点击查看摘要

Abstract:Dental panoramic radiographs offer vast diagnostic opportunities, but training supervised deep learning networks for automatic analysis of those radiology images is hampered by a shortage of labeled data. Here, a different perspective on this problem is introduced. A semi-supervised learning framework is proposed to classify thirteen dental conditions on panoramic radiographs, with a particular emphasis on teeth. Large language models were explored to annotate the most common dental conditions based on dental reports. Additionally, a masked autoencoder was employed to pre-train the classification neural network, and a Vision Transformer was used to leverage the unlabeled data. The analyses were validated using two of the most extensive datasets in the literature, comprising 8,795 panoramic radiographs and 8,029 paired reports and images. Encouragingly, the results consistently met or surpassed the baseline metrics for the Matthews correlation coefficient. A comparison of the proposed solution with human practitioners, supported by statistical analysis, highlighted its effectiveness and performance limitations; based on the degree of agreement among specialists, the solution demonstrated an accuracy level comparable to that of a junior specialist.

[CV-69] Entity Augmentation for Efficient Classification of Vertically Partitioned Data with Limited Overlap

链接: https://arxiv.org/abs/2406.17899
作者: Avi Amalanshu,Viswesh Nagaswamy,G. V. S. S. Prudhvi,Yash Sirvi,Debashish Chakravarty
关键词: Vertical Federated Learning, Vertical Federated, machine learning paradigm, Federated Learning, machine learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: GLOW @ IJCAI 2024 (12 pages + 2 page bibliography. 15 figures.)

点击查看摘要

Abstract:Vertical Federated Learning (VFL) is a machine learning paradigm for learning from vertically partitioned data (i.e. features for each input are distributed across multiple “guest” clients and an aggregating “host” server owns labels) without communicating raw data. Traditionally, VFL involves an “entity resolution” phase where the host identifies and serializes the unique entities known to all guests. This is followed by private set intersection to find common entities, and an “entity alignment” step to ensure all guests are always processing the same entity’s data. However, using only data of entities from the intersection means guests discard potentially useful data. Besides, the effect on privacy is dubious and these operations are computationally expensive. We propose a novel approach that eliminates the need for set intersection and entity alignment in categorical tasks. Our Entity Augmentation technique generates meaningful labels for activations sent to the host, regardless of their originating entity, enabling efficient VFL without explicit entity alignment. With limited overlap between training data, this approach performs substantially better (e.g. with 5% overlap, 48.1% vs 69.48% test accuracy on CIFAR-10). In fact, thanks to the regularizing effect, our model performs marginally better even with 100% overlap.

[CV-70] MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval

链接: https://arxiv.org/abs/2406.17880
作者: Weitong Cai,Jiabo Huang,Shaogang Gong,Hailin Jin,Yang Liu
关键词: Video Moment Retrieval, untrimmed long video, specific temporal segment, Moment Retrieval, Video Moment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. It confines the cross-modal alignment knowledge within the scope of a limited text corpus, thereby leading to sub-optimal visual-textual modeling and poor generalizability. By leveraging the visual-textual understanding capability of multi-modal large language models (MLLM), in this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization. To effectively maintain temporal sensibility for localization, we design to get text narratives for each certain video timestamp and construct a structured text paragraph with time information, which is temporally aligned with the visual content. Then we perform cross-modal feature merging between the temporal-aware narratives and corresponding video temporal features to produce semantic-enhanced video representation sequences for query localization. Subsequently, we introduce a uni-modal narrative-query matching mechanism, which encourages the model to extract complementary information from contextual cohesive descriptions for improved retrieval. Extensive experiments on two benchmarks show the effectiveness and generalizability of our proposed method.

[CV-71] ET tu CLIP? Addressing Common Object Errors for Unseen Environments

链接: https://arxiv.org/abs/2406.17876
作者: Ye Won Byun,Cathy Jiao,Shahriar Noroozizadeh,Jimin Sun,Rosa Vitiello
关键词: enhance model generalization, employs pre-trained CLIP, pre-trained CLIP encoders, ALFRED task, introduce a simple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.

[CV-72] Burst Image Super-Resolution with Base Frame Selection

链接: https://arxiv.org/abs/2406.17869
作者: Sanghyun Kim,Min Jung Lee,Woohyeok Kim,Deunsol Jung,Jaesung Rim,Sunghyun Cho,Minsu Cho
关键词: recent years due, Non-uniformly Exposed Burst, Exposed Burst Image, dubbed Non-uniformly Exposed, burst shots
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR2024W NTIRE accepted

点击查看摘要

Abstract:Burst image super-resolution has been a topic of active research in recent years due to its ability to obtain a high-resolution image by using complementary information between multiple frames in the burst. In this work, we explore using burst shots with non-uniform exposures to confront real-world practical scenarios by introducing a new benchmark dataset, dubbed Non-uniformly Exposed Burst Image (NEBI), that includes the burst frames at varying exposure times to obtain a broader range of irradiance and motion characteristics within a scene. As burst shots with non-uniform exposures exhibit varying levels of degradation, fusing information of the burst shots into the first frame as a base frame may not result in optimal image quality. To address this limitation, we propose a Frame Selection Network (FSN) for non-uniform scenarios. This network seamlessly integrates into existing super-resolution methods in a plug-and-play manner with low computational costs. The comparative analysis reveals the effectiveness of the nonuniform setting for the practical scenario and our FSN on synthetic-/real- NEBI datasets.

[CV-73] Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark Detection

链接: https://arxiv.org/abs/2406.17858
作者: Jialun Pei,Ruize Cui,Yaoqian Li,Weixin Si,Jing Qin,Pheng-Ann Heng
关键词: complex intraoperative dynamic, intraoperative dynamic environment, hidden structures inside, liver surgery poses, poses a complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted by MICCAI 2024

点击查看摘要

Abstract:Laparoscopic liver surgery poses a complex intraoperative dynamic environment for surgeons, where remains a significant challenge to distinguish critical or even hidden structures inside the liver. Liver anatomical landmarks, e.g., ridge and ligament, serve as important markers for 2D-3D alignment, which can significantly enhance the spatial perception of surgeons for precise surgery. To facilitate the detection of laparoscopic liver landmarks, we collect a novel dataset called L3D, which comprises 1,152 frames with elaborated landmark annotations from surgical videos of 39 patients across two medical sites. For benchmarking purposes, 12 mainstream detection methods are selected and comprehensively evaluated on L3D. Further, we propose a depth-driven geometric prompt learning network, namely D2GPLand. Specifically, we design a Depth-aware Prompt Embedding (DPE) module that is guided by self-supervised prompts and generates semantically relevant geometric information with the benefit of global depth cues extracted from SAM-based features. Additionally, a Semantic-specific Geometric Augmentation (SGA) scheme is introduced to efficiently merge RGB-D spatial and geometric information through reverse anatomic perception. The experimental results indicate that D2GPLand obtains state-of-the-art performance on L3D, with 63.52% DICE and 48.68% IoU scores. Together with 2D-3D fusion technology, our method can directly provide the surgeon with intuitive guidance information in laparoscopic scenarios.

[CV-74] Human-Object Interaction from Human-Level Instructions

链接: https://arxiv.org/abs/2406.17840
作者: Zhen Wu,Jiaman Li,C. Karen Liu
关键词: Intelligent agents, daily tasks based, human-level instructions, motion, instructions
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Intelligent agents need to autonomously navigate and interact within contextual environments to perform a wide range of daily tasks based on human-level instructions. These agents require a foundational understanding of the world, incorporating common sense and knowledge, to interpret such instructions. Moreover, they must possess precise low-level skills for movement and interaction to execute the detailed task plans derived from these instructions. In this work, we address the task of synthesizing continuous human-object interactions for manipulating large objects within contextual environments, guided by human-level instructions. Our goal is to generate synchronized object motion, full-body human motion, and detailed finger motion, all essential for realistic interactions. Our framework consists of a large language model (LLM) planning module and a low-level motion generator. We use LLMs to deduce spatial object relationships and devise a method for accurately determining their positions and orientations in target scene layouts. Additionally, the LLM planner outlines a detailed task plan specifying a sequence of sub-tasks. This task plan, along with the target object poses, serves as input for our low-level motion generator, which seamlessly alternates between navigation and interaction modules. We present the first complete system that can synthesize object motion, full-body motion, and finger motion simultaneously from human-level instructions. Our experiments demonstrate the effectiveness of our high-level planner in generating plausible target layouts and our low-level motion generator in synthesizing realistic interactions for diverse objects. Please refer to our project page for more results: this https URL.

[CV-75] SUM: Saliency Unification through Mamba for Visual Attention Modeling

链接: https://arxiv.org/abs/2406.17815
作者: Alireza Hosseini,Amirhossein Kazerouni,Saeed Akhavan,Michael Brudno,Babak Taati
关键词: Convolutional Neural Networks, prioritizing visual stimuli, important for interpreting, plays a significant, interpreting and prioritizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visual attention modeling, important for interpreting and prioritizing visual stimuli, plays a significant role in applications such as marketing, multimedia, and robotics. Traditional saliency prediction models, especially those based on Convolutional Neural Networks (CNNs) or Transformers, achieve notable success by leveraging large-scale annotated datasets. However, the current state-of-the-art (SOTA) models that use Transformers are computationally expensive. Additionally, separate models are often required for each image type, lacking a unified approach. In this paper, we propose Saliency Unification through Mamba (SUM), a novel approach that integrates the efficient long-range dependency modeling of Mamba with U-Net to provide a unified model for diverse image types. Using a novel Conditional Visual State Space (C-VSS) block, SUM dynamically adapts to various image types, including natural scenes, web pages, and commercial imagery, ensuring universal applicability across different data types. Our comprehensive evaluations across five benchmarks demonstrate that SUM seamlessly adapts to different visual characteristics and consistently outperforms existing models. These results position SUM as a versatile and powerful tool for advancing visual attention modeling, offering a robust solution universally applicable across different types of visual content.

[CV-76] MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?

链接: https://arxiv.org/abs/2406.17806
作者: Xirui Li,Hengguang Zhou,Ruochen Wang,Tianyi Zhou,Minhao Cheng,Cho-Jui Hsieh
关键词: biased thinking patterns, Humans are prone, Large Language Models, Multimodal Large Language, cognitive distortions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans are prone to cognitive distortions – biased thinking patterns that lead to exaggerated responses to specific stimuli, albeit in very different contexts. This paper demonstrates that advanced Multimodal Large Language Models (MLLMs) exhibit similar tendencies. While these models are designed to respond queries under safety mechanism, they sometimes reject harmless queries in the presence of certain visual stimuli, disregarding the benign nature of their contexts. As the initial step in investigating this behavior, we identify three types of stimuli that trigger the oversensitivity of existing MLLMs: Exaggerated Risk, Negated Harm, and Counterintuitive Interpretation. To systematically evaluate MLLMs’ oversensitivity to these stimuli, we propose the Multimodal OverSenSitivity Benchmark (MOSSBench). This toolkit consists of 300 manually collected benign multimodal queries, cross-verified by third-party reviewers (AMT). Empirical studies using MOSSBench on 20 MLLMs reveal several insights: (1). Oversensitivity is prevalent among SOTA MLLMs, with refusal rates reaching up to 76% for harmless queries. (2). Safer models are more oversensitive: increasing safety may inadvertently raise caution and conservatism in the model’s responses. (3). Different types of stimuli tend to cause errors at specific stages – perception, intent reasoning, and safety judgement – in the response process of MLLMs. These findings highlight the need for refined safety mechanisms that balance caution with contextually appropriate responses, improving the reliability of MLLMs in real-world applications. We make our project available at this https URL.

[CV-77] RACon: Retrieval-Augmented Simulated Character Locomotion Control

链接: https://arxiv.org/abs/2406.17795
作者: Yuxuan Mu,Shihao Zou,Kangning Yin,Zheng Tian,Li Cheng,Weinan Zhang,Jun Wang
关键词: simulated character, computer animation, Character Locomotion Control, Simulated Character Locomotion, Locomotion Control
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted in ICME2024 for oral presentation

点击查看摘要

Abstract:In computer animation, driving a simulated character with lifelike motion is challenging. Current generative models, though able to generalize to diverse motions, often pose challenges to the responsiveness of end-user control. To address these issues, we introduce RACon: Retrieval-Augmented Simulated Character Locomotion Control. Our end-to-end hierarchical reinforcement learning method utilizes a retriever and a motion controller. The retriever searches motion experts from a user-specified database in a task-oriented fashion, which boosts the responsiveness to the user’s control. The selected motion experts and the manipulation signal are then transferred to the controller to drive the simulated character. In addition, a retrieval-augmented discriminator is designed to stabilize the training process. Our method surpasses existing techniques in both quality and quantity in locomotion control, as demonstrated in our empirical study. Moreover, by switching extensive databases for retrieval, it can adapt to distinctive motion types at run time.

[CV-78] Real-time Neural Woven Fabric Rendering

链接: https://arxiv.org/abs/2406.17782
作者: Xiang Chen,Lu Wang,Beibei Wang
关键词: Woven fabrics, real-time capability, rendering realistic woven, realistic woven fabrics, Woven
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted by SIGGRAPH 2024 Conference Proceedings

点击查看摘要

Abstract:Woven fabrics are widely used in applications of realistic rendering, where real-time capability is also essential. However, rendering realistic woven fabrics in real time is challenging due to their complex structure and optical appearance, which cause aliasing and noise without many samples. The core of this issue is a multi-scale representation of the fabric shading model, which allows for a fast range query. Some previous neural methods deal with the issue at the cost of training on each material, which limits their practicality. In this paper, we propose a lightweight neural network to represent different types of woven fabrics at different scales. Thanks to the regularity and repetitiveness of woven fabric patterns, our network can encode fabric patterns and parameters as a small latent vector, which is later interpreted by a small decoder, enabling the representation of different types of fabrics. By applying the pixel’s footprint as input, our network achieves multi-scale representation. Moreover, our network is fast and occupies little storage because of its lightweight structure. As a result, our method achieves rendering and editing woven fabrics at nearly 60 frames per second on an RTX 3090, showing a quality close to the ground truth and being free from visible aliasing and noise.

[CV-79] Large Language Models estimate fine-grained human color-concept associations

链接: https://arxiv.org/abs/2406.17781
作者: Kushin Mukherjee,Timothy T. Rogers,Karen B. Schloss
关键词: visual cognition ranging, color-concept associations, perceptual color space, influence aspects, aspects of visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Concepts, both abstract and concrete, elicit a distribution of association strengths across perceptual color space, which influence aspects of visual cognition ranging from object recognition to interpretation of information visualizations. While prior work has hypothesized that color-concept associations may be learned from the cross-modal statistical structure of experience, it has been unclear whether natural environments possess such structure or, if so, whether learning systems are capable of discovering and exploiting it without strong prior constraints. We addressed these questions by investigating the ability of GPT-4, a multimodal large language model, to estimate human-like color-concept associations without any additional training. Starting with human color-concept association ratings for 71 color set spanning perceptual color space (\textttUW-71) and concepts that varied in abstractness, we assessed how well association ratings generated by GPT-4 could predict human ratings. GPT-4 ratings were correlated with human ratings, with performance comparable to state-of-the-art methods for automatically estimating color-concept associations from images. Variability in GPT-4’s performance across concepts could be explained by specificity of the concept’s color-concept association distribution. This study suggests that high-order covariances between language and perception, as expressed in the natural environment of the internet, contain sufficient information to support learning of human-like color-concept associations, and provides an existence proof that a learning system can encode such associations without initial constraints. The work further shows that GPT-4 can be used to efficiently estimate distributions of color associations for a broad range of concepts, potentially serving as a critical tool for designing effective and intuitive information visualizations.

[CV-80] OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

链接: https://arxiv.org/abs/2406.16620
作者: Lu Zhang,Tiancheng Zhao,Heting Ying,Yibo Ma,Kyusong Lee
关键词: Large Language Models, Language Models, Large Language, Recent advancements, advancements in Large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have expanded their capabilities to multimodal contexts, including comprehensive video understanding. However, processing extensive videos such as 24-hour CCTV footage or full-length films presents significant challenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, often result in substantial information loss. To address these shortcomings, we develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries, preserving the detailed content of videos. Additionally, it features an Divide-and-Conquer Loop capable of autonomous reasoning, dynamically invoking APIs and tools to enhance query processing and accuracy. This approach ensures robust video understanding, significantly reducing information loss. Experimental results affirm OmAgent’s efficacy in handling various types of videos and complex tasks. Moreover, we have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.

[CV-81] Multi-modal Evidential Fusion Network for Trusted PET/CT Tumor Segmentation

链接: https://arxiv.org/abs/2406.18327
作者: Yuxuan Qi,Li Lin,Jiajun Wang,Jingya Zhang,Bin Zhang
关键词: Accurate segmentation, Evidential Fusion Network, Multi-modal Evidential Fusion, treatment of cancer, PET
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate segmentation of tumors in PET/CT images is important in computer-aided diagnosis and treatment of cancer. The key issue of such a segmentation problem lies in the effective integration of complementary information from PET and CT images. However, the quality of PET and CT images varies widely in clinical settings, which leads to uncertainty in the modality information extracted by networks. To take the uncertainty into account in multi-modal information fusion, this paper proposes a novel Multi-modal Evidential Fusion Network (MEFN) comprising a Cross-Modal Feature Learning (CFL) module and a Multi-modal Trusted Fusion (MTF) module. The CFL module reduces the domain gap upon modality conversion and highlights common tumor features, thereby alleviating the needs of the segmentation module to handle modality specificity. The MTF module utilizes mutual attention mechanisms and an uncertainty calibrator to fuse modality features based on modality uncertainty and then fuse the segmentation results under the guidance of Dempster-Shafer Theory. Besides, a new uncertainty perceptual loss is introduced to force the model focusing on uncertain features and hence improve its ability to extract trusted modality information. Extensive comparative experiments are conducted on two publicly available PET/CT datasets to evaluate the performance of our proposed method whose results demonstrate that our MEFN significantly outperforms state-of-the-art methods with improvements of 2.15% and 3.23% in DSC scores on the AutoPET dataset and the Hecktor dataset, respectively. More importantly, our model can provide radiologists with credible uncertainty of the segmentation results for their decision in accepting or rejecting the automatic segmentation results, which is particularly important for clinical applications. Our code will be available at this https URL.

[CV-82] Generative artificial intelligence in ophthalmology: multimodal retinal images for the diagnosis of Alzheimers disease with convolutional neural networks

链接: https://arxiv.org/abs/2406.18247
作者: I. R. Slootweg,M. Thach,K. R. Curro-Tafili,F. D. Verbraak,F. H. Bouwman,Y. A. L. Pijnenburg,J. F. Boer,J. H. P. de Kwisthout,L. Bagheriye,P. J. González
关键词: Amyloid Positron Emission, Positron Emission Tomography, predict Amyloid Positron, Amyloid Positron, Positron Emission
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background/Aim. This study aims to predict Amyloid Positron Emission Tomography (AmyloidPET) status with multimodal retinal imaging and convolutional neural networks (CNNs) and to improve the performance through pretraining with synthetic data. Methods. Fundus autofluorescence, optical coherence tomography (OCT), and OCT angiography images from 328 eyes of 59 AmyloidPET positive subjects and 108 AmyloidPET negative subjects were used for classification. Denoising Diffusion Probabilistic Models (DDPMs) were trained to generate synthetic images and unimodal CNNs were pretrained on synthetic data and finetuned on real data or trained solely on real data. Multimodal classifiers were developed to combine predictions of the four unimodal CNNs with patient metadata. Class activation maps of the unimodal classifiers provided insight into the network’s attention to inputs. Results. DDPMs generated diverse, realistic images without memorization. Pretraining unimodal CNNs with synthetic data improved AUPR at most from 0.350 to 0.579. Integration of metadata in multimodal CNNs improved AUPR from 0.486 to 0.634, which was the best overall best classifier. Class activation maps highlighted relevant retinal regions which correlated with AD. Conclusion. Our method for generating and leveraging synthetic data has the potential to improve AmyloidPET prediction from multimodal retinal imaging. A DDPM can generate realistic and unique multimodal synthetic retinal images. Our best performing unimodal and multimodal classifiers were not pretrained on synthetic data, however pretraining with synthetic data slightly improved classification performance for two out of the four modalities.

[CV-83] Concordance in basal cell carcinoma diagnosis. Building a proper ground truth to train Artificial Intelligence tools

链接: https://arxiv.org/abs/2406.18240
作者: Francisca Silva-Clavería,Carmen Serrano,Iván Matas,Amalia Serrano,Tomás Toledo-Pastrana,David Moreno-Ramírez,Begoña Acha
关键词: basal cell carcinoma, BCC, cell carcinoma, objectively validated, basal cell
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Methodology (stat.ME)
*备注: Manuscript word count: 3000, Number of figures: 2, Number of tables: 3

点击查看摘要

Abstract:Background: The existence of different basal cell carcinoma (BCC) clinical criteria cannot be objectively validated. An adequate ground-truth is needed to train an artificial intelligence (AI) tool that explains the BCC diagnosis by providing its dermoscopic features. Objectives: To determine the consensus among dermatologists on dermoscopic criteria of 204 BCC. To analyze the performance of an AI tool when the ground-truth is inferred. Methods: A single center, diagnostic and prospective study was conducted to analyze the agreement in dermoscopic criteria by four dermatologists and then derive a reference standard. 1434 dermoscopic images have been used, that were taken by a primary health physician, sent via teledermatology, and diagnosed by a dermatologist. They were randomly selected from the teledermatology platform (2019-2021). 204 of them were tested with an AI tool; the remainder trained it. The performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists was analyzed using McNemar’s test and Hamming distance. Results: Dermatologists achieve perfect agreement in the diagnosis of BCC (Fleiss-Kappa=0.9079), and a high correlation with the biopsy (PPV=0.9670). However, there is low agreement in detecting some dermoscopic criteria. Statistical differences were found in the performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists. Conclusions: Care should be taken when training an AI tool to determine the BCC patterns present in a lesion. Ground-truth should be established from multiple dermatologists.

[CV-84] Joint Stream: Malignant Region Learning for Breast Cancer Diagnosis

链接: https://arxiv.org/abs/2406.18212
作者: Abdul Rehman,Sarfaraz Hussein,Waqas Sultani
关键词: mortality rate worldwide, breast cancer, rate worldwide, contributes to reducing, reducing the mortality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review (Biomedical Signal Processing and Control)

点击查看摘要

Abstract:Early diagnosis of breast cancer (BC) significantly contributes to reducing the mortality rate worldwide. The detection of different factors and biomarkers such as Estrogen receptor (ER), Progesterone receptor (PR), Human epidermal growth factor receptor 2 (HER2) gene, Histological grade (HG), Auxiliary lymph node (ALN) status, and Molecular subtype (MS) can play a significant role in improved BC diagnosis. However, the existing methods predict only a single factor which makes them less suitable to use in diagnosis and designing a strategy for treatment. In this paper, we propose to classify the six essential indicating factors (ER, PR, HER2, ALN, HG, MS) for early BC diagnosis using H\E stained WSI’s. To precisely capture local neighboring relationships, we use spatial and frequency domain information from the large patch size of WSI’s malignant regions. Furthermore, to cater the variable number of regions of interest sizes and give due attention to each region, we propose a malignant region learning attention network. Our experimental results demonstrate that combining spatial and frequency information using the malignant region learning module significantly improves multi-factor and single-factor classification performance on publicly available datasets.

[CV-85] EFCNet: Every Feature Counts for Small Medical Object Segmentation

链接: https://arxiv.org/abs/2406.18201
作者: Lingjie Kong,Qiaoling Wei,Chengming Xu,Han Chen,Yanwei Fu
关键词: Convolutional Neural Networks, small medical objects, Neural Networks, significant clinical, Convolutional Neural
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper explores the segmentation of very small medical objects with significant clinical value. While Convolutional Neural Networks (CNNs), particularly UNet-like models, and recent Transformers have shown substantial progress in image segmentation, our empirical findings reveal their poor performance in segmenting the small medical objects and lesions concerned in this paper. This limitation may be attributed to information loss during their encoding and decoding process. In response to this challenge, we propose a novel model named EFCNet for small object segmentation in medical images. Our model incorporates two modules: the Cross-Stage Axial Attention Module (CSAA) and the Multi-Precision Supervision Module (MPS). These modules address information loss during encoding and decoding procedures, respectively. Specifically, CSAA integrates features from all stages of the encoder to adaptively learn suitable information needed in different decoding stages, thereby reducing information loss in the encoder. On the other hand, MPS introduces a novel multi-precision supervision mechanism to the decoder. This mechanism prioritizes attention to low-resolution features in the initial stages of the decoder, mitigating information loss caused by subsequent convolution and sampling processes and enhancing the model’s global perception. We evaluate our model on two benchmark medical image datasets. The results demonstrate that EFCNet significantly outperforms previous segmentation methods designed for both medical and normal images.

[CV-86] A Lung Nodule Dataset with Histopathology-based Cancer Type Annotation

链接: https://arxiv.org/abs/2406.18102
作者: Muwei Jian,Hongyu Chen,Zaiyong Zhang,Nan Yang,Haorang Zhang,Lifu Ma,Wenjing Xu,Huixiang Zhi
关键词: clinical diagnostic workflows, CAD systems, CAD systems encounter, diagnostic workflows, significantly alleviating
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, Computer-Aided Diagnosis (CAD) systems have emerged as indispensable tools in clinical diagnostic workflows, significantly alleviating the burden on radiologists. Nevertheless, despite their integration into clinical settings, CAD systems encounter limitations. Specifically, while CAD systems can achieve high performance in the detection of lung nodules, they face challenges in accurately predicting multiple cancer types. This limitation can be attributed to the scarcity of publicly available datasets annotated with expert-level cancer type information. This research aims to bridge this gap by providing publicly accessible datasets and reliable tools for medical diagnosis, facilitating a finer categorization of different types of lung diseases so as to offer precise treatment recommendations. To achieve this objective, we curated a diverse dataset of lung Computed Tomography (CT) images, comprising 330 annotated nodules (nodules are labeled as bounding boxes) from 95 distinct patients. The quality of the dataset was evaluated using a variety of classical classification and detection models, and these promising results demonstrate that the dataset has a feasible application and further facilitate intelligent auxiliary diagnosis.

[CV-87] Leveraging Pre-trained Models for FF-to-FFPE Histopathological Image Translation

链接: https://arxiv.org/abs/2406.18054
作者: Qilai Zhang,Jiawen Li,Peiran Liao,Jiali Hu,Tian Guan,Anjia Han,Yonghong He
关键词: Hematoxylin and Eosin, Fresh Frozen, types of Hematoxylin, Formalin-Fixed Paraffin-Embedded, FFPE slides offer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The two primary types of Hematoxylin and Eosin (HE) slides in histopathology are Formalin-Fixed Paraffin-Embedded (FFPE) and Fresh Frozen (FF). FFPE slides offer high quality histopathological images but require a labor-intensive acquisition process. In contrast, FF slides can be prepared quickly, but the image quality is relatively poor. Our task is to translate FF images into FFPE style, thereby improving the image quality for diagnostic purposes. In this paper, we propose Diffusion-FFPE, a method for FF-to-FFPE histopathological image translation using a pre-trained diffusion model. Specifically, we employ a one-step diffusion model as the generator and fine-tune it with LoRA adapters using adversarial learning objectives. To ensure that the model effectively captures both global structural information and local details, we propose a multi-scale feature fusion (MFF) module. This module utilizes two VAE encoders to extract features of varying image sizes and performs feature fusion before feeding them into the UNet. Furthermore, we utilize a pre-trained vision-language model for histopathology as the backbone for the discriminator to further improve performance We conducted FF-to-FFPE translation experiments on the TCGA-NSCLC datasets, and our method achieved better performance compared to other methods. The code and models are released at this https URL.

[CV-88] DeepSense-V2V: A Vehicle-to-Vehicle Multi-Modal Sensing Localization and Communications Dataset

链接: https://arxiv.org/abs/2406.17908
作者: Joao Morais,Gouranga Charan,Nikhil Srinivas,Ahmed Alkhateeb
关键词: High data rate, future intelligent transport, intelligent transport systems, support distributed computing, enhance safety
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 15 figures, 2 tables. The dataset is available on the DeepSense6G website: this https URL

点击查看摘要

Abstract:High data rate and low-latency vehicle-to-vehicle (V2V) communication are essential for future intelligent transport systems to enable coordination, enhance safety, and support distributed computing and intelligence requirements. Developing effective communication strategies, however, demands realistic test scenarios and datasets. This is important at the high-frequency bands where more spectrum is available, yet harvesting this bandwidth is challenged by the need for direction transmission and the sensitivity of signal propagation to blockages. This work presents the first large-scale multi-modal dataset for studying mmWave vehicle-to-vehicle communications. It presents a two-vehicle testbed that comprises data from a 360-degree camera, four radars, four 60 GHz phased arrays, a 3D lidar, and two precise GPSs. The dataset contains vehicles driving during the day and night for 120 km in intercity and rural settings, with speeds up to 100 km per hour. More than one million objects were detected across all images, from trucks to bicycles. This work further includes detailed dataset statistics that prove the coverage of various situations and highlights how this dataset can enable novel machine-learning applications.

[CV-89] Domain Adaptation of Echocardiography Segmentation Via Reinforcement Learning

链接: https://arxiv.org/abs/2406.17902
作者: Arnaud Judge,Thierry Judge,Nicolas Duchateau,Roman A. Sandler,Joseph Z. Sokol,Olivier Bernard,Pierre-Marc Jodoin
关键词: Performance of deep, insufficient annotated data, effective fine-tuning, significantly challenged, aiming to adapt
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Performance of deep learning segmentation models is significantly challenged in its transferability across different medical imaging domains, particularly when aiming to adapt these models to a target domain with insufficient annotated data for effective fine-tuning. While existing domain adaptation (DA) methods propose strategies to alleviate this problem, these methods do not explicitly incorporate human-verified segmentation priors, compromising the potential of a model to produce anatomically plausible segmentations. We introduce RL4Seg, an innovative reinforcement learning framework that reduces the need to otherwise incorporate large expertly annotated datasets in the target domain, and eliminates the need for lengthy manual human review. Using a target dataset of 10,000 unannotated 2D echocardiographic images, RL4Seg not only outperforms existing state-of-the-art DA methods in accuracy but also achieves 99% anatomical validity on a subset of 220 expert-validated subjects from the target domain. Furthermore, our framework’s reward network offers uncertainty estimates comparable with dedicated state-of-the-art uncertainty methods, demonstrating the utility and effectiveness of RL4Seg in overcoming domain adaptation challenges in medical image segmentation.

[CV-90] A Review of Electromagnetic Elimination Methods for low-field portable MRI scanner

链接: https://arxiv.org/abs/2406.17804
作者: Wanyu Bian
关键词: eliminating electromagnetic interference, deep learning, deep learning methods, EMI, EMI elimination
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive analysis of both conventional and deep learning methods for eliminating electromagnetic interference (EMI) in MRI systems. We explore the underlying principles and implementation of traditional analytical and adaptive EMI elimination techniques, as well as cutting-edge deep learning approaches. Through a detailed comparison, the strengths and limitations of each method are highlighted. Recent advancements in active EMI elimination utilizing multiple external EMI receiver coils and analytical techniques are discussed alongside the superior performance of deep learning methods, which leverage neural networks trained on extensive MRI data. While deep learning methods demonstrate significant improvements in EMI suppression, enhancing diagnostic capabilities and accessibility of MRI technology, they also introduce potential security and safety concerns, especially in production and commercial applications. This study underscores the need to address these challenges to fully realize the benefits of deep learning in EMI elimination. The findings suggest a balanced approach, combining the reliability of conventional methods with the advanced capabilities of deep learning, to develop more robust and effective EMI suppression strategies in MRI systems.

[CV-91] Applications of interpretable deep learning in neuroimaging: a comprehensive review

链接: https://arxiv.org/abs/2406.17792
作者: Lindsay Munroe,Mariana da Silva,Faezeh Heidari,Irina Grigorescu,Simon Dahan,Emma C. Robinson,Maria Deprez,Po-Wah So
关键词: neural networks leads, deep learning models, Clinical adoption, deep learning, trustworthiness and reliability
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Clinical adoption of deep learning models has been hindered, in part, because the black-box nature of neural networks leads to concerns regarding their trustworthiness and reliability. These concerns are particularly relevant in the field of neuroimaging due to the complex brain phenotypes and inter-subject heterogeneity often encountered. The challenge can be addressed by interpretable deep learning (iDL) methods that enable the visualisation and interpretation of the inner workings of deep learning models. This study systematically reviewed the literature on neuroimaging applications of iDL methods and critically analysed how iDL explanation properties were evaluated. Seventy-five studies were included, and ten categories of iDL methods were identified. We also reviewed five properties of iDL explanations that were analysed in the included studies: biological validity, robustness, continuity, selectivity, and downstream task performance. We found that the most popular iDL approaches used in the literature may be sub-optimal for neuroimaging data, and we discussed possible future directions for the field.

机器学习

[LG-0] owards Compositionality in Concept Learning

链接: https://arxiv.org/abs/2406.18534
作者: Adam Stein,Aaditya Naik,Yinjun Wu,Mayur Naik,Eric Wong
关键词: Concept-based interpretability methods, Concept-based interpretability, interpretability methods offer, compositional concept representations, compositional concept
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at ICML 2024. 26 pages, 10 figures

点击查看摘要

Abstract:Concept-based interpretability methods offer a lens into the internals of foundation models by decomposing their embeddings into high-level concepts. These concept representations are most useful when they are compositional, meaning that the individual concepts compose to explain the full sample. We show that existing unsupervised concept extraction methods find concepts which are not compositional. To automatically discover compositional concept representations, we identify two salient properties of such representations, and propose Compositional Concept Extraction (CCE) for finding concepts which obey these properties. We evaluate CCE on five different datasets over image and text data. Our evaluation shows that CCE finds more compositional concept representations than baselines and yields better accuracy on four downstream classification tasks. Code and data are available at this https URL .

[LG-1] Symbolic Learning Enables Self-Evolving Agents

链接: https://arxiv.org/abs/2406.18532
作者: Wangchunshu Zhou,Yixin Ou,Shengwei Ding,Long Li,Jialong Wu,Tiannan Wang,Jiamin Chen,Shuai Wang,Xiaohua Xu,Ningyu Zhang,Huajun Chen,Yuchen Eleanor Jiang
关键词: language agents, large language models, agent symbolic learning, tool usage methods, language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Code available at this https URL

点击查看摘要

Abstract:The AI community has been exploring a pathway to artificial general intelligence (AGI) by developing “language agents”, which are complex large language models (LLMs) pipelines involving both prompting techniques and tool usage methods. While language agents have demonstrated impressive capabilities for many real-world tasks, a fundamental limitation of current language agents research is that they are model-centric, or engineering-centric. That’s to say, the progress on prompts, tools, and pipelines of language agents requires substantial manual engineering efforts from human experts rather than automatically learning from data. We believe the transition from model-centric, or engineering-centric, to data-centric, i.e., the ability of language agents to autonomously learn and evolve in environments, is the key for them to possibly achieve AGI. In this work, we introduce agent symbolic learning, a systematic framework that enables language agents to optimize themselves on their own in a data-centric way using symbolic optimizers. Specifically, we consider agents as symbolic networks where learnable weights are defined by prompts, tools, and the way they are stacked together. Agent symbolic learning is designed to optimize the symbolic network within language agents by mimicking two fundamental algorithms in connectionist learning: back-propagation and gradient descent. Instead of dealing with numeric weights, agent symbolic learning works with natural language simulacrums of weights, loss, and gradients. We conduct proof-of-concept experiments on both standard benchmarks and complex real-world tasks and show that agent symbolic learning enables language agents to update themselves after being created and deployed in the wild, resulting in “self-evolving agents”. Comments: Code available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2406.18532 [cs.CL] (or arXiv:2406.18532v1 [cs.CL] for this version)

[LG-2] Confident Natural Policy Gradient for Local Planning in q_pi-realizable Constrained MDPs

链接: https://arxiv.org/abs/2406.18529
作者: Tian Tian,Lin F. Yang,Csaba Szepesvári
关键词: Markov decision process, constrained Markov decision, important reinforcement learning, reinforcement learning approach, maximizing cumulative reward
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives while maximizing cumulative reward. However, the current understanding of how to learn efficiently in a CMDP environment with a potentially infinite number of states remains under investigation, particularly when function approximation is applied to the value functions. In this paper, we address the learning problem given linear function approximation with q_\pi -realizability, where the value functions of all policies are linearly representable with a known feature map, a setting known to be more general and challenging than other linear settings. Utilizing a local-access model, we propose a novel primal-dual algorithm that, after \tildeO(\textpoly(d) \epsilon^-3) queries, outputs with high probability a policy that strictly satisfies the constraints while nearly optimizing the value with respect to a reward function. Here, d is the feature dimension and \epsilon 0 is a given error. The algorithm relies on a carefully crafted off-policy evaluation procedure to evaluate the policy using historical data, which informs policy updates through policy gradients and conserves samples. To our knowledge, this is the first result achieving polynomial sample complexity for CMDP in the q_\pi -realizable setting.

[LG-3] APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

链接: https://arxiv.org/abs/2406.18518
作者: Zuxin Liu,Thai Hoang,Jianguo Zhang,Ming Zhu,Tian Lan,Shirley Kokane,Juntao Tan,Weiran Yao,Zhiwei Liu,Yihao Feng,Rithesh Murthy,Liangwei Yang,Silvio Savarese,Juan Carlos Niebles,Huan Wang,Shelby Heinecke,Caiming Xiong
关键词: Berkeley Function-Calling Benchmark, models requires diverse, function-calling, requires diverse, agent models requires
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: this https URL and the project homepage: this https URL

[LG-4] Mental Modeling of Reinforcement Learning Agents by Language Models

链接: https://arxiv.org/abs/2406.18505
作者: Wenhao Lu,Xufeng Zhao,Josua Spisak,Jae Hee Lee,Stefan Wermter
关键词: language models faithfully, emergent language models, models faithfully model, language models, intelligence of decision-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
*备注: this https URL

点击查看摘要

Abstract:Can emergent language models faithfully model the intelligence of decision-making agents? Though modern language models exhibit already some reasoning ability, and theoretically can potentially express any probable distribution over tokens, it remains underexplored how the world knowledge these pretrained models have memorized can be utilized to comprehend an agent’s behaviour in the physical world. This study empirically examines, for the first time, how well large language models (LLMs) can build a mental model of agents, termed agent mental modelling, by reasoning about an agent’s behaviour and its effect on states from agent interaction history. This research may unveil the potential of leveraging LLMs for elucidating RL agent behaviour, addressing a key challenge in eXplainable reinforcement learning (XRL). To this end, we propose specific evaluation metrics and test them on selected RL task datasets of varying complexity, reporting findings on agent mental model establishment. Our results disclose that LLMs are not yet capable of fully mental modelling agents through inference alone without further innovations. This work thus provides new insights into the capabilities and limitations of modern LLMs.

[LG-5] Enhancing Federated Learning with Adaptive Differential Privacy and Priority-Based Aggregation

链接: https://arxiv.org/abs/2406.18491
作者: Mahtab Talaei,Iman Izadi
关键词: distributed machine learning, develops global models, Federated learning, branch of distributed, distributed machine
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL), a novel branch of distributed machine learning (ML), develops global models through a private procedure without direct access to local datasets. However, it is still possible to access the model updates (gradient updates of deep neural networks) transferred between clients and servers, potentially revealing sensitive local information to adversaries using model inversion attacks. Differential privacy (DP) offers a promising approach to addressing this issue by adding noise to the parameters. On the other hand, heterogeneities in data structure, storage, communication, and computational capabilities of devices can cause convergence problems and delays in developing the global model. A personalized weighted averaging of local parameters based on the resources of each device can yield a better aggregated model in each round. In this paper, to efficiently preserve privacy, we propose a personalized DP framework that injects noise based on clients’ relative impact factors and aggregates parameters while considering heterogeneities and adjusting properties. To fulfill the DP requirements, we first analyze the convergence boundary of the FL algorithm when impact factors are personalized and fixed throughout the learning process. We then further study the convergence property considering time-varying (adaptive) impact factors.

[LG-6] UniRec: A Dual Enhancement of Uniformity and Frequency in Sequential Recommendations

链接: https://arxiv.org/abs/2406.18470
作者: Yang Liu,Yitong Wang,Chenyue Feng
关键词: accurately modeling user, critical for accurately, accurately modeling, modeling user interaction, improving recommendation precision
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures, for source code, see this https URL

点击查看摘要

Abstract:Representation learning in sequential recommendation is critical for accurately modeling user interaction patterns and improving recommendation precision. However, existing approaches predominantly emphasize item-to-item transitions, often neglecting the time intervals between interactions, which are closely related to behavior pattern changes. Additionally, broader interaction attributes, such as item frequency, are frequently overlooked. We found that both sequences with more uniform time intervals and items with higher frequency yield better prediction performance. Conversely, non-uniform sequences exacerbate user interest drift and less-frequent items are difficult to model due to sparse sampling, presenting unique challenges inadequately addressed by current methods. In this paper, we propose UniRec, a novel bidirectional enhancement sequential recommendation method. UniRec leverages sequence uniformity and item frequency to enhance performance, particularly improving the representation of non-uniform sequences and less-frequent items. These two branches mutually reinforce each other, driving comprehensive performance optimization in complex sequential recommendation scenarios. Additionally, we present a multidimensional time module to further enhance adaptability. To the best of our knowledge, UniRec is the first method to utilize the characteristics of uniformity and frequency for feature augmentation. Comparing with eleven advanced models across four datasets, we demonstrate that UniRec outperforms SOTA models significantly. The code is available at this https URL.

[LG-7] Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers

链接: https://arxiv.org/abs/2406.18451
作者: Jonas Ngnawé,Sabyasachi Sahoo,Yann Pequignot,Frédéric Precioso,Christian Gagné
关键词: high-stakes real-world applications, adversarial training strategies, input space margins, improve robustness, imperceptible perturbations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures, 2 tables, 1 algorithm

点击查看摘要

Abstract:Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model’s vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency – a property that links the input space margins and the logit margins in robust models – for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model’s logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate strong margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively use the logit margin to confidently detect brittle decisions with such models and accurately estimate robust accuracy on an arbitrarily large test set by estimating the input margins only on a small subset. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to efficiently assess adversarial vulnerability in deployment scenarios.

[LG-8] Preference Elicitation for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2406.18450
作者: Alizée Pace,Bernhard Schölkopf,Gunnar Rätsch,Giorgia Ramponi
关键词: designing reward functions, Applying reinforcement learning, Applying reinforcement, reward function, real-world problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in different environments.

[LG-9] An Autotuning-based Optimization Framework for Mixed-kernel SVM Classifications in Smart Pixel Datasets and Heterojunction Transistors

链接: https://arxiv.org/abs/2406.18445
作者: Xingfu Wu,Tupendra Oli,ustin H. Qian,Valerie Taylor,Mark C. Hersam,Vinod K. Sangwan
关键词: Support Vector Machine, Support Vector, Vector Machine, modeling diverse sources, high dimensional data
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Support Vector Machine (SVM) is a state-of-the-art classification method widely used in science and engineering due to its high accuracy, its ability to deal with high dimensional data, and its flexibility in modeling diverse sources of data. In this paper, we propose an autotuning-based optimization framework to quantify the ranges of hyperparameters in SVMs to identify their optimal choices, and apply the framework to two SVMs with the mixed-kernel between Sigmoid and Gaussian kernels for smart pixel datasets in high energy physics (HEP) and mixed-kernel heterojunction transistors (MKH). Our experimental results show that the optimal selection of hyperparameters in the SVMs and the kernels greatly varies for different applications and datasets, and choosing their optimal choices is critical for a high classification accuracy of the mixed kernel SVMs. Uninformed choices of hyperparameters C and coef0 in the mixed-kernel SVMs result in severely low accuracy, and the proposed framework effectively quantifies the proper ranges for the hyperparameters in the SVMs to identify their optimal choices to achieve the highest accuracy 94.6% for the HEP application and the highest average accuracy 97.2% with far less tuning time for the MKH application.

[LG-10] Graph Neural Networks for Emulation of Finite-Element Ice Dynamics in Greenland and Antarctic Ice Sheets

链接: https://arxiv.org/abs/2406.18423
作者: Younghyun Koo,Maryam Rahnemoonfar
关键词: partial differential equations, provide accurate solutions, solve partial differential, models provide accurate, intensified computational demands
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注: 6 pages, 2 figures, submitted to the ICML 2024 Workshop on Machine Learning for Earth System Modeling

点击查看摘要

Abstract:Although numerical models provide accurate solutions for ice sheet dynamics based on physics laws, they accompany intensified computational demands to solve partial differential equations. In recent years, convolutional neural networks (CNNs) have been widely used as statistical emulators for those numerical models. However, since CNNs operate on regular grids, they cannot represent the refined meshes and computational efficiency of finite-element numerical models. Therefore, instead of CNNs, this study adopts an equivariant graph convolutional network (EGCN) as an emulator for the ice sheet dynamics modeling. EGCN reproduces ice thickness and velocity changes in the Helheim Glacier, Greenland, and Pine Island Glacier, Antarctica, with 260 times and 44 times faster computation time, respectively. Compared to the traditional CNN and graph convolutional network, EGCN shows outstanding accuracy in thickness prediction near fast ice streams by preserving the equivariance to the translation and rotation of graphs.

[LG-11] Mixture of Experts in a Mixture of RL settings

链接: https://arxiv.org/abs/2406.18420
作者: Timon Willi,Johan Obando-Ceron,Jakob Foerster,Karolina Dziugaite,Pablo Samuel Castro
关键词: Mixtures of Experts, enhanced inference efficiency, supervised learning due, Deep Reinforcement Learning, boost Deep Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mixtures of Experts (MoEs) have gained prominence in (self-)supervised learning due to their enhanced inference efficiency, adaptability to distributed training, and modularity. Previous research has illustrated that MoEs can significantly boost Deep Reinforcement Learning (DRL) performance by expanding the network’s parameter count while reducing dormant neurons, thereby enhancing the model’s learning capacity and ability to deal with non-stationarity. In this work, we shed more light on MoEs’ ability to deal with non-stationarity and investigate MoEs in DRL settings with “amplified” non-stationarity via multi-task training, providing further evidence that MoEs improve learning capacity. In contrast to previous work, our multi-task results allow us to better understand the underlying causes for the beneficial effect of MoE in DRL training, the impact of the various MoE components, and insights into how best to incorporate them in actor-critic-based DRL networks. Finally, we also confirm results from previous work.

[LG-12] Differential error feedback for communication-efficient decentralized learning

链接: https://arxiv.org/abs/2406.18418
作者: Roula Nassif,Stefan Vlaski,Marco Carpentiero,Vincenzo Matta,Ali H. Sayed
关键词: local updates coupled, Communication-constrained algorithms, compressed signals, rely on local, local updates
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: arXiv admin note: text overlap with arXiv:2209.07821

点击查看摘要

Abstract:Communication-constrained algorithms for decentralized learning and optimization rely on local updates coupled with the exchange of compressed signals. In this context, differential quantization is an effective technique to mitigate the negative impact of compression by leveraging correlations between successive iterates. In addition, the use of error feedback, which consists of incorporating the compression error into subsequent steps, is a powerful mechanism to compensate for the bias caused by the compression. Under error feedback, performance guarantees in the literature have so far focused on algorithms employing a fusion center or a special class of contractive compressors that cannot be implemented with a finite number of bits. In this work, we propose a new decentralized communication-efficient learning approach that blends differential quantization with error feedback. The approach is specifically tailored for decentralized learning problems where agents have individual risk functions to minimize subject to subspace constraints that require the minimizers across the network to lie in low-dimensional subspaces. This constrained formulation includes consensus or single-task optimization as special cases, and allows for more general task relatedness models such as multitask smoothness and coupled optimization. We show that, under some general conditions on the compression noise, and for sufficiently small step-sizes \mu , the resulting communication-efficient strategy is stable both in terms of mean-square error and average bit rate: by reducing \mu , it is possible to keep the estimation errors small (on the order of \mu ) without increasing indefinitely the bit rate as \mu\rightarrow 0 . The results establish that, in the small step-size regime and with a finite number of bits, it is possible to attain the performance achievable in the absence of compression.

[LG-13] owards diffusion models for large-scale sea-ice modelling

链接: https://arxiv.org/abs/2406.18417
作者: Tobias Sebastian Finn,Charlotte Durand,Alban Farchi,Marc Bocquet,Julien Brajard
关键词: Arctic-wide sea-ice states, latent diffusion models, multivariate and Arctic-wide, diffusion models, latent diffusion
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 21 pages, 5 figure, Accepted at the ICML 2024 Machine Learning for Earth System Modeling workshop

点击查看摘要

Abstract:We make the first steps towards diffusion models for unconditional generation of multivariate and Arctic-wide sea-ice states. While targeting to reduce the computational costs by diffusion in latent space, latent diffusion models also offer the possibility to integrate physical knowledge into the generation process. We tailor latent diffusion models to sea-ice physics with a censored Gaussian distribution in data space to generate data that follows the physical bounds of the modelled variables. Our latent diffusion models reach similar scores as the diffusion model trained in data space, but they smooth the generated fields as caused by the latent mapping. While enforcing physical bounds cannot reduce the smoothing, it improves the representation of the marginal ice zone. Therefore, for large-scale Earth system modelling, latent diffusion models can have many advantages compared to diffusion in data space if the significant barrier of smoothing can be resolved.

[LG-14] Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

链接: https://arxiv.org/abs/2406.18400
作者: Yibo Jiang,Goutham Rajendran,Pradeep Ravikumar,Bryon Aragam
关键词: Large Language Models, Large Language, Language Models, capacity to store, store and recall
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the capacity to store and recall facts. Through experimentation with open-source models, we observe that this ability to retrieve facts can be easily manipulated by changing contexts, even without altering their factual meanings. These findings highlight that LLMs might behave like an associative memory model where certain tokens in the contexts serve as clues to retrieving facts. We mathematically explore this property by studying how transformers, the building blocks of LLMs, can complete such memory tasks. We study a simple latent concept association problem with a one-layer transformer and we show theoretically and empirically that the transformer gathers information using self-attention and uses the value matrix for associative memory.

[LG-15] DoubleTake: Geometry Guided Depth Estimation

链接: https://arxiv.org/abs/2406.18387
作者: Mohamed Sayed,Filippo Aleotti,Jamie Watson,Zawar Qureshi,Guillermo Garcia-Hernando,Gabriel Brostow,Sara Vicente,Michael Firman
关键词: posed RGB images, computer vision task, fundamental computer vision, posed RGB, RGB images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood. In contrast, our model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames. We introduce a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry. We demonstrate that our method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.

[LG-16] Adversarial Search Engine Optimization for Large Language Models

链接: https://arxiv.org/abs/2406.18382
作者: Fredrik Nestaas,Edoardo Debenedetti,Florian Tramèr
关键词: Large Language Models, Large Language, Language Models, model selects, Preference Manipulation Attacks
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in applications where the model selects from competing third-party content, such as in LLM-powered search engines or chatbot plugins. In this paper, we introduce Preference Manipulation Attacks, a new class of attacks that manipulate an LLM’s selections to favor the attacker. We demonstrate that carefully crafted website content or plugin documentations can trick an LLM to promote the attacker products and discredit competitors, thereby increasing user traffic and monetization. We show this leads to a prisoner’s dilemma, where all parties are incentivized to launch attacks, but the collective effect degrades the LLM’s outputs for everyone. We demonstrate our attacks on production LLM search engines (Bing and Perplexity) and plugin APIs (for GPT-4 and Claude). As LLMs are increasingly used to rank third-party content, we expect Preference Manipulation Attacks to emerge as a significant threat.

[LG-17] KAGNNs: Kolmogorov-Arnold Networks meet Graph Learning

链接: https://arxiv.org/abs/2406.18380
作者: Roman Bresson,Giannis Nikolentzos,George Panagopoulos,Michail Chatzianastasis,Jun Pang,Michalis Vazirgiannis
关键词: Graph Neural Networks, Neural Networks, Graph Neural, recent years, facto tool
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Graph Neural Networks (GNNs) have become the de facto tool for learning node and graph representations. Most GNNs typically consist of a sequence of neighborhood aggregation (a.k.a., message passing) layers. Within each of these layers, the representation of each node is updated from an aggregation and transformation of its neighbours representations at the previous layer. The upper bound for the expressive power of message passing GNNs was reached through the use of MLPs as a transformation, due to their universal approximation capabilities. However, MLPs suffer from well-known limitations, which recently motivated the introduction of Kolmogorov-Arnold Networks (KANs). KANs rely on the Kolmogorov-Arnold representation theorem, rendering them a promising alternative to MLPs. In this work, we compare the performance of KANs against that of MLPs in graph learning tasks. We perform extensive experiments on node classification, graph classification and graph regression datasets. Our preliminary results indicate that while KANs are on-par with MLPs in classification tasks, they seem to have a clear advantage in the graph regression tasks.

[LG-18] Kolmogorov-Arnold Graph Neural Networks

链接: https://arxiv.org/abs/2406.18354
作者: Gianluca De Carlo,Andrea Mastropietro,Aris Anagnostopoulos
关键词: Graph neural networks, domains requiring transparent, requiring transparent decision-making, Graph Kolmogorov-Arnold Network, excel in learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 4 figures, under review

点击查看摘要

Abstract:Graph neural networks (GNNs) excel in learning from network-like data but often lack interpretability, making their application challenging in domains requiring transparent decision-making. We propose the Graph Kolmogorov-Arnold Network (GKAN), a novel GNN model leveraging spline-based activation functions on edges to enhance both accuracy and interpretability. Our experiments on five benchmark datasets demonstrate that GKAN outperforms state-of-the-art GNN models in node classification, link prediction, and graph classification tasks. In addition to the improved accuracy, GKAN’s design inherently provides clear insights into the model’s decision-making process, eliminating the need for post-hoc explainability techniques. This paper discusses the methodology, performance, and interpretability of GKAN, highlighting its potential for applications in domains where interpretability is crucial.

[LG-19] Reinforcement Learning with Intrinsically Motivated Feedback Graph for Lost-sales Inventory Control

链接: https://arxiv.org/abs/2406.18351
作者: Zifan Liu,Xinran Li,Shibo Chen,Gen Li,Jiashuo Jiang,Jun Zhang
关键词: inventory control, well-performed and general-purpose, online experience, sample efficiency, Reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has proven to be well-performed and general-purpose in the inventory control (IC). However, further improvement of RL algorithms in the IC domain is impeded due to two limitations of online experience. First, online experience is expensive to acquire in real-world applications. With the low sample efficiency nature of RL algorithms, it would take extensive time to train the RL policy to convergence. Second, online experience may not reflect the true demand due to the lost sales phenomenon typical in IC, which makes the learning process more challenging. To address the above challenges, we propose a decision framework that combines reinforcement learning with feedback graph (RLFG) and intrinsically motivated exploration (IME) to boost sample efficiency. In particular, we first take advantage of the inherent properties of lost-sales IC problems and design the feedback graph (FG) specially for lost-sales IC problems to generate abundant side experiences aid RL updates. Then we conduct a rigorous theoretical analysis of how the designed FG reduces the sample complexity of RL methods. Based on the theoretical insights, we design an intrinsic reward to direct the RL agent to explore to the state-action space with more side experiences, further exploiting FG’s power. Experimental results demonstrate that our method greatly improves the sample efficiency of applying RL in IC. Our code is available at https://anonymous.4open.science/r/RLIMFG4IC-811D/

[LG-20] EmT: A Novel Transformer for Generalized Cross-subject EEG Emotion Recognition

链接: https://arxiv.org/abs/2406.18345
作者: Yi Ding,Chengxuan Tong,Shuailei Zhang,Muyun Jiang,Yong Li,Kevin Lim Jun Liang,Cuntai Guan
关键词: Integrating prior knowledge, neural network architecture, network architecture enhances, Integrating prior, prior knowledge
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 11 pages, 5 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Integrating prior knowledge of neurophysiology into neural network architecture enhances the performance of emotion decoding. While numerous techniques emphasize learning spatial and short-term temporal patterns, there has been limited emphasis on capturing the vital long-term contextual information associated with emotional cognitive processes. In order to address this discrepancy, we introduce a novel transformer model called emotion transformer (EmT). EmT is designed to excel in both generalized cross-subject EEG emotion classification and regression tasks. In EmT, EEG signals are transformed into a temporal graph format, creating a sequence of EEG feature graphs using a temporal graph construction module (TGC). A novel residual multi-view pyramid GCN module (RMPG) is then proposed to learn dynamic graph representations for each EEG feature graph within the series, and the learned representations of each graph are fused into one token. Furthermore, we design a temporal contextual transformer module (TCT) with two types of token mixers to learn the temporal contextual information. Finally, the task-specific output module (TSO) generates the desired outputs. Experiments on four publicly available datasets show that EmT achieves higher results than the baseline methods for both EEG emotion classification and regression tasks. The code is available at this https URL.

[LG-21] Efficient and Accurate Explanation Estimation with Distribution Compression

链接: https://arxiv.org/abs/2406.18334
作者: Hubert Baniecki,Giuseppe Casalicchio,Bernd Bischl,Przemyslaw Biecek
关键词: Exact computation, requires numerous model, learning explanations requires, explanations requires numerous, machine learning explanations
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To be presented at the ICML 2024 Workshop on DMLR

点击查看摘要

Abstract:Exact computation of various machine learning explanations requires numerous model evaluations and in extreme cases becomes impractical. The computational cost of approximation increases with an ever-increasing size of data and model parameters. Many heuristics have been proposed to approximate post-hoc explanations efficiently. This paper shows that the standard i.i.d. sampling used in a broad spectrum of algorithms for explanation estimation leads to an approximation error worthy of improvement. To this end, we introduce Compress Then Explain (CTE), a new paradigm for more efficient and accurate explanation estimation. CTE uses distribution compression through kernel thinning to obtain a data sample that best approximates the marginal distribution. We show that CTE improves the estimation of removal-based local and global explanations with negligible computational overhead. It often achieves an on-par explanation approximation error using 2-3x less samples, i.e. requiring 2-3x less model evaluations. CTE is a simple, yet powerful, plug-in for any explanation method that now relies on i.i.d. sampling.

[LG-22] Early Classification of Time Series: Taxonomy and Benchmark

链接: https://arxiv.org/abs/2406.18332
作者: Aurélien Renault,Alexis Bondu,Antoine Cornuéjols,Vincent Lemaire
关键词: time series, time penalty, provided sequentially, cost of misclassification, phenomenon are provided
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many situations, the measurements of a studied phenomenon are provided sequentially, and the prediction of its class needs to be made as early as possible so as not to incur too high a time penalty, but not too early and risk paying the cost of misclassification. This problem has been particularly studied in the case of time series, and is known as Early Classification of Time Series (ECTS). Although it has been the subject of a growing body of literature, there is still a lack of a systematic, shared evaluation protocol to compare the relative merits of the various existing methods. This document begins by situating these methods within a principle-based taxonomy. It defines dimensions for organizing their evaluation, and then reports the results of a very extensive set of experiments along these dimensions involving nine state-of-the art ECTS algorithms. In addition, these and other experiments can be carried out using an open-source library in which most of the existing ECTS algorithms have been implemented (see \urlthis https URL).

[LG-23] Molecular Diffusion Models with Virtual Receptors

链接: https://arxiv.org/abs/2406.18330
作者: Matan Halfon,Eyal Rozenberg,Ehud Rivlin,Daniel Freedman
关键词: Structure-Based Drug Design, Machine learning approaches, Drug Design, Machine learning, proven quite fertile
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning approaches to Structure-Based Drug Design (SBDD) have proven quite fertile over the last few years. In particular, diffusion-based approaches to SBDD have shown great promise. We present a technique which expands on this diffusion approach in two crucial ways. First, we address the size disparity between the drug molecule and the target/receptor, which makes learning more challenging and inference slower. We do so through the notion of a Virtual Receptor, which is a compressed version of the receptor; it is learned so as to preserve key aspects of the structural information of the original receptor, while respecting the relevant group equivariance. Second, we incorporate a protein language embedding used originally in the context of protein folding. We experimentally demonstrate the contributions of both the virtual receptors and the protein embeddings: in practice, they lead to both better performance, as well as significantly faster computations.

[LG-24] PDFA Distillation via String Probability Queries PDFA Distillation via String Probability Queries

链接: https://arxiv.org/abs/2406.18328
作者: Robert Baumgartner,Sicco Verwer
关键词: Probabilistic deterministic finite, deterministic finite automata, discrete event systems, event systems modeling, Probabilistic deterministic
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注: LearnAUT 2024

点击查看摘要

Abstract:Probabilistic deterministic finite automata (PDFA) are discrete event systems modeling conditional probabilities over languages: Given an already seen sequence of tokens they return the probability of tokens of interest to appear next. These types of models have gained interest in the domain of explainable machine learning, where they are used as surrogate models for neural networks trained as language models. In this work we present an algorithm to distill PDFA from neural networks. Our algorithm is a derivative of the L# algorithm and capable of learning PDFA from a new type of query, in which the algorithm infers conditional probabilities from the probability of the queried string to occur. We show its effectiveness on a recent public dataset by distilling PDFA from a set of trained neural networks.

[LG-25] ContactNet: Geometric-Based Deep Learning Model for Predicting Protein-Protein Interactions

链接: https://arxiv.org/abs/2406.18314
作者: Matan Halfon,Tomer Cohen,Raanan Fattal,Dina Schneidman-Duhovny
关键词: Deep learning approaches, Multiple Sequence Alignment, Deep learning, predicting protein structures, require Multiple Sequence
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Deep learning approaches achieved significant progress in predicting protein structures. These methods are often applied to protein-protein interactions (PPIs) yet require Multiple Sequence Alignment (MSA) which is unavailable for various interactions, such as antibody-antigen. Computational docking methods are capable of sampling accurate complex models, but also produce thousands of invalid configurations. The design of scoring functions for identifying accurate models is a long-standing challenge. We develop a novel attention-based Graph Neural Network (GNN), ContactNet, for classifying PPI models obtained from docking algorithms into accurate and incorrect ones. When trained on docked antigen and modeled antibody structures, ContactNet doubles the accuracy of current state-of-the-art scoring functions, achieving accurate models among its Top-10 at 43% of the test cases. When applied to unbound antibodies, its Top-10 accuracy increases to 65%. This performance is achieved without MSA and the approach is applicable to other types of interactions, such as host-pathogens or general PPIs.

[LG-26] Online Learning of Multiple Tasks and Their Relationships : Testing on Spam Email Data and EEG Signals Recorded in Construction Fields

链接: https://arxiv.org/abs/2406.18311
作者: Yixin Jin,Wenjing Zhou,Meiqi Wang,Meng Li,Xintao Li,Tianyu Hu,Xingyuan Bu
关键词: online multi-task learning, processes data sequentially, multi-task learning, paper examines, examines an online
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper examines an online multi-task learning (OMTL) method, which processes data sequentially to predict labels across related tasks. The framework learns task weights and their relatedness concurrently. Unlike previous models that assumed static task relatedness, our approach treats tasks as initially independent, updating their relatedness iteratively using newly calculated weight vectors. We introduced three rules to update the task relatedness matrix: OMTLCOV, OMTLLOG, and OMTLVON, and compared them against a conventional method (CMTL) that uses a fixed relatedness value. Performance evaluations on three datasets a spam dataset and two EEG datasets from construction workers under varying conditions demonstrated that our OMTL methods outperform CMTL, improving accuracy by 1% to 3% on EEG data, and maintaining low error rates around 12% on the spam dataset.

[LG-27] Spatial-temporal Hierarchical Reinforcement Learning for Interpretable Pathology Image Super-Resolution

链接: https://arxiv.org/abs/2406.18310
作者: Wenting Chen,Jie Liu,Tommy W.S. Chow,Yixuan Yuan
关键词: accurately interpreting lesion, interpreting lesion cells, acquiring high-resolution digital, high-resolution digital slides, digital slides requires
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Accepted to IEEE TRANSACTIONS ON MEDICAL IMAGING (TMI)

点击查看摘要

Abstract:Pathology image are essential for accurately interpreting lesion cells in cytopathology screening, but acquiring high-resolution digital slides requires specialized equipment and long scanning times. Though super-resolution (SR) techniques can alleviate this problem, existing deep learning models recover pathology image in a black-box manner, which can lead to untruthful biological details and misdiagnosis. Additionally, current methods allocate the same computational resources to recover each pixel of pathology image, leading to the sub-optimal recovery issue due to the large variation of pathology image. In this paper, we propose the first hierarchical reinforcement learning framework named Spatial-Temporal hierARchical Reinforcement Learning (STAR-RL), mainly for addressing the aforementioned issues in pathology image super-resolution problem. We reformulate the SR problem as a Markov decision process of interpretable operations and adopt the hierarchical recovery mechanism in patch level, to avoid sub-optimal recovery. Specifically, the higher-level spatial manager is proposed to pick out the most corrupted patch for the lower-level patch worker. Moreover, the higher-level temporal manager is advanced to evaluate the selected patch and determine whether the optimization should be stopped earlier, thereby avoiding the over-processed problem. Under the guidance of spatial-temporal managers, the lower-level patch worker processes the selected patch with pixel-wise interpretable actions at each time step. Experimental results on medical images degraded by different kernels show the effectiveness of STAR-RL. Furthermore, STAR-RL validates the promotion in tumor diagnosis with a large margin and shows generalizability under various degradations. The source code is available at this https URL.

[LG-28] Automated Immunophenotyping Assessment for Diagnosing Childhood Acute Leukemia using Set-Transformers

链接: https://arxiv.org/abs/2406.18309
作者: Elpiniki Maria Lygizou,Michael Reiter,Margarita Maurer-Granofszky,Michael Dworzak,Radu Grosu
关键词: common hematologic malignancy, Multiparameter Flow Cytometry, children and adolescents, Acute Leukemia, common hematologic
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: The paper has been accepted at IEEE EMBS 2024 (46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society)

点击查看摘要

Abstract:Acute Leukemia is the most common hematologic malignancy in children and adolescents. A key methodology in the diagnostic evaluation of this malignancy is immunophenotyping based on Multiparameter Flow Cytometry (FCM). However, this approach is manual, and thus time-consuming and subjective. To alleviate this situation, we propose in this paper the FCM-Former, a machine learning, self-attention based FCM-diagnostic tool, automating the immunophenotyping assessment in Childhood Acute Leukemia. The FCM-Former is trained in a supervised manner, by directly using flow cytometric data. Our FCM-Former achieves an accuracy of 96.5% assigning lineage to each sample among 960 cases of either acute B-cell, T-cell lymphoblastic, and acute myeloid leukemia (B-ALL, T-ALL, AML). To the best of our knowledge, the FCM-Former is the first work that automates the immunophenotyping assessment with FCM data in diagnosing pediatric Acute Leukemia.

[LG-29] Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI

链接: https://arxiv.org/abs/2406.18295
作者: Nikolaos Dionelis,Casper Fibaek,Luke Camilleri,Andreas Luyts,Jente Bosmans,Bertrand Le Saux
关键词: Foundation Models, Computer Vision application, Models, Foundation, prescribed high performance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, Submitted

点击查看摘要

Abstract:When we are primarily interested in solving several problems jointly with a given prescribed high performance accuracy for each target application, then Foundation Models should for most cases be used rather than problem-specific models. We focus on the specific Computer Vision application of Foundation Models for Earth Observation (EO) and geospatial AI. These models can solve important problems we are tackling, including for example land cover classification, crop type mapping, flood segmentation, building density estimation, and road regression segmentation. In this paper, we show that for a limited number of labelled data, Foundation Models achieve improved performance compared to problem-specific models. In this work, we also present our proposed evaluation benchmark for Foundation Models for EO. Benchmarking the generalization performance of Foundation Models is important as it has become difficult to standardize a fair comparison across the many different models that have been proposed recently. We present the results using our evaluation benchmark for EO Foundation Models and show that Foundation Models are label efficient in the downstream tasks and help us solve problems we are tackling in EO and remote sensing.

[LG-30] Combining Automated Optimisation of Hyperparameters and Reward Shape

链接: https://arxiv.org/abs/2406.18293
作者: Julian Dierkes,Emma Cramer,Holger H. Hoos,Sebastian Trimpe
关键词: deep reinforcement learning, reinforcement learning, recent years, significant progress, progress in deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in the Reinforcement Learning Journal 2024

点击查看摘要

Abstract:There has been significant progress in deep reinforcement learning (RL) in recent years. Nevertheless, finding suitable hyperparameter configurations and reward functions remains challenging even for experts, and performance heavily relies on these design choices. Also, most RL research is conducted on known benchmarks where knowledge about these choices already exists. However, novel practical applications often pose complex tasks for which no prior knowledge about good hyperparameters and reward functions is available, thus necessitating their derivation from scratch. Prior work has examined automatically tuning either hyperparameters or reward functions individually. We demonstrate empirically that an RL algorithm’s hyperparameter configurations and reward function are often mutually dependent, meaning neither can be fully optimised without appropriate values for the other. We then propose a methodology for the combined optimisation of hyperparameters and the reward function. Furthermore, we include a variance penalty as an optimisation objective to improve the stability of learned policies. We conducted extensive experiments using Proximal Policy Optimisation and Soft Actor-Critic on four environments. Our results show that combined optimisation significantly improves over baseline performance in half of the environments and achieves competitive performance in the others, with only a minor increase in computational costs. This suggests that combined optimisation should be best practice.

[LG-31] CAS: Confidence Assessments of classification algorithms for Semantic segmentation of EO data

链接: https://arxiv.org/abs/2406.18279
作者: Nikolaos Dionelis,Nicolas Longepe
关键词: semantic segmentation, Confidence assessments, remote sensing, Confidence, semantic segmentation algorithms
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 7 figures, 4 tables, Submitted

点击查看摘要

Abstract:Confidence assessments of semantic segmentation algorithms in remote sensing are important. It is a desirable property of models to a priori know if they produce an incorrect output. Evaluations of the confidence assigned to the estimates of models for the task of classification in Earth Observation (EO) are crucial as they can be used to achieve improved semantic segmentation performance and prevent high error rates during inference and deployment. The model we develop, the Confidence Assessments of classification algorithms for Semantic segmentation (CAS) model, performs confidence evaluations at both the segment and pixel levels, and outputs both labels and confidence. The outcome of this work has important applications. The main application is the evaluation of EO Foundation Models on semantic segmentation downstream tasks, in particular land cover classification using satellite Copernicus Sentinel-2 data. The evaluation shows that the proposed model is effective and outperforms other alternative baseline models.

[LG-32] Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation

链接: https://arxiv.org/abs/2406.18249
作者: Hamideh Kerdegari,Kyle Higgins,Dennis Veselkov,Ivan Laponogov,Inese Polaka,Miguel Coimbra,Junior Andrea Pescino,Marcis Leja,Mario Dinis-Ribeiro,Tania Fleitas Kanonnikoff,Kirill Veselkov
关键词: managing upper gastrointestinal, global cancer mortality, medical diagnostics represents, artificial intelligence, upper gastrointestinal
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The integration of artificial intelligence (AI) in medical diagnostics represents a significant advancement in managing upper gastrointestinal (GI) cancer, a major cause of global cancer mortality. Specifically for gastric cancer (GC), chronic inflammation causes changes in the mucosa such as atrophy, intestinal metaplasia (IM), dysplasia and ultimately cancer. Early detection through endoscopic regular surveillance is essential for better outcomes. Foundation models (FM), which are machine or deep learning models trained on diverse data and applicable to broad use cases, offer a promising solution to enhance the accuracy of endoscopy and its subsequent pathology image analysis. This review explores the recent advancements, applications, and challenges associated with FM in endoscopy and pathology imaging. We started by elucidating the core principles and architectures underlying these models, including their training methodologies and the pivotal role of large-scale data in developing their predictive capabilities. Moreover, this work discusses emerging trends and future research directions, emphasizing the integration of multimodal data, the development of more robust and equitable models, and the potential for real-time diagnostic support. This review aims to provide a roadmap for researchers and practitioners in navigating the complexities of incorporating FM into clinical practice for prevention/management of GC cases, thereby improving patient outcomes.

[LG-33] Guiding Video Prediction with Explicit Procedural Knowledge

链接: https://arxiv.org/abs/2406.18220
作者: Patrick Takenaka,Johannes Maucher,Marco F. Huber
关键词: deep learning models, integrate procedural knowledge, propose a general, video prediction, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published in 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

点击查看摘要

Abstract:We propose a general way to integrate procedural knowledge of a domain into deep learning models. We apply it to the case of video prediction, building on top of object-centric deep models and show that this leads to a better performance than using data-driven models alone. We develop an architecture that facilitates latent space disentanglement in order to use the integrated procedural knowledge, and establish a setup that allows the model to learn the procedural interface in the latent space using the downstream task of video prediction. We contrast the performance to a state-of-the-art data-driven approach and show that problems where purely data-driven approaches struggle can be handled by using knowledge about the domain, providing an alternative to simply collecting more data.

[LG-34] A Closer Look into Mixture-of-Experts in Large Language Models

链接: https://arxiv.org/abs/2406.18219
作者: Ka Man Lo,Zeyu Huang,Zihan Qiu,Zili Wang,Jie Fu
关键词: gaining increasing attention, increasing attention due, gaining increasing, increasing attention, attention due
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at this https URL.

[LG-35] Selective Prompting Tuning for Personalized Conversations with LLMs

链接: https://arxiv.org/abs/2406.18187
作者: Qiushi Huang,Xubo Liu,Tom Ko,Bo Wu,Wenwu Wang,Yu Zhang,Lilian Tang
关键词: understanding is essential, profiles and contextual, contextual understanding, SPT, persona profiles
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ACL 2024 findings

点击查看摘要

Abstract:In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models’ (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to yield responses that are similar to the ground truths in datasets, while direct fine-tuning tends to produce repetitive or overly generic replies. To alleviate those issues, we propose \textbfSelective \textbfPrompt \textbfTuning (SPT), which softly prompts LLMs for personalized conversations in a selective way. Concretely, SPT initializes a set of soft prompts and uses a trainable dense retriever to adaptively select suitable soft prompts for LLMs according to different input contexts, where the prompt retriever is dynamically updated through feedback from the LLMs. Additionally, we propose context-prompt contrastive learning and prompt fusion learning to encourage the SPT to enhance the diversity of personalized conversations. Experiments on the CONVAI2 dataset demonstrate that SPT significantly enhances response diversity by up to 90%, along with improvements in other critical performance indicators. Those results highlight the efficacy of SPT in fostering engaging and personalized dialogue generation. The SPT model code (this https URL) is publicly available for further exploration.

[LG-36] DeepExtremeCubes: Integrating Earth system spatio-temporal data for impact assessment of climate extremes

链接: https://arxiv.org/abs/2406.18179
作者: Chaonan Ji,Tonio Fincke,Vitus Benson,Gustau Camps-Valls,Miguel-Angel Fernandez-Torres,Fabian Gans,Guido Kraemer,Francesco Martinuzzi,David Montero,Karin Mora,Oscar J. Pellicer-Valero,Claire Robin,Maximilian Soechting,Melanie Weynants,Miguel D. Mahecha
关键词: robust analytical tools, climate extremes’ rising, extremes’ rising frequency, frequency and intensity, robust analytical
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:With climate extremes’ rising frequency and intensity, robust analytical tools are crucial to predict their impacts on terrestrial ecosystems. Machine learning techniques show promise but require well-structured, high-quality, and curated analysis-ready datasets. Earth observation datasets comprehensively monitor ecosystem dynamics and responses to climatic extremes, yet the data complexity can challenge the effectiveness of machine learning models. Despite recent progress in deep learning to ecosystem monitoring, there is a need for datasets specifically designed to analyse compound heatwave and drought extreme impact. Here, we introduce the DeepExtremeCubes database, tailored to map around these extremes, focusing on persistent natural vegetation. It comprises over 40,000 spatially sampled small data cubes (i.e. minicubes) globally, with a spatial coverage of 2.5 by 2.5 km. Each minicube includes (i) Sentinel-2 L2A images, (ii) ERA5-Land variables and generated extreme event cube covering 2016 to 2022, and (iii) ancillary land cover and topography maps. The paper aims to (1) streamline data accessibility, structuring, pre-processing, and enhance scientific reproducibility, and (2) facilitate biosphere dynamics forecasting in response to compound extremes.

[LG-37] NeBuLa: A discourse aware Minecraft Builder

链接: https://arxiv.org/abs/2406.18164
作者: Akshay Chaturvedi,Kate Thompson,Nicholas Asher
关键词: humans efficiently exploit, humans efficiently, engaging in collaborative, efficiently exploit, exploit the semantic
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:When engaging in collaborative tasks, humans efficiently exploit the semantic structure of a conversation to optimize verbal and nonverbal interactions. But in recent “language to code” or “language to action” models, this information is lacking. We show how incorporating the prior discourse and nonlinguistic context of a conversation situated in a nonlinguistic environment can improve the “language to action” component of such interactions. We fine tune an LLM to predict actions based on prior context; our model, NeBuLa, doubles the net-action F1 score over the baseline on this task of Jayannavar et al.(2020). We also investigate our model’s ability to construct shapes and understand location descriptions using a synthetic dataset.

[LG-38] FedAQ: Communication-Efficient Federated Edge Learning via Joint Uplink and Downlink Adaptive Quantization

链接: https://arxiv.org/abs/2406.18156
作者: Linping Qu,Shenghui Song,Chi-Ying Tsui
关键词: clients’ data privacy, protecting clients’ data, powerful machine learning, machine learning paradigm, data privacy
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Federated learning (FL) is a powerful machine learning paradigm which leverages the data as well as the computational resources of clients, while protecting clients’ data privacy. However, the substantial model size and frequent aggregation between the server and clients result in significant communication overhead, making it challenging to deploy FL in resource-limited wireless networks. In this work, we aim to mitigate the communication overhead by using quantization. Previous research on quantization has primarily focused on the uplink communication, employing either fixed-bit quantization or adaptive quantization methods. In this work, we introduce a holistic approach by joint uplink and downlink adaptive quantization to reduce the communication overhead. In particular, we optimize the learning convergence by determining the optimal uplink and downlink quantization bit-length, with a communication energy constraint. Theoretical analysis shows that the optimal quantization levels depend on the range of model gradients or weights. Based on this insight, we propose a decreasing-trend quantization for the uplink and an increasing-trend quantization for the downlink, which aligns with the change of the model parameters during the training process. Experimental results show that, the proposed joint uplink and downlink adaptive quantization strategy can save up to 66.7% energy compared with the existing schemes.

[LG-39] Beyond Statistical Estimation: Differentially Private Individual Computation in the Shuffle Model

链接: https://arxiv.org/abs/2406.18145
作者: Shaowei Wang,Changyu Dong,Di Wang,Xiangfu Song
关键词: fully trustable parties, trustable parties, recently emerged, fully trustable, shuffle model
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The shuffle model of differential privacy (DP) has recently emerged as a powerful one for decentralized computation without fully trustable parties. Since it anonymizes and permutes messages from clients through a shuffler, the privacy can be amplified and utility can be improved. However, the shuffling procedure in turn restricts its applications only to statistical tasks that are permutation-invariant. This work explores the feasibility of shuffle privacy amplification for prevalent non-statistical computations: spatial crowdsourcing, combinatorial optimization, location-based social systems, and federated learning with incentives, which suffer either computationally intractability or intolerable utility loss in existing approaches (e.g., secure MPC and local DP). We proposes a new paradigm of shuffle model that can provide critical security functionalities like message authorization and result access control, meanwhile maintaining the most of privacy amplification effects. It incurs almost the same computation/communication costs as the non-private setting, and permits the server to run arbitrary algorithms on (noisy) client information in plaintext. Our novel technique is introducing statistically random identity into DP and force identical random distribution on all clients, so as to support secure functionalities even after message shuffling and to maintain privacy amplification simultaneously. Given that existing DP randomizers fails in the new shuffle model, we also propose a new mechanism and prove its optimality therein. Experimental results on spatial crowdsourcing, location-based social system, and federated learning with incentives, show that our paradigm and mechanism is fast as non-private settings, while reducing up to 90% error and increasing utility performance indicates by 100%-300% relatively, and can be practical under reasonable privacy budget. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2406.18145 [cs.CR] (or arXiv:2406.18145v1 [cs.CR] for this version)

[LG-40] Sequential Disentanglement by Extracting Static Information From A Single Sequence Element

链接: https://arxiv.org/abs/2406.18131
作者: Nimrod Berman,Ilan Naiman,Idan Arbiv,Gal Fadlon,Omri Azencot
关键词: unsupervised sequential disentanglement, fundamental representation learning, single static factor, sequential disentanglement, unsupervised sequential
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2024; The first four authors contributed equally

点击查看摘要

Abstract:One of the fundamental representation learning tasks is unsupervised sequential disentanglement, where latent codes of inputs are decomposed to a single static factor and a sequence of dynamic factors. To extract this latent information, existing methods condition the static and dynamic codes on the entire input sequence. Unfortunately, these models often suffer from information leakage, i.e., the dynamic vectors encode both static and dynamic information, or vice versa, leading to a non-disentangled representation. Attempts to alleviate this problem via reducing the dynamic dimension and auxiliary loss terms gain only partial success. Instead, we propose a novel and simple architecture that mitigates information leakage by offering a simple and effective subtraction inductive bias while conditioning on a single sample. Remarkably, the resulting variational framework is simpler in terms of required loss terms, hyperparameters, and data augmentation. We evaluate our method on multiple data-modality benchmarks including general time series, video, and audio, and we show beyond state-of-the-art results on generation and prediction tasks in comparison to several strong baselines.

[LG-41] CTS: Sim-to-Real Unsupervised Domain Adaptation on 3D Detection

链接: https://arxiv.org/abs/2406.18129
作者: Meiying Zhang,Weiyuan Peng,Guangyao Ding,Chenyang Lei,Chunlin Ji,Qi Hao
关键词: including object detection, object detection, object detection algorithms, expected to improve, cross-domain object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation data can be accurately labeled and have been expected to improve the performance of data-driven algorithms, including object detection. However, due to the various domain inconsistencies from simulation to reality (sim-to-real), cross-domain object detection algorithms usually suffer from dramatic performance drops. While numerous unsupervised domain adaptation (UDA) methods have been developed to address cross-domain tasks between real-world datasets, progress in sim-to-real remains limited. This paper presents a novel Complex-to-Simple (CTS) framework to transfer models from labeled simulation (source) to unlabeled reality (target) domains. Based on a two-stage detector, the novelty of this work is threefold: 1) developing fixed-size anchor heads and RoI augmentation to address size bias and feature diversity between two domains, thereby improving the quality of pseudo-label; 2) developing a novel corner-format representation of aleatoric uncertainty (AU) for the bounding box, to uniformly quantify pseudo-label quality; 3) developing a noise-aware mean teacher domain adaptation method based on AU, as well as object-level and frame-level sampling strategies, to migrate the impact of noisy labels. Experimental results demonstrate that our proposed approach significantly enhances the sim-to-real domain adaptation capability of 3D object detection models, outperforming state-of-the-art cross-domain algorithms, which are usually developed for real-to-real UDA tasks.

[LG-42] ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models

链接: https://arxiv.org/abs/2406.18125
作者: Ahmed Heakl,Youssef Mohamed,Noran Mohamed,Ali Sharkaway,Ahmed Zaky
关键词: recruitment platforms coupled, resume classification methods, efficient resume classification, increasing reliance, platforms coupled
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, 1 table, 6th International Conference on AI in Computational Linguistics

点击查看摘要

Abstract:The increasing reliance on online recruitment platforms coupled with the adoption of AI technologies has highlighted the critical need for efficient resume classification methods. However, challenges such as small datasets, lack of standardized resume templates, and privacy concerns hinder the accuracy and effectiveness of existing classification models. In this work, we address these challenges by presenting a comprehensive approach to resume classification. We curated a large-scale dataset of 13,389 resumes from diverse sources and employed Large Language Models (LLMs) such as BERT and Gemma1.1 2B for classification. Our results demonstrate significant improvements over traditional machine learning approaches, with our best model achieving a top-1 accuracy of 92% and a top-5 accuracy of 97.5%. These findings underscore the importance of dataset quality and advanced model architectures in enhancing the accuracy and robustness of resume classification systems, thus advancing the field of online recruitment practices.

[LG-43] ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs

链接: https://arxiv.org/abs/2406.18120
作者: Ahmed Heakl,Youssef Zaghloul,Mennatullah Ali,Rania Hossam,Walid Gomaa
关键词: Egyptian Arabic, Egyptian Arabic recognition, translating code-switched Egyptian, code-switched Egyptian Arabic-English, automatic speech recognition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 5 tables, 6th International Conference on AI in Computational Linguistics

点击查看摘要

Abstract:Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of 56% in English translation over the state-of-the-art and 9.3% in Arabic translation. Since code-switching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources. Code: \urlthis http URL, Models: \urlthis http URL.

[LG-44] Robust personnel rostering: how accurate should absenteeism predictions be?

链接: https://arxiv.org/abs/2406.18119
作者: Martina Doneda,Pieter Smet,Giuliana Carello,Ettore Lanzarone,Greet Vanden Berghe
关键词: employees’ working hours, necessitate last-minute adjustments, Disruptions to personnel, personnel rosters caused, working hours
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Disruptions to personnel rosters caused by absenteeism often necessitate last-minute adjustments to the employees’ working hours. A common strategy to mitigate the impact of such changes is to assign employees to reserve shifts: special on-call duties during which an employee can be called in to cover for an absent employee. To maximize roster robustness, we assume a predict-then-optimize approach that uses absence predictions from a machine learning model to schedule an adequate number of reserve shifts. In this paper we propose a methodology to evaluate the robustness of rosters generated by the predict-then-optimize approach, assuming the machine learning model will make predictions at a predetermined prediction performance level. Instead of training and testing machine learning models, our methodology simulates the predictions based on a characterization of model performance. We show how this methodology can be applied to identify the minimum performance level needed for the model to outperform simple non-data-driven robust rostering policies. In a computational study on a nurse rostering problem, we demonstrate how the predict-then-optimize approach outperforms non-data-driven policies under reasonable performance requirements, particularly when employees possess interchangeable skills.

[LG-45] oken-Weighted RNN-T for Learning from Flawed Data

链接: https://arxiv.org/abs/2406.18108
作者: Gil Keren,Wei Zhou,Ozlem Kalinli
关键词: ASR models, target token sequence, models are commonly, commonly trained, increase the probability
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:ASR models are commonly trained with the cross-entropy criterion to increase the probability of a target token sequence. While optimizing the probability of all tokens in the target sequence is sensible, one may want to de-emphasize tokens that reflect transcription errors. In this work, we propose a novel token-weighted RNN-T criterion that augments the RNN-T objective with token-specific weights. The new objective is used for mitigating accuracy loss from transcriptions errors in the training data, which naturally appear in two settings: pseudo-labeling and human annotation errors. Experiments results show that using our method for semi-supervised learning with pseudo-labels leads to a consistent accuracy improvement, up to 38% relative. We also analyze the accuracy degradation resulting from different levels of WER in the reference transcription, and show that token-weighted RNN-T is suitable for overcoming this degradation, recovering 64%-99% of the accuracy loss.

[LG-46] Learning Optimal Filters Using Variational Inference

链接: https://arxiv.org/abs/2406.18066
作者: Enoch Luk,Eviatar Bach,Ricardo Baptista,Andrew Stuart
关键词: Filtering-the task, observations-is important, science and engineering, including weather, climate prediction
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Filtering-the task of estimating the conditional distribution of states of a dynamical system given partial, noisy, observations-is important in many areas of science and engineering, including weather and climate prediction. However, the filtering distribution is generally intractable to obtain for high-dimensional, nonlinear systems. Filters used in practice, such as the ensemble Kalman filter (EnKF), are biased for nonlinear systems and have numerous tuning parameters. Here, we present a framework for learning a parameterized analysis map-the map that takes a forecast distribution and observations to the filtering distribution-using variational inference. We show that this methodology can be used to learn gain matrices for filtering linear and nonlinear dynamical systems, as well as inflation and localization parameters for an EnKF. Future work will apply this framework to learn new filtering algorithms.

[LG-47] Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents

链接: https://arxiv.org/abs/2406.18062
作者: Chung-En Sun,Sicun Gao,Tsui-Wei Weng
关键词: deep reinforcement learning, randomized smoothing emerging, reinforcement learning, enhancing this attribute, remains a paramount
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in ICML 2024

点击查看摘要

Abstract:Robustness remains a paramount concern in deep reinforcement learning (DRL), with randomized smoothing emerging as a key technique for enhancing this attribute. However, a notable gap exists in the performance of current smoothed DRL agents, often characterized by significantly low clean rewards and weak robustness. In response to this challenge, our study introduces innovative algorithms aimed at training effective smoothed robust DRL agents. We propose S-DQN and S-PPO, novel approaches that demonstrate remarkable improvements in clean rewards, empirical robustness, and robustness guarantee across standard RL benchmarks. Notably, our S-DQN and S-PPO agents not only significantly outperform existing smoothed agents by an average factor of 2.16\times under the strongest attack, but also surpass previous robustly-trained agents by an average factor of 2.13\times . This represents a significant leap forward in the field. Furthermore, we introduce Smoothed Attack, which is 1.89\times more effective in decreasing the rewards of smoothed agents than existing adversarial attacks.

[LG-48] AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

链接: https://arxiv.org/abs/2406.18060
作者: Yifan Yang,Kai Zhen,Ershad Banijamal,Athanasios Mouchtaris,Zheng Zhang
关键词: natural language processing, Fine-tuning large language, language processing tasks, large language models, achieved remarkable performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.

[LG-49] Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies

链接: https://arxiv.org/abs/2406.18053
作者: Yu Luo,Fuchun Sun,Tianying Ji,Xianyuan Zhan
关键词: Hierarchical reinforcement learning, addresses complex long-horizon, reinforcement learning, addresses complex, skillfully decomposing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hierarchical reinforcement learning (HRL) addresses complex long-horizon tasks by skillfully decomposing them into subgoals. Therefore, the effectiveness of HRL is greatly influenced by subgoal reachability. Typical HRL methods only consider subgoal reachability from the unilateral level, where a dominant level enforces compliance to the subordinate level. However, we observe that when the dominant level becomes trapped in local exploration or generates unattainable subgoals, the subordinate level is negatively affected and cannot follow the dominant level’s actions. This can potentially make both levels stuck in local optima, ultimately hindering subsequent subgoal reachability. Allowing real-time bilateral information sharing and error correction would be a natural cure for this issue, which motivates us to propose a mutual response mechanism. Based on this, we propose the Bidirectional-reachable Hierarchical Policy Optimization (BrHPO)–a simple yet effective algorithm that also enjoys computation efficiency. Experiment results on a variety of long-horizon tasks showcase that BrHPO outperforms other state-of-the-art HRL baselines, coupled with a significantly higher exploration efficiency and robustness.

[LG-50] Multimodal foundation world models for generalist embodied agents

链接: https://arxiv.org/abs/2406.18043
作者: Pietro Mazzaglia,Tim Verbelen,Bart Dhoedt,Aaron Courville,Sai Rajeswar
关键词: solve multitudes, Learning, models, foundation, tasks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning generalist embodied agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. In contrast, language can specify tasks in a more natural way. Current foundation vision-language models (VLMs) generally require fine-tuning or other adaptations to be functional, due to the significant domain gap. However, the lack of multimodal data in such domains represents an obstacle toward developing foundation models for embodied applications. In this work, we overcome these problems by presenting multimodal foundation world models, able to connect and align the representation of foundation VLMs with the latent space of generative world models for RL, without any language annotations. The resulting agent learning framework, GenRL, allows one to specify tasks through vision and/or language prompts, ground them in the embodied domain’s dynamics, and learns the corresponding behaviors in imagination. As assessed through large-scale multi-task benchmarking, GenRL exhibits strong multi-task generalization performance in several locomotion and manipulation domains. Furthermore, by introducing a data-free RL strategy, it lays the groundwork for foundation model-based RL for generalist embodied agents.

[LG-51] MT2ST: Adaptive Multi-Task to Single-Task Learning

链接: https://arxiv.org/abs/2406.18038
作者: Dong Liu,Meng Jiang
关键词: face challenges, challenges in balancing, balancing the breadth, STL, conventional training approaches
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The conventional training approaches often face challenges in balancing the breadth of multi-task learning (MTL) with the depth of single-task learning (STL). To address this issue, we introduce the Multi-Task to Single-Task (MT2ST) framework, a groundbreaking approach that can combine the generalizability of MTL with the precision of STL. Our work include two strategies: ‘Diminish’ and ‘Switch’. ‘Diminish’ Strategy will gradually reduce the influence of auxiliary tasks, while the ‘Switch’ strategy involves a shift from multi-tasking to single-tasking at a specific timepoint at the training process. In this paper, we propose the Multi-Task to Single-Task (MT2ST) framework, a novel approach that significantly enhances the efficiency and accuracy of word embedding training while concurrently addressing prevalent issues such as overfitting. Our empirical studies demonstrate that MT2ST can reduce training time by 67% when contrasted with single-task learning approaches, and by 13% compared to traditional multi-task learning methods. These findings underscore MT2ST’s potential to be a powerful tools for word embedding training acceleration. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2406.18038 [cs.LG] (or arXiv:2406.18038v1 [cs.LG] for this version)

[LG-52] Local Linear Recovery Guarantee of Deep Neural Networks at Overparameterization

链接: https://arxiv.org/abs/2406.18035
作者: Yaoyu Zhang,Leyang Zhang,Zhongwang Zhang,Zhiwei Bai
关键词: reliably recover target, recover target functions, deep learning, deep neural network, Determining whether deep
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2211.11623

点击查看摘要

Abstract:Determining whether deep neural network (DNN) models can reliably recover target functions at overparameterization is a critical yet complex issue in the theory of deep learning. To advance understanding in this area, we introduce a concept we term “local linear recovery” (LLR), a weaker form of target function recovery that renders the problem more amenable to theoretical analysis. In the sense of LLR, we prove that functions expressible by narrower DNNs are guaranteed to be recoverable from fewer samples than model parameters. Specifically, we establish upper limits on the optimistic sample sizes, defined as the smallest sample size necessary to guarantee LLR, for functions in the space of a given DNN. Furthermore, we prove that these upper bounds are achieved in the case of two-layer tanh neural networks. Our research lays a solid groundwork for future investigations into the recovery capabilities of DNNs in overparameterized scenarios.

[LG-53] Boosting Soft Q-Learning by Bounding

链接: https://arxiv.org/abs/2406.18033
作者: Jacob Adamczyk,Volodymyr Makarenko,Stas Tiomkin,Rahul V. Kulkarni
关键词: leverage past experience, solving new tasks, agent ability, ability to leverage, leverage past
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: To appear in the 1st Reinforcement Learning Conference

点击查看摘要

Abstract:An agent’s ability to leverage past experience is critical for efficiently solving new tasks. Prior work has focused on using value function estimates to obtain zero-shot approximations for solutions to a new task. In soft Q-learning, we show how any value function estimate can also be used to derive double-sided bounds on the optimal value function. The derived bounds lead to new approaches for boosting training performance which we validate experimentally. Notably, we find that the proposed framework suggests an alternative method for updating the Q-function, leading to boosted performance.

[LG-54] AutoOPE: Automated Off-Policy Estimator Selection

链接: https://arxiv.org/abs/2406.18022
作者: Nicolò Felicioni,Michael Benigni,Maurizio Ferrari Dacrema
关键词: Off-Policy Evaluation, OPE, consists of evaluating, data collected, counterfactual policies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Off-Policy Evaluation (OPE) problem consists of evaluating the performance of counterfactual policies with data collected by another one. This problem is of utmost importance for various application domains, e.g., recommendation systems, medical treatments, and many others. To solve the OPE problem, we resort to estimators, which aim to estimate in the most accurate way possible the performance that the counterfactual policies would have had if they were deployed in place of the logging policy. In the literature, several estimators have been developed, all with different characteristics and theoretical guarantees. Therefore, there is no dominant estimator, and each estimator may be the best one for different OPE problems, depending on the characteristics of the dataset at hand. While the selection of the estimator is a crucial choice for an accurate OPE, this problem has been widely overlooked in the literature. We propose an automated data-driven OPE estimator selection method based on machine learning. In particular, the core idea we propose in this paper is to create several synthetic OPE tasks and use a machine learning model trained to predict the best estimator for those synthetic tasks. We empirically show how our method is able to generalize to unseen tasks and make a better estimator selection compared to a baseline method on several real-world datasets, with a computational cost significantly lower than the one of the baseline.

[LG-55] SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR

链接: https://arxiv.org/abs/2406.18021
作者: Shuaishuai Ye,Shunfei Chen,Xinhui Hu,Xinkang Xu
关键词: Connectionist Temporal Classification, Temporal Classification, Connectionist Temporal, correspond to Mandarin, automatic speech recognition
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted by InterSpeech 2024; 5 pages, 2 figures

点击查看摘要

Abstract:In this work, we propose a Switch-Conformer-based MoE system named SC-MoE for unified streaming and non-streaming code-switching (CS) automatic speech recognition (ASR), where we design a streaming MoE layer consisting of three language experts, which correspond to Mandarin, English, and blank, respectively, and equipped with a language identification (LID) network with a Connectionist Temporal Classification (CTC) loss as a router in the encoder of SC-MoE to achieve a real-time streaming CS ASR system. To further utilize the language information embedded in text, we also incorporate MoE layers into the decoder of SC-MoE. In addition, we introduce routers into every MoE layer of the encoder and the decoder and achieve better recognition performance. Experimental results show that the SC-MoE significantly improves CS ASR performances over baseline with comparable computational efficiency.

[LG-56] MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views

链接: https://arxiv.org/abs/2406.18020
作者: Muzhen Cai,Sendong Zhao,Haochun Wang,Yanrui Du,Zewen Qiang,Bing Qin,Ting Liu
关键词: Artificial Intelligence predicts, Intelligence predicts drug, Artificial Intelligence, predicts drug properties, Intelligence predicts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Artificial Intelligence predicts drug properties by encoding drug molecules, aiding in the rapid screening of candidates. Different molecular representations, such as SMILES and molecule graphs, contain complementary information for molecular encoding. Thus exploiting complementary information from different molecular representations is one of the research priorities in molecular encoding. Most existing methods for combining molecular multi-modalities only use molecular-level information, making it hard to encode intra-molecular alignment information between different modalities. To address this issue, we propose a multi-granularity fusion method that is MolFusion. The proposed MolFusion consists of two key components: (1) MolSim, a molecular-level encoding component that achieves molecular-level alignment between different molecular representations. and (2) AtomAlign, an atomic-level encoding component that achieves atomic-level alignment between different molecular representations. Experimental results show that MolFusion effectively utilizes complementary multimodal information, leading to significant improvements in performance across various classification and regression tasks.

[LG-57] Explicit Diversity Conditions for Effective Question Answer Generation with Large Language Models

链接: https://arxiv.org/abs/2406.17990
作者: Vikas Yadav,Hyuk Joon Kwon,Vijay Srinivasan,Hongxia Jin
关键词: Question Answer Generation, Answer Generation, question answering systems, Question Answer, explicit diversity conditions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at COLING 2024

点击查看摘要

Abstract:Question Answer Generation (QAG) is an effective data augmentation technique to improve the accuracy of question answering systems, especially in low-resource domains. While recent pretrained and large language model-based QAG methods have made substantial progress, they face the critical issue of redundant QA pair generation, affecting downstream QA systems. Implicit diversity techniques such as sampling and diverse beam search are proven effective solutions but often yield smaller diversity. We present explicit diversity conditions for QAG, focusing on spatial aspects, question types, and entities, substantially increasing diversity in QA generation. Our work emphasizes the need of explicit diversity conditions for generating diverse question-answer synthetic data by showing significant improvements in downstream QA task over existing widely adopted implicit diversity techniques. In particular, generated QA pairs from explicit diversity conditions when used to train the downstream QA model results in an average 4.1% exact match and 4.5% F1 improvement over QAG from implicit sampling techniques on SQuADDU. Our work emphasizes the need for explicit diversity conditions even more in low-resource datasets (SubjQA), where average downstream QA performance improvements are around 12% EM.

[LG-58] Learning Neural Networks with Sparse Activations

链接: https://arxiv.org/abs/2406.17989
作者: Pranjal Awasthi,Nishanth Dikkala,Pritish Kamath,Raghu Meka
关键词: fully connected layers, successful neural network, core component present, neural network architectures, MLP block
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Proceedings of the 37th Conference on Learning Theory (COLT 2024), 20 pages

点击查看摘要

Abstract:A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of \em dynamic activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of \em sparsely activated networks would lead to methods that can exploit activation sparsity in practice.

[LG-59] Inherent Challenges of Post-Hoc Membership Inference for Large Language Models

链接: https://arxiv.org/abs/2406.17975
作者: Matthieu Meeus,Shubham Jain,Marek Rei,Yves-Alexandre de Montjoye
关键词: Large Language Models, Large Language, Membership Inference Attacks, Language Models, Inference Attacks
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are often trained on vast amounts of undisclosed data, motivating the development of post-hoc Membership Inference Attacks (MIAs) to gain insight into their training data composition. However, in this paper, we identify inherent challenges in post-hoc MIA evaluation due to potential distribution shifts between collected member and non-member datasets. Using a simple bag-of-words classifier, we demonstrate that datasets used in recent post-hoc MIAs suffer from significant distribution shifts, in some cases achieving near-perfect distinction between members and non-members. This implies that previously reported high MIA performance may be largely attributable to these shifts rather than model memorization. We confirm that randomized, controlled setups eliminate such shifts and thus enable the development and fair evaluation of new MIAs. However, we note that such randomized setups are rarely available for the latest LLMs, making post-hoc data collection still required to infer membership for real-world LLMs. As a potential solution, we propose a Regression Discontinuity Design (RDD) approach for post-hoc data collection, which substantially mitigates distribution shifts. Evaluating various MIA methods on this RDD setup yields performance barely above random guessing, in stark contrast to previously reported results. Overall, our findings highlight the challenges in accurately measuring LLM memorization and the need for careful experimental design in (post-hoc) membership inference tasks.

[LG-60] LABOR-LLM: Language-Based Occupational Representations with Large Language Models

链接: https://arxiv.org/abs/2406.17972
作者: Tianyu Du,Ayush Kanodia,Herman Brunborg,Keyon Vafa,Susan Athey
关键词: carefully constructed longitudinal, labor market questions, market questions rely, constructed longitudinal survey, longitudinal survey datasets
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:Many empirical studies of labor market questions rely on estimating relatively simple predictive models using small, carefully constructed longitudinal survey datasets based on hand-engineered features. Large Language Models (LLMs), trained on massive datasets, encode vast quantities of world knowledge and can be used for the next job prediction problem. However, while an off-the-shelf LLM produces plausible career trajectories when prompted, the probability with which an LLM predicts a particular job transition conditional on career history will not, in general, align with the true conditional probability in a given population. Recently, Vafa et al. (2024) introduced a transformer-based “foundation model”, CAREER, trained using a large, unrepresentative resume dataset, that predicts transitions between jobs; it further demonstrated how transfer learning techniques can be used to leverage the foundation model to build better predictive models of both transitions and wages that reflect conditional transition probabilities found in nationally representative survey datasets. This paper considers an alternative where the fine-tuning of the CAREER foundation model is replaced by fine-tuning LLMs. For the task of next job prediction, we demonstrate that models trained with our approach outperform several alternatives in terms of predictive performance on the survey data, including traditional econometric models, CAREER, and LLMs with in-context learning, even though the LLM can in principle predict job titles that are not allowed in the survey data. Further, we show that our fine-tuned LLM-based models’ predictions are more representative of the career trajectories of various workforce subpopulations than off-the-shelf LLM models and CAREER. We conduct experiments and analyses that highlight the sources of the gains in the performance of our models for representative predictions.

[LG-61] Efficient Document Ranking with Learnable Late Interactions

链接: https://arxiv.org/abs/2406.17968
作者: Ziwei Ji,Himanshu Jain,Andreas Veit,Sashank J. Reddi,Sadeep Jayasumana,Ankit Singh Rawat,Aditya Krishna Menon,Felix Yu,Sanjiv Kumar
关键词: information retrieval, fundamental approaches, LITE, document token embeddings, models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer based on query and document token embeddings. However, these lightweight scorers are often hand-crafted, and there is no understanding of their approximation power; further, such scorers require access to individual document token embeddings, which imposes an increased latency and storage burden. In this paper, we propose novel learnable late-interaction models (LITE) that resolve these issues. Theoretically, we prove that LITE is a universal approximator of continuous scoring functions, even for relatively small embedding dimension. Empirically, LITE outperforms previous late-interaction models such as ColBERT on both in-domain and zero-shot re-ranking tasks. For instance, experiments on MS MARCO passage re-ranking show that LITE not only yields a model with better generalization, but also lowers latency and requires 0.25x storage compared to ColBERT.

[LG-62] Empowering Interdisciplinary Insights with Dynamic Graph Embedding Trajectories

链接: https://arxiv.org/abs/2406.17963
作者: Yiqiao Jin,Andrew Zhao,Yeon-Chang Lee,Meng Ye,Ajay Divakaran,Srijan Kumar
关键词: diverse real-world systems, dynamic graphs, real-world systems, dynamic, graphs
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
*备注: 25 pages, 11 figures

点击查看摘要

Abstract:We developed DyGETViz, a novel framework for effectively visualizing dynamic graphs (DGs) that are ubiquitous across diverse real-world systems. This framework leverages recent advancements in discrete-time dynamic graph (DTDG) models to adeptly handle the temporal dynamics inherent in dynamic graphs. DyGETViz effectively captures both micro- and macro-level structural shifts within these graphs, offering a robust method for representing complex and massive dynamic graphs. The application of DyGETViz extends to a diverse array of domains, including ethology, epidemiology, finance, genetics, linguistics, communication studies, social studies, and international relations. Through its implementation, DyGETViz has revealed or confirmed various critical insights. These include the diversity of content sharing patterns and the degree of specialization within online communities, the chronological evolution of lexicons across decades, and the distinct trajectories exhibited by aging-related and non-related genes. Importantly, DyGETViz enhances the accessibility of scientific findings to non-domain experts by simplifying the complexities of dynamic graphs. Our framework is released as an open-source Python package for use across diverse disciplines. Our work not only addresses the ongoing challenges in visualizing and analyzing DTDG models but also establishes a foundational framework for future investigations into dynamic graph representation and analysis across various disciplines.

[LG-63] Why Line Search when you can Plane Search? SO-Friendly Neural Networks allow Per-Iteration Optimization of Learning and Momentum Rates for Every Layer

链接: https://arxiv.org/abs/2406.17954
作者: Betty Shea,Mark Schmidt
关键词: practice including networks, number of outputs, number of inputs, SO-friendly neural networks, introduce the class
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce the class of SO-friendly neural networks, which include several models used in practice including networks with 2 layers of hidden weights where the number of inputs is larger than the number of outputs. SO-friendly networks have the property that performing a precise line search to set the step size on each iteration has the same asymptotic cost during full-batch training as using a fixed learning. Further, for the same cost a planesearch can be used to set both the learning and momentum rate on each step. Even further, SO-friendly networks also allow us to use subspace optimization to set a learning rate and momentum rate for each layer on each iteration. We explore augmenting gradient descent as well as quasi-Newton methods and Adam with line optimization and subspace optimization, and our experiments indicate that this gives fast and reliable ways to train these networks that are insensitive to hyper-parameters.

[LG-64] LINSCAN – A Linearity Based Clustering Algorithm

链接: https://arxiv.org/abs/2406.17952
作者: Andrew Dennehy,Xiaoyu Zou,Shabnam J. Semnani,Yuri Fialko,Alexander Cloninger
关键词: Kullback Leibler Divergence, DBSCAN and OPTICS, identifying clusters, seek lineated clusters, lineated clusters
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注:

点击查看摘要

Abstract:DBSCAN and OPTICS are powerful algorithms for identifying clusters of points in domains where few assumptions can be made about the structure of the data. In this paper, we leverage these strengths and introduce a new algorithm, LINSCAN, designed to seek lineated clusters that are difficult to find and isolate with existing methods. In particular, by embedding points as normal distributions approximating their local neighborhoods and leveraging a distance function derived from the Kullback Leibler Divergence, LINSCAN can detect and distinguish lineated clusters that are spatially close but have orthogonal covariances. We demonstrate how LINSCAN can be applied to seismic data to identify active faults, including intersecting faults, and determine their orientation. Finally, we discuss the properties a generalization of DBSCAN and OPTICS must have in order to retain the stability benefits of these algorithms.

[LG-65] Navigating High-Degree Heterogeneity: Federated Learning in Aerial and Space Networks

链接: https://arxiv.org/abs/2406.17951
作者: Fan Dong,Henry Leung,Steve Drew
关键词: utilizing vast private, vast private edge, computing capabilities accessible, private edge data, Federated learning offers
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning offers a compelling solution to the challenges of networking and data privacy within aerial and space networks by utilizing vast private edge data and computing capabilities accessible through drones, balloons, and satellites. While current research has focused on optimizing the learning process, computing efficiency, and minimizing communication overhead, the issue of heterogeneity and class imbalance remains a significant barrier to rapid model convergence. In our study, we explore the influence of heterogeneity on class imbalance, which diminishes performance in ASN-based federated learning. We illustrate the correlation between heterogeneity and class imbalance within grouped data and show how constraints such as battery life exacerbate the class imbalance challenge. Our findings indicate that ASN-based FL faces heightened class imbalance issues even with similar levels of heterogeneity compared to other scenarios. Finally, we analyze the impact of varying degrees of heterogeneity on FL training and evaluate the efficacy of current state-of-the-art algorithms under these conditions. Our results reveal that the heterogeneity challenge is more pronounced in ASN-based federated learning and that prevailing algorithms often fail to effectively address high levels of heterogeneity.

[LG-66] he Overcooked Generalisation Challenge

链接: https://arxiv.org/abs/2406.17949
作者: Constantin Ruhdorfer,Matteo Bortoletto,Anna Penzkofer,Andreas Bulling
关键词: agents’ zero-shot cooperation, zero-shot cooperation abilities, Overcooked Generalisation Challenge, study agents’ zero-shot, capture generalisation abilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 9 pages

点击查看摘要

Abstract:We introduce the Overcooked Generalisation Challenge (OGC) - the first benchmark to study agents’ zero-shot cooperation abilities when faced with novel partners and levels in the Overcooked-AI environment. This perspective starkly contrasts a large body of previous work that has trained and evaluated cooperating agents only on the same level, failing to capture generalisation abilities required for real-world human-AI cooperation. Our challenge interfaces with state-of-the-art dual curriculum design (DCD) methods to generate auto-curricula for training general agents in Overcooked. It is the first cooperative multi-agent environment specially designed for DCD methods and, consequently, the first benchmarked with state-of-the-art methods. It is fully GPU-accelerated, built on the DCD benchmark suite minimax, and freely available under an open-source license: this https URL. We show that current DCD algorithms struggle to produce useful policies in this novel challenge, even if combined with recent network architectures that were designed for scalability and generalisability. The OGC pushes the boundaries of real-world human-AI cooperation by enabling the research community to study the impact of generalisation on cooperating agents.

[LG-67] Hot-Distance: Combining One-Hot and Signed Distance Embeddings for Segmentation

链接: https://arxiv.org/abs/2406.17936
作者: Marwan Zouinkhi,Jeff L. Rhoades,Aubrey V. Weigel
关键词: Machine learning models, Machine learning, Machine, learning models, data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
*备注: 3 pages, 1 figure, in progress

点击查看摘要

Abstract:Machine learning models are only as good as the data to which they are fit. As such, it is always preferable to use as much data as possible in training models. What data can be used for fitting a model depends a lot on the formulation of the task. We introduce Hot-Distance, a novel segmentation target that incorporates the strength of signed boundary distance prediction with the flexibility of one-hot encoding, to increase the amount of usable training data for segmentation of subcellular structures in focused ion beam scanning electron microscopy (FIB-SEM).

[LG-68] CAT: Interpretable Concept-based Taylor Additive Models

链接: https://arxiv.org/abs/2406.17931
作者: Viet Duong,Qiong Wu,Zhengyi Zhou,Hongjue Zhao,Chenxiang Luo,Eric Zavesky,Huaxiu Yao,Huajie Shao
关键词: adopt neural networks, Generalized Additive Models, Taylor Neural Network, Generalized Additive, emerging interpretable technique
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As an emerging interpretable technique, Generalized Additive Models (GAMs) adopt neural networks to individually learn non-linear functions for each feature, which are then combined through a linear model for final predictions. Although GAMs can explain deep neural networks (DNNs) at the feature level, they require large numbers of model parameters and are prone to overfitting, making them hard to train and scale. Additionally, in real-world datasets with many features, the interpretability of feature-based explanations diminishes for humans. To tackle these issues, recent research has shifted towards concept-based interpretable methods. These approaches try to integrate concept learning as an intermediate step before making predictions, explaining the predictions in terms of human-understandable concepts. However, these methods require domain experts to extensively label concepts with relevant names and their ground-truth values. In response, we propose CAT, a novel interpretable Concept-bAsed Taylor additive model to simply this process. CAT does not have to require domain experts to annotate concepts and their ground-truth values. Instead, it only requires users to simply categorize input features into broad groups, which can be easily accomplished through a quick metadata review. Specifically, CAT first embeds each group of input features into one-dimensional high-level concept representation, and then feeds the concept representations into a new white-box Taylor Neural Network (TaylorNet). The TaylorNet aims to learn the non-linear relationship between the inputs and outputs using polynomials. Evaluation results across multiple benchmarks demonstrate that CAT can outperform or compete with the baselines while reducing the need of extensive model parameters. Importantly, it can explain model predictions through high-level concepts that human can understand.

[LG-69] GraphSnapShot: Graph Machine Learning Acceleration with Fast Storage and Retrieval

链接: https://arxiv.org/abs/2406.17918
作者: Dong Liu,Roger Waleffe,Meng Jiang,Shivaram Venkataraman
关键词: framework called GraphSnapShot, recent research, graph learning, developed a framework, framework called
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:In our recent research, we have developed a framework called GraphSnapShot, which has been proven an useful tool for graph learning acceleration. GraphSnapShot is a framework for fast cache, storage, retrieval and computation for graph learning. It can quickly store and update the local topology of graph structure and allows us to track patterns in the structure of graph networks, just like take snapshots of the graphs. In experiments, GraphSnapShot shows efficiency, it can achieve up to 30% training acceleration and 73% memory reduction for lossless graph ML training compared to current baselines such as dgl.This technique is particular useful for large dynamic graph learning tasks such as social media analysis and recommendation systems to process complex relationships between entities.

[LG-70] Camera Model Identification Using Audio and Visual Content from Videos

链接: https://arxiv.org/abs/2406.17916
作者: Ioannis Tsingalis,Christos Korgialas,Constantine Kotropoulos
关键词: multimedia forensic applications, forensic applications, brands and models, models plays, plays a pivotal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The identification of device brands and models plays a pivotal role in the realm of multimedia forensic applications. This paper presents a framework capable of identifying devices using audio, visual content, or a fusion of them. The fusion of visual and audio content occurs later by applying two fundamental fusion rules: the product and the sum. The device identification problem is tackled as a classification one by leveraging Convolutional Neural Networks. Experimental evaluation illustrates that the proposed framework exhibits promising classification performance when independently using audio or visual content. Furthermore, although the fusion results don’t consistently surpass both individual modalities, they demonstrate promising potential for enhancing classification performance. Future research could refine the fusion process to improve classification performance in both modalities consistently. Finally, a statistical significance test is performed for a more in-depth study of the classification results.

[LG-71] Entity Augmentation for Efficient Classification of Vertically Partitioned Data with Limited Overlap

链接: https://arxiv.org/abs/2406.17899
作者: Avi Amalanshu,Viswesh Nagaswamy,G. V. S. S. Prudhvi,Yash Sirvi,Debashish Chakravarty
关键词: Vertical Federated Learning, Vertical Federated, machine learning paradigm, Federated Learning, machine learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: GLOW @ IJCAI 2024 (12 pages + 2 page bibliography. 15 figures.)

点击查看摘要

Abstract:Vertical Federated Learning (VFL) is a machine learning paradigm for learning from vertically partitioned data (i.e. features for each input are distributed across multiple “guest” clients and an aggregating “host” server owns labels) without communicating raw data. Traditionally, VFL involves an “entity resolution” phase where the host identifies and serializes the unique entities known to all guests. This is followed by private set intersection to find common entities, and an “entity alignment” step to ensure all guests are always processing the same entity’s data. However, using only data of entities from the intersection means guests discard potentially useful data. Besides, the effect on privacy is dubious and these operations are computationally expensive. We propose a novel approach that eliminates the need for set intersection and entity alignment in categorical tasks. Our Entity Augmentation technique generates meaningful labels for activations sent to the host, regardless of their originating entity, enabling efficient VFL without explicit entity alignment. With limited overlap between training data, this approach performs substantially better (e.g. with 5% overlap, 48.1% vs 69.48% test accuracy on CIFAR-10). In fact, thanks to the regularizing effect, our model performs marginally better even with 100% overlap.

[LG-72] Efficient and Effective Implicit Dynamic Graph Neural Network

链接: https://arxiv.org/abs/2406.17894
作者: Yongjian Zhong,Hieu Vu,Tianbao Yang,Bijaya Adhikari
关键词: Implicit graph neural, graph neural networks, capture long-range dependencies, graph neural, Dynamic Graph Neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Implicit graph neural networks have gained popularity in recent years as they capture long-range dependencies while improving predictive performance in static graphs. Despite the tussle between performance degradation due to the oversmoothing of learned embeddings and long-range dependency being more pronounced in dynamic graphs, as features are aggregated both across neighborhood and time, no prior work has proposed an implicit graph neural model in a dynamic setting. In this paper, we present Implicit Dynamic Graph Neural Network (IDGNN) a novel implicit neural network for dynamic graphs which is the first of its kind. A key characteristic of IDGNN is that it demonstrably is well-posed, i.e., it is theoretically guaranteed to have a fixed-point representation. We then demonstrate that the standard iterative algorithm often used to train implicit models is computationally expensive in our dynamic setting as it involves computing gradients, which themselves have to be estimated in an iterative manner. To overcome this, we pose an equivalent bilevel optimization problem and propose an efficient single-loop training algorithm that avoids iterative computation by maintaining moving averages of key components of the gradients. We conduct extensive experiments on real-world datasets on both classification and regression tasks to demonstrate the superiority of our approach over the state-of-the-art baselines. We also demonstrate that our bi-level optimization framework maintains the performance of the expensive iterative algorithm while obtaining up to \textbf1600x speed-up.

[LG-73] SigKAN: Signature-Weighted Kolmogorov-Arnold Networks for Time Series

链接: https://arxiv.org/abs/2406.17890
作者: Hugo Inzirillo,Remi Genet
关键词: learnable path signatures, learnable path, path signatures, multivariate function approximation, enhances multivariate function
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2405.07344 , arXiv:2406.02486

点击查看摘要

Abstract:We propose a novel approach that enhances multivariate function approximation using learnable path signatures and Kolmogorov-Arnold networks (KANs). We enhance the learning capabilities of these networks by weighting the values obtained by KANs using learnable path signatures, which capture important geometric features of paths. This combination allows for a more comprehensive and flexible representation of sequential and temporal data. We demonstrate through studies that our SigKANs with learnable path signatures perform better than conventional methods across a range of function approximation challenges. By leveraging path signatures in neural networks, this method offers intriguing opportunities to enhance performance in time series analysis and time series forecasting, among other fields.

[LG-74] CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design

链接: https://arxiv.org/abs/2406.17888
作者: Nafis Neehal,Bowen Wang,Shayom Debopadhaya,Soham Dan,Keerthiram Murugesan,Vibha Anand,Kristin P. Bennett
关键词: assess language models, baseline features, language models, aiding clinical study, benchmark to assess
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. Given study-specific metadata, CTBench evaluates AI models’ ability to determine the baseline features of a clinical trial (CT), which include demographic and relevant features collected at the trial’s start from all participants. These baseline features, typically presented in CT publications (often as Table 1), are crucial for characterizing study cohorts and validating results. Baseline features, including confounders and covariates, are also necessary for accurate treatment effect estimation in studies involving observational data. CTBench consists of two datasets: “CT-Repo,” containing baseline features from 1,690 clinical trials sourced from this http URL, and “CT-Pub,” a subset of 100 trials with more comprehensive baseline features gathered from relevant publications. Two LM-based evaluation methods are developed to compare the actual baseline feature lists against LM-generated responses. “ListMatch-LM” and “ListMatch-BERT” use GPT-4o and BERT scores (at various thresholds), respectively, for evaluation. To establish baseline results, advanced prompt engineering techniques using LLaMa3-70B-Instruct and GPT-4o in zero-shot and three-shot learning settings are applied to generate potential baseline features. The performance of GPT-4o as an evaluator is validated through human-in-the-loop evaluations on the CT-Pub dataset, where clinical experts confirm matches between actual and LM-generated features. The results highlight a promising direction with significant potential for improvement, positioning CTBench as a useful tool for advancing research on AI in CT design and potentially enhancing the efficacy and robustness of CTs.

[LG-75] Federated Dynamical Low-Rank Training with Global Loss Convergence Guarantees

链接: https://arxiv.org/abs/2406.17887
作者: Steffen Schotthöfer,M. Paul Laiu
关键词: horizontal federated learning, significant performance bottlenecks, federated dynamical low-rank, federated learning, reduce client compute
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this work, we propose a federated dynamical low-rank training (FeDLRT) scheme to reduce client compute and communication costs - two significant performance bottlenecks in horizontal federated learning. Our method builds upon dynamical low-rank splitting schemes for manifold-constrained optimization to create a global low-rank basis of network weights, which enables client training on a small coefficient matrix. A consistent global low-rank basis allows us to incorporate a variance correction scheme and prove global loss descent and convergence to a stationary point. Dynamic augmentation and truncation of the low-rank bases automatically optimizes computing and communication resource utilization. We demonstrate the efficiency of FeDLRT in an array of computer vision benchmarks and show a reduction of client compute and communication costs by up to an order of magnitude with minimal impacts on global accuracy.

[LG-76] Enabling Regional Explainability by Automatic and Model-agnostic Rule Extraction

链接: https://arxiv.org/abs/2406.17885
作者: Yu Chen,Tianyu Cui,Alexander Capstick,Nan Fletcher-Loyd,Payam Barnaghi
关键词: understanding patterns learned, rule extraction translates, extraction translates model, translates model knowledge, IF-THEN statements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:In Explainable AI, rule extraction translates model knowledge into logical rules, such as IF-THEN statements, crucial for understanding patterns learned by black-box models. This could significantly aid in fields like disease diagnosis, disease progression estimation, or drug discovery. However, such application domains often contain imbalanced data, with the class of interest underrepresented. Existing methods inevitably compromise the performance of rules for the minor class to maximise the overall performance. As the first attempt in this field, we propose a model-agnostic approach for extracting rules from specific subgroups of data, featuring automatic rule generation for numerical features. This method enhances the regional explainability of machine learning models and offers wider applicability compared to existing methods. We additionally introduce a new method for selecting features to compose rules, reducing computational costs in high-dimensional spaces. Experiments across various datasets and models demonstrate the effectiveness of our methods.

[LG-77] ET tu CLIP? Addressing Common Object Errors for Unseen Environments

链接: https://arxiv.org/abs/2406.17876
作者: Ye Won Byun,Cathy Jiao,Shahriar Noroozizadeh,Jimin Sun,Rosa Vitiello
关键词: enhance model generalization, employs pre-trained CLIP, pre-trained CLIP encoders, ALFRED task, introduce a simple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.

[LG-78] InFiConD: Interactive No-code Fine-tuning with Concept-based Knowledge Distillation

链接: https://arxiv.org/abs/2406.17838
作者: Jinbin Huang,Wenbin He,Liang Gou,Liu Ren,Chris Bryan
关键词: limited computational resources, large-scale pre-trained models, downstream tasks, computational resources, Knowledge distillation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The emergence of large-scale pre-trained models has heightened their application in various downstream tasks, yet deployment is a challenge in environments with limited computational resources. Knowledge distillation has emerged as a solution in such scenarios, whereby knowledge from large teacher models is transferred into smaller student’ models, but this is a non-trivial process that traditionally requires technical expertise in AI/ML. To address these challenges, this paper presents InFiConD, a novel framework that leverages visual concepts to implement the knowledge distillation process and enable subsequent no-code fine-tuning of student models. We develop a novel knowledge distillation pipeline based on extracting text-aligned visual concepts from a concept corpus using multimodal models, and construct highly interpretable linear student models based on visual concepts that mimic a teacher model in a response-based manner. InFiConD’s interface allows users to interactively fine-tune the student model by manipulating concept influences directly in the user interface. We validate InFiConD via a robust usage scenario and user study. Our findings indicate that InFiConD’s human-in-the-loop and visualization-driven approach enables users to effectively create and analyze student models, understand how knowledge is transferred, and efficiently perform fine-tuning operations. We discuss how this work highlights the potential of interactive and visual methods in making knowledge distillation and subsequent no-code fine-tuning more accessible and adaptable to a wider range of users with domain-specific demands.

[LG-79] ransformer Normalisation Layers and the Independence of Semantic Subspaces

链接: https://arxiv.org/abs/2406.17837
作者: Stephen Menary,Samuel Kaski,Andre Freitas
关键词: solve contextual reasoning, contextual reasoning tasks, internally executing computational, executing computational graphs, computational graphs called
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent works have shown that transformers can solve contextual reasoning tasks by internally executing computational graphs called circuits. Circuits often use attention to logically match information from subspaces of the representation, e.g. using position-in-sequence to identify the previous token. In this work, we consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution. We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability unless the model learns a strict representation structure of orthogonal spheres. This is because it causes linear subspaces to interfere through their common normalisation factor. Theoretically, we analyse circuit stability by modelling this interference as random noise on the L_2 -norms of the query/key/value vectors, predicting a phenomenon of circuit collapse when sparse-attention shifts to a different token. Empirically, we investigate the sensitivity of real-world models trained for mathematical addition, observing a 1% rate of circuit collapse when the norms are artificially perturbed by \lesssim 10%. We contrast Pre-Norm with QKV-Norm, which places normalisation after the attention head’s linear operators. Theoretically this relaxes the representational constraints. Empirically we observe comparable in-distribution but worse out-of-distribution performance.

[LG-80] he Use of AI-Robotic Systems for Scientific Discovery

链接: https://arxiv.org/abs/2406.17835
作者: Alexander H. Gower,Konstantin Korovin,Daniel Brunnsåker,Filip Kronström,Gabriel K. Reder,Ievgeniia A. Tiukova,Ronald S. Reiserer,John P. Wikswo,Ross D. King
关键词: scientific method, process of developing, entire scientific method, robot scientist, developing theories
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, book chapter

点击查看摘要

Abstract:The process of developing theories and models and testing them with experiments is fundamental to the scientific method. Automating the entire scientific method then requires not only automation of the induction of theories from data, but also experimentation from design to implementation. This is the idea behind a robot scientist – a coupled system of AI and laboratory robotics that has agency to test hypotheses with real-world experiments. In this chapter we explore some of the fundamentals of robot scientists in the philosophy of science. We also map the activities of a robot scientist to machine learning paradigms, and argue that the scientific method shares an analogy with active learning. We demonstrate these concepts using examples from previous robot scientists, and also from Genesis: a next generation robot scientist designed for research in systems biology, comprising a micro-fluidic system with 1000 computer-controlled micro-bioreactors and interpretable models based in controlled vocabularies and logic.

[LG-81] Univariate Skeleton Prediction in Multivariate Systems Using Transformers

链接: https://arxiv.org/abs/2406.17834
作者: Giorgio Morales,John W. Sheppard
关键词: approximate the behavior, observed system, system response, univariate symbolic, methods attempt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Paper accepted at European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) 2024

点击查看摘要

Abstract:Symbolic regression (SR) methods attempt to learn mathematical expressions that approximate the behavior of an observed system. However, when dealing with multivariate systems, they often fail to identify the functional form that explains the relationship between each variable and the system’s response. To begin to address this, we propose an explainable neural SR method that generates univariate symbolic skeletons that aim to explain how each variable influences the system’s response. By analyzing multiple sets of data generated artificially, where one input variable varies while others are fixed, relationships are modeled separately for each input variable. The response of such artificial data sets is estimated using a regression neural network (NN). Finally, the multiple sets of input-response pairs are processed by a pre-trained Multi-Set Transformer that solves a problem we termed Multi-Set Skeleton Prediction and outputs a univariate symbolic skeleton. Thus, such skeletons represent explanations of the function approximated by the regression NN. Experimental results demonstrate that this method learns skeleton expressions matching the underlying functions and outperforms two GP-based and two neural SR methods.

[LG-82] Empirical Bayes for Dynamic Bayesian Networks Using Generalized Variational Inference

链接: https://arxiv.org/abs/2406.17831
作者: Vyacheslav Kungurtsev,Apaar Garg,Aarya Khandelwal,Parth Sandeep Ratogi,Bapi Chatterjee,Jakub Marecek
关键词: Dynamic Bayesian Network, Empirical Bayes approach, Bayesian Network, Empirical Bayes, Dynamic Bayesian
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this work, we demonstrate the Empirical Bayes approach to learning a Dynamic Bayesian Network. By starting with several point estimates of structure and weights, we can use a data-driven prior to subsequently obtain a model to quantify uncertainty. This approach uses a recent development of Generalized Variational Inference, and indicates the potential of sampling the uncertainty of a mixture of DAG structures as well as a parameter posterior.

[LG-83] Extreme Learning Machines for Fast Training of Click-Through Rate Prediction Models

链接: https://arxiv.org/abs/2406.17828
作者: Ergun Biçici
关键词: Extreme Learning Machines, traditional gradient-based learning, robust generalization capabilities, Extreme Learning, Learning Machines
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Extreme Learning Machines (ELM) provide a fast alternative to traditional gradient-based learning in neural networks, offering rapid training and robust generalization capabilities. Its theoretical basis shows its universal approximation capability. We explore the application of ELMs for the task of Click-Through Rate (CTR) prediction, which is largely unexplored by ELMs due to the high dimensionality of the problem. We introduce an ELM-based model enhanced with embedding layers to improve the performance on CTR tasks, which is a novel addition to the field. Experimental results on benchmark datasets, including Avazu and Criteo, demonstrate that our proposed ELM with embeddings achieves competitive F1 results while significantly reducing training time compared to state-of-the-art models such as Masknet. Our findings show that ELMs can be useful for CTR prediction, especially when fast training is needed.

[LG-84] European Space Agency Benchmark for Anomaly Detection in Satellite Telemetry

链接: https://arxiv.org/abs/2406.17826
作者: Krzysztof Kotowski,Christoph Haskamp,Jacek Andrzejewski,Bogdan Ruszczak,Jakub Nalepa,Daniel Lakey,Peter Collins,Aybike Kolmas,Mauro Bartesaghi,Jose Martinez-Heras,Gabriele De Canio
关键词: European Space Agency, improve anomaly detection, anomaly detection, satellite telemetry, Space Agency Benchmark
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 87 pages, 24 figures, 19 tables

点击查看摘要

Abstract:Machine learning has vast potential to improve anomaly detection in satellite telemetry which is a crucial task for spacecraft operations. This potential is currently hampered by a lack of comprehensible benchmarks for multivariate time series anomaly detection, especially for the challenging case of satellite telemetry. The European Space Agency Benchmark for Anomaly Detection in Satellite Telemetry (ESA-ADB) aims to address this challenge and establish a new standard in the domain. It is a result of close cooperation between spacecraft operations engineers from the European Space Agency (ESA) and machine learning experts. The newly introduced ESA Anomalies Dataset contains annotated real-life telemetry from three different ESA missions, out of which two are included in ESA-ADB. Results of typical anomaly detection algorithms assessed in our novel hierarchical evaluation pipeline show that new approaches are necessary to address operators’ needs. All elements of ESA-ADB are publicly available to ensure its full reproducibility.

[LG-85] AI for the prediction of early stages of Alzheimers disease from neuroimaging biomarkers – A narrative review of a growing field

链接: https://arxiv.org/abs/2406.17822
作者: Thorsten Rudroff,Oona Rainio,Riku Klén
关键词: early Alzheimer disease, Alzheimer disease, early Alzheimer, multiple neuroimaging techniques, MRI and PET
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 2 tables

点击查看摘要

Abstract:Objectives: The objectives of this narrative review are to summarize the current state of AI applications in neuroimaging for early Alzheimer’s disease (AD) prediction and to highlight the potential of AI techniques in improving early AD diagnosis, prognosis, and management. Methods: We conducted a narrative review of studies using AI techniques applied to neuroimaging data for early AD prediction. We examined single-modality studies using structural MRI and PET imaging, as well as multi-modality studies integrating multiple neuroimaging techniques and biomarkers. Furthermore, they reviewed longitudinal studies that model AD progression and identify individuals at risk of rapid decline. Results: Single-modality studies using structural MRI and PET imaging have demonstrated high accuracy in classifying AD and predicting progression from mild cognitive impairment (MCI) to AD. Multi-modality studies, integrating multiple neuroimaging techniques and biomarkers, have shown improved performance and robustness compared to single-modality approaches. Longitudinal studies have highlighted the value of AI in modeling AD progression and identifying individuals at risk of rapid decline. However, challenges remain in data standardization, model interpretability, generalizability, clinical integration, and ethical considerations. Conclusion: AI techniques applied to neuroimaging data have the potential to improve early AD diagnosis, prognosis, and management. Addressing challenges related to data standardization, model interpretability, generalizability, clinical integration, and ethical considerations is crucial for realizing the full potential of AI in AD research and clinical practice. Collaborative efforts among researchers, clinicians, and regulatory agencies are needed to develop reliable, robust, and ethical AI tools that can benefit AD patients and society. Comments: 15 pages, 2 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.17822 [cs.LG] (or arXiv:2406.17822v1 [cs.LG] for this version) Related DOI: https://doi.org/10.1007/s10072-024-07649-8 Focus to learn more DOI(s) linking to related resources Submission history From: Oona Rainio [view email] [v1] Tue, 25 Jun 2024 09:22:53 UTC (585 KB)

[LG-86] Automatically Adaptive Conformal Risk Control

链接: https://arxiv.org/abs/2406.17819
作者: Vincent Blot(LISN, CNRS),Anastasios N Angelopoulos(UC Berkeley),Michael I Jordan(UC Berkeley, Inria),Nicolas J-B Brunel(ENSIIE)
关键词: machine learning algorithms, black-box machine learning, Science and technology, ensure reliable, learning algorithms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Science and technology have a growing need for effective mechanisms that ensure reliable, controlled performance from black-box machine learning algorithms. These performance guarantees should ideally hold conditionally on the input-that is the performance guarantees should hold, at least approximately, no matter what the input. However, beyond stylized discrete groupings such as ethnicity and gender, the right notion of conditioning can be difficult to define. For example, in problems such as image segmentation, we want the uncertainty to reflect the intrinsic difficulty of the test sample, but this may be difficult to capture via a conditioning event. Building on the recent work of Gibbs et al. [2023], we propose a methodology for achieving approximate conditional control of statistical risks-the expected value of loss functions-by adapting to the difficulty of test samples. Our framework goes beyond traditional conditional risk control based on user-provided conditioning events to the algorithmic, data-driven determination of appropriate function classes for conditioning. We apply this framework to various regression and segmentation tasks, enabling finer-grained control over model performance and demonstrating that by continuously monitoring and adjusting these parameters, we can achieve superior precision compared to conventional risk-control methods.

[LG-87] mporal Prototype-Aware Learning for Active Voltage Control on Power Distribution Networks

链接: https://arxiv.org/abs/2406.17818
作者: Feiyang Xu,Shunyu Liu,Yunpeng Qing,Yihe Zhou,Yuwen Wang,Mingli Song
关键词: Active Voltage Control, Active Voltage, Power Distribution Networks, power systems, voltage levels
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Active Voltage Control (AVC) on the Power Distribution Networks (PDNs) aims to stabilize the voltage levels to ensure efficient and reliable operation of power systems. With the increasing integration of distributed energy resources, recent efforts have explored employing multi-agent reinforcement learning (MARL) techniques to realize effective AVC. Existing methods mainly focus on the acquisition of short-term AVC strategies, i.e., only learning AVC within the short-term training trajectories of a singular diurnal cycle. However, due to the dynamic nature of load demands and renewable energy, the operation states of real-world PDNs may exhibit significant distribution shifts across varying timescales (e.g., daily and seasonal changes). This can render those short-term strategies suboptimal or even obsolete when performing continuous AVC over extended periods. In this paper, we propose a novel temporal prototype-aware learning method, abbreviated as TPA, to learn time-adaptive AVC under short-term training trajectories. At the heart of TPA are two complementary components, namely multi-scale dynamic encoder and temporal prototype-aware policy, that can be readily incorporated into various MARL methods. The former component integrates a stacked transformer network to learn underlying temporal dependencies at different timescales of the PDNs, while the latter implements a learnable prototype matching mechanism to construct a dedicated AVC policy that can dynamically adapt to the evolving operation states. Experimental results on the AVC benchmark with different PDN sizes demonstrate that the proposed TPA surpasses the state-of-the-art counterparts not only in terms of control performance but also by offering model transferability. Our code is available at this https URL.

[LG-88] Unsupervised Concept Drift Detection from Deep Learning Representations in Real-time

链接: https://arxiv.org/abs/2406.17813
作者: Salvatore Greco,Bartolomeo Vacchetti,Daniele Apiletti,Tania Cerquitelli
关键词: target domain change, statistical properties, target domain, Drift, model performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Concept Drift is a phenomenon in which the underlying data distribution and statistical properties of a target domain change over time, leading to a degradation of the model’s performance. Consequently, models deployed in production require continuous monitoring through drift detection techniques. Most drift detection methods to date are supervised, i.e., based on ground-truth labels. However, true labels are usually not available in many real-world scenarios. Although recent efforts have been made to develop unsupervised methods, they often lack the required accuracy, have a complexity that makes real-time implementation in production environments difficult, or are unable to effectively characterize drift. To address these challenges, we propose DriftLens, an unsupervised real-time concept drift detection framework. It works on unstructured data by exploiting the distribution distances of deep learning representations. DriftLens can also provide drift characterization by analyzing each label separately. A comprehensive experimental evaluation is presented with multiple deep learning classifiers for text, image, and speech. Results show that (i) DriftLens performs better than previous methods in detecting drift in 11/13 use cases; (ii) it runs at least 5 times faster; (iii) its detected drift value is very coherent with the amount of drift (correlation \geq 0.85 ); (iv) it is robust to parameter changes.

[LG-89] Scalable Artificial Intelligence for Science: Perspectives Methods and Exemplars

链接: https://arxiv.org/abs/2406.17812
作者: Wesley Brewer,Aditya Kashi,Sajal Dash,Aristeidis Tsaris,Junqi Yin,Mallikarjun Shankar,Feiyi Wang
关键词: leveraging scalable artificial, scalable artificial intelligence, post-ChatGPT world, paper explores, explores the potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:In a post-ChatGPT world, this paper explores the potential of leveraging scalable artificial intelligence for scientific discovery. We propose that scaling up artificial intelligence on high-performance computing platforms is essential to address such complex problems. This perspective focuses on scientific use cases like cognitive simulations, large language models for scientific inquiry, medical image analysis, and physics-informed approaches. The study outlines the methodologies needed to address such challenges at scale on supercomputers or the cloud and provides exemplars of such approaches applied to solve a variety of scientific problems.

[LG-90] CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization

链接: https://arxiv.org/abs/2406.17811
作者: Jacob O. Tørring,Carl Hvarfner,Luigi Nardi,Magnus Själander
关键词: Bayesian optimization, powerful method, method for automating, automating tuning, Bayesian optimization algorithms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Bayesian optimization is a powerful method for automating tuning of compilers. The complex landscape of autotuning provides a myriad of rarely considered structural challenges for black-box optimizers, and the lack of standardized benchmarks has limited the study of Bayesian optimization within the domain. To address this, we present CATBench, a comprehensive benchmarking suite that captures the complexities of compiler autotuning, ranging from discrete, conditional, and permutation parameter types to known and unknown binary constraints, as well as both multi-fidelity and multi-objective evaluations. The benchmarks in CATBench span a range of machine learning-oriented computations, from tensor algebra to image processing and clustering, and uses state-of-the-art compilers, such as TACO and RISE/ELEVATE. CATBench offers a unified interface for evaluating Bayesian optimization algorithms, promoting reproducibility and innovation through an easy-to-use, fully containerized setup of both surrogate and real-world compiler optimization tasks. We validate CATBench on several state-of-the-art algorithms, revealing their strengths and weaknesses and demonstrating the suite’s potential for advancing both Bayesian optimization and compiler autotuning research.

[LG-91] raining-Free Exponential Extension of Sliding Window Context with Cascading KV Cache

链接: https://arxiv.org/abs/2406.17808
作者: Jeffrey Willette,Heejun Lee,Youngwan Lee,Myeongjae Jeon,Sung Ju Hwang
关键词: current task, Large Language Models, form of active, active memory, few-shot learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The context window within a transformer provides a form of active memory for the current task, which can be useful for few-shot learning and conditional generation, both which depend heavily on previous context tokens. However, as the context length grows, the computational cost increases quadratically. Recent works have shown that saving a few initial tokens along with a fixed-sized sliding window leads to stable streaming generation with linear complexity in transformer-based Large Language Models (LLMs). However, they make suboptimal use of the fixed window by naively evicting all tokens unconditionally from the key-value (KV) cache once they reach the end of the window, resulting in tokens being forgotten and no longer able to affect subsequent predictions. To overcome this limitation, we propose a novel mechanism for storing longer sliding window contexts with the same total cache size by keeping separate cascading sub-cache buffers whereby each subsequent buffer conditionally accepts a fraction of the relatively more important tokens evicted from the previous buffer. Our method results in a dynamic KV cache that can store tokens from the more distant past than a fixed, static sliding window approach. Our experiments show improvements of 5.6% on long context generation (LongBench), 1.2% in streaming perplexity (PG19), and 0.6% in language understanding (MMLU STEM) using LLMs given the same fixed cache size. Additionally, we provide an efficient implementation that improves the KV cache latency from 1.33ms per caching operation to 0.54ms, a 59% speedup over previous work.

[LG-92] MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?

链接: https://arxiv.org/abs/2406.17806
作者: Xirui Li,Hengguang Zhou,Ruochen Wang,Tianyi Zhou,Minhao Cheng,Cho-Jui Hsieh
关键词: biased thinking patterns, Humans are prone, Large Language Models, Multimodal Large Language, cognitive distortions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans are prone to cognitive distortions – biased thinking patterns that lead to exaggerated responses to specific stimuli, albeit in very different contexts. This paper demonstrates that advanced Multimodal Large Language Models (MLLMs) exhibit similar tendencies. While these models are designed to respond queries under safety mechanism, they sometimes reject harmless queries in the presence of certain visual stimuli, disregarding the benign nature of their contexts. As the initial step in investigating this behavior, we identify three types of stimuli that trigger the oversensitivity of existing MLLMs: Exaggerated Risk, Negated Harm, and Counterintuitive Interpretation. To systematically evaluate MLLMs’ oversensitivity to these stimuli, we propose the Multimodal OverSenSitivity Benchmark (MOSSBench). This toolkit consists of 300 manually collected benign multimodal queries, cross-verified by third-party reviewers (AMT). Empirical studies using MOSSBench on 20 MLLMs reveal several insights: (1). Oversensitivity is prevalent among SOTA MLLMs, with refusal rates reaching up to 76% for harmless queries. (2). Safer models are more oversensitive: increasing safety may inadvertently raise caution and conservatism in the model’s responses. (3). Different types of stimuli tend to cause errors at specific stages – perception, intent reasoning, and safety judgement – in the response process of MLLMs. These findings highlight the need for refined safety mechanisms that balance caution with contextually appropriate responses, improving the reliability of MLLMs in real-world applications. We make our project available at this https URL.

[LG-93] Deep Learning Approaches for Detecting Adversarial Cyberbullying and Hate Speech in Social Networks

链接: https://arxiv.org/abs/2406.17793
作者: Sylvia Worlali Azumah,Nelly Elsayed,Zag ElSayed,Murat Ozer,Amanda La Guardia
关键词: concern intricately linked, intricately linked, find resolution, resolution through technological, significant concern intricately
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
*备注: 10 pages, 8 figures, 3 tables, under reviewing

点击查看摘要

Abstract:Cyberbullying is a significant concern intricately linked to technology that can find resolution through technological means. Despite its prevalence, technology also provides solutions to mitigate cyberbullying. To address growing concerns regarding the adverse impact of cyberbullying on individuals’ online experiences, various online platforms and researchers are actively adopting measures to enhance the safety of digital environments. While researchers persist in crafting detection models to counteract or minimize cyberbullying, malicious actors are deploying adversarial techniques to circumvent these detection methods. This paper focuses on detecting cyberbullying in adversarial attack content within social networking site text data, specifically emphasizing hate speech. Utilizing a deep learning-based approach with a correction algorithm, this paper yielded significant results. An LSTM model with a fixed epoch of 100 demonstrated remarkable performance, achieving high accuracy, precision, recall, F1-score, and AUC-ROC scores of 87.57%, 88.73%, 87.57%, 88.15%, and 91% respectively. Additionally, the LSTM model’s performance surpassed that of previous studies.

[LG-94] CNN-based Compressor Mass Flow Estimator in Industrial Aircraft Vapor Cycle System

链接: https://arxiv.org/abs/2406.17788
作者: Justin Reverdi(IRIT, IMT),Sixin Zhang(IRIT),Saïd Aoues,Fabrice Gamboa(IMT),Serge Gratton(IRIT),Thomas Pellegrini(IRIT)
关键词: Vapor Cycle Systems, Vapor Cycle, playsa key role, sensor playsa key, Cycle Systems
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In Vapor Cycle Systems, the mass flow sensor playsa key role for different monitoring and control purposes. However,physical sensors can be inaccurate, heavy, cumbersome, expensive orhighly sensitive to vibrations, which is especially problematic whenembedded into an aircraft. The conception of a virtual sensor, basedon other standard sensors, is a good alternative. This paper has twomain objectives. Firstly, a data-driven model using a ConvolutionalNeural Network is proposed to estimate the mass flow of thecompressor. We show that it significantly outperforms the standardPolynomial Regression model (thermodynamic maps), in terms of thestandard MSE metric and Engineer Performance metrics. Secondly,a semi-automatic segmentation method is proposed to compute theEngineer Performance metrics for real datasets, as the standard MSEmetric may pose risks in analyzing the dynamic behavior of VaporCycle Systems.

[LG-95] Bayesian inverse Navier-Stokes problems: joint flow field reconstruction and parameter learning

链接: https://arxiv.org/abs/2406.18464
作者: Alexandros Kontogiannis,Scott V. Elgersma,Andrew J. Sederman,Matthew P. Juniper
关键词: Bayesian inverse Navier-Stokes, Bayesian inverse, order to jointly, assimilates velocimetry data, Gaussian prior distributions
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We formulate and solve a Bayesian inverse Navier-Stokes (N-S) problem that assimilates velocimetry data in order to jointly reconstruct a 3D flow field and learn the unknown N-S parameters, including the boundary position. By hardwiring a generalised N-S problem, and regularising its unknown parameters using Gaussian prior distributions, we learn the most likely parameters in a collapsed search space. The most likely flow field reconstruction is then the N-S solution that corresponds to the learned parameters. We develop the method in the variational setting and use a stabilised Nitsche weak form of the N-S problem that permits the control of all N-S parameters. To regularise the inferred the geometry, we use a viscous signed distance field (vSDF) as an auxiliary variable, which is given as the solution of a viscous Eikonal boundary value problem. We devise an algorithm that solves this inverse problem, and numerically implement it using an adjoint-consistent stabilised cut-cell finite element method. We then use this method to reconstruct magnetic resonance velocimetry (flow-MRI) data of a 3D steady laminar flow through a physical model of an aortic arch for two different Reynolds numbers and signal-to-noise ratio (SNR) levels (low/high). We find that the method can accurately i) reconstruct the low SNR data by filtering out the noise/artefacts and recovering flow features that are obscured by noise, and ii) reproduce the high SNR data without overfitting. Although the framework that we develop applies to 3D steady laminar flows in complex geometries, it readily extends to time-dependent laminar and Reynolds-averaged turbulent flows, as well as non-Newtonian (e.g. viscoelastic) fluids.

[LG-96] Second Maximum of a Gaussian Random Field and Exact (t-)Spacing test

链接: https://arxiv.org/abs/2406.18397
作者: Azaïs Jean-Marc,Dalmao Federico,De Castro Yohann
关键词: Gaussian random field, Gaussian random, Riemannian submanifold, test, maximum
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Differential Geometry (math.DG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 5 figures, 22 pages main document, 2 pages supplements

点击查看摘要

Abstract:In this article, we introduce the novel concept of the second maximum of a Gaussian random field on a Riemannian submanifold. This second maximum serves as a powerful tool for characterizing the distribution of the maximum. By utilizing an ad-hoc Kac Rice formula, we derive the explicit form of the maximum’s distribution, conditioned on the second maximum and some regressed component of the Riemannian Hessian. This approach results in an exact test, based on the evaluation of spacing between these maxima, which we refer to as the spacing test. We investigate the applicability of this test in detecting sparse alternatives within Gaussian symmetric tensors, continuous sparse deconvolution, and two-layered neural networks with smooth rectifiers. Our theoretical results are supported by numerical experiments, which illustrate the calibration and power of the proposed tests. More generally, this test can be applied to any Gaussian random field on a Riemannian manifold, and we provide a general framework for the application of the spacing test in continuous sparse kernel regression. Furthermore, when the variance-covariance function of the Gaussian random field is known up to a scaling factor, we derive an exact Studentized version of our test, coined the t -spacing test. This test is perfectly calibrated under the null hypothesis and has high power for detecting sparse alternatives. Comments: 5 figures, 22 pages main document, 2 pages supplements Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Differential Geometry (math.DG); Probability (math.PR); Machine Learning (stat.ML) MSC classes: Primary 62E15, 62F03, 60G15, 62H10, 62H15, secondary 60E05, 60G10, 62J05, 94A08 Cite as: arXiv:2406.18397 [math.ST] (or arXiv:2406.18397v1 [math.ST] for this version)

[LG-97] Learning pure quantum states (almost) without regret

链接: https://arxiv.org/abs/2406.18370
作者: Josep Lumbreras,Mikhail Terekhov,Marco Tomamichel
关键词: initiate the study, pure quantum state, quantum state, unknown pure quantum, quantum state tomography
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 2 figures

点击查看摘要

Abstract:We initiate the study of quantum state tomography with minimal regret. A learner has sequential oracle access to an unknown pure quantum state, and in each round selects a pure probe state. Regret is incurred if the unknown state is measured orthogonal to this probe, and the learner’s goal is to minimise the expected cumulative regret over T rounds. The challenge is to find a balance between the most informative measurements and measurements incurring minimal regret. We show that the cumulative regret scales as \Theta(\operatornamepolylog T) using a new tomography algorithm based on a median of means least squares estimator. This algorithm employs measurements biased towards the unknown state and produces online estimates that are optimal (up to logarithmic terms) in the number of observed samples.

[LG-98] Multi-modal Evidential Fusion Network for Trusted PET/CT Tumor Segmentation

链接: https://arxiv.org/abs/2406.18327
作者: Yuxuan Qi,Li Lin,Jiajun Wang,Jingya Zhang,Bin Zhang
关键词: Accurate segmentation, Evidential Fusion Network, Multi-modal Evidential Fusion, treatment of cancer, PET
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate segmentation of tumors in PET/CT images is important in computer-aided diagnosis and treatment of cancer. The key issue of such a segmentation problem lies in the effective integration of complementary information from PET and CT images. However, the quality of PET and CT images varies widely in clinical settings, which leads to uncertainty in the modality information extracted by networks. To take the uncertainty into account in multi-modal information fusion, this paper proposes a novel Multi-modal Evidential Fusion Network (MEFN) comprising a Cross-Modal Feature Learning (CFL) module and a Multi-modal Trusted Fusion (MTF) module. The CFL module reduces the domain gap upon modality conversion and highlights common tumor features, thereby alleviating the needs of the segmentation module to handle modality specificity. The MTF module utilizes mutual attention mechanisms and an uncertainty calibrator to fuse modality features based on modality uncertainty and then fuse the segmentation results under the guidance of Dempster-Shafer Theory. Besides, a new uncertainty perceptual loss is introduced to force the model focusing on uncertain features and hence improve its ability to extract trusted modality information. Extensive comparative experiments are conducted on two publicly available PET/CT datasets to evaluate the performance of our proposed method whose results demonstrate that our MEFN significantly outperforms state-of-the-art methods with improvements of 2.15% and 3.23% in DSC scores on the AutoPET dataset and the Hecktor dataset, respectively. More importantly, our model can provide radiologists with credible uncertainty of the segmentation results for their decision in accepting or rejecting the automatic segmentation results, which is particularly important for clinical applications. Our code will be available at this https URL.

[LG-99] rade-off between Gradient Measurement Efficiency and Expressivity in Deep Quantum Neural Networks

链接: https://arxiv.org/abs/2406.18316
作者: Koki Chinzei,Shinichiro Yamano,Quoc Hoan Tran,Yasuhiro Endo,Hirotaka Oshima
关键词: Quantum neural networks, practical quantum advantages, achieve practical quantum, neural networks, achieve practical
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 32 pages, 11 figures

点击查看摘要

Abstract:Quantum neural networks (QNNs) require an efficient training algorithm to achieve practical quantum advantages. A promising approach is the use of gradient-based optimization algorithms, where gradients are estimated through quantum measurements. However, it is generally difficult to efficiently measure gradients in QNNs because the quantum state collapses upon measurement. In this work, we prove a general trade-off between gradient measurement efficiency and expressivity in a wide class of deep QNNs, elucidating the theoretical limits and possibilities of efficient gradient estimation. This trade-off implies that a more expressive QNN requires a higher measurement cost in gradient estimation, whereas we can increase gradient measurement efficiency by reducing the QNN expressivity to suit a given task. We further propose a general QNN ansatz called the stabilizer-logical product ansatz (SLPA), which can reach the upper limit of the trade-off inequality by leveraging the symmetric structure of the quantum circuit. In learning an unknown symmetric function, the SLPA drastically reduces the quantum resources required for training while maintaining accuracy and trainability compared to a well-designed symmetric circuit based on the parameter-shift method. Our results not only reveal a theoretical understanding of efficient training in QNNs but also provide a standard and broadly applicable efficient QNN design.

[LG-100] Generative artificial intelligence in ophthalmology: multimodal retinal images for the diagnosis of Alzheimers disease with convolutional neural networks

链接: https://arxiv.org/abs/2406.18247
作者: I. R. Slootweg,M. Thach,K. R. Curro-Tafili,F. D. Verbraak,F. H. Bouwman,Y. A. L. Pijnenburg,J. F. Boer,J. H. P. de Kwisthout,L. Bagheriye,P. J. González
关键词: Amyloid Positron Emission, Positron Emission Tomography, predict Amyloid Positron, Amyloid Positron, Positron Emission
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background/Aim. This study aims to predict Amyloid Positron Emission Tomography (AmyloidPET) status with multimodal retinal imaging and convolutional neural networks (CNNs) and to improve the performance through pretraining with synthetic data. Methods. Fundus autofluorescence, optical coherence tomography (OCT), and OCT angiography images from 328 eyes of 59 AmyloidPET positive subjects and 108 AmyloidPET negative subjects were used for classification. Denoising Diffusion Probabilistic Models (DDPMs) were trained to generate synthetic images and unimodal CNNs were pretrained on synthetic data and finetuned on real data or trained solely on real data. Multimodal classifiers were developed to combine predictions of the four unimodal CNNs with patient metadata. Class activation maps of the unimodal classifiers provided insight into the network’s attention to inputs. Results. DDPMs generated diverse, realistic images without memorization. Pretraining unimodal CNNs with synthetic data improved AUPR at most from 0.350 to 0.579. Integration of metadata in multimodal CNNs improved AUPR from 0.486 to 0.634, which was the best overall best classifier. Class activation maps highlighted relevant retinal regions which correlated with AD. Conclusion. Our method for generating and leveraging synthetic data has the potential to improve AmyloidPET prediction from multimodal retinal imaging. A DDPM can generate realistic and unique multimodal synthetic retinal images. Our best performing unimodal and multimodal classifiers were not pretrained on synthetic data, however pretraining with synthetic data slightly improved classification performance for two out of the four modalities.

[LG-101] Sparse deep neural networks for nonparametric estimation in high-dimensional sparse regression

链接: https://arxiv.org/abs/2406.18137
作者: Dongya Wu,Xin Li
关键词: deep neural networks, deep neural, neural networks, sparse deep neural, neural
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generalization theory has been established for sparse deep neural networks under high-dimensional regime. Beyond generalization, parameter estimation is also important since it is crucial for variable selection and interpretability of deep neural networks. Current theoretical studies concerning parameter estimation mainly focus on two-layer neural networks, which is due to the fact that the convergence of parameter estimation heavily relies on the regularity of the Hessian matrix, while the Hessian matrix of deep neural networks is highly singular. To avoid the unidentifiability of deep neural networks in parameter estimation, we propose to conduct nonparametric estimation of partial derivatives with respect to inputs. We first show that model convergence of sparse deep neural networks is guaranteed in that the sample complexity only grows with the logarithm of the number of parameters or the input dimension when the \ell_1 -norm of parameters is well constrained. Then by bounding the norm and the divergence of partial derivatives, we establish that the convergence rate of nonparametric estimation of partial derivatives scales as \mathcalO(n^-1/4) , a rate which is slower than the model convergence rate \mathcalO(n^-1/2) . To the best of our knowledge, this study combines nonparametric estimation and parametric sparse deep neural networks for the first time. As nonparametric estimation of partial derivatives is of great significance for nonlinear variable selection, the current results show the promising future for the interpretability of deep neural networks.

[LG-102] Learning for Bandits under Action Erasures

链接: https://arxiv.org/abs/2406.18072
作者: Osama Hanna,Merve Karakas,Lin F. Yang,Christina Fragouli
关键词: distributed agents, multi-arm bandit, external sensors, MAB algorithm, underlying MAB algorithm
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a novel multi-arm bandit (MAB) setup, where a learner needs to communicate the actions to distributed agents over erasure channels, while the rewards for the actions are directly available to the learner through external sensors. In our model, while the distributed agents know if an action is erased, the central learner does not (there is no feedback), and thus does not know whether the observed reward resulted from the desired action or not. We propose a scheme that can work on top of any (existing or future) MAB algorithm and make it robust to action erasures. Our scheme results in a worst-case regret over action-erasure channels that is at most a factor of O(1/\sqrt1-\epsilon) away from the no-erasure worst-case regret of the underlying MAB algorithm, where \epsilon is the erasure probability. We also propose a modification of the successive arm elimination algorithm and prove that its worst-case regret is \TildeO(\sqrtKT+K/(1-\epsilon)) , which we prove is optimal by providing a matching lower bound.

[LG-103] Domain Adaptation of Echocardiography Segmentation Via Reinforcement Learning

链接: https://arxiv.org/abs/2406.17902
作者: Arnaud Judge,Thierry Judge,Nicolas Duchateau,Roman A. Sandler,Joseph Z. Sokol,Olivier Bernard,Pierre-Marc Jodoin
关键词: Performance of deep, insufficient annotated data, effective fine-tuning, significantly challenged, aiming to adapt
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Performance of deep learning segmentation models is significantly challenged in its transferability across different medical imaging domains, particularly when aiming to adapt these models to a target domain with insufficient annotated data for effective fine-tuning. While existing domain adaptation (DA) methods propose strategies to alleviate this problem, these methods do not explicitly incorporate human-verified segmentation priors, compromising the potential of a model to produce anatomically plausible segmentations. We introduce RL4Seg, an innovative reinforcement learning framework that reduces the need to otherwise incorporate large expertly annotated datasets in the target domain, and eliminates the need for lengthy manual human review. Using a target dataset of 10,000 unannotated 2D echocardiographic images, RL4Seg not only outperforms existing state-of-the-art DA methods in accuracy but also achieves 99% anatomical validity on a subset of 220 expert-validated subjects from the target domain. Furthermore, our framework’s reward network offers uncertainty estimates comparable with dedicated state-of-the-art uncertainty methods, demonstrating the utility and effectiveness of RL4Seg in overcoming domain adaptation challenges in medical image segmentation.

[LG-104] reatment of Statistical Estimation Problems in Randomized Smoothing for Adversarial Robustness

链接: https://arxiv.org/abs/2406.17830
作者: Vaclav Voracek
关键词: popular certified defense, popular certified, certified defense, Randomized smoothing, adversarial attacks
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: comments are welcome

点击查看摘要

Abstract:Randomized smoothing is a popular certified defense against adversarial attacks. In its essence, we need to solve a problem of statistical estimation which is usually very time-consuming since we need to perform numerous (usually 10^5 ) forward passes of the classifier for every point to be certified. In this paper, we review the statistical estimation problems for randomized smoothing to find out if the computational burden is necessary. In particular, we consider the (standard) task of adversarial robustness where we need to decide if a point is robust at a certain radius or not using as few samples as possible while maintaining statistical guarantees. We present estimation procedures employing confidence sequences enjoying the same statistical guarantees as the standard methods, with the optimal sample complexities for the estimation task and empirically demonstrate their good performance. Additionally, we provide a randomized version of Clopper-Pearson confidence intervals resulting in strictly stronger certificates.

[LG-105] Distribution Learnability and Robustness

链接: https://arxiv.org/abs/2406.17814
作者: Shai Ben-David,Alex Bie,Gautam Kamath,Tosca Lechner
关键词: Machine Learning, learnability, learning, PAC learning, distribution learning
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: In NeurIPS 2023

点击查看摘要

Abstract:We examine the relationship between learnability and robust (or agnostic) learnability for the problem of distribution learning. We show that, contrary to other learning settings (e.g., PAC learning of function classes), realizable learnability of a class of probability distributions does not imply its agnostic learnability. We go on to examine what type of data corruption can disrupt the learnability of a distribution class and what is such learnability robust against. We show that realizable learnability of a class of distributions implies its robust learnability with respect to only additive corruption, but not against subtractive corruption. We also explore related implications in the context of compression schemes and differentially private learnability. Comments: In NeurIPS 2023 Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2406.17814 [stat.ML] (or arXiv:2406.17814v1 [stat.ML] for this version)

[LG-106] MoleculeCLA: Rethinking Molecular Benchmark via Computational Ligand-Target Binding Analysis

链接: https://arxiv.org/abs/2406.17797
作者: Shikun Feng,Jiaxin Zheng,Yinjun Jia,Yanwen Huang,Fengfeng Zhou,Wei-Ying Ma,Yanyan Lan
关键词: prediction tasks related, molecular property prediction, property prediction tasks, Molecular representation, Molecular representation learning
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular representation learning is pivotal for various molecular property prediction tasks related to drug discovery. Robust and accurate benchmarks are essential for refining and validating current methods. Existing molecular property benchmarks derived from wet experiments, however, face limitations such as data volume constraints, unbalanced label distribution, and noisy labels. To address these issues, we construct a large-scale and precise molecular representation dataset of approximately 140,000 small molecules, meticulously designed to capture an extensive array of chemical, physical, and biological properties, derived through a robust computational ligand-target binding analysis pipeline. We conduct extensive experiments on various deep learning models, demonstrating that our dataset offers significant physicochemical interpretability to guide model development and design. Notably, the dataset’s properties are linked to binding affinity metrics, providing additional insights into model performance in drug-target interaction tasks. We believe this dataset will serve as a more accurate and reliable benchmark for molecular representation learning, thereby expediting progress in the field of artificial intelligence-driven drug discovery.

信息检索

[IR-0] UniRec: A Dual Enhancement of Uniformity and Frequency in Sequential Recommendations

链接: https://arxiv.org/abs/2406.18470
作者: Yang Liu,Yitong Wang,Chenyue Feng
关键词: accurately modeling user, critical for accurately, accurately modeling, modeling user interaction, improving recommendation precision
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures, for source code, see this https URL

点击查看摘要

Abstract:Representation learning in sequential recommendation is critical for accurately modeling user interaction patterns and improving recommendation precision. However, existing approaches predominantly emphasize item-to-item transitions, often neglecting the time intervals between interactions, which are closely related to behavior pattern changes. Additionally, broader interaction attributes, such as item frequency, are frequently overlooked. We found that both sequences with more uniform time intervals and items with higher frequency yield better prediction performance. Conversely, non-uniform sequences exacerbate user interest drift and less-frequent items are difficult to model due to sparse sampling, presenting unique challenges inadequately addressed by current methods. In this paper, we propose UniRec, a novel bidirectional enhancement sequential recommendation method. UniRec leverages sequence uniformity and item frequency to enhance performance, particularly improving the representation of non-uniform sequences and less-frequent items. These two branches mutually reinforce each other, driving comprehensive performance optimization in complex sequential recommendation scenarios. Additionally, we present a multidimensional time module to further enhance adaptability. To the best of our knowledge, UniRec is the first method to utilize the characteristics of uniformity and frequency for feature augmentation. Comparing with eleven advanced models across four datasets, we demonstrate that UniRec outperforms SOTA models significantly. The code is available at this https URL.

[IR-1] he Effects of Data Split Strategies on the Offline Experiments for CTR Prediction

链接: https://arxiv.org/abs/2406.18320
作者: Ramazan Tarik Turksoy,Beyza Turkmen
关键词: Click-through rate, advertising to recommend, recommend products, products that users, Click-through
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Click-through rate (CTR) prediction is a crucial task in online advertising to recommend products that users are likely to be interested in. To identify the best-performing models, rigorous model evaluation is necessary. Offline experimentation plays a significant role in selecting models for live user-item interactions, despite the value of online experimentation like A/B testing, which has its own limitations and risks. Often, the correlation between offline performance metrics and actual online model performance is inadequate. One main reason for this discrepancy is the common practice of using random splits to create training, validation, and test datasets in CTR prediction. In contrast, real-world CTR prediction follows a temporal order. Therefore, the methodology used in offline evaluation, particularly the data splitting strategy, is crucial. This study aims to address the inconsistency between current offline evaluation methods and real-world use cases, by focusing on data splitting strategies. To examine the impact of different data split strategies on offline performance, we conduct extensive experiments using both random and temporal splits on a large open benchmark dataset, Criteo.

[IR-2] Effects of Using Synthetic Data on Deep Recommender Models Performance

链接: https://arxiv.org/abs/2406.18286
作者: Fatih Cihan Taskin,Ilknur Akcay,Muhammed Pesen,Said Aldemir,Ipek Iraz Esin,Furkan Durmus
关键词: enhancing user experiences, suggesting items based, individual preferences, essential for enhancing, enhancing user
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems are essential for enhancing user experiences by suggesting items based on individual preferences. However, these systems frequently face the challenge of data imbalance, characterized by a predominance of negative interactions over positive ones. This imbalance can result in biased recommendations favoring popular items. This study investigates the effectiveness of synthetic data generation in addressing data imbalances within recommender systems. Six different methods were used to generate synthetic data. Our experimental approach involved generating synthetic data using these methods and integrating the generated samples into the original dataset. Our results show that the inclusion of generated negative samples consistently improves the Area Under the Curve (AUC) scores. The significant impact of synthetic negative samples highlights the potential of data augmentation strategies to address issues of data sparsity and imbalance, ultimately leading to improved performance of recommender systems.

[IR-3] Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning

链接: https://arxiv.org/abs/2406.18254
作者: Zhijie Nie,Richong Zhang,Zhangchi Feng,Hailang Huang,Xudong Liu
关键词: achieves image-text retrieval, Cross-lingual Cross-modal Retrieval, improved retrieval tasks, cross-lingual cross-modal pre-training, web search
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Accepted by KDD 2024 Research Track

点击查看摘要

Abstract:Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on large-scale data have significantly improved retrieval tasks. However, these methods directly follow the existing pre-training methods in the cross-lingual or cross-modal domain, leading to two problems of inconsistency in CCR: The methods with cross-lingual style suffer from the intra-modal error propagation, resulting in inconsistent recall performance across languages in the whole dataset. The methods with cross-modal style suffer from the inter-modal optimization direction bias, resulting in inconsistent rank across languages within each instance, which cannot be reflected by Recall@K. To solve these problems, we propose a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias. In addition, we propose a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance. Extensive experiments on four CCR datasets show that our method improves both recall rates and MRV with smaller-scale pre-trained data, achieving the new state-of-art.

[IR-4] Knowledge Graph Enhanced Retrieval-Augmented Generation for Failure Mode and Effects Analysis

链接: https://arxiv.org/abs/2406.18114
作者: Lukas Bahr,Christoph Wehner,Judith Wewerka,José Bittencourt,Ute Schmid,Rüdiger Daub
关键词: mitigating potential failures, Failure mode, potential failures, effects analysis, FMEA
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Failure mode and effects analysis (FMEA) is a critical tool for mitigating potential failures, particular during ramp-up phases of new products. However, its effectiveness is often limited by the missing reasoning capabilities of the FMEA tools, which are usually tabular structured. Meanwhile, large language models (LLMs) offer novel prospects for fine-tuning on custom datasets for reasoning within FMEA contexts. However, LLMs face challenges in tasks that require factual knowledge, a gap that retrieval-augmented generation (RAG) approaches aim to fill. RAG retrieves information from a non-parametric data store and uses a language model to generate responses. Building on this idea, we propose to advance the non-parametric data store with a knowledge graph (KG). By enhancing the RAG framework with a KG, our objective is to leverage analytical and semantic question-answering capabilities on FMEA data. This paper contributes by presenting a new ontology for FMEA observations, an algorithm for creating vector embeddings from the FMEA KG, and a KG enhanced RAG framework. Our approach is validated through a human study and we measure the performance of the context retrieval recall and precision.

[IR-5] Efficient Document Ranking with Learnable Late Interactions

链接: https://arxiv.org/abs/2406.17968
作者: Ziwei Ji,Himanshu Jain,Andreas Veit,Sashank J. Reddi,Sadeep Jayasumana,Ankit Singh Rawat,Aditya Krishna Menon,Felix Yu,Sanjiv Kumar
关键词: information retrieval, fundamental approaches, LITE, document token embeddings, models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer based on query and document token embeddings. However, these lightweight scorers are often hand-crafted, and there is no understanding of their approximation power; further, such scorers require access to individual document token embeddings, which imposes an increased latency and storage burden. In this paper, we propose novel learnable late-interaction models (LITE) that resolve these issues. Theoretically, we prove that LITE is a universal approximator of continuous scoring functions, even for relatively small embedding dimension. Empirically, LITE outperforms previous late-interaction models such as ColBERT on both in-domain and zero-shot re-ranking tasks. For instance, experiments on MS MARCO passage re-ranking show that LITE not only yields a model with better generalization, but also lowers latency and requires 0.25x storage compared to ColBERT.

[IR-6] NormTab: Improving Symbolic Reasoning in LLMs Through Tabular Data Normalization

链接: https://arxiv.org/abs/2406.17961
作者: Md Mahadi Hasan Nahid,Davood Rafiei
关键词: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, parsing textual data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
*备注: Work in Progress

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in parsing textual data and generating code. However, their performance in tasks involving tabular data, especially those requiring symbolic reasoning, faces challenges due to the structural variance and inconsistency in table cell values often found in web tables. In this paper, we introduce NormTab, a novel framework aimed at enhancing the symbolic reasoning performance of LLMs by normalizing web tables. We study table normalization as a stand-alone, one-time preprocessing step using LLMs to support symbolic reasoning on tabular data. Our experimental evaluation, conducted on challenging web table datasets such as WikiTableQuestion and TabFact, demonstrates that leveraging NormTab significantly improves symbolic reasoning performance, showcasing the importance and effectiveness of web table normalization for enhancing LLM-based symbolic reasoning tasks.

[IR-7] Understanding the Role of User Profile in the Personalization of Large Language Models

链接: https://arxiv.org/abs/2406.17803
作者: Bin Wu,Zhengyan Shi,Hossein A. Rahmani,Varsha Ramineni,Emine Yilmaz
关键词: Large Language Models, personalize Large Language, Language Models, Large Language, Utilizing user profiles
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Utilizing user profiles to personalize Large Language Models (LLMs) has been shown to enhance the performance on a wide range of tasks. However, the precise role of user profiles and their effect mechanism on LLMs remains unclear. This study first confirms that the effectiveness of user profiles is primarily due to personalization information rather than semantic information. Furthermore, we investigate how user profiles affect the personalization of LLMs. Within the user profile, we reveal that it is the historical personalized response produced or approved by users that plays a pivotal role in personalizing LLMs. This discovery unlocks the potential of LLMs to incorporate a greater number of user profiles within the constraints of limited input length. As for the position of user profiles, we observe that user profiles integrated into different positions of the input context do not contribute equally to personalization. Instead, where the user profile that is closer to the beginning affects more on the personalization of LLMs. Our findings reveal the role of user profiles for the personalization of LLMs, and showcase how incorporating user profiles impacts performance providing insight to leverage user profiles effectively.

[IR-8] Concordance in basal cell carcinoma diagnosis. Building a proper ground truth to train Artificial Intelligence tools

链接: https://arxiv.org/abs/2406.18240
作者: Francisca Silva-Clavería,Carmen Serrano,Iván Matas,Amalia Serrano,Tomás Toledo-Pastrana,David Moreno-Ramírez,Begoña Acha
关键词: basal cell carcinoma, BCC, cell carcinoma, objectively validated, basal cell
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Methodology (stat.ME)
*备注: Manuscript word count: 3000, Number of figures: 2, Number of tables: 3

点击查看摘要

Abstract:Background: The existence of different basal cell carcinoma (BCC) clinical criteria cannot be objectively validated. An adequate ground-truth is needed to train an artificial intelligence (AI) tool that explains the BCC diagnosis by providing its dermoscopic features. Objectives: To determine the consensus among dermatologists on dermoscopic criteria of 204 BCC. To analyze the performance of an AI tool when the ground-truth is inferred. Methods: A single center, diagnostic and prospective study was conducted to analyze the agreement in dermoscopic criteria by four dermatologists and then derive a reference standard. 1434 dermoscopic images have been used, that were taken by a primary health physician, sent via teledermatology, and diagnosed by a dermatologist. They were randomly selected from the teledermatology platform (2019-2021). 204 of them were tested with an AI tool; the remainder trained it. The performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists was analyzed using McNemar’s test and Hamming distance. Results: Dermatologists achieve perfect agreement in the diagnosis of BCC (Fleiss-Kappa=0.9079), and a high correlation with the biopsy (PPV=0.9670). However, there is low agreement in detecting some dermoscopic criteria. Statistical differences were found in the performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists. Conclusions: Care should be taken when training an AI tool to determine the BCC patterns present in a lesion. Ground-truth should be established from multiple dermatologists.

人工智能

[AI-0] Symbolic Learning Enables Self-Evolving Agents

链接: https://arxiv.org/abs/2406.18532
作者: Wangchunshu Zhou,Yixin Ou,Shengwei Ding,Long Li,Jialong Wu,Tiannan Wang,Jiamin Chen,Shuai Wang,Xiaohua Xu,Ningyu Zhang,Huajun Chen,Yuchen Eleanor Jiang
关键词: language agents, large language models, agent symbolic learning, tool usage methods, language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Code available at this https URL

点击查看摘要

Abstract:The AI community has been exploring a pathway to artificial general intelligence (AGI) by developing “language agents”, which are complex large language models (LLMs) pipelines involving both prompting techniques and tool usage methods. While language agents have demonstrated impressive capabilities for many real-world tasks, a fundamental limitation of current language agents research is that they are model-centric, or engineering-centric. That’s to say, the progress on prompts, tools, and pipelines of language agents requires substantial manual engineering efforts from human experts rather than automatically learning from data. We believe the transition from model-centric, or engineering-centric, to data-centric, i.e., the ability of language agents to autonomously learn and evolve in environments, is the key for them to possibly achieve AGI. In this work, we introduce agent symbolic learning, a systematic framework that enables language agents to optimize themselves on their own in a data-centric way using symbolic optimizers. Specifically, we consider agents as symbolic networks where learnable weights are defined by prompts, tools, and the way they are stacked together. Agent symbolic learning is designed to optimize the symbolic network within language agents by mimicking two fundamental algorithms in connectionist learning: back-propagation and gradient descent. Instead of dealing with numeric weights, agent symbolic learning works with natural language simulacrums of weights, loss, and gradients. We conduct proof-of-concept experiments on both standard benchmarks and complex real-world tasks and show that agent symbolic learning enables language agents to update themselves after being created and deployed in the wild, resulting in “self-evolving agents”. Comments: Code available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2406.18532 [cs.CL] (or arXiv:2406.18532v1 [cs.CL] for this version)

[AI-1] APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

链接: https://arxiv.org/abs/2406.18518
作者: Zuxin Liu,Thai Hoang,Jianguo Zhang,Ming Zhu,Tian Lan,Shirley Kokane,Juntao Tan,Weiran Yao,Zhiwei Liu,Yihao Feng,Rithesh Murthy,Liangwei Yang,Silvio Savarese,Juan Carlos Niebles,Huan Wang,Shelby Heinecke,Caiming Xiong
关键词: Berkeley Function-Calling Benchmark, models requires diverse, function-calling, requires diverse, agent models requires
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: this https URL and the project homepage: this https URL

[AI-2] Mental Modeling of Reinforcement Learning Agents by Language Models

链接: https://arxiv.org/abs/2406.18505
作者: Wenhao Lu,Xufeng Zhao,Josua Spisak,Jae Hee Lee,Stefan Wermter
关键词: language models faithfully, emergent language models, models faithfully model, language models, intelligence of decision-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
*备注: this https URL

点击查看摘要

Abstract:Can emergent language models faithfully model the intelligence of decision-making agents? Though modern language models exhibit already some reasoning ability, and theoretically can potentially express any probable distribution over tokens, it remains underexplored how the world knowledge these pretrained models have memorized can be utilized to comprehend an agent’s behaviour in the physical world. This study empirically examines, for the first time, how well large language models (LLMs) can build a mental model of agents, termed agent mental modelling, by reasoning about an agent’s behaviour and its effect on states from agent interaction history. This research may unveil the potential of leveraging LLMs for elucidating RL agent behaviour, addressing a key challenge in eXplainable reinforcement learning (XRL). To this end, we propose specific evaluation metrics and test them on selected RL task datasets of varying complexity, reporting findings on agent mental model establishment. Our results disclose that LLMs are not yet capable of fully mental modelling agents through inference alone without further innovations. This work thus provides new insights into the capabilities and limitations of modern LLMs.

[AI-3] Role-Play Zero-Shot Prompting with Large Language Models for Open-Domain Human-Machine Conversation

链接: https://arxiv.org/abs/2406.18460
作者: Ahmed Njifenjou,Virgile Sucal,Bassam Jabaian,Fabrice Lefèvre
关键词: Large Language Models, Large Language, proposed to create, create open-domain conversational, Recently
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Updated version of a paper originally submitted at SIGDIAL 2023

点击查看摘要

Abstract:Recently, various methods have been proposed to create open-domain conversational agents with Large Language Models (LLMs). These models are able to answer user queries, but in a one-way QA format rather than a true conversation. Fine-tuning on particular datasets is the usual way to modify their style to increase conversational ability, but this is expensive and usually only available in a few languages. In this study, we explore role-play zero-shot prompting as an efficient and cost-effective solution for open-domain conversation, using capable multilingual LLMs (Beeching et al., 2023) trained to obey instructions. We design a prompting system that, when combined with an instruction-following model - here Vicuna (Chiang et al., 2023) - produces conversational agents that match and even surpass fine-tuned models in human evaluation in French in two different tasks.

[AI-4] Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers

链接: https://arxiv.org/abs/2406.18451
作者: Jonas Ngnawé,Sabyasachi Sahoo,Yann Pequignot,Frédéric Precioso,Christian Gagné
关键词: high-stakes real-world applications, adversarial training strategies, input space margins, improve robustness, imperceptible perturbations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures, 2 tables, 1 algorithm

点击查看摘要

Abstract:Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model’s vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency – a property that links the input space margins and the logit margins in robust models – for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model’s logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate strong margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively use the logit margin to confidently detect brittle decisions with such models and accurately estimate robust accuracy on an arbitrarily large test set by estimating the input margins only on a small subset. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to efficiently assess adversarial vulnerability in deployment scenarios.

[AI-5] Preference Elicitation for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2406.18450
作者: Alizée Pace,Bernhard Schölkopf,Gunnar Rätsch,Giorgia Ramponi
关键词: designing reward functions, Applying reinforcement learning, Applying reinforcement, reward function, real-world problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in different environments.

[AI-6] Cascading Large Language Models for Salient Event Graph Generation

链接: https://arxiv.org/abs/2406.18449
作者: Xingwei Tan,Yuxiang Zhou,Gabriele Pergola,Yulan He
关键词: multiple tasks involved, reconciling unstructured input, Generating event graphs, Generating event, identifying their relationships
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 + 12 pages

点击查看摘要

Abstract:Generating event graphs from long documents is challenging due to the inherent complexity of multiple tasks involved such as detecting events, identifying their relationships, and reconciling unstructured input with structured graphs. Recent studies typically consider all events with equal importance, failing to distinguish salient events crucial for understanding narratives. This paper presents CALLMSAE, a CAscading Large Language Model framework for SAlient Event graph generation, which leverages the capabilities of LLMs and eliminates the need for costly human annotations. We first identify salient events by prompting LLMs to generate summaries, from which salient events are identified. Next, we develop an iterative code refinement prompting strategy to generate event relation graphs, removing hallucinated relations and recovering missing edges. Fine-tuning contextualised graph generation models on the LLM-generated graphs outperforms the models trained on CAEVO-generated data. Experimental results on a human-annotated test set show that the proposed method generates salient and more accurate graphs, outperforming competitive baselines.

[AI-7] Graph Neural Networks for Emulation of Finite-Element Ice Dynamics in Greenland and Antarctic Ice Sheets

链接: https://arxiv.org/abs/2406.18423
作者: Younghyun Koo,Maryam Rahnemoonfar
关键词: partial differential equations, provide accurate solutions, solve partial differential, models provide accurate, intensified computational demands
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注: 6 pages, 2 figures, submitted to the ICML 2024 Workshop on Machine Learning for Earth System Modeling

点击查看摘要

Abstract:Although numerical models provide accurate solutions for ice sheet dynamics based on physics laws, they accompany intensified computational demands to solve partial differential equations. In recent years, convolutional neural networks (CNNs) have been widely used as statistical emulators for those numerical models. However, since CNNs operate on regular grids, they cannot represent the refined meshes and computational efficiency of finite-element numerical models. Therefore, instead of CNNs, this study adopts an equivariant graph convolutional network (EGCN) as an emulator for the ice sheet dynamics modeling. EGCN reproduces ice thickness and velocity changes in the Helheim Glacier, Greenland, and Pine Island Glacier, Antarctica, with 260 times and 44 times faster computation time, respectively. Compared to the traditional CNN and graph convolutional network, EGCN shows outstanding accuracy in thickness prediction near fast ice streams by preserving the equivariance to the translation and rotation of graphs.

[AI-8] Mixture of Experts in a Mixture of RL settings

链接: https://arxiv.org/abs/2406.18420
作者: Timon Willi,Johan Obando-Ceron,Jakob Foerster,Karolina Dziugaite,Pablo Samuel Castro
关键词: Mixtures of Experts, enhanced inference efficiency, supervised learning due, Deep Reinforcement Learning, boost Deep Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mixtures of Experts (MoEs) have gained prominence in (self-)supervised learning due to their enhanced inference efficiency, adaptability to distributed training, and modularity. Previous research has illustrated that MoEs can significantly boost Deep Reinforcement Learning (DRL) performance by expanding the network’s parameter count while reducing dormant neurons, thereby enhancing the model’s learning capacity and ability to deal with non-stationarity. In this work, we shed more light on MoEs’ ability to deal with non-stationarity and investigate MoEs in DRL settings with “amplified” non-stationarity via multi-task training, providing further evidence that MoEs improve learning capacity. In contrast to previous work, our multi-task results allow us to better understand the underlying causes for the beneficial effect of MoE in DRL training, the impact of the various MoE components, and insights into how best to incorporate them in actor-critic-based DRL networks. Finally, we also confirm results from previous work.

[AI-9] BiTrack: Bidirectional Offline 3D Multi-Object Tracking Using Camera-LiDAR Data

链接: https://arxiv.org/abs/2406.18414
作者: Kemiao Huang,Meiying Zhang,Qi Hao
关键词: erroneous link correction, offline multi-object tracking, bounding box misalignment, real-time multi-object tracking, full track optimization
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Compared with real-time multi-object tracking (MOT), offline multi-object tracking (OMOT) has the advantages to perform 2D-3D detection fusion, erroneous link correction, and full track optimization but has to deal with the challenges from bounding box misalignment and track evaluation, editing, and refinement. This paper proposes “BiTrack”, a 3D OMOT framework that includes modules of 2D-3D detection fusion, initial trajectory generation, and bidirectional trajectory re-optimization to achieve optimal tracking results from camera-LiDAR data. The novelty of this paper includes threefold: (1) development of a point-level object registration technique that employs a density-based similarity metric to achieve accurate fusion of 2D-3D detection results; (2) development of a set of data association and track management skills that utilizes a vertex-based similarity metric as well as false alarm rejection and track recovery mechanisms to generate reliable bidirectional object trajectories; (3) development of a trajectory re-optimization scheme that re-organizes track fragments of different fidelities in a greedy fashion, as well as refines each trajectory with completion and smoothing techniques. The experiment results on the KITTI dataset demonstrate that BiTrack achieves the state-of-the-art performance for 3D OMOT tasks in terms of accuracy and efficiency.

[AI-10] IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons

链接: https://arxiv.org/abs/2406.18406
作者: Dan Shi,Renren Jin,Tianhao Shen,Weilong Dong,Xinwei Wu,Deyi Xiong
关键词: large language models, encode a vast, mass data, widely acknowledged, acknowledged that large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 19 pages, 13 figures, 5 tables

点击查看摘要

Abstract:It is widely acknowledged that large language models (LLMs) encode a vast reservoir of knowledge after being trained on mass data. Recent studies disclose knowledge conflicts in LLM generation, wherein outdated or incorrect parametric knowledge (i.e., encoded knowledge) contradicts new knowledge provided in the context. To mitigate such knowledge conflicts, we propose a novel framework, IRCAN (Identifying and Reweighting Context-Aware Neurons) to capitalize on neurons that are crucial in processing contextual cues. Specifically, IRCAN first identifies neurons that significantly contribute to context processing, utilizing a context-aware attribution score derived from integrated gradients. Subsequently, the identified context-aware neurons are strengthened via reweighting. In doing so, we steer LLMs to generate context-sensitive outputs with respect to the new knowledge provided in the context. Extensive experiments conducted across a variety of models and tasks demonstrate that IRCAN not only achieves remarkable improvements in handling knowledge conflicts but also offers a scalable, plug-andplay solution that can be integrated seamlessly with existing models.

[AI-11] SAM: Semi-Active Mechanism for Extensible Continuum Manipulator and Real-time Hysteresis Compensation Control Algorithm

链接: https://arxiv.org/abs/2406.18388
作者: Junhyun Park,Seonghyeok Jang,Myeongbo Park,Hyojae Park,Jeonghyeon Yoon,Minho Hwang
关键词: Cable-Driven Continuum Manipulators, enable scar-free procedures, improve target lesion, target lesion accessibility, Cable-Driven Continuum
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 12 pages, 14 figures, 6 tables

点击查看摘要

Abstract:Cable-Driven Continuum Manipulators (CDCMs) enable scar-free procedures via natural orifices and improve target lesion accessibility through curved paths. However, CDCMs face limitations in workspace and control accuracy due to non-linear cable effects causing hysteresis. This paper introduces an extensible CDCM with a Semi-active Mechanism (SAM) to expand the workspace via translational motion without additional mechanical elements or actuation. We collect a hysteresis dataset using 8 fiducial markers and RGBD sensing. Based on this dataset, we develop a real-time hysteresis compensation control algorithm using the trained Temporal Convolutional Network (TCN) with a 1ms time latency, effectively estimating the manipulator’s hysteresis behavior. Performance validation through random trajectory tracking tests and box pointing tasks shows the proposed controller significantly reduces hysteresis by up to 69.5% in joint space and approximately 26% in the box pointing task.

[AI-12] MALSIGHT: Exploring Malicious Source Code and Benign Pseudocode for Iterative Binary Malware Summarization

链接: https://arxiv.org/abs/2406.18379
作者: Haolang Lu,Hongrui Peng,Guoshun Nan,Jiaoyang Cui,Cheng Wang,Weifei Jin
关键词: Large Language Models, automatically generate human-readable, executable files, facilitating tasks, cracking and detection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 17 pages, 14 figures

点击查看摘要

Abstract:Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations, and incomplete summaries, primarily due to the obscure pseudocode structure and the lack of malware training summaries. Further, calling relationships between functions, which involve the rich interactions within a binary malware, remain largely underexplored. To this end, we propose MALSIGHT, a novel code summarization framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Specifically, we construct the first malware summaries, MalS and MalP, using an LLM and manually refine this dataset with human effort. At the training stage, we tune our proposed MalT5, a novel LLM-based code model, on the MalS dataset and a benign pseudocode dataset. Then, at the test stage, we iteratively feed the pseudocode functions into MalT5 to obtain the summary. Such a procedure facilitates the understanding of pseudocode structure and captures the intricate interactions between functions, thereby benefiting the usability, accuracy, and completeness of summaries. Additionally, we propose a novel evaluation benchmark, BLEURT-sum, to measure the quality of summaries. Experiments on three datasets show the effectiveness of the proposed MALSIGHT. Notably, our proposed MalT5, with only 0.77B parameters, delivers comparable performance to much larger ChatGPT3.5.

[AI-13] Research on Information Extraction of LCSTS Dataset Based on an Improved BERTSum-LSTM Model

链接: https://arxiv.org/abs/2406.18364
作者: Yiming Chen,Haobin Chen,Simin Liu,Yunyun Liu,Fanhao Zhou,Bing Wei
关键词: natural language processing, language processing technology, Chinese news summaries, artificial intelligence, Chinese
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: submitted to ICMIII 2024

点击查看摘要

Abstract:With the continuous advancement of artificial intelligence, natural language processing technology has become widely utilized in various fields. At the same time, there are many challenges in creating Chinese news summaries. First of all, the semantics of Chinese news is complex, and the amount of information is enormous. Extracting critical information from Chinese news presents a significant challenge. Second, the news summary should be concise and clear, focusing on the main content and avoiding redundancy. In addition, the particularity of the Chinese language, such as polysemy, word segmentation, etc., makes it challenging to generate Chinese news summaries. Based on the above, this paper studies the information extraction method of the LCSTS dataset based on an improved BERTSum-LSTM model. We improve the BERTSum-LSTM model to make it perform better in generating Chinese news summaries. The experimental results show that the proposed method has a good effect on creating news summaries, which is of great importance to the construction of news summaries.

[AI-14] Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

链接: https://arxiv.org/abs/2406.18361
作者: Tianyu Lin,Zhiguang Chen,Zhonghao Yan,Fudan Zheng,Weijiang Yu
关键词: generative tasks, demonstrated their effectiveness, Abstract, Diffusion, SDSeg
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: Accepted at MICCAI 2024. Code and citation info see this https URL

点击查看摘要

Abstract:Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model’s stability as implied by its name. The code is available at this https URL

[AI-15] Kolmogorov-Arnold Graph Neural Networks

链接: https://arxiv.org/abs/2406.18354
作者: Gianluca De Carlo,Andrea Mastropietro,Aris Anagnostopoulos
关键词: Graph neural networks, domains requiring transparent, requiring transparent decision-making, Graph Kolmogorov-Arnold Network, excel in learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 4 figures, under review

点击查看摘要

Abstract:Graph neural networks (GNNs) excel in learning from network-like data but often lack interpretability, making their application challenging in domains requiring transparent decision-making. We propose the Graph Kolmogorov-Arnold Network (GKAN), a novel GNN model leveraging spline-based activation functions on edges to enhance both accuracy and interpretability. Our experiments on five benchmark datasets demonstrate that GKAN outperforms state-of-the-art GNN models in node classification, link prediction, and graph classification tasks. In addition to the improved accuracy, GKAN’s design inherently provides clear insights into the model’s decision-making process, eliminating the need for post-hoc explainability techniques. This paper discusses the methodology, performance, and interpretability of GKAN, highlighting its potential for applications in domains where interpretability is crucial.

[AI-16] Reinforcement Learning with Intrinsically Motivated Feedback Graph for Lost-sales Inventory Control

链接: https://arxiv.org/abs/2406.18351
作者: Zifan Liu,Xinran Li,Shibo Chen,Gen Li,Jiashuo Jiang,Jun Zhang
关键词: inventory control, well-performed and general-purpose, online experience, sample efficiency, Reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has proven to be well-performed and general-purpose in the inventory control (IC). However, further improvement of RL algorithms in the IC domain is impeded due to two limitations of online experience. First, online experience is expensive to acquire in real-world applications. With the low sample efficiency nature of RL algorithms, it would take extensive time to train the RL policy to convergence. Second, online experience may not reflect the true demand due to the lost sales phenomenon typical in IC, which makes the learning process more challenging. To address the above challenges, we propose a decision framework that combines reinforcement learning with feedback graph (RLFG) and intrinsically motivated exploration (IME) to boost sample efficiency. In particular, we first take advantage of the inherent properties of lost-sales IC problems and design the feedback graph (FG) specially for lost-sales IC problems to generate abundant side experiences aid RL updates. Then we conduct a rigorous theoretical analysis of how the designed FG reduces the sample complexity of RL methods. Based on the theoretical insights, we design an intrinsic reward to direct the RL agent to explore to the state-action space with more side experiences, further exploiting FG’s power. Experimental results demonstrate that our method greatly improves the sample efficiency of applying RL in IC. Our code is available at https://anonymous.4open.science/r/RLIMFG4IC-811D/

[AI-17] AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

链接: https://arxiv.org/abs/2406.18346
作者: Adam Dahlgren Lindström,Leila Methnani,Lea Krause,Petter Ericson,Íñigo Martínez de Rituerto de Troya,Dimitri Coelho Mollo,Roel Dobbe
关键词: Large Language Models, align Artificial Intelligence, Artificial Intelligence, Language Models, Large Language
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 1 table, to be submitted

点击查看摘要

Abstract:This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLxF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics and contributing to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We conclude by urging researchers and practitioners alike to critically assess the sociotechnical ramifications of RLxF, advocating for a more nuanced and reflective approach to its application in AI development.

[AI-18] Continuous Sign Language Recognition Using Intra-inter Gloss Attention

链接: https://arxiv.org/abs/2406.18333
作者: Hossein Ranjbar,Alireza Taheri
关键词: capturing global contexts, adopt transformer-based architectures, sequence modeling due, studies adopt transformer-based, sign language recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many continuous sign language recognition (CSLR) studies adopt transformer-based architectures for sequence modeling due to their powerful capacity for capturing global contexts. Nevertheless, vanilla self-attention, which serves as the core module of the transformer, calculates a weighted average over all time steps; therefore, the local temporal semantics of sign videos may not be fully exploited. In this study, we introduce a novel module in sign language recognition studies, called intra-inter gloss attention module, to leverage the relationships among frames within glosses and the semantic and grammatical dependencies between glosses in the video. In the intra-gloss attention module, the video is divided into equally sized chunks and a self-attention mechanism is applied within each chunk. This localized self-attention significantly reduces complexity and eliminates noise introduced by considering non-relative frames. In the inter-gloss attention module, we first aggregate the chunk-level features within each gloss chunk by average pooling along the temporal dimension. Subsequently, multi-head self-attention is applied to all chunk-level features. Given the non-significance of the signer-environment interaction, we utilize segmentation to remove the background of the videos. This enables the proposed model to direct its focus toward the signer. Experimental results on the PHOENIX-2014 benchmark dataset demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge, improve the accuracy of CSLR, and achieve the word error rate (WER) of 20.4 on the test set which is a competitive result compare to the state-of-the-art which uses additional supervisions.

[AI-19] PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models

链接: https://arxiv.org/abs/2406.18326
作者: Huixuan Zhang,Yun Lin,Xiaojun Wan
关键词: Large language models, Large language, intentionally include data, trained on vast, vast amounts
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks. This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications. To address this benchmark contamination problem, we first propose a set of requirements that practical contamination detection methods should follow. Following these proposed requirements, we introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs. Our method constructs a counterpart for each piece of data with the same distribution, and performs statistical analysis of the corresponding confidence to test whether the model is significantly more confident under the original benchmark. We validate the effectiveness of PaCoST and apply it on popular open-source models and benchmarks. We find that almost all models and benchmarks we tested are suspected contaminated more or less. We finally call for new LLM evaluation methods.

[AI-20] MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

链接: https://arxiv.org/abs/2406.18321
作者: Meng Fang,Xiangpeng Wan,Fei Lu,Fei Xing,Kai Zou
关键词: Large language models, Large language, advanced natural language, natural language understanding, strong problem-solving abilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed “MathOdyssey” dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and DBRX-Instruct, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and code are publicly available.

[AI-21] AI-native Memory: A Pathway from LLMs Towards AGI

链接: https://arxiv.org/abs/2406.18312
作者: Jingbo Shang,Zai Zheng,Xiang Ying,Felix Tao,Mindverse Team
关键词: artificial general intelligence, general intelligence, context length, demonstrated the world, sparks of artificial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated the world with the sparks of artificial general intelligence (AGI). One opinion, especially from some startups working on LLMs, argues that an LLM with nearly unlimited context length can realize AGI. However, they might be too optimistic about the long-context capability of (existing) LLMs – (1) Recent literature has shown that their effective context length is significantly smaller than their claimed context length; and (2) Our reasoning-in-a-haystack experiments further demonstrate that simultaneously finding the relevant information from a long context and conducting (simple) reasoning is nearly impossible. In this paper, we envision a pathway from LLMs to AGI through the integration of \emphmemory. We believe that AGI should be a system where LLMs serve as core processors. In addition to raw data, the memory in this system would store a large number of important conclusions derived from reasoning processes. Compared with retrieval-augmented generation (RAG) that merely processing raw data, this approach not only connects semantically related information closer, but also simplifies complex inferences at the time of querying. As an intermediate stage, the memory will likely be in the form of natural language descriptions, which can be directly consumed by users too. Ultimately, every agent/person should have its own large personal model, a deep neural network model (thus \emphAI-native) that parameterizes and compresses all types of memory, even the ones cannot be described by natural languages. Finally, we discuss the significant potential of AI-native memory as the transformative infrastructure for (proactive) engagement, personalization, distribution, and social in the AGI era, as well as the incurred privacy and security challenges with preliminary solutions.

[AI-22] S3: A Simple Strong Sample-effective Multimodal Dialog System

链接: https://arxiv.org/abs/2406.18305
作者: Elisei Rykov,Egor Malkershin,Alexander Panchenko
关键词: Journey Contest, compelling leaderboards, present a conceptually, conceptually simple, simple yet powerful
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.

[AI-23] Combining Automated Optimisation of Hyperparameters and Reward Shape

链接: https://arxiv.org/abs/2406.18293
作者: Julian Dierkes,Emma Cramer,Holger H. Hoos,Sebastian Trimpe
关键词: deep reinforcement learning, reinforcement learning, recent years, significant progress, progress in deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in the Reinforcement Learning Journal 2024

点击查看摘要

Abstract:There has been significant progress in deep reinforcement learning (RL) in recent years. Nevertheless, finding suitable hyperparameter configurations and reward functions remains challenging even for experts, and performance heavily relies on these design choices. Also, most RL research is conducted on known benchmarks where knowledge about these choices already exists. However, novel practical applications often pose complex tasks for which no prior knowledge about good hyperparameters and reward functions is available, thus necessitating their derivation from scratch. Prior work has examined automatically tuning either hyperparameters or reward functions individually. We demonstrate empirically that an RL algorithm’s hyperparameter configurations and reward function are often mutually dependent, meaning neither can be fully optimised without appropriate values for the other. We then propose a methodology for the combined optimisation of hyperparameters and the reward function. Furthermore, we include a variance penalty as an optimisation objective to improve the stability of learned policies. We conducted extensive experiments using Proximal Policy Optimisation and Soft Actor-Critic on four environments. Our results show that combined optimisation significantly improves over baseline performance in half of the environments and achieves competitive performance in the others, with only a minor increase in computational costs. This suggests that combined optimisation should be best practice.

[AI-24] Detecting Machine-Generated Texts: Not Just “AI vs Humans” and Explainability is Complicated

链接: https://arxiv.org/abs/2406.18259
作者: Jiazhou Ji,Ruizhe Li,Shujun Li,Jie Guo,Weidong Qiu,Zheng Huang,Chiyu Chen,Xiaoyu Jiang,Xinru Lu
关键词: increasing concerns arise, LLMs rapidly advance, rapidly advance, increasing concerns, real world
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 19 pages, 2 figures

点击查看摘要

Abstract:As LLMs rapidly advance, increasing concerns arise regarding risks about actual authorship of texts we see online and in real world. The task of distinguishing LLM-authored texts is complicated by the nuanced and overlapping behaviors of both machines and humans. In this paper, we challenge the current practice of considering LLM-generated text detection a binary classification task of differentiating human from AI. Instead, we introduce a novel ternary text classification scheme, adding an “undecided” category for texts that could be attributed to either source, and we show that this new category is crucial to understand how to make the detection result more explainable to lay users. This research shifts the paradigm from merely classifying to explaining machine-generated texts, emphasizing need for detectors to provide clear and understandable explanations to users. Our study involves creating four new datasets comprised of texts from various LLMs and human authors. Based on new datasets, we performed binary classification tests to ascertain the most effective SOTA detection methods and identified SOTA LLMs capable of producing harder-to-detect texts. We constructed a new dataset of texts generated by two top-performing LLMs and human authors, and asked three human annotators to produce ternary labels with explanation notes. This dataset was used to investigate how three top-performing SOTA detectors behave in new ternary classification context. Our results highlight why “undecided” category is much needed from the viewpoint of explainability. Additionally, we conducted an analysis of explainability of the three best-performing detectors and the explanation notes of the human annotators, revealing insights about the complexity of explainable detection of machine-generated texts. Finally, we propose guidelines for developing future detection systems with improved explanatory power.

[AI-25] Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning

链接: https://arxiv.org/abs/2406.18254
作者: Zhijie Nie,Richong Zhang,Zhangchi Feng,Hailang Huang,Xudong Liu
关键词: achieves image-text retrieval, Cross-lingual Cross-modal Retrieval, improved retrieval tasks, cross-lingual cross-modal pre-training, web search
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Accepted by KDD 2024 Research Track

点击查看摘要

Abstract:Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on large-scale data have significantly improved retrieval tasks. However, these methods directly follow the existing pre-training methods in the cross-lingual or cross-modal domain, leading to two problems of inconsistency in CCR: The methods with cross-lingual style suffer from the intra-modal error propagation, resulting in inconsistent recall performance across languages in the whole dataset. The methods with cross-modal style suffer from the inter-modal optimization direction bias, resulting in inconsistent rank across languages within each instance, which cannot be reflected by Recall@K. To solve these problems, we propose a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias. In addition, we propose a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance. Extensive experiments on four CCR datasets show that our method improves both recall rates and MRV with smaller-scale pre-trained data, achieving the new state-of-art.

[AI-26] Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets

链接: https://arxiv.org/abs/2406.18239
作者: Simon Münker,Kai Kugler,Achim Rettinger
关键词: Filtering and annotating, annotating textual data, annotating textual, Natural Language Processing, Filtering
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, 2 tables, 1 figure

点击查看摘要

Abstract:Filtering and annotating textual data are routine tasks in many areas, like social media or news analytics. Automating these tasks allows to scale the analyses wrt. speed and breadth of content covered and decreases the manual effort required. Due to technical advancements in Natural Language Processing, specifically the success of large foundation models, a new tool for automating such annotation processes by using a text-to-text interface given written guidelines without providing training samples has become available. In this work, we assess these advancements in-the-wild by empirically testing them in an annotation task on German Twitter data about social and political European crises. We compare the prompt-based results with our human annotation and preceding classification approaches, including Naive Bayes and a BERT-based fine-tuning/domain adaptation pipeline. Our results show that the prompt-based approach - despite being limited by local computation resources during the model selection - is comparable with the fine-tuned BERT but without any annotated training data. Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data. Comments: 10 pages, 2 tables, 1 figure Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.18239 [cs.CL] (or arXiv:2406.18239v1 [cs.CL] for this version)

[AI-27] PlaMo: Plan and Move in Rich 3D Physical Environments

链接: https://arxiv.org/abs/2406.18237
作者: Assaf Hallak,Gal Dalal,Chen Tessler,Kelly Guo,Shie Mannor,Gal Chechik
关键词: visual content creation, physically simulated worlds, complex physically simulated, Controlling humanoids, applications in gaming
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Controlling humanoids in complex physically simulated worlds is a long-standing challenge with numerous applications in gaming, simulation, and visual content creation. In our setup, given a rich and complex 3D scene, the user provides a list of instructions composed of target locations and locomotion types. To solve this task we present PlaMo, a scene-aware path planner and a robust physics-based controller. The path planner produces a sequence of motion paths, considering the various limitations the scene imposes on the motion, such as location, height, and speed. Complementing the planner, our control policy generates rich and realistic physical motion adhering to the plan. We demonstrate how the combination of both modules enables traversing complex landscapes in diverse forms while responding to real-time changes in the environment. Video: this https URL .

[AI-28] Enhancing Data Privacy in Large Language Models through Private Association Editing

链接: https://arxiv.org/abs/2406.18221
作者: Davide Venditti,Elena Sofia Ruzzetti,Giancarlo A. Xompero,Cristina Giannone,Andrea Favalli,Raniero Romagnoli,Fabio Massimo Zanzotto
关键词: Large Language Models, Large Language, raises significant concerns, information raises significant, Private Association Editing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful tools with extensive applications, but their tendency to memorize private information raises significant concerns as private data leakage can easily happen. In this paper, we introduce Private Association Editing (PAE), a novel defense approach for private data leakage. PAE is designed to effectively remove Personally Identifiable Information (PII) without retraining the model. Our approach consists of a four-step procedure: detecting memorized PII, applying PAE cards to mitigate memorization of private data, verifying resilience to targeted data extraction (TDE) attacks, and ensuring consistency in the post-edit LLMs. The versatility and efficiency of PAE, which allows for batch modifications, significantly enhance data privacy in LLMs. Experimental results demonstrate the effectiveness of PAE in mitigating private data leakage. We believe PAE will serve as a critical tool in the ongoing effort to protect data privacy in LLMs, encouraging the development of safer models for real-world applications.

[AI-29] Guiding Video Prediction with Explicit Procedural Knowledge

链接: https://arxiv.org/abs/2406.18220
作者: Patrick Takenaka,Johannes Maucher,Marco F. Huber
关键词: deep learning models, integrate procedural knowledge, propose a general, video prediction, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published in 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

点击查看摘要

Abstract:We propose a general way to integrate procedural knowledge of a domain into deep learning models. We apply it to the case of video prediction, building on top of object-centric deep models and show that this leads to a better performance than using data-driven models alone. We develop an architecture that facilitates latent space disentanglement in order to use the integrated procedural knowledge, and establish a setup that allows the model to learn the procedural interface in the latent space using the downstream task of video prediction. We contrast the performance to a state-of-the-art data-driven approach and show that problems where purely data-driven approaches struggle can be handled by using knowledge about the domain, providing an alternative to simply collecting more data.

[AI-30] AI Cards: Towards an Applied Framework for Machine-Readable AI and Risk Documentation Inspired by the EU AI Act

链接: https://arxiv.org/abs/2406.18211
作者: Delaram Golpayegani,Isabelle Hupont,Cecilia Panigutti,Harshvardhan J. Pandit,Sven Schade,Declan O’Sullivan,Dave Lewis
关键词: risk management, risk management information, upcoming enforcement, playing a pivotal, pivotal role
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the upcoming enforcement of the EU AI Act, documentation of high-risk AI systems and their risk management information will become a legal requirement playing a pivotal role in demonstration of compliance. Despite its importance, there is a lack of standards and guidelines to assist with drawing up AI and risk documentation aligned with the AI Act. This paper aims to address this gap by providing an in-depth analysis of the AI Act’s provisions regarding technical documentation, wherein we particularly focus on AI risk management. On the basis of this analysis, we propose AI Cards as a novel holistic framework for representing a given intended use of an AI system by encompassing information regarding technical specifications, context of use, and risk management, both in human- and machine-readable formats. While the human-readable representation of AI Cards provides AI stakeholders with a transparent and comprehensible overview of the AI use case, its machine-readable specification leverages on state of the art Semantic Web technologies to embody the interoperability needed for exchanging documentation within the AI value chain. This brings the flexibility required for reflecting changes applied to the AI system and its context, provides the scalability needed to accommodate potential amendments to legal requirements, and enables development of automated tools to assist with legal compliance and conformity assessment tasks. To solidify the benefits, we provide an exemplar AI Card for an AI-based student proctoring system and further discuss its potential applications within and beyond the context of the AI Act.

[AI-31] MammothModa: Multi-Modal Large Language Model

链接: https://arxiv.org/abs/2406.18193
作者: Qi She,Junwen Pan,Xin Wan,Rui Zhang,Dawei Lu,Kai Huang
关键词: multi-modal large language, Complex Language Understanding, Maintaining Complex Language, Visual Attention Experts, Integrating Visual Capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Technical report

点击查看摘要

Abstract:In this report, we introduce MammothModa, yet another multi-modal large language model (MLLM) designed to achieve state-of-the-art performance starting from an elementary baseline. We focus on three key design insights: (i) Integrating Visual Capabilities while Maintaining Complex Language Understanding: In addition to the vision encoder, we incorporated the Visual Attention Experts into the LLM to enhance its visual capabilities. (ii) Extending Context Window for High-Resolution and Long-Duration Visual Feature: We explore the Visual Merger Module to effectively reduce the token number of high-resolution images and incorporated frame position ids to avoid position interpolation. (iii) High-Quality Bilingual Datasets: We meticulously curated and filtered a high-quality bilingual multimodal dataset to reduce visual hallucinations. With above recipe we build MammothModa that consistently outperforms the state-of-the-art models, e.g., LLaVA-series, across main real-world visual language benchmarks without bells and whistles.

[AI-32] Methodology of Adapting Large English Language Models for Specific Cultural Contexts

链接: https://arxiv.org/abs/2406.18192
作者: Wenjing Zhang,Siqi Xiao,Xuejiao Lei,Ning Wang,Huazheng Zhang,Meijuan An,Bikun Yang,Zhaoxiang Liu,Kai Wang,Shiguo Lian
关键词: artificial intelligence, prominent trend, field of artificial, specific cultural, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:The rapid growth of large language models(LLMs) has emerged as a prominent trend in the field of artificial intelligence. However, current state-of-the-art LLMs are predominantly based on English. They encounter limitations when directly applied to tasks in specific cultural domains, due to deficiencies in domain-specific knowledge and misunderstandings caused by differences in cultural values. To address this challenge, our paper proposes a rapid adaptation method for large models in specific cultural contexts, which leverages instruction-tuning based on specific cultural knowledge and safety values data. Taking Chinese as the specific cultural context and utilizing the LLaMA3-8B as the experimental English LLM, the evaluation results demonstrate that the adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values, while maintaining its original expertise advantages.

[AI-33] Selective Prompting Tuning for Personalized Conversations with LLMs

链接: https://arxiv.org/abs/2406.18187
作者: Qiushi Huang,Xubo Liu,Tom Ko,Bo Wu,Wenwu Wang,Yu Zhang,Lilian Tang
关键词: understanding is essential, profiles and contextual, contextual understanding, SPT, persona profiles
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ACL 2024 findings

点击查看摘要

Abstract:In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models’ (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to yield responses that are similar to the ground truths in datasets, while direct fine-tuning tends to produce repetitive or overly generic replies. To alleviate those issues, we propose \textbfSelective \textbfPrompt \textbfTuning (SPT), which softly prompts LLMs for personalized conversations in a selective way. Concretely, SPT initializes a set of soft prompts and uses a trainable dense retriever to adaptively select suitable soft prompts for LLMs according to different input contexts, where the prompt retriever is dynamically updated through feedback from the LLMs. Additionally, we propose context-prompt contrastive learning and prompt fusion learning to encourage the SPT to enhance the diversity of personalized conversations. Experiments on the CONVAI2 dataset demonstrate that SPT significantly enhances response diversity by up to 90%, along with improvements in other critical performance indicators. Those results highlight the efficacy of SPT in fostering engaging and personalized dialogue generation. The SPT model code (this https URL) is publicly available for further exploration.

[AI-34] Games of Knightian Uncertainty

链接: https://arxiv.org/abs/2406.18178
作者: Spyridon Samothrakis,Dennis J.N.J. Soemers,Damian Machlanski
关键词: Arguably, games, building intelligent machines, identifying optimal players, Abstract
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Arguably, for the latter part of the late 20th and early 21st centuries, games have been seen as the drosophila of AI. Games are a set of exciting testbeds, whose solutions (in terms of identifying optimal players) would lead to machines that would possess some form of general intelligence, or at the very least help us gain insights toward building intelligent machines. Following impressive successes in traditional board games like Go, Chess, and Poker, but also video games like the Atari 2600 collection, it is clear that this is not the case. Games have been attacked successfully, but we are nowhere near AGI developments (or, as harsher critics might say, useful AI developments!). In this short vision paper, we argue that for game research to become again relevant to the AGI pathway, we need to be able to address \textitKnightian uncertainty in the context of games, i.e. agents need to be able to adapt to rapid changes in game rules on the fly with no warning, no previous data, and no model access.

[AI-35] Start from Zero: Triple Set Prediction for Automatic Knowledge Graph Completion

链接: https://arxiv.org/abs/2406.18166
作者: Wen Zhang,Yajing Xu,Peng Ye,Zhiwei Huang,Zezhong Xu,Jiaoyan Chen,Jeff Z. Pan,Huajun Chen
关键词: Knowledge graph, missing triples, missing, triples, TSP
类目: Artificial Intelligence (cs.AI)
*备注: Paper accepted by TKDE in 2024

点击查看摘要

Abstract:Knowledge graph (KG) completion aims to find out missing triples in a KG. Some tasks, such as link prediction and instance completion, have been proposed for KG completion. They are triple-level tasks with some elements in a missing triple given to predict the missing element of the triple. However, knowing some elements of the missing triple in advance is not always a realistic setting. In this paper, we propose a novel graph-level automatic KG completion task called Triple Set Prediction (TSP) which assumes none of the elements in the missing triples is given. TSP is to predict a set of missing triples given a set of known triples. To properly and accurately evaluate this new task, we propose 4 evaluation metrics including 3 classification metrics and 1 ranking metric, considering both the partial-open-world and the closed-world assumptions. Furthermore, to tackle the huge candidate triples for prediction, we propose a novel and efficient subgraph-based method GPHT that can predict the triple set fast. To fairly compare the TSP results, we also propose two types of methods RuleTensor-TSP and KGE-TSP applying the existing rule- and embedding-based methods for TSP as baselines. During experiments, we evaluate the proposed methods on two datasets extracted from Wikidata following the relation-similarity partial-open-world assumption proposed by us, and also create a complete family data set to evaluate TSP results following the closed-world assumption. Results prove that the methods can successfully generate a set of missing triples and achieve reasonable scores on the new task, and GPHT performs better than the baselines with significantly shorter prediction time. The datasets and code for experiments are available at this https URL.

[AI-36] Innovating for Tomorrow: The Convergence of SE and Green AI

链接: https://arxiv.org/abs/2406.18142
作者: Luís Cruz,Xavier Franch Gutierrez,Silverio Martínez-Fernández
关键词: existing software engineering, machine learning, latest advancements, advancements in machine, revolutionizing the frontiers
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted in SE 2030 - International Workshop on Software Engineering in 2030

点击查看摘要

Abstract:The latest advancements in machine learning, specifically in foundation models, are revolutionizing the frontiers of existing software engineering (SE) processes. This is a bi-directional phenomona, where 1) software systems are now challenged to provide AI-enabled features to their users, and 2) AI is used to automate tasks within the software development lifecycle. In an era where sustainability is a pressing societal concern, our community needs to adopt a long-term plan enabling a conscious transformation that aligns with environmental sustainability values. In this paper, we reflect on the impact of adopting environmentally friendly practices to create AI-enabled software systems and make considerations on the environmental impact of using foundation models for software development.

[AI-37] Exclusive Style Removal for Cross Domain Novel Class Discovery

链接: https://arxiv.org/abs/2406.18140
作者: Yicheng Wang,Feng Liu,Junmin Liu,Zhen Fang,Kai Sun
关键词: Class Discovery, NCD methods, open-world learning, NCD, unlabeled set based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As a promising field in open-world learning, \textitNovel Class Discovery (NCD) is usually a task to cluster unseen novel classes in an unlabeled set based on the prior knowledge of labeled data within the same domain. However, the performance of existing NCD methods could be severely compromised when novel classes are sampled from a different distribution with the labeled ones. In this paper, we explore and establish the solvability of NCD in cross domain setting with the necessary condition that style information must be removed. Based on the theoretical analysis, we introduce an exclusive style removal module for extracting style information that is distinctive from the baseline features, thereby facilitating inference. Moreover, this module is easy to integrate with other NCD methods, acting as a plug-in to improve performance on novel classes with different distributions compared to the seen labeled set. Additionally, recognizing the non-negligible influence of different backbones and pre-training strategies on the performance of the NCD methods, we build a fair benchmark for future NCD research. Extensive experiments on three common datasets demonstrate the effectiveness of our proposed module.

[AI-38] ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models

链接: https://arxiv.org/abs/2406.18125
作者: Ahmed Heakl,Youssef Mohamed,Noran Mohamed,Ali Sharkaway,Ahmed Zaky
关键词: recruitment platforms coupled, resume classification methods, efficient resume classification, increasing reliance, platforms coupled
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, 1 table, 6th International Conference on AI in Computational Linguistics

点击查看摘要

Abstract:The increasing reliance on online recruitment platforms coupled with the adoption of AI technologies has highlighted the critical need for efficient resume classification methods. However, challenges such as small datasets, lack of standardized resume templates, and privacy concerns hinder the accuracy and effectiveness of existing classification models. In this work, we address these challenges by presenting a comprehensive approach to resume classification. We curated a large-scale dataset of 13,389 resumes from diverse sources and employed Large Language Models (LLMs) such as BERT and Gemma1.1 2B for classification. Our results demonstrate significant improvements over traditional machine learning approaches, with our best model achieving a top-1 accuracy of 92% and a top-5 accuracy of 97.5%. These findings underscore the importance of dataset quality and advanced model architectures in enhancing the accuracy and robustness of resume classification systems, thus advancing the field of online recruitment practices.

[AI-39] Poisoned LangChain: Jailbreak LLMs by LangChain

链接: https://arxiv.org/abs/2406.18122
作者: Ziqiu Wang,Jun Liu,Shengkai Zhang,Yang Yang
关键词: large language models, natural language processing, language models, large language, language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages,2 figures,This paper is a submission to ACM TURC. It has been accepted by the editor of the organizer

点击查看摘要

Abstract:With the development of natural language processing (NLP), large language models (LLMs) are becoming increasingly popular. LLMs are integrating more into everyday life, raising public concerns about their security vulnerabilities. Consequently, the security of large language models is becoming critically important. Currently, the techniques for attacking and defending against LLMs are continuously evolving. One significant method type of attack is the jailbreak attack, which designed to evade model safety mechanisms and induce the generation of inappropriate content. Existing jailbreak attacks primarily rely on crafting inducement prompts for direct jailbreaks, which are less effective against large models with robust filtering and high comprehension abilities. Given the increasing demand for real-time capabilities in large language models, real-time updates and iterations of new knowledge have become essential. Retrieval-Augmented Generation (RAG), an advanced technique to compensate for the model’s lack of new knowledge, is gradually becoming mainstream. As RAG enables the model to utilize external knowledge bases, it provides a new avenue for jailbreak attacks. In this paper, we conduct the first work to propose the concept of indirect jailbreak and achieve Retrieval-Augmented Generation via LangChain. Building on this, we further design a novel method of indirect jailbreak attack, termed Poisoned-LangChain (PLC), which leverages a poisoned external knowledge base to interact with large language models, thereby causing the large models to generate malicious non-compliant dialogues.We tested this method on six different large language models across three major categories of jailbreak issues. The experiments demonstrate that PLC successfully implemented indirect jailbreak attacks under three different scenarios, achieving success rates of 88.56%, 79.04%, and 82.69% respectively. Comments: 6 pages,2 figures,This paper is a submission to ACM TURC. It has been accepted by the editor of the organizer Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.18122 [cs.CL] (or arXiv:2406.18122v1 [cs.CL] for this version)

[AI-40] ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs

链接: https://arxiv.org/abs/2406.18120
作者: Ahmed Heakl,Youssef Zaghloul,Mennatullah Ali,Rania Hossam,Walid Gomaa
关键词: Egyptian Arabic, Egyptian Arabic recognition, translating code-switched Egyptian, code-switched Egyptian Arabic-English, automatic speech recognition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 5 tables, 6th International Conference on AI in Computational Linguistics

点击查看摘要

Abstract:Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of 56% in English translation over the state-of-the-art and 9.3% in Arabic translation. Since code-switching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources. Code: \urlthis http URL, Models: \urlthis http URL.

[AI-41] BADGE: BADminton report Generation and Evaluation with LLM

链接: https://arxiv.org/abs/2406.18116
作者: Shang-Hsuan Chiang,Lin-Wei Chao,Kuang-Da Wang,Chih-Chuan Wang,Wen-Chih Peng
关键词: enjoys widespread popularity, matches generally include, Badminton enjoys widespread, Large Language Model, generally include details
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Accepted by IJCAI 2024 Workshop: The 2nd International Workshop on Intelligent Technologies for Precision Sports Science (IT4PSS)

点击查看摘要

Abstract:Badminton enjoys widespread popularity, and reports on matches generally include details such as player names, game scores, and ball types, providing audiences with a comprehensive view of the games. However, writing these reports can be a time-consuming task. This challenge led us to explore whether a Large Language Model (LLM) could automate the generation and evaluation of badminton reports. We introduce a novel framework named BADGE, designed for this purpose using LLM. Our method consists of two main phases: Report Generation and Report Evaluation. Initially, badminton-related data is processed by the LLM, which then generates a detailed report of the match. We tested different Input Data Types, In-Context Learning (ICL), and LLM, finding that GPT-4 performs best when using CSV data type and the Chain of Thought prompting. Following report generation, the LLM evaluates and scores the reports to assess their quality. Our comparisons between the scores evaluated by GPT-4 and human judges show a tendency to prefer GPT-4 generated reports. Since the application of LLM in badminton reporting remains largely unexplored, our research serves as a foundational step for future advancements in this area. Moreover, our method can be extended to other sports games, thereby enhancing sports promotion. For more details, please refer to this https URL.

[AI-42] Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

链接: https://arxiv.org/abs/2406.18115
作者: Dicong Qiu,Wenzong Ma,Zhenfu Pan,Hui Xiong,Junwei Liang
关键词: Open-Vocabulary Mobile Manipulation, Mobile Manipulation, crucial capability, capability for autonomous, posed by unknown
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Open-vocabulary, Mobile Manipulation, Dynamic Environments, 3D Semantic Maps, Zero-shot, LLMs, VLMs, 18 pages, 2 figures

点击查看摘要

Abstract:Open-Vocabulary Mobile Manipulation (OVMM) is a crucial capability for autonomous robots, especially when faced with the challenges posed by unknown and dynamic environments. This task requires robots to explore and build a semantic understanding of their surroundings, generate feasible plans to achieve manipulation goals, adapt to environmental changes, and comprehend natural language instructions from humans. To address these challenges, we propose a novel framework that leverages the zero-shot detection and grounded recognition capabilities of pretraining visual-language models (VLMs) combined with dense 3D entity reconstruction to build 3D semantic maps. Additionally, we utilize large language models (LLMs) for spatial region abstraction and online planning, incorporating human instructions and spatial semantic context. We have built a 10-DoF mobile manipulation robotic platform JSR-1 and demonstrated in real-world robot experiments that our proposed framework can effectively capture spatial semantics and process natural language user instructions for zero-shot OVMM tasks under dynamic environment settings, with an overall navigation and task success rate of 80.95% and 73.33% over 105 episodes, and better SFT and SPL by 157.18% and 19.53% respectively compared to the baseline. Furthermore, the framework is capable of replanning towards the next most probable candidate location based on the spatial semantic context derived from the 3D semantic map when initial plans fail, keeping an average success rate of 76.67%.

[AI-43] LLM-Driven Multimodal Opinion Expression Identification

链接: https://arxiv.org/abs/2406.18088
作者: Bonian Jia,Huiyao Chen,Yueheng Sun,Meishan Zhang,Min Zhang
关键词: Opinion Expression Identification, essential in NLP, NLP for applications, Expression Identification, depression diagnosis
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 3 Figures

点击查看摘要

Abstract:Opinion Expression Identification (OEI) is essential in NLP for applications ranging from voice assistants to depression diagnosis. This study extends OEI to encompass multimodal inputs, underlining the significance of auditory cues in delivering emotional subtleties beyond the capabilities of text. We introduce a novel multimodal OEI (MOEI) task, integrating text and speech to mirror real-world scenarios. Utilizing CMU MOSEI and IEMOCAP datasets, we construct the CI-MOEI dataset. Additionally, Text-to-Speech (TTS) technology is applied to the MPQA dataset to obtain the CIM-OEI dataset. We design a template for the OEI task to take full advantage of the generative power of large language models (LLMs). Advancing further, we propose an LLM-driven method STOEI, which combines speech and text modal to identify opinion expressions. Our experiments demonstrate that MOEI significantly improves the performance while our method outperforms existing methods by 9.20% and obtains SOTA results.

[AI-44] EHR-Based Mobile and Web Platform for Chronic Disease Risk Prediction Using Large Language Multimodal Models

链接: https://arxiv.org/abs/2406.18087
作者: Chun-Chieh Liao,Wei-Ting Kuo,I-Hsuan Hu,Yen-Chen Shih,Jun-En Ding,Feng Liu,Fang-Ming Hung
关键词: involves in-person consultations, Traditional diagnosis, diseases involves in-person, Electronic Health Records, involves in-person
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Traditional diagnosis of chronic diseases involves in-person consultations with physicians to identify the disease. However, there is a lack of research focused on predicting and developing application systems using clinical notes and blood test values. We collected five years of Electronic Health Records (EHRs) from Taiwan’s hospital database between 2017 and 2021 as an AI database. Furthermore, we developed an EHR-based chronic disease prediction platform utilizing Large Language Multimodal Models (LLMMs), successfully integrating with frontend web and mobile applications for prediction. This prediction platform can also connect to the hospital’s backend database, providing physicians with real-time risk assessment diagnostics. The demonstration link can be found at this https URL.

[AI-45] Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction

链接: https://arxiv.org/abs/2406.18078
作者: Yice Zhang,Jie Zeng,Weiming Hu,Ziyi Wang,Shiwei Chen,Ruifeng Xu
关键词: Sentiment Quad Prediction, Aspect Sentiment Quad, aspect-based sentiment analysis, Quad Prediction, Sentiment Quad
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to ACL 2024 Main Conference

点击查看摘要

Abstract:Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review, which is the most representative and challenging task in aspect-based sentiment analysis. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. To tackle this issue, we propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels, aiming to filter out mismatches and thereby enhance the effectiveness of self-training. We highlight two critical aspects to ensure the scorer’s effectiveness and reliability: the quality of the training dataset and its model architecture. To this end, we create a human-annotated comparison dataset and train a generative model on it using ranking-based objectives. Extensive experiments on public ASQP datasets reveal that using our scorer can greatly and consistently improve the effectiveness of self-training. Moreover, we explore the possibility of replacing humans with large language models for comparison dataset annotation, and experiments demonstrate its feasibility. We release our code and data at this https URL .

[AI-46] Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

链接: https://arxiv.org/abs/2406.18074
作者: Song Tang,Shaxu Yan,Xiaozhi Qi,Jianxin Gao,Mao Ye,Jianwei Zhang,Xiatian Zhu
关键词: Few-shot Semantic Segmentation, single labelled training, labelled training sample, aims to adapt, sample per class
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Few-shot Semantic Segmentation (FSS) aims to adapt a pretrained model to new classes with as few as a single labelled training sample per class. Despite the prototype based approaches have achieved substantial success, existing models are limited to the imaging scenarios with considerably distinct objects and not highly complex background, e.g., natural images. This makes such models suboptimal for medical imaging with both conditions invalid. To address this problem, we propose a novel Detail Self-refined Prototype Network (DSPNet) to constructing high-fidelity prototypes representing the object foreground and the background more comprehensively. Specifically, to construct global semantics while maintaining the captured detail semantics, we learn the foreground prototypes by modelling the multi-modal structures with clustering and then fusing each in a channel-wise manner. Considering that the background often has no apparent semantic relation in the spatial dimensions, we integrate channel-specific structural information under sparse channel-aware regulation. Extensive experiments on three challenging medical image benchmarks show the superiority of DSPNet over previous state-of-the-art methods.

[AI-47] Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents

链接: https://arxiv.org/abs/2406.18062
作者: Chung-En Sun,Sicun Gao,Tsui-Wei Weng
关键词: deep reinforcement learning, randomized smoothing emerging, reinforcement learning, enhancing this attribute, remains a paramount
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in ICML 2024

点击查看摘要

Abstract:Robustness remains a paramount concern in deep reinforcement learning (DRL), with randomized smoothing emerging as a key technique for enhancing this attribute. However, a notable gap exists in the performance of current smoothed DRL agents, often characterized by significantly low clean rewards and weak robustness. In response to this challenge, our study introduces innovative algorithms aimed at training effective smoothed robust DRL agents. We propose S-DQN and S-PPO, novel approaches that demonstrate remarkable improvements in clean rewards, empirical robustness, and robustness guarantee across standard RL benchmarks. Notably, our S-DQN and S-PPO agents not only significantly outperform existing smoothed agents by an average factor of 2.16\times under the strongest attack, but also surpass previous robustly-trained agents by an average factor of 2.13\times . This represents a significant leap forward in the field. Furthermore, we introduce Smoothed Attack, which is 1.89\times more effective in decreasing the rewards of smoothed agents than existing adversarial attacks.

[AI-48] AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

链接: https://arxiv.org/abs/2406.18060
作者: Yifan Yang,Kai Zhen,Ershad Banijamal,Athanasios Mouchtaris,Zheng Zhang
关键词: natural language processing, Fine-tuning large language, language processing tasks, large language models, achieved remarkable performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.

[AI-49] Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies

链接: https://arxiv.org/abs/2406.18053
作者: Yu Luo,Fuchun Sun,Tianying Ji,Xianyuan Zhan
关键词: Hierarchical reinforcement learning, addresses complex long-horizon, reinforcement learning, addresses complex, skillfully decomposing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hierarchical reinforcement learning (HRL) addresses complex long-horizon tasks by skillfully decomposing them into subgoals. Therefore, the effectiveness of HRL is greatly influenced by subgoal reachability. Typical HRL methods only consider subgoal reachability from the unilateral level, where a dominant level enforces compliance to the subordinate level. However, we observe that when the dominant level becomes trapped in local exploration or generates unattainable subgoals, the subordinate level is negatively affected and cannot follow the dominant level’s actions. This can potentially make both levels stuck in local optima, ultimately hindering subsequent subgoal reachability. Allowing real-time bilateral information sharing and error correction would be a natural cure for this issue, which motivates us to propose a mutual response mechanism. Based on this, we propose the Bidirectional-reachable Hierarchical Policy Optimization (BrHPO)–a simple yet effective algorithm that also enjoys computation efficiency. Experiment results on a variety of long-horizon tasks showcase that BrHPO outperforms other state-of-the-art HRL baselines, coupled with a significantly higher exploration efficiency and robustness.

[AI-50] Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources

链接: https://arxiv.org/abs/2406.18049
作者: Yiming Li,Deepthi Viswaroopan,William He,Jianfu Li,Xu Zuo,Hua Xu,Cui Tao
关键词: Adverse event, Traditional deep learning, deep learning models, deep learning, Traditional deep
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Adverse event (AE) extraction following COVID-19 vaccines from text data is crucial for monitoring and analyzing the safety profiles of immunizations. Traditional deep learning models are adept at learning intricate feature representations and dependencies in sequential data, but often require extensive labeled data. In contrast, large language models (LLMs) excel in understanding contextual information, but exhibit unstable performance on named entity recognition tasks, possibly due to their broad but unspecific training. This study aims to evaluate the effectiveness of LLMs and traditional deep learning models in AE extraction, and to assess the impact of ensembling these models on performance. In this study, we utilized reports and posts from the VAERS (n=621), Twitter (n=9,133), and Reddit (n=131) as our corpora. Our goal was to extract three types of entities: “vaccine”, “shot”, and “ae”. We explored and fine-tuned (except GPT-4) multiple LLMs, including GPT-2, GPT-3.5, GPT-4, and Llama-2, as well as traditional deep learning models like RNN and BioBERT. To enhance performance, we created ensembles of the three models with the best performance. For evaluation, we used strict and relaxed F1 scores to evaluate the performance for each entity type, and micro-average F1 was used to assess the overall performance. The ensemble model achieved the highest performance in “vaccine”, “shot”, and “ae” with strict F1-scores of 0.878, 0.930, and 0.925, respectively, along with a micro-average score of 0.903. In conclusion, this study demonstrates the effectiveness and robustness of ensembling fine-tuned traditional deep learning models and LLMs, for extracting AE-related information. This study contributes to the advancement of biomedical natural language processing, providing valuable insights into improving AE extraction from text data for pharmacovigilance and public health surveillance.

[AI-51] PharmGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

链接: https://arxiv.org/abs/2406.18045
作者: Linqing Chen,Weilei Wang,Zilong Bai,Peng Xu,Yan Fang,Jie Fang,Wentao Wu,Lizhi Zhou,Ruiji Zhang,Yubin Xia,Chaobo Xu,Ran Hu,Licong Xu,Qijun Cai,Haoran Hua,Jing Sun,Jin Liu,Tian Qiu,Haowen Liu,Meng Hu,Xiuwen Li,Fei Gao,Yufu Wang,Lin Tie,Chaochao Wang,Jianping Lu,Cheng Sun,Yixin Wang,Shengjie Yang,Yuancheng Li,Lu Jin,Lisha Zhang,Fu Bian,Changyang Tu
关键词: Natural Language Processing, revolutionized Natural Language, complex feature engineering, revolutionized Natural, Large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmGPT, a suite of multilingual LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus of hundreds of billions of tokens tailored to the Bio-Pharmaceutical and Chemical sectors. Our evaluation shows that PharmGPT matches or surpasses existing general models on key benchmarks, such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. This advancement establishes a new benchmark for LLMs in the Bio-Pharmaceutical and Chemical fields, addressing the existing gap in specialized language modeling. Furthermore, this suggests a promising path for enhanced research and development in these specialized areas, paving the way for more precise and effective applications of NLP in specialized domains.

[AI-52] Multimodal foundation world models for generalist embodied agents

链接: https://arxiv.org/abs/2406.18043
作者: Pietro Mazzaglia,Tim Verbelen,Bart Dhoedt,Aaron Courville,Sai Rajeswar
关键词: solve multitudes, Learning, models, foundation, tasks
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning generalist embodied agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. In contrast, language can specify tasks in a more natural way. Current foundation vision-language models (VLMs) generally require fine-tuning or other adaptations to be functional, due to the significant domain gap. However, the lack of multimodal data in such domains represents an obstacle toward developing foundation models for embodied applications. In this work, we overcome these problems by presenting multimodal foundation world models, able to connect and align the representation of foundation VLMs with the latent space of generative world models for RL, without any language annotations. The resulting agent learning framework, GenRL, allows one to specify tasks through vision and/or language prompts, ground them in the embodied domain’s dynamics, and learns the corresponding behaviors in imagination. As assessed through large-scale multi-task benchmarking, GenRL exhibits strong multi-task generalization performance in several locomotion and manipulation domains. Furthermore, by introducing a data-free RL strategy, it lays the groundwork for foundation model-based RL for generalist embodied agents.

[AI-53] Boosting Soft Q-Learning by Bounding

链接: https://arxiv.org/abs/2406.18033
作者: Jacob Adamczyk,Volodymyr Makarenko,Stas Tiomkin,Rahul V. Kulkarni
关键词: leverage past experience, solving new tasks, agent ability, ability to leverage, leverage past
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: To appear in the 1st Reinforcement Learning Conference

点击查看摘要

Abstract:An agent’s ability to leverage past experience is critical for efficiently solving new tasks. Prior work has focused on using value function estimates to obtain zero-shot approximations for solutions to a new task. In soft Q-learning, we show how any value function estimate can also be used to derive double-sided bounds on the optimal value function. The derived bounds lead to new approaches for boosting training performance which we validate experimentally. Notably, we find that the proposed framework suggests an alternative method for updating the Q-function, leading to boosted performance.

[AI-54] Automated Clinical Data Extraction with Knowledge Conditioned LLMs

链接: https://arxiv.org/abs/2406.18027
作者: Diya Li,Asim Kadav,Aijing Gao,Rui Li,Richard Bourgon
关键词: medical imaging reports, lung-related diseases, medical imaging, crucial for research, care of lung-related
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The extraction of lung lesion information from clinical and medical imaging reports is crucial for research on and clinical care of lung-related diseases. Large language models (LLMs) can be effective at interpreting unstructured text in reports, but they often hallucinate due to a lack of domain-specific knowledge, leading to reduced accuracy and posing challenges for use in clinical settings. To address this, we propose a novel framework that aligns generated internal knowledge with external knowledge through in-context learning (ICL). Our framework employs a retriever to identify relevant units of internal or external knowledge and a grader to evaluate the truthfulness and helpfulness of the retrieved internal-knowledge rules, to align and update the knowledge bases. Our knowledge-conditioned approach also improves the accuracy and reliability of LLM outputs by addressing the extraction task in two stages: (i) lung lesion finding detection and primary structured field parsing, followed by (ii) further parsing of lesion description text into additional structured fields. Experiments with expert-curated test datasets demonstrate that this ICL approach can increase the F1 score for key fields (lesion size, margin and solidity) by an average of 12.9% over existing ICL methods.

[AI-55] AutoOPE: Automated Off-Policy Estimator Selection

链接: https://arxiv.org/abs/2406.18022
作者: Nicolò Felicioni,Michael Benigni,Maurizio Ferrari Dacrema
关键词: Off-Policy Evaluation, OPE, consists of evaluating, data collected, counterfactual policies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Off-Policy Evaluation (OPE) problem consists of evaluating the performance of counterfactual policies with data collected by another one. This problem is of utmost importance for various application domains, e.g., recommendation systems, medical treatments, and many others. To solve the OPE problem, we resort to estimators, which aim to estimate in the most accurate way possible the performance that the counterfactual policies would have had if they were deployed in place of the logging policy. In the literature, several estimators have been developed, all with different characteristics and theoretical guarantees. Therefore, there is no dominant estimator, and each estimator may be the best one for different OPE problems, depending on the characteristics of the dataset at hand. While the selection of the estimator is a crucial choice for an accurate OPE, this problem has been widely overlooked in the literature. We propose an automated data-driven OPE estimator selection method based on machine learning. In particular, the core idea we propose in this paper is to create several synthetic OPE tasks and use a machine learning model trained to predict the best estimator for those synthetic tasks. We empirically show how our method is able to generalize to unseen tasks and make a better estimator selection compared to a baseline method on several real-world datasets, with a computational cost significantly lower than the one of the baseline.

[AI-56] MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views

链接: https://arxiv.org/abs/2406.18020
作者: Muzhen Cai,Sendong Zhao,Haochun Wang,Yanrui Du,Zewen Qiang,Bing Qin,Ting Liu
关键词: Artificial Intelligence predicts, Intelligence predicts drug, Artificial Intelligence, predicts drug properties, Intelligence predicts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Artificial Intelligence predicts drug properties by encoding drug molecules, aiding in the rapid screening of candidates. Different molecular representations, such as SMILES and molecule graphs, contain complementary information for molecular encoding. Thus exploiting complementary information from different molecular representations is one of the research priorities in molecular encoding. Most existing methods for combining molecular multi-modalities only use molecular-level information, making it hard to encode intra-molecular alignment information between different modalities. To address this issue, we propose a multi-granularity fusion method that is MolFusion. The proposed MolFusion consists of two key components: (1) MolSim, a molecular-level encoding component that achieves molecular-level alignment between different molecular representations. and (2) AtomAlign, an atomic-level encoding component that achieves atomic-level alignment between different molecular representations. Experimental results show that MolFusion effectively utilizes complementary multimodal information, leading to significant improvements in performance across various classification and regression tasks.

[AI-57] View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adaptive View Synthesis

链接: https://arxiv.org/abs/2406.18012
作者: Subin Varghese,Vedhus Hoskere
关键词: requires identifying visual, infrastructure assets typically, assets typically requires, typically requires identifying, scenes periodically photographed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The inspection and monitoring of infrastructure assets typically requires identifying visual anomalies in scenes periodically photographed over time. Images collected manually or with robots such as unmanned aerial vehicles from the same scene at different instances in time are typically not perfectly aligned. Supervised segmentation methods can be applied to identify known problems, but unsupervised anomaly detection approaches are required when unknown anomalies occur. Current unsupervised pixel-level anomaly detection methods have mainly been developed for industrial settings where the camera position is known and constant. However, we find that these methods fail to generalize to the case when images are not perfectly aligned. We term the problem of unsupervised anomaly detection between two such imperfectly aligned sets of images as Scene Anomaly Detection (Scene AD). We present a novel network termed OmniAD to address the Scene AD problem posed. Specifically, we refine the anomaly detection method reverse distillation to achieve a 40% increase in pixel-level anomaly detection performance. The network’s performance is further demonstrated to improve with two new data augmentation strategies proposed that leverage novel view synthesis and camera localization to improve generalization. We validate our approach with qualitative and quantitative results on a new dataset, ToyCity, the first Scene AD dataset with multiple objects, as well as on the established single object-centric dataset, MAD. this https URL

[AI-58] Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher

链接: https://arxiv.org/abs/2406.18002
作者: Hyunjong Ok,Jegwang Ryu,Jaeho Lee
关键词: sLLMs efficiently utilize, generative quality, improve their generative, efficiently utilize, LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:How can sLLMs efficiently utilize the supervision of LLMs to improve their generative quality? This question has been well studied in scenarios where there is no restriction on the number of LLM supervisions one can use, giving birth to many decoding algorithms that utilize supervision without further training. However, it is still unclear what is an effective strategy under the limited supervision scenario, where we assume that no more than a few tokens can be generated by LLMs. To this end, we develop an algorithm to effectively aggregate the sLLM and LLM predictions on initial tokens so that the generated tokens can more accurately condition the subsequent token generation by sLLM only. Critically, we find that it is essential to adaptively overtrust or disregard the LLM prediction based on the confidence of the sLLM. Through our experiments on a wide range of models and datasets, we demonstrate that our method provides a consistent improvement over conventional decoding strategies.

[AI-59] Catching Chameleons: Detecting Evolving Disinformation Generated using Large Language Models

链接: https://arxiv.org/abs/2406.17992
作者: Bohan Jiang,Chengshuai Zhao,Zhen Tan,Huan Liu
关键词: current efforts overlook, detecting evolving LLM-generated, detecting disinformation generated, evolving LLM-generated disinformation, current efforts
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Despite recent advancements in detecting disinformation generated by large language models (LLMs), current efforts overlook the ever-evolving nature of this disinformation. In this work, we investigate a challenging yet practical research problem of detecting evolving LLM-generated disinformation. Disinformation evolves constantly through the rapid development of LLMs and their variants. As a consequence, the detection model faces significant challenges. First, it is inefficient to train separate models for each disinformation generator. Second, the performance decreases in scenarios when evolving LLM-generated disinformation is encountered in sequential order. To address this problem, we propose DELD (Detecting Evolving LLM-generated Disinformation), a parameter-efficient approach that jointly leverages the general fact-checking capabilities of pre-trained language models (PLM) and the independent disinformation generation characteristics of various LLMs. In particular, the learned characteristics are concatenated sequentially to facilitate knowledge accumulation and transformation. DELD addresses the issue of label scarcity by integrating the semantic embeddings of disinformation with trainable soft prompts to elicit model-specific knowledge. Our experiments show that \textitDELD significantly outperforms state-of-the-art methods. Moreover, our method provides critical insights into the unique patterns of disinformation generation across different LLMs, offering valuable perspectives in this line of research.

[AI-60] Explicit Diversity Conditions for Effective Question Answer Generation with Large Language Models

链接: https://arxiv.org/abs/2406.17990
作者: Vikas Yadav,Hyuk Joon Kwon,Vijay Srinivasan,Hongxia Jin
关键词: Question Answer Generation, Answer Generation, question answering systems, Question Answer, explicit diversity conditions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at COLING 2024

点击查看摘要

Abstract:Question Answer Generation (QAG) is an effective data augmentation technique to improve the accuracy of question answering systems, especially in low-resource domains. While recent pretrained and large language model-based QAG methods have made substantial progress, they face the critical issue of redundant QA pair generation, affecting downstream QA systems. Implicit diversity techniques such as sampling and diverse beam search are proven effective solutions but often yield smaller diversity. We present explicit diversity conditions for QAG, focusing on spatial aspects, question types, and entities, substantially increasing diversity in QA generation. Our work emphasizes the need of explicit diversity conditions for generating diverse question-answer synthetic data by showing significant improvements in downstream QA task over existing widely adopted implicit diversity techniques. In particular, generated QA pairs from explicit diversity conditions when used to train the downstream QA model results in an average 4.1% exact match and 4.5% F1 improvement over QAG from implicit sampling techniques on SQuADDU. Our work emphasizes the need for explicit diversity conditions even more in low-resource datasets (SubjQA), where average downstream QA performance improvements are around 12% EM.

[AI-61] Multi-step Knowledge Retrieval and Inference over Unstructured Data

链接: https://arxiv.org/abs/2406.17987
作者: Aditya Kalyanpur,Kailash Saravanakumar,Victor Barres,CJ McFate,Lori Moon,Nati Seifu,Maksim Eremeev,Jose Barrera,Eric Brown,David Ferrucci
关键词: Large Language Models, revolutionized natural language, natural language applications, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) and Generative AI has revolutionized natural language applications across various domains. However, high-stakes decision-making tasks in fields such as medical, legal and finance require a level of precision, comprehensiveness, and logical consistency that pure LLM or Retrieval-Augmented-Generation (RAG) approaches often fail to deliver. At Elemental Cognition (EC), we have developed a neuro-symbolic AI platform to tackle these problems. The platform integrates fine-tuned LLMs for knowledge extraction and alignment with a robust symbolic reasoning engine for logical inference, planning and interactive constraint solving. We describe Cora, a Collaborative Research Assistant built on this platform, that is designed to perform complex research and discovery tasks in high-stakes domains. This paper discusses the multi-step inference challenges inherent in such domains, critiques the limitations of existing LLM-based methods, and demonstrates how Cora’s neuro-symbolic approach effectively addresses these issues. We provide an overview of the system architecture, key algorithms for knowledge extraction and formal reasoning, and present preliminary evaluation results that highlight Cora’s superior performance compared to well-known LLM and RAG baselines.

[AI-62] Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective

链接: https://arxiv.org/abs/2406.17969
作者: Hanqi Yan,Yanzheng Xiang,Guangyi Chen,Yifei Wang,Lin Gui,Yulan He
关键词: recent studies focus, large language models, recent studies, basic units, interpret the intrinsic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To better interpret the intrinsic mechanism of large language models (LLMs), recent studies focus on monosemanticity on its basic units. A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts. Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to model capacity. To explore this question, we revisit monosemanticity from the feature decorrelation perspective and advocate for its encouragement. We experimentally observe that the current conclusion by wang2024learning, which suggests that decreasing monosemanticity enhances model performance, does not hold when the model changes. Instead, we demonstrate that monosemanticity consistently exhibits a positive correlation with model capacity, in the preference alignment process. Consequently, we apply feature correlation as a proxy for monosemanticity and incorporate a feature decorrelation regularizer into the dynamic preference optimization process. The experiments show that our method not only enhances representation diversity and activation sparsity but also improves preference alignment performance.

[AI-63] Efficient Document Ranking with Learnable Late Interactions

链接: https://arxiv.org/abs/2406.17968
作者: Ziwei Ji,Himanshu Jain,Andreas Veit,Sashank J. Reddi,Sadeep Jayasumana,Ankit Singh Rawat,Aditya Krishna Menon,Felix Yu,Sanjiv Kumar
关键词: information retrieval, fundamental approaches, LITE, document token embeddings, models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer based on query and document token embeddings. However, these lightweight scorers are often hand-crafted, and there is no understanding of their approximation power; further, such scorers require access to individual document token embeddings, which imposes an increased latency and storage burden. In this paper, we propose novel learnable late-interaction models (LITE) that resolve these issues. Theoretically, we prove that LITE is a universal approximator of continuous scoring functions, even for relatively small embedding dimension. Empirically, LITE outperforms previous late-interaction models such as ColBERT on both in-domain and zero-shot re-ranking tasks. For instance, experiments on MS MARCO passage re-ranking show that LITE not only yields a model with better generalization, but also lowers latency and requires 0.25x storage compared to ColBERT.

[AI-64] NormTab: Improving Symbolic Reasoning in LLMs Through Tabular Data Normalization

链接: https://arxiv.org/abs/2406.17961
作者: Md Mahadi Hasan Nahid,Davood Rafiei
关键词: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, parsing textual data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
*备注: Work in Progress

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in parsing textual data and generating code. However, their performance in tasks involving tabular data, especially those requiring symbolic reasoning, faces challenges due to the structural variance and inconsistency in table cell values often found in web tables. In this paper, we introduce NormTab, a novel framework aimed at enhancing the symbolic reasoning performance of LLMs by normalizing web tables. We study table normalization as a stand-alone, one-time preprocessing step using LLMs to support symbolic reasoning on tabular data. Our experimental evaluation, conducted on challenging web table datasets such as WikiTableQuestion and TabFact, demonstrates that leveraging NormTab significantly improves symbolic reasoning performance, showcasing the importance and effectiveness of web table normalization for enhancing LLM-based symbolic reasoning tasks.

[AI-65] MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation for Effective-and-Efficient Vision-and-Language Navigation

链接: https://arxiv.org/abs/2406.17960
作者: Liuyi Wang,Zongtao He,Mengjiao Shen,Jingwei Yang,Chengju Liu,Qijun Chen
关键词: Embodied Artificial Intelligence, Artificial Intelligence, Embodied Artificial, recent large models, excessive parameter sizes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the remarkable developments of recent large models in Embodied Artificial Intelligence (E-AI), their integration into robotics is hampered by their excessive parameter sizes and computational demands. Towards the Vision-and-Language Navigation (VLN) task, a core task in E-AI, this paper reveals the great potential of using knowledge distillation for obtaining lightweight student models by proposing a Meta-Ability Guided Interactive Chain-of-distillation (MAGIC) method. Specifically, a Meta-Ability Knowledge Distillation (MAKD) framework is proposed for decoupling and refining the necessary meta-abilities of VLN agents. A Meta-Knowledge Randomization Weighting (MKRW) and a Meta-Knowledge Transferable Determination (MKTD) module are incorporated to dynamically adjust aggregation weights at the meta-ability and sample levels, respectively. Move beyond the traditional one-step unidirectional distillation, an Interactive Chain-of-Distillation (ICoD) learning strategy is proposed to allow students to give feedback to teachers, forming a new multi-step teacher-student co-evolution pipeline. Remarkably, on the R2R test unseen public leaderboard, our smallest model, MAGIC-S, with only 5% (11M) of the teacher’s size, outperforms all previous methods under the same training data. Additionally, our largest model, MAGIC-L, surpasses the previous state-of-the-art by 5.84% in SPL and 3.18% in SR. Furthermore, a new dataset was collected and annotated from our living environments, where MAGIC-S demonstrated superior performance and real-time efficiency. Our code is publicly available on this https URL.

[AI-66] Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

链接: https://arxiv.org/abs/2406.17957
作者: Paarth Neekhara,Shehzeen Hussain,Subhankar Ghosh,Jason Li,Rafael Valle,Rohan Badlani,Boris Ginsburg
关键词: Large Language Model, Large Language, demonstrated remarkable capabilities, handling large speech, large speech datasets
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Published as a conference paper at INTERSPEECH 2024

点击查看摘要

Abstract:Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.

[AI-67] he Overcooked Generalisation Challenge

链接: https://arxiv.org/abs/2406.17949
作者: Constantin Ruhdorfer,Matteo Bortoletto,Anna Penzkofer,Andreas Bulling
关键词: agents’ zero-shot cooperation, zero-shot cooperation abilities, Overcooked Generalisation Challenge, study agents’ zero-shot, capture generalisation abilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 9 pages

点击查看摘要

Abstract:We introduce the Overcooked Generalisation Challenge (OGC) - the first benchmark to study agents’ zero-shot cooperation abilities when faced with novel partners and levels in the Overcooked-AI environment. This perspective starkly contrasts a large body of previous work that has trained and evaluated cooperating agents only on the same level, failing to capture generalisation abilities required for real-world human-AI cooperation. Our challenge interfaces with state-of-the-art dual curriculum design (DCD) methods to generate auto-curricula for training general agents in Overcooked. It is the first cooperative multi-agent environment specially designed for DCD methods and, consequently, the first benchmarked with state-of-the-art methods. It is fully GPU-accelerated, built on the DCD benchmark suite minimax, and freely available under an open-source license: this https URL. We show that current DCD algorithms struggle to produce useful policies in this novel challenge, even if combined with recent network architectures that were designed for scalability and generalisability. The OGC pushes the boundaries of real-world human-AI cooperation by enabling the research community to study the impact of generalisation on cooperating agents.

[AI-68] Semi-supervised classification of dental conditions in panoramic radiographs using large language model and instance segmentation: A real-world dataset evaluation

链接: https://arxiv.org/abs/2406.17915
作者: Bernardo Silva,Jefferson Fontinele,Carolina Letícia Zilli Vieira,João Manuel R.S. Tavares,Patricia Ramos Cury,Luciano Oliveira
关键词: vast diagnostic opportunities, offer vast diagnostic, training supervised deep, radiographs offer vast, supervised deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 43 pages, 12 figures, 9 tables

点击查看摘要

Abstract:Dental panoramic radiographs offer vast diagnostic opportunities, but training supervised deep learning networks for automatic analysis of those radiology images is hampered by a shortage of labeled data. Here, a different perspective on this problem is introduced. A semi-supervised learning framework is proposed to classify thirteen dental conditions on panoramic radiographs, with a particular emphasis on teeth. Large language models were explored to annotate the most common dental conditions based on dental reports. Additionally, a masked autoencoder was employed to pre-train the classification neural network, and a Vision Transformer was used to leverage the unlabeled data. The analyses were validated using two of the most extensive datasets in the literature, comprising 8,795 panoramic radiographs and 8,029 paired reports and images. Encouragingly, the results consistently met or surpassed the baseline metrics for the Matthews correlation coefficient. A comparison of the proposed solution with human practitioners, supported by statistical analysis, highlighted its effectiveness and performance limitations; based on the degree of agreement among specialists, the solution demonstrated an accuracy level comparable to that of a junior specialist.

[AI-69] ransforming Software Development: Evaluating the Efficiency and Challenges of GitHub Copilot in Real-World Projects

链接: https://arxiv.org/abs/2406.17910
作者: Ruchika Pandey,Prabhat Singh,Raymond Wei,Shaila Shankar
关键词: Generative AI technologies, product development lifecycle, technologies promise, promise to transform, transform the product
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Generative AI technologies promise to transform the product development lifecycle. This study evaluates the efficiency gains, areas for improvement, and emerging challenges of using GitHub Copilot, an AI-powered coding assistant. We identified 15 software development tasks and assessed Copilot’s benefits through real-world projects on large proprietary code bases. Our findings indicate significant reductions in developer toil, with up to 50% time saved in code documentation and autocompletion, and 30-40% in repetitive coding tasks, unit test generation, debugging, and pair programming. However, Copilot struggles with complex tasks, large functions, multiple files, and proprietary contexts, particularly with C/C++ code. We project a 33-36% time reduction for coding-related tasks in a cloud-first software development lifecycle. This study aims to quantify productivity improvements, identify underperforming scenarios, examine practical benefits and challenges, investigate performance variations across programming languages, and discuss emerging issues related to code quality, security, and developer experience.

[AI-70] Unbiasing on the Fly: Explanation-Guided Human Oversight of Machine Learning System Decisions

链接: https://arxiv.org/abs/2406.17906
作者: Hussaini Mamman,Shuib Basri,Abdullateef Balogun,Abubakar Abdullahi Imam,Ganesh Kumar,Luiz Fernando Capretz
关键词: healthcare raises growing, raises growing concerns, discriminatory decision-making based, protected attributes, widespread adoption
类目: Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:The widespread adoption of ML systems across critical domains like hiring, finance, and healthcare raises growing concerns about their potential for discriminatory decision-making based on protected attributes. While efforts to ensure fairness during development are crucial, they leave deployed ML systems vulnerable to potentially exhibiting discrimination during their operations. To address this gap, we propose a novel framework for on-the-fly tracking and correction of discrimination in deployed ML systems. Leveraging counterfactual explanations, the framework continuously monitors the predictions made by an ML system and flags discriminatory outcomes. When flagged, post-hoc explanations related to the original prediction and the counterfactual alternatives are presented to a human reviewer for real-time intervention. This human-in-the-loop approach empowers reviewers to accept or override the ML system decision, enabling fair and responsible ML operation under dynamic settings. While further work is needed for validation and refinement, this framework offers a promising avenue for mitigating discrimination and building trust in ML systems deployed in a wide range of domains.

[AI-71] Application of Liquid Rank Reputation System for Twitter Trend Analysis on Bitcoin

链接: https://arxiv.org/abs/2406.17904
作者: Abhishek Saxena(Novosibirsk State University),Anton Kolonin(Novosibirsk State University)
关键词: analyzing Bitcoin trends, Analyzing social media, create a win-win, win-win situation, Liquid Rank Reputation
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: Under publication in 2024 Ural-Siberian Conference on Biomedical Engineering, Radioelectronics and Information Technology, Yekaterinburg, Russia

点击查看摘要

Abstract:Analyzing social media trends can create a win-win situation for both creators and consumers. Creators can receive fair compensation, while consumers gain access to engaging, relevant, and personalized content. This paper proposes a new model for analyzing Bitcoin trends on Twitter by incorporating a ‘liquid democracy’ approach based on user reputation. This system aims to identify the most impactful trends and their influence on Bitcoin prices and trading volume. It uses a Twitter sentiment analysis model based on a reputation rating system to determine the impact on Bitcoin price change and traded volume. In addition, the reputation model considers the users’ higher-order friends on the social network (the initial Twitter input channels in our case study) to improve the accuracy and diversity of the reputation results. We analyze Bitcoin-related news on Twitter to understand how trends and user sentiment, measured through our Liquid Rank Reputation System, affect Bitcoin price fluctuations and trading activity within the studied time frame. This reputation model can also be used as an additional layer in other trend and sentiment analysis models. The paper proposes the implementation, challenges, and future scope of the liquid rank reputation model.

[AI-72] Human-centered In-building Embodied Delivery Benchmark

链接: https://arxiv.org/abs/2406.17898
作者: Zhuoqun Xu,Yang Liu,Xiaoqi Li,Jiyao Zhang,Hao Dong
关键词: accepted and popularized, leading people, widely accepted, people to naturally, potential for commercialization
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, the concept of embodied intelligence has been widely accepted and popularized, leading people to naturally consider the potential for commercialization in this field. In this work, we propose a specific commercial scenario simulation, human-centered in-building embodied delivery. Furthermore, for this scenario, we have developed a brand-new virtual environment system from scratch, constructing a multi-level connected building space modeled after a polar research station. This environment also includes autonomous human characters and robots with grasping and mobility capabilities, as well as a large number of interactive items. Based on this environment, we have built a delivery dataset containing 13k language instructions to guide robots in providing services. We simulate human behavior through human characters and sample their various needs in daily life. Finally, we proposed a method centered around a large multimodal model to serve as the baseline system for this dataset. Compared to past embodied data work, our work focuses on a virtual environment centered around human-robot interaction for commercial scenarios. We believe this will bring new perspectives and exploration angles to the embodied community.

[AI-73] CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design

链接: https://arxiv.org/abs/2406.17888
作者: Nafis Neehal,Bowen Wang,Shayom Debopadhaya,Soham Dan,Keerthiram Murugesan,Vibha Anand,Kristin P. Bennett
关键词: assess language models, baseline features, language models, aiding clinical study, benchmark to assess
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. Given study-specific metadata, CTBench evaluates AI models’ ability to determine the baseline features of a clinical trial (CT), which include demographic and relevant features collected at the trial’s start from all participants. These baseline features, typically presented in CT publications (often as Table 1), are crucial for characterizing study cohorts and validating results. Baseline features, including confounders and covariates, are also necessary for accurate treatment effect estimation in studies involving observational data. CTBench consists of two datasets: “CT-Repo,” containing baseline features from 1,690 clinical trials sourced from this http URL, and “CT-Pub,” a subset of 100 trials with more comprehensive baseline features gathered from relevant publications. Two LM-based evaluation methods are developed to compare the actual baseline feature lists against LM-generated responses. “ListMatch-LM” and “ListMatch-BERT” use GPT-4o and BERT scores (at various thresholds), respectively, for evaluation. To establish baseline results, advanced prompt engineering techniques using LLaMa3-70B-Instruct and GPT-4o in zero-shot and three-shot learning settings are applied to generate potential baseline features. The performance of GPT-4o as an evaluator is validated through human-in-the-loop evaluations on the CT-Pub dataset, where clinical experts confirm matches between actual and LM-generated features. The results highlight a promising direction with significant potential for improvement, positioning CTBench as a useful tool for advancing research on AI in CT design and potentially enhancing the efficacy and robustness of CTs.

[AI-74] Federated Dynamical Low-Rank Training with Global Loss Convergence Guarantees

链接: https://arxiv.org/abs/2406.17887
作者: Steffen Schotthöfer,M. Paul Laiu
关键词: horizontal federated learning, significant performance bottlenecks, federated dynamical low-rank, federated learning, reduce client compute
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this work, we propose a federated dynamical low-rank training (FeDLRT) scheme to reduce client compute and communication costs - two significant performance bottlenecks in horizontal federated learning. Our method builds upon dynamical low-rank splitting schemes for manifold-constrained optimization to create a global low-rank basis of network weights, which enables client training on a small coefficient matrix. A consistent global low-rank basis allows us to incorporate a variance correction scheme and prove global loss descent and convergence to a stationary point. Dynamic augmentation and truncation of the low-rank bases automatically optimizes computing and communication resource utilization. We demonstrate the efficiency of FeDLRT in an array of computer vision benchmarks and show a reduction of client compute and communication costs by up to an order of magnitude with minimal impacts on global accuracy.

[AI-75] Enabling Regional Explainability by Automatic and Model-agnostic Rule Extraction

链接: https://arxiv.org/abs/2406.17885
作者: Yu Chen,Tianyu Cui,Alexander Capstick,Nan Fletcher-Loyd,Payam Barnaghi
关键词: understanding patterns learned, rule extraction translates, extraction translates model, translates model knowledge, IF-THEN statements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:In Explainable AI, rule extraction translates model knowledge into logical rules, such as IF-THEN statements, crucial for understanding patterns learned by black-box models. This could significantly aid in fields like disease diagnosis, disease progression estimation, or drug discovery. However, such application domains often contain imbalanced data, with the class of interest underrepresented. Existing methods inevitably compromise the performance of rules for the minor class to maximise the overall performance. As the first attempt in this field, we propose a model-agnostic approach for extracting rules from specific subgroups of data, featuring automatic rule generation for numerical features. This method enhances the regional explainability of machine learning models and offers wider applicability compared to existing methods. We additionally introduce a new method for selecting features to compose rules, reducing computational costs in high-dimensional spaces. Experiments across various datasets and models demonstrate the effectiveness of our methods.

[AI-76] ET tu CLIP? Addressing Common Object Errors for Unseen Environments

链接: https://arxiv.org/abs/2406.17876
作者: Ye Won Byun,Cathy Jiao,Shahriar Noroozizadeh,Jimin Sun,Rosa Vitiello
关键词: enhance model generalization, employs pre-trained CLIP, pre-trained CLIP encoders, ALFRED task, introduce a simple
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.

[AI-77] Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples Verification and Dynamic Feedback

链接: https://arxiv.org/abs/2406.17873
作者: Zhongtao Miao,Kaiyan Zhao,Yoshimasa Tsuruoka
关键词: large language models, large language, Current representations, language models, reasoning steps
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under review, 25 figures, 8 tables, 29 pages

点击查看摘要

Abstract:Current representations used in reasoning steps of large language models can mostly be categorized into two main types: (1) natural language, which is difficult to verify; and (2) non-natural language, usually programming code, which is difficult for people who are unfamiliar with coding to read. In this paper, we propose to use a semi-structured form to represent reasoning steps of large language models. Specifically, we use relation tuples, which are not only human-readable but also machine-friendly and easier to verify than natural language. We implement a framework that includes three main components: (1) introducing relation tuples into the reasoning steps of large language models; (2) implementing an automatic verification process of reasoning steps with a local code interpreter based on relation tuples; and (3) integrating a simple and effective dynamic feedback mechanism, which we found helpful for self-improvement of large language models. The experimental results on various arithmetic datasets demonstrate the effectiveness of our method in improving the arithmetic reasoning ability of large language models. The source code is available at this https URL.

[AI-78] AI Risk Categorization Decoded (AIR 2024): From Government Regulations to Corporate Policies

链接: https://arxiv.org/abs/2406.17864
作者: Yi Zeng,Kevin Klyman,Andy Zhou,Yu Yang,Minzhou Pan,Ruoxi Jia,Dawn Song,Percy Liang,Bo Li
关键词: United States, company policies worldwide, European Union, company policies, policies worldwide
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a comprehensive AI risk taxonomy derived from eight government policies from the European Union, United States, and China and 16 company policies worldwide, making a significant step towards establishing a unified language for generative AI safety evaluation. We identify 314 unique risk categories organized into a four-tiered taxonomy. At the highest level, this taxonomy encompasses System Operational Risks, Content Safety Risks, Societal Risks, and Legal Rights Risks. The taxonomy establishes connections between various descriptions and approaches to risk, highlighting the overlaps and discrepancies between public and private sector conceptions of risk. By providing this unified framework, we aim to advance AI safety through information sharing across sectors and the promotion of best practices in risk mitigation for generative AI models and systems.

[AI-79] What type of inference is planning?

链接: https://arxiv.org/abs/2406.17863
作者: Miguel Lázaro-Gredilla,Li Yang Ku,Kevin P. Murphy,Dileep George
关键词: probabilistic graphical models, Multiple types, graphical models, probabilistic graphical, inference
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multiple types of inference are available for probabilistic graphical models, e.g., marginal, maximum-a-posteriori, and even marginal maximum-a-posteriori. Which one do researchers mean when they talk about “planning as inference”? There is no consistency in the literature, different types are used, and their ability to do planning is further entangled with specific approximations or additional constraints. In this work we use the variational framework to show that all commonly used types of inference correspond to different weightings of the entropy terms in the variational problem, and that planning corresponds exactly to a different set of weights. This means that all the tricks of variational inference are readily applicable to planning. We develop an analogue of loopy belief propagation that allows us to perform approximate planning in factored state Markov decisions processes without incurring intractability due to the exponentially large state space. The variational perspective shows that the previous types of inference for planning are only adequate in environments with low stochasticity, and allows us to characterize each type by its own merits, disentangling the type of inference from the additional approximations that its practical use requires. We validate these results empirically on synthetic MDPs and tasks posed in the International Planning Competition.

[AI-80] Human-Object Interaction from Human-Level Instructions

链接: https://arxiv.org/abs/2406.17840
作者: Zhen Wu,Jiaman Li,C. Karen Liu
关键词: Intelligent agents, daily tasks based, human-level instructions, motion, instructions
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Intelligent agents need to autonomously navigate and interact within contextual environments to perform a wide range of daily tasks based on human-level instructions. These agents require a foundational understanding of the world, incorporating common sense and knowledge, to interpret such instructions. Moreover, they must possess precise low-level skills for movement and interaction to execute the detailed task plans derived from these instructions. In this work, we address the task of synthesizing continuous human-object interactions for manipulating large objects within contextual environments, guided by human-level instructions. Our goal is to generate synchronized object motion, full-body human motion, and detailed finger motion, all essential for realistic interactions. Our framework consists of a large language model (LLM) planning module and a low-level motion generator. We use LLMs to deduce spatial object relationships and devise a method for accurately determining their positions and orientations in target scene layouts. Additionally, the LLM planner outlines a detailed task plan specifying a sequence of sub-tasks. This task plan, along with the target object poses, serves as input for our low-level motion generator, which seamlessly alternates between navigation and interaction modules. We present the first complete system that can synthesize object motion, full-body motion, and finger motion simultaneously from human-level instructions. Our experiments demonstrate the effectiveness of our high-level planner in generating plausible target layouts and our low-level motion generator in synthesizing realistic interactions for diverse objects. Please refer to our project page for more results: this https URL.

[AI-81] InFiConD: Interactive No-code Fine-tuning with Concept-based Knowledge Distillation

链接: https://arxiv.org/abs/2406.17838
作者: Jinbin Huang,Wenbin He,Liang Gou,Liu Ren,Chris Bryan
关键词: limited computational resources, large-scale pre-trained models, downstream tasks, computational resources, Knowledge distillation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The emergence of large-scale pre-trained models has heightened their application in various downstream tasks, yet deployment is a challenge in environments with limited computational resources. Knowledge distillation has emerged as a solution in such scenarios, whereby knowledge from large teacher models is transferred into smaller student’ models, but this is a non-trivial process that traditionally requires technical expertise in AI/ML. To address these challenges, this paper presents InFiConD, a novel framework that leverages visual concepts to implement the knowledge distillation process and enable subsequent no-code fine-tuning of student models. We develop a novel knowledge distillation pipeline based on extracting text-aligned visual concepts from a concept corpus using multimodal models, and construct highly interpretable linear student models based on visual concepts that mimic a teacher model in a response-based manner. InFiConD’s interface allows users to interactively fine-tune the student model by manipulating concept influences directly in the user interface. We validate InFiConD via a robust usage scenario and user study. Our findings indicate that InFiConD’s human-in-the-loop and visualization-driven approach enables users to effectively create and analyze student models, understand how knowledge is transferred, and efficiently perform fine-tuning operations. We discuss how this work highlights the potential of interactive and visual methods in making knowledge distillation and subsequent no-code fine-tuning more accessible and adaptable to a wider range of users with domain-specific demands.

[AI-82] ransformer Normalisation Layers and the Independence of Semantic Subspaces

链接: https://arxiv.org/abs/2406.17837
作者: Stephen Menary,Samuel Kaski,Andre Freitas
关键词: solve contextual reasoning, contextual reasoning tasks, internally executing computational, executing computational graphs, computational graphs called
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent works have shown that transformers can solve contextual reasoning tasks by internally executing computational graphs called circuits. Circuits often use attention to logically match information from subspaces of the representation, e.g. using position-in-sequence to identify the previous token. In this work, we consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution. We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability unless the model learns a strict representation structure of orthogonal spheres. This is because it causes linear subspaces to interfere through their common normalisation factor. Theoretically, we analyse circuit stability by modelling this interference as random noise on the L_2 -norms of the query/key/value vectors, predicting a phenomenon of circuit collapse when sparse-attention shifts to a different token. Empirically, we investigate the sensitivity of real-world models trained for mathematical addition, observing a 1% rate of circuit collapse when the norms are artificially perturbed by \lesssim 10%. We contrast Pre-Norm with QKV-Norm, which places normalisation after the attention head’s linear operators. Theoretically this relaxes the representational constraints. Empirically we observe comparable in-distribution but worse out-of-distribution performance.

[AI-83] A Moonshot for AI Oracles in the Sciences

链接: https://arxiv.org/abs/2406.17836
作者: Bryan Kaiser,Tailin Wu,Maike Sonnewald,Colin Thackray,Skylar Callis
关键词: Nobel laureate Philip, laureate Philip Anderson, Philip Anderson, Anderson and Elihu, Elihu Abrahams
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); History and Overview (math.HO); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Nobel laureate Philip Anderson and Elihu Abrahams once stated that, “even if machines did contribute to normal science, we see no mechanism by which they could create a Kuhnian revolution and thereby establish a new physical law.” In this Perspective, we draw upon insights from the philosophies of science and artificial intelligence (AI) to propose necessary conditions of precisely such a mechanism for generating revolutionary mathematical theories. Recent advancements in AI suggest that satisfying the proposed necessary conditions by machines may be plausible; thus, our proposed necessary conditions also define a moonshot challenge. We also propose a heuristic definition of the intelligibility of mathematical theories to accelerate the development of machine theorists.

[AI-84] he Use of AI-Robotic Systems for Scientific Discovery

链接: https://arxiv.org/abs/2406.17835
作者: Alexander H. Gower,Konstantin Korovin,Daniel Brunnsåker,Filip Kronström,Gabriel K. Reder,Ievgeniia A. Tiukova,Ronald S. Reiserer,John P. Wikswo,Ross D. King
关键词: scientific method, process of developing, entire scientific method, robot scientist, developing theories
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, book chapter

点击查看摘要

Abstract:The process of developing theories and models and testing them with experiments is fundamental to the scientific method. Automating the entire scientific method then requires not only automation of the induction of theories from data, but also experimentation from design to implementation. This is the idea behind a robot scientist – a coupled system of AI and laboratory robotics that has agency to test hypotheses with real-world experiments. In this chapter we explore some of the fundamentals of robot scientists in the philosophy of science. We also map the activities of a robot scientist to machine learning paradigms, and argue that the scientific method shares an analogy with active learning. We demonstrate these concepts using examples from previous robot scientists, and also from Genesis: a next generation robot scientist designed for research in systems biology, comprising a micro-fluidic system with 1000 computer-controlled micro-bioreactors and interpretable models based in controlled vocabularies and logic.

[AI-85] Univariate Skeleton Prediction in Multivariate Systems Using Transformers

链接: https://arxiv.org/abs/2406.17834
作者: Giorgio Morales,John W. Sheppard
关键词: approximate the behavior, observed system, system response, univariate symbolic, methods attempt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Paper accepted at European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) 2024