本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-07-03)

今日共更新840篇论文,其中:

  • 自然语言处理174篇(Computation and Language (cs.CL))
  • 计算机视觉180篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能230篇(Artificial Intelligence (cs.AI))
  • 机器学习277篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] KV Cache Compression But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches
[NLP-0] KV缓存压缩,但我们必须付出什么作为回报?具有长期背景能力的方法的全面基准

链接: https://arxiv.org/abs/2407.01527
作者: Jiayi Yuan,Hongyi Liu,Shaochen(Henry)Zhong,Yu-Neng Chuang,Songchen Li,Guanchu Wang,Duy Le,Hongye Jin,Vipin Chaudhary,Zhaozhuo Xu,Zirui Liu,Xia Hu
关键词: digest long-form texts, large language models, long-form texts, crucial competency, competency for large
中文关键词: 消化长篇文本、大型语言模型、长篇文本、关键能力、大型能力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long context capability is a crucial competency for large language models (LLMs) as it mitigates the human struggle to digest long-form texts. This capability enables complex task-solving scenarios such as book summarization, code assistance, and many more tasks that are traditionally manpower-intensive. However, transformer-based LLMs face significant challenges with long context input due to the growing size of the KV cache and the intrinsic complexity of attending to extended inputs; where multiple schools of efficiency-driven approaches – such as KV cache quantization, token dropping, prompt compression, linear-time sequence models, and hybrid architectures – have been proposed to produce efficient yet long context-capable models. Despite these advancements, no existing work has comprehensively benchmarked these methods in a reasonably aligned environment. In this work, we fill this gap by providing a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks. Our work reveals numerous previously unknown phenomena and offers insights – as well as a friendly workbench – for the future development of long context-capable LLMs. The source code will be available at this https URL
摘要:长上下文能力是大型语言模型(LLM)的一项重要能力,因为它缓解了人类消化长格式文本的困难。此功能支持复杂的任务解决方案,如图书摘要、代码帮助以及传统上人力密集型的更多任务。然而,由于KV高速缓存的不断增长和处理扩展输入的内在复杂性,基于变压器的LLMS在长上下文输入方面面临着巨大的挑战;其中已经提出了多种效率驱动的方法–例如KV高速缓存量化、令牌丢弃、即时压缩、线性时序模型和混合体系结构–以产生高效但具有长上下文能力的模型。尽管有这些进步,但现有的工作还没有在合理一致的环境中对这些方法进行全面的基准测试。在这项工作中,我们通过提供当前方法的分类并评估七类长上下文任务中的10多种最先进的方法来填补这一空白。我们的工作揭示了许多以前未知的现象,并为长上下文支持的LLM的未来开发提供了见解–以及一个友好的工作台。源代码将在此HTTPS URL上提供

[NLP-1] Empowering 3D Visual Grounding with Reasoning Capabilities
[NLP-1] 通过推理能力增强3D视觉基础

链接: https://arxiv.org/abs/2407.01525
作者: Chenming Zhu,Tai Wang,Wenwei Zhang,Kai Chen,Xihui Liu
关键词: explicit textual descriptions, reason human intentions, Large Language Model, Multi-modal Large Language, implicit instructions
中文关键词: 显式文本描述、推理人类意图、大型语言模型、多模式大型语言、隐式指令
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ECCV24. A comprehensive and hierarchical 3D reasoning grounding benchmark in the era of foundation models. Project page: this https URL

点击查看摘要

Abstract:Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.
摘要:尽管在3D视觉基础方面取得了很大进展,但当前的模型仍然依赖于显式的文本描述来基础,并且缺乏从隐性指令中推理人类意图的能力。我们提出了一项名为3D推理基础的新任务,并引入了一个新的基准ScanReason,该基准提供了来自五种需要推理和基础协同的推理类型的超过10 K个问答位置对。我们进一步设计了我们的方法ReGround 3D,该方法由多模式大型语言模型(MLLM)支持的以视觉为中心的推理模块和3D基础模块组成,通过回顾增强的几何形状和来自3D场景的细粒度细节来获得准确的对象位置。提出了一种基础链机制,通过推理期间的交叉推理和基础步骤进一步提高性能。对拟议基准的大量实验验证了我们提出的方法的有效性。

[NLP-2] MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
[NLP-2] MMLongBench-Doc:通过可视化对长上下文文档理解进行基准测试

链接: https://arxiv.org/abs/2407.01523
作者: Yubo Ma,Yuhang Zang,Liangyu Chen,Meiqi Chen,Yizhu Jiao,Xinze Li,Xinyuan Lu,Ziyu Liu,Yan Ma,Xiaoyi Dong,Pan Zhang,Liangming Pan,Yu-Gang Jiang,Jiaqi Wang,Yixin Cao,Aixin Sun
关键词: long-standing and practical, practical task, Recent Large Vision-Language, Large Vision-Language Models, single-page document understanding
中文关键词: 长期存在且实际的、实际的任务、最近的大型视觉语言、大型视觉语言模型、单页文档理解
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: this https URL
摘要:理解具有丰富布局和多通道成分的文档是一项长期而实用的任务。最近的大型视觉语言模型(LVLM)在各种任务中取得了显著的进展,特别是在单页文档理解(DU)方面。然而,他们在长语境DU上的能力仍然是一个悬而未决的问题。这项工作提出了MMLongBch-Doc,这是一个长上下文、多模式的基准,包括1,062个专家注释的问题。与以前的数据集不同,它是在130个PDF格式的长文档上构建的,平均有49.4页和20971个文本标记。对于综合评价,这些问题的答案依赖于来自(1)不同来源(文本、图像、图表、表格和布局结构)和(2)不同位置(即页码)的证据。此外,33.2%的问题是跨页问题,需要跨多页提供证据。22.8%的问题被设计成无法回答潜在的幻觉。在14个LVLMS上的实验表明,长上下文DU极大地挑战了现有的模型。值得注意的是,表现最好的车型GPT-40的F1得分仅为42.7%,而第二好的GPT-4V得分为31.4%。此外,12个LLM(除GPT-4o和GPT-4V之外)的性能甚至比LLm对应的LLM更差,后者提供的是有损解析的OCR文档。这些结果验证了未来研究更有能力的长上下文LVLM的必要性。项目页面:此HTTPS URL

[NLP-3] MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
[NLP-3] MIA-Bench:在多模式LLM评估后实现更好的教学

链接: https://arxiv.org/abs/2407.01509
作者: Yusu Qian,Hanrong Ye,Jean-Philippe Fauconnier,Peter Grasch,Yinfei Yang,Zhe Gan
关键词: large language models, evaluate multimodal large, multimodal large language, introduce MIA-Bench, language models
中文关键词: 大型语言模型,评估多模式大型、多模式大型语言,引入MIA-Bench,语言模型
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models’ compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models’ ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.
摘要:我们引入了MIA-Bench,这是一个新的基准测试,旨在评估多模式大型语言模型(MLLM)严格遵守复杂指令的能力。我们的基准测试由400个不同的图像提示对组成,每一个图像提示对都是为了挑战模型在生成满足特定请求模式的准确响应方面对分层指令的遵守性。各种最先进的MLLM的评估结果揭示了性能的显着差异,凸显了教学保真度需要改进的领域。此外,我们还创建额外的训练数据并探索有监督的微调,以增强模型严格遵循指令的能力,而不影响其他任务的性能。我们希望这个基准不仅可以作为衡量MLLM遵守指令的工具,还可以指导MLLM培训方法的未来发展。

[NLP-4] Self-Cognition in Large Language Models: An Exploratory Study
[NLP-4] 大型语言模型中的自我认知:探索性研究

链接: https://arxiv.org/abs/2407.01505
作者: Dongping Chen,Jiawen Shi,Yao Wan,Pan Zhou,Neil Zhenqiang Gong,Lichao Sun
关键词: Large Language Models, Large Language, achieved remarkable success, Language Models, self-cognition
中文关键词: 大型语言模型,大型语言,取得了显着的成功,语言模型,自我认知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2024 Large Language Models and Cognition Workshop

点击查看摘要

Abstract:While Large Language Models (LLMs) have achieved remarkable success across various applications, they also raise concerns regarding self-cognition. In this paper, we perform a pioneering study to explore self-cognition in LLMs. Specifically, we first construct a pool of self-cognition instruction prompts to evaluate where an LLM exhibits self-cognition and four well-designed principles to quantify LLMs’ self-cognition. Our study reveals that 4 of the 48 models on Chatbot Arena–specifically Command R, Claude3-Opus, Llama-3-70b-Instruct, and Reka-core–demonstrate some level of detectable self-cognition. We observe a positive correlation between model size, training data quality, and self-cognition level. Additionally, we also explore the utility and trustworthiness of LLM in the self-cognition state, revealing that the self-cognition state enhances some specific tasks such as creative writing and exaggeration. We believe that our work can serve as an inspiration for further research to study the self-cognition in LLMs.
摘要:尽管大型语言模型在各种应用中取得了显著的成功,但它们也引起了人们对自我认知的关注。在这篇论文中,我们进行了一项开创性的研究,以探索学习记忆中的自我认知。具体地说,我们首先构建了一个自我认知教学提示库来评估LLM在哪里表现出自我认知,并构建了四个精心设计的原则来量化LLM的自我认知。我们的研究显示,在聊天机器人竞技场上的48个模型中,有4个–特别是Command R、Claude3-Opus、Llama-3-70b-Indict和Reka-core-表现出某种程度的可检测到的自我认知。我们观察到模型大小、训练数据质量和自我认知水平之间存在正相关。此外,我们还考察了LLM在自我认知状态下的实用性和可信度,发现自我认知状态会促进创造性写作和夸张等特定任务的完成。我们相信,我们的工作可以为进一步研究低收入者的自我认知提供启发。

[NLP-5] RegMix: Data Mixture as Regression for Language Model Pre-training
[NLP-5] RegMix:数据混合作为语言模型预训练的回归

链接: https://arxiv.org/abs/2407.01492
作者: Qian Liu,Xiaosen Zheng,Niklas Muennighoff,Guangtao Zeng,Longxu Dou,Tianyu Pang,Jing Jiang,Min Lin
关键词: large language model, language model pre-training, mixture remains unclear, effective mixture remains, data mixture
中文关键词: 大型语言模型、语言模型预训练、混合仍然不清楚、有效混合仍然存在、数据混合
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance given their respective mixtures. With the fitted regression model, we simulate the top-ranked mixture and use it to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens of different mixtures to fit the regression model and find the optimal mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Further, our method demonstrates superior performance compared to human selection and achieves results that match or surpass DoReMi, while utilizing only 10% of the compute budget. Our experiments also show that (1) Data mixtures significantly impact performance with single-task performance variations of up to 14.6%; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws, and our approach captures the complexity by considering all domains together. Our code is available at this https URL.
摘要:大型语言模型训练前的数据混合对性能有很大影响,但如何确定一个有效的混合仍不清楚。我们建议RegMix通过将其描述为回归任务来自动识别高性能数据混合。RegMix涉及用不同的数据混合训练一组小模型,并拟合回归模型,以预测它们在给定各自混合数据时的表现。用拟合出的回归模型来模拟排名靠前的混合物,并用它来训练一个计算数量级较多的大规模模型。为了实证验证RegMix,我们针对不同混合的1B令牌训练了512个具有1M个参数的模型,以拟合回归模型并找到最优混合。使用该混合模型,我们对25B符号(即大1000倍、长25倍)训练了一个1B参数模型,我们发现该模型在与其他混合的候选1B参数模型中执行得最好。此外,与人工选择相比,我们的方法表现出更好的性能,并且在仅利用10%的计算预算的情况下,获得了与DoReMi匹配或超过DoReMi的结果。我们的实验还表明:(1)数据混合显著影响性能,单任务性能差异高达14.6%;(2)Web语料库而不是像维基百科这样被认为是高质量的数据与下游性能有最强的正相关;(3)领域以复杂的方式交互,往往与常识相矛盾,因此需要像RegMix这样的自动方法;(4)数据混合效应超越了标度律,我们的方法通过综合考虑所有领域来捕捉复杂性。我们的代码可以在这个HTTPS URL上找到。

[NLP-6] Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning
[NLP-6] 通过缓慢级联学习对大型模型进行表达性和可推广的低等级适应

链接: https://arxiv.org/abs/2407.01491
作者: Siwei Li,Yifan Yang,Yifei Shen,Fangyun Wei,Zongqing Lu,Lili Qiu,Yuqing Yang
关键词: Efficient fine-tuning plays, Efficient fine-tuning, low-rank adaptation emerging, modern large models, fine-tuning plays
中文关键词: 高效微调剧目,高效微调,低等级改编新兴,现代大型号,微调剧目
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA’s expressiveness and generalization capabilities while preserving its training efficiency. Our approach augments expressiveness through a cascaded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model’s ability to capture complex patterns. Additionally, we introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. The extensive experiments on various language and vision datasets, as well as robustness benchmarks, demonstrate that the proposed method not only significantly outperforms existing baselines, but also mitigates overfitting, enhances model stability, and improves OOD robustness. Code will be release in this https URL very soon.
摘要:有效的微调在现代大型模型中起着基础性的作用,低阶自适应是一种特别有前途的方法。然而,现有的LORA变体受到表现力有限、过度适应的趋势以及对超参数设置的敏感性的阻碍。本文提出了LORA慢级联学习(LoRASC),这是一种创新的技术,旨在提高LORA的表达能力和泛化能力,同时保持其训练效率。我们的方法通过级联学习策略增强了表现力,该策略允许混合低级适应,从而增强了模型捕获复杂模式的能力。此外,我们引入了慢-快更新机制和级联噪声调优来支持泛化。在各种语言和视觉数据集上的大量实验以及健壮性基准测试表明,该方法不仅显著优于现有的基线,而且可以缓解过拟合,增强模型的稳定性,并提高面向对象设计的健壮性。代码将很快在此HTTPS URL中发布。

[NLP-7] LLM See LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives
[NLP-7] LLM看到LLM做:引导数据生成以实现非差异化目标

链接: https://arxiv.org/abs/2407.01490
作者: Luísa Shimabucoro,Sebastian Ruder,Julia Kreutzer,Marzieh Fadaee,Sara Hooker
关键词: synthetic data, synthetic data raises, data, synthetic, widespread adoption
中文关键词: 合成数据,合成数据提出,数据,合成,广泛采用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread adoption of synthetic data raises new questions about how models generating the data can influence other large language models (LLMs) via distilled data. To start, our work exhaustively characterizes the impact of passive inheritance of model properties by systematically studying the consequences of synthetic data integration. We provide one of the most comprehensive studies to-date of how the source of synthetic data shapes models’ internal biases, calibration and generations’ textual attributes and preferences. We find that models are surprisingly sensitive towards certain attributes even when the synthetic data prompts appear “neutral”. which invites the question whether this sensitivity can be exploited for good. Our findings invite the question can we explicitly steer the models towards the properties we want at test time by exploiting the data generation process? This would have historically been considered infeasible due to the cost of collecting data with a specific characteristic or objective in mind. However, improvement in the quality of synthetic data, as well as a shift towards general-purpose models designed to follow a diverse way of instructions, means this question is timely. We propose active inheritance as a term to describe intentionally constraining synthetic data according to a non-differentiable objective. We demonstrate how active inheritance can steer the generation profiles of models towards desirable non-differentiable attributes, e.g. high lexical diversity or low toxicity. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2407.01490 [cs.CL] (or arXiv:2407.01490v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.01490 Focus to learn more arXiv-issued DOI via DataCite
摘要:合成数据的广泛采用提出了新的问题,即生成数据的模型如何通过提取的数据影响其他大型语言模型(LLM)。首先,我们的工作通过系统地研究合成数据集成的结果来详尽地表征模型属性的被动继承的影响。我们提供了到目前为止最全面的研究之一,关于合成数据的来源如何塑造模型的内部偏差、校准以及几代人的文本属性和偏好。我们发现,即使合成数据提示看起来是“中性的”,模型对某些属性的敏感度也出奇地高。这就引出了这样一个问题:这种敏感性能否被永久利用?我们的发现引发了这样一个问题:我们是否可以通过利用数据生成过程,在测试时显式地将模型引向我们想要的属性?这在历史上被认为是不可行的,因为在脑海中收集特定特征或目标的数据的成本很高。然而,合成数据质量的提高,以及向旨在遵循多样化指导方式的通用模型的转变,意味着这个问题是及时的。我们提出主动继承作为一个术语来描述根据不可微目标有意地约束合成数据。我们展示了主动遗传如何将模型的生成配置文件引导到理想的不可区分属性,例如高词汇多样性或低毒性。主题:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG)引用AS:arxiv:2407.01490cs.CLhttps://doi.org/10.48550/arXiv.2407.01490 Focus通过DataCite了解更多arxiv发布的目标文件

[NLP-8] Agentless: Demystifying LLM-based Software Engineering Agents
[NLP-8] 无限制:揭开基于LLM的软件工程代理的神秘面纱

链接: https://arxiv.org/abs/2407.01489
作者: Chunqiu Steven Xia,Yinlin Deng,Soren Dunn,Lingming Zhang
关键词: including code synthesis, large language models, software development tasks, Recent advancements, software development
中文关键词: 包括代码合成、大型语言模型、软件开发任务、最新进展、软件开发
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic two-phase process of localization followed by repair, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (27.33%) and lowest cost (\ 0.34) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.
摘要:大型语言模型(LLM)的最新进展极大地促进了软件开发任务的自动化,包括代码综合、程序修复和测试生成。最近,研究人员和行业从业者开发了各种自主的LLM代理来执行端到端的软件开发任务。这些代理配备了使用工具、运行命令、观察来自环境的反馈以及计划未来操作的能力。然而,这些基于代理的方法的复杂性,以及当前LLM有限的能力,引发了以下问题:我们真的必须使用复杂的自主软件代理吗?为了尝试回答这个问题,我们构建了无代理–一种自动解决软件开发问题的无代理方法。与基于代理的方法繁琐而复杂的设置相比,无代理采用了简单的两阶段本地化过程,然后进行修复,而不是让LLM决定未来的操作或使用复杂的工具进行操作。我们在流行的SWE-BENCH Lite基准测试上的结果显示,令人惊讶的是,与所有现有的开源软件代理相比,简单化的代理能够实现最高的性能(27.33%)和最低的成本(0.34)!此外,我们手动对SWE-BENCH Lite中的问题进行了分类,发现了准确的基本事实补丁或问题描述不充分/具有误导性的问题。因此,我们通过排除此类问题来构建SWE-BENCH Lite-S,以进行更严格的评估和比较。我们的工作突出了目前在自主软件开发中被忽视的简单、可解释的技术的潜力。我们希望无代理将有助于重置自主软件代理的基线、起点和视野,并激励未来沿着这一关键方向开展工作。

[NLP-9] ree Search for Language Model Agents
[NLP-9] ree搜索语言模型代理

链接: https://arxiv.org/abs/2407.01476
作者: Jing Yu Koh,Stephen McAleer,Daniel Fried,Ruslan Salakhutdinov
关键词: Autonomous agents powered, Autonomous agents, perform decision-making tasks, demonstrated promise, search
中文关键词: 自主代理提供动力,自主代理,执行决策任务,展示承诺,搜索
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages. Models and code available at this https URL

点击查看摘要

Abstract:Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at this https URL.
摘要:由语言模型(LMS)驱动的自治代理在执行网络自动化等决策任务方面表现出了良好的前景。然而,一个关键的局限性仍然存在:LMS主要针对自然语言理解和生成进行了优化,在尝试解决现实的计算机任务时,它在多步推理、规划和使用环境反馈方面遇到了困难。为了解决这一问题,我们提出了一种推理时间搜索算法,供LM代理在交互式Web环境中显式执行探索和多步规划。我们的方法是一种在实际环境空间内操作的最佳优先树搜索形式,并与大多数现有的最先进的代理相辅相成。这是第一个针对LM代理的树搜索算法,在现实的Web任务中显示了有效性。在具有挑战性的VisualWebArena基准测试中,将我们的搜索算法应用到GPT-40代理之上,与没有搜索的相同基线相比,成功率相对增加了39.7%,达到了26.4%的最新成功率。在WebArena上,搜索的相对效率也比基准代理提高了28.0%,竞争成功率为19.2%。我们的实验突出了搜索Web代理的有效性,并证明了性能随测试时间计算的增加而扩展。我们对我们的结果进行了彻底的分析,以突出搜索的改进、局限性和未来工作的有希望的方向。我们的代码和模型在此HTTPS URL上公开发布。

[NLP-10] DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
[NLP-10] DogeRM:通过模型合并为奖励模型配备领域知识

链接: https://arxiv.org/abs/2407.01470
作者: Tzu-Han Lin,Chen-An Li,Hung-yi Lee,Yun-Nung Chen
关键词: aligning large language, Reinforcement learning, large language models, human feedback, desired behaviors
中文关键词: 对齐大型语言、强化学习、大型语言模型、人类反馈、所需行为
类目: Computation and Language (cs.CL)
备注: Preprint. Code will be released after the review results

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the \textbfDomain knowled\textbfge merged \textbfReward \textbfModel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.
摘要:来自人类反馈的强化学习(RL HF)是一种流行的策略,用于将大型语言模型(LLM)与所需行为对齐。奖励建模是WLHF的关键一步。然而,收集配对偏好数据来训练奖励模型通常成本高昂且耗时,尤其是对于需要专家注释的特定领域偏好。为了应对这一挑战,我们提出了\textbfDomain knowled\textbfge mixed\textbfReward \textbfModel(DogeRM),这是一个新颖的框架,通过模型合并将特定领域的知识集成到通用奖励模型中。实验表明,DogeRM增强了不同基准的性能,并提供了展示模型合并效果的详细分析,展示了促进模型对齐的巨大潜力。

[NLP-11] Retrieval-augmented generation in multilingual settings
[NLP-11] 多语言环境中的检索增强生成

链接: https://arxiv.org/abs/2407.01463
作者: Nadezhda Chirkova,David Rau,Hervé Déjean,Thibault Formal,Stéphane Clinchant,Vassilina Nikoulina
关键词: improving LLM factuality, large language models, studied in English-only, LLM factuality, English-only settings
中文关键词: 提高LLM真实性,大型语言模型,纯英语研究,LLM真实性,纯英语环境
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into large language models (LLMs) and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate which components and with which adjustments are needed to build a well-performing mRAG pipeline, that can be used as a strong baseline in future works. Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities. The main limitations to be addressed in future works include frequent code-switching in non-Latin alphabet languages, occasional fluency errors, wrong reading of the provided documents, or irrelevant retrieval. We release the code for the resulting mRAG baseline pipeline at this https URL.
摘要:检索增强生成(RAG)是一种很有前途的解决方案,可以将最新的或特定于领域的知识整合到大型语言模型(LLM)中,并提高LLM的真实性,但主要是在仅限英语的环境下进行研究。在这项工作中,我们考虑了多语言环境(MRAG)中的RAG,即具有13种语言的用户查询和数据存储,并调查了需要哪些组件和哪些调整才能建立一个性能良好的MRAG管道,这可以在未来的工作中用作强大的基线。我们的发现强调,尽管有高质量的现成多语言检索器和生成器,但需要针对特定任务的提示工程来实现用户语言的生成。此外,目前的评价指标需要调整多语种设置,以考虑到命名实体在拼写上的差异。未来工作中要解决的主要限制包括非拉丁字母语言的频繁代码转换、偶尔的流利错误、对所提供文件的错误阅读或不相关的检索。

[NLP-12] Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement
[NLP-12] 通过强化学习驱动的查询细化增强大型语言模型的能力和鲁棒性

链接: https://arxiv.org/abs/2407.01461
作者: Zisu Huang,Xiaohua Wang,Feiran Zhang,Zhibo Xu,Cenyuan Zhang,Xiaoqing Zheng,Xuanjing Huang
关键词: large language models, helpful responses heavily, responses heavily relies, capacity of large, large language
中文关键词: 大型语言模型、大量有用的响应、大量依赖的响应、大量语言的容量
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The capacity of large language models (LLMs) to generate honest, harmless, and helpful responses heavily relies on the quality of user prompts. However, these prompts often tend to be brief and vague, thereby significantly limiting the full potential of LLMs. Moreover, harmful prompts can be meticulously crafted and manipulated by adversaries to jailbreak LLMs, inducing them to produce potentially toxic content. To enhance the capabilities of LLMs while maintaining strong robustness against harmful jailbreak inputs, this study proposes a transferable and pluggable framework that refines user prompts before they are input into LLMs. This strategy improves the quality of the queries, empowering LLMs to generate more truthful, benign and useful responses. Specifically, a lightweight query refinement model is introduced and trained using a specially designed reinforcement learning approach that incorporates multiple objectives to enhance particular capabilities of LLMs. Extensive experiments demonstrate that the refinement model not only improves the quality of responses but also strengthens their robustness against jailbreak attacks. Code is available at: this https URL .
摘要:大型语言模型生成诚实、无害和有用的响应的能力在很大程度上取决于用户提示的质量。然而,这些提示往往简短而含糊,从而极大地限制了LLM的全部潜力。此外,有害的提示可以被对手精心制作和操纵,以越狱LLM,诱导它们产生潜在的有毒内容。为了增强LLMS的能力,同时保持对有害越狱输入的强大健壮性,本研究提出了一个可移植和可插拔的框架,在将用户提示输入到LLMS之前对其进行提炼。这一策略提高了查询的质量,使LLMS能够生成更真实、良性和有用的响应。具体地说,引入了一种轻量级查询精化模型,并使用专门设计的强化学习方法进行训练,该方法结合了多个目标来增强LLMS的特定能力。大量实验表明,改进模型不仅提高了响应的质量,而且增强了对越狱攻击的健壮性。代码可在以下网址获得:这个HTTPS URL。

[NLP-13] meToM: Temporal Space is the Key to Unlocking the Door of Large Language Models Theory-of-Mind
[NLP-13] meToM:时空是打开大型语言模型之门的关键思维理论

链接: https://arxiv.org/abs/2407.01455
作者: Guiyang Hou,Wenqi Zhang,Yongliang Shen,Linjuan Wu,Weiming Lu
关键词: Theory of Mind, Large Language Models, advanced Large Language, ToM, Large Language
中文关键词: 心理理论、大型语言模型、高级大型语言、ToM、大型语言
类目: Computation and Language (cs.CL)
备注: 16 pages, 6 figures, ACL 2024(findings)

点击查看摘要

Abstract:Theory of Mind (ToM)-the cognitive ability to reason about mental states of ourselves and others, is the foundation of social interaction. Although ToM comes naturally to humans, it poses a significant challenge to even the most advanced Large Language Models (LLMs). Due to the complex logical chains in ToM reasoning, especially in higher-order ToM questions, simply utilizing reasoning methods like Chain of Thought (CoT) will not improve the ToM capabilities of LLMs. We present TimeToM, which constructs a temporal space and uses it as the foundation to improve the ToM capabilities of LLMs in multiple scenarios. Specifically, within the temporal space, we construct Temporal Belief State Chain (TBSC) for each character and inspired by the cognition perspective of the social world model, we divide TBSC into self-world beliefs and social world beliefs, aligning with first-order ToM (first-order beliefs) and higher-order ToM (higher-order beliefs) questions, respectively. Moreover, we design a novel tool-belief solver that, by considering belief communication between characters in temporal space, can transform a character’s higher-order beliefs into another character’s first-order beliefs under belief communication period. Experimental results indicate that TimeToM can dramatically improve the reasoning performance of LLMs on ToM questions while taking a big step towards coherent and robust ToM reasoning.
摘要:心理理论是对自己和他人的心理状态进行推理的认知能力,是社会交往的基础。尽管Tom对于人类来说是自然而然的,但它对最先进的大型语言模型(LLM)也构成了巨大的挑战。由于TOM推理中存在复杂的逻辑链,特别是在高阶TOM问题中,单纯使用思维链法等推理方法并不能提高LLMS的TOM能力。我们提出了TimeToM,它构造了一个时间空间,并以此为基础来提高多场景下LLMS的TOM能力。具体地说,在时间空间内,我们为每个角色构建了时间信念状态链,并受社会世界模型的认知视角的启发,将时间信念状态链分为自我世界信念和社会世界信念,分别对应于一阶TOM(一阶信念)和高阶TOM(高阶信念)问题。此外,我们设计了一种新颖的工具–信念求解器,通过考虑角色在时间空间中的信念交流,在信念交流周期内将一个角色的高阶信念转换为另一个角色的一阶信念。实验结果表明,TimeToM能够显著提高LLMS对TOM问题的推理性能,同时向连贯和健壮的TOM推理迈进了一大步。

[NLP-14] ColPali: Efficient Document Retrieval with Vision Language Models
[NLP-14] ColPali:使用视觉语言模型的高效文档检索

链接: https://arxiv.org/abs/2407.01449
作者: Manuel Faysse,Hugues Sibille,Tony Wu,Gautier Viaud,Céline Hudelot,Pierre Colombo
关键词: Retrieval Augmented Generation, document retrieval, visually rich structures, information through text, modern document retrieval
中文关键词: 检索增强生成、文档检索、视觉丰富的结构、文本信息、现代文档检索
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, ColPali, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.
摘要:文档是一种视觉丰富的结构,它通过文本以及表格、插图、页面布局或字体来传达信息。虽然现代文档检索系统在查询到文本匹配方面表现出很强的性能,但它们难以有效地利用视觉线索,这阻碍了它们在实际文档检索应用中的性能,例如检索增强生成。为了对当前系统的视觉丰富文档检索进行基准测试,我们引入了可视化文档检索基准ViDoRe,它由跨越多个域、语言和设置的各种页面级检索任务组成。现代系统的固有缺陷促使引入一种新的检索模型体系结构ColPali,它利用最新的Vision语言模型的文档理解能力,仅从文档页面的图像生成高质量的上下文嵌入。与后期交互匹配机制相结合,ColPali在很大程度上超过了现代文档检索管道,同时速度快得多,而且端到端可培训。

[NLP-15] Needle in the Haystack for Memory Based Large Language Models
[NLP-15] 基于内存的大型语言模型的大难不死

链接: https://arxiv.org/abs/2407.01437
作者: Subhajit Chaudhury,Soham Dan,Payel Das,Georgios Kollias,Elliot Nelson
关键词: augmented Large Language, Large Language Model, Large Language, memory augmented Large, augmented Large
中文关键词: 增强的大型语言、大型语言模型、大型语言、内存增强的大型、增强的大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages

点击查看摘要

Abstract:In this paper, we demonstrate the benefits of using memory augmented Large Language Model (LLM) architecture in improving the recall abilities of facts from a potentially long context. As a case study we test LARIMAR, a recently proposed LLM architecture which augments a LLM decoder with an external associative memory, on several long-context recall tasks, including passkey and needle-in-the-haystack tests. We demonstrate that the external memory can be adapted at test time to handle contexts much longer than those seen during training, while keeping readouts from the memory recognizable to the trained decoder and without increasing GPU memory footprint. Compared to alternative architectures for long-context recall tasks with models of a comparable parameter count, LARIMAR is able to maintain strong performance without any task-specific training.
摘要:在本文中,我们展示了使用内存增强大型语言模型(LLM)架构在提高潜在长上下文中事实的回忆能力方面的好处。作为案例研究,我们测试了LARIamar,这是一种最近提出的LLM架构,它通过外部联想存储器增强了LLM解码器,用于几项长上下文回忆任务,包括密钥和大海捞针测试。我们证明,外部存储器可以在测试时进行调整,以处理比训练期间看到的时间长得多的上下文,同时保持存储器的读出可被训练的解码器识别,并且不会增加图形处理器的内存占用空间。与具有可比参数计数模型的长上下文回忆任务替代架构相比,LARIVAR能够在无需任何特定任务培训的情况下保持强劲的性能。

[NLP-16] A Global-Local Attention Mechanism for Relation Classification
[NLP-16] 关系分类的全球-本地注意力机制

链接: https://arxiv.org/abs/2407.01424
作者: Yiping Sun
关键词: involves identifying connections, Relation classification, involves identifying, crucial component, identifying connections
中文关键词: 涉及识别联系,关系分类,涉及识别,关键组件,识别联系
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: This paper has been accepted by the 2024 20th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)

点击查看摘要

Abstract:Relation classification, a crucial component of relation extraction, involves identifying connections between two entities. Previous studies have predominantly focused on integrating the attention mechanism into relation classification at a global scale, overlooking the importance of the local context. To address this gap, this paper introduces a novel global-local attention mechanism for relation classification, which enhances global attention with a localized focus. Additionally, we propose innovative hard and soft localization mechanisms to identify potential keywords for local attention. By incorporating both hard and soft localization strategies, our approach offers a more nuanced and comprehensive understanding of the contextual cues that contribute to effective relation classification. Our experimental results on the SemEval-2010 Task 8 dataset highlight the superior performance of our method compared to previous attention-based approaches in relation classification.
摘要:关系分类是关系提取的重要组成部分,涉及识别两个实体之间的联系。之前的研究主要集中在将注意机制整合到全球范围内的关系分类中,而忽视了当地背景的重要性。为了解决这一差距,本文引入了一种新型的关系分类全球-局部注意力机制,该机制通过本地化的焦点增强全球注意力。此外,我们还提出了创新的硬本地化和软本地化机制来识别潜在的关键词以引起当地关注。通过结合硬本地化和软本地化策略,我们的方法提供了对有助于有效关系分类的上下文线索的更细致和全面的理解。我们在SemEval-2010 Task 8数据集上的实验结果凸显了与之前在关系分类中基于注意力的方法相比,我们的方法具有更好的性能。

[NLP-17] HyperLoader: Integrating Hypernetwork-Based LoRA and Adapter Layers into Multi-Task Transformers for Sequence Labelling
[NLP-17] HyperPlayer:将基于超网络的LoRA和适配器层集成到多任务转换器中以进行序列标签

链接: https://arxiv.org/abs/2407.01411
作者: Jesus-German Ortiz-Barajas,Helena Gomez-Adorno,Thamar Solorio
关键词: parameter-efficient fine-tuning methods, simple approach, multi-task setting, parameter-efficient fine-tuning, fine-tuning methods
中文关键词: 参数高效的微调方法、简单方法、多任务设置、参数高效的微调、微调方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present HyperLoader, a simple approach that combines different parameter-efficient fine-tuning methods in a multi-task setting. To achieve this goal, our model uses a hypernetwork to generate the weights of these modules based on the task, the transformer layer, and its position within this layer. Our method combines the benefits of multi-task learning by capturing the structure of all tasks while reducing the task interference problem by encapsulating the task-specific knowledge in the generated weights and the benefits of combining different parameter-efficient methods to outperform full-fine tuning. We provide empirical evidence that HyperLoader outperforms previous approaches in most datasets and obtains the best average performance across tasks in high-resource and low-resource scenarios.
摘要:我们介绍了HyperPlayer,这是一种简单的方法,在多任务设置中结合了不同的参数高效微调方法。为了实现这一目标,我们的模型使用超网络根据任务、Transformer层及其在该层中的位置生成这些模块的权重。我们的方法结合了多任务学习的好处,通过捕获所有任务的结构,同时通过将任务特定知识封装在生成的权重中来减少任务干扰问题,以及结合不同参数高效方法以优于全微调的好处。我们提供的经验证据表明,HyperPlayer在大多数数据集中优于以前的方法,并在高资源和低资源场景中的任务中获得最佳平均性能。

[NLP-18] Dynamic Few-Shot Learning for Knowledge Graph Question Answering
[NLP-18] 知识图谱问题解答的动态少镜头学习

链接: https://arxiv.org/abs/2407.01409
作者: Jacopo D’Abramo,Andrea Zugarini,Paolo Torroni
关键词: innovative Question Answering, Large language models, Knowledge Graphs, Question Answering, Answering over Knowledge
中文关键词: 创新的问题解答、大型语言模型、知识图、问题解答、知识解答
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models present opportunities for innovative Question Answering over Knowledge Graphs (KGQA). However, they are not inherently designed for query generation. To bridge this gap, solutions have been proposed that rely on fine-tuning or ad-hoc architectures, achieving good results but limited out-of-domain distribution generalization. In this study, we introduce a novel approach called Dynamic Few-Shot Learning (DFSL). DFSL integrates the efficiency of in-context learning and semantic similarity and provides a generally applicable solution for KGQA with state-of-the-art performance. We run an extensive evaluation across multiple benchmark datasets and architecture configurations.
摘要:大型语言模型为创新的知识图问题解答(KGQA)提供了机会。然而,它们本质上并不是为查询生成而设计的。为了弥合这一差距,人们提出了依赖于微调或临时架构的解决方案,以实现良好的结果,但域外分布概括有限。在这项研究中,我们引入了一种名为动态少镜头学习(DFSL)的新颖方法。DFSL集成了上下文学习和语义相似性的效率,并为KGQA提供了具有最先进性能的普遍适用的解决方案。我们对多个基准数据集和架构配置进行了广泛的评估。

[NLP-19] Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters
[NLP-19] 通过适配器使用知识图将多语言LLM适应低资源语言

链接: https://arxiv.org/abs/2407.01406
作者: Daniil Gurgurov,Mareike Hartmann,Simon Ostermann
关键词: named entity recognition, Large Language Models, multilingual Large Language, multilingual Large, Large Language
中文关键词: 命名实体识别、大型语言模型、多语言大型语言、多语言大型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, KaLLM workshop

点击查看摘要

Abstract:This paper explores the integration of graph knowledge from linguistic ontologies into multilingual Large Language Models (LLMs) using adapters to improve performance for low-resource languages (LRLs) in sentiment analysis (SA) and named entity recognition (NER). Building upon successful parameter-efficient fine-tuning techniques, such as K-ADAPTER and MAD-X, we propose a similar approach for incorporating knowledge from multilingual graphs, connecting concepts in various languages with each other through linguistic relationships, into multilingual LLMs for LRLs. Specifically, we focus on eight LRLs – Maltese, Bulgarian, Indonesian, Nepali, Javanese, Uyghur, Tibetan, and Sinhala – and employ language-specific adapters fine-tuned on data extracted from the language-specific section of ConceptNet, aiming to enable knowledge transfer across the languages covered by the knowledge graph. We compare various fine-tuning objectives, including standard Masked Language Modeling (MLM), MLM with full-word masking, and MLM with targeted masking, to analyse their effectiveness in learning and integrating the extracted graph data. Through empirical evaluation on language-specific tasks, we assess how structured graph knowledge affects the performance of multilingual LLMs for LRLs in SA and NER, providing insights into the potential benefits of adapting language models for low-resource scenarios.
摘要:为了提高低资源语言(LRLS)在情感分析(SA)和命名实体识别(NER)中的性能,利用适配器将语言本体中的图知识集成到多语言大语言模型(LLMS)中。在K-Adapter和MAD-X等成功的参数高效微调技术的基础上,我们提出了一种类似的方法来整合来自多语言图的知识,通过语言关系将不同语言中的概念相互连接到LRL的多语言LLM中。具体地说,我们专注于八种LRL–马耳他语、保加利亚语、印度尼西亚语、尼泊尔语、爪哇语、维吾尔语、藏语和僧伽罗语–并使用特定于语言的适配器,对从概念网的特定语言部分提取的数据进行微调,旨在实现知识图谱涵盖的语言之间的知识转移。我们比较了各种微调目标,包括标准掩蔽语言建模(MLM)、全词掩蔽的MLM和目标掩蔽的MLM,以分析它们在学习和整合提取的图形数据方面的有效性。通过对特定语言任务的实证评估,我们评估了结构化图知识如何影响SA和NER中LRL的多语言LLM的性能,从而为适应低资源情景下的语言模型提供了潜在的好处。

[NLP-20] Optimization of Retrieval-Augmented Generation Context with Outlier Detection
[NLP-20] 利用离群点检测优化检索增强生成上下文

链接: https://arxiv.org/abs/2407.01403
作者: Vitaly Bulgakov
关键词: prompt context required, Large Language Model, retrieved LLM responses, question-answering systems, reduce the size
中文关键词: 需要提示上下文、大型语言模型、检索的LLM回复、问答系统、缩小规模
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we focus on methods to reduce the size and improve the quality of the prompt context required for question-answering systems. Attempts to increase the number of retrieved chunked documents and thereby enlarge the context related to the query can significantly complicate the processing and decrease the performance of a Large Language Model (LLM) when generating responses to queries. It is well known that a large set of documents retrieved from a database in response to a query may contain irrelevant information, which often leads to hallucinations in the resulting answers. Our goal is to select the most semantically relevant documents, treating the discarded ones as outliers. We propose and evaluate several methods for identifying outliers by creating features that utilize the distances of embedding vectors, retrieved from the vector database, to both the centroid and the query vectors. The methods were evaluated by comparing the similarities of the retrieved LLM responses to ground-truth answers obtained using the OpenAI GPT-4o model. It was found that the greatest improvements were achieved with increasing complexity of the questions and answers.
摘要:在本文中,我们重点研究了减少问答系统所需提示上下文的大小和提高提示上下文质量的方法。当生成对查询的响应时,尝试增加检索到的分块文档的数量并由此扩大与查询相关的上下文可能会显著地使处理复杂化并降低大型语言模型(LLM)的性能。众所周知,响应于查询而从数据库检索的大量文档可能包含不相关的信息,这通常会导致结果答案中出现幻觉。我们的目标是选择语义最相关的文档,将被丢弃的文档视为离群值。我们提出并评估了几种通过创建特征来识别离群点的方法,这些特征利用从向量数据库中检索到的嵌入向量到质心和查询向量的距离。通过比较检索到的LLM响应与使用OpenAI GPT-4o模型获得的地面真相答案的相似性来评估这些方法。研究发现,问题和答案的复杂性越高,提高的程度越大。

[NLP-21] Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing
[NLP-21] Gloss 2text:使用LLM和语义感知标签平滑的手语Gloss翻译

链接: https://arxiv.org/abs/2407.01394
作者: Pooya Fayyazsanavi,Antonios Anastasopoulos,Jana Košecká
关键词: spoken text presents, text presents unique, presents unique challenges, unique challenges owing, expression nuances
中文关键词: 口语文本呈现,文本呈现独特,呈现独特的挑战,独特的挑战,表达的细微差别
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on \em Gloss2Text translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities improving significantly the performance of state-of-the-art approaches. Through extensive experiments and ablation studies on the PHOENIX Weather 2014T dataset, our approach surpasses state-of-the-art performance in \em Gloss2Text translation, indicating its efficacy in addressing sign language translation and suggesting promising avenues for future research and development.
摘要:从视频到口语文本的手语翻译由于不同的语法、表达细微差别以及不同说话者和上下文之间视觉外观的高度差异而带来了独特的挑战。视频的中间注释旨在指导翻译过程。在我们的工作中,我们重点关注\em Gloss 2文本翻译阶段,并通过利用预训练的大型语言模型(LLM)、数据增强和利用gloss翻译歧义的新型标签平滑丢失功能提出了几项进步,显着提高了最先进方法的性能。通过对PHOENIX Weather 2014 T数据集的广泛实验和消融研究,我们的方法超越了\em Gloss 2text翻译中的最新性能,表明其在解决手语翻译问题方面的功效,并为未来的研究和开发提出了有希望的途径。

[NLP-22] POLygraph: Polish Fake News Dataset
[NLP-22] POLygraph:波兰假新闻数据集

链接: https://arxiv.org/abs/2407.01393
作者: Daniel Dzienisiewicz,Filip Graliński,Piotr Jabłoński,Marek Kubis,Paweł Skórzewski,Piotr Wierzchoń
关键词: fake news detection, dataset, unique resource, detection in Polish, Polish
中文关键词: 假新闻检测、数据集、独特资源、波兰语检测、波兰语
类目: Computation and Language (cs.CL)
备注: 14 pages, 1 figure, accepted to the 14th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA’24)

点击查看摘要

Abstract:This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish. The dataset, created by an interdisciplinary team, is composed of two parts: the “fake-or-not” dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the “fake-they-say” dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them. Unlike existing datasets, POLygraph encompasses a variety of approaches from source literature, providing a comprehensive resource for fake news detection. The data was collected through manual annotation by expert and non-expert annotators. The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity. The tool and dataset are expected to benefit various entities, from public sector institutions to publishers and fact-checking organizations. Further dataset exploration will foster fake news detection and potentially stimulate the implementation of similar models in other languages. The paper focuses on the creation and composition of the dataset, so it does not include a detailed evaluation of the software tool for content authenticity analysis, which is planned at a later stage of the project.
摘要:本文介绍了波兰语中唯一的假新闻检测资源–测谎仪数据集。该数据集由一个跨学科团队创建,由两部分组成:包含11,360对新闻文章(通过它们的URL识别)和相应标签的“假不假”数据集,以及包含5082篇新闻文章(通过它们的URL识别)和对它们的评论的推文的“假他们说”数据集。与现有的数据集不同,Polygraph包含了来自原始文献的各种方法,为假新闻检测提供了全面的资源。这些数据是由专家和非专家注释员通过手工注解收集的。该项目还开发了一个软件工具,使用先进的机器学习技术来分析数据并确定内容的真实性。该工具和数据集预计将使各种实体受益,从公共部门机构到出版商和事实核查组织。进一步的数据集探索将促进假新闻检测,并可能刺激在其他语言中实施类似的模型。本文件侧重于数据集的创建和组成,因此不包括对计划在项目后期阶段进行内容真实性分析的软件工具的详细评估。

[NLP-23] Free-text Rationale Generation under Readability Level Control
[NLP-23] 可读级别控制下的自由文本基本原理生成

链接: https://arxiv.org/abs/2407.01384
作者: Yi-Sheng Hsu,Nils Feldhus,Sherzod Hakimov
关键词: Free-text rationales justify, justify model decisions, Free-text rationales, likable and accessible, accessible among approaches
中文关键词: 自由文本理由证明,证明模型决策,自由文本理由,可爱且易于理解,在方法中易于理解
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Free-text rationales justify model decisions in natural language and thus become likable and accessible among approaches to explanation across many tasks. However, their effectiveness can be hindered by misinterpretation and hallucination. As a perturbation test, we investigate how large language models (LLMs) perform the task of natural language explanation (NLE) under the effects of readability level control, i.e., being prompted for a rationale targeting a specific expertise level, such as sixth grade or college. We find that explanations are adaptable to such instruction, but the requested readability is often misaligned with the measured text complexity according to traditional readability metrics. Furthermore, the quality assessment shows that LLMs’ ratings of rationales across text complexity exhibit a similar pattern of preference as observed in natural language generation (NLG). Finally, our human evaluation suggests a generally satisfactory impression on rationales at all readability levels, with high-school-level readability being most commonly perceived and favored.
摘要:自由文本理论证明了自然语言中的模型决策是合理的,因此在许多任务的解释方法中变得受欢迎和容易理解。然而,它们的有效性可能会受到误解和幻觉的阻碍。作为一项扰动测试,我们考察了在可读性水平控制的影响下,大语言模型(LLM)如何执行自然语言解释(NLE)任务,即被提示针对特定专业水平的理论基础,如六年级或大学。我们发现,解释是适合这样的指导的,但根据传统的可读性度量,所要求的可读性往往与测量的文本复杂性不一致。此外,质量评估表明,LLMS对文本复杂性的基本原理的评分显示出与自然语言生成(NLG)中观察到的相似的偏好模式。最后,我们的人类评估表明,在所有可读性水平上,人们对基本原理的印象总体上是令人满意的,其中高中水平的可读性是最常见的感知和青睐。

[NLP-24] Badllama 3: removing safety finetuning from Llama 3 in minutes
[NLP-24] Badllama 3:几分钟内删除Lama 3的安全微调

链接: https://arxiv.org/abs/2407.01376
作者: Dmitrii Volkov
关键词: extensive LLM safety, LLM safety fine-tuning, extensive LLM, model weights, LLM safety
中文关键词: 广泛的LLM安全性、LLM安全微调、广泛的LLM、模型权重、LLM安全性
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.
摘要:我们表明,当攻击者能够访问模型权重时,广泛的LLM安全微调很容易被颠覆。我们评估了三种最先进的微调方法–QLoRA、ReFT和Ortho–并展示了算法进步如何通过减少FLOP和优化能力来实现持续的越狱性能。我们在单个图形处理器上在1分钟内取消了Llama 3 8 B的安全微调,在30分钟内取消了Llama 3 70 B的安全微调,并概述了进一步减少这一问题的方法。

[NLP-25] Bridging the Gap: Transfer Learning from English PLMs to Malaysian English
[NLP-25] 弥合差距:从英语PLM学习转移到马来西亚英语

链接: https://arxiv.org/abs/2407.01374
作者: Mohan Raj Chanthran,Lay-Ki Soon,Huey Fang Ong,Bhawani Selvaretnam
关键词: Malaysian English, Standard English, addition to Standard, Malaysian English text, English
中文关键词: 马来西亚英语,标准英语,标准英语的补充,马来西亚英语文本,英语
类目: Computation and Language (cs.CL)
备注: Accepted in 9th Workshop on Representation Learning for NLP (Rep4NLP) at ACL 2024

点击查看摘要

Abstract:Malaysian English is a low resource creole language, where it carries the elements of Malay, Chinese, and Tamil languages, in addition to Standard English. Named Entity Recognition (NER) models underperform when capturing entities from Malaysian English text due to its distinctive morphosyntactic adaptations, semantic features and code-switching (mixing English and Malay). Considering these gaps, we introduce MENmBERT and MENBERT, a pre-trained language model with contextual understanding, specifically tailored for Malaysian English. We have fine-tuned MENmBERT and MENBERT using manually annotated entities and relations from the Malaysian English News Article (MEN) Dataset. This fine-tuning process allows the PLM to learn representations that capture the nuances of Malaysian English relevant for NER and RE tasks. MENmBERT achieved a 1.52% and 26.27% improvement on NER and RE tasks respectively compared to the bert-base-multilingual-cased model. Although the overall performance of NER does not have a significant improvement, our further analysis shows that there is a significant improvement when evaluated by the 12 entity labels. These findings suggest that pre-training language models on language-specific and geographically-focused corpora can be a promising approach for improving NER performance in low-resource settings. The dataset and code published in this paper provide valuable resources for NLP research work focusing on Malaysian English.
摘要:马来西亚英语是一种低资源的克里奥尔语言,除了标准英语外,它还包含马来语、华语和泰米尔语的元素。命名实体识别(NER)模型在从马来西亚英语文本中捕获实体时表现不佳,这是由于其独特的形态句法适应、语义特征和代码转换(混合英语和马来语)。考虑到这些差距,我们引入了MENmBERT和MENBERT,这是一种专门为马来西亚英语量身定做的具有上下文理解的预训练语言模型。我们使用马来西亚英语新闻文章(MEN)数据集中的手动标注实体和关系对MENmBERT和MENBERT进行了微调。这一微调过程使PLM能够学习与NER和RE任务相关的马来西亚英语的细微差别。MENmBERT与BERT-BASE-MULTICAGE-CASE模式相比,MENmBERT在NER和RE任务上的成绩分别提高了1.52和26.27。虽然NER的整体性能没有明显的提高,但我们进一步的分析表明,当以12个实体标签来评估时,NER的性能有了显著的提高。这些发现表明,在特定语言和关注地理的语料库上预先训练语言模型可以成为在低资源环境下提高自主学习能力的一种很有前途的方法。本文发布的数据集和代码为以马来西亚英语为重点的自然语言处理研究工作提供了宝贵的资源。

[NLP-26] Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
[NLP-26] 干草堆摘要:对长上下文LLM和RAG系统的挑战

链接: https://arxiv.org/abs/2407.01370
作者: Philippe Laban,Alexander R. Fabbri,Caiming Xiong,Chien-Sheng Wu
关键词: capable of handling, handling millions, millions of input, input tokens, RAG systems
中文关键词: 能够处理数百万个输入、输入令牌、RAG系统
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In this work, we argue that summarization can play a central role in such evaluation. We design a procedure to synthesize Haystacks of documents, ensuring that specific \textitinsights repeat across documents. The “Summary of a Haystack” (SummHay) task then requires a system to process the Haystack and generate, given a query, a summary that identifies the relevant insights and precisely cites the source documents. Since we have precise knowledge of what insights should appear in a haystack summary and what documents should be cited, we implement a highly reproducible automatic evaluation that can score summaries on two aspects - Coverage and Citation. We generate Haystacks in two domains (conversation, news), and perform a large-scale evaluation of 10 LLMs and corresponding 50 RAG systems. Our findings indicate that SummHay is an open challenge for current systems, as even systems provided with an Oracle signal of document relevance lag our estimate of human performance (56%) by 10+ points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to study enterprise RAG systems and position bias in long-context models. We hope future systems can equal and surpass human performance on SummHay.
摘要:LLMS和RAG系统现在能够处理数百万个或更多的输入令牌。然而,在长背景任务上评估这类系统的输出质量仍然具有挑战性,因为像干草堆中的针这样的任务缺乏复杂性。在这项工作中,我们认为总结可以在这样的评估中发挥核心作用。我们设计了一个过程来合成一堆文档,确保特定的\文本洞察力在文档中重复。然后,“草堆摘要”(SummHay)任务需要系统处理草堆,并在给定查询的情况下生成摘要,以识别相关见解并准确引用源文档。由于我们精确地知道干草堆摘要中应该出现哪些见解以及应该引用哪些文件,我们实现了一个高度可重复性的自动评估,可以在两个方面对摘要进行评分-覆盖和引用。我们在两个领域(对话、新闻)生成草栈,并对10个LLM和对应的50个RAG系统进行大规模评估。我们的发现表明,SummHay对于当前的系统来说是一个开放的挑战,因为即使是提供了文档相关性Oracle信号的系统也比我们对人类性能的估计(56%)在联合得分上落后10分以上。在没有猎犬的情况下,像GPT-40和Claude 3 Opus这样的长语境LLM在SummHay上的得分低于20%。我们表明,SummHay也可以用于研究企业RAG系统和长期背景模型中的位置偏差。我们希望未来的系统能在SummHay上赶上并超过人类的表现。

[NLP-27] Nullpointer at ArAIEval Shared Task: Arabic Propagandist Technique Detection with Token-to-Word Mapping in Sequence Tagging
[NLP-27] ArAIEval共享任务中的空指针:在序列标记中使用标记到单词映射的阿拉伯语字母表技术检测

链接: https://arxiv.org/abs/2407.01360
作者: Abrar Abir,Kemal Oflazer
关键词: ArAIEval shared task, propaganda technique detection, Arabic text, detection in Arabic, including tweets
中文关键词: ArAIEval共享任务、宣传技术检测、阿拉伯语文本、阿拉伯语检测,包括推文
类目: Computation and Language (cs.CL)
备注: To appear in proceedings of 2024 Arabic NLP Conference

点击查看摘要

Abstract:This paper investigates the optimization of propaganda technique detection in Arabic text, including tweets \ news paragraphs, from ArAIEval shared task 1. Our approach involves fine-tuning the AraBERT v2 model with a neural network classifier for sequence tagging. Experimental results show relying on the first token of the word for technique prediction produces the best performance. In addition, incorporating genre information as a feature further enhances the model’s performance. Our system achieved a score of 25.41, placing us 4 ^th on the leaderboard. Subsequent post-submission improvements further raised our score to 26.68.
摘要:本文研究了ArAIEval共享任务1中阿拉伯文本(包括推文/新闻段落)中宣传技术检测的优化。我们的方法涉及使用用于序列标记的神经网络分类器微调AraBERT v2模型。实验结果表明,依赖单词的第一个标记进行技术预测可以产生最佳性能。此外,将流派信息作为功能进一步增强了模型的性能。我们的系统获得了25.41分,在排行榜上排名第4。随后的提交后改进进一步将我们的分数提高到26.68。

[NLP-28] Evaluating Knowledge-based Cross-lingual Inconsistency in Large Language Models
[NLP-28] 评估大型语言模型中基于知识的跨语言不一致性

链接: https://arxiv.org/abs/2407.01358
作者: Xiaolin Xing,Zhiwei He,Haoyu Xu,Xing Wang,Rui Wang,Yu Hong
关键词: Natural Language Processing, Large Language Models, observed in Large, Large Language, Natural Language
中文关键词: 自然语言处理,大型语言模型,在大型、大型语言、自然语言中观察
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates the cross-lingual inconsistencies observed in Large Language Models (LLMs), such as ChatGPT, Llama, and Baichuan, which have shown exceptional performance in various Natural Language Processing (NLP) tasks. Despite their successes, these models often exhibit significant inconsistencies when processing the same concepts across different languages. This study focuses on three primary questions: the existence of cross-lingual inconsistencies in LLMs, the specific aspects in which these inconsistencies manifest, and the correlation between cross-lingual consistency and multilingual capabilities of this http URL address these questions, we propose an innovative evaluation method for Cross-lingual Semantic Consistency (xSC) using the LaBSE model. We further introduce metrics for Cross-lingual Accuracy Consistency (xAC) and Cross-lingual Timeliness Consistency (xTC) to comprehensively assess the models’ performance regarding semantic, accuracy, and timeliness inconsistencies. By harmonizing these metrics, we provide a holistic measurement of LLMs’ cross-lingual consistency. Our findings aim to enhance the understanding and improvement of multilingual capabilities and interpretability in LLMs, contributing to the development of more robust and reliable multilingual language models.
摘要:本文研究了ChatGPT、Llama和白川等在自然语言处理(NLP)任务中表现出色的大型语言模型(LLM)中的跨语言不一致现象。尽管取得了成功,但这些模型在处理不同语言中的相同概念时往往表现出严重的不一致。本研究围绕三个主要问题:LLMS中跨语言不一致的存在,这些不一致的具体表现方面,以及这个http URL的跨语言一致性与多语言能力之间的相关性。针对这些问题,我们提出了一种基于LaBSE模型的跨语言语义一致性(XSC)评估方法。我们进一步引入了跨语言准确性一致性(XAC)和跨语言时效性一致性(XTC)的度量,以综合评估模型在语义、准确性和时效性方面的表现。通过协调这些指标,我们提供了一种LLMS跨语言一致性的整体测量。我们的发现旨在提高人们对LLMS中多语言能力和可解释性的理解和改进,有助于开发更健壮和可靠的多语言模式。

[NLP-29] Protecting Privacy in Classifiers by Token Manipulation
[NLP-29] 通过代币操纵保护分类器中的隐私

链接: https://arxiv.org/abs/2407.01334
作者: Re’em Harel,Yair Elboher,Yuval Pinter
关键词: remote service entails, service entails sending, entails sending private, sending private information, untrusted provider
中文关键词: 远程服务需要,服务需要发送,需要发送私人信息,发送私人信息,不受信任的提供商
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Using language models as a remote service entails sending private information to an untrusted provider. In addition, potential eavesdroppers can intercept the messages, thereby exposing the information. In this work, we explore the prospects of avoiding such data exposure at the level of text manipulation. We focus on text classification models, examining various token mapping and contextualized manipulation functions in order to see whether classifier accuracy may be maintained while keeping the original text unrecoverable. We find that although some token mapping functions are easy and straightforward to implement, they heavily influence performance on the downstream task, and via a sophisticated attacker can be reconstructed. In comparison, the contextualized manipulation provides an improvement in performance.
摘要:使用语言模型作为远程服务需要向不受信任的提供商发送私人信息。此外,潜在的窃听者可以拦截消息,从而暴露信息。在这项工作中,我们探索了在文本操作层面避免此类数据暴露的前景。我们专注于文本分类模型,检查各种标记映射和上下文化操纵功能,以了解是否可以在保持原始文本不可恢复的同时保持分类器的准确性。我们发现,尽管一些令牌映射函数易于实现且直接,但它们严重影响下游任务的性能,并且可以通过复杂的攻击者进行重建。相比之下,上下文化操纵提供了性能改进。

[NLP-30] Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning
[NLP-30] 免费增加模型容量:参数高效微调的简单策略

链接: https://arxiv.org/abs/2407.01320
作者: Haobo Song,Hao Zhao,Soumajit Majumder,Tao Lin
关键词: large pre-trained foundation, Fine-tuning large pre-trained, pre-trained foundation models, large pre-trained, pre-trained foundation
中文关键词: 大型预训练基础,微调大型预训练、预训练的基础模型,大型预训练、预训练的基础
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICLR 2024. Code at this https URL

点击查看摘要

Abstract:Fine-tuning large pre-trained foundation models, such as the 175B GPT-3, has attracted more attention for downstream tasks recently. While parameter-efficient fine-tuning methods have been proposed and proven effective without retraining all model parameters, their performance is limited by the capacity of incremental modules, especially under constrained parameter budgets. \ To overcome this challenge, we propose CapaBoost, a simple yet effective strategy that enhances model capacity by leveraging low-rank updates through parallel weight modules in target layers. By applying static random masks to the shared weight matrix, CapaBoost constructs a diverse set of weight matrices, effectively increasing the rank of incremental weights without adding parameters. Notably, our approach can be seamlessly integrated into various existing parameter-efficient fine-tuning methods. We extensively validate the efficacy of CapaBoost through experiments on diverse downstream tasks, including natural language understanding, question answering, and image classification. Our results demonstrate significant improvements over baselines, without incurring additional computation or storage costs. Our code is available at \urlthis https URL.
摘要:微调大型预先训练的基础模型,如175B GPT-3,最近为下游任务吸引了更多的关注。虽然参数高效的微调方法已经被提出并被证明是有效的,但它们的性能受到增量模块能力的限制,特别是在参数预算受限的情况下。为了克服这一挑战,我们提出了CapaBoost,这是一个简单但有效的策略,通过利用目标层中的并行权重模块进行低阶次更新来增强模型的能力。通过将静态随机掩码应用于共享权重矩阵,CapaBoost构建了一组不同的权重矩阵,在不增加参数的情况下有效地提高了增量权重的排名。值得注意的是,我们的方法可以无缝地集成到各种现有的参数高效微调方法中。我们通过对不同下游任务的实验,包括自然语言理解、问答和图像分类,广泛地验证了CapaBoost的有效性。我们的结果表明,在不产生额外的计算或存储成本的情况下,与基线相比有了显著的改进。我们的代码位于此HTTPS URL。

[NLP-31] Language Portability Strategies for Open-domain Dialogue with Pre-trained Language Models from High to Low Resource Languages
[NLP-31] 使用从高资源语言到低资源语言的预训练语言模型进行开放领域对话的语言移植策略

链接: https://arxiv.org/abs/2407.01315
作者: Ahmed Njifenjou,Virgile Sucal,Bassam Jabaian,Fabrice Lefèvre
关键词: linguistic portability strategies, open-domain dialogue systems, pre-trained language models, large pre-trained language, paper we propose
中文关键词: 语言可移植性策略、开放领域对话系统、预训练语言模型、大型预训练语言,我们提出的论文
类目: Computation and Language (cs.CL)
备注: The 13th International Workshop on Spoken Dialogue Systems Technology (IWSDS '23)

点击查看摘要

Abstract:In this paper we propose a study of linguistic portability strategies of large pre-trained language models (PLMs) used for open-domain dialogue systems in a high-resource language for this task. In particular the target low-resource language (L_T) will be simulated with French, as it lacks of task-specific resources and allows our human evaluation, when the source language (L_S) is English. For obvious reasons, recent works using such models for open-domain dialogue are mostly developed in English. Yet building specific PLMs for each possible target language supposes collecting new datasets and is costly. For this reason, trying to leverage all existing resources (PLMs and data) in both L_S and L_T , we wish to assess the performance achievable in L_T with different approaches. The first two approaches evaluate the usage of Neural Machine Translation (NMT) at different levels: TrainOnTarget where a L_S dataset is translated before fine-tuning in L_T and TestOnSource where a L_S model is coupled with NMT modules during inference. Then, the advent of BLOOM [2], the world first open-access multilingual large PLM, allow researchers to develop new approaches aiming to leverage not only the model’s full accessibility but also its multilingualism and translation abilities. In this context the task is learned in L_S first and adapted to L_T using the MAD-X Adapter architecture [16]. In the two sets of experiments models are evaluated in spoken dialogue conditions with human and the strategies can be compared in terms of perceived interaction quality.
摘要:针对这一任务,我们提出了一种用于开放领域对话系统的大型预训练语言模型(PLM)的语言可移植策略的研究。特别是,目标低资源语言(L_T)将用法语模拟,因为它缺乏特定任务的资源,并且允许我们进行人工评估,而源语言(L_S)是英语。由于显而易见的原因,最近使用这种模式进行开放领域对话的著作大多是用英语开发的。然而,为每种可能的目标语言建立特定的PLM需要收集新的数据集,而且成本高昂。因此,尝试利用L_S和L_T中的所有现有资源(PLM和数据),我们希望以不同的方法评估L_T可以实现的性能。前两种方法在不同的水平上评估神经机器翻译的使用:TrainOnTarget,其中L_S的数据集在L_T中进行微调之前被翻译;以及TestOnSource,其中L_S模型与神经机器翻译模块在推理过程中耦合。然后,Bloom[2]的问世,世界上第一个开放获取的多语言大型PLM,允许研究人员开发新的方法,旨在不仅利用该模型的完全可访问性,而且还利用其多语言和翻译能力。在这种情况下,该任务首先在L_S中学习,并使用MAD-X适配器体系结构适应L_T[16]。在这两组实验中,在与人的口语对话条件下对模型进行了评估,并从感知交互质量的角度对两种策略进行了比较。

[NLP-32] Collaborative Performance Prediction for Large Language Models
[NLP-32] 大型语言模型的协同性能预测

链接: https://arxiv.org/abs/2407.01300
作者: Qiyuan Zhang,Fuyuan Lyu,Xue Liu,Chen Ma
关键词: NLP research, Comprehensively understanding, challenge in NLP, large language models, diverse downstream tasks
中文关键词: NLP研究、全面理解、NLP挑战、大型语言模型、多样化的下游任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework, Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from online platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance, an area previously overlooked.
摘要:全面理解和准确预测大型语言模型在不同下游任务中的性能已经成为自然语言处理研究中的一个关键挑战。关于下游工程的开创性比例定律展示了模型族内部的内在相似性,并利用这些相似性进行性能预测。但是,它们往往忽略模型族之间的相似性,只考虑原始比例定律中列出的设计因素。为了克服这些局限性,我们引入了一个新的框架,协作性能预测(CPP),它通过利用各种模型在下游任务上的历史性能以及模型和任务的其他设计因素来显著提高预测精度。我们还收集了来自在线平台的协作数据,其中包含历史性能和其他设计因素。在协作数据的支持下,CPP不仅在预测缩放LLMS的性能方面超过了传统的标度律,而且还有助于对因素重要性的详细分析,这是以前被忽视的领域。

[NLP-33] Lightweight Zero-shot Text-to-Speech with Mixture of Adapters
[NLP-33] 具有混合适配器的轻量级零镜头文本到语音

链接: https://arxiv.org/abs/2407.01291
作者: Kenichi Fujita,Takanori Ashihara,Marc Delcroix,Yusuke Ijima
关键词: demonstrated high fidelity, based on large-scale, demonstrated high, high fidelity, fidelity in reproducing
中文关键词: 表现出高保真度,在大规模的基础上,表现出高、高保真度、复制度
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 pages,3 figures, Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40% of parameters at 1.9 times faster inference speed. Audio samples are available on our demo page (this https URL).
摘要:基于大规模模型的零镜头文本转语音(TTC)方法的进步已经证明了再现说话者特征的高保真度。然而,这些型号对于实际日常使用来说太大了。我们提出了一种使用混合适配器(MoA)的轻量级零发射TTC方法。我们提出的方法将MoA模块整合到非自回归TTC模型的解码器和方差适配器中。这些模块通过根据扬声器嵌入选择与扬声器特性相关的适当适配器,增强了以零触发方式适应各种扬声器的能力。我们的方法以最少的附加参数实现了高质量的语音合成。通过客观和主观评估,我们确认我们的方法以不到40%的参数、1.9倍的推理速度实现了比基线更好的性能。音频样本可在我们的演示页面(此https URL)上找到。

[NLP-34] We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
[NLP-34] We-Math:您的大型多峰模型能否实现类人的数学推理?

链接: https://arxiv.org/abs/2407.01284
作者: Runqi Qiao,Qiuna Tan,Guanting Dong,Minhui Wu,Chong Sun,Xiaoshuai Song,Zhuoma GongQue,Shanglin Lei,Zhe Wei,Miaoxuan Zhang,Runfeng Qiao,Yifan Zhang,Xiao Zong,Yida Xu,Muxi Diao,Zhimin Bao,Chen Li,Honggang Zhang
关键词: Large Multimodal Models, Multimodal Models, Large Multimodal, received widespread attention, Visual mathematical reasoning
中文关键词: 大型多峰模型,多峰模型,大型多峰,受到广泛关注,视觉数学推理
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: Work in progress

点击查看摘要

Abstract:Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs’ reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at this https URL.
摘要:可视化数学推理作为一种基本的可视化推理能力,受到了大型多通道模型(LMM)领域的广泛关注。现有的基准,如MathVista和MathVerse,更多地关注以结果为导向的绩效,而忽视了知识获取和概括的基本原则。受类似人类的数学推理的启发,我们引入了WE-MATH,这是第一个专门为探索端到端性能之外的问题解决原理而设计的基准测试。我们精心收集和归类6.5K可视化数学题,跨越67个层次化的知识概念和5层知识粒度。我们根据所需的知识概念将组合问题分解为子问题,并引入了一种新的四维度量,即知识不足(IK)、不充分概括(IG)、完全掌握(CM)和旋转记忆(RM),以分层地评估LMM推理过程中的内在问题。我们使用WE-MATH对现有的可视化数学推理中的LMM进行了全面的评估,发现求解步骤与问题具体表现之间存在负相关关系。我们证实,通过知识扩充策略可以有效地改善LMM的知识密集度问题。更值得注意的是,GPT-40的主要挑战已经显著地从IK过渡到IG,使其成为第一个迈向知识推广阶段的LMM。相比之下,其他LMM表现出明显的Rote记忆倾向–他们正确地解决了涉及多个知识概念的复合问题,但未能回答子问题。我们预计WE-MATH将为LMM的可视化数学推理的发展开辟新的途径。WE-数学数据和评估代码可在此HTTPS URL中找到。

[NLP-35] Leveraging Large Language Models for Actionable Course Evaluation Student Feedback to Lecturers
[NLP-35] 利用大型语言模型进行可操作课程评估学生向讲师的反馈

链接: https://arxiv.org/abs/2407.01274
作者: Mike Zhang,Euan D Lindsay,Frederik Bode Thorbensen,Danny Bøgsted Poulsen,Johannes Bjerva
关键词: End of semester, dominant mechanism, semester student evaluations, End, feedback
中文关键词: 学期结束,主导机制,学期学生评估,结束,反馈
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Accepted to SEFI 2024

点击查看摘要

Abstract:End of semester student evaluations of teaching are the dominant mechanism for providing feedback to academics on their teaching practice. For large classes, however, the volume of feedback makes these tools impractical for this purpose. This paper explores the use of open-source generative AI to synthesise factual, actionable and appropriate summaries of student feedback from these survey responses. In our setup, we have 742 student responses ranging over 75 courses in a Computer Science department. For each course, we synthesise a summary of the course evaluations and actionable items for the instructor. Our results reveal a promising avenue for enhancing teaching practices in the classroom setting. Our contribution lies in demonstrating the feasibility of using generative AI to produce insightful feedback for teachers, thus providing a cost-effective means to support educators’ development. Overall, our work highlights the possibility of using generative AI to produce factual, actionable, and appropriate feedback for teachers in the classroom setting.
摘要:学期末学生对教学的评价是向学者反馈教学实践的主要机制。然而,对于大班来说,反馈的数量使这些工具不适用于此目的。本文探讨了如何使用开源的生成性人工智能来从这些调查答复中综合出事实的、可操作的和适当的学生反馈摘要。在我们的设置中,我们有742名学生回应,涉及计算机科学系的75门课程。对于每门课程,我们为教师综合课程评估和可操作项目的摘要。我们的结果揭示了在课堂环境中加强教学实践的一条很有希望的途径。我们的贡献在于证明了使用生成性人工智能为教师产生有洞察力的反馈的可行性,从而为支持教育工作者的发展提供了一种经济有效的手段。总体而言,我们的工作突出了使用生成性人工智能在课堂环境中为教师产生事实的、可操作的和适当的反馈的可能性。

[NLP-36] Show Less Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER
[NLP-36] 少显示指导多:用零镜头NER的定义和指南丰富预算

链接: https://arxiv.org/abs/2407.01272
作者: Andrew Zamai,Andrea Zugarini,Leonardo Rigutini,Marco Ernandes,Marco Maggini
关键词: instruction-tuned Large Language, Large Language Models, specialized instruction-tuned Large, Large Language, Named Entity Recognition
中文关键词: 经过描述的大型语言、大型语言模型、专门经过描述的大型、大型语言、命名实体识别
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, several specialized instruction-tuned Large Language Models (LLMs) for Named Entity Recognition (NER) have emerged. Compared to traditional NER approaches, these models have strong generalization capabilities. Existing LLMs mainly focus on zero-shot NER in out-of-domain distributions, being fine-tuned on an extensive number of entity classes that often highly or completely overlap with test sets. In this work instead, we propose SLIMER, an approach designed to tackle never-seen-before named entity tags by instructing the model on fewer examples, and by leveraging a prompt enriched with definition and guidelines. Experiments demonstrate that definition and guidelines yield better performance, faster and more robust learning, particularly when labelling unseen Named Entities. Furthermore, SLIMER performs comparably to state-of-the-art approaches in out-of-domain zero-shot NER, while being trained on a reduced tag set.
摘要:最近,出现了几种专门的用于命名实体识别(NER)的描述调整大型语言模型(LLM)。与传统NER方法相比,这些模型具有很强的概括能力。现有的LLM主要关注域外分发中的零镜头NER,对大量通常与测试集高度或完全重叠的实体类进行微调。相反,在这项工作中,我们提出了SIMAER,这是一种旨在通过在更少的示例上指导模型并利用富含定义和指导方针的提示来解决以前从未见过的命名实体标签的方法。实验表明,定义和指南可以产生更好的性能、更快、更稳健的学习,特别是在标记未见的命名实体时。此外,SIIMAER在域外零射击NER中执行了最先进的方法,同时在精简的标签集上进行训练。

[NLP-37] First Place Solution of 2023 Global Artificial Intelligence Technology Innovation Competition Track 1
[NLP-37] 2023年全球人工智能技术创新大赛第一赛道解决方案第一名

链接: https://arxiv.org/abs/2407.01271
作者: Xiangyu Wu,Hailiang Zhang,Yang Yang,Jianfeng Lu
关键词: Innovation Competition Track, Medical Imaging Diagnosis, Global Artificial Intelligence, Artificial Intelligence Technology, Intelligence Technology Innovation
中文关键词: 创新竞赛赛道、医学影像诊断、全球人工智能、人工智能技术、智能技术创新
类目: Computation and Language (cs.CL)
备注: First Place of 2023 Global Artificial Intelligence Technology Innovation Competition

点击查看摘要

Abstract:In this paper, we present our champion solution to the Global Artificial Intelligence Technology Innovation Competition Track 1: Medical Imaging Diagnosis Report Generation. We select CPT-BASE as our base model for the text generation task. During the pre-training stage, we delete the mask language modeling task of CPT-BASE and instead reconstruct the vocabulary, adopting a span mask strategy and gradually increasing the number of masking ratios to perform the denoising auto-encoder pre-training task. In the fine-tuning stage, we design iterative retrieval augmentation and noise-aware similarity bucket prompt strategies. The retrieval augmentation constructs a mini-knowledge base, enriching the input information of the model, while the similarity bucket further perceives the noise information within the mini-knowledge base, guiding the model to generate higher-quality diagnostic reports based on the similarity prompts. Surprisingly, our single model has achieved a score of 2.321 on leaderboard A, and the multiple model fusion scores are 2.362 and 2.320 on the A and B leaderboards respectively, securing first place in the rankings.
摘要:在本文中,我们介绍了我们在全球人工智能技术创新大赛第一赛道:医学影像诊断报告生成方面的冠军解决方案。我们选择CPT-BASE作为文本生成任务的基本模型。在预训练阶段,我们删除了CPT-BASE的掩码语言建模任务,代之以重建词汇表,采用跨度掩码策略,逐步增加掩蔽率来完成去噪自动编码器的预训练任务。在微调阶段,我们设计了迭代检索增强和噪声感知相似桶提示策略。检索扩充构建了一个微型知识库,丰富了模型的输入信息,而相似桶则进一步感知微型知识库中的噪声信息,指导模型基于相似性提示生成更高质量的诊断报告。令人惊讶的是,我们的单模在排行榜A上获得了2.321分,多模融合在A和B排行榜上的得分分别为2.362和2.320,确保了排名第一。

[NLP-38] he African Woman is Rhythmic and Soulful: Evaluation of Open-ended Generation for Implicit Biases
[NLP-38] 非洲女性有节奏又有灵魂:开放一代的隐性偏见评价

链接: https://arxiv.org/abs/2407.01270
作者: Serene Lim
关键词: Large Language Models, Language Models, Large Language, LLM Decision Bias, demonstrate underlying prejudices
中文关键词: 大型语言模型、语言模型、大型语言、LLM决策偏见,展示了潜在的偏见
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the subtle and often concealed biases present in Large Language Models (LLMs), which, despite passing explicit bias tests, can still exhibit implicit biases akin to those observed in humans who profess egalitarian beliefs yet demonstrate underlying prejudices. The challenge of measuring such biases is exacerbated as LLMs become increasingly proprietary, restricting access to their internal mechanisms such as embeddings, which are crucial for applying traditional bias measures. To tackle these issues, this study introduces innovative measures of bias inspired by psychological methodologies: the LLM Implicit Association Test (IAT) Bias and the LLM Decision Bias. The LLM IAT Bias is a prompt-based method designed to unearth implicit biases by simulating the well-known psychological IAT but adapted for use with LLMs. The LLM Decision Bias measure is developed to detect subtle discrimination in decision-making tasks, focusing on how LLMs choose between individuals in various scenarios. Open-ended generation is also utilised through thematic analysis of word generations and storytelling. The experiments revealed biases across gender and racial domains, from discriminatory categorisations to exoticisation. Our findings indicate that the prompt-based measure of implicit bias not only correlates with traditional embedding-based methods but also more effectively predicts downstream behaviors, which are crucially measured by the LLM Decision Bias. This relationship underscores the importance of relative, rather than absolute, evaluations in assessing implicit biases, reflecting psychological insights into human bias assessment. This research contributes to the broader understanding of AI ethics and provides suggestions for continually assessing and mitigating biases in advanced AI systems, emphasising the need for more qualitative and downstream focus.
摘要:这项研究调查了大型语言模型(LLM)中存在的微妙且往往被隐藏的偏见,尽管通过了显性偏见测试,但仍可能表现出类似于在自称平等主义信念但表现出潜在偏见的人类身上观察到的内隐偏见。随着LLM变得越来越专有,限制了对其嵌入等内部机制的访问,衡量此类偏差的挑战加剧,而这些机制对于应用传统的偏差衡量标准至关重要。为了解决这些问题,本研究引入了受心理学方法论启发的偏差的创新测量方法:LLM内隐联想测试(IAT)偏差和LLM决策偏差。LLMIAT偏差是一种基于即时的方法,旨在通过模拟众所周知的心理IAT来挖掘内隐偏见,但适用于LLMS。LLM决策偏差测量是为了检测决策任务中的细微差别,重点关注LLM在不同情景下如何在个人之间做出选择。开放式生成也通过对词语生成和讲故事的主题分析来使用。这些实验揭示了性别和种族领域的偏见,从歧视性的分类到异国情调。我们的发现表明,基于提示的内隐偏差测量不仅与传统的基于嵌入的方法相关,而且更有效地预测下游行为,这一点是由LLM决策偏差来衡量的。这种关系突显了相对评估而不是绝对评估在评估隐性偏见方面的重要性,反映了对人类偏见评估的心理学见解。这项研究有助于更广泛地理解人工智能伦理,并为持续评估和减轻先进人工智能系统中的偏差提供建议,强调需要更多地关注定性和下游。

[NLP-39] SignCLIP: Connecting Text and Sign Language by Contrastive Learning
[NLP-39] SignCLIP:通过对比学习连接文本和手语

链接: https://arxiv.org/abs/2407.01264
作者: Zifan Jiang,Gerard Sant,Amit Moryossef,Mathias Müller,Rico Sennrich,Sarah Ebling
关键词: Contrastive Language-Image Pretraining, Contrastive Language-Image, Language-Image Pretraining, sign language, language
中文关键词: 对比隐喻-图像预训练,对比隐喻-图像,隐喻-图像预训练,手语,语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present SignCLIP, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size. We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. SignCLIP discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot prompting or fine-tuning. We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights. Our code and models are openly available. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2407.01264 [cs.CL] (or arXiv:2407.01264v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.01264 Focus to learn more arXiv-issued DOI via DataCite
摘要:我们提出了SignCLIP,它重新利用CLIP(对比语言-图像预训练)将口语文本和手语视频这两类不同形式的自然语言投影到同一空间。SignCLIP是一种有效的方法,可以从大规模、多语言的视频-文本对中学习用于手语处理的有用的视觉表示,而不需要直接针对特定任务或通常有限大小的手语进行优化。我们在SpreadtheSign上对SignCLIP进行了预培训,SpreadtheSign是一个著名的手语词典,由多达44种手语的约50万个视频片段组成,并使用各种下游数据集对其进行评估。SignCLIP识别域内签名,具有显著的文本到视频/视频到文本检索精度。它还在域外下游任务方面具有竞争力,例如在基本的少量提示或微调时进行孤立的手语识别。我们分析了口语文本和手语姿势所形成的潜在空间,这为我们提供了更多的语言学见解。我们的代码和模型是公开提供的。主题:计算与语言(cs.CL)引用为:arxiv:2407.01264cs.CLhttps://doi.org/10.48550/arXiv.2407.01264 Focus通过DataCite了解更多arxiv发布的文档

[NLP-40] uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation via Large-Scale Pseudo Labelling
[NLP-40] uDistil-Whisper:通过大规模伪标签进行知识提炼的无标签数据过滤

链接: https://arxiv.org/abs/2407.01257
作者: Abdul Waheed,Karima Kadaoui,Muhammad Abdul-Mageed
关键词: distilling Whisper knowledge, Recent work, models, reducing the size, Recent
中文关键词: 提炼Whisper知识,最近的工作,模型,缩小规模,最近
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Work in progress

点击查看摘要

Abstract:Recent work on distilling Whisper’s knowledge into small models using pseudo-labels shows promising performance while reducing the size by up to 50%. This results in small, efficient, and dedicated models. However, a critical step of distillation from pseudo-labels involves filtering high-quality predictions and using only those during training. This step requires ground truth to compare and filter bad examples making the whole process supervised. In addition to that, the distillation process requires a large amount of data thereby limiting the ability to distil models in low-resource settings. To address this challenge, we propose an unsupervised or label-free framework for distillation, thus eliminating the requirement for labeled data altogether. Through experimentation, we show that our best distilled models outperform the teacher model by 5-7 points in terms of WER. Additionally, our models are on par with or better than similar supervised data filtering setup. When we scale the data, our models significantly outperform all zero-shot and supervised models. In this work, we demonstrate that it’s possible to distill large Whisper models into relatively small models without using any labeled data. As a result, our distilled models are 25-50% more compute and memory efficient while maintaining performance equal to or better than the teacher model.
摘要:最近使用伪标签将语者的知识提取到小模型中的工作显示出良好的性能,同时将大小减少了50%。这就产生了小型、高效和专用的模型。然而,从伪标签中提炼的一个关键步骤是过滤高质量的预测,并在训练期间只使用那些预测。这一步需要地面实况来比较和过滤不良榜样,使整个过程受到监督。除此之外,蒸馏过程需要大量数据,从而限制了在低资源环境下提取模型的能力。为了应对这一挑战,我们提出了一个无监督或无标签的蒸馏框架,从而完全消除了对标签数据的要求。通过实验,我们发现我们最好的提取模型在WER方面比教师模型高出5-7个百分点。此外,我们的模型与类似的监督数据过滤设置不相上下,甚至更好。当我们对数据进行缩放时,我们的模型显著优于所有的零精度模型和监督模型。在这项工作中,我们证明了在不使用任何标记数据的情况下,将大型耳语模型提取为相对较小的模型是可能的。因此,我们的精炼模型在保持与教师模型相同或更好的性能的同时,计算和内存效率提高了25%-50%。

[NLP-41] Large Language Models are Zero-Shot Recognizers for Activities of Daily Living
[NLP-41] 大型语言模型是日常生活活动的零镜头识别器

链接: https://arxiv.org/abs/2407.01238
作者: Gabriele Civitarese,Michele Fiori,Priyankar Choudhary,Claudio Bettini
关键词: Daily Living, Large Language Models, energy management, smart home environments, home environments enables
中文关键词: 日常生活、大型语言模型、能源管理、智能家居环境、家庭环境使
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: Currently under review

点击查看摘要

Abstract:The sensor-based recognition of Activities of Daily Living (ADLs) in smart home environments enables several applications in the areas of energy management, safety, well-being, and healthcare. ADLs recognition is typically based on deep learning methods requiring large datasets to be trained. Recently, several studies proved that Large Language Models (LLMs) effectively capture common-sense knowledge about human activities. However, the effectiveness of LLMs for ADLs recognition in smart home environments still deserves to be investigated. In this work, we propose ADL-LLM, a novel LLM-based ADLs recognition system. ADLLLM transforms raw sensor data into textual representations, that are processed by an LLM to perform zero-shot ADLs recognition. Moreover, in the scenario where a small labeled dataset is available, ADL-LLM can also be empowered with few-shot prompting. We evaluated ADL-LLM on two public datasets, showing its effectiveness in this domain.
摘要:智能家居环境中基于传感器的日常生活活动(ADL)识别实现了能源管理、安全、福祉和医疗保健领域的多种应用。ADL识别通常基于需要训练大型数据集的深度学习方法。最近,几项研究证明,大型语言模型(LLM)可以有效地捕获有关人类活动的常识知识。然而,LLM在智能家居环境中用于ADL识别的有效性仍然值得研究。在这项工作中,我们提出了ADL-LLM,这是一种新型的基于LLM的ADL识别系统。ADLLLM将原始传感器数据转换为文本表示,由LLM处理以执行零激发ADL识别。此外,在有小标签数据集可用的情况下,ADL-LLM还可以通过少量提示来实现。我们在两个公共数据集上评估了ADL-LLM,展示了其在该领域的有效性。

[NLP-42] MIRAI: Evaluating LLM Agents for Event Forecasting
[NLP-42] MIRAI:评估LLM代理的事件预测

链接: https://arxiv.org/abs/2407.01231
作者: Chenchen Ye,Ziniu Hu,Yihe Deng,Zijie Huang,Mingyu Derek Ma,Yanqiao Zhu,Wei Wang
关键词: solve complex problems, LLM agents, Large Language Models, empowered LLM agents, Recent advancements
中文关键词: 解决复杂问题、LLM代理、大型语言模型、授权LLM代理、最新进展
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 66 pages, 8 figures, 6 tables; Website: this https URL

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents’ forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents’ abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents’ capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.
摘要:大型语言模型(LLM)的最新进展使LLM代理能够自主地收集世界信息,并在这些信息上进行推理来解决复杂的问题。鉴于这一能力,越来越多的人对使用LLM代理来预测国际事件产生了兴趣,这可能会影响决策并在国际范围内制定政策。尽管人们的兴趣与日俱增,但LLM代理的预测能力和可靠性缺乏严格的基准。为了弥补这一差距,我们引入了Mirai,这是一个新的基准,旨在系统地评估LLM代理在国际事件背景下作为时间预测者的作用。我们的基准具有代理环境,具有访问历史、结构化事件和文本新闻文章的广泛数据库的工具。我们通过仔细的清理和解析来精炼GDELT事件数据库,以管理一系列具有不同预测视野的关系预测任务,评估LLM代理从短期到长期的预测能力。我们进一步实现了API,使LLM代理能够通过基于代码的接口使用不同的工具。总而言之,Mirai从三个方面全面评估了代理的能力:1)自主地从大型全球数据库中获取和集成关键信息;2)使用特定于领域的API和库编写代码以供工具使用;3)联合推理来自不同格式和时间的历史知识,以准确预测未来事件。通过全面的基准,我们的目标是建立一个可靠的框架,以评估LLM代理预测国际事件的能力,从而有助于开发更准确和可靠的国际关系分析模型。

[NLP-43] Searching for Best Practices in Retrieval-Augmented Generation
[NLP-43] 寻找检索增强一代的最佳实践

链接: https://arxiv.org/abs/2407.01219
作者: Xiaohua Wang,Zhenghua Wang,Xuan Gao,Feiran Zhang,Yixin Wu,Zhibo Xu,Tianyuan Shi,Zhengyuan Wang,Shizheng Li,Qi Qian,Ruicheng Yin,Changze Lv,Xiaoqing Zheng,Xuanjing Huang
关键词: enhancing response quality, mitigating hallucinations, effective in integrating, specialized domains, Retrieval-augmented generation
中文关键词: 提高响应质量,减轻幻觉,有效整合,专业领域,检索增强一代
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a “retrieval as generation” strategy.
摘要:检索增强生成(RAG)技术已被证明在整合最新信息、减轻幻觉和提高反应质量方面是有效的,特别是在专业领域。虽然已经提出了许多RAG方法来通过依赖于查询的检索来增强大型语言模型,但这些方法仍然存在实现复杂和响应时间延长的问题。通常,RAG工作流程涉及多个处理步骤,每个步骤都可以以各种方式执行。在这里,我们调查现有的RAG方法及其潜在的组合,以确定最佳的RAG实践。通过大量的实验,我们提出了几种平衡性能和效率的RAG部署策略。此外,我们还证明了多通道检索技术可以显著提高关于视觉输入的问答能力,并使用以检索为生成的策略来加速多通道内容的生成。

[NLP-44] EconNLI: Evaluating Large Language Models on Economics Reasoning
[NLP-44] EcoNLI:评估经济推理中的大型语言模型

链接: https://arxiv.org/abs/2407.01212
作者: Yue Guo,Yi Yang
关键词: Large Language Models, providing financial advice, lacks systematic evaluation, Large Language, Language Models
中文关键词: 大型语言模型,提供财务建议,缺乏系统评估,大型语言,语言模型
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2024

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used for writing economic analysis reports or providing financial advice, but their ability to understand economic knowledge and reason about potential results of specific economic events lacks systematic evaluation. To address this gap, we propose a new dataset, natural language inference on economic events (EconNLI), to evaluate LLMs’ knowledge and reasoning abilities in the economic domain. We evaluate LLMs on (1) their ability to correctly classify whether a premise event will cause a hypothesis event and (2) their ability to generate reasonable events resulting from a given premise. Our experiments reveal that LLMs are not sophisticated in economic reasoning and may generate wrong or hallucinated answers. Our study raises awareness of the limitations of using LLMs for critical decision-making involving economic reasoning and analysis. The dataset and codes are available at this https URL.
摘要:大型语言模型(LLM)被广泛用于撰写经济分析报告或提供财务建议,但其理解经济知识和对特定经济事件潜在结果的推理的能力缺乏系统评估。为了解决这一差距,我们提出了一个新的数据集,即经济事件的自然语言推理(EngineNLI),来评估LLM在经济领域的知识和推理能力。我们评估LLM的指标包括:(1)它们正确分类前提事件是否会导致假设事件的能力,以及(2)它们生成由给定前提产生的合理事件的能力。我们的实验表明,LLM在经济推理方面并不复杂,可能会产生错误或幻觉的答案。我们的研究提高了人们对使用LLM进行涉及经济推理和分析的关键决策的局限性的认识。数据集和代码可在此httpsURL中获取。

[NLP-45] textMemory3: Language Modeling with Explicit Memory
[NLP-45] 文本内存3:具有显式记忆的语言建模

链接: https://arxiv.org/abs/2407.01178
作者: Hongkang Yang,Zehao Lin,Wenjin Wang,Hao Wu,Zhiyu Li,Bo Tang,Wenqiang Wei,Jinbo Wang,Zeyun Tang,Shichao Song,Chenyang Xi,Yu Yu,Kai Chen,Feiyu Xiong,Linpeng Tang,Weinan E
关键词: large language models, meaningful computation, large language, costly process, process that transports
中文关键词: 大型语言模型、有意义的计算、大型语言、昂贵的过程、传输的过程
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining “abstract knowledge”. As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named \textMemory^3 , since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.
摘要:大型语言模型的训练和推理是一个昂贵的过程,它将知识从原始数据传输到有意义的计算。受人脑记忆层次的启发,我们通过为LLM配备显式记忆来降低这一成本,这是一种比模型参数和文本检索-增强生成(RAG)更便宜的记忆格式。从概念上讲,随着LLM的大部分知识外化到外显记忆中,LLM可以享受较小的参数大小、训练成本和推理成本,所有这些都与剩余的“抽象知识”的数量成比例。作为概念的初步验证,我们从零开始训练一个2.4B的LLM,它获得了比更大的LLM和RAG模型更好的性能,并保持了比RAG更高的译码速度。该模型被命名为TextMemory^3,因为外显记忆是LLMS中仅次于内隐记忆(模型参数)和工作记忆(上下文键-值)的第三种记忆形式。我们引入了支持知识外部化的记忆电路理论,并提出了新的技术,包括使存储易于处理的记忆稀疏机制和促进记忆形成的两阶段预训练方案。

[NLP-46] Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
[NLP-46] 学习探索和选择覆盖条件检索增强一代

链接: https://arxiv.org/abs/2407.01158
作者: Takyoung Kim,Kyungjae Lee,Young Rok Jang,Ji Yong Cho,Gangwoo Kim,Minseok Cho,Moontae Lee
关键词: extensive parametric capacities, typically yield long-form, Interactions with billion-scale, yield long-form responses, long-form responses due
中文关键词: 广泛的参数能力,通常产生长形式,与十亿级的相互作用,产生长形式响应,长形式响应
类目: Computation and Language (cs.CL)
备注: Work in progress. Resources are available at this https URL

点击查看摘要

Abstract:Interactions with billion-scale large language models typically yield long-form responses due to their extensive parametric capacities, along with retrieval-augmented features. While detailed responses provide insightful viewpoint of a specific subject, they frequently generate redundant and less engaging content that does not meet user interests. In this work, we focus on the role of query outlining (i.e., selected sequence of queries) in scenarios that users request a specific range of information, namely coverage-conditioned ( C^2 ) scenarios. For simulating C^2 scenarios, we construct QTree, 10K sets of information-seeking queries decomposed with various perspectives on certain topics. By utilizing QTree, we train QPlanner, a 7B language model generating customized query outlines that follow coverage-conditioned queries. We analyze the effectiveness of generated outlines through automatic and human evaluation, targeting on retrieval-augmented generation (RAG). Moreover, the experimental results demonstrate that QPlanner with alignment training can further provide outlines satisfying diverse user interests. Our resources are available at this https URL.
摘要:与数十亿规模的大型语言模型的交互通常会产生长形式的响应,这是因为它们具有广泛的参数容量以及检索增强的功能。虽然详细的回复提供了对特定主题的有洞察力的观点,但它们经常产生多余的、不太吸引人的内容,不符合用户的兴趣。在这项工作中,我们专注于查询大纲(即选定的查询序列)在用户请求特定范围的信息的场景中的作用,即覆盖条件(C^2)场景。为了模拟C^2场景,我们构建了QTree,10K个信息搜索查询集,这些查询以特定主题的不同视角进行分解。通过使用QTree,我们训练了QPlanner,这是一种7B语言模型,生成遵循覆盖条件查询的定制查询大纲。我们针对检索增强生成(RAG),通过自动评价和人工评价来分析生成的轮廓的有效性。此外,实验结果表明,经过对齐训练的QPlanner可以进一步提供满足不同用户兴趣的轮廓。我们的资源可以在这个HTTPS URL上找到。

[NLP-47] Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models
[NLP-47] 将所有内容分开:或将任何文本与多模式模型中的任何图像对齐

链接: https://arxiv.org/abs/2407.01157
作者: Shaeke Salman,Md Montasir Bin Shams,Xiuwen Liu
关键词: unprecedented zero-shot capabilities, exhibit unprecedented zero-shot, shared embedding space, models exhibit unprecedented, zero-shot capabilities
中文关键词: 前所未有的零拍摄能力,展现出前所未有的零拍摄、共享嵌入空间,模型展现出前所未有的零拍摄能力
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:2401.15568 , arXiv:2402.08473

点击查看摘要

Abstract:Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities can be misaligned. In this paper, we extend and utilize a recently developed effective gradient-based procedure that allows us to match the embedding of a given text by minimally modifying an image. Using the procedure, we show that we can align the embeddings of distinguishable texts to any image through unnoticeable adversarial attacks in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and at the same time visually indistinguishable images can be matched to the embeddings of very different texts. Our technique achieves 100% success rate when it is applied to text datasets and images from multiple sources. Without overcoming the vulnerability, multimodal models cannot robustly align inputs from different modalities in a semantically meaningful way. \textbfWarning: the text data used in this paper are toxic in nature and may be offensive to some readers.
摘要:利用共享的嵌入空间,新兴的多式联运模型显示出前所未有的零射击能力。然而,如果不同的模式可能会错位,共享嵌入空间可能会导致新的漏洞。在本文中,我们扩展和利用了最近开发的一种有效的基于梯度的方法,该方法允许我们通过对图像进行最小限度的修改来匹配给定文本的嵌入。利用该过程,我们证明了在联合图文模型中,通过不可察觉的对抗性攻击,可以将可区分文本的嵌入与任何图像对齐,从而揭示了语义无关的图像可以具有相同文本的嵌入,同时视觉上不可区分的图像可以与非常不同的文本的嵌入相匹配。将该方法应用于多个来源的文本数据集和图像,取得了100%的准确率。如果不克服这一弱点,多通道模型就不能以语义有意义的方式稳健地对齐来自不同通道的输入。\textbf警告:本文中使用的文本数据是有毒的,可能会冒犯某些读者。

[NLP-48] Sociocultural Considerations in Monitoring Anti-LGBTQ Content on Social Media
[NLP-48] 监控社交媒体上反LGBTQ内容的社会文化考虑

链接: https://arxiv.org/abs/2407.01149
作者: Sidney G.-J. Wong
关键词: sociocultural factors, open-source training data, open-source hate speech, hate speech data, hate speech detection
中文关键词: 社会文化因素、开源训练数据、开源仇恨言论、仇恨言论数据、仇恨言论检测
类目: Computation and Language (cs.CL)
备注: Accepted Manuscript ACL 2024 Workshop C3NLP

点击查看摘要

Abstract:The purpose of this paper is to ascertain the influence of sociocultural factors (i.e., social, cultural, and political) in the development of hate speech detection systems. We set out to investigate the suitability of using open-source training data to monitor levels of anti-LGBTQ+ content on social media across different national-varieties of English. Our findings suggests the social and cultural alignment of open-source hate speech data sets influences the predicted outputs. Furthermore, the keyword-search approach of anti-LGBTQ+ slurs in the development of open-source training data encourages detection models to overfit on slurs; therefore, anti-LGBTQ+ content may go undetected. We recommend combining empirical outputs with qualitative insights to ensure these systems are fit for purpose.
摘要:本文的目的是确定社会文化因素(即,社会、文化和政治)在仇恨言论检测系统的开发中。我们开始调查使用开源培训数据来监控不同国家英语变体社交媒体上反LGBTQ+内容水平的合适性。我们的研究结果表明,开源仇恨言论数据集的社会和文化一致性会影响预测的输出。此外,在开发开源训练数据时反LGBTQ+诽谤的关键词搜索方法鼓励检测模型过度适合诽谤;因此,反LGBTQ+内容可能会被检测不到。我们建议将经验输出与定性见解相结合,以确保这些系统适合目的。

[NLP-49] An Empirical Comparison of Generative Approaches for Product Attribute-Value Identification
[NLP-49] 产品属性-价值识别生成方法的实证比较

链接: https://arxiv.org/abs/2407.01137
作者: Kassem Sabeh,Robert Litschko,Mouna Kacimi,Barbara Plank,Johann Gamper
关键词: e-commerce platforms, supporting applications, applications like search, question answering, Product attributes
中文关键词: 电子商务平台、支持应用程序、搜索等应用程序、问答、产品属性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Product attributes are crucial for e-commerce platforms, supporting applications like search, recommendation, and question answering. The task of Product Attribute and Value Identification (PAVI) involves identifying both attributes and their values from product information. In this paper, we formulate PAVI as a generation task and provide, to the best of our knowledge, the most comprehensive evaluation of PAVI so far. We compare three different attribute-value generation (AVG) strategies based on fine-tuning encoder-decoder models on three datasets. Experiments show that end-to-end AVG approach, which is computationally efficient, outperforms other strategies. However, there are differences depending on model sizes and the underlying language model. The code to reproduce all experiments is available at: this https URL
摘要:产品属性对于电子商务平台至关重要,支持搜索、推荐和问答等应用程序。产品属性和价值识别(PAVI)的任务涉及从产品信息中识别属性及其价值。在本文中,我们将PAVI制定为一项生成任务,并提供了据我们所知迄今为止对PAVI最全面的评估。我们基于三个数据集上的微调编码器-解码器模型,比较了三种不同的属性值生成(AVG)策略。实验表明,端到端AVG方法计算效率高,优于其他策略。然而,根据模型大小和基础语言模型的不同,存在差异。复制所有实验的代码可在以下网址获取:此https URL

[NLP-50] Cross-Lingual Transfer Learning for Speech Translation
[NLP-50] 语音翻译的跨语言迁移学习

链接: https://arxiv.org/abs/2407.01130
作者: Rao Ma,Yassir Fathullah,Mengjie Qian,Siyuan Tang,Mark Gales,Kate Knill
关键词: increasing interest, interest in building, building multilingual foundation, NLP, NLP tasks
中文关键词: 增加兴趣,对建立、建立多语言基础、NLP、NLP任务的兴趣
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There has been increasing interest in building multilingual foundation models for NLP and speech research. Zero-shot cross-lingual transfer has been demonstrated on a range of NLP tasks where a model fine-tuned on task-specific data in one language yields performance gains in other languages. Here, we explore whether speech-based models exhibit the same transfer capability. Using Whisper as an example of a multilingual speech foundation model, we examine the utterance representation generated by the speech encoder. Despite some language-sensitive information being preserved in the audio embedding, words from different languages are mapped to a similar semantic space, as evidenced by a high recall rate in a speech-to-speech retrieval task. Leveraging this shared embedding space, zero-shot cross-lingual transfer is demonstrated in speech translation. When the Whisper model is fine-tuned solely on English-to-Chinese translation data, performance improvements are observed for input utterances in other languages. Additionally, experiments on low-resource languages show that Whisper can perform speech translation for utterances from languages unseen during pre-training by utilizing cross-lingual representations.
摘要:为自然语言处理和语音研究建立多语言基础模型越来越受到人们的关注。在一系列NLP任务上已经展示了零准数跨语言迁移,其中一个模型对一种语言的特定任务数据进行了微调,在其他语言中产生了性能提升。在这里,我们探讨基于语音的模型是否表现出相同的传输能力。以Whisper作为多语言语音基础模型的一个例子,我们检查了语音编码器生成的话语表示。尽管在音频嵌入中保留了一些语言敏感信息,但来自不同语言的单词被映射到相似的语义空间,语音到语音检索任务中的高召回率证明了这一点。利用这种共享的嵌入空间,在语音翻译中展示了零镜头跨语言迁移。当只根据英汉翻译数据对Whisper模型进行微调时,可以观察到其他语言的输入话语的性能改善。此外,在低资源语言上的实验表明,Whisper可以利用跨语言表征对预训练中看不到的语言进行语音翻译。

[NLP-51] Investigating the potential of Sparse Mixtures-of-Experts for multi-domain neural machine translation
[NLP-51] 研究稀疏专家混合在多领域神经机器翻译中的潜力

链接: https://arxiv.org/abs/2407.01126
作者: Nadezhda Chirkova,Vassilina Nikoulina,Jean-Luc Meunier,Alexandre Bérard
关键词: Neural Machine Translation, multi-domain Neural Machine, Machine Translation, Neural Machine, developing efficient models
中文关键词: 神经机器翻译,多域神经机器,机器翻译,神经机器,开发高效模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We focus on multi-domain Neural Machine Translation, with the goal of developing efficient models which can handle data from various domains seen during training and are robust to domains unseen during training. We hypothesize that Sparse Mixture-of-Experts (SMoE) models are a good fit for this task, as they enable efficient model scaling, which helps to accommodate a variety of multi-domain data, and allow flexible sharing of parameters between domains, potentially enabling knowledge transfer between similar domains and limiting negative transfer. We conduct a series of experiments aimed at validating the utility of SMoE for the multi-domain scenario, and find that a straightforward width scaling of Transformer is a simpler and surprisingly more efficient approach in practice, and reaches the same performance level as SMoE. We also search for a better recipe for robustness of multi-domain systems, highlighting the importance of mixing-in a generic domain, i.e. Paracrawl, and introducing a simple technique, domain randomization.
摘要:我们致力于多领域神经机器翻译的研究,目的是开发高效的模型,能够处理训练过程中看到的不同领域的数据,并且对训练过程中看不到的领域具有健壮性。我们假设稀疏专家混合(SMOE)模型很适合这项任务,因为它们支持有效的模型缩放,这有助于容纳各种多域数据,并允许域之间灵活地共享参数,潜在地使相似域之间的知识转移成为可能,并限制负转移。我们进行了一系列实验,旨在验证SMOE在多域场景中的实用性,并发现Transformer的直接宽度缩放在实践中是一种更简单、更高效的方法,并且达到了与SMOE相同的性能水平。我们还寻找了一种更好的方法来提高多域系统的健壮性,强调了混合在通用域中的重要性,即Paracrawl,并引入了一种简单的技术,域随机化。

[NLP-52] Calibrated Large Language Models for Binary Question Answering
[NLP-52] 用于二元问题解答的校准大型语言模型

链接: https://arxiv.org/abs/2407.01122
作者: Patrizio Giovannotti,Alexander Gammerman
关键词: large language models, binary text classification, text classification tasks, classification tasks remains, remains a challenge
中文关键词: 大型语言模型、二进制文本分类、文本分类任务、分类任务仍然是一个挑战
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to COPA 2024 (13th Symposium on Conformal and Probabilistic Prediction with Applications)

点击查看摘要

Abstract:Quantifying the uncertainty of predictions made by large language models (LLMs) in binary text classification tasks remains a challenge. Calibration, in the context of LLMs, refers to the alignment between the model’s predicted probabilities and the actual correctness of its predictions. A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn–Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels. Our experiments on the BoolQ dataset using the Llama 2 model demonstrate that IVAP consistently outperforms the commonly used temperature scaling method for various label token choices, achieving well-calibrated probabilities while maintaining high predictive quality. Our findings contribute to the understanding of calibration techniques for LLMs and provide a practical solution for obtaining reliable uncertainty estimates in binary question answering tasks, enhancing the interpretability and trustworthiness of LLM predictions.
摘要:在二进制文本分类任务中,量化大语言模型预测的不确定性仍然是一个挑战。在LLMS的上下文中,校准指的是模型的预测概率与其预测的实际正确性之间的对准。一个经过良好校准的模型应该产生能够准确反映其预测正确的可能性的概率。我们提出了一种新的方法,它利用归纳的Venn-Abers预测器(IVAP)来校准与对应于二进制标签的输出标记相关联的概率。我们在使用Llama 2模型的BoolQ数据集上的实验表明,对于各种标签标记选择,IVAP始终优于常用的温度缩放方法,在保持高预测质量的同时实现了良好校准的概率。我们的发现有助于理解LLMS的校准技术,并为在二元问答任务中获得可靠的不确定性估计提供了实用的解决方案,提高了LLM预测的可解释性和可信度。

[NLP-53] Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?
[NLP-53] Pron vs Promote:大型语言模型是否已经可以挑战创意文本写作领域的世界级小说作家?

链接: https://arxiv.org/abs/2407.01119
作者: Guillermo Marco,Julio Gonzalo,Ramón del Castillo,María Teresa Mateo Girona
关键词: Large Language Models, creative text writing, report research results, creative writing skills, outperform average humans
中文关键词: 大型语言模型、创意文本写作、报告研究结果、创意写作技能,优于普通人
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages 6 figures

点击查看摘要

Abstract:It has become routine to report research results where Large Language Models (LLMs) outperform average humans in a wide range of language-related tasks, and creative text writing is no exception. It seems natural, then, to raise the bid: Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent’s. Then, we prepared an evaluation rubric inspired by Boden’s definition of creativity, and we collected 5,400 manual assessments provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer, and that reaching such level of autonomous creative writing skills probably cannot be reached simply with larger language models.
摘要:在一系列与语言相关的任务中,大型语言模型(LLM)的表现优于普通人类,报道研究成果已成为一种惯例,创造性的文本写作也不例外。那么,提高出价似乎是很自然的:LLMS准备好与顶尖(而不是一般)小说家在创造性写作技能上竞争了吗?为了提供这个问题的初步答案,我们在Patricio Pron(获奖小说家,被认为是他那一代人中最好的小说家之一)和GPT-4(表现最好的LLM之一)之间进行了一场比赛,本着人工智能与人类决斗的精神,比如DeepBlue vs Kasparov和AlphaGo vs Lee Sidol。我们要求Pron和GPT-4各提供30个题目,然后为他们的题目和他们对手的题目写短篇小说。然后,我们根据Boden对创造力的定义编制了一个评价量表,我们收集了5400名文学评论家和学者提供的手工评估。我们的实验结果表明,LLM还远远不能挑战人类顶尖的创造性作家,而达到这种水平的自主创造性写作技能可能不是简单地通过更大的语言模型就能达到的。

[NLP-54] BERGEN: A Benchmarking Library for Retrieval-Augmented Generation
[NLP-54] Bergen:检索增强一代的基准库

链接: https://arxiv.org/abs/2407.01102
作者: David Rau,Hervé Déjean,Nadezhda Chirkova,Thibault Formal,Shuai Wang,Vassilina Nikoulina,Stéphane Clinchant
关键词: Large Language Models, enhance Large Language, Large Language, Language Models, Retrieval-Augmented Generation
中文关键词: 大型语言模型、增强大型语言、大型语言、语言模型、检索增强生成
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 29 pages

点击查看摘要

Abstract:Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets. Our open-source library BERGEN is available under \urlthis https URL.
摘要:检索增强生成允许使用外部知识增强大型语言模型。为了响应生成式LLM最近的流行,人们提出了许多RAG方法,其中涉及大量复杂的不同配置,例如评估数据集、集合、指标、检索器和LLM。不一致的基准测试给比较方法和了解管道中每个组件的影响带来了重大挑战。在这项工作中,我们研究了最佳实践,为RAG的系统评估奠定了基础,并介绍了BERGER,这是一个用于标准化RAG实验的可重复研究的端到端图书馆。在一项专注于QA的广泛研究中,我们对不同的最先进的寻回犬、重评级者和LLM进行了基准测试。此外,我们还分析现有的RAG指标和数据集。我们的开源库Bergen可在\urlThis https URL下找到。

[NLP-55] Eliminating Position Bias of Language Models: A Mechanistic Approach
[NLP-55] 消除语言模型的位置偏差:一种机械方法

链接: https://arxiv.org/abs/2407.01100
作者: Ziqi Wang,Hanlin Zhang,Xiner Li,Kuan-Hao Huang,Chi Han,Shuiwang Ji,Sham M. Kakade,Hao Peng,Heng Ji
关键词: Position bias, prevalent issue, issue of modern, Position, bias
中文关键词: 立场偏见,流行问题,现代问题,立场,偏见
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. Specifically, we find that causal attention generally causes models to favor distant content, while relative positional encodings like RoPE prefer nearby ones based on the analysis of retrieval-augmented question answering (QA). Further, our empirical study on object detection reveals that position bias is also present in vision-language models (VLMs). Based on the above analyses, we propose to ELIMINATE position bias caused by different input segment orders (e.g., options in LM-as-a-judge, retrieved documents in QA) in a TRAINING-FREE ZERO-SHOT manner. Our method changes the causal attention to bidirectional attention between segments and utilizes model attention values to decide the relative orders of segments instead of using the order provided in input prompts, therefore enabling Position-INvariant inferencE (PINE) at the segment level. By eliminating position bias, models achieve better performance and reliability in downstream tasks where position bias widely exists, such as LM-as-a-judge and retrieval-augmented QA. Notably, PINE is especially useful when adapting LMs for evaluating reasoning pairs: it consistently provides 8 to 10 percentage points performance gains in most cases, and makes Llama-3-70B-Instruct perform even better than GPT-4-0125-preview on the RewardBench reasoning subset. Comments: 18 pages, 5 figures Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2407.01100 [cs.CL] (or arXiv:2407.01100v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.01100 Focus to learn more arXiv-issued DOI via DataCite
摘要:位置偏差已被证明是现代语言模型(LMS)中的一个普遍问题,该模型根据内容在给定上下文中的位置来确定内容的优先级。这种偏差通常会导致意外的模型故障,并损害各种应用程序的性能、健壮性和可靠性。我们的机制分析将位置偏差归因于几乎所有最先进的LMS中使用的两个组成部分:因果注意和相对位置编码。具体地说,我们发现因果注意通常会导致模型倾向于距离较远的内容,而基于检索-增强问答(QA)的分析,相对位置编码(如ROPE)更倾向于附近的内容。此外,我们对目标检测的实证研究表明,位置偏差也存在于视觉-语言模型(VLM)中。基于以上分析,我们提出了以免训练的零射击方式消除不同输入片段顺序(例如,LM-as-a-Screen中的选项,QA中的检索文档)造成的位置偏差。该方法将语段之间的因果注意改变为语段间的双向注意,并利用模型关注值来确定语段的相对顺序,而不是使用输入提示中提供的顺序,从而实现了语段级别的位置不变推理(PINE)。通过消除位置偏差,模型在位置偏差广泛存在的下游任务中获得了更好的性能和可靠性,例如作为判断的LM和检索增强的QA。值得注意的是,在采用LMS来评估推理对时,PINE特别有用:它在大多数情况下都能持续提供8到10个百分点的性能提升,并使Llama-3-70B-Indict在RewardBch推理子集上的性能甚至好于GPT-4-0125-PREVIEW。评论:18页,5位数字主题:计算和语言(cs.CL);机器学习(cs.LG)引用为:arxiv:2407.01100cs.CLhttps://doi.org/10.48550/arXiv.2407.01100 Focus通过DataCite了解更多arxiv发布的目标文件

[NLP-56] IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation
[NLP-56] IBSEN:导演-演员代理合作,创造可控和互动的戏剧剧本

链接: https://arxiv.org/abs/2407.01093
作者: Senyu Han,Lu Chen,Li-Min Lin,Zhengshan Xu,Kai Yu
关键词: Large language models, human-like character role-playing, Large language, language model agents, Current language model
中文关键词: 大型语言模型、类人角色扮演、大型语言、语言模型代理、当前语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted by ACL 2024 Main

点击查看摘要

Abstract:Large language models have demonstrated their capabilities in storyline creation and human-like character role-playing. Current language model agents mainly focus on reasonable behaviors from the level of individuals, and their behaviors might be hard to constraint on the level of the whole storyline. In this paper we introduce IBSEN, a director-actor coordinate agent framework that generates drama scripts and makes the plot played by agents more controllable. The director agent writes plot outlines that the user desires to see, instructs the actor agents to role-play their characters, and reschedules the plot when human players participate in the scenario to ensure the plot is progressing towards the objective. To evaluate the framework, we create a novel drama plot that involves several actor agents and check the interactions between them under the instruction of the director agent. Evaluation results show that our framework could generate complete, diverse drama scripts from only a rough outline of plot objectives, meanwhile maintaining the characteristics of characters in the drama. Our codes and prompts are available at this https URL.
摘要:大型语言模型在故事情节创作和人物角色扮演方面表现出了强大的能力。目前的语言模型主体主要关注个体层面的合理行为,他们的行为可能很难约束在整个故事情节的层面上。在本文中,我们介绍了IBSEN,这是一个导演和演员协调的代理框架,它生成戏剧剧本,使代理扮演的情节更具可控性。导演代理编写用户希望看到的情节大纲,指示演员代理扮演他们的角色,并在人类玩家参与场景时重新安排情节,以确保情节朝着目标发展。为了评估该框架,我们创建了一个涉及多个演员代理的小说情节,并在导演代理的指导下检查他们之间的互动。评测结果表明,该框架能够从剧情目标的大致轮廓中生成完整、多样的剧情剧本,同时保持剧中人物的特点。我们的代码和提示可在此HTTPS URL中找到。

[NLP-57] M2QA: Multi-domain Multilingual Question Answering
[NLP-57] M2 QA:多领域多语言问题解答

链接: https://arxiv.org/abs/2407.01091
作者: Leon Engländer,Hannah Sterz,Clifton Poth,Jonas Pfeiffer,Ilia Kuznetsov,Iryna Gurevych
关键词: Generalization and robustness, machine learning research, robustness to input, core desiderata, desiderata of machine
中文关键词: 概括性和鲁棒性、机器学习研究、输入鲁棒性、核心需求、机器的需求
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generalization and robustness to input variation are core desiderata of machine learning research. Language varies along several axes, most importantly, language instance (e.g. French) and domain (e.g. news). While adapting NLP models to new languages within a single domain, or to new domains within a single language, is widely studied, research in joint adaptation is hampered by the lack of evaluation datasets. This prevents the transfer of NLP systems from well-resourced languages and domains to non-dominant language-domain combinations. To address this gap, we introduce M2QA, a multi-domain multilingual question answering benchmark. M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing. We use M2QA to explore cross-lingual cross-domain performance of fine-tuned models and state-of-the-art LLMs and investigate modular approaches to domain and language adaptation. We witness 1) considerable performance variations across domain-language combinations within model classes and 2) considerable performance drops between source and target language-domain combinations across all model sizes. We demonstrate that M2QA is far from solved, and new methods to effectively transfer both linguistic and domain-specific information are necessary. We make M2QA publicly available at this https URL.
摘要:泛化和对输入变化的稳健性是机器学习研究的核心要求。语言沿着几个轴变化,最重要的是,语言实例(例如法语)和领域(例如新闻)。虽然对自然语言处理模型适应单一领域内的新语言或单一语言内的新领域进行了广泛的研究,但由于缺乏评估数据集,联合适应的研究受到阻碍。这防止了自然语言处理系统从资源丰富的语言和领域转移到非主要语言领域的组合。为了弥补这一差距,我们引入了M2QA,一个多领域多语言问答基准。M2QA包括13,500个Tean2.0风格的德语、土耳其语和中文问答实例,用于产品评论、新闻和创意写作领域。我们使用M2QA来探索微调模型和最新LLM的跨语言跨领域性能,并研究领域和语言适应的模块化方法。我们见证了1)模型类内领域语言组合之间的显著性能差异,以及2)所有模型大小的源语言和目标语言领域组合之间的显著性能下降。我们证明,M2QA远未解决,需要新的方法来有效地传递语言和领域特定的信息。我们通过此HTTPS URL公开提供M2QA。

[NLP-58] Rethinking LLM-based Preference Evaluation
[NLP-58] 重新思考基于LLM的偏好评估

链接: https://arxiv.org/abs/2407.01085
作者: Zhengyu Hu,Linxin Song,Jieyu Zhang,Zheyuan Xiao,Jingang Wang,Zhenyu Chen,Jieyu Zhao,Hui Xiong
关键词: large language model, based preference evaluation, large language, widely adopted, adopted to compare
中文关键词: 大语言模型,基于偏好评估,大语言,广泛采用,采用进行比较
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, large language model (LLM)-based preference evaluation has been widely adopted to compare pairs of model responses. However, a severe bias towards lengthy responses has been observed, raising concerns about the reliability of this evaluation method. In this work, we designed a series of controlled experiments to study the major impacting factors of the metric of LLM-based preference evaluation, i.e., win rate, and conclude that the win rate is affected by two axes of model response: desirability and information mass, where the former is length-independent and related to trustworthiness, and the latter is length-dependent and can be represented by conditional entropy. We find that length impacts the existing evaluations by influencing information mass. However, a reliable evaluation metric should not only assess content quality but also ensure that the assessment is not confounded by extraneous factors such as response length. Therefore, we propose a simple yet effective adjustment, AdapAlpaca, to the existing practice of win rate measurement. Specifically, by adjusting the lengths of reference answers to match the test model’s answers within the same interval, we debias information mass relative to length, ensuring a fair model evaluation.
摘要:近年来,基于大语言模型(LLM)的偏好评价被广泛应用于模型反应对的比较。然而,已经观察到严重偏向于冗长的答复,这引起了人们对这种评估方法的可靠性的担忧。在这项工作中,我们设计了一系列对照实验来研究基于LLM的偏好评估度量的主要影响因素,即胜率,得出胜率受到模型响应的两个轴的影响,其中前者与长度无关且与可信性有关,后者与长度相关且可用条件熵来表示。我们发现,长度通过影响信息质量来影响现有的评估。然而,可靠的评估指标不仅应该评估内容质量,还应该确保评估不会受到响应长度等外部因素的干扰。因此,我们提出了一种简单而有效的调整,AdapAlpaca,以适应现有的胜率测量做法。具体地说,通过调整参考答案的长度以匹配相同间隔内的测试模型答案,我们相对于长度偏离了信息量,确保了公平的模型评估。

[NLP-59] Min P Sampling: Balancing Creativity and Coherence at High Temperature
[NLP-59] Min P采样:平衡高温下的创造力和一致性

链接: https://arxiv.org/abs/2407.01082
作者: Minh Nguyen,Andrew Baker,Andreas Kirsch,Clement Neo
关键词: Large Language Models, Large Language, Language Models, generate longform text, generate longform
中文关键词: 大型语言模型,大型语言,语言模型,生成长格式文本,生成长格式
类目: Computation and Language (cs.CL)
备注: 8 Pages

点击查看摘要

Abstract:Large Language Models (LLMs) generate longform text by successively sampling the next token based on the probability distribution of the token vocabulary at each decoding step. Current popular truncation sampling methods such as top- p sampling, also known as nucleus sampling, often struggle to balance coherence and creativity in generating text, particularly when using higher temperatures. To address this issue, we propose min- p , a dynamic truncation sampling method, that establishes a minimum base percentage threshold for tokens, which the scales according to the probability of the top candidate token. Through experiments on several benchmarks, such as GPQA, GSM8K and AlpacaEval Creative Writing, we demonstrate that min- p improves the coherence and quality of generated text even at high temperatures, while also facilitating more creative and diverse outputs compared to top- p and other sampling methods. As of writing, min- p has been adopted by multiple open-source LLM implementations, and have been independently assessed by members of the open-source LLM community, further validating its practical utility and potential.
摘要:大语言模型通过在每个解码步骤中根据令牌词汇量的概率分布连续采样下一个令牌来生成长文本。目前流行的截断抽样方法,如top-p抽样,也被称为核抽样,通常难以在生成文本时平衡连贯性和创造性,特别是在使用更高的温度时。为了解决这个问题,我们提出了一种动态截断抽样方法MIN-P,它为令牌建立了一个最小基本百分比阈值,该阈值根据顶级候选令牌的概率进行缩放。通过在GPQA、GSM8K和AlpacaEval Creative Writing等几个基准测试上的实验,我们证明了min-p即使在高温下也提高了生成文本的连贯性和质量,同时也促进了比top-p和其他采样方法更具创造性和多样性的输出。在撰写本文时,min-p已经被多个开源LLM实现采用,并且已经由开源LLM社区的成员独立评估,进一步验证了它的实用价值和潜力。

[NLP-60] CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
[NLP-60] CVLUE:中国视觉语言理解评估的新基准数据集

链接: https://arxiv.org/abs/2407.01081
作者: Yuxuan Wang,Yijun Liu,Fei Yu,Chen Huang,Kexin Li,Zhiguo Wan,Wanxiang Che
关键词: Chinese vision-language models, constructed on Western-centric, Chinese vision-language, Chinese, Chinese culture
中文关键词: 中国视觉语言模型,建立在以西方为中心的中国视觉语言、中国、中国文化的基础上
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the rapid development of Chinese vision-language models (VLMs), most existing Chinese vision-language (VL) datasets are constructed on Western-centric images from existing English VL datasets. The cultural bias in the images makes these datasets unsuitable for evaluating VLMs in Chinese culture. To remedy this issue, we present a new Chinese Vision- Language Understanding Evaluation (CVLUE) benchmark dataset, where the selection of object categories and images is entirely driven by Chinese native speakers, ensuring that the source images are representative of Chinese culture. The benchmark contains four distinct VL tasks ranging from image-text retrieval to visual question answering, visual grounding and visual dialogue. We present a detailed statistical analysis of CVLUE and provide a baseline performance analysis with several open-source multilingual VLMs on CVLUE and its English counterparts to reveal their performance gap between English and Chinese. Our in-depth category-level analysis reveals a lack of Chinese cultural knowledge in existing VLMs. We also find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs’ understanding of Chinese culture.
摘要:尽管中文视觉语言模型发展迅速,但现有的大多数中文视觉语言数据集都是建立在已有英语视觉语言数据集中的以西方为中心的图像上。图像中的文化偏见使得这些数据集不适合评估中国文化中的VLM。为了解决这个问题,我们提出了一个新的汉语视觉-语言理解评估(CVLUE)基准数据集,其中对象类别和图像的选择完全由以中国为母语的人驱动,确保源图像代表中国文化。该基准包含四个不同的VL任务,从图像-文本检索到视觉问题回答、视觉基础和视觉对话。我们对CVLUE进行了详细的统计分析,并提供了几个开源的多语言VLM在CVLUE和其英文版本上的基线性能分析,以揭示它们在英文和中文之间的性能差距。我们对语类层面的深入分析表明,现有的语料库中缺乏中国文化知识。我们还发现,对与中国文化相关的语料库进行微调有效地提高了语料库对中国文化的理解。

[NLP-61] Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese
[NLP-61] Face 4RAG:中文检索增强生成的事实一致性评估

链接: https://arxiv.org/abs/2407.01080
作者: Yunqi Xu,Tianchi Cai,Jiyan Jiang,Xierui Song
关键词: Retrieval Augmented Generation, conventional Retrieval Augmented, Augmented Generation, Retrieval Augmented, Large Language Models
中文关键词: 检索增强生成、传统检索增强、增强生成、检索增强、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE). Despite the various FCE methods proposed earlier, these methods are evaluated on datasets generated by specific Large Language Models (LLMs). Without a comprehensive benchmark, it remains unexplored how these FCE methods perform on other LLMs with different error distributions or even unseen error types, as these methods may fail to detect the error types generated by other LLMs. To fill this gap, in this paper, we propose the first comprehensive FCE benchmark \emphFace4RAG for RAG independent of the underlying LLM. Our benchmark consists of a synthetic dataset built upon a carefully designed typology for factuality inconsistency error and a real-world dataset constructed from six commonly used LLMs, enabling evaluation of FCE methods on specific error types or real-world error distributions. On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference. To fix this issue, we further propose a new method called \emphL-Face4RAG with two novel designs of logic-preserving answer decomposition and fact-logic FCE. Extensive experiments show L-Face4RAG substantially outperforms previous methods for factual inconsistency detection on a wide range of tasks, notably beyond the RAG task from which it is originally motivated. Both the benchmark and our proposed method are publicly available.\footnote\urlthis https URL\labellink_face4rag
摘要:传统检索增强生成(RAG)中普遍存在的事实不一致错误问题推动了事实一致性评价(FCE)的研究。尽管前面提出了各种FCE方法,但这些方法是在特定的大型语言模型(LLM)生成的数据集上进行评估的。在没有全面的基准的情况下,这些FCE方法如何在具有不同错误分布甚至是看不见的错误类型的其他LLM上执行仍然是未被探索的,因为这些方法可能无法检测由其他LLM产生的错误类型。为了填补这一空白,在本文中,我们提出了第一个独立于底层LLM的RAG的全面FCE基准\phaFace4RAG。我们的基准包括一个建立在针对真实性不一致性错误的精心设计的类型学基础上的合成数据集,以及一个由六个常用的LLM构建的真实世界数据集,从而能够对特定错误类型或真实世界错误分布的FCE方法进行评估。在提出的基准测试中,我们发现了现有FCE方法在检测逻辑谬误方面的失败,逻辑谬误指的是答案和检索到的引用之间的逻辑结构不匹配。为了解决这个问题,我们进一步提出了一种新的方法,称为L-Face4RAG,它具有两个新的设计,即保留逻辑的答案分解和事实逻辑FCE。广泛的实验表明,L-Face4RAG在许多任务上的事实不一致性检测性能大大优于以前的方法,特别是在它最初动机所在的RAG任务之外。基准测试和我们建议的方法都是公开提供的。\Footnote\urlThis HTTPS URL\Labellink_face4rag

[NLP-62] Human-like object concept representations emerge naturally in multimodal large language models
[NLP-62] 类人对象概念表示在多模式大型语言模型中自然出现

链接: https://arxiv.org/abs/2407.01067
作者: Changde Du,Kaicheng Fu,Bincheng Wen,Yi Sun,Jie Peng,Wei Wei,Ying Gao,Shengpei Wang,Chuncheng Zhang,Jinpeng Li,Shuang Qiu,Le Chang,Huiguang He
关键词: offering crucial insights, Large Language Models, intrigued cognitive scientists, long intrigued cognitive, scientists and neuroscientists
中文关键词: 提供重要见解、大型语言模型、引起兴趣的认知科学家、长期引起兴趣的认知科学家和神经科学家
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The conceptualization and categorization of natural objects in the human mind have long intrigued cognitive scientists and neuroscientists, offering crucial insights into human perception and cognition. Recently, the rapid development of Large Language Models (LLMs) has raised the attractive question of whether these models can also develop human-like object representations through exposure to vast amounts of linguistic and multimodal data. In this study, we combined behavioral and neuroimaging analysis methods to uncover how the object concept representations in LLMs correlate with those of humans. By collecting large-scale datasets of 4.7 million triplet judgments from LLM and Multimodal LLM (MLLM), we were able to derive low-dimensional embeddings that capture the underlying similarity structure of 1,854 natural objects. The resulting 66-dimensional embeddings were found to be highly stable and predictive, and exhibited semantic clustering akin to human mental representations. Interestingly, the interpretability of the dimensions underlying these embeddings suggests that LLM and MLLM have developed human-like conceptual representations of natural objects. Further analysis demonstrated strong alignment between the identified model embeddings and neural activity patterns in many functionally defined brain ROIs (e.g., EBA, PPA, RSC and FFA). This provides compelling evidence that the object representations in LLMs, while not identical to those in the human, share fundamental commonalities that reflect key schemas of human conceptual knowledge. This study advances our understanding of machine intelligence and informs the development of more human-like artificial cognitive systems.
摘要:人类头脑中自然物体的概念化和分类长期以来一直是认知科学家和神经学家的兴趣所在,为人类的感知和认知提供了至关重要的见解。近年来,大型语言模型的快速发展提出了一个引人注目的问题,即这些模型是否也可以通过接触大量的语言和多通道数据来开发类似于人类的对象表征。在这项研究中,我们结合行为和神经成像分析方法来揭示LLMS中的对象概念表征如何与人类的概念表征相关联。通过从LLM和多模式LLM(MLLM)收集470万个三元组判断的大规模数据集,我们能够推导出低维嵌入,这些嵌入捕捉了1854个自然对象的潜在相似结构。由此得到的66维嵌入被发现具有高度的稳定性和预测性,并表现出类似于人类心理表征的语义聚集。有趣的是,这些嵌入背后的维度的可解释性表明,LLM和MLLM已经开发出了对自然对象的类似于人类的概念表示。进一步的分析表明,在许多功能定义的脑ROI(如EBA、PPA、RSC和FFA)中,识别的模型嵌入与神经活动模式之间具有很强的一致性。这提供了令人信服的证据,表明LLMS中的对象表示虽然与人类中的对象表示不完全相同,但具有基本的共性,反映了人类概念知识的关键图式。这项研究促进了我们对机器智能的理解,并为更多类似人类的人工认知系统的发展提供了信息。

[NLP-63] Development of Cognitive Intelligence in Pre-trained Language Models
[NLP-63] 预训练语言模型中认知智能的发展

链接: https://arxiv.org/abs/2407.01047
作者: Raj Sanjay Shah,Khushi Bhardwaj,Sashank Varma
关键词: Large Pre-trained Language, Pre-trained Language Models, Recent studies show, Large Pre-trained, Pre-trained Language
中文关键词: 大型预训练语言,预训练语言模型,最近的研究表明,大型预训练语言,预训练语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies show evidence for emergent cognitive abilities in Large Pre-trained Language Models (PLMs). The increasing cognitive alignment of these models has made them candidates for cognitive science theories. Prior research into the emergent cognitive abilities of PLMs has largely been path-independent to model training, i.e., has focused on the final model weights and not the intermediate steps. However, building plausible models of human cognition using PLMs would benefit from considering the developmental alignment of their performance during training to the trajectories of children’s thinking. Guided by psychometric tests of human intelligence, we choose four sets of tasks to investigate the alignment of ten popular families of PLMs and evaluate their available intermediate and final training steps. These tasks are Numerical ability, Linguistic abilities, Conceptual understanding, and Fluid reasoning. We find a striking regularity: regardless of model size, the developmental trajectories of PLMs consistently exhibit a window of maximal alignment to human cognitive development. Before that window, training appears to endow “blank slate” models with the requisite structure to be poised to rapidly learn from experience. After that window, training appears to serve the engineering goal of reducing loss but not the scientific goal of increasing alignment with human cognition.
摘要:最近的研究表明,在大型预先训练的语言模型(PLM)中有新出现的认知能力。以往对PLM突现认知能力的研究在很大程度上是与模型训练无关的,即专注于最终的模型权重,而不是中间步骤。然而,使用PLM建立可信的人类认知模型将受益于考虑它们在训练期间的表现与儿童思维轨迹的发展一致性。在人类智力心理测量学测试的指导下,我们选择了四组任务来调查十个流行的PLM家庭的配对情况,并评估他们可用的中间和最终训练步骤。我们发现了一个惊人的规律性:无论模型大小如何,PLM的发展轨迹始终显示出与人类认知发展最大一致的窗口。在这一窗口之前,培训似乎赋予了“白板”模型必要的结构,使其能够迅速从经验中学习。在那之后,培训似乎服务于减少损失的工程目标,而不是增加与人类认知一致的科学目标。

[NLP-64] FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language Models
[NLP-64] FRoG:评估大型语言模型中广义量化词的模糊推理

链接: https://arxiv.org/abs/2407.01046
作者: Yiyuan Li,Shichao Sun,Pengfei Liu
关键词: daily contexts, vital due, imprecise information, information in daily, Fuzzy reasoning
中文关键词: 日常上下文、重要原因、不精确信息、日常信息、模糊推理
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Fuzzy reasoning is vital due to the frequent use of imprecise information in daily contexts. However, the ability of current large language models (LLMs) to handle such reasoning remains largely uncharted. In this paper, we introduce a new benchmark, FRoG, for fuzzy reasoning, featuring real-world mathematical word problems that incorporate generalized quantifiers. Our experimental findings reveal that fuzzy reasoning continues to pose significant challenges for LLMs. Moreover, we find that existing methods designed to enhance reasoning do not consistently improve performance in tasks involving fuzzy logic. Additionally, our results show an inverse scaling effect in the performance of LLMs on FRoG. Interestingly, we also demonstrate that strong mathematical reasoning skills are not necessarily indicative of success on our benchmark.
摘要:由于日常环境中经常使用不精确的信息,模糊推理至关重要。然而,当前大型语言模型(LLM)处理此类推理的能力在很大程度上仍然未知。在本文中,我们引入了一个新的模糊推理基准FRoG,其特点是包含广义量化词的现实世界数学单词问题。我们的实验结果表明,模糊推理继续给LLM带来重大挑战。此外,我们发现旨在增强推理的现有方法并不能始终如一地提高涉及模糊逻辑的任务的性能。此外,我们的结果显示了LLM在FRoG上的性能存在逆比例效应。有趣的是,我们还证明,强大的数学推理能力并不一定表明我们的基准取得成功。

[NLP-65] PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs
[NLP-65] PocketLLM:为个性化LLM启用设备上微调

链接: https://arxiv.org/abs/2407.01031
作者: Dan Peng,Zhihui Fu,Jun Wang
关键词: Recent advancements, large language models, impressive capabilities, advancements in large, large language
中文关键词: 最近的进步、大型语言模型、令人印象深刻的功能、大型语言的进步
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to the ACL 2024 Workshop on Privacy in Natural Language Processing (PrivateNLP)

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have indeed showcased their impressive capabilities. On mobile devices, the wealth of valuable, non-public data generated daily holds great promise for locally fine-tuning personalized LLMs, while maintaining privacy through on-device processing. However, the constraints of mobile device resources pose challenges to direct on-device LLM fine-tuning, mainly due to the memory-intensive nature of derivative-based optimization required for saving gradients and optimizer states. To tackle this, we propose employing derivative-free optimization techniques to enable on-device fine-tuning of LLM, even on memory-limited mobile devices. Empirical results demonstrate that the RoBERTa-large model and OPT-1.3B can be fine-tuned locally on the OPPO Reno 6 smartphone using around 4GB and 6.5GB of memory respectively, using derivative-free optimization techniques. This highlights the feasibility of on-device LLM fine-tuning on mobile devices, paving the way for personalized LLMs on resource-constrained devices while safeguarding data privacy.
摘要:大型语言模型(LLM)的最新进展确实展示了它们令人印象深刻的能力。在移动设备上,每天产生的宝贵的非公开数据为本地微调个性化LLM提供了巨大的希望,同时通过设备上的处理保持隐私。然而,移动设备资源的限制给直接在设备上进行LLM微调带来了挑战,这主要是因为保存梯度和优化器状态所需的基于导数的优化的内存密集型本质。为了解决这个问题,我们建议使用免导数优化技术来在设备上实现LLM的微调,即使在内存有限的移动设备上也是如此。实验结果表明,Roberta-Large模型和OPT-1.3B可以在oppo Reno 6智能手机上进行本地微调,分别使用约4 GB和6.5 GB的内存,使用免导数优化技术。这突出了在移动设备上进行设备上LLM微调的可行性,为在资源受限的设备上实现个性化LLM铺平了道路,同时保护了数据隐私。

[NLP-66] Augmenting Document-level Relation Extraction with Efficient Multi-Supervision
[NLP-66] 通过高效的多重监督增强文档级关系提取

链接: https://arxiv.org/abs/2407.01026
作者: Xiangyu Lin,Weijia Jia,Zhiguo Gong
关键词: low information density, distantly supervised data, document-level relation extraction, relation extraction due, sentence-level relation extraction
中文关键词: 低信息密度、远程监督数据、文档级关系提取、关系提取到期、业务级关系提取
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite its popularity in sentence-level relation extraction, distantly supervised data is rarely utilized by existing work in document-level relation extraction due to its noisy nature and low information density. Among its current applications, distantly supervised data is mostly used as a whole for pertaining, which is of low time efficiency. To fill in the gap of efficient and robust utilization of distantly supervised training data, we propose Efficient Multi-Supervision for document-level relation extraction, in which we first select a subset of informative documents from the massive dataset by combining distant supervision with expert supervision, then train the model with Multi-Supervision Ranking Loss that integrates the knowledge from multiple sources of supervision to alleviate the effects of noise. The experiments demonstrate the effectiveness of our method in improving the model performance with higher time efficiency than existing baselines.
摘要:尽管远程监督数据在业务级关系提取中很受欢迎,但由于其有噪音和低信息密度,文档级关系提取的现有工作很少利用远程监督数据。在其当前的应用中,远程监督的数据大多作为一个整体用于关联,时间效率较低。为了填补远程监督训练数据高效、稳健利用的空白,我们提出了文档级关系提取的高效多监督,其中我们首先通过将远程监督与专家监督相结合,从海量数据集中选择信息文档的子集,然后使用Multi-训练模型监督排名损失,整合来自多个监督来源的知识,以减轻噪音的影响。实验证明了我们的方法在提高模型性能方面的有效性,时间效率比现有基线更高。

[NLP-67] DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models
[NLP-67] DynaThink:快还是慢?大型语言模型的动态决策框架

链接: https://arxiv.org/abs/2407.01009
作者: Jiabao Pan,Yan Zhang,Chen Zhang,Zuozhu Liu,Hongwei Wang,Haizhou Li
关键词: Large language models, Large language, demonstrated emergent capabilities, language models, fast COT approach
中文关键词: 大型语言模型、大型语言、演示的紧急能力、语言模型、快速COT方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated emergent capabilities across diverse reasoning tasks via popular Chains-of-Thought (COT) prompting. However, such a simple and fast COT approach often encounters limitations in dealing with complicated problems, while a thorough method, which considers multiple reasoning pathways and verifies each step carefully, results in slower inference. This paper addresses the challenge of enabling LLMs to autonomously select between fast and slow inference methods, thereby optimizing both efficiency and effectiveness. We introduce a dynamic decision-making framework that categorizes tasks into two distinct pathways: ‘Fast’, designated for tasks where the LLM quickly identifies a high-confidence solution, and ‘Slow’, allocated for tasks that the LLM perceives as complex and for which it has low confidence in immediate solutions as well as requiring more reasoning paths to verify. Experiments on five popular reasoning benchmarks demonstrated the superiority of the DynaThink over baselines.
摘要:大型语言模型(LLM)通过流行的思维链(CoT)提示,在不同的推理任务中表现出了涌现的能力。然而,这种简单、快速的COT方法在处理复杂问题时往往会遇到局限性,而考虑多条推理路径并仔细验证每一步的彻底方法会导致推理速度较慢。本文解决了使LLMS能够在快速和慢速推理方法之间自主选择的挑战,从而优化了效率和有效性。我们引入了一个动态决策框架,将任务分类为两条不同的路径:“快速”,指定给LLM快速识别高置信度解决方案的任务;“慢”,分配给LLM认为复杂且对立即解决方案信心较低的任务,以及需要更多推理路径来验证的任务。在五个流行的推理基准上的实验证明了动态思维相对于基线的优越性。

[NLP-68] Engineering Conversational Search Systems: A Review of Applications Architectures and Functional Components
[NLP-68] 工程对话搜索系统:应用程序架构和功能组件回顾

链接: https://arxiv.org/abs/2407.00997
作者: Phillip Schneider,Wessel Poelman,Michael Rovatsos,Florian Matthes
关键词: multiple dialogue turns, Conversational search systems, users’ information gain, maximizing users’ information, enable information retrieval
中文关键词: 多次对话回合、对话式搜索系统、用户信息获得、最大化用户信息、实现信息检索
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 NLP4ConvAI Workshop

点击查看摘要

Abstract:Conversational search systems enable information retrieval via natural language interactions, with the goal of maximizing users’ information gain over multiple dialogue turns. The increasing prevalence of conversational interfaces adopting this search paradigm challenges traditional information retrieval approaches, stressing the importance of better understanding the engineering process of developing these systems. We undertook a systematic literature review to investigate the links between theoretical studies and technical implementations of conversational search systems. Our review identifies real-world application scenarios, system architectures, and functional components. We consolidate our results by presenting a layered architecture framework and explaining the core functions of conversational search systems. Furthermore, we reflect on our findings in light of the rapid progress in large language models, discussing their capabilities, limitations, and directions for future research.
摘要:会话搜索系统通过自然语言交互实现信息检索,其目标是在多次对话中最大化用户的信息收益。采用这种搜索模式的对话界面越来越普遍,这对传统的信息检索方法提出了挑战,强调了更好地了解开发这些系统的工程过程的重要性。我们进行了系统的文献回顾,以调查对话搜索系统的理论研究和技术实现之间的联系。我们的审查确定了真实世界的应用程序场景、系统架构和功能组件。我们通过提出分层架构框架和解释对话式搜索系统的核心功能来巩固我们的结果。此外,鉴于大型语言模型的快速发展,我们反思了我们的发现,讨论了它们的能力、局限性和未来研究的方向。

[NLP-69] Can Small Language Models Learn Unlearn and Retain Noise Patterns?
[NLP-69] 小型语言模型可以学习忘记并保留噪音模式吗?

链接: https://arxiv.org/abs/2407.00996
作者: Nicy Scaria,Silvester John Joseph Kennedy,Deepak Subramani
关键词: Small Language Models, large language models, Language Models, Small Language, large language
中文关键词: 小语言模型,大语言模型,语言模型,小语言,大语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Small Language Models (SLMs) are generally considered to be more compact versions of large language models (LLMs), typically having fewer than 7 billion parameters. This study investigates the ability of small language models to learn, retain, and subsequently eliminate noise that is typically not found on the internet, where most pretraining datasets are sourced. For this, four pre-trained SLMs were utilized: Olmo 1B, Qwen1.5 1.8B, Gemma 2B, and Phi2 2.7B. The models were instruction-tuned without noise and tested for task execution with in-context learning. Afterward, noise patterns were introduced to evaluate the models’ learning and unlearning capabilities. We evaluated the models’ performance at various training levels. Phi consistently excelled with word-level noise but performed the worst with character-level noise. Despite being the smallest with approximately 1 billion parameters, Olmo performed consistently well on tasks.
摘要:小型语言模型(SLC)通常被认为是大型语言模型(LLM)的更紧凑版本,通常参数少于70亿个。这项研究调查了小型语言模型学习、保留并随后消除通常在互联网上找不到的噪音的能力,而互联网是大多数预训练数据集的来源地。为此,使用了四个预先训练的STM:Olmo 1B、Qwen 1.5 1.8B、Gemma 2B和Phi 2 2.7B。这些模型在没有噪音的情况下进行了描述调整,并通过上下文学习测试了任务执行。随后,引入噪音模式来评估模型的学习和非学习能力。我们评估了模型在不同培训水平下的表现。Phi在单词级噪音方面一直表现出色,但在字符级噪音方面表现最差。尽管奥尔莫是最小的,参数约为10亿个,但他在任务中始终表现良好。

[NLP-70] LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation
[NLP-70] 通过方向蕴含图和索赔水平响应增强进行LLM不确定性量化

链接: https://arxiv.org/abs/2407.00994
作者: Longchao Da,Tiejin Chen,Lu Cheng,Hua Wei
关键词: Large language models, showcased superior capabilities, Large language, stemming from basic, basic question-answer
中文关键词: 大型语言模型,展示了卓越的能力,大型语言,源于基本的、基本的问答
类目: Computation and Language (cs.CL)
备注: 11 pages main content, 5 pages appendix

点击查看摘要

Abstract:The Large language models (LLMs) have showcased superior capabilities in sophisticated tasks across various domains, stemming from basic question-answer (QA), they are nowadays used as decision assistants or explainers for unfamiliar content. However, they are not always correct due to the data sparsity in specific domain corpus, or the model’s hallucination problems. Given this, how much should we trust the responses from LLMs? This paper presents a novel way to evaluate the uncertainty that captures the directional instability, by constructing a directional graph from entailment probabilities, and we innovatively conduct Random Walk Laplacian given the asymmetric property of a constructed directed graph, then the uncertainty is aggregated by the derived eigenvalues from the Laplacian process. We also provide a way to incorporate the existing work’s semantics uncertainty with our proposed layer. Besides, this paper identifies the vagueness issues in the raw response set and proposes an augmentation approach to mitigate such a problem, we conducted extensive empirical experiments and demonstrated the superiority of our proposed solutions.
摘要:大型语言模型(LLM)在各种领域的复杂任务中显示出了优越的能力,源于基本的问答(QA),它们现在被用作决策助手或不熟悉的内容的解释器。然而,由于特定领域语料库中的数据稀疏,或者模型的幻觉问题,它们并不总是正确的。有鉴于此,我们应该在多大程度上信任低收入国家的回应?本文提出了一种新的方法来评估反映方向不稳定性的不确定性,通过从蕴涵概率构造有向图,并创新性地对构造的有向图进行随机游走拉普拉斯运算,然后利用从拉普拉斯过程得到的特征值来聚合不确定性。我们还提供了一种将现有工作的语义不确定性合并到我们提议的层中的方法。此外,本文识别了原始响应集合中的模糊问题,并提出了一种增强方法来缓解这一问题,我们进行了大量的实证实验,证明了我们所提出的解决方案的优越性。

[NLP-71] Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
[NLP-71] Mobile-Bench:基于LLM的移动代理的评估基准

链接: https://arxiv.org/abs/2407.00993
作者: Shihan Deng,Weikai Xu,Hongda Sun,Wei Liu,Tao Tan,Jianfeng Liu,Ang Li,Jian Luan,Bin Wang,Rui Yan,Shuo Shang
关键词: large language models, LLM-based mobile agents, mobile agents, language models, human-computer interaction
中文关键词: 大型语言模型、基于LLM的移动代理、移动代理、语言模型、人机交互
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the remarkable advancements of large language models (LLMs), LLM-based agents have become a research hotspot in human-computer interaction. However, there is a scarcity of benchmarks available for LLM-based mobile agents. Benchmarking these agents generally faces three main challenges: (1) The inefficiency of UI-only operations imposes limitations to task evaluation. (2) Specific instructions within a singular application lack adequacy for assessing the multi-dimensional reasoning and decision-making capacities of LLM mobile agents. (3) Current evaluation metrics are insufficient to accurately assess the process of sequential actions. To this end, we propose Mobile-Bench, a novel benchmark for evaluating the capabilities of LLM-based mobile agents. First, we expand conventional UI operations by incorporating 103 collected APIs to accelerate the efficiency of task completion. Subsequently, we collect evaluation data by combining real user queries with augmentation from LLMs. To better evaluate different levels of planning capabilities for mobile agents, our data is categorized into three distinct groups: SAST, SAMT, and MAMT, reflecting varying levels of task complexity. Mobile-Bench comprises 832 data entries, with more than 200 tasks specifically designed to evaluate multi-APP collaboration scenarios. Furthermore, we introduce a more accurate evaluation metric, named CheckPoint, to assess whether LLM-based mobile agents reach essential points during their planning and reasoning steps.
摘要:随着大语言模型的发展,基于大语言模型的智能体已成为人机交互领域的研究热点。然而,基于LLM的移动代理缺乏可用的基准。对这些代理进行基准测试通常面临三个主要挑战:(1)仅限用户界面的操作效率低下,对任务评估造成了限制。(2)单一应用程序中的特定指令不足以评估LLM移动代理的多维推理和决策能力。(3)现有的评价指标不足以准确地评估顺序动作的过程。为此,我们提出了一种新的评估基于LLM的移动代理性能的基准–Mobile-BENCH。首先,我们通过整合103个收集的API来扩展传统的UI操作,以加快任务完成的效率。随后,我们通过将真实用户查询与来自LLMS的扩充相结合来收集评估数据。为了更好地评估移动代理的不同规划能力级别,我们的数据被分为三个不同的组:SAST、SAMT和MAMT,反映了不同级别的任务复杂性。Mobile-Back包含832个数据条目,200多个任务专门设计用于评估多应用协作场景。此外,我们引入了一种更准确的评估度量,称为检查点,用于评估基于LLM的移动代理在规划和推理过程中是否到达关键点。

[NLP-72] VisEval: A Benchmark for Data Visualization in the Era of Large Language Models
[NLP-72] VisEval:大型语言模型时代数据可视化的基准

链接: https://arxiv.org/abs/2407.00981
作者: Nan Chen,Yuge Zhang,Jiahang Xu,Kan Ren,Yuqing Yang
关键词: Translating natural language, visual data analysis, shown great promise, natural language processing, Translating natural
中文关键词: 翻译自然语言,视觉数据分析,表现出巨大的前景,自然语言处理,翻译自然
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Translating natural language to visualization (NL2VIS) has shown great promise for visual data analysis, but it remains a challenging task that requires multiple low-level implementations, such as natural language processing and visualization design. Recent advancements in pre-trained large language models (LLMs) are opening new avenues for generating visualizations from natural language. However, the lack of a comprehensive and reliable benchmark hinders our understanding of LLMs’ capabilities in visualization generation. In this paper, we address this gap by proposing a new NL2VIS benchmark called VisEval. Firstly, we introduce a high-quality and large-scale dataset. This dataset includes 2,524 representative queries covering 146 databases, paired with accurately labeled ground truths. Secondly, we advocate for a comprehensive automated evaluation methodology covering multiple dimensions, including validity, legality, and readability. By systematically scanning for potential issues with a number of heterogeneous checkers, VisEval provides reliable and trustworthy evaluation outcomes. We run VisEval on a series of state-of-the-art LLMs. Our evaluation reveals prevalent challenges and delivers essential insights for future advancements.
摘要:自然语言到可视化的转换(NL2VIS)在可视化数据分析方面显示出巨大的潜力,但它仍然是一项具有挑战性的任务,需要自然语言处理和可视化设计等多种底层实现。最近在预先训练的大型语言模型(LLM)中的进展为从自然语言中生成可视化开辟了新的途径。然而,缺乏一个全面可靠的基准,阻碍了我们对LLMS在可视化生成方面的能力的理解。在本文中,我们通过提出一种新的NL2VIS基准来解决这一差距,该基准称为VisEval。首先,我们介绍了一个高质量、大规模的数据集。该数据集包括2,524个代表性查询,涉及146个数据库,并与准确标记的基本事实配对。其次,我们提倡一种涵盖多个维度的全面的自动化评估方法,包括有效性、合法性和可读性。通过使用多个不同的检查器系统地扫描潜在问题,VisEval提供了可靠和值得信赖的评估结果。我们在一系列最先进的LLM上运行VisEval。我们的评估揭示了普遍存在的挑战,并为未来的发展提供了重要的见解。

[NLP-73] Universal Approximation Theory: The basic theory for large language models
[NLP-73] 普遍逼近理论:大型语言模型的基本理论

链接: https://arxiv.org/abs/2407.00958
作者: Wei Wang,Qing Li
关键词: artificial intelligence, innovations like ChatGPT, area of focus, focus in artificial, introduction of groundbreaking
中文关键词: 人工智能、ChatGPT等创新、重点领域、专注于人工、开创性的引入
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models have emerged as a critical area of focus in artificial intelligence, particularly with the introduction of groundbreaking innovations like ChatGPT. Large-scale Transformer networks have quickly become the leading approach for advancing natural language processing algorithms. Built on the Transformer architecture, these models enable interactions that closely mimic human communication and, equipped with extensive knowledge, can even assist in guiding human tasks. Despite their impressive capabilities and growing complexity, a key question remains-the theoretical foundations of large language models (LLMs). What makes Transformer so effective for powering intelligent language applications, such as translation and coding? What underlies LLMs’ ability for In-Context Learning (ICL)? How does the LoRA scheme enhance the fine-tuning of LLMs? And what supports the practicality of pruning LLMs? To address these critical questions and explore the technological strategies within LLMs, we leverage the Universal Approximation Theory (UAT) to offer a theoretical backdrop, shedding light on the mechanisms that underpin these advancements.
摘要:语言模型已经成为人工智能领域的一个关键焦点,特别是随着ChatGPT等突破性创新的引入。大规模的Transformer网络已经迅速成为改进自然语言处理算法的主要方法。这些模型构建在Transformer架构之上,能够实现紧密模拟人类交流的交互,并配备了丰富的知识,甚至可以帮助指导人工任务。尽管它们的功能令人印象深刻,而且日益复杂,但一个关键问题仍然存在–大型语言模型(LLM)的理论基础。是什么让Transformer在支持智能语言应用(如翻译和编码)方面如此有效?LLMS的情景中学习(ICL)能力的基础是什么?LORA方案如何增强LLMS的微调?什么支持修剪LLM的实用性?为了解决这些关键问题并探索LLMS中的技术战略,我们利用通用近似理论(UAT)提供了一个理论背景,揭示了支撑这些进步的机制。

[NLP-74] SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models
[NLP-74] SplitLoRA:用于大型语言模型的分离参数高效微调框架

链接: https://arxiv.org/abs/2407.00952
作者: Zheng Lin,Xuanjie Hu,Yuxin Zhang,Zhe Chen,Zihan Fang,Xianhao Chen,Ang Li,Praneeth Vepakomma,Yue Gao
关键词: LLM fine-tuning, LLM fine-tuning paradigm, LLM, large language models, handling high-complexity models
中文关键词: LLM微调,LLM微调范式,LLM,大型语言模型,处理高复杂性模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:The scalability of large language models (LLMs) in handling high-complexity models and large-scale datasets has led to tremendous successes in pivotal domains. While there is an urgent need to acquire more training data for LLMs, a concerning reality is the depletion of high-quality public datasets within a few years. In view of this, the federated learning (FL) LLM fine-tuning paradigm recently has been proposed to facilitate collaborative LLM fine-tuning on distributed private data, where multiple data owners collaboratively fine-tune a shared LLM without sharing raw data. However, the staggering model size of LLMs imposes heavy computing and communication burdens on clients, posing significant barriers to the democratization of the FL LLM fine-tuning paradigm. To address this issue, split learning (SL) has emerged as a promising solution by offloading the primary training workload to a server via model partitioning while exchanging activation/activation’s gradients with smaller data sizes rather than the entire LLM. Unfortunately, research on the SL LLM fine-tuning paradigm is still in its nascent stage. To fill this gap, in this paper, we propose the first SL LLM fine-tuning framework, named SplitLoRA. SplitLoRA is built on the split federated learning (SFL) framework, amalgamating the advantages of parallel training from FL and model splitting from SL and thus greatly enhancing the training efficiency. It is worth noting that SplitLoRA is the inaugural open-source benchmark for SL LLM fine-tuning, providing a foundation for research efforts dedicated to advancing SL LLM fine-tuning. Extensive simulations validate that SplitLoRA achieves target accuracy in significantly less time than state-of-the-art LLM fine-tuning frameworks, demonstrating the superior training performance of SplitLoRA. The project page is available at this https URL.
摘要:大型语言模型(LLM)在处理高复杂性模型和大规模数据集方面的可扩展性导致了在关键领域的巨大成功。虽然迫切需要为土地管理系统获取更多的训练数据,但一个令人担忧的现实是,高质量的公共数据集在几年内就会枯竭。有鉴于此,最近提出了联合学习(FL)LLM微调范例,以促进分布式私有数据上的协作式LLM微调,其中多个数据所有者在不共享原始数据的情况下协作微调共享LLM。然而,LLMS惊人的模型大小给客户带来了沉重的计算和通信负担,对FL LLM微调范例的民主化构成了重大障碍。为了解决这个问题,分离学习(SL)已经成为一种有前途的解决方案,它通过模型分区将主要培训工作量卸载到服务器,同时用较小的数据大小交换激活/激活的梯度,而不是整个LLM。遗憾的是,对二语LLM微调范式的研究仍处于初级阶段。为了填补这一空白,本文提出了第一个SL LLM微调框架SplitLoRA。SplitLoRA建立在分裂联邦学习(SFL)框架上,融合了FL并行训练和SL模型分裂的优点,极大地提高了训练效率。值得注意的是,SplitLoRA是第一个用于SL LLM微调的开源基准,为致力于推进SL LLM微调的研究工作提供了基础。大量的仿真验证了SplitLoRA比最先进的LLM微调框架在显著更少的时间内达到目标精度,展示了SplitLoRA优越的训练性能。该项目页面可通过此HTTPS URL访问。

[NLP-75] he House Always Wins: A Framework for Evaluating Strategic Deception in LLMs
[NLP-75] 众议院永远获胜:评估法学院战略欺骗的框架

链接: https://arxiv.org/abs/2407.00948
作者: Tanush Chopra,Michael Li
关键词: large language models, language models, large language, evaluating strategic deception, fair play
中文关键词: 大型语言模型,语言模型,大型语言,评估战略欺骗,公平竞争
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Research conducted at the Deception Detection Hackathon 2024 hosted by Apart Apollo Research

点击查看摘要

Abstract:We propose a framework for evaluating strategic deception in large language models (LLMs). In this framework, an LLM acts as a game master in two scenarios: one with random game mechanics and another where it can choose between random or deliberate actions. As an example, we use blackjack because the action space nor strategies involve deception. We benchmark Llama3-70B, GPT-4-Turbo, and Mixtral in blackjack, comparing outcomes against expected distributions in fair play to determine if LLMs develop strategies favoring the “house.” Our findings reveal that the LLMs exhibit significant deviations from fair play when given implicit randomness instructions, suggesting a tendency towards strategic manipulation in ambiguous scenarios. However, when presented with an explicit choice, the LLMs largely adhere to fair play, indicating that the framing of instructions plays a crucial role in eliciting or mitigating potentially deceptive behaviors in AI systems.
摘要:我们提出了一个评估大型语言模型(LLM)中战略欺骗的框架。在这个框架中,LLM在两种情况下充当游戏大师:一种是随机游戏机制,另一种是可以在随机或故意动作之间进行选择。例如,我们使用二十一点是因为动作空间和策略都涉及欺骗。我们在二十一点中对Llama 3 - 70 B、GPT-4-Turbo和Mixtral进行基准测试,将结果与公平竞争中的预期分布进行比较,以确定LLM是否制定有利于“房子”的策略。“我们的研究结果表明,当给予隐性随机性指令时,LLM表现出与公平竞争的显着偏差,这表明在模糊场景中存在战略操纵的倾向。然而,当提出明确的选择时,LLM在很大程度上遵守公平竞争,这表明指令的框架在引发或减轻人工智能系统中潜在的欺骗行为方面发挥着至关重要的作用。

[NLP-76] ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions
[NLP-76] ProductAgent:通过提出澄清问题来对对话式产品搜索代理进行基准测试

链接: https://arxiv.org/abs/2407.00942
作者: Jingheng Ye,Yong Jiang,Xiaobin Wang,Yinghui Li,Yangning Li,Hai-Tao Zheng,Pengjun Xie,Fei Huang
关键词: tailored product searching, clarification question generation, strategic clarification question, e-commercial scenario, paper introduces
中文关键词: 定制产品搜索、澄清问题生成、战略澄清问题、电子商务场景、论文介绍
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 13 tables, 6 figures. Under review

点击查看摘要

Abstract:This paper introduces the task of product demand clarification within an e-commercial scenario, where the user commences the conversation with ambiguous queries and the task-oriented agent is designed to achieve more accurate and tailored product searching by asking clarification questions. To address this task, we propose ProductAgent, a conversational information seeking agent equipped with abilities of strategic clarification question generation and dynamic product retrieval. Specifically, we develop the agent with strategies for product feature summarization, query generation, and product retrieval. Furthermore, we propose the benchmark called PROCLARE to evaluate the agent’s performance both automatically and qualitatively with the aid of a LLM-driven user simulator. Experiments show that ProductAgent interacts positively with the user and enhances retrieval performance with increasing dialogue turns, where user demands become gradually more explicit and detailed. All the source codes will be released after the review anonymity period.
摘要:本文介绍了电子商务场景中的产品需求澄清任务,其中用户以含糊的问题开始对话,面向任务的代理被设计为通过提出澄清问题来实现更准确和定制的产品搜索。为了解决这一问题,我们提出了一种对话式信息搜索代理ProductAgent,它具有策略澄清问题生成和动态产品检索的能力。具体地说,我们开发了具有产品特征摘要、查询生成和产品检索策略的代理。此外,我们还提出了一个称为PROCLARE的基准测试程序,它可以在LLM驱动的用户模拟器的帮助下,自动和定性地评估代理的性能。实验表明,随着对话次数的增加,ProductAgent与用户进行了积极的交互,提高了检索性能,用户的需求逐渐变得更加明确和详细。所有源代码将在审查匿名期后发布。

[NLP-77] MalAlgoQA: A Pedagogical Approach for Evaluating Counterfactual Reasoning Abilities
[NLP-77] MalAlgoQA:评估反事实推理能力的教学方法

链接: https://arxiv.org/abs/2407.00938
作者: Naiming Liu,Shashank Sonkar,Myco Le,Richard Baraniuk
关键词: Large Language Models, Large Language, capabilities of Large, answer rationale identification, paper introduces MalAlgoQA
中文关键词: 大型语言模型、大型语言、大型功能、答案原理识别、论文介绍MalAlgoQA
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This paper introduces MalAlgoQA, a novel dataset designed to evaluate the counterfactual reasoning capabilities of Large Language Models (LLMs) through a pedagogical approach. The dataset comprises mathematics and reading comprehension questions, each accompanied by four answer choices and their corresponding rationales. We focus on the incorrect answer rationales, termed “malgorithms”, which highlights flawed reasoning steps leading to incorrect answers and offers valuable insights into erroneous thought processes. We also propose the Malgorithm Identification task, where LLMs are assessed based on their ability to identify corresponding malgorithm given an incorrect answer choice. To evaluate the model performance, we introduce two metrics: Algorithm Identification Accuracy (AIA) for correct answer rationale identification, and Malgorithm Identification Accuracy (MIA) for incorrect answer rationale identification. The task is challenging since state-of-the-art LLMs exhibit significant drops in MIA as compared to AIA. Moreover, we find that the chain-of-thought prompting technique not only fails to consistently enhance MIA, but can also lead to underperformance compared to simple prompting. These findings hold significant implications for the development of more cognitively-inspired LLMs to improve their counterfactual reasoning abilities, particularly through a pedagogical perspective where understanding and rectifying student misconceptions are crucial.
摘要:本文介绍了一个新的数据集MalAlgoQA,该数据集旨在通过教学方法评估大型语言模型(LLMS)的反事实推理能力。数据集包括数学和阅读理解问题,每个问题都伴随着四个答案选择和相应的理论基础。我们专注于错误答案的基本原理,称为“M算法”,它突出了导致错误答案的有缺陷的推理步骤,并提供了对错误思维过程的有价值的见解。我们还提出了M算法识别任务,其中评估LLM的基础是它们在给定错误答案选择的情况下识别相应M算法的能力。为了评估模型的性能,我们引入了两个度量:算法识别准确率(AIA)和算法识别准确率(MIA),分别用于正确答案理由识别和错误答案理由识别。这项任务具有挑战性,因为与AIA相比,最先进的LLM在MIA中显示出显著的下降。此外,我们发现,链式提示技术不仅不能持续地提高MIA,而且与简单的提示相比,还可能导致表现不佳。这些发现对于开发更多受认知启发的LLM以提高他们的反事实推理能力具有重要意义,特别是通过理解和纠正学生错误概念的教学角度。

[NLP-78] Large Language Model Enhanced Knowledge Representation Learning: A Survey
[NLP-78] 大语言模型增强知识表示学习:调查

链接: https://arxiv.org/abs/2407.00936
作者: Xin Wang,Zirui Chen,Haofen Wang,Leong Hou U,Zhao Li,Wenbin Guo
关键词: Large Language Models, Knowledge Representation Learning, complex knowledge structures, Large Language, utilize complex knowledge
中文关键词: 大型语言模型、知识表示学习、复杂知识结构、大型语言、利用复杂知识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with Knowledge Representation Learning (KRL) signifies a pivotal advancement in the field of artificial intelligence, enhancing the ability to capture and utilize complex knowledge structures. This synergy leverages the advanced linguistic and contextual understanding capabilities of LLMs to improve the accuracy, adaptability, and efficacy of KRL, thereby expanding its applications and potential. Despite the increasing volume of research focused on embedding LLMs within the domain of knowledge representation, a thorough review that examines the fundamental components and processes of these enhanced models is conspicuously absent. Our survey addresses this by categorizing these models based on three distinct Transformer architectures, and by analyzing experimental data from various KRL downstream tasks to evaluate the strengths and weaknesses of each approach. Finally, we identify and explore potential future research directions in this emerging yet underexplored domain, proposing pathways for continued progress.
摘要:大型语言模型(LLMS)与知识表示学习(KRL)的结合标志着人工智能领域的一大进步,增强了捕捉和利用复杂知识结构的能力。这种协同利用了LLMS的高级语言和上下文理解能力,以提高KRL的准确性、适应性和有效性,从而扩大其应用和潜力。尽管越来越多的研究集中于将LLM嵌入知识表示领域,但明显缺乏对这些增强模型的基本组成部分和过程进行彻底审查。我们的调查通过根据三种不同的Transformer架构对这些模型进行分类,并通过分析来自各种KRL下游任务的实验数据来评估每种方法的优缺点,从而解决了这一问题。最后,我们确定并探索了这一新兴但未被探索的领域的潜在未来研究方向,提出了继续取得进展的途径。

[NLP-79] Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining
[NLP-79] 向前看还是环顾四周?自回归与掩蔽预训练的理论比较

链接: https://arxiv.org/abs/2407.00935
作者: Qi Zhang,Tianqi Du,Haotian Huang,Yifei Wang,Yisen Wang
关键词: masked SSL, SSL, generative self-supervised learning, autoregressive SSL, generative SSL paradigms
中文关键词: 屏蔽SSL、SSL、生成性自我监督学习、自回归SSL、生成性SSL范式
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL. Through establishing theoretical frameworks, we elucidate the strengths and limitations of autoregressive and masked SSL within the primary evaluation tasks of classification and content generation. Our findings demonstrate that in classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections compared to the fixed position of target tokens in autoregressive SSL, which yields superior clustering performance. In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts in masked SSL (vs. flexible lengths of conditional texts in autoregressive SSL) hinders its generation performance. To leverage each other’s strengths and mitigate weaknesses, we propose diversity-enhanced autoregressive and variable-length masked objectives, which substantially improve the classification performance of autoregressive SSL and the generation performance of masked SSL. Code is available at this https URL.
摘要:近年来,生成性自我监督学习(SSL)范式的兴起在视觉、语言和多模式领域表现出了令人印象深刻的表现。虽然生成性SSL目标的不同设计导致了下游任务中的不同属性,但对这些差异的理论理解仍在很大程度上有待探索。在这篇文章中,我们首次对两种主要的生成性SSL范式:自回归SSL和掩蔽SSL进行了理论上的比较。通过建立理论框架,我们阐明了自回归和掩蔽SSL在分类和内容生成的主要评估任务中的优势和局限性。我们的发现表明,在分类任务中,掩蔽SSL中目标标记的灵活性与自回归SSL中目标标记的固定位置相比,可以促进更多的样本间连接,从而产生更好的聚类性能。在内容生成任务中,测试样本的可变长度与掩蔽SSL中未屏蔽文本的固定长度之间的不一致(与自回归SSL中条件文本的可变长度相比)阻碍了其生成性能。为了取长补短,我们提出了分集增强的自回归和可变长度掩码目标,大大提高了自回归算法的分类性能和掩码算法的生成性能。代码可在此HTTPS URL上找到。

[NLP-80] CLEME2.0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction
[NLP-80] CREME 2.0:通过理清编辑以纠正语法错误来实现更具可解释性的评估

链接: https://arxiv.org/abs/2407.00934
作者: Jingheng Ye,Zishan Xu,Yinghui Li,Xuxin Cheng,Linlin Song,Qingyu Zhou,Hai-Tao Zheng,Ying Shen,Xin Su
关键词: Grammatical Error Correction, Error Correction, Grammatical Error, interpretability of Grammatical, previous studies
中文关键词: 语法错误纠正,错误纠正,语法错误,语法的可解释性,以前的研究
类目: Computation and Language (cs.CL)
备注: 16 pages, 8 tables, 2 figures. Under review

点击查看摘要

Abstract:The paper focuses on improving the interpretability of Grammatical Error Correction (GEC) metrics, which receives little attention in previous studies. To bridge the gap, we propose CLEME2.0, a reference-based evaluation strategy that can describe four elementary dimensions of GEC systems, namely hit-correction, error-correction, under-correction, and over-correction. They collectively contribute to revealing the critical characteristics and locating drawbacks of GEC systems. Evaluating systems by Combining these dimensions leads to high human consistency over other reference-based and reference-less metrics. Extensive experiments on 2 human judgement datasets and 6 reference datasets demonstrate the effectiveness and robustness of our method. All the codes will be released after the peer review.
摘要:本文的重点是提高语法错误纠正(GEC)指标的可解释性,而这在之前的研究中很少受到关注。为了弥合这一差距,我们提出了CREME 2.0,这是一种基于参考的评估策略,可以描述GEC系统的四个基本维度,即命中纠正、错误纠正、不足纠正和过度纠正。它们共同有助于揭示GEC系统的关键特征和定位缺陷。通过结合这些维度来评估系统可以比其他基于参考和无参考的指标具有高度的人类一致性。对2个人类判断数据集和6个参考数据集的大量实验证明了我们方法的有效性和稳健性。所有代码将在同行评审后发布。

[NLP-81] FoldGPT: Simple and Effective Large Language Model Compression Scheme
[NLP-81] FoudGPT:简单有效的大型语言模型压缩方案

链接: https://arxiv.org/abs/2407.00928
作者: Songwei Liu,Chao Zeng,Lianqiang Li,Chenqian Yan,Lean Fu,Xing Mei,Fangmin Chen
关键词: escalating data security, data security concerns, deploying large language, mobile devices continues, large language models
中文关键词: 数据安全升级、数据安全担忧、部署大型语言、移动设备继续、大型语言模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of most layers exhibit significant similarity. Moreover, this similarity becomes more pronounced as the model size increases, indicating substantial redundancy in the depth direction of the LLMs. Based on this observation, we propose an efficient model volume compression strategy, termed FoldGPT, which combines block removal and block parameter sharing.This strategy consists of three parts: (1) Based on the learnable gating parameters, we determine the block importance ranking while modeling the coupling effect between blocks. Then we delete some redundant layers based on the given removal rate. (2) For the retained blocks, we apply a specially designed group parameter sharing strategy, where blocks within the same group share identical weights, significantly compressing the number of parameters and slightly reducing latency overhead. (3) After sharing these Blocks, we “cure” the mismatch caused by sparsity with a minor amount of fine-tuning and introduce a tail-layer distillation strategy to improve the performance. Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression, demonstrating the feasibility of achieving model lightweighting through straightforward block removal and parameter sharing.
摘要:在不断升级的数据安全担忧和云成本的推动下,在移动设备上部署大型语言模型(LLM)的需求持续增加。然而,网络带宽和内存限制给在移动设备上部署十亿级别的模型带来了挑战。在这项研究中,我们调查了不同层次在不同尺度上的LLMS的输出,发现大多数层次的输出表现出显著的相似性。此外,随着模型尺寸的增加,这种相似性变得更加明显,这表明在LLMS的深度方向上存在大量冗余。基于此,我们提出了一种结合块去除和块参数共享的高效模型体积压缩策略FoldGPT,该策略由三部分组成:(1)基于可学习的选通参数,在对块之间的耦合效应进行建模的同时,确定块的重要性排序。然后根据给定的去除率删除一些冗余层。(2)对于保留的块,我们采用了专门设计的组参数共享策略,其中同一组内的块共享相同的权重,显著压缩了参数数量,并略微降低了延迟开销。(3)在共享这些块之后,我们用少量的微调来“治愈”稀疏性造成的不匹配,并引入了尾层精馏策略来提高性能。实验表明,FoldGPT在模型压缩效率上优于以往的SOTA方法,证明了通过简单的块去除和参数共享来实现模型轻量化的可行性。

[NLP-82] EXCGEC: A Benchmark of Edit-wise Explainable Chinese Grammatical Error Correction
[NLP-82] EXCGEC:编辑式可解释中文语法错误更正的基准

链接: https://arxiv.org/abs/2407.00924
作者: Jingheng Ye,Shang Qin,Yinghui Li,Xuxin Cheng,Libo Qin,Hai-Tao Zheng,Peng Xing,Zishan Xu,Guo Cheng,Zhao Wei
关键词: Grammatical Error Correction, Existing studies explore, Grammatical Error, Error Correction, Existing studies
中文关键词: 语法错误纠正,现有研究探索,语法错误,错误纠正,现有研究
类目: Computation and Language (cs.CL)
备注: 22 pages, 10 tables, 9 figures. Under review

点击查看摘要

Abstract:Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations. To bridge the gap, this paper introduces the task of EXplainable GEC (EXGEC), which focuses on the integral role of both correction and explanation tasks. To facilitate the task, we propose EXCGEC, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented samples featuring the design of hybrid edit-wise explanations. We benchmark several series of LLMs in multiple settings, covering post-explaining and pre-explaining. To promote the development of the task, we introduce a comprehensive suite of automatic metrics and conduct human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. All the codes and data will be released after the review.
摘要:现有的研究探索了语法错误更正(GEC)在有限场景中的可解释性,忽视了更正和解释之间的相互作用。为了弥合这一差距,本文介绍了可解释GEC(EXGEC)的任务,重点关注纠正和解释任务的整体作用。为了促进这项任务,我们提出了EXCGEC,这是一个为中国EXGEC量身定制的基准,由8,216个解释增强样本组成,具有混合编辑解释的设计。我们在多种环境下对几个系列LLM进行基准测试,涵盖后解释和预解释。为了促进任务的发展,我们引入了一套全面的自动指标,并进行了人工评估实验,以证明自由文本解释自动指标的人类一致性。所有代码和数据将在审查后发布。

[NLP-83] Preserving Multilingual Quality While Tuning Query Encoder on English Only
[NLP-83] 在仅以英语调整查询编码器的同时保留多语言质量

链接: https://arxiv.org/abs/2407.00923
作者: Oleg Vasilyev,Randy Sawaya,John Bohannon
关键词: relevant text passages, dense passage retrieval, passage retrieval system, dense passage, text passages
中文关键词: 相关文本段落,密集段落检索,段落检索系统,密集段落,文本段落
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A dense passage retrieval system can serve as the initial stages of information retrieval, selecting the most relevant text passages for downstream tasks. In this work we conducted experiments with the goal of finding how much the quality of a multilingual retrieval could be degraded if the query part of a dual encoder is tuned on an English-only dataset (assuming scarcity of cross-lingual samples for the targeted domain or task). Specifically, starting with a high quality multilingual embedding model, we observe that an English-only tuning may not only preserve the original quality of the multilingual retrieval, but even improve it.
摘要:密集段落检索系统可以作为信息检索的初始阶段,为下游任务选择最相关的文本段落。在这项工作中,我们进行了实验,目标是找出如果双编码器的查询部分在纯英语数据集上调整,多语言检索的质量可能会降低多少(假设目标领域或任务的跨语言样本稀缺)。具体来说,从高质量的多语言嵌入模型开始,我们观察到纯英语调优不仅可以保留多语言检索的原始质量,甚至可以改进它。

[NLP-84] Deep Image-to-Recipe Translation
[NLP-84] 深度图像到食谱翻译

链接: https://arxiv.org/abs/2407.00911
作者: Jiangqin Ma,Bilal Mawji,Franz Williams
关键词: profound level, reflecting the intricate, intricate connection, Eat, cherished food memories
中文关键词: 层次深刻,反映出错综复杂、错综复杂的联系,吃、珍惜的食物记忆
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The modern saying, “You Are What You Eat” resonates on a profound level, reflecting the intricate connection between our identities and the food we consume. Our project, Deep Image-to-Recipe Translation, is an intersection of computer vision and natural language generation that aims to bridge the gap between cherished food memories and the art of culinary creation. Our primary objective involves predicting ingredients from a given food image. For this task, we first develop a custom convolutional network and then compare its performance to a model that leverages transfer learning. We pursue an additional goal of generating a comprehensive set of recipe steps from a list of ingredients. We frame this process as a sequence-to-sequence task and develop a recurrent neural network that utilizes pre-trained word embeddings. We address several challenges of deep learning including imbalanced datasets, data cleaning, overfitting, and hyperparameter selection. Our approach emphasizes the importance of metrics such as Intersection over Union (IoU) and F1 score in scenarios where accuracy alone might be misleading. For our recipe prediction model, we employ perplexity, a commonly used and important metric for language models. We find that transfer learning via pre-trained ResNet-50 weights and GloVe embeddings provide an exceptional boost to model performance, especially when considering training resource constraints. Although we have made progress on the image-to-recipe translation, there is an opportunity for future exploration with advancements in model architectures, dataset scalability, and enhanced user interaction.
摘要:现代谚语“你吃什么,你就是什么”在深刻的层面上引起共鸣,反映了我们的身份和我们消费的食物之间的错综复杂的联系。我们的项目,深度图像到食谱翻译,是计算机视觉和自然语言生成的交叉,旨在弥合珍贵的食物记忆和烹饪创造艺术之间的差距。我们的主要目标是根据给定的食物图像预测配料。对于这项任务,我们首先开发一个定制的卷积网络,然后将其性能与利用迁移学习的模型进行比较。我们追求的另一个目标是从配料列表生成一套全面的食谱步骤。我们将这个过程框架为一个序列到序列的任务,并开发了一个利用预训练单词嵌入的递归神经网络。我们解决了深度学习的几个挑战,包括数据集不平衡、数据清理、过拟合和超参数选择。我们的方法强调了诸如联合交集(IOU)和F1分数等指标在仅有准确性可能会产生误导的场景中的重要性。对于我们的食谱预测模型,我们采用了困惑,这是语言模型的一个常用且重要的度量。我们发现,通过预先训练的ResNet-50权重和手套嵌入的转移学习可以极大地提高模型的性能,特别是在考虑训练资源限制的情况下。尽管我们在图像到配方的转换方面取得了进展,但随着模型体系结构、数据集可伸缩性和增强的用户交互方面的进步,未来仍有机会进行探索。

[NLP-85] FineSurE: Fine-grained Summarization Evaluation using LLMs
[NLP-85] FineSurE:使用LLM进行细粒度总结评估

链接: https://arxiv.org/abs/2407.00908
作者: Hwanjun Song,Hang Su,Igor Shalyminov,Jason Cai,Saab Mansour
关键词: Automated evaluation, streamlining text summarization, crucial for streamlining, streamlining text, costly and time-consuming
中文关键词: 自动评估、简化文本摘要,对于简化、简化文本至关重要,成本高昂且耗时
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2024 (main, long)

点击查看摘要

Abstract:Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at this https URL.
像Rouge这样的传统方法并不能很好地与人类判断相关联,而最近提出的基于LLM的指标只提供了使用Likert-Scale分数的摘要级评估。这限制了更深层次的模型分析,例如,我们只能在摘要级别分配一个幻觉分数,而在句子级别,我们可以计算包含幻觉的句子。为了弥补这些局限性,我们提出了FineSurE,这是一个专门为使用大型语言模型(LLM)的摘要任务量身定做的细粒度评估器。我们将各种开源和专有LLM作为FineSurE的主干进行比较。此外,我们针对SOTA方法(包括基于NLI、QA和LLM的方法)对FineSurE进行了广泛的基准测试,显示出改进的性能,特别是在完备性和简洁性维度上。代码可在此HTTPS URL上找到。

[NLP-86] From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning
[NLP-86] 从内省到最佳实践:多模式背景学习中演示的原则性分析

链接: https://arxiv.org/abs/2407.00902
作者: Nan Xu,Fei Wang,Sheng Zhang,Hoifung Poon,Muhao Chen
关键词: Large Language models, Large Language, multiple image-text pairs, similar ICL abilities, capabilities of Large
中文关键词: 大型语言模型、大型语言、多个图像-文本对、类似的ICL能力、大型的能力
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Considering such modality impact, we further utilize modality-driven demonstration strategies to boost ICL performance. We also identify that demonstration selection is closely related to the models’ ability to capture task inductive biases from multimodal ICL. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks even if those tasks are not seen in or even contradict pretraining data.
摘要:受大语言模型的语境学习能力的启发,具有额外视觉通道的多通道大语言模型在多个图文对的实验中也表现出类似的语境学习能力。然而,对多式联运ICL工作原理的研究相对较少。我们在一系列新的但关键的任务上对不同规模的模型进行了系统和原则性的多模式ICL评估。通过对不同通道信息的扰动,我们表明在多通道ICL中,通道在不同任务中的重要性不同。考虑到这些通道的影响,我们进一步利用通道驱动的演示策略来提高ICL的性能。我们还发现,演示选择与模型从多通道ICL中捕获任务归纳偏差的能力密切相关。我们的原则性分析提供了一种全面的方式来理解演示在多模式情景学习中的作用,并有助于有效地在广泛的任务中改进多模式ICL,即使这些任务在预培训数据中看不到甚至与之相矛盾。

[NLP-87] MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula
[NLP-87] MathCAMPS:人类课程中数学问题的细粒度综合

链接: https://arxiv.org/abs/2407.00900
作者: Shubhra Mishra,Gabriel Poesia,Belinda Mo,Noah D. Goodman
关键词: Large Language Models, Large Language, Language Models, important capability, Mathematics Common Core
中文关键词: 大型语言模型,大型语言,语言模型,重要能力,数学共同核心
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Dataset and code: this https URL

点击查看摘要

Abstract:Mathematical problem solving is an important skill for Large Language Models (LLMs), both as an important capability and a proxy for a range of reasoning abilities. Existing benchmarks probe a diverse set of skills, but they yield aggregate accuracy metrics, obscuring specific abilities or weaknesses. Furthermore, they are difficult to extend with new problems, risking data contamination over time. To address these challenges, we propose MathCAMPS: a method to synthesize high-quality mathematical problems at scale, grounded on 44 fine-grained “standards” from the Mathematics Common Core (CC) Standard for K-8 grades. We encode each standard in a formal grammar, allowing us to sample diverse symbolic problems and their answers. We then use LLMs to realize the symbolic problems into word problems. We propose a cycle-consistency method for validating problem faithfulness. Finally, we derive follow-up questions from symbolic structures and convert them into follow-up word problems - a novel task of mathematical dialogue that probes for robustness in understanding. Experiments on 23 LLMs show surprising failures even in the strongest models (in particular when asked simple follow-up questions). Moreover, we evaluate training checkpoints of Pythia 12B on MathCAMPS, allowing us to analyze when particular mathematical skills develop during its training. Our framework enables the community to reproduce and extend our pipeline for a fraction of the typical cost of building new high-quality datasets.
摘要:数学问题解决是大型语言模型的一项重要技能,它既是一种重要的能力,也是一系列推理能力的代表。现有的基准测试了一系列不同的技能,但它们产生了聚合的准确性指标,掩盖了特定的能力或弱点。此外,它们很难扩展到新的问题,随着时间的推移有数据污染的风险。为了应对这些挑战,我们提出了MathCAMPS:一种大规模综合高质量数学问题的方法,基于K-8年级数学共同核心(CC)标准中的44个细粒度“标准”。我们用形式语法对每个标准进行编码,允许我们对不同的符号问题及其答案进行采样。然后利用最小二乘法将符号问题转化为文字问题。我们提出了一种验证问题忠诚度的循环一致性方法。最后,我们从符号结构中推导出后续问题,并将其转换为后续应用问题–这是一项探索理解稳健性的数学对话的新任务。在23个LLM上的实验显示,即使在最强大的模型中也会出现令人惊讶的失败(特别是在被问及简单的后续问题时)。此外,我们在MathCAMPS上评估了Pythia 12B的训练检查点,使我们能够分析在其训练过程中特定的数学技能何时发展。我们的框架使社区能够以构建新的高质量数据集的典型成本的一小部分来复制和扩展我们的管道。

[NLP-88] How to Leverage Digit Embeddings to Represent Numbers?
[NLP-88] 如何利用数字嵌入来代表数字?

链接: https://arxiv.org/abs/2407.00894
作者: Jasivan Alex Sivakumar,Nafise Sadat Moosavi
关键词: performing arithmetic operations, existing language models, arithmetic operations, Sivakumar and Moosavi, performing arithmetic
中文关键词: 执行算术运算、现有语言模型、算术运算、Sivakumar和Moosavi、执行算术
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Apart from performing arithmetic operations, understanding numbers themselves is still a challenge for existing language models. Simple generalisations, such as solving 100+200 instead of 1+2, can substantially affect model performance (Sivakumar and Moosavi, 2023). Among various techniques, character-level embeddings of numbers have emerged as a promising approach to improve number representation. However, this method has limitations as it leaves the task of aggregating digit representations to the model, which lacks direct supervision for this process. In this paper, we explore the use of mathematical priors to compute aggregated digit embeddings and explicitly incorporate these aggregates into transformer models. This can be achieved either by adding a special token to the input embeddings or by introducing an additional loss function to enhance correct predictions. We evaluate the effectiveness of incorporating this explicit aggregation, analysing its strengths and shortcomings, and discuss future directions to better benefit from this approach. Our methods, while simple, are compatible with any pretrained model and require only a few lines of code, which we have made publicly available.
摘要:除了执行算术运算外,理解数字本身对于现有的语言模型来说仍然是一个挑战。简单的概括,例如求解100+200而不是1+2,可以显著影响模型的性能(Sivakumar和Moosavi,2023)。在各种技术中,数字的字符级嵌入已经成为一种很有前途的改进数字表示的方法。然而,这种方法有局限性,因为它将聚合数字表示的任务留给了模型,而模型缺乏对这一过程的直接监督。在本文中,我们探索使用数学先验来计算聚合数字嵌入,并显式地将这些聚合合并到变压器模型中。这可以通过向输入嵌入添加特殊令牌或通过引入额外的损失函数来增强正确预测来实现。我们评估了合并这种显式聚合的有效性,分析了它的优点和缺点,并讨论了未来的方向,以更好地从这种方法中受益。我们的方法虽然简单,但与任何预先训练的模型兼容,只需要几行代码,我们已经公开了这些代码。

[NLP-89] Papez: Resource-Efficient Speech Separation with Auditory Working Memory
[NLP-89] Papez:具有听觉工作记忆的资源高效语音分离

链接: https://arxiv.org/abs/2407.00888
作者: Hyunseok Oh,Juheon Yi,Youngki Lee
关键词: Transformer-based models recently, extreme computational load, computational load makes, single-channel speech separation, models recently reached
中文关键词: 基于转换器的模型最近,极端的计算负载,计算负载使得,单通道语音分离,最近达到的模型
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 pages. Accepted by ICASSP 2023

点击查看摘要

Abstract:Transformer-based models recently reached state-of-the-art single-channel speech separation accuracy; However, their extreme computational load makes it difficult to deploy them in resource-constrained mobile or IoT devices. We thus present Papez, a lightweight and computation-efficient single-channel speech separation model. Papez is based on three key techniques. We first replace the inter-chunk Transformer with small-sized auditory working memory. Second, we adaptively prune the input tokens that do not need further processing. Finally, we reduce the number of parameters through the recurrent transformer. Our extensive evaluation shows that Papez achieves the best resource and accuracy tradeoffs with a large margin. We publicly share our source code at \textttthis https URL
摘要:基于转换器的模型最近达到了最先进的单通道语音分离准确度;然而,它们极端的计算负载使得难以在资源有限的移动或物联网设备中部署它们。因此,我们提出了Papez,一种轻量级且计算效率高的单通道语音分离模型。Papez基于三项关键技术。我们首先用小尺寸的听觉工作记忆替换块间Transformer。其次,我们自适应地修剪不需要进一步处理的输入令牌。最后,我们通过循环Transformer减少参数的数量。我们的广泛评估表明,Papez以较大的优势实现了最佳的资源和准确性权衡。我们在\textttThis https URL上公开分享我们的源代码

[NLP-90] Mechanistic Interpretation through Contextual Decomposition in Transformers
[NLP-90] 《变形金刚》中通过语境分解进行机械解释

链接: https://arxiv.org/abs/2407.00886
作者: Aliyah R. Hsu,Yeshwanth Cherapanamjeri,Anobel Y. Odisho,Peter R. Carroll,Bin Yu
关键词: black boxes due, complex nonlinear relationships, regarded as black, black boxes, boxes due
中文关键词: 由于黑匣子,复杂的非线性关系,被视为黑匣子,黑匣子,由于盒
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformers exhibit impressive capabilities but are often regarded as black boxes due to challenges in understanding the complex nonlinear relationships between features. Interpreting machine learning models is of paramount importance to mitigate risks, and mechanistic interpretability is in particular of current interest as it opens up a window for guiding manual modifications and reverse-engineering solutions. In this work, we introduce contextual decomposition for transformers (CD-T), extending a prior work on CD for RNNs and CNNs, to address mechanistic interpretation computationally efficiently. CD-T is a flexible interpretation method for transformers. It can capture contributions of combinations of input features or source internal components (e.g. attention heads, feed-forward networks) to (1) final predictions or (2) the output of any target internal component. Using CD-T, we propose a novel algorithm for circuit discovery. On a real-world pathology report classification task: we show CD-T distills a more faithful circuit of attention heads with improved computational efficiency (speed up 2x) than a prior benchmark, path patching. As a versatile interpretation method, CD-T also exhibits exceptional capabilities for local interpretations. CD-T is shown to reliably find words and phrases of contrasting sentiment/topic on SST-2 and AGNews datasets. Through human experiments, we demonstrate CD-T enables users to identify the more accurate of two models and to better trust a model’s outputs compared to alternative interpretation methods such as SHAP and LIME.
摘要:变形金刚显示出令人印象深刻的能力,但由于在理解特征之间复杂的非线性关系方面存在挑战,因此通常被视为黑匣子。解释机器学习模型对于降低风险至关重要,机械性的可解释性尤其令人感兴趣,因为它为指导人工修改和逆向工程解决方案打开了一扇窗。在这项工作中,我们引入了转换器的上下文分解(CD-T),扩展了先前关于RNN和CNN的CD的工作,以解决计算上高效的机械性解释。CD-T是一种灵活的变压器解释方法。它可以捕获输入特征或源内部组件(例如,注意力头部、前馈网络)的组合对(1)最终预测或(2)任何目标内部组件的输出的贡献。利用CD-T,我们提出了一种新的电路发现算法。在一个真实的病理报告分类任务中:我们向CD-T摘要展示了一个更忠实的注意力头部电路,与之前的基准测试路径修补相比,计算效率提高了2倍。作为一种多才多艺的口译方法,CD-T在当地口译中也表现出了非凡的能力。CD-T在SST-2和AgNews数据集上可靠地找到了情感/主题相反的单词和短语。通过人体实验,我们证明CD-T使用户能够识别两个模型中更准确的一个,并且与Shap和LIME等替代解释方法相比,能够更好地信任模型的输出。

[NLP-91] MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting
[NLP-91] MoE-CT:一种抗灾难性遗忘的大型语言模型训练的新型方法

链接: https://arxiv.org/abs/2407.00875
作者: Tianhao Li,Shangjie Li,Binbin Xie,Deyi Xiong,Baosong Yang
关键词: Conventional Continual Training, leaving a disparity, advent of large, predominantly catered, Continual Training
中文关键词: 传统的持续培训,留下了差距,大规模的、主要有迎合性的持续培训的出现
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:The advent of large language models (LLMs) has predominantly catered to high-resource languages, leaving a disparity in performance for low-resource languages. Conventional Continual Training (CT) approaches to bridge this gap often undermine a model’s original linguistic proficiency when expanding to multilingual contexts. Addressing this issue, we introduce a novel MoE-CT architecture, a paradigm that innovatively separates the base model’s learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency. Our approach significantly outperforms conventional CT methods, as evidenced by our experiments, which show marked improvements in multilingual benchmarks without sacrificing the model’s original language performance. Moreover, our MoE-CT framework demonstrates enhanced resistance to forgetting and superior transfer learning capabilities. By preserving the base model’s integrity and focusing on strategic parameter expansion, our methodology advances multilingual language modeling and represents a significant step forward for low-resource language inclusion in LLMs, indicating a fruitful direction for future research in language technologies.
摘要:大型语言模型(LLM)的出现主要迎合了高资源语言,使得低资源语言在性能上存在差异。弥补这一差距的传统持续培训(CT)方法在扩展到多语言环境时往往会破坏模型的原始语言熟练程度。针对这一问题,我们引入了一种新颖的MOE-CT体系结构,它创新性地将基本模型的学习与多语言扩展过程分开。我们的设计冻结了原始的LLM参数,从而保护了它在高资源语言中的性能,而附加的MOE模块,在不同的语言数据集上训练,增强了低资源语言的熟练程度。我们的方法明显优于传统的CT方法,我们的实验表明,在不牺牲模型原始语言性能的情况下,多语言基准测试有显著的改善。此外,我们的MOE-CT框架显示出更强的抗遗忘能力和卓越的迁移学习能力。通过保持基本模型的完整性和专注于战略参数扩展,我们的方法推进了多语言建模,代表了低资源语言包含在LLMS中的重要一步,为语言技术的未来研究指明了一个富有成效的方向。

[NLP-92] Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles
[NLP-92] 角色扮演:使领域专家能够通过启发和遵守原则来创建LLM模拟患者

链接: https://arxiv.org/abs/2407.00870
作者: Ryan Louie(1),Ananjan Nandi(1),William Fang(1),Cheng Chang(1),Emma Brunskill(1),Diyi Yang(1) ((1) Stanford University)
关键词: Recent works leverage, realistic social scenarios, works leverage LLMs, Recent works, roleplay realistic social
中文关键词: 最近的作品利用、现实的社会场景、作品利用LLM、最近的作品、角色扮演现实的社会
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 34 pages, 24 figures, 11 Tables

点击查看摘要

Abstract:Recent works leverage LLMs to roleplay realistic social scenarios, aiding novices in practicing their social skills. However, simulating sensitive interactions, such as in mental health, is challenging. Privacy concerns restrict data access, and collecting expert feedback, although vital, is laborious. To address this, we develop Roleplay-doh, a novel human-LLM collaboration pipeline that elicits qualitative feedback from a domain-expert, which is transformed into a set of principles, or natural language rules, that govern an LLM-prompted roleplay. We apply this pipeline to enable senior mental health supporters to create customized AI patients for simulated practice partners for novice counselors. After uncovering issues in GPT-4 simulations not adhering to expert-defined principles, we also introduce a novel principle-adherence prompting pipeline which shows 30% improvements in response quality and principle following for the downstream task. Via a user study with 25 counseling experts, we demonstrate that the pipeline makes it easy and effective to create AI patients that more faithfully resemble real patients, as judged by creators and third-party counselors.
摘要:最近的作品利用LLM扮演现实的社交场景,帮助新手练习他们的社交技能。然而,模拟敏感的相互作用,例如心理健康,是具有挑战性的。隐私问题限制了数据的访问,收集专家的反馈虽然至关重要,但却很费力。为了解决这个问题,我们开发了Roleplay-doh,这是一种新颖的人-LLM协作管道,它从领域专家那里获得定性反馈,这些反馈被转换为一组原则或自然语言规则,这些原则或自然语言规则管理LLM提示的角色扮演。我们应用这条管道,使资深心理健康支持者能够为模拟实践伙伴为新手顾问创建定制的人工智能患者。在发现GPT-4模拟中不遵循专家定义的原则的问题后,我们还引入了一种新的原则遵守提示管道,该管道在响应质量和下游任务的原则遵循方面都有30%的改进。通过与25名咨询专家进行的用户研究,我们证明,根据创建者和第三方顾问的判断,该管道可以轻松有效地创建更接近真实患者的人工智能患者。

[NLP-93] Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
[NLP-93] 大型语言模型是不自愿的真话者:利用谬误失败进行越狱攻击

链接: https://arxiv.org/abs/2407.00869
作者: Yue Zhou,Henry Peng Zou,Barbara Di Eugenio,Yang Zhang
关键词: difficulties generating fallacious, difficulties generating, deceptive reasoning, language models, language
中文关键词: 产生谬误的困难、产生困难、欺骗性推理、语言模型、语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We find that language models have difficulties generating fallacious and deceptive reasoning. When asked to generate deceptive outputs, language models tend to leak honest counterparts but believe them to be false. Exploiting this deficiency, we propose a jailbreak attack method that elicits an aligned language model for malicious output. Specifically, we query the model to generate a fallacious yet deceptively real procedure for the harmful behavior. Since a fallacious procedure is generally considered fake and thus harmless by LLMs, it helps bypass the safeguard mechanism. Yet the output is factually harmful since the LLM cannot fabricate fallacious solutions but proposes truthful ones. We evaluate our approach over five safety-aligned large language models, comparing four previous jailbreak methods, and show that our approach achieves competitive performance with more harmful outputs. We believe the findings could be extended beyond model safety, such as self-verification and hallucination.
摘要:我们发现语言模型很难产生错误和欺骗性的推理。当被要求生成欺骗性输出时,语言模型往往会泄露诚实的对应结果,但认为它们是错误的。利用这一缺陷,我们提出了一种越狱攻击方法,该方法可以得到恶意输出的对齐语言模型。具体地说,我们对该模型提出质疑,以便为有害行为生成一个虚假但虚假的真实过程。由于错误的程序通常被LLMS认为是虚假的,因此是无害的,它有助于绕过保障机制。然而,这些结果实际上是有害的,因为LLM不能捏造虚假的解决方案,而是提出真实的解决方案。我们在五个安全对齐的大型语言模型上对我们的方法进行了评估,并与之前的四种越狱方法进行了比较,结果表明我们的方法在具有更多有害输出的情况下获得了具有竞争力的性能。我们认为,这些发现可以扩展到模型安全之外,例如自我验证和幻觉。

[NLP-94] owards Robust Speech Representation Learning for Thousands of Languages
[NLP-94] owards针对数千种语言的稳健语音表示学习

链接: https://arxiv.org/abs/2407.00837
作者: William Chen,Wangyou Zhang,Yifan Peng,Xinjian Li,Jinchuan Tian,Jiatong Shi,Xuankai Chang,Soumi Maiti,Karen Livescu,Shinji Watanabe
关键词: Self-supervised learning, helped extend speech, extend speech technologies, helped extend, Self-supervised
中文关键词: 自我监督学习,帮助扩展语音,扩展语音技术,帮助扩展,自我监督
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 20 pages

点击查看摘要

Abstract:Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world’s 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in this https URL.
摘要:自监督学习通过减少对标记数据的需求,帮助将语音技术扩展到更多的语言。然而,机型还远远不能支持世界上7000多种语言。我们提出了XEUS,这是一种用于通用语音的跨语言编码器,针对4057种语言的100多万小时数据进行了培训,将SSL模型的语言覆盖范围扩大了4倍。我们将现有可公开访问的语料库中的100万个小时的演讲与新创建的来自4057种语言的7400多个小时的语料库结合在一起,该语料库将公开发布。为了处理多语言语音数据的不同情况,我们用一个新的去混响目标来增强典型的SSLMASTED预测方法,增加了鲁棒性。我们在几个基准测试中对Xeus进行了评估,结果表明,在各种任务中,Xeus的性能始终优于最先进的(SOTA)SSL模型,或者取得了与之相当的结果。Xeus在ML-Superb基准上设定了新的SOTA:尽管参数或训练前数据较少,但它的性能分别比MMS 1B和w2v-BERT 2.0 v2高0.8%和4.4%。检查点、代码和数据位于此HTTPS URL中。

[NLP-95] NAIST Simultaneous Speech Translation System for IWSLT 2024
[NLP-95] NAIST IWSYS 2024同步语音翻译系统

链接: https://arxiv.org/abs/2407.00826
作者: Yuka Ko,Ryo Fukuda,Yuta Nishikawa,Yasumasa Kano,Tomoya Yanagita,Kosuke Doi,Mana Makinae,Haotian Tan,Makoto Sakai,Sakriani Sakti,Katsuhito Sudoh,Satoshi Nakamura
关键词: paper describes NAIST, describes NAIST submission, Evaluation Campaign, describes NAIST, NAIST submission
中文关键词: 论文描述了NAIST,描述了NAIST提交,评估活动,描述了NAIST,NAIST提交
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: IWSLT 2024 system paper

点击查看摘要

Abstract:This paper describes NAIST’s submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-German, Japanese, Chinese speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.
摘要:本文描述了NAIST提交给IWSLT 2024评估活动的同步轨道:英语到德语、日语、汉语的语音到文本的翻译和英语到日语的语音到语音的翻译。结合Hubert和mBART两种预先训练好的语言模型,我们开发了一个多语种端到端的语音到文本翻译模型。我们使用本地协议(LA)和AlignAtt两种解码策略对该模型进行训练。提交的模型使用LA策略,因为它的性能优于以前模型中的AlignAtt策略。我们的语音到语音翻译方法是上述语音到文本模型和增量文本到语音(TTS)模块的级联,该模块结合了音素估计模型、并行声学模型和并行WaveGAN声码器。我们通过将Transformer架构与AlignAtt策略应用于评估模型来改进我们的增量TTS。结果表明,我们升级的TTS模块有助于提高系统的性能。

[NLP-96] Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning
[NLP-96] 分步控制DPO:利用分步误差进行增强数学推理

链接: https://arxiv.org/abs/2407.00782
作者: Zimu Lu,Aojun Zhou,Ke Wang,Houxing Ren,Weikang Shi,Junting Pan,Mingjie Zhan
关键词: Direct Preference Optimization, Direct Preference, Preference Optimization, large language models, SCDPO
中文关键词: 直接偏好优化、直接偏好、偏好优化、大型语言模型、SCDPO
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to understand reasoning errors and output accurate reasoning steps. We apply SCDPO to both code-integrated and chain-of-thought solutions, empirically showing that it consistently improves the performance compared to naive DPO on three different SFT models, including one existing SFT model and two models we finetuned. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at identifying errors in mathematical solutions. We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method.
摘要:直接偏好优化(DPO)已被证明能有效地提高大型语言模型(LLM)在推理和对齐等下游任务中的性能。在这项工作中,我们提出了步骤控制的DPO(SCDPO),一种通过创建从指定步骤开始出错的数学推理原理的负样本来自动提供逐步错误监督的方法。通过将这些样本应用于DPO训练,SCDPO可以更好地对齐模型,以了解推理错误并输出准确的推理步骤。我们将SCDPO应用于代码集成和思想链解决方案,经验表明,与在三个不同的SFT模型(包括一个现有的SFT模型和我们优化的两个模型)上的朴素DPO相比,SCDPO始终提高了性能。对SCDPO和DPO的学分分配的定性分析表明,SCDPO在识别数学解中的错误方面是有效的。然后我们将SCDPO应用于InternLM2-20B模型,得到了一个20B模型,该模型在GSM8K上获得了88.5%的高分,在数学上获得了58.1%的高分,与所有其他开源的LLM相媲美,显示了我们方法的巨大潜力。

[NLP-97] Characterizing Stereotypical Bias from Privacy-preserving Pre-Training
[NLP-97] 从隐私保护预培训中描述刻板印象偏见

链接: https://arxiv.org/abs/2407.00764
作者: Stefan Arnold,Rene Gröbner,Annika Schreiner
关键词: Differential Privacy, embedding space, applied to raw, exploiting the spatial, spatial arrangement
中文关键词: 差异隐私,嵌入空间,应用于原始,利用空间,空间安排
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Differential Privacy (DP) can be applied to raw text by exploiting the spatial arrangement of words in an embedding space. We investigate the implications of such text privatization on Language Models (LMs) and their tendency towards stereotypical associations. Since previous studies documented that linguistic proficiency correlates with stereotypical bias, one could assume that techniques for text privatization, which are known to degrade language modeling capabilities, would cancel out undesirable biases. By testing BERT models trained on texts containing biased statements primed with varying degrees of privacy, our study reveals that while stereotypical bias generally diminishes when privacy is tightened, text privatization does not uniformly equate to diminishing bias across all social domains. This highlights the need for careful diagnosis of bias in LMs that undergo text privatization.
摘要:通过利用嵌入空间中单词的空间排列,差异隐私(DP)可以应用于原始文本。我们调查了这种文本私有化对语言模型(LM)及其刻板印象联想倾向的影响。由于之前的研究记录了语言熟练程度与刻板印象偏见相关,因此人们可以假设文本私有化技术(已知会降低语言建模能力)将消除不受欢迎的偏见。通过测试在包含带有不同程度隐私的偏见陈述的文本上训练的BERT模型,我们的研究表明,虽然刻板印象偏见通常会在隐私收紧时减少,但文本私有化并不统一等同于所有社会领域的偏见减少。这凸显了需要仔细诊断进行文本私有化的LM中的偏见。

[NLP-98] A Comparative Study of Quality Evaluation Methods for Text Summarization
[NLP-98] 文本摘要质量评价方法的比较研究

链接: https://arxiv.org/abs/2407.00747
作者: Huyen Nguyen,Haihua Chen,Lavanya Pobbathi,Junhua Ding
关键词: natural language processing, Evaluating text summarization, challenging task, task in natural, NLP
中文关键词: 自然语言处理、评估文本摘要、具有挑战性的任务、自然任务、NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The paper is under review at Empirical Methods in Natural Language Processing (EMNLP) 2024. It has 15 pages and 4 figures

点击查看摘要

Abstract:Evaluating text summarization has been a challenging task in natural language processing (NLP). Automatic metrics which heavily rely on reference summaries are not suitable in many situations, while human evaluation is time-consuming and labor-intensive. To bridge this gap, this paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. We also conducts a comparative study on eight automatic metrics, human evaluation, and our proposed LLM-based method. Seven different types of state-of-the-art (SOTA) summarization models were evaluated. We perform extensive experiments and analysis on datasets with patent documents. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency. Based on the empirical comparison, we propose a LLM-powered framework for automatically evaluating and improving text summarization, which is beneficial and could attract wide attention among the community.
摘要:评价文本摘要一直是自然语言处理领域的一项具有挑战性的任务。严重依赖参考摘要的自动度量在许多情况下不适用,而人工评估既耗时又费力。为了弥补这一差距,提出了一种基于大语言模型(LLMS)的文本摘要评价方法。我们还对八个自动度量、人工评估和我们提出的基于LLM的方法进行了比较研究。评估了七种不同类型的最新(SOTA)摘要模型。我们在包含专利文档的数据集上进行了广泛的实验和分析。我们的结果表明,LLMS评价与人工评价非常接近,而广泛使用的自动度量如Rouge-2、BERTScore和SummaC则不一致,也缺乏一致性。在实验比较的基础上,我们提出了一个基于LLM的自动评价和改进文本摘要的框架,这是有益的,可以引起社区的广泛关注。

[NLP-99] AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations
[NLP-99] AIMDiT:通过多模式维度转换进行模式增强和互动,用于对话中的情感识别

链接: https://arxiv.org/abs/2407.00743
作者: Sheng Wu,Jiaxing Liu,Longbiao Wang,Dongxiao He,Xiaobao Wang,Jianwu Dang
关键词: natural language processing, Recognition in Conversations, Emotion Recognition, speaker in conversations, language processing
中文关键词: 自然语言处理、对话中的识别、情感识别、对话中的说话者、语言处理
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Specifically, we design a Modality Augmentation Network which performs rich representation learning through dimension transformation of different modalities and parameter-efficient inception block. On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics compared to the state-of-the-art (SOTA) models.
摘要:对话中的情感识别(ERC)是自然语言处理领域的一项流行任务,旨在识别对话中说话者的情感状态。虽然目前的研究主要强调上下文建模,但对有效的多模式融合方法的研究很少。我们提出了一种名为AIMDiT的新型框架来解决深度特征的多模式融合问题。具体来说,我们设计了一个模式增强网络,它通过不同模式的维度转换和参数高效的初始块来执行丰富的表示学习。另一方面,情态交互网络对提取的情态间特征和情态内特征进行交互融合。使用我们的AIMDiT框架在公共基准数据集MELD上进行的实验显示,与最先进的(SOTA)模型相比,Acc-7和w-F1指标提高了2.34%和2.87%。

[NLP-100] LocateEdit: Energy-based Text Editing for Efficient Flexible and Faithful Controlled Text Generation
[NLP-100] LocateEdit:基于能量的文本编辑,用于高效、灵活且忠实的受控文本生成

链接: https://arxiv.org/abs/2407.00740
作者: Hye Ryung Son,Jay-Yoon Lee
关键词: Recent approaches, base language models, decoding time, base, approaches to controlled
中文关键词: 最近的方法、基础语言模型、解码时间、基础、控制的方法
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 2 figures

点击查看摘要

Abstract:Recent approaches to controlled text generation (CTG) often involve manipulating the weights or logits of base language models (LMs) at decoding time. However, these methods are inapplicable to latest black-box LMs and ineffective at preserving the core semantics of the base LM’s original generations. In this work, we propose LocateEdit(LE), an efficient and flexible energy-based approach to CTG, which edits text outputs from a base LM using off-the-shelf energy models. Given text outputs from the base LM, LE first locates spans that are most relevant to constraints (e.g., toxicity) utilizing energy models, and then edits these spans by replacing them with more suitable alternatives. Importantly, our method is compatible with black-box LMs, as it requires only the text outputs. Also, since LE doesn’t mandate specific architecture for its component models, it can work with a diverse combination of available off-the-shelf models. Moreover, LE preserves the base LM’s original generations, by selectively modifying constraint-related aspects of the texts and leaving others unchanged. These targeted edits also ensure that LE operates efficiently. Our experiments confirm that LE achieves superior semantic preservation of the base LM generations and speed, while simultaneously obtaining competitive or improved constraint satisfaction. Furthermore, we analyze how the granularity of energy distribution impacts CTG performance and find that fine-grained, regression-based energy models improve constraint satisfaction, compared to conventional binary classifier energy models.
摘要:最近的受控文本生成(CTG)方法通常涉及在解码时操纵基本语言模型(LMS)的权重或逻辑。然而,这些方法不适用于最新的黑盒LM,并且不能有效地保留基本LM原始代的核心语义。在这项工作中,我们提出了LocateEdit(LE),这是一种高效而灵活的基于能量的CTG方法,它使用现有的能量模型来编辑来自基本LM的文本输出。在给定来自基本LM的文本输出的情况下,LE首先利用能量模型来定位与约束(例如,毒性)最相关的跨度,然后通过用更合适的替代方案来替换它们来编辑这些跨度。重要的是,我们的方法与黑盒LMS兼容,因为它只需要文本输出。此外,由于LE没有为其组件模型强制指定特定的体系结构,因此它可以使用各种可用的现成模型组合。此外,LE通过选择性地修改文本中与约束相关的方面并保持其他方面不变,保留了基本LM的原始世代。这些有针对性的编辑还确保LE高效运行。我们的实验证实,LE在保持基本LM生成和速度的同时,获得了竞争性或改进的约束满足。此外,我们分析了能量分布的粒度对CTG性能的影响,发现与传统的二进制分类器能量模型相比,基于回归的细粒度能量模型提高了约束满足。

[NLP-101] Large Language Models Struggle in Token-Level Clinical Named Entity Recognition
[NLP-101] 大型语言模型在令牌级临床命名实体识别中陷入困境

链接: https://arxiv.org/abs/2407.00731
作者: Qiuhao Lu,Rui Li,Andrew Wen,Jinlian Wang,Liwei Wang,Hongfang Liu
关键词: Large Language Models, Large Language, Named Entity Recognition, Language Models, token-level NER
中文关键词: 大型语言模型、大型语言、命名实体识别、语言模型、标记级NER
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AMIA 2024 Annual Symposium Proceedings

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized various sectors, including healthcare where they are employed in diverse applications. Their utility is particularly significant in the context of rare diseases, where data scarcity, complexity, and specificity pose considerable challenges. In the clinical domain, Named Entity Recognition (NER) stands out as an essential task and it plays a crucial role in extracting relevant information from clinical texts. Despite the promise of LLMs, current research mostly concentrates on document-level NER, identifying entities in a more general context across entire documents, without extracting their precise location. Additionally, efforts have been directed towards adapting ChatGPT for token-level NER. However, there is a significant research gap when it comes to employing token-level NER for clinical texts, especially with the use of local open-source LLMs. This study aims to bridge this gap by investigating the effectiveness of both proprietary and local LLMs in token-level clinical NER. Essentially, we delve into the capabilities of these models through a series of experiments involving zero-shot prompting, few-shot prompting, retrieval-augmented generation (RAG), and instruction-fine-tuning. Our exploration reveals the inherent challenges LLMs face in token-level NER, particularly in the context of rare diseases, and suggests possible improvements for their application in healthcare. This research contributes to narrowing a significant gap in healthcare informatics and offers insights that could lead to a more refined application of LLMs in the healthcare sector.
摘要:大型语言模型(LLM)已经给各个领域带来了革命性的变化,包括医疗保健领域,它们被应用于不同的应用领域。在罕见疾病的背景下,它们的效用尤其重要,因为数据的稀缺性、复杂性和特殊性构成了相当大的挑战。在临床领域,命名实体识别(NER)作为一项基本任务脱颖而出,在从临床文本中提取相关信息方面发挥着至关重要的作用。尽管LLMS前景看好,但目前的研究主要集中在文档级NER上,在整个文档中识别更一般上下文中的实体,而不提取它们的准确位置。此外,还努力使ChatGPT适应令牌级NER。然而,当涉及到为临床文本使用令牌级NER时,特别是在使用本地开源LLM时,存在着显著的研究差距。这项研究旨在通过调查专有和本地LLM在令牌级临床NER中的有效性来弥合这一差距。从本质上讲,我们通过一系列实验深入研究了这些模型的能力,这些实验涉及零射击提示、少射击提示、提取-增强生成(RAG)和指令微调。我们的探索揭示了LLMS在令牌级NER中面临的内在挑战,特别是在罕见疾病的背景下,并建议对其在医疗保健中的应用进行可能的改进。这项研究有助于缩小医疗保健信息学中的一个重大差距,并提供了可能导致在医疗保健部门更精细地应用LLMS的见解。

[NLP-102] Scaling Technology Acceptance Analysis with Large Language Model (LLM) Annotation Systems
[NLP-102] 使用大型语言模型(LLM)注释系统进行扩展技术接受度分析

链接: https://arxiv.org/abs/2407.00702
作者: Pawel Robert Smolinski,Joseph Januszewicz,Jacek Winiarski
关键词: models effectively predict, effectively predict, acceptance models effectively, technology products, Technology
中文关键词: 模型有效预测、有效预测、有效接受模型、技术产品、技术
类目: Computation and Language (cs.CL)
备注: This is a preprint of a paper accepted for the 32nd International Conference on Information Systems Development (ISD 2024), Gdansk, Poland

点击查看摘要

Abstract:Technology acceptance models effectively predict how users will adopt new technology products. Traditional surveys, often expensive and cumbersome, are commonly used for this assessment. As an alternative to surveys, we explore the use of large language models for annotating online user-generated content, like digital reviews and comments. Our research involved designing an LLM annotation system that transform reviews into structured data based on the Unified Theory of Acceptance and Use of Technology model. We conducted two studies to validate the consistency and accuracy of the annotations. Results showed moderate-to-strong consistency of LLM annotation systems, improving further by lowering the model temperature. LLM annotations achieved close agreement with human expert annotations and outperformed the agreement between experts for UTAUT variables. These results suggest that LLMs can be an effective tool for analyzing user sentiment, offering a practical alternative to traditional survey methods and enabling deeper insights into technology design and adoption.
摘要:技术接受模型有效地预测了用户将如何采用新技术产品。这种评估通常使用传统的调查,通常既昂贵又繁琐。作为调查的替代方案,我们探索使用大型语言模型来注释在线用户生成的内容,如数字评论和评论。我们的研究包括设计一个基于技术接受和使用统一理论模型的LLM标注系统,将评论转换为结构化数据。我们进行了两项研究,以验证注释的一致性和准确性。结果表明,LLM注记系统具有中等到较强的一致性,随着模型温度的降低,一致性进一步提高。LLM注释与人类专家注释的一致性很好,并且优于专家之间对UTAUT变量的一致。这些结果表明,LLMS可以成为分析用户情绪的有效工具,为传统调查方法提供了一种实用的替代方法,并使人们能够更深入地了解技术设计和采用。

[NLP-103] BAPO: Base-Anchored Preference Optimization for Personalized Alignment in Large Language Models
[NLP-103] BAPO:大型语言模型中个性化对齐的基本锚定偏好优化

链接: https://arxiv.org/abs/2407.00693
作者: Gihun Lee,Minchan Jeong,Yujin Kim,Hojung Jung,Jaehoon Oh,Sangmook Kim,Se-Young Yun
关键词: align Large Language, Large Language Models, Large Language, shown remarkable success, align Large
中文关键词: 对齐大型语言,大型语言模型,大型语言,表现出显着的成功,对齐大型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: under review

点击查看摘要

Abstract:While learning to align Large Language Models (LLMs) with human preferences has shown remarkable success, aligning these models to meet the diverse user preferences presents further challenges in preserving previous knowledge. This paper examines the impact of personalized preference optimization on LLMs, revealing that the extent of knowledge loss varies significantly with preference heterogeneity. Although previous approaches have utilized the KL constraint between the reference model and the policy model, we observe that they fail to maintain general knowledge and alignment when facing personalized preferences. To this end, we introduce Base-Anchored Preference Optimization (BAPO), a simple yet effective approach that utilizes the initial responses of reference model to mitigate forgetting while accommodating personalized alignment. BAPO effectively adapts to diverse user preferences while minimally affecting global knowledge or general alignment. Our experiments demonstrate the efficacy of BAPO in various setups.
摘要:虽然学习将大语言模型(LLM)与人类偏好相匹配已经取得了显著的成功,但将这些模型与不同的用户偏好相匹配,在保存先前的知识方面提出了更多的挑战。本文考察了个性化偏好优化对LLMS的影响,发现知识损失的程度随偏好异质性的不同而显著不同。虽然以前的方法利用了参考模型和策略模型之间的KL约束,但我们观察到它们在面对个性化偏好时无法保持一般知识和一致性。为此,我们引入了基本锚定偏好优化(BAPO),这是一种简单但有效的方法,它利用参考模型的初始响应来缓解遗忘,同时适应个性化对齐。BAPO有效地适应了不同的用户偏好,同时最大限度地减少了对全球知识或总体一致性的影响。我们的实验证明了BAPO在各种设置中的有效性。

[NLP-104] HRDE: Retrieval-Augmented Large Language Models for Chinese Health Rumor Detection and Explainability
[NLP-104] HRDE:用于中国健康谣言检测和解释的检索增强大语言模型

链接: https://arxiv.org/abs/2407.00668
作者: Yanfang Chen,Ding Chen,Shichao Song,Simin Niu,Hanyu Wang,Zeyun Tang,Feiyu Xiong,Zhiyu Li
关键词: people increasingly prioritize, health information, health, Chinese health, health information dissemination
中文关键词: 人们越来越重视健康信息、健康、中国健康、健康信息传播
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As people increasingly prioritize their health, the speed and breadth of health information dissemination on the internet have also grown. At the same time, the presence of false health information (health rumors) intermingled with genuine content poses a significant potential threat to public health. However, current research on Chinese health rumors still lacks a large-scale, public, and open-source dataset of health rumor information, as well as effective and reliable rumor detection methods. This paper addresses this gap by constructing a dataset containing 1.12 million health-related rumors (HealthRCN) through web scraping of common health-related questions and a series of data processing steps. HealthRCN is the largest known dataset of Chinese health information rumors to date. Based on this dataset, we propose retrieval-augmented large language models for Chinese health rumor detection and explainability (HRDE). This model leverages retrieved relevant information to accurately determine whether the input health information is a rumor and provides explanatory responses, effectively aiding users in verifying the authenticity of health information. In evaluation experiments, we compared multiple models and found that HRDE outperformed them all, including GPT-4-1106-Preview, in rumor detection accuracy and answer quality. HRDE achieved an average accuracy of 91.04% and an F1 score of 91.58%.
摘要:随着人们越来越重视自己的健康,互联网上健康信息传播的速度和广度也在增长。与此同时,虚假健康信息(健康谣言)的存在与真实内容交织在一起,对公众健康构成了重大的潜在威胁。然而,目前对中国健康谣言的研究还缺乏大规模、公开、开源的健康谣言信息数据集,以及有效可靠的谣言检测方法。本文通过对常见健康相关问题的网络抓取和一系列数据处理步骤,构建了一个包含112万条健康相关谣言(HealthRCN)的数据集,以解决这一差距。HealthRCN是迄今为止已知的最大的中国健康信息谣言数据集。基于这个数据集,我们提出了用于中文健康谣言检测和解释(HRDE)的检索增强的大语言模型。该模型利用检索到的相关信息准确地确定输入的健康信息是否为谣言,并提供解释性回应,有效地帮助用户验证健康信息的真实性。在评价实验中,我们比较了多个模型,发现HRDE在谣言检测准确率和答案质量方面都优于包括GPT-4-1106-Pview在内的所有模型。HRDE的平均准确率为91.04%,F1评分为91.58%。

[NLP-105] Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs
[NLP-105] 知识链:通过从知识图中学习将知识推理集成到大型语言模型中

链接: https://arxiv.org/abs/2407.00653
作者: Yifei Zhang,Xintao Wang,Jiaqing Liang,Sirui Xia,Lida Chen,Yanghua Xiao
关键词: Large Language Models, natural language processing, Large Language, exhibited impressive proficiency, involve increasingly complex
中文关键词: 大型语言模型、自然语言处理、大型语言,表现出令人印象深刻的熟练程度,涉及的内容越来越复杂
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited impressive proficiency in various natural language processing (NLP) tasks, which involve increasingly complex reasoning. Knowledge reasoning, a primary type of reasoning, aims at deriving new knowledge from existing one.While it has been widely studied in the context of knowledge graphs (KGs), knowledge reasoning in LLMs remains underexplored. In this paper, we introduce Chain-of-Knowledge, a comprehensive framework for knowledge reasoning, including methodologies for both dataset construction and model learning. For dataset construction, we create KnowReason via rule mining on KGs. For model learning, we observe rule overfitting induced by naive training. Hence, we enhance CoK with a trial-and-error mechanism that simulates the human process of internal knowledge exploration. We conduct extensive experiments with KnowReason. Our results show the effectiveness of CoK in refining LLMs in not only knowledge reasoning, but also general reasoning benchmarkms.
摘要:大型语言模型在各种自然语言处理(NLP)任务中表现出令人印象深刻的熟练程度,这些任务涉及越来越复杂的推理。知识推理是一种主要的推理类型,其目的是从已有的知识中获取新的知识,但在知识图(KGs)的背景下已经得到了广泛的研究,而在LLMS中的知识推理的研究还很少。在本文中,我们介绍了知识链,一个全面的知识推理框架,包括数据集构建和模型学习的方法。对于数据集的构建,我们通过在KGS上进行规则挖掘来创建KnowReason。对于模型学习,我们观察到幼稚训练导致的规则过度匹配。因此,我们用一种模拟人类内部知识探索过程的试错机制来增强COK。我们用KnowReason进行了广泛的实验。我们的结果表明,COK不仅在知识推理方面,而且在一般推理基准方面,都能有效地提炼LLM。

[NLP-106] LegalTurk Optimized BERT for Multi-Label Text Classification and NER
[NLP-106] LegalTurk针对多标签文本分类和NER优化BERT

链接: https://arxiv.org/abs/2407.00648
作者: Farnaz Zeidi,Mehmet Fatih Amasyali,Çiğdem Erol
关键词: Transformer neural network, Transformer neural, legal Turkish domain, BERT, neural network
中文关键词: Transformer神经网络,Transformer神经,合法土耳其域,BERT,神经网络
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The introduction of the Transformer neural network, along with techniques like self-supervised pre-training and transfer learning, has paved the way for advanced models like BERT. Despite BERT’s impressive performance, opportunities for further enhancement exist. To our knowledge, most efforts are focusing on improving BERT’s performance in English and in general domains, with no study specifically addressing the legal Turkish domain. Our study is primarily dedicated to enhancing the BERT model within the legal Turkish domain through modifications in the pre-training phase. In this work, we introduce our innovative modified pre-training approach by combining diverse masking strategies. In the fine-tuning task, we focus on two essential downstream tasks in the legal domain: name entity recognition and multi-label text classification. To evaluate our modified pre-training approach, we fine-tuned all customized models alongside the original BERT models to compare their performance. Our modified approach demonstrated significant improvements in both NER and multi-label text classification tasks compared to the original BERT model. Finally, to showcase the impact of our proposed models, we trained our best models with different corpus sizes and compared them with BERTurk models. The experimental results demonstrate that our innovative approach, despite being pre-trained on a smaller corpus, competes with BERTurk.
摘要:Transformer神经网络的引入,以及自我监督的预训练和转移学习等技术,为像BERT这样的高级模型铺平了道路。尽管伯特的表现令人印象深刻,但仍有进一步提升的机会。据我们所知,大多数努力都集中在提高Bert在英语和一般领域的表现上,没有专门针对合法的土耳其领域的研究。我们的研究主要致力于通过在培训前阶段进行修改,在合法的土耳其领域内增强ERT模型。在这项工作中,我们通过结合不同的掩蔽策略来引入我们的创新的改进的预训练方法。在微调任务中,我们重点研究了法律领域中两个必不可少的下游任务:名称实体识别和多标签文本分类。为了评估我们修改后的预培训方法,我们微调了所有定制模型以及原始的BERT模型,以比较它们的性能。与原始的BERT模型相比,我们的改进方法在NER和多标签文本分类任务上都有显著的改进。最后,为了展示我们提出的模型的影响,我们用不同语料库大小训练了我们最好的模型,并将它们与BERTurk模型进行了比较。实验结果表明,尽管我们在较小的语料库上进行了预训练,但我们的创新方法仍然与BERTurk竞争。

[NLP-107] A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy
[NLP-107] 一种基于配置的方法来解决词级度量差异隐私的挑战

链接: https://arxiv.org/abs/2407.00638
作者: Stephen Meisenbacher,Maulik Chevli,Florian Matthes
关键词: proposed mechanism operates, Differential Privacy, NLP must distinguish, Differential Privacy approaches, textit
中文关键词: 拟议的机制运行,差异隐私,NLP必须区分,差异隐私方法,文本
类目: Computation and Language (cs.CL)
备注: 13 pages, 2 figures, 9 tables. Accepted to PrivateNLP 2024

点击查看摘要

Abstract:Applications of Differential Privacy (DP) in NLP must distinguish between the syntactic level on which a proposed mechanism operates, often taking the form of \textitword-level or \textitdocument-level privatization. Recently, several word-level \textitMetric Differential Privacy approaches have been proposed, which rely on this generalized DP notion for operating in word embedding spaces. These approaches, however, often fail to produce semantically coherent textual outputs, and their application at the sentence- or document-level is only possible by a basic composition of word perturbations. In this work, we strive to address these challenges by operating \textitbetween the word and sentence levels, namely with \textitcollocations . By perturbing n-grams rather than single words, we devise a method where composed privatized outputs have higher semantic coherence and variable length. This is accomplished by constructing an embedding model based on frequently occurring word groups, in which unigram words co-exist with bi- and trigram collocations. We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.
摘要:差异隐私(DP)在自然语言处理中的应用必须区分所提出的机制所在的句法级别,通常采取\文本标题词级或\文本标题文档级私有化的形式。最近,已经提出了几种词级\文本度量差分隐私方法,它们依赖于这一广义DP概念在词嵌入空间中操作。然而,这些方法往往无法产生语义连贯的文本输出,并且它们只能通过单词扰动的基本组合才能在句子或文档级别上应用。在这项工作中,我们努力通过在单词和句子之间操作文本,即通过文本搭配来应对这些挑战。通过扰动n-gram而不是单个单词,我们设计了一种方法,其中合成的私有化输出具有更高的语义连贯性和可变的长度。这是通过构建基于频繁出现的词组的嵌入模型来实现的,在该模型中,一元词与双词和三词的搭配共存。我们在效用和隐私测试中对我们的方法进行了评估,这为词级以外的标记化策略提供了一个明确的案例。

[NLP-108] DP-MLM: Differentially Private Text Rewriting Using Masked Language Models
[NLP-108] DP-MLM:使用掩蔽语言模型进行差异化私人文本重写

链接: https://arxiv.org/abs/2407.00637
作者: Stephen Meisenbacher,Maulik Chevli,Juraj Vladika,Florian Matthes
关键词: privatization using Differential, Differential Privacy, Differential, text, language models
中文关键词: 使用差异、差异隐私、差异、文本、语言模型的私有化
类目: Computation and Language (cs.CL)
备注: 15 pages, 2 figures, 8 tables. Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:The task of text privatization using Differential Privacy has recently taken the form of \textittext rewriting , in which an input text is obfuscated via the use of generative (large) language models. While these methods have shown promising results in the ability to preserve privacy, these methods rely on autoregressive models which lack a mechanism to contextualize the private rewriting process. In response to this, we propose \textbfDP-MLM , a new method for differentially private text rewriting based on leveraging masked language models (MLMs) to rewrite text in a semantically similar \textitand obfuscated manner. We accomplish this with a simple contextualization technique, whereby we rewrite a text one token at a time. We find that utilizing encoder-only MLMs provides better utility preservation at lower \varepsilon levels, as compared to previous methods relying on larger models with a decoder. In addition, MLMs allow for greater customization of the rewriting mechanism, as opposed to generative approaches. We make the code for \textbfDP-MLM public and reusable, found at this https URL .
摘要:使用差异隐私的文本私有化任务最近采用了文本重写的形式,即通过使用生成(大)语言模型来混淆输入文本。虽然这些方法在保护隐私方面取得了令人振奋的结果,但这些方法依赖于自回归模型,该模型缺乏一种将私人重写过程与上下文关联的机制。针对这一问题,我们提出了一种新的基于掩蔽语言模型(MLMS)的差异隐私文本重写方法我们使用一种简单的上下文化技术来实现这一点,即一次重写一个文本标记。我们发现,与以前依赖于带有解码器的较大模型的方法相比,仅使用编码器的MLMS在较低的水平上提供了更好的效用保持性。此外,与生成性方法相比,MLM允许对重写机制进行更大程度的定制。我们将在此https URL中找到的\textbfdp-mlm代码公开并可重复使用。

[NLP-109] CAMON: Cooperative Agents for Multi-Object Navigation with LLM-based Conversations
[NLP-109] CAMON:具有基于LLM的对话的多对象导航合作代理

链接: https://arxiv.org/abs/2407.00632
作者: Pengying Wu,Yao Mu,Kangjie Zhou,Ji Ma,Junting Chen,Chang Liu
关键词: Visual navigation tasks, Visual navigation, household service robots, Visual, service robots
中文关键词: 视觉导航任务,视觉导航,家庭服务机器人,视觉,服务机器人
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: Accepted to the RSS 2024 Workshop: GROUND

点击查看摘要

Abstract:Visual navigation tasks are critical for household service robots. As these tasks become increasingly complex, effective communication and collaboration among multiple robots become imperative to ensure successful completion. In recent years, large language models (LLMs) have exhibited remarkable comprehension and planning abilities in the context of embodied agents. However, their application in household scenarios, specifically in the use of multiple agents collaborating to complete complex navigation tasks through communication, remains unexplored. Therefore, this paper proposes a framework for decentralized multi-agent navigation, leveraging LLM-enabled communication and collaboration. By designing the communication-triggered dynamic leadership organization structure, we achieve faster team consensus with fewer communication instances, leading to better navigation effectiveness and collaborative exploration efficiency. With the proposed novel communication scheme, our framework promises to be conflict-free and robust in multi-object navigation tasks, even when there is a surge in team size.
摘要:视觉导航任务是家用服务机器人的关键任务。随着这些任务变得越来越复杂,多个机器人之间的有效沟通和协作变得至关重要,以确保成功完成。近年来,大型语言模型(LLM)在具身智能体的背景下表现出了显著的理解和规划能力。然而,它们在家庭场景中的应用,特别是在使用多个代理协作通过通信完成复杂的导航任务方面,仍未得到探索。因此,本文提出了一种分布式多智能体导航的框架,利用LLM支持的通信和协作。通过设计沟通触发的动态领导组织结构,以较少的沟通实例实现更快的团队共识,从而获得更好的导航效果和协同探索效率。通过提出的新的通信方案,我们的框架在多目标导航任务中保证了无冲突和健壮性,即使在团队规模激增的情况下也是如此。

[NLP-110] Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
[NLP-110] 迭代纳什政策优化:通过无悔学习将LLM与普遍偏好保持一致

链接: https://arxiv.org/abs/2407.00617
作者: Yuheng Zhang,Dian Yu,Baolin Peng,Linfeng Song,Ye Tian,Mingyue Huo,Nan Jiang,Haitao Mi,Dong Yu
关键词: achieved great success, aligning large language, large language models, Human Feedback, Reinforcement Learning
中文关键词: 取得了巨大成功,调整大型语言、大型语言模型、人类反馈、强化学习
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art iterative algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our ablation study highlights the benefits of incorporating KL regularization for response length control.
摘要:带人反馈的强化学习(RLHF)在大型语言模型(LLM)与人类偏好的匹配方面取得了巨大的成功。流行的RLHF方法是基于报酬的,遵循Bradley-Terry(BT)模型假设,该模型可能不能完全捕捉到人类偏好的复杂性。在本文中,我们在一般偏好框架下探讨了RLHF,并从博弈论的角度对其进行了探讨。具体地说,我们将该问题描述为一个两人博弈问题,并提出了一种新的算法–迭代纳什策略优化(INPO)。关键的想法是让政策通过无悔的学习来与自己对抗,从而接近纳什政策。与以前的方法不同,INPO避免了估计单个响应的预期获胜率的需要,这通常会导致较高的计算或注释成本。相反,我们引入了一个新的损失目标,该目标直接最小化了偏好数据集。我们为我们的方法提供了理论分析,并通过在各种有代表性的基准上的实验证明了它的有效性。使用基于骆驼3-8B的SFT模型,INPO在AlpacaEval 2.0上实现了41.5%的长度控制胜率,在Arena-Hard上实现了38.3%的胜率,比BT模型假设下的最先进的迭代算法[董等人,2024]有了实质性的改进。此外,我们的消融研究强调了结合KL正则化对反应长度控制的好处。

[NLP-111] Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace
[NLP-111] 利用文本子空间高效个性化文本到图像生成

链接: https://arxiv.org/abs/2407.00608
作者: Shian Du,Xiaotian Cheng,Qi Qian,Henglu Wei,Yi Xu,Xiangyang Ji
关键词: attracted unprecedented attention, generating highly-personalized images, input concept dataset, textual prompt, input textual prompt
中文关键词: 引起前所未有的关注,生成高度个性化的图像、输入概念数据集、文本提示、输入文本提示
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalized text-to-image generation has attracted unprecedented attention in the recent few years due to its unique capability of generating highly-personalized images via using the input concept dataset and novel textual prompt. However, previous methods solely focus on the performance of the reconstruction task, degrading its ability to combine with different textual prompt. Besides, optimizing in the high-dimensional embedding space usually leads to unnecessary time-consuming training process and slow convergence. To address these issues, we propose an efficient method to explore the target embedding in a textual subspace, drawing inspiration from the self-expressiveness property. Additionally, we propose an efficient selection strategy for determining the basis vectors of the textual subspace. The experimental evaluations demonstrate that the learned embedding can not only faithfully reconstruct input image, but also significantly improves its alignment with novel input textual prompt. Furthermore, we observe that optimizing in the textual subspace leads to an significant improvement of the robustness to the initial word, relaxing the constraint that requires users to input the most relevant initial word. Our method opens the door to more efficient representation learning for personalized text-to-image generation.
摘要:近年来,个性化文本到图像的生成受到了前所未有的关注,因为它能够利用输入的概念数据集和新颖的文本提示来生成高度个性化的图像。然而,以往的方法只关注重建任务的性能,降低了其与不同文本提示相结合的能力。此外,在高维嵌入空间中进行优化通常会导致不必要的训练过程耗时和收敛速度慢。为了解决这些问题,我们提出了一种有效的方法来探索嵌入到文本子空间中的目标,并从自我表达特性中得到启发。此外,我们还提出了一种有效的选择策略来确定文本子空间的基向量。实验结果表明,学习嵌入不仅能较好地重建输入图像,而且能显著提高图像与新输入文本提示的对比度。此外,我们观察到在文本子空间中的优化导致对首字母词的稳健性显著提高,放松了要求用户输入最相关的首字母词的限制。我们的方法为个性化文本到图像的生成打开了更有效的表示学习的大门。

[NLP-112] MasonTigers at SemEval-2024 Task 10: Emotion Discovery and Flip Reasoning in Conversation with Ensemble of Transformers and Prompting
[NLP-112] MasonTigers在SemEval-2024任务10:与变形金刚和预算组合对话中的情感发现和翻转推理

链接: https://arxiv.org/abs/2407.00581
作者: Al Nahian Bin Emran,Amrita Ganguly,Sadiya Sayara Chowdhury Puspo,Nishat Raihan,Dhiman Goswami
关键词: Hindi-English code-mixed dialogues, present MasonTigers’ participation, code-mixed dialogues, Hindi-English code-mixed, emotion flip reasoning
中文关键词: 印度语-英语代码混合对话,呈现MasonTigers的参与,代码混合对话,印度语-英语代码混合,情感翻转推理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we present MasonTigers’ participation in SemEval-2024 Task 10, a shared task aimed at identifying emotions and understanding the rationale behind their flips within monolingual English and Hindi-English code-mixed dialogues. This task comprises three distinct subtasks - emotion recognition in conversation for Hindi-English code-mixed dialogues, emotion flip reasoning for Hindi-English code-mixed dialogues, and emotion flip reasoning for English dialogues. Our team, MasonTigers, contributed to each subtask, focusing on developing methods for accurate emotion recognition and reasoning. By leveraging our approaches, we attained impressive F1-scores of 0.78 for the first task and 0.79 for both the second and third tasks. This performance not only underscores the effectiveness of our methods across different aspects of the task but also secured us the top rank in the first and third subtasks, and the 2nd rank in the second subtask. Through extensive experimentation and analysis, we provide insights into our system’s performance and contributions to each subtask.
摘要:在本文中,我们介绍了MasonTigers参与SemEval-2024任务10,这是一个共同的任务,旨在识别情绪和理解他们在单语英语和印英代码混合对话中翻转背后的理论基础。这项任务包括三个不同的子任务–印英混码对话的情感识别、印英混码对话的情感翻转推理和英语对话的情感翻转推理。我们的团队MasonTigers为每个子任务做出了贡献,专注于开发准确的情感识别和推理方法。通过利用我们的方法,我们获得了令人印象深刻的F1-第一项任务的0.78分,第二项和第三项任务的0.79分。这一表现不仅突出了我们方法在任务不同方面的有效性,而且还确保了我们在第一和第三子任务中排名第一,在第二子任务中排名第二。通过广泛的实验和分析,我们对我们的系统的性能和对每个子任务的贡献提供了见解。

[NLP-113] Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models
[NLP-113] 调查和缓解大型视觉语言模型中的多模式幻觉滚雪球

链接: https://arxiv.org/abs/2407.00569
作者: Weihong Zhong,Xiaocheng Feng,Liang Zhao,Qiming Li,Lei Huang,Yuxuan Gu,Weitao Ma,Yuan Xu,Bing Qin
关键词: Large Vision-Language Models, Large Vision-Language, understanding visual information, human languages, generated hallucinations
中文关键词: 大型视觉语言模型,大型视觉语言,理解视觉信息,人类语言,产生的幻觉
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Main Conference. 21 pages, 20 figures

点击查看摘要

Abstract:Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs’ subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called MMHalSnowball to evaluate LVLMs’ behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least 31% , indicating that LVLMs are prone to accept the generated hallucinations and make false claims that they would not have supported without distractions. We term this phenomenon Multimodal Hallucination Snowballing. To mitigate this, we further propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input, providing models with direct access to the visual information. Experiments show that our method can mitigate more than 24% of the snowballed multimodal hallucination while maintaining capabilities.
摘要:尽管大型视觉语言模型在理解人类语言的视觉信息方面取得了很大进展,但它仍然存在多通道幻觉。一个自然的担忧是,在多模式互动过程中,产生的幻觉可能会影响LVLM的下一代。因此,我们提出了一个问题:当出现与先前产生的幻觉相关的问题时,即使地面视觉信息存在,LVLMS是否会被误导并做出错误的反应?为了回答这个问题,我们提出了一个名为MMHalSnowball的框架来评估LVLMS在遇到产生的幻觉时的行为,其中LVLMS被要求在经过策划的幻觉对话中回答特定的视觉问题。重要的是,我们的实验表明,开源的LVLMS的性能至少下降了31%,这表明LVLMS容易接受产生的幻觉,并做出错误的声明,如果没有分心的话,他们就不会支持。我们称这种现象为多模式幻觉滚雪球。为了缓解这一问题,我们进一步提出了一种称为残差视觉解码的无需训练的方法,其中我们用从残差视觉输入得到的输出分布来修正LVLM的输出分布,为模型提供了直接访问视觉信息的途径。实验表明,我们的方法可以在保持能力的情况下缓解滚雪球般的多通道幻觉超过24%。

[NLP-114] Answering real-world clinical questions using large language model based systems
[NLP-114] 使用基于大型语言模型的系统回答现实世界的临床问题

链接: https://arxiv.org/abs/2407.00541
作者: Yen Sia Low(1),Michael L. Jackson(1),Rebecca J. Hyde(1),Robert E. Brown(1),Neil M. Sanghavi(1),Julian D. Baldwin(1),C. William Pike(1),Jananee Muralidharan(1),Gavin Hui(1 and 2),Natasha Alexander(3),Hadeel Hassan(3),Rahul V. Nene(4),Morgan Pike(5),Courtney J. Pokrzywa(6),Shivam Vedak(7),Adam Paul Yan(3),Dong-han Yao(7),Amy R. Zipursky(3),Christina Dinh(1),Philip Ballentine(1),Dan C. Derieg(1),Vladimir Polony(1),Rehan N. Chawdry(1),Jordan Davies(1),Brigham B. Hyde(1),Nigam H. Shah(1 and 7),Saurabh Gombar(1 and 8) ((1) Atropos Health, New York NY, USA, (2) Department of Medicine, University of California, Los Angeles CA, USA, (3) Department of Pediatrics, The Hospital for Sick Children, Toronto ON, Canada, (4) Department of Emergency Medicine, University of California, San Diego CA, USA, (5) Department of Emergency Medicine, University of Michigan, Ann Arbor MI, USA, (6) Department of Surgery, Columbia University, New York NY, USA, (7) Center for Biomedical Informatics Research, Stanford University, Stanford CA, USA (8) Department of Pathology, Stanford University, Stanford CA, USA)
关键词: guide healthcare decisions, contextualizing existing research, guide healthcare, healthcare decisions, difficulty in contextualizing
中文关键词: 指导医疗保健决策,结合现有研究的背景,指导医疗保健,医疗保健决策,背景困难
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 28 pages (2 figures, 3 tables) inclusive of 8 pages of supplemental materials (4 supplemental figures and 4 supplemental tables)

点击查看摘要

Abstract:Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.
摘要:指导医疗决策的证据往往受限于缺乏相关和可信的文献,以及难以将现有研究与特定患者联系起来。大型语言模型(LLM)可以通过总结已发表的文献或基于真实世界数据(RWD)生成新的研究来潜在地解决这两个挑战。我们评估了5个基于LLM的系统回答50个临床问题的能力,并让9名独立医生审查了回答的相关性、可靠性和可操作性。目前,通用的LLMS(ChatGPT-4、Claude 3 Opus、Gemini Pro 1.5)很少产生被认为相关和基于证据的答案(2%-10%)。相比之下,基于检索增强生成(RAG)的智能LLM系统为24%(OpenEvidence)到58%(ChatRWD)的问题提供了相关和基于证据的答案。与其他LLM相比,只有智能型ChatRWD能够回答新问题(65%比0-9%)。这些结果表明,尽管通用的LLMS不应按原样使用,但基于RAG的专门构建的证据摘要系统和用于协同工作的新证据生成系统将提高患者护理相关证据的可用性。

[NLP-115] ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees
[NLP-115] ConU:具有正确覆盖保证的大型语言模型中的保形不确定性

链接: https://arxiv.org/abs/2407.00499
作者: Zhiyuan Wang,Jinhao Duan,Lu Cheng,Yue Zhang,Qingni Wang,Hengtao Shen,Xiaofeng Zhu,Xiaoshuang Shi,Kaidi Xu
关键词: natural language generation, recent large language, large language models, open-ended NLG tasks, language generation
中文关键词: 自然语言生成、最近的大型语言、大型语言模型、开放式NLG任务、语言生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Uncertainty quantification (UQ) in natural language generation (NLG) tasks remains an open challenge, exacerbated by the intricate nature of the recent large language models (LLMs). This study investigates adapting conformal prediction (CP), which can convert any heuristic measure of uncertainty into rigorous theoretical guarantees by constructing prediction sets, for black-box LLMs in open-ended NLG tasks. We propose a sampling-based uncertainty measure leveraging self-consistency and develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the design of the CP algorithm. Experimental results indicate that our uncertainty measure generally surpasses prior state-of-the-art methods. Furthermore, we calibrate the prediction sets within the model’s unfixed answer distribution and achieve strict control over the correctness coverage rate across 6 LLMs on 4 free-form NLG datasets, spanning general-purpose and medical domains, while the small average set size further highlights the efficiency of our method in providing trustworthy guarantees for practical open-ended NLG applications.
摘要:自然语言生成(NLG)任务中的不确定性量化(UQ)仍然是一个开放的挑战,最近的大型语言模型(LLM)的复杂性质加剧了这一挑战。通过构造预测集,将任何不确定性的启发式度量转化为严格的理论保证,研究了在开放式NLG任务中对黑箱LLMS的适形预测。我们提出了一种利用自洽的基于抽样的不确定性度量方法,并将与正确性一致的不确定性条件融入到CP算法的设计中,提出了一种共形不确定性准则。实验结果表明,我们的不确定性度量总体上超过了现有的方法。此外,我们在模型的非固定答案分布范围内对预测集进行了校准,并在4个自由形式的NLG数据集上实现了对6个LLM的正确覆盖率的严格控制,而小的平均集大小进一步突出了该方法在为实际的开放式NLG应用提供可信保证方面的有效性。

[NLP-116] LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement
[NLP-116] 法学硕士讲师:从错误中学习,实现自动化模型改进

链接: https://arxiv.org/abs/2407.00497
作者: Jiahao Ying,Mingbao Lin,Yixin Cao,Wei Tang,Bo Wang,Qianru Sun,Xuanjing Huang,Shuicheng Yan
关键词: Large Language Models, advanced Large Language, smaller target models, advanced Large, Large Language
中文关键词: 大型语言模型、高级大型语言、较小的目标模型、高级大型、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces the innovative “LLMs-as-Instructors” framework, which leverages the advanced Large Language Models (LLMs) to autonomously enhance the training of smaller target models. Inspired by the theory of “Learning from Errors”, this framework employs an instructor LLM to meticulously analyze the specific errors within a target model, facilitating targeted and efficient training cycles. Within this framework, we implement two strategies: “Learning from Error,” which focuses solely on incorrect responses to tailor training data, and “Learning from Error by Contrast”, which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors. Our empirical studies, conducted with several open-source models, demonstrate significant improvements across multiple benchmarks, including mathematical reasoning, coding abilities, and factual knowledge. Notably, the refined Llama-3-8b-Instruction has outperformed ChatGPT, illustrating the effectiveness of our approach. By leveraging the strengths of both strategies, we have attained a more balanced performance improvement on both in-domain and out-of-domain benchmarks. Our code can be found at this https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2407.00497 [cs.CL] (or arXiv:2407.00497v1 [cs.CL] for this version)
摘要:本文介绍了创新的“LLMS-as-Teacher”框架,它利用先进的大语言模型(LLMS)自主地加强对较小目标模型的培训。在“从错误中学习”理论的启发下,该框架聘请了一名教师LLM来仔细分析目标模型中的具体错误,从而促进有针对性和有效的培训周期。在这个框架内,我们实施了两种策略:“从错误中学习”,它只关注不正确的反应,以定制训练数据;以及“从错误中学习”,它使用对比学习来分析正确和不正确的反应,以更深入地理解错误。我们用几个开源模型进行的经验研究表明,在多个基准测试中,包括数学推理、编码能力和事实知识在内,都有显著的改进。值得注意的是,改进的Llama-3-8b-指令的性能优于ChatGPT,说明了我们方法的有效性。通过利用这两种策略的优势,我们在域内和域外基准测试中实现了更均衡的性能改进。我们的代码可以在这个HTTPS URL中找到。科目:计算和语言(cs.CL)引用为:arxiv:2407.00497cs.CL

[NLP-117] PFME: A Modular Approach for Fine-grained Hallucination Detection and Editing of Large Language Models
[NLP-117] PFME:一种用于细粒度幻觉检测和编辑大型语言模型的模块化方法

链接: https://arxiv.org/abs/2407.00488
作者: Kunquan Deng,Zeyu Huang,Chen Li,Chenghua Lin,Min Gao,Wenge Rong
关键词: Large Language Models, Large Language, producing inaccurate content, risk producing inaccurate, Fine-grained Hallucination Detection
中文关键词: 大型语言模型、大型语言、产生不准确内容、产生不准确、细粒度幻觉检测的风险
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in fluency but risk producing inaccurate content, called “hallucinations.” This paper outlines a standardized process for categorizing fine-grained hallucination types and proposes an innovative framework–the Progressive Fine-grained Model Editor (PFME)–specifically designed to detect and correct fine-grained hallucinations in LLMs. PFME consists of two collaborative modules: the Real-time Fact Retrieval Module and the Fine-grained Hallucination Detection and Editing Module. The former identifies key entities in the document and retrieves the latest factual evidence from credible sources. The latter further segments the document into sentence-level text and, based on relevant evidence and previously edited context, identifies, locates, and edits each sentence’s hallucination type. Experimental results on FavaBench and FActScore demonstrate that PFME outperforms existing methods in fine-grained hallucination detection tasks. Particularly, when using the Llama3-8B-Instruct model, PFME’s performance in fine-grained hallucination detection with external knowledge assistance improves by 8.7 percentage points (pp) compared to ChatGPT. In editing tasks, PFME further enhances the FActScore of FActScore-Alpaca13B and FActScore-ChatGPT datasets, increasing by 16.2pp and 4.6pp, respectively.
摘要:大型语言模型(LLM)在流利性方面表现出色,但存在产生不准确内容的风险,这种情况被称为“幻觉”。本文概述了对细粒度幻觉类型进行分类的标准化过程,并提出了一个创新的框架–渐进式细粒度模型编辑器(PFME)–专门设计用于检测和纠正LLMS中的细粒度幻觉。PFME由两个协作模块组成:实时事实检索模块和细粒度幻觉检测和编辑模块。前者确定文件中的关键实体,并从可靠的来源检索最新的事实证据。后者进一步将文档分割成句子级别的文本,并基于相关证据和先前编辑的上下文,识别、定位和编辑每个句子的幻觉类型。在FavaBitch和FActScore上的实验结果表明,PFME在细粒度幻觉检测任务中的性能优于现有方法。特别是,当使用Llama3-8B-指令模型时,与ChatGPT相比,PFME在外部知识辅助下的细粒度幻觉检测性能提高了8.7个百分点(Pp)。在编辑任务中,PFME进一步增强了FActScore-Alpaca13B和FActScore-ChatGPT数据集的FActScore,分别增加了16.2pp和4.6pp。

[NLP-118] Its Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization
[NLP-118] 其变形时间:通过多目标优化释放多个LLM的潜力

链接: https://arxiv.org/abs/2407.00487
作者: Bingdong Li,Zixiang Di,Yanting Yang,Hong Qian,Peng Yang,Hao Hao,Ke Tang,Aimin Zhou
关键词: large language model, model merging, large language, merging, multi-objective optimization algorithms
中文关键词: 大语言模型,模型合并,大语言,合并,多目标优化算法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce a novel approach for large language model merging via black-box multi-objective optimization algorithms. The goal of model merging is to combine multiple models, each excelling in different tasks, into a single model that outperforms any of the individual source models. However, model merging faces two significant challenges: First, existing methods rely heavily on human intuition and customized strategies. Second, parameter conflicts often arise during merging, and while methods like DARE [1] can alleviate this issue, they tend to stochastically drop parameters, risking the loss of important delta parameters. To address these challenges, we propose the MM-MO method, which automates the search for optimal merging configurations using multi-objective optimization algorithms, eliminating the need for human intuition. During the configuration searching process, we use estimated performance across multiple diverse tasks as optimization objectives in order to alleviate the parameter conflicting between different source models without losing crucial delta parameters. We conducted comparative experiments with other mainstream model merging methods, demonstrating that our method consistently outperforms them. Moreover, our experiments reveal that even task types not explicitly targeted as optimization objectives show performance improvements, indicating that our method enhances the overall potential of the model rather than merely overfitting to specific task types. This approach provides a significant advancement in model merging techniques, offering a robust and plug-and-play solution for integrating diverse models into a unified, high-performing model.
摘要:提出了一种新的基于黑盒多目标优化算法的大型语言模型融合方法。模型合并的目标是将多个模型(每个模型都在不同的任务中表现出色)组合成一个表现优于任何单个源模型的模型。然而,模型融合面临着两个重大挑战:第一,现有方法严重依赖于人类的直觉和定制的策略。其次,在合并过程中经常会出现参数冲突,虽然像DARE[1]这样的方法可以缓解这个问题,但它们往往会随机丢弃参数,从而冒着丢失重要增量参数的风险。为了应对这些挑战,我们提出了MM-MO方法,该方法使用多目标优化算法自动搜索最优合并配置,消除了对人类直觉的需要。在配置搜索过程中,我们以多个不同任务的估计性能作为优化目标,以缓解不同源模型之间的参数冲突,而不丢失关键的Delta参数。我们与其他主流的模型融合方法进行了对比实验,证明了我们的方法始终优于它们。此外,我们的实验表明,即使没有明确作为优化目标的任务类型也表现出了性能改进,这表明我们的方法增强了模型的整体潜力,而不仅仅是过度适应特定的任务类型。这种方法大大改进了模型合并技术,为将不同的模型集成到统一的、高性能的模型中提供了一个健壮的、即插即用的解决方案。

[NLP-119] owards Massive Multilingual Holistic Bias
[NLP-119] owards大量多语言整体偏见

链接: https://arxiv.org/abs/2407.00486
作者: Xiaoqing Ellen Tan,Prangthip Hansanti,Carleigh Wood,Bokai Yu,Christophe Ropers,Marta R. Costa-jussà
关键词: mitigate demographic biases, MASSIVE MULTILINGUAL HOLISTICBIAS, automatic language generation, current landscape, biases as existing
中文关键词: 减轻人口统计偏见、大规模多语言圣言偏见、自动语言生成、当前格局、现有偏见
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the current landscape of automatic language generation, there is a need to understand, evaluate, and mitigate demographic biases as existing models are becoming increasingly multilingual. To address this, we present the initial eight languages from the MASSIVE MULTILINGUAL HOLISTICBIAS (MMHB) dataset and benchmark consisting of approximately 6 million sentences representing 13 demographic axes. We propose an automatic construction methodology to further scale up MMHB sentences in terms of both language coverage and size, leveraging limited human annotation. Our approach utilizes placeholders in multilingual sentence construction and employs a systematic method to independently translate sentence patterns, nouns, and descriptors. Combined with human translation, this technique carefully designs placeholders to dynamically generate multiple sentence variations and significantly reduces the human translation workload. The translation process has been meticulously conducted to avoid an English-centric perspective and include all necessary morphological variations for languages that require them, improving from the original English HOLISTICBIAS. Finally, we utilize MMHB to report results on gender bias and added toxicity in machine translation tasks. On the gender analysis, MMHB unveils: (1) a lack of gender robustness showing almost +4 chrf points in average for masculine semantic sentences compared to feminine ones and (2) a preference to overgeneralize to masculine forms by reporting more than +12 chrf points in average when evaluating with masculine compared to feminine references. MMHB triggers added toxicity up to 2.3%.
摘要:在当前的自动语言生成环境中,随着现有模型越来越多地使用多种语言,需要理解、评估和缓解人口统计偏差。为了解决这一问题,我们提供了大规模多语言HOLISTICBIAS(MMHB)数据集和基准中的最初八种语言,该数据集和基准由代表13个人口轴的大约600万个句子组成。我们提出了一种自动构建方法,利用有限的人工标注,在语言覆盖和大小方面进一步扩大MMHB语句的规模。我们的方法在多语言句子结构中使用占位符,并使用系统的方法来独立翻译句型、名词和描述符。与人工翻译相结合,该技术精心设计占位符,动态生成多个句子变体,显著减少了人工翻译的工作量。翻译过程一丝不苟地进行,以避免以英语为中心的观点,并包括需要它们的语言的所有必要的形态变体,比原始的英语HOLISTICBIAS有所改进。最后,我们利用MMHB来报告机器翻译任务中的性别偏见和添加毒性的结果。在性别分析方面,MMHB揭示了:(1)缺乏性别稳健性,与女性相比,男性语义句子平均+4个chrf分;(2)当评估男性与女性参照时,平均超过+12个chrf分,倾向于过度泛化为男性形式。MMHB会引发高达2.3%的额外毒性。

[NLP-120] Large Language Models for Power Scheduling: A User-Centric Approach
[NLP-120] 电力调度的大型语言模型:以用户为中心的方法

链接: https://arxiv.org/abs/2407.00476
作者: Thomas Mongaillard,Samson Lasaulce,Othman Hicheur,Chao Zhang,Lina Bariah,Vineeth S. Varma,Hang Zou,Qiyang Zhao,Merouane Debbah
关键词: predefined system requirements, meet fixed, personalized services, aiming to achieve, achieve high
中文关键词: 预定义的系统要求,满足固定的、个性化的服务,旨在实现,实现高
类目: Computation and Language (cs.CL); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:While traditional optimization and scheduling schemes are designed to meet fixed, predefined system requirements, future systems are moving toward user-driven approaches and personalized services, aiming to achieve high quality-of-experience (QoE) and flexibility. This challenge is particularly pronounced in wireless and digitalized energy networks, where users’ requirements have largely not been taken into consideration due to the lack of a common language between users and machines. The emergence of powerful large language models (LLMs) marks a radical departure from traditional system-centric methods into more advanced user-centric approaches by providing a natural communication interface between users and devices. In this paper, for the first time, we introduce a novel architecture for resource scheduling problems by constructing three LLM agents to convert an arbitrary user’s voice request (VRQ) into a resource allocation vector. Specifically, we design an LLM intent recognition agent to translate the request into an optimization problem (OP), an LLM OP parameter identification agent, and an LLM OP solving agent. To evaluate system performance, we construct a database of typical VRQs in the context of electric vehicle (EV) charging. As a proof of concept, we primarily use Llama 3 8B. Through testing with different prompt engineering scenarios, the obtained results demonstrate the efficiency of the proposed architecture. The conducted performance analysis allows key insights to be extracted. For instance, having a larger set of candidate OPs to model the real-world problem might degrade the final performance because of a higher recognition/OP classification noise level. All results and codes are open source.
摘要:传统的优化和调度方案是为满足固定的、预先定义的系统需求而设计的,而未来的系统正朝着用户驱动的方法和个性化服务的方向发展,旨在实现高质量的体验和灵活性。这一挑战在无线和数字化能源网络中尤其明显,由于用户和机器之间缺乏共同语言,用户的需求在很大程度上没有得到考虑。强大的大型语言模型(LLM)的出现标志着从传统的以系统为中心的方法转变为更先进的以用户为中心的方法,它提供了用户和设备之间的自然通信接口。本文首次提出了一种解决资源调度问题的新体系结构,通过构造三个LLM代理来将任意用户的语音请求(VRQ)转换为资源分配向量。具体地说,我们设计了一个LLM意图识别代理来将请求转化为一个优化问题(OP),一个LLM OP参数识别代理和一个LLM OP求解代理。为了评估系统的性能,我们构建了一个电动汽车(EV)充电环境下的典型VRQ数据库。作为概念验证,我们主要使用骆驼38B.通过对不同的即时工程场景进行测试,得到的结果验证了该体系结构的有效性。所进行的绩效分析可以提取关键的见解。例如,由于较高的识别/OP分类噪声级别,让较大的候选OP集合对真实世界问题进行建模可能会降低最终性能。所有结果和代码都是开源的。

[NLP-121] Classifier identification in Ancient Egyptian as a low-resource sequence-labelling task
[NLP-121] 古埃及中的分类器识别是一项低资源序列标签任务

链接: https://arxiv.org/abs/2407.00475
作者: Dmitry Nikolaev,Jorke Grotenhuis,Haleli Harel,Orly Goldwasser
关键词: complex Ancient Egyptian, hieroglyphic signs clarifying, Ancient Egyptian, writing system, hieroglyphic signs
中文关键词: 复杂的古埃及,象形文字符号澄清,古埃及,书写系统,象形文字符号
类目: Computation and Language (cs.CL)
备注: Accepted to ML4AL 2024 (First Machine Learning for Ancient Languages Workshop)

点击查看摘要

Abstract:The complex Ancient Egyptian (AE) writing system was characterised by widespread use of graphemic classifiers (determinatives): silent (unpronounced) hieroglyphic signs clarifying the meaning or indicating the pronunciation of the host word. The study of classifiers has intensified in recent years with the launch and quick growth of the iClassifier project, a web-based platform for annotation and analysis of classifiers in ancient and modern languages. Thanks to the data contributed by the project participants, it is now possible to formulate the identification of classifiers in AE texts as an NLP task. In this paper, we make first steps towards solving this task by implementing a series of sequence-labelling neural models, which achieve promising performance despite the modest amount of training data. We discuss tokenisation and operationalisation issues arising from tackling AE texts and contrast our approach with frequency-based baselines.
摘要:复杂的古埃及(AE)书写系统的特点是广泛使用字形分类符(限定词):无声(不发音)的象形文字符号,澄清主词的含义或指示发音。近年来,随着iClassifier项目的推出和快速发展,对分类器的研究得到了加强,iClassifier项目是一个用于注释和分析古代和现代语言分类器的基于网络的平台。得益于项目参与者提供的数据,现在可以将AE文本中分类器的识别制定为NLP任务。在本文中,我们通过实施一系列序列标记神经模型来解决这一任务迈出了第一步,尽管训练数据量有限,这些模型仍取得了令人鼓舞的性能。我们讨论了处理AE文本中出现的标记化和操作化问题,并将我们的方法与基于频率的基线进行了比较。

[NLP-122] MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
[NLP-122] MMEvalPro:校准多模式基准以实现值得信赖和高效的评估

链接: https://arxiv.org/abs/2407.00468
作者: Jinsheng Huang,Liang Chen,Taian Guo,Fu Zeng,Yusheng Zhao,Bohan Wu,Ye Yuan,Haozhe Zhao,Zhihui Guo,Yichi Zhang,Jingyang Yuan,Wei Ju,Luchen Liu,Tianyu Liu,Baobao Chang,Ming Zhang
关键词: Large Multimodal Models, exhibit impressive cross-modal, impressive cross-modal understanding, Multimodal Models, Large Language Models
中文关键词: 大型多模式模型,展现出令人印象深刻的跨模式、令人印象深刻的跨模式理解、多模式模型、大型语言模型
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, code released at this https URL , Homepage at this https URL

点击查看摘要

Abstract:Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises 2,138 question triplets, totaling 6,414 distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by 31.73% , compared to an average gap of 8.03% in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by 23.09% , whereas the gap for previous benchmarks is just 14.64% ). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.
摘要:大型多通道模型(LMM)表现出令人印象深刻的跨通道理解和推理能力,通常通过包括图像、问题和多个选项的多项选择题(MCQ)进行评估。然而,用于这类评价的许多基准存在系统性偏差。值得注意的是,没有任何视觉感知能力的大型语言模型(LLM)实现了非同寻常的性能,破坏了这些评估的可信度。为了解决这个问题,同时保持McQ评估的效率,我们提出了MMEvalPro,这是一个基准,旨在通过三部曲评估管道和更严格的度量来避免类型I错误。对于现有基准中的每个原始问题,人工注释员通过细致的注解过程创建一个感知问题和一个知识锚问题,从而对其进行扩充。MMEvalPro包括2,138个问题三联,总计6414个不同的问题。其中三分之二的问题是由人类专家手动标记的,其余的来自现有的基准(MMMU、Science QA和MathVista)。与现有的基准测试相比,我们用最新的LMM和LMM进行的实验表明,MMEvalPro更具挑战性(最好的LMM落后人类性能31.73%,而以前的基准测试的平均差距为8.03%)和可信性(最好的LLM比最好的LMM落后23.09%,而以前的基准测试的差距仅为14.64%)。我们的深入分析解释了表现差距较大的原因,并证明了评估的可信度,强调了其推动未来研究的巨大潜力。

[NLP-123] BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science
[NLP-123] BioKGBBench:生物医学科学人工智能代理的知识图谱检查基准

链接: https://arxiv.org/abs/2407.00466
作者: Xinna Lin,Siqi Ma,Junjie Shan,Xiaojing Zhang,Shell Xu Hu,Tiannan Guo,Stan Z. Li,Kaicheng Yu
关键词: Pursuing artificial intelligence, Large Language Models, Pursuing artificial, artificial intelligence, Language Models
中文关键词: 追求人工智能,大型语言模型,追求人工,人工智能,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pursuing artificial intelligence for biomedical science, a.k.a. AI Scientist, draws increasing attention, where one common approach is to build a copilot agent driven by Large Language Models (LLMs). However, to evaluate such systems, people either rely on direct Question-Answering (QA) to the LLM itself, or in a biomedical experimental manner. How to precisely benchmark biomedical agents from an AI Scientist perspective remains largely unexplored. To this end, we draw inspiration from one most important abilities of scientists, understanding the literature, and introduce BioKGBench. In contrast to traditional evaluation benchmark that only focuses on factual QA, where the LLMs are known to have hallucination issues, we first disentangle “Understanding Literature” into two atomic abilities, i) “Understanding” the unstructured text from research papers by performing scientific claim verification, and ii) Ability to interact with structured Knowledge-Graph Question-Answering (KGQA) as a form of “Literature” grounding. We then formulate a novel agent task, dubbed KGCheck, using KGQA and domain-based Retrieval-Augmented Generation (RAG) to identify the factual errors of existing large-scale knowledge graph databases. We collect over two thousand data for two atomic tasks and 225 high-quality annotated data for the agent task. Surprisingly, we discover that state-of-the-art agents, both daily scenarios and biomedical ones, have either failed or inferior performance on our benchmark. We then introduce a simple yet effective baseline, dubbed BKGAgent. On the widely used popular knowledge graph, we discover over 90 factual errors which provide scenarios for agents to make discoveries and demonstrate the effectiveness of our approach. The code and data are available at this https URL.
摘要:为生物医学科学追求人工智能。AI科学家的研究引起了越来越多的关注,其中一种常见的方法是建立一个由大型语言模型(LLM)驱动的Copilot代理。然而,为了评价这样的系统,人们要么依靠对LLM本身的直接问答(QA),要么以生物医学实验的方式。如何从人工智能科学家的角度准确地对生物医学代理进行基准测试在很大程度上仍有待探索。为此,我们从科学家最重要的能力之一–理解文献中获得灵感,并介绍BioKGB边。与只关注事实QA的传统评估基准不同,LLM已知存在幻觉问题,我们首先将“理解文学”分解为两种原子能力,i)通过执行科学主张验证来“理解”研究论文中的非结构化文本,以及ii)与结构化知识交互的能力-图形问答(KGQA)作为“文学”基础的一种形式。然后,我们使用KGQA和基于领域的检索-增强生成(RAG)来制定一种新的代理任务KGCheck来识别现有大规模知识图库中的事实错误。我们为两个原子任务收集了2000多个数据,并为代理任务收集了225个高质量的注释数据。令人惊讶的是,我们发现最先进的代理,无论是日常场景还是生物医学场景,在我们的基准测试中要么表现不佳,要么表现不佳。然后,我们介绍一个简单但有效的基准,称为BKGAgent。在广泛使用的流行知识图上,我们发现了90多个事实错误,这些错误为代理发现提供了场景,并证明了我们方法的有效性。代码和数据可在此HTTPS URL上找到。

[NLP-124] Open-Source Conversational AI with SpeechBrain 1.0
[NLP-124] SpeechBrain 1.0开源对话人工智能

链接: https://arxiv.org/abs/2407.00463
作者: Mirco Ravanelli,Titouan Parcollet,Adel Moumen,Sylvain de Langen,Cem Subakan,Peter Plantinga,Yingzhi Wang,Pooneh Mousavi,Luca Della Libera,Artem Ploujnikov,Francesco Paissan,Davide Borra,Salah Zaiem,Zeyu Zhao,Shucong Zhang,Georgios Karakasidis,Sung-Lin Yeh,Aku Rouhe,Rudolf Braun,Florian Mai,Juan Zuluaga-Gomez,Seyed Mahed Mousavi,Andreas Nautsch,Xuechen Liu,Sangeet Sagar,Jarod Duret,Salima Mdhaffar,Gaelle Laperriere,Renato De Mori,Yannick Esteve
关键词: http URL promotes, URL promotes transparency, open-source Conversational, http URL, URL promotes
中文关键词: http URL促进,URL促进透明度,开源对话,http URL,URL促进
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
备注: Submitted to JMLR (Machine Learning Open Source Software)

点击查看摘要

Abstract:SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much this http URL promotes transparency and replicability by releasing both the pre-trained models and the complete “recipes” of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.
摘要:SpeechBrain是一个基于PyTorch的开源对话人工智能工具包,特别专注于语音处理任务,例如语音识别、语音增强、说话人识别、文本到语音,而这个http URL通过发布预训练的模型以及训练它们所需的完整代码“食谱”和算法来提高透明度和可复制性。本文介绍了SpeechBrain 1.0,这是该工具包发展过程中的一个重要里程碑,该工具包目前拥有200多种语音、音频和语言处理任务的食谱,以及100多种可在Hugging Face上提供的模型。SpeechBrain 1.0引入了新技术来支持多样化的学习模式、大型语言模型(LLM)集成和高级解码策略,以及新颖的模型、任务和模式。它还包括一个新的基准存储库,为研究人员提供了一个统一的平台,用于评估跨不同任务的模型。

[NLP-125] Polarization and Morality: Lexical Analysis of Abortion Discourse on Reddit
[NLP-125] 两极分化与道德:Reddit上堕胎话语的词汇分析

链接: https://arxiv.org/abs/2407.00455
作者: Tessa Stanier,Hagyeong Shin
关键词: Moral Foundations Theory, Moral Foundations Dictionary, study investigates, investigates whether division, division on political
中文关键词: 道德基础理论,道德基础词典,研究调查,调查是否分裂,政治上的分裂
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This study investigates whether division on political topics is mapped with the distinctive patterns of language use. We collect a total 145,832 Reddit comments on the abortion debate and explore the languages of subreddit communities r/prolife and r/prochoice. With consideration of the Moral Foundations Theory, we examine lexical patterns in three ways. First, we compute proportional frequencies of lexical items from the Moral Foundations Dictionary in order to make inferences about each group’s moral considerations when forming arguments for and against abortion. We then create n-gram models to reveal frequent collocations from each stance group and better understand how commonly used words are patterned in their linguistic context and in relation to morality values. Finally, we use Latent Dirichlet Allocation to identify underlying topical structures in the corpus data. Results show that the use of morality words is mapped with the stances on abortion.
摘要:本研究调查政治话题的分歧是否与语言使用的独特模式相对应。我们总共收集了145,832条关于堕胎辩论的Reddit评论,并探索子Reddit社区r/prolife和r/prochoice的语言。考虑到道德基础理论,我们通过三种方式检查词汇模式。首先,我们计算《道德基础词典》中词汇项的比例频率,以便在形成支持和反对堕胎的论点时对每个群体的道德考虑做出推论。然后,我们创建n-gram模型来揭示每个立场组的频繁搭配,并更好地了解常用词如何在其语言背景中以及与道德价值观的关系中形成模式。最后,我们使用潜在Dirichlet分配来识别文集数据中的潜在主题结构。结果表明,道德词语的使用与堕胎的立场相对应。

[NLP-126] Self-Translate-Train: A Simple but Strong Baseline for Cross-lingual Transfer of Large Language Models
[NLP-126] 自翻译训练:大型语言模型跨语言迁移的简单但强大的基线

链接: https://arxiv.org/abs/2407.00454
作者: Ryokan Ri,Shun Kiyono,Sho Takase
关键词: Cross-lingual transfer, promising technique, improve performance, utilizing data, language
中文关键词: 跨语言迁移、有前途的技术、提高性能、利用数据、语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-lingual transfer is a promising technique for utilizing data in a source language to improve performance in a target language. However, current techniques often require an external translation system or suffer from suboptimal performance due to over-reliance on cross-lingual generalization of multi-lingual pretrained language models. In this study, we propose a simple yet effective method called Self-Translate-Train. It leverages the translation capability of a large language model to generate synthetic training data in the target language and fine-tunes the model with its own generated data. We evaluate the proposed method on a wide range of tasks and show substantial performance gains across several non-English languages.
摘要:跨语言传输是一种很有前途的技术,可以利用源语言的数据来提高目标语言的性能。然而,当前的技术通常需要外部翻译系统,或者由于过度依赖多语言预训练语言模型的跨语言概括而导致性能不佳。在这项研究中,我们提出了一种简单而有效的方法,称为自翻译训练。它利用大型语言模型的翻译能力来生成目标语言的合成训练数据,并使用其自己生成的数据对模型进行微调。我们在广泛的任务中评估了所提出的方法,并在几种非英语语言中表现出了显着的性能提升。

[NLP-127] PerSEval: Assessing Personalization in Text Summarizers
[NLP-127] PerSEval:评估文本摘要中的个性化

链接: https://arxiv.org/abs/2407.00453
作者: Sourish Dasgupta,Ankush Chander,Parth Borad,Isha Motiyani,Tanmoy Chakraborty
关键词: individuals’ subjective understanding, understanding of saliency, topics of attention, cater to individuals’, individuals’ subjective
中文关键词: 个人的主观理解,对显着性的理解,关注话题,迎合个人的,个人的主观
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Personalized summarization models cater to individuals’ subjective understanding of saliency, as represented by their reading history and current topics of attention. Existing personalized text summarizers are primarily evaluated based on accuracy measures such as BLEU, ROUGE, and METEOR. However, a recent study argued that accuracy measures are inadequate for evaluating the degree of personalization of these models and proposed EGISES, the first metric to evaluate personalized text summaries. It was suggested that accuracy is a separate aspect and should be evaluated standalone. In this paper, we challenge the necessity of an accuracy leaderboard, suggesting that relying on accuracy-based aggregated results might lead to misleading conclusions. To support this, we delve deeper into EGISES, demonstrating both theoretically and empirically that it measures the degree of responsiveness, a necessary but not sufficient condition for degree-of-personalization. We subsequently propose PerSEval, a novel measure that satisfies the required sufficiency condition. Based on the benchmarking of ten SOTA summarization models on the PENS dataset, we empirically establish that – (i) PerSEval is reliable w.r.t human-judgment correlation (Pearson’s r = 0.73; Spearman’s \rho = 0.62; Kendall’s \tau = 0.42), (ii) PerSEval has high rank-stability, (iii) PerSEval as a rank-measure is not entailed by EGISES-based ranking, and (iv) PerSEval can be a standalone rank-measure without the need of any aggregated ranking.
摘要:个性化的摘要模式迎合了个体对突显的主观理解,表现为阅读历史和当前关注的话题。现有的个性化文本摘要主要基于BLEU、胭脂和流星等准确度指标进行评估。然而,最近的一项研究认为,准确性度量不足以评估这些模型的个性化程度,并提出了EGISES,这是第一个评估个性化文本摘要的度量。有人建议,准确性是一个单独的方面,应该单独评估。在这篇文章中,我们对准确性排行榜的必要性提出了质疑,认为依赖基于准确性的汇总结果可能会导致误导结论。为了支持这一点,我们更深入地研究了EGISES,从理论和经验上证明了它衡量了响应性的程度,这是个性化程度的必要条件,但不是充分条件。我们随后提出了PerSEval,这是一种满足所要求的充分性条件的新度量。基于对10个SOTA摘要模型在PENS数据集上的基准测试,我们实证地证明:(I)PerSEval是可靠的,与人的判断相关(Pearson‘s r=0.73;Spearman’s\rho=0.62;Kendall‘s\tau=0.42),(Ii)PerSEval具有很高的排名稳定性,(Iii)PerSEval作为排名度量不需要基于EGISES的排名,以及(Iv)PerSEval可以是一个独立的排名度量,不需要任何聚合排名。

[NLP-128] A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models
[NLP-128] 多语言大语言模型并行库开发方案

链接: https://arxiv.org/abs/2407.00436
作者: Peiqin Lin,André F. T. Martins,Hinrich Schütze
关键词: Recent studies, parallel corpora, multilingual large language, large language models, exploiting parallel corpora
中文关键词: 最近的研究,平行库,多语言大型语言,大型语言模型,利用平行库
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus containing just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models due to their stronger capacity for cross-task transfer. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.
摘要:最近的研究强调了利用平行语料库来增强多语言大型语言模型的潜力,从而提高了双语任务(如机器翻译)和通用任务(如文本分类)的性能。在这些发现的基础上,我们的全面研究旨在确定利用平行语料库的最有效策略。我们调查了并行语料库的质量和数量、培训目标和模型规模对跨不同语言和任务的并行语料库增强的多语言大型语言模型性能的影响。我们的分析揭示了几个关键的见解:(I)过滤噪音翻译对于有效利用平行语料库至关重要,而语言识别和短句过滤效果甚微;(Ii)即使是只包含10K个平行句子的语料库,也可以产生与从更大的数据集获得的结果相当的结果;(Iii)在各种培训目标及其组合中,仅使用机器翻译目标可以产生最好的结果;(Iv)较大的多语言模型从平行语料库中受益更多,因为它们具有更强的跨任务迁移能力。我们的研究为优化利用平行语料库来增强多语言大型语言模型提供了有价值的见解,将以前的发现从有限的语言和任务扩展到更广泛的场景。

[NLP-129] Brevity is the soul of wit: Pruning long files for code generation
[NLP-129] 简洁是智慧的灵魂:修剪长文件以生成代码

链接: https://arxiv.org/abs/2407.00434
作者: Aaditya K. Singh,Yu Yang,Kushal Tirumala,Mostafa Elhoushi,Ari S. Morcos
关键词: Data, higher quality data, curation is commonly, commonly considered, higher quality
中文关键词: 数据,更高质量的数据,策展通常被普遍认为是更高质量的
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Data curation is commonly considered a “secret-sauce” for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data pruning has become a larger and larger focus. Specifically, many have shown that de-duplicating data, or sub-selecting higher quality data, can lead to efficiency or performance improvements. Generally, three types of methods are used to filter internet-scale corpora: embedding-based, heuristic-based, and classifier-based. In this work, we contrast the former two in the domain of finetuning LLMs for code generation. We find that embedding-based methods are often confounded by length, and that a simple heuristic–pruning long files–outperforms other methods in compute-limited regimes. Our method can yield up to a 2x efficiency benefit in training (while matching performance) or a 3.5% absolute performance improvement on HumanEval (while matching compute). However, we find that perplexity on held-out long files can increase, begging the question of whether optimizing data mixtures for common coding benchmarks (HumanEval, MBPP) actually best serves downstream use cases. Overall, we hope our work builds useful intuitions about code data (specifically, the low quality of extremely long code files) provides a compelling heuristic-based method for data pruning, and brings to light questions in how we evaluate code generation models.
摘要:数据管理通常被认为是LLM训练的秘密武器,高质量的数据通常会带来更好的LLM性能。考虑到互联网刮来的语料库的规模,数据修剪已成为一个越来越大的关注点。具体地说,许多人已经证明,消除重复数据或细分选择更高质量的数据可以提高效率或性能。通常,过滤互联网规模的语料库有三种方法:基于嵌入的、基于启发式的和基于分类器的。在这项工作中,我们在代码生成的精调LLM领域对比了前两种方法。我们发现,基于嵌入的方法经常被长度混淆,并且一个简单的启发式方法–修剪长文件–在计算受限的情况下比其他方法性能更好。我们的方法可以在训练中产生高达2倍的效率收益(在匹配性能时),或者在Human Eval上产生3.5%的绝对性能改进(在匹配计算时)。然而,我们发现延迟的长文件的困惑可能会增加,这就提出了一个问题,即为公共编码基准(HumanEval,MBPP)优化数据混合是否真的最好地服务于下游用例。总而言之,我们希望我们的工作建立了关于代码数据的有用的直觉(特别是非常长的代码文件的低质量),为数据修剪提供了一种引人注目的基于启发式的方法,并揭示了我们如何评估代码生成模型的问题。

[NLP-130] Fontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey
[NLP-130] 丰特斯。中世纪拉丁语文本的词性标记和词化。跨流派调查

链接: https://arxiv.org/abs/2407.00418
作者: Krzysztof Nowak,Jędrzej Ziębura,Krzysztof Wróbel,Aleksander Smywiński-Pohl
关键词: Medieval Latin texts, Polish Medieval Latin, automatic linguistic annotation, Medieval Latin, Latin texts
中文关键词: 中世纪拉丁文本,波兰中世纪拉丁语,自动语言注释,中世纪拉丁语,拉丁文本
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study introduces the eFontes models for automatic linguistic annotation of Medieval Latin texts, focusing on lemmatization, part-of-speech tagging, and morphological feature determination. Using the Transformers library, these models were trained on Universal Dependencies (UD) corpora and the newly developed eFontes corpus of Polish Medieval Latin. The research evaluates the models’ performance, addressing challenges such as orthographic variations and the integration of Latinized vernacular terms. The models achieved high accuracy rates: lemmatization at 92.60%, part-of-speech tagging at 83.29%, and morphological feature determination at 88.57%. The findings underscore the importance of high-quality annotated corpora and propose future enhancements, including extending the models to Named Entity Recognition.
摘要:本研究介绍了用于中世纪拉丁文本自动语言注释的eFontes模型,重点关注词形化、词性标记和形态特征确定。使用Transformers库,这些模型在通用从属关系(UD)库和新开发的波兰中世纪拉丁语eFontes库上进行训练。该研究评估了模型的性能,解决了拼写差异和拉丁化白话术语整合等挑战。这些模型实现了很高的准确率:词形分解率为92.60%,词性标记率为83.29%,形态特征确定率为88.57%。这些研究结果强调了高质量注释库的重要性,并提出了未来的增强措施,包括将模型扩展到命名实体识别。

[NLP-131] oo Late to Train Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs
[NLP-131] oo训练太晚了,太早了,无法使用?低资源孟加拉LL的必要性和可行性研究

链接: https://arxiv.org/abs/2407.00416
作者: Tamzeed Mahfuz,Satak Kumar Dey,Ruwad Naswan,Hasnaen Adil,Khondker Salman Sayeed,Haz Sameen Shahgir
关键词: English-oriented Large Language, English-oriented Large, exhibits enhanced cross-lingual, enhanced cross-lingual transfer, cross-lingual transfer capabilities
中文关键词: 以英语为导向的大型语言,以英语为导向的大型,展现出增强的跨语言、增强的跨语言迁移、跨语言迁移能力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Each new generation of English-oriented Large Language Models (LLMs) exhibits enhanced cross-lingual transfer capabilities and significantly outperforms older LLMs on low-resource languages. This prompts the question: Is there a need for LLMs dedicated to a particular low-resource language? We aim to explore this question for Bengali, a low-to-moderate resource Indo-Aryan language native to the Bengal region of South Asia. We compare the performance of open-weight and closed-source LLMs such as LLaMA-3 and GPT-4 against fine-tuned encoder-decoder models across a diverse set of Bengali downstream tasks, including translation, summarization, paraphrasing, question-answering, and natural language inference. Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent. Key challenges include inefficient tokenization of Bengali script by existing LLMs, leading to increased computational costs and potential performance degradation. Additionally, we highlight biases in machine-translated datasets commonly used for Bengali NLP tasks. We conclude that there is a significant need for a Bengali-oriented LLM, but the field currently lacks the high-quality pretraining and instruction-tuning datasets necessary to develop a highly effective model. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2407.00416 [cs.CL] (or arXiv:2407.00416v1 [cs.CL] for this version)
摘要:每一代面向英语的大语言模型都显示出增强的跨语言迁移能力,并且在低资源语言上的表现明显优于旧的大语言模型。这就引出了一个问题:是否需要专门针对特定低资源语言的LLM?我们的目标是为孟加拉语探索这个问题,孟加拉语是南亚孟加拉地区的一种低到中等资源的印度雅利安语。我们比较了开放权重和封闭源代码的LLMS(如Llama-3和GPT-4)与微调的编解码器模型在不同的孟加拉下游任务集上的性能,包括翻译、摘要、释义、问答和自然语言推理。我们的发现表明,虽然LLM通常在推理任务中表现出色,但在需要孟加拉文字生成的任务中,他们的表现并不一致。关键挑战包括现有LLM对孟加拉语脚本的低效标记化,导致计算成本增加和潜在的性能下降。此外,我们还强调了通常用于孟加拉语NLP任务的机器翻译数据集中的偏差。我们的结论是,对面向孟加拉语的法律培训有很大的需求,但该领域目前缺乏开发高效模式所需的高质量的预训和教学调整数据集。科目:计算和语言(cs.CL)引用为:arxiv:2407.00416cs.CL

[NLP-132] SHADE: Semantic Hypernym Annotator for Domain-specific Entities – DnD Domain Use Case
[NLP-132] SHADE:域特定实体的语义超假名注释器-- DnD域用例

链接: https://arxiv.org/abs/2407.00407
作者: Akila Peiris,Nisansa de Silva
关键词: important NLP task, Manual data annotation, important NLP, Manual data, NLP task
中文关键词: 重要NLP任务、手动数据注释、重要NLP、手动数据、NLP任务
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Manual data annotation is an important NLP task but one that takes considerable amount of resources and effort. In spite of the costs, labeling and categorizing entities is essential for NLP tasks such as semantic evaluation. Even though annotation can be done by non-experts in most cases, due to the fact that this requires human labor, the process is costly. Another major challenge encountered in data annotation is maintaining the annotation consistency. Annotation efforts are typically carried out by teams of multiple annotators. The annotations need to maintain the consistency in relation to both the domain truth and annotation format while reducing human errors. Annotating a specialized domain that deviates significantly from the general domain, such as fantasy literature, will see a lot of human error and annotator disagreement. So it is vital that proper guidelines and error reduction mechanisms are enforced. One such way to enforce these constraints is using a specialized application. Such an app can ensure that the notations are consistent, and the labels can be pre-defined or restricted reducing the room for errors. In this paper, we present SHADE, an annotation software that can be used to annotate entities in the high fantasy literature domain. Specifically in Dungeons and Dragons lore extracted from the Forgotten Realms Fandom Wiki.
摘要:人工数据标注是一项重要的自然语言处理任务,但需要花费大量的资源和精力。尽管代价高昂,但对实体进行标记和分类对于语义评估等自然语言处理任务是必不可少的。尽管在大多数情况下,注释可以由非专家完成,但由于这需要人力,这一过程代价高昂。在数据标注中遇到的另一个主要挑战是维护标注的一致性。注释工作通常由多个注释员组成的团队执行。注释需要保持与领域事实和注释格式相关的一致性,同时减少人为错误。注释一个明显偏离一般领域的专业领域,如奇幻文学,会看到许多人为错误和注释者的不同意见。因此,执行适当的指导方针和减少错误的机制至关重要。实施这些约束的一种方法是使用专门的应用程序。这样的应用程序可以确保符号的一致性,并且可以预定义或限制标签,以减少出错的空间。在本文中,我们介绍了一个标注软件Shade,它可以用来标注高级幻想文学领域的实体。特别是从被遗忘的王国粉丝维基那里摘录的《地下城与龙》的故事。

[NLP-133] Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
[NLP-133] 如果您只需要检索,那么上下文真的很长吗?迈向复杂困难的长期背景NLP

链接: https://arxiv.org/abs/2407.00402
作者: Omer Goldman,Alon Jacovi,Aviv Slobodkin,Aviya Maimon,Ido Dagan,Reut Tsarfaty
关键词: language models’ capabilities, Improvements in language, making long-context evaluation, language models’, models’ capabilities
中文关键词: 语言模型的能力、语言的改进、进行长上下文评估、语言模型、模型的能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Improvements in language models’ capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of “long-context”, defined simply by the total length of the model’s input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.
摘要:语言模型能力的提高将其应用推向了更长的语境,使长语境的评估和开发成为一个活跃的研究领域。然而,许多不同的用例被归类到“长上下文”这一总括术语下,该术语简单地由模型输入的总长度定义,包括–例如,干草堆中的针任务、图书摘要和信息聚合。考虑到它们的不同困难,在这份立场文件中,我们认为根据不同任务的上下文长度合并不同的任务是徒劳的。作为一个社区,我们需要更精确的词汇表来理解长上下文任务的相似或不同之处。我们建议根据使长上下文更难处理的属性来拆解长上下文的分类。我们提出了两个正交轴:(I)扩散:在上下文中找到必要的信息有多难?(2)范围:需要查找多少必要信息?我们调查了关于长上下文的文献,为这个分类作为一个信息性描述符提供了理由,并将文献放在与之相关的位置。我们的结论是,最困难和最有趣的环境,其必要的信息非常长,并在输入中高度分散,严重不足。通过使用描述性词汇并讨论长语境下难度的相关性质,我们可以在这一领域进行更有见地的研究。我们呼吁仔细设计具有明显较长背景的任务和基准,同时考虑到使其与较短背景有质的不同的特点。

[NLP-134] A Study on Effect of Reference Knowledge Choice in Generating Technical Content Relevant to SAPPhIRE Model Using Large Language Model
[NLP-134] 使用大语言模型生成与SAP PhIRE模型相关的技术内容中参考知识选择的影响研究

链接: https://arxiv.org/abs/2407.00396
作者: Kausik Bhattacharya,Anubhab Majumder,Amaresh Chakrabarti
关键词: SAPPhIRE model, Large Language Model, stimulus in design, model, inspirational stimulus
中文关键词: SPP PhIRE模型、大型语言模型、设计刺激、模型、鼓舞人心的刺激
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Representation of systems using the SAPPhIRE model of causality can be an inspirational stimulus in design. However, creating a SAPPhIRE model of a technical or a natural system requires sourcing technical knowledge from multiple technical documents regarding how the system works. This research investigates how to generate technical content accurately relevant to the SAPPhIRE model of causality using a Large Language Model, also called LLM. This paper, which is the first part of the two-part research, presents a method for hallucination suppression using Retrieval Augmented Generating with LLM to generate technical content supported by the scientific information relevant to a SAPPhIRE con-struct. The result from this research shows that the selection of reference knowledge used in providing context to the LLM for generating the technical content is very important. The outcome of this research is used to build a software support tool to generate the SAPPhIRE model of a given technical system.
摘要:使用蓝宝石因果关系模型来表示系统可以成为设计中的灵感刺激。然而,创建技术或自然系统的蓝宝石模型需要从有关系统如何工作的多个技术文档中获取技术知识。这项研究调查了如何使用大型语言模型(也称为LLM)准确地生成与蓝宝石因果关系模型相关的技术内容。本文是两部分研究的第一部分,提出了一种使用LLM的检索增广生成来抑制幻觉的方法,以生成与蓝宝石结构相关的科学信息支持的技术内容。这项研究的结果表明,参考知识的选择在为LLM提供上下文以生成技术内容时是非常重要的。本研究的成果被用来构建一个软件支持工具,以生成给定技术系统的蓝宝石模型。

[NLP-135] Advancing Process Verification for Large Language Models via Tree-Based Preference Learning
[NLP-135] 通过基于树的偏好学习推进大型语言模型的流程验证

链接: https://arxiv.org/abs/2407.00390
作者: Mingqian He,Yongliang Shen,Wenqi Zhang,Zeqi Tan,Weiming Lu
关键词: Large Language Models, Large Language, Language Models, demonstrated remarkable potential, introducing extra verifiers
中文关键词: 大型语言模型、大型语言、语言模型展示了显着的潜力,引入了额外的验证器
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable potential in handling complex reasoning tasks by generating step-by-step rationales.Some methods have proven effective in boosting accuracy by introducing extra verifiers to assess these paths. However, existing verifiers, typically trained on binary-labeled reasoning paths, fail to fully utilize the relative merits of intermediate steps, thereby limiting the effectiveness of the feedback provided. To overcome this limitation, we propose Tree-based Preference Learning Verifier (Tree-PLV), a novel approach that constructs reasoning trees via a best-first search algorithm and collects step-level paired data for preference training. Compared to traditional binary classification, step-level preferences more finely capture the nuances between reasoning steps, allowing for a more precise evaluation of the complete reasoning path. We empirically evaluate Tree-PLV across a range of arithmetic and commonsense reasoning tasks, where it significantly outperforms existing benchmarks. For instance, Tree-PLV achieved substantial performance gains over the Mistral-7B self-consistency baseline on GSM8K (67.55% to 82.79%), MATH (17.00% to 26.80%), CSQA (68.14% to 72.97%), and StrategyQA (82.86% to 83.25%).Additionally, our study explores the appropriate granularity for applying preference learning, revealing that step-level guidance provides feedback that better aligns with the evaluation of the reasoning process.
摘要:大型语言模型通过生成逐步推理在处理复杂推理任务方面显示出巨大的潜力,一些方法通过引入额外的验证器来评估这些路径,从而有效地提高了准确率。然而,现有的验证器通常在二进制标记的推理路径上训练,不能充分利用中间步骤的相对优点,从而限制了所提供的反馈的有效性。为了克服这一局限性,我们提出了一种基于树的偏好学习验证器(Tree-PLV),该方法通过最佳优先搜索算法构建推理树,并收集步长级别的配对数据进行偏好训练。与传统的二进制分类相比,步骤级偏好更精细地捕捉推理步骤之间的细微差别,允许更精确地评估完整的推理路径。我们在一系列算术和常识推理任务中对Tree-PLV进行了经验评估,在这些任务中,它的性能显著优于现有的基准测试。例如,Tree-PLV在GSM8K(67.55%到82.79%)、数学(17.00%到26.80%)、CSQA(68.14%到72.97%)和Strategy yQA(82.86%到83.25%)上取得了显著的成绩提升。此外,我们的研究探索了应用偏好学习的合适粒度,揭示了步骤级指导提供的反馈更好地符合推理过程的评估。

[NLP-136] GraphArena: Benchmarking Large Language Models on Graph Computational Problems
[NLP-136] GraphArena:在图计算问题上对大型语言模型进行基准测试

链接: https://arxiv.org/abs/2407.00379
作者: Jianheng Tang,Qifan Zhang,Yuhan Li,Jia Li
关键词: Large Language Models, Large Language, arms race, Language Models, examine their progresses
中文关键词: 大型语言模型,大型语言,军备竞赛,语言模型,检查他们的进展
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The “arms race” of Large Language Models (LLMs) demands novel, challenging, and diverse benchmarks to faithfully examine their progresses. We introduce GraphArena, a benchmarking tool designed to evaluate LLMs on graph computational problems using million-scale real-world graphs from diverse scenarios such as knowledge graphs, social networks, and molecular structures. GraphArena offers a suite of 10 computational tasks, encompassing four polynomial-time (e.g., Shortest Distance) and six NP-complete challenges (e.g., Travelling Salesman Problem). It features a rigorous evaluation framework that classifies LLM outputs as correct, suboptimal (feasible but not optimal), or hallucinatory (properly formatted but infeasible). Evaluation of 10 leading LLMs, including GPT-4o and LLaMA3-70B-Instruct, reveals that even top-performing models struggle with larger, more complex graph problems and exhibit hallucination issues. Despite the application of strategies such as chain-of-thought prompting, these issues remain unresolved. GraphArena contributes a valuable supplement to the existing LLM benchmarks and is open-sourced at this https URL.
摘要:大型语言模型的“军备竞赛”要求具有新颖性、挑战性和多样性的基准来忠实地检验它们的进展。我们介绍了GraphArena,这是一个基准测试工具,旨在使用来自不同场景(如知识图、社会网络和分子结构)的百万级真实图来评估图计算问题的LLM。GraphArena提供了一套10个计算任务,包括4个多项式时间(例如,最短距离)和6个NP完全挑战(例如,Traveling Salesman问题)。它的特点是有一个严格的评估框架,将LLM的输出归类为正确、次优(可行但不是最佳)或幻觉(格式正确但不可行)。对包括GPT-40和LLaMA3-70B-Indict在内的10个领先LLM的评估显示,即使是表现最好的模特也会遇到更大、更复杂的图形问题,并表现出幻觉问题。GraphArena对现有的LLM基准测试做出了有价值的补充,并且在该HTTPS URL上是开源的。

[NLP-137] he Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention
[NLP-137] 多元化干预的文本到图像生成的事实税:基准和事实增强干预

链接: https://arxiv.org/abs/2407.00377
作者: Yixin Wan,Di Wu,Haoran Wang,Kai-Wei Chang
关键词: models depicting individuals, commonly adopted, depicting individuals, Prompt-based, diversity interventions
中文关键词: 描述个人的模型,常用的,描述个人的,基于预算的,多样性干预
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Prompt-based “diversity interventions” are commonly adopted to improve the diversity of Text-to-Image (T2I) models depicting individuals with various racial or gender traits. However, will this strategy result in nonfactual demographic distribution, especially when generating real historical figures? In this work, we propose DemOgraphic FActualIty Representation (DoFaiR), a benchmark to systematically quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. DoFaiR consists of 756 meticulously fact-checked test instances to reveal the factuality tax of various diversity prompts through an automated evidence-supported evaluation pipeline. Experiments on DoFaiR unveil that diversity-oriented instructions increase the number of different gender and racial groups in DALLE-3’s generations at the cost of historically inaccurate demographic distributions. To resolve this issue, we propose Fact-Augmented Intervention (FAI), which instructs a Large Language Model (LLM) to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history, and incorporate it into the generation context of T2I models. By orienting model generations using the reflected historical truths, FAI significantly improves the demographic factuality under diversity interventions while preserving diversity.
摘要:基于提示的“多样性干预”通常被用来提高文本到图像(T2I)模型的多样性,该模型描述了具有不同种族或性别特征的个体。然而,这种策略会导致不真实的人口分布吗,特别是在生成真实的历史人物时?在这项工作中,我们提出了人口真实性表示(DoFaiR),这是一个基准,系统地量化使用多样性干预和在T2I模型中保持人口真实性之间的权衡。DoFaiR由756个经过仔细事实核查的测试实例组成,通过一个自动化的证据支持的评估管道来揭示各种多样性提示的真实性。DoFaiR上的实验揭示,以多样性为导向的指导增加了DALE-3代S一代中不同性别和种族群体的数量,代价是历史上不准确的人口分布。为了解决这一问题,我们提出了事实增强干预(FAI),它指示一个大型语言模型(LLM)反映历史上关于性别和种族构成的事实信息,并将其纳入T2I模型的生成上下文中。通过使用反映的历史真相来确定模型世代的方向,FAI在保持多样性的同时,显著改善了多样性干预下的人口事实。

[NLP-138] How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models
[NLP-138] 如何培训事实验证者:通过多模式开放模型进行知识转移

链接: https://arxiv.org/abs/2407.00369
作者: Jaeyoung Lee,Ximing Lu,Jack Hessel,Faeze Brahman,Youngjae Yu,Yonatan Bisk,Yejin Choi,Saadia Gabriel
关键词: provide effective real-time, effective real-time verification, social media, growing influx, provide effective
中文关键词: 提供有效的实时、有效的实时验证、社交媒体、不断增长的涌入、提供有效的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Given the growing influx of misinformation across news and social media, there is a critical need for systems that can provide effective real-time verification of news claims. Large language or multimodal model based verification has been proposed to scale up online policing mechanisms for mitigating spread of false and harmful content. While these can potentially reduce burden on human fact-checkers, such efforts may be hampered by foundation model training data becoming outdated. In this work, we test the limits of improving foundation model performance without continual updating through an initial study of knowledge transfer using either existing intra- and inter- domain benchmarks or explanations generated from large language models (LLMs). We evaluate on 12 public benchmarks for fact-checking and misinformation detection as well as two other tasks relevant to content moderation – toxicity and stance detection. Our results on two recent multi-modal fact-checking benchmarks, Mocheg and Fakeddit, indicate that knowledge transfer strategies can improve Fakeddit performance over the state-of-the-art by up to 1.7% and Mocheg performance by up to 2.9%.
摘要:鉴于新闻和社交媒体上不断涌入的错误信息,迫切需要能够对新闻声明提供有效实时验证的系统。已经提出了基于大型语言或多模式模型的验证,以扩大在线监管机制,以减少虚假和有害内容的传播。虽然这些可能会减轻人类事实核查人员的负担,但这种努力可能会因为基础模型培训数据过时而受到阻碍。在这项工作中,我们通过使用现有的域内和域间基准或从大型语言模型(LLM)生成的解释对知识转移进行初步研究,测试在不持续更新的情况下提高基础模型性能的限制。我们对事实核查和错误信息检测以及与内容审核相关的其他两项任务–毒性和立场检测–的12个公共基准进行了评估。我们在最近的两个多模式事实核查基准Mocheg和Fakeddit上的结果表明,知识转移策略可以将Fakeddit的性能提高1.7%,将Mocheg的性能提高2.9%。

[NLP-139] Financial Knowledge Large Language Model
[NLP-139] 金融知识大语言模型

链接: https://arxiv.org/abs/2407.00365
作者: Cehao Yang,Chengjin Xu,Yiyan Qi
关键词: making significant strides, Artificial intelligence, large language models, processed and interpreted, intelligence is making
中文关键词: 取得重大进展,人工智能、大型语言模型、处理和解释,智能正在取得
类目: Computation and Language (cs.CL)
备注: 66 pages

点击查看摘要

Abstract:Artificial intelligence is making significant strides in the finance industry, revolutionizing how data is processed and interpreted. Among these technologies, large language models (LLMs) have demonstrated substantial potential to transform financial services by automating complex tasks, enhancing customer service, and providing detailed financial analysis. Firstly, we introduce IDEA-FinBench, an evaluation benchmark specifically tailored for assessing financial knowledge in large language models (LLMs). This benchmark utilizes questions from two globally respected and authoritative financial professional exams, aimimg to comprehensively evaluate the capability of LLMs to directly address exam questions pertinent to the finance sector. Secondly, we propose IDEA-FinKER, a Financial Knowledge Enhancement framework designed to facilitate the rapid adaptation of general LLMs to the financial domain, introducing a retrieval-based few-shot learning method for real-time context-level knowledge injection, and a set of high-quality financial knowledge instructions for fine-tuning any general LLM. Finally, we present IDEA-FinQA, a financial question-answering system powered by LLMs. This system is structured around a scheme of real-time knowledge injection and factual enhancement using external knowledge. IDEA-FinQA is comprised of three main modules: the data collector, the data querying module, and LLM-based agents tasked with specific functions.
摘要:人工智能正在金融行业取得重大进展,彻底改变了数据的处理和解释方式。在这些技术中,大型语言模型(LLM)通过自动化复杂的任务、增强客户服务和提供详细的金融分析,显示了转变金融服务的巨大潜力。首先,我们介绍了IDEA-FinBch,一个专门为评估大型语言模型(LLM)中的金融知识而定制的评估基准。这一基准利用了两个全球知名和权威的金融专业考试的试题,旨在全面评估LLMS直接解决与金融部门相关的考试问题的能力。其次,我们提出了一个金融知识增强框架Idea-Finker,该框架旨在促进普通LLM快速适应金融领域,引入了一种基于检索的少镜头学习方法来实时注入上下文级知识,并引入了一套高质量的金融知识指令来微调任何普通LLM。最后,我们提出了一个基于LLMS的金融问答系统IDEA-FinQA。该系统围绕一种利用外部知识进行实时知识注入和事实增强的方案构建。IDEA-FinQA由三个主要模块组成:数据收集器、数据查询模块和基于LLM的代理,负责特定的功能。

[NLP-140] From RAG to RICHES: Retrieval Interlaced with Sequence Generation
[NLP-140] 从RAG到RICIES:检索与序列生成交织

链接: https://arxiv.org/abs/2407.00361
作者: Palak Jain,Livio Baldini Soares,Tom Kwiatkowski
关键词: present RICHES, RICHES, conventional RAG systems, sequence generation tasks, sequence generation
中文关键词: 当前RICIES、RICIES、传统RAG系统、序列生成任务、序列生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures, Preprint

点击查看摘要

Abstract:We present RICHES, a novel approach that interleaves retrieval with sequence generation tasks. RICHES offers an alternative to conventional RAG systems by eliminating the need for separate retriever and generator. It retrieves documents by directly decoding their contents, constrained on the corpus. Unifying retrieval with generation allows us to adapt to diverse new tasks via prompting alone. RICHES can work with any Instruction-tuned model, without additional training. It provides attributed evidence, supports multi-hop retrievals and interleaves thoughts to plan on what to retrieve next, all within a single decoding pass of the LLM. We demonstrate the strong performance of RICHES across ODQA tasks including attributed and multi-hop QA.
摘要:我们介绍了RICIES,这是一种将检索与序列生成任务交织在一起的新型方法。RICIES通过无需单独的取回器和发生器,提供了传统RAG系统的替代方案。它通过直接解码文档的内容来检索文档,并限制在文集上。将检索与生成统一起来,使我们能够仅通过提示来适应多样化的新任务。RICIES可以与任何经过指令调整的模型一起工作,无需额外培训。它提供归因证据、支持多跳检索并交织想法以计划接下来检索什么,所有这些都在LLM的单次解码过程中完成。我们展示了RICIES在ODQA任务(包括归因和多跳QA)中的强劲性能。

[NLP-141] Korean Aspect-Based Sentiment Analysis via Implicit-Feature Alignment with Corpus Filtering
[NLP-141] 通过隐含特征对齐和数据库过滤进行基于韩语语音的情绪分析

链接: https://arxiv.org/abs/2407.00342
作者: Kibeom Nam
关键词: Aspect-Based Sentiment Analysis, Sentiment Analysis, Korean restaurant reviews, Investigations into Aspect-Based, Aspect-Based Sentiment
中文关键词: 基于杀虫剂的情感分析,情感分析,韩国餐厅评论,对杀虫剂的调查,基于杀虫剂的情感
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, EMNLP 2024 (submitted), DMLR@ICML 2024

点击查看摘要

Abstract:Investigations into Aspect-Based Sentiment Analysis (ABSA) for Korean restaurant reviews are notably lacking in the existing literature. Our research proposes an intuitive and effective framework for ABSA in low-resource languages such as Korean. It optimizes prediction labels by integrating translated benchmark and unlabeled Korean data. Using a model fine-tuned on translated data, we pseudo-labeled the actual Korean NLI set. Subsequently, we applied LaBSE and MSP-based filtering to this pseudo-NLI set as implicit feature, enhancing Aspect Category Detection and Polarity determination through additional training. Incorporating dual filtering, this model bridged dataset gaps, achieving positive results in Korean ABSA with minimal resources. Through additional data injection pipelines, our approach aims to utilize high-resource data and construct effective models within communities, whether corporate or individual, in low-resource language countries. Compared to English ABSA, our framework showed an approximately 3% difference in F1 scores and accuracy. We release the dataset and our code for Korean ABSA, at this link.
摘要:针对韩国餐馆评论的基于方面的情感分析(ABSA)的研究在现有文献中明显缺乏。我们的研究为ABSA在韩语等低资源语言中提供了一个直观而有效的框架。它通过集成翻译后的基准数据和未标记的韩语数据来优化预测标签。使用在翻译数据上微调的模型,我们对实际的韩语NLI集进行了伪标记。随后,我们将LaBSE和基于MSP的过滤作为隐含特征应用到这个伪NLI集合中,通过额外的训练来增强特征类别检测和极性判定。结合双重过滤,该模型弥合了数据集的差距,以最少的资源在韩国ABSA中取得了积极的结果。通过额外的数据注入管道,我们的方法旨在利用高资源数据,并在低资源语言国家的社区内构建有效的模型,无论是公司还是个人。与英语ABSA相比,我们的框架显示出F1分数和准确率大约3%的差异。我们通过此链接发布韩国ABSA的数据集和代码。

[NLP-142] Iterative Data Augmentation with Large Language Models for Aspect-based Sentiment Analysis
[NLP-142] 使用大型语言模型进行迭代数据增强,用于基于Aspects的情绪分析

链接: https://arxiv.org/abs/2407.00341
作者: Haiyun Li,Qihuang Zhong,Ke Zhu,Juhua Liu,Bo Du,Dacheng Tao
关键词: Aspect-based Sentiment Analysis, sentiment analysis task, important sentiment analysis, Sentiment Analysis, Aspect-based Sentiment
中文关键词: 基于Aspects的情绪分析,情绪分析任务,重要情绪分析,情绪分析,基于Aspects的情绪
类目: Computation and Language (cs.CL)
备注: Work in process

点击查看摘要

Abstract:Aspect-based Sentiment Analysis (ABSA) is an important sentiment analysis task, which aims to determine the sentiment polarity towards an aspect in a sentence. Due to the expensive and limited labeled data, data augmentation (DA) has become the standard for improving the performance of ABSA. However, current DA methods usually have some shortcomings: 1) poor fluency and coherence, 2) lack of diversity of generated data, and 3) reliance on some existing labeled data, hindering its applications in real-world scenarios. In response to these problems, we propose a systematic Iterative Data augmentation framework, namely IterD, to boost the performance of ABSA. The core of IterD is to leverage the powerful ability of large language models (LLMs) to iteratively generate more fluent and diverse synthetic labeled data, starting from an unsupervised sentence corpus. Extensive experiments on 4 widely-used ABSA benchmarks show that IterD brings consistent and significant performance gains among 5 baseline ABSA models. More encouragingly, the synthetic data generated by IterD can achieve comparable or even better performance against the manually annotated data.
摘要:基于体的情感分析(ABSA)是一项重要的情感分析任务,其目的是确定句子中体的情感极性。由于标签数据的昂贵和有限,数据增强(DA)已成为提高ABSA性能的标准。然而,目前的DA方法通常存在以下缺点:1)流畅性和连贯性差;2)生成的数据缺乏多样性;3)依赖于一些已有的标记数据,阻碍了其在实际场景中的应用。针对这些问题,我们提出了一个系统的迭代数据增强框架,即IterD,以提高ABSA的性能。IterD的核心是利用大型语言模型的强大能力,从无监督的句子语料库开始,迭代地生成更流畅和多样化的合成标签数据。在4个广泛使用的ABSA基准测试上的广泛实验表明,IterD在5个基准ABSA模型中带来了一致且显著的性能提升。更令人鼓舞的是,IterD生成的合成数据可以达到与手动注释数据相当甚至更好的性能。

[NLP-143] LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods
[NLP-143] LLM生成的自然语言满足缩放定律:新的探索和数据增强方法

链接: https://arxiv.org/abs/2407.00322
作者: Zhenhua Wang,Guang Xu,Ming Ren
关键词: large language models, natural language processing, natural language, witnessed enhancements, LLM
中文关键词: 大型语言模型、自然语言处理、自然语言、见证增强、LLM
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the ascent of large language models (LLM), natural language processing has witnessed enhancements, such as LLM-based data augmentation. Nonetheless, prior research harbors two primary concerns: firstly, a lack of contemplation regarding whether the natural language generated by LLM (LLMNL) truly aligns with human natural language (HNL), a critical foundational question; secondly, an oversight that augmented data is randomly generated by LLM, implying that not all data may possess equal training value, that could impede the performance of classifiers. To address these challenges, we introduce the scaling laws to intrinsically calculate LLMNL and HNL. Through extensive experiments, we reveal slight deviations (approximately 0.2 Mandelbrot exponent) from Mandelbrot’s law in LLMNL, underscore a complexity advantage in HNL, and supplement an interpretive discussion on language style. This establishes a solid foundation for LLM’s expansion. Further, we introduce a novel data augmentation method for few-shot text classification, termed ZGPTDA, which leverages fuzzy computing mechanisms driven by the conformity to scaling laws to make decisions about GPT-4 augmented data. Extensive experiments, conducted in real-world scenarios, confirms the effectiveness (improving F1 of Bert and RoBerta by 7-10%) and competitiveness (surpassing recent AugGPT and GENCO methods by about 2% accuracy on DeBerta) of ZGPTDA. In addition, we reveal some interesting insights, e.g., Hilberg’s law and Taylor’s law can impart more benefits to text classification, etc.
摘要:随着大型语言模型(LLM)的兴起,自然语言处理得到了改进,如基于LLM的数据扩充。然而,以前的研究有两个主要的担忧:第一,缺乏对LLM生成的自然语言(LLMNL)是否真的与人类自然语言(HNL)一致的思考,这是一个关键的基础性问题;第二,疏忽了LLM随机生成的扩展数据,这意味着并不是所有的数据都具有相同的训练价值,这可能会阻碍分类器的性能。为了应对这些挑战,我们引入了标度定律来本质上计算LLMNL和HNL。通过大量的实验,我们揭示了LLMNL中对Mandelbrot定律的轻微偏离(约0.2 Mandelbrot指数),强调了HNL中的复杂性优势,并补充了对语言风格的解释性讨论。这为LLM的扩张奠定了坚实的基础。在此基础上,提出了一种新的用于少镜头文本分类的数据增强方法ZGPTDA,该方法利用符合尺度律驱动的模糊计算机制对GPT-4扩展数据进行决策。在真实场景中进行的大量实验证实了ZGPTDA的有效性(将Bert和Roberta的F1提高了7%-10%)和竞争力(在DeBerta上比最近的AugGPT和GENCO方法提高了约2%的精度)。此外,我们还揭示了一些有趣的见解,例如希尔伯格定律和泰勒定律可以给文本分类带来更多的好处等。

[NLP-144] LiteSearch: Efficacious Tree Search for LLM
[NLP-144] LiteSearch:高效的LLM树搜索

链接: https://arxiv.org/abs/2407.00320
作者: Ante Wang,Linfeng Song,Ye Tian,Baolin Peng,Dian Yu,Haitao Mi,Jinsong Su,Dong Yu
关键词: Monte Carlo Tree, Recent research suggests, dramatically boost LLM, mathematical reasoning tasks, Monte Carlo
中文关键词: 最近的研究表明,蒙特卡洛树可以极大地促进法学硕士、数学推理任务、蒙特卡洛
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent research suggests that tree search algorithms (e.g. Monte Carlo Tree Search) can dramatically boost LLM performance on complex mathematical reasoning tasks. However, they often require more than 10 times the computational resources of greedy decoding due to wasteful search strategies, making them difficult to be deployed in practical applications. This study introduces a novel guided tree search algorithm with dynamic node selection and node-level exploration budget (maximum number of children) calculation to tackle this issue. By considering the search progress towards the final answer (history) and the guidance from a value network (future) trained without any step-wise annotations, our algorithm iteratively selects the most promising tree node before expanding it within the boundaries of the allocated computational budget. Experiments conducted on the GSM8K and TabMWP datasets demonstrate that our approach not only offers competitive performance but also enjoys significantly lower computational costs compared to baseline methods.
摘要:最近的研究表明,树搜索算法(如蒙特卡罗树搜索)可以显著提高LLM在复杂数学推理任务中的性能。然而,由于搜索策略的浪费,它们往往需要的计算资源是贪婪译码的10倍以上,难以在实际应用中部署。针对这一问题,提出了一种动态节点选择和节点级搜索预算(最大子代数)计算的有向树搜索算法。通过考虑对最终答案(历史)的搜索进度和来自没有任何逐步注释的值网络(未来)的指导,我们的算法迭代地选择最有希望的树节点,然后在分配的计算预算的边界内扩展它。在GSM8K和TabMWP数据集上进行的实验表明,我们的方法不仅提供了具有竞争力的性能,而且与基准方法相比,计算代价大大降低。

[NLP-145] From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models
[NLP-145] 从本地概念到普遍性:评估视觉语言模型的多元文化理解

链接: https://arxiv.org/abs/2407.00263
作者: Mehar Bhatia,Sahithya Ravi,Aditya Chinchure,Eunjeong Hwang,Vered Shwartz
关键词: non-western cultures due, performance remains suboptimal, training datasets, recent advancements, remains suboptimal
中文关键词: 由于非西方文化,性能仍然次优,训练数据集,最近的进步,仍然次优
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under peer review

点击查看摘要

Abstract:Despite recent advancements in vision-language models, their performance remains suboptimal on images from non-western cultures due to underrepresentation in training datasets. Various benchmarks have been proposed to test models’ cultural inclusivity, but they have limited coverage of cultures and do not adequately assess cultural diversity across universal as well as culture-specific local concepts. To address these limitations, we introduce the GlobalRG benchmark, comprising two challenging tasks: retrieval across universals and cultural visual grounding. The former task entails retrieving culturally diverse images for universal concepts from 50 countries, while the latter aims at grounding culture-specific concepts within images from 15 countries. Our evaluation across a wide range of models reveals that the performance varies significantly across cultures – underscoring the necessity for enhancing multicultural understanding in vision-language models.
摘要:尽管视觉语言模型最近取得了进步,但由于训练数据集中的代表性不足,它们在非西方文化图像上的表现仍然不佳。人们提出了各种基准来测试模型的文化包容性,但它们对文化的覆盖范围有限,并且没有充分评估普遍以及特定文化的当地概念的文化多样性。为了解决这些限制,我们引入了GlobalRG基准,其中包括两项具有挑战性的任务:跨共性的检索和文化视觉基础。前一项任务需要检索来自50个国家的普遍概念的文化多样性图像,而后一项任务旨在将特定文化的概念植根于来自15个国家的图像中。我们对各种模型的评估表明,不同文化的表现存在显着差异,这凸显了在视觉语言模型中增强多元文化理解的必要性。

[NLP-146] One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts
[NLP-146] 一个提示是不够的:自动构建专家混合预算

链接: https://arxiv.org/abs/2407.00256
作者: Ruochen Wang,Sohyun An,Minhao Cheng,Tianyi Zhou,Sung Ju Hwang,Cho-Jui Hsieh
关键词: Large Language Models, Large Language, Language Models, exhibit strong generalization, strong generalization capabilities
中文关键词: 大型语言模型,大型语言,语言模型,表现出强大的概括性,强大的概括能力
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: ICML 2024. code available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong generalization capabilities to novel tasks when prompted with language instructions and in-context demos. Since this ability sensitively depends on the quality of prompts, various methods have been explored to automate the instruction design. While these methods demonstrated promising results, they also restricted the searched prompt to one instruction. Such simplification significantly limits their capacity, as a single demo-free instruction might not be able to cover the entire complex problem space of the targeted task. To alleviate this issue, we adopt the Mixture-of-Expert paradigm and divide the problem space into a set of sub-regions; Each sub-region is governed by a specialized expert, equipped with both an instruction and a set of demos. A two-phase process is developed to construct the specialized expert for each region: (1) demo assignment: Inspired by the theoretical connection between in-context learning and kernel regression, we group demos into experts based on their semantic similarity; (2) instruction assignment: A region-based joint search of an instruction per expert complements the demos assigned to it, yielding a synergistic effect. The resulting method, codenamed Mixture-of-Prompts (MoP), achieves an average win rate of 81% against prior arts across several major benchmarks.
摘要:在语言指令和情景演示的提示下,大型语言模型(LLM)对新任务表现出很强的泛化能力。由于这种能力敏感地依赖于提示的质量,因此已经探索了各种方法来自动化指令设计。虽然这些方法显示了有希望的结果,但它们也将搜索提示限制在一条指令上。这种简化大大限制了他们的能力,因为单一的无演示指令可能无法涵盖目标任务的整个复杂问题空间。为了缓解这个问题,我们采用混合专家范式,将问题空间划分为一组子区域;每个子区域由一名专门的专家管理,配备了一套说明和一组演示。构建每个区域的专业专家的过程分为两个阶段:(1)演示分配:受上下文中学习和核回归之间的理论联系的启发,我们根据演示的语义相似度将演示分组为专家;(2)指令分配:基于区域的联合搜索每个专家的指令与分配给它的演示互补,产生协同效应。由此产生的方法,代号为提示混合(MOP),在几个主要基准中实现了相对于现有技术的81%的平均胜率。

[NLP-147] Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription
[NLP-147] 注意差距:用基于变形者的转录分析缺陷

链接: https://arxiv.org/abs/2407.00250
作者: Jaydeep Borkar,David A. Smith
关键词: illegible text resulting, documents frequently suffer, storage damage, frequently suffer, illegible text
中文关键词: 导致文本难以辨认,文档经常遭受损失,存储损坏,经常遭受损失,文本难以辨认
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ICDAR 2024 Workshop on Computational Paleography

点击查看摘要

Abstract:Historical documents frequently suffer from damage and inconsistencies, including missing or illegible text resulting from issues such as holes, ink problems, and storage damage. These missing portions or gaps are referred to as lacunae. In this study, we employ transformer-based optical character recognition (OCR) models trained on synthetic data containing lacunae in a supervised manner. We demonstrate their effectiveness in detecting and restoring lacunae, achieving a success rate of 65%, compared to a base model lacking knowledge of lacunae, which achieves only 5% restoration. Additionally, we investigate the mechanistic properties of the model, such as the log probability of transcription, which can identify lacunae and other errors (e.g., mistranscriptions due to complex writing or ink issues) in line images without directly inspecting the image. This capability could be valuable for scholars seeking to distinguish images containing lacunae or errors from clean ones. Although we explore the potential of attention mechanisms in flagging lacunae and transcription errors, our findings suggest it is not a significant factor. Our work highlights a promising direction in utilizing transformer-based OCR models for restoring or analyzing damaged historical documents.
摘要:历史文档经常受到损坏和不一致的影响,包括由于孔洞、墨水问题和存储损坏等问题而导致的文本丢失或难以辨认。这些缺失的部分或缝隙被称为腔隙。在这项研究中,我们使用基于变压器的光学字符识别(OCR)模型,在有监督的方式下对包含空洞的合成数据进行训练。我们证明了它们在检测和修复腔隙方面的有效性,获得了65%的成功率,而缺乏腔隙知识的基本模型仅实现了5%的修复。此外,我们还研究了该模型的机制属性,例如转录的对数概率,该模型可以在不直接检查图像的情况下识别线条图像中的空洞和其他错误(例如,由于复杂的书写或墨水问题而导致的错误翻译)。这种能力对于试图区分包含漏洞或错误的图像与干净的图像的学者来说可能是有价值的。虽然我们探索了注意机制在标记腔隙和转录错误中的潜在作用,但我们的发现表明它不是一个重要的因素。我们的工作突出了利用基于变压器的OCR模型来恢复或分析损坏的历史文档的一个有前途的方向。

[NLP-148] DiffuseDef: Improved Robustness to Adversarial Attacks
[NLP-148] diffuseDef:增强对抗攻击的鲁棒性

链接: https://arxiv.org/abs/2407.00248
作者: Zhenhao Li,Marek Rei,Lucia Specia
关键词: Pretrained language models, natural language processing, significantly advanced performance, Pretrained language, language processing tasks
中文关键词: 预训练的语言模型、自然语言处理、显着提高的性能、预训练的语言、语言处理任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to system built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over different existing adversarial defense methods and achieves state-of-the-art performance against common adversarial attacks.
摘要:预训练的语言模型在各种自然语言处理任务中显着提高了性能。然而,对抗性攻击继续对使用这些模型构建的系统构成严峻挑战,因为它们可以被精心设计的对抗性文本利用。受到扩散模型预测和减少计算机视觉中噪音的能力的启发,我们提出了一种新颖且灵活的语言分类任务对抗防御方法:DistuseDef,它在编码器和分类器之间引入了扩散层作为降噪器。在推理过程中,对抗性隐藏状态首先与采样噪音相结合,然后迭代去噪,最后集成以产生稳健的文本表示。通过集成对抗性训练、去噪和集成技术,我们证明了DistuseDef比不同的现有对抗性防御方法进行了改进,并在对抗常见对抗性攻击时实现了最先进的性能。

[NLP-149] EHRmonize: A Framework for Medical Concept Abstraction from Electronic Health Records using Large Language Models
[NLP-149] EHRmonize:使用大型语言模型从电子健康记录中提取医疗概念的框架

链接: https://arxiv.org/abs/2407.00242
作者: João Matos,Jack Gallifant,Jian Pei,A. Ian Wong
关键词: Electronic health records, significant clinical expertise, requiring significant clinical, Electronic health, costly task requiring
中文关键词: 电子健康记录、重要的临床专业知识、需要重要的临床、电子健康、需要昂贵的任务
类目: Computation and Language (cs.CL)
备注: submitted for review, total of 10 pages

点击查看摘要

Abstract:Electronic health records (EHRs) contain vast amounts of complex data, but harmonizing and processing this information remains a challenging and costly task requiring significant clinical expertise. While large language models (LLMs) have shown promise in various healthcare applications, their potential for abstracting medical concepts from EHRs remains largely unexplored. We introduce EHRmonize, a framework leveraging LLMs to abstract medical concepts from EHR data. Our study uses medication data from two real-world EHR databases to evaluate five LLMs on two free-text extraction and six binary classification tasks across various prompting strategies. GPT-4o’s with 10-shot prompting achieved the highest performance in all tasks, accompanied by Claude-3.5-Sonnet in a subset of tasks. GPT-4o achieved an accuracy of 97% in identifying generic route names, 82% for generic drug names, and 100% in performing binary classification of antibiotics. While EHRmonize significantly enhances efficiency, reducing annotation time by an estimated 60%, we emphasize that clinician oversight remains essential. Our framework, available as a Python package, offers a promising tool to assist clinicians in EHR data abstraction, potentially accelerating healthcare research and improving data harmonization processes.
摘要:电子健康记录(EHR)包含大量复杂的数据,但协调和处理这些信息仍然是一项具有挑战性和昂贵的任务,需要大量的临床专业知识。虽然大型语言模型(LLM)在各种医疗保健应用中显示出了良好的前景,但它们从EHR中提取医学概念的潜力仍在很大程度上尚未开发。我们介绍了EHRmonize,这是一个利用LLMS从EHR数据中抽象医学概念的框架。我们的研究使用来自两个真实世界的EHR数据库的药物数据来评估五个LLMS在两个自由文本提取和六个二进制分类任务上的不同提示策略。在所有任务中,10枪提示的GPT-40取得了最高的表现,在部分任务中伴随着克劳德-3.5-十四行诗。GPT-40对仿制药名称的识别准确率为97%,对仿制药名称的识别准确率为82%,对抗生素的二进制分类准确率为100%。虽然EHRmonize显著提高了效率,将注释时间减少了约60%,但我们强调,临床医生的监督仍然是必不可少的。我们的框架以Python包的形式提供,提供了一个有前景的工具来帮助临床医生提取EHR数据,潜在地加速了医疗保健研究并改进了数据协调过程。

[NLP-150] Evaluating Human Alignment and Model Faithfulness of LLM Rationale
[NLP-150] 评估LLM理论的人际关系和模型忠实性

链接: https://arxiv.org/abs/2407.00219
作者: Mohsen Fayyaz,Fan Yin,Jiao Sun,Nanyun Peng
关键词: large language models, explain their generations, large language, input texts, texts that reflect
中文关键词: 大型语言模型,解释它们的世代,大型语言,输入文本,反映的文本
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study how well large language models (LLMs) explain their generations with rationales – a set of tokens extracted from the input texts that reflect the decision process of LLMs. We examine LLM rationales extracted with two methods: 1) attribution-based methods that use attention or gradients to locate important tokens, and 2) prompting-based methods that guide LLMs to extract rationales using prompts. Through extensive experiments, we show that prompting-based rationales align better with human-annotated rationales than attribution-based rationales, and demonstrate reasonable alignment with humans even when model performance is poor. We additionally find that the faithfulness limitations of prompting-based methods, which are identified in previous work, may be linked to their collapsed predictions. By fine-tuning these models on the corresponding datasets, both prompting and attribution methods demonstrate improved faithfulness. Our study sheds light on more rigorous and fair evaluations of LLM rationales, especially for prompting-based ones.
摘要:我们研究了大型语言模型(LLM)如何用理论基础来解释它们的生成–从输入文本中提取的一组反映LLM决策过程的符号。我们考察了用两种方法提取的LLM推理:1)基于归因的方法,它使用注意力或梯度来定位重要的表征;2)基于提示的方法,引导LLM使用提示来提取合理性。通过大量的实验,我们表明,基于提示的推理比基于归因的推理更符合人类注释的推理,并且即使在模型性能较差的情况下,也证明了与人类的合理匹配。此外,我们还发现,以前的工作中发现的基于提示的方法的忠实性限制可能与它们崩溃的预测有关。通过在相应的数据集上微调这些模型,提示和归因方法都显示出更高的忠诚度。我们的研究有助于更严格、更公正地评估LLM理论基础,尤其是基于激励的理论基础。

[NLP-151] Detection and Measurement of Syntactic Templates in Generated Text
[NLP-151] 生成文本中语法模板的检测和测量

链接: https://arxiv.org/abs/2407.00211
作者: Chantal Shaib,Yanai Elazar,Junyi Jessy Li,Byron C. Wallace
关键词: Recent work, focused on word-level, Recent, templates, word-level features
中文关键词: 最近的工作,重点是单词级、最近的、模板、单词级功能
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work on evaluating the diversity of text generated by LLMs has focused on word-level features. Here we offer an analysis of syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts. We find that most (76%) templates in model-generated text can be found in pre-training data (compared to only 35% of human-authored text), and are not overwritten during fine-tuning processes such as RLHF. This connection to the pre-training data allows us to analyze syntactic templates in models where we do not have the pre-training data. We also find that templates as features are able to differentiate between models, tasks, and domains, and are useful for qualitatively evaluating common model constructions. Finally, we demonstrate the use of templates as a useful tool for analyzing style memorization of training data in LLMs.
摘要:最近在评估LLMS生成的文本多样性方面的工作主要集中在词级特征上。在这里,我们提供了一个句法特征的分析,以表征模型中的一般重复,而不是频繁的n-gram。具体地说,我们定义了句法模板,并表明模型在下游任务中产生模板文本的速度高于在人类参考文本中发现的速度。我们发现,模型生成的文本中的大多数(76%)模板可以在预训练数据中找到(相比之下,只有35%的人创作的文本),并且在RLHF等微调过程中不会被覆盖。这种到预训练数据的连接允许我们在没有预训练数据的模型中分析句法模板。我们还发现,作为功能的模板能够区分模型、任务和域,并且对于定性评估常见的模型构造很有用。最后,我们演示了模板的使用作为一种有用的工具来分析LLMS中训练数据的风格记忆。

[NLP-152] MetaKP: On-Demand Keyphrase Generation
[NLP-152] MetaKP:按需关键词生成

链接: https://arxiv.org/abs/2407.00191
作者: Di Wu,Xiaoxian Shen,Kai-Wei Chang
关键词: Traditional keyphrase prediction, prediction methods predict, Traditional keyphrase, failing to cater, predict a single
中文关键词: 传统关键短语预测,预测方法预测,传统关键短语,未能迎合,预测单一
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional keyphrase prediction methods predict a single set of keyphrases per document, failing to cater to the diverse needs of users and downstream applications. To bridge the gap, we introduce on-demand keyphrase generation, a novel paradigm that requires keyphrases that conform to specific high-level goals or intents. For this task, we present MetaKP, a large-scale benchmark comprising four datasets, 7500 documents, and 3760 goals across news and biomedical domains with human-annotated keyphrases. Leveraging MetaKP, we design both supervised and unsupervised methods, including a multi-task fine-tuning approach and a self-consistency prompting method with large language models. The results highlight the challenges of supervised fine-tuning, whose performance is not robust to distribution shifts. By contrast, the proposed self-consistency prompting approach greatly improves the performance of large language models, enabling GPT-4o to achieve 0.548 SemF1, surpassing the performance of a fully fine-tuned BART-base model. Finally, we demonstrate the potential of our method to serve as a general NLP infrastructure, exemplified by its application in epidemic event detection from social media.
摘要:传统的关键短语预测方法对每个文档只预测一组关键短语,不能满足用户和下游应用的多样化需求。为了弥补这一差距,我们引入了按需关键短语生成,这是一种新的范式,要求关键短语符合特定的高级目标或意图。对于这项任务,我们提出了MetaKP,这是一个大规模的基准测试,包括四个数据集、7500个文档和3760个目标,涉及新闻和生物医学领域,具有人类注释的关键短语。利用MetaKP,我们设计了有监督和无监督的方法,包括多任务微调方法和使用大型语言模型的自我一致性提示方法。这些结果突出了监督微调的挑战,其性能对分布变化的健壮性不强。相比之下,提出的自洽提示方法大大提高了大型语言模型的性能,使GPT-40达到了0.548的SemF1,超过了完全微调的基于BART的模型的性能。最后,我们展示了我们的方法作为一般NLP基础设施的潜力,例如它在社交媒体上的流行病事件检测中的应用。

[NLP-153] Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach
[NLP-153] GPT-4可以帮助检测停止电子烟意图吗?自动数据注释方法的探索

链接: https://arxiv.org/abs/2407.00167
作者: Sai Krishna Revanth Vuruma,Dezhi Wu,Saborny Sen Gupta,Lucas Aust,Valerie Lookingbill,Wyatt Bellamy,Yang Ren,Erin Kasson,Li-Shiun Chen,Patricia Cavazos-Rehg,Dian Hu,Ming Huang
关键词: use-associated lung injury, United States, EVALI outbreak, vaping use-associated lung, comprehend vaping behaviors
中文关键词: 使用相关肺损伤,美国,EVATI爆发,电子烟使用相关肺,了解电子烟行为
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: Accepted for the AI Applications in Public Health and Social Services workshop at the 22nd International Conference on Artificial Intelligence in Medicine (AIME 2024)

点击查看摘要

Abstract:In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use-associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due to the ubiquity of social media platforms, over 4.7 billion users worldwide use them for connectivity, communications, news, and entertainment with a significant portion of the discourse related to health, thereby establishing social media data as an invaluable organic data resource for public health research. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users’ quit-vaping intentions. Leveraging OpenAI’s latest large language model GPT-4 for sentence-level quit vaping intention detection, this study compares the outcomes of this model against layman and clinical expert annotations. Using different prompting strategies such as zero-shot, one-shot, few-shot and chain-of-thought prompting, we developed 8 prompts with varying levels of detail to explain the task to GPT-4 and also evaluated the performance of the strategies against each other. These preliminary findings emphasize the potential of GPT-4 in social media data analysis, especially in identifying users’ subtle intentions that may elude human detection.
摘要:近年来,美国使用电子烟或电子烟的人数大幅增加,导致2019年电子烟和电子烟使用相关肺损伤(EVALI)疫情期间导致住院和死亡的病例显著上升,突显了了解电子烟行为并制定有效的戒烟策略的紧迫性。由于社交媒体平台无处不在,全球超过47亿用户使用它们进行连接、通信、新闻和娱乐,其中很大一部分内容与健康有关,从而将社交媒体数据确立为公共卫生研究的宝贵有机数据资源。在这项研究中,我们从Reddit上的一个Vaping子社区中提取了一个样本数据集,以分析用户的戒烟意图。利用OpenAI最新的大型语言模型GPT-4进行句子级戒烟意图检测,将该模型的结果与外行人和临床专家的注释进行了比较。本研究采用零射、一射、少射和连锁式提示等不同的提示策略,编制了8个不同细节程度的提示,向GPT-4进行任务解释,并对不同提示策略的表现进行了评价。这些初步发现强调了GPT-4在社交媒体数据分析方面的潜力,特别是在识别用户可能躲避人类发现的微妙意图方面。

[NLP-154] he Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic
[NLP-154] 他Qiyas基准:用阿拉伯语衡量ChatGPT数学和语言理解

链接: https://arxiv.org/abs/2407.00146
作者: Shahad Al-Khalifa,Hend Al-Khalifa
关键词: models pre-trained exclusively, language models pre-trained, Arabic data, Arabic, growing importance
中文关键词: 专门预训练的模型,预训练的语言模型,阿拉伯语数据,阿拉伯语,重要性日益增长
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the growing importance of Arabic as a global language, there is a notable lack of language models pre-trained exclusively on Arabic data. This shortage has led to limited benchmarks available for assessing language model performance in Arabic. To address this gap, we introduce two novel benchmarks designed to evaluate models’ mathematical reasoning and language understanding abilities in Arabic. These benchmarks are derived from a General Aptitude Test (GAT) called Qiyas exam, a standardized test widely used for university admissions in Saudi Arabia. For validation purposes, we assess the performance of ChatGPT-3.5-trubo and ChatGPT-4 on our benchmarks. Our findings reveal that these benchmarks pose a significant challenge, with ChatGPT-4 achieving an overall average accuracy of 64%, while ChatGPT-3.5-trubo achieved an overall accuracy of 49% across the various question types in the Qiyas benchmark. We believe the release of these benchmarks will pave the way for enhancing the mathematical reasoning and language understanding capabilities of future models tailored for the low-resource Arabic language.
摘要:尽管阿拉伯语作为一种全球语言的重要性与日俱增,但明显缺乏专门针对阿拉伯语数据进行预训练的语言模型。这一短缺导致可用于评估阿拉伯文语文模型执行情况的基准有限。为了弥补这一差距,我们引入了两个新的基准,旨在评估模型的数学推理和阿拉伯语语言理解能力。这些基准来自于一种名为齐亚斯考试的通用能力倾向测试(GAT),这是一种在沙特阿拉伯大学招生中广泛使用的标准化考试。为了验证目的,我们在我们的基准上评估了ChatGPT-3.5-trubo和ChatGPT-4的性能。我们的研究结果表明,这些基准构成了一个巨大的挑战,ChatGPT-4的总体平均准确率达到了%,而ChatGPT-3.5-TRubo在齐亚斯基准测试中的各种问题类型的总体准确率达到了49%。我们相信,这些基准的发布将为加强未来模型的数学推理和语言理解能力铺平道路,这些模型是为资源少的阿拉伯语量身定做的。

[NLP-155] A Simple Attention-Based Mechanism for Bimodal Emotion Classification
[NLP-155] 基于注意力的简单双峰情绪分类机制

链接: https://arxiv.org/abs/2407.00134
作者: Mazen Elabd,Sardar Jaf
关键词: Big data, learning important features, Big, important features, learning
中文关键词: 大数据,学习重要功能,大的,重要功能,学习
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Big data contain rich information for machine learning algorithms to utilize when learning important features during classification tasks. Human beings express their emotion using certain words, speech (tone, pitch, speed) or facial expression. Artificial Intelligence approach to emotion classification are largely based on learning from textual information. However, public datasets containing text and speech data provide sufficient resources to train machine learning algorithms for the tack of emotion classification. In this paper, we present novel bimodal deep learning-based architectures enhanced with attention mechanism trained and tested on text and speech data for emotion classification. We report details of different deep learning based architectures and show the performance of each architecture including rigorous error analyses. Our finding suggests that deep learning based architectures trained on different types of data (text and speech) outperform architectures trained only on text or speech. Our proposed attention-based bimodal architecture outperforms several state-of-the-art systems in emotion classification.
摘要:大数据包含了丰富的信息,机器学习算法在分类任务中学习重要特征时可以利用这些信息。人类使用特定的词语、语音(音调、音调、速度)或面部表情来表达自己的情感。人工智能的情感分类方法在很大程度上是基于对文本信息的学习。然而,包含文本和语音数据的公共数据集为训练情感分类的机器学习算法提供了足够的资源。在本文中,我们提出了一种新的基于双峰深度学习的结构,该结构具有增强的注意力机制,并在文本和语音数据上进行了情感分类。我们报告了不同基于深度学习的体系结构的详细信息,并展示了每个体系结构的性能,包括严格的错误分析。我们的发现表明,基于深度学习的架构在不同类型的数据(文本和语音)上训练的性能优于仅在文本或语音上训练的架构。我们提出的基于注意力的双峰结构在情感分类方面优于几个最先进的系统。

[NLP-156] Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
[NLP-156] Granite函数调用模型:通过颗粒任务的多任务学习引入函数调用能力

链接: https://arxiv.org/abs/2407.00121
作者: Ibrahim Abdelaziz,Kinjal Basu,Mayank Agarwal,Sadhana Kumaravel,Matthew Stallone,Rameswar Panda,Yara Rizk,GP Bhargav,Maxwell Crouse,Chulaka Gunasekara,Shajith Ikbal,Sachin Joshi,Hima Karanam,Vineet Kumar,Asim Munawar,Sumit Neelam,Dinesh Raghu,Udit Sharma,Adriana Meza Soria,Dheeraj Sreedhar,Praveen Venkateswaran,Merve Unuvar,David Cox,Salim Roukos,Luis Lastras,Pavan Kapanipathi
关键词: Large language models, recently shown tremendous, shown tremendous promise, Large language, function calling
中文关键词: 大型语言模型,最近展示了巨大的潜力,展示了巨大的前景,大型语言,函数调用
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We present a comprehensive evaluation on multiple out-of-domain datasets comparing GRANITE-20B-FUNCTIONCALLING to more than 15 other best proprietary and open models. GRANITE-20B-FUNCTIONCALLING provides the best performance among all open models on the Berkeley Function Calling Leaderboard and fourth overall. As a result of the diverse tasks and datasets used for training our model, we show that GRANITE-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.
摘要:大型语言模型(LLM)最近在充当代理系统的主干方面显示出了巨大的前景,它们在SWE-BENCH和Agent-BENCH等多方面、具有挑战性的基准测试中的表现就证明了这一点。然而,要实现LLM作为自主代理的真正潜力,它们必须学会识别、调用外部工具和应用程序接口(API)并与其交互,以完成复杂的任务。这些任务一起称为函数调用。赋予LLMS函数调用能力会带来许多优势,例如访问数据库和知识源中的当前和特定于领域的信息,以及外包可由工具(例如,Python解释器或计算器)可靠执行的任务的能力。虽然在使用LLM进行函数调用方面已经取得了重大进展,但仍然缺乏与GPT、Claude和Gemini等专有LLM一样性能的开放模型。该模型使用多任务训练方法对函数调用中包含的七个基本任务进行训练,这些任务分别是嵌套函数调用、函数链接、并行函数、函数名称检测、参数-值对检测、次优函数和响应生成。我们对多个域外数据集进行了综合评估,将Granite-20B-FuncIONCALLING与其他15个最好的专有和开放模型进行了比较。Granite-20B-FuncIONCALLING在伯克利功能的所有开放模型中提供了最好的性能,称为排行榜和第四。由于用于训练我们的模型的任务和数据集的多样性,我们表明Granite-20B-Function-CALLING在七个不同的评估数据集中对多个任务具有更好的泛化能力。

[NLP-157] Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations
[NLP-157] 用于对话中多模式情感识别的高效长距离潜在感知图神经网络

链接: https://arxiv.org/abs/2407.00119
作者: Yuntao Shou,Wei Ai,Jiayi Du,Tao Meng,Haiyan Liu
关键词: genuine emotional state, graph neural networks, aims to analyze, multi-modal emotion recognition, analyze the genuine
中文关键词: 真实的情感状态,图神经网络,旨在分析、多模式情感识别、分析真实的
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 3 tables

点击查看摘要

Abstract:The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52% and 35%, respectively.
摘要:会话中多模式情感识别的目标是根据会话中的多模式信息分析每个话语的真实情感状态,这是会话理解的关键。现有的方法侧重于使用图神经网络(GNN)来建模会话关系和捕获上下文潜在语义关系。然而,由于GNN的复杂性,现有的方法不能有效地捕捉长距离话语之间的潜在依赖关系,这限制了Merc的性能。本文提出了一种高效的长距离潜在关系感知图神经网络(ELR-GNN),用于会话中的多模式情感识别。具体地说,我们首先使用预先提取的文本、视频和音频特征作为bi-LSTM的输入,以获取上下文语义信息和低层话语特征。然后,我们利用低层话语特征来构建会话情感交互图。为了有效地捕获长距离话语之间的潜在依赖关系,我们使用扩展的广义向前推算法来预测全局话语之间的情感传播,并设计了一个情感关系感知算子来捕捉不同话语之间的潜在语义关联。此外,我们结合早期融合和自适应后期融合机制来融合说话人关系信息和语境之间的潜在依赖信息。最后,我们提取高层语篇特征,并将其反馈到MLP中进行情感预测。大量实验结果表明,ELR-GNN在基准数据集IEMOCAP和MELD上达到了最好的性能,运行时间分别减少了52%和35%。

[NLP-158] OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
[NLP-158] OmniJARRIS:统一的视觉-语言-动作代币化实现遵循代理的开放世界指令

链接: https://arxiv.org/abs/2407.00114
作者: Zihao Wang,Shaofei Cai,Zhancun Mu,Haowei Lin,Ceyao Zhang,Xuejie Liu,Qing Li,Anji Liu,Xiaojian Ma,Yitao Liang
关键词: open-world instruction-following agents, instruction-following agents, open-world Minecraft, behavior, tokens
中文关键词: 开放世界描述跟随代理,描述跟随代理,开放世界我的世界,行为,代币
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories \tau = o_0 , a_0 , \dots and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.
摘要:我们提出了一种新的视觉-语言-动作(VLA)模型OmniJARVIS,用于开放世界中的开放世界教学跟踪智能体。与以往要么发出文本目标以分离控制器,要么直接产生控制命令的工作相比,OmniJARVIS寻求一条不同的途径,通过对多通道交互数据的统一标记化来确保强大的推理和高效的决策能力。首先,我们引入了一种自监督方法来学习行为编码器,它为行为轨迹\tau=o_0,a_0,\dots产生离散化的令牌,以及以这些令牌为条件的模仿学习(IL)策略解码器。这些额外的行为标记将被扩充到预先训练的多模式语言模型(MLM)的词汇表中。然后,使用这个编码器,我们将涉及任务指令、记忆、思维、观察、文本响应、行为轨迹等的长期多模式交互打包到统一的令牌序列中,并使用自回归转换器对它们进行建模。多亏了语义上有意义的行为令牌,最终得到的VLA模型OmniJARVIS可以推理(通过生成思想链)、计划、回答问题和行动(通过为IL策略解码器生成行为令牌)。OmniJARVIS在开放世界的《我的世界》中展示了在原子性、程序性和开放式任务的全面集合上的出色表现。我们的分析进一步揭示了交互数据形成、统一标记化及其扩展潜力方面的关键设计原则。

[NLP-159] Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models
[NLP-159] 利用微调小语言模型准确预测配体-蛋白质相互作用亲和力

链接: https://arxiv.org/abs/2407.00111
作者: Ben Fauber
关键词: instruction fine-tuned pretrained, fine-tuned pretrained generative, pretrained generative small, generative small language, small language models
中文关键词: 指令微调预训练、微调预训练生成器、预训练生成器小型、生成器小型语言、小型语言模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We describe the accurate prediction of ligand-protein interaction (LPI) affinities, also known as drug-target interactions (DTI), with instruction fine-tuned pretrained generative small language models (SLMs). We achieved accurate predictions for a range of affinity values associated with ligand-protein interactions on out-of-sample data in a zero-shot setting. Only the SMILES string of the ligand and the amino acid sequence of the protein were used as the model inputs. Our results demonstrate a clear improvement over machine learning (ML) and free-energy perturbation (FEP+) based methods in accurately predicting a range of ligand-protein interaction affinities, which can be leveraged to further accelerate drug discovery campaigns against challenging therapeutic targets.
摘要:我们描述了配体-蛋白质相互作用(LPI)亲和力(也称为药物-靶点相互作用(RTI))的准确预测,并使用经过微调的预训练生成小语言模型(SLC)。我们在零激发设置下对样本外数据上与配体-蛋白质相互作用相关的一系列亲和力值实现了准确预测。仅使用配体的SMILES串和蛋白质的氨基酸序列作为模型输入。我们的结果表明,在准确预测一系列配体-蛋白质相互作用亲和力方面,与基于机器学习(ML)和自由能微扰(BEP+)的方法相比有了明显的改进,这可用于进一步加速针对具有挑战性的治疗目标的药物发现活动。

[NLP-160] A Case Study on Contextual Machine Translation in a Professional Scenario of Subtitling
[NLP-160] 专业字幕场景下的上下文机器翻译案例研究

链接: https://arxiv.org/abs/2407.00108
作者: Sebastian Vincent,Charlotte Prescott,Chris Bayliss,Chris Oakley,Carolina Scarton
关键词: enhance translation quality, Incorporating extra-textual context, translation quality, Incorporating extra-textual, machine translation
中文关键词: 提高翻译质量,消除文本外上下文,翻译质量,消除文本外,机器翻译
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted to EAMT 2024

点击查看摘要

Abstract:Incorporating extra-textual context such as film metadata into the machine translation (MT) pipeline can enhance translation quality, as indicated by automatic evaluation in recent work. However, the positive impact of such systems in industry remains unproven. We report on an industrial case study carried out to investigate the benefit of MT in a professional scenario of translating TV subtitles with a focus on how leveraging extra-textual context impacts post-editing. We found that post-editors marked significantly fewer context-related errors when correcting the outputs of MTCue, the context-aware model, as opposed to non-contextual models. We also present the results of a survey of the employed post-editors, which highlights contextual inadequacy as a significant gap consistently observed in MT. Our findings strengthen the motivation for further work within fully contextual MT.
摘要:正如最近工作中的自动评估所表明的那样,将电影元数据等非文本上下文融入机器翻译(MT)管道可以提高翻译质量。然而,此类系统对工业的积极影响尚未得到证实。我们报告了一项工业案例研究,旨在调查MT在翻译电视字幕的专业场景中的好处,重点关注利用文本外上下文如何影响后期编辑。我们发现,与非上下文模型相比,在纠正MTCue(上下文感知模型)的输出时,后期编辑标记的上下文相关错误明显较少。我们还介绍了对受雇的后编辑的调查结果,该调查强调了上下文不足是MT中一贯观察到的一个重大差距。我们的发现增强了在完全上下文MT中进一步工作的动力。

[NLP-161] UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI
[NLP-161] UnUnlearning:Unlearning不足以实现先进生成人工智能中的内容监管

链接: https://arxiv.org/abs/2407.00106
作者: Ilia Shumailov,Jamie Hayes,Eleni Triantafillou,Guillermo Ortiz-Jimenez,Nicolas Papernot,Matthew Jagielski,Itay Yona,Heidi Howard,Eugene Bagdasaryan
关键词: Exact unlearning, allowed a user, user to retract, retract their data, data from machine
中文关键词: 精确的取消学习,允许用户、用户撤回、撤回他们的数据、来自机器的数据
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Exact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate the impractical costs associated with exact unlearning. More recently unlearning is often discussed as an approach for removal of impermissible knowledge i.e. knowledge that the model should not possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the model does not have a certain malicious capability, then it cannot be used for the associated malicious purpose. In this paper we revisit the paradigm in which unlearning is used for in Large Language Models (LLMs) and highlight an underlying inconsistency arising from in-context learning. Unlearning can be an effective control mechanism for the training phase, yet it does not prevent the model from performing an impermissible act during inference. We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications.
摘要:精确遗忘最初是作为一种隐私机制引入的,它允许用户根据请求从机器学习模型中检索他们的数据。不久之后,不精确的方案被提出,以减轻与精确遗忘相关的不切实际的成本。承诺的是,如果模型没有特定的恶意功能,则不能用于相关的恶意目的。我们讨论了忘却对现代LLM的可行性,并考察了更广泛的影响。

[NLP-162] Curriculum Learning with Quality-Driven Data Selection
[NLP-162] 质量驱动数据选择的课程学习

链接: https://arxiv.org/abs/2407.00102
作者: Biao Wu,Fang Meng,Ling Chen
关键词: Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, impressive multimodal capabilities
中文关键词: 多模式大型语言、大型语言模型、大型语言、多模式大型、令人印象深刻的多模式能力
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The impressive multimodal capabilities demonstrated by OpenAI’s GPT-4 have generated significant interest in the development of Multimodal Large Language Models (MLLMs). Visual instruction tuning of MLLMs with machine-generated instruction-following data has shown to enhance zero-shot capabilities across various tasks. However, there has been limited exploration into controlling the quality of the instruction data.Current methodologies for data selection in MLLMs often rely on single, unreliable scores or use downstream tasks for selection, which is time-consuming and can lead to potential overfitting on the chosen evaluation datasets. To mitigate these limitations, we propose a novel data selection methodology that utilizes image-text correlation and model perplexity to evaluate and select data of varying quality. This approach leverages the distinct distribution of these two attributes, mapping data quality into a two-dimensional space that allows for the selection of data based on their location within this distribution. By utilizing this space, we can analyze the impact of task type settings, used as prompts, on data quality. Additionally, this space can be used to construct multi-stage subsets of varying quality to facilitate curriculum learning. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in five commonly assessed capabilities compared to using the complete dataset. Our codes, data, and models are publicly available at: \urlhttps://anonymous.4open.science/r/EHIT-31B4
摘要:OpenAI的GPT-4展示了令人印象深刻的多模式能力,这引起了人们对多模式大型语言模型(MLLMS)开发的浓厚兴趣。具有机器生成的指令跟随数据的MLLMS的可视指令调优已显示出增强了跨各种任务的零射能力。然而,在控制教学数据质量方面的探索一直是有限的。目前,MLLMS中的数据选择方法往往依赖于单一的、不可靠的分数或使用下游任务进行选择,这既耗时又可能导致对所选评估数据集的潜在过度匹配。为了缓解这些局限性,我们提出了一种新的数据选择方法,该方法利用图文相关性和模型困惑度来评估和选择不同质量的数据。该方法利用这两个属性的不同分布,将数据质量映射到二维空间,从而允许根据数据在该分布中的位置来选择数据。通过利用这个空间,我们可以分析用作提示的任务类型设置对数据质量的影响。此外,这个空间可以用来构建不同质量的多阶段子集,以促进课程学习。我们的研究包括在各种数据集上进行的全面实验。结果强调,与使用完整的数据集相比,在五个通常评估的能力方面有实质性的增强。我们的代码、数据和模型可在以下网址公开获得:\urlhttps://anonymous.4open.science/r/EHIT-31B4

[NLP-163] Enhancing In-Context Learning via Implicit Demonstration Augmentation
[NLP-163] 通过内隐演示增强增强上下文学习

链接: https://arxiv.org/abs/2407.00100
作者: Xiaoling Zhou,Wei Ye,Yidong Wang,Chaoya Jiang,Zhemg Lee,Rui Xie,Shikun Zhang
关键词: enables large pre-trained, pre-trained language models, large pre-trained language, ICL effectiveness heavily, in-context learning
中文关键词: 实现大型预训练、预训练语言模型、大型预训练语言、ICL有效性、上下文学习
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL 2024 Main 19 pages,10 figures

点击查看摘要

Abstract:The emergence of in-context learning (ICL) enables large pre-trained language models (PLMs) to make predictions for unseen inputs without updating parameters. Despite its potential, ICL’s effectiveness heavily relies on the quality, quantity, and permutation of demonstrations, commonly leading to suboptimal and unstable performance. In this paper, we tackle this challenge for the first time from the perspective of demonstration augmentation. Specifically, we start with enriching representations of demonstrations by leveraging their deep feature distribution. We then theoretically reveal that when the number of augmented copies approaches infinity, the augmentation is approximately equal to a novel logit calibration mechanism integrated with specific statistical properties. This insight results in a simple yet highly efficient method that significantly improves the average and worst-case accuracy across diverse PLMs and tasks. Moreover, our method effectively reduces performance variance among varying demonstrations, permutations, and templates, and displays the capability to address imbalanced class distributions.
摘要:情境学习(ICL)的出现使大型预训练语言模型(PLM)能够在不更新参数的情况下对看不见的输入进行预测。尽管有潜力,但ICL的有效性在很大程度上依赖于演示的质量、数量和排列,通常会导致次优和不稳定的性能。在本文中,我们首次从演示增强的角度来应对这一挑战。具体地说,我们首先通过利用演示的深层功能分布来丰富演示的表示。然后,我们从理论上揭示了当扩展拷贝数趋于无穷大时,扩展近似等于一种结合了特定统计特性的新的Logit校准机制。这种洞察产生了一种简单但高效的方法,显著提高了不同PLM和任务的平均和最差情况的准确性。此外,我们的方法有效地减少了不同演示、排列和模板之间的性能差异,并显示了解决类分布不平衡的能力。

[NLP-164] ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
[NLP-164] ARES:交替强化学习和有监督的微调,通过多样化的人工智能反馈增强多模式思维链推理

链接: https://arxiv.org/abs/2407.00087
作者: Ju-Seung Byun,Jiyun Chun,Jihyung Kil,Andrew Perrault
关键词: Large Multimodal Models, Large Multimodal, comprehending human instructions, demonstrate remarkable results, excel at comprehending
中文关键词: 大型多模式模型,大型多模式,理解人类指令,表现出显着的结果,擅长理解
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) excel at comprehending human instructions and demonstrate remarkable results across a broad spectrum of tasks. Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) further refine LLMs by aligning them with specific preferences. These methods primarily use ranking-based feedback for entire generations. With advanced AI models (Teacher), such as GPT-4 and Claude 3 Opus, we can request various types of detailed feedback that are expensive for humans to provide. We propose a two-stage algorithm ARES that Alternates REinforcement Learning (RL) and Supervised Fine-Tuning (SFT). First, we request the Teacher to score how much each sentence contributes to solving the problem in a Chain-of-Thought (CoT). This sentence-level feedback allows us to consider individual valuable segments, providing more granular rewards for the RL procedure. Second, we ask the Teacher to correct the wrong reasoning after the RL stage. The RL procedure requires massive efforts for hyperparameter tuning and often generates errors like repetitive words and incomplete sentences. With the correction feedback, we stabilize the RL fine-tuned model through SFT. We conduct experiments on multi-model dataset ScienceQA and A-OKVQA to demonstrate the effectiveness of our proposal. ARES rationale reasoning achieves around 70% win rate against baseline models judged by GPT-4o. Additionally, we observe that the improved rationale reasoning leads to a 2.5% increase in inference answer accuracy on average for the multi-modal datasets.
摘要:大型多通道模型(LMM)擅长理解人类的指令,并在广泛的任务范围内展示了显著的结果。人类反馈强化学习(RLHF)和人工智能反馈强化学习(RLAIF)通过将LLM与特定的偏好对齐来进一步细化LLMS。这些方法主要针对整个世代使用基于排名的反馈。有了先进的人工智能模型(老师),如GPT-4和Claude 3 Opus,我们可以请求各种类型的详细反馈,而这些反馈对于人类来说是昂贵的。我们提出了一种交替强化学习(RL)和有监督精调(SFT)的两阶段算法ARES。首先,我们要求老师在思维链(COT)中对每句话对解决问题的贡献程度进行评分。这种句子级反馈允许我们考虑个别有价值的片段,为RL过程提供更细粒度的回报。其次,我们要求老师在RL阶段之后纠正错误的推理。RL过程需要大量的超参数调整,并且经常会产生重复单词和不完整的句子等错误。在修正反馈的作用下,我们通过SFT稳定了RL微调模型。我们在多模型数据集Science QA和A-OKVQA上进行了实验,验证了该算法的有效性。与GPT-40判断的基线模型相比,战神理性推理达到了70%左右的胜率。此外,我们观察到改进的基本推理导致多模式数据集的推理答案准确率平均提高2.5%。

[NLP-165] Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
[NLP-165] Logicbreaks:理解基于规则的推理颠覆的框架

链接: https://arxiv.org/abs/2407.00075
作者: Anton Xue,Avishree Khare,Rajeev Alur,Surbhi Goel,Eric Wong
关键词: subvert language models, propositional Horn logic, language models, models, large language models
中文关键词: 颠覆语言模型、命题Horn逻辑、语言模型、模型、大型语言模型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study how to subvert language models from following the rules. We model rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if P and Q , then R " for some propositions P , Q , and R . We prove that although transformers can faithfully abide by such rules, maliciously crafted prompts can nevertheless mislead even theoretically constructed models. Empirically, we find that attacks on our theoretical models mirror popular attacks on large language models. Our work suggests that studying smaller theoretical models can help understand the behavior of large language models in rule-based settings like logical reasoning and jailbreak attacks.
摘要:我们研究如何从遵循规则来颠覆语言模型。我们将规则遵循建模为命题Horn逻辑中的推理,这是一个数学系统,其中一些命题P、Q和R的规则具有“如果P和Q,那么R”的形式。我们证明,尽管变形金刚可以忠实地遵守这些规则,但恶意制作的提示仍然可以误导理论上构建的模型。从经验上看,我们发现对理论模型的攻击反映了对大型语言模型的流行攻击。我们的工作表明,研究较小的理论模型可以帮助理解大型语言模型在逻辑推理和越狱攻击等基于规则的环境中的行为。

[NLP-166] Pistis-RAG: A Scalable Cascading Framework Towards Trustworthy Retrieval-Augmented Generation
[NLP-166] Pistis-RAG:迈向值得信赖的检索增强一代的可扩展级联框架

链接: https://arxiv.org/abs/2407.00072
作者: Yu Bai,Yukai Miao,Li Chen,Dan Li,Yanyu Ren,Hongtao Xie,Ce Yang,Xuhui Cai
关键词: Pistis symbolized good, symbolized good faith, Greek mythology, Pistis symbolized, good faith
中文关键词: 皮蒂斯象征善良,象征诚信,希腊神话,皮蒂斯象征,诚信
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In Greek mythology, Pistis symbolized good faith, trust, and reliability, echoing the core principles of RAG in LLM systems. Pistis-RAG, a scalable multi-stage framework, effectively addresses the challenges of large-scale retrieval-augmented generation (RAG). Each stage plays a distinct role: matching refines the search space, pre-ranking prioritizes semantically relevant documents, and ranking aligns with the large language model’s (LLM) preferences. The reasoning and aggregating stage supports the implementation of complex chain-of-thought (CoT) methods within this cascading structure. We argue that the lack of strong alignment between LLMs and the external knowledge ranking methods used in RAG tasks is relevant to the reliance on the model-centric paradigm in RAG frameworks. A content-centric approach would prioritize seamless integration between the LLMs and external information sources, optimizing the content transformation process for each specific task. Critically, our ranking stage deviates from traditional RAG approaches by recognizing that semantic relevance alone may not directly translate to improved generation. This is due to the sensitivity of the few-shot prompt order, as highlighted in prior work \citelu2021fantastically. Current RAG frameworks fail to account for this crucial factor. We introduce a novel ranking stage specifically designed for RAG systems. It adheres to information retrieval principles while considering the unique business scenario captured by LLM preferences and user feedback. Our approach integrates in-context learning (ICL) methods and reasoning steps to incorporate user feedback, ensuring efficient alignment. Experiments on the MMLU benchmark demonstrate a 9.3% performance improvement. The model and code will be open-sourced on GitHub. Experiments on real-world, large-scale data validate our framework’s scalability.
摘要:在希腊神话中,Pistis象征着诚信、信任和可靠,与LLM系统中RAG的核心原则相呼应。Pistis-RAG是一个可扩展的多阶段框架,有效地解决了大规模检索-增强生成(RAG)的挑战。每个阶段都扮演着不同的角色:匹配细化搜索空间,预先排序对语义相关的文档进行优先排序,并根据大型语言模型(LLM)的偏好进行排序。推理和聚合阶段支持在该级联结构中实现复杂的思想链(COT)方法。我们认为,LLM与RAG任务中使用的外部知识排名方法之间缺乏很强的一致性,这与RAG框架中依赖以模型为中心的范式有关。以内容为中心的方法将优先考虑低成本管理系统和外部信息源之间的无缝集成,优化每个特定任务的内容转换过程。关键的是,我们的排名阶段偏离了传统的RAG方法,因为我们认识到单靠语义相关性可能不会直接转化为更好的生成。这是由于少数几枪的提示顺序的敏感性,如前面的工作\cielu2021所强调的那样。目前的RAG框架未能考虑到这一关键因素。我们介绍了一种专门为RAG系统设计的新的排名阶段。它遵循信息检索原则,同时考虑由LLM首选项和用户反馈捕获的独特业务场景。我们的方法集成了情境学习(ICL)方法和推理步骤,以纳入用户反馈,确保有效的对齐。在MMLU基准上的实验表明,该算法的性能有9.3%的提高。该模型和代码将在GitHub上开源。在真实世界、大规模数据上的实验验证了该框架的可扩展性。

[NLP-167] Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization
[NLP-167] 组合推理:通过组合优化在生成人工智能管道中选择原因

链接: https://arxiv.org/abs/2407.00071
作者: Mert Esencan,Tarun Advaith Kumar,Ata Akbari Asanjan,P. Aaron Lott,Masoud Mohseni,Can Unlu,Davide Venturelli,Alan Ho
关键词: Recent Large Language, Large Language Models, Recent Large, Language Models, Large Language
中文关键词: 最近的大型语言,大型语言模型,最近的大型,语言模型,大型语言
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Recent Large Language Models (LLMs) have demonstrated impressive capabilities at tasks that require human intelligence and are a significant step towards human-like artificial intelligence (AI). Yet the performance of LLMs at reasoning tasks have been subpar and the reasoning capability of LLMs is a matter of significant debate. While it has been shown that the choice of the prompting technique to the LLM can alter its performance on a multitude of tasks, including reasoning, the best performing techniques require human-made prompts with the knowledge of the tasks at hand. We introduce a framework for what we call Combinatorial Reasoning (CR), a fully-automated prompting method, where reasons are sampled from an LLM pipeline and mapped into a Quadratic Unconstrained Binary Optimization (QUBO) problem. The framework investigates whether QUBO solutions can be profitably used to select a useful subset of the reasons to construct a Chain-of-Thought style prompt. We explore the acceleration of CR with specialized solvers. We also investigate the performance of simpler zero-shot strategies such as linear majority rule or random selection of reasons. Our preliminary study indicates that coupling a combinatorial solver to generative AI pipelines is an interesting avenue for AI reasoning and elucidates design principles for future CR methods.
摘要:最近的大型语言模型(LLM)在需要人类智能的任务中表现出了令人印象深刻的能力,是向类人类人工智能(AI)迈出的重要一步。然而,LLMS在推理任务中的表现一直不佳,LLMS的推理能力是一个有重大争议的问题。虽然已经表明,对LLM的提示技术的选择可以改变其在包括推理在内的许多任务上的表现,但最好的执行技术需要具有手头任务知识的人工提示。我们介绍了一种称为组合推理(CR)的框架,这是一种全自动提示方法,其中原因从LLM管道中采样并映射到二次无约束二元优化(QUBO)问题。该框架调查Qubo解决方案是否可以有利可图地用于选择有用的原因子集来构建思维链式提示。我们使用专门的求解器来探索CR的加速。我们还研究了更简单的零投篮策略的性能,如线性多数规则或随机选择理由。我们的初步研究表明,将组合求解器耦合到生成式人工智能管道是人工智能推理的一条有趣的途径,并阐明了未来认知无线电方法的设计原则。

[NLP-168] Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
[NLP-168] 压缩然后上菜:以很少的费用为数千个LoRA适配器提供服务

链接: https://arxiv.org/abs/2407.00066
作者: Rickard Brüel-Gabrielsson,Jiacheng Zhu,Onkar Bhardwaj,Leshem Choshen,Kristjan Greenewald,Mikhail Yurochkin,Justin Solomon
关键词: Fine-tuning large language, large language models, yielding numerous copies, Fine-tuning large, LLM differing
中文关键词: 微调大型语言、大型语言模型、产生大量副本、微调大型、LLM不同
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) with low-rank adapters (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRA adapters. We consider compressing adapters individually via SVD and propose a method for joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. Our experiments with up to 500 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 75% of the throughput of serving a single LoRA.
摘要:使用低阶适配器(LORA)对大型语言模型(LLM)进行微调已成为一种常见做法,通常会产生相同LLM的多个副本,只是LORA更新有所不同。这一范例对为每个涉及不同LORA的查询提供实时响应的系统提出了挑战。以前的工作优化了这种系统的设计,但仍然需要不断地加载和卸载LORA,因为在GPU内存中存储数千个LORA是不可行的。为了缓解这个问题,我们调查了在为LORA适配器提供服务时压缩的效果。我们考虑通过奇异值分解来单独压缩适配器,并提出了一种将LORA联合压缩成与LORA特定的尺度矩阵配对的共享基的方法。我们在多达500个LORA上的实验表明,压缩的LORA在保持性能的同时,在实际服务场景中提供了显著的吞吐量提升,其中LORA超过1000个,保持了单个LORA吞吐量的75%。

[NLP-169] A Document-based Knowledge Discovery with Microservices Architecture
[NLP-169] 采用微服务架构的基于文档的知识发现

链接: https://arxiv.org/abs/2407.00053
作者: Habtom Kahsay Gidey,Mario Kesseler,Patrick Stangl,Peter Hillmann,Andreas Karcher
关键词: digitally stored data, organizations lies, conversion of analog, digitally stored, analog data
中文关键词: 数字存储的数据、组织谎言、模拟、数字存储、模拟数据的转换
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The first step towards digitalization within organizations lies in digitization - the conversion of analog data into digitally stored data. This basic step is the prerequisite for all following activities like the digitalization of processes or the servitization of products or offerings. However, digitization itself often leads to ‘data-rich’ but ‘knowledge-poor’ material. Knowledge discovery and knowledge extraction as approaches try to increase the usefulness of digitized data. In this paper, we point out the key challenges in the context of knowledge discovery and present an approach to addressing these using a microservices architecture. Our solution led to a conceptual design focusing on keyword extraction, similarity calculation of documents, database queries in natural language, and programming language independent provision of the extracted information. In addition, the conceptual design provides referential design guidelines for integrating processes and applications for semi-automatic learning, editing, and visualization of ontologies. The concept also uses a microservices architecture to address non-functional requirements, such as scalability and resilience. The evaluation of the specified requirements is performed using a demonstrator that implements the concept. Furthermore, this modern approach is used in the German patent office in an extended version.
摘要:在组织内部迈向数字化的第一步在于数字化–将模拟数据转换为数字存储数据。这一基本步骤是所有后续活动的先决条件,如流程数字化或产品或产品的服务化。然而,数字化本身往往导致“数据丰富”但“知识贫乏”的材料。作为方法的知识发现和知识提取试图增加数字化数据的有用性。在本文中,我们指出了知识发现环境中的关键挑战,并提出了一种使用微服务体系结构来解决这些挑战的方法。我们的解决方案导致了一个概念设计,重点是关键字提取、文档相似度计算、自然语言数据库查询以及与编程语言无关的提取信息的提供。此外,概念设计还为集成用于半自动学习、编辑和可视化本体的过程和应用程序提供了参考设计指南。该概念还使用微服务体系结构来解决非功能需求,如可伸缩性和弹性。使用实现该概念的演示器来执行指定需求的评估。此外,这种现代方法在德国专利局的扩展版本中也得到了使用。

[NLP-170] One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
[NLP-170] 一个队列即可:解决大型语言模型服务中的行头阻塞问题

链接: https://arxiv.org/abs/2407.00047
作者: Archit Patke,Dhemath Reddy,Saurabh Jha,Haoran Qiu,Christian Pinto,Shengkun Cui,Chandra Narayanaswami,Zbigniew Kalbarczyk,Ravishankar Iyer
关键词: increasingly important workload, cloud providers catering, LLM serving, Large language models, Large language
中文关键词: 越来越重要的工作量、云提供商餐饮、LLM服务、大型语言模型、大型语言
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract: Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to in production settings. However, existing LLM serving systems focus on optimization objectives such as request serving throughput or request execution latency rather than the end-to-end latency SLOs. Achieving end-to-end SLOs for latency-sensitive requests is challenging due to head-of-line (HOL) blocking in the request queue, which results from bursty arrival rates and insufficient resources. To address the above challenge, we propose QLM, a multi-model queue management framework for LLM serving. QLM uses stochastic programming to orchestrate the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize SLO attainment. Specifically, QLM uses the following LSOs: model swapping, request eviction, GPU-CPU state swapping, load balancing, and warm model start. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2407.00047 [cs.DC] (or arXiv:2407.00047v1 [cs.DC] for this version)
摘要:大型语言模型(LLM)已成为满足企业和消费者应用的云提供商日益重要的工作负载。来自这些应用程序的LLM推理请求具有在生产设置中必须遵守的端到端延迟SLO。然而,现有的LLM服务系统关注的是诸如请求服务吞吐量或请求执行延迟等优化目标,而不是端到端延迟SLO。由于请求队列中的队头(HOL)阻塞,导致突发到达速率和资源不足,因此为延迟敏感型请求实现端到端SLO是一项挑战。为了解决上述挑战,我们提出了一种多模型队列管理框架QLM。QLM使用随机编程来协调多个LLM服务运营(LSO)的行动,以减少HOL阻塞并最大限度地实现SLO。具体地说,QLM使用以下LSO:模型交换、请求逐出、GPU-CPU状态交换、负载平衡和热模型启动。对具有真实LLM服务数据集的异类GPU设备和模型的评估表明,与其他最先进的LLM服务系统相比,QLm在保持或提高设备利用率的同时,将SLO达标率提高了40%-90%,吞吐量提高了20%-400%。主题:分布式、并行和集群计算(cs.DC);计算和语言(cs.CL);机器学习(cs.LG)引用AS:arxiv:2407.00047cs.DC

[NLP-171] Visual Language Model based Cross-modal Semantic Communication Systems
[NLP-171] 基于视觉语言模型的跨模式语义传播系统

链接: https://arxiv.org/abs/2407.00020
作者: Feibo Jiang,Chuanguo Tang,Li Dong,Kezhi Wang,Kun Yang,Cunhua Pan
关键词: Shannon physical capacity, physical capacity limits, Cross-modal Semantic Communication, transcending the Shannon, Shannon physical
中文关键词: 香农物理能力,物理能力极限,跨模式语义沟通,超越香农,香农物理
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:Semantic Communication (SC) has emerged as a novel communication paradigm in recent years, successfully transcending the Shannon physical capacity limits through innovative semantic transmission concepts. Nevertheless, extant Image Semantic Communication (ISC) systems face several challenges in dynamic environments, including low semantic density, catastrophic forgetting, and uncertain Signal-to-Noise Ratio (SNR). To address these challenges, we propose a novel Vision-Language Model-based Cross-modal Semantic Communication (VLM-CSC) system. The VLM-CSC comprises three novel components: (1) Cross-modal Knowledge Base (CKB) is used to extract high-density textual semantics from the semantically sparse image at the transmitter and reconstruct the original image based on textual semantics at the receiver. The transmission of high-density semantics contributes to alleviating bandwidth pressure. (2) Memory-assisted Encoder and Decoder (MED) employ a hybrid long/short-term memory mechanism, enabling the semantic encoder and decoder to overcome catastrophic forgetting in dynamic environments when there is a drift in the distribution of semantic features. (3) Noise Attention Module (NAM) employs attention mechanisms to adaptively adjust the semantic coding and the channel coding based on SNR, ensuring the robustness of the CSC system. The experimental simulations validate the effectiveness, adaptability, and robustness of the CSC system.
摘要:语义传播是近年来出现的一种新的传播范式,通过创新的语义传播概念,成功地超越了香农的物理容量限制。然而,现有的图像语义通信(ISC)系统在动态环境中面临着一些挑战,包括语义密度低、灾难性遗忘和信噪比不确定。为了应对这些挑战,我们提出了一种基于视觉语言模型的跨通道语义交流(VLM-CSC)系统。VLM-CSC包括三个新的组成部分:(1)跨模式知识库用于在发送端从语义稀疏的图像中提取高密度的文本语义,并在接收端基于文本语义重建原始图像。高密度语义的传输有助于缓解带宽压力。(2)记忆辅助编解码器(MED)采用混合的长/短时记忆机制,使语义编解码器能够克服动态环境中语义特征分布漂移时的灾难性遗忘。(3)噪声注意模块(NAM)采用注意机制,根据信噪比自适应调整语义编码和信道编码,保证了CSC系统的健壮性。实验仿真验证了CSC系统的有效性、适应性和鲁棒性。

[NLP-172] Macroeconomic Forecasting with Large Language Models
[NLP-172] 使用大型语言模型进行宏观经济预测

链接: https://arxiv.org/abs/2407.00890
作者: Andrea Carriero,Davide Pettenuzzo,Shubhranshu Shekhar
关键词: Large Language Models, Language Models, Large Language, comparative analysis evaluating, accuracy of Large
中文关键词: 大型语言模型、语言模型、大型语言、比较分析评估、大型准确性
类目: Econometrics (econ.EM); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a comparative analysis evaluating the accuracy of Large Language Models (LLMs) against traditional macro time series forecasting approaches. In recent times, LLMs have surged in popularity for forecasting due to their ability to capture intricate patterns in data and quickly adapt across very different domains. However, their effectiveness in forecasting macroeconomic time series data compared to conventional methods remains an area of interest. To address this, we conduct a rigorous evaluation of LLMs against traditional macro forecasting methods, using as common ground the FRED-MD database. Our findings provide valuable insights into the strengths and limitations of LLMs in forecasting macroeconomic time series, shedding light on their applicability in real-world scenarios
摘要:本文进行了一项比较分析,评估了大型语言模型(LLM)与传统宏观时间序列预测方法的准确性。近年来,LLM因能够捕捉复杂的数据模式并快速适应非常不同的领域而在预测方面激增。然而,与传统方法相比,它们在预测宏观经济时间序列数据方面的有效性仍然是一个值得关注的领域。为了解决这个问题,我们使用FRED-MD数据库作为共同点,对照传统的宏观预测方法对LLM进行了严格评估。我们的研究结果为LLM在预测宏观经济时间序列方面的优势和局限性提供了宝贵的见解,并揭示了它们在现实世界场景中的适用性

[NLP-173] Decoding moral judgement from text: a pilot study
[NLP-173] 从文本中解码道德判断:一项试点研究

链接: https://arxiv.org/abs/2407.00039
作者: Diana E. Gherman,Thorsten O. Zander
关键词:
中文关键词:
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 7 pages, 2 figures, conference

点击查看摘要

计算机视觉

[CV-0] Empowering 3D Visual Grounding with Reasoning Capabilities

链接: https://arxiv.org/abs/2407.01525
作者: Chenming Zhu,Tai Wang,Wenwei Zhang,Kai Chen,Xihui Liu
关键词: explicit textual descriptions, reason human intentions, Large Language Model, Multi-modal Large Language, implicit instructions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by ECCV24. A comprehensive and hierarchical 3D reasoning grounding benchmark in the era of foundation models. Project page: this https URL

点击查看摘要

Abstract:Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.

[CV-1] MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

链接: https://arxiv.org/abs/2407.01523
作者: Yubo Ma,Yuhang Zang,Liangyu Chen,Meiqi Chen,Yizhu Jiao,Xinze Li,Xinyuan Lu,Ziyu Liu,Yan Ma,Xiaoyi Dong,Pan Zhang,Liangming Pan,Yu-Gang Jiang,Jiaqi Wang,Yixin Cao,Aixin Sun
关键词: long-standing and practical, practical task, Recent Large Vision-Language, Large Vision-Language Models, single-page document understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: this https URL

[CV-2] Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

链接: https://arxiv.org/abs/2407.01521
作者: Bingliang Zhang,Wenda Chu,Julius Berner,Chenlin Meng,Anima Anandkumar,Yang Song
关键词: solving Bayesian inverse, learned data priors, Bayesian inverse problems, solving Bayesian, recently achieved success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inverse problems, such as phase retrieval. To address this challenge, we propose a new method called Decoupled Annealing Posterior Sampling (DAPS) that relies on a novel noise annealing process. Specifically, we decouple consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior as we reduce noise levels. This approach enables the exploration of a larger solution space, improving the success rate for accurate reconstructions. We demonstrate that DAPS significantly improves sample quality and stability across multiple image restoration tasks, particularly in complicated nonlinear inverse problems. For example, we achieve a PSNR of 30.72dB on the FFHQ 256 dataset for phase retrieval, which is an improvement of 9.12dB compared to existing methods.

[CV-3] DiffIR2VR-Zero: Zero-Shot Video Restoration with Diffusion-based Image Restoration Models

链接: https://arxiv.org/abs/2407.01519
作者: Chang-Han Yeh,Chin-Yang Lin,Zhixiang Wang,Chi-Wei Hsiao,Ting-Hsuan Chen,Yu-Lun Liu
关键词: pre-trained image restoration, image restoration diffusion, paper introduces, pre-trained image, video restoration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces a method for zero-shot video restoration using pre-trained image restoration diffusion models. Traditional video restoration methods often need retraining for different settings and struggle with limited generalization across various degradation types and datasets. Our approach uses a hierarchical token merging strategy for keyframes and local frames, combined with a hybrid correspondence mechanism that blends optical flow and feature-based nearest neighbor matching (latent merging). We show that our method not only achieves top performance in zero-shot video restoration but also significantly surpasses trained models in generalization across diverse datasets and extreme degradations (8 \times super-resolution and high-standard deviation video denoising). We present evidence through quantitative metrics and visual comparisons on various challenging datasets. Additionally, our technique works with any 2D restoration diffusion model, offering a versatile and powerful tool for video enhancement tasks without extensive retraining. This research leads to more efficient and widely applicable video restoration technologies, supporting advancements in fields that require high-quality video output. See our project page for video results at this https URL.

[CV-4] owards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

链接: https://arxiv.org/abs/2407.01518
作者: Hao Dong,Eleni Chatzi,Olga Fink
关键词: Multimodal Open-Set Domain, open-set domain generalization, open-set domain, involves recognizing, Multimodal Jigsaw Puzzles
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ECCV 2024, code: this https URL

点击查看摘要

Abstract:The task of open-set domain generalization (OSDG) involves recognizing novel classes within unseen domains, which becomes more challenging with multiple modalities as input. Existing works have only addressed unimodal OSDG within the meta-learning framework, without considering multimodal scenarios. In this work, we introduce a novel approach to address Multimodal Open-Set Domain Generalization (MM-OSDG) for the first time, utilizing self-supervision. To this end, we introduce two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles. These tasks facilitate the learning of multimodal representative features, thereby enhancing generalization and open-class detection capabilities. Additionally, we propose a novel entropy weighting mechanism to balance the loss across different modalities. Furthermore, we extend our approach to tackle also the Multimodal Open-Set Domain Adaptation (MM-OSDA) problem, especially in scenarios where unlabeled data from the target domain is available. Extensive experiments conducted under MM-OSDG, MM-OSDA, and Multimodal Closed-Set DG settings on the EPIC-Kitchens and HAC datasets demonstrate the efficacy and versatility of the proposed approach. Our source code is available at this https URL.

[CV-5] E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness

链接: https://arxiv.org/abs/2407.01516
作者: Robin Courant,Nicolas Dufour,Xi Wang,Marc Christie,Vicky Kalogeiton
关键词: Stories and emotions, directing decisions, movement over time, emotions in movies, movies emerge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024. Project page: this https URL

点击查看摘要

Abstract:Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, in particular camera placement and movement over time. Crafting compelling camera trajectories remains a complex iterative process, even for skilful artists. To tackle this, in this paper, we propose a dataset called the Exceptional Trajectories (E.T.) with camera trajectories along with character information and textual captions encompassing descriptions of both camera and character. To our knowledge, this is the first dataset of its kind. To show the potential applications of the E.T. dataset, we propose a diffusion-based approach, named DIRECTOR, which generates complex camera trajectories from textual captions that describe the relation and synchronisation between the camera and characters. To ensure robust and accurate evaluations, we train on the E.T. dataset CLaTr, a Contrastive Language-Trajectory embedding for evaluation metrics. We posit that our proposed dataset and method significantly advance the democratization of cinematography, making it more accessible to common users.

[CV-6] MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

链接: https://arxiv.org/abs/2407.01509
作者: Yusu Qian,Hanrong Ye,Jean-Philippe Fauconnier,Peter Grasch,Yinfei Yang,Zhe Gan
关键词: large language models, evaluate multimodal large, multimodal large language, introduce MIA-Bench, language models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models’ compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models’ ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

[CV-7] FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

链接: https://arxiv.org/abs/2407.01494
作者: Yiming Zhang,Yicheng Gu,Yanhong Zeng,Zhening Xing,Yuancheng Wang,Zhizheng Wu,Kai Chen
关键词: study Neural Foley, Neural Foley, immersive audio-visual experience, study Neural, sound effects synchronizing
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Project page: this https URL

点击查看摘要

Abstract:We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. Despite its wide range of applications, existing approaches encounter limitations when it comes to simultaneously synthesizing high-quality and video-aligned (i.e., semantic relevant and temporal synchronized) sounds. To overcome these limitations, we propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. FoleyCrafter comprises two key components: the semantic adapter for semantic alignment and the temporal controller for precise audio-video synchronization. The semantic adapter utilizes parallel cross-attention layers to condition audio generation on video features, producing realistic sound effects that are semantically relevant to the visual content. Meanwhile, the temporal controller incorporates an onset detector and a timestampbased adapter to achieve precise audio-video alignment. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents. We conduct extensive quantitative and qualitative experiments on standard benchmarks to verify the effectiveness of FoleyCrafter. Models and codes are available at this https URL.

[CV-8] Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning

链接: https://arxiv.org/abs/2407.01491
作者: Siwei Li,Yifan Yang,Yifei Shen,Fangyun Wei,Zongqing Lu,Lili Qiu,Yuqing Yang
关键词: Efficient fine-tuning plays, Efficient fine-tuning, low-rank adaptation emerging, modern large models, fine-tuning plays
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA’s expressiveness and generalization capabilities while preserving its training efficiency. Our approach augments expressiveness through a cascaded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model’s ability to capture complex patterns. Additionally, we introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. The extensive experiments on various language and vision datasets, as well as robustness benchmarks, demonstrate that the proposed method not only significantly outperforms existing baselines, but also mitigates overfitting, enhances model stability, and improves OOD robustness. Code will be release in this https URL very soon.

[CV-9] he Balanced-Pairwise-Affinities Feature Transform

链接: https://arxiv.org/abs/2407.01467
作者: Daniel Shalam,Simon Korman
关键词: facilitate downstream matching, grouping related tasks, designed to upgrade, items to facilitate, facilitate downstream
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2204.03065

点击查看摘要

Abstract:The Balanced-Pairwise-Affinities (BPA) feature transform is designed to upgrade the features of a set of input items to facilitate downstream matching or grouping related tasks. The transformed set encodes a rich representation of high order relations between the input features. A particular min-cost-max-flow fractional matching problem, whose entropy regularized version can be approximated by an optimal transport (OT) optimization, leads to a transform which is efficient, differentiable, equivariant, parameterless and probabilistically interpretable. While the Sinkhorn OT solver has been adapted extensively in many contexts, we use it differently by minimizing the cost between a set of features to itself and using the transport plan’s rows as the new representation. Empirically, the transform is highly effective and flexible in its use and consistently improves networks it is inserted into, in a variety of tasks and training schemes. We demonstrate state-of-the-art results in few-shot classification, unsupervised image clustering and person re-identification. Code is available at \urlthis http URL.

[CV-10] ColPali: Efficient Document Retrieval with Vision Language Models

链接: https://arxiv.org/abs/2407.01449
作者: Manuel Faysse,Hugues Sibille,Tony Wu,Gautier Viaud,Céline Hudelot,Pierre Colombo
关键词: Retrieval Augmented Generation, document retrieval, visually rich structures, information through text, modern document retrieval
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, ColPali, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.

[CV-11] FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources

链接: https://arxiv.org/abs/2407.01445
作者: Xiyuan Wei,Fanjiang Ye,Ori Yonay,Xingyu Chen,Baixi Sun,Dingwen Tao,Tianbao Yang
关键词: Contrastive Language-Image Pretraining, large batch size, Existing studies, Language-Image Pretraining, data involve hundreds
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages

点击查看摘要

Abstract:Existing studies of training state-of-the-art Contrastive Language-Image Pretraining (CLIP) models on large-scale data involve hundreds of or even thousands of GPUs due to the requirement of a large batch size. However, such a large amount of resources is not accessible to most people. While advanced compositional optimization techniques for optimizing global contrastive losses have been demonstrated effective for removing the requirement of large batch size, their performance on large-scale data remains underexplored and not optimized. To bridge the gap, this paper explores several aspects of CLIP training with limited resources (e.g., up to tens of GPUs). First, we introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques while designed and optimized for the distributed setting. Our framework is equipped with an efficient gradient reduction strategy to reduce communication overhead. Second, to further boost training efficiency, we investigate three components of the framework from an optimization perspective: the schedule of the inner learning rate, the update rules of the temperature parameter and the model parameters, respectively. Experiments on different strategies for each component shed light on how to conduct CLIP training more efficiently. Finally, we benchmark the performance of FastCLIP and the state-of-the-art training baseline (OpenCLIP) on different compute scales up to 32 GPUs on 8 nodes, and three data scales ranging from 2.7 million, 9.1 million to 315 million image-text pairs to demonstrate the significant improvement of FastCLIP in the resource-limited setting. We release the code of FastCLIP at this https URL .

[CV-12] AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction

链接: https://arxiv.org/abs/2407.01436
作者: Dubing Chen,Wencheng Han,Jin Fang,Jianbing Shen
关键词: Challenge at CVPR, Open-Occ Dataset Challenge, Flow Prediction track, Dataset Challenge, technical report
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 2nd Place in the 3D Occupancy and Flow Prediction Challenge (CVPR24)

点击查看摘要

Abstract:In this technical report, we present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ Dataset Challenge at CVPR 2024. Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling. Initially, we independently train the occupancy model, followed by flow prediction using sequential frame integration. Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth. Experimental results on the nuScenes dataset demonstrate significant improvements in accuracy and robustness, showcasing the effectiveness of our approach in real-world scenarios. Our single model based on Swin-Base ranks second on the public leaderboard, validating the potential of our method in advancing autonomous car perception systems.

[CV-13] Scarecrow monitoring system:employing mobilenet ssd for enhanced animal supervision

链接: https://arxiv.org/abs/2407.01435
作者: Balaji VS,Mahi AR,Anirudh Ganapathy PS,Manju M
关键词: Mobile Net SSD, SSD Mobile Net, wildlife wreaking havoc, Mobile Net, Net SSD model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 10 figures

点击查看摘要

Abstract:Agriculture faces a growing challenge with wildlife wreaking havoc on crops, threatening sustainability. The project employs advanced object detection, the system utilizes the Mobile Net SSD model for real-time animal classification. The methodology initiates with the creation of a dataset, where each animal is represented by annotated images. The SSD Mobile Net architecture facilitates the use of a model for image classification and object detection. The model undergoes fine-tuning and optimization during training, enhancing accuracy for precise animal classification. Real-time detection is achieved through a webcam and the OpenCV library, enabling prompt identification and categorization of approaching animals. By seamlessly integrating intelligent scarecrow technology with object detection, this system offers a robust solution to field protection, minimizing crop damage and promoting precision farming. It represents a valuable contribution to agricultural sustainability, addressing the challenge of wildlife interference with crops. The implementation of the Intelligent Scarecrow Monitoring System stands as a progressive tool for proactive field management and protection, empowering farmers with an advanced solution for precision agriculture. Keywords: Machine learning, Deep Learning, Computer Vision, MobileNet SSD Comments: 9 pages, 10 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2407.01435 [cs.CV] (or arXiv:2407.01435v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2407.01435 Focus to learn more arXiv-issued DOI via DataCite

[CV-14] FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

链接: https://arxiv.org/abs/2407.01425
作者: Pratheba Selvaraju,Tianyu Ding,Tianyi Chen,Ilya Zharkov,Luming Liang
关键词: generating high-quality images, images and videos, largely due, facto choice, choice for generating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion transformers (DiT) have become the de facto choice for generating high-quality images and videos, largely due to their scalability, which enables the construction of larger models for enhanced performance. However, the increased size of these models leads to higher inference costs, making them less attractive for real-time applications. We present Fast-FORward CAching (FORA), a simple yet effective approach designed to accelerate DiT by exploiting the repetitive nature of the diffusion process. FORA implements a caching mechanism that stores and reuses intermediate outputs from the attention and MLP layers across denoising steps, thereby reducing computational overhead. This approach does not require model retraining and seamlessly integrates with existing transformer-based diffusion models. Experiments show that FORA can speed up diffusion transformers several times over while only minimally affecting performance metrics such as the IS Score and FID. By enabling faster processing with minimal trade-offs in quality, FORA represents a significant advancement in deploying diffusion transformers for real-time applications. Code will be made publicly available at: this https URL.

[CV-15] StyleShot: A Snapshot on Any Style

链接: https://arxiv.org/abs/2407.01414
作者: Junyao Gao,Yanchen Liu,Yanan Sun,Yinhao Tang,Yanhong Zeng,Kai Chen,Cairong Zhao
关键词: crucial and sufficient, sufficient for generalized, good style representation, generalized style transfer, style
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:In this paper, we show that, a good style representation is crucial and sufficient for generalized style transfer without test-time tuning. We achieve this through constructing a style-aware encoder and a well-organized style dataset called StyleGallery. With dedicated design for style learning, this style-aware encoder is trained to extract expressive style representation with decoupling training strategy, and StyleGallery enables the generalization ability. We further employ a content-fusion encoder to enhance image-driven style transfer. We highlight that, our approach, named StyleShot, is simple yet effective in mimicking various desired styles, i.e., 3D, flat, abstract or even fine-grained styles, without test-time tuning. Rigorous experiments validate that, StyleShot achieves superior performance across a wide range of styles compared to existing state-of-the-art methods. The project page is available at: this https URL.

[CV-16] Semantic Compositions Enhance Vision-Language Contrastive Learning

链接: https://arxiv.org/abs/2407.01408
作者: Maxwell Aladago,Lorenzo Torresani,Soroush Vosoughi
关键词: vision-language contrastive learning, leverage within-batch non-matching, within-batch non-matching pairs, contrastive learning, matched image-caption pairs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-C for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification and cross-modal retrieval. The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.

[CV-17] GalLoP: Learning Global and Local Prompts for Vision-Language Models

链接: https://arxiv.org/abs/2407.01400
作者: Marc Lafon,Elias Ramzi,Clément Rambour,Nicolas Audebert,Nicolas Thome
关键词: adapt vision-language models, efficiently adapt vision-language, few-shot image classification, Prompt learning, prompt learning methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To be published at ECCV 2024

点击查看摘要

Abstract:Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning method that learns multiple diverse prompts leveraging both global and local visual features. The training of the local prompts relies on local features with an enhanced vision-text alignment. To focus only on pertinent features, this local alignment is coupled with a sparsity strategy in the selection of the local features. We enforce diversity on the set of prompts using a new ``prompt dropout’’ technique and a multiscale strategy on the local prompts. GalLoP outperforms previous prompt learning methods on accuracy on eleven datasets in different few shots settings and with various backbones. Furthermore, GalLoP shows strong robustness performances in both domain generalization and OOD detection, even outperforming dedicated OOD detection methods. Code and instructions to reproduce our results will be open-sourced.

[CV-18] Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

链接: https://arxiv.org/abs/2407.01397
作者: Matteo Mosconi,Andriy Sorokin,Aniello Panariello,Angelo Porrello,Jacopo Bonato,Marco Cotogni,Luigi Sabetta,Simone Calderara,Rita Cucchiara
关键词: deep learning models, efficiently and effectively, skeletal data, data allows deep, models to perform
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ICPR 2024

点击查看摘要

Abstract:The use of skeletal data allows deep learning models to perform action recognition efficiently and effectively. Herein, we believe that exploring this problem within the context of Continual Learning is crucial. While numerous studies focus on skeleton-based action recognition from a traditional offline perspective, only a handful venture into online approaches. In this respect, we introduce CHARON (Continual Human Action Recognition On skeletoNs), which maintains consistent performance while operating within an efficient framework. Through techniques like uniform sampling, interpolation, and a memory-efficient training stage based on masking, we achieve improved recognition accuracy while minimizing computational overhead. Our experiments on Split NTU-60 and the proposed Split NTU-120 datasets demonstrate that CHARON sets a new benchmark in this domain. The code is available at this https URL.

[CV-19] Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

链接: https://arxiv.org/abs/2407.01394
作者: Pooya Fayyazsanavi,Antonios Anastasopoulos,Jana Košecká
关键词: spoken text presents, text presents unique, presents unique challenges, unique challenges owing, expression nuances
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on \em Gloss2Text translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities improving significantly the performance of state-of-the-art approaches. Through extensive experiments and ablation studies on the PHOENIX Weather 2014T dataset, our approach surpasses state-of-the-art performance in \em Gloss2Text translation, indicating its efficacy in addressing sign language translation and suggesting promising avenues for future research and development.

[CV-20] Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

链接: https://arxiv.org/abs/2407.01392
作者: Boyuan Chen,Diego Marti Monso,Yilun Du,Max Simchowitz,Russ Tedrake,Vincent Sitzmann
关键词: per-token noise levels, presents Diffusion Forcing, independent per-token noise, paper presents Diffusion, Diffusion Forcing
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing’s variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution. Project website: https://boyuan.space/diffusion-forcing

[CV-21] ransferAttn: Transferable-guided Attention Is All You Need for Video Domain Adaptation

链接: https://arxiv.org/abs/2407.01375
作者: André Sacilotti,Samuel Felipe dos Santos,Nicu Sebe,Jurandy Almeida
关键词: Unsupervised domain adaptation, image-based UDA techniques, Unsupervised domain, Transferable-guided Attention Block, Domain Transferable-guided Attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video domain adaptation has still been little explored. Our key idea is to use the transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge from different backbones. To improve the transferability of ViT, we introduce a novel and effective module named Domain Transferable-guided Attention Block~(DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets with different backbones, like ResNet101, I3D, and STAM, verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains. The code will be made freely available.

[CV-22] Hyperspectral Pansharpening: Critical Review Tools and Future Perspectives

链接: https://arxiv.org/abs/2407.01355
作者: Matteo Ciotola,Giuseppe Guarino,Gemine Vivone,Giovanni Poggi,Jocelyn Chanussot,Antonio Plaza,Giuseppe Scarpa
关键词: Hyperspectral pansharpening consists, low-resolution hyperspectral image, low-resolution hyperspectral, high-resolution panchromatic band, Hyperspectral pansharpening
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Hyperspectral pansharpening consists of fusing a high-resolution panchromatic band and a low-resolution hyperspectral image to obtain a new image with high resolution in both the spatial and spectral domains. These remote sensing products are valuable for a wide range of applications, driving ever growing research efforts. Nonetheless, results still do not meet application demands. In part, this comes from the technical complexity of the task: compared to multispectral pansharpening, many more bands are involved, in a spectral range only partially covered by the panchromatic component and with overwhelming noise. However, another major limiting factor is the absence of a comprehensive framework for the rapid development and accurate evaluation of new methods. This paper attempts to address this issue. We started by designing a dataset large and diverse enough to allow reliable training (for data-driven methods) and testing of new methods. Then, we selected a set of state-of-the-art methods, following different approaches, characterized by promising performance, and reimplemented them in a single PyTorch framework. Finally, we carried out a critical comparative analysis of all methods, using the most accredited quality indicators. The analysis highlights the main limitations of current solutions in terms of spectral/spatial quality and computational efficiency, and suggests promising research directions. To ensure full reproducibility of the results and support future research, the framework (including codes, evaluation procedures and links to the dataset) is shared on this https URL, as a single Python-based reference benchmark toolbox. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) Cite as: arXiv:2407.01355 [cs.CV] (or arXiv:2407.01355v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2407.01355 Focus to learn more arXiv-issued DOI via DataCite

[CV-23] PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction

链接: https://arxiv.org/abs/2407.01349
作者: Xuan Yu,Yili Liu,Chenrui Han,Sitong Mao,Shunbo Zhou,Rong Xiong,Yiyi Liao,Yue Wang
关键词: challenging task, Panoptic reconstruction, instance, Panoptic, segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Panoptic reconstruction is a challenging task in 3D scene understanding. However, most existing methods heavily rely on pre-trained semantic segmentation models and known 3D object bounding boxes for 3D panoptic segmentation, which is not available for in-the-wild scenes. In this paper, we propose a novel zero-shot panoptic reconstruction method from RGB-D images of scenes. For zero-shot segmentation, we leverage open-vocabulary instance segmentation, but it has to face partial labeling and instance association challenges. We tackle both challenges by propagating partial labels with the aid of dense generalized features and building a 3D instance graph for associating 2D instance IDs. Specifically, we exploit partial labels to learn a classifier for generalized semantic features to provide complete labels for scenes with dense distilled features. Moreover, we formulate instance association as a 3D instance graph segmentation problem, allowing us to fully utilize the scene geometry prior and all 2D instance masks to infer global unique pseudo 3D instance ID. Our method outperforms state-of-the-art methods on the indoor dataset ScanNet V2 and the outdoor dataset KITTI-360, demonstrating the effectiveness of our graph segmentation method and reconstruction network.

[CV-24] AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition

链接: https://arxiv.org/abs/2407.01332
作者: Fadi Boutros,Vitomir Štruc,Naser Damer
关键词: compact student model, high-performing teacher model, aims at improving, improving the performance, student model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:Knowledge distillation (KD) aims at improving the performance of a compact student model by distilling the knowledge from a high-performing teacher model. In this paper, we present an adaptive KD approach, namely AdaDistill, for deep face recognition. The proposed AdaDistill embeds the KD concept into the softmax loss by training the student using a margin penalty softmax loss with distilled class centers from the teacher. Being aware of the relatively low capacity of the compact student model, we propose to distill less complex knowledge at an early stage of training and more complex one at a later stage of training. This relative adjustment of the distilled knowledge is controlled by the progression of the learning capability of the student over the training iterations without the need to tune any hyper-parameters. Extensive experiments and ablation studies show that AdaDistill can enhance the discriminative learning capability of the student and demonstrate superiority over various state-of-the-art competitors on several challenging benchmarks, such as IJB-B, IJB-C, and ICCV2021-MFR

[CV-25] Restyling Unsupervised Concept Based Interpretable Networks with Generative Models

链接: https://arxiv.org/abs/2407.01331
作者: Jayneel Parekh,Quentin Bouniot,Pavlo Mozharovskyi,Alasdair Newson,Florence d’Alché-Buc
关键词: Developing inherently interpretable, Developing inherently, inherently interpretable models, recent years, gained prominence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page available at this https URL

点击查看摘要

Abstract:Developing inherently interpretable models for prediction has gained prominence in recent years. A subclass of these models, wherein the interpretable network relies on learning high-level concepts, are valued because of closeness of concept representations to human communication. However, the visualization and understanding of the learnt unsupervised dictionary of concepts encounters major limitations, specially for large-scale images. We propose here a novel method that relies on mapping the concept features to the latent space of a pretrained generative model. The use of a generative model enables high quality visualization, and naturally lays out an intuitive and interactive procedure for better interpretation of the learnt concepts. Furthermore, leveraging pretrained generative models has the additional advantage of making the training of the system more efficient. We quantitatively ascertain the efficacy of our method in terms of accuracy of the interpretable prediction network, fidelity of reconstruction, as well as faithfulness and consistency of learnt concepts. The experiments are conducted on multiple image recognition benchmarks for large-scale images. Project page available at this https URL

[CV-26] Learning Unsigned Distance Fields from Local Shape Functions for 3D Surface Reconstruction

链接: https://arxiv.org/abs/2407.01330
作者: Jiangbei Hu,Yanggeng Li,Fei Hou,Junhui Hou,Zhebin Zhang,Shengfa Wang,Na Lei,Ying He
关键词: Unsigned distance fields, Unsigned distance, distance fields, provide a versatile, encompassing both watertight
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:Unsigned distance fields (UDFs) provide a versatile framework for representing a diverse array of 3D shapes, encompassing both watertight and non-watertight geometries. Traditional UDF learning methods typically require extensive training on large datasets of 3D shapes, which is costly and often necessitates hyperparameter adjustments for new datasets. This paper presents a novel neural framework, LoSF-UDF, for reconstructing surfaces from 3D point clouds by leveraging local shape functions to learn UDFs. We observe that 3D shapes manifest simple patterns within localized areas, prompting us to create a training dataset of point cloud patches characterized by mathematical functions that represent a continuum from smooth surfaces to sharp edges and corners. Our approach learns features within a specific radius around each query point and utilizes an attention mechanism to focus on the crucial features for UDF estimation. This method enables efficient and robust surface reconstruction from point clouds without the need for shape-specific training. Additionally, our method exhibits enhanced resilience to noise and outliers in point clouds compared to existing methods. We present comprehensive experiments and comparisons across various datasets, including synthetic and real-scanned point clouds, to validate our method’s efficacy.

[CV-27] CSFNet: A Cosine Similarity Fusion Network for Real-Time RGB-X Semantic Segmentation of Driving Scenes

链接: https://arxiv.org/abs/2407.01328
作者: Danial Qashqai,Emad Mousavian,Shahriar Baradaran Shokouhi,Sattar Mirzakuchaki
关键词: complex visual interpretation, vehicle vision systems, autonomous vehicle vision, Semantic segmentation, multimodal semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semantic segmentation, as a crucial component of complex visual interpretation, plays a fundamental role in autonomous vehicle vision systems. Recent studies have significantly improved the accuracy of semantic segmentation by exploiting complementary information and developing multimodal methods. Despite the gains in accuracy, multimodal semantic segmentation methods suffer from high computational complexity and low inference speed. Therefore, it is a challenging task to implement multimodal methods in driving applications. To address this problem, we propose the Cosine Similarity Fusion Network (CSFNet) as a real-time RGB-X semantic segmentation model. Specifically, we design a Cosine Similarity Attention Fusion Module (CS-AFM) that effectively rectifies and fuses features of two modalities. The CS-AFM module leverages cross-modal similarity to achieve high generalization ability. By enhancing the fusion of cross-modal features at lower levels, CS-AFM paves the way for the use of a single-branch network at higher levels. Therefore, we use dual and single-branch architectures in an encoder, along with an efficient context module and a lightweight decoder for fast and accurate predictions. To verify the effectiveness of CSFNet, we use the Cityscapes, MFNet, and ZJU datasets for the RGB-D/T/P semantic segmentation. According to the results, CSFNet has competitive accuracy with state-of-the-art methods while being state-of-the-art in terms of speed among multimodal semantic segmentation models. It also achieves high efficiency due to its low parameter count and computational complexity. The source code for CSFNet will be available at this https URL.

[CV-28] Gradient-based Class Weighting for Unsupervised Domain Adaptation in Dense Prediction Visual Tasks

链接: https://arxiv.org/abs/2407.01327
作者: Roberto Alcover-Couso,Marcos Escudero-Viñolo,Juan C. SanMiguel,Jesus Bescós
关键词: unsupervised domain adaptation, significant class imbalance, class imbalance remains, addressing the challenge, open issue
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In unsupervised domain adaptation (UDA), where models are trained on source data (e.g., synthetic) and adapted to target data (e.g., real-world) without target annotations, addressing the challenge of significant class imbalance remains an open issue. Despite considerable progress in bridging the domain gap, existing methods often experience performance degradation when confronted with highly imbalanced dense prediction visual tasks like semantic and panoptic segmentation. This discrepancy becomes especially pronounced due to the lack of equivalent priors between the source and target domains, turning class imbalanced techniques used for other areas (e.g., image classification) ineffective in UDA scenarios. This paper proposes a class-imbalance mitigation strategy that incorporates class-weights into the UDA learning losses, but with the novelty of estimating these weights dynamically through the loss gradient, defining a Gradient-based class weighting (GBW) learning. GBW naturally increases the contribution of classes whose learning is hindered by large-represented classes, and has the advantage of being able to automatically and quickly adapt to the iteration training outcomes, avoiding explicitly curricular learning patterns common in loss-weighing strategies. Extensive experimentation validates the effectiveness of GBW across architectures (convolutional and transformer), UDA strategies (adversarial, self-training and entropy minimization), tasks (semantic and panoptic segmentation), and datasets (GTA and Synthia). Analysing the source of advantage, GBW consistently increases the recall of low represented classes.

[CV-29] oCoAD: Two-Stage Contrastive Learning for Industrial Anomaly Detection

链接: https://arxiv.org/abs/2407.01312
作者: Yun Liang,Zhiguang Hu,Junjie Huang,Donglin Di,Anyang Su,Lei Fan
关键词: Current unsupervised anomaly, unsupervised anomaly detection, anomaly detection approaches, detection approaches perform, Current unsupervised
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Current unsupervised anomaly detection approaches perform well on public datasets but struggle with specific anomaly types due to the domain gap between pre-trained feature extractors and target-specific domains. To tackle this issue, this paper presents a two-stage training strategy, called \textbfToCoAD. In the first stage, a discriminative network is trained by using synthetic anomalies in a self-supervised learning manner. This network is then utilized in the second stage to provide a negative feature guide, aiding in the training of the feature extractor through bootstrap contrastive learning. This approach enables the model to progressively learn the distribution of anomalies specific to industrial datasets, effectively enhancing its generalizability to various types of anomalies. Extensive experiments are conducted to demonstrate the effectiveness of our proposed two-stage training strategy, and our model produces competitive performance, achieving pixel-level AUROC scores of 98.21%, 98.43% and 97.70% on MVTec AD, VisA and BTAD respectively.

[CV-30] Multi-State-Action Tokenisation in Decision Transformers for Multi-Discrete Action Spaces

链接: https://arxiv.org/abs/2407.01310
作者: Perusha Moodley,Pramod Kaushik,Dhillu Thambi,Mark Trovinger,Praveen Paruchuri,Xia Hong,Benjamin Rosman
关键词: Decision Transformer architectures, Decision Transformer, enhanced Decision Transformer, existing Decision Transformer, multi-discrete action spaces
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Decision Transformers, in their vanilla form, struggle to perform on image-based environments with multi-discrete action spaces. Although enhanced Decision Transformer architectures have been developed to improve performance, these methods have not specifically addressed this problem of multi-discrete action spaces which hampers existing Decision Transformer architectures from learning good representations. To mitigate this, we propose Multi-State Action Tokenisation (M-SAT), an approach for tokenising actions in multi-discrete action spaces that enhances the model’s performance in such environments. Our approach involves two key changes: disentangling actions to the individual action level and tokenising the actions with auxiliary state information. These two key changes also improve individual action level interpretability and visibility within the attention layers. We demonstrate the performance gains of M-SAT on challenging ViZDoom environments with multi-discrete action spaces and image-based state spaces, including the Deadly Corridor and My Way Home scenarios, where M-SAT outperforms the baseline Decision Transformer without any additional data or heavy computational overheads. Additionally, we find that removing positional encoding does not adversely affect M-SAT’s performance and, in some cases, even improves it.

[CV-31] Robot Instance Segmentation with Few Annotations for Grasping

链接: https://arxiv.org/abs/2407.01302
作者: Moshe Kimhi,David Vainshtein,Chaim Baskin,Dotan Di Castro
关键词: manipulate objects relies, objects relies heavily, ability of robots, robots to manipulate, relies heavily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The ability of robots to manipulate objects relies heavily on their aptitude for visual perception. In domains characterized by cluttered scenes and high object variability, most methods call for vast labeled datasets, laboriously hand-annotated, with the aim of training capable models. Once deployed, the challenge of generalizing to unfamiliar objects implies that the model must evolve alongside its domain. To address this, we propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI), allowing a model to learn by observing scene alterations and leverage visual consistency despite temporal gaps without requiring curated data of interaction sequences. As a result, our approach exploits partially annotated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unlabeled still images. We validate our method on two common benchmarks, ARMBench mix-object-tote and OCID, where it achieves state-of-the-art performance. Notably, on ARMBench, we attain an \textAP_50 of 86.37 , almost a 20% improvement over existing work, and obtain remarkable results in scenarios with extremely low annotation, achieving an \textAP_50 score of 84.89 with just 1 % of annotated data compared to 72 presented in ARMBench on the fully annotated counterpart.

[CV-32] GaussianStego: A Generalizable Stenography Pipeline for Generative 3D Gaussians Splatting

链接: https://arxiv.org/abs/2407.01301
作者: Chenxin Li,Hengyu Liu,Zhiwen Fan,Wuyang Li,Yifan Liu,Panwang Pan,Yixuan Yuan
关键词: point-based techniques pave, Recent advancements, widespread visual data, visual data distribution, real-time neural rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project website: this https URL

点击查看摘要

Abstract:Recent advancements in large generative models and real-time neural rendering using point-based techniques pave the way for a future of widespread visual data distribution through sharing synthesized 3D assets. However, while standardized methods for embedding proprietary or copyright information, either overtly or subtly, exist for conventional visual content such as images and videos, this issue remains unexplored for emerging generative 3D formats like Gaussian Splatting. We present GaussianStego, a method for embedding steganographic information in the rendering of generated 3D assets. Our approach employs an optimization framework that enables the accurate extraction of hidden information from images rendered using Gaussian assets derived from large models, while maintaining their original visual quality. We conduct preliminary evaluations of our method across several potential deployment scenarios and discuss issues identified through analysis. GaussianStego represents an initial exploration into the novel challenge of embedding customizable, imperceptible, and recoverable information within the renders produced by current 3D generative models, while ensuring minimal impact on the rendered content’s quality.

[CV-33] Preserving Full Degradation Details for Blind Image Super-Resolution

链接: https://arxiv.org/abs/2407.01299
作者: Hongda Liu,Longguang Wang,Ye Zhang,Kaiwen Xue,Shunbo Zhou,Yulan Guo
关键词: super-resolution relies heavily, image super-resolution relies, super-resolution relies, relies heavily, degradation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 11 figures, 4 tables

点击查看摘要

Abstract:The performance of image super-resolution relies heavily on the accuracy of degradation information, especially under blind settings. Due to absence of true degradation models in real-world scenarios, previous methods learn distinct representations by distinguishing different degradations in a batch. However, the most significant degradation differences may provide shortcuts for the learning of representations such that subtle difference may be discarded. In this paper, we propose an alternative to learn degradation representations through reproducing degraded low-resolution (LR) images. By guiding the degrader to reconstruct input LR images, full degradation information can be encoded into the representations. In addition, we develop an energy distance loss to facilitate the learning of the degradation representations by introducing a bounded constraint. Experiments show that our representations can extract accurate and highly robust degradation information. Moreover, evaluations on both synthetic and real images demonstrate that our ReDSR achieves state-of-the-art performance for the blind SR tasks.

[CV-34] Formal Verification of Object Detection

链接: https://arxiv.org/abs/2407.01295
作者: Avraham Raviv,Yizhak Y. Elboher,Michelle Aluf-Medina,Yael Leibovich Weiss,Omer Cohen,Roy Assa,Guy Katz,Hillel Kugler
关键词: Deep Neural Networks, Deep Neural, object detection, object detection models, ubiquitous in real-world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are ubiquitous in real-world applications, yet they remain vulnerable to errors and adversarial attacks. This work tackles the challenge of applying formal verification to ensure the safety of computer vision models, extending verification beyond image classification to object detection. We propose a general formulation for certifying the robustness of object detection models using formal verification and outline implementation strategies compatible with state-of-the-art verification tools. Our approach enables the application of these tools, originally designed for verifying classification models, to object detection. We define various attacks for object detection, illustrating the diverse ways adversarial inputs can compromise neural network outputs. Our experiments, conducted on several common datasets and networks, reveal potential errors in object detection models, highlighting system vulnerabilities and emphasizing the need for expanding formal verification to these new domains. This work paves the way for further research in integrating formal verification across a broader range of computer vision applications.

[CV-35] We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

链接: https://arxiv.org/abs/2407.01284
作者: Runqi Qiao,Qiuna Tan,Guanting Dong,Minhui Wu,Chong Sun,Xiaoshuai Song,Zhuoma GongQue,Shanglin Lei,Zhe Wei,Miaoxuan Zhang,Runfeng Qiao,Yifan Zhang,Xiao Zong,Yida Xu,Muxi Diao,Zhimin Bao,Chen Li,Honggang Zhang
关键词: Large Multimodal Models, Multimodal Models, Large Multimodal, received widespread attention, Visual mathematical reasoning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: Work in progress

点击查看摘要

Abstract:Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs’ reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at this https URL.

[CV-36] Small Aerial Target Detection for Airborne Infrared Detection Systems using LightGBM and Trajectory Constraints

链接: https://arxiv.org/abs/2407.01278
作者: Xiaoliang Sun,Liangchao Guo,Wenlong Zhang,Zi Wang,Qifeng Yu
关键词: rapid relative motion, aerial target detection, make robust small, small aerial target, aerial target
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages,10 figures

点击查看摘要

Abstract:Factors, such as rapid relative motion, clutter background, etc., make robust small aerial target detection for airborne infrared detection systems a challenge. Existing methods are facing difficulties when dealing with such cases. We consider that a continuous and smooth trajectory is critical in boosting small infrared aerial target detection performance. A simple and effective small aerial target detection method for airborne infrared detection system using light gradient boosting model (LightGBM) and trajectory constraints is proposed in this article. First, we simply formulate target candidate detection as a binary classification problem. Target candidates in every individual frame are detected via interesting pixel detection and a trained LightGBM model. Then, the local smoothness and global continuous characteristic of the target trajectory are modeled as short-strict and long-loose constraints. The trajectory constraints are used efficiently for detecting the true small infrared aerial targets from numerous target candidates. Experiments on public datasets demonstrate that the proposed method performs better than other existing methods. Furthermore, a public dataset for small aerial target detection in airborne infrared detection systems is constructed. To the best of our knowledge, this dataset has the largest data scale and richest scene types within this field.

[CV-37] OSL-ActionSpotting: A Unified Library for Action Spotting in Sports Videos

链接: https://arxiv.org/abs/2407.01265
作者: Yassine Benzakour,Bruno Cabado,Silvio Giancola,Anthony Cioppa,Bernard Ghanem,Marc Van Droogenbroeck
关键词: Action spotting, sports analytics, sports video analytics, providing insights, tactical decision-making
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Action spotting is crucial in sports analytics as it enables the precise identification and categorization of pivotal moments in sports matches, providing insights that are essential for performance analysis and tactical decision-making. The fragmentation of existing methodologies, however, impedes the progression of sports analytics, necessitating a unified codebase to support the development and deployment of action spotting for video analysis. In this work, we introduce OSL-ActionSpotting, a Python library that unifies different action spotting algorithms to streamline research and applications in sports video analytics. OSL-ActionSpotting encapsulates various state-of-the-art techniques into a singular, user-friendly framework, offering standardized processes for action spotting and analysis across multiple datasets. We successfully integrated three cornerstone action spotting methods into OSL-ActionSpotting, achieving performance metrics that match those of the original, disparate codebases. This unification within a single library preserves the effectiveness of each method and enhances usability and accessibility for researchers and practitioners in sports analytics. By bridging the gaps between various action spotting techniques, OSL-ActionSpotting significantly contributes to the field of sports video analysis, fostering enhanced analytical capabilities and collaborative research opportunities. The scalable and modularized design of the library ensures its long-term relevance and adaptability to future technological advancements in the domain.

[CV-38] Multi-level Reliable Guidance for Unpaired Multi-view Clustering

链接: https://arxiv.org/abs/2407.01247
作者: Like Xin,Wanqi Yang,Lei Wang,Ming Yang
关键词: perform effective joint, unpaired observed samples, effective joint clustering, unpaired multi-view clustering, cluster structures
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we address the challenging problem of unpaired multi-view clustering (UMC), aiming to perform effective joint clustering using unpaired observed samples across multiple views. Commonly, traditional incomplete multi-view clustering (IMC) methods often depend on paired samples to capture complementary information between views. However, the strategy becomes impractical in UMC due to the absence of paired samples. Although some researchers have attempted to tackle the issue by preserving consistent cluster structures across views, they frequently neglect the confidence of these cluster structures, especially for boundary samples and uncertain cluster structures during the initial training. Therefore, we propose a method called Multi-level Reliable Guidance for UMC (MRG-UMC), which leverages multi-level clustering to aid in learning a trustworthy cluster structure across inner-view, cross-view, and common-view, respectively. Specifically, within each view, multi-level clustering fosters a trustworthy cluster structure across different levels and reduces clustering error. In cross-view learning, reliable view guidance enhances the confidence of the cluster structures in other views. Similarly, within the multi-level framework, the incorporation of a common view aids in aligning different views, thereby reducing the clustering error and uncertainty of cluster structure. Finally, as evidenced by extensive experiments, our method for UMC demonstrates significant efficiency improvements compared to 20 state-of-the-art methods.

[CV-39] CLHOP: Combined Audio-Video Learning for Horse 3D Pose and Shape Estimation

链接: https://arxiv.org/abs/2407.01244
作者: Ci Li,Elin Hernlund,Hedvig Kjellström,Silvia Zuffi
关键词: typically relies solely, animals typically relies, highly under-constrained, typically relies, relies solely
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR CV4Animals Workshop 2024

点击查看摘要

Abstract:In the monocular setting, predicting 3D pose and shape of animals typically relies solely on visual information, which is highly under-constrained. In this work, we explore using audio to enhance 3D shape and motion recovery of horses from monocular video. We test our approach on two datasets: an indoor treadmill dataset for 3D evaluation and an outdoor dataset capturing diverse horse movements, the latter being a contribution to this study. Our results show that incorporating sound with visual data leads to more accurate and robust motion regression. This study is the first to investigate audio’s role in 3D animal motion recovery.

[CV-40] SGCCNet: Single-Stage 3D Object Detector With Saliency-Guided Data Augmentation and Confidence Correction Mechanism

链接: https://arxiv.org/abs/2407.01239
作者: Ao Liang,Wenyu Chen,Jian Fang,Huaici Zhao
关键词: attracted widespread research, widespread research interest, research interest due, fast inference speed, inference speed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 16 figures

点击查看摘要

Abstract:The single-stage point-based 3D object detectors have attracted widespread research interest due to their advantages of lightweight and fast inference speed. However, they still face challenges such as inadequate learning of low-quality objects (ILQ) and misalignment between localization accuracy and classification confidence (MLC). In this paper, we propose SGCCNet to alleviate these two issues. For ILQ, SGCCNet adopts a Saliency-Guided Data Augmentation (SGDA) strategy to enhance the robustness of the model on low-quality objects by reducing its reliance on salient features. Specifically, We construct a classification task and then approximate the saliency scores of points by moving points towards the point cloud centroid in a differentiable process. During the training process, SGCCNet will be forced to learn from low saliency features through dropping points. Meanwhile, to avoid internal covariate shift and contextual features forgetting caused by dropping points, we add a geometric normalization module and skip connection block in each stage. For MLC, we design a Confidence Correction Mechanism (CCM) specifically for point-based multi-class detectors. This mechanism corrects the confidence of the current proposal by utilizing the predictions of other key points within the local region in the post-processing stage. Extensive experiments on the KITTI dataset demonstrate the generality and effectiveness of our SGCCNet. On the KITTI \textittest set, SGCCNet achieves 80.82% for the metric of AP_3D on the \textitModerate level, outperforming all other point-based detectors, surpassing IA-SSD and Fast Point R-CNN by 2.35% and 3.42% , respectively. Additionally, SGCCNet demonstrates excellent portability for other point-based detectors

[CV-41] DaBiT: Depth and Blur informed Transformer for Joint Refocusing and Super-Resolution

链接: https://arxiv.org/abs/2407.01230
作者: Crispian Morris,Nantheera Anantrasirichai,Fan Zhang,David Bull
关键词: specifically target motion, recorded videos suffer, accidental focus blur, target motion blur, deblurring methods exist
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In many real-world scenarios, recorded videos suffer from accidental focus blur, and while video deblurring methods exist, most specifically target motion blur. This paper introduces a framework optimised for the joint task of focal deblurring (refocusing) and video super-resolution (VSR). The proposed method employs novel map guided transformers, in addition to image propagation, to effectively leverage the continuous spatial variance of focal blur and restore the footage. We also introduce a flow re-focusing module to efficiently align relevant features between the blurry and sharp domains. Additionally, we propose a novel technique for generating synthetic focal blur data, broadening the model’s learning capabilities to include a wider array of content. We have made a new benchmark dataset, DAVIS-Blur, available. This dataset, a modified extension of the popular DAVIS video segmentation set, provides realistic out-of-focus blur degradations as well as the corresponding blur maps. Comprehensive experiments on DAVIS-Blur demonstrate the superiority of our approach. We achieve state-of-the-art results with an average PSNR performance over 1.9dB greater than comparable existing video restoration methods. Our source code will be made available at this https URL

[CV-42] Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

链接: https://arxiv.org/abs/2407.01220
作者: Zihan Gao,Lingling Li,Licheng Jiao,Fang Liu,Xu Liu,Wenping Ma,Yuwei Guo,Shuyuan Yang
关键词: spanning multiple domains, computer vision research, applications spanning multiple, high-dimensional CLIP features, CLIP features
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enables open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. While effective, however, the per-pixel distillation of high-dimensional CLIP features introduces ambiguity and necessitates complex regularization strategies, adding inefficiencies during training. This paper presents MaskField, which enables fast and efficient 3D open-vocabulary segmentation with neural fields under weak supervision. Unlike previous methods, MaskField distills masks rather than dense high-dimensional CLIP features. MaskFields employ neural fields as binary mask generators and supervise them with masks generated by SAM and classified by coarse CLIP features. MaskField overcomes the ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence, outperforming previous methods with just 5 minutes of training. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models.

[CV-43] Efficient Cutting Tool Wear Segmentation Based on Segment Anything Model

链接: https://arxiv.org/abs/2407.01211
作者: Zongshuo Li,Ding Huo,Markus Meurer,Thomas Bergs
关键词: final geometric precision, wear conditions impact, Tool wear conditions, Tool wear, geometric precision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Tool wear conditions impact the surface quality of the workpiece and its final geometric precision. In this research, we propose an efficient tool wear segmentation approach based on Segment Anything Model, which integrates U-Net as an automated prompt generator to streamline the processes of tool wear detection. Our evaluation covered three Point-of-Interest generation methods and further investigated the effects of variations in training dataset sizes and U-Net training intensities on resultant wear segmentation outcomes. The results consistently highlight our approach’s advantage over U-Net, emphasizing its ability to achieve accurate wear segmentation even with limited training datasets. This feature underscores its potential applicability in industrial scenarios where datasets may be limited.

[CV-44] Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection

链接: https://arxiv.org/abs/2407.01193
作者: Francesco Barbato,Umberto Michieli,Jijoong Moon,Pietro Zanuttigh,Mete Ozay
关键词: robotic systems deployed, detection robotic systems, object detection robotic, Personalized Object Detection, personal devices
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at IROS 2024, 8 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Recent years have seen object detection robotic systems deployed in several personal devices (e.g., home robots and appliances). This has highlighted a challenge in their design, i.e., they cannot efficiently update their knowledge to distinguish between general classes and user-specific instances (e.g., a dog vs. user’s dog). We refer to this challenging task as Instance-level Personalized Object Detection (IPOD). The personalization task requires many samples for model tuning and optimization in a centralized server, raising privacy concerns. An alternative is provided by approaches based on recent large-scale Foundation Models, but their compute costs preclude on-device applications. In our work we tackle both problems at the same time, designing a Few-Shot IPOD strategy called AuXFT. We introduce a conditional coarse-to-fine few-shot learner to refine the coarse predictions made by an efficient object detector, showing that using an off-the-shelf model leads to poor personalization due to neural collapse. Therefore, we introduce a Translator block that generates an auxiliary feature space where features generated by a self-supervised model (e.g., DINOv2) are distilled without impacting the performance of the detector. We validate AuXFT on three publicly available datasets and one in-house benchmark designed for the IPOD task, achieving remarkable gains in all considered scenarios with excellent time-complexity trade-off: AuXFT reaches a performance of 80% its upper bound at just 32% of the inference time, 13% of VRAM and 19% of the model size. Comments: Accepted at IROS 2024, 8 pages, 4 figures, 6 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2407.01193 [cs.CV] (or arXiv:2407.01193v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2407.01193 Focus to learn more arXiv-issued DOI via DataCite

[CV-45] MARS: Multimodal Active Robotic Sensing for Articulated Characterization

链接: https://arxiv.org/abs/2407.01191
作者: Hongliang Zeng,Ping Zhang,Chengjiong Wu,Jiahua Wang,Tingyu Ye,Fang Li
关键词: Precise perception, empowering service robots, empowering service, Precise, Abstract
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Precise perception of articulated objects is vital for empowering service robots. Recent studies mainly focus on point cloud, a single-modal approach, often neglecting vital texture and lighting details and assuming ideal conditions like optimal viewpoints, unrepresentative of real-world scenarios. To address these limitations, we introduce MARS, a novel framework for articulated object characterization. It features a multi-modal fusion module utilizing multi-scale RGB features to enhance point cloud features, coupled with reinforcement learning-based active sensing for autonomous optimization of observation viewpoints. In experiments conducted with various articulated object instances from the PartNet-Mobility dataset, our method outperformed current state-of-the-art methods in joint parameter estimation accuracy. Additionally, through active sensing, MARS further reduces errors, demonstrating enhanced efficiency in handling suboptimal viewpoints. Furthermore, our method effectively generalizes to real-world articulated objects, enhancing robot interactions. Code is available at this https URL.

[CV-46] Multi-View Black-Box Physical Attacks on Infrared Pedestrian Detectors Using Adversarial Infrared Grid

链接: https://arxiv.org/abs/2407.01168
作者: Kalibinuer Tiliwalidi,Chengyin Hu,Weiwen Shi
关键词: extensive research exists, visible spectrum, research exists, infrared spectrum, attacks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While extensive research exists on physical adversarial attacks within the visible spectrum, studies on such techniques in the infrared spectrum are limited. Infrared object detectors are vital in modern technological applications but are susceptible to adversarial attacks, posing significant security threats. Previous studies using physical perturbations like light bulb arrays and aerogels for white-box attacks, or hot and cold patches for black-box attacks, have proven impractical or limited in multi-view support. To address these issues, we propose the Adversarial Infrared Grid (AdvGrid), which models perturbations in a grid format and uses a genetic algorithm for black-box optimization. These perturbations are cyclically applied to various parts of a pedestrian’s clothing to facilitate multi-view black-box physical attacks on infrared pedestrian detectors. Extensive experiments validate AdvGrid’s effectiveness, stealthiness, and robustness. The method achieves attack success rates of 80.00% in digital environments and 91.86% in physical environments, outperforming baseline methods. Additionally, the average attack success rate exceeds 50% against mainstream detectors, demonstrating AdvGrid’s robustness. Our analyses include ablation studies, transfer attacks, and adversarial defenses, confirming the method’s superiority.

[CV-47] Benchmarking Predictive Coding Networks – Made Simple

链接: https://arxiv.org/abs/2407.01163
作者: Luca Pinchetti,Chang Qi,Oleh Lokshyn,Gaspard Olivers,Cornelius Emde,Mufeng Tang,Amine M’Charrak,Simon Frieder,Bayar Menzat,Rafal Bogacz,Thomas Lukasiewicz,Tommaso Salvatori
关键词: predictive coding networks, machine learning, predictive coding, coding networks, networks in machine
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 33 pages, 25 figures

点击查看摘要

Abstract:In this work, we tackle the problems of efficiency and scalability for predictive coding networks in machine learning. To do so, we first propose a library called PCX, whose focus lies on performance and simplicity, and provides a user-friendly, deep-learning oriented interface. Second, we use PCX to implement a large set of benchmarks for the community to use for their experiments. As most works propose their own tasks and architectures, do not compare one against each other, and focus on small-scale tasks, a simple and fast open-source library adopted by the whole community would address all of these concerns. Third, we perform extensive benchmarks using multiple algorithms, setting new state-of-the-art results in multiple tasks and datasets, as well as highlighting limitations inherent to PC that should be addressed. Thanks to the efficiency of PCX, we are able to analyze larger architectures than commonly used, providing baselines to galvanize community efforts towards one of the main open problems in the field: scalability. The code for PCX is available at \textitthis https URL.

[CV-48] Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models

链接: https://arxiv.org/abs/2407.01157
作者: Shaeke Salman,Md Montasir Bin Shams,Xiuwen Liu
关键词: unprecedented zero-shot capabilities, exhibit unprecedented zero-shot, shared embedding space, models exhibit unprecedented, zero-shot capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 14 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:2401.15568 , arXiv:2402.08473

点击查看摘要

Abstract:Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities can be misaligned. In this paper, we extend and utilize a recently developed effective gradient-based procedure that allows us to match the embedding of a given text by minimally modifying an image. Using the procedure, we show that we can align the embeddings of distinguishable texts to any image through unnoticeable adversarial attacks in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and at the same time visually indistinguishable images can be matched to the embeddings of very different texts. Our technique achieves 100% success rate when it is applied to text datasets and images from multiple sources. Without overcoming the vulnerability, multimodal models cannot robustly align inputs from different modalities in a semantically meaningful way. \textbfWarning: the text data used in this paper are toxic in nature and may be offensive to some readers.

[CV-49] Integrated feature analysis for deep learning interpretation and class activation maps

链接: https://arxiv.org/abs/2407.01142
作者: Yanli Li,Tahereh Hassanzadeh,Denis P. Shamonin,Monique Reijnierse,Annette H.M. van der Helm-van Mil,Berend C. Stoel
关键词: integrated feature analysis, Understanding the decisions, integrated feature, feature analysis, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 11 figures, code available: this https URL This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Understanding the decisions of deep learning (DL) models is essential for the acceptance of DL to risk-sensitive applications. Although methods, like class activation maps (CAMs), give a glimpse into the black box, they do miss some crucial information, thereby limiting its interpretability and merely providing the considered locations of objects. To provide more insight into the models and the influence of datasets, we propose an integrated feature analysis method, which consists of feature distribution analysis and feature decomposition, to look closer into the intermediate features extracted by DL models. This integrated feature analysis could provide information on overfitting, confounders, outliers in datasets, model redundancies and principal features extracted by the models, and provide distribution information to form a common intensity scale, which are missing in current CAM algorithms. The integrated feature analysis was applied to eight different datasets for general validation: photographs of handwritten digits, two datasets of natural images and five medical datasets, including skin photography, ultrasound, CT, X-rays and MRIs. The method was evaluated by calculating the consistency between the CAMs average class activation levels and the logits of the model. Based on the eight datasets, the correlation coefficients through our method were all very close to 100%, and based on the feature decomposition, 5%-25% of features could generate equally informative saliency maps and obtain the same model performances as using all features. This proves the reliability of the integrated feature analysis. As the proposed methods rely on very few assumptions, this is a step towards better model interpretation and a useful extension to existing CAM algorithms. Codes: this https URL

[CV-50] M2IST: Multi-Modal Interactive Side-Tuning for Memory-efficient Referring Expression Comprehension

链接: https://arxiv.org/abs/2407.01131
作者: Xuyang Liu,Ting Liu,Siteng Huang,Yue Hu,Quanjun Yin,Donglin Wang,Honggang Chen
关键词: Referring expression comprehension, Referring expression, expression comprehension, vision-language task, task to locate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, applying PETL to REC faces two challenges: (1) insufficient interaction between pre-trained vision and language encoders, and (2) high GPU memory usage due to gradients passing through both heavy encoders. To address these issues, we present M ^2 IST: Multi-Modal Interactive Side-Tuning with M ^3 ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep the pre-trained vision and language encoders fixed and update M ^3 ISAs on side networks to establish connections between them, thereby achieving parameter- and memory-efficient tuning for REC. Empirical results on three benchmarks show M ^2 IST achieves the best performance-parameter-memory trade-off compared to full fine-tuning and other PETL methods, with only 3.14M tunable parameters (2.11% of full fine-tuning) and 15.44GB GPU memory usage (39.61% of full fine-tuning). Source code will soon be publicly available.

[CV-51] RMS-FlowNet: Efficient and Robust Multi-Scale Scene Flow Estimation for Large-Scale Point Clouds

链接: https://arxiv.org/abs/2407.01129
作者: Ramy Battrawy,René Schuster,Didier Stricker
关键词: operate on high-density, efficient scene flow, scene flow, full input resolution, high-density point clouds
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This version of the article has been accepted by International Journal of Computer Vision (IJCV), and published in 23.05.2024

点击查看摘要

Abstract:The proposed RMS-FlowNet++ is a novel end-to-end learning-based architecture for accurate and efficient scene flow estimation that can operate on high-density point clouds. For hierarchical scene f low estimation, existing methods rely on expensive Farthest-Point-Sampling (FPS) to sample the scenes, must find large correspondence sets across the consecutive frames and/or must search for correspondences at a full input resolution. While this can improve the accuracy, it reduces the overall efficiency of these methods and limits their ability to handle large numbers of points due to memory requirements. In contrast to these methods, our architecture is based on an efficient design for hierarchical prediction of multi-scale scene flow. To this end, we develop a special flow embedding block that has two advantages over the current methods: First, a smaller correspondence set is used, and second, the use of Random-Sampling (RS) is possible. In addition, our architecture does not need to search for correspondences at a full input resolution. Exhibiting high accuracy, our RMS-FlowNet++ provides a faster prediction than state-of-the-art methods, avoids high memory requirements and enables efficient scene flow on dense point clouds of more than 250K points at once. Our comprehensive experiments verify the accuracy of RMS FlowNet++ on the established FlyingThings3D data set with different point cloud densities and validate our design choices. Furthermore, we demonstrate that our model has a competitive ability to generalize to the real-world scenes of the KITTI data set without fine-tuning.

[CV-52] Comprehensive Dataset for Urban Streetlight Analysis

链接: https://arxiv.org/abs/2407.01117
作者: Eliza Femi Sherley S,Sanjay T,Shri Kaanth P,Jeffrey Samuel S
关键词: India major streets, Chennai region, systematically from India, India major, high-resolution streetlight images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This article includes a comprehensive collection of over 800 high-resolution streetlight images taken systematically from India’s major streets, primarily in the Chennai region. The images were methodically collected following standardized methods to assure uniformity and quality. Each image has been labelled and grouped into directories based on binary class labels, which indicate whether each streetlight is functional or not. This organized dataset is intended to make it easier to train and evaluate deep neural networks, allowing for the creation of pre-trained models that have robust feature representations. Such models have several potential uses, such as improving smart city surveillance systems, automating street infrastructure monitoring, and increasing urban management efficiency. The availability of this dataset is intended to inspire future research and development in computer vision and smart city technologies, supporting innovation and practical solutions to urban infrastructure concerns. The dataset can be accessed at this https URL.

[CV-53] Semantic-guided Adversarial Diffusion Model for Self-supervised Shadow Removal

链接: https://arxiv.org/abs/2407.01104
作者: Ziqi Zeng,Chen Zhao,Weiling Cai,Chenyu Dong
关键词: Existing unsupervised methods, Existing unsupervised, shadow removal tasks, addressed the challenges, challenges of inconsistent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing unsupervised methods have addressed the challenges of inconsistent paired data and tedious acquisition of ground-truth labels in shadow removal tasks. However, GAN-based training often faces issues such as mode collapse and unstable optimization. Furthermore, due to the complex mapping between shadow and shadow-free domains, merely relying on adversarial learning is not enough to capture the underlying relationship between two domains, resulting in low quality of the generated images. To address these problems, we propose a semantic-guided adversarial diffusion framework for self-supervised shadow removal, which consists of two stages. At first stage a semantic-guided generative adversarial network (SG-GAN) is proposed to carry out a coarse result and construct paired synthetic data through a cycle-consistent structure. Then the coarse result is refined with a diffusion-based restoration module (DBRM) to enhance the texture details and edge artifact at second stage. Meanwhile, we propose a multi-modal semantic prompter (MSP) that aids in extracting accurate semantic information from real images and text, guiding the shadow removal network to restore images better in SG-GAN. We conduct experiments on multiple public datasets, and the experimental results demonstrate the effectiveness of our method.

[CV-54] Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

链接: https://arxiv.org/abs/2407.01094
作者: Mingxiang Liao,Hannan Lu,Xinyu Zhang,Fang Wan,Tianyu Wang,Yuzhong Zhao,Wangmeng Zuo,Qixiang Ye,Jingdong Wang
关键词: Comprehensive and constructive, constructive evaluation protocols, evaluation protocols play, development of sophisticated, dynamics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at this https URL.

[CV-55] Kolmogorov-Arnold Convolutions: Design Principles and Empirical Studies

链接: https://arxiv.org/abs/2407.01092
作者: Ivan Drokin
关键词: sparked significant interest, Kolmogorov-Arnold Networks, scientific community, sparked significant, significant interest
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of Kolmogorov-Arnold Networks (KANs) has sparked significant interest and debate within the scientific community. This paper explores the application of KANs in the domain of computer vision (CV). We examine the convolutional version of KANs, considering various nonlinearity options beyond splines, such as Wavelet transforms and a range of polynomials. We propose a parameter-efficient design for Kolmogorov-Arnold convolutional layers and a parameter-efficient finetuning algorithm for pre-trained KAN models, as well as KAN convolutional versions of self-attention and focal modulation layers. We provide empirical evaluations conducted on MNIST, CIFAR10, CIFAR100, Tiny ImageNet, ImageNet1k, and HAM10000 datasets for image classification tasks. Additionally, we explore segmentation tasks, proposing U-Net-like architectures with KAN convolutions, and achieving state-of-the-art results on BUSI, GlaS, and CVC datasets. We summarized all of our findings in a preliminary design guide of KAN convolutional models for computer vision tasks. Furthermore, we investigate regularization techniques for KANs. All experimental code and implementations of convolutional layers and models, pre-trained on ImageNet1k weights are available on GitHub via this this https URL

[CV-56] CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

链接: https://arxiv.org/abs/2407.01081
作者: Yuxuan Wang,Yijun Liu,Fei Yu,Chen Huang,Kexin Li,Zhiguo Wan,Wanxiang Che
关键词: Chinese vision-language models, constructed on Western-centric, Chinese vision-language, Chinese, Chinese culture
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Despite the rapid development of Chinese vision-language models (VLMs), most existing Chinese vision-language (VL) datasets are constructed on Western-centric images from existing English VL datasets. The cultural bias in the images makes these datasets unsuitable for evaluating VLMs in Chinese culture. To remedy this issue, we present a new Chinese Vision- Language Understanding Evaluation (CVLUE) benchmark dataset, where the selection of object categories and images is entirely driven by Chinese native speakers, ensuring that the source images are representative of Chinese culture. The benchmark contains four distinct VL tasks ranging from image-text retrieval to visual question answering, visual grounding and visual dialogue. We present a detailed statistical analysis of CVLUE and provide a baseline performance analysis with several open-source multilingual VLMs on CVLUE and its English counterparts to reveal their performance gap between English and Chinese. Our in-depth category-level analysis reveals a lack of Chinese cultural knowledge in existing VLMs. We also find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs’ understanding of Chinese culture.

[CV-57] Multimodal Conditional 3D Face Geometry Generation

链接: https://arxiv.org/abs/2407.01074
作者: Christopher Otto,Prashanth Chandran,Sebastian Weiss,Markus Gross,Gaspard Zoss,Derek Bradley
关键词: multimodal conditional, method for multimodal, output identity, identity and expression, FLAME face model
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We present a new method for multimodal conditional 3D face geometry generation that allows user-friendly control over the output identity and expression via a number of different conditioning signals. Within a single model, we demonstrate 3D faces generated from artistic sketches, 2D face landmarks, Canny edges, FLAME face model parameters, portrait photos, or text prompts. Our approach is based on a diffusion process that generates 3D geometry in a 2D parameterized UV domain. Geometry generation passes each conditioning signal through a set of cross-attention layers (IP-Adapter), one set for each user-defined conditioning signal. The result is an easy-to-use 3D face generation tool that produces high resolution geometry with fine-grain user control.

[CV-58] Human-like object concept representations emerge naturally in multimodal large language models

链接: https://arxiv.org/abs/2407.01067
作者: Changde Du,Kaicheng Fu,Bincheng Wen,Yi Sun,Jie Peng,Wei Wei,Ying Gao,Shengpei Wang,Chuncheng Zhang,Jinpeng Li,Shuang Qiu,Le Chang,Huiguang He
关键词: offering crucial insights, Large Language Models, intrigued cognitive scientists, long intrigued cognitive, scientists and neuroscientists
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The conceptualization and categorization of natural objects in the human mind have long intrigued cognitive scientists and neuroscientists, offering crucial insights into human perception and cognition. Recently, the rapid development of Large Language Models (LLMs) has raised the attractive question of whether these models can also develop human-like object representations through exposure to vast amounts of linguistic and multimodal data. In this study, we combined behavioral and neuroimaging analysis methods to uncover how the object concept representations in LLMs correlate with those of humans. By collecting large-scale datasets of 4.7 million triplet judgments from LLM and Multimodal LLM (MLLM), we were able to derive low-dimensional embeddings that capture the underlying similarity structure of 1,854 natural objects. The resulting 66-dimensional embeddings were found to be highly stable and predictive, and exhibited semantic clustering akin to human mental representations. Interestingly, the interpretability of the dimensions underlying these embeddings suggests that LLM and MLLM have developed human-like conceptual representations of natural objects. Further analysis demonstrated strong alignment between the identified model embeddings and neural activity patterns in many functionally defined brain ROIs (e.g., EBA, PPA, RSC and FFA). This provides compelling evidence that the object representations in LLMs, while not identical to those in the human, share fundamental commonalities that reflect key schemas of human conceptual knowledge. This study advances our understanding of machine intelligence and informs the development of more human-like artificial cognitive systems.

[CV-59] Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert

链接: https://arxiv.org/abs/2407.01034
作者: Han EunGi,Oh Hyun-Bin,Kim Sung-Bin,Corentin Nivelet Etcheberry,Suekyeong Nam,Janghoon Joo,Tae-Hyun Oh
关键词: recently garnered attention, garnered attention due, multimedia production, recently garnered, garnered attention
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: INTERSPEECH 2024

点击查看摘要

Abstract:Speech-driven 3D facial animation has recently garnered attention due to its cost-effective usability in multimedia production. However, most current advances overlook the intelligibility of lip movements, limiting the realism of facial expressions. In this paper, we introduce a method for speech-driven 3D facial animation to generate accurate lip movements, proposing an audio-visual multimodal perceptual loss. This loss provides guidance to train the speech-driven 3D facial animators to generate plausible lip motions aligned with the spoken transcripts. Furthermore, to incorporate the proposed audio-visual perceptual loss, we devise an audio-visual lip reading expert leveraging its prior knowledge about correlations between speech and lip motions. We validate the effectiveness of our approach through broad experiments, showing noticeable improvements in lip synchronization and lip readability performance. Codes are available at this https URL.

[CV-60] Overcoming Common Flaws in the Evaluation of Selective Classification Systems

链接: https://arxiv.org/abs/2407.01032
作者: Jeremias Traub,Till J. Bungert,Carsten T. Lüth,Michael Baumgartner,Klaus H. Maier-Hein,Lena Maier-Hein,Paul F Jaeger
关键词: reject low-confidence predictions, promises reliable translation, machine-learning based classification, based classification systems, low-confidence predictions
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the \mathrmAUROC in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ( \mathrmAUGRC ), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of \mathrmAUGRC on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

[CV-61] EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting

链接: https://arxiv.org/abs/2407.01029
作者: Chenxin Li,Brandon Y. Feng,Yifan Liu,Hengyu Liu,Cheng Wang,Weihao Yu,Yixuan Yuan
关键词: important downstream surgical, downstream surgical applications, biological tissues, key to unlock, unlock various important
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accpeted by MICCAI2024

点击查看摘要

Abstract:3D reconstruction of biological tissues from a collection of endoscopic images is a key to unlock various important downstream surgical applications with 3D capabilities. Existing methods employ various advanced neural rendering techniques for photorealistic view synthesis, but they often struggle to recover accurate 3D representations when only sparse observations are available, which is usually the case in real-world clinical scenarios. To tackle this sparsity challenge, we propose a framework leveraging the prior knowledge from multiple foundation models during the reconstruction process, dubbed as \textitEndoSparse. Experimental results indicate that our proposed strategy significantly improves the geometric and appearance quality under challenging sparse-view conditions, including using only three views. In rigorous benchmarking experiments against state-of-the-art methods, \textitEndoSparse achieves superior results in terms of accurate geometry, realistic appearance, and rendering efficiency, confirming the robustness to sparse-view limitations in endoscopic reconstruction. \textitEndoSparse signifies a steady step towards the practical deployment of neural 3D reconstruction in real-world clinical scenarios. Project page: this https URL.

[CV-62] Blind Inversion using Latent Diffusion Priors

链接: https://arxiv.org/abs/2407.01027
作者: Weimin Bai,Siyi Chen,Wenzheng Chen,He Sun
关键词: complex prior distributions, exceptional ability, Diffusion models, Diffusion, inverse problems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as powerful tools for solving inverse problems due to their exceptional ability to model complex prior distributions. However, existing methods predominantly assume known forward operators (i.e., non-blind), limiting their applicability in practical settings where acquiring such operators is costly. Additionally, many current approaches rely on pixel-space diffusion models, leaving the potential of more powerful latent diffusion models (LDMs) underexplored. In this paper, we introduce LatentDEM, an innovative technique that addresses more challenging blind inverse problems using latent diffusion priors. At the core of our method is solving blind inverse problems within an iterative Expectation-Maximization (EM) framework: (1) the E-step recovers clean images from corrupted observations using LDM priors and a known forward model, and (2) the M-step estimates the forward operator based on the recovered images. Additionally, we propose two novel optimization techniques tailored for LDM priors and EM frameworks, yielding more accurate and efficient blind inversion results. As a general framework, LatentDEM supports both linear and non-linear inverse problems. Beyond common 2D image restoration tasks, it enables new capabilities in non-linear 3D inverse rendering problems. We validate LatentDEM’s performance on representative 2D blind deblurring and 3D sparse-view reconstruction tasks, demonstrating its superior efficacy over prior arts.

[CV-63] Coding for Intelligence from the Perspective of Category

链接: https://arxiv.org/abs/2407.01017
作者: Wenhan Yang,Zixuan Hu,Lilang Lin,Jiaying Liu,Ling-Yu Duan
关键词: abstract computational level, abstract computational, interweave recently, significant progress, targets compressing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Coding, which targets compressing and reconstructing data, and intelligence, often regarded at an abstract computational level as being centered around model learning and prediction, interweave recently to give birth to a series of significant progress. The recent trends demonstrate the potential homogeneity of these two fields, especially when deep-learning models aid these two categories for better probability modeling. For better understanding and describing from a unified perspective, inspired by the basic generally recognized principles in cognitive psychology, we formulate a novel problem of Coding for Intelligence from the category theory view. Based on the three axioms: existence of ideal coding, existence of practical coding, and compactness promoting generalization, we derive a general framework to understand existing methodologies, namely that, coding captures the intrinsic relationships of objects as much as possible, while ignoring information irrelevant to downstream tasks. This framework helps identify the challenges and essential elements in solving the specific derived Minimal Description Length (MDL) optimization problem from a broader range, providing opportunities to build a more intelligent system for handling multiple tasks/applications with coding ideas/tools. Centering on those elements, we systematically review recent processes of towards optimizing the MDL problem in more comprehensive ways from data, model, and task perspectives, and reveal their impacts on the potential CfI technical routes. After that, we also present new technique paths to fulfill CfI and provide potential solutions with preliminary experimental evidence. Last, further directions and remaining issues are discussed as well. The discussion shows our theory can reveal many phenomena and insights about large foundation models, which mutually corroborate with recent practices in feature learning.

[CV-64] SOOD: Leveraging Unlabeled Data to Boost Oriented Object Detection

链接: https://arxiv.org/abs/2407.01016
作者: Dingkang Liang,Wei Hua,Chunsheng Shi,Zhikang Zou,Xiaoqing Ye,Xiang Bai
关键词: hot topic recently, boost object detectors, Semi-supervised object detection, Semi-supervised Oriented Object, Oriented Object Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects common in aerial images unexplored. At the same time, the annotation cost of multi-oriented objects is significantly higher than that of their horizontal counterparts. Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images are usually arbitrary orientations, small scales, and aggregation, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various multi-oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.92, +2.39, and +2.57 mAP under 10%, 20%, and 30% labeled data settings, respectively) with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. The code will be made available.

[CV-65] An Expectation-Maximization Algorithm for Training Clean Diffusion Models from Corrupted Observations

链接: https://arxiv.org/abs/2407.01014
作者: Weimin Bai,Yifei Wang,Wenzheng Chen,He Sun
关键词: inverse problems due, complex image priors, solving imaging inverse, imaging inverse problems, excel in solving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models excel in solving imaging inverse problems due to their ability to model complex image priors. However, their reliance on large, clean datasets for training limits their practical use where clean data is scarce. In this paper, we propose EMDiffusion, an expectation-maximization (EM) approach to train diffusion models from corrupted observations. Our method alternates between reconstructing clean images from corrupted data using a known diffusion model (E-step) and refining diffusion model weights based on these reconstructions (M-step). This iterative process leads the learned diffusion model to gradually converge to the true clean data distribution. We validate our method through extensive experiments on diverse computational imaging tasks, including random inpainting, denoising, and deblurring, achieving new state-of-the-art performance.

[CV-66] Swish-T:Enhancing Swish Activation with Tanh Bias for Improved Neural Network Performance

链接: https://arxiv.org/abs/2407.01012
作者: Youngmin Seo,Jinha Kim,Unsang Park
关键词: activation function Swish, existing non-monotonic activation, original Swish, original Swish function, Swish-T
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:We propose the Swish-T family, an enhancement of the existing non-monotonic activation function Swish. Swish-T is defined by adding a Tanh bias to the original Swish function. This modification creates a family of Swish-T variants, each designed to excel in different tasks, showcasing specific advantages depending on the application context. The Tanh bias allows for broader acceptance of negative values during initial training stages, offering a smoother non-monotonic curve than the original Swish. We ultimately propose the Swish-T _\textbfC function, while Swish-T and Swish-T _\textbfB , byproducts of Swish-T _\textbfC , also demonstrate satisfactory performance. Furthermore, our ablation study shows that using Swish-T _\textbfC as a non-parametric function can still achieve high performance. The superiority of the Swish-T family has been empirically demonstrated across various models and benchmark datasets, including MNIST, Fashion MNIST, SVHN, CIFAR-10, and CIFAR-100. The code is publicly available at "this https URL.

[CV-67] GMT: A Robust Global Association Model for Multi-Target Multi-Camera Tracking

链接: https://arxiv.org/abs/2407.01007
作者: Huijie Fan,Tinghui Zhao,Qiang Wang,Baojie Fan,Yandong Tang,LianQing Liu
关键词: data association problem, multi-target multi-camera, main challenge, task of multi-target, complications arising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the task of multi-target multi-camera (MTMC) tracking of pedestrians, the data association problem is a key issue and main challenge, especially with complications arising from camera movements, lighting variations, and obstructions. However, most MTMC models adopt two-step approaches, thus heavily depending on the results of the first-step tracking in practical applications. Moreover, the same targets crossing different cameras may exhibit significant appearance variations, which further increases the difficulty of cross-camera matching. To address the aforementioned issues, we propose a global online MTMC tracking model that addresses the dependency on the first tracking stage in two-step methods and enhances cross-camera matching. Specifically, we propose a transformer-based global MTMC association module to explore target associations across different cameras and frames, generating global trajectories directly. Additionally, to integrate the appearance and spatio-temporal features of targets, we propose a feature extraction and fusion module for MTMC tracking. This module enhances feature representation and establishes correlations between the features of targets across multiple cameras. To accommodate high scene diversity and complex lighting condition variations, we have established the VisionTrack dataset, which enables the development of models that are more generalized and robust to various environments. Our model demonstrates significant improvements over comparison methods on the VisionTrack dataset and others.

[CV-68] Embedded Prompt Tuning: Towards Enhanced Calibration of Pretrained Models for Medical Images

链接: https://arxiv.org/abs/2407.01003
作者: Wenqiang Zu,Shenghao Xie,Qing Zhao,Guoqi Li,Lei Ma
关键词: natural imaging downstream, imaging downstream tasks, Foundation models, Foundation models pre-trained, prompt tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 7 figures. arXiv admin note: text overlap with arXiv:2306.09579 , arXiv:2203.12119 by other authors

点击查看摘要

Abstract:Foundation models pre-trained on large-scale data have been widely witnessed to achieve success in various natural imaging downstream tasks. Parameter-efficient fine-tuning (PEFT) methods aim to adapt foundation models to new domains by updating only a small portion of parameters in order to reduce computational overhead. However, the effectiveness of these PEFT methods, especially in cross-domain few-shot scenarios, e.g., medical image analysis, has not been fully explored. In this work, we facilitate the study of the performance of PEFT when adapting foundation models to medical image classification tasks. Furthermore, to alleviate the limitations of prompt introducing ways and approximation capabilities on Transformer architectures of mainstream prompt tuning methods, we propose the Embedded Prompt Tuning (EPT) method by embedding prompt tokens into the expanded channels. We also find that there are anomalies in the feature space distribution of foundation models during pre-training process, and prompt tuning can help mitigate this negative impact. To explain this phenomenon, we also introduce a novel perspective to understand prompt tuning: \textbfPrompt tuning is a distribution calibrator. And we support it by analyzing patch-wise scaling and feature separation operations contained in EPT. Our experiments show that EPT outperforms several state-of-the-art fine-tuning methods by a significant margin on few-shot medical image classification tasks, and completes the fine-tuning process within highly competitive time, indicating EPT is an effective PEFT method. Our code will be released once accepted.

[CV-69] Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

链接: https://arxiv.org/abs/2407.00985
作者: Takayuki Nishimura,Katsuyuki Kuyo,Motonari Kambara,Komei Sugiura
关键词: domestic service robots, object manipulation instruction, open vocabulary instructions, generating segmentation masks, give open vocabulary
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for presentation at IROS2024

点击查看摘要

Abstract:We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera’s field of view and cases in which the order of vertices differs but still represents the same polygon, which leads to erroneous mask generation. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions. We implement a novel loss function using optimal transport to prevent significant loss where the order of vertices differs but still represents the same polygon. To evaluate our approach, we constructed a new dataset based on the REVERIE dataset and Matterport3D dataset. The results demonstrated the effectiveness of the proposed method compared with existing mask generation methods. Remarkably, our best model achieved a +16.32% improvement on the dataset compared with a representative polygon-based method.

[CV-70] FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

链接: https://arxiv.org/abs/2407.00983
作者: Ruinan Jin,Zikang Xu,Yuan Zhong,Qiongsong Yao,Qi Dou,S. Kevin Zhou,Xiaoxiao Li
关键词: offers unprecedented opportunities, healthcare offers unprecedented, enhance medical diagnostics, advent of foundation, offers unprecedented
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 29 pages, 17 figures

点击查看摘要

Abstract:The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmarks, standardized pipelines, and easily adaptable libraries to evaluate and understand the fairness performance of FMs in medical imaging, leading to considerable challenges in formulating and implementing solutions that ensure equitable outcomes across diverse patient populations. To fill this gap, we introduce FairMedFM, a fairness benchmark for FM research in medical imaging.FairMedFM integrates with 17 popular medical imaging datasets, encompassing different modalities, dimensionalities, and sensitive attributes. It explores 20 widely used FMs, with various usages such as zero-shot learning, linear probing, parameter-efficient fine-tuning, and prompting in various downstream tasks – classification and segmentation. Our exhaustive analysis evaluates the fairness performance over different evaluation metrics from multiple perspectives, revealing the existence of bias, varied utility-fairness trade-offs on different FMs, consistent disparities on the same datasets regardless FMs, and limited effectiveness of existing unfairness mitigation methods. Checkout FairMedFM’s project page and open-sourced codebase, which supports extendible functionalities and applications as well as inclusive for studies on FMs in medical imaging over the long term.

[CV-71] Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval

链接: https://arxiv.org/abs/2407.00979
作者: Hanwen Su,Ge Song,Kai Huang,Jiyan Wang,Ming Yang
关键词: sketch-based image retrieval, zero-shot sketch-based image, study the problem, Description Generation Module, Cross-modal Alignment Module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we study the problem of zero-shot sketch-based image retrieval (ZS-SBIR). The prior methods tackle the problem in a two-modality setting with only category labels or even no textual information involved. However, the growing prevalence of Large-scale pre-trained Language Models (LLMs), which have demonstrated great knowledge learned from web-scale data, can provide us with an opportunity to conclude collective textual information. Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers. To this end, we propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval. The network consists of three components: (i) a Description Generation Module that generates textual descriptions for each training category by prompting an LLM with several interrogative sentences, (ii) a Feature Extraction Module that includes two ViTs for sketch and image data, a transformer for extracting tokens of sentences of each training category, finally (iii) a Cross-modal Alignment Module that exchanges the token features of both text-sketch and text-image using cross-attention mechanism, and align the tokens locally and globally. Extensive experiments on three benchmark datasets show our superior performances over the state-of-the-art ZS-SBIR methods.

[CV-72] FALCON: Frequency Adjoint Link with CONtinuous Density Mask for Fast Single Image Dehazing

链接: https://arxiv.org/abs/2407.00972
作者: Donghyun Kim,Seil Kang,Seong Jae Hwang
关键词: pervasive challenge crucial, robust vision applications, addressing atmospheric interference, Frequency Adjoint Link, Image dehazing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image dehazing, addressing atmospheric interference like fog and haze, remains a pervasive challenge crucial for robust vision applications such as surveillance and remote sensing under adverse visibility. While various methodologies have evolved from early works predicting transmission matrix and atmospheric light features to deep learning and dehazing networks, they innately prioritize dehazing quality metrics, neglecting the need for real-time applicability in time-sensitive domains like autonomous driving. This work introduces FALCON (Frequency Adjoint Link with CONtinuous density mask), a single-image dehazing system achieving state-of-the-art performance on both quality and speed. Particularly, we develop a novel bottleneck module, namely, Frequency Adjoint Link, operating in the frequency space to globally expand the receptive field with minimal growth in network size. Further, we leverage the underlying haze distribution based on the atmospheric scattering model via a Continuous Density Mask (CDM) which serves as a continuous-valued mask input prior and a differentiable auxiliary loss. Comprehensive experiments involving multiple state-of-the-art methods and ablation analysis demonstrate FALCON’s exceptional performance in both dehazing quality and speed (i.e., 180 frames-per-second), quantified by metrics such as FPS, PSNR, and SSIM.

[CV-73] Deep learning for automated detection of breast cancer in deep ultraviolet fluorescence images with diffusion probabilistic model

链接: https://arxiv.org/abs/2407.00967
作者: Sepehr Salem Ghahfarokhi,Tyrell To,Julie Jorns,Tina Yen,Bing Yu,Dong Hye Ye
关键词: Data limitation, applying deep learning, significant challenge, challenge in applying, learning to medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IEEE International Symposium on Biomedical Imaging 2024

点击查看摘要

Abstract:Data limitation is a significant challenge in applying deep learning to medical images. Recently, the diffusion probabilistic model (DPM) has shown the potential to generate high-quality images by converting Gaussian random noise into realistic images. In this paper, we apply the DPM to augment the deep ultraviolet fluorescence (DUV) image dataset with an aim to improve breast cancer classification for intraoperative margin assessment. For classification, we divide the whole surface DUV image into small patches and extract convolutional features for each patch by utilizing the pre-trained ResNet. Then, we feed them into an XGBoost classifier for patch-level decisions and then fuse them with a regional importance map computed by Grad-CAM++ for whole surface-level prediction. Our experimental results show that augmenting the training dataset with the DPM significantly improves breast cancer detection performance in DUV images, increasing accuracy from 93% to 97%, compared to using Affine transformations and ProGAN.

[CV-74] SpectralKAN: Kolmogorov-Arnold Network for Hyperspectral Images Change Detection

链接: https://arxiv.org/abs/2407.00949
作者: Yanheng Wang,Xiaohan Yu,Yongsheng Gao,Jianjun Sha,Jian Wang,Lianru Gao,Yonggang Zhang,Xianhui Rong
关键词: including convolutional neural, deep learning methods, graph neural networks, convolutional neural networks, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:It has been verified that deep learning methods, including convolutional neural networks (CNNs), graph neural networks (GNNs), and transformers, can accurately extract features from hyperspectral images (HSIs). These algorithms perform exceptionally well on HSIs change detection (HSIs-CD). However, the downside of these impressive results is the enormous number of parameters, FLOPs, GPU memory, training and test times required. In this paper, we propose an spectral Kolmogorov-Arnold Network for HSIs-CD (SpectralKAN). SpectralKAN represent a multivariate continuous function with a composition of activation functions to extract HSIs feature and classification. These activation functions are b-spline functions with different parameters that can simulate various functions. In SpectralKAN, a KAN encoder is proposed to enhance computational efficiency for HSIs. And a spatial-spectral KAN encoder is introduced, where the spatial KAN encoder extracts spatial features and compresses the spatial dimensions from patch size to one. The spectral KAN encoder then extracts spectral features and classifies them into changed and unchanged categories. We use five HSIs-CD datasets to verify the effectiveness of SpectralKAN. Experimental verification has shown that SpectralKAN maintains high HSIs-CD accuracy while requiring fewer parameters, FLOPs, GPU memory, training and testing times, thereby increasing the efficiency of HSIs-CD. The code will be available at this https URL.

[CV-75] Diffusion Transformer Model With Compact Prior for Low-dose PET Reconstruction

链接: https://arxiv.org/abs/2407.00944
作者: Bin Huang,Xubiao Liu,Lei Fang,Qiegen Liu,Bingxuan Li
关键词: Positron emission tomography, Positron emission, low-dose PET, advanced medical imaging, medical imaging technique
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Positron emission tomography (PET) is an advanced medical imaging technique that plays a crucial role in non-invasive clinical diagnosis. However, while reducing radiation exposure through low-dose PET scans is beneficial for patient safety, it often results in insufficient statistical data. This scarcity of data poses significant challenges for accurately reconstructing high-quality images, which are essential for reliable diagnostic outcomes. In this research, we propose a diffusion transformer model (DTM) guided by joint compact prior (JCP) to enhance the reconstruction quality of low-dose PET imaging. In light of current research findings, we present a pioneering PET reconstruction model that integrates diffusion and transformer models for joint optimization. This model combines the powerful distribution mapping abilities of diffusion models with the capacity of transformers to capture long-range dependencies, offering significant advantages for low-dose PET reconstruction. Additionally, the incorporation of the lesion refining block and penalized weighted least squares (PWLS) enhance the recovery capability of lesion regions and preserves detail information, solving blurring problems in lesion areas and texture details of most deep learning frameworks. Experimental results demonstrate the effectiveness of DTM in enhancing image quality and preserving critical clinical information for low-dose PET scans. Our approach not only reduces radiation exposure risks but also provides a more reliable PET imaging tool for early disease detection and patient management.

[CV-76] PointViG: A Lightweight GNN-based Model for Efficient Point Cloud Analysis

链接: https://arxiv.org/abs/2407.00921
作者: Qiang Zheng,Yafei Qi,Chen Wang,Chao Zhang,Jian Sun
关键词: Graph Neural Networks, Neural Networks, existing approaches encounter, approaches encounter challenges, point cloud analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the domain of point cloud analysis, despite the significant capabilities of Graph Neural Networks (GNNs) in managing complex 3D datasets, existing approaches encounter challenges like high computational costs and scalability issues with extensive scenarios. These limitations restrict the practical deployment of GNNs, notably in resource-constrained environments. To address these issues, this study introduce bPoint\b bVi\bsion bG\bNN (PointViG), an efficient framework for point cloud analysis. PointViG incorporates a lightweight graph convolutional module to efficiently aggregate local features and mitigate over-smoothing. For large-scale point cloud scenes, we propose an adaptive dilated graph convolution technique that searches for sparse neighboring nodes within a dilated neighborhood based on semantic correlation, thereby expanding the receptive field and ensuring computational efficiency. Experiments demonstrate that PointViG achieves performance comparable to state-of-the-art models while balancing performance and complexity. On the ModelNet40 classification task, PointViG achieved 94.3% accuracy with 1.5M parameters. For the S3DIS segmentation task, it achieved an mIoU of 71.7% with 5.3M parameters. These results underscore the potential and efficiency of PointViG in point cloud analysis.

[CV-77] From Category to Scenery: An End-to-End Framework for Multi-Person Human-Object Interaction Recognition in Videos

链接: https://arxiv.org/abs/2407.00917
作者: Tanqiu Qiao,Ruochen Li,Frederick W. B. Li,Hubert P. H. Shum
关键词: Video-based Human-Object Interaction, Video-based Human-Object, recognition explores, behavior and intentions, explores the intricate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICPR 2024

点击查看摘要

Abstract:Video-based Human-Object Interaction (HOI) recognition explores the intricate dynamics between humans and objects, which are essential for a comprehensive understanding of human behavior and intentions. While previous work has made significant strides, effectively integrating geometric and visual features to model dynamic relationships between humans and objects in a graph framework remains a challenge. In this work, we propose a novel end-to-end category to scenery framework, CATS, starting by generating geometric features for various categories through graphs respectively, then fusing them with corresponding visual features. Subsequently, we construct a scenery interactive graph with these enhanced geometric-visual features as nodes to learn the relationships among human and object categories. This methodological advance facilitates a deeper, more structured comprehension of interactions, bridging category-specific insights with broad scenery dynamics. Our method demonstrates state-of-the-art performance on two pivotal HOI benchmarks, including the MPHOI-72 dataset for multi-person HOIs and the single-person HOI CAD-120 dataset.

[CV-78] Deep Image-to-Recipe Translation

链接: https://arxiv.org/abs/2407.00911
作者: Jiangqin Ma,Bilal Mawji,Franz Williams
关键词: profound level, reflecting the intricate, intricate connection, Eat, cherished food memories
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The modern saying, “You Are What You Eat” resonates on a profound level, reflecting the intricate connection between our identities and the food we consume. Our project, Deep Image-to-Recipe Translation, is an intersection of computer vision and natural language generation that aims to bridge the gap between cherished food memories and the art of culinary creation. Our primary objective involves predicting ingredients from a given food image. For this task, we first develop a custom convolutional network and then compare its performance to a model that leverages transfer learning. We pursue an additional goal of generating a comprehensive set of recipe steps from a list of ingredients. We frame this process as a sequence-to-sequence task and develop a recurrent neural network that utilizes pre-trained word embeddings. We address several challenges of deep learning including imbalanced datasets, data cleaning, overfitting, and hyperparameter selection. Our approach emphasizes the importance of metrics such as Intersection over Union (IoU) and F1 score in scenarios where accuracy alone might be misleading. For our recipe prediction model, we employ perplexity, a commonly used and important metric for language models. We find that transfer learning via pre-trained ResNet-50 weights and GloVe embeddings provide an exceptional boost to model performance, especially when considering training resource constraints. Although we have made progress on the image-to-recipe translation, there is an opportunity for future exploration with advancements in model architectures, dataset scalability, and enhanced user interaction.

[CV-79] Heterogeneous Graph-based Framework with Disentangled Representations Learning for Multi-target Cross Domain Recommendation

链接: https://arxiv.org/abs/2407.00909
作者: Xiaopeng Liu,Juan Zhang,Chongqi Ren,Shenghui Xu,Zhaoming Pan,Zhimin Zhang
关键词: data sparsity problem, Cross-Domain Recommendation, recommendation system, CDR, critical solution
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:CDR (Cross-Domain Recommendation), i.e., leveraging information from multiple domains, is a critical solution to data sparsity problem in recommendation system. The majority of previous research either focused on single-target CDR (STCDR) by utilizing data from the source domains to improve the model’s performance on the target domain, or applied dual-target CDR (DTCDR) by integrating data from the source and target domains. In addition, multi-target CDR (MTCDR) is a generalization of DTCDR, which is able to capture the link among different domains. In this paper we present HGDR (Heterogeneous Graph-based Framework with Disentangled Representations Learning), an end-to-end heterogeneous network architecture where graph convolutional layers are applied to model relations among different domains, meanwhile utilizes the idea of disentangling representation for domain-shared and domain-specifc information. First, a shared heterogeneous graph is generated by gathering users and items from several domains without any further side information. Second, we use HGDR to compute disentangled representations for users and items in all domains.Experiments on real-world datasets and online A/B tests prove that our proposed model can transmit information among domains effectively and reach the SOTA performance.

[CV-80] GSO-YOLO: Global Stability Optimization YOLO for Construction Site Detection

链接: https://arxiv.org/abs/2407.00906
作者: Yuming Zhang,Dongzhi Guan,Shouxin Zhang,Junhao Su,Yunzhi Han,Jiabin Liu
关键词: causing economic damage, economic damage due, construction sites, plagued the industry, posing risks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safety issues at construction sites have long plagued the industry, posing risks to worker safety and causing economic damage due to potential hazards. With the advancement of artificial intelligence, particularly in the field of computer vision, the automation of safety monitoring on construction sites has emerged as a solution to this longstanding issue. Despite achieving impressive performance, advanced object detection methods like YOLOv8 still face challenges in handling the complex conditions found at construction sites. To solve these problems, this study presents the Global Stability Optimization YOLO (GSO-YOLO) model to address challenges in complex construction sites. The model integrates the Global Optimization Module (GOM) and Steady Capture Module (SCM) to enhance global contextual information capture and detection stability. The innovative AIoU loss function, which combines CIoU and EIoU, improves detection accuracy and efficiency. Experiments on datasets like SODA, MOCS, and CIS show that GSO-YOLO outperforms existing methods, achieving SOTA performance.

[CV-81] Learning Robust 3D Representation from CLIP via Dual Denoising

链接: https://arxiv.org/abs/2407.00905
作者: Shuqing Luo,Bowen Qu,Wei Gao
关键词: pre-trained vision language, vision language models, under-investigated issue, explore a critical, critical yet under-investigated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we explore a critical yet under-investigated issue: how to learn robust and well-generalized 3D representation from pre-trained vision language models such as CLIP. Previous works have demonstrated that cross-modal distillation can provide rich and useful knowledge for 3D data. However, like most deep learning models, the resultant 3D learning network is still vulnerable to adversarial attacks especially the iterative attack. In this work, we propose Dual Denoising, a novel framework for learning robust and well-generalized 3D representations from CLIP. It combines a denoising-based proxy task with a novel feature denoising network for 3D pre-training. Additionally, we propose utilizing parallel noise inference to enhance the generalization of point cloud features under cross domain settings. Experiments show that our model can effectively improve the representation learning performance and adversarial robustness of the 3D learning network under zero-shot settings without adversarial training. Our code is available at this https URL.

[CV-82] From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

链接: https://arxiv.org/abs/2407.00902
作者: Nan Xu,Fei Wang,Sheng Zhang,Hoifung Poon,Muhao Chen
关键词: Large Language models, Large Language, multiple image-text pairs, similar ICL abilities, capabilities of Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Considering such modality impact, we further utilize modality-driven demonstration strategies to boost ICL performance. We also identify that demonstration selection is closely related to the models’ ability to capture task inductive biases from multimodal ICL. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks even if those tasks are not seen in or even contradict pretraining data.

[CV-83] Dynamically Modulating Visual Place Recognition Sequence Length For Minimum Acceptable Performance Scenarios

链接: https://arxiv.org/abs/2407.00863
作者: Connor Malone,Ankit Vora,Thierry Peynot,Michael Milford
关键词: Mobile robots, GPS become uncertain, critical position estimates, uncertain or unreliable, robots and autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: DOI TBC

点击查看摘要

Abstract:Mobile robots and autonomous vehicles are often required to function in environments where critical position estimates from sensors such as GPS become uncertain or unreliable. Single image visual place recognition (VPR) provides an alternative for localization but often requires techniques such as sequence matching to improve robustness, which incurs additional computation and latency costs. Even then, the sequence length required to localize at an acceptable performance level varies widely; and simply setting overly long fixed sequence lengths creates unnecessary latency, computational overhead, and can even degrade performance. In these scenarios it is often more desirable to meet or exceed a set target performance at minimal expense. In this paper we present an approach which uses a calibration set of data to fit a model that modulates sequence length for VPR as needed to exceed a target localization performance. We make use of a coarse position prior, which could be provided by any other localization system, and capture the variation in appearance across this region. We use the correlation between appearance variation and sequence length to curate VPR features and fit a multilayer perceptron (MLP) for selecting the optimal length. We demonstrate that this method is effective at modulating sequence length to maximize the number of sections in a dataset which meet or exceed a target performance whilst minimizing the median length used. We show applicability across several datasets and reveal key phenomena like generalization capabilities, the benefits of curating features and the utility of non-state-of-the-art feature extractors with nuanced properties.

[CV-84] SAFE: a SAR Feature Extractor based on self-supervised learning and masked Siamese ViTs

链接: https://arxiv.org/abs/2407.00851
作者: Max Muzeau,Joana Frontera-Pons,Chengfang Ren,Jean-Philippe Ovarlez
关键词: Synthetic Aperture Radar, Aperture Radar imagery, Synthetic Aperture, Aperture Radar, earth monitoring
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Due to its all-weather and day-and-night capabilities, Synthetic Aperture Radar imagery is essential for various applications such as disaster management, earth monitoring, change detection and target recognition. However, the scarcity of labeled SAR data limits the performance of most deep learning algorithms. To address this issue, we propose a novel self-supervised learning framework based on masked Siamese Vision Transformers to create a General SAR Feature Extractor coined SAFE. Our method leverages contrastive learning principles to train a model on unlabeled SAR data, extracting robust and generalizable features. SAFE is applicable across multiple SAR acquisition modes and resolutions. We introduce tailored data augmentation techniques specific to SAR imagery, such as sub-aperture decomposition and despeckling. Comprehensive evaluations on various downstream tasks, including few-shot classification, segmentation, visualization, and pattern detection, demonstrate the effectiveness and versatility of the proposed approach. Our network competes with or surpasses other state-of-the-art methods in few-shot classification and segmentation tasks, even without being trained on the sensors used for the evaluation.

[CV-85] DroBoost: An Intelligent Score and Model Boosting Method for Drone Detection

链接: https://arxiv.org/abs/2407.00830
作者: Ogulcan Eryuksel,Kamil Anil Ozfuttu,Fatih Cagatay Akyon,Kadir Sahin,Efe Buyukborekci,Devrim Cavusoglu,Sinan Altinuc
关键词: small visible objects, complex backgrounds, small visible, task where visibility, visibility conditions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Drone detection is a challenging object detection task where visibility conditions and quality of the images may be unfavorable, and detections might become difficult due to complex backgrounds, small visible objects, and hard to distinguish objects. Both provide high confidence for drone detections, and eliminating false detections requires efficient algorithms and approaches. Our previous work, which uses YOLOv5, uses both real and synthetic data and a Kalman-based tracker to track the detections and increase their confidence using temporal information. Our current work improves on the previous approach by combining several improvements. We used a more diverse dataset combining multiple sources and combined with synthetic samples chosen from a large synthetic dataset based on the error analysis of the base model. Also, to obtain more resilient confidence scores for objects, we introduced a classification component that discriminates whether the object is a drone or not. Finally, we developed a more advanced scoring algorithm for object tracking that we use to adjust localization confidence. Furthermore, the proposed technique won 1st Place in the Drone vs. Bird Challenge (Workshop on Small-Drone Surveillance, Detection and Counteraction Techniques at ICIAP 2021).

[CV-86] Image Classification for Snow Detection to Improve Pedestrian Safety

链接: https://arxiv.org/abs/2407.00818
作者: Ricardo de Deijn,Rajeev Bukralia
关键词: visually impaired individuals, winter-related fall injuries, vision approach aimed, reduce winter-related fall, fall injuries
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages 1 figure 1 table Included in MWAIS - 2024 Conference Proceedings Chair: Jacob Young

点击查看摘要

Abstract:This study presents a computer vision approach aimed at detecting snow on sidewalks and pavements to reduce winter-related fall injuries, especially among elderly and visually impaired individuals. Leveraging fine-tuned VGG-19 and ResNet50 convolutional neural networks (CNNs), the research focuses on identifying snow presence in pavement images. The dataset comprises 98 images evenly split between snowy and snow-free conditions, evaluated with a separate test set using the F1 score and accuracy metrics. This work builds upon existing research by employing fine-tuned CNN architectures to accurately detect snow on pavements from smartphone-captured images. The methodology incorporates transfer learning and model ensembling techniques to integrate the best predictions from both the VGG19 and ResNet50 architectures. The study yields accuracy and F1 scores of 81.8% and 81.7%, respectively, showcasing the potential of computer vision in addressing winter-related hazards for vulnerable populations.

[CV-87] A Deep Learning-based Pest Insect Monitoring System for Ultra-low Power Pocket-sized Drones

链接: https://arxiv.org/abs/2407.00815
作者: Luca Crupi,Luca Butera,Alberto Ferrante,Daniele Palossi
关键词: agriculture represent game-changer, represent game-changer technologies, precision agriculture represent, sustainable agribusiness, Smart farming
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Smart farming and precision agriculture represent game-changer technologies for efficient and sustainable agribusiness. Miniaturized palm-sized drones can act as flexible smart sensors inspecting crops, looking for early signs of potential pest outbreaking. However, achieving such an ambitious goal requires hardware-software codesign to develop accurate deep learning (DL) detection models while keeping memory and computational needs under an ultra-tight budget, i.e., a few MB on-chip memory and a few 100s mW power envelope. This work presents a novel vertically integrated solution featuring two ultra-low power System-on-Chips (SoCs), i.e., the dual-core STM32H74 and a multi-core GWT GAP9, running two State-of-the-Art DL models for detecting the Popillia japonica bug. We fine-tune both models for our image-based detection task, quantize them in 8-bit integers, and deploy them on the two SoCs. On the STM32H74, we deploy a FOMO-MobileNetV2 model, achieving a mean average precision (mAP) of 0.66 and running at 16.1 frame/s within 498 mW. While on the GAP9 SoC, we deploy a more complex SSDLite-MobileNetV3, which scores an mAP of 0.79 and peaks at 6.8 frame/s within 33 mW. Compared to a top-notch RetinaNet-ResNet101-FPN full-precision baseline, which requires 14.9x more memory and 300x more operations per inference, our best model drops only 15% in mAP, paving the way toward autonomous palm-sized drones capable of lightweight and precise pest detection.

[CV-88] Controlling Faces Frame generation in StyleGANs latent space operations: Modifying faces to deceive our memory

链接: https://arxiv.org/abs/2407.00803
作者: Agustín Roca,Nicolás Ignacio Britos
关键词: Innocence Project, reducing wrongful convictions, non-profitable organization, Buenos Aires, Laboratorio de Sueño
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Innocence Project is a non-profitable organization that works in reducing wrongful convictions. In collaboration with Laboratorio de Sueño y Memoria from Instituto Tecnológico de Buenos Aires (ITBA), they are studying human memory in the context of face identification. They have a strong hypothesis stating that human memory heavily relies in face’s frame to recognize faces. If this is proved, it could mean that face recognition in police lineups couldn’t be trusted, as they may lead to wrongful convictions. This study uses experiments in order to try to prove this using faces with different properties, such as eyes size, but maintaining its frame as much as possible. In this project, we continue the work from a previous project that provided the basic tool to generate realistic faces using StyleGAN2. We take a deep dive into the internals of this tool to make full use of StyleGAN2 functionalities, while also adding more features, such as modifying certain of its attributes, including mouth-opening or eye-opening. As the usage of this tool heavily relies on maintaining the face-frame, we develop a way to identify the face-frame of each image and a function to compare it to the output of the neural network after applying some operations. We conclude that the face-frame is maintained when modifying eye-opening or mouth opening. When modifying vertical face orientation, gender, age and smile, have a considerable impact on its frame variation. And finally, the horizontal face orientation shows a major impact on the face-frame. This way, the Lab may apply some operations being confident that the face-frame won’t significantly change, making them viable to be used to deceive subjects’ memories. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2407.00803 [cs.CV] (or arXiv:2407.00803v1 [cs.CV] for this version) Submission history From: Agustin Roca [view email] [v1] Sun, 30 Jun 2024 19:10:22 UTC (38,716 KB)

[CV-89] InstantStyle-Plus: Style Transfer with Content-Preserving in Text-to-Image Generation

链接: https://arxiv.org/abs/2407.00788
作者: Haofan Wang,Peng Xing,Renyuan Huang,Hao Ai,Qixun Wang,Xu Bai
关键词: inventive process designed, Style, designed to create, maintains the essence, content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:Style transfer is an inventive process designed to create an image that maintains the essence of the original while embracing the visual style of another. Although diffusion models have demonstrated impressive generative power in personalized subject-driven or style-driven applications, existing state-of-the-art methods still encounter difficulties in achieving a seamless balance between content preservation and style enhancement. For example, amplifying the style’s influence can often undermine the structural integrity of the content. To address these challenges, we deconstruct the style transfer task into three core elements: 1) Style, focusing on the image’s aesthetic characteristics; 2) Spatial Structure, concerning the geometric arrangement and composition of visual elements; and 3) Semantic Content, which captures the conceptual meaning of the image. Guided by these principles, we introduce InstantStyle-Plus, an approach that prioritizes the integrity of the original content while seamlessly integrating the target style. Specifically, our method accomplishes style injection through an efficient, lightweight process, utilizing the cutting-edge InstantStyle framework. To reinforce the content preservation, we initiate the process with an inverted content latent noise and a versatile plug-and-play tile ControlNet for preserving the original image’s intrinsic layout. We also incorporate a global semantic adapter to enhance the semantic content’s fidelity. To safeguard against the dilution of style information, a style extractor is employed as discriminator for providing supplementary style guidance. Codes will be available at this https URL.

[CV-90] Diffusion Models and Representation Learning: A Survey

链接: https://arxiv.org/abs/2407.00783
作者: Michael Fuest,Pingchuan Ma,Ming Gui,Johannes S. Fischer,Vincent Tao Hu,Bjorn Ommer
关键词: attracting significant attention, Diffusion Models, popular generative modeling, generative modeling methods, attracting significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Github Repo: this https URL

点击查看摘要

Abstract:Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models’ essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link: this https URL

[CV-91] Chest-Diffusion: A Light-Weight Text-to-Image Model for Report-to-CXR Generation

链接: https://arxiv.org/abs/2407.00752
作者: Peng Huang,Xue Gao,Lihong Huang,Jing Jiao,Xiaokang Li,Yuanyuan Wang,Yi Guo
关键词: important implications, diverse and controllable, Stable Diffusion, adapt Stable Diffusion, common stable diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-to-image generation has important implications for generation of diverse and controllable images. Several attempts have been made to adapt Stable Diffusion (SD) to the medical domain. However, the large distribution difference between medical reports and natural texts, as well as high computational complexity in common stable diffusion limit the authenticity and feasibility of the generated medical images. To solve above problems, we propose a novel light-weight transformer-based diffusion model learning framework, Chest-Diffusion, for report-to-CXR generation. Chest-Diffusion employs a domain-specific text encoder to obtain accurate and expressive text features to guide image generation, improving the authenticity of the generated images. Meanwhile, we introduce a light-weight transformer architecture as the denoising model, reducing the computational complexity of the diffusion model. Experiments demonstrate that our Chest-Diffusion achieves the lowest FID score 24.456, under the computation budget of 118.918 GFLOPs, which is nearly one-third of the computational complexity of SD.

[CV-92] PolygonGNN: Representation Learning for Polygonal Geometries with Heterogeneous Visibility Graph

链接: https://arxiv.org/abs/2407.00742
作者: Dazhou Yu,Yuntong Hu,Yun Li,Liang Zhao
关键词: building pattern classification, geographic question answering, encompassing tasks, shape coding, building pattern
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Polygon representation learning is essential for diverse applications, encompassing tasks such as shape coding, building pattern classification, and geographic question answering. While recent years have seen considerable advancements in this field, much of the focus has been on single polygons, overlooking the intricate inner- and inter-polygonal relationships inherent in multipolygons. To address this gap, our study introduces a comprehensive framework specifically designed for learning representations of polygonal geometries, particularly multipolygons. Central to our approach is the incorporation of a heterogeneous visibility graph, which seamlessly integrates both inner- and inter-polygonal relationships. To enhance computational efficiency and minimize graph redundancy, we implement a heterogeneous spanning tree sampling method. Additionally, we devise a rotation-translation invariant geometric representation, ensuring broader applicability across diverse scenarios. Finally, we introduce Multipolygon-GNN, a novel model tailored to leverage the spatial and semantic heterogeneity inherent in the visibility graph. Experiments on five real-world and synthetic datasets demonstrate its ability to capture informative representations for polygonal geometries.

[CV-93] Engineering an Efficient Object Tracker for Non-Linear Motion

链接: https://arxiv.org/abs/2407.00738
作者: Momir Adžemović,Predrag Tadić,Andrija Petrović,Mladen Nikolić
关键词: maintaining unique identifiers, video frames, detect and track, scene while maintaining, maintaining unique
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 3 figures, 20 tables

点击查看摘要

Abstract:The goal of multi-object tracking is to detect and track all objects in a scene while maintaining unique identifiers for each, by associating their bounding boxes across video frames. This association relies on matching motion and appearance patterns of detected objects. This task is especially hard in case of scenarios involving dynamic and non-linear motion patterns. In this paper, we introduce DeepMoveSORT, a novel, carefully engineered multi-object tracker designed specifically for such scenarios. In addition to standard methods of appearance-based association, we improve motion-based association by employing deep learnable filters (instead of the most commonly used Kalman filter) and a rich set of newly proposed heuristics. Our improvements to motion-based association methods are severalfold. First, we propose a new transformer-based filter architecture, TransFilter, which uses an object’s motion history for both motion prediction and noise filtering. We further enhance the filter’s performance by careful handling of its motion history and accounting for camera motion. Second, we propose a set of heuristics that exploit cues from the position, shape, and confidence of detected bounding boxes to improve association performance. Our experimental evaluation demonstrates that DeepMoveSORT outperforms existing trackers in scenarios featuring non-linear motion, surpassing state-of-the-art results on three such datasets. We also perform a thorough ablation study to evaluate the contributions of different tracker components which we proposed. Based on our study, we conclude that using a learnable filter instead of the Kalman filter, along with appearance-based association is key to achieving strong general tracking performance.

[CV-94] LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

链接: https://arxiv.org/abs/2407.00737
作者: Mushui Liu,Yuhang Ma,Xinfeng Zhang,Yang Zhen,Zeng Zhao,Zhipeng Hu,Bai Liu,Changjie Fan
关键词: exhibited substantial success, Large Language Models, Diffusion Models, exhibited substantial, substantial success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 13 figures

点击查看摘要

Abstract:Diffusion Models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts that involve multiple objects, attribute binding, and long descriptions. This paper proposes a framework called \textbfLLM4GEN, which enhances the semantic understanding ability of text-to-image diffusion models by leveraging the semantic representation of Large Language Models (LLMs). Through a specially designed Cross-Adapter Module (CAM) that combines the original text features of text-to-image models with LLM features, LLM4GEN can be easily incorporated into various diffusion models as a plug-and-play component and enhances text-to-image generation. Additionally, to facilitate the complex and dense prompts semantic understanding, we develop a LAION-refined dataset, consisting of 1 million (M) text-image pairs with improved image descriptions. We also introduce DensePrompts which contains 7,000 dense prompts to provide a comprehensive evaluation for the text-to-image generation task. With just 10% of the training data required by recent ELLA, LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 7.69% and 9.60% in color on T2I-CompBench, respectively. The extensive experiments on DensePrompts also demonstrate that LLM4GEN surpasses existing state-of-the-art models in terms of sample quality, image-text alignment, and human evaluation. The project website is at: \textcolormagenta\urlthis https URL

[CV-95] CaFNet: A Confidence-Driven Framework for Radar Camera Depth Estimation

链接: https://arxiv.org/abs/2407.00697
作者: Huawei Sun,Hao Feng,Julius Ott,Lorenzo Servadei,Robert Wille
关键词: Depth estimation, scenes accurately, driving for interpreting, critical in autonomous, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Accepted by IROS 2024

点击查看摘要

Abstract:Depth estimation is critical in autonomous driving for interpreting 3D scenes accurately. Recently, radar-camera depth estimation has become of sufficient interest due to the robustness and low-cost properties of radar. Thus, this paper introduces a two-stage, end-to-end trainable Confidence-aware Fusion Net (CaFNet) for dense depth estimation, combining RGB imagery with sparse and noisy radar point cloud data. The first stage addresses radar-specific challenges, such as ambiguous elevation and noisy measurements, by predicting a radar confidence map and a preliminary coarse depth map. A novel approach is presented for generating the ground truth for the confidence map, which involves associating each radar point with its corresponding object to identify potential projection surfaces. These maps, together with the initial radar input, are processed by a second encoder. For the final depth estimation, we innovate a confidence-aware gated fusion mechanism to integrate radar and image features effectively, thereby enhancing the reliability of the depth map by filtering out radar noise. Our methodology, evaluated on the nuScenes dataset, demonstrates superior performance, improving upon the current leading model by 3.2% in Mean Absolute Error (MAE) and 2.7% in Root Mean Square Error (RMSE).

[CV-96] Multi-Task Learning for Affect Analysis

链接: https://arxiv.org/abs/2407.00679
作者: Fazeel Asim
关键词: Undergraduate Final Year, Final Year dissertation, Undergraduate Final, Final Year, Dimitrios Kollias
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This Project was my Undergraduate Final Year dissertation, supervised by Dimitrios Kollias This research delves into the realm of affective computing for image analysis, aiming to enhance the efficiency and effectiveness of multi-task learning in the context of emotion recognition. This project investigates two primary approaches: uni-task solutions and a multi-task approach to the same problems. Each approach undergoes testing, exploring various formulations, variations, and initialization strategies to come up with the best configuration. The project utilizes existing a neural network architecture, adapting it for multi-task learning by modifying output layers and loss functions. Tasks encompass 7 basic emotion recognition, action unit detection, and valence-arousal estimation. Comparative analyses involve uni-task models for each individual task, facilitating the assessment of multi-task model performance. Variations within each approach, including, loss functions, and hyperparameter tuning, undergo evaluation. The impact of different initialization strategies and pre-training techniques on model convergence and accuracy is explored. The research aspires to contribute to the burgeoning field of affective computing, with applications spanning healthcare, marketing, and human-computer interaction. By systematically exploring multi-task learning formulations, this research aims to contribute to the development of more accurate and efficient models for recognizing and understanding emotions in images. The findings hold promise for applications in diverse industries, paving the way for advancements in affective computing

[CV-97] Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

链接: https://arxiv.org/abs/2407.00676
作者: Yuchuan Tian,Jianhong Han,Hanting Chen,Yuanyuan Xi,Guoyang Zhang,Jie Hu,Chao Xu,Yunhe Wang
关键词: low-level vision, low-level vision tasks, low-level vision models, handful of low-level, intensive computation costs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT – an All-in-One Image Processing Transformer that could effectively address manifold image restoration tasks with large inter-task gaps, such as denoising, deblurring, deraining, dehazing, and desnowing. Rather than popular feature adaptation methods, we propose weight modulation that adapts weights to specific tasks. Firstly, we figure out task-sensitive weights via a toy experiment and introduce task-specific biases on top of them. Secondly, we conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases. Thirdly, we propose synchronous training that updates the task-general backbone model and the task-specific biases simultaneously. In this way, the model is instructed to learn general and task-specific knowledge. Via our simple yet effective method that instructs the IPT to be task experts, Instruct-IPT could better cooperate between tasks with distinct characteristics at humble costs. Further, we propose to maneuver Instruct-IPT with text instructions for better user interfaces. We have conducted experiments on Instruct-IPT to demonstrate the effectiveness of our method on manifold tasks, and we have effectively extended our method to diffusion denoisers as well. The code is available at this https URL.

[CV-98] Resolving Variable Respiratory Motion From Unsorted 4D Computed Tomography

链接: https://arxiv.org/abs/2407.00665
作者: Yuliang Huang,Bjoern Eiben,Kris Thielemans,Jamie R. McClelland
关键词: Computed Tomography, radiotherapy treatment planning, PET and ventilation, downstream clinical applications, clinical applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI 2024

点击查看摘要

Abstract:4D Computed Tomography (4DCT) is widely used for many clinical applications such as radiotherapy treatment planning, PET and ventilation imaging. However, common 4DCT methods reconstruct multiple breath cycles into a single, arbitrary breath cycle which can lead to various artefacts, impacting the downstream clinical applications. Surrogate driven motion models can estimate continuous variable motion across multiple cycles based on CT segments `unsorted’ from 4DCT, but it requires respiration surrogate signals with strong correlation to the internal motion, which are not always available. The method proposed in this study eliminates such dependency by adapting the hyper-gradient method to the optimization of surrogate signals as hyper-parameters, while achieving better or comparable performance, as demonstrated on digital phantom simulations and real patient data. Our method produces a high-quality motion-compensated image together with estimates of the motion, including breath-to-breath variability, throughout the image acquisition. Our method has the potential to improve downstream clinical applications, and also enables retrospective analysis of open access 4DCT dataset where no respiration signals are stored. Code is avaibale at this https URL.

[CV-99] SCMIL: Sparse Context-aware Multiple Instance Learning for Predicting Cancer Survival Probability Distribution in Whole Slide Images

链接: https://arxiv.org/abs/2407.00664
作者: Zekang Yang,Hong Liu,Xiangdong Wang
关键词: Slide Image, Cancer survival prediction, Cancer survival, involves analyzing, tumor microenvironment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: MICCAI2024

点击查看摘要

Abstract:Cancer survival prediction is a challenging task that involves analyzing of the tumor microenvironment within Whole Slide Image (WSI). Previous methods cannot effectively capture the intricate interaction features among instances within the local area of WSI. Moreover, existing methods for cancer survival prediction based on WSI often fail to provide better clinically meaningful predictions. To overcome these challenges, we propose a Sparse Context-aware Multiple Instance Learning (SCMIL) framework for predicting cancer survival probability distributions. SCMIL innovatively segments patches into various clusters based on their morphological features and spatial location information, subsequently leveraging sparse self-attention to discern the relationships between these patches with a context-aware perspective. Considering many patches are irrelevant to the task, we introduce a learnable patch filtering module called SoftFilter, which ensures that only interactions between task-relevant patches are considered. To enhance the clinical relevance of our prediction, we propose a register-based mixture density network to forecast the survival probability distribution for individual patients. We evaluate SCMIL on two public WSI datasets from the The Cancer Genome Atlas (TCGA) specifically focusing on lung adenocarcinom (LUAD) and kidney renal clear cell carcinoma (KIRC). Our experimental results indicate that SCMIL outperforms current state-of-the-art methods for survival prediction, offering more clinically meaningful and interpretable outcomes. Our code is accessible at this https URL.

[CV-100] arsier: Recipes for Training and Evaluating Large Video Description Models

链接: https://arxiv.org/abs/2407.00634
作者: Jiawei Wang,Liping Yuan,Yuchen Zhang
关键词: Generating fine-grained video, Generating fine-grained, fundamental challenge, video, Generating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a +51.4% advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a +12.3% advantage against GPT-4V and a -6.7% disadvantage against Gemini 1.5 Pro. Besides video description, Tarsier proves to be a versatile generalist model, achieving new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. Our second contribution is the introduction of a new benchmark for evaluating video description models, consisting of a new challenging dataset featuring videos from diverse sources and varying complexity, along with an automatic method specifically designed to assess the quality of fine-grained video descriptions. We make our models and evaluation benchmark publicly available at \urlthis https URL.

[CV-101] DEAR: Disentangled Environment and Agent Representations for Reinforcement Learning without Reconstruction

链接: https://arxiv.org/abs/2407.00633
作者: Ameya Pore,Riccardo Muradore,Diego Dall’Alba
关键词: Reinforcement Learning, amount of data, complex and unstructured, large amount, scene is complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 8 figures, 2 tables. Accepted at 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:Reinforcement Learning (RL) algorithms can learn robotic control tasks from visual observations, but they often require a large amount of data, especially when the visual scene is complex and unstructured. In this paper, we explore how the agent’s knowledge of its shape can improve the sample efficiency of visual RL methods. We propose a novel method, Disentangled Environment and Agent Representations (DEAR), that uses the segmentation mask of the agent as supervision to learn disentangled representations of the environment and the agent through feature separation constraints. Unlike previous approaches, DEAR does not require reconstruction of visual observations. These representations are then used as an auxiliary loss to the RL objective, encouraging the agent to focus on the relevant features of the environment. We evaluate DEAR on two challenging benchmarks: Distracting DeepMind control suite and Franka Kitchen manipulation tasks. Our findings demonstrate that DEAR surpasses state-of-the-art methods in sample efficiency, achieving comparable or superior performance with reduced parameters. Our results indicate that integrating agent knowledge into visual RL methods has the potential to enhance their learning efficiency and robustness.

[CV-102] CAMON: Cooperative Agents for Multi-Object Navigation with LLM-based Conversations

链接: https://arxiv.org/abs/2407.00632
作者: Pengying Wu,Yao Mu,Kangjie Zhou,Ji Ma,Junting Chen,Chang Liu
关键词: Visual navigation tasks, Visual navigation, household service robots, Visual, service robots
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
*备注: Accepted to the RSS 2024 Workshop: GROUND

点击查看摘要

Abstract:Visual navigation tasks are critical for household service robots. As these tasks become increasingly complex, effective communication and collaboration among multiple robots become imperative to ensure successful completion. In recent years, large language models (LLMs) have exhibited remarkable comprehension and planning abilities in the context of embodied agents. However, their application in household scenarios, specifically in the use of multiple agents collaborating to complete complex navigation tasks through communication, remains unexplored. Therefore, this paper proposes a framework for decentralized multi-agent navigation, leveraging LLM-enabled communication and collaboration. By designing the communication-triggered dynamic leadership organization structure, we achieve faster team consensus with fewer communication instances, leading to better navigation effectiveness and collaborative exploration efficiency. With the proposed novel communication scheme, our framework promises to be conflict-free and robust in multi-object navigation tasks, even when there is a surge in team size.

[CV-103] Consistency Purification: Effective and Efficient Diffusion Purification towards Certified Robustness

链接: https://arxiv.org/abs/2407.00623
作者: Yiquan Li,Zhongzhu Chen,Kun Jin,Jiongxiao Wang,Bo Li,Chaowei Xiao
关键词: purifying noised images, Denoising Diffusion Probabilistic, Diffusion Probabilistic Model, purified images, Stochastic Diffusion Model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Purification, purifying noised images with diffusion models, has been widely used for enhancing certified robustness via randomized smoothing. However, existing frameworks often grapple with the balance between efficiency and effectiveness. While the Denoising Diffusion Probabilistic Model (DDPM) offers an efficient single-step purification, it falls short in ensuring purified images reside on the data manifold. Conversely, the Stochastic Diffusion Model effectively places purified images on the data manifold but demands solving cumbersome stochastic differential equations, while its derivative, the Probability Flow Ordinary Differential Equation (PF-ODE), though solving simpler ordinary differential equations, still requires multiple computational steps. In this work, we demonstrated that an ideal purification pipeline should generate the purified images on the data manifold that are as much semantically aligned to the original images for effectiveness in one step for efficiency. Therefore, we introduced Consistency Purification, an efficiency-effectiveness Pareto superior purifier compared to the previous work. Consistency Purification employs the consistency model, a one-step generative model distilled from PF-ODE, thus can generate on-manifold purified images with a single network evaluation. However, the consistency model is designed not for purification thus it does not inherently ensure semantic alignment between purified and original images. To resolve this issue, we further refine it through Consistency Fine-tuning with LPIPS loss, which enables more aligned semantic meaning while keeping the purified images on data manifold. Our comprehensive experiments demonstrate that our Consistency Purification framework achieves state-of the-art certified robustness and efficiency compared to baseline methods.

[CV-104] Learning Granularity-Aware Affordances from Human-Object Interaction for Tool-Based Functional Grasping in Dexterous Robotics

链接: https://arxiv.org/abs/2407.00614
作者: Fan Yang,Wenrui Chen,Kailun Yang,Haoran Lin,DongSheng Luo,Conghui Tang,Zhiyong Li,Yaonan Wang
关键词: touching specific areas, specific areas precisely, Affordance, initial step, step is teaching
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: The source code and the established dataset will be made publicly available at this https URL

点击查看摘要

Abstract:To enable robots to use tools, the initial step is teaching robots to employ dexterous gestures for touching specific areas precisely where tasks are performed. Affordance features of objects serve as a bridge in the functional interaction between agents and objects. However, leveraging these affordance cues to help robots achieve functional tool grasping remains unresolved. To address this, we propose a granularity-aware affordance feature extraction method for locating functional affordance areas and predicting dexterous coarse gestures. We study the intrinsic mechanisms of human tool use. On one hand, we use fine-grained affordance features of object-functional finger contact areas to locate functional affordance regions. On the other hand, we use highly activated coarse-grained affordance features in hand-object interaction regions to predict grasp gestures. Additionally, we introduce a model-based post-processing module that includes functional finger coordinate localization, finger-to-end coordinate transformation, and force feedback-based coarse-to-fine grasping. This forms a complete dexterous robotic functional grasping framework GAAF-Dex, which learns Granularity-Aware Affordances from human-object interaction for tool-based Functional grasping in Dexterous Robotics. Unlike fully-supervised methods that require extensive data annotation, we employ a weakly supervised approach to extract relevant cues from exocentric (Exo) images of hand-object interactions to supervise feature extraction in egocentric (Ego) images. We have constructed a small-scale dataset, FAH, which includes near 6K images of functional hand-object interaction Exo- and Ego images of 18 commonly used tools performing 6 tasks. Extensive experiments on the dataset demonstrate our method outperforms state-of-the-art methods. The code will be made publicly available at this https URL.

[CV-105] ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding

链接: https://arxiv.org/abs/2407.00609
作者: Quang P.M. Pham,Khoi T.N. Nguyen,Lan C. Ngo,Truong Do,Truong Son Hy
关键词: understanding tasks due, explicit nature, scene understanding tasks, tasks due, compact and explicit
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, multi-view 3D data. This work, to the best of our knowledge, is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding. Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence. ESGNN demands low computational resources and is easy to implement from available frameworks, paving the way for real-time applications such as robotics and computer vision.

[CV-106] Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

链接: https://arxiv.org/abs/2407.00608
作者: Shian Du,Xiaotian Cheng,Qi Qian,Henglu Wei,Yi Xu,Xiangyang Ji
关键词: attracted unprecedented attention, generating highly-personalized images, input concept dataset, textual prompt, input textual prompt
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Personalized text-to-image generation has attracted unprecedented attention in the recent few years due to its unique capability of generating highly-personalized images via using the input concept dataset and novel textual prompt. However, previous methods solely focus on the performance of the reconstruction task, degrading its ability to combine with different textual prompt. Besides, optimizing in the high-dimensional embedding space usually leads to unnecessary time-consuming training process and slow convergence. To address these issues, we propose an efficient method to explore the target embedding in a textual subspace, drawing inspiration from the self-expressiveness property. Additionally, we propose an efficient selection strategy for determining the basis vectors of the textual subspace. The experimental evaluations demonstrate that the learned embedding can not only faithfully reconstruct input image, but also significantly improves its alignment with novel input textual prompt. Furthermore, we observe that optimizing in the textual subspace leads to an significant improvement of the robustness to the initial word, relaxing the constraint that requires users to input the most relevant initial word. Our method opens the door to more efficient representation learning for personalized text-to-image generation.

[CV-107] Hierarchical Memory for Long Video QA

链接: https://arxiv.org/abs/2407.00603
作者: Yiqin Wang,Haoji Zhang,Yansong Tang,Yong Liu,Jiashi Feng,Jifeng Dai,Xiaojie Jin
关键词: Long Video VQA, LOVEU Challenge, Video VQA, paper describes, describes our champion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper describes our champion solution to the LOVEU Challenge @ CVPR’24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream, that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage this https URL

[CV-108] GenderBias-emphVL: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing

链接: https://arxiv.org/abs/2407.00600
作者: Yisong Xiao,Aishan Liu,QianJia Cheng,Zhenfei Yin,Siyuan Liang,Jiapeng Li,Jing Shao,Xianglong Liu,Dacheng Tao
关键词: Large Vision-Language Models, Large Vision-Language, exhibit significant gender, widely adopted, exhibit significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have been widely adopted in various applications; however, they exhibit significant gender biases. Existing benchmarks primarily evaluate gender bias at the demographic group level, neglecting individual fairness, which emphasizes equal treatment of similar individuals. This research gap limits the detection of discriminatory behaviors, as individual fairness offers a more granular examination of biases that group fairness may overlook. For the first time, this paper introduces the GenderBias-\emphVL benchmark to evaluate occupation-related gender bias in LVLMs using counterfactual visual questions under individual fairness criteria. To construct this benchmark, we first utilize text-to-image diffusion models to generate occupation images and their gender counterfactuals. Subsequently, we generate corresponding textual occupation options by identifying stereotyped occupation pairs with high semantic similarity but opposite gender proportions in real-world statistics. This method enables the creation of large-scale visual question counterfactuals to expose biases in LVLMs, applicable in both multimodal and unimodal contexts through modifying gender attributes in specific modalities. Overall, our GenderBias-\emphVL benchmark comprises 34,581 visual question counterfactual pairs, covering 177 occupations. Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs (\eg, LLaVA) and state-of-the-art commercial APIs, including GPT-4o and Gemini-Pro. Our findings reveal widespread gender biases in existing LVLMs. Our benchmark offers: (1) a comprehensive dataset for occupation-related gender bias evaluation; (2) an up-to-date leaderboard on LVLM biases; and (3) a nuanced understanding of the biases presented by these models. \footnoteThe dataset and code are available at the \hrefthis https URLwebsite.

[CV-109] Unveiling Glitches: A Deep Dive into Image Encoding Bugs within CLIP

链接: https://arxiv.org/abs/2407.00592
作者: Ayush Ranjan,Daniel Wen,Karthik Bhat
关键词: responsible application, CLIP, Discrepancy Analysis Framework, Transformative Caption Analysis, CLIP image comprehension
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding the limitations and weaknesses of state-of-the-art models in artificial intelligence is crucial for their improvement and responsible application. In this research, we focus on CLIP, a model renowned for its integration of vision and language processing. Our objective is to uncover recurring problems and blind spots in CLIP’s image comprehension. By delving into both the commonalities and disparities between CLIP and human image understanding, we augment our comprehension of these models’ capabilities. Through our analysis, we reveal significant discrepancies in CLIP’s interpretation of images compared to human perception, shedding light on areas requiring improvement. Our methodologies, the Discrepancy Analysis Framework (DAF) and the Transformative Caption Analysis for CLIP (TCAC), enable a comprehensive evaluation of CLIP’s performance. We identify 14 systemic faults, including Action vs. Stillness confusion, Failure to identify the direction of movement or positioning of objects in the image, Hallucination of Water-like Features, Misattribution of Geographic Context, among others. By addressing these limitations, we lay the groundwork for the development of more accurate and nuanced image embedding models, contributing to advancements in artificial intelligence.

[CV-110] OfCaM: Global Human Mesh Recovery via Optimization-free Camera Motion Scale Calibration

链接: https://arxiv.org/abs/2407.00574
作者: Fengyuan Yang,Kerui Gu,Ha Linh Nguyen,Angela Yao
关键词: motion, camera motion, scale factor, unknown scale factor, Simultaneous Localization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Accurate camera motion estimation is critical to estimate human motion in the global space. A standard and widely used method for estimating camera motion is Simultaneous Localization and Mapping (SLAM). However, SLAM only provides a trajectory up to an unknown scale factor. Different from previous attempts that optimize the scale factor, this paper presents Optimization-free Camera Motion Scale Calibration (OfCaM), a novel framework that utilizes prior knowledge from human mesh recovery (HMR) models to directly calibrate the unknown scale factor. Specifically, OfCaM leverages the absolute depth of human-background contact joints from HMR predictions as a calibration reference, enabling the precise recovery of SLAM camera trajectory scale in global space. With this correctly scaled camera motion and HMR’s local motion predictions, we achieve more accurate global human motion estimation. To compensate for scenes where we detect SLAM failure, we adopt a local-to-global motion mapping to fuse with previously derived motion to enhance robustness. Simple yet powerful, our method sets a new standard for global human mesh estimation tasks, reducing global human motion error by 60% over the prior SOTA while also demanding orders of magnitude less inference time compared with optimization-based methods.

[CV-111] Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

链接: https://arxiv.org/abs/2407.00569
作者: Weihong Zhong,Xiaocheng Feng,Liang Zhao,Qiming Li,Lei Huang,Yuxuan Gu,Weitao Ma,Yuan Xu,Bing Qin
关键词: Large Vision-Language Models, Large Vision-Language, understanding visual information, human languages, generated hallucinations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted to ACL 2024 Main Conference. 21 pages, 20 figures

点击查看摘要

Abstract:Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs’ subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called MMHalSnowball to evaluate LVLMs’ behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least 31% , indicating that LVLMs are prone to accept the generated hallucinations and make false claims that they would not have supported without distractions. We term this phenomenon Multimodal Hallucination Snowballing. To mitigate this, we further propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input, providing models with direct access to the visual information. Experiments show that our method can mitigate more than 24% of the snowballed multimodal hallucination while maintaining capabilities.

[CV-112] Explaining Chest X-ray Pathology Models using Textual Concepts

链接: https://arxiv.org/abs/2407.00557
作者: Vijay Sadashivaiah,Mannudeep K. Kalra,Pingkun Yan,James A. Hendler
关键词: Deep learning models, opaque nature poses, nature poses challenges, Deep learning, revolutionized medical imaging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning models have revolutionized medical imaging and diagnostics, yet their opaque nature poses challenges for clinical adoption and trust. Amongst approaches to improve model interpretability, concept-based explanations aim to provide concise and human understandable explanations of any arbitrary classifier. However, such methods usually require a large amount of manually collected data with concept annotation, which is often scarce in the medical domain. In this paper, we propose Conceptual Counterfactual Explanations for Chest X-ray (CoCoX) that leverage existing vision-language models (VLM) joint embedding space to explain black-box classifier outcomes without the need for annotated datasets. Specifically, we utilize textual concepts derived from chest radiography reports and a pre-trained chest radiography-based VLM to explain three common cardiothoracic pathologies. We demonstrate that the explanations generated by our method are semantically meaningful and faithful to underlying pathologies.

[CV-113] Privacy-Preserving and Trustworthy Deep Learning for Medical Imaging

链接: https://arxiv.org/abs/2407.00538
作者: Kiarash Sedghighadikolaei,Attila A Yavuz
关键词: Deep Radiomics, Deep Radiomics pipeline, impacted healthcare systems, Machine Learning, notably impacted healthcare
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The shift towards efficient and automated data analysis through Machine Learning (ML) has notably impacted healthcare systems, particularly Radiomics. Radiomics leverages ML to analyze medical images accurately and efficiently for precision medicine. Current methods rely on Deep Learning (DL) to improve performance and accuracy (Deep Radiomics). Given the sensitivity of medical images, ensuring privacy throughout the Deep Radiomics pipeline-from data generation and collection to model training and inference-is essential, especially when outsourced. Thus, Privacy-Enhancing Technologies (PETs) are crucial tools for Deep Radiomics. Previous studies and systematization efforts have either broadly overviewed PETs and their applications or mainly focused on subsets of PETs for ML algorithms. In Deep Radiomics, where efficiency, accuracy, and privacy are crucial, many PETs, while theoretically applicable, may not be practical without specialized optimizations or hybrid designs. Additionally, not all DL models are suitable for Radiomics. Consequently, there is a need for specialized studies that investigate and systematize the effective and practical integration of PETs into the Deep Radiomics pipeline. This work addresses this research gap by (1) classifying existing PETs, presenting practical hybrid PETS constructions, and a taxonomy illustrating their potential integration with the Deep Radiomics pipeline, with comparative analyses detailing assumptions, architectural suitability, and security, (2) Offering technical insights, describing potential challenges and means of combining PETs into the Deep Radiomics pipeline, including integration strategies, subtilities, and potential challenges, (3) Proposing potential research directions, identifying challenges, and suggesting solutions to enhance the PETs in Deep Radiomics.

[CV-114] AI-powered multimodal modeling of personalized hemodynamics in aortic stenosis

链接: https://arxiv.org/abs/2407.00535
作者: Caglar Ozturk,Daniel H. Pak,Luca Rosalia,Debkalpa Goswami,Mary E. Robakowski,Raymond McKay,Christopher T. Nguyen,James S. Duncan,Ellen T. Roche
关键词: common valvular heart, valvular heart disease, Aortic stenosis, developed countries, common valvular
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
*备注: CO and DHP contributed equally to this work. JSD and ETR are corresponding authors

点击查看摘要

Abstract:Aortic stenosis (AS) is the most common valvular heart disease in developed countries. High-fidelity preclinical models can improve AS management by enabling therapeutic innovation, early diagnosis, and tailored treatment planning. However, their use is currently limited by complex workflows necessitating lengthy expert-driven manual operations. Here, we propose an AI-powered computational framework for accelerated and democratized patient-specific modeling of AS hemodynamics from computed tomography. First, we demonstrate that our automated meshing algorithms can generate task-ready geometries for both computational and benchtop simulations with higher accuracy and 100 times faster than existing approaches. Then, we show that our approach can be integrated with fluid-structure interaction and soft robotics models to accurately recapitulate a broad spectrum of clinical hemodynamic measurements of diverse AS patients. The efficiency and reliability of these algorithms make them an ideal complementary tool for personalized high-fidelity modeling of AS biomechanics, hemodynamics, and treatment planning.

[CV-115] A Medical Low-Back Pain Physical Rehabilitation Dataset for Human Body Movement Analysis

链接: https://arxiv.org/abs/2407.00521
作者: Sao Mai Nguyen,Maxime Devanne,Olivier Remy-Neris,Mathieu Lempereur,André Thepaut
关键词: showing encouraging results, non-medical applications, limited use contexts, automatic monitoring, monitoring and coaching
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While automatic monitoring and coaching of exercises are showing encouraging results in non-medical applications, they still have limitations such as errors and limited use contexts. To allow the development and assessment of physical rehabilitation by an intelligent tutoring system, we identify in this article four challenges to address and propose a medical dataset of clinical patients carrying out low back-pain rehabilitation exercises. The dataset includes 3D Kinect skeleton positions and orientations, RGB videos, 2D skeleton data, and medical annotations to assess the correctness, and error classification and localisation of body part and timespan. Along this dataset, we perform a complete research path, from data collection to processing, and finally a small benchmark. We evaluated on the dataset two baseline movement recognition algorithms, pertaining to two different approaches: the probabilistic approach with a Gaussian Mixture Model (GMM), and the deep learning approach with a Long-Short Term Memory (LSTM). This dataset is valuable because it includes rehabilitation relevant motions in a clinical setting with patients in their rehabilitation program, using a cost-effective, portable, and convenient sensor, and because it shows the potential for improvement on these challenges. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.5.4; I.4.8 Cite as: arXiv:2407.00521 [cs.LG] (or arXiv:2407.00521v1 [cs.LG] for this version) Journalreference: IJCNN 2024

[CV-116] oward a Diffusion-Based Generalist for Dense Vision Tasks

链接: https://arxiv.org/abs/2407.00503
作者: Yue Fan,Yongqin Xian,Xiaohua Zhai,Alexander Kolesnikov,Muhammad Ferjad Naeem,Bernt Schiele,Federico Tombari
关键词: Building generalized models, Building generalized, intriguing direction, solve many computer, Building
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at CVPR 2024 as a workshop paper

点击查看摘要

Abstract:Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.

[CV-117] Intrinsic PAPR for Point-level 3D Scene Albedo and Shading Editing

链接: https://arxiv.org/abs/2407.00500
作者: Alireza Moazeni,Shichong Peng,Ke Li
关键词: multi-view RGB images, RGB images, multi-view RGB, Intrinsic PAPR, neural rendering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in neural rendering have excelled at novel view synthesis from multi-view RGB images. However, they often lack the capability to edit the shading or colour of the scene at a detailed point-level, while ensuring consistency across different viewpoints. In this work, we address the challenge of point-level 3D scene albedo and shading editing from multi-view RGB images, focusing on detailed editing at the point-level rather than at a part or global level. While prior works based on volumetric representation such as NeRF struggle with achieving 3D consistent editing at the point level, recent advancements in point-based neural rendering show promise in overcoming this challenge. We introduce ``Intrinsic PAPR’', a novel method based on the recent point-based neural rendering technique Proximity Attention Point Rendering (PAPR). Unlike other point-based methods that model the intrinsic decomposition of the scene, our approach does not rely on complicated shading models or simplistic priors that may not universally apply. Instead, we directly model scene decomposition into albedo and shading components, leading to better estimation accuracy. Comparative evaluations against the latest point-based inverse rendering methods demonstrate that Intrinsic PAPR achieves higher-quality novel view rendering and superior point-level albedo and shading editing.

[CV-118] Quantifying Spuriousness of Biased Datasets Using Partial Information Decomposition

链接: https://arxiv.org/abs/2407.00482
作者: Barproda Halder,Faisal Hamman,Pasan Dissanayake,Qiuyi Zhang,Ilia Sucholutsky,Sanghamitra Dutta
关键词: Spurious patterns refer, Partial Information Decomposition, unique information, causally related, patterns refer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Information Theory (cs.IT)
*备注: Accepted at ICML 2024 Workshop on Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

点击查看摘要

Abstract:Spurious patterns refer to a mathematical association between two or more variables in a dataset that are not causally related. However, this notion of spuriousness, which is usually introduced due to sampling biases in the dataset, has classically lacked a formal definition. To address this gap, this work presents the first information-theoretic formalization of spuriousness in a dataset (given a split of spurious and core features) using a mathematical framework called Partial Information Decomposition (PID). Specifically, we disentangle the joint information content that the spurious and core features share about another target variable (e.g., the prediction label) into distinct components, namely unique, redundant, and synergistic information. We propose the use of unique information, with roots in Blackwell Sufficiency, as a novel metric to formally quantify dataset spuriousness and derive its desirable properties. We empirically demonstrate how higher unique information in the spurious features in a dataset could lead a model into choosing the spurious features over the core features for inference, often having low worst-group-accuracy. We also propose a novel autoencoder-based estimator for computing unique information that is able to handle high-dimensional image data. Finally, we also show how this unique information in the spurious feature is reduced across several dataset-based spurious-pattern-mitigation techniques such as data reweighting and varying levels of background mixing, demonstrating a novel tradeoff between unique information (spuriousness) and worst-group-accuracy.

[CV-119] Development of an interactive GUI using MATLAB for the detection of type and stage of Breast Tumor

链接: https://arxiv.org/abs/2407.00480
作者: Poulmi Banerjee,Satadal Saha
关键词: Breast lumps, Breast, Breast cancer, common types, breast lump image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Breast cancer is described as one of the most common types of cancer which has been diagnosed mainly in women. When compared in the ratio of male to female, it has been duly found that the prone of having breast cancer is more in females than males. Breast lumps are classified mainly into two groups namely: cancerous and non-cancerous. When we say that the lump in the breast is cancerous, it means that it can spread via lobules, ducts, areola, stroma to various organs of the body. On the other hand, non-cancerous breast lumps are less harmful but it should be monitored under proper diagnosis to avoid it being transformed to cancerous lump. To diagnose these breast lumps the method of mammogram, ultrasonic images and MRI images are undertaken. Also, for better diagnosis sometimes doctors recommend for biopsy and any unforeseen anomalies occurring there may give rise to inaccurate test report. To avoid these discrepancies, processing the mammogram images is considered to be one of the most reliable methods. In the proposed method MATLAB GUI is developed and some sample images of breast lumps are placed accordingly in the respective axes. With the help of sliders the actual breast lump image is compared with the already stored breast lump sample images and then accordingly the history of the breast lumps is generated in real time in the form of test report.

[CV-120] MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

链接: https://arxiv.org/abs/2407.00468
作者: Jinsheng Huang,Liang Chen,Taian Guo,Fu Zeng,Yusheng Zhao,Bohan Wu,Ye Yuan,Haozhe Zhao,Zhihui Guo,Yichi Zhang,Jingyang Yuan,Wei Ju,Luchen Liu,Tianyu Liu,Baobao Chang,Ming Zhang
关键词: Large Multimodal Models, exhibit impressive cross-modal, impressive cross-modal understanding, Multimodal Models, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 21 pages, code released at this https URL , Homepage at this https URL

点击查看摘要

Abstract:Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises 2,138 question triplets, totaling 6,414 distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by 31.73% , compared to an average gap of 8.03% in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by 23.09% , whereas the gap for previous benchmarks is just 14.64% ). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.

[CV-121] Characterizing Continual Learning Scenarios and Strategies for Audio Analysis

链接: https://arxiv.org/abs/2407.00465
作者: Ruchi Bhatt,Pratibha Kumari,Dwarikanath Mahapatra,Abdulmotaleb El Saddik,Mukesh Saini
关键词: Audio analysis, Audio, analysis, characterize continual learning, audio analysis approaches
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio analysis is useful in many application scenarios. The state-of-the-art audio analysis approaches assume that the data distribution at training and deployment time will be the same. However, due to various real-life environmental factors, the data may encounter drift in its distribution or can encounter new classes in the late future. Thus, a one-time trained model might not perform adequately. In this paper, we characterize continual learning (CL) approaches in audio analysis. In this paper, we characterize continual learning (CL) approaches, intended to tackle catastrophic forgetting arising due to drifts. As there is no CL dataset for audio analysis, we use DCASE 2020 to 2023 datasets to create various CL scenarios for audio-based monitoring tasks. We have investigated the following CL and non-CL approaches: EWC, LwF, SI, GEM, A-GEM, GDumb, Replay, Naive, cumulative, and joint training. The study is very beneficial for researchers and practitioners working in the area of audio analysis for developing adaptive models. We observed that Replay achieved better results than other methods in the DCASE challenge data. It achieved an accuracy of 70.12% for the domain incremental scenario and an accuracy of 96.98% for the class incremental scenario.

[CV-122] pFLFE: Cross-silo Personalized Federated Learning via Feature Enhancement on Medical Image Segmentation

链接: https://arxiv.org/abs/2407.00462
作者: Luyuan Xie,Manqing Lin,Siyuan Liu,ChenMing Xu,Tianyu Luan,Cong Li,Yuejian Fang,Qingni Shen,Zhonghai Wu
关键词: overcome data scarcity, utilizing varied data, personalized cross-silo federated, cross-silo federated learning, Personalized Federated Learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In medical image segmentation, personalized cross-silo federated learning (FL) is becoming popular for utilizing varied data across healthcare settings to overcome data scarcity and privacy concerns. However, existing methods often suffer from client drift, leading to inconsistent performance and delayed training. We propose a new framework, Personalized Federated Learning via Feature Enhancement (pFLFE), designed to mitigate these challenges. pFLFE consists of two main stages: feature enhancement and supervised learning. The first stage improves differentiation between foreground and background features, and the second uses these enhanced features for learning from segmentation masks. We also design an alternative training approach that requires fewer communication rounds without compromising segmentation quality, even with limited communication resources. Through experiments on three medical segmentation tasks, we demonstrate that pFLFE outperforms the state-of-the-art methods.

[CV-123] Diving Deeper Into Pedestrian Behavior Understanding: Intention Estimation Action Prediction and Event Risk Assessment

链接: https://arxiv.org/abs/2407.00446
作者: Amir Rasouli,Iuliia Kotseruba
关键词: event risk assessment, behavior understanding problem, pedestrian behavior understanding, risk assessment, behavior understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 5 figures, 6 tables

点击查看摘要

Abstract:In this paper, we delve into the pedestrian behavior understanding problem from the perspective of three different tasks: intention estimation, action prediction, and event risk assessment. We first define the tasks and discuss how these tasks are represented and annotated in two widely used pedestrian datasets, JAAD and PIE. We then propose a new benchmark based on these definitions, available annotations, and three new classes of metrics, each designed to assess different aspects of the model performance. We apply the new evaluation approach to examine four SOTA prediction models on each task and compare their performance w.r.t. metrics and input modalities. In particular, we analyze the differences between intention estimation and action prediction tasks by considering various scenarios and contextual factors. Lastly, we examine model agreement across these two tasks to show their complementary role. The proposed benchmark reveals new facts about the role of different data modalities, the tasks, and relevant data properties. We conclude by elaborating on our findings and proposing future research directions. Comments: 8 pages, 5 figures, 6 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2407.00446 [cs.CV] (or arXiv:2407.00446v1 [cs.CV] for this version)

[CV-124] AI Age Discrepancy: A Novel Parameter for Frailty Assessment in Kidney Tumor Patients

链接: https://arxiv.org/abs/2407.00438
作者: Jayant Siva,Angelica Bartholomew,Clara Goebel,Gabriel Wallerstein-King,Beatriz López Morato,Nicholas Heller,Jason Scovell,Rebecca Campbell,Andrew Wood,Michal Ozery-Flato,Vesna Barros,Maria Gabrani,Michal Rosen-Zvi,Resha Tejpaul,Vidhyalakshmi Ramesh,Nikolaos Papanikolopoulos,Subodh Regmi,Ryan Ward,Robert Abouassaly,Steven C. Campbell,Erick Remer,Christopher Weight
关键词: global health concern, optimizing surgical outcomes, Age Discrepancy, Kidney Tumor Segmentation, Kidney cancer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Kidney cancer is a global health concern, and accurate assessment of patient frailty is crucial for optimizing surgical outcomes. This paper introduces AI Age Discrepancy, a novel metric derived from machine learning analysis of preoperative abdominal CT scans, as a potential indicator of frailty and postoperative risk in kidney cancer patients. This retrospective study of 599 patients from the 2023 Kidney Tumor Segmentation (KiTS) challenge dataset found that a higher AI Age Discrepancy is significantly associated with longer hospital stays and lower overall survival rates, independent of established factors. This suggests that AI Age Discrepancy may provide valuable insights into patient frailty and could thus inform clinical decision-making in kidney cancer treatment.

[CV-125] Location embedding based pairwise distance learning for fine-grained diagnosis of urinary stones

链接: https://arxiv.org/abs/2407.00431
作者: Qiangguo Jin,Jiapeng Huang,Changming Sun,Hui Cui,Ping Xuan,Ran Su,Leyi Wei,Yu-Jie Wu,Chia-An Wu,Henry B.L. Duh,Yueh-Hsun Lu
关键词: effective treatment strategies, devising effective treatment, treatment strategies, crucial for devising, devising effective
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The precise diagnosis of urinary stones is crucial for devising effective treatment strategies. The diagnostic process, however, is often complicated by the low contrast between stones and surrounding tissues, as well as the variability in stone locations across different patients. To address this issue, we propose a novel location embedding based pairwise distance learning network (LEPD-Net) that leverages low-dose abdominal X-ray imaging combined with location information for the fine-grained diagnosis of urinary stones. LEPD-Net enhances the representation of stone-related features through context-aware region enhancement, incorporates critical location knowledge via stone location embedding, and achieves recognition of fine-grained objects with our innovative fine-grained pairwise distance learning. Additionally, we have established an in-house dataset on urinary tract stones to demonstrate the effectiveness of our proposed approach. Comprehensive experiments conducted on this dataset reveal that our framework significantly surpasses existing state-of-the-art methods.

[CV-126] Parametric Primitive Analysis of CAD Sketches with Vision Transformer

链接: https://arxiv.org/abs/2407.00410
作者: Xiaogang Wang,Liang Wang,Hongyu Wu,Guoqiang Xiao,Kai Xu
关键词: primarily involving CAD, involving CAD primitives, industrial product design, involving CAD, CAD primitives
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The design and analysis of Computer-Aided Design (CAD) sketches play a crucial role in industrial product design, primarily involving CAD primitives and their inter-primitive constraints. To address challenges related to error accumulation in autoregressive models and the complexities associated with self-supervised model design for this task, we propose a two-stage network framework. This framework consists of a primitive network and a constraint network, transforming the sketch analysis task into a set prediction problem to enhance the effective handling of primitives and constraints. By decoupling target types from parameters, the model gains increased flexibility and optimization while reducing complexity. Additionally, the constraint network incorporates a pointer module to explicitly indicate the relationship between constraint parameters and primitive indices, enhancing interpretability and performance. Qualitative and quantitative analyses on two publicly available datasets demonstrate the superiority of this method.

[CV-127] Query-Efficient Hard-Label Black-Box Attack against Vision Transformers

链接: https://arxiv.org/abs/2407.00389
作者: Chao Zhou,Xiaowen Shi,Yuan-Gen Wang
关键词: convolutional neural networks, face similar security, similar security risks, deep convolutional neural, Recent studies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent studies have revealed that vision transformers (ViTs) face similar security risks from adversarial attacks as deep convolutional neural networks (CNNs). However, directly applying attack methodology on CNNs to ViTs has been demonstrated to be ineffective since the ViTs typically work on patch-wise encoding. This article explores the vulnerability of ViTs against adversarial attacks under a black-box scenario, and proposes a novel query-efficient hard-label adversarial attack method called AdvViT. Specifically, considering that ViTs are highly sensitive to patch modification, we propose to optimize the adversarial perturbation on the individual patches. To reduce the dimension of perturbation search space, we modify only a handful of low-frequency components of each patch. Moreover, we design a weight mask matrix for all patches to further optimize the perturbation on different regions of a whole image. We test six mainstream ViT backbones on the ImageNet-1k dataset. Experimental results show that compared with the state-of-the-art attacks on CNNs, our AdvViT achieves much lower L_2 -norm distortion under the same query budget, sufficiently validating the vulnerability of ViTs against adversarial attacks.

[CV-128] he Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention

链接: https://arxiv.org/abs/2407.00377
作者: Yixin Wan,Di Wu,Haoran Wang,Kai-Wei Chang
关键词: models depicting individuals, commonly adopted, depicting individuals, Prompt-based, diversity interventions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Prompt-based “diversity interventions” are commonly adopted to improve the diversity of Text-to-Image (T2I) models depicting individuals with various racial or gender traits. However, will this strategy result in nonfactual demographic distribution, especially when generating real historical figures? In this work, we propose DemOgraphic FActualIty Representation (DoFaiR), a benchmark to systematically quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. DoFaiR consists of 756 meticulously fact-checked test instances to reveal the factuality tax of various diversity prompts through an automated evidence-supported evaluation pipeline. Experiments on DoFaiR unveil that diversity-oriented instructions increase the number of different gender and racial groups in DALLE-3’s generations at the cost of historically inaccurate demographic distributions. To resolve this issue, we propose Fact-Augmented Intervention (FAI), which instructs a Large Language Model (LLM) to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history, and incorporate it into the generation context of T2I models. By orienting model generations using the reflected historical truths, FAI significantly improves the demographic factuality under diversity interventions while preserving diversity.

[CV-129] SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

链接: https://arxiv.org/abs/2407.00367
作者: Peng Dai,Feitong Tan,Qiangeng Xu,David Futschik,Ruofei Du,Sean Fanello,Xiaojuan Qi,Yinda Zhang
关键词: demonstrated great capabilities, producing impressive monocular, video remains under-explored, video generation model, impressive monocular videos
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 3D stereoscopic video generation, video diffusion, inpainting

点击查看摘要

Abstract:Video generation models have demonstrated great capabilities of producing impressive monocular videos, however, the generation of 3D stereoscopic video remains under-explored. We propose a pose-free and training-free approach for generating 3D stereoscopic videos using an off-the-shelf monocular video generation model. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth, and employs a novel frame matrix video inpainting framework. The framework leverages the video generation model to inpaint frames observed from different timestamps and views. This effective approach generates consistent and semantically coherent stereoscopic videos without scene optimization or model fine-tuning. Moreover, we develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora [4 ], Lumiere [2], WALT [8 ], and Zeroscope [ 42]. The experiments demonstrate that our method has a significant improvement over previous methods. The code will be released at \urlthis https URL.

[CV-130] JSCDS: A Core Data Selection Method with Jason-Shannon Divergence for Caries RGB Images-Efficient Learning

链接: https://arxiv.org/abs/2407.00362
作者: Peiliang Zhang,Yujia Tong,Chenghu Du,Chao Che,Yongjun Zhu
关键词: preventing oral diseases, Core data selection, Deep learning-based RGB, data selection, data selection methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted in KDD 2024 Workshop AIDSH

点击查看摘要

Abstract:Deep learning-based RGB caries detection improves the efficiency of caries identification and is crucial for preventing oral diseases. The performance of deep learning models depends on high-quality data and requires substantial training resources, making efficient deployment challenging. Core data selection, by eliminating low-quality and confusing data, aims to enhance training efficiency without significantly compromising model performance. However, distance-based data selection methods struggle to distinguish dependencies among high-dimensional caries data. To address this issue, we propose a Core Data Selection Method with Jensen-Shannon Divergence (JSCDS) for efficient caries image learning and caries classification. We describe the core data selection criterion as the distribution of samples in different classes. JSCDS calculates the cluster centers by sample embedding representation in the caries classification network and utilizes Jensen-Shannon Divergence to compute the mutual information between data samples and cluster centers, capturing nonlinear dependencies among high-dimensional data. The average mutual information is calculated to fit the above distribution, serving as the criterion for constructing the core set for model training. Extensive experiments on RGB caries datasets show that JSCDS outperforms other data selection methods in prediction performance and time consumption. Notably, JSCDS exceeds the performance of the full dataset model with only 50% of the core data, with its performance advantage becoming more pronounced in the 70% of core data.

[CV-131] Enhancing Accuracy and Parameter-Efficiency of Neural Representations for Network Parameterization

链接: https://arxiv.org/abs/2407.00356
作者: Hongjun Choi,Jayaraman J. Thiagarajan,Ruben Glatt,Shusen Liu
关键词: neural network weights, investigate the fundamental, fundamental trade-off, parameter efficiency, parameterization of neural
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we investigate the fundamental trade-off regarding accuracy and parameter efficiency in the parameterization of neural network weights using predictor networks. We present a surprising finding that, when recovering the original model accuracy is the sole objective, it can be achieved effectively through the weight reconstruction objective alone. Additionally, we explore the underlying factors for improving weight reconstruction under parameter-efficiency constraints, and propose a novel training scheme that decouples the reconstruction objective from auxiliary objectives such as knowledge distillation that leads to significant improvements compared to state-of-the-art approaches. Finally, these results pave way for more practical scenarios, where one needs to achieve improvements on both model accuracy and predictor network parameter-efficiency simultaneously.

[CV-132] PhyTracker: An Online Tracker for Phytoplankton

链接: https://arxiv.org/abs/2407.00352
作者: Yang Yu,Qingxuan Lv,Yuezun Li,Zhiqiang Wei,Junyu Dong
关键词: understand marine ecological, marine ecological processes, requires efficient monitoring, aquatic ecosystems, requires efficient
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13pages,eleven figures

点击查看摘要

Abstract:Phytoplankton, a crucial component of aquatic ecosystems, requires efficient monitoring to understand marine ecological processes and environmental conditions. Traditional phytoplankton monitoring methods, relying on non-in situ observations, are time-consuming and resource-intensive, limiting timely analysis. To address these limitations, we introduce PhyTracker, an intelligent in situ tracking framework designed for automatic tracking of phytoplankton. PhyTracker overcomes significant challenges unique to phytoplankton monitoring, such as constrained mobility within water flow, inconspicuous appearance, and the presence of impurities. Our method incorporates three innovative modules: a Texture-enhanced Feature Extraction (TFE) module, an Attention-enhanced Temporal Association (ATA) module, and a Flow-agnostic Movement Refinement (FMR) module. These modules enhance feature capture, differentiate between phytoplankton and impurities, and refine movement characteristics, respectively. Extensive experiments on the PMOT dataset validate the superiority of PhyTracker in phytoplankton tracking, and additional tests on the MOT dataset demonstrate its general applicability, outperforming conventional tracking methods. This work highlights key differences between phytoplankton and traditional objects, offering an effective solution for phytoplankton monitoring.

[CV-133] OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

链接: https://arxiv.org/abs/2407.00316
作者: Adam Sun,Tiange Xiang,Scott Delp,Li Fei-Fei,Ehsan Adeli
关键词: rendering methods require, input video, methods require, fully visible, human
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion models for efficient and high-fidelity human rendering. We propose a pipeline consisting of three stages. In the Initialization stage, complete human masks are generated from partial visibility masks. In the Optimization stage, 3D human Gaussians are optimized with additional supervision by Score-Distillation Sampling (SDS) to create a complete geometry of the human. Finally, in the Refinement stage, in-context inpainting is designed to further improve rendering quality on the less observed human body parts. We evaluate OccFusion on ZJU-MoCap and challenging OcMotion sequences and find that it achieves state-of-the-art performance in the rendering of occluded humans.

[CV-134] Learning Unsupervised Gaze Representation via Eye Mask Driven Information Bottleneck

链接: https://arxiv.org/abs/2407.00315
作者: Yangzhou Jiang,Yinxin Lin,Yaoming Wang,Teng Li,Bilian Ke,Bingbing Ni
关键词: Appearance-based supervised methods, made tremendous advances, Appearance-based supervised, recent gaze estimation, full-face image input
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between facial and eye features, such frameworks are still deficient in extracting useful gaze features from full-face. To alleviate above limitations, this work proposes a novel unsupervised/self-supervised gaze pre-training framework, which forces the full-face branch to learn a low dimensional gaze embedding without gaze annotations, through collaborative feature contrast and squeeze modules. In the heart of this framework is an alternating eye-attended/unattended masking training scheme, which squeezes gaze-related information from full-face branch into an eye-masked auto-encoder through an injection bottleneck design that successfully encourages the model to pays more attention to gaze direction rather than facial textures only, while still adopting the eye self-reconstruction objective. In the same time, a novel eye/gaze-related information contrastive loss has been designed to further boost the learned representation by forcing the model to focus on eye-centered regions. Extensive experimental results on several gaze benchmarks demonstrate that the proposed scheme achieves superior performances over unsupervised state-of-the-art.

[CV-135] Benchmark Evaluation of Image Fusion algorithms for Smartphone Camera Capture

链接: https://arxiv.org/abs/2407.00301
作者: Lucas N. Kirsten
关键词: smartphone camera capture, image quality, image fusion techniques, image, fusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Presented at the ICMLAI 2024, in Mendonza, Argentina

点击查看摘要

Abstract:This paper investigates the trade-off between computational resource utilization and image quality in the context of image fusion techniques for smartphone camera capture. The study explores various combinations of fusion methods, fusion weights, number of frames, and stacking (a.k.a. merging) techniques using a proprietary dataset of images captured with Motorola smartphones. The objective was to identify optimal configurations that balance computational efficiency with image quality. Our results indicate that multi-scale methods and their single-scale fusion counterparts return similar image quality measures and runtime, but single-scale ones have lower memory usage. Furthermore, we identified that fusion methods operating in the YUV color space yield better performance in terms of image quality, resource utilization, and runtime. The study also shows that fusion weights have an overall small impact on image quality, runtime, and memory. Moreover, our results reveal that increasing the number of highly exposed input frames does not necessarily improve image quality and comes with a corresponding increase in computational resources usage and runtime; and that stacking methods, although reducing memory usage, may compromise image quality. Finally, our work underscores the importance of thoughtful configuration selection for image fusion techniques in constrained environments and offers insights for future image fusion method development, particularly in the realm of smartphone applications.

[CV-136] SolarSAM: Building-scale Photovoltaic Potential Assessment Based on Segment Anything Model (SAM) and Remote Sensing for Emerging City

链接: https://arxiv.org/abs/2407.00296
作者: Guohao Wang
关键词: renewable energy source, promising renewable energy, Driven by advancements, energy source, renewable energy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Driven by advancements in photovoltaic (PV) technology, solar energy has emerged as a promising renewable energy source, due to its ease of integration onto building rooftops, facades, and windows. For the emerging cities, the lack of detailed street-level data presents a challenge for effectively assessing the potential of building-integrated photovoltaic (BIPV). To address this, this study introduces SolarSAM, a novel BIPV evaluation method that leverages remote sensing imagery and deep learning techniques, and an emerging city in northern China is utilized to validate the model performance. During the process, SolarSAM segmented various building rooftops using text prompt guided semantic segmentation. Separate PV models were then developed for Rooftop PV, Facade-integrated PV, and PV windows systems, using this segmented data and local climate information. The potential for BIPV installation, solar power generation, and city-wide power self-sufficiency were assessed, revealing that the annual BIPV power generation potential surpassed the city’s total electricity consumption by a factor of 2.5. Economic and environmental analysis were also conducted, including levelized cost of electricity and carbon reduction calculations, comparing different BIPV systems across various building categories. These findings demonstrated the model’s performance and reveled the potential of BIPV power generation in the future.

[CV-137] A deep neural network framework for dynamic multi-valued mapping estimation and its applications

链接: https://arxiv.org/abs/2407.00295
作者: Geng Li,Di Qiu,Lok Ming Lui
关键词: estimating dynamic multi-valued, dynamic multi-valued mappings, dynamic multi-valued, multi-valued mappings, estimating dynamic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of modeling and estimating dynamic multi-valued mappings. While most mathematical models provide a unique solution for a given input, real-world applications often lack deterministic solutions. In such scenarios, estimating dynamic multi-valued mappings is necessary to suggest different reasonable solutions for each input. This paper introduces a deep neural network framework incorporating a generative network and a classification component. The objective is to model the dynamic multi-valued mapping between the input and output by providing a reliable uncertainty measurement. Generating multiple solutions for a given input involves utilizing a discrete codebook comprising finite variables. These variables are fed into a generative network along with the input, producing various output possibilities. The discreteness of the codebook enables efficient estimation of the output’s conditional probability distribution for any given input using a classifier. By jointly optimizing the discrete codebook and its uncertainty estimation during training using a specially designed loss function, a highly accurate approximation is achieved. The effectiveness of our proposed framework is demonstrated through its application to various imaging problems, using both synthetic and real imaging data. Experimental results show that our framework accurately estimates the dynamic multi-valued mapping with uncertainty estimation.

[CV-138] PerAct2: A Perceiver Actor Framework for Bimanual Manipulation Tasks

链接: https://arxiv.org/abs/2407.00278
作者: Markus Grotz,Mohit Shridhar,Tamim Asfour,Dieter Fox
关键词: temporal coordination required, challenging due, due to precise, precise spatial, spatial and temporal
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by extending RLBench to bimanual manipulation. We open-source our code and benchmark comprising 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To kickstart the benchmark, we extended several state-of-the art methods to bimanual manipulation and also present a language-conditioned behavioral cloning agent – PerAct2, which enables the learning and execution of bimanual 6-DoF manipulation tasks. Our novel network architecture efficiently integrates language processing with action prediction, allowing robots to understand and perform complex bimanual tasks in response to user-specified goals. Project website with code is available at: this http URL

[CV-139] Learning a Clinically-Relevant Concept Bottleneck for Lesion Detection in Breast Ultrasound

链接: https://arxiv.org/abs/2407.00267
作者: Arianna Bunnell,Yannik Glaser,Dustin Valdez,Thomas Wolfgruber,Aleen Altamirano,Carol Zamora González,Brenda Y. Hernandez,Peter Sadowski,John A. Shepherd
关键词: Detecting and classifying, Radiology Breast Imaging, breast ultrasound images, artificial intelligence, access to mammography
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Submitted version of manuscript accepted at MICCAI 2024. This preprint has not undergone peer review or any post-submission improvements or corrections

点击查看摘要

Abstract:Detecting and classifying lesions in breast ultrasound images is a promising application of artificial intelligence (AI) for reducing the burden of cancer in regions with limited access to mammography. Such AI systems are more likely to be useful in a clinical setting if their predictions can be explained to a radiologist. This work proposes an explainable AI model that provides interpretable predictions using a standard lexicon from the American College of Radiology’s Breast Imaging and Reporting Data System (BI-RADS). The model is a deep neural network featuring a concept bottleneck layer in which known BI-RADS features are predicted before making a final cancer classification. This enables radiologists to easily review the predictions of the AI system and potentially fix errors in real time by modifying the concept predictions. In experiments, a model is developed on 8,854 images from 994 women with expert annotations and histological cancer labels. The model outperforms state-of-the-art lesion detection frameworks with 48.9 average precision on the held-out testing set, and for cancer classification, concept intervention is shown to increase performance from 0.876 to 0.885 area under the receiver operating characteristic curve. Training and evaluation code is available at this https URL.

[CV-140] From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models

链接: https://arxiv.org/abs/2407.00263
作者: Mehar Bhatia,Sahithya Ravi,Aditya Chinchure,Eunjeong Hwang,Vered Shwartz
关键词: non-western cultures due, performance remains suboptimal, training datasets, recent advancements, remains suboptimal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under peer review

点击查看摘要

Abstract:Despite recent advancements in vision-language models, their performance remains suboptimal on images from non-western cultures due to underrepresentation in training datasets. Various benchmarks have been proposed to test models’ cultural inclusivity, but they have limited coverage of cultures and do not adequately assess cultural diversity across universal as well as culture-specific local concepts. To address these limitations, we introduce the GlobalRG benchmark, comprising two challenging tasks: retrieval across universals and cultural visual grounding. The former task entails retrieving culturally diverse images for universal concepts from 50 countries, while the latter aims at grounding culture-specific concepts within images from 15 countries. Our evaluation across a wide range of models reveals that the performance varies significantly across cultures – underscoring the necessity for enhancing multicultural understanding in vision-language models.

[CV-141] Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review

链接: https://arxiv.org/abs/2407.00252
作者: Moseli Mots’oehli
关键词: acquiring high-quality annotated, achieved significant success, high-quality annotated data, annotated data remains, acquiring high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
*备注: Accepted IEEE ETNCC 2024, 9 pages

点击查看摘要

Abstract:While supervised learning has achieved significant success in computer vision tasks, acquiring high-quality annotated data remains a bottleneck. This paper explores both scholarly and non-scholarly works in AI-assistive deep learning image annotation systems that provide textual suggestions, captions, or descriptions of the input image to the annotator. This potentially results in higher annotation efficiency and quality. Our exploration covers annotation for a range of computer vision tasks including image classification, object detection, regression, instance, semantic segmentation, and pose estimation. We review various datasets and how they contribute to the training and evaluation of AI-assistive annotation systems. We also examine methods leveraging neuro-symbolic learning, deep active learning, and self-supervised learning algorithms that enable semantic image understanding and generate free-text output. These include image captioning, visual question answering, and multi-modal reasoning. Despite the promising potential, there is limited publicly available work on AI-assistive image annotation with textual output capabilities. We conclude by suggesting future research directions to advance this field, emphasizing the need for more publicly accessible datasets and collaborative efforts between academia and industry.

[CV-142] Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription

链接: https://arxiv.org/abs/2407.00250
作者: Jaydeep Borkar,David A. Smith
关键词: illegible text resulting, documents frequently suffer, storage damage, frequently suffer, illegible text
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted to ICDAR 2024 Workshop on Computational Paleography

点击查看摘要

Abstract:Historical documents frequently suffer from damage and inconsistencies, including missing or illegible text resulting from issues such as holes, ink problems, and storage damage. These missing portions or gaps are referred to as lacunae. In this study, we employ transformer-based optical character recognition (OCR) models trained on synthetic data containing lacunae in a supervised manner. We demonstrate their effectiveness in detecting and restoring lacunae, achieving a success rate of 65%, compared to a base model lacking knowledge of lacunae, which achieves only 5% restoration. Additionally, we investigate the mechanistic properties of the model, such as the log probability of transcription, which can identify lacunae and other errors (e.g., mistranscriptions due to complex writing or ink issues) in line images without directly inspecting the image. This capability could be valuable for scholars seeking to distinguish images containing lacunae or errors from clean ones. Although we explore the potential of attention mechanisms in flagging lacunae and transcription errors, our findings suggest it is not a significant factor. Our work highlights a promising direction in utilizing transformer-based OCR models for restoring or analyzing damaged historical documents.

[CV-143] Prompt Refinement with Image Pivot for Text-to-Image Generation

链接: https://arxiv.org/abs/2407.00247
作者: Jingtao Zhan,Qingyao Ai,Yiqun Liu,Yingwei Pan,Ting Yao,Jiaxin Mao,Shaoping Ma,Tao Mei
关键词: automatically refining user-provided, refining user-provided natural, keyword-enriched prompts favored, user-provided natural language, automatically refining
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACL 2024

点击查看摘要

Abstract:For text-to-image generation, automatically refining user-provided natural language prompts into the keyword-enriched prompts favored by systems is essential for the user experience. Such a prompt refinement process is analogous to translating the prompt from “user languages” into “system languages”. However, the scarcity of such parallel corpora makes it difficult to train a prompt refinement model. Inspired by zero-shot machine translation techniques, we introduce Prompt Refinement with Image Pivot (PRIP). PRIP innovatively uses the latent representation of a user-preferred image as an intermediary “pivot” between the user and system languages. It decomposes the refinement process into two data-rich tasks: inferring representations of user-preferred images from user languages and subsequently translating image representations into system languages. Thus, it can leverage abundant data for training. Extensive experiments show that PRIP substantially outperforms a wide range of baselines and effectively transfers to unseen systems in a zero-shot manner.

[CV-144] Methodology to Deploy CNN-Based Computer Vision Models on Immersive Wearable Devices

链接: https://arxiv.org/abs/2407.00233
作者: Kaveh Malek(1),Fernando Moreu(2), ((1) Department of Mechanical Engineering, University of New Mexico, New Mexico, (2) Department of Civil, Construction and Environmental Engineering, University of New Mexico, New Mexico)
关键词: Convolutional Neural Network, Convolutional Neural, Neural Network, Augmented Reality, addressed by Augmented
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 10 pages 8 figures 4300 words

点击查看摘要

Abstract:Convolutional Neural Network (CNN) models often lack the ability to incorporate human input, which can be addressed by Augmented Reality (AR) headsets. However, current AR headsets face limitations in processing power, which has prevented researchers from performing real-time, complex image recognition tasks using CNNs in AR headsets. This paper presents a method to deploy CNN models on AR headsets by training them on computers and transferring the optimized weight matrices to the headset. The approach transforms the image data and CNN layers into a one-dimensional format suitable for the AR platform. We demonstrate this method by training the LeNet-5 CNN model on the MNIST dataset using PyTorch and deploying it on a HoloLens AR headset. The results show that the model maintains an accuracy of approximately 98%, similar to its performance on a computer. This integration of CNN and AR enables real-time image processing on AR headsets, allowing for the incorporation of human input into AI models.

[CV-145] SemUV: Deep Learning based semantic manipulation over UV texture map of virtual human heads

链接: https://arxiv.org/abs/2407.00229
作者: Anirban Mukherjee,Venkat Suprabath Bitra,Vignesh Bondugula,Tarun Reddy Tallapureddy,Dinesh Babu Jayagopi
关键词: manipulating virtual human, virtual human heads, Designing and manipulating, interaction and VFX, human heads
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: CVIP 2024 Preprint

点击查看摘要

Abstract:Designing and manipulating virtual human heads is essential across various applications, including AR, VR, gaming, human-computer interaction and VFX. Traditional graphic-based approaches require manual effort and resources to achieve accurate representation of human heads. While modern deep learning techniques can generate and edit highly photorealistic images of faces, their focus remains predominantly on 2D facial images. This limitation makes them less suitable for 3D applications. Recognizing the vital role of editing within the UV texture space as a key component in the 3D graphics pipeline, our work focuses on this aspect to benefit graphic designers by providing enhanced control and precision in appearance manipulation. Research on existing methods within the UV texture space is limited, complex, and poses challenges. In this paper, we introduce SemUV: a simple and effective approach using the FFHQ-UV dataset for semantic manipulation directly within the UV texture space. We train a StyleGAN model on the publicly available FFHQ-UV dataset, and subsequently train a boundary for interpolation and semantic feature manipulation. Through experiments comparing our method with 2D manipulation technique, we demonstrate its superior ability to preserve identity while effectively modifying semantic features such as age, gender, and facial hair. Our approach is simple, agnostic to other 3D components such as structure, lighting, and rendering, and also enables seamless integration into standard 3D graphics pipelines without demanding extensive domain expertise, time, or resources.

[CV-146] ransformer-based Image and Video Inpainting: Current Challenges and Future Directions

链接: https://arxiv.org/abs/2407.00226
作者: Omar Elharrouss,Rafat Damseh,Abdelkader Nasreddine Belkacem,Elarbi Badidi,Abderrahmane Lakas
关键词: image or video, video inpainting, hot topic, video, Image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper have been submitted to Artificial Intelligence Review journal

点击查看摘要

Abstract:Image inpainting is currently a hot topic within the field of computer vision. It offers a viable solution for various applications, including photographic restoration, video editing, and medical imaging. Deep learning advancements, notably convolutional neural networks (CNNs) and generative adversarial networks (GANs), have significantly enhanced the inpainting task with an improved capability to fill missing or damaged regions in an image or video through the incorporation of contextually appropriate details. These advancements have improved other aspects, including efficiency, information preservation, and achieving both realistic textures and structures. Recently, visual transformers have been exploited and offer some improvements to image or video inpainting. The advent of transformer-based architectures, which were initially designed for natural language processing, has also been integrated into computer vision tasks. These methods utilize self-attention mechanisms that excel in capturing long-range dependencies within data; therefore, they are particularly effective for tasks requiring a comprehensive understanding of the global context of an image or video. In this paper, we provide a comprehensive review of the current image or video inpainting approaches, with a specific focus on transformer-based techniques, with the goal to highlight the significant improvements and provide a guideline for new researchers in the field of image or video inpainting using visual transformers. We categorized the transformer-based techniques by their architectural configurations, types of damage, and performance metrics. Furthermore, we present an organized synthesis of the current challenges, and suggest directions for future research in the field of image or video inpainting.

[CV-147] Multimodal Prototyping for cancer survival prediction

链接: https://arxiv.org/abs/2407.00224
作者: Andrew H. Song,Richard J. Chen,Guillaume Jaume,Anurag J. Vaidya,Alexander S. Baras,Faisal Mahmood
关键词: histology whole-slide images, combining gigapixel histology, gigapixel histology whole-slide, survival methods combining, methods combining gigapixel
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注: ICML 2024

点击查看摘要

Abstract:Multimodal survival methods combining gigapixel histology whole-slide images (WSIs) and transcriptomic profiles are particularly promising for patient prognostication and stratification. Current approaches involve tokenizing the WSIs into smaller patches (10,000 patches) and transcriptomics into gene groups, which are then integrated using a Transformer for predicting outcomes. However, this process generates many tokens, which leads to high memory requirements for computing attention and complicates post-hoc interpretability analyses. Instead, we hypothesize that we can: (1) effectively summarize the morphological content of a WSI by condensing its constituting tokens using morphological prototypes, achieving more than 300x compression; and (2) accurately characterize cellular functions by encoding the transcriptomic profile with biological pathway prototypes, all in an unsupervised fashion. The resulting multimodal tokens are then processed by a fusion network, either with a Transformer or an optimal transport cross-alignment, which now operates with a small and fixed number of tokens without approximations. Extensive evaluation on six cancer types shows that our framework outperforms state-of-the-art methods with much less computation while unlocking new interpretability analyses.

[CV-148] PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration

链接: https://arxiv.org/abs/2407.00203
作者: Yuxuan Sun,Yunlong Zhang,Yixuan Si,Chenglu Zhu,Zhongyi Shui,Kai Zhang,Jingxiong Li,Xingheng Lyu,Tao Lin,Lin Yang
关键词: Vision Language Models, Slide Image, attracted substantial attention, Vision Language, serving as backbones
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Vision Language Models (VLMs) like CLIP have attracted substantial attention in pathology, serving as backbones for applications such as zero-shot image classification and Whole Slide Image (WSI) analysis. Additionally, they can function as vision encoders when combined with large language models (LLMs) to support broader capabilities. Current efforts to train pathology VLMs rely on pathology image-text pairs from platforms like PubMed, YouTube, and Twitter, which provide limited, unscalable data with generally suboptimal image quality. In this work, we leverage large-scale WSI datasets like TCGA to extract numerous high-quality image patches. We then train a large multimodal model to generate captions for these images, creating PathGen-1.6M, a dataset containing 1.6 million high-quality image-caption pairs. Our approach involves multiple agent models collaborating to extract representative WSI patches, generating and refining captions to obtain high-quality image-text pairs. Extensive experiments show that integrating these generated pairs with existing datasets to train a pathology-specific CLIP model, PathGen-CLIP, significantly enhances its ability to analyze pathological images, with substantial improvements across nine pathology-related zero-shot image classification tasks and three whole-slide image tasks. Furthermore, we construct 200K instruction-tuning data based on PathGen-1.6M and integrate PathGen-CLIP with the Vicuna LLM to create more powerful multimodal models through instruction tuning. Overall, we provide a scalable pathway for high-quality data generation in pathology, paving the way for next-generation general pathology models.

[CV-149] SMPLOlympics: Sports Environments for Physically Simulated Humanoids

链接: https://arxiv.org/abs/2407.00187
作者: Zhengyi Luo,Jiashun Wang,Kangni Liu,Haotian Zhang,Chen Tessler,Jingbo Wang,Ye Yuan,Jinkun Cao,Zihui Lin,Fengyi Wang,Jessica Hodgins,Kris Kitani
关键词: variety of Olympic, physically simulated environments, Olympic sports, physically simulated, physically demanding nature
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present SMPLOlympics, a collection of physically simulated environments that allow humanoids to compete in a variety of Olympic sports. Sports simulation offers a rich and standardized testing ground for evaluating and improving the capabilities of learning algorithms due to the diversity and physically demanding nature of athletic activities. As humans have been competing in these sports for many years, there is also a plethora of existing knowledge on the preferred strategy to achieve better performance. To leverage these existing human demonstrations from videos and motion capture, we design our humanoid to be compatible with the widely-used SMPL and SMPL-X human models from the vision and graphics community. We provide a suite of individual sports environments, including golf, javelin throw, high jump, long jump, and hurdling, as well as competitive sports, including both 1v1 and 2v2 games such as table tennis, tennis, fencing, boxing, soccer, and basketball. Our analysis shows that combining strong motion priors with simple rewards can result in human-like behavior in various sports. By providing a unified sports benchmark and baseline implementation of state and reward designs, we hope that SMPLOlympics can help the control and animation communities achieve human-like and performant behaviors.

[CV-150] he impact of model size on catastrophic forgetting in Online Continual Learning

链接: https://arxiv.org/abs/2407.00176
作者: Eunhae Lee
关键词: Continual Learning performance, Continual Learning, Online Continual Learning, Continual Learning efficacy, Learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study investigates the impact of model size on Online Continual Learning performance, with a focus on catastrophic forgetting. Employing ResNet architectures of varying sizes, the research examines how network depth and width affect model performance in class-incremental learning using the SplitCIFAR-10 dataset. Key findings reveal that larger models do not guarantee better Continual Learning performance; in fact, they often struggle more in adapting to new tasks, particularly in online settings. These results challenge the notion that larger models inherently mitigate catastrophic forgetting, highlighting the nuanced relationship between model size and Continual Learning efficacy. This study contributes to a deeper understanding of model scalability and its practical implications in Continual Learning scenarios.

[CV-151] Localizing Anomalies via Multiscale Score Matching Analysis

链接: https://arxiv.org/abs/2407.00148
作者: Ahsan Mahmood,Junier Oliva,Martin Styner
关键词: remain critical challenges, Multiscale Score Matching, Score Matching Analysis, imaging remain critical, challenges in healthcare
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection and localization in medical imaging remain critical challenges in healthcare. This paper introduces Spatial-MSMA (Multiscale Score Matching Analysis), a novel unsupervised method for anomaly localization in volumetric brain MRIs. Building upon the MSMA framework, our approach incorporates spatial information and conditional likelihoods to enhance anomaly detection capabilities. We employ a flexible normalizing flow model conditioned on patch positions and global image features to estimate patch-wise anomaly scores. The method is evaluated on a dataset of 1,650 T1- and T2-weighted brain MRIs from typically developing children, with simulated lesions added to the test set. Spatial-MSMA significantly outperforms existing methods, including reconstruction-based, generative-based, and interpretation-based approaches, in lesion detection and segmentation tasks. Our model achieves superior performance in both distance-based metrics (99th percentile Hausdorff Distance: 7.05 \pm 0.61 , Mean Surface Distance: 2.10 \pm 0.43 ) and component-wise metrics (True Positive Rate: 0.83 \pm 0.01 , Positive Predictive Value: 0.96 \pm 0.01 ). These results demonstrate Spatial-MSMA’s potential for accurate and interpretable anomaly localization in medical imaging, with implications for improved diagnosis and treatment planning in clinical settings. Our code is available at~\urlthis https URL.

[CV-152] InfoNCE: Identifying the Gap Between Theory and Practice

链接: https://arxiv.org/abs/2407.00143
作者: Evgenia Rusak,Patrik Reizinger,Attila Juhos,Oliver Bringmann,Roland S. Zimmermann,Wieland Brendel
关键词: learned representations uncover, contrastive learning, work on contrastive, learned representations, ground-truth latent factors
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Previous theoretical work on contrastive learning (CL) with InfoNCE showed that, under certain assumptions, the learned representations uncover the ground-truth latent factors. We argue these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they assume that within a positive pair, all latent factors either vary to a similar extent, or that some do not vary at all. However, in practice, positive pairs are often generated using augmentations such as strong cropping to just a few pixels. Hence, a more realistic assumption is that all latent factors change, with a continuum of variability across these factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Additionally, we explore and discuss further mismatches between theoretical assumptions and practical implementations, including extensions to hard negative mining and loss ensembles.

[CV-153] Analyzing Quality Bias and Performance in Text-to-Image Generative Models

链接: https://arxiv.org/abs/2407.00138
作者: Nila Masrourisaadat,Nazanin Sedaghatkish,Fatemeh Sarshartehrani,Edward A. Fox
关键词: Advances in generative, demonstrating the ability, text prompts, led to significant, significant interest
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:Advances in generative models have led to significant interest in image synthesis, demonstrating the ability to generate high-quality images for a diverse range of text prompts. Despite this progress, most studies ignore the presence of bias. In this paper, we examine several text-to-image models not only by qualitatively assessing their performance in generating accurate images of human faces, groups, and specified numbers of objects but also by presenting a social bias analysis. As expected, models with larger capacity generate higher-quality images. However, we also document the inherent gender or social biases these models possess, offering a more complete understanding of their impact and limitations.

[CV-154] RepAct: The Re-parameterizable Adaptive Activation Function

链接: https://arxiv.org/abs/2407.00131
作者: Xian Wu,Qingchuan Tao,Shuang Wang
关键词: efficient artificial intelligence, Addressing the imperative, study presents RepAct, re-parameterizable adaptive activation, activation function tailored
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Addressing the imperative need for efficient artificial intelligence in IoT and edge computing, this study presents RepAct, a re-parameterizable adaptive activation function tailored for optimizing lightweight neural networks within the computational limitations of edge devices. By employing a multi-branch structure with learnable adaptive weights, RepAct enriches feature processing and enhances cross-layer interpretability. When evaluated on tasks such as image classification and object detection, RepAct notably surpassed conventional activation functions in lightweight networks, delivering up to a 7.92% accuracy boost on MobileNetV3-Small for the ImageNet100 dataset, while maintaining computational complexity on par with HardSwish. This innovative approach not only maximizes model parameter efficiency but also significantly improves the performance and understanding capabilities of lightweight neural networks, demonstrating its potential for real-time edge computing applications.

[CV-155] Multi-Species Object Detection in Drone Imagery for Population Monitoring of Endangered Animals

链接: https://arxiv.org/abs/2407.00127
作者: Sowmya Sankaran
关键词: Animal populations worldwide, accurately count endangered, count endangered species, rapidly declining, populations worldwide
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Animal populations worldwide are rapidly declining, and a technology that can accurately count endangered species could be vital for monitoring population changes over several years. This research focused on fine-tuning object detection models for drone images to create accurate counts of animal species. Hundreds of images taken using a drone and large, openly available drone-image datasets were used to fine-tune machine learning models with the baseline YOLOv8 architecture. We trained 30 different models, with the largest having 43.7 million parameters and 365 layers, and used hyperparameter tuning and data augmentation techniques to improve accuracy. While the state-of-the-art YOLOv8 baseline had only 0.7% accuracy on a dataset of safari animals, our models had 95% accuracy on the same dataset. Finally, we deployed the models on the Jetson Orin Nano for demonstration of low-power real-time species detection for easy inference on drones.

[CV-156] Automated Web-Based Malaria Detection System with Machine Learning and Deep Learning Techniques

链接: https://arxiv.org/abs/2407.00120
作者: Abraham G Taye,Sador Yemane,Eshetu Negash,Yared Minwuyelet,Moges Abebe,Melkamu Hunegnaw Asmare
关键词: global health burden, causing widespread suffering, significant global health, Malaria parasites pose, health burden
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Malaria parasites pose a significant global health burden, causing widespread suffering and mortality. Detecting malaria infection accurately is crucial for effective treatment and control. However, existing automated detection techniques have shown limitations in terms of accuracy and generalizability. Many studies have focused on specific features without exploring more comprehensive approaches. In our case, we formulate a deep learning technique for malaria-infected cell classification using traditional CNNs and transfer learning models notably VGG19, InceptionV3, and Xception. The models were trained using NIH datasets and tested using different performance metrics such as accuracy, precision, recall, and F1-score. The test results showed that deep CNNs achieved the highest accuracy – 97%, followed by Xception with an accuracy of 95%. A machine learning model SVM achieved an accuracy of 83%, while an Inception-V3 achieved an accuracy of 94%. Furthermore, the system can be accessed through a web interface, where users can upload blood smear images for malaria detection.

[CV-157] AI-Driven Skin Cancer Diagnosis: Grad-CAM and Expert Annotations for Enhanced Interpretability

链接: https://arxiv.org/abs/2407.00104
作者: Iván Matas,Carmen Serrano,Francisca Silva,Amalia Serrano,Tomás Toledo-Pastrana,Begoña Acha
关键词: optimizing resource utilization, BCC, provide interpretable support, BCC dermoscopic features, BCC dermoscopic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Image and Video Processing (eess.IV)
*备注: 8 pages, 4 figures, 4 tables, under review

点击查看摘要

Abstract:An AI tool has been developed to provide interpretable support for the diagnosis of BCC via teledermatology, thus speeding up referrals and optimizing resource utilization. The interpretability is provided in two ways: on the one hand, the main BCC dermoscopic patterns are found in the image to justify the BCC/Non BCC classification. Secondly, based on the common visual XAI Grad-CAM, a clinically inspired visual explanation is developed where the relevant features for diagnosis are located. Since there is no established ground truth for BCC dermoscopic features, a standard reference is inferred from the diagnosis of four dermatologists using an Expectation Maximization (EM) based algorithm. The results demonstrate significant improvements in classification accuracy and interpretability, positioning this approach as a valuable tool for early BCC detection and referral to dermatologists. The BCC/non-BCC classification achieved an accuracy rate of 90%. For Clinically-inspired XAI results, the detection of BCC patterns useful to clinicians reaches 99% accuracy. As for the Clinically-inspired Visual XAI results, the mean of the Grad-CAM normalized value within the manually segmented clinical features is 0.57, while outside this region it is 0.16. This indicates that the model struggles to accurately identify the regions of the BCC patterns. These results prove the ability of the AI tool to provide a useful explanation.

[CV-158] LMVD: A Large-Scale Multimodal Vlog Dataset for Depression Detection in the Wild

链接: https://arxiv.org/abs/2407.00024
作者: Lang He,Kai Chen,Junnan Zhao,Yimeng Wang,Ercheng Pei,Haifeng Chen,Jiewei Jiang,Shiqing Zhang,Jie Zhang,Zhongmin Wang,Tao He,Prayag Tiwari
关键词: including their personal, social functioning, academic and work, significantly impact, impact many aspects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Depression can significantly impact many aspects of an individual’s life, including their personal and social functioning, academic and work performance, and overall quality of life. Many researchers within the field of affective computing are adopting deep learning technology to explore potential patterns related to the detection of depression. However, because of subjects’ privacy protection concerns, that data in this area is still scarce, presenting a challenge for the deep discriminative models used in detecting depression. To navigate these obstacles, a large-scale multimodal vlog dataset (LMVD), for depression recognition in the wild is built. In LMVD, which has 1823 samples with 214 hours of the 1475 participants captured from four multimedia platforms (Sina Weibo, Bilibili, Tiktok, and YouTube). A novel architecture termed MDDformer to learn the non-verbal behaviors of individuals is proposed. Extensive validations are performed on the LMVD dataset, demonstrating superior performance for depression detection. We anticipate that the LMVD will contribute a valuable function to the depression detection community. The data and code will released at the link: this https URL.

[CV-159] Neural Graphics Texture Compression Supporting Random Acces

链接: https://arxiv.org/abs/2407.00021
作者: Farzad Farhadzadeh,Qiqi Hou,Hoang Le,Amir Said,Randall Rauwendaal,Alex Bourd,Fatih Porikli
关键词: tremendous growth, including resolution, Neural Image Compression, matched by advances, Advances
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
*备注: ECCV submission

点击查看摘要

Abstract:Advances in rendering have led to tremendous growth in texture assets, including resolution, complexity, and novel textures components, but this growth in data volume has not been matched by advances in its compression. Meanwhile Neural Image Compression (NIC) has advanced significantly and shown promising results, but the proposed methods cannot be directly adapted to neural texture compression. First, texture compression requires on-demand and real-time decoding with random access during parallel rendering (e.g. block texture decompression on GPUs). Additionally, NIC does not support multi-resolution reconstruction (mip-levels), nor does it have the ability to efficiently jointly compress different sets of texture channels. In this work, we introduce a novel approach to texture set compression that integrates traditional GPU texture representation and NIC techniques, designed to enable random access and support many-channel texture sets. To achieve this goal, we propose an asymmetric auto-encoder framework that employs a convolutional encoder to capture detailed information in a bottleneck-latent space, and at decoder side we utilize a fully connected network, whose inputs are sampled latent features plus positional information, for a given texture coordinate and mip level. This latent data is defined to enable simplified access to multi-resolution data by simply changing the scanning strides. Experimental results demonstrate that this approach provides much better results than conventional texture compression, and significant improvement over the latest method using neural networks.

[CV-160] Visual Language Model based Cross-modal Semantic Communication Systems

链接: https://arxiv.org/abs/2407.00020
作者: Feibo Jiang,Chuanguo Tang,Li Dong,Kezhi Wang,Kun Yang,Cunhua Pan
关键词: Shannon physical capacity, physical capacity limits, Cross-modal Semantic Communication, transcending the Shannon, Shannon physical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:Semantic Communication (SC) has emerged as a novel communication paradigm in recent years, successfully transcending the Shannon physical capacity limits through innovative semantic transmission concepts. Nevertheless, extant Image Semantic Communication (ISC) systems face several challenges in dynamic environments, including low semantic density, catastrophic forgetting, and uncertain Signal-to-Noise Ratio (SNR). To address these challenges, we propose a novel Vision-Language Model-based Cross-modal Semantic Communication (VLM-CSC) system. The VLM-CSC comprises three novel components: (1) Cross-modal Knowledge Base (CKB) is used to extract high-density textual semantics from the semantically sparse image at the transmitter and reconstruct the original image based on textual semantics at the receiver. The transmission of high-density semantics contributes to alleviating bandwidth pressure. (2) Memory-assisted Encoder and Decoder (MED) employ a hybrid long/short-term memory mechanism, enabling the semantic encoder and decoder to overcome catastrophic forgetting in dynamic environments when there is a drift in the distribution of semantic features. (3) Noise Attention Module (NAM) employs attention mechanisms to adaptively adjust the semantic coding and the channel coding based on SNR, ensuring the robustness of the CSC system. The experimental simulations validate the effectiveness, adaptability, and robustness of the CSC system.

[CV-161] NeuroMoCo: A Neuromorphic Momentum Contrast Learning Method for Spiking Neural Networks

链接: https://arxiv.org/abs/2406.06305
作者: Yuqi Ma,Huamin Wang,Hangchi Shen,Xuemei Chen,Shukai Duan,Shiping Wen
关键词: brain-inspired spiking neural, spiking neural networks, attracted great research, great research attention, research attention owing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 32 pages,4 figures,4 tables

点击查看摘要

Abstract:Recently, brain-inspired spiking neural networks (SNNs) have attracted great research attention owing to their inherent bio-interpretability, event-triggered properties and powerful perception of spatiotemporal information, which is beneficial to handling event-based neuromorphic datasets. In contrast to conventional static image datasets, event-based neuromorphic datasets present heightened complexity in feature extraction due to their distinctive time series and sparsity characteristics, which influences their classification accuracy. To overcome this challenge, a novel approach termed Neuromorphic Momentum Contrast Learning (NeuroMoCo) for SNNs is introduced in this paper by extending the benefits of self-supervised pre-training to SNNs to effectively stimulate their potential. This is the first time that self-supervised learning (SSL) based on momentum contrastive learning is realized in SNNs. In addition, we devise a novel loss function named MixInfoNCE tailored to their temporal characteristics to further increase the classification accuracy of neuromorphic datasets, which is verified through rigorous ablation experiments. Finally, experiments on DVS-CIFAR10, DVS128Gesture and N-Caltech101 have shown that NeuroMoCo of this paper establishes new state-of-the-art (SOTA) benchmarks: 83.6% (Spikformer-2-256), 98.62% (Spikformer-2-256), and 84.4% (SEW-ResNet-18), respectively.

[CV-162] xLSTM-UNet can be an Effective 2D 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

链接: https://arxiv.org/abs/2407.01530
作者: Tianrun Chen,Chaotao Ding,Lanyun Zhu,Tao Xu,Deyi Ji,Ying Zang,Zejian Li
关键词: dependencies remains constrained, Vision Transformers, Neural Language Processing, Convolutional Neural Networks, manage long-range dependencies
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) have been pivotal in biomedical image segmentation, yet their ability to manage long-range dependencies remains constrained by inherent locality and computational overhead. To overcome these challenges, in this technical report, we first propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (xLSTM) as its backbone for medical image segmentation. xLSTM is a recently proposed as the successor of Long Short-Term Memory (LSTM) networks and have demonstrated superior performance compared to Transformers and State Space Models (SSMs) like Mamba in Neural Language Processing (NLP) and image classification (as demonstrated in Vision-LSTM, or ViL implementation). Here, xLSTM-UNet we designed extend the success in biomedical image segmentation domain. By integrating the local feature extraction strengths of convolutional layers with the long-range dependency capturing abilities of xLSTM, xLSTM-UNet offers a robust solution for comprehensive image analysis. We validate the efficacy of xLSTM-UNet through experiments. Our findings demonstrate that xLSTM-UNet consistently surpasses the performance of leading CNN-based, Transformer-based, and Mamba-based segmentation networks in multiple datasets in biomedical segmentation including organs in abdomen MRI, instruments in endoscopic images, and cells in microscopic images. With comprehensive experiments performed, this technical report highlights the potential of xLSTM-based architectures in advancing biomedical image analysis in both 2D and 3D. The code, models, and datasets are publicly available at \hrefthis http URLthis http URL

[CV-163] Centerline Boundary Dice Loss for Vascular Segmentation

链接: https://arxiv.org/abs/2407.01517
作者: Pengcheng Shi,Jiesi Hu,Yanwu Yang,Zilve Gao,Wei Liu,Ting Ma
关键词: medical imaging plays, functional assessments, medical imaging, imaging plays, plays a crucial
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by MICCAI 2024

点击查看摘要

Abstract:Vascular segmentation in medical imaging plays a crucial role in analysing morphological and functional assessments. Traditional methods, like the centerline Dice (clDice) loss, ensure topology preservation but falter in capturing geometric details, especially under translation and deformation. The combination of clDice with traditional Dice loss can lead to diameter imbalance, favoring larger vessels. Addressing these challenges, we introduce the centerline boundary Dice (cbDice) loss function, which harmonizes topological integrity and geometric nuances, ensuring consistent segmentation across various vessel sizes. cbDice enriches the clDice approach by including boundary-aware aspects, thereby improving geometric detail recognition. It matches the performance of the boundary difference over union (B-DoU) loss through a mask-distance-based approach, enhancing traslation sensitivity. Crucially, cbDice incorporates radius information from vascular skeletons, enabling uniform adaptation to vascular diameter changes and maintaining balance in branch growth and fracture impacts. Furthermore, we conducted a theoretical analysis of clDice variants (cl-X-Dice). We validated cbDice’s efficacy on three diverse vascular segmentation datasets, encompassing both 2D and 3D, and binary and multi-class segmentation. Particularly, the method integrated with cbDice demonstrated outstanding performance on the MICCAI 2023 TopCoW Challenge dataset. Our code is made publicly available at: this https URL.

[CV-164] Neurovascular Segmentation in sOCT with Deep Learning and Synthetic Training Data

链接: https://arxiv.org/abs/2407.01419
作者: Etienne Chollet,Yaël Balbastre,Chiara Mauri,Caroline Magnain,Bruce Fischl,Hui Wang
关键词: Microvascular anatomy, neurological disorders, Microvascular, comprehensive three-dimensional vascular, three-dimensional vascular network
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 10 figures

点击查看摘要

Abstract:Microvascular anatomy is known to be involved in various neurological disorders. However, understanding these disorders is hindered by the lack of imaging modalities capable of capturing the comprehensive three-dimensional vascular network structure at microscopic resolution. With a lateral resolution of = 20 \textmum and ability to reconstruct large tissue blocks up to tens of cubic centimeters, serial-section optical coherence tomography (sOCT) is well suited for this task. This method uses intrinsic optical properties to visualize the vessels and therefore does not possess a specific contrast, which complicates the extraction of accurate vascular models. The performance of traditional vessel segmentation methods is heavily degraded in the presence of substantial noise and imaging artifacts and is sensitive to domain shifts, while convolutional neural networks (CNNs) require extensive labeled data and are also sensitive the precise intensity characteristics of the data that they are trained on. Building on the emerging field of synthesis-based training, this study demonstrates a synthesis engine for neurovascular segmentation in sOCT images. Characterized by minimal priors and high variance sampling, our highly generalizable method tested on five distinct sOCT acquisitions eliminates the need for manual annotations while attaining human-level precision. Our approach comprises two phases: label synthesis and label-to-image transformation. We demonstrate the efficacy of the former by comparing it to several more realistic sets of training labels, and the latter by an ablation study of synthetic noise and artifact models.

[CV-165] Cross-Slice Attention and Evidential Critical Loss for Uncertainty-Aware Prostate Cancer Detection

链接: https://arxiv.org/abs/2407.01146
作者: Alex Ling Yu Hung,Haoxin Zheng,Kai Zhao,Kaifeng Pang,Demetri Terzopoulos,Kyunghyun Sung
关键词: Current deep learning-based, albeit disregarding volumetric, typically analyze medical, analyze medical images, disregarding volumetric information
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current deep learning-based models typically analyze medical images in either 2D or 3D albeit disregarding volumetric information or suffering sub-optimal performance due to the anisotropic resolution of MR data. Furthermore, providing an accurate uncertainty estimation is beneficial to clinicians, as it indicates how confident a model is about its prediction. We propose a novel 2.5D cross-slice attention model that utilizes both global and local information, along with an evidential critical loss, to perform evidential deep learning for the detection in MR images of prostate cancer, one of the most common cancers and a leading cause of cancer-related death in men. We perform extensive experiments with our model on two different datasets and achieve state-of-the-art performance in prostate cancer detection along with improved epistemic uncertainty estimation. The implementation of the model is available at this https URL.

[CV-166] Learning 3D Gaussians for Extremely Sparse-View Cone-Beam CT Reconstruction

链接: https://arxiv.org/abs/2407.01090
作者: Yiqun Lin,Hualiang Wang,Jixiang Chen,Xiaomeng Li
关键词: Cone-Beam Computed Tomography, Computed Tomography, exposure raises concerns, radiation exposure raises, Cone-Beam Computed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to MICCAI 2024. Project link: this https URL

点击查看摘要

Abstract:Cone-Beam Computed Tomography (CBCT) is an indispensable technique in medical imaging, yet the associated radiation exposure raises concerns in clinical practice. To mitigate these risks, sparse-view reconstruction has emerged as an essential research direction, aiming to reduce the radiation dose by utilizing fewer projections for CT reconstruction. Although implicit neural representations have been introduced for sparse-view CBCT reconstruction, existing methods primarily focus on local 2D features queried from sparse projections, which is insufficient to process the more complicated anatomical structures, such as the chest. To this end, we propose a novel reconstruction framework, namely DIF-Gaussian, which leverages 3D Gaussians to represent the feature distribution in the 3D space, offering additional 3D spatial information to facilitate the estimation of attenuation coefficients. Furthermore, we incorporate test-time optimization during inference to further improve the generalization capability of the model. We evaluate DIF-Gaussian on two public datasets, showing significantly superior reconstruction performance than previous state-of-the-art methods.

[CV-167] Analysis of Modern Computer Vision Models for Blood Cell Classification

链接: https://arxiv.org/abs/2407.00759
作者: Alexander Kim(1),Ryan Kim(2) ((1) University of Illinois Urbana-Champaign, (2) William Fremd High School)
关键词: white blood cells, related blood components, white blood, blood cells, related blood
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:The accurate classification of white blood cells and related blood components is crucial for medical diagnoses. While traditional manual examinations and automated hematology analyzers have been widely used, they are often slow and prone to errors. Recent advancements in deep learning have shown promise for addressing these limitations. Earlier studies have demonstrated the viability of convolutional neural networks such as DenseNet, ResNet, and VGGNet for this task. Building on these foundations, our work employs more recent and efficient models to achieve rapid and accurate results. Specifically, this study used state-of-the-art architectures, including MaxVit, EfficientVit, EfficientNet, EfficientNetV2, and MobileNetV3. This study aimed to evaluate the performance of these models in WBC classification, potentially offering a more efficient and reliable alternative to current methods. Our approach not only addresses the speed and accuracy concerns of traditional techniques but also explores the applicability of innovative deep learning models in hematological analysis.

[CV-168] ASPS: Augmented Segment Anything Model for Polyp Segmentation

链接: https://arxiv.org/abs/2407.00718
作者: Huiqian Li,Dingwen Zhang,Jieru Yao,Longfei Han,Zhongyu Li,Junwei Han
关键词: colorectal cancer diagnosis, Polyp segmentation, Polyp segmentation plays, cancer diagnosis, Polyp
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI2024

点击查看摘要

Abstract:Polyp segmentation plays a pivotal role in colorectal cancer diagnosis. Recently, the emergence of the Segment Anything Model (SAM) has introduced unprecedented potential for polyp segmentation, leveraging its powerful pre-training capability on large-scale datasets. However, due to the domain gap between natural and endoscopy images, SAM encounters two limitations in achieving effective performance in polyp segmentation. Firstly, its Transformer-based structure prioritizes global and low-frequency information, potentially overlooking local details, and introducing bias into the learned features. Secondly, when applied to endoscopy images, its poor out-of-distribution (OOD) performance results in substandard predictions and biased confidence output. To tackle these challenges, we introduce a novel approach named Augmented SAM for Polyp Segmentation (ASPS), equipped with two modules: Cross-branch Feature Augmentation (CFA) and Uncertainty-guided Prediction Regularization (UPR). CFA integrates a trainable CNN encoder branch with a frozen ViT encoder, enabling the integration of domain-specific knowledge while enhancing local features and high-frequency details. Moreover, UPR ingeniously leverages SAM’s IoU score to mitigate uncertainty during the training procedure, thereby improving OOD performance and domain generalization. Extensive experimental results demonstrate the effectiveness and utility of the proposed method in improving SAM’s performance in polyp segmentation. Our code is available at this https URL.

[CV-169] A Review of Image Processing Methods in Prostate Ultrasound

链接: https://arxiv.org/abs/2407.00678
作者: Haiqiao Wang,Hong Wu,Zhuoyuan Wang,Peiyan Yue,Dong Ni,Pheng-Ann Heng,Yi Wang
关键词: reducing mortality rates, poses a significant, men health, mortality rates, Prostate cancer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Prostate cancer (PCa) poses a significant threat to men’s health, with early diagnosis being crucial for improving prognosis and reducing mortality rates. Transrectal ultrasound (TRUS) plays a vital role in the diagnosis and image-guided intervention of this http URL facilitate physicians with more accurate and efficient computer-assisted diagnosis and interventions, many image processing algorithms in TRUS have been proposed and achieved state-of-the-art performance in several tasks, including prostate gland segmentation, prostate image registration, PCa classification and detection, and interventional needle detection.The rapid development of these algorithms over the past two decades necessitates a comprehensive summary. In consequence, this survey provides a systematic analysis of this field, outlining the evolution of image processing methods in the context of TRUS image analysis and meanwhile highlighting their relevant contributions. Furthermore, this survey discusses current challenges and suggests future research directions to possibly advance this field further.

[CV-170] HATs: Hierarchical Adaptive Taxonomy Segmentation for Panoramic Pathology Image Analysis

链接: https://arxiv.org/abs/2407.00596
作者: Ruining Deng,Quan Liu,Can Cui,Tianyuan Yao,Juming Xiong,Shunxing Bao,Hao Li,Mengmeng Yin,Yu Wang,Shilin Zhao,Yucheng Tang,Haichun Yang,Yuankai Huo
关键词: variably scaled anatomy, remarkable challenge due, computational pathology presents, Panoramic image segmentation, scaled anatomy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2402.19286

点击查看摘要

Abstract:Panoramic image segmentation in computational pathology presents a remarkable challenge due to the morphologically complex and variably scaled anatomy. For instance, the intricate organization in kidney pathology spans multiple layers, from regions like the cortex and medulla to functional units such as glomeruli, tubules, and vessels, down to various cell types. In this paper, we propose a novel Hierarchical Adaptive Taxonomy Segmentation (HATs) method, which is designed to thoroughly segment panoramic views of kidney structures by leveraging detailed anatomical insights. Our approach entails (1) the innovative HATs technique which translates spatial relationships among 15 distinct object classes into a versatile “plug-and-play” loss function that spans across regions, functional units, and cells, (2) the incorporation of anatomical hierarchies and scale considerations into a unified simple matrix representation for all panoramic entities, (3) the adoption of the latest AI foundation model (EfficientSAM) as a feature extraction tool to boost the model’s adaptability, yet eliminating the need for manual prompt generation in conventional segment anything model (SAM). Experimental findings demonstrate that the HATs method offers an efficient and effective strategy for integrating clinical insights and imaging precedents into a unified segmentation model across more than 15 categories. The official implementation is publicly available at this https URL.

[CV-171] Fully invertible hyperbolic neural networks for segmenting large-scale surface and sub-surface data

链接: https://arxiv.org/abs/2407.00595
作者: Bas Peters,Eldad Haber,Keegan Lensink
关键词: fully invertible networks, invertible networks, surface data segmentation, fully invertible, invertible
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages, 13 figures

点击查看摘要

Abstract:The large spatial/temporal/frequency scale of geoscience and remote-sensing datasets causes memory issues when using convolutional neural networks for (sub-) surface data segmentation. Recently developed fully reversible or fully invertible networks can mostly avoid memory limitations by recomputing the states during the backward pass through the network. This results in a low and fixed memory requirement for storing network states, as opposed to the typical linear memory growth with network depth. This work focuses on a fully invertible network based on the telegraph equation. While reversibility saves the major amount of memory used in deep networks by the data, the convolutional kernels can take up most memory if fully invertible networks contain multiple invertible pooling/coarsening layers. We address the explosion of the number of convolutional kernels by combining fully invertible networks with layers that contain the convolutional kernels in a compressed form directly. A second challenge is that invertible networks output a tensor the same size as its input. This property prevents the straightforward application of invertible networks to applications that map between different input-output dimensions, need to map to outputs with more channels than present in the input data, or desire outputs that decrease/increase the resolution compared to the input data. However, we show that by employing invertible networks in a non-standard fashion, we can still use them for these tasks. Examples in hyperspectral land-use classification, airborne geophysical surveying, and seismic imaging illustrate that we can input large data volumes in one chunk and do not need to work on small patches, use dimensionality reduction, or employ methods that classify a patch to a single central pixel.

[CV-172] Accelerating Longitudinal MRI using Prior Informed Latent Diffusion

链接: https://arxiv.org/abs/2407.00537
作者: Yonatan Urman,Zachary Shah,Ashwin Kumar,Bruno P.Soares,Kawin Setsompop
关键词: soft-tissue imaging modality, ionization-free soft-tissue imaging, imaging modality, widely used ionization-free, ionization-free soft-tissue
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:MRI is a widely used ionization-free soft-tissue imaging modality, often employed repeatedly over a patient’s lifetime. However, prolonged scanning durations, among other issues, can limit availability and accessibility. In this work, we aim to substantially reduce scan times by leveraging prior scans of the same patient. These prior scans typically contain considerable shared information with the current scan, thereby enabling higher acceleration rates when appropriately utilized. We propose a prior informed reconstruction method with a trained diffusion model in conjunction with data-consistency steps. Our method can be trained with unlabeled image data, eliminating the need for a dataset of either k-space measurements or paired longitudinal scans as is required of other learning-based methods. We demonstrate superiority of our method over previously suggested approaches in effectively utilizing prior information without over-biasing prior consistency, which we validate on both an open-source dataset of healthy patients as well as several longitudinal cases of clinical interest.

[CV-173] UADSN: Uncertainty-Aware Dual-Stream Network for Facial Nerve Segmentation

链接: https://arxiv.org/abs/2407.00297
作者: Guanghao Zhu,Lin Liu,Jing Zhang,Xiaohui Du,Ruqian Hao,Juanxiu Liu
关键词: cochlear implantation surgery, preoperative path planning, Facial nerve, implantation surgery, crucial for preoperative
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Facial nerve segmentation is crucial for preoperative path planning in cochlear implantation surgery. Recently, researchers have proposed some segmentation methods, such as atlas-based and deep learning-based methods. However, since the facial nerve is a tubular organ with a diameter of only 1.0-1.5mm, it is challenging to locate and segment the facial nerve in CT scans. In this work, we propose an uncertainty-aware dualstream network (UADSN). UADSN consists of a 2D segmentation stream and a 3D segmentation stream. Predictions from two streams are used to identify uncertain regions, and a consistency loss is employed to supervise the segmentation of these regions. In addition, we introduce channel squeeze spatial excitation modules into the skip connections of U-shaped networks to extract meaningful spatial information. In order to consider topologypreservation, a clDice loss is introduced into the supervised loss function. Experimental results on the facial nerve dataset demonstrate the effectiveness of UADSN and our submodules.

[CV-174] IVCA: Inter-Relation-Aware Video Complexity Analyzer

链接: https://arxiv.org/abs/2407.00280
作者: Junqi Liao,Yao Li,Zhuoyuan Li,Li Li,Dong Liu
关键词: video streaming applications, real-time analysis requirements, video complexity analyzer, video streaming, streaming applications
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: The report for the solution of second prize winner in ICIP 2024 Grand Challenge on Video Complexity (Team: USTC-iVC_Team1, USTC-iVC_Team2)

点击查看摘要

Abstract:To meet the real-time analysis requirements of video streaming applications, we propose an inter-relation-aware video complexity analyzer (IVCA) as an extension to VCA. The IVCA addresses the limitation of VCA by considering inter-frame relations, namely motion and reference structure. First, we enhance the accuracy of temporal features by introducing feature-domain motion estimation into the IVCA. Next, drawing inspiration from the hierarchical reference structure in codecs, we design layer-aware weights to adjust the majorities of frame complexity in different layers. Additionally, we expand the scope of temporal features by considering frames that be referred to, rather than relying solely on the previous frame. Experimental results show the significant improvement in complexity estimation accuracy achieved by IVCA, with minimal time complexity increase.

[CV-175] Generative Iris Prior Embedded Transformer for Iris Restoration

链接: https://arxiv.org/abs/2407.00261
作者: Yubo Huang,Jia Wang,Peipei Li,Liuyu Xiang,Peigang Li,Zhaofeng He
关键词: complexly degraded iris, aiming to improve, challenging problem, Iris, degraded iris images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Our code is available at this https URL

点击查看摘要

Abstract:Iris restoration from complexly degraded iris images, aiming to improve iris recognition performance, is a challenging problem. Due to the complex degradation, directly training a convolutional neural network (CNN) without prior cannot yield satisfactory results. In this work, we propose a generative iris prior embedded Transformer model (Gformer), in which we build a hierarchical encoder-decoder network employing Transformer block and generative iris prior. First, we tame Transformer blocks to model long-range dependencies in target images. Second, we pretrain an iris generative adversarial network (GAN) to obtain the rich iris prior, and incorporate it into the iris restoration process with our iris feature modulator. Our experiments demonstrate that the proposed Gformer outperforms state-of-the-art methods. Besides, iris recognition performance has been significantly improved after applying Gformer.

[CV-176] DCSM 2.0: Deep Conditional Shape Models for Data Efficient Segmentation

链接: https://arxiv.org/abs/2407.00186
作者: Athira J Jacob,Puneet Sharma,Daniel Rueckert
关键词: image analyses workflows, medical image analyses, Conditional Shape Models, Deep Conditional Shape, analyses workflows
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Best oral paper award at ISBI 2024

点击查看摘要

Abstract:Segmentation is often the first step in many medical image analyses workflows. Deep learning approaches, while giving state-of-the-art accuracies, are data intensive and do not scale well to low data regimes. We introduce Deep Conditional Shape Models 2.0, which uses an edge detector, along with an implicit shape function conditioned on edge maps, to leverage cross-modality shape information. The shape function is trained exclusively on a source domain (contrasted CT) and applied to the target domain of interest (3D echocardiography). We demonstrate data efficiency in the target domain by varying the amounts of training data used in the edge detection stage. We observe that DCSM 2.0 outperforms the baseline at all data levels in terms of Hausdorff distances, and while using 50% or less of the training data in terms of average mesh distance, and at 10% or less of the data with the dice coefficient. The method scales well to low data regimes, with gains of up to 5% in dice coefficient, 2.58 mm in average surface distance and 21.02 mm in Hausdorff distance when using just 2% (22 volumes) of the training data.

[CV-177] Scalable Trustworthy Generative Model for Virtual Multi-Staining from HE Whole Slide Images

链接: https://arxiv.org/abs/2407.00098
作者: Mehdi Ounissi,Ilias Sarbout,Jean-Pierre Hugot,Christine Martinez-Vinson,Dominique Berrebi,Daniel Racoceanu
关键词: require extensive time, raise environmental concerns, expensive chemicals, Chemical staining methods, extensive time
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Chemical staining methods are dependable but require extensive time, expensive chemicals, and raise environmental concerns. These challenges highlight the need for alternative solutions like virtual staining, which accelerates the diagnostic process and enhances stain application flexibility. Generative AI technologies are pivotal in addressing these issues. However, the high-stakes nature of healthcare decisions, especially in computational pathology, complicates the adoption of these tools due to their opaque processes. Our work introduces the use of generative AI for virtual staining, aiming to enhance performance, trustworthiness, scalability, and adaptability in computational pathology. The methodology centers on a singular HE encoder supporting multiple stain decoders. This design focuses on critical regions in the latent space of HE, enabling precise synthetic stain generation. Our method, tested to generate 8 different stains from a single HE slide, offers scalability by loading only necessary model components during production. We integrate label-free knowledge in training, using loss functions and regularization to minimize artifacts, thus improving paired/unpaired virtual staining accuracy. To build trust, we use real-time self-inspection with discriminators for each stain type, providing pathologists with confidence heat-maps. Automatic quality checks on new HE slides ensure conformity to the trained distribution, ensuring accurate synthetic stains. Recognizing pathologists’ challenges with new technologies, we have developed an open-source, cloud-based system, that allows easy virtual staining of HE slides through a browser, addressing hardware/software issues and facilitating real-time user feedback. We also curated a novel dataset of 8 paired HE/stains related to pediatric Crohn’s disease, comprising 480 WSIs to further stimulate computational pathology research.

[CV-178] Comparing fine-grained and coarse-grained object detection for ecology

链接: https://arxiv.org/abs/2407.00018
作者: Jess Tam,Justin Kay
关键词:
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Populations and Evolution (q-bio.PE)
*备注: 6 pages, 4 figures, accepted to be presented as a poster presentation at a conference workshop (11th Fine-Grained Visual Categorisation 2024)

点击查看摘要

[CV-179] Odd-One-Out: Anomaly Detection by Comparing with Neighbors

链接: https://arxiv.org/abs/2406.20099
作者: Ankan Bhunia,Changjian Li,Hakan Bilen
关键词:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Codes Dataset at this https URL

点击查看摘要

机器学习

[LG-0] Sparse Diffusion Policy: A Sparse Reusable and Flexible Policy for Robot Learning

链接: https://arxiv.org/abs/2407.01531
作者: Yixiao Wang,Yifei Zhang,Mingxiao Huo,Ran Tian,Xiang Zhang,Yichen Xie,Chenfeng Xu,Pengliang Ji,Wei Zhan,Mingyu Ding,Masayoshi Tomizuka
关键词: demands efficient strategies, increasing complexity, Sparse Diffusion Policy, Diffusion Policy, tasks
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). By adopting Mixture of Experts (MoE) within a transformer-based diffusion policy, SDP selectively activates experts and skills, enabling efficient and task-specific learning without retraining the entire model. SDP not only reduces the burden of active parameters but also facilitates the seamless integration and reuse of experts across various tasks. Extensive experiments on diverse tasks in both simulations and real world show that SDP 1) excels in multitask scenarios with negligible increases in active parameters, 2) prevents forgetting in continual learning of new tasks, and 3) enables efficient task transfer, offering a promising solution for advanced robotic applications. Demos and codes can be found in this https URL.

[LG-1] On the Abuse and Detection of Polyglot Files

链接: https://arxiv.org/abs/2407.01529
作者: Luke Koch,Sean Oesch,Amul Chaulagain,Jared Dixon,Matthe