本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-07-09)

今日共更新729篇论文,其中:

  • 自然语言处理132篇(Computation and Language (cs.CL))
  • 计算机视觉184篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能147篇(Artificial Intelligence (cs.AI))
  • 机器学习192篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Me Myself and AI: The Situational Awareness Dataset (SAD) for LLMs
[NLP-0] Me Myself和AI:LLM的情境意识数据集(SAD)

链接: https://arxiv.org/abs/2407.04694
作者: Rudolf Laine,Bilal Chughtai,Jan Betley,Kaivalya Hariharan,Jeremy Scheurer,Mikita Balesni,Marius Hobbhahn,Alexander Meinke,Owain Evans
关键词: situational awareness, ChatGPT are trained, trained to respond, respond to users, SAD
中文关键词: 情景感知、ChatGPT经过培训,接受过响应、响应用户、SAD的培训
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 page main body, 98 page appendix, 58 figures

点击查看摘要

Abstract:AI assistants such as ChatGPT are trained to respond to users by saying, “I am a large language model”. This raises questions. Do such models know that they are LLMs and reliably act on this knowledge? Are they aware of their current circumstances, such as being deployed to the public? We refer to a model’s knowledge of itself and its circumstances as situational awareness. To quantify situational awareness in LLMs, we introduce a range of behavioral tests, based on question answering and instruction following. These tests form the \textbfSituational Awareness Dataset (SAD) , a benchmark comprising 7 task categories and over 13,000 questions. The benchmark tests numerous abilities, including the capacity of LLMs to (i) recognize their own generated text, (ii) predict their own behavior, (iii) determine whether a prompt is from internal evaluation or real-world deployment, and (iv) follow instructions that depend on self-knowledge. We evaluate 16 LLMs on SAD, including both base (pretrained) and chat models. While all models perform better than chance, even the highest-scoring model (Claude 3 Opus) is far from a human baseline on certain tasks. We also observe that performance on SAD is only partially predicted by metrics of general knowledge (e.g. MMLU). Chat models, which are finetuned to serve as AI assistants, outperform their corresponding base models on SAD but not on general knowledge tasks. The purpose of SAD is to facilitate scientific understanding of situational awareness in LLMs by breaking it down into quantitative abilities. Situational awareness is important because it enhances a model’s capacity for autonomous planning and action. While this has potential benefits for automation, it also introduces novel risks related to AI safety and control. Code and latest results available at this https URL . Comments: 11 page main body, 98 page appendix, 58 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2407.04694 [cs.CL] (or arXiv:2407.04694v1 [cs.CL] for this version)
摘要:像ChatGPT这样的人工智能助手被训练成通过说:我是一个大型语言模型来回应用户。这引发了一些问题。这样的模型是否知道自己是LLM,并可靠地根据这一知识采取行动?他们是否知道自己目前的情况,例如被部署到公众中?我们把模特儿对自身和环境的了解称为情境感知。为了量化LLMS中的情境意识,我们引入了一系列基于问题回答和指令遵循的行为测试。这些测试形成了情景感知数据集(SAD),这是一个包含7个任务类别和13,000多个问题的基准。该基准测试了许多能力,包括LLMS的能力,包括(I)识别他们自己生成的文本,(Ii)预测他们自己的行为,(Iii)确定提示是来自内部评估还是真实世界部署,以及(Iv)遵循依赖于自我认识的指令。我们在SAD上评估了16个LLMS,包括基本(预训练)和聊天模型。虽然所有模型的表现都好于运气,但在某些任务上,即使是得分最高的模型(克劳德3·奥普斯)也远远不是人类的基线。我们还观察到,SAD上的性能仅由常识指标(例如MMLU)部分预测。聊天模型被优化为人工智能助手,在SAD上表现优于相应的基础模型,但在一般知识任务上表现不佳。SAD的目的是通过将其分解为定量的能力来促进对低收入管理中的态势意识的科学理解。情境感知很重要,因为它增强了模型的自主规划和行动能力。虽然这对自动化有潜在的好处,但它也引入了与人工智能安全和控制相关的新风险。代码和最新结果,请访问此HTTPS URL。备注:正文11页,附录98页,图形58幅主题:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG)引用AS:arxiv:2407.04694cs.CL

[NLP-1] ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models
[NLP-1] ANAH-v2:大型语言模型的缩放分析幻觉注释

链接: https://arxiv.org/abs/2407.04693
作者: Yuzhe Gu,Ziwei Ji,Wenwei Zhang,Chengqi Lyu,Dahua Lin,Kai Chen
关键词: Large language models, long-form question-answering tasks, Large language, hallucination, hallucination annotator
中文关键词: 大型语言模型、长篇问答任务、大型语言、幻觉、幻觉注释器
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications. Current hallucination detection and mitigation datasets are limited in domains and sizes, which struggle to scale due to prohibitive labor costs and insufficient reliability of existing hallucination annotators. To facilitate the scalable oversight of LLM hallucinations, this paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset and improves the accuracy of the hallucination annotator. Based on the Expectation Maximization (EM) algorithm, in each iteration, the framework first applies a hallucination annotation pipeline to annotate a scaled dataset and then trains a more accurate hallucination annotator on the dataset. This new hallucination annotator is adopted in the hallucination annotation pipeline used for the next iteration. Extensive experimental results demonstrate that the finally obtained hallucination annotator with only 7B parameters surpasses the performance of GPT-4 and obtains new state-of-the-art hallucination detection results on HaluEval and HalluQA by zero-shot inference. Such an annotator can not only evaluate the hallucination levels of various LLMs on the large-scale dataset but also help to mitigate the hallucination of LLMs generations, with the Natural Language Inference (NLI) metric increasing from 25% to 37% on HaluEval.
摘要:大型语言模型在不同领域和广泛应用的长格式问答任务中表现出幻觉。目前的幻觉检测和缓解数据集在域和大小上都是有限的,由于高昂的劳动力成本和现有幻觉注释器的可靠性不足,这些数据集很难扩展。为了促进对LLM幻觉的可扩展监督,本文引入了一种迭代的自我训练框架,该框架同时渐进地放大幻觉注释器的数据集,并提高幻觉注释器的准确性。该框架基于期望最大化(EM)算法,在每次迭代中,首先应用幻觉注解流水线来标注缩放后的数据集,然后在数据集上训练更准确的幻觉注释器。这种新的幻觉注释器被用于下一次迭代的幻觉注解流水线中。大量实验结果表明,最终得到的仅需7B参数的幻觉注释器性能优于GPT-4,并通过零镜头推理获得了最新的HaluEval和HalluQA幻觉检测结果。这样的注释器不仅可以在大规模数据集上评估各种LLM的幻觉水平,而且有助于缓解LLMS世代的幻觉,在HaluEval上,自然语言推理(NLI)度量从25%增加到37%。

[NLP-2] Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
[NLP-2] 错过的原因和模糊的影响:反事实对解释神经网络提出挑战

链接: https://arxiv.org/abs/2407.04690
作者: Aaron Mueller
关键词: causality for granted, Interpretability research, Abstract, counterfactual, counterfactual theories
中文关键词: 因果关系理所当然,可解释性研究,抽象,反事实,反事实理论
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interpretability research takes counterfactual theories of causality for granted. Most causal methods rely on counterfactual interventions to inputs or the activations of particular model components, followed by observations of the change in models’ output logits or behaviors. While this yields more faithful evidence than correlational methods, counterfactuals nonetheless have key problems that bias our findings in specific and predictable ways. Specifically, (i) counterfactual theories do not effectively capture multiple independently sufficient causes of the same effect, which leads us to miss certain causes entirely; and (ii) counterfactual dependencies in neural networks are generally not transitive, which complicates methods for extracting and interpreting causal graphs from neural networks. We discuss the implications of these challenges for interpretability researchers and propose concrete suggestions for future work.
摘要:可解释性研究将反事实因果关系理论视为理所当然。大多数因果方法依赖于对输入或特定模型组件的激活的反事实干预,然后观察模型输出日志或行为的变化。虽然这比相关方法产生了更可靠的证据,但反事实方法仍然存在一些关键问题,使我们的研究结果以特定且可预测的方式产生偏差。具体来说,(i)反事实理论无法有效捕捉同一效应的多个独立充分的原因,这导致我们完全错过某些原因;(ii)神经网络中的反事实依赖性通常不传递,这使得从神经网络中提取和解释因果图的方法变得复杂。我们讨论了这些挑战对可解释性研究人员的影响,并为未来的工作提出具体建议。

[NLP-3] Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge
[NLP-3] 重新思考具有外部知识的多模式大型语言模型的视觉规划

链接: https://arxiv.org/abs/2407.04681
作者: Yuanze Lin,Yunsheng Li,Dongdong Chen,Weijian Xu,Ronald Clark,Philip Torr,Lu Yuan
关键词: multimodal large language, high-quality image-text datasets, made significant strides, vast high-quality image-text, generally understand images
中文关键词: 多模式大语言、高质量图文数据集,取得重大进展,海量高质量图文,普遍理解图像
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets, enabling them to generally understand images well. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs, limiting their ability to answer questions requiring an understanding of detailed or localized visual elements. Drawing inspiration from the Retrieval-Augmented Generation (RAG) concept, this paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models (e.g., instance segmentation/OCR models), into MLLMs. This is a promising yet underexplored direction for enhancing MLLMs’ performance. Our approach diverges from concurrent works, which transform external knowledge into additional text prompts, necessitating the model to indirectly learn the correspondence between visual content and text coordinates. Instead, we propose embedding fine-grained knowledge information directly into a spatial embedding map as a visual prompt. This design can be effortlessly incorporated into various MLLMs, such as LLaVA and Mipha, considerably improving their visual understanding performance. Through rigorous experiments, we demonstrate that our method can enhance MLLM performance across nine benchmarks, amplifying their fine-grained context-aware capabilities.
受检索-增强生成(RAG)概念的启发,提出了一种新的视觉提示方法,将从专门的视觉模型(如实例分割/OCR模型)中收集的细粒度外部知识集成到MLLMS中。这是一个很有希望但尚未被探索的方向,可以提高MLLMS的性能。我们的方法与并行工作不同,后者将外部知识转换为额外的文本提示,需要模型间接学习视觉内容和文本坐标之间的对应关系。相反,我们建议将细粒度的知识信息直接嵌入到空间嵌入地图中作为视觉提示。这种设计可以毫不费力地集成到各种MLLM中,如LLaVA和Mipha,大大提高了它们的视觉理解性能。通过严格的实验,我们证明了我们的方法可以在九个基准测试中提高MLLM的性能,放大它们的细粒度上下文感知能力。

[NLP-4] Entity Decomposition with Filtering: A Zero-Shot Clinical Named Entity Recognition Framework
[NLP-4] 带过滤的实体分解:零镜头临床命名实体识别框架

链接: https://arxiv.org/abs/2407.04629
作者: Reza Averly,Xia Ning
关键词: retrieve important entities, retrieve important, Clinical named entity, clinical narratives, NER
中文关键词: 检索重要实体,检索重要实体,临床命名实体,临床叙述,NER
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Clinical named entity recognition (NER) aims to retrieve important entities within clinical narratives. Recent works have demonstrated that large language models (LLMs) can achieve strong performance in this task. While previous works focus on proprietary LLMs, we investigate how open NER LLMs, trained specifically for entity recognition, perform in clinical NER. In this paper, we aim to improve them through a novel framework, entity decomposition with filtering, or EDF. Our key idea is to decompose the entity recognition task into several retrievals of sub-entity types. We also introduce a filtering mechanism to remove incorrect entities. Our experimental results demonstrate the efficacy of our framework across all metrics, models, datasets, and entity types. Our analysis reveals that entity decomposition can recognize previously missed entities with substantial improvement. We further provide a comprehensive evaluation of our framework and an in-depth error analysis to pave future works.
摘要:临床命名实体识别(NER)旨在检索临床叙述中的重要实体。最近的研究表明,大型语言模型(LLM)可以在这一任务中取得很好的性能。虽然以前的工作主要集中在专有的LLM上,但我们调查了专门为实体识别而训练的OPEN NER LLM在临床NER中的表现。在本文中,我们的目标是通过一个新的框架,实体分解和过滤,或EDF来改进它们。我们的主要思想是将实体识别任务分解为多个子实体类型的检索。我们还引入了过滤机制来删除不正确的实体。我们的实验结果证明了我们的框架在所有指标、模型、数据集和实体类型上的有效性。我们的分析表明,实体分解可以识别以前遗漏的实体,并有很大的改进。我们进一步对我们的框架进行了全面的评估和深入的错误分析,为下一步的工作做好了准备。

[NLP-5] Learning to (Learn at Test Time): RNNs with Expressive Hidden States
[NLP-5] 学习(在测试时学习):具有表达性隐藏状态的RNN

链接: https://arxiv.org/abs/2407.04620
作者: Yu Sun,Xinhao Li,Karan Dalal,Jiarui Xu,Arjun Vikram,Genghan Zhang,Yann Dubois,Xinlei Chen,Xiaolong Wang,Sanmi Koyejo,Tatsunori Hashimoto,Carlos Guestrin
关键词: Self-attention performs, hidden state, long context, Self-attention, state
中文关键词: 自我注意力表演,隐藏状态,长背景,自我注意力,状态
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Both TTT-Linear and TTT-MLP match or exceed the baselines. Similar to Transformer, they can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.
摘要:自我注意在长情境下表现良好,但具有二次复杂性。现有的RNN层具有线性复杂度,但其隐含状态的表达能力限制了它们在长上下文中的性能。我们提出了一类新的序列建模层,它具有线性复杂度和可表达的隐藏状态。其核心思想是将隐含状态本身作为机器学习模型,将更新规则作为自监督学习步骤。由于隐藏状态即使在测试序列上也会通过训练来更新,因此我们的层被称为测试时间训练(TTT)层。我们考虑了两个实例:TTT-Line和TTT-MLP,它们的隐藏状态分别是线性模型和两层MLP。我们在125M到1.3B参数的范围内评估我们的实例化,并与强大的Transformer和现代RNN Mamba进行比较。TTT-线性和TTT-MLP都达到或超过基线。与《变形金刚》类似,他们可以通过调整更多的令牌来减少困惑,而曼巴在16k上下文之后就不能了。通过初步的系统优化,TTT-Line在8k环境下已经比Transformer更快,在挂钟时间上与Mamba不相上下。TTT-MLP在存储I/O方面仍然面临挑战,但在长环境下显示出更大的潜力,为未来的研究指明了方向。

[NLP-6] ARM: Efficient Guided Decoding with Autoregressive Reward Models
[NLP-6] ARM:采用自回归奖励模型的高效引导解码

链接: https://arxiv.org/abs/2407.04615
作者: Sergey Troshin,Vlad Niculae,Antske Fokkens
关键词: data require careful, require careful tuning, Language models trained, real world, base language model
中文关键词: 数据需要仔细,需要仔细调整,训练的语言模型,现实世界,基础语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models trained on large amounts of data require careful tuning to be safely deployed in real world. We revisit the guided decoding paradigm, where the goal is to augment the logits of the base language model using the scores from a task-specific reward model. We propose a simple but efficient parameterization of the autoregressive reward model enabling fast and effective guided decoding. On detoxification and sentiment control tasks, we show that our efficient parameterization performs on par with RAD, a strong but less efficient guided decoding approach.
摘要:在大量数据上训练的语言模型需要仔细调整才能安全地部署在现实世界中。我们重新审视引导解码范式,其中的目标是使用特定任务奖励模型的分数来增强基础语言模型的逻辑。我们提出了一种简单但有效的自回归奖励模型参数化,以实现快速有效的引导解码。在解毒和情绪控制任务方面,我们表明我们高效的参数化与RAD(一种强大但效率较低的引导解码方法)相当。

[NLP-7] sting learning hypotheses using neural networks by manipulating learning data
[NLP-7] 通过操纵学习数据使用神经网络来刺痛学习假设

链接: https://arxiv.org/abs/2407.04593
作者: Cara Su-Yi Leong,Tal Linzen
关键词: English speakers learn, passivization is productive, hour was lasted, completely general, neural network language
中文关键词: 英语使用者学习,被动化是富有成效的,持续了一个小时,完全通用,神经网络语言
类目: Computation and Language (cs.CL)
备注: Submitted to Journal of Memory and Language

点击查看摘要

Abstract:Although passivization is productive in English, it is not completely general – some exceptions exist (e.g. One hour was lasted by the meeting). How do English speakers learn these exceptions to an otherwise general pattern? Using neural network language models as theories of acquisition, we explore the sources of indirect evidence that a learner can leverage to learn whether a verb can passivize. We first characterize English speakers’ judgments of exceptions to the passive, confirming that speakers find some verbs more passivizable than others. We then show that a neural network language model can learn restrictions to the passive that are similar to those displayed by humans, suggesting that evidence for these exceptions is available in the linguistic input. We test the causal role of two hypotheses for how the language model learns these restrictions by training models on modified training corpora, which we create by altering the existing training corpora to remove features of the input implicated by each hypothesis. We find that while the frequency with which a verb appears in the passive significantly affects its passivizability, the semantics of the verb does not. This study highlight the utility of altering a language model’s training data for answering questions where complete control over a learner’s input is vital.
摘要:虽然被动语态在英语中是有成效的,但它并不是完全通用的–也有一些例外(例如
会议持续了一个小时)。说英语的人如何学习这些例外,而不是一般的模式?使用神经网络语言模型作为习得理论,我们探索了间接证据的来源,即学习者可以利用这些证据来学习动词是否可以被动。然后,我们证明了神经网络语言模型可以学习对被动句的限制,这些限制类似于人类所显示的限制,这表明这些例外的证据在语言输入中是可用的。我们通过在修改后的训练语料库上的训练模型来测试两个假设对语言模型如何学习这些限制的因果作用,我们通过改变现有的训练语料库来删除每个假设所涉及的输入特征。我们发现,虽然动词在被动句中出现的频率显著影响其被动化,但动词的语义并不影响它的被动化。

[NLP-8] VRSD: Rethinking Similarity and Diversity for Retrieval in Large Language Models
[NLP-8] VRSD:重新思考大型语言模型中检索的相似性和多样性

链接: https://arxiv.org/abs/2407.04573
作者: Hang Gao,Yongfeng Zhang
关键词: Large Language Models, Language Models, Large Language, landscape of Large, Maximal Marginal Relevance
中文关键词: 大型语言模型、语言模型、大型语言、最大边缘相关性景观
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vector retrieval algorithms are vital for semantic queries in the evolving landscape of Large Language Models (LLMs). Retrieving vectors that simultaneously meet criteria for both similarity and diversity significantly enhances the capabilities of LLM-based agents. Despite the widespread use of the Maximal Marginal Relevance (MMR) in retrieval scenarios with relevance and diversity requirements, fluctuations caused by variations in the parameter \lambda within the MMR complicate the determination of the optimization trajectory in vector spaces, thus obscuring the direction of enhancement. Moreover, there is a lack of a robust theoretical analysis for the constraints of similarity and diversity in retrieval processes. This paper introduces a novel approach to characterizing both constraints through the relationship between the sum vector and the query vector. The proximity of these vectors addresses the similarity constraint, while necessitating that individual vectors within the sum vector divergently align with the query vector to satisfy the diversity constraint. We also formulate a new combinatorial optimization challenge, taking a selection of k vectors from a set of candidates such that their sum vector maximally aligns with the query vector, a problem we demonstrate to be NP-complete. This establishes the profound difficulty of pursuing similarity and diversity simultaneously in vector retrieval and lays a theoretical groundwork for further research. Additionally, we present the heuristic algorithm Vectors Retrieval with Similarity and Diversity (VRSD) which not only has a definitive optimization goal and eschews the need for preset parameters but also offers a modest reduction in time complexity compared to MMR. Empirical validation further confirm that VRSD significantly surpasses MMR across various datasets.
摘要:向量检索算法对于大型语言模型(LLMS)中不断演化的语义查询至关重要。检索同时满足相似性和多样性标准的载体显著增强了基于LLM的代理的能力。尽管最大边际相关性(MMR)在具有相关性和多样性要求的检索场景中被广泛使用,但由于MMR中参数\lambda的变化引起的波动使得向量空间中的最优轨迹的确定变得复杂,从而模糊了增强的方向。此外,对于检索过程中的相似性和多样性的约束,缺乏强有力的理论分析。本文介绍了一种通过和向量与查询向量之间的关系来刻画这两个约束的新方法。这些向量的接近解决了相似性约束,同时要求和向量内的各个向量与查询向量不同地对齐,以满足多样性约束。我们还提出了一个新的组合优化挑战,从一组候选者中选择k个向量,使它们的和向量最大限度地与查询向量一致,我们证明了这个问题是NP-完全的。这为向量检索中同时追求相似性和多样性奠定了深刻的难度,为进一步的研究奠定了理论基础。此外,我们还提出了具有相似性和多样性的启发式向量检索算法(VRSD),该算法不仅有明确的优化目标,不需要预先设置参数,而且与MMR相比,时间复杂度略有降低。经验验证进一步证实,在不同的数据集上,VRSD显著超过MMR。

[NLP-9] Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence Grounding and Repetition
[NLP-9] (尚未)完整的故事:评估视觉讲故事需要的不仅仅是衡量连贯性基础和重复性

链接: https://arxiv.org/abs/2407.04559
作者: Aditya K Surikuchi,Raquel Fernández,Sandro Pezzelle
关键词: temporally ordered sequence, Visual storytelling consists, sequence of images, consists in generating, generating a natural
中文关键词: 时间排序的序列,视觉讲故事由图像序列组成,包括生成,生成自然的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual storytelling consists in generating a natural language story given a temporally ordered sequence of images. This task is not only challenging for models, but also very difficult to evaluate with automatic metrics since there is no consensus about what makes a story ‘good’. In this paper, we introduce a novel method that measures story quality in terms of human likeness regarding three key aspects highlighted in previous work: visual grounding, coherence, and repetitiveness. We then use this method to evaluate the stories generated by several models, showing that the foundation model LLaVA obtains the best result, but only slightly so compared to TAPM, a 50-times smaller visual storytelling model. Upgrading the visual and language components of TAPM results in a model that yields competitive performance with a relatively low number of parameters. Finally, we carry out a human evaluation study, whose results suggest that a ‘good’ story may require more than a human-like level of visual grounding, coherence, and repetition.
摘要:视觉讲故事是指在给定时间顺序的图像序列的情况下生成一个自然语言故事。这项任务不仅对模型具有挑战性,而且很难用自动衡量标准进行评估,因为对于什么才是故事的好的方面没有达成共识。在这篇文章中,我们介绍了一种新的方法,它根据人类的相似性来衡量故事的质量,这三个方面在以前的工作中得到了强调:视觉基础、连贯性和重复性。然后,我们使用该方法对几个模型生成的故事进行了评估,结果表明,基础模型LLaVA获得了最好的结果,但与TAPM相比略有不同,TAPM是一个小50倍的可视讲故事模型。升级TAPM的视觉和语言组件会产生一个在参数数量相对较少的情况下产生具有竞争力的性能的模型。最后,我们进行了一项人类评估研究,其结果表明,一个好的故事可能需要比人类更接近人类的视觉基础、连贯性和重复。

[NLP-10] Spontaneous Reward Hacking in Iterative Self-Refinement
[NLP-10] 迭代自我完善中的自发奖励黑客

链接: https://arxiv.org/abs/2407.04549
作者: Jane Pan,He He,Samuel R. Bowman,Shi Feng
关键词: natural language feedback, user preference, reward hacking, language model, actual user preference
中文关键词: 自然语言反馈、用户偏好、奖励黑客、语言模型、实际用户偏好
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models are capable of iteratively improving their outputs based on natural language feedback, thus enabling in-context optimization of user preference. In place of human users, a second language model can be used as an evaluator, providing feedback along with numerical ratings which the generator attempts to optimize. However, because the evaluator is an imperfect proxy of user preference, this optimization can lead to reward hacking, where the evaluator’s ratings improve while the generation quality remains stagnant or even decreases as judged by actual user preference. The concern of reward hacking is heightened in iterative self-refinement where the generator and the evaluator use the same underlying language model, in which case the optimization pressure can drive them to exploit shared vulnerabilities. Using an essay editing task, we show that iterative self-refinement leads to deviation between the language model evaluator and human judgment, demonstrating that reward hacking can occur spontaneously in-context with the use of iterative self-refinement. In addition, we study conditions under which reward hacking occurs and observe two factors that affect reward hacking severity: model size and context sharing between the generator and the evaluator.
摘要:语言模型能够基于自然语言反馈迭代地改进其输出,从而实现用户偏好的上下文优化。可以使用第二语言模型代替人类用户作为评估器,提供反馈以及生成器试图优化的数字评级。然而,由于评估者不是用户偏好的完美代理,这种优化可能导致奖励黑客攻击,其中评估者的评级提高,而根据实际用户偏好判断,生成质量保持停滞甚至下降。在迭代自我精化中,生成器和评估器使用相同的底层语言模型,在这种情况下,优化压力可能会促使他们利用共同的漏洞,从而加剧了对奖励黑客的担忧。通过一篇文章编辑任务,我们证明了迭代自我精炼会导致语言模型评价者和人类判断之间的偏差,从而证明了奖励黑客行为可以在使用迭代自我精炼的上下文中自发发生。此外,我们还研究了奖励黑客行为发生的条件,并观察到影响奖励黑客行为严重性的两个因素:模型大小和创建者和评价者之间的上下文共享。

[NLP-11] Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations
[NLP-11] 通过预训练执行语法转换来加强结构归纳偏差

链接: https://arxiv.org/abs/2407.04543
作者: Matthias Lindemann,Alexander Koller,Ivan Titov
关键词: training distribution, effectively learn, learn from small, small amounts, amounts of data
中文关键词: 培训分布,有效学习,从少量、少量的数据中学习
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Models need appropriate inductive biases to effectively learn from small amounts of data and generalize systematically outside of the training distribution. While Transformers are highly versatile and powerful, they can still benefit from enhanced structural inductive biases for seq2seq tasks, especially those involving syntactic transformations, such as converting active to passive voice or semantic parsing. In this paper, we propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training to perform synthetically generated syntactic transformations of dependency trees given a description of the transformation. Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking, and also improves structural generalization for semantic parsing. Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token, and that the model can leverage these attention heads on downstream tasks.
摘要:模型需要适当的归纳偏差才能有效地从少量数据中学习,并在训练分布之外系统地进行推广。虽然Transformers具有高度的通用性和强大功能,但它们仍然可以受益于seq2seq任务的增强的结构归纳偏差,特别是那些涉及句法转换的任务,如将主动语态转换为被动语态或语义分析。在本文中,我们提出通过中间预训练来加强变压器的结构感应偏向,以执行综合生成的依赖树的句法转换,给出了该转换的描述。我们的实验证实,这有助于对组块等句法任务的少量学习,并提高了语义分析的结构泛化。我们的分析表明,中间预训练会产生注意力头部,这些注意力头部跟踪哪些句法转换需要应用于哪个标记,并且该模型可以在下游任务中利用这些注意力头部。

[NLP-12] PoPreRo: A New Dataset for Popularity Prediction of Romanian Reddit Posts
[NLP-12] PoPreRo:罗马尼亚Reddit帖子受欢迎程度预测的新数据集

链接: https://arxiv.org/abs/2407.04541
作者: Ana-Cristina Rogoz,Maria Ilinca Nechita,Radu Tudor Ionescu
关键词: collected from Reddit, Romanian posts collected, Popularity Prediction, Reddit, posts collected
中文关键词: 从Reddit收集,罗马尼亚帖子收集,人气预测,Reddit,帖子收集
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICPR 2024

点击查看摘要

Abstract:We introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. The PoPreRo dataset includes a varied compilation of post samples from five distinct subreddits of Romania, totaling 28,107 data samples. Along with our novel dataset, we introduce a set of competitive models to be used as baselines for future research. Interestingly, the top-scoring model achieves an accuracy of 61.35% and a macro F1 score of 60.60% on the test set, indicating that the popularity prediction task on PoPreRo is very challenging. Further investigations based on few-shot prompting the Falcon-7B Large Language Model also point in the same direction. We thus believe that PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian. We release our dataset at this https URL.
摘要:我们介绍PoPreRo,这是第一个从Reddit收集的罗马尼亚帖子受欢迎程度预测的数据集。PoPreRo数据集包括来自罗马尼亚五个不同子目录的各种帖子样本汇编,总共28,107个数据样本。除了我们的新颖数据集之外,我们还引入了一组竞争模型,用作未来研究的基线。有趣的是,最高评分模型在测试集中的准确率为61.35%,宏F1得分为60.60%,这表明PoPreRo上的受欢迎程度预测任务非常具有挑战性。基于少数镜头提示Falcon-7 B大型语言模型的进一步研究也指向了同一方向。因此,我们相信PoPreRo是一个宝贵的资源,可用于评估预测罗马尼亚语社交媒体帖子受欢迎程度的模型。我们在此https URL上发布我们的数据集。

[NLP-13] Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect
[NLP-13] 突尼斯方言低资源SL和ASB语音编码器的性能分析

链接: https://arxiv.org/abs/2407.04533
作者: Salima Mdhaffar,Haroun Elleuch,Fethi Bougares,Yannick Estève
关键词: Spoken Language Understanding, Automatic Speech Recognition, including Spoken Language, Language Understanding, demonstrated remarkable performance
中文关键词: 口语理解、自动语音识别,包括口语、语言理解,表现出色
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in ArabicNLP 2024

点击查看摘要

Abstract:Speech encoders pretrained through self-supervised learning (SSL) have demonstrated remarkable performance in various downstream tasks, including Spoken Language Understanding (SLU) and Automatic Speech Recognition (ASR). For instance, fine-tuning SSL models for such tasks has shown significant potential, leading to improvements in the SOTA performance across challenging datasets. In contrast to existing research, this paper contributes by comparing the effectiveness of SSL approaches in the context of (i) the low-resource spoken Tunisian Arabic dialect and (ii) its combination with a low-resource SLU and ASR scenario, where only a few semantic annotations are available for fine-tuning. We conduct experiments using many SSL speech encoders on the TARIC-SLU dataset. We use speech encoders that were pre-trained on either monolingual or multilingual speech data. Some of them have also been refined without in-domain nor Tunisian data through multimodal supervised teacher-student paradigm. This study yields numerous significant findings that we are discussing in this paper.
摘要:通过自监督学习(SSL)训练的语音编码器在包括口语理解(SLU)和自动语音识别(ASR)在内的各种后续任务中表现出了出色的性能。例如,针对此类任务的微调SSL模型已显示出巨大的潜力,从而提高了SOTA在各种具有挑战性的数据集上的性能。与现有研究相反,本文通过比较SSL方法在(I)低资源口语突尼斯阿拉伯语方言和(Ii)其与低资源SLU和ASR场景的组合中的有效性,做出了贡献,其中只有几个语义注释可用于微调。我们在TARIC-SLU数据集上使用多个SSL语音编码器进行了实验。我们使用在单语或多语语音数据上经过预先训练的语音编码器。其中一些还通过多模式监督教师-学生范式在没有领域内或突尼斯数据的情况下进行了改进。这项研究产生了许多我们在本文中讨论的重要发现。

[NLP-14] GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning
[NLP-14] GPT与RETRO:探索检索与参数高效微调的交叉点

链接: https://arxiv.org/abs/2407.04528
作者: Aleksander Ficek,Jiaqi Zeng,Oleksii Kuchaiev
关键词: minimizing compute requirements, adapting large language, Retrieval-Augmented Generation, large language models, Parameter-Efficient Fine-Tuning
中文关键词: 最小化计算需求、适应大型语言、检索增强生成、大型语言模型、参数高效微调
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) and Retrieval-Augmented Generation (RAG) have become popular methods for adapting large language models while minimizing compute requirements. In this paper, we apply PEFT methods (P-tuning, Adapters, and LoRA) to a modified Retrieval-Enhanced Transformer (RETRO) and a baseline GPT model across several sizes, ranging from 823 million to 48 billion parameters. We show that RETRO models outperform GPT models in zero-shot settings due to their unique pre-training process but GPT models have higher performance potential with PEFT. Additionally, our study indicates that 8B parameter models strike an optimal balance between cost and performance and P-tuning lags behind other PEFT techniques. We further provide a comparative analysis of between applying PEFT to an Instruction-tuned RETRO model and base RETRO model. This work presents the first comprehensive comparison of various PEFT methods integrated with RAG, applied to both GPT and RETRO models, highlighting their relative performance.
摘要:参数高效精调(PEFT)和检索增强生成(RAG)已经成为适应大型语言模型并最小化计算需求的流行方法。在本文中,我们将PEFT方法(P-Tuning、Adapters和LORA)应用于改进的检索增强型变压器(RERTO)和基线GPT模型,这些模型的大小从8.23亿到480亿个参数不等。我们发现,由于其独特的预训练过程,回溯模型在零镜头设置中的性能优于GPT模型,但GPT模型在PEFT中具有更高的性能潜力。此外,我们的研究表明,8B参数模型在成本和性能之间取得了最佳平衡,并且P-Tuning落后于其他PEFT技术。我们进一步提供了将PEFT应用于指令调整的追溯模型和基本追溯模型的比较分析。这项工作首次对各种与RAG集成的PEFT方法进行了全面的比较,分别适用于GPT和RETERO模型,突出了它们的相对性能。

[NLP-15] Leveraging Graph Structures to Detect Hallucinations in Large Language Models
[NLP-15] 利用图结构检测大型语言模型中的幻觉

链接: https://arxiv.org/abs/2407.04485
作者: Noa Nonkes,Sergei Agaronian,Evangelos Kanoulas,Roxana Petcu
关键词: Large language models, providing financial guidance, Large language, content creation, educational tutoring
中文关键词: 大型语言模型,提供财务指导,大型语言,内容创建,教育辅导
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models are extensively applied across a wide range of tasks, such as customer support, content creation, educational tutoring, and providing financial guidance. However, a well-known drawback is their predisposition to generate hallucinations. This damages the trustworthiness of the information these models provide, impacting decision-making and user confidence. We propose a method to detect hallucinations by looking at the structure of the latent space and finding associations within hallucinated and non-hallucinated generations. We create a graph structure that connects generations that lie closely in the embedding space. Moreover, we employ a Graph Attention Network which utilizes message passing to aggregate information from neighboring nodes and assigns varying degrees of importance to each neighbor based on their relevance. Our findings show that 1) there exists a structure in the latent space that differentiates between hallucinated and non-hallucinated generations, 2) Graph Attention Networks can learn this structure and generalize it to unseen generations, and 3) the robustness of our method is enhanced when incorporating contrastive learning. When evaluated against evidence-based benchmarks, our model performs similarly without access to search-based methods.
摘要:大型语言模型广泛应用于各种任务,如客户支持、内容创建、教育辅导和提供财务指导。然而,一个众所周知的缺点是他们容易产生幻觉。这损害了这些模型提供的信息的可信度,影响了决策和用户信心。我们提出了一种检测幻觉的方法,通过观察潜伏空间的结构,并在出现幻觉和未出现幻觉的几代人中找到联系。我们创建了一种图结构,它将紧密位于嵌入空间中的世代联系在一起。此外,我们采用了图注意网络,它利用消息传递来聚集来自邻居节点的信息,并根据每个邻居的相关性来为每个邻居分配不同程度的重要性。我们的发现表明,1)潜在空间中存在区分幻觉和非幻觉世代的结构,2)图形注意网络可以学习这种结构并将其推广到看不见的世代,3)当结合对比学习时,我们的方法的稳健性得到了增强。当根据基于证据的基准进行评估时,我们的模型在不使用基于搜索的方法的情况下执行类似的操作。

[NLP-16] Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models
[NLP-16] 控制耳语:控制语音基础模型的通用声学对抗攻击

链接: https://arxiv.org/abs/2407.04482
作者: Vyas Raina,Mark Gales
关键词: audio-prompted large language, large language models, flexible speech recognition, speech recognition based, Speech
中文关键词: 音频提示的大型语言、大型语言模型、灵活的语音识别、基于语音识别的、语音
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly popular. One of the interesting aspects of these models is their ability to perform tasks other than automatic speech recognition (ASR) using an appropriate prompt. For example, the OpenAI Whisper model can perform both speech transcription and speech translation. With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. Without any access to the model prompt it is possible to modify the behaviour of the system by appropriately changing the audio input. To illustrate this risk, we demonstrate that it is possible to prepend a short universal adversarial acoustic segment to any input speech signal to override the prompt setting of an ASR foundation model. Specifically, we successfully use a universal adversarial acoustic segment to control Whisper to always perform speech translation, despite being set to perform speech transcription. Overall, this work demonstrates a new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model.
摘要:支持语音的基础模型,无论是基于灵活语音识别系统的形式,还是以音频提示的大型语言模型(LLMS)的形式,正变得越来越受欢迎。这些模型的一个有趣方面是,它们能够使用适当的提示执行自动语音识别(ASR)以外的任务。例如,OpenAI Whisper模型可以执行语音转录和语音翻译。随着音频提示LLMS的发展,有可能出现更大的控制选项。在这项工作中,我们证明了有了这种更大的灵活性,系统可以容易受到模型控制的对抗性攻击。在不访问模型提示的情况下,可以通过适当地改变音频输入来修改系统的行为。为了说明这一风险,我们证明了有可能在任何输入语音信号之前添加一个简短的通用对抗性声学片段,以覆盖ASR基础模型的提示设置。具体地说,我们成功地使用了一个通用的对抗性声学段来控制Whisper始终执行语音翻译,尽管被设置为执行语音转录。总体而言,这项工作展示了一种对多任务语音启用的基础模型的新形式的对抗性攻击,在部署这种形式的模型之前需要考虑这种形式。

[NLP-17] EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context
[NLP-17] 事件聊天:大型语言模型驱动的对话推荐系统的实施和以用户为中心的评估,用于探索中小企业环境中的休闲活动

链接: https://arxiv.org/abs/2407.04472
作者: Hannes Kunstmann,Joseph Ollier,Joel Persson,Florian von Wangenheim
关键词: Large language models, LLM-driven CRS, Large language, conversational recommender systems, implement LLM-driven CRS
中文关键词: 大型语言模型,LLM驱动的CRS,大型语言,对话式推荐系统,实现LLM驱动的CRS
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages, 3 tables, 5 figures, pre-print manuscript, updated version of manuscript due to typo (previous version, Figure 5 was incorrectly named Figure 6)

点击查看摘要

Abstract:Large language models (LLMs) present an enormous evolution in the strategic potential of conversational recommender systems (CRS). Yet to date, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms, particularly from the perspective of a small to medium enterprises (SME) that makeup the bedrock of the global economy. In the current paper, we detail the design of an LLM-driven CRS in an SME setting, and its subsequent performance in the field using both objective system metrics and subjective user evaluations. While doing so, we additionally outline a short-form revised ResQue model for evaluating LLM-driven CRS, enabling replicability in a rapidly evolving field. Our results reveal good system performance from a user experience perspective (85.5% recommendation accuracy) but underscore latency, cost, and quality issues challenging business viability. Notably, with a median cost of 0.04 per interaction and a latency of 5.7s, cost-effectiveness and response time emerge as crucial areas for achieving a more user-friendly and economically viable LLM-driven CRS for SME settings. One major driver of these costs is the use of an advanced LLM as a ranker within the retrieval-augmented generation (RAG) technique. Our results additionally indicate that relying solely on approaches such as Prompt-based learning with ChatGPT as the underlying LLM makes it challenging to achieve satisfying quality in a production environment. Strategic considerations for SMEs deploying an LLM-driven CRS are outlined, particularly considering trade-offs in the current technical landscape.
摘要:大型语言模型(LLM)代表了会话推荐系统(CRS)战略潜力的巨大演变。然而,到目前为止,研究主要集中在实施LLM驱动的CRS的技术框架上,而不是最终用户评估或对公司的战略影响,特别是从构成全球经济基石的中小型企业(SME)的角度。在当前的论文中,我们详细介绍了在中小企业环境中LLM驱动的CRS的设计,以及它在使用客观系统度量和主观用户评估的领域中的后续表现。在这样做的同时,我们还概述了一个简短的修订的Resque模型,用于评估LLM驱动的CRS,使其在快速发展的领域中具有可复制性。我们的结果显示,从用户体验的角度来看,系统性能良好(推荐准确率为85.5%),但也强调了对业务生存能力构成挑战的延迟、成本和质量问题。值得注意的是,每次交互的中位数成本为0.04,延迟为5.7秒,成本效益和响应时间成为为中小企业环境实现更加用户友好和经济可行的LLM驱动的CRS的关键领域。这些成本的一个主要驱动因素是使用先进的LLM作为检索增强生成(RAG)技术中的排序器。此外,我们的结果还表明,仅依赖于以ChatGPT为基础的基于迅速的学习等方法,在生产环境中实现令人满意的质量是具有挑战性的。概述了部署LLM驱动的CRS的中小企业的战略考虑,特别是考虑到当前技术环境中的权衡。

[NLP-18] Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games
[NLP-18] 大型语言模型是战略决策者吗?两人非零和博弈的表现与偏差研究

链接: https://arxiv.org/abs/2407.04467
作者: Nathan Herr,Fernando Acero,Roberta Raileanu,María Pérez-Ortiz,Zhibin Li
关键词: Large Language Models, Large Language, remain largely unexplored, abilities remain largely, strategic abilities remain
中文关键词: 大型语言模型,大型语言,在很大程度上仍然未被探索,能力仍然存在,战略能力仍然存在
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注: 8 pages (19 with appendix), 6 figures in the main body (4 in the appendix), 4 tables in the main body

点击查看摘要

Abstract:Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic abilities remain largely unexplored. Game theory provides a good framework for assessing the decision-making abilities of LLMs in interactions with other agents. Although prior studies have shown that LLMs can solve these tasks with carefully curated prompts, they fail when the problem setting or prompt changes. In this work we investigate LLMs’ behaviour in strategic games, Stag Hunt and Prisoner Dilemma, analyzing performance variations under different settings and prompts. Our results show that the tested state-of-the-art LLMs exhibit at least one of the following systematic biases: (1) positional bias, (2) payoff bias, or (3) behavioural bias. Subsequently, we observed that the LLMs’ performance drops when the game configuration is misaligned with the affecting biases. Performance is assessed based on the selection of the correct action, one which agrees with the prompted preferred behaviours of both players. Alignment refers to whether the LLM’s bias aligns with the correct action. For example, GPT-4o’s average performance drops by 34% when misaligned. Additionally, the current trend of “bigger and newer is better” does not hold for the above, where GPT-4o (the current best-performing LLM) suffers the most substantial performance drop. Lastly, we note that while chain-of-thought prompting does reduce the effect of the biases on most models, it is far from solving the problem at the fundamental level.
摘要:大型语言模型在现实世界中的使用越来越多,但它们的策略能力在很大程度上仍未被发掘。博弈论为评估LLMS在与其他主体相互作用中的决策能力提供了一个很好的框架。尽管先前的研究表明,LLMS可以在精心策划的提示下解决这些任务,但当问题设置或提示发生变化时,它们会失败。在这项工作中,我们调查了LLMS在战略游戏、Stag Hunt和囚犯困境中的行为,分析了不同设置和提示下的性能变化。我们的结果表明,被测试的最先进的LLM至少表现出以下系统偏差之一:(1)位置偏差,(2)回报偏差,或(3)行为偏差。随后,我们观察到,当游戏配置与影响偏差不一致时,LLMS的性能会下降。根据对正确动作的选择来评估表现,该动作与两名球员提示的首选行为相一致。对齐是指LLM的偏差是否与正确的动作对齐。例如,GPT-40的平均性能在未对准时会下降34%。此外,目前“越大越新越好”的趋势不适用于上述情况,其中GPT-40(目前性能最好的LLM)的性能降幅最大。最后,我们注意到,尽管思维链激励确实减少了偏差对大多数模型的影响,但它远远不能从根本上解决问题。

[NLP-19] Using LLMs to label medical papers according to the CIViC evidence model
[NLP-19] 根据CIViC证据模型使用LLM标记医学论文

链接: https://arxiv.org/abs/2407.04466
作者: Markus Hisch,Xing David Wang
关键词: CIViC Evidence, medical NLP, problem CIViC Evidence, sequence classification problem, Evidence
中文关键词: CIViC证据,医学NLP,问题CIViC证据,序列分类问题,证据
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce the sequence classification problem CIViC Evidence to the field of medical NLP. CIViC Evidence denotes the multi-label classification problem of assigning labels of clinical evidence to abstracts of scientific papers which have examined various combinations of genomic variants, cancer types, and treatment approaches. We approach CIViC Evidence using different language models: We fine-tune pretrained checkpoints of BERT and RoBERTa on the CIViC Evidence dataset and challenge their performance with models of the same architecture which have been pretrained on domain-specific text. In this context, we find that BiomedBERT and BioLinkBERT can outperform BERT on CIViC Evidence (+0.8% and +0.9% absolute improvement in class-support weighted F1 score). All transformer-based models show a clear performance edge when compared to a logistic regression trained on bigram tf-idf scores (+1.5 - 2.7% improved F1 score). We compare the aforementioned BERT-like models to OpenAI’s GPT-4 in a few-shot setting (on a small subset of our original test dataset), demonstrating that, without additional prompt-engineering or fine-tuning, GPT-4 performs worse on CIViC Evidence than our six fine-tuned models (66.1% weighted F1 score compared to 71.8% for the best fine-tuned model). However, performance gets reasonably close to the benchmark of a logistic regression model trained on bigram tf-idf scores (67.7% weighted F1 score).
摘要:将序列分类问题引入医学自然语言处理领域。公民证据指的是将临床证据的标签分配给研究了基因组变异、癌症类型和治疗方法的各种组合的科学论文摘要的多标签分类问题。我们使用不同的语言模型来处理公民证据:我们微调了Bert和Roberta在公民证据数据集上的预训练检查点,并用在特定领域文本上预训练的相同架构的模型来挑战他们的性能。在此背景下,我们发现BiomedBERT和BioLinkBERT在公民证据方面的表现优于BERT(班级支持加权F1分数分别提高了0.8%和0.9%)。与基于二元语法TF-IDF分数训练的Logistic回归相比,所有基于变压器的模型都显示出明显的性能优势(F1分数提高了1.5-2.7%)。然而,性能相当接近于基于二元语法TF-IDF分数训练的Logistic回归模型的基准(67.7%加权F1分数)。

[NLP-20] Generalists vs. Specialists: Evaluating Large Language Models for Urdu
[NLP-20] 多面手与专家:评估乌尔都语的大型语言模型

链接: https://arxiv.org/abs/2407.04459
作者: Samee Arif,Abdul Hameed Azeemi,Agha Ali Raza,Awais Athar
关键词: Natural Language Processing, fine-tuned on specific, models, special-purpose models fine-tuned, Large Language Models
中文关键词: 自然语言处理,对特定模型进行微调,专用模型进行微调,大型语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we compare general-purpose pretrained models, GPT-4-Turbo and Llama-3-8b-Instruct with special-purpose models fine-tuned on specific tasks, XLM-Roberta-large, mT5-large, and Llama-3-8b-Instruct. We focus on seven classification and six generation tasks to evaluate the performance of these models on Urdu language. Urdu has 70 million native speakers, yet it remains underrepresented in Natural Language Processing (NLP). Despite the frequent advancements in Large Language Models (LLMs), their performance in low-resource languages, including Urdu, still needs to be explored. We also conduct a human evaluation for the generation tasks and compare the results with the evaluations performed by GPT-4-Turbo and Llama-3-8b-Instruct. We find that special-purpose models consistently outperform general-purpose models across various tasks. We also find that the evaluation done by GPT-4-Turbo for generation tasks aligns more closely with human evaluation compared to the evaluation by Llama-3-8b-Instruct. This paper contributes to the NLP community by providing insights into the effectiveness of general and specific-purpose LLMs for low-resource languages.
摘要:在本文中,我们比较了通用预训练模型GPT-4-Turbo和Llama-3-8b-指令与针对特定任务微调的专用模型XLm-Roberta-Large、MT5-Large和Llama-3-8b-指令。我们集中于七个分类和六个生成任务来评估这些模型在乌尔都语上的性能。乌尔都语有7000万人以乌尔都语为母语,但在自然语言处理(NLP)中的比例仍然很低。尽管大型语言模型(LLM)不断进步,但它们在包括乌尔都语在内的低资源语言中的性能仍有待研究。我们还对生成任务进行了人工评估,并将结果与GPT-4-Turbo和Llama-3-8b-Indict执行的评估进行了比较。我们发现,在各种任务中,专用模型始终优于通用模型。我们还发现,与Llama-3-8b-Indict相比,GPT-4-Turbo对发电任务所做的评估更接近于人的评估。这篇文章通过对低资源语言的通用和专用LLM的有效性的洞察,为NLP社区做出了贡献。

[NLP-21] okenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR
[NLP-21] okenVerse:通过基于传感器的ASB统一语音和NLP任务

链接: https://arxiv.org/abs/2407.04444
作者: Shashi Kumar,Srikanth Madikeri,Juan Zuluaga-Gomez,Iuliia Nigmatulina,Esaú Villatoro-Tello,Sergio Burdisso,Petr Motlicek,Karthik Pandia,Aravind Ganapathiraju
关键词: named entity recognition, traditional conversational intelligence, voice activity detection, intelligence from speech, entity recognition
中文关键词: 命名实体识别、传统对话智能、语音活动检测、语音智能、实体识别
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, double column

点击查看摘要

Abstract:In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Additionally, we present task transfer learning to a new task within an existing TokenVerse.
摘要:在传统的语音对话智能中,使用级联管道,涉及语音活动检测、日记化、转录等任务,以及使用不同的NLP模型进行后续处理,以执行语义端点和命名实体识别(NER)等任务。我们的论文介绍了TokenVerse,这是一个基于Transducer的单一模型,旨在处理多个任务。这是通过在ASB模型训练期间将特定于任务的令牌集成到参考文本中来实现的,简化推理并消除对单独NLP模型的需要。除了ASB之外,我们还对3个不同的任务进行了实验:说话人变化检测、端点定位和NER。我们在公共和私有数据集上的实验表明,所提出的方法在相对WER方面将ASB提高了高达7.7%,同时在单个任务性能方面优于级联流水线方法。此外,我们还在现有TokenVerse中将任务转移学习呈现给新任务。

[NLP-22] From Showgirls to Performers: Fine-tuning with Gender-inclusive Language for Bias Reduction in LLMs
[NLP-22] 从歌舞女郎到表演者:用包容性别的语言进行微调,以减少法学硕士中的偏见

链接: https://arxiv.org/abs/2407.04434
作者: Marion Bartl,Susan Leavy
关键词: Large Language Models, Large Language, prevalent in Large, LLM training data, Language Models
中文关键词: 大型语言模型,大型语言,在大型中盛行,LLM培训数据,语言模型
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 tables; to appear in Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing at ACL 2024

点击查看摘要

Abstract:Gender bias is not only prevalent in Large Language Models (LLMs) and their training data, but also firmly ingrained into the structural aspects of language itself. Therefore, adapting linguistic structures within LLM training data to promote gender-inclusivity can make gender representations within the model more inclusive. The focus of our work are gender-exclusive affixes in English, such as in ‘show-girl’ or ‘man-cave’, which can perpetuate gender stereotypes and binary conceptions of gender. We use an LLM training dataset to compile a catalogue of 692 gender-exclusive terms along with gender-neutral variants and from this, develop a gender-inclusive fine-tuning dataset, the ‘Tiny Heap’. Fine-tuning three different LLMs with this dataset, we observe an overall reduction in gender-stereotyping tendencies across the models. Our approach provides a practical method for enhancing gender inclusivity in LLM training data and contributes to incorporating queer-feminist linguistic activism in bias mitigation research in NLP.
摘要:性别偏见不仅在大型语言模型及其训练数据中普遍存在,而且在语言本身的结构方面也根深蒂固。因此,调整LLM培训数据中的语言结构以促进性别包容性可以使模型中的性别表征更具包容性。我们的工作重点是英语中性别专属的词缀,例如在‘show-Girl’或‘man-cave’中,这会使性别刻板印象和对性别的二元概念永久化。我们使用LLM训练数据集编制了692个性别专有术语以及中性变体的目录,并在此基础上开发了一个包含性别因素的微调数据集,即“小堆”。使用这个数据集微调三个不同的LLM,我们观察到所有模型的性别刻板印象倾向总体上都有所减少。我们的方法为提高LLM训练数据中的性别包容性提供了一种实用的方法,并有助于将同性恋-女权主义语言激进主义纳入NLP的偏见缓解研究。

[NLP-23] Waterfall: Framework for Robust and Scalable Text Watermarking
[NLP-23] 瀑布:稳健且可扩展的文本水印框架

链接: https://arxiv.org/abs/2407.04411
作者: Gregory Kang Ruey Lau,Xinyuan Niu,Hieu Dao,Jiangwei Chen,Chuan-Sheng Foo,Bryan Kian Hsiang Low
关键词: Protecting intellectual property, Protecting intellectual, large language models, intellectual property, increasingly important
中文关键词: 保护知识产权,保护知识分子、大型语言模型、知识产权,越来越重要
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Protecting intellectual property (IP) of text such as articles and code is increasingly important, especially as sophisticated attacks become possible, such as paraphrasing by large language models (LLMs) or even unauthorized training of LLMs on copyrighted text to infringe such IP. However, existing text watermarking methods are not robust enough against such attacks nor scalable to millions of users for practical implementation. In this paper, we propose Waterfall, the first training-free framework for robust and scalable text watermarking applicable across multiple text types (e.g., articles, code) and languages supportable by LLMs, for general text and LLM data provenance. Waterfall comprises several key innovations, such as being the first to use LLM as paraphrasers for watermarking along with a novel combination of techniques that are surprisingly effective in achieving robust verifiability and scalability. We empirically demonstrate that Waterfall achieves significantly better scalability, robust verifiability, and computational efficiency compared to SOTA article-text watermarking methods, and also showed how it could be directly applied to the watermarking of code.
摘要:保护文章和代码等文本的知识产权(IP)变得越来越重要,特别是在复杂攻击变得可能的情况下,例如利用大型语言模型(LLM)进行释义,甚至未经授权对受版权保护的文本进行LLM培训以侵犯此类IP。然而,现有的文本水印方法对此类攻击不够健壮,也不能扩展到数百万用户进行实际实现。在本文中,我们提出了瀑布,这是第一个无训练的文本水印框架,适用于LLMS支持的多种文本类型(例如,文章、代码)和语言,用于一般文本和LLM数据来源。瀑布由几项关键创新组成,例如第一个使用LLM作为水印解释程序,以及在实现强大的可验证性和可扩展性方面出人意料地有效的技术组合。实验证明,与SOTA的文章-文本水印方法相比,瀑布算法具有更好的可扩展性、健壮的可验证性和计算效率,并且可以直接应用于代码的水印。

[NLP-24] Romanization Encoding For Multilingual ASR
[NLP-24] 多语言ASB的罗马化编码

链接: https://arxiv.org/abs/2407.04368
作者: Wen Ding,Fei Jia,Hainan Xu,Yu Xi,Junjie Lai,Boris Ginsburg
关键词: Automatic Speech Recognition, code-switching Automatic Speech, Speech Recognition, Automatic Speech, introduce romanization encoding
中文关键词: 自动语音识别,代码切换自动语音,语音识别,自动语音,引入罗马化编码
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method’s strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.
摘要:我们为大量脚本语言引入罗马化编码,以优化多语言和代码切换自动语音识别(ASB)系统。通过在配备Roman 2Char模块的FastConformer-RNNT框架中采用罗马化编码和平衡的级联符号化器,我们显着减少了词汇量和输出维度,从而实现了更大的训练批量并减少了内存消耗。我们的方法将声学建模和语言建模相结合,增强了系统的灵活性和适应性。在我们的研究中,将这种方法应用于普通话-英语ASB导致词汇量显着减少了63.51%,并且在SEAME代码切换基准上显着性能提高了13.72%和15.03%。Ablation对普通话-韩语和普通话-日语的研究强调了我们的方法解决其他脚本密集语言复杂性的强大能力,为更通用、更有效的多语言ASB系统铺平了道路。

[NLP-25] Crafting Large Language Models for Enhanced Interpretability
[NLP-25] 打造大型语言模型以增强可解释性

链接: https://arxiv.org/abs/2407.04307
作者: Chung-En Sun,Tuomas Oikarinen,Tsui-Wei Weng
关键词: Bottleneck Large Language, Concept Bottleneck Large, interpretable Large Language, inherently interpretable Large, Large Language Model
中文关键词: 瓶颈大型语言,概念瓶颈大型,可解释大型语言,固有可解释大型,大型语言模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Present at ICML 2024 Mechanistic Interpretability (MI) Workshop

点击查看摘要

Abstract:We introduce the Concept Bottleneck Large Language Model (CB-LLM), a pioneering approach to creating inherently interpretable Large Language Models (LLMs). Unlike traditional black-box LLMs that rely on post-hoc interpretation methods with limited neuron function insights, CB-LLM sets a new standard with its built-in interpretability, scalability, and ability to provide clear, accurate explanations. This innovation not only advances transparency in language models but also enhances their effectiveness. Our unique Automatic Concept Correction (ACC) strategy successfully narrows the performance gap with conventional black-box LLMs, positioning CB-LLM as a model that combines the high accuracy of traditional LLMs with the added benefit of clear interpretability – a feature markedly absent in existing LLMs.
摘要:我们介绍了概念瓶颈大型语言模型(CB-LLM),这是一种创建本质上可解释的大型语言模型(LLM)的开创性方法。与依赖事后解释方法且神经元功能洞察有限的传统黑匣子LLM不同,CB-LLM凭借其内置的可解释性、可扩展性以及提供清晰、准确解释的能力设定了新标准。这一创新不仅提高了语言模型的透明度,而且提高了其有效性。我们独特的自动概念纠正(ACC)策略成功缩小了与传统黑匣子LLM的性能差距,将CB-LLM定位为一种结合了传统LLM的高准确性与清晰可解释性的额外好处的模型–这是现有LLM中明显缺乏的一项功能。

[NLP-26] Jailbreak Attacks and Defenses Against Large Language Models: A Survey
[NLP-26] 针对大型语言模型的越狱攻击和防御:调查

链接: https://arxiv.org/abs/2407.04295
作者: Sibo Yi,Yule Liu,Zhen Sun,Tianshuo Cong,Xinlei He,Jiaxing Song,Ke Xu,Qi Li
关键词: Large Language Models, Large Language, including question answering, Language Models, code completion
中文关键词: 大型语言模型、大型语言,包括问答、语言模型、代码完成
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of “jailbreaking”, which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.
摘要:大型语言模型(LLMS)在问答、翻译、代码补全等文本生成任务中表现出色。然而,LLMS的过度协助带来了越狱的挑战,这导致该模型通过设计对抗性提示来生成针对使用策略和社会的恶意响应。随着利用LLMS中不同漏洞的越狱攻击方法的出现,相应的安全对齐措施也在不断发展。在本文中,我们提出了一个全面和详细的分类越狱攻防方法。例如,根据目标模型的透明性,将攻击方法分为黑盒攻击和白盒攻击。同时,我们将防御方法分为提示级防御和模型级防御。此外,我们还将这些攻击和防御方法进一步细分为不同的子类,并提供了一个连贯的图来说明它们之间的关系。我们还对现有的评估方法进行了调查,并从不同的角度对它们进行了比较。我们的发现旨在启发未来在保护LLM免受对手攻击方面的研究和实际实现。最重要的是,尽管越狱在社区中仍然是一个重要的问题,但我们相信我们的工作增进了对这个领域的了解,并为开发更安全的LLM提供了基础。

[NLP-27] Systematic Evaluation of Online Speaker Diarization Systems Regarding their Latency
[NLP-27] 在线发言人拨号系统的延迟性系统评估

链接: https://arxiv.org/abs/2407.04293
作者: Roman Aperdannier,Sigurd Schacht,Alexander Piazza
关键词: speaker diarization systems, online speaker diarization, test data, data with regard, online diarization system
中文关键词: 扬声器日记化系统,在线扬声器日记化,测试数据,相关数据,在线日记化系统
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 6 pages

点击查看摘要

Abstract:In this paper, different online speaker diarization systems are evaluated on the same hardware with the same test data with regard to their latency. The latency is the time span from audio input to the output of the corresponding speaker label. As part of the evaluation, various model combinations within the DIART framework, a diarization system based on the online clustering algorithm UIS-RNN-SML, and the end-to-end online diarization system FS-EEND are compared. The lowest latency is achieved for the DIART-pipeline with the embedding model pyannote/embedding and the segmentation model pyannote/segmentation. The FS-EEND system shows a similarly good latency. In general there is currently no published research that compares several online diarization systems in terms of their latency. This makes this work even more relevant.
摘要:在本文中,不同的在线扬声器拨号系统在相同的硬件上进行了评估,测试数据相同。延迟是从音频输入到相应扬声器标签输出的时间跨度。作为评估的一部分,比较了DIART框架内的各种模型组合、基于在线集群算法UIS-RNN-SML的日记化系统和端到端在线日记化系统FS-EEND。使用嵌入模型pyannote/embedding和分段模型pyannote/segment的DIART管道实现了最低的延迟。FS-EEND系统显示出类似良好的延迟。一般来说,目前还没有发表的研究可以比较几种在线日记系统的延迟时间。这使得这项工作变得更加相关。

[NLP-28] LearnerVoice: A Dataset of Non-Native English Learners Spontaneous Speech
[NLP-28] LearnerVoice:非英语母语学习者自发言语的数据集

链接: https://arxiv.org/abs/2407.04280
作者: Haechan Kim,Junho Myung,Seoyoung Kim,Sungpah Lee,Dongyeop Kang,Juho Kim
关键词: Automatic Speech Recognition, pose unique challenges, Prevalent ungrammatical expressions, learners pose unique, challenges to Automatic
中文关键词: 自动语音识别,带来独特的挑战,普遍存在的不语法表达,学习者对自动语音识别提出独特的挑战
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted for INTERSPEECH 2024

点击查看摘要

Abstract:Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners’ spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner’s Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.
摘要:第二语言(L2)学习者自发言语中普遍存在的不语法表达和不流利现象给自动语音识别(ASB)系统带来了独特的挑战。然而,很少有数据集适合L2学习者语音。我们公开发布LearnerVoice,这是一个由50.04小时的音频和L2学习者自发言语的转录组成的数据集。我们的语言学分析表明,我们数据集中的转录包含L2 S(L2学习者的自发言语)特征,由不合语法的表达和不流利组成(例如,填充词、词重复、自我修复、错误开始),明显多于母语语音数据集。使用LearnerVoice进行微调的whisper-small.en实现了10.26%的WER,比vanilla whisper-small. en低44.2%。此外,我们的定性分析表明,LearnerVoice上vanilla模型中54.2%的错误归因于L2 S功能,其中48.1%的错误在微调模型中得到了减少。

[NLP-29] BiosERC: Integrating Biography Speakers Supported by LLMs for ERC Tasks
[NLP-29] BiosERC:集成由LLM支持的传记演讲者来执行ERC任务

链接: https://arxiv.org/abs/2407.04279
作者: Jieying Xue,Minh Phuong Nguyen,Blake Matheny,Le Minh Nguyen
关键词: Emotion Recognition, utilized attention mechanisms, attention mechanisms exploring, mechanisms exploring relationships, modeling emotional interaction
中文关键词: 情感识别、利用注意力机制、注意力机制探索、关系探索机制、建模情感互动
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in the 33rd International Conference on Artificial Neural Networks (ICANN 2024)

点击查看摘要

Abstract:In the Emotion Recognition in Conversation task, recent investigations have utilized attention mechanisms exploring relationships among utterances from intra- and inter-speakers for modeling emotional interaction between them. However, attributes such as speaker personality traits remain unexplored and present challenges in terms of their applicability to other tasks or compatibility with diverse model architectures. Therefore, this work introduces a novel framework named BiosERC, which investigates speaker characteristics in a conversation. By employing Large Language Models (LLMs), we extract the “biographical information” of the speaker within a conversation as supplementary knowledge injected into the model to classify emotional labels for each utterance. Our proposed method achieved state-of-the-art (SOTA) results on three famous benchmark datasets: IEMOCAP, MELD, and EmoryNLP, demonstrating the effectiveness and generalization of our model and showcasing its potential for adaptation to various conversation analysis tasks. Our source code is available at this https URL.
摘要:在会话中的情绪识别任务中,最近的研究利用注意机制来探索说话者内部和说话者之间的话语之间的关系,以模拟他们之间的情感交互。然而,说话人个性特征等属性仍未被探索,在适用于其他任务或与不同模型体系结构的兼容性方面存在挑战。因此,本工作引入了一种新的框架BiosERC,它研究对话中的说话人特征。通过使用大语言模型(LLMS),我们提取说话人在对话中的“传记信息”作为补充知识注入到模型中,以分类每个话语的情感标签。我们提出的方法在IEMOCAP、MELD和EmoryNLP这三个著名的基准数据集上获得了最先进的结果,证明了我们模型的有效性和泛化能力,并展示了其适应各种会话分析任务的潜力。我们的源代码可以在这个HTTPS URL上找到。

[NLP-30] Unified Interpretation of Smoothing Methods for Negative Sampling Loss Functions in Knowledge Graph Embedding
[NLP-30] 知识图嵌入中负抽样损失函数平滑方法的统一解释

链接: https://arxiv.org/abs/2407.04251
作者: Xincan Feng,Hidetaka Kamigaito,Katsuhiko Hayashi,Taro Watanabe
关键词: Knowledge Graphs, tasks in NLP, Negative Sampling, Adaptive Negative Sampling, fundamental resources
中文关键词: 知识图、NLP任务、负采样、自适应负采样、基础资源
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, 2 tables; accepted to workshop RepL4NLP held in conjunction with ACL 2024

点击查看摘要

Abstract:Knowledge Graphs (KGs) are fundamental resources in knowledge-intensive tasks in NLP. Due to the limitation of manually creating KGs, KG Completion (KGC) has an important role in automatically completing KGs by scoring their links with KG Embedding (KGE). To handle many entities in training, KGE relies on Negative Sampling (NS) loss that can reduce the computational cost by sampling. Since the appearance frequencies for each link are at most one in KGs, sparsity is an essential and inevitable problem. The NS loss is no exception. As a solution, the NS loss in KGE relies on smoothing methods like Self-Adversarial Negative Sampling (SANS) and subsampling. However, it is uncertain what kind of smoothing method is suitable for this purpose due to the lack of theoretical understanding. This paper provides theoretical interpretations of the smoothing methods for the NS loss in KGE and induces a new NS loss, Triplet Adaptive Negative Sampling (TANS), that can cover the characteristics of the conventional smoothing methods. Experimental results of TransE, DistMult, ComplEx, RotatE, HAKE, and HousE on FB15k-237, WN18RR, and YAGO3-10 datasets and their sparser subsets show the soundness of our interpretation and performance improvement by our TANS.
摘要:知识图谱是自然语言处理中知识密集型任务的基本资源。由于人工创建KG的局限性,KG补全(KGC)通过KG嵌入(KGE)对KG的链接进行评分,在自动完成KG方面具有重要作用。为了在训练中处理多个实体,KGE依赖于负采样(NS)损失,通过采样可以降低计算成本。由于每条链路的出现频率在KG中最多为一次,稀疏性是一个基本且不可避免的问题。NS的损失也不例外。作为解决方案,KGE中的NS损失依赖于自对抗负抽样(SANS)和次抽样等平滑方法。然而,由于缺乏理论上的了解,目前还不确定哪种平滑方法适合于此目的。本文对KGE中NS损失的平滑方法进行了理论解释,并引入了一种新的NS损失–三重自适应负抽样(TANS),该方法能够覆盖传统平滑方法的特点。在FB15K-237、WN18RR和YAG03-10数据集及其稀疏子集上的TRANSE、DistMult、Complex、Rotate、Hake和House数据集上的实验结果表明,我们的解释是合理的,并且我们的TANS提高了性能。

[NLP-31] ArAIEval Shared Task: Propagandistic Techniques Detection in Unimodal and Multimodal Arabic Content
[NLP-31] ArAIEval共享任务:单模式和多模式阿拉伯语内容中的语法技术检测

链接: https://arxiv.org/abs/2407.04247
作者: Maram Hasanain,Md. Arid Hasan,Fatema Ahmed,Reem Suwaileh,Md. Rafiul Biswas,Wajdi Zaghouani,Firoj Alam
关键词: co-located with ACL, ArAIEval shared task, organized as part, conference co-located, ArAIEval shared
中文关键词: 与ACL位于同一地点,ArAIEval共享任务,作为一部分组织,会议位于同一地点,ArAIEval共享
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: propaganda, span detection, disinformation, misinformation, fake news, LLMs, GPT-4, multimodality, multimodal LLMs

点击查看摘要

Abstract:We present an overview of the second edition of the ArAIEval shared task, organized as part of the ArabicNLP 2024 conference co-located with ACL 2024. In this edition, ArAIEval offers two tasks: (i) detection of propagandistic textual spans with persuasion techniques identification in tweets and news articles, and (ii) distinguishing between propagandistic and non-propagandistic memes. A total of 14 teams participated in the final evaluation phase, with 6 and 9 teams participating in Tasks 1 and 2, respectively. Finally, 11 teams submitted system description papers. Across both tasks, we observed that fine-tuning transformer models such as AraBERT was at the core of the majority of the participating systems. We provide a description of the task setup, including a description of the dataset construction and the evaluation setup. We further provide a brief overview of the participating systems. All datasets and evaluation scripts are released to the research community (this https URL). We hope this will enable further research on these important tasks in Arabic.
摘要:我们概述了ArAIEval共享任务的第二版,该任务是作为与ACL 2024共同举办的ArabecNLP 2024会议的一部分组织的。在这一版本中,ArAIEval提供了两项任务:(I)使用说服技术检测推文和新闻文章中的宣传性文本跨度,以及(Ii)区分宣传性和非宣传性模因。共有14支队伍参加了最终评估阶段,其中6支队伍和9支队伍分别参加了任务1和任务2。最后,11个团队提交了系统描述文件。在这两个任务中,我们观察到像AraBERT这样的微调变压器模型是大多数参与系统的核心。我们提供了对任务设置的描述,包括对数据集结构和评估设置的描述。我们进一步提供了参与系统的简要概述。所有数据集和评估脚本都将发布给研究社区(此HTTPS URL)。我们希望这将使我们能够用阿拉伯语对这些重要任务进行进一步研究。

[NLP-32] HAF-RM: A Hybrid Alignment Framework for Reward Model Training
[NLP-32] HAF-RM:奖励模型训练的混合协调框架

链接: https://arxiv.org/abs/2407.04185
作者: Shujun Liu,Xiaoyu Shen,Yuhang Lai,Siyuan Wang,Shengbin Yue,Zengfeng Huang,Xuanjing Huang,Zhongyu Wei
关键词: reward model, reward, large language models, increasingly important, construction for large
中文关键词: 奖励模型,奖励,大型语言模型,越来越重要,大型构建
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards. In this paper, we propose a hybrid alignment framework HaF-RM for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level. Theoretical justifications and experiment results on five datasets show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model. By decoupling the reward modeling procedure and incorporating hybrid supervision, our HaF-RM framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models. We release our code at this https URL.
摘要:奖励模型在大型语言模型的对齐、评估和数据构建中变得越来越重要。现有的大多数研究人员关注于通过数据改进来增强奖励模型,遵循传统的奖励模型的训练框架,直接优化预测的奖励。在本文中,我们提出了一种用于奖励模型训练的混合对齐框架HAF-RM,除了奖励分数之外,还引入了对令牌级策略概率的额外约束。它可以同时在令牌级对内部偏好模型进行监督,在序列级对奖励模型的映射层进行优化。理论证明和在五个数据集上的实验结果表明了我们提出的训练高质量奖励模型的混合框架的有效性。通过分离奖励建模过程并纳入混合监管,我们的HAF-RM框架提供了一种原则性的有效方法来提高奖励模型的性能和一致性,这是负责开发强大语言模型的关键组件。我们在这个HTTPS URL发布我们的代码。

[NLP-33] Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
[NLP-33] 像人工智能一样看待事物:法学硕士如何应用(和误用)维基百科中立规范

链接: https://arxiv.org/abs/2407.04183
作者: Joshua Ashkinaze,Ruijia Guan,Laura Kurek,Eytan Adar,Ceren Budak,Eric Gilbert
关键词: Large language models, Large language, trained on broad, broad corpora, communities with specialized
中文关键词: 大型语言模型,大型语言,在广泛、广泛的库、具有专业知识的社区中接受培训
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are trained on broad corpora and then used in communities with specialized norms. Is providing LLMs with community rules enough for models to follow these norms? We evaluate LLMs’ capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia’s Neutral Point of View (NPOV) policy. LLMs struggled with bias detection, achieving only 64% accuracy on a balanced dataset. Models exhibited contrasting biases (some under- and others over-predicted bias), suggesting distinct priors about neutrality. LLMs performed better at generation, removing 79% of words removed by Wikipedia editors. However, LLMs made additional changes beyond Wikipedia editors’ simpler neutralizations, resulting in high-recall but low-precision editing. Interestingly, crowdworkers rated AI rewrites as more neutral (70%) and fluent (61%) than Wikipedia-editor rewrites. Qualitative analysis found LLMs sometimes applied NPOV more comprehensively than Wikipedia editors but often made extraneous non-NPOV-related changes (such as grammar). LLMs may apply rules in ways that resonate with the public but diverge from community experts. While potentially effective for generation, LLMs may reduce editor agency and increase moderation workload (e.g., verifying additions). Even when rules are easy to articulate, having LLMs apply them like community members may still be difficult.
摘要:大型语言模型在广泛的语料库上进行训练,然后在具有专门规范的社区中使用。为LLMS提供社区规则足以让模型遵循这些规范吗?我们根据维基百科的中立观点(NPOV)政策评估LLMS检测(任务1)和纠正(任务2)有偏见的维基百科编辑的能力。LLMS在偏差检测方面举步维艰,在平衡的数据集上只达到了%的准确率。模型显示出截然不同的偏差(有些预测偏低,另一些预测过高),这表明人们对中立有着截然不同的偏好。LLMS在生成方面表现更好,删除了维基百科编辑删除的79%的单词。然而,LLMS在维基百科编辑更简单的中和之外做了额外的改变,导致了高召回率但低精确度的编辑。有趣的是,大众工作者对人工智能重写的评价比维基百科编辑的重写更中性(70%)和流畅(61%)。定性分析发现,LLMS有时比维基百科编辑更全面地应用NPOV,但经常进行无关的、与NPOV无关的更改(如语法)。LLMS可能会以与公众产生共鸣但与社区专家背道而驰的方式实施规则。虽然LLMS可能对生成有效,但它可能会减少编辑代理并增加审核工作量(例如,验证添加)。即使规则很容易表达,让LLM像社区成员一样应用它们可能仍然很困难。

[NLP-34] Orchestrating LLMs with Different Personalizations
[NLP-34] 使用不同的个性化来描述LLM

链接: https://arxiv.org/abs/2407.04181
作者: Jin Peng Zhou,Katie Z Luo,Jingwen Gu,Jason Yuan,Kilian Q. Weinberger,Wen Sun
关键词: Reinforcement Learning, Human Feedback, aligning large language, large language models, individual human preferences
中文关键词: 强化学习、人类反馈、对齐大型语言、大型语言模型、个人人类偏好
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a novel approach to aligning large language models (LLMs) with individual human preferences, sometimes referred to as Reinforcement Learning from \textitPersonalized Human Feedback (RLPHF). Given stated preferences along multiple dimensions, such as helpfulness, conciseness, or humor, the goal is to create an LLM without re-training that best adheres to this specification. Starting from specialized expert LLMs, each trained for one such particular preference dimension, we propose a black-box method that merges their outputs on a per-token level. We train a lightweight Preference Control Model (PCM) that dynamically translates the preference description and current context into next-token prediction weights. By combining the expert models’ outputs at the token level, our approach dynamically generates text that optimizes the given preference. Empirical tests show that our method matches or surpasses existing preference merging techniques, providing a scalable, efficient alternative to fine-tuning LLMs for individual personalization.
摘要:本文提出了一种将大语言模型(LLMS)与人的个人偏好相匹配的新方法,有时被称为基于个性化人类反馈的强化学习(RLPHF)。给定多个维度的特定偏好,例如帮助、简洁或幽默,目标是创建一个LLM,而不需要重新培训,最好地遵守这一规范。从专门的专家LLM出发,每个专家都针对这样一个特定的偏好维度进行训练,我们提出了一种在每个令牌级别上合并它们的输出的黑盒方法。我们训练了一个轻量级的偏好控制模型(PCM),它将偏好描述和当前上下文动态转换为下一个令牌预测权重。通过在令牌级结合专家模型的输出,我们的方法动态地生成优化给定偏好的文本。实验表明,我们的方法与现有的偏好合并技术相匹配或超过了现有的偏好合并技术,为个体个性化提供了一种可扩展的、高效的替代微调LLM的方法。

[NLP-35] Defense Against Syntactic Textual Backdoor Attacks with Token Substitution
[NLP-35] 利用令牌替换防御语法文本后门攻击

链接: https://arxiv.org/abs/2407.04179
作者: Xinglin Li,Xianwen He,Yao Li,Minhao Cheng
关键词: Large Language Models, Large Language, substantial security risk, Textual backdoor attacks, risk to Large
中文关键词: 大型语言模型、大型语言、重大安全风险、文本后门攻击、大型风险
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Textual backdoor attacks present a substantial security risk to Large Language Models (LLM). It embeds carefully chosen triggers into a victim model at the training stage, and makes the model erroneously predict inputs containing the same triggers as a certain class. Prior backdoor defense methods primarily target special token-based triggers, leaving syntax-based triggers insufficiently addressed. To fill this gap, this paper proposes a novel online defense algorithm that effectively counters syntax-based as well as special token-based backdoor attacks. The algorithm replaces semantically meaningful words in sentences with entirely different ones but preserves the syntactic templates or special tokens, and then compares the predicted labels before and after the substitution to determine whether a sentence contains triggers. Experimental results confirm the algorithm’s performance against these two types of triggers, offering a comprehensive defense strategy for model integrity.
摘要:文本后门攻击给大型语言模型(LLM)带来了巨大的安全风险。它在训练阶段将精心选择的触发器嵌入到受害者模型中,并使模型错误地预测包含与特定类别相同的触发器的输入。以前的后门防御方法主要针对特殊的基于令牌的触发器,留下了基于语法的触发器没有得到充分解决。为了填补这一空白,提出了一种新的在线防御算法,该算法能够有效地对抗基于语法的后门攻击和基于特殊令牌的后门攻击。该算法将句子中具有语义意义的单词替换为完全不同的单词,但保留了句法模板或特殊标记,然后将替换前后的预测标签进行比较,以确定句子是否包含触发器。实验结果证实了该算法对这两类触发器的抵抗能力,为模型完整性提供了一种全面的防御策略。

[NLP-36] ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
[NLP-36] ChartGemma:野外图表推理的视觉教学调整

链接: https://arxiv.org/abs/2407.04172
作者: Ahmed Masry,Megh Thakkar,Aayush Bajaj,Aaryaman Kartha,Enamul Hoque,Shafiq Joty
关键词: developing pre-trained foundation, general purpose instruction-tuned, pre-trained foundation models, purpose instruction-tuned models, underlying data tables
中文关键词: 开发预训练的基础、通用描述优化、预训练的基础模型、目的描述优化模型、底层数据表
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given the ubiquity of charts as a data analysis, visualization, and decision-making tool across industries and sciences, there has been a growing interest in developing pre-trained foundation models as well as general purpose instruction-tuned models for chart understanding and reasoning. However, existing methods suffer crucial drawbacks across two critical axes affecting the performance of chart representation models: they are trained on data generated from underlying data tables of the charts, ignoring the visual trends and patterns in chart images, and use weakly aligned vision-language backbone models for domain-specific training, limiting their generalizability when encountering charts in the wild. We address these important drawbacks and introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. Rather than relying on underlying data tables, ChartGemma is trained on instruction-tuning data generated directly from chart images, thus capturing both high-level trends and low-level visual information from a diverse set of charts. Our simple approach achieves state-of-the-art results across 5 benchmarks spanning chart summarization, question answering, and fact-checking, and our elaborate qualitative studies on real-world charts show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries. We release the code, model checkpoints, dataset, and demos at this https URL.
摘要:考虑到图表作为数据分析、可视化和决策工具在各个行业和科学中的普遍存在,开发预先训练的基础模型以及用于图表理解和推理的通用指令调整模型的兴趣越来越大。然而,现有的方法在影响图表表示模型性能的两个关键轴上存在严重缺陷:它们基于从图表的底层数据表生成的数据进行训练,忽略了图表图像中的视觉趋势和模式,并且使用弱对齐的视觉语言骨干模型进行特定领域的训练,限制了它们在野外遇到图表时的泛化能力。我们解决了这些重要的缺陷,并引入了一种新的图表理解和推理模型ChartGema,它是在PaliGema的基础上发展起来的。ChartGema不依赖于底层数据表,而是针对直接从图表图像生成的指令调整数据进行培训,从而从不同的图表集中捕获高级趋势和低级可视信息。我们的简单方法在横跨图表摘要、问题回答和事实核查的5个基准测试中获得了最先进的结果,我们对现实世界图表的详细定性研究表明,与同时代的图表相比,ChartGema生成的摘要更现实、更真实。我们在此HTTPS URL上发布代码、模型检查点、数据集和演示。

[NLP-37] ELCC: the Emergent Language Corpus Collection
[NLP-37] ELCC:紧急语言库收藏

链接: https://arxiv.org/abs/2407.04158
作者: Brendon Boldt,David Mortensen
关键词: open source implementations, emergent communication systems, Language Corpus Collection, Emergent Language Corpus, collected from open
中文关键词: 开源实现、紧急通信系统、语言库集合、紧急语言库、从开放收集
类目: Computation and Language (cs.CL)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:We introduce the Emergent Language Corpus Collection (ELCC): a collection of corpora collected from open source implementations of emergent communication systems across the literature. These systems include a variety of signalling game environments as well as more complex tasks like a social deduction game and embodied navigation. Each corpus is annotated with metadata describing the characteristics of the source system as well as a suite of analyses of the corpus (e.g., size, entropy, average message length). Currently, research studying emergent languages requires directly running different systems which takes time away from actual analyses of such languages, limits the variety of languages that are studied, and presents a barrier to entry for researchers without a background in deep learning. The availability of a substantial collection of well-documented emergent language corpora, then, will enable new directions of research which focus their purview on the properties of emergent languages themselves rather than on experimental apparatus.
摘要:我们介绍了紧急语言语料库集合(ELCC):这是从文献中的紧急通信系统的开源实现中收集的语料库的集合。这些系统包括各种信号游戏环境,以及更复杂的任务,如社交演绎游戏和具体化导航。用描述源系统的特征的元数据以及对语料库的一组分析(例如,大小、熵、平均消息长度)来注释每个语料库。目前,对新兴语言的研究需要直接运行不同的系统,这占用了对此类语言的实际分析的时间,限制了所研究的语言的种类,并为没有深度学习背景的研究人员提供了进入障碍。因此,大量记载良好的新兴语言语料库的收集将使新的研究方向成为可能,这些研究的重点是新兴语言本身的特性,而不是实验设备。

[NLP-38] Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers
[NLP-38] 保护多轮对话语言模型免受分布式后门触发器的影响

链接: https://arxiv.org/abs/2407.04151
作者: Terry Tong,Jiashu Xu,Qin Liu,Muhao Chen
关键词: conversational large language, popular LLM utilization, large language models, multi-turn conversational large, conversational large
中文关键词: 对话式大型语言、流行的LLM利用、大型语言模型、多轮对话式大型、大型对话式
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Submitted to EMNLP 2024

点击查看摘要

Abstract:The security of multi-turn conversational large language models (LLMs) is understudied despite it being one of the most popular LLM utilization. Specifically, LLMs are vulnerable to data poisoning backdoor attacks, where an adversary manipulates the training data to cause the model to output malicious responses to predefined triggers. Specific to the multi-turn dialogue setting, LLMs are at the risk of even more harmful and stealthy backdoor attacks where the backdoor triggers may span across multiple utterances, giving lee-way to context-driven attacks. In this paper, we explore a novel distributed backdoor trigger attack that serves to be an extra tool in an adversary’s toolbox that can interface with other single-turn attack strategies in a plug and play manner. Results on two representative defense mechanisms indicate that distributed backdoor triggers are robust against existing defense strategies which are designed for single-turn user-model interactions, motivating us to propose a new defense strategy for the multi-turn dialogue setting that is more challenging. To this end, we also explore a novel contrastive decoding based defense that is able to mitigate the backdoor with a low computational tradeoff.
摘要:多话轮会话大语言模型(LLMS)是目前应用最广泛的大语言模型之一,但其安全性仍未得到充分的研究。具体地说,LLM容易受到数据中毒后门攻击,在这种攻击中,敌手操纵训练数据,使模型输出对预定义触发器的恶意响应。具体到多轮对话设置,LLM面临着更具危害性和更隐蔽的后门攻击的风险,其中后门触发可能跨越多个话语,使Lee-way成为上下文驱动的攻击。在本文中,我们探索了一种新型的分布式后门触发攻击,它是对手工具箱中的一个额外工具,可以以即插即用的方式与其他单轮攻击策略对接。在两种典型的防御机制上的结果表明,分布式后门触发器对现有的针对单轮用户-模型交互的防御策略具有很强的鲁棒性,这促使我们提出了一种新的针对更具挑战性的多轮对话环境的防御策略。为此,我们还探索了一种新的基于对比解码的防御方案,该方案能够以较低的计算代价缓解后门问题。

[NLP-39] owards Automating Text Annotation: A Case Study on Semantic Proximity Annotation using GPT-4
[NLP-39] owards自动文本注释:使用GPT-4的语义接近性注释案例研究

链接: https://arxiv.org/abs/2407.04130
作者: Sachin Yadav,Tejaswi Choppa,Dominik Schlechtweg
关键词: data annotation process, automatic prompting techniques, paper explores, annotation process, automatic prompts
中文关键词: 数据注释过程、自动提示技术、论文探索、注释过程、自动提示
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:This paper explores using GPT-3.5 and GPT-4 to automate the data annotation process with automatic prompting techniques. The main aim of this paper is to reuse human annotation guidelines along with some annotated data to design automatic prompts for LLMs, focusing on the semantic proximity annotation task. Automatic prompts are compared to customized prompts. We further implement the prompting strategies into an open-source text annotation tool, enabling easy online use via the OpenAI API. Our study reveals the crucial role of accurate prompt design and suggests that prompting GPT-4 with human-like instructions is not straightforwardly possible for the semantic proximity task. We show that small modifications to the human guidelines already improve the performance, suggesting possible ways for future research.
摘要:本文探讨了使用GPT-3.5和GPT-4通过自动提示技术自动化数据注释过程。本文的主要目的是重用人类注释指南和一些注释数据来设计LLM的自动提示,重点关注语义邻近性注释任务。自动提示与自定义提示进行比较。我们进一步将提示策略实施到开源文本注释工具中,通过OpenAI API实现轻松在线使用。我们的研究揭示了准确提示设计的关键作用,并表明对于语义接近性任务来说,使用类似人类的指令提示GPT-4是不可能的。我们表明,对人类指南的微小修改已经提高了性能,为未来的研究提出了可能的方法。

[NLP-40] Query-Guided Self-Supervised Summarization of Nursing Notes
[NLP-40] 查询引导的护理笔记自我监督总结

链接: https://arxiv.org/abs/2407.04125
作者: Ya Gao,Hans Moen,Saila Koivusalo,Miika Koskinen,Pekka Marttinen
关键词: Electronic Health Records, patient health status, Health Records, Electronic Health, component of Electronic
中文关键词: 电子健康记录、患者健康状况、健康记录、电子健康、电子组件
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Nursing notes, an important component of Electronic Health Records (EHRs), keep track of the progression of a patient’s health status during a care episode. Distilling the key information in nursing notes through text summarization techniques can improve clinicians’ efficiency in understanding patients’ conditions when reviewing nursing notes. However, existing abstractive summarization methods in the clinical setting have often overlooked nursing notes and require the creation of reference summaries for supervision signals, which is time-consuming. In this work, we introduce QGSumm, a query-guided self-supervised domain adaptation framework for nursing note summarization. Using patient-related clinical queries as guidance, our approach generates high-quality, patient-centered summaries without relying on reference summaries for training. Through automatic and manual evaluation by an expert clinician, we demonstrate the strengths of our approach compared to the state-of-the-art Large Language Models (LLMs) in both zero-shot and few-shot settings. Ultimately, our approach provides a new perspective on conditional text summarization, tailored to the specific interests of clinical personnel.
摘要:护理笔记是电子健康记录(EHR)的重要组成部分,用于跟踪患者在护理过程中的健康状况。通过文本摘要技术提取护理笔记中的关键信息,可以提高临床医生在审阅护理笔记时了解患者病情的效率。然而,现有的临床环境中的摘要方法往往忽略了护理笔记,并且需要为监控信号创建参考摘要,这是耗时的。在这项工作中,我们介绍了QGSumm,一个查询引导的自监督领域适应框架,用于护理笔记摘要。使用与患者相关的临床查询作为指导,我们的方法生成高质量的、以患者为中心的摘要,而不依赖参考摘要进行培训。通过专家临床医生的自动和手动评估,我们展示了与最先进的大型语言模型(LLM)相比,我们的方法在零激发和少激发设置下的优势。最终,我们的方法为条件文本摘要提供了一个新的视角,为临床人员的特定兴趣量身定做。

[NLP-41] Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models
[NLP-41] 幻觉检测:在大型语言模型中稳健地辨别可靠的答案

链接: https://arxiv.org/abs/2407.04121
作者: Yuyan Chen,Qiang Fu,Yichen Yuan,Zhihao Wen,Ge Fan,Dayiheng Liu,Dongmei Zhang,Zhixu Li,Yanghua Xiao
关键词: Large Language Models, language processing tasks, natural language processing, Large Language, Language Models
中文关键词: 大型语言模型、语言处理任务、自然语言处理、大型语言、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to CIKM 2023 (Long Paper)

点击查看摘要

Abstract:Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks, including question answering and dialogue systems. However, a major drawback of LLMs is the issue of hallucination, where they generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences. In this paper, we propose a robust discriminator named RelD to effectively detect hallucination in LLMs’ generated answers. RelD is trained on the constructed RelQA, a bilingual question-answering dialogue dataset along with answers generated by LLMs and a comprehensive set of metrics. Our experimental results demonstrate that the proposed RelD successfully detects hallucination in the answers generated by diverse LLMs. Moreover, it performs well in distinguishing hallucination in LLMs’ generated answers from both in-distribution and out-of-distribution datasets. Additionally, we also conduct a thorough analysis of the types of hallucinations that occur and present valuable insights. This research significantly contributes to the detection of reliable answers generated by LLMs and holds noteworthy implications for mitigating hallucination in the future work.
摘要:大语言模型在各种自然语言处理任务中得到了广泛的应用,包括问答和对话系统。然而,LLMS的一个主要缺点是产生幻觉的问题,即它们产生偏离输入源的不忠实或不一致的内容,从而导致严重的后果。在本文中,我们提出了一种名为Reld的稳健鉴别器来有效地检测LLMS生成的答案中的幻觉。Reld接受了关于构建的RelQA的培训,这是一个双语问答对话数据集,以及由LLMS生成的答案和一套全面的指标。我们的实验结果表明,所提出的RELD成功地检测到由不同LLM生成的答案中的幻觉。此外,它在从分布内和分布外的数据集中区分LLMS生成的答案中的幻觉方面表现良好。此外,我们还对产生的幻觉类型进行了彻底的分析,并提出了有价值的见解。这项研究对检测LLMS产生的可靠答案做出了重要贡献,并对未来的工作中减轻幻觉具有重要意义。

[NLP-42] MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization
[NLP-42] MAPO:通过模型自适应提示优化提高大型语言模型性能

链接: https://arxiv.org/abs/2407.04118
作者: Yuyan Chen,Zhihao Wen,Ge Fan,Zhengyu Chen,Wei Wu,Dayiheng Liu,Zhixu Li,Bang Liu,Yanghua Xiao
关键词: Large Language Models, leverage Large Language, Language Models, Large Language, leverage Large
中文关键词: 大型语言模型,利用大型语言,语言模型,大型语言,利用大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2023 (Findings)

点击查看摘要

Abstract:Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this work, we first quantitatively demonstrate that different prompts should be adapted to different LLMs to enhance their capabilities across various downstream tasks in NLP. Then we novelly propose a model-adaptive prompt optimizer (MAPO) method that optimizes the original prompts for each specific LLM in downstream tasks. Extensive experiments indicate that the proposed method can effectively refine prompts for an LLM, leading to significant improvements over various downstream tasks.
摘要:提示工程作为利用大型语言模型(LLM)的一种高效有效的方法,引起了研究界的广泛关注。现有的研究主要强调使提示适应特定任务而不是特定的LLM的重要性。然而,好的提示不仅仅由其措辞来定义,还与相关LLM的性质有约束力。在这项工作中,我们首先量化地证明不同的提示应该适应不同的LLM,以增强它们在NLP中各种下游任务中的能力。然后,我们新颖地提出了一种模型自适应提示优化器(MAPO)方法,该方法可以优化下游任务中每个特定LLM的原始提示。大量实验表明,所提出的方法可以有效地细化LLM的提示,从而对各种下游任务进行显着改进。

[NLP-43] Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
[NLP-43] 作为后门触发器的未来事件:调查LLM中的临时漏洞

链接: https://arxiv.org/abs/2407.04108
作者: Sara Price,Arjun Panickssery,Sam Bowman,Asa Cooper Stickland
关键词: hidden behaviors, https URL, URL, Backdoors, model
中文关键词: 隐藏行为、https URL、URL、后门、模型
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Backdoors are hidden behaviors that are only triggered once an AI system has been deployed. Bad actors looking to create successful backdoors must design them to avoid activation during training and evaluation. Since data used in these stages often only contains information about events that have already occurred, a component of a simple backdoor trigger could be a model recognizing data that is in the future relative to when it was trained. Through prompting experiments and by probing internal activations, we show that current large language models (LLMs) can distinguish past from future events, with probes on model activations achieving 90% accuracy. We train models with backdoors triggered by a temporal distributional shift; they activate when the model is exposed to news headlines beyond their training cut-off dates. Fine-tuning on helpful, harmless and honest (HHH) data does not work well for removing simpler backdoor triggers but is effective on our backdoored models, although this distinction is smaller for the larger-scale model we tested. We also find that an activation-steering vector representing a model’s internal representation of the date influences the rate of backdoor activation. We take these results as initial evidence that, at least for models at the modest scale we test, standard safety measures are enough to remove these backdoors. We publicly release all relevant code (this https URL), datasets (this https URL), and models (this https URL).
摘要:后门是隐藏的行为,只有在部署了人工智能系统后才会触发。想要创建成功的后门的糟糕参与者必须在设计时避免在培训和评估期间激活。由于在这些阶段中使用的数据通常只包含有关已经发生的事件的信息,因此简单后门触发器的一个组件可以是识别相对于训练时间的未来数据的模型。通过提示性实验和对内部激活的探测,我们发现当前的大语言模型能够区分过去和未来的事件,其中对模型激活的探测准确率达到90%。我们用由时间分布转变触发的后门来训练模型;当模型在训练截止日期之后接触到新闻标题时,它们就会激活。对有用的、无害的和诚实的(HHH)数据进行微调不能很好地删除更简单的后门触发因素,但在我们的后门模型上有效,尽管这种区别在我们测试的较大规模的模型中较小。我们还发现,代表模型内部数据表示的激活导向向量影响后门激活的速度。我们将这些结果作为初步证据,证明至少对于我们测试的中等规模的车型来说,标准的安全措施足以消除这些后门。我们公开发布所有相关代码(此HTTPS URL)、数据集(此HTTPS URL)和模型(此HTTPS URL)。

[NLP-44] MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis
[NLP-44] MiniGPT-Med:大型语言模型作为放射诊断通用界面

链接: https://arxiv.org/abs/2407.04106
作者: Asma Alkhaldi,Raneem Alnajim,Layan Alabdullatef,Rawan Alyahya,Jun Chen,Deyao Zhu,Ahmed Alsinan,Mohamed Elhoseiny
关键词: Recent advancements, refining diagnostic procedures, precipitated significant breakthroughs, artificial intelligence, breakthroughs in healthcare
中文关键词: 最近的进步,完善诊断程序,促成了重大突破,人工智能,医疗保健领域的突破
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI) have precipitated significant breakthroughs in healthcare, particularly in refining diagnostic procedures. However, previous studies have often been constrained to limited functionalities. This study introduces MiniGPT-Med, a vision-language model derived from large-scale language models and tailored for medical applications. MiniGPT-Med demonstrates remarkable versatility across various imaging modalities, including X-rays, CT scans, and MRIs, enhancing its utility. The model is capable of performing tasks such as medical report generation, visual question answering (VQA), and disease identification within medical imagery. Its integrated processing of both image and textual clinical data markedly improves diagnostic accuracy. Our empirical assessments confirm MiniGPT-Med’s superior performance in disease grounding, medical report generation, and VQA benchmarks, representing a significant step towards reducing the gap in assisting radiology practice. Furthermore, it achieves state-of-the-art performance on medical report generation, higher than the previous best model by 19% accuracy. MiniGPT-Med promises to become a general interface for radiology diagnoses, enhancing diagnostic efficiency across a wide range of medical imaging applications.
摘要:人工智能(AI)的最新进展推动了医疗保健领域的重大突破,特别是在完善诊断程序方面。然而,以前的研究往往局限于有限的功能。本文介绍了一种基于大规模语言模型的视觉语言模型MiniGPT-Med,它是为医学应用量身定做的。MiniGPT-Med在包括X射线、CT扫描和核磁共振成像在内的各种成像方式中展示了非凡的多功能性,增强了其实用性。该模型能够在医学图像中执行诸如医疗报告生成、视觉问答(VQA)和疾病识别等任务。它对图像和文本临床数据的集成处理显著提高了诊断准确率。我们的经验评估证实了MiniGPT-Med在疾病基础、医疗报告生成和VQA基准方面的卓越表现,代表着朝着缩小在协助放射学实践方面的差距迈出的重要一步。此外,它在医疗报告生成方面实现了最先进的性能,比以前最好的模型提高了19%的准确率。MiniGPT-Med有望成为放射学诊断的通用界面,在广泛的医学成像应用中提高诊断效率。

[NLP-45] Can Pre-trained Language Models Understand Chinese Humor?
[NLP-45] 预先训练的语言模型能理解中国幽默吗?

链接: https://arxiv.org/abs/2407.04105
作者: Yuyan Chen,Zhixu Li,Jiaqing Liang,Yanghua Xiao,Bang Liu,Yunwen Chen
关键词: natural language processing, Humor understanding, important and challenging, challenging research, research in natural
中文关键词: 自然语言处理、幽默理解、重要且具有挑战性、具有挑战性的研究、自然研究
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to WSDM 2022

点击查看摘要

Abstract:Humor understanding is an important and challenging research in natural language processing. As the popularity of pre-trained language models (PLMs), some recent work makes preliminary attempts to adopt PLMs for humor recognition and generation. However, these simple attempts do not substantially answer the question: \em whether PLMs are capable of humor understanding? This paper is the first work that systematically investigates the humor understanding ability of PLMs. For this purpose, a comprehensive framework with three evaluation steps and four evaluation tasks is designed. We also construct a comprehensive Chinese humor dataset, which can fully meet all the data requirements of the proposed evaluation framework. Our empirical study on the Chinese humor dataset yields some valuable observations, which are of great guiding value for future optimization of PLMs in humor understanding and generation.
摘要:幽默理解是自然语言处理中一项重要且具有挑战性的研究。随着预训练语言模型(PLM)的流行,最近的一些工作初步尝试采用PLM进行幽默识别和生成。然而,这些简单的尝试并不能从根本上回答这个问题:PLM是否能够理解幽默?本文是第一部系统研究PLM幽默理解能力的作品。为此,设计了一个包含三个评估步骤和四个评估任务的综合框架。我们还构建了一个全面的中国幽默数据集,可以完全满足所提出的评估框架的所有数据要求。我们对中国幽默数据集的实证研究得出了一些有价值的观察结果,这对未来优化PLM在幽默理解和生成方面具有重要的指导价值。

[NLP-46] Stephanie: Step-by-Step Dialogues for Mimicking Human Interactions in Social Conversations
[NLP-46] 斯蒂芬妮:在社交对话中模仿人类互动的分步对话

链接: https://arxiv.org/abs/2407.04093
作者: Hao Yang,Hongyuan Lu,Xinhua Zeng,Yang Liu,Xiang Zhang,Haoran Yang,Yumeng Zhang,Yiran Wei,Wai Lam
关键词: rapidly evolving field, systems primarily employ, dialogue systems primarily, natural language processing, rapidly evolving
中文关键词: 快速发展的领域,主要采用的系统,主要是对话系统,自然语言处理,快速发展
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the rapidly evolving field of natural language processing, dialogue systems primarily employ a single-step dialogue paradigm. Although this paradigm is efficient, it lacks the depth and fluidity of human interactions and does not appear natural. We introduce a novel \textbfStep-by-Step Dialogue Paradigm (Stephanie), designed to mimic the ongoing dynamic nature of human conversations. By employing a dual learning strategy and a further-split post-editing method, we generated and utilized a high-quality step-by-step dialogue dataset to fine-tune existing large language models, enabling them to perform step-by-step dialogues. We thoroughly present Stephanie. Tailored automatic and human evaluations are conducted to assess its effectiveness compared to the traditional single-step dialogue paradigm. We will release code, Stephanie datasets, and Stephanie LLMs to facilitate the future of chatbot eras.
摘要:在快速发展的自然语言处理领域,对话系统主要采用一步对话范式。尽管这种范式很有效,但它缺乏人类互动的深度和流动性,而且看起来不自然。我们介绍了一部小说\textbf分步对话范式(斯蒂芬妮),旨在模仿人类对话的持续动态本质。通过采用双重学习策略和进一步分裂的后期编辑方法,我们生成并利用高质量的分步对话数据集来微调现有的大型语言模型,使它们能够执行分步对话。我们彻底介绍斯蒂芬妮。与传统的一步对话范式相比,进行了量身定制的自动和人工评估,以评估其有效性。我们将发布代码、Stephanie数据集和Stephanie LLM,以促进聊天机器人时代的未来。

[NLP-47] AXOLOTL24 Shared Task on Multilingual Explainable Semantic Change Modeling
[NLP-47] AX OLOTL 24多语言可解释语义变化建模的共享任务

链接: https://arxiv.org/abs/2407.04079
作者: Mariia Fedorova,Timothee Mickus,Niko Partanen,Janine Siewert,Elena Spaziani,Andrey Kutuzov
关键词: multilingual explainable semantic, modeling shared task, semantic change modeling, explainable semantic change, shared task
中文关键词: 多语言可解释语义,建模共享任务,语义变化建模,可解释语义变化,共享任务
类目: Computation and Language (cs.CL)
备注: Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change (ACL’24)

点击查看摘要

Abstract:This paper describes the organization and findings of AXOLOTL’24, the first multilingual explainable semantic change modeling shared task. We present new sense-annotated diachronic semantic change datasets for Finnish and Russian which were employed in the shared task, along with a surprise test-only German dataset borrowed from an existing source. The setup of AXOLOTL’24 is new to the semantic change modeling field, and involves subtasks of identifying unknown (novel) senses and providing dictionary-like definitions to these senses. The methods of the winning teams are described and compared, thus paving a path towards explainability in computational approaches to historical change of meaning.
摘要:本文描述了AXOLOTL ’ 24的组织和调查结果,这是第一个多语言可解释的语义变化建模共享任务。我们展示了在共享任务中使用的芬兰语和俄语的新的有意义注释的历时语义变化数据集,以及从现有来源借用的仅进行意外测试的德国数据集。AXOLOTL ’ 24的设置对于语义变化建模领域来说是新的,涉及识别未知(新颖)意义并为这些意义提供类似词典的定义的子任务。获胜团队的方法被描述和比较,从而为用计算方法解释意义的历史变化铺平了道路。

[NLP-48] DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning
[NLP-48] DotaMath:通过代码辅助和数学推理自我纠正分解思想

链接: https://arxiv.org/abs/2407.04078
作者: Chengpeng Li,Guanting Dong,Mingfeng Xue,Ru Peng,Xiang Wang,Dayiheng Liu
关键词: Large language models, made impressive progress, Large language, handling simple math, complex mathematical tasks
中文关键词: 大型语言模型,取得了令人印象深刻的进步,大型语言,处理简单的数学、复杂的数学任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have made impressive progress in handling simple math problems, yet they still struggle with more challenging and complex mathematical tasks. In this paper, we introduce a series of LLMs that employs the Decomposition of thought with code assistance and self-correction for mathematical reasoning, dubbed as DotaMath. DotaMath models tackle complex mathematical tasks by decomposing them into simpler logical subtasks, leveraging code to solve these subtasks, obtaining fine-grained feedback from the code interpreter, and engaging in self-reflection and correction. By annotating diverse interactive tool-use trajectories and employing query evolution on GSM8K and MATH datasets, we generate an instruction fine-tuning dataset called DotaMathQA with 574K query-response pairs. We train a series of base LLMs using imitation learning on DotaMathQA, resulting in DotaMath models that achieve remarkable performance compared to open-source LLMs across various in-domain and out-of-domain benchmarks. Notably, DotaMath-deepseek-7B showcases an outstanding performance of 64.8% on the competitive MATH dataset and 86.7% on GSM8K. Besides, DotaMath-deepseek-7B maintains strong competitiveness on a series of in-domain and out-of-domain benchmarks (Avg. 80.1%). Looking forward, we anticipate that the DotaMath paradigm will open new pathways for addressing intricate mathematical problems. Our code is publicly available at this https URL.
摘要:大型语言模型在处理简单的数学问题方面取得了令人印象深刻的进展,但它们仍然在处理更具挑战性和更复杂的数学任务。在本文中,我们介绍了一系列使用代码辅助和自校正思想分解进行数学推理的LLMS,称为DotaMath。DotaMath模型通过将复杂的数学任务分解为更简单的逻辑子任务,利用代码来解决这些子任务,从代码解释器获得细粒度的反馈,并进行自我反思和更正,来处理复杂的数学任务。通过标注不同的交互工具使用轨迹,并在GSM8K和数学数据集上使用查询进化,我们生成了一个具有574k查询-响应对的指令微调数据集DotaMathQA。我们在DotaMathQA上使用模仿学习训练了一系列基本的LLM,导致DotaMath模型在各种域内和域外基准测试中取得了与开源LLM相比显著的性能。值得注意的是,DotaMath-DeepSeek-7B在竞争性数学数据集上的出色表现为64.8%,在GSM8K上的表现为86.7%。此外,DotaMath-DeepSeek-7B在一系列域内和域外基准(Avg.80.1%)。展望未来,我们预计DotaMath范例将为解决复杂的数学问题开辟新的途径。我们的代码在此HTTPS URL上公开提供。

[NLP-49] A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges Limitations and Recommendations
[NLP-49] 评估大型语言模型的系统调查和批判性评论:挑战、限制和建议

链接: https://arxiv.org/abs/2407.04069
作者: Md Tahmid Rahman Laskar,Sawsan Alqahtani,M Saiful Bari,Mizanur Rahman,Mohammad Abdullah Matin Khan,Haidar Khan,Israt Jahan,Amran Bhuiyan,Chee Wei Tan,Md Rizwan Parvez,Enamul Hoque,Shafiq Joty,Jimmy Huang
关键词: Large Language Models, Large Language, recently gained significant, gained significant attention, significant attention due
中文关键词: 大型语言模型,大型语言,最近获得了重大关注,受到了重大关注,由于
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.
摘要:大型语言模型(LLM)最近因其在各个领域执行多样化任务的出色能力而受到广泛关注。然而,在将这些模型部署到现实世界的应用程序中之前,对它们进行彻底评估至关重要,以确保它们产生可靠的性能。尽管评估LLM在社区中的重要性已得到公认,但评估过程的复杂性导致了评估设置的不同,导致调查结果和解释的不一致。为了解决这个问题,我们系统地审查了LLM评估各个步骤中导致这些不一致和不可靠评估的主要挑战和限制。根据我们的批判性审查,我们提出我们的观点和建议,以确保LLM评估具有可重复性、可靠性和稳健性。

[NLP-50] Semantic Graphs for Syntactic Simplification: A Revisit from the Age of LLM
[NLP-50] 用于语法简化的语义图:法学硕士时代的重温

链接: https://arxiv.org/abs/2407.04067
作者: Peiran Yao,Kostyantyn Guzhva,Denilson Barbosa
关键词: simplify downstream NLP, Abstract Meaning Representation, downstream NLP tasks, Abstract Meaning, downstream NLP
中文关键词: 简化下游NLP、抽象意义表示、下游NLP任务、抽象意义、下游NLP
类目: Computation and Language (cs.CL)
备注: Accepted at TextGraphs-17 @ ACL 2024

点击查看摘要

Abstract:Symbolic sentence meaning representations, such as AMR (Abstract Meaning Representation) provide expressive and structured semantic graphs that act as intermediates that simplify downstream NLP tasks. However, the instruction-following capability of large language models (LLMs) offers a shortcut to effectively solve NLP tasks, questioning the utility of semantic graphs. Meanwhile, recent work has also shown the difficulty of using meaning representations merely as a helpful auxiliary for LLMs. We revisit the position of semantic graphs in syntactic simplification, the task of simplifying sentence structures while preserving their meaning, which requires semantic understanding, and evaluate it on a new complex and natural dataset. The AMR-based method that we propose, AMRS ^3 , demonstrates that state-of-the-art meaning representations can lead to easy-to-implement simplification methods with competitive performance and unique advantages in cost, interpretability, and generalization. With AMRS ^3 as an anchor, we discover that syntactic simplification is a task where semantic graphs are helpful in LLM prompting. We propose AMRCoC prompting that guides LLMs to emulate graph algorithms for explicit symbolic reasoning on AMR graphs, and show its potential for improving LLM on semantic-centered tasks like syntactic simplification.
摘要:符号句义表征,如抽象意义表征,提供了富有表现力和结构化的语义图,作为简化下游自然语言处理任务的中介。然而,大型语言模型的指令跟随能力为有效解决自然语言处理任务提供了一条捷径,这对语义图的有效性提出了质疑。同时,最近的研究也表明,将意义表征仅仅作为LLM的一个有用的辅助手段是困难的。我们回顾了语义图在句法简化中的地位,即在保持句子结构含义的同时简化句子结构的任务,这需要语义理解,并在一个新的复杂和自然的数据集上进行评估。我们提出的基于AMR的方法AMRS^3表明,最先进的意义表示可以导致易于实现的简化方法,具有竞争力的性能和在成本、可解释性和泛化方面的独特优势。以AMRS^3为锚,我们发现句法简化是一项语义图有助于LLM提示的任务。我们提出了AMRCoC提示,指导LLMS在AMR图上模拟显式符号推理的图算法,并展示了它在句法简化等以语义为中心的任务上改进LLM的潜力。

[NLP-51] Deep Content Understanding Toward Entity and Aspect Target Sentiment Analysis on Foundation Models
[NLP-51] 对基础模型上的实体和方面目标情绪分析的深度内容理解

链接: https://arxiv.org/abs/2407.04050
作者: Vorakit Vorakitphan,Milos Basic,Guilhaume Leroy Meline
关键词: Sentiment Triplet Extraction, Introducing Entity-Aspect Sentiment, Triplet Extraction, Entity-Aspect Sentiment Triplet, separating aspect categories
中文关键词: 情感三重组提取,引入潜在方面情感,三重组提取,潜在方面情感三重组,分离方面类别
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s)

点击查看摘要

Abstract:Introducing Entity-Aspect Sentiment Triplet Extraction (EASTE), a novel Aspect-Based Sentiment Analysis (ABSA) task which extends Target-Aspect-Sentiment Detection (TASD) by separating aspect categories (e.g., food#quality) into pre-defined entities (e.g., meal, drink) and aspects (e.g., taste, freshness) which add a fine-gainer level of complexity, yet help exposing true sentiment of chained aspect to its entity. We explore the task of EASTE solving capabilities of language models based on transformers architecture from our proposed unified-loss approach via token classification task using BERT architecture to text generative models such as Flan-T5, Flan-Ul2 to Llama2, Llama3 and Mixtral employing different alignment techniques such as zero/few-shot learning, Parameter Efficient Fine Tuning (PEFT) such as Low-Rank Adaptation (LoRA). The model performances are evaluated on the SamEval-2016 benchmark dataset representing the fair comparison to existing works. Our research not only aims to achieve high performance on the EASTE task but also investigates the impact of model size, type, and adaptation techniques on task performance. Ultimately, we provide detailed insights and achieving state-of-the-art results in complex sentiment analysis.
摘要:介绍了实体-方面情感三元组提取(EASTE),这是一种新的基于方面的情感分析(ABSA)任务,它通过将方面类别(例如,食物#质量)划分为预定义的实体(例如,食物、饮料)和方面(例如,味道、新鲜度)来扩展目标-方面-情感检测(TASD),从而增加了复杂性,但有助于将连锁方面的真实情感暴露给其实体。我们探索了基于转换器结构的语言模型的东方求解能力的任务,从我们提出的统一损失方法,通过使用BERT结构的令牌分类任务,到使用不同对齐技术的文本生成模型,如Flan-T5,Flan-Ul2到Llama2,Llama3和Mixtral,以及使用不同的对齐技术,如零/少镜头学习,参数高效微调(PEFT),如低秩自适应(LORA)。模型的性能在SamEval-2016基准数据集上进行了评估,与现有的工作进行了公平的比较。我们的研究不仅旨在实现EASTE任务的高性能,而且还考察了模型大小、类型和自适应技术对任务性能的影响。最终,我们提供了详细的见解,并在复杂的情感分析中获得了最先进的结果。

[NLP-52] Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis
[NLP-52] 使用基于无监督文本到语音合成的数据增强来改进重读语音识别

链接: https://arxiv.org/abs/2407.04047
作者: Cong-Thanh Do,Shuhei Imai,Rama Doddipatla,Thomas Hain
关键词: accented speech data, accented speech, accented speech recognition, speech data, speech
中文关键词: 口音语音数据,口音语音,口音语音识别,语音数据,语音
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to EUSIPCO 2024

点击查看摘要

Abstract:This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions, and hence unsupervised. This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition. Synthetic accented speech data, generated from text prompts by using the TTS systems, are then combined with available non-accented speech data to train automatic speech recognition (ASR) systems. ASR experiments are performed in a self-supervised learning framework using a Wav2vec2.0 model which was pre-trained on large amount of unsupervised accented speech data. The accented speech data for training the unsupervised TTS are read speech, selected from L2-ARCTIC and British Isles corpora, while spontaneous conversational speech from the Edinburgh international accents of English corpus are used as the evaluation data. Experimental results show that Wav2vec2.0 models which are fine-tuned to downstream ASR task with synthetic accented speech data, generated by the unsupervised TTS, yield up to 6.1% relative word error rate reductions compared to a Wav2vec2.0 baseline which is fine-tuned with the non-accented speech data from Librispeech corpus.
摘要:本文研究了无监督文本到语音合成(TTS)作为一种数据增强方法来改进带口音的语音识别。TTS系统是用少量的重音语音训练数据及其伪标签而不是人工转录来训练的,因此没有监督。该方法使得能够使用重音语音数据而无需手动转录来执行用于重音语音识别的数据扩充。然后,使用TTS系统从文本提示生成的合成重音语音数据与可用的非重音语音数据组合,以训练自动语音识别(ASR)系统。ASR实验是在一个自监督学习框架中进行的,它使用的是一个Wav2ve2.0模型,该模型在大量的非监督重音数据上进行了预训练。用于训练无监督语料的重音语音数据是从L2-北极语料库和不列颠群岛语料库中挑选出来的朗读语音,而来自英语语料库爱丁堡国际口音的自发会话语音则被用作评估数据。实验结果表明,与使用来自Librispeech语料库的非重音语音数据微调的Wav2ve2.0基线相比,使用非监督TTS生成的合成重音语音数据微调到下游ASR任务的Wav2ve2.0模型的相对单词错误率降低了6.1%。

[NLP-53] Systematic Task Exploration with LLMs: A Study in Citation Text Generation
[NLP-53] 利用LLM进行系统任务探索:引文文本生成研究

链接: https://arxiv.org/abs/2407.04046
作者: Furkan Şahinuç,Ilia Kuznetsov,Yufang Hou,Iryna Gurevych
关键词: Large language models, Large language, creative natural language, natural language generation, bring unprecedented flexibility
中文关键词: 大型语言模型、大型语言、创意自然语言、自然语言生成,带来前所未有的灵活性
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 (Main)

点击查看摘要

Abstract:Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks. Yet, this flexibility brings new challenges, as it introduces new degrees of freedom in formulating the task inputs and instructions and in evaluating model performance. To facilitate the exploration of creative NLG tasks, we propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement. We use this framework to explore citation text generation – a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric and has not yet been tackled within the LLM paradigm. Our results highlight the importance of systematically investigating both task instruction and input configuration when prompting LLMs, and reveal non-trivial relationships between different evaluation metrics used for citation text generation. Additional human generation and human evaluation experiments provide new qualitative insights into the task to guide future research in citation text generation. We make our code and data publicly available.
摘要:大型语言模型在定义和执行复杂的、创造性的自然语言生成(NLG)任务方面带来了前所未有的灵活性。然而,这种灵活性带来了新的挑战,因为它在制定任务输入和指令以及评估模型性能方面引入了新的自由度。为了促进创造性NLG任务的探索,我们提出了一个由系统输入操作、参考数据和输出测量组成的三个组成部分的研究框架。我们使用这个框架来探索引文文本生成–这是一项流行的学术NLP任务,在任务定义和评估指标上缺乏共识,尚未在LLM范式中得到解决。我们的结果强调了在提示LLMS时系统研究任务指导和输入配置的重要性,并揭示了用于引文文本生成的不同评估指标之间的重要关系。额外的人类生成和人类评估实验为该任务提供了新的定性见解,以指导未来在引文文本生成方面的研究。我们公开我们的代码和数据。

[NLP-54] LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking
[NLP-54] LLMAEL:大型语言模型是实体链接的良好上下文增强器

链接: https://arxiv.org/abs/2407.04020
作者: Amy Xin,Yunjia Qi,Zijun Yao,Fangwei Zhu,Kaisheng Zeng,Xu Bin,Lei Hou,Juanzi Li
关键词: Entity Linking, well-trained at mapping, Entity Linking LLMAEL, models, Entity
中文关键词: 实体链接,在地图方面训练有素,实体链接LLMAEL,模型,实体
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Entity Linking (EL) models are well-trained at mapping mentions to their corresponding entities according to a given context. However, EL models struggle to disambiguate long-tail entities due to their limited training data. Meanwhile, large language models (LLMs) are more robust at interpreting uncommon mentions. Yet, due to a lack of specialized training, LLMs suffer at generating correct entity IDs. Furthermore, training an LLM to perform EL is cost-intensive. Building upon these insights, we introduce LLM-Augmented Entity Linking LLMAEL, a plug-and-play approach to enhance entity linking through LLM data augmentation. We leverage LLMs as knowledgeable context augmenters, generating mention-centered descriptions as additional input, while preserving traditional EL models for task specific processing. Experiments on 6 standard datasets show that the vanilla LLMAEL outperforms baseline EL models in most cases, while the fine-tuned LLMAEL set the new state-of-the-art results across all 6 benchmarks.
摘要:实体链接(EL)模型擅长根据给定的上下文将提及映射到其对应的实体。然而,EL模型由于其有限的训练数据而难以消除长尾实体的歧义。同时,大型语言模型(LLM)在解释不常见的提及方面更加健壮。然而,由于缺乏专门的培训,小岛屿发展中国家难以生成正确的实体ID。此外,培训LLM以执行ELI是一项成本密集型工作。在这些见解的基础上,我们引入了LLM增强的实体链接LLMAEL,这是一种通过LLM数据增强来增强实体链接的即插即用方法。我们利用LLM作为知识丰富的上下文扩充器,生成以提及为中心的描述作为额外输入,同时保留传统的EL模型用于特定任务的处理。在6个标准数据集上的实验表明,在大多数情况下,Vanilla LLMAEL的性能优于基准EL模型,而经过微调的LLMAEL在所有6个基准测试中设置了新的最先进的结果。

[NLP-55] Exploring Diachronic and Diatopic Changes in Dialect Continua: Tasks Datasets and Challenges
[NLP-55] 探索方言连续体的历时和历时变化:任务数据集和挑战

链接: https://arxiv.org/abs/2407.04010
作者: Melis Çelikkol,Lydia Körber,Wei Zhao
关键词: Everlasting contact, language communities leads, leads to constant, communities leads, non-inclusive NLP technologies
中文关键词: 持久的联系、语言社区引领、持续引领、社区引领、非包容性NLP技术
类目: Computation and Language (cs.CL)
备注: LChange24 Camera Ready

点击查看摘要

Abstract:Everlasting contact between language communities leads to constant changes in languages over time, and gives rise to language varieties and dialects. However, the communities speaking non-standard language are often overlooked by non-inclusive NLP technologies. Recently, there has been a surge of interest in studying diatopic and diachronic changes in dialect NLP, but there is currently no research exploring the intersection of both. Our work aims to fill this gap by systematically reviewing diachronic and diatopic papers from a unified perspective. In this work, we critically assess nine tasks and datasets across five dialects from three language families (Slavic, Romance, and Germanic) in both spoken and written modalities. The tasks covered are diverse, including corpus construction, dialect distance estimation, and dialect geolocation prediction, among others. Moreover, we outline five open challenges regarding changes in dialect use over time, the reliability of dialect datasets, the importance of speaker characteristics, limited coverage of dialects, and ethical considerations in data collection. We hope that our work sheds light on future research towards inclusive computational methods and datasets for language varieties and dialects.
摘要:语言社区之间的长期接触导致语言随着时间的推移而不断变化,并产生语言变体和方言。然而,说非标准语言的社区往往被非包容性的NLP技术忽视。近年来,人们对方言自然语言处理中的异位变化和历时变化的研究方兴未艾,但目前还没有关于两者交集的研究。我们的工作旨在通过从统一的角度系统地审查历时和专题性论文来填补这一空白。在这项工作中,我们批判性地评估了来自三个语系(斯拉夫语、罗曼语和日耳曼语)的五种方言的九项任务和数据集。所涉及的任务多种多样,包括语料库建设、方言距离估计和方言地理位置预测等。此外,我们概述了关于方言使用随时间的变化、方言数据集的可靠性、说话人特征的重要性、方言覆盖范围有限以及数据收集中的伦理考虑的五个开放挑战。我们希望我们的工作对未来针对语言变体和方言的包容性计算方法和数据集的研究有所帮助。

[NLP-56] Unlocking the Potential of Model Merging for Low-Resource Languages
[NLP-56] 释放低资源语言模型合并的潜力

链接: https://arxiv.org/abs/2407.03994
作者: Mingxu Tao,Chen Zhang,Quzhe Huang,Tianyao Ma,Songfang Huang,Dongyan Zhao,Yansong Feng
关键词: Adapting large language, involves continual pre-training, typically involves continual, Adapting large, languages typically involves
中文关键词: 适应大型语言,涉及持续的预培训,通常涉及持续的,适应大型语言通常涉及
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adapting large language models (LLMs) to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT). However, this CT-then-SFT approach struggles with limited data in the context of low-resource languages, failing to balance language modeling and task-solving capabilities. We thus propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training. We use model merging to develop task-solving LLMs for low-resource languages without SFT data in the target languages. Our experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data. Observing performance saturation in model merging with more training tokens, we further analyze the merging process and introduce a slack variable to the model merging algorithm to mitigate the loss of important parameters, thereby enhancing performance. We hope that model merging can benefit more human languages suffering from data scarcity with its higher data efficiency.
摘要:使大语言模型(LLM)适应新的语言通常需要连续的预训练(CT),然后是有监督的微调(SFT)。然而,这种CT-Then-SFT方法在低资源语言的背景下难以处理有限的数据,无法平衡语言建模和任务求解能力。因此,我们建议将模型合并作为低资源语言的替代方案,将具有不同功能的模型合并到单个模型中,而无需额外的培训。我们使用模型合并的方法来开发针对目标语言中没有SFT数据的低资源语言的任务求解LLMS。我们在Llama-2-7B上的实验表明,模型合并有效地赋予了低资源语言的LLM任务求解能力,在数据极其稀缺的场景中表现优于CT-THEN-SFT。通过观察训练权标较多的模型合并时的性能饱和,进一步分析了模型合并的过程,并在模型合并算法中引入松弛变量,以减少重要参数的损失,从而提高性能。我们希望模型合并能够以其更高的数据效率造福于更多的人类语言。

[NLP-57] A Survey on Natural Language Counterfactual Generation
[NLP-57] 自然语言反事实生成研究综述

链接: https://arxiv.org/abs/2407.03993
作者: Yongjie Wang,Xiaoqi Qiu,Yu Yue,Xu Guo,Zhiwei Zeng,Yuhong Feng,Zhiqi Shen
关键词: Natural Language Counterfactual, Natural Language, modified text, Counterfactual generation aims, aims to minimally
中文关键词: 自然语言反事实,自然语言,修改文本,反事实生成目标,旨在最低限度地
类目: Computation and Language (cs.CL)
备注: A survey paper

点击查看摘要

Abstract:Natural Language Counterfactual generation aims to minimally modify a given text such that the modified text will be classified into a different class. The generated counterfactuals provide insight into the reasoning behind a model’s predictions by highlighting which words significantly influence the outcomes. Additionally, they can be used to detect model fairness issues or augment the training data to enhance the model’s robustness. A substantial amount of research has been conducted to generate counterfactuals for various NLP tasks, employing different models and methodologies. With the rapid growth of studies in this field, a systematic review is crucial to guide future researchers and developers. To bridge this gap, this survey comprehensively overview textual counterfactual generation methods, particularly including those based on Large Language Models. We propose a new taxonomy that categorizes the generation methods into four groups and systematically summarize the metrics for evaluating the generation quality. Finally, we discuss ongoing research challenges and outline promising directions for future work.
摘要:自然语言反事实生成的目的是对给定的文本进行最小限度的修改,从而将修改后的文本归入不同的类别。生成的反事实通过突出哪些词对结果有重大影响,提供了对模型预测背后的推理的洞察。此外,它们还可用于检测模型公平性问题或增加训练数据以增强模型的稳健性。已经进行了大量的研究,以利用不同的模型和方法为各种自然语言处理任务产生反事实。随着这一领域研究的快速增长,系统的综述对于指导未来的研究人员和开发人员至关重要。为了弥补这一差距,本调查全面概述了文本反事实生成方法,特别是那些基于大型语言模型的方法。我们提出了一种新的分类方法,将生成方法分为四类,并系统地总结了评价生成质量的指标。最后,我们讨论了正在进行的研究挑战,并概述了未来工作的有希望的方向。

[NLP-58] Benchmarking Complex Instruction-Following with Multiple Constraints Composition
[NLP-58] 对具有多重约束的复杂教学进行基准测试

链接: https://arxiv.org/abs/2407.03978
作者: Bosi Wen,Pei Ke,Xiaotao Gu,Lindong Wu,Hao Huang,Jinfeng Zhou,Wenchuang Li,Binxin Hu,Wendy Gao,Jiaxin Xu,Yiming Liu,Jie Tang,Hongning Wang,Minlie Huang
关键词: large language models, language models, complex instructions, fundamental capabilities, capabilities of large
中文关键词: 大型语言模型、语言模型、复杂指令、基本能力、大型能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition.
摘要:指令遵循是大型语言模型的基本能力之一。随着LLMS能力的不断提高,它们越来越多地被应用于处理现实世界场景中复杂的人类指令。因此,如何评价LLMS的复杂指令遵循能力成为一个关键的研究问题。现有的基准测试主要集中于对人类指令中不同类型的约束进行建模,而忽略了复杂指令中不可缺少的不同约束的组合。为此,我们提出了ComplexBch,这是一个全面评估LLM遵循由多个约束组成的复杂指令的能力的基准。我们提出了一种复杂指令的层次分类,包括4种约束类型、19种约束维度和4种组合类型,并相应地手动收集高质量的数据集。为了使评价更可靠,我们在基于LLM的评价器中增加了规则,以有效地验证生成的文本是否满足每个约束和组合。此外,我们还根据不同作文类型确定的依存结构得到最终的评价分数。在处理具有多个约束组合的复杂指令时,ComplexB边发现了现有LLM中的重大缺陷。

[NLP-59] LLM Roleplay: Simulating Human-Chatbot Interaction
[NLP-59] LLM角色扮演:模拟人类与聊天机器人互动

链接: https://arxiv.org/abs/2407.03974
作者: Hovhannes Tamoyan,Hendrik Schuff,Iryna Gurevych
关键词: chatbots requires collecting, users’ sociodemographic backgrounds, requires collecting, reflect the breadth, breadth of users’
中文关键词: 聊天机器人需要收集,用户的社会人口背景,需要收集,反映用户的广度,广度
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of chatbots requires collecting a large number of human-chatbot dialogues to reflect the breadth of users’ sociodemographic backgrounds and conversational goals. However, the resource requirements to conduct the respective user studies can be prohibitively high and often only allow for a narrow analysis of specific dialogue goals and participant demographics. In this paper, we propose LLM-Roleplay: a goal-oriented, persona-based method to automatically generate diverse multi-turn dialogues simulating human-chatbot interaction. LLM-Roleplay can be applied to generate dialogues with any type of chatbot and uses large language models (LLMs) to play the role of textually described personas. To validate our method we collect natural human-chatbot dialogues from different sociodemographic groups and conduct a human evaluation to compare real human-chatbot dialogues with our generated dialogues. We compare the abilities of state-of-the-art LLMs in embodying personas and holding a conversation and find that our method can simulate human-chatbot dialogues with a high indistinguishability rate.
摘要:聊天机器人的开发需要收集大量的人类-聊天机器人对话,以反映用户社会人口学背景和对话目标的广度。然而,进行各自的用户研究所需的资源可能高得令人望而却步,而且往往只允许对具体对话目标和参与者人口统计进行狭隘的分析。在本文中,我们提出了LLM-Role Play:一种面向目标、基于角色的方法,用于自动生成模拟人与聊天机器人交互的各种多轮对话。LLM-Role Play可用于生成与任何类型聊天机器人的对话,并使用大型语言模型(LLM)来扮演文本描述的人物角色。为了验证我们的方法,我们从不同的社会人口统计群体中收集了自然的人类-聊天机器人对话,并进行了人类评估,以比较真实的人类-聊天机器人对话和我们生成的对话。我们比较了最新的LLMS在体现人物角色和进行对话方面的能力,发现我们的方法可以模拟人与聊天机器人的对话,具有很高的不可识别率。

[NLP-60] Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks
[NLP-60] 研究指令多样性和任务难度在机器人操纵任务中的作用

链接: https://arxiv.org/abs/2407.03967
作者: Amit Parekh,Nikolas Vitsakis,Alessandro Suglia,Ioannis Konstas
关键词: models based solely, data fails, true robustness, based solely, fails to capture
中文关键词: 仅基于模型,数据失败,真正的稳健性,仅基于,无法捕获
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Evaluating the generalisation capabilities of multimodal models based solely on their performance on out-of-distribution data fails to capture their true robustness. This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models, considering architectural design, input perturbations across language and vision modalities, and increased task complexity. The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes, raising concerns about overfitting to spurious correlations. By employing this evaluation framework on current Transformer-based multimodal models for robotic manipulation tasks, we uncover limitations and suggest future advancements should focus on architectural and training innovations that better integrate multimodal inputs, enhancing a model’s generalisation prowess by prioritising sensitivity to input content over incidental correlations.
摘要:仅基于多模式模型在非分布数据上的性能来评估多模式模型的泛化能力不能反映其真正的稳健性。这项工作引入了一个全面的评估框架,系统地检查了指令和输入在此类模型的泛化能力中的作用,考虑了建筑设计、跨语言和视觉模式的输入扰动以及增加的任务复杂性。拟议的框架揭示了多模式模型对极端指令扰动的弹性以及它们对观测变化的脆弱性,这引发了人们对过度拟合虚假相关性的担忧。通过对当前基于变压器的机器人操纵任务多通道模型使用这个评估框架,我们发现了局限性,并建议未来的进步应专注于更好地集成多通道输入的架构和培训创新,通过优先考虑对输入内容的敏感性而不是附带相关性来增强模型的泛化能力。

[NLP-61] Improving Sample Efficiency of Reinforcement Learning with Background Knowledge from Large Language Models
[NLP-61] 利用大型语言模型的背景知识提高强化学习的样本效率

链接: https://arxiv.org/abs/2407.03964
作者: Fuxiang Zhang,Junyou Li,Yi-Chen Li,Zongzhang Zhang,Yang Yu,Deheng Ye
关键词: Low sample efficiency, Low sample, enduring challenge, challenge of reinforcement, Low
中文关键词: 低样本效率、低样本、持久挑战、强化挑战、低
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Low sample efficiency is an enduring challenge of reinforcement learning (RL). With the advent of versatile large language models (LLMs), recent works impart common-sense knowledge to accelerate policy learning for RL processes. However, we note that such guidance is often tailored for one specific task but loses generalizability. In this paper, we introduce a framework that harnesses LLMs to extract background knowledge of an environment, which contains general understandings of the entire environment, making various downstream RL tasks benefit from one-time knowledge representation. We ground LLMs by feeding a few pre-collected experiences and requesting them to delineate background knowledge of the environment. Afterward, we represent the output knowledge as potential functions for potential-based reward shaping, which has a good property for maintaining policy optimality from task rewards. We instantiate three variants to prompt LLMs for background knowledge, including writing code, annotating preferences, and assigning goals. Our experiments show that these methods achieve significant sample efficiency improvements in a spectrum of downstream tasks from Minigrid and Crafter domains.
摘要:低样本效率是强化学习(RL)的一个长期挑战。随着通用大型语言模型(LLM)的出现,最近的工作传授常识知识,以加速RL过程的政策学习。然而,我们注意到,这种指导通常是为一项特定任务量身定做的,但失去了普遍性。在本文中,我们介绍了一个框架,它利用LLMS来提取环境的背景知识,其中包含了对整个环境的一般理解,使得下游的RL任务受益于一次性的知识表示。我们通过提供一些预先收集的经验并要求他们描述环境的背景知识来使LLMS接地。然后,我们将输出知识表示为基于势的奖励成形的势函数,这对于从任务奖励中保持策略最优性具有良好的性质。我们实例化了三个变体来提示LLM输入背景知识,包括编写代码、注释首选项和分配目标。我们的实验表明,这些方法在来自微格和Crafter域的一系列下游任务中实现了显着的样本效率提高。

[NLP-62] LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
[NLP-62] LLM-jp:研究和开发完全开放的日本法学硕士的跨组织项目

链接: https://arxiv.org/abs/2407.03963
作者: LLM-jp:Akiko Aizawa,Eiji Aramaki,Bowen Chen,Fei Cheng,Hiroyuki Deguchi,Rintaro Enomoto,Kazuki Fujii,Kensuke Fukumoto,Takuya Fukushima,Namgi Han,Yuto Harada,Chikara Hashimoto,Tatsuya Hiraoka,Shohei Hisada,Sosuke Hosokawa,Lu Jie,Keisuke Kamata,Teruhito Kanazawa,Hiroki Kanezashi,Hiroshi Kataoka,Satoru Katsumata,Daisuke Kawahara,Seiya Kawano,Atsushi Keyaki,Keisuke Kiryu,Hirokazu Kiyomaru,Takashi Kodama,Takahiro Kubo,Yohei Kuga,Ryoma Kumon,Shuhei Kurita,Sadao Kurohashi,Conglong Li,Taiki Maekawa,Hiroshi Matsuda,Yusuke Miyao,Kentaro Mizuki,Sakae Mizuki,Yugo Murawaki,Ryo Nakamura,Taishi Nakamura,Kouta Nakayama,Tomoka Nakazato,Takuro Niitsuma,Jiro Nishitoba,Yusuke Oda,Hayato Ogawa,Takumi Okamoto,Naoaki Okazaki,Yohei Oseki,Shintaro Ozaki,Koki Ryu,Rafal Rzepka,Keisuke Sakaguchi,Shota Sasaki,Satoshi Sekine,Kohei Suda,Saku Sugawara,Issa Sugiura,Hiroaki Sugiyama,Hisami Suzuki,Jun Suzuki,Toyotaro Suzumura,Kensuke Tachibana,Yu Takagi,Kyosuke Takami,Koichi Takeda,Masashi Takeshita,Masahiro Tanaka,Kenjiro Taura,Arseny Tolmachev,Nobuhiro Ueda,Zhen Wan,Shuntaro Yada,Sakiko Yahata,Yuya Yamamoto,Yusuke Yamauchi,Hitomi Yanaka,Rio Yokota,Koichiro Yoshino
关键词: large language models, Japanese large language, language models, Japanese large, cross-organizational project
中文关键词: 大型语言模型,日语大型语言,语言模型,日语大型跨组织项目
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit this https URL.
摘要:本文介绍了LLM-jp,这是一个用于研究和开发日语大型语言模型(LLM)的跨组织项目。LLM-jp旨在开发开源且强大的日本LLM,截至本文撰写时,来自学术界和工业界的1,500多名参与者正在为此目的共同努力。本文介绍了LLM-jp成立的背景、其活动摘要以及LLM-jp开发的LLM-jp的技术报告。有关最新活动,请访问此https URL。

[NLP-63] Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge
[NLP-63] 斯塔克:具有人物形象常识的社交长期多模式对话

链接: https://arxiv.org/abs/2407.03958
作者: Young-Jun Lee,Dokyong Lee,Junyoung Youn,Kyeongjin Oh,Byungsoo Ko,Jonghwan Hyeon,Ho-Jin Choi
关键词: instant messaging tools, messaging tools, personal experiences, instant messaging, share a wide
中文关键词: 即时通讯工具、通讯工具、个人体验、即时通讯、广泛分享
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal conversation dataset that covers a wide range of social personas in a multi-modality format, time intervals, and images. To construct Stark automatically, we propose a novel multi-modal contextualization framework, Mcu, that generates long-term multi-modal dialogue distilled from ChatGPT and our proposed Plan-and-Execute image aligner. Using our Stark, we train a multi-modal conversation model, Ultron 7B, which demonstrates impressive visual imagination ability. Furthermore, we demonstrate the effectiveness of our dataset in human evaluation. We make our source code and dataset publicly available.
摘要:人类通过即时通讯工具在对话中分享与其个人经历相关的各种图像。然而,现有作品集中在(1)单一会话中的图像共享行为,导致长期社交互动有限,(2)缺乏个性化的图像共享行为。在这项工作中,我们介绍了Stark,这是一个大规模长期多模式对话数据集,以多模式格式、时间间隔和图像覆盖了广泛的社交角色。为了自动构建Stark,我们提出了一种新型的多模式情境化框架Mcu,它生成从ChatGPT和我们提出的计划并执行图像对齐器中提取的长期多模式对话。使用Stark,我们训练了多模式对话模型Ultron 7 B,它展示了令人印象深刻的视觉想象能力。此外,我们还证明了我们的数据集在人类评估中的有效性。我们公开我们的源代码和数据集。

[NLP-64] Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems
[NLP-64] 使用约束引导多智能体系统解决斑马谜题

链接: https://arxiv.org/abs/2407.03956
作者: Shmuel Berman,Baishakhi Ray,Kathleen McKeown
关键词: Large Language Models, Prior research, Language Models, ability of Large, Large Language
中文关键词: 大型语言模型、先前研究、语言模型、大型、大型语言的能力
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prior research has enhanced the ability of Large Language Models (LLMs) to solve logic puzzles using techniques such as chain-of-thought prompting or introducing a symbolic representation. These frameworks are still usually insufficient to solve complicated logical problems, such as Zebra puzzles, due to the inherent complexity of translating natural language clues into logical statements. We introduce a multi-agent system, ZPS, that integrates LLMs with an off the shelf theorem prover. This system tackles the complex puzzle-solving task by breaking down the problem into smaller, manageable parts, generating SMT (Satisfiability Modulo Theories) code to solve them with a theorem prover, and using feedback between the agents to repeatedly improve their answers. We also introduce an automated grid puzzle grader to assess the correctness of our puzzle solutions and show that the automated grader is reliable by evaluating it in a user-study. Our approach shows improvement in all three LLMs we tested, with GPT-4 showing 166% improvement in the number of fully correct solutions.
摘要:以前的研究已经提高了大型语言模型(LLM)使用诸如思想链提示或引入符号表示等技术来解决逻辑难题的能力。由于将自然语言线索转换成逻辑语句的内在复杂性,这些框架通常仍然不足以解决复杂的逻辑问题,如斑马谜题。我们介绍了一个多智能体系统ZPS,它集成了LLMS和一个现成的定理证明器。这个系统通过将问题分解成更小的、可管理的部分,生成SMT(满足性模理论)代码来使用定理证明器来解决它们,并使用代理之间的反馈来反复改进他们的答案,从而处理复杂的难题解决任务。我们还介绍了一个自动网格谜题分级器来评估我们的谜题解决方案的正确性,并通过用户研究对其进行评估,表明该自动分级器是可靠的。我们的方法在我们测试的所有三个LLM中都显示了改进,GPT-4显示完全正确解的数量提高了166%。

[NLP-65] Meta-prompting Optimized Retrieval-augmented Generation
[NLP-65] 元提示优化检索增强生成

链接: https://arxiv.org/abs/2407.03955
作者: João Rodrigues,António Branco
关键词: large language models, Retrieval-augmented generation resorts, external sources, sources in order, order to leverage
中文关键词: 大型语言模型、检索增强一代度假村、外部来源、有序来源、利用顺序
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation resorts to content retrieved from external sources in order to leverage the performance of large language models in downstream tasks. The excessive volume of retrieved content, the possible dispersion of its parts, or their out of focus range may happen nevertheless to eventually have a detrimental rather than an incremental effect. To mitigate this issue and improve retrieval-augmented generation, we propose a method to refine the retrieved content before it is included in the prompt by resorting to meta-prompting optimization. Put to empirical test with the demanding multi-hop question answering task from the StrategyQA dataset, the evaluation results indicate that this method outperforms a similar retrieval-augmented system but without this method by over 30%.
摘要:检索增强生成诉诸于从外部源检索的内容,以利用下游任务中大型语言模型的性能。然而,检索到的内容的过多量、其部分可能的分散或其失焦范围最终可能会产生有害的而不是增量的影响。为了缓解这个问题并改进检索增强生成,我们提出了一种方法,通过诉诸元提示优化在检索到的内容被包括在提示中之前对其进行细化。通过来自StrategyQA数据集中的要求严格的多跳问答任务进行实证测试,评估结果表明,这种方法的性能优于类似的检索增强系统,但没有这种方法,高出30%以上。

[NLP-66] A framework for annotating and modelling intentions behind metaphor use
[NLP-66] 注释和建模隐喻使用背后意图的框架

链接: https://arxiv.org/abs/2407.03952
作者: Gianluca Michelli,Xiaoyu Tong,Ekaterina Shutova
关键词: conceptualize the world, part of everyday, everyday language, language models, metaphor
中文关键词: 概念化世界,日常生活的一部分,日常语言,语言模型,隐喻
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Metaphors are part of everyday language and shape the way in which we conceptualize the world. Moreover, they play a multifaceted role in communication, making their understanding and generation a challenging task for language models (LMs). While there has been extensive work in the literature linking metaphor to the fulfilment of individual intentions, no comprehensive taxonomy of such intentions, suitable for natural language processing (NLP) applications, is available to present day. In this paper, we propose a novel taxonomy of intentions commonly attributed to metaphor, which comprises 9 categories. We also release the first dataset annotated for intentions behind metaphor use. Finally, we use this dataset to test the capability of large language models (LLMs) in inferring the intentions behind metaphor use, in zero- and in-context few-shot settings. Our experiments show that this is still a challenge for LLMs.
摘要:隐喻是日常语言的一部分,塑造了我们概念化世界的方式。此外,它们在沟通中发挥着多方面的作用,这使得它们的理解和生成对于语言模型(LM)来说是一项具有挑战性的任务。虽然文献中已经有大量工作将隐喻与个人意图的实现联系起来,但迄今为止还没有适合自然语言处理(NLP)应用的此类意图的全面分类。在本文中,我们提出了一种新的意图分类法,通常归因于隐喻,它包括9个类别。我们还发布了第一个注释隐喻使用背后意图的数据集。最后,我们使用该数据集来测试大型语言模型(LLM)在零场景和上下文少场景设置中推断隐喻使用背后意图的能力。我们的实验表明,这对于LLM来说仍然是一个挑战。

[NLP-67] Diverse and Fine-Grained Instruction-Following Ability Exploration with Synthetic Data
[NLP-67] 利用合成数据进行多元化、细粒度的教学跟随能力探索

链接: https://arxiv.org/abs/2407.03942
作者: Zihui Gu,Xingwu Sun,Fengzong Lian,Zhanhui Kang,Cheng-Zhong Xu,Ju Fan
关键词: large language models, language models, support diverse user, crucial for large, large language
中文关键词: 大型语言模型,语言模型,支持多元化用户,对于大型语言至关重要
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Instruction-following is particularly crucial for large language models (LLMs) to support diverse user requests. While existing work has made progress in aligning LLMs with human preferences, evaluating their capabilities on instruction following remains a challenge due to complexity and diversity of real-world user instructions. While existing evaluation methods focus on general skills, they suffer from two main shortcomings, i.e., lack of fine-grained task-level evaluation and reliance on singular instruction expression. To address these problems, this paper introduces DINGO, a fine-grained and diverse instruction-following evaluation dataset that has two main advantages: (1) DINGO is based on a manual annotated, fine-grained and multi-level category tree with 130 nodes derived from real-world user requests; (2) DINGO includes diverse instructions, generated by both GPT-4 and human experts. Through extensive experiments, we demonstrate that DINGO can not only provide more challenging and comprehensive evaluation for LLMs, but also provide task-level fine-grained directions to further improve LLMs.
摘要:指令遵循对于支持不同用户请求的大型语言模型(LLM)尤为重要。虽然现有的工作在使LLM符合人类偏好方面取得了进展,但由于现实世界用户指令的复杂性和多样性,评估它们在指令遵循方面的能力仍然是一个挑战。虽然现有的评价方法侧重于一般技能,但它们存在两个主要缺陷,即缺乏细粒度的任务级评价和依赖单一的教学表达。为了解决这些问题,本文引入了Dingo,这是一个细粒度和多样化的指令遵循评估数据集,它具有两个主要优点:(1)Dingo是基于人工标注的细粒度多层类别树,其中130个节点来自真实世界的用户请求;(2)Dingo包含由GPT-4和人类专家生成的各种指令。通过大量的实验,我们证明了Dingo不仅可以为LLMS提供更具挑战性和更全面的评估,而且还可以为进一步改进LLMS提供任务级的细粒度指导。

[NLP-68] Narrow Transformer: Starcoder-Based Java-LM For Desktop
[NLP-68] Narrow Transformer:基于StarCoder的Java-LM For桌面

链接: https://arxiv.org/abs/2407.03941
作者: Kamalkumar Rathinasamy,Balaji A J,Ankush Kumar,Gagan Gayari,Harshini K,Rajab Ali Mondal,Sreenivasa Raghavan K S,Swayam Singh
关键词: MultiPL-E Java code, Java code, Java code benchmark, Java code model, small Java code
中文关键词: MultiPL-E Java代码、Java代码、Java代码基准、Java代码模型、小型Java代码
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents NT-Java-1.1B, an open-source specialized code language model built on StarCoderBase-1.1B, designed for coding tasks in Java programming. NT-Java-1.1B achieves state-of-the-art performance, surpassing its base model and majority of other models of similar size on MultiPL-E Java code benchmark. While there have been studies on extending large, generic pre-trained models to improve proficiency in specific programming languages like Python, similar investigations on small code models for other programming languages are lacking. Large code models require specialized hardware like GPUs for inference, highlighting the need for research into building small code models that can be deployed on developer desktops. This paper addresses this research gap by focusing on the development of a small Java code model, NT-Java-1.1B, and its quantized versions, which performs comparably to open models around 1.1B on MultiPL-E Java code benchmarks, making them ideal for desktop deployment. This paper establishes the foundation for specialized models across languages and sizes for a family of NT Models.
摘要:本文介绍了一个基于StarCoderBase-1.1B的开源专用代码语言模型NT-Java-1.1B,它是为Java编程中的编码任务而设计的。NT-Java-1.1B实现了最先进的性能,在MultiPL-E Java代码基准测试中超过了其基本模型和大多数其他类似大小的模型。虽然已经有关于扩展大型、通用的预先训练的模型以提高对特定编程语言的熟练程度的研究,但对于其他编程语言的小代码模型还缺乏类似的研究。大型代码模型需要专门的硬件(如GPU)进行推理,这突显了研究构建可部署在开发人员桌面上的小型代码模型的必要性。本文通过集中开发一个小型Java代码模型NT-Java-1.1B及其量化版本来弥补这一研究差距,该模型的性能与基于MultiPL-E Java代码基准测试的1.1B左右的开放模型相当,使其成为桌面部署的理想选择。本文为NT模型家族的跨语言和大小的专用模型奠定了基础。

[NLP-69] ongGu: Mastering Classical Chinese Understanding with Knowledge-Grounded Large Language Models
[NLP-69] ongGu:用基于知识的大型语言模型掌握古典中文理解

链接: https://arxiv.org/abs/2407.03937
作者: Jiahuan Cao,Dezhi Peng,Peirong Zhang,Yongxin Shi,Yang Liu,Kai Ding,Lianwen Jin
关键词: complexities pose formidable, pose formidable comprehension, formidable comprehension barriers, Natural Language Processing, Classical Chinese
中文关键词: 复杂性构成强大的,构成强大的理解力,强大的理解障碍,自然语言处理,古典中文
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Classical Chinese is a gateway to the rich heritage and wisdom of ancient China, yet its complexities pose formidable comprehension barriers for most modern people without specialized knowledge. While Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), they struggle with Classical Chinese Understanding (CCU), especially in data-demanding and knowledge-intensive tasks. In response to this dilemma, we propose \textbfTongGu (mean understanding ancient and modern), the first CCU-specific LLM, underpinned by three core contributions. First, we construct a two-stage instruction-tuning dataset ACCN-INS derived from rich classical Chinese corpora, aiming to unlock the full CCU potential of LLMs. Second, we propose Redundancy-Aware Tuning (RAT) to prevent catastrophic forgetting, enabling TongGu to acquire new capabilities while preserving its foundational knowledge. Third, we present a CCU Retrieval-Augmented Generation (CCU-RAG) technique to reduce hallucinations based on knowledge-grounding. Extensive experiments across 24 diverse CCU tasks validate TongGu’s superior ability, underscoring the effectiveness of RAT and CCU-RAG. The model and dataset will be public available.
摘要:文言文是通向古代中国丰富遗产和智慧的门户,但它的复杂性给大多数没有专业知识的现代人带来了巨大的理解障碍。虽然大型语言模型在自然语言处理(NLP)方面表现出了卓越的能力,但它们在文言文理解(CCU)方面却举步维艰,特别是在数据要求高和知识密集型的任务中。为了应对这一困境,我们提出了第一个CCU特有的LLM,并以三个核心贡献为基础。首先,我们从丰富的文言文语料库中构建了一个两级指令调优数据集ACCN-INS,旨在充分挖掘LLMS的CCU潜力。其次,我们提出了冗余感知调谐(RAT)来防止灾难性遗忘,使铜鼓在保持其基础知识的同时获得新的能力。第三,我们提出了一种CCU检索-增强生成(CCU-RAG)技术,以减少基于知识的幻觉。在24个不同的CCU任务上进行的广泛实验验证了铜鼓的优越能力,强调了RAT和CCU-RAG的有效性。模型和数据集将公开提供。

[NLP-70] Entity-Level Sentiment: More than the Sum of Its Parts
[NLP-70] 青少年水平的情绪:超过其各部分的总和

链接: https://arxiv.org/abs/2407.03916
作者: Egil Rønningstad,Roman Klinger,Erik Velldal,Lilja Øvrelid
关键词: longer texts, topics discussed, sentiment, variety of topics, entity
中文关键词: 较长的文本、讨论的主题、情感、各种主题、实体
类目: Computation and Language (cs.CL)
备注: 14th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2024)

点击查看摘要

Abstract:In sentiment analysis of longer texts, there may be a variety of topics discussed, of entities mentioned, and of sentiments expressed regarding each entity. We find a lack of studies exploring how such texts express their sentiment towards each entity of interest, and how these sentiments can be modelled. In order to better understand how sentiment regarding persons and organizations (each entity in our scope) is expressed in longer texts, we have collected a dataset of expert annotations where the overall sentiment regarding each entity is identified, together with the sentence-level sentiment for these entities separately. We show that the reader’s perceived sentiment regarding an entity often differs from an arithmetic aggregation of sentiments at the sentence level. Only 70% of the positive and 55% of the negative entities receive a correct overall sentiment label when we aggregate the (human-annotated) sentiment labels for the sentences where the entity is mentioned. Our dataset reveals the complexity of entity-specific sentiment in longer texts, and allows for more precise modelling and evaluation of such sentiment expressions.
摘要:在较长文本的情感分析中,可能会有各种讨论的主题、提及的实体以及对每个实体表达的情感。我们发现缺乏研究来探索这些文本如何表达他们对每个感兴趣实体的情感,以及如何对这些情感进行建模。为了更好地理解如何在较长的文本中表达对个人和组织(我们范围内的每个实体)的情绪,我们收集了一个专家注释数据集,其中确定了对每个实体的总体情绪,以及这些实体单独的句子级情绪。我们表明,读者对一个实体的感知情感往往不同于句子层面上的情感的算术聚合。当我们聚合提到实体的句子的(人工标注的)情感标签时,只有70%的积极实体和55%的消极实体获得了正确的总体情感标签。我们的数据集揭示了较长文本中特定于实体的情感的复杂性,并允许对此类情感表达进行更精确的建模和评估。

[NLP-71] Scoping Review of Active Learning Strategies and their Evaluation Environments for Entity Recognition Tasks
[NLP-71] 实体识别任务的主动学习策略及其评估环境的范围审查

链接: https://arxiv.org/abs/2407.03895
作者: Philipp Kohl,Yoka Krämer,Claudia Fohry,Bodo Kraft
关键词: active learning strategies, natural language processing, active learning, Identify active learning, learning strategies
中文关键词: 主动学习策略,自然语言处理,主动学习,识别主动学习,学习策略
类目: Computation and Language (cs.CL)
备注: The Version of Record of this contribution is published in Deep Learning Theory and Applications 5th International Conference, DeLTA 2024 Proceedings, and will be available after the conference

点击查看摘要

Abstract:We conducted a scoping review for active learning in the domain of natural language processing (NLP), which we summarize in accordance with the PRISMA-ScR guidelines as follows: Objective: Identify active learning strategies that were proposed for entity recognition and their evaluation environments (datasets, metrics, hardware, execution time). Design: We used Scopus and ACM as our search engines. We compared the results with two literature surveys to assess the search quality. We included peer-reviewed English publications introducing or comparing active learning strategies for entity recognition. Results: We analyzed 62 relevant papers and identified 106 active learning strategies. We grouped them into three categories: exploitation-based (60x), exploration-based (14x), and hybrid strategies (32x). We found that all studies used the F1-score as an evaluation metric. Information about hardware (6x) and execution time (13x) was only occasionally included. The 62 papers used 57 different datasets to evaluate their respective strategies. Most datasets contained newspaper articles or biomedical/medical data. Our analysis revealed that 26 out of 57 datasets are publicly accessible. Conclusion: Numerous active learning strategies have been identified, along with significant open questions that still need to be addressed. Researchers and practitioners face difficulties when making data-driven decisions about which active learning strategy to adopt. Conducting comprehensive empirical comparisons using the evaluation environment proposed in this study could help establish best practices in the domain. Comments: The Version of Record of this contribution is published in Deep Learning Theory and Applications 5th International Conference, DeLTA 2024 Proceedings, and will be available after the conference Subjects: Computation and Language (cs.CL) Cite as: arXiv:2407.03895 [cs.CL] (or arXiv:2407.03895v1 [cs.CL] for this version)
摘要:我们对自然语言处理(NLP)领域的主动学习进行了范围划分,根据PRISMA-SCR指南总结如下:目的:确定提出的用于实体识别的主动学习策略及其评估环境(数据集、度量、硬件、执行时间)。设计:我们使用Scope us和ACM作为我们的搜索引擎。我们将结果与两个文献调查进行比较,以评估搜索质量。我们纳入了同行评议的英语出版物,介绍或比较了实体识别的主动学习策略。结果:分析了62篇相关文献,确定了106种主动学习策略。我们将它们分为三类:基于开采的(60倍)、基于勘探的(14倍)和混合战略(32倍)。我们发现,所有的研究都使用F1分数作为评估指标。关于硬件(6倍)和执行时间(13倍)的信息只是偶尔包括在内。这62篇论文使用了57个不同的数据集来评估各自的策略。大多数数据集包含报纸文章或生物医学/医学数据。我们的分析显示,57个数据集中有26个是可公开访问的。结论:已经确定了许多积极的学习策略,以及仍然需要解决的重大开放问题。研究人员和实践者在做出关于采用哪种主动学习策略的数据驱动的决定时面临困难。利用本研究建议的评价环境进行全面的实证比较,有助于确立该领域的最佳做法。评论:本投稿的记录版本发表在第五届深度学习理论与应用国际会议,Delta2024年论文集,并将在会议主题:计算与语言(cs.CL)引用为:arxiv:2407.03895cs.CL后可用。

[NLP-72] Planning with Large Language Models for Conversational Agents
[NLP-72] 对话代理的大型语言模型规划

链接: https://arxiv.org/abs/2407.03884
作者: Zhigen Li,Jianxiang Peng,Yanmeng Wang,Tianhao Shen,Minghui Zhang,Linxi Su,Shang Wu,Yihang Wu,Yuqian Wang,Ye Wang,Wei Hu,Jianfeng Li,Shaojun Wang,Jing Xiao,Deyi Xiong
关键词: autonomous conversational agents, crucial properties, properties of autonomous, Controllability, dialogue
中文关键词: 自主对话主体,关键属性,自主属性,可控性,对话
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Controllability and proactivity are crucial properties of autonomous conversational agents (CAs). Controllability requires the CAs to follow the standard operating procedures (SOPs), such as verifying identity before activating credit cards. Proactivity requires the CAs to guide the conversation towards the goal during user uncooperation, such as persuasive dialogue. Existing research cannot be unified with controllability, proactivity, and low manual annotation. To bridge this gap, we propose a new framework for planning-based conversational agents (PCA) powered by large language models (LLMs), which only requires humans to define tasks and goals for the LLMs. Before conversation, LLM plans the core and necessary SOP for dialogue offline. During the conversation, LLM plans the best action path online referring to the SOP, and generates responses to achieve process controllability. Subsequently, we propose a semi-automatic dialogue data creation framework and curate a high-quality dialogue dataset (PCA-D). Meanwhile, we develop multiple variants and evaluation metrics for PCA, e.g., planning with Monte Carlo Tree Search (PCA-M), which searches for the optimal dialogue action while satisfying SOP constraints and achieving the proactive of the dialogue. Experiment results show that LLMs finetuned on PCA-D can significantly improve the performance and generalize to unseen domains. PCA-M outperforms other CoT and ToT baselines in terms of conversation controllability, proactivity, task success rate, and overall logical coherence, and is applicable in industry dialogue scenarios. The dataset and codes are available at XXXX.
摘要:可控性和主动性是自主会话代理的重要特性。可控性要求CA遵循标准操作程序(SOP),例如在激活信用卡之前验证身份。主动性要求CA在用户不合作期间将对话引导到目标,例如说服性对话。现有的研究不能统一为可控性、主动性和低人工注释。为了弥补这一差距,我们提出了一种基于规划的会话代理(PCA)的新框架,该框架由大语言模型(LLM)提供支持,只需要人类为LLM定义任务和目标。在对话之前,LLM计划离线对话的核心和必要的SOP。在对话过程中,LLM参考SOP在线规划最佳行动路径,并生成响应以实现过程可控性。随后,我们提出了一个半自动对话数据创建框架,并建立了一个高质量的对话数据集(PCA-D)。同时,我们提出了多种主成分分析方法和评价指标,如蒙特卡罗树搜索计划(PCA-M),它在满足SOP约束的同时搜索最优的对话动作,实现对话的主动性。实验结果表明,在PCA-D上优化的LLMS能够显著提高性能,并推广到不可见的领域。PCA-M在会话可控性、主动性、任务成功率和总体逻辑一致性方面优于其他COT和TOT基线,适用于行业对话场景。数据集和代码可在XXXX获得。

[NLP-73] DART: Deep Adversarial Automated Red Teaming for LLM Safety
[NLP-73] DART:深度对抗自动化红色团队,确保LLM安全

链接: https://arxiv.org/abs/2407.03876
作者: Bojian Jiang,Yi Jing,Tianhao Shen,Qing Yang,Deyi Xiong
关键词: Manual Red teaming, Target LLM, automated red teaming, Red LLM, automated Red LLM
中文关键词: 手动红色分组、目标LLM、自动红色分组、红色LLM、自动红色LLM
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Manual Red teaming is a commonly-used method to identify vulnerabilities in large language models (LLMs), which, is costly and unscalable. In contrast, automated red teaming uses a Red LLM to automatically generate adversarial prompts to the Target LLM, offering a scalable way for safety vulnerability detection. However, the difficulty of building a powerful automated Red LLM lies in the fact that the safety vulnerabilities of the Target LLM are dynamically changing with the evolution of the Target LLM. To mitigate this issue, we propose a Deep Adversarial Automated Red Teaming (DART) framework in which the Red LLM and Target LLM are deeply and dynamically interacting with each other in an iterative manner. In each iteration, in order to generate successful attacks as many as possible, the Red LLM not only takes into account the responses from the Target LLM, but also adversarially adjust its attacking directions by monitoring the global diversity of generated attacks across multiple iterations. Simultaneously, to explore dynamically changing safety vulnerabilities of the Target LLM, we allow the Target LLM to enhance its safety via an active learning based data selection mechanism. Experimential results demonstrate that DART significantly reduces the safety risk of the target LLM. For human evaluation on Anthropic Harmless dataset, compared to the instruction-tuning target LLM, DART eliminates the violation risks by 53.4%. We will release the datasets and codes of DART soon.
摘要:手动Red Teaming是一种常用的大型语言模型漏洞识别方法,代价昂贵且不可扩展。相比之下,自动红色团队使用Red LLM自动生成针对Target LLM的敌意提示,为安全漏洞检测提供了一种可扩展的方法。然而,构建功能强大的自动Red LLM的难点在于,目标LLM的安全漏洞随着目标LLM的演化而动态变化。为了缓解这一问题,我们提出了一个深度对抗性自动红团队(DART)框架,在该框架中,Red LLM和Target LLM以迭代的方式进行深度和动态的交互。在每一次迭代中,为了生成尽可能多的成功攻击,Red LLM不仅考虑目标LLM的响应,而且通过监控生成的攻击在多个迭代中的全局多样性来相反地调整其攻击方向。同时,为了探索动态变化的安全漏洞,我们允许目标LLM通过一种基于主动学习的数据选择机制来增强其安全性。实验结果表明,DART显著降低了目标LLM的安全风险。对于人类无害数据集的人工评估,与指令调优目标LLM相比,DART消除了53.4%的违规风险。我们将很快公布DART的数据集和代码。

[NLP-74] artuNLP @ AXOLOTL-24: Leveraging Classifier Output for New Sense Detection in Lexical Semantics
[NLP-74] artuNLP @ AX OLOTL-24:利用分类器输出进行词汇语义中的新意义检测

链接: https://arxiv.org/abs/2407.03861
作者: Aleksei Dorkin,Kairit Sirts
关键词: shared task, shared task comprises, present our submission, older time periods, task comprises
中文关键词: 共享任务,共享任务包括,提交我们的提交,旧时间段,任务包括
类目: Computation and Language (cs.CL)
备注: Accepted to the 5th International Workshop on Computational Approaches to Historical Language Change 2024 (LChange’24)

点击查看摘要

Abstract:We present our submission to the AXOLOTL-24 shared task. The shared task comprises two subtasks: identifying new senses that words gain with time (when comparing newer and older time periods) and producing the definitions for the identified new senses. We implemented a conceptually simple and computationally inexpensive solution to both subtasks. We trained adapter-based binary classification models to match glosses with usage examples and leveraged the probability output of the models to identify novel senses. The same models were used to match examples of novel sense usages with Wiktionary definitions. Our submission attained third place on the first subtask and the first place on the second subtask.
摘要:我们提交了对AX OLOTL-24共享任务的提交文件。共享任务包括两个子任务:识别单词随着时间的推移获得的新意义(当比较新的和旧的时间段时)并为识别的新意义产生定义。我们为这两个子任务实现了一个概念上简单且计算成本低的解决方案。我们训练了基于适配器的二元分类模型,将修饰与使用示例进行匹配,并利用模型的概率输出来识别新颖的感官。使用相同的模型将新颖的意义用法示例与维基词典的定义进行匹配。我们的提交在第一子任务中获得第三名,在第二子任务中获得第一名。

[NLP-75] Anthropocentric bias and the possibility of artificial cognition
[NLP-75] 以人为中心的偏见和人工认知的可能性

链接: https://arxiv.org/abs/2407.03859
作者: Raphaël Millière,Charles Rathkopf
关键词: large language models, language models, requires overcoming, large language, Evaluating the cognitive
中文关键词: 大型语言模型,语言模型,需要克服,大型语言,评估认知
类目: Computation and Language (cs.CL)
备注: Accepted for ICML 2024 (Workshop on Large Language Models and Cognition)

点击查看摘要

Abstract:Evaluating the cognitive capacities of large language models (LLMs) requires overcoming not only anthropomorphic but also anthropocentric biases. This article identifies two types of anthropocentric bias that have been neglected: overlooking how auxiliary factors can impede LLM performance despite competence (Type-I), and dismissing LLM mechanistic strategies that differ from those of humans as not genuinely competent (Type-II). Mitigating these biases necessitates an empirically-driven, iterative approach to mapping cognitive tasks to LLM-specific capacities and mechanisms, which can be done by supplementing carefully designed behavioral experiments with mechanistic studies.
摘要:评估大型语言模型(LLM)的认知能力不仅需要克服拟人化偏见,还需要克服以人为本的偏见。本文指出了两种被忽视的以人为本的偏见:忽视了辅助因素如何在有能力的情况下阻碍LLM表现(类型I),并驳斥了与人类不同的LLM机械策略,因为它们并不真正有能力(类型II)。缓解这些偏见需要一种数学驱动的迭代方法,将认知任务映射到LLM特定的能力和机制,这可以通过用机械研究补充精心设计的行为实验来实现。

[NLP-76] HYBRINFOX at CheckThat! 2024 – Task 1: Enhancing Language Models with Structured Information for Check-Worthiness Estimation
[NLP-76] hyBRINFOX在CheckThat!2024年–任务1:使用结构化信息增强语言模型以进行检查价值估计

链接: https://arxiv.org/abs/2407.03850
作者: Géraud Faye,Morgane Casanova,Benjamin Icard,Julien Chanson,Guillaume Gadek,Guillaume Gravier,Paul Égré
关键词: HYBRINFOX team, Language Models, paper summarizes, summarizes the experiments, Large Language Models
中文关键词: hyBRINFOX团队,语言模型,论文总结,实验总结,大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper to appear in the Proceedings of the Conference and Labs of the Evaluation Forum (CLEF 2024 CheckThat!)

点击查看摘要

Abstract:This paper summarizes the experiments and results of the HYBRINFOX team for the CheckThat! 2024 - Task 1 competition. We propose an approach enriching Language Models such as RoBERTa with embeddings produced by triples (subject ; predicate ; object) extracted from the text sentences. Our analysis of the developmental data shows that this method improves the performance of Language Models alone. On the evaluation data, its best performance was in English, where it achieved an F1 score of 71.1 and ranked 12th out of 27 candidates. On the other languages (Dutch and Arabic), it obtained more mixed results. Future research tracks are identified toward adapting this processing pipeline to more recent Large Language Models.
摘要:本文总结了hyBRINFOX团队针对CheckThat!的实验和结果2024年-任务1竞赛。我们提出了一种方法,通过从文本句子中提取的三重组(主题;动词;对象)产生的嵌入来丰富RoBERTa等语言模型。我们对开发数据的分析表明,这种方法单独提高了语言模型的性能。从评估数据来看,其表现最好的是英语,F1成绩为71.1分,在27名候选人中排名第12位。在其他语言(荷兰语和阿拉伯语)上,它得到的结果更加混杂。未来的研究轨迹旨在使该处理管道适应更新的大型语言模型。

[NLP-77] On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation
[NLP-77] 关于开放领域对话评估的LLM基准

链接: https://arxiv.org/abs/2407.03841
作者: John Mendonça,Alon Lavie,Isabel Trancoso
关键词: Natural Language Processing, Language Processing tasks, Large Language Models, Processing tasks, Large Language
中文关键词: 自然语言处理、语言处理任务、大型语言模型、处理任务、大型语言
类目: Computation and Language (cs.CL)
备注: Accepted to the 6th NLP for Conversational AI workshop at ACL

点击查看摘要

Abstract:Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in dialogues generated by current LLM chatbots. Comments: Accepted to the 6th NLP for Conversational AI workshop at ACL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2407.03841 [cs.CL] (or arXiv:2407.03841v1 [cs.CL] for this version)
摘要:大型语言模型在各种自然语言处理任务中显示出了卓越的性能。尤其是在自动开放领域对话评价方面,LLMS已被无缝地纳入评价框架,并与人工评价一起构成了大多数评价的支柱。然而,现有的评估基准往往依赖于过时的数据集,并评估流畅性和相关性等方面,这无法充分反映最先进的聊天机器人模型的能力和局限性。本文批判性地审查了当前的评估基准,强调使用较旧的响应生成器和质量方面无法准确反映现代聊天机器人的能力。在最近LLM生成的数据集(SODA)上的一个小型注释实验表明,LLM评估器(如GPT-4)难以检测出当前LLM聊天机器人生成的对话中的实际缺陷。评论:接受第六届NLP对话式人工智能研讨会,主题:计算和语言(cs.CL)引用为:arxiv:2407.03841cs.CL

[NLP-78] ConText at WASSA 2024 Empathy and Personality Shared Task: History-Dependent Embedding Utterance Representations for Empathy and Emotion Prediction in Conversations
[NLP-78] WASSA 2024年同理心和人格共享任务:对话中同理心和情感预测的历史相关嵌入直言不讳表示

链接: https://arxiv.org/abs/2407.03818
作者: Patrícia Pereira,Helena Moniz,Joao Paulo Carvalho
关键词: empathetic agents, key components, development of effective, effective and empathetic, emotion prediction
中文关键词: 同理心因子,关键成分,开发有效、有效和同理心,情感预测
类目: Computation and Language (cs.CL)
备注: WASSA’24

点击查看摘要

Abstract:Empathy and emotion prediction are key components in the development of effective and empathetic agents, amongst several other applications. The WASSA shared task on empathy and emotion prediction in interactions presents an opportunity to benchmark approaches to these tasks. Appropriately selecting and representing the historical context is crucial in the modelling of empathy and emotion in conversations. In our submissions, we model empathy, emotion polarity and emotion intensity of each utterance in a conversation by feeding the utterance to be classified together with its conversational context, i.e., a certain number of previous conversational turns, as input to an encoder Pre-trained Language Model, to which we append a regression head for prediction. We also model perceived counterparty empathy of each interlocutor by feeding all utterances from the conversation and a token identifying the interlocutor for which we are predicting the empathy. Our system officially ranked 1^st at the CONV-turn track and 2^nd at the CONV-dialog track.
摘要:移情和情绪预测是开发有效和移情代理的关键组成部分,以及其他一些应用。WASSA分享的关于互动中的移情和情绪预测的任务提供了一个基准方法来处理这些任务。恰当地选择和再现历史语境在对话中建立移情和情感模型是至关重要的。在我们的意见书中,我们通过将待分类的话语及其会话上下文(即先前一定数量的会话话轮)作为输入输入到编码器预先训练的语言模型中,并在该模型中添加用于预测的回归头部,来对会话中每个话语的共情、情感极性和情感强度进行建模。我们还通过提供对话中的所有话语和一个标识我们预测共情的对话者的令牌来模拟每个对话者的感知对方共情。我们的系统在转弯赛道上正式排名第一,在转弯赛道上排名第二。

[NLP-79] Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation
[NLP-79] 爱沙尼亚对话口语翻译的端到端模型微调

链接: https://arxiv.org/abs/2407.03809
作者: Tiia Sildam,Andra Velve,Tanel Alumäe
关键词: Estonian-Russian conversational, paper investigates, investigates the finetuning, bidirectional Estonian-English, Estonian-English and Estonian-Russian
中文关键词: 爱沙尼亚-俄罗斯对话,论文调查,调查微调,双向爱沙尼亚-英语、爱沙尼亚-英语和爱沙尼亚-俄语
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to LoResMT 2024 (ACL workshop)

点击查看摘要

Abstract:This paper investigates the finetuning of end-to-end models for bidirectional Estonian-English and Estonian-Russian conversational speech-to-text translation. Due to the limited availability of speech translation data for Estonian, we created additional training data by web scraping and synthesizing data from speech recognition datasets using machine translation. We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. Our results indicate that fine-tuning with synthetic data enhances translation accuracy by a large margin, with SeamlessM4T matching or surpassing cascaded speech translation systems that use state-of-the-art speech recognition and machine translation models.
摘要:本文研究了爱沙尼亚-英语和爱沙尼亚-俄语双向对话语音到文本翻译的端到端模型的微调。由于爱沙尼亚语语音翻译数据的可用性有限,我们通过网络抓取和使用机器翻译从语音识别数据集合成数据来创建额外的训练数据。我们评估了三种公开可用的端到端模型:Whisper、OWSM 3.1和DeliverlessM 4 T。我们的结果表明,利用合成数据进行微调,可以大幅提高翻译准确性,实现无缝M4 T匹配或超越使用最先进语音识别和机器翻译模型的级联语音翻译系统。

[NLP-80] Cognitive Modeling with Scaffolded LLMs: A Case Study of Referential Expression Generation
[NLP-80] 支架式LLM的认知建模:引用表达生成的案例研究

链接: https://arxiv.org/abs/2407.03805
作者: Polina Tsvilodub,Michael Franke,Fausto Carcassi
关键词: extent can LLMs, Dale Reiter, cognitive model, algorithmic cognitive model, language generation
中文关键词: LLM的程度,Dale Reiter,认知模型,算法认知模型,语言生成
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures, 2 algorithms, to appear at the ICML 2024 workshop on Large Language Models and Cognition

点击查看摘要

Abstract:To what extent can LLMs be used as part of a cognitive model of language generation? In this paper, we approach this question by exploring a neuro-symbolic implementation of an algorithmic cognitive model of referential expression generation by Dale Reiter (1995). The symbolic task analysis implements the generation as an iterative procedure that scaffolds symbolic and gpt-3.5-turbo-based modules. We compare this implementation to an ablated model and a one-shot LLM-only baseline on the A3DS dataset (Tsvilodub Franke, 2023). We find that our hybrid approach is cognitively plausible and performs well in complex contexts, while allowing for more open-ended modeling of language generation in a larger domain.
摘要:LLM在多大程度上可以用作语言生成认知模型的一部分?在本文中,我们通过探索Dale Reiter(1995)的指代表达生成算法认知模型的神经符号实现来解决这个问题。符号任务分析将生成作为一个迭代过程来实现,该过程构建符号模块和基于GPT-3.5涡轮机的模块。我们将此实现与A3 DS数据集上的烧蚀模型和仅一次LLM基线进行比较(Tsvilodub Franke,2023)。我们发现,我们的混合方法在认知上是合理的,并且在复杂的环境中表现良好,同时允许在更大的领域中对语言生成进行更开放的建模。

[NLP-81] Mmathbf5 – A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks
[NLP-81] Mmathbf 5–评估大型多模式模型跨多语言和多文化视觉语言任务性能的多元化基准

链接: https://arxiv.org/abs/2407.03791
作者: Florian Schneider,Sunayana Sitaram
关键词: Natural Language Processing, experienced rapid advancements, Large Multimodal Models, Large Language Models, field of Natural
中文关键词: 自然语言处理,经历了快速发展,大型多模式模型,大型语言模型,自然领域
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Since the release of ChatGPT, the field of Natural Language Processing has experienced rapid advancements, particularly in Large Language Models (LLMs) and their multimodal counterparts, Large Multimodal Models (LMMs). Despite their impressive capabilities, LLMs often exhibit significant performance disparities across different languages and cultural contexts, as demonstrated by various text-only benchmarks. However, current research lacks such benchmarks for multimodal visio-linguistic settings. This work fills this gap by introducing M5, the first comprehensive benchmark designed to evaluate LMMs on diverse vision-language tasks within a multilingual and multicultural context. M5 includes eight datasets covering five tasks and 41 languages, with a focus on underrepresented languages and culturally diverse images. Furthermore, we introduce two novel datasets, M5-VGR and M5-VLOD, including a new Visio-Linguistic Outlier Detection task, in which all evaluated open-source models fail to significantly surpass the random baseline. Through extensive evaluation and analyses, we highlight substantial task-agnostic performance disparities between high- and low-resource languages. Moreover, we show that larger models do not necessarily outperform smaller ones in a multilingual setting.
摘要:自ChatGPT发布以来,自然语言处理领域取得了长足的进步,特别是在大语言模型(LLM)及其对应的多通道模型(LMM)方面。尽管LLM的能力令人印象深刻,但它们往往在不同的语言和文化背景下表现出显著的性能差异,正如各种纯文本基准所表明的那样。然而,目前的研究缺乏针对多通道视觉语言环境的基准。这项工作通过引入M5来填补这一空白,M5是第一个全面的基准,旨在评估多语言和多文化背景下不同视觉语言任务的LMM。M5包括8个数据集,涵盖5个任务和41种语言,重点是代表性较低的语言和文化多样性的图像。此外,我们引入了两个新的数据集,M5-VGR和M5-Vlod,其中包括一个新的Visio-Language孤立点检测任务,在该任务中,所有评估的开源模型都没有显著超过随机基线。通过广泛的评估和分析,我们强调了高资源和低资源语言之间显著的任务不可知性性能差异。此外,我们还表明,在多语言环境下,较大的模型并不一定优于较小的模型。

[NLP-82] Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
[NLP-82] 视频语言表示学习的元优化角度裕度对比框架

链接: https://arxiv.org/abs/2407.03788
作者: Thong Nguyen,Yi Bin,Xiaobao Wu,Xinshuai Dong,Zhiyuan Hu,Khoi Le,Cong-Duy Nguyen,See-Kiong Ng,Luu Anh Tuan
关键词: Data quality stands, video-language representation learning, quality stands, forefront of deciding, deciding the effectiveness
中文关键词: 数据质量站、视频语言表示学习、质量站、决策前沿、决策有效性
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering the downstream performance across unpopular subjects. To address these problems, we propose a contrastive objective with a subtractive angular margin to regularize cross-modal representations in their effort to reach perfect similarity. Furthermore, to adapt to the non-uniform concept distribution, we propose a multi-layer perceptron (MLP)-parameterized weighting function that maps loss values to sample weights which enable dynamic adjustment of the model’s focus throughout the training. With the training guided by a small amount of unbiased meta-data and augmented by video-text data generated by large vision-language model, we improve video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets.
摘要:数据质量是决定视听语言表征学习效果的首要因素。然而,先前数据中的视频-文本对通常彼此不完全对齐,这可能导致视频-语言表示不能准确地反映跨模式语义。此外,以前的数据也具有概念的不均匀分布,从而阻碍了不受欢迎的主题的下游表现。为了解决这些问题,我们提出了一个具有减法角度余量的对比目标来规则化跨模式表示,以努力达到完美的相似性。此外,为了适应概念分布的不均匀,我们提出了一种多层感知器(MLP)参数化权重函数,该函数将损失值映射到样本权重,从而能够在整个训练过程中动态调整模型的焦点。该方法以少量无偏的元数据为指导,以大视觉语言模型生成的视频文本数据为补充,改进了视频语言表示,并在常用的视频问答和文本视频检索数据集上取得了较好的性能。

[NLP-83] Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning
[NLP-83] 野外功能忠实性:利用可微计算图修剪的电路发现

链接: https://arxiv.org/abs/2407.03779
作者: Lei Yu,Jingcheng Niu,Zining Zhu,Gerald Penn
关键词: Circuit Discovery, effective algorithm based, introduce a comprehensive, comprehensive reformulation, differentiable masking
中文关键词: 电路发现,基于有效的算法,引入全面、全面的重新公式化、可区分的掩蔽
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we introduce a comprehensive reformulation of the task known as Circuit Discovery, along with DiscoGP, a novel and effective algorithm based on differentiable masking for discovering circuits. Circuit discovery is the task of interpreting the computational mechanisms of language models (LMs) by dissecting their functions and capabilities into sparse subnetworks (circuits). We identified two major limitations in existing circuit discovery efforts: (1) a dichotomy between weight-based and connection-edge-based approaches forces researchers to choose between pruning connections or weights, thereby limiting the scope of mechanistic interpretation of LMs; (2) algorithms based on activation patching tend to identify circuits that are neither functionally faithful nor complete. The performance of these identified circuits is substantially reduced, often resulting in near-random performance in isolation. Furthermore, the complement of the circuit – i.e., the original LM with the identified circuit removed – still retains adequate performance, indicating that essential components of a complete circuits are missed by existing methods. DiscoGP successfully addresses the two aforementioned issues and demonstrates state-of-the-art faithfulness, completeness, and sparsity. The effectiveness of the algorithm and its novel structure open up new avenues of gathering new insights into the internal workings of generative AI. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2407.03779 [cs.CL] (or arXiv:2407.03779v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.03779 Focus to learn more arXiv-issued DOI via DataCite
摘要:在本文中,我们介绍了对电路发现任务的全面重新描述,以及基于可微掩码的发现电路的一种新颖而有效的算法DiscoGP。电路发现是通过将语言模型(LMS)的功能和能力分解为稀疏子网络(电路)来解释其计算机制的任务。我们确定了现有电路发现工作中的两个主要局限性:(1)基于权重和基于连接边的方法之间的二分法迫使研究人员在剪枝连接或权重之间进行选择,从而限制了LMS的机械解释的范围;(2)基于激活修补的算法倾向于识别功能上既不忠实也不完整的电路。这些已识别电路的性能大大降低,通常导致隔离时近乎随机的性能。此外,电路的补集–即删除了所识别的电路的原始LM–仍然保持足够的性能,这表明现有方法遗漏了完整电路的基本组件。DiscoGP成功地解决了上述两个问题,并展示了最先进的忠实性、完整性和稀疏性。该算法的有效性及其新颖的结构为收集对生成性人工智能内部工作原理的新见解开辟了新的途径。主题:计算与语言(cs.CL);机器学习(cs.LG)引用AS:arxiv:2407.03779cs.CLhttps://doi.org/10.48550/arXiv.2407.03779 Focus通过DataCite了解更多arxiv发布的指令

[NLP-84] From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI
[NLP-84] 从数据到常识推理:使用大型语言模型进行可解释人工智能

链接: https://arxiv.org/abs/2407.03778
作者: Stefanie Krause,Frieder Stolzenburg
关键词: critical skill, Commonsense reasoning, difficult task, Llama, reasoning
中文关键词: 关键技能,常识推理,困难任务,骆驼,推理
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 19 pages

点击查看摘要

Abstract:Commonsense reasoning is a difficult task for a computer, but a critical skill for an artificial intelligence (AI). It can enhance the explainability of AI models by enabling them to provide intuitive and human-like explanations for their decisions. This is necessary in many areas especially in question answering (QA), which is one of the most important tasks of natural language processing (NLP). Over time, a multitude of methods have emerged for solving commonsense reasoning problems such as knowledge-based approaches using formal logic or linguistic analysis. In this paper, we investigate the effectiveness of large language models (LLMs) on different QA tasks with a focus on their abilities in reasoning and explainability. We study three LLMs: GPT-3.5, Gemma and Llama 3. We further evaluate the LLM results by means of a questionnaire. We demonstrate the ability of LLMs to reason with commonsense as the models outperform humans on different datasets. While GPT-3.5’s accuracy ranges from 56% to 93% on various QA benchmarks, Llama 3 achieved a mean accuracy of 90% on all eleven datasets. Thereby Llama 3 is outperforming humans on all datasets with an average 21% higher accuracy over ten datasets. Furthermore, we can appraise that, in the sense of explainable artificial intelligence (XAI), GPT-3.5 provides good explanations for its decisions. Our questionnaire revealed that 66% of participants rated GPT-3.5’s explanations as either “good” or “excellent”. Taken together, these findings enrich our understanding of current LLMs and pave the way for future investigations of reasoning and explainability.
摘要:常识推理对计算机来说是一项艰巨的任务,但对人工智能(AI)来说却是一项关键技能。它可以增强人工智能模型的可解释性,使他们能够为他们的决定提供直观和人性化的解释。这在许多领域都是必要的,特别是问答,这是自然语言处理(NLP)最重要的任务之一。随着时间的推移,出现了许多解决常识推理问题的方法,例如使用形式逻辑或语言分析的基于知识的方法。在本文中,我们考察了大语言模型在不同问答任务上的有效性,重点是它们的推理和解释能力。我们研究了三个最小二乘模型:GPT-3.5、Gema和Llama 3,并通过问卷调查的方式进一步评价了最小二乘的结果。我们展示了LLM的常识推理能力,因为这些模型在不同的数据集上比人类更好。在不同的质量保证基准上,GPT-3.5‘S的准确率从56%到93%不等,而Llama 3在所有11个数据集上的平均准确率达到了90%。因此,Llama 3在所有数据集上的表现都优于人类,在10个数据集上的准确率平均高出21%。此外,我们可以评价,在可解释人工智能(XAI)的意义上,GPT-3.5为其决策提供了很好的解释。问卷调查显示,66%的受试者对GPT-3.5中的S解释给予“良好”或“优秀”的评价。综上所述,这些发现丰富了我们对当前LLM的理解,并为未来的推理和可解释性研究铺平了道路。

[NLP-85] HYBRINFOX at CheckThat! 2024 – Task 2: Enriching BERT Models with the Expert System VAGO for Subjectivity Detection
[NLP-85] hyBRINFOX在CheckThat!2024年–任务2:利用主观性检测专家系统VaGO丰富BERT模型

链接: https://arxiv.org/abs/2407.03770
作者: Morgane Casanova,Julien Chanson,Benjamin Icard,Géraud Faye,Guillaume Gadek,Guillaume Gravier,Paul Égré
关键词: HYBRINFOX method, paper presents, presents the HYBRINFOX, Subjectivity detection, HYBRINFOX method ranked
中文关键词: hyBRINFOX方法,论文介绍,介绍hyBRINFOX,主观性检测,hyBRINFOX方法排名
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of the Conference and Labs of the Evaluation Forum (CLEF 2024 CheckThat!)

点击查看摘要

Abstract:This paper presents the HYBRINFOX method used to solve Task 2 of Subjectivity detection of the CLEF 2024 CheckThat! competition. The specificity of the method is to use a hybrid system, combining a RoBERTa model, fine-tuned for subjectivity detection, a frozen sentence-BERT (sBERT) model to capture semantics, and several scores calculated by the English version of the expert system VAGO, developed independently of this task to measure vagueness and subjectivity in texts based on the lexicon. In English, the HYBRINFOX method ranked 1st with a macro F1 score of 0.7442 on the evaluation data. For the other languages, the method used a translation step into English, producing more mixed results (ranking 1st in Multilingual and 2nd in Italian over the baseline, but under the baseline in Bulgarian, German, and Arabic). We explain the principles of our hybrid approach, and outline ways in which the method could be improved for other languages besides English.
摘要:本文介绍了用于解决CREF 2024 CheckThat!主观性检测任务2的hyBRINFOX方法!竞争该方法的特殊性是使用混合系统,结合了针对主观性检测进行微调的RoBERTa模型、用于捕获语义的冻结时间-BERT(sBERT)模型,以及由专家系统VaGO的英文版本计算的几个分数,该系统独立于此任务开发,以基于词典来测量文本中的模糊性和主观性。在英语中,hyBRINFOX方法在评估数据上排名第一,宏F1评分为0.7442。对于其他语言,该方法使用了英语的翻译步骤,产生了更加混合的结果(多语言中排名第一,意大利语中排名第二,但保加利亚语、德语和阿拉伯语低于基线)。我们解释了混合方法的原则,并概述了可以针对英语以外的其他语言改进该方法的方法。

[NLP-86] Convolutional vs Large Language Models for Software Log Classification in Edge-Deployable Cellular Network Testing
[NLP-86] 边缘可部署蜂窝网络测试中软件日志分类的卷积与大型语言模型

链接: https://arxiv.org/abs/2407.03759
作者: Achintha Ihalage,Sayed M. Taheri,Faris Muhammad,Hamed Al-Raweshidy
关键词: sophisticated network emulators, extremely complex, generated by sophisticated, comprising tens, tens of thousands
中文关键词: 复杂的网络模拟器,极其复杂,由复杂的生成,包括数万个
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Software logs generated by sophisticated network emulators in the telecommunications industry, such as VIAVI TM500, are extremely complex, often comprising tens of thousands of text lines with minimal resemblance to natural language. Only specialised expert engineers can decipher such logs and troubleshoot defects in test runs. While AI offers a promising solution for automating defect triage, potentially leading to massive revenue savings for companies, state-of-the-art large language models (LLMs) suffer from significant drawbacks in this specialised domain. These include a constrained context window, limited applicability to text beyond natural language, and high inference costs. To address these limitations, we propose a compact convolutional neural network (CNN) architecture that offers a context window spanning up to 200,000 characters and achieves over 96% accuracy (F10.9) in classifying multifaceted software logs into various layers in the telecommunications protocol stack. Specifically, the proposed model is capable of identifying defects in test runs and triaging them to the relevant department, formerly a manual engineering process that required expert knowledge. We evaluate several LLMs; LLaMA2-7B, Mixtral 8x7B, Flan-T5, BERT and BigBird, and experimentally demonstrate their shortcomings in our specialized application. Despite being lightweight, our CNN significantly outperforms LLM-based approaches in telecommunications log classification while minimizing the cost of production. Our defect triaging AI model is deployable on edge devices without dedicated hardware and widely applicable across software logs in various industries.
摘要:在电信行业中,由Viavi TM500等复杂的网络仿真器生成的软件日志极其复杂,通常由数万行文本组成,与自然语言几乎没有相似之处。只有专业的专家工程师才能破译这些日志,并解决测试运行中的缺陷。虽然人工智能为自动化缺陷分类提供了一个有前途的解决方案,可能会为公司节省大量收入,但最先进的大型语言模型(LLM)在这个专业领域存在重大缺陷。这些问题包括受限的上下文窗口、对自然语言以外的文本的有限适用性以及高昂的推理成本。为了解决这些限制,我们提出了一种紧凑型卷积神经网络(CNN)体系结构,该体系结构提供了跨度高达200,000个字符的上下文窗口,并且在将多方面软件日志分类到电信协议堆栈的不同层中时实现了96%以上的准确率(F10.9)。具体地说,拟议的模型能够识别测试运行中的缺陷,并将它们分类给相关部门,以前是一个需要专业知识的人工工程过程。我们评估了几种LLMS:LLaMA2-7B、Mixtral 8x7B、Flan-T5、BERT和BigBird,并通过实验证明了它们在专业应用中的不足。尽管我们的CNN是轻量级的,但在电信日志分类方面显著优于基于LLM的方法,同时将生产成本降至最低。我们的缺陷分类AI模型可以部署在没有专用硬件的边缘设备上,并广泛适用于各个行业的软件日志。

[NLP-87] Argument Mining in Data Scarce Settings: Cross-lingual Transfer and Few-shot Techniques
[NLP-87] 数据稀缺环境下的论点挖掘:跨语言迁移和少镜头技术

链接: https://arxiv.org/abs/2407.03748
作者: Anar Yeginbergen,Maite Oronoz,Rodrigo Agerri
关键词: manually annotated data, Recent research, pre-trained language models, Argument Mining data, language models
中文关键词: 手动注释数据、最近的研究、预训练的语言模型、论据挖掘数据、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research on sequence labelling has been exploring different strategies to mitigate the lack of manually annotated data for the large majority of the world languages. Among others, the most successful approaches have been based on (i) the cross-lingual transfer capabilities of multilingual pre-trained language models (model-transfer), (ii) data translation and label projection (data-transfer) and (iii), prompt-based learning by reusing the mask objective to exploit the few-shot capabilities of pre-trained language models (few-shot). Previous work seems to conclude that model-transfer outperforms data-transfer methods and that few-shot techniques based on prompting are superior to updating the model’s weights via fine-tuning. In this paper, we empirically demonstrate that, for Argument Mining, a sequence labelling task which requires the detection of long and complex discourse structures, previous insights on cross-lingual transfer or few-shot learning do not apply. Contrary to previous work, we show that for Argument Mining data transfer obtains better results than model-transfer and that fine-tuning outperforms few-shot methods. Regarding the former, the domain of the dataset used for data-transfer seems to be a deciding factor, while, for few-shot, the type of task (length and complexity of the sequence spans) and sampling method prove to be crucial.
摘要:最近关于序列标注的研究一直在探索不同的策略来缓解大多数世界语言缺乏人工标注数据的问题。其中,最成功的方法是:(1)多语种预先训练的语言模型的跨语言迁移能力(模型迁移);(2)数据翻译和标签投影(数据迁移);(3)基于提示的学习,通过重复使用掩膜目标来利用预先培训的语言模型的极少成功的能力(极少成功)。以前的工作似乎得出结论,模型传输优于数据传输方法,基于提示的少镜头技术优于通过微调更新模型的权重。在本文中,我们实证地证明,对于论元挖掘这项需要检测冗长而复杂的语篇结构的序列标注任务,以前关于跨语言迁移或少数几次学习的见解并不适用。与前人的工作相反,我们证明了对于参数挖掘,数据传输获得了比模型传输更好的结果,并且微调方法的性能优于少数几种方法。关于前者,用于数据传输的数据集的域似乎是一个决定因素,而对于极少数情况,任务的类型(序列跨度的长度和复杂性)和抽样方法被证明是关键。

[NLP-88] Improving Self-supervised Pre-training using Accent-Specific Codebooks
[NLP-88] 使用特定口音的代码簿改进自我监督的预培训

链接: https://arxiv.org/abs/2407.03734
作者: Darshan Prabhu,Abhishek Gupta,Omkar Nitsure,Preethi Jyothi,Sriram Ganapathy
关键词: Automatic Speech Recognition, Automatic Speech, Speech Recognition, Speech accents present, Speech
中文关键词: 自动语音识别,自动语音,语音识别,存在的语音口音,语音
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:Speech accents present a serious challenge to the performance of state-of-the-art end-to-end Automatic Speech Recognition (ASR) systems. Even with self-supervised learning and pre-training of ASR models, accent invariance is seldom achieved. In this work, we propose an accent-aware adaptation technique for self-supervised learning that introduces a trainable set of accent-specific codebooks to the self-supervised architecture. These learnable codebooks enable the model to capture accent specific information during pre-training, that is further refined during ASR finetuning. On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches on both seen and unseen English accents, with up to 9% relative reduction in word error rate (WER).
摘要:语音口音对最先进的端到端自动语音识别(ASB)系统的性能提出了严重挑战。即使使用自监督学习和ASB模型的预训练,也很少能实现口音不变性。在这项工作中,我们提出了一种用于自我监督学习的口音感知适应技术,该技术将一组可训练的特定口音码本引入到自我监督架构中。这些可学习的码本使模型能够在预训练期间捕获特定于口音的信息,并在ASB微调期间进一步细化该信息。在Firefox Common Voice数据集上,我们提出的方法在可见和不可见的英语口音上优于所有其他口音适应方法,字错误率(WER)相对降低了高达9%。

[NLP-89] Query-oriented Data Augmentation for Session Search
[NLP-89] 面向查询的会话搜索数据增强

链接: https://arxiv.org/abs/2407.03720
作者: Haonan Chen,Zhicheng Dou,Yutao Zhu,Ji-Rong Wen
关键词: complex user intents, Modeling contextual information, understanding complex user, search, user intents
中文关键词: 复杂用户意图、建模上下文信息、理解复杂用户、搜索、用户意图
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: TKDE 2024

点击查看摘要

Abstract:Modeling contextual information in a search session has drawn more and more attention when understanding complex user intents. Recent methods are all data-driven, i.e., they train different models on large-scale search log data to identify the relevance between search contexts and candidate documents. The common training paradigm is to pair the search context with different candidate documents and train the model to rank the clicked documents higher than the unclicked ones. However, this paradigm neglects the symmetric nature of the relevance between the session context and document, i.e., the clicked documents can also be paired with different search contexts when training. In this work, we propose query-oriented data augmentation to enrich search logs and empower the modeling. We generate supplemental training pairs by altering the most important part of a search context, i.e., the current query, and train our model to rank the generated sequence along with the original sequence. This approach enables models to learn that the relevance of a document may vary as the session context changes, leading to a better understanding of users’ search patterns. We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty. Through experimentation on two extensive public search logs, we have successfully demonstrated the effectiveness of our model.
摘要:在理解复杂的用户意图时,对搜索会话中的上下文信息进行建模越来越受到人们的关注。最近的方法都是数据驱动的,即它们在大规模搜索日志数据上训练不同的模型来识别搜索上下文和候选文档之间的相关性。常见的训练范式是将搜索上下文与不同的候选文档配对,并训练模型将点击的文档排序高于未点击的文档。然而,这种范式忽略了会话上下文和文档之间相关性的对称性,即在训练时,点击的文档也可以与不同的搜索上下文配对。在这项工作中,我们提出了面向查询的数据扩充,以丰富搜索日志并增强建模能力。我们通过改变搜索上下文的最重要部分,即当前查询来生成补充训练对,并训练我们的模型来将生成的序列与原始序列一起排序。这种方法使模型能够了解到,文档的相关性可能会随着会话上下文的变化而变化,从而更好地理解用户的搜索模式。我们开发了几种策略来改变当前的查询,导致新的训练数据具有不同程度的难度。通过在两个广泛的公共搜索日志上的实验,我们成功地证明了该模型的有效性。

[NLP-90] Multi-Convformer: Extending Conformer with Multiple Convolution Kernels
[NLP-90] Multi-Conformer:用多个卷积核扩展Conformer

链接: https://arxiv.org/abs/2407.03718
作者: Darshan Prabhu,Yifan Peng,Preethi Jyothi,Shinji Watanabe
关键词: Automatic Speech Recognition, Automatic Speech, Speech Recognition, Transformer-based ASR systems, vanilla Transformer-based ASR
中文关键词: 自动语音识别,自动语音,语音识别,基于转换器的ASB系统,香草基于转换器的ASB
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate~(WER) improvements.
摘要:卷积在最先进的端到端自动语音识别(ASR)系统中因其对局部上下文的有效建模而变得至关重要。值得注意的是,与基于香草变压器的ASR系统相比,它在变形器中的使用带来了卓越的性能。虽然已经重新检查了变形器中卷积模块以外的部件,但改变卷积模块本身的探索要少得多。为此,我们引入了多重变形器,它在变形器的卷积模块中结合门控使用多个卷积核。这有助于改进对不同粒度的本地依赖项的建模。我们的模型在性能上可与现有的CgMLP和E-Branchform等同形变体相媲美,同时参数效率更高。我们在四个不同的数据集和三个不同的建模范例上将我们的方法与Conform及其变体进行了实证比较,结果显示相对单词错误率~(WER)提高了8%。

[NLP-91] xt2TimeSeries: Enhancing Financial Forecasting through Time Series Prediction Updates with Event-Driven Insights from Large Language Models
[NLP-91] xt2TimeSeries:通过时间序列预测更新以及大型语言模型的事件驱动洞察来增强财务预测

链接: https://arxiv.org/abs/2407.03689
作者: Litton Jose Kurisinkel,Pruthwik Mishra,Yue Zhang
关键词: typically trained, trained on numerical, Time series, Time, series
中文关键词: 通常训练,接受数字训练,时间序列,时间,序列
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 12 figures

点击查看摘要

Abstract:Time series models, typically trained on numerical data, are designed to forecast future values. These models often rely on weighted averaging techniques over time intervals. However, real-world time series data is seldom isolated and is frequently influenced by non-numeric factors. For instance, stock price fluctuations are impacted by daily random events in the broader world, with each event exerting a unique influence on price signals. Previously, forecasts in financial markets have been approached in two main ways: either as time-series problems over price sequence or sentiment analysis tasks. The sentiment analysis tasks aim to determine whether news events will have a positive or negative impact on stock prices, often categorizing them into discrete labels. Recognizing the need for a more comprehensive approach to accurately model time series prediction, we propose a collaborative modeling framework that incorporates textual information about relevant events for predictions. Specifically, we leverage the intuition of large language models about future changes to update real number time series predictions. We evaluated the effectiveness of our approach on financial market data.
摘要:时间序列模型通常以数字数据为基础进行训练,旨在预测未来的价值。这些模型通常依赖于时间间隔的加权平均技术。然而,现实世界的时间序列数据很少是孤立的,而且经常受到非数值因素的影响。例如,股票价格波动受到更广泛世界中每日随机事件的影响,每个事件对价格信号产生独特的影响。此前,金融市场的预测主要有两种方式:要么是价格序列的时间序列问题,要么是情绪分析任务。情绪分析任务旨在确定新闻事件将对股价产生积极还是负面影响,通常将它们归类为不同的标签。认识到需要一种更全面的方法来准确建模时间序列预测,我们提出了一个协作建模框架,该框架结合了用于预测的相关事件的文本信息。具体地说,我们利用大型语言模型对未来变化的直觉来更新实数时间序列预测。我们评估了我们方法在金融市场数据上的有效性。

[NLP-92] STOC-TOT: Stochastic Tree-of-Thought with Constrained Decoding for Complex Reasoning in Multi-Hop Question Answering
[NLP-92] STOC-TOT:具有约束解码的随机思维树,用于多跳问题回答中的复杂推理

链接: https://arxiv.org/abs/2407.03687
作者: Zhenyu Bi,Daniel Hajialigol,Zhongkai Sun,Jie Hao,Xuan Wang
关键词: Multi-hop question answering, Multi-hop question, reasoning, MHQA, information from multiple
中文关键词: 多跳问题回答、多跳问题、推理、MHQA、来自多个的信息
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Multi-hop question answering (MHQA) requires a model to retrieve and integrate information from multiple passages to answer a complex question. Recent systems leverage the power of large language models and integrate evidence retrieval with reasoning prompts (e.g., chain-of-thought reasoning) for the MHQA task. However, the complexities in the question types (bridge v.s. comparison questions) and the reasoning types (sequential v.s. parallel reasonings) require more novel and fine-grained prompting methods to enhance the performance of MHQA under the zero-shot setting. In this paper, we propose STOC-TOT, a stochastic tree-of-thought reasoning prompting method with constrained decoding for MHQA and conduct a detailed comparison with other reasoning prompts on different question types and reasoning types. Specifically, we construct a tree-like reasoning structure by prompting the model to break down the original question into smaller sub-questions to form different reasoning paths. In addition, we prompt the model to provide a probability estimation for each reasoning path at each reasoning step. At answer time, we conduct constrained decoding on the model to generate more grounded answers and reduce hallucination. Experiments comparing STOC-TOT with two MHQA datasets and five large language models showed that our framework outperforms other reasoning prompts by a significant margin.
摘要:多跳问答(MHQA)需要一个模型来检索和整合多篇文章中的信息来回答一个复杂的问题。最近的系统利用大型语言模型的能力,并将证据检索与MHQA任务的推理提示(例如,思想链推理)相结合。然而,问题类型(桥梁问题与对比问题)和推理类型(顺序推理与并行推理)的复杂性要求更新颖、更细粒度的提示方法来提高MHQA在零命中设置下的表现。本文针对MHQA提出了一种带约束解码的随机思维树推理提示方法STOC-TOT,并在不同的问题类型和推理类型上与其他推理提示方法进行了详细的比较。具体地说,我们通过促使模型将原始问题分解成更小的子问题来形成不同的推理路径,从而构建了一个树形推理结构。此外,我们还提示模型在每个推理步骤中为每条推理路径提供概率估计。在回答时,我们对模型进行约束解码,以生成更多接地的答案,减少幻觉。用两个MHQA数据集和五个大型语言模型对STOC-TOT进行的实验表明,我们的框架比其他推理提示有明显的优势。

[NLP-93] Improving Self Consistency in LLMs through Probabilistic Tokenization
[NLP-93] 通过概率令牌化提高LLM的自我一致性

链接: https://arxiv.org/abs/2407.03678
作者: Ashutosh Sathe,Divyanshu Aggarwal,Sunayana Sitaram
关键词: demonstrated noticeable performance, noticeable performance gains, Prior research, involves employing multiple, large language models
中文关键词: 表现出显着的性能、显着的性能提升,之前的研究涉及使用多个大型语言模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2024 Workshop on LLMs and Cognition

点击查看摘要

Abstract:Prior research has demonstrated noticeable performance gains through the use of probabilistic tokenizations, an approach that involves employing multiple tokenizations of the same input string during the training phase of a language model. Despite these promising findings, modern large language models (LLMs) have yet to be trained using probabilistic tokenizations. Interestingly, while the tokenizers of these contemporary LLMs have the capability to generate multiple tokenizations, this property remains underutilized. In this work, we propose a novel method to leverage the multiple tokenization capabilities of modern LLM tokenizers, aiming to enhance the self-consistency of LLMs in reasoning tasks. Our experiments indicate that when utilizing probabilistic tokenizations, LLMs generate logically diverse reasoning paths, moving beyond mere surface-level linguistic diversity.We carefully study probabilistic tokenization and offer insights to explain the self consistency improvements it brings through extensive experimentation on 5 LLM families and 4 reasoning benchmarks. Comments: ICML 2024 Workshop on LLMs and Cognition Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2407.03678 [cs.CL] (or arXiv:2407.03678v1 [cs.CL] for this version)
摘要:以前的研究已经证明,通过使用概率标记化可以显著提高性能,概率标记化是一种涉及在语言模型的训练阶段使用同一输入字符串的多个标记化的方法。尽管有这些有希望的发现,但现代大型语言模型(LLM)尚未使用概率标记化进行训练。有趣的是,尽管这些当代LLM的标记器具有生成多个标记化的能力,但这一特性仍然未得到充分利用。在这项工作中,我们提出了一种新的方法来利用现代LLM标记器的多重标记化能力,旨在增强LLM在推理任务中的自相合性。我们的实验表明,当使用概率标记化时,LLMS生成逻辑上不同的推理路径,超越了仅仅是表层的语言差异。我们仔细研究了概率标记化,并通过在5个LLM族和4个推理基准上的广泛实验来解释它带来的自我一致性的改善。评论:ICML2024年关于LLMS和认知主题的研讨会:计算和语言(cs.CL);机器学习(cs.LG)引用为:arxiv:2407.03678cs.CL

[NLP-94] GPT-4 vs. Human Translators: A Comprehensive Evaluation of Translation Quality Across Languages Domains and Expertise Levels
[NLP-94] GPT-4与人类翻译:跨语言领域和专业知识水平翻译质量的全面评估

链接: https://arxiv.org/abs/2407.03658
作者: Jianhao Yan,Pingchuan Yan,Yulong Chen,Judy Li,Xianchao Zhu,Yue Zhang
关键词: Large Language Models, varying expertise levels, multiple language pairs, quality of Large, Language Models
中文关键词: 大型语言模型、不同的专业知识水平、多种语言对、大型语言模型的质量
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study comprehensively evaluates the translation quality of Large Language Models (LLMs), specifically GPT-4, against human translators of varying expertise levels across multiple language pairs and domains. Through carefully designed annotation rounds, we find that GPT-4 performs comparably to junior translators in terms of total errors made but lags behind medium and senior translators. We also observe the imbalanced performance across different languages and domains, with GPT-4’s translation capability gradually weakening from resource-rich to resource-poor directions. In addition, we qualitatively study the translation given by GPT-4 and human translators, and find that GPT-4 translator suffers from literal translations, but human translators sometimes overthink the background information. To our knowledge, this study is the first to evaluate LLMs against human translators and analyze the systematic differences between their outputs, providing valuable insights into the current state of LLM-based translation and its potential limitations.
摘要:这项研究综合评估了大型语言模型(LLM)的翻译质量,特别是GPT-4,对比了不同专业水平的人工翻译人员对多个语言对和领域的翻译质量。通过精心设计的注释轮,我们发现GPT-4在总体错误方面与初级译者不相上下,但落后于中、高级译者。我们还观察到了不同语言和领域之间的不平衡表现,从资源丰富的方向到资源贫乏的方向,GPT-4的S的翻译能力逐渐减弱。此外,我们定性地研究了GPT-4和人类翻译者给出的翻译,发现GPT-4译者存在直译的问题,但人类译者有时会过度考虑背景信息。据我们所知,这项研究是第一次评估LLMS与人工译者之间的差异,并分析他们的输出之间的系统差异,为基于LLM的翻译的现状及其潜在的局限性提供了有价值的见解。

[NLP-95] Evaluating Language Model Context Windows: A “Working Memory” Test and Inference-time Correction
[NLP-95] 评估语言模型上下文窗口:“工作记忆”测试和推理时纠正

链接: https://arxiv.org/abs/2407.03651
作者: Amanda Dsouza,Christopher Glaze,Changho Shin,Frederic Sala
关键词: Large language models, Large language, large volumes, real-world applications, tasked with reasoning
中文关键词: 大型语言模型,大型语言,大量,现实世界的应用程序,负责推理
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. An exciting development in this space is models boasting extended context capabilities, with some accommodating over 2 million tokens. Such long context model capabilities remain uncertain in production systems, motivating the need to benchmark their performance on real world use cases. We address this challenge by proposing SWiM, an evaluation framework that addresses the limitations of standard tests. Testing the framework on eight long context models, we find that even strong models such as GPT-4 and Claude 3 Opus degrade in performance when information is present in the middle of the context window (lost-in-the-middle effect). Next, in addition to our benchmark, we propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect, by generating responses a few times, each time randomly permuting documents in the context, and selecting the medoid answer. We evaluate medoid voting on single document QA tasks, achieving up to a 24% lift in accuracy.
摘要:大型语言模型主要用于现实世界的应用程序,通常负责对大量文档进行推理。这一领域的一个令人兴奋的发展是拥有扩展的上下文功能的型号,其中一些型号可以容纳200多万个令牌。这样的长上下文模型能力在生产系统中仍然不确定,这促使需要根据真实世界的用例对其性能进行基准测试。我们通过提出SWIM来应对这一挑战,这是一个解决标准测试局限性的评估框架。在八个长上下文模型上测试该框架,我们发现即使是GPT-4和Claude 3Opus这样的强模型,当信息存在于上下文窗口的中间时,性能也会下降(丢失在中间的效应)。接下来,除了我们的基准,我们还提出了Medoid投票,这是一种简单但有效的免培训方法,通过生成几次响应来帮助缓解这种影响,每次都随机排列上下文中的文档,并选择Medoid答案。我们对单文档QA任务进行了Medoid投票评估,实现了高达24%的准确性提升。

[NLP-96] Differentiating between human-written and AI-generated texts using linguistic features automatically extracted from an online computational tool
[NLP-96] 使用从在线计算工具自动提取的语言特征区分人类书写的文本和人工智能生成的文本

链接: https://arxiv.org/abs/2407.03646
作者: Georgios P. Georgiou
关键词: Artificial Intelligence, human-written and Artificial, compared linguistic features, recent years, extensive research
中文关键词: 人工智能,人工书写和人工,比较了语言特征,近年来,广泛研究
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While extensive research has focused on ChatGPT in recent years, very few studies have systematically quantified and compared linguistic features between human-written and Artificial Intelligence (AI)-generated language. This study aims to investigate how various linguistic components are represented in both types of texts, assessing the ability of AI to emulate human writing. Using human-authored essays as a benchmark, we prompted ChatGPT to generate essays of equivalent length. These texts were analyzed using Open Brain AI, an online computational tool, to extract measures of phonological, morphological, syntactic, and lexical constituents. Despite AI-generated texts appearing to mimic human speech, the results revealed significant differences across multiple linguistic features such as consonants, word stress, nouns, verbs, pronouns, direct objects, prepositional modifiers, and use of difficult words among others. These findings underscore the importance of integrating automated tools for efficient language assessment, reducing time and effort in data analysis. Moreover, they emphasize the necessity for enhanced training methodologies to improve the capacity of AI for producing more human-like text.
摘要:虽然近年来对ChatGPT进行了广泛的研究,但很少有研究系统地量化和比较人类书写的语言和人工智能生成的语言的语言特征。这项研究旨在调查不同的语言成分在两种类型的文本中是如何表现的,评估人工智能模仿人类写作的能力。以人类创作的文章为基准,我们促使ChatGPT生成同等长度的文章。使用在线计算工具Open Brain AI对这些文本进行分析,以提取语音、形态、句法和词汇成分的测量。尽管人工智能生成的文本似乎模仿了人类的语音,但结果显示,辅音、重音、名词、动词、代词、直接宾语、介词修饰语和使用难词等多种语言特征存在显著差异。这些发现强调了整合自动化工具以有效地进行语言评估、减少数据分析的时间和精力的重要性。此外,他们强调有必要加强培训方法,以提高人工智能产生更多类似人类的文本的能力。

[NLP-97] Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems
[NLP-97] 多语言ASB系统自回归解码器的持续学习优化

链接: https://arxiv.org/abs/2407.03645
作者: Chin Yuen Kwok,Jia Qi Yip,Eng Siong Chng
关键词: involves fine-tuning pre-trained, fine-tuning pre-trained models, pre-trained data, Continual Learning, fine-tuning pre-trained
中文关键词: 涉及微调预训练、微调预训练模型、预训练数据、持续学习、微调预训练
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Continual Learning (CL) involves fine-tuning pre-trained models with new data while maintaining the performance on the pre-trained data. This is particularly relevant for expanding multilingual ASR (MASR) capabilities. However, existing CL methods, mainly designed for computer vision and reinforcement learning tasks, often yield sub-optimal results when directly applied to MASR. We hypothesise that this is because CL of the auto-regressive decoder in the MASR model is difficult. To verify this, we propose four optimizations on the decoder. They include decoder-layer gradient surgery, freezing unused token embeddings, suppressing output of newly added tokens, and learning rate re-scaling. Our experiments on adapting Whisper to 10 unseen languages from the Common Voice dataset demonstrate that these optimizations reduce the Average Word Error Rate (AWER) of pretrained languages from 14.2% to 12.4% compared with Experience Replay, without compromising the AWER of new languages.
摘要:连续学习(CL)涉及使用新数据微调预训练模型,同时保持预训练数据的性能。这对于扩展多语言ASB(MASR)功能尤其重要。然而,现有的CL方法主要为计算机视觉和强化学习任务设计,当直接应用于MASR时,通常会产生次优的结果。我们假设这是因为MASR模型中自回归解码器的CL很困难。为了验证这一点,我们对解码器提出了四种优化。它们包括解码器层梯度手术、冻结未使用的令牌嵌入、抑制新添加的令牌的输出以及学习率重新缩放。我们将Whisper适应Common Voice数据集中的10种未见过语言的实验表明,与Experience Replay相比,这些优化将预训练语言的平均字错误率(AWER)从14.2%降低到12.4%,而不会损害新语言的AWER。

[NLP-98] Generative Technology for Human Emotion Recognition: A Scope Review
[NLP-98] 人类情感识别的生成技术:范围回顾

链接: https://arxiv.org/abs/2407.03640
作者: Fei Ma,Yucheng Yuan,Yifan Xie,Hongwei Ren,Ivan Liu,Ying He,Fuji Ren,Fei Richard Yu,Shiguang Ni
关键词: Affective computing stands, Affective computing, emotion recognition, seeking to imbue, Large Language Model
中文关键词: 情感计算立场、情感计算、情感识别、寻求灌输、大型语言模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Affective computing stands at the forefront of artificial intelligence (AI), seeking to imbue machines with the ability to comprehend and respond to human emotions. Central to this field is emotion recognition, which endeavors to identify and interpret human emotional states from different modalities, such as speech, facial images, text, and physiological signals. In recent years, important progress has been made in generative models, including Autoencoder, Generative Adversarial Network, Diffusion Model, and Large Language Model. These models, with their powerful data generation capabilities, emerge as pivotal tools in advancing emotion recognition. However, up to now, there remains a paucity of systematic efforts that review generative technology for emotion recognition. This survey aims to bridge the gaps in the existing literature by conducting a comprehensive analysis of over 320 research papers until June 2024. Specifically, this survey will firstly introduce the mathematical principles of different generative models and the commonly used datasets. Subsequently, through a taxonomy, it will provide an in-depth analysis of how generative techniques address emotion recognition based on different modalities in several aspects, including data augmentation, feature extraction, semi-supervised learning, cross-domain, etc. Finally, the review will outline future research directions, emphasizing the potential of generative models to advance the field of emotion recognition and enhance the emotional intelligence of AI systems.
摘要:情感计算站在人工智能(AI)的前沿,试图赋予机器理解和响应人类情感的能力。这一领域的核心是情感识别,它努力从不同的模式识别和解释人类的情感状态,如语音、面部图像、文本和生理信号。近年来,生成模型的研究取得了重要进展,包括自动编码、生成对抗网络、扩散模型、大语言模型等。这些模型具有强大的数据生成能力,是推进情感识别的关键工具。然而,到目前为止,仍然缺乏系统的努力来审查用于情感识别的生成性技术。这项调查旨在通过对截至2024年6月的320多篇研究论文进行全面分析,弥合现有文献中的差距。具体地说,本综述将首先介绍不同生成模型的数学原理和常用的数据集。随后,通过分类,从数据增强、特征提取、半监督学习、跨域等几个方面深入分析了产生式技术如何解决基于不同模式的情绪识别问题。最后,综述将概述未来的研究方向,强调产生式模型在推进情绪识别领域和提高人工智能系统情绪智力方面的潜力。

[NLP-99] HERA: High-efficiency Matrix Compression via Element Replacement
[NLP-99] HERA:通过元素替换实现高效矩阵压缩

链接: https://arxiv.org/abs/2407.03637
作者: Yanshu Wang,Wang Li,Tong Yang
关键词: advanced natural language, significantly advanced natural, text generation, natural language processing, language processing tasks
中文关键词: 高级自然语言、显着高级自然、文本生成、自然语言处理、语言处理任务
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced natural language processing tasks such as machine translation, text generation, and sentiment analysis. However, their large size, often consisting of billions of parameters, poses challenges for storage, computation, and deployment, particularly in resource-constrained environments like mobile devices and edge computing platforms. Additionally, the key-value (k-v) cache used to speed up query processing requires substantial memory and storage, exacerbating these challenges. Vector databases have emerged as a crucial technology to efficiently manage and retrieve the high-dimensional vectors produced by LLMs, facilitating faster data access and reducing computational demands. Effective compression and quantization techniques are essential to address these challenges, as they reduce the memory footprint and computational requirements without significantly compromising performance. Traditional methods that uniformly map parameters to compressed spaces often fail to account for the uneven distribution of parameters, leading to considerable accuracy loss. Therefore, innovative approaches are needed to achieve better compression ratios while preserving model performance. In this work, we propose HERA, a novel algorithm that employs heuristic Element Replacement for compressing matrix. HERA systematically replaces elements within the model using heuristic methods, which simplifies the structure of the model and makes subsequent compression more effective. By hierarchically segmenting, compressing, and reorganizing the matrix dataset, our method can effectively reduce the quantization error to 12.3% of the original at the same compression ratio. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2407.03637 [cs.LG] (or arXiv:2407.03637v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2407.03637 Focus to learn more arXiv-issued DOI via DataCite
摘要:大型语言模型对机器翻译、文本生成和情感分析等自然语言处理任务具有重要的推动作用。然而,它们通常由数十亿个参数组成的庞大规模给存储、计算和部署带来了挑战,特别是在移动设备和边缘计算平台等资源受限的环境中。此外,用于加速查询处理的键值(k-v)缓存需要大量内存和存储,这加剧了这些挑战。矢量数据库已经成为有效管理和检索LLMS产生的高维矢量的关键技术,有助于更快地访问数据并减少计算需求。有效的压缩和量化技术对于解决这些挑战至关重要,因为它们在不显著影响性能的情况下减少了内存占用和计算需求。传统的将参数均匀映射到压缩空间的方法往往不能考虑参数的不均匀分布,导致相当大的精度损失。因此,需要创新的方法来实现更好的压缩比,同时保持模型的性能。在这项工作中,我们提出了一种新的算法HERA,它使用启发式元素替换来压缩矩阵。HERA使用启发式方法系统地替换模型中的元素,这简化了模型的结构,并使后续的压缩更加有效。通过对矩阵数据集进行分层分割、压缩和重组,在相同的压缩比下,量化误差可以有效地降低到原来的12.3%。主题:机器学习(cs.lg);计算与语言(cs.CL)引用为:arxiv:2407.03637cs.lghttps://doi.org/10.48550/arXiv.2407.03637 Focus通过DataCite了解更多arxiv发布的指示信息

[NLP-100] DSLR: Document Refinement with Sentence-Level Re-ranking and Reconstruction to Enhance Retrieval-Augmented Generation
[NLP-100] 数码单反:通过句子级重新排序和重建进行文档细化,以增强检索增强生成

链接: https://arxiv.org/abs/2407.03627
作者: Taeho Hwang,Soyeong Jeong,Sukmin Cho,SeungYoon Han,Jong C. Park
关键词: Large Language Models, Natural Language Processing, Language Models, Language Processing, Large Language
中文关键词: 大型语言模型、自然语言处理、语言模型、语言处理、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly improved their performance across various Natural Language Processing (NLP) tasks. However, LLMs still struggle with generating non-factual responses due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) systems address this issue by incorporating external knowledge with a retrieval module. Despite their successes, however, current RAG systems face challenges with retrieval failures and the limited ability of LLMs to filter out irrelevant information. Therefore, in this work, we propose DSLR (Document Refinement with Sentence-Level Re-ranking and Reconstruction), an unsupervised framework that decomposes retrieved documents into sentences, filters out irrelevant sentences, and reconstructs them again into coherent passages. We experimentally validate DSLR on multiple open-domain QA datasets and the results demonstrate that DSLR significantly enhances the RAG performance over conventional fixed-size passage. Furthermore, our DSLR enhances performance in specific, yet realistic scenarios without the need for additional training, providing an effective and efficient solution for refining retrieved documents in RAG systems.
摘要:大型语言模型的最新进展显著提高了它们在各种自然语言处理(NLP)任务中的性能。然而,由于其参数记忆的限制,LLM仍然难以产生非事实响应。检索-增强生成(RAG)系统通过将外部知识与检索模块相结合来解决这一问题。然而,尽管取得了成功,但目前的RAG系统面临着检索失败和LLMS过滤不相关信息的能力有限的挑战。因此,在这项工作中,我们提出了一种无监督的文档精化框架DSLR,它将检索到的文档分解成句子,过滤掉不相关的句子,然后重新构建成连贯的段落。我们在多个开放领域的QA数据集上对DSLR进行了实验验证,结果表明,DSLR比传统的固定大小通道显著提高了RAG性能。此外,我们的单反在特定而现实的场景中提高了性能,而不需要额外的培训,为在RAG系统中提炼检索到的文档提供了有效和高效的解决方案。

[NLP-101] Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks
[NLP-101] 数据分析预算处理提高了LLM推理任务中的性能

链接: https://arxiv.org/abs/2407.03624
作者: Dharunish Yugeswardeenoo,Kevin Zhu,Sean O’Brien
关键词: transform many fields, reasoning tasks, Question Analysis Prompting, potential to transform, underperform humans
中文关键词: 改变许多领域、推理任务、问题分析预算、改变的潜力、表现不佳人类
类目: Computation and Language (cs.CL)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Although LLMs have the potential to transform many fields, they still underperform humans in reasoning tasks. Existing methods induce the model to produce step-by-step calculations, but this research explores the question: Does making the LLM analyze the question improve its performance? We propose a novel prompting strategy called Question Analysis Prompting (QAP), in which the model is prompted to explain the question in n words before solving. The value of n influences the length of response generated by the model. QAP is evaluated on GPT 3.5 Turbo and GPT 4 Turbo on arithmetic datasets GSM8K, AQuA, and SAT and commonsense dataset StrategyQA. QAP is compared with other state-of-the-art prompts including Chain-of-Thought (CoT), Plan and Solve Prompting (PS+) and Take A Deep Breath (TADB). QAP outperforms all state-of-the-art prompts on AQuA and SAT datasets on both GPT3.5 and GPT4. QAP consistently ranks among the top-2 prompts on 75% of the tests. A key factor of QAP performance can be attributed to response length, where detailed responses are beneficial when answering harder questions, but can negatively affect easy questions.
摘要:尽管LLM具有改变许多领域的潜力,但它们在推理任务中的表现仍然落后于人类。现有的方法诱导模型产生循序渐进的计算,但本研究探索了这个问题:让LLM分析问题是否提高了它的性能?我们提出了一种新的提示策略,称为问题分析提示(QAP),在该策略中,模型在解题之前被提示用n个单词解释问题。N的值影响模型生成的响应的长度。QAP在GPT 3.5 Turbo和GPT 4 Turbo上在算术数据集GSM8K、Aqua和SAT以及常识数据集Strategy yQA上进行了评估。QAP与其他最先进的提示进行了比较,包括思想链(COT)、计划和解决提示(PS+)和深呼吸(TADB)。在GPT3.5和GPT4上,QAP在Aqua和SAT数据集上的表现都超过了所有最先进的提示。在75%的测试中,QAP始终位居前2名提示。QAP成绩的一个关键因素可以归因于回答时间的长短,在回答较难的问题时,详细的回答是有益的,但可能会对容易的问题产生负面影响。

[NLP-102] he Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Metas Llama 2 Model
[NLP-102] 神经元神秘案例1512:可注射重新对齐架构揭示了Meta Lama 2模型的内部特征

链接: https://arxiv.org/abs/2407.03621
作者: Brenden Smith,Dallin Baker,Clayton Chase,Myles Barney,Kaden Parker,Makenna Allred,Peter Hu,Alex Evans,Nancy Fulda
关键词: Large Language Models, Large Language, Injectable Realignment Model, human preferences, text they generate
中文关键词: 大型语言模型、大型语言、可注射重新对齐模型、人类偏好、它们生成的文本
类目: Computation and Language (cs.CL)
备注: 21 pages, 17 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have an unrivaled and invaluable ability to “align” their output to a diverse range of human preferences, by mirroring them in the text they generate. The internal characteristics of such models, however, remain largely opaque. This work presents the Injectable Realignment Model (IRM) as a novel approach to language model interpretability and explainability. Inspired by earlier work on Neural Programming Interfaces, we construct and train a small network – the IRM – to induce emotion-based alignments within a 7B parameter LLM architecture. The IRM outputs are injected via layerwise addition at various points during the LLM’s forward pass, thus modulating its behavior without changing the weights of the original model. This isolates the alignment behavior from the complex mechanisms of the transformer model. Analysis of the trained IRM’s outputs reveals a curious pattern. Across more than 24 training runs and multiple alignment datasets, patterns of IRM activations align themselves in striations associated with a neuron’s index within each transformer layer, rather than being associated with the layers themselves. Further, a single neuron index (1512) is strongly correlated with all tested alignments. This result, although initially counterintuitive, is directly attributable to design choices present within almost all commercially available transformer architectures, and highlights a potential weak point in Meta’s pretrained Llama 2 models. It also demonstrates the value of the IRM architecture for language model analysis and interpretability. Our code and datasets are available at this https URL
摘要:大型语言模型(LLM)具有无与伦比、无可估量的能力,通过在它们生成的文本中反映它们的输出,使它们的输出与不同的人类偏好相匹配。然而,这些模式的内部特征在很大程度上仍然不透明。这项工作提出了可注入重排模型(IRM)作为一种新的语言模型可解释性和可解释性的方法。受早期神经编程接口工作的启发,我们构建并训练了一个小型网络–IRM–以在7B参数LLM体系结构中诱导基于情感的比对。在LLM的前向传递过程中,IRM输出通过LayerWise加法在不同点处注入,从而在不改变原始模型权重的情况下调整其行为。这将对齐行为与变压器模型的复杂机制隔离开来。对经过训练的IRM输出的分析揭示了一种奇怪的模式。在超过24次的训练运行和多个对准数据集中,IRM激活模式以与每个变压器层中神经元的索引相关联的条纹对齐,而不是与层本身相关联。此外,单个神经元指数(1512)与所有测试的比对强烈相关。这一结果虽然最初是违反直觉的,但直接归因于几乎所有商业可用变压器架构中的设计选择,并突出了Meta预先培训的Llama 2型号中的一个潜在弱点。它还展示了IRM体系结构在语言模型分析和可解释性方面的价值。我们的代码和数据集可在此HTTPS URL中找到

[NLP-103] BM25S: Orders of magnitude faster lexical search via eager sparse scoring
[NLP-103] BM 25 S:通过渴望稀疏评分实现数量级更快的词汇搜索

链接: https://arxiv.org/abs/2407.03618
作者: Xing Han Lù
关键词: Numpy and Scipy, depends on Numpy, efficient Python-based implementation, efficient Python-based, popular Python-based framework
中文关键词: Numpy和Scipy,依赖于Numpy、高效的基于Python的实现、高效的基于Python的、流行的基于Python的框架
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products. Finally, BM25S reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) by extending eager scoring to non-sparse variants using a novel score shifting method. The code can be found at this https URL
摘要:我们引入了BM 25 S,这是BM 25的一种基于Python的高效实现,仅依赖于Numpy和Scipy。与最流行的基于Python的框架相比,BM 25 S通过在索引期间热切地计算BM 25分数并将其存储到稀疏矩阵中,实现了高达500倍的加速。与流行商业产品使用的高度优化的基于Java的实现相比,它还实现了相当大的加速。最后,BM 25 S通过使用新型得分转移方法将渴望评分扩展到非稀疏变体,重现了基于Kamphuis等人(2020)的五个BM 25变体的确切实现。该代码可以在此https URL找到

[NLP-104] Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models
[NLP-104] 可视化对话:通过使用大型语言模型的对话理解增强图像选择

链接: https://arxiv.org/abs/2407.03615
作者: Chang-Sheng Kao,Yun-Nung Chen
关键词: integrating multimodal responses, enable conveying ideas, Recent advancements, multimodal responses, text-based interactions
中文关键词: 集成多模式响应,能够传达想法,最新进展,多模式响应,基于文本的交互
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in dialogue systems have highlighted the significance of integrating multimodal responses, which enable conveying ideas through diverse modalities rather than solely relying on text-based interactions. This enrichment not only improves overall communicative efficacy but also enhances the quality of conversational experiences. However, existing methods for dialogue-to-image retrieval face limitations due to the constraints of pre-trained vision language models (VLMs) in comprehending complex dialogues accurately. To address this, we present a novel approach leveraging the robust reasoning capabilities of large language models (LLMs) to generate precise dialogue-associated visual descriptors, facilitating seamless connection with images. Extensive experiments conducted on benchmark data validate the effectiveness of our proposed approach in deriving concise and accurate visual descriptors, leading to significant enhancements in dialogue-to-image retrieval performance. Furthermore, our findings demonstrate the method’s generalizability across diverse visual cues, various LLMs, and different datasets, underscoring its practicality and potential impact in real-world applications.
摘要:对话系统的最新进展突显了整合多模式回应的重要性,这使得能够通过不同的模式传达思想,而不是仅仅依赖基于文本的互动。这种丰富不仅提高了整体的交际效率,还提高了对话体验的质量。然而,现有的对话到图像检索方法受到预先训练的视觉语言模型的限制,无法准确理解复杂的对话。为了解决这一问题,我们提出了一种新的方法,利用大型语言模型(LLM)的强大推理能力来生成精确的对话相关视觉描述符,从而促进与图像的无缝连接。在基准数据上进行的大量实验验证了我们所提出的方法在获取简洁准确的视觉描述符方面的有效性,从而显著地提高了对话图像检索的性能。此外,我们的发现证明了该方法在不同的视觉线索、不同的LLM和不同的数据集上的泛化能力,强调了它在现实世界应用中的实用性和潜在的影响。

[NLP-105] Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations
[NLP-105] 侧向LoRA:具有模式专业化适应的交叉指令调优

链接: https://arxiv.org/abs/2407.03604
作者: Zhiyang Xu,Minqian Liu,Ying Shen,Joy Rimchala,Jiaxin Zhang,Qifan Wang,Yu Cheng,Lifu Huang
关键词: Vision-Language Generalists, Recent advancements, capable of understanding, Lateralization LoRA, Recent
中文关键词: 视觉语言通才,最近的进步,能够理解,侧向LoRA,最近
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 Pages, visual instruction tuning, parameter-efficient tuning

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have led to the development of Vision-Language Generalists (VLGs) capable of understanding and generating interleaved images and text. Despite these advances, VLGs still struggle to follow user instructions for interleaved text and image generation. To address this issue, we introduce LeafInstruct, the first open-sourced interleaved instruction tuning data with over 30,000 high-quality instances across more than 10 domains. Due to the extensive size of existing VLGs, we opt for parameter-efficient tuning. However, we observe that VLGs tuned with a standard LoRA typically exhibit inferior performance in interleaved text-image generation. We attribute this problem to modality interference and the lack of modality-specialized adaptation design. Hence, we propose Lateralization LoRA, a novel modality-specialized adaptation method inspired by the concept of brain lateralization. Lateralization LoRA employs a hybrid approach, combining the traditional linear LoRA and a Convolutional LoRA for generating text and images, enabling the generation of high-quality text and images by leveraging modality-specific structures and parameter sets. We perform instruction tuning of the VLG (i.e., EMU2) using Lateralization LoRA on the LeafInstruct dataset. Extensive experiments demonstrate that EMU2 tuned with Lateralization LoRA achieve state-of-the-art performance, significantly surpassing baseline models in complex interleaved tasks.
尽管有了这些进步,VLG仍然很难按照用户的指示生成交错的文本和图像。为了解决这个问题,我们引入了LeafInstruct,这是第一个开源的交错指令调优数据,拥有10多个域的30,000多个高质量实例。由于现有VLG的大小很大,我们选择了参数高效的调优。然而,我们观察到,使用标准LORA调谐的VLG在交错文本-图像生成中通常表现出较差的性能。我们将这一问题归因于通道干扰和缺乏专门针对通道的适应设计。因此,我们在大脑偏侧化概念的启发下,提出了一种新的通道-专门化适应方法–偏侧化LORA。我们在LeafInstruct数据集上使用侧化LORA执行VLG(即EMU2)的指令调优。广泛的实验表明,与侧化LORA调整的EMU2获得了最先进的性能,在复杂交织任务中显著超过了基线模型。

[NLP-106] Contrastive Chain-of-Thought Prompting
[NLP-106] 对比思维链预算

链接: https://arxiv.org/abs/2407.03600
作者: Grant Kruttschnitt,Jay Shim,Alyssa Ma,Daniel Kim,Benjamin Chek,Athul Anand,Kevin Zhu,Sean O’Brien
关键词: Rapidly increasing model, Rapidly increasing, increasing model scales, model scales coupled, scales coupled
中文关键词: 快速增加模型,快速增加,增加模型规模,模型规模耦合,规模耦合
类目: Computation and Language (cs.CL)
备注: 6 pages, 0 figures

点击查看摘要

Abstract:Rapidly increasing model scales coupled with steering methods such as chain-of-thought prompting have led to drastic improvements in language model reasoning. At the same time, models struggle with compositional generalization and are far from human performance on many reasoning-based benchmarks. Leveraging the success of chain-of-thought prompting, and also taking inspiration from context-aware decoding (CAD), we explore input-based contrasting methods to further encourage the type of reasoning induced by chain-of-thought prompting. While work remains to stabilize these results across datasets and models, the improvements we find warrant further investigation into input-based steering methods for context-aware reasoning.
摘要:快速增加的模型规模加上思想链提示等引导方法,导致语言模型推理的大幅改进。与此同时,模型很难应对组合概括,并且在许多基于推理的基准上与人类的表现相去甚远。利用思想链提示的成功,并从上下文感知解码(CAD)中汲取灵感,我们探索基于输入的对比方法,以进一步鼓励思想链提示引发的推理类型。虽然仍有工作要稳定跨数据集和模型的这些结果,但我们发现的改进值得进一步研究基于输入的引导方法以进行上下文感知推理。

[NLP-107] Zero-shot Persuasive Chatbots with LLM-Generated Strategies and Information Retrieval
[NLP-107] 具有LLM生成策略和信息检索的零镜头说服聊天机器人

链接: https://arxiv.org/abs/2407.03585
作者: Kazuaki Furumai,Roberto Legaspi,Julio Vizcarra,Yudai Yamazaki,Yasutaka Nishimura,Sina J. Semnani,Kazushi Ikeda,Weiyan Shi,Monica S. Lam
关键词: plays a pivotal, pivotal role, wide range, Persuasive, Persuasion plays
中文关键词: 发挥着关键、关键的作用,范围广泛,有说服力,有说服力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Persuasion plays a pivotal role in a wide range of applications from health intervention to the promotion of social good. Persuasive chatbots can accelerate the positive effects of persuasion in such applications. Existing methods rely on fine-tuning persuasive chatbots with task-specific training data which is costly, if not infeasible, to collect. To address this issue, we propose a method to leverage the generalizability and inherent persuasive abilities of large language models (LLMs) in creating effective and truthful persuasive chatbot for any given domain in a zero-shot manner. Unlike previous studies which used pre-defined persuasion strategies, our method first uses an LLM to generate responses, then extracts the strategies used on the fly, and replaces any unsubstantiated claims in the response with retrieved facts supporting the strategies. We applied our chatbot, PersuaBot, to three significantly different domains needing persuasion skills: donation solicitation, recommendations, and health intervention. Our experiments on simulated and human conversations show that our zero-shot approach is more persuasive than prior work, while achieving factual accuracy surpassing state-of-the-art knowledge-oriented chatbots. Our study demonstrated that when persuasive chatbots are employed responsibly for social good, it is an enabler of positive individual and social change.
摘要:说服在从健康干预到促进社会公益的广泛应用中发挥着举足轻重的作用。说服性聊天机器人可以在这类应用中加速说服力的积极效果。现有的方法依赖于微调有说服力的聊天机器人,并使用特定于任务的训练数据进行微调,收集这些数据即使不是不可行,也是昂贵的。为了解决这个问题,我们提出了一种方法,利用大语言模型的泛化能力和内在的说服能力,以零命中的方式为任何给定的领域创建有效和真实的说服性聊天机器人。与以前使用预先定义的说服策略的研究不同,我们的方法首先使用LLM来生成响应,然后提取动态使用的策略,并用检索到的支持这些策略的事实来替换响应中任何未经证实的说法。我们将我们的聊天机器人PersuaBot应用于三个需要说服技能的显著不同领域:募捐、推荐和健康干预。我们在模拟对话和真人对话上的实验表明,我们的零镜头方法比以前的工作更有说服力,同时获得了超过最先进的知识型聊天机器人的事实准确性。我们的研究表明,当有说服力的聊天机器人被负责任地用于社会公益时,它是积极的个人和社会变革的推动者。

[NLP-108] Integrating Randomness in Large Language Models: A Linear Congruential Generator Approach for Generating Clinically Relevant Content
[NLP-108] 将随机性集成到大型语言模型中:生成临床相关内容的线性同容生成器方法

链接: https://arxiv.org/abs/2407.03582
作者: Andrew Bouras
关键词: Linear Congruential Generator, Generating diverse, models is crucial, Generating, Congruential Generator method
中文关键词: 线性congrical Generator,生成多样化,模型至关重要,生成,congrical Generator方法
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating diverse, high-quality outputs from language models is crucial for applications in education and content creation. Achieving true randomness and avoiding repetition remains a significant challenge. This study uses the Linear Congruential Generator method for systematic fact selection, combined with AI-powered content generation. We ensured unique combinations of gastrointestinal physiology and pathology facts across multiple rounds, integrating these facts into prompts for GPT-4o to create clinically relevant, vignette-style outputs. Over 14 rounds, 98 unique outputs were generated, demonstrating LCG’s effectiveness in producing diverse and high-quality content. This method addresses key issues of randomness and repetition, enhancing the quality and efficiency of language model-generated content for various applications.
摘要:从语言模型生成多样化、高质量的输出对于教育和内容创建的应用至关重要。实现真正的随机性和避免重复仍然是一个重大挑战。这项研究使用线性同容生成器方法进行系统性事实选择,并结合人工智能驱动的内容生成。我们确保了多轮胃肠道生理学和病理学事实的独特组合,将这些事实整合到GPT-4 o的提示中,以创建临床相关的、小插曲式的输出。在14轮比赛中,产生了98个独特的产出,证明了LCG在制作多元化和高质量内容方面的有效性。这种方法解决了随机性和重复性的关键问题,提高了各种应用程序的语言模型生成内容的质量和效率。

[NLP-109] Core: Robust Factual Precision Scoring with Informative Sub-Claim Identification
[NLP-109] 核心:具有信息性的子主张识别的稳健事实精确度评分

链接: https://arxiv.org/abs/2407.03572
作者: Zhengping Jiang,Jingyu Zhang,Nathaniel Weir,Seth Ebner,Miriam Wanner,Kate Sanders,Daniel Khashabi,Anqi Liu,Benjamin Van Durme
关键词: evaluate factual precision, large language models, factual precision, pose a challenge, motivating the development
中文关键词: 评估事实精确性、大型语言模型、事实精确性,构成挑战,激励开发
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucinations – the generation of untrue claims – pose a challenge to the application of large language models (LLMs) [1] thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as FActScore [2], can be manipulated by adding obvious or repetitive claims to artificially inflate scores. We expand the FActScore dataset to design and analyze factual precision metrics, demonstrating that models can be trained to achieve high scores under existing metrics through exploiting the issues we identify. This motivates our new customizable plug-and-play subclaim selection component called Core, which filters down individual subclaims according to their uniqueness and informativeness. Metrics augmented by Core are substantially more robust as shown in head-to-head comparisons. We release an evaluation framework supporting the modular use of Core (this https URL) and various decomposition strategies, and we suggest its adoption by the LLM community. [1] Hong et al., “The Hallucinations Leaderboard – An Open Effort to Measure Hallucinations in Large Language Models”, arXiv:2404.05904v2 [cs.CL]. [2] Min et al., “FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation”, arXiv:2305.14251v2 [cs.CL]. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2407.03572 [cs.CL] (or arXiv:2407.03572v1 [cs.CL] for this version)
摘要:幻觉–不真实声明的产生–对大型语言模型(LLM)的应用提出了挑战[1],从而推动了评估事实精确度的度量标准的发展。我们观察到,使用分解-然后-验证框架的流行指标,如FActScore[2],可以通过添加明显或重复的声明来人为夸大分数来操纵。我们扩展了FActScore数据集来设计和分析事实精度指标,演示了可以通过利用我们发现的问题来训练模型,以在现有指标下获得高分。这激发了我们新的可定制的即插即用的子声明选择组件Core,该组件根据其唯一性和信息性来筛选各个子声明。如逐一比较所示,由Core增强的指标要稳健得多。我们发布了一个评估框架,支持Core(这个HTTPS URL)的模块化使用和各种分解策略,并建议LLM社区采用它。[1]Hong等人,《幻觉排行榜–在大型语言模型中测量幻觉的公开努力》,arxiv:2404.05904v2[cs.CL]。[2]Min等人,《FActScore:长文本生成中事实精确度的细粒度原子评估》,arxiv:2305.14251v2[cs.CL]。科目:计算和语言(cs.CL)引用为:arxiv:2407.03572cs.CL

[NLP-110] Feelings about Bodies: Emotions on Diet and Fitness Forums Reveal Gendered Stereotypes and Body Image Concerns
[NLP-110] 对身体的感觉:饮食和健身论坛的情绪揭示性别刻板印象和身体形象担忧

链接: https://arxiv.org/abs/2407.03551
作者: Cinthia Sánchez,Minh Duc Chu,Zihao He,Rebecca Dorn,Stuart Murray,Kristina Lerman
关键词: body image concerns, image concerns, extreme cases, disordered eating, types can lead
中文关键词: 身体形象担忧、形象担忧、极端案例、饮食失调、类型可能导致
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The gendered expectations about ideal body types can lead to body image concerns, dissatisfaction, and in extreme cases, disordered eating and other psychopathologies across the gender spectrum. While research has focused on pro-anorexia online communities that glorify the ‘thin ideal’, less attention has been given to the broader spectrum of body image concerns or how emerging disorders like muscle dysmorphia (‘bigorexia’) present in online discussions. To address these gaps, we analyze 46 Reddit discussion forums related to diet, fitness, and associated mental health challenges. Using membership structure analysis and transformer-based language models, we project these communities along gender and body ideal axes, revealing complex interactions between gender, body ideals, and emotional expression. Our findings show that feminine-oriented communities generally express more negative emotions, particularly in thinness-promoting forums. Conversely, communities focused on the muscular ideal exhibit less negativity, regardless of gender orientation. We also uncover a gendered pattern in emotional indicators of mental health challenges, with communities discussing serious issues aligning more closely with thinness-oriented, predominantly feminine-leaning communities. By revealing the gendered emotional dynamics of online communities, our findings can inform the development of more effective content moderation approaches that facilitate supportive interactions, while minimizing exposure to potentially harmful content.
摘要:对理想体型的性别期望可能会导致对身体形象的担忧、不满,在极端情况下,会导致饮食失调和其他跨性别的心理变态。虽然研究的重点是美化“瘦理想”的厌食症网络社区,但很少有人关注更广泛的身体形象问题,或者肌肉变形症(“偏执症”)等新出现的疾病如何出现在在线讨论中。为了解决这些差距,我们分析了46个Reddit论坛,这些论坛与饮食、健身和相关的心理健康挑战有关。使用成员结构分析和基于变换的语言模型,我们沿着性别和身体理想轴投影这些社区,揭示性别、身体理想和情感表达之间的复杂互动。我们的发现表明,面向女性的社区通常会表达更多的负面情绪,特别是在促进瘦身的论坛上。相反,无论性别取向如何,专注于肌肉理想的社区表现出较少的负面情绪。我们还发现了心理健康挑战的情感指标中的性别模式,社区讨论严重的问题时,更紧密地与以瘦为导向、以女性为主的社区保持一致。通过揭示在线社区的性别情感动态,我们的发现可以为开发更有效的内容审核方法提供信息,这些方法有助于支持性互动,同时将潜在有害内容的暴露降至最低。

[NLP-111] On Evaluating Explanation Utility for Human-AI Decision Making in NLP
[NLP-111] NLP中人工智能决策的解释效用评估

链接: https://arxiv.org/abs/2407.03545
作者: Fateme Hashemi Chaleshtori,Atreya Ghosal,Alexander Gill,Purbid Bambroo,Ana Marasović
关键词: false promise, cs.CL, NLP, Abstract, explanations aid people
中文关键词: 虚假承诺,cs.CL,NLP,摘要,解释帮助人们
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages main, 7 pages references, 32 pages appendix

点击查看摘要

Abstract:Is explainability a false promise? This debate has emerged from the insufficient evidence that explanations aid people in situations they are introduced for. More human-centered, application-grounded evaluations of explanations are needed to settle this. Yet, with no established guidelines for such studies in NLP, researchers accustomed to standardized proxy evaluations must discover appropriate measurements, tasks, datasets, and sensible models for human-AI teams in their studies. To help with this, we first review fitting existing metrics. We then establish requirements for datasets to be suitable for application-grounded evaluations. Among over 50 datasets available for explainability research in NLP, we find that 4 meet our criteria. By finetuning Flan-T5-3B, we demonstrate the importance of reassessing the state of the art to form and study human-AI teams. Finally, we present the exemplar studies of human-AI decision-making for one of the identified suitable tasks – verifying the correctness of a legal claim given a contract. Comments: 9 pages main, 7 pages references, 32 pages appendix Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2407.03545 [cs.CL] (or arXiv:2407.03545v1 [cs.CL] for this version)
摘要:可解释性是错误的承诺吗?这场辩论源于证据不足,即解释有助于人们在被介绍给他们的情况下。为了解决这一问题,需要对解释进行更多以人为中心、以应用为基础的评估。然而,由于NLP中没有建立此类研究的指导方针,习惯于标准化代理评估的研究人员必须在他们的研究中为人类-AI团队找到合适的测量、任务、数据集和合理的模型。为了帮助解决这一问题,我们首先回顾了与现有指标的匹配。然后,我们建立适合基于应用程序的评估的数据集的要求。在NLP中可用于解释性研究的50多个数据集中,我们发现有4个符合我们的标准。评论:正文9页,参考文献7页,附录32页主题:计算与语言(cs.CL);人工智能(cs.AI);人机交互(cs.HC)引用AS:arxiv:2407.03545cs.CL

[NLP-112] Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias
[NLP-112] 孟加拉语大型语言模型中的社会偏见:性别和宗教偏见的实证研究

链接: https://arxiv.org/abs/2407.03536
作者: Jayanta Sadhu,Maneesha Rani Saha,Rifat Shahriyar
关键词: Large Language Models, growth of Large, Language Models, Large Language, rapid growth
中文关键词: 大型语言模型,大型的增长,语言模型,大型语言,快速增长
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid growth of Large Language Models (LLMs) has put forward the study of biases as a crucial field. It is important to assess the influence of different types of biases embedded in LLMs to ensure fair use in sensitive fields. Although there have been extensive works on bias assessment in English, such efforts are rare and scarce for a major language like Bangla. In this work, we examine two types of social biases in LLM generated outputs for Bangla language. Our main contributions in this work are: (1) bias studies on two different social biases for Bangla (2) a curated dataset for bias measurement benchmarking (3) two different probing techniques for bias detection in the context of Bangla. This is the first work of such kind involving bias assessment of LLMs for Bangla to the best of our knowledge. All our code and resources are publicly available for the progress of bias related research in Bangla NLP.
摘要:大型语言模型(LLM)的快速发展使偏见研究成为一个重要领域。评估LLM中嵌入的不同类型偏见的影响非常重要,以确保在敏感领域的公平使用。尽管在英语中的偏见评估方面已经有了大量的工作,但对于孟加拉语这样的主要语言来说,这样的工作是罕见的。在这项工作中,我们研究了LLM生成的孟加拉语输出中的两种类型的社会偏见。我们在这项工作中的主要贡献是:(1)对孟加拉语两种不同社会偏见的偏见研究(2)用于偏见测量基准的精心策划的数据集(3)孟加拉语背景下用于偏见检测的两种不同探测技术。据我们所知,这是第一项涉及孟加拉语LLM偏见评估的此类工作。我们所有的代码和资源都是公开的,用于孟加拉NLP中偏见相关研究的进展。

[NLP-113] UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs Memorization
[NLP-113] UnSeenTimeQA:超越LLC小型化的时间敏感型客户服务

链接: https://arxiv.org/abs/2407.03525
作者: Md Nayem Uddin,Amir Saeidi,Divij Handa,Agastya Seth,Tran Cao Son,Eduardo Blanco,Steven R. Corman,Chitta Baral
关键词: traditional TSQA benchmarks, TSQA benchmarks, traditional TSQA, paper introduces UnSeenTimeQA, web-searchable queries
中文关键词: 传统TSQA基准测试,TSQA基准测试,传统TSQA,论文介绍UnSeenTimeQA,网络搜索查询
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces UnSeenTimeQA, a novel time-sensitive question-answering (TSQA) benchmark that diverges from traditional TSQA benchmarks by avoiding factual and web-searchable queries. We present a series of time-sensitive event scenarios decoupled from real-world factual information. It requires large language models (LLMs) to engage in genuine temporal reasoning, disassociating from the knowledge acquired during the pre-training phase. Our evaluation of six open-source LLMs (ranging from 2B to 70B in size) and three closed-source LLMs reveal that the questions from the UnSeenTimeQA present substantial challenges. This indicates the models’ difficulties in handling complex temporal reasoning scenarios. Additionally, we present several analyses shedding light on the models’ performance in answering time-sensitive questions.
摘要:本文介绍了UnSeenTimeQA,这是一种新型的时间敏感问答(TSQA)基准,它通过避免事实和网络搜索查询而与传统的TSQA基准不同。我们提出了一系列与现实世界的事实信息脱钩的时间敏感事件场景。它需要大型语言模型(LLM)进行真正的时态推理,与预训练阶段获得的知识脱钩。我们对六个开源LLM(规模从2B到70 B不等)和三个开源LLM的评估表明,UnSeenTimeQA提出的问题提出了巨大的挑战。这表明模型在处理复杂的时态推理场景方面存在困难。此外,我们还提供了几项分析,以了解模型在回答时间敏感问题方面的性能。

[NLP-114] Improving LLM Abilities in Idiomatic Translation
[NLP-114] 提高LLM习语翻译能力

链接: https://arxiv.org/abs/2407.03518
作者: Sundesh Donthi,Maximilian Spencer,Om Patel,Joon Doh,Eid Rodan
关键词: NLLB and GPT, Cosine Similarity Lookup, translating idioms remains, cosine similarity, Similarity Lookup method
中文关键词: NLLB和GPT、Cosine相似度检查器、翻译习语残留、Cosine相似度、相似度检查器方法
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:For large language models (LLMs) like NLLB and GPT, translating idioms remains a challenge. Our goal is to enhance translation fidelity by improving LLM processing of idiomatic language while preserving the original linguistic style. This has a significant social impact, as it preserves cultural nuances and ensures translated texts retain their intent and emotional resonance, fostering better cross-cultural communication. Previous work has utilized knowledge bases like IdiomKB by providing the LLM with the meaning of an idiom to use in translation. Although this method yielded better results than a direct translation, it is still limited in its ability to preserve idiomatic writing style across languages. In this research, we expand upon the knowledge base to find corresponding idioms in the target language. Our research performs translations using two methods: The first method employs the SentenceTransformers model to semantically generate cosine similarity scores between the meanings of the original and target language idioms, selecting the best idiom (Cosine Similarity method). The second method uses an LLM to find a corresponding idiom in the target language for use in the translation (LLM-generated idiom method). As a baseline, we performed a direct translation without providing additional information. Human evaluations on the English - Chinese, and Chinese - English show the Cosine Similarity Lookup method out-performed others in all GPT4o translations. To further build upon IdiomKB, we developed a low-resource Urdu dataset containing Urdu idioms and their translations. Despite dataset limitations, the Cosine Similarity Lookup method shows promise, potentially overcoming language barriers and enabling the exploration of diverse literary works in Chinese and Urdu. For access to the code and replication of our experiments, please visit (this https URL).
摘要:对于像NLLB和GPT这样的大型语言模型(LLM)来说,习语翻译仍然是一个挑战。我们的目标是在保持原有语言风格的同时,通过改进习语的LLM处理来提高翻译的逼真度。这具有重大的社会影响,因为它保留了文化的细微差别,并确保翻译文本保持其意图和情感共鸣,促进更好的跨文化交流。以前的工作已经利用了像IdiomKB这样的知识库,为LLM提供了在翻译中使用的成语的含义。虽然这种方法比直接翻译取得了更好的效果,但它在保留跨语言习语写作风格方面仍然有限。在这项研究中,我们在知识库的基础上进行扩展,以便在目标语言中找到对应的习语。我们的研究使用两种方法进行翻译:第一种方法使用SentenceTransformers模型在语义上生成原语习语和目标语言习语之间的余弦相似度分数,选择最好的习语(余弦相似度方法)。第二种方法使用LLM在目标语言中找到对应的习语用于翻译(LLM生成的习语方法)。作为基准,我们在没有提供额外信息的情况下执行了直接翻译。人类对英汉、汉英翻译的评价表明,余弦相似查找方法在所有GPT4o翻译中的表现都优于其他方法。为了进一步建立在IdiomKB的基础上,我们开发了一个包含乌尔都语习语及其翻译的低资源乌尔都语数据集。尽管受到数据集的限制,余弦相似查找方法还是显示出了希望,它有可能克服语言障碍,并使探索各种中文和乌尔都语文学作品成为可能。有关我们实验的代码和复制,请访问(此HTTPS URL)。

[NLP-115] AgentInstruct: Toward Generative Teaching with Agentic Flows
[NLP-115] AgentDirect:通过抽象流程实现生成性教学

链接: https://arxiv.org/abs/2407.03502
作者: Arindam Mitra,Luciano Del Corro,Guoqing Zheng,Shweti Mahajan,Dany Rouhana,Andres Codas,Yadong Lu,Wei-ge Chen,Olga Vrousgos,Corby Rosset,Fillipe Silva,Hamed Khanpour,Yash Lara,Ahmed Awadallah
关键词: Synthetic data, data, increasingly important, important for accelerating, accelerating the development
中文关键词: 合成数据、数据,越来越重要,对于加速、加速发展很重要
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Synthetic data is becoming increasingly important for accelerating the development of language models, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse and drawbacks of imitating other models. This discrepancy can be attributed to the fact that synthetic data varies in quality and diversity. Effective use of synthetic data usually requires significant human effort in curating the data. We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, we refer to this setting as Generative Teaching. We introduce AgentInstruct, an extensible agentic framework for automatically creating large amounts of diverse and high-quality synthetic data. AgentInstruct can create both the prompts and responses, using only raw data sources like text documents and code files as seeds. We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach language models different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base model. We post-train Mistral-7b with the data. When comparing the resulting model Orca-3 to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and GPT-3.5-turbo.
摘要:合成数据对于加速语言模型的开发变得越来越重要,无论是大的还是小的。尽管有几个成功的使用案例,研究人员也提出了对模型崩溃和模仿其他模型的缺点的担忧。这种差异可以归因于这样一个事实,即合成数据在质量和多样性上存在差异。合成数据的有效利用通常需要大量的人力来管理数据。我们专注于使用合成数据进行后培训,特别是通过强大的模型创建数据,将一种新的技能或行为传授给另一种模型,我们将这种设置称为生成性教学。我们介绍了AgentInstruct,这是一个可扩展的代理框架,用于自动创建大量不同的高质量合成数据。AgentInstruct可以创建提示和响应,只使用文本文档和代码文件等原始数据源作为种子。我们通过创建一个25M对的后训练数据集来演示AgentInstruct的有效性,以教授语言模型不同的技能,如文本编辑、创造性写作、工具使用、编码、阅读理解等。该数据集可用于任何基本模型的教学调整。我们用这些数据对米斯特拉尔-7b进行了后期训练。当将得到的模型Orca-3与Mistral-7b-Indict(使用相同的基本模型)进行比较时,我们观察到在许多基准测试中都有显著的改进。例如,AGIEval改进了40%,MMLU改进了19%,GSM8K改进了54%,BBH改进了38%,AlpacaEval改进了45%。此外,它的表现一直优于其他型号,如骆驼-8B-指令和GPT-3.5-涡轮。

[NLP-116] Exploring LGBTQ Bias in Generative AI Answers across Different Country and Religious Contexts
[NLP-116] 探索不同国家和宗教背景下的生成人工智能答案中的LGBTQ偏见

链接: https://arxiv.org/abs/2407.03473
作者: Lilla Vicsek,Anna Vancsó,Mike Zajko,Judit Takacs
关键词: Previous discussions, content about minorities, cultures and religions, discussions have highlighted, neglect the complexities
中文关键词: 之前的讨论,关于少数族裔、文化和宗教的内容,讨论强调了,忽视了复杂性
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous discussions have highlighted the need for generative AI tools to become more culturally sensitive, yet often neglect the complexities of handling content about minorities, who are perceived differently across cultures and religions. Our study examined how two generative AI systems respond to homophobic statements with varying cultural and religious context information. Findings showed ChatGPT 3.5’s replies exhibited cultural relativism, in contrast to Bard’s, which stressed human rights and provided more support for LGBTQ+ issues. Both demonstrated significant change in responses based on contextual information provided in the prompts, suggesting that AI systems may adjust in their responses the degree and forms of support for LGBTQ+ people according to information they receive about the user’s background. The study contributes to understanding the social and ethical implications of AI responses and argues that any work to make generative AI outputs more culturally diverse requires a grounding in fundamental human rights.
摘要:之前的讨论强调了生成性人工智能工具需要变得对文化更敏感,但往往忽视了处理关于少数民族的内容的复杂性,不同文化和宗教对少数民族的看法不同。我们的研究考察了两个生成性人工智能系统对具有不同文化和宗教背景信息的恐同声明的反应。结果显示,ChatGPT3.5的S的回复显示出文化相对主义,而巴德的回复强调人权,并为LGBTQ+问题提供更多支持。两者都显示出基于提示中提供的上下文信息的回应发生了显著变化,这表明人工智能系统可能会根据他们收到的关于用户背景的信息,在他们的回复中调整对LGBTQ+人的支持程度和形式。这项研究有助于理解人工智能反应的社会和伦理影响,并认为任何使生产性人工智能产出在文化上更加多样化的工作都需要以基本人权为基础。

[NLP-117] Prosody-Driven Privacy-Preserving Dementia Detection
[NLP-117] 韵律驱动的隐私保护痴呆症检测

链接: https://arxiv.org/abs/2407.03470
作者: Dominika Woszczyk,Ranya Aloufi,Soteris Demetriou
关键词: extracted from voice, voice recordings, proven valuable, Speaker embeddings extracted, dementia detection
中文关键词: 从语音中提取、录音、被证明有价值、提取的说话者嵌入、痴呆症检测
类目: ound (cs.SD); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2024

点击查看摘要

Abstract:Speaker embeddings extracted from voice recordings have been proven valuable for dementia detection. However, by their nature, these embeddings contain identifiable information which raises privacy concerns. In this work, we aim to anonymize embeddings while preserving the diagnostic utility for dementia detection. Previous studies rely on adversarial learning and models trained on the target attribute and struggle in limited-resource settings. We propose a novel approach that leverages domain knowledge to disentangle prosody features relevant to dementia from speaker embeddings without relying on a dementia classifier. Our experiments show the effectiveness of our approach in preserving speaker privacy (speaker recognition F1-score .01%) while maintaining high dementia detection score F1-score of 74% on the ADReSS dataset. Our results are also on par with a more constrained classifier-dependent system on ADReSSo (.01% and .66%), and have no impact on synthesized speech naturalness.
摘要:从语音记录中提取的说话人嵌入已被证明对痴呆症检测有价值。然而,根据它们的性质,这些嵌入包含可识别的信息,这引发了隐私问题。在这项工作中,我们的目标是匿名嵌入,同时保留痴呆症检测的诊断效用。以前的研究依赖于对抗性学习和在有限资源环境下对目标属性和斗争进行训练的模型。我们提出了一种新的方法,利用领域知识从说话人嵌入中分离出与痴呆症相关的韵律特征,而不依赖于痴呆症分类器。我们的实验表明,我们的方法在保护说话人隐私(说话人识别F1-得分0.01%)的同时,在ADRESS数据集上保持了74%的高痴呆症检测得分F1-得分。我们的结果也与ADReSSo上更受限的依赖于分类器的系统(0.01%和0.66%)相当,并且对合成语音的自然度没有影响。

[NLP-118] Collaborative Quest Completion with LLM-driven Non-Player Characters in Minecraft
[NLP-118] 在《我的世界》中与LLM驱动的非玩家角色合作完成任务

链接: https://arxiv.org/abs/2407.03460
作者: Sudha Rao,Weijia Xu,Michael Xu,Jorge Leandro,Ken Lobb,Gabriel DesGarennes,Chris Brockett,Bill Dolan
关键词: LLM-driven non-player characters, expect LLM-driven non-player, video game development, large language models, language models continue
中文关键词: LLM驱动的非玩家角色,预计LLM驱动的非玩家、视频游戏开发、大型语言模型、语言模型继续
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Wordplay workshop at ACL 2024

点击查看摘要

Abstract:The use of generative AI in video game development is on the rise, and as the conversational and other capabilities of large language models continue to improve, we expect LLM-driven non-player characters (NPCs) to become widely deployed. In this paper, we seek to understand how human players collaborate with LLM-driven NPCs to accomplish in-game goals. We design a minigame within Minecraft where a player works with two GPT4-driven NPCs to complete a quest. We perform a user study in which 28 Minecraft players play this minigame and share their feedback. On analyzing the game logs and recordings, we find that several patterns of collaborative behavior emerge from the NPCs and the human players. We also report on the current limitations of language-only models that do not have rich game-state or visual understanding. We believe that this preliminary study and analysis will inform future game developers on how to better exploit these rapidly improving generative AI models for collaborative roles in games.
摘要:生成式人工智能在视频游戏开发中的使用正在上升,随着大型语言模型的对话和其他能力的不断提高,我们预计LLM驱动的非玩家角色(NPC)将得到广泛部署。在这篇文章中,我们试图理解人类玩家如何与LLM驱动的NPC合作来实现游戏中的目标。我们在《我的世界》中设计了一个迷你游戏,其中一个玩家与两个GPT4驱动的NPC合作来完成一项任务。我们进行了一项用户研究,让28名《我的世界》玩家玩这款小游戏,并分享他们的反馈。通过分析游戏日志和记录,我们发现在NPC和人类玩家之间出现了几种协作行为模式。我们还报告了目前仅有语言模型的局限性,这些模型没有丰富的游戏状态或视觉理解。我们相信,这一初步研究和分析将为未来的游戏开发人员提供信息,帮助他们更好地利用这些快速改进的生成性人工智能模型来实现游戏中的协作角色。

[NLP-119] XferBench: a Data-Driven Benchmark for Emergent Language
[NLP-119] XferBench:Emergent语言的数据驱动基准

链接: https://arxiv.org/abs/2407.03456
作者: Brendon Boldt,David Mortensen
关键词: emergent language, data-driven methods, emergent, language, human language
中文关键词: 涌现语言,数据驱动方法,涌现,语言,人类语言
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:In this paper, we introduce a benchmark for evaluating the overall quality of emergent languages using data-driven methods. Specifically, we interpret the notion of the “quality” of an emergent language as its similarity to human language within a deep learning framework. We measure this by using the emergent language as pretraining data for a downstream NLP tasks in human language – the better the downstream performance, the better the emergent language. We implement this benchmark as an easy-to-use Python package that only requires a text file of utterances from the emergent language to be evaluated. Finally, we empirically test the benchmark’s validity using human, synthetic, and emergent language baselines.
摘要:在本文中,我们引入了一个使用数据驱动方法评估新兴语言整体质量的基准。具体来说,我们将新兴语言的“质量”概念解释为其与深度学习框架内人类语言的相似性。我们通过使用新兴语言作为人类语言下游NLP任务的预训练数据来衡量这一点–下游性能越好,新兴语言就越好。我们将这个基准实现为一个易于使用的Python包,它只需要评估新兴语言的话语文本文件。最后,我们使用人类、合成和新兴语言基线来实证测试基准的有效性。

[NLP-120] HEMM: Holistic Evaluation of Multimodal Foundation Models
[NLP-120] HEMM:多峰基础模型的整体评估

链接: https://arxiv.org/abs/2407.03418
作者: Paul Pu Liang,Akshay Goindani,Talha Chafekar,Leena Mathur,Haofei Yu,Ruslan Salakhutdinov,Louis-Philippe Morency
关键词: Multimodal foundation models, text alongside images, holistically process text, process text alongside, Multimodal foundation
中文关键词: 多模式基础模型,文本与图像一起,整体处理文本,同时处理文本,多模式基础
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this https URL

点击查看摘要

Abstract:Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today’s models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance. Our conclusions regarding challenging multimodal interactions, use cases, and tasks requiring reasoning and external knowledge, the benefits of data and model scale, and the impacts of instruction tuning yield actionable insights for future work in multimodal foundation models.
摘要:多通道基础模型可以与图像、视频、音频和其他感官通道一起整体处理文本,在现实世界的各种应用中得到越来越多的使用。然而,考虑到可能的建模决策、任务和领域的范围,表征和研究多模式基础模型的进展是具有挑战性的。在本文中,我们引入了多通道模型的整体评估(HEMM)来系统地评估多通道基础模型在3个维度上的能力:基本技能、信息流和真实世界的用例。基本的多通道技能是解决问题所需的内部能力,例如跨通道的学习交互、细粒度对齐、多步骤推理和处理外部知识的能力。信息流研究多模式内容如何通过查询、翻译、编辑和融合在任务过程中发生变化。用例涵盖在真实世界的多媒体、情感计算、自然科学、医疗保健和人机交互应用程序中引入的特定领域的挑战。通过对HEMM中30项任务的全面实验,我们(1)确定了对当今模型构成挑战的关键数据集维度(例如,基本技能、信息流和用例),以及(2)提取了有关不同建模维度(例如,规模、预培训数据、多模式对齐、预培训和教学调整目标)如何影响性能的性能趋势。我们关于挑战多模式交互、用例和需要推理和外部知识的任务、数据和模型规模的好处以及教学调整的影响的结论为多模式基础模型的未来工作提供了可操作的见解。

[NLP-121] Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning
[NLP-121] 软乞讨:基于即时调优,模块化且高效地屏蔽LLM,防止即时注入和越狱

链接: https://arxiv.org/abs/2407.03391
作者: Simon Ostermann,Kevin Baum,Christoph Endres,Julia Masloh,Patrick Schramowski
关键词: large language models, direct and indirect, language models, application-integrated contexts, recognized as significant
中文关键词: 大型语言模型(直接和间接)、语言模型、应用程序集成上下文,被认为是重要的
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt injection (both direct and indirect) and jailbreaking are now recognized as significant issues for large language models (LLMs), particularly due to their potential for harm in application-integrated contexts. This extended abstract explores a novel approach to protecting LLMs from such attacks, termed “soft begging.” This method involves training soft prompts to counteract the effects of corrupted prompts on the LLM’s output. We provide an overview of prompt injections and jailbreaking, introduce the theoretical basis of the “soft begging” technique, and discuss an evaluation of its effectiveness.
摘要:提示注入(直接和间接)和越狱现在被认为是大型语言模型(LLM)的重大问题,特别是因为它们在应用程序集成上下文中可能造成伤害。这篇扩展摘要探讨了一种保护LLM免受此类攻击的新型方法,称为“软乞讨”。“这种方法涉及训练软提示,以抵消损坏提示对LLM输出的影响。我们概述了及时注射和越狱,介绍了“软乞讨”技术的理论基础,并讨论了对其有效性的评估。

[NLP-122] ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
[NLP-122] ConCodeEval:评估大型语言模型的特定领域语言中的代码约束

链接: https://arxiv.org/abs/2407.03387
作者: Mehant Kammakomati,Sameer Pimparkhede,Srikanth Tamilselvam,Prince Kumar,Pushpak Bhattacharyya
关键词: shows Large Language, Recent work shows, work shows Large, Large Language Models, shows Large
中文关键词: 显示大型语言,最近的工作显示,工作显示大型,大型语言模型,显示大型
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work shows Large Language Models (LLMs) struggle to understand natural language constraints for various text generation tasks in zero- and few-shot settings. While, in the code domain, there is wide usage of constraints in code format to maintain the integrity of code written in Domain-Specific Languages (DSLs), yet there has been no work evaluating LLMs with these constraints. We propose two novel tasks to assess the controllability of LLMs using hard and soft constraints represented as code across five representations. Our findings suggest that LLMs struggle to comprehend constraints in all representations irrespective of their portions in the pre-training data. While models are better at comprehending constraints in JSON, YAML, and natural language representations, they struggle with constraints represented in XML and the resource-rich language Python.
摘要:最近的工作表明,大型语言模型(LLM)很难理解零镜头和少镜头设置中各种文本生成任务的自然语言约束。虽然,在代码领域,广泛使用代码格式中的约束来维护用领域特定语言(SL)编写的代码的完整性,但还没有工作评估具有这些约束的LLM。我们提出了两项新颖的任务来评估LLM的可控性,使用表示为五种表示方式的代码的硬约束和软约束。我们的研究结果表明,LLM很难理解所有表示中的约束,无论它们在预训练数据中的比例如何。虽然模型更擅长理解杨森、YML和自然语言表示中的约束,但它们很难应对以HTML和资源丰富的语言Python表示的约束。

[NLP-123] A Multi-Modal Explainability Approach for Human-Aware Robots in Multi-Party Conversation
[NLP-123] 多方对话中具有人类意识的机器人的多模式解释方法

链接: https://arxiv.org/abs/2407.03340
作者: Iveta Bečková,Štefan Pócoš,Giulia Belgiovine,Marco Matarese,Alessandra Sciutti,Carlo Mazzola
关键词:
中文关键词:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 21pp (+7pp sup.mat.) Submitted to Computer Vision and Image Understanding Journal on May 13, 2024. This research received funding Horizon-Europe TERAIS project (G.A. 101079338) and Slovak Research and Development Agency, project no. APVV-21-0105

点击查看摘要

[NLP-124] QOG:Question and Options Generation based on Language Model
[NLP-124] QOG:基于语言模型的问题和选项生成

链接: https://arxiv.org/abs/2406.12381
作者: Jincheng Zhou
关键词:
中文关键词:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-125] he Impact of Quantization on Retrieval-Augmented Generation: An Analysis of Small LLMs
[NLP-125] 量化对检索增强生成的影响:小型LLM的分析

链接: https://arxiv.org/abs/2406.10251
作者: Mert Yazan,Suzan Verberne,Frederik Situmeang
关键词:
中文关键词:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to the IR-RAG Workshop at SIGIR 2024

点击查看摘要

[NLP-126] Lost in Translation: The Algorithmic Gap Between LMs and the Brain
[NLP-126] 迷失在翻译中:LM和大脑之间的数学差距

链接: https://arxiv.org/abs/2407.04680
作者: Tommaso Tosato,Pascal Jr Tikeng Notsawo,Saskia Helbling,Irina Rish,Guillaume Dumas
关键词:
中文关键词:
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-127] Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units
[NLP-127] 使用自动发现的声学单元预训练端到端关键字搜索

链接: https://arxiv.org/abs/2407.04652
作者: Bolaji Yusuf,Jan “Honza” Černocký,Murat Saraçlar
关键词:
中文关键词:
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Interspeech 2024. KWS code at: this https URL AUD code at this https URL

点击查看摘要

[NLP-128] Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models
[NLP-128] 通过语言模型的音频前置低等级自适应进行推测性语音识别

链接: https://arxiv.org/abs/2407.04641
作者: Bolaji Yusuf,Murali Karthick Baskar,Andrew Rosenberg,Bhuvana Ramabhadran
关键词:
中文关键词:
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Interspeech 2024

点击查看摘要

[NLP-129] Written Term Detection Improves Spoken Term Detection
[NLP-129] 书面术语检测改进口语术语检测

链接: https://arxiv.org/abs/2407.04601
作者: Bolaji Yusuf,Murat Saraçlar
关键词:
中文关键词:
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2024. Code at this https URL

点击查看摘要

[NLP-130] Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition
[NLP-130] 具有跨模式注意力的视频时间动态学习以实现稳健的视听语音识别

链接: https://arxiv.org/abs/2407.03563
作者: Sungnyun Kim,Kangwook Jang,Sangmin Bae,Hoirin Kim,Se-Young Yun
关键词:
中文关键词:
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[NLP-131] Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations
[NLP-131] Codec-ASB:用离散语音表示训练高性能的自动语音识别系统

链接: https://arxiv.org/abs/2407.03495
作者: Kunal Dhawan,Nithin Rao Koluguri,Ante Jukić,Ryan Langman,Jagadeesh Balam,Boris Ginsburg
关键词:
中文关键词:
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at Interspeech 2024

点击查看摘要

计算机视觉

[CV-0] LaRa: Efficient Large-Baseline Radiance Fields

链接: https://arxiv.org/abs/2407.04699
作者: Anpei Chen,Haofei Xu,Stefano Esposito,Siyu Tang,Andreas Geiger
关键词: achieved photorealistic, photorealistic novel view, view synthesis, synthesis and geometry, Radiance field methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Radiance field methods have achieved photorealistic novel view synthesis and geometry reconstruction. But they are mostly applied in per-scene optimization or small-baseline settings. While several recent works investigate feed-forward reconstruction with large baselines by utilizing transformers, they all operate with a standard global attention mechanism and hence ignore the local nature of 3D reconstruction. We propose a method that unifies local and global reasoning in transformer layers, resulting in improved quality and faster convergence. Our model represents scenes as Gaussian Volumes and combines this with an image encoder and Group Attention Layers for efficient feed-forward reconstruction. Experimental results demonstrate that our model, trained for two days on four GPUs, demonstrates high fidelity in reconstructing 360deg radiance fields, and robustness to zero-shot and out-of-domain testing.

[CV-1] VCoME: Verbal Video Composition with Multimodal Editing Effects

链接: https://arxiv.org/abs/2407.04697
作者: Weibo Gong,Xiaojie Jin,Xin Li,Dongliang He,Xinglong Wu
关键词: present significant challenges, provide valuable content, editing effects, featuring voice-overs, text overlays
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Verbal videos, featuring voice-overs or text overlays, provide valuable content but present significant challenges in composition, especially when incorporating editing effects to enhance clarity and visual appeal. In this paper, we introduce the novel task of verbal video composition with editing effects. This task aims to generate coherent and visually appealing verbal videos by integrating multimodal editing effects across textual, visual, and audio categories. To achieve this, we curate a large-scale dataset of video effects compositions from publicly available sources. We then formulate this task as a generative problem, involving the identification of appropriate positions in the verbal content and the recommendation of editing effects for these positions. To address this task, we propose VCoME, a general framework that employs a large multimodal model to generate editing effects for video composition. Specifically, VCoME takes in the multimodal video context and autoregressively outputs where to apply effects within the verbal content and which effects are most appropriate for each position. VCoME also supports prompt-based control of composition density and style, providing substantial flexibility for diverse applications. Through extensive quantitative and qualitative evaluations, we clearly demonstrate the effectiveness of VCoME. A comprehensive user study shows that our method produces videos of professional quality while being 85 \times more efficient than professional editors.

[CV-2] RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

链接: https://arxiv.org/abs/2407.04689
作者: Yuxuan Kuang,Junjie Ye,Haoran Geng,Jiageng Mao,Congyue Deng,Leonidas Guibas,He Wang,Yue Wang
关键词: featuring generalizability, RAM, dubbed RAM, affordance, manipulation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on a retrieval-based affordance transfer paradigm to acquire versatile manipulation capabilities from abundant out-of-domain data. First, RAM extracts unified affordance at scale from diverse sources of demonstrations including robotic data, human-object interaction (HOI) data, and custom data to construct a comprehensive affordance memory. Then given a language instruction, RAM hierarchically retrieves the most similar demonstration from the affordance memory and transfers such out-of-domain 2D affordance to in-domain 3D executable affordance in a zero-shot and embodiment-agnostic manner. Extensive simulation and real-world evaluations demonstrate that our RAM consistently outperforms existing works in diverse daily tasks. Additionally, RAM shows significant potential for downstream applications such as automatic and efficient data collection, one-shot visual imitation, and LLM/VLM-integrated long-horizon manipulation. For more details, please check our website at this https URL.

[CV-3] Enhancing Vehicle Re-identification and Matching for Weaving Analysis

链接: https://arxiv.org/abs/2407.04688
作者: Mei Qiu,Wei Lin,Stanley Chien,Lauren Christopher,Yaobin Chen,Shu Hu
关键词: raises safety issues, traffic management systems, sophisticated traffic management, Vehicle weaving, raises safety
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vehicle weaving on highways contributes to traffic congestion, raises safety issues, and underscores the need for sophisticated traffic management systems. Current tools are inadequate in offering precise and comprehensive data on lane-specific weaving patterns. This paper introduces an innovative method for collecting non-overlapping video data in weaving zones, enabling the generation of quantitative insights into lane-specific weaving behaviors. Our experimental results confirm the efficacy of this approach, delivering critical data that can assist transportation authorities in enhancing traffic control and roadway infrastructure.

[CV-4] Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

链接: https://arxiv.org/abs/2407.04681
作者: Yuanze Lin,Yunsheng Li,Dongdong Chen,Weijian Xu,Ronald Clark,Philip Torr,Lu Yuan
关键词: multimodal large language, high-quality image-text datasets, made significant strides, vast high-quality image-text, generally understand images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets, enabling them to generally understand images well. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs, limiting their ability to answer questions requiring an understanding of detailed or localized visual elements. Drawing inspiration from the Retrieval-Augmented Generation (RAG) concept, this paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models (e.g., instance segmentation/OCR models), into MLLMs. This is a promising yet underexplored direction for enhancing MLLMs’ performance. Our approach diverges from concurrent works, which transform external knowledge into additional text prompts, necessitating the model to indirectly learn the correspondence between visual content and text coordinates. Instead, we propose embedding fine-grained knowledge information directly into a spatial embedding map as a visual prompt. This design can be effortlessly incorporated into various MLLMs, such as LLaVA and Mipha, considerably improving their visual understanding performance. Through rigorous experiments, we demonstrate that our method can enhance MLLM performance across nine benchmarks, amplifying their fine-grained context-aware capabilities.

[CV-5] Is plantar thermography a valid digital biomarker for characterising diabetic foot ulceration risk?

链接: https://arxiv.org/abs/2407.04676
作者: Akshay Jagadeesh,Chanchanok Aramrat,Aqsha Nur,Poppy Mallinson,Sanjay Kinra
关键词: DFU risk, DFU risk factors, DFU risk stratification, DFU, risk factors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 2 Figures, 1 Table. Supplementary files and link to code to be uploaded

点击查看摘要

Abstract:Background: In the absence of prospective data on diabetic foot ulcers (DFU), cross-sectional associations with causal risk factors (peripheral neuropathy, and peripheral arterial disease (PAD)) could be used to establish the validity of plantar thermography for DFU risk stratification. Methods: First, we investigated the associations between the intrinsic clusters of plantar thermographic images with several DFU risk factors using an unsupervised deep-learning framework. We then studied associations between obtained thermography clusters and DFU risk factors. Second, to identify those associations with predictive power, we used supervised learning to train Convolutional Neural Network (CNN) regression/classification models that predicted the risk factor based on the thermograph (and visual) input. Findings: Our dataset comprised 282 thermographs from type 2 diabetes mellitus patients (aged 56.31 ± 9.18 years, 51.42 % males). On clustering, we found two overlapping clusters (silhouette score = 0.10, indicating weak separation). There was strong evidence for associations between assigned clusters and several factors related to diabetic foot ulceration such as peripheral neuropathy, PAD, number of diabetes complications, and composite DFU risk prediction scores such as Martins-Mendes, PODUS-2020, and SIGN. However, models predicting said risk factors had poor performances. Interpretation: The strong associations between intrinsic thermography clusters and several DFU risk factors support the validity of using thermography for characterising DFU risk. However, obtained associations did not prove to be predictive, likely due to, spectrum bias, or because thermography and classical risk factors characterise incompletely overlapping portions of the DFU risk construct. Our findings highlight the challenges in standardising ground truths when defining novel digital biomarkers. Comments: 13 pages, 2 Figures, 1 Table. Supplementary files and link to code to be uploaded Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) ACMclasses: I.2; I.5 Cite as: arXiv:2407.04676 [cs.CV] (or arXiv:2407.04676v1 [cs.CV] for this version) Submission history From: Akshay Jagadeesh [view email] [v1] Fri, 5 Jul 2024 17:39:03 UTC (537 KB)

[CV-6] Unsupervised 4D Cardiac Motion Tracking with Spatiotemporal Optical Flow Networks

链接: https://arxiv.org/abs/2407.04663
作者: Long Teng,Wei Feng,Menglong Zhu,Xinchao Li
关键词: quantify myocardial motion, motion tracking, motion, quantify myocardial, cardiac cycle
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardiac motion tracking from echocardiography can be used to estimate and quantify myocardial motion within a cardiac cycle. It is a cost-efficient and effective approach for assessing myocardial function. However, ultrasound imaging has the inherent characteristics of spatially low resolution and temporally random noise, which leads to difficulties in obtaining reliable annotation. Thus it is difficult to perform supervised learning for motion tracking. In addition, there is no end-to-end unsupervised method currently in the literature. This paper presents a motion tracking method where unsupervised optical flow networks are designed with spatial reconstruction loss and temporal-consistency loss. Our proposed loss functions make use of the pair-wise and temporal correlation to estimate cardiac motion from noisy background. Experiments using a synthetic 4D echocardiography dataset has shown the effectiveness of our approach, and its superiority over existing methods on both accuracy and running speed. To the best of our knowledge, this is the first work performed that uses unsupervised end-to-end deep learning optical flow network for 4D cardiac motion tracking.

[CV-7] SAM Fewshot Finetuning for Anatomical Segmentation in Medical Images

链接: https://arxiv.org/abs/2407.04651
作者: Weiyi Xie,Nathalie Willems,Shubham Patil,Yang Li,Mayank Kumar
关键词: highly effective few-shot, anatomical segmentation tasks, few-shot fine-tuning strategy, propose a straightforward, straightforward yet highly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024

点击查看摘要

Abstract:We propose a straightforward yet highly effective few-shot fine-tuning strategy for adapting the Segment Anything (SAM) to anatomical segmentation tasks in medical images. Our novel approach revolves around reformulating the mask decoder within SAM, leveraging few-shot embeddings derived from a limited set of labeled images (few-shot collection) as prompts for querying anatomical objects captured in image embeddings. This innovative reformulation greatly reduces the need for time-consuming online user interactions for labeling volumetric images, such as exhaustively marking points and bounding boxes to provide prompts slice by slice. With our method, users can manually segment a few 2D slices offline, and the embeddings of these annotated image regions serve as effective prompts for online segmentation tasks. Our method prioritizes the efficiency of the fine-tuning process by exclusively training the mask decoder through caching mechanisms while keeping the image encoder frozen. Importantly, this approach is not limited to volumetric medical images, but can generically be applied to any 2D/3D segmentation task. To thoroughly evaluate our method, we conducted extensive validation on four datasets, covering six anatomical segmentation tasks across two modalities. Furthermore, we conducted a comparative analysis of different prompting options within SAM and the fully-supervised nnU-Net. The results demonstrate the superior performance of our method compared to SAM employing only point prompts (approximately 50% improvement in IoU) and performs on-par with fully supervised methods whilst reducing the requirement of labeled data by at least an order of magnitude.

[CV-8] Semi-Supervised Segmentation via Embedding Matching

链接: https://arxiv.org/abs/2407.04638
作者: Weiyi Xie,Nathalie Willems,Nikolas Lessmann,Tom Gibbons,Daniele De Massari
关键词: Deep convolutional neural, convolutional neural networks, Deep convolutional, convolutional neural, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, MIDL2024 oral

点击查看摘要

Abstract:Deep convolutional neural networks are widely used in medical image segmentation but require many labeled images for training. Annotating three-dimensional medical images is a time-consuming and costly process. To overcome this limitation, we propose a novel semi-supervised segmentation method that leverages mostly unlabeled images and a small set of labeled images in training. Our approach involves assessing prediction uncertainty to identify reliable predictions on unlabeled voxels from the teacher model. These voxels serve as pseudo-labels for training the student model. In voxels where the teacher model produces unreliable predictions, pseudo-labeling is carried out based on voxel-wise embedding correspondence using reference voxels from labeled images. We applied this method to automate hip bone segmentation in CT images, achieving notable results with just 4 CT scans. The proposed approach yielded a Hausdorff distance with 95th percentile (HD95) of 3.30 and IoU of 0.929, surpassing existing methods achieving HD95 (4.07) and IoU (0.927) at their best.

[CV-9] OneRestore: A Universal Restoration Framework for Composite Degradation

链接: https://arxiv.org/abs/2407.04621
作者: Yu Guo,Yuan Gao,Yuxu Lu,Huilin Zhu,Ryan Wen Liu,Shengfeng He
关键词: low light, impairments often manifest, interplay of elements, composite degradation scenarios, composite
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In real-world scenarios, image impairments often manifest as composite degradations, presenting a complex interplay of elements such as low light, haze, rain, and snow. Despite this reality, existing restoration methods typically target isolated degradation types, thereby falling short in environments where multiple degrading factors coexist. To bridge this gap, our study proposes a versatile imaging model that consolidates four physical corruption paradigms to accurately represent complex, composite degradation scenarios. In this context, we propose OneRestore, a novel transformer-based framework designed for adaptive, controllable scene restoration. The proposed framework leverages a unique cross-attention mechanism, merging degraded scene descriptors with image features, allowing for nuanced restoration. Our model allows versatile input scene descriptors, ranging from manual text embeddings to automatic extractions based on visual attributes. Our methodology is further enhanced through a composite degradation restoration loss, using extra degraded images as negative samples to fortify model constraints. Comparative results on synthetic and real-world datasets demonstrate OneRestore as a superior solution, significantly advancing the state-of-the-art in addressing complex, composite degradations.

[CV-10] CountGD: Multi-Modal Open-World Counting

链接: https://arxiv.org/abs/2407.04619
作者: Niki Amini-Naieni,Tengda Han,Andrew Zisserman
关键词: open-vocabulary object counting, improve the generality, visual exemplars, target object, visual exemplar prompts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The goal of this paper is to improve the generality and accuracy of open-vocabulary object counting in images. To improve the generality, we repurpose an open-vocabulary detection foundation model (GroundingDINO) for the counting task, and also extend its capabilities by introducing modules to enable specifying the target object to count by visual exemplars. In turn, these new capabilities - being able to specify the target object by multi-modalites (text and exemplars) - lead to an improvement in counting accuracy. We make three contributions: First, we introduce the first open-world counting model, CountGD, where the prompt can be specified by a text description or visual exemplars or both; Second, we show that the performance of the model significantly improves the state of the art on multiple counting benchmarks - when using text only, CountGD is comparable to or outperforms all previous text-only works, and when using both text and visual exemplars, we outperform all previous models; Third, we carry out a preliminary study into different interactions between the text and visual exemplar prompts, including the cases where they reinforce each other and where one restricts the other. The code and an app to test the model are available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2407.04619 [cs.CV] (or arXiv:2407.04619v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2407.04619 Focus to learn more arXiv-issued DOI via DataCite

[CV-11] Isomorphic Pruning for Vision Models

链接: https://arxiv.org/abs/2407.04616
作者: Gongfan Fang,Xinyin Ma,Michael Bi Mi,Xinchao Wang
关键词: Structured pruning reduces, deep neural networks, Structured pruning, removing redundant sub-structures, overhead of deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structured pruning reduces the computational overhead of deep neural networks by removing redundant sub-structures. However, assessing the relative importance of different sub-structures remains a significant challenge, particularly in advanced vision models featuring novel mechanisms and architectures like self-attention, depth-wise convolutions, or residual connections. These heterogeneous substructures usually exhibit diverged parameter scales, weight distributions, and computational topology, introducing considerable difficulty to importance comparison. To overcome this, we present Isomorphic Pruning, a simple approach that demonstrates effectiveness across a range of network architectures such as Vision Transformers and CNNs, and delivers competitive performance across different model sizes. Isomorphic Pruning originates from an observation that, when evaluated under a pre-defined importance criterion, heterogeneous sub-structures demonstrate significant divergence in their importance distribution, as opposed to isomorphic structures that present similar importance patterns. This inspires us to perform isolated ranking and comparison on different types of sub-structures for more reliable pruning. Our empirical results on ImageNet-1K demonstrate that Isomorphic Pruning surpasses several pruning baselines dedicatedly designed for Transformers or CNNs. For instance, we improve the accuracy of DeiT-Tiny from 74.52% to 77.50% by pruning an off-the-shelf DeiT-Base model. And for ConvNext-Tiny, we enhanced performance from 82.06% to 82.18%, while reducing the number of parameters and memory usage. Code is available at \urlthis https URL.

[CV-12] PartCraft: Crafting Creative Objects by Parts

链接: https://arxiv.org/abs/2407.04604
作者: Kam Woh Ng,Xiatian Zhu,Yi-Zhe Song,Tao Xiang
关键词: propels creative control, control in generative, allowing users, paper propels creative, propels creative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024. arXiv admin note: substantial text overlap with arXiv:2311.15477

点击查看摘要

Abstract:This paper propels creative control in generative visual AI by allowing users to “select”. Departing from traditional text or sketch-based methods, we for the first time allow users to choose visual concepts by parts for their creative endeavors. The outcome is fine-grained generation that precisely captures selected visual concepts, ensuring a holistically faithful and plausible result. To achieve this, we first parse objects into parts through unsupervised feature clustering. Then, we encode parts into text tokens and introduce an entropy-based normalized attention loss that operates on them. This loss design enables our model to learn generic prior topology knowledge about object’s part composition, and further generalize to novel part compositions to ensure the generation looks holistically faithful. Lastly, we employ a bottleneck encoder to project the part tokens. This not only enhances fidelity but also accelerates learning, by leveraging shared knowledge and facilitating information exchange among instances. Visual results in the paper and supplementary material showcase the compelling power of PartCraft in crafting highly customized, innovative creations, exemplified by the “charming” and creative birds. Code is released at this https URL.

[CV-13] AWT: Transferring Vision-Language Models via Augmentation Weighting and Transportation

链接: https://arxiv.org/abs/2407.04603
作者: Yuhan Zhu,Yuyang Ji,Zhiyu Zhao,Gangshan Wu,Limin Wang
关键词: shown impressive results, Pre-trained vision-language models, Pre-trained vision-language, shown impressive, impressive results
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT’s effectiveness and adaptability across different VLMs, architectures, and scales.

[CV-14] Feature Attenuation of Defective Representation Can Resolve Incomplete Masking on Anomaly Detection

链接: https://arxiv.org/abs/2407.04597
作者: YeongHyeon Park,Sungho Kang,Myung Jin Kim,Hyeong Seok Kim,Juneho Yi
关键词: pursued unified models, public benchmark datasets, tailor-made neural networks, adopt large-scale tailor-made, large-scale tailor-made neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figures, 5 tables

点击查看摘要

Abstract:In unsupervised anomaly detection (UAD) research, while state-of-the-art models have reached a saturation point with extensive studies on public benchmark datasets, they adopt large-scale tailor-made neural networks (NN) for detection performance or pursued unified models for various tasks. Towards edge computing, it is necessary to develop a computationally efficient and scalable solution that avoids large-scale complex NNs. Motivated by this, we aim to optimize the UAD performance with minimal changes to NN settings. Thus, we revisit the reconstruction-by-inpainting approach and rethink to improve it by analyzing strengths and weaknesses. The strength of the SOTA methods is a single deterministic masking approach that addresses the challenges of random multiple masking that is inference latency and output inconsistency. Nevertheless, the issue of failure to provide a mask to completely cover anomalous regions is a remaining weakness. To mitigate this issue, we propose Feature Attenuation of Defective Representation (FADeR) that only employs two MLP layers which attenuates feature information of anomaly reconstruction during decoding. By leveraging FADeR, features of unseen anomaly patterns are reconstructed into seen normal patterns, reducing false alarms. Experimental results demonstrate that FADeR achieves enhanced performance compared to similar-scale NNs. Furthermore, our approach exhibits scalability in performance enhancement when integrated with other single deterministic masking methods in a plug-and-play manner.

[CV-15] Smell and Emotion: Recognising emotions in smell-related artworks

链接: https://arxiv.org/abs/2407.04592
作者: Vishal Patoliya,Mathias Zinnen,Andreas Maier,Vincent Christlein
关键词: digital art history, art history, smell are underrepresented, underrepresented in digital, digital art
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Emotions and smell are underrepresented in digital art history. In this exploratory work, we show that recognising emotions from smell-related artworks is technically feasible but has room for improvement. Using style transfer and hyperparameter optimization we achieve a minor performance boost and open up the field for future extensions.

[CV-16] SH17: A Dataset for Human Safety and Personal Protective Equipment Detection in Manufacturing Industry

链接: https://arxiv.org/abs/2407.04590
作者: Hafiz Mughees Ahmad,Afshin Rahimi
关键词: Personal Protective Equipment, Workplace accidents continue, effective Personal Protective, pose significant risks, Convolutional Neural Network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Workplace accidents continue to pose significant risks for human safety, particularly in industries such as construction and manufacturing, and the necessity for effective Personal Protective Equipment (PPE) compliance has become increasingly paramount. Our research focuses on the development of non-invasive techniques based on the Object Detection (OD) and Convolutional Neural Network (CNN) to detect and verify the proper use of various types of PPE such as helmets, safety glasses, masks, and protective clothing. This study proposes the SH17 Dataset, consisting of 8,099 annotated images containing 75,994 instances of 17 classes collected from diverse industrial environments, to train and validate the OD models. We have trained state-of-the-art OD models for benchmarking, and initial results demonstrate promising accuracy levels with You Only Look Once (YOLO)v9-e model variant exceeding 70.9% in PPE detection. The performance of the model validation on cross-domain datasets suggests that integrating these technologies can significantly improve safety management systems, providing a scalable and efficient solution for industries striving to meet human safety regulations and protect their workforce. The dataset is available at this https URL.

[CV-17] Multimodal Classification via Modal-Aware Interactive Enhancement

链接: https://arxiv.org/abs/2407.04587
作者: Qing-Yuan Jiang,Zhouyang Chi,Yang Yang
关键词: modality imbalance problem, notorious modality imbalance, achieve satisfactory performance, multimodal learning, imbalance problem
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to the notorious modality imbalance problem, multimodal learning (MML) leads to the phenomenon of optimization imbalance, thus struggling to achieve satisfactory performance. Recently, some representative methods have been proposed to boost the performance, mainly focusing on adaptive adjusting the optimization of each modality to rebalance the learning speed of dominant and non-dominant modalities. To better facilitate the interaction of model information in multimodal learning, in this paper, we propose a novel multimodal learning method, called modal-aware interactive enhancement (MIE). Specifically, we first utilize an optimization strategy based on sharpness aware minimization (SAM) to smooth the learning objective during the forward phase. Then, with the help of the geometry property of SAM, we propose a gradient modification strategy to impose the influence between different modalities during the backward phase. Therefore, we can improve the generalization ability and alleviate the modality forgetting phenomenon simultaneously for multimodal learning. Extensive experiments on widely used datasets demonstrate that our proposed method can outperform various state-of-the-art baselines to achieve the best performance.

[CV-18] Real Time Emotion Analysis Using Deep Learning for Education Entertainment and Beyond

链接: https://arxiv.org/abs/2407.04560
作者: Abhilash Khuntia,Shubham Kale
关键词: significance of emotion, emotion detection, detection is increasing, facial expressions, transform facial expressions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 23 figures

点击查看摘要

Abstract:The significance of emotion detection is increasing in education, entertainment, and various other domains. We are developing a system that can identify and transform facial expressions into emojis to provide immediate feedback.The project consists of two components. Initially, we will employ sophisticated image processing techniques and neural networks to construct a deep learning model capable of precisely categorising facial expressions. Next, we will develop a basic application that records live video using the camera on your device. The app will utilise a sophisticated model to promptly analyse facial expressions and promptly exhibit corresponding emojis.Our objective is to develop a dynamic tool that integrates deep learning and real-time video processing for the purposes of online education, virtual events, gaming, and enhancing user experience. This tool enhances interactions and introduces novel emotional intelligence technologies.

[CV-19] Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence Grounding and Repetition

链接: https://arxiv.org/abs/2407.04559
作者: Aditya K Surikuchi,Raquel Fernández,Sandro Pezzelle
关键词: temporally ordered sequence, Visual storytelling consists, sequence of images, consists in generating, generating a natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Visual storytelling consists in generating a natural language story given a temporally ordered sequence of images. This task is not only challenging for models, but also very difficult to evaluate with automatic metrics since there is no consensus about what makes a story ‘good’. In this paper, we introduce a novel method that measures story quality in terms of human likeness regarding three key aspects highlighted in previous work: visual grounding, coherence, and repetitiveness. We then use this method to evaluate the stories generated by several models, showing that the foundation model LLaVA obtains the best result, but only slightly so compared to TAPM, a 50-times smaller visual storytelling model. Upgrading the visual and language components of TAPM results in a model that yields competitive performance with a relatively low number of parameters. Finally, we carry out a human evaluation study, whose results suggest that a ‘good’ story may require more than a human-like level of visual grounding, coherence, and repetition.

[CV-20] Gaussian Eigen Models for Human Heads

链接: https://arxiv.org/abs/2407.04545
作者: Wojciech Zielonka,Timo Bolkart,Thabo Beeler,Justus Thies
关键词: present personalized Gaussian, Gaussian Eigen Models, low-dimensional linear spaces, personalized Gaussian Eigen, present personalized
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:We present personalized Gaussian Eigen Models (GEMs) for human heads, a novel method that compresses dynamic 3D Gaussians into low-dimensional linear spaces. Our approach is inspired by the seminal work of Blanz and Vetter, where a mesh-based 3D morphable model (3DMM) is constructed from registered meshes. Based on dynamic 3D Gaussians, we create a lower-dimensional representation of primitives that applies to most 3DGS head avatars. Specifically, we propose a universal method to distill the appearance of a mesh-controlled UNet Gaussian avatar using an ensemble of linear eigenbasis. We replace heavy CNN-based architectures with a single linear layer improving speed and enabling a range of real-time downstream applications. To create a particular facial expression, one simply needs to perform a dot product between the eigen coefficients and the distilled basis. This efficient method removes the requirement for an input mesh during testing, enhancing simplicity and speed in expression generation. This process is highly efficient and supports real-time rendering on everyday devices, leveraging the effectiveness of standard Gaussian Splatting. In addition, we demonstrate how the GEM can be controlled using a ResNet-based regression architecture. We show and compare self-reenactment and cross-person reenactment to state-of-the-art 3D avatar methods, demonstrating higher quality and better control. A real-time demo showcases the applicability of the GEM representation.

[CV-21] Rethinking Image Compression on the Web with Generative AI

链接: https://arxiv.org/abs/2407.04542
作者: Shayan Ali Hassan,Danish Humair,Ihsan Ayyub Qazi,Zafar Ayyub Qazi
关键词: increased webpage sizes, significant data transfer, made images central, web browsing, Web experience
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The rapid growth of the Internet, driven by social media, web browsing, and video streaming, has made images central to the Web experience, resulting in significant data transfer and increased webpage sizes. Traditional image compression methods, while reducing bandwidth, often degrade image quality. This paper explores a novel approach using generative AI to reconstruct images at the edge or client-side. We develop a framework that leverages text prompts and provides additional conditioning inputs like Canny edges and color palettes to a text-to-image model, achieving up to 99.8% bandwidth savings in the best cases and 92.6% on average, while maintaining high perceptual similarity. Empirical analysis and a user study show that our method preserves image meaning and structure more effectively than traditional compression methods, offering a promising solution for reducing bandwidth usage and improving Internet affordability with minimal degradation in image quality.

[CV-22] PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

链接: https://arxiv.org/abs/2407.04538
作者: Ananthu Aniraj,Cassio F.Dantas,Dino Ienco,Diego Marcos
关键词: explicitly detect object, detect object parts, Computer vision methods, inherently interpretable models, Computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted as a main conference paper at the European Conference of Computer Vision (ECCV) 2024

点击查看摘要

Abstract:Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.

[CV-23] Success or Failure? Analyzing Segmentation Refinement with Few-Shot Segmentation

链接: https://arxiv.org/abs/2407.04519
作者: Seonghyeon Moon,Haein Kong,Muhammad Haris Khan
关键词: segmentation refinement, segmentation, masks, coarse masks generated, refinement
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages

点击查看摘要

Abstract:The purpose of segmentation refinement is to enhance the initial coarse masks generated by segmentation algorithms. The refined masks are expected to capture the details and contours of the target objects. Research on segmentation refinement has developed as a response to the need for high-quality initial masks. However, to our knowledge, no method has been developed that can determine the success of segmentation refinement. Such a method could ensure the reliability of segmentation in applications where the outcome of the segmentation is important, and fosters innovation in image processing technologies. To address this research gap, we propose JFS~(Judging From Support-set), a method to identify the success of segmentation refinement leveraging a few-shot segmentation (FSS) model. The traditional goal of the problem in FSS is to find a target object in a query image utilizing target information given by a support set. However, in our proposed method, we use the FSS network in a novel way to assess the segmentation refinement. When there are two masks, a coarse mask and a refined mask from segmentation refinement, these two masks become support masks. The existing support mask works as a ground truth mask to judge whether the quality of the refined segmentation is more accurate than the coarse mask. We first obtained a coarse mask and refined it using SEPL (SAM Enhanced Pseduo-Labels) to get the two masks. Then, these become input to FSS model to judge whether the post-processing was successful. JFS is evaluated on the best and worst cases from SEPL to validate its effectiveness. The results showed that JFS can determine whether the SEPL is a success or not.

[CV-24] LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order

链接: https://arxiv.org/abs/2407.04513
作者: Matthias Freiberger,Peter Kun,Anders Sundnes Løvlie,Sebastian Risi
关键词: artificial neural networks, robust toward pruning, typically not robust, neural network architectures, artificial neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to their architecture and how they are trained, artificial neural networks are typically not robust toward pruning, replacing, or shuffling layers at test time. However, such properties would be desirable for different applications, such as distributed neural network architectures where the order of execution cannot be guaranteed or parts of the network can fail during inference. In this work, we address these issues through a number of proposed training approaches for vision transformers whose most important component is randomizing the execution order of attention modules at training time. We show that with our proposed approaches, vision transformers are indeed capable to adapt to arbitrary layer execution orders at test time assuming one tolerates a reduction (about 20%) in accuracy at the same model size. We also find that our trained models can be randomly merged with each other resulting in functional (“Frankenstein”) models without loss of performance compared to the source models. Finally, we layer-prune our models at test time and find that their performance declines gracefully.

[CV-25] Hyperspectral Dataset and Deep Learning methods for Waste from Electric and Electronic Equipment Identification (WEEE)

链接: https://arxiv.org/abs/2407.04505
作者: Artzai Picon,Pablo Galan,Arantza Bereciartua-Perez,Leire Benito-del-Valle
关键词: Toggle, deep learning techniques, Hyperspectral, Code Toggle Papers, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Hyperspectral imaging, a rapidly evolving field, has witnessed the ascendancy of deep learning techniques, supplanting classical feature extraction and classification methods in various applications. However, many researchers employ arbitrary architectures for hyperspectral image processing, often without rigorous analysis of the interplay between spectral and spatial information. This oversight neglects the implications of combining these two modalities on model performance. In this paper, we evaluate the performance of diverse deep learning architectures for hyperspectral image segmentation. Our analysis disentangles the impact of different architectures, spanning various spectral and spatial granularities. Specifically, we investigate the effects of spectral resolution (capturing spectral information) and spatial texture (conveying spatial details) on segmentation outcomes. Additionally, we explore the transferability of knowledge from large pre-trained image foundation models, originally designed for RGB images, to the hyperspectral domain. Results show that incorporating spatial information alongside spectral data leads to improved segmentation results, and that it is essential to further work on novel architectures comprising spectral and spatial information and on the adaption of RGB foundation models into the hyperspectral domain. Furthermore, we contribute to the field by cleaning and publicly releasing the Tecnalia WEEE Hyperspectral dataset. This dataset contains different non-ferrous fractions of Waste Electrical and Electronic Equipment (WEEE), including Copper, Brass, Aluminum, Stainless Steel, and White Copper, spanning the range of 400 to 1000 nm. We expect these conclusions can guide novel researchers in the field of hyperspectral imaging. Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2407.04505 [cs.CV] (or arXiv:2407.04505v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2407.04505 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Artzai Picon [view email] [v1] Fri, 5 Jul 2024 13:45:11 UTC (3,587 KB) Full-text links: Access Paper: View a PDF of the paper titled Hyperspectral Dataset and Deep Learning methods for Waste from Electric and Electronic Equipment Identification (WEEE), by Artzai Picon and Pablo Galan and Arantza Bereciartua-Perez and Leire Benito-del-ValleView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2024-07 Change to browse by: cs eess eess.IV References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[CV-26] Segment Any 4D Gaussians

链接: https://arxiv.org/abs/2407.04504
作者: Shengxiang Ji,Guanjun Wu,Jiemin Fang,Jiazhong Cen,Taoran Yi,Wenyu Liu,Qi Tian,Xinggang Wang
关键词: reconstructing the real, Gaussian Splatting, real world, Modeling, Gaussians
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages

点击查看摘要

Abstract:Modeling, understanding, and reconstructing the real world are crucial in XR/VR. Recently, 3D Gaussian Splatting (3D-GS) methods have shown remarkable success in modeling and understanding 3D scenes. Similarly, various 4D representations have demonstrated the ability to capture the dynamics of the 4D world. However, there is a dearth of research focusing on segmentation within 4D representations. In this paper, we propose Segment Any 4D Gaussians (SA4D), one of the first frameworks to segment anything in the 4D digital world based on 4D Gaussians. In SA4D, an efficient temporal identity feature field is introduced to handle Gaussian drifting, with the potential to learn precise identity features from noisy and sparse input. Additionally, a 4D segmentation refinement process is proposed to remove artifacts. Our SA4D achieves precise, high-quality segmentation within seconds in 4D Gaussians and shows the ability to remove, recolor, compose, and render high-quality anything masks. More demos are available at: this https URL.

[CV-27] Micro-gesture Online Recognition using Learnable Query Points

链接: https://arxiv.org/abs/2407.04490
作者: Pengyu Liu,Fei Wang,Kun Li,Guoliang Chen,Yanyan Wei,Shengeng Tang,Zhiliang Wu,Dan Guo
关键词: Micro-gesture Online Recognition, Online Recognition track, Online Recognition, Micro-gesture Online, challenge at IJCAI
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report of HFUT-VUT for the MiGA challenge at IJCAI 2024

点击查看摘要

Abstract:In this paper, we briefly introduce the solution developed by our team, HFUT-VUT, for the Micro-gesture Online Recognition track in the MiGA challenge at IJCAI 2024. The Micro-gesture Online Recognition task involves identifying the category and locating the start and end times of micro-gestures in video clips. Compared to the typical Temporal Action Detection task, the Micro-gesture Online Recognition task focuses more on distinguishing between micro-gestures and pinpointing the start and end times of actions. Our solution ranks 2nd in the Micro-gesture Online Recognition track.

[CV-28] Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model

链接: https://arxiv.org/abs/2407.04489
作者: Duy M. H. Nguyen,An T. Le,Trung Q. Nguyen,Nghiem T. Diep,Tai Nguyen,Duy Duong-Tran,Jan Peters,Li Shen,Mathias Niepert,Daniel Sonntag
关键词: gaining increasing attention, increasing attention due, customize large vision-language, pre-trained contextual knowledge, minimal training data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Version 1

点击查看摘要

Abstract:Prompt learning methods are gaining increasing attention due to their ability to customize large vision-language models to new domains using pre-trained contextual knowledge and minimal training data. However, existing works typically rely on optimizing unified prompt inputs, often struggling with fine-grained classification tasks due to insufficient discriminative attributes. To tackle this, we consider a new framework based on a dual context of both domain-shared and class-specific contexts, where the latter is generated by Large Language Models (LLMs) such as GPTs. Such dual prompt methods enhance the model’s feature representation by joining implicit and explicit factors encoded in LLM knowledge. Moreover, we formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens. Through partial matching, UOT can properly align discrete sets of visual tokens and prompt embeddings under different mass distributions, which is particularly valuable for handling irrelevant or noisy elements, ensuring that the preservation of mass does not restrict transport solutions. Furthermore, UOT’s characteristics integrate seamlessly with image augmentation, expanding the training sample pool while maintaining a reasonable distance between perturbed images and prompt inputs. Extensive experiments across few-shot classification and adapter settings substantiate the superiority of our model over current state-of-the-art baselines.

[CV-29] Optimizing the image correction pipeline for pedestrian detection in the thermal-infrared domain

链接: https://arxiv.org/abs/2407.04484
作者: Christophe Karam,Jessy Matias,Xavier Breniere,Jocelyn Chanussot
关键词: low-visibility situations, fog and low-light, noise and requires, low-light scenarios, infrared processing pipelines
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Infrared imagery can help in low-visibility situations such as fog and low-light scenarios, but it is prone to thermal noise and requires further processing and correction. This work studies the effect of different infrared processing pipelines on the performance of a pedestrian detection in an urban environment, similar to autonomous driving scenarios. Detection on infrared images is shown to outperform that on visible images, but the infrared correction pipeline is crucial since the models cannot extract information from raw infrared images. Two thermal correction pipelines are studied, the shutter and the shutterless pipes. Experiments show that some correction algorithms like spatial denoising are detrimental to performance even if they increase visual quality for a human observer. Other algorithms like destriping and, to a lesser extent, temporal denoising, increase computational time, but have some role to play in increasing detection accuracy. As it stands, the optimal trade-off for speed and accuracy is simply to use the shutterless pipe with a tonemapping algorithm only, for autonomous driving applications within varied environments.

[CV-30] Rethinking Data Input for Point Cloud Upsampling

链接: https://arxiv.org/abs/2407.04476
作者: Tongxu Zhang
关键词: point cloud, point cloud upsampling, point cloud model, patch based, recent years
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:In recent years, point cloud upsampling has been widely applied in fields such as 3D reconstruction and surface generation. However, existing point cloud upsampling inputs are all patch based, and there is no research discussing the differences and principles between point cloud model full input and patch based input. In order to compare with patch based point cloud input, this article proposes a new data input method, which divides the full point cloud model to ensure shape integrity while training PU-GCN. This article was validated on the PU1K and ABC datasets, but the results showed that Patch based performance is better than model based full input i.e. Average Segment input. Therefore, this article explores the data input factors and model modules that affect the upsampling results of point clouds.

[CV-31] VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing

链接: https://arxiv.org/abs/2407.04461
作者: Shang Liu,Chaohui Yu,Chenjie Cao,Wen Qian,Fan Wang
关键词: Recent research, shapes benefits, dramatically developed, including inpainting-based, optimization-based approaches
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Recent research on texture synthesis for 3D shapes benefits a lot from dramatically developed 2D text-to-image diffusion models, including inpainting-based and optimization-based approaches. However, these methods ignore the modal gap between the 2D diffusion model and 3D objects, which primarily render 3D objects into 2D images and texture each image separately. In this paper, we revisit the texture synthesis and propose a Variance alignment based 3D-2D Collaborative Denoising framework, dubbed VCD-Texture, to address these issues. Formally, we first unify both 2D and 3D latent feature learning in diffusion self-attention modules with re-projected 3D attention receptive fields. Subsequently, the denoised multi-view 2D latent features are aggregated into 3D space and then rasterized back to formulate more consistent 2D predictions. However, the rasterization process suffers from an intractable variance bias, which is theoretically addressed by the proposed variance alignment, achieving high-fidelity texture synthesis. Moreover, we present an inpainting refinement to further improve the details with conflicting regions. Notably, there is not a publicly available benchmark to evaluate texture synthesis, which hinders its development. Thus we construct a new evaluation set built upon three open-source 3D datasets and propose to use four metrics to thoroughly validate the texturing performance. Comprehensive experiments demonstrate that VCD-Texture achieves superior performance against other counterparts.

[CV-32] Robust Multimodal Learning via Representation Decoupling

链接: https://arxiv.org/abs/2407.04458
作者: Shicai Wei,Yang Luo,Yuji Wang,Chunbo Luo
关键词: attracted increasing attention, attracted increasing, modality combinations, increasing attention due, Multimodal Representation Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV2024 17 pages

点击查看摘要

Abstract:Multimodal learning robust to missing modality has attracted increasing attention due to its practicality. Existing methods tend to address it by learning a common subspace representation for different modality combinations. However, we reveal that they are sub-optimal due to their implicit constraint on intra-class representation. Specifically, the sample with different modalities within the same class will be forced to learn representations in the same direction. This hinders the model from capturing modality-specific information, resulting in insufficient learning. To this end, we propose a novel Decoupled Multimodal Representation Network (DMRNet) to assist robust multimodal learning. Specifically, DMRNet models the input from different modality combinations as a probabilistic distribution instead of a fixed point in the latent space, and samples embeddings from the distribution for the prediction module to calculate the task loss. As a result, the direction constraint from the loss minimization is blocked by the sampled representation. This relaxes the constraint on the inference representation and enables the model to capture the specific information for different modality combinations. Furthermore, we introduce a hard combination regularizer to prevent DMRNet from unbalanced training by guiding it to pay more attention to hard modality combinations. Finally, extensive experiments on multimodal classification and segmentation tasks demonstrate that the proposed DMRNet outperforms the state-of-the-art significantly.

[CV-33] Multi-modal Masked Siamese Network Improves Chest X-Ray Representation Learning

链接: https://arxiv.org/abs/2407.04449
作者: Saeed Shurrab,Alejandro Guerra-Manzanares,Farah E. Shamout
关键词: images primarily rely, Electronic Health Records, medical images primarily, Masked Siamese Network, images primarily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Self-supervised learning methods for medical images primarily rely on the imaging modality during pretraining. While such approaches deliver promising results, they do not leverage associated patient or scan information collected within Electronic Health Records (EHR). Here, we propose to incorporate EHR data during self-supervised pretraining with a Masked Siamese Network (MSN) to enhance the quality of chest X-ray representations. We investigate three types of EHR data, including demographic, scan metadata, and inpatient stay information. We evaluate our approach on three publicly available chest X-ray datasets, MIMIC-CXR, CheXpert, and NIH-14, using two vision transformer (ViT) backbones, specifically ViT-Tiny and ViT-Small. In assessing the quality of the representations via linear evaluation, our proposed method demonstrates significant improvement compared to vanilla MSN and state-of-the-art self-supervised learning baselines. Our work highlights the potential of EHR-enhanced self-supervised pre-training for medical imaging. The code is publicly available at: this https URL

[CV-34] Graph-Guided Test-Time Adaptation for Glaucoma Diagnosis using Fundus Photography

链接: https://arxiv.org/abs/2407.04396
作者: Qian Zeng,Fan Zhang
关键词: irreversible blindness worldwide, blindness worldwide, irreversible blindness, Graph-guided Test-Time Adaptation, glaucoma diagnosis models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 3 figures, 3 tables, submitted to MICCAI

点击查看摘要

Abstract:Glaucoma is a leading cause of irreversible blindness worldwide. While deep learning approaches using fundus images have largely improved early diagnosis of glaucoma, variations in images from different devices and locations (known as domain shifts) challenge the use of pre-trained models in real-world settings. To address this, we propose a novel Graph-guided Test-Time Adaptation (GTTA) framework to generalize glaucoma diagnosis models to unseen test environments. GTTA integrates the topological information of fundus images into the model training, enhancing the model’s transferability and reducing the risk of learning spurious correlation. During inference, GTTA introduces a novel test-time training objective to make the source-trained classifier progressively adapt to target patterns with reliable class conditional estimation and consistency regularization. Experiments on cross-domain glaucoma diagnosis benchmarks demonstrate the superiority of the overall framework and individual components under different backbone networks.

[CV-35] Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos

链接: https://arxiv.org/abs/2407.04384
作者: Leonhard Sommer,Artur Jesslen,Eddy Ilg,Adam Kortylewski
关键词: fundamentally important problem, vision and robotics, fundamentally important, computer vision, embodied agents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics, e.g. for embodied agents or to train 3D generative models. However, so far methods that estimate the category-level object pose require either large amounts of human annotations, CAD models or input from RGB-D sensors. In contrast, we tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos without human supervision. We propose a two-step pipeline: First, we introduce a multi-view alignment procedure that determines canonical camera poses across videos with a novel and robust cyclic distance formulation for geometric and appearance matching using reconstructed coarse meshes and DINOv2 features. In a second step, the canonical poses and reconstructed meshes enable us to train a model for 3D pose estimation from a single image. In particular, our model learns to estimate dense correspondences between images and a prototypical 3D template by predicting, for each pixel in a 2D image, a feature vector of the corresponding vertex in the template mesh. We demonstrate that our method outperforms all baselines at the unsupervised alignment of object-centric videos by a large margin and provides faithful and robust predictions in-the-wild. Our code and data is available at this https URL.

[CV-36] Self-Supervised Representation Learning for Adversarial Attack Detection

链接: https://arxiv.org/abs/2407.04382
作者: Yi Li,Plamen Angelov,Neeraj Suri
关键词: adversarial attack detection, learning-based adversarial attack, attack detection, attack detection methods, attack detection task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Supervised learning-based adversarial attack detection methods rely on a large number of labeled data and suffer significant performance degradation when applying the trained model to new domains. In this paper, we propose a self-supervised representation learning framework for the adversarial attack detection task to address this drawback. Firstly, we map the pixels of augmented input images into an embedding space. Then, we employ the prototype-wise contrastive estimation loss to cluster prototypes as latent variables. Additionally, drawing inspiration from the concept of memory banks, we introduce a discrimination bank to distinguish and learn representations for each individual instance that shares the same or a similar prototype, establishing a connection between instances and their associated prototypes. We propose a parallel axial-attention (PAA)-based encoder to facilitate the training process by parallel training over height- and width-axis of attention maps. Experimental results show that, compared to various benchmark self-supervised vision learning models and supervised adversarial attack detection methods, the proposed model achieves state-of-the-art performance on the adversarial attack detection task across a wide range of images.

[CV-37] Multi-Branch Auxiliary Fusion YOLO with Re-parameterization Heterogeneous Convolutional for accurate object detection

链接: https://arxiv.org/abs/2407.04381
作者: Zhiqiang Yang,Qiu Guan,Keer Zhao,Jianmin Yang,Xinli Xu,Haixia Long,Ying Tang
关键词: Path Aggregation FPN, YOLO detectors, employed in YOLO, Path Aggregation, multi-scale feature fusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Due to the effective performance of multi-scale feature fusion, Path Aggregation FPN (PAFPN) is widely employed in YOLO detectors. However, it cannot efficiently and adaptively integrate high-level semantic information with low-level spatial information simultaneously. We propose a new model named MAF-YOLO in this paper, which is a novel object detection framework with a versatile neck named Multi-Branch Auxiliary FPN (MAFPN). Within MAFPN, the Superficial Assisted Fusion (SAF) module is designed to combine the output of the backbone with the neck, preserving an optimal level of shallow information to facilitate subsequent learning. Meanwhile, the Advanced Assisted Fusion (AAF) module deeply embedded within the neck conveys a more diverse range of gradient information to the output layer. Furthermore, our proposed Re-parameterized Heterogeneous Efficient Layer Aggregation Network (RepHELAN) module ensures that both the overall model architecture and convolutional design embrace the utilization of heterogeneous large convolution kernels. Therefore, this guarantees the preservation of information related to small targets while simultaneously achieving the multi-scale receptive field. Finally, taking the nano version of MAF-YOLO for example, it can achieve 42.4% AP on COCO with only 3.76M learnable parameters and 10.51G FLOPs, and approximately outperforms YOLOv8n by about 5.1%. The source code of this work is available at: this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2407.04381 [cs.CV] (or arXiv:2407.04381v1 [cs.CV] for this version)

[CV-38] ZARRIO @ Ego4D Short Term Object Interaction Anticipation Challenge: Leveraging Affordances and Attention-based models for STA

链接: https://arxiv.org/abs/2407.04369
作者: Lorenzo Mur-Labadia,Ruben Martinez-Cantin,Josechu Guerrero-Campo,Giovanni Maria Farinella
关键词: Short-Term object-interaction Anticipation, object-interaction Anticipation, STA predictions, support STA predictions, Short-Term object-interaction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2406.01194

点击查看摘要

Abstract:Short-Term object-interaction Anticipation (STA) consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. We propose STAformer, a novel attention-based architecture integrating frame-guided temporal pooling, dual image-video attention, and multi-scale feature fusion to support STA predictions from an image-input video pair. Moreover, we introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. On the test set, our results obtain a final 33.5 N mAP, 17.25 N+V mAP, 11.77 N+\delta mAP and 6.75 Overall top-5 mAP metric when trained on the v2 training dataset.

[CV-39] owards Context-aware Support for Color Vision Deficiency: An Approach Integrating LLM and AR

链接: https://arxiv.org/abs/2407.04362
作者: Shogo Morita,Yan Zhang,Takuto Yamauchi,Sinan Chen,Jialong Li,Kenji Tei
关键词: complicate daily tasks, color vision deficiency, red and green, environmental adjustments, color vision
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:People with color vision deficiency often face challenges in distinguishing colors such as red and green, which can complicate daily tasks and require the use of assistive tools or environmental adjustments. Current support tools mainly focus on presentation-based aids, like the color vision modes found in iPhone accessibility settings. However, offering context-aware support, like indicating the doneness of meat, remains a challenge since task-specific solutions are not cost-effective for all possible scenarios. To address this, our paper proposes an application that provides contextual and autonomous assistance. This application is mainly composed of: (i) an augmented reality interface that efficiently captures context; and (ii) a multi-modal large language model-based reasoner that serves to cognitize the context and then reason about the appropriate support contents. Preliminary user experiments with two color vision deficient users across five different scenarios have demonstrated the effectiveness and universality of our application.

[CV-40] Shape Prior Segmentation Guided by Harmonic Beltrami Signature

链接: https://arxiv.org/abs/2407.04360
作者: Chenran Lin,Lok Ming Lui
关键词: Harmonic Beltrami Signature, Beltrami Signature, Harmonic Beltrami, segmentation method guided, shape prior knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Complex Variables (math.CV)
*备注: 34 pages, 15 figures

点击查看摘要

Abstract:This paper presents a novel shape prior segmentation method guided by the Harmonic Beltrami Signature (HBS). The HBS is a shape representation fully capturing 2D simply connected shapes, exhibiting resilience against perturbations and invariance to translation, rotation, and scaling. The proposed method integrates the HBS within a quasi-conformal topology preserving segmentation framework, leveraging shape prior knowledge to significantly enhance segmentation performance, especially for low-quality or occluded images. The key innovation lies in the bifurcation of the optimization process into two iterative stages: 1) The computation of a quasi-conformal deformation map, which transforms the unit disk into the targeted segmentation area, driven by image data and other regularization terms; 2) The subsequent refinement of this map is contingent upon minimizing the L_2 distance between its Beltrami coefficient and the reference HBS. This shape-constrained refinement ensures that the segmentation adheres to the reference shape(s) by exploiting the inherent invariance, robustness, and discerning shape discriminative capabilities afforded by the HBS. Extensive experiments on synthetic and real-world images validate the method’s ability to improve segmentation accuracy over baselines, eliminate preprocessing requirements, resist noise corruption, and flexibly acquire and apply shape priors. Overall, the HBS segmentation framework offers an efficient strategy to robustly incorporate the shape prior knowledge, thereby advancing critical low-level vision tasks.

[CV-41] Data-Driven Tissue- and Subject-Specific Elastic Regularization for Medical Image Registration

链接: https://arxiv.org/abs/2407.04355
作者: Anna Reithmeir,Lina Felsner,Rickmer Braren,Julia A. Schnabel,Veronika A. Zimmer
关键词: intra-patient image registration, anatomical structures, Physics-inspired regularization, desired for intra-patient, intra-patient image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at MICCAI 2024

点击查看摘要

Abstract:Physics-inspired regularization is desired for intra-patient image registration since it can effectively capture the biomechanical characteristics of anatomical structures. However, a major challenge lies in the reliance on physical parameters: Parameter estimations vary widely across the literature, and the physical properties themselves are inherently subject-specific. In this work, we introduce a novel data-driven method that leverages hypernetworks to learn the tissue-dependent elasticity parameters of an elastic regularizer. Notably, our approach facilitates the estimation of patient-specific parameters without the need to retrain the network. We evaluate our method on three publicly available 2D and 3D lung CT and cardiac MR datasets. We find that with our proposed subject-specific tissue-dependent regularization, a higher registration quality is achieved across all datasets compared to using a global regularizer. The code is available at this https URL.

[CV-42] MobileFlow: A Multimodal LLM For Mobile GUI Agent

链接: https://arxiv.org/abs/2407.04346
作者: Songqin Nong,Jiali Zhu,Rui Wu,Jiongchao Jin,Shuo Shan,Xiutian Huang,Wenhao Xu
关键词: Graphical User Interfaces, people daily lives, mobile Graphical User, Graphical User, GUI Agents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Currently, the integration of mobile Graphical User Interfaces (GUIs) is ubiquitous in most people’s daily lives. And the ongoing evolution of multimodal large-scale models, such as GPT-4v, Qwen-VL-Max, has significantly bolstered the capabilities of GUI comprehension and user action analysis, showcasing the potentiality of intelligent GUI assistants. However, current GUI Agents often need to access page layout information through calling system APIs, which may pose privacy risks. Fixing GUI (such as mobile interfaces) to a certain low resolution might result in the loss of fine-grained image details. At the same time, the multimodal large models built for GUI Agents currently have poor understanding and decision-making abilities for Chinese GUI interfaces, making them difficult to apply to a large number of Chinese apps. This paper introduces MobileFlow, a multimodal large language model meticulously crafted for mobile GUI agents. Transforming from the open-source model Qwen-VL-Chat into GUI domain, MobileFlow contains approximately 21 billion parameters and is equipped with novel hybrid visual encoders, making it possible for variable resolutions of image inputs and good support for multilingual GUI. By incorporating Mixture of Experts (MoE) expansions and pioneering alignment training strategies, MobileFlow has the capacity to fully interpret image data and comprehend user instructions for GUI interaction tasks. Finally, MobileFlow outperforms Qwen-VL-Max and GPT-4v in terms of task execution by GUI agents on both public and our proposed evaluation metrics, and has been successfully deployed in real-world business contexts, proving its effectiveness for practical applications.

[CV-43] CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images

链接: https://arxiv.org/abs/2407.04345
作者: Jisu Shin,Junmyeong Lee,Seongmin Lee,Min-Gyu Park,Ju-Mi Kang,Ju Hong Yoon,Hae-Gon Jeon
关键词: reconstructing animatable human, animatable human avatars, Linear Blend Skinning, framework for reconstructing, reconstructing animatable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 Accepted (18 pages, 9 figures)

点击查看摘要

Abstract:We present a novel framework for reconstructing animatable human avatars from multiple images, termed CanonicalFusion. Our central concept involves integrating individual reconstruction results into the canonical space. To be specific, we first predict Linear Blend Skinning (LBS) weight maps and depth maps using a shared-encoder-dual-decoder network, enabling direct canonicalization of the 3D mesh from the predicted depth maps. Here, instead of predicting high-dimensional skinning weights, we infer compressed skinning weights, i.e., 3-dimensional vector, with the aid of pre-trained MLP networks. We also introduce a forward skinning-based differentiable rendering scheme to merge the reconstructed results from multiple images. This scheme refines the initial mesh by reposing the canonical mesh via the forward skinning and by minimizing photometric and geometric errors between the rendered and the predicted results. Our optimization scheme considers the position and color of vertices as well as the joint angles for each image, thereby mitigating the negative effects of pose errors. We conduct extensive experiments to demonstrate the effectiveness of our method and compare our CanonicalFusion with state-of-the-art methods. Our source codes are available at this https URL.

[CV-44] Learning Geometric Invariant Features for Classification of Vector Polygons with Graph Message-passing Neural Network

链接: https://arxiv.org/abs/2407.04334
作者: Zexian Huang,Kourosh Khoshelham,Martin Tomko
关键词: non-trivial learning task, vector polygons remains, deep learning approaches, vector polygons, polygons
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Geometric shape classification of vector polygons remains a non-trivial learning task in spatial analysis. Previous studies mainly focus on devising deep learning approaches for representation learning of rasterized vector polygons, whereas the study of discrete representations of polygons and subsequent deep learning approaches have not been fully investigated. In this study, we investigate a graph representation of vector polygons and propose a novel graph message-passing neural network (PolyMP) to learn the geometric-invariant features for shape classification of polygons. Through extensive experiments, we show that the graph representation of polygons combined with a permutation-invariant graph message-passing neural network achieves highly robust performances on benchmark datasets (i.e., synthetic glyph and real-world building footprint datasets) as compared to baseline methods. We demonstrate that the proposed graph-based PolyMP network enables the learning of expressive geometric features invariant to geometric transformations of polygons (i.e., translation, rotation, scaling and shearing) and is robust to trivial vertex removals of polygons. We further show the strong generalizability of PolyMP, which enables generalizing the learned geometric features from the synthetic glyph polygons to the real-world building footprints.

[CV-45] F-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking

链接: https://arxiv.org/abs/2407.04327
作者: Thuc Nguyen-Quang,Minh-Triet Tran
关键词: requiring precise localization, computer vision remains, Multi-object tracking, requiring precise, video sequences
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences. This task is crucial for various applications, including action recognition and behavior analysis. Key challenges include occlusion, reidentification, tracking fast-moving objects, and handling camera motion artifacts. Past research has explored tracking-by-detection methods and end-to-end models, with recent attention on tracking-by-attention approaches leveraging transformer architectures. The emergence of data sets that emphasize robust reidentification, such as DanceTrack, has highlighted the need for effective solutions. While memory-based approaches have shown promise, they often suffer from high computational complexity and memory usage. We propose a novel sparse memory approach that selectively stores critical features based on object motion and overlapping awareness, aiming to enhance efficiency while minimizing redundancy. Building upon the MOTRv2 model, a hybrid of tracking-by-attention and tracking-by-detection, we introduce a training-free memory designed to bolster reidentification capabilities and preserve the model’s flexibility. Our memory approach achieves significant improvements over MOTRv2 in the DanceTrack test set, demonstrating a gain of 1.1% in HOTA metrics and 2.1% in IDF1 score.

[CV-46] LMSeg: A deep graph message-passing network for efficient and accurate semantic segmentation of large-scale 3D landscape meshes

链接: https://arxiv.org/abs/2407.04326
作者: Zexian Huang,Kourosh Khoshelham,Gunditj Mirring Traditional Owners Corporation,Martin Tomko
关键词: including spatial analysis, geospatial applications, automatic mapping, planning and development, mapping and localization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semantic segmentation of large-scale 3D landscape meshes is pivotal for various geospatial applications, including spatial analysis, automatic mapping and localization of target objects, and urban planning and development. This requires an efficient and accurate 3D perception system to understand and analyze real-world environments. However, traditional mesh segmentation methods face challenges in accurately segmenting small objects and maintaining computational efficiency due to the complexity and large size of 3D landscape mesh datasets. This paper presents an end-to-end deep graph message-passing network, LMSeg, designed to efficiently and accurately perform semantic segmentation on large-scale 3D landscape meshes. The proposed approach takes the barycentric dual graph of meshes as inputs and applies deep message-passing neural networks to hierarchically capture the geometric and spatial features from the barycentric graph structures and learn intricate semantic information from textured meshes. The hierarchical and local pooling of the barycentric graph, along with the effective geometry aggregation modules of LMSeg, enable fast inference and accurate segmentation of small-sized and irregular mesh objects in various complex landscapes. Extensive experiments on two benchmark datasets (natural and urban landscapes) demonstrate that LMSeg significantly outperforms existing learning-based segmentation methods in terms of object segmentation accuracy and computational efficiency. Furthermore, our method exhibits strong generalization capabilities across diverse landscapes and demonstrates robust resilience against varying mesh densities and landscape topologies.

[CV-47] SSP-GNN: Learning to Track via Bilevel Optimization

链接: https://arxiv.org/abs/2407.04308
作者: Griffin Golias,Masa Nakura-Fan,Vitaly Ablavsky
关键词: graph-based tracking formulation, re-identification features, propose a graph-based, formulation for multi-object, kinematic information
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a graph-based tracking formulation for multi-object tracking (MOT) where target detections contain kinematic information and re-identification features (attributes). Our method applies a successive shortest paths (SSP) algorithm to a tracking graph defined over a batch of frames. The edge costs in this tracking graph are computed via a message-passing network, a graph neural network (GNN) variant. The parameters of the GNN, and hence, the tracker, are learned end-to-end on a training set of example ground-truth tracks and detections. Specifically, learning takes the form of bilevel optimization guided by our novel loss function. We evaluate our algorithm on simulated scenarios to understand its sensitivity to scenario aspects and model hyperparameters. Across varied scenario complexities, our method compares favorably to a strong baseline.

[CV-48] owards Stable 3D Object Detection

链接: https://arxiv.org/abs/2407.04305
作者: Jiabao Wang,Qiang Meng,Guochao Liu,Liujiang Yan,Ke Wang,Ming-Ming Cheng,Qibin Hou
关键词: detection greatly impacts, driving safety, autonomous driving, greatly impacts, Waymo Open Dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In autonomous driving, the temporal stability of 3D object detection greatly impacts the driving safety. However, the detection stability cannot be accessed by existing metrics such as mAP and MOTA, and consequently is less explored by the community. To bridge this gap, this work proposes Stability Index (SI), a new metric that can comprehensively evaluate the stability of 3D detectors in terms of confidence, box localization, extent, and heading. By benchmarking state-of-the-art object detectors on the Waymo Open Dataset, SI reveals interesting properties of object stability that have not been previously discovered by other metrics. To help models improve their stability, we further introduce a general and effective training strategy, called Prediction Consistency Learning (PCL). PCL essentially encourages the prediction consistency of the same objects under different timestamps and augmentations, leading to enhanced detection stability. Furthermore, we examine the effectiveness of PCL with the widely-used CenterPoint, and achieve a remarkable SI of 86.00 for vehicle class, surpassing the baseline by 5.48. We hope our work could serve as a reliable baseline and draw the community’s attention to this crucial issue in 3D object detection. Codes will be made publicly available.

[CV-49] MARS: Paying more attention to visual attributes for text-based person search

链接: https://arxiv.org/abs/2407.04287
作者: Alex Ergasti,Tomaso Fontanini,Claudio Ferrari,Massimo Bertozzi,Andrea Prati
关键词: Text-based person search, gained significant interest, research community, problem that gained, Text-based person
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-based person search (TBPS) is a problem that gained significant interest within the research community. The task is that of retrieving one or more images of a specific individual based on a textual description. The multi-modal nature of the task requires learning representations that bridge text and image data within a shared latent space. Existing TBPS systems face two major challenges. One is defined as inter-identity noise that is due to the inherent vagueness and imprecision of text descriptions and it indicates how descriptions of visual attributes can be generally associated to different people; the other is the intra-identity variations, which are all those nuisances e.g. pose, illumination, that can alter the visual appearance of the same textual attributes for a given subject. To address these issues, this paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive), which enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss. The former employs a Masked AutoEncoder trained to reconstruct randomly masked image patches with the aid of the textual description. In doing so the model is encouraged to learn more expressive representations and textual-visual relations in the latent space. The Attribute Loss, instead, balances the contribution of different types of attributes, defined as adjective-noun chunks of text. This loss ensures that every attribute is taken into consideration in the person retrieval process. Extensive experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements, with significant gains in the mean Average Precision (mAP) metric w.r.t. the current state of the art.

[CV-50] Research Applications and Prospects of Event-Based Pedestrian Detection: A Survey

链接: https://arxiv.org/abs/2407.04277
作者: Han Wang,Yuman Nie,Yun Li,Hongjie Liu,Min Liu,Wen Cheng,Yaoxiong Wang
关键词: superior temporal resolution, minimal power requirements, expansive dynamic range, cutting-edge sensors distinguished, negligible latency
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Event-based cameras, inspired by the biological retina, have evolved into cutting-edge sensors distinguished by their minimal power requirements, negligible latency, superior temporal resolution, and expansive dynamic range. At present, cameras used for pedestrian detection are mainly frame-based imaging sensors, which have suffered from lethargic response times and hefty data redundancy. In contrast, event-based cameras address these limitations by eschewing extraneous data transmissions and obviating motion blur in high-speed imaging scenarios. On pedestrian detection via event-based cameras, this paper offers an exhaustive review of research and applications particularly in the autonomous driving context. Through methodically scrutinizing relevant literature, the paper outlines the foundational principles, developmental trajectory, and the comparative merits and demerits of eventbased detection relative to traditional frame-based methodologies. This review conducts thorough analyses of various event stream inputs and their corresponding network models to evaluate their applicability across diverse operational environments. It also delves into pivotal elements such as crucial datasets and data acquisition techniques essential for advancing this technology, as well as advanced algorithms for processing event stream data. Culminating with a synthesis of the extant landscape, the review accentuates the unique advantages and persistent challenges inherent in event-based pedestrian detection, offering a prognostic view on potential future developments in this fast-progressing field.

[CV-51] Fine-grained Dynamic Network for Generic Event Boundary Detection

链接: https://arxiv.org/abs/2407.04274
作者: Ziwei Zheng,Lijun He,Le Yang,Fan Li
关键词: understanding long-form videos, boundaries naturally perceived, aims at pinpointing, perceived by humans, playing a crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art.

[CV-52] Variational Partial Group Convolutions for Input-Aware Partial Equivariance of Rotations and Color-Shifts

链接: https://arxiv.org/abs/2407.04271
作者: Hyunsu Kim,Yegon Kim,Hongseok Yang,Juho Lee
关键词: Group Equivariant CNNs, shown promising efficacy, Equivariant CNNs, equivariant manner, capture hierarchical features
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ICML2024

点击查看摘要

Abstract:Group Equivariant CNNs (G-CNNs) have shown promising efficacy in various tasks, owing to their ability to capture hierarchical features in an equivariant manner. However, their equivariance is fixed to the symmetry of the whole group, limiting adaptability to diverse partial symmetries in real-world datasets, such as limited rotation symmetry of handwritten digit images and limited color-shift symmetry of flower images. Recent efforts address this limitation, one example being Partial G-CNN which restricts the output group space of convolution layers to break full equivariance. However, such an approach still fails to adjust equivariance levels across data. In this paper, we propose a novel approach, Variational Partial G-CNN (VP G-CNN), to capture varying levels of partial equivariance specific to each data instance. VP G-CNN redesigns the distribution of the output group elements to be conditioned on input data, leveraging variational inference to avoid overfitting. This enables the model to adjust its equivariance levels according to the needs of individual data points. Additionally, we address training instability inherent in discrete group equivariance models by redesigning the reparametrizable distribution. We demonstrate the effectiveness of VP G-CNN on both toy and real-world datasets, including MNIST67-180, CIFAR10, ColorMNIST, and Flowers102. Our results show robust performance, even in uncertainty metrics.

[CV-53] Parametric Curve Segment Extraction by Support Regions

链接: https://arxiv.org/abs/2407.04265
作者: Cem Ünsalan
关键词: Laplacian of Gaussian, introduce a method, image directly, filter response, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:We introduce a method to extract curve segments in parametric form from the image directly using the Laplacian of Gaussian (LoG) filter response. Our segmentation gives convex and concave curves. To do so, we form curve support regions by grouping pixels of the thresholded filter response. Then, we model each support region boundary by Fourier series and extract the corresponding parametric curve segment.

[CV-54] Efficient Detection of Long Consistent Cycles and its Application to Distributed Synchronization

链接: https://arxiv.org/abs/2407.04260
作者: Shaohan Li,Yunpeng Shi,Gilad Lerman
关键词: Structure from Motion, pipelines for Structure, plays a crucial, crucial role, role in global
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Group synchronization plays a crucial role in global pipelines for Structure from Motion (SfM). Its formulation is nonconvex and it is faced with highly corrupted measurements. Cycle consistency has been effective in addressing these challenges. However, computationally efficient solutions are needed for cycles longer than three, especially in practical scenarios where 3-cycles are unavailable. To overcome this computational bottleneck, we propose an algorithm for group synchronization that leverages information from cycles of lengths ranging from three to six with a time complexity of order O(n^3) (or O(n^2.373) when using a faster matrix multiplication algorithm). We establish non-trivial theory for this and related methods that achieves competitive sample complexity, assuming the uniform corruption model. To advocate the practical need for our method, we consider distributed group synchronization, which requires at least 4-cycles, and we illustrate state-of-the-art performance by our method in this context.

[CV-55] Unsupervised Video Summarization via Reinforcement Learning and a Trained Evaluator

链接: https://arxiv.org/abs/2407.04258
作者: Mehryar Abbasi,Hadi Hadizadeh,Parvaneh Saeedi
关键词: unsupervised video summarization, paper presents, video, summarizer model, reward generation pipeline
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach for unsupervised video summarization using reinforcement learning. It aims to address the existing limitations of current unsupervised methods, including unstable training of adversarial generator-discriminator architectures and reliance on hand-crafted reward functions for quality evaluation. The proposed method is based on the concept that a concise and informative summary should result in a reconstructed video that closely resembles the original. The summarizer model assigns an importance score to each frame and generates a video summary. In the proposed scheme, reinforcement learning, coupled with a unique reward generation pipeline, is employed to train the summarizer model. The reward generation pipeline trains the summarizer to create summaries that lead to improved reconstructions. It comprises a generator model capable of reconstructing masked frames from a partially masked video, along with a reward mechanism that compares the reconstructed video from the summary against the original. The video generator is trained in a self-supervised manner to reconstruct randomly masked frames, enhancing its ability to generate accurate summaries. This training pipeline results in a summarizer model that better mimics human-generated video summaries compared to methods relying on hand-crafted rewards. The training process consists of two stable and isolated training steps, unlike adversarial architectures. Experimental results demonstrate promising performance, with F-scores of 62.3 and 54.5 on TVSum and SumMe datasets, respectively. Additionally, the inference stage is 300 times faster than our previously reported state-of-the-art method.

[CV-56] Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge

链接: https://arxiv.org/abs/2407.04255
作者: Xiangyu Wu,Zhouyang Chi,Yang Yang,Jianfeng Lu
关键词: Question Answering Challenge, Toloka Visual Question, Visual Question Answering, visual grounding task, Answering Challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Second Place of WSDM2023 Toloka Visual Question Answering Challenge

点击查看摘要

Abstract:In this paper, we present our solution for the WSDM2023 Toloka Visual Question Answering Challenge. Inspired by the application of multimodal pre-trained models to various downstream tasks(e.g., visual question answering, visual grounding, and cross-modal retrieval), we approached this competition as a visual grounding task, where the input is an image and a question, guiding the model to answer the question and display the answer as a bounding box on the image. We designed a three-stage solution for this task. Specifically, we used the visual-language pre-trained model OFA as the foundation. In the first stage, we constructed a large-scale synthetic dataset similar to the competition dataset and coarse-tuned the model to learn generalized semantic information. In the second stage, we treated the competition task as a visual grounding task, loaded the weights from the previous stage, and continued to fine-tune the model on the competition dataset, transferring the semantic information learned in the first stage to the competition task. Finally, we designed a bounding box matching and replacing post-processing strategy to correct the model’s prediction results. Our team achieved a score of 76.342 on the final leaderboard, ranking second.

[CV-57] FeatureSORT: Essential Features for Effective Tracking

链接: https://arxiv.org/abs/2407.04249
作者: Hamidreza Hashempoor,Rosemary Koikara,Yu Dong Hwang
关键词: online multiple object, tracker designed, multiple feature modules, tracking, provide multiple feature
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we introduce a novel tracker designed for online multiple object tracking with a focus on being simple, while being effective. we provide multiple feature modules each of which stands for a particular appearance information. By integrating distinct appearance features, including clothing color, style, and target direction, alongside a ReID network for robust embedding extraction, our tracker significantly enhances online tracking accuracy. Additionally, we propose the incorporation of a stronger detector and also provide an advanced post processing methods that further elevate the tracker’s performance. During real time operation, we establish measurement to track associated distance function which includes the IoU, direction, color, style, and ReID features similarity information, where each metric is calculated separately. With the design of our feature related distance function, it is possible to track objects through longer period of occlusions, while keeping the number of identity switches comparatively low. Extensive experimental evaluation demonstrates notable improvement in tracking accuracy and reliability, as evidenced by reduced identity switches and enhanced occlusion handling. These advancements not only contribute to the state of the art in object tracking but also open new avenues for future research and practical applications demanding high precision and reliability.

[CV-58] ArAIEval Shared Task: Propagandistic Techniques Detection in Unimodal and Multimodal Arabic Content

链接: https://arxiv.org/abs/2407.04247
作者: Maram Hasanain,Md. Arid Hasan,Fatema Ahmed,Reem Suwaileh,Md. Rafiul Biswas,Wajdi Zaghouani,Firoj Alam
关键词: co-located with ACL, ArAIEval shared task, organized as part, conference co-located, ArAIEval shared
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: propaganda, span detection, disinformation, misinformation, fake news, LLMs, GPT-4, multimodality, multimodal LLMs

点击查看摘要

Abstract:We present an overview of the second edition of the ArAIEval shared task, organized as part of the ArabicNLP 2024 conference co-located with ACL 2024. In this edition, ArAIEval offers two tasks: (i) detection of propagandistic textual spans with persuasion techniques identification in tweets and news articles, and (ii) distinguishing between propagandistic and non-propagandistic memes. A total of 14 teams participated in the final evaluation phase, with 6 and 9 teams participating in Tasks 1 and 2, respectively. Finally, 11 teams submitted system description papers. Across both tasks, we observed that fine-tuning transformer models such as AraBERT was at the core of the majority of the participating systems. We provide a description of the task setup, including a description of the dataset construction and the evaluation setup. We further provide a brief overview of the participating systems. All datasets and evaluation scripts are released to the research community (this https URL). We hope this will enable further research on these important tasks in Arabic.

[CV-59] Every Pixel Has its Moments: Ultra-High-Resolution Unpaired Image-to-Image Translation via Dense Normalization

链接: https://arxiv.org/abs/2407.04245
作者: Ming-Yang Ho,Che-Ming Wu,Min-Sheng Wu,Yufeng Jane Tseng
关键词: limited GPU memory, Recent advancements, limited GPU, GPU memory, patch-wise inference
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Recent advancements in ultra-high-resolution unpaired image-to-image translation have aimed to mitigate the constraints imposed by limited GPU memory through patch-wise inference. Nonetheless, existing methods often compromise between the reduction of noticeable tiling artifacts and the preservation of color and hue contrast, attributed to the reliance on global image- or patch-level statistics in the instance normalization layers. In this study, we introduce a Dense Normalization (DN) layer designed to estimate pixel-level statistical moments. This approach effectively diminishes tiling artifacts while concurrently preserving local color and hue contrasts. To address the computational demands of pixel-level estimation, we further propose an efficient interpolation algorithm. Moreover, we invent a parallelism strategy that enables the DN layer to operate in a single pass. Through extensive experiments, we demonstrate that our method surpasses all existing approaches in performance. Notably, our DN layer is hyperparameter-free and can be seamlessly integrated into most unpaired image-to-image translation frameworks without necessitating retraining. Overall, our work paves the way for future exploration in handling images of arbitrary resolutions within the realm of unpaired image-to-image translation. Code is available at: this https URL.

[CV-60] Exploration of Class Center for Fine-Grained Visual Classification

链接: https://arxiv.org/abs/2407.04243
作者: Hang Yao,Qiguang Miao,Peipei Zhao,Chaoneng Li,Xin Li,Guanwen Feng,Ruyi Liu
关键词: challenging task due, subtle inter-class differences, evident intra-class variances, large-scale classification tasks, fine-grained visual classification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accpeted by TCSVT. Code and trained models are here: this https URL

点击查看摘要

Abstract:Different from large-scale classification tasks, fine-grained visual classification is a challenging task due to two critical problems: 1) evident intra-class variances and subtle inter-class differences, and 2) overfitting owing to fewer training samples in datasets. Most existing methods extract key features to reduce intra-class variances, but pay no attention to subtle inter-class differences in fine-grained visual classification. To address this issue, we propose a loss function named exploration of class center, which consists of a multiple class-center constraint and a class-center label generation. This loss function fully utilizes the information of the class center from the perspective of features and labels. From the feature perspective, the multiple class-center constraint pulls samples closer to the target class center, and pushes samples away from the most similar nontarget class center. Thus, the constraint reduces intra-class variances and enlarges inter-class differences. From the label perspective, the class-center label generation utilizes classcenter distributions to generate soft labels to alleviate overfitting. Our method can be easily integrated with existing fine-grained visual classification approaches as a loss function, to further boost excellent performance with only slight training costs. Extensive experiments are conducted to demonstrate consistent improvements achieved by our method on four widely-used fine-grained visual classification datasets. In particular, our method achieves state-of-the-art performance on the FGVC-Aircraft and CUB-200-2011 datasets.

[CV-61] Fine-grained Context and Multi-modal Alignment for Freehand 3D Ultrasound Reconstruction

链接: https://arxiv.org/abs/2407.04242
作者: Zhongnuo Yan,Xin Yang,Mingyuan Luo,Jiongquan Chen,Rusi Chen,Lian Liu,Dong Ni
关键词: Fine-grained spatio-temporal learning, crucial for freehand, Fine-grained spatio-temporal, spatio-temporal, Fine-grained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at MICCAI 2024. This is the submitted manuscript and the preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections

点击查看摘要

Abstract:Fine-grained spatio-temporal learning is crucial for freehand 3D ultrasound reconstruction. Previous works mainly resorted to the coarse-grained spatial features and the separated temporal dependency learning and struggles for fine-grained spatio-temporal learning. Mining spatio-temporal information in fine-grained scales is extremely challenging due to learning difficulties in long-range dependencies. In this context, we propose a novel method to exploit the long-range dependency management capabilities of the state space model (SSM) to address the above challenge. Our contribution is three-fold. First, we propose ReMamba, which mines multi-scale spatio-temporal information by devising a multi-directional SSM. Second, we propose an adaptive fusion strategy that introduces multiple inertial measurement units as auxiliary temporal information to enhance spatio-temporal perception. Last, we design an online alignment strategy that encodes the temporal information as pseudo labels for multi-modal alignment to further improve reconstruction performance. Extensive experimental validations on two large-scale datasets show remarkable improvement from our method over competitors.

[CV-62] AnySR: Realizing Image Super-Resolution as Any-Scale Any-Resource

链接: https://arxiv.org/abs/2407.04241
作者: Wengyi Zhan,Mingbao Lin,Chia-Wen Lin,Rongrong Ji
关键词: existing arbitrary-scale SISR, arbitrary-scale SISR methods, any-resource implementation, single-image super-resolution, SISR
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In an effort to improve the efficiency and scalability of single-image super-resolution (SISR) applications, we introduce AnySR, to rebuild existing arbitrary-scale SR methods into any-scale, any-resource implementation. As a contrast to off-the-shelf methods that solve SR tasks across various scales with the same computing costs, our AnySR innovates in: 1) building arbitrary-scale tasks as any-resource implementation, reducing resource requirements for smaller scales without additional parameters; 2) enhancing any-scale performance in a feature-interweaving fashion, inserting scale pairs into features at regular intervals and ensuring correct feature/scale processing. The efficacy of our AnySR is fully demonstrated by rebuilding most existing arbitrary-scale SISR methods and validating on five popular SISR test datasets. The results show that our AnySR implements SISR tasks in a computing-more-efficient fashion, and performs on par with existing arbitrary-scale SISR methods. For the first time, we realize SISR tasks as not only any-scale in literature, but also as any-resource. Code is available at this https URL.

[CV-63] GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction

链接: https://arxiv.org/abs/2407.04237
作者: Yuxuan Mu,Xinxin Zuo,Chuan Guo,Yilin Wang,Juwei Lu,Xiaofeng Wu,Songcen Xu,Peng Dai,Youliang Yan,Li Cheng
关键词: present GSD, Gaussian Splatting, diffusion model, GSD, model approach based
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted for ECCV 2024

点击查看摘要

Abstract:We present GSD, a diffusion model approach based on Gaussian Splatting (GS) representation for 3D object reconstruction from a single view. Prior works suffer from inconsistent 3D geometry or mediocre rendering quality due to improper representations. We take a step towards resolving these shortcomings by utilizing the recent state-of-the-art 3D explicit representation, Gaussian Splatting, and an unconditional diffusion model. This model learns to generate 3D objects represented by sets of GS ellipsoids. With these strong generative 3D priors, though learning unconditionally, the diffusion model is ready for view-guided reconstruction without further model fine-tuning. This is achieved by propagating fine-grained 2D features through the efficient yet flexible splatting function and the guided denoising sampling process. In addition, a 2D diffusion model is further employed to enhance rendering fidelity, and improve reconstructed GS quality by polishing and re-using the rendered images. The final reconstructed objects explicitly come with high-quality 3D structure and texture, and can be efficiently rendered in arbitrary views. Experiments on the challenging real-world CO3D dataset demonstrate the superiority of our approach.

[CV-64] Efficient GANs for Document Image Binarization Based on DWT and Normalization

链接: https://arxiv.org/abs/2407.04231
作者: Rui-Yang Ju,KokSheik Wong,Jen-Shiun Chiang
关键词: text information extraction, generative adversarial networks, image binarization task, document image binarization, SOTA network architecture
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:For document image binarization task, generative adversarial networks (GANs) can generate images where shadows and noise are effectively removed, which allow for text information extraction. The current state-of-the-art (SOTA) method proposes a three-stage network architecture that utilizes six GANs. Despite its excellent model performance, the SOTA network architecture requires long training and inference times. To overcome this problem, this work introduces an efficient GAN method based on the three-stage network architecture that incorporates the Discrete Wavelet Transformation and normalization to reduce the input image size, which in turns, decrease both training and inference times. In addition, this work presents novel generators, discriminators, and loss functions to improve the model’s performance. Experimental results show that the proposed method reduces the training time by 10% and the inference time by 26% when compared to the SOTA method while maintaining the model performance at 73.79 of Avg-Score. Our implementation code is available on GitHub at this https URL.

[CV-65] A Physical Model-Guided Framework for Underwater Image Enhancement and Depth Estimation

链接: https://arxiv.org/abs/2407.04230
作者: Dazhao Du,Enhan Li,Lingyu Si,Fanjiang Xu,Jianwei Niu,Fuchun Sun
关键词: UIE model, UIE, diverse aquatic media, aquatic media, model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Due to the selective absorption and scattering of light by diverse aquatic media, underwater images usually suffer from various visual degradations. Existing underwater image enhancement (UIE) approaches that combine underwater physical imaging models with neural networks often fail to accurately estimate imaging model parameters such as depth and veiling light, resulting in poor performance in certain scenarios. To address this issue, we propose a physical model-guided framework for jointly training a Deep Degradation Model (DDM) with any advanced UIE model. DDM includes three well-designed sub-networks to accurately estimate various imaging parameters: a veiling light estimation sub-network, a factors estimation sub-network, and a depth estimation sub-network. Based on the estimated parameters and the underwater physical imaging model, we impose physical constraints on the enhancement process by modeling the relationship between underwater images and desired clean images, i.e., outputs of the UIE model. Moreover, while our framework is compatible with any UIE model, we design a simple yet effective fully convolutional UIE model, termed UIEConv. UIEConv utilizes both global and local features for image enhancement through a dual-branch structure. UIEConv trained within our framework achieves remarkable enhancement results across diverse underwater scenes. Furthermore, as a byproduct of UIE, the trained depth estimation sub-network enables accurate underwater scene depth estimation. Extensive experiments conducted in various real underwater imaging scenarios, including deep-sea environments with artificial light sources, validate the effectiveness of our framework and the UIEConv model.

[CV-66] Batch Transformer: Look for Attention in Batch

链接: https://arxiv.org/abs/2407.04218
作者: Myung Beom Her,Jisu Jeong,Hojoon Song,Ji-Hyeong Han
关键词: Facial expression recognition, Facial expression, received considerable attention, computer vision, human-computer interaction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Facial expression recognition (FER) has received considerable attention in computer vision, with “in-the-wild” environments such as human-computer interaction. However, FER images contain uncertainties such as occlusion, low resolution, pose variation, illumination variation, and subjectivity, which includes some expressions that do not match the target label. Consequently, little information is obtained from a noisy single image and it is not trusted. This could significantly degrade the performance of the FER task. To address this issue, we propose a batch transformer (BT), which consists of the proposed class batch attention (CBA) module, to prevent overfitting in noisy data and extract trustworthy information by training on features reflected from several images in a batch, rather than information from a single image. We also propose multi-level attention (MLA) to prevent overfitting the specific features by capturing correlations between each level. In this paper, we present a batch transformer network (BTN) that combines the above proposals. Experimental results on various FER benchmark datasets show that the proposed BTN consistently outperforms the state-ofthe-art in FER datasets. Representative results demonstrate the promise of the proposed BTN for FER.

[CV-67] 2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2407.04215
作者: Zhongqi Wang,Jie Zhang,Shiguang Shan,Xilin Chen
关键词: diffusion models demonstrate, models demonstrate impressive, impressive generation capabilities, demonstrate impressive generation, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:While text-to-image diffusion models demonstrate impressive generation capabilities, they also exhibit vulnerability to backdoor attacks, which involve the manipulation of model outputs through malicious triggers. In this paper, for the first time, we propose a comprehensive defense method named T2IShield to detect, localize, and mitigate such attacks. Specifically, we find the “Assimilation Phenomenon” on the cross-attention maps caused by the backdoor trigger. Based on this key insight, we propose two effective backdoor detection methods: Frobenius Norm Threshold Truncation and Covariance Discriminant Analysis. Besides, we introduce a binary-search approach to localize the trigger within a backdoor sample and assess the efficacy of existing concept editing methods in mitigating backdoor attacks. Empirical evaluations on two advanced backdoor attack scenarios show the effectiveness of our proposed defense method. For backdoor sample detection, T2IShield achieves a detection F1 score of 88.9 % with low computational cost. Furthermore, T2IShield achieves a localization F1 score of 86.4 % and invalidates 99 % poisoned samples. Codes are released at this https URL.

[CV-68] AMD: Automatic Multi-step Distillation of Large-scale Vision Models

链接: https://arxiv.org/abs/2407.04208
作者: Cheng Han,Qifan Wang,Sohail A. Dianat,Majid Rabbani,Raghuveer M. Rao,Yi Fang,Qiang Guan,Lifu Huang,Dongfang Liu
关键词: Transformer-based architectures, diverse vision tasks, vision tasks owing, de-facto standard models, de-facto standard
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Transformer-based architectures have become the de-facto standard models for diverse vision tasks owing to their superior performance. As the size of the models continues to scale up, model distillation becomes extremely important in various real applications, particularly on devices limited by computational resources. However, prevailing knowledge distillation methods exhibit diminished efficacy when confronted with a large capacity gap between the teacher and the student, e.g, 10x compression rate. In this paper, we present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression. In particular, our distillation process unfolds across multiple steps. Initially, the teacher undergoes distillation to form an intermediate teacher-assistant model, which is subsequently distilled further to the student. An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance. We conduct extensive experiments on multiple image classification datasets, including CIFAR-10, CIFAR-100, and ImageNet. The findings consistently reveal that our approach outperforms several established baselines, paving a path for future knowledge distillation methods on large-scale vision models.

[CV-69] Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning

链接: https://arxiv.org/abs/2407.04207
作者: Mainak Singha,Ankit Jha,Divyam Gupta,Pranav Singla,Biplab Banerjee
关键词: including zero-shot SBIR, generalized zero-shot SBIR, zero-shot SBIR, sketch-based image retrieval, vision-language foundation model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ECCV 2024

点击查看摘要

Abstract:We address the challenges inherent in sketch-based image retrieval (SBIR) across various settings, including zero-shot SBIR, generalized zero-shot SBIR, and fine-grained zero-shot SBIR, by leveraging the vision-language foundation model, CLIP. While recent endeavors have employed CLIP to enhance SBIR, these approaches predominantly follow uni-modal prompt processing and overlook to fully exploit CLIP’s integrated visual and textual capabilities. To bridge this gap, we introduce SpLIP, a novel multi-modal prompt learning scheme designed to operate effectively with frozen CLIP backbones. We diverge from existing multi-modal prompting methods that either treat visual and textual prompts independently or integrate them in a limited fashion, leading to suboptimal generalization. SpLIP implements a bi-directional prompt-sharing strategy that enables mutual knowledge exchange between CLIP’s visual and textual encoders, fostering a more cohesive and synergistic prompt processing mechanism that significantly reduces the semantic gap between the sketch and photo embeddings. In addition to pioneering multi-modal prompt learning, we propose two innovative strategies for further refining the embedding space. The first is an adaptive margin generation for the sketch-photo triplet loss, regulated by CLIP’s class textual embeddings. The second introduces a novel task, termed conditional cross-modal jigsaw, aimed at enhancing fine-grained sketch-photo alignment, by focusing on implicitly modelling the viable patch arrangement of sketches using knowledge of unshuffled photos. Our comprehensive experimental evaluations across multiple benchmarks demonstrate the superior performance of SpLIP in all three SBIR scenarios. Code is available at this https URL.

[CV-70] HCS-TNAS: Hybrid Constraint-driven Semi-supervised Transformer-NAS for Ultrasound Image Segmentation

链接: https://arxiv.org/abs/2407.04203
作者: Renqi Chen
关键词: Accurate ultrasound segmentation, comprehensive diagnosis, aids clinicians, clinicians in achieving, achieving a comprehensive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate ultrasound segmentation is pursued because it aids clinicians in achieving a comprehensive diagnosis. Due to the presence of low image quality and high costs associated with annotation, two primary concerns arise: (1) enhancing the understanding of multi-scale features, and (2) improving the resistance to data dependency. To mitigate these concerns, we propose HCS-TNAS, a novel neural architecture search (NAS) method that automatically designs the network. For the first concern, we employ multi-level searching encompassing cellular, layer, and module levels. Specifically, we design an Efficient NAS-ViT module that searches for multi-scale tokens in the vision Transformer (ViT) to capture context and local information, rather than relying solely on simple combinations of operations. For the second concern, we propose a hybrid constraint-driven semi-supervised learning method that considers additional network independence and incorporates contrastive loss in a NAS formulation. By further developing a stage-wise optimization strategy, a rational network structure can be identified. Extensive experiments on three publicly available ultrasound image datasets demonstrate that HCS-TNAS effectively improves segmentation accuracy and outperforms state-of-the-art methods.

[CV-71] GazeFusion: Saliency-guided Image Generation

链接: https://arxiv.org/abs/2407.04191
作者: Yunxiang Zhang,Nan Wu,Connor Z. Lin,Gordon Wetzstein,Qi Sun
关键词: offer unprecedented image, models offer unprecedented, Diffusion models offer, text prompt, unprecedented image generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Diffusion models offer unprecedented image generation capabilities given just a text prompt. While emerging control mechanisms have enabled users to specify the desired spatial arrangements of the generated content, they cannot predict or control where viewers will pay more attention due to the complexity of human vision. Recognizing the critical necessity of attention-controllable image generation in practical applications, we present a saliency-guided framework to incorporate the data priors of human visual attention into the generation process. Given a desired viewer attention distribution, our control module conditions a diffusion model to generate images that attract viewers’ attention toward desired areas. To assess the efficacy of our approach, we performed an eye-tracked user study and a large-scale model-based saliency analysis. The results evidence that both the cross-user eye gaze distributions and the saliency model predictions align with the desired attention distributions. Lastly, we outline several applications, including interactive design of saliency guidance, attention suppression in unwanted regions, and adaptive generation for varied display/viewing conditions.

[CV-72] Computer Vision for Clinical Gait Analysis: A Gait Abnormality Video Dataset

链接: https://arxiv.org/abs/2407.04190
作者: Rahm Ranjan,David Ahmedt-Aristizabal,Mohammad Ali Armin,Juno Kim
关键词: clear task objectives, Clinical gait analysis, task objectives, gait analysis, CGA
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Clinical gait analysis (CGA) using computer vision is an emerging field in artificial intelligence that faces barriers of accessible, real-world data, and clear task objectives. This paper lays the foundation for current developments in CGA as well as vision-based methods and datasets suitable for gait analysis. We introduce The Gait Abnormality in Video Dataset (GAVD) in response to our review of over 150 current gait-related computer vision datasets, which highlighted the need for a large and accessible gait dataset clinically annotated for CGA. GAVD stands out as the largest video gait dataset, comprising 1874 sequences of normal, abnormal and pathological gaits. Additionally, GAVD includes clinically annotated RGB data sourced from publicly available content on online platforms. It also encompasses over 400 subjects who have undergone clinical grade visual screening to represent a diverse range of abnormal gait patterns, captured in various settings, including hospital clinics and urban uncontrolled outdoor environments. We demonstrate the validity of the dataset and utility of action recognition models for CGA using pretrained models Temporal Segment Networks(TSN) and SlowFast network to achieve video abnormality detection of 94% and 92% respectively when tested on GAVD dataset. A GitHub repository this https URL consisting of convenient URL links, and clinically relevant annotation for CGA is provided for over 450 online videos, featuring diverse subjects performing a range of normal, pathological, and abnormal gait patterns.

[CV-73] QueryMamba: A Mamba-Based Encoder-Decoder Architecture with a Statistical Verb-Noun Interaction Module for Video Action Forecasting @ Ego4D Long-Term Action Anticipation Challenge 2024

链接: https://arxiv.org/abs/2407.04184
作者: Zeyun Zhong,Manuel Martin,Frederik Diederichs,Juergen Beyerer
关键词: video action forecasting, Mamba-based encoder-decoder architecture, enhance video action, integrated verb-noun interaction, verb-noun interaction module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This report presents a novel Mamba-based encoder-decoder architecture, QueryMamba, featuring an integrated verb-noun interaction module that utilizes a statistical verb-noun co-occurrence matrix to enhance video action forecasting. This architecture not only predicts verbs and nouns likely to occur based on historical data but also considers their joint occurrence to improve forecast accuracy. The efficacy of this approach is substantiated by experimental results, with the method achieving second place in the Ego4D LTA challenge and ranking first in noun prediction accuracy.

[CV-74] Slice-100K: A Multimodal Dataset for Extrusion-based 3D Printing

链接: https://arxiv.org/abs/2407.04180
作者: Anushrut Jignasu,Kelly O. Marshall,Ankush Kumar Mishra,Lucas Nerone Rillo,Baskar Ganapathysubramanian,Aditya Balu,Chinmay Hegde,Adarsh Krishnamurthy
关键词: printing programming language, computer numerical control, numerical control, printing programming, programming language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:G-code (Geometric code) or RS-274 is the most widely used computer numerical control (CNC) and 3D printing programming language. G-code provides machine instructions for the movement of the 3D printer, especially for the nozzle, stage, and extrusion of material for extrusion-based additive manufacturing. Currently there does not exist a large repository of curated CAD models along with their corresponding G-code files for additive manufacturing. To address this issue, we present SLICE-100K, a first-of-its-kind dataset of over 100,000 G-code files, along with their tessellated CAD model, LVIS (Large Vocabulary Instance Segmentation) categories, geometric properties, and renderings. We build our dataset from triangulated meshes derived from Objaverse-XL and Thingi10K datasets. We demonstrate the utility of this dataset by finetuning GPT-2 on a subset of the dataset for G-code translation from a legacy G-code format (Sailfish) to a more modern, widely used format (Marlin). SLICE-100K will be the first step in developing a multimodal foundation model for digital manufacturing.

[CV-75] ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

链接: https://arxiv.org/abs/2407.04172
作者: Ahmed Masry,Megh Thakkar,Aayush Bajaj,Aaryaman Kartha,Enamul Hoque,Shafiq Joty
关键词: developing pre-trained foundation, general purpose instruction-tuned, pre-trained foundation models, purpose instruction-tuned models, underlying data tables
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Given the ubiquity of charts as a data analysis, visualization, and decision-making tool across industries and sciences, there has been a growing interest in developing pre-trained foundation models as well as general purpose instruction-tuned models for chart understanding and reasoning. However, existing methods suffer crucial drawbacks across two critical axes affecting the performance of chart representation models: they are trained on data generated from underlying data tables of the charts, ignoring the visual trends and patterns in chart images, and use weakly aligned vision-language backbone models for domain-specific training, limiting their generalizability when encountering charts in the wild. We address these important drawbacks and introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. Rather than relying on underlying data tables, ChartGemma is trained on instruction-tuning data generated directly from chart images, thus capturing both high-level trends and low-level visual information from a diverse set of charts. Our simple approach achieves state-of-the-art results across 5 benchmarks spanning chart summarization, question answering, and fact-checking, and our elaborate qualitative studies on real-world charts show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries. We release the code, model checkpoints, dataset, and demos at this https URL.

[CV-76] Attention Normalization Impacts Cardinality Generalization in Slot Attention

链接: https://arxiv.org/abs/2407.04170
作者: Markus Krimmel,Jan Achterhold,Joerg Stueckler
关键词: Object-centric scene decompositions, Slot Attention, Object-centric scene, unsupervised object-centric scene, Slot Attention module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 24 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Object-centric scene decompositions are important representations for downstream tasks in fields such as computer vision and robotics. The recently proposed Slot Attention module, already leveraged by several derivative works for image segmentation and object tracking in videos, is a deep learning component which performs unsupervised object-centric scene decomposition on input images. It is based on an attention architecture, in which latent slot vectors, which hold compressed information on objects, attend to localized perceptual features from the input image. In this paper, we show that design decisions on normalizing the aggregated values in the attention architecture have considerable impact on the capabilities of Slot Attention to generalize to a higher number of slots and objects as seen during training. We argue that the original Slot Attention normalization scheme discards information on the prior assignment probability of pixels to slots, which impairs its generalization capabilities. Based on these findings, we propose and investigate alternative normalization approaches which increase the generalization capabilities of Slot Attention to varying slot and object counts, resulting in performance gains on the task of unsupervised image segmentation.

[CV-77] Solutions to Deepfakes: Can Camera Hardware Cryptography and Deep Learning Verify Real Images?

链接: https://arxiv.org/abs/2407.04169
作者: Alexander Vilesov,Yuan Tian,Nader Sehatbakhsh,Achuta Kadambi
关键词: exponential progress, poses serious implications, real, generative, real images
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The exponential progress in generative AI poses serious implications for the credibility of all real images and videos. There will exist a point in the future where 1) digital content produced by generative AI will be indistinguishable from those created by cameras, 2) high-quality generative algorithms will be accessible to anyone, and 3) the ratio of all synthetic to real images will be large. It is imperative to establish methods that can separate real data from synthetic data with high confidence. We define real images as those that were produced by the camera hardware, capturing a real-world scene. Any synthetic generation of an image or alteration of a real image through generative AI or computer graphics techniques is labeled as a synthetic image. To this end, this document aims to: present known strategies in detection and cryptography that can be employed to verify which images are real, weight the strengths and weaknesses of these strategies, and suggest additional improvements to alleviate shortcomings.

[CV-78] VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation

链接: https://arxiv.org/abs/2407.04152
作者: I-Chun Arthur Liu,Sicheng He,Daniel Seita,Gaurav Sukhatme
关键词: Bimanual manipulation, robotics applications, bimanual manipulation tasks, Vision Language Models, manipulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bimanual manipulation is critical to many robotics applications. In contrast to single-arm manipulation, bimanual manipulation tasks are challenging due to higher-dimensional action spaces. Prior works leverage large amounts of data and primitive actions to address this problem, but may suffer from sample inefficiency and limited generalization across various tasks. To this end, we propose VoxAct-B, a language-conditioned, voxel-based method that leverages Vision Language Models (VLMs) to prioritize key regions within the scene and reconstruct a voxel grid. We provide this voxel grid to our bimanual manipulation policy to learn acting and stabilizing actions. This approach enables more efficient policy learning from voxels and is generalizable to different tasks. In simulation, we show that VoxAct-B outperforms strong baselines on fine-grained bimanual manipulation tasks. Furthermore, we demonstrate VoxAct-B on real-world \textttOpen Drawer and \textttOpen Jar tasks using two UR5s. Code, data, and videos will be available at this https URL.

[CV-79] SineKAN: Kolmogorov-Arnold Networks Using Sinusoidal Activation Functions

链接: https://arxiv.org/abs/2407.04149
作者: Eric A. F. Reinhardt,Sergei Gleyzer
关键词: perceptron neural networks, traditional multi-layer perceptron, multi-layer perceptron neural, Recent work, neural networks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 8 figures

点击查看摘要

Abstract:Recent work has established an alternative to traditional multi-layer perceptron neural networks in the form of Kolmogorov-Arnold Networks (KAN). The general KAN framework uses learnable activation functions on the edges of the computational graph followed by summation on nodes. The learnable edge activation functions in the original implementation are basis spline functions (B-Spline). Here, we present a model in which learnable grids of B-Spline activation functions can be replaced by grids of re-weighted sine functions. We show that this leads to better or comparable numerical performance to B-Spline KAN models on the MNIST benchmark, while also providing a substantial speed increase on the order of 4-9 times.

[CV-80] Biometric Authentication Based on Enhanced Remote Photoplethysmography Signal Morphology

链接: https://arxiv.org/abs/2407.04127
作者: Zhaodong Sun,Xiaobai Li,Jukka Komulainen,Guoying Zhao
关键词: Remote photoplethysmography, facial videos, measuring cardiac signals, contact photoplethysmography, facial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: accepted by IJCB 2024

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) is a non-contact method for measuring cardiac signals from facial videos, offering a convenient alternative to contact photoplethysmography (cPPG) obtained from contact sensors. Recent studies have shown that each individual possesses a unique cPPG signal morphology that can be utilized as a biometric identifier, which has inspired us to utilize the morphology of rPPG signals extracted from facial videos for person authentication. Since the facial appearance and rPPG are mixed in the facial videos, we first de-identify facial videos to remove facial appearance while preserving the rPPG information, which protects facial privacy and guarantees that only rPPG is used for authentication. The de-identified videos are fed into an rPPG model to get the rPPG signal morphology for authentication. In the first training stage, unsupervised rPPG training is performed to get coarse rPPG signals. In the second training stage, an rPPG-cPPG hybrid training is performed by incorporating external cPPG datasets to achieve rPPG biometric authentication and enhance rPPG signal morphology. Our approach needs only de-identified facial videos with subject IDs to train rPPG authentication models. The experimental results demonstrate that rPPG signal morphology hidden in facial videos can be used for biometric authentication. The code is available at this https URL.

[CV-81] An Autoencoder Architecture for L-band Passive Microwave Retrieval of Landscape Freeze-Thaw Cycle

链接: https://arxiv.org/abs/2407.04119
作者: Divya Kumawat,Ardeshir Ebtehaj,Xiaolan Xu,Andreas Colliander,Vipin Kumar
关键词: global carbon budgets, understanding permafrost response, Northern Hemisphere, Hemisphere is crucial, global warming
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Estimating the landscape and soil freeze-thaw (FT) dynamics in the Northern Hemisphere is crucial for understanding permafrost response to global warming and changes in regional and global carbon budgets. A new framework is presented for surface FT-cycle retrievals using L-band microwave radiometry based on a deep convolutional autoencoder neural network. This framework defines the landscape FT-cycle retrieval as a time series anomaly detection problem considering the frozen states as normal and thawed states as anomalies. The autoencoder retrieves the FT-cycle probabilistically through supervised reconstruction of the brightness temperature (TB) time series using a contrastive loss function that minimizes (maximizes) the reconstruction error for the peak winter (summer). Using the data provided by the Soil Moisture Active Passive (SMAP) satellite, it is demonstrated that the framework learns to isolate the landscape FT states over different land surface types with varying complexities related to the radiometric characteristics of snow cover, lake-ice phenology, and vegetation canopy. The consistency of the retrievals is evaluated over Alaska, against in situ ground-based observations, showing reduced uncertainties compared to the traditional methods that use thresholding of the normalized polarization ratio.

[CV-82] MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis

链接: https://arxiv.org/abs/2407.04106
作者: Asma Alkhaldi,Raneem Alnajim,Layan Alabdullatef,Rawan Alyahya,Jun Chen,Deyao Zhu,Ahmed Alsinan,Mohamed Elhoseiny
关键词: Recent advancements, refining diagnostic procedures, precipitated significant breakthroughs, artificial intelligence, breakthroughs in healthcare
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI) have precipitated significant breakthroughs in healthcare, particularly in refining diagnostic procedures. However, previous studies have often been constrained to limited functionalities. This study introduces MiniGPT-Med, a vision-language model derived from large-scale language models and tailored for medical applications. MiniGPT-Med demonstrates remarkable versatility across various imaging modalities, including X-rays, CT scans, and MRIs, enhancing its utility. The model is capable of performing tasks such as medical report generation, visual question answering (VQA), and disease identification within medical imagery. Its integrated processing of both image and textual clinical data markedly improves diagnostic accuracy. Our empirical assessments confirm MiniGPT-Med’s superior performance in disease grounding, medical report generation, and VQA benchmarks, representing a significant step towards reducing the gap in assisting radiology practice. Furthermore, it achieves state-of-the-art performance on medical report generation, higher than the previous best model by 19% accuracy. MiniGPT-Med promises to become a general interface for radiology diagnoses, enhancing diagnostic efficiency across a wide range of medical imaging applications.

[CV-83] Advances in Diffusion Models for Image Data Augmentation: A Review of Methods Models Evaluation Metrics and Future Research Directions

链接: https://arxiv.org/abs/2407.04103
作者: Panagiotis Alimisis,Ioannis Mademlis,Panagiotis Radoglou-Grammatikis,Panagiotis Sarigiannidis,Georgios Th. Papadopoulos
关键词: modern computer vision, Image data augmentation, computer vision tasks, data augmentation constitutes, machine learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 53 pages, 15 figures

点击查看摘要

Abstract:Image data augmentation constitutes a critical methodology in modern computer vision tasks, since it can facilitate towards enhancing the diversity and quality of training datasets; thereby, improving the performance and robustness of machine learning models in downstream tasks. In parallel, augmentation approaches can also be used for editing/modifying a given image in a context- and semantics-aware way. Diffusion Models (DMs), which comprise one of the most recent and highly promising classes of methods in the field of generative Artificial Intelligence (AI), have emerged as a powerful tool for image data augmentation, capable of generating realistic and diverse images by learning the underlying data distribution. The current study realizes a systematic, comprehensive and in-depth review of DM-based approaches for image augmentation, covering a wide range of strategies, tasks and applications. In particular, a comprehensive analysis of the fundamental principles, model architectures and training strategies of DMs is initially performed. Subsequently, a taxonomy of the relevant image augmentation methods is introduced, focusing on techniques regarding semantic manipulation, personalization and adaptation, and application-specific augmentation tasks. Then, performance assessment methodologies and respective evaluation metrics are analyzed. Finally, current challenges and future research directions in the field are discussed.

[CV-84] C3DG: Conditional Domain Generalization for Hyperspectral Imagery Classification with Convergence and Constrained-risk Theories

链接: https://arxiv.org/abs/2407.04100
作者: Zhe Gao,Bin Pan,Zhenwei Shi
关键词: classes present similar, Hyperspectral Imagery Classification, present similar spectra, Hyperspectral imagery, Conditional Domain Generalization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hyperspectral imagery (HSI) classification may suffer the challenge of hyperspectral-monospectra, where different classes present similar spectra. Joint spatial-spectral feature extraction is a popular solution for the problem, but this strategy tends to inflate accuracy since test pixels may exist in training patches. Domain generalization methods show promising potential, but they still fail to distinguish similar spectra across varying domains, in addition, the theoretical support is usually ignored. In this paper, we only rely on spectral information to solve the hyperspectral-monospectra problem, and propose a Convergence and Error-Constrained Conditional Domain Generalization method for Hyperspectral Imagery Classification (C ^3 DG). The major contributions of this paper include two aspects: the Conditional Revising Inference Block (CRIB), and the corresponding theories for model convergence and generalization errors. CRIB is the kernel structure of the proposed method, which employs a shared encoder and multi-branch decoders to fully leverage the conditional distribution during training, achieving a decoupling that aligns with the generation mechanisms of HSI. Moreover, to ensure model convergence and maintain controllable error, we propose the optimization convergence theorem and risk upper bound theorem. In the optimization convergence theorem, we ensure the model convergence by demonstrating that the gradients of the loss terms are not contradictory. In the risk upper bound theorem, our theoretical analysis explores the relationship between test-time training and recent related work to establish a concrete bound for error. Experimental results on three benchmark datasets indicate the superiority of C ^3 DG.

[CV-85] Looking for Tiny Defects via Forward-Backward Feature Transfer

链接: https://arxiv.org/abs/2407.04092
作者: Alex Costanzino,Pierluigi Zama Ramirez,Giuseppe Lisanti,Luigi Di Stefano
关键词: Motivated by efficiency, processing low-resolution images, efficiency requirements, anomaly detection, focus on processing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Motivated by efficiency requirements, most anomaly detection and segmentation (ADS) methods focus on processing low-resolution images, e.g., 224\times 224 pixels, obtained by downsampling the original input images. In this setting, downsampling is typically applied also to the provided ground-truth defect masks. Yet, as numerous industrial applications demand identification of both large and tiny defects, the above-described protocol may fall short in providing a realistic picture of the actual performance attainable by current methods. Hence, in this work, we introduce a novel benchmark that evaluates methods on the original, high-resolution image and ground-truth masks, focusing on segmentation performance as a function of the size of anomalies. Our benchmark includes a metric that captures robustness with respect to defect size, i.e., the ability of a method to preserve good localization from large anomalies to tiny ones. Furthermore, we introduce an ADS approach based on a novel Teacher-Student paradigm which relies on two shallow MLPs (the Students) that learn to transfer patch features across the layers of a frozen vision transformer (the Teacher). By means of our benchmark, we evaluate our proposal and other recent ADS methods on high-resolution inputs containing large and tiny defects. Our proposal features the highest robustness to defect size, runs at the fastest speed, yields state-of-the-art performance on the MVTec AD dataset and state-of-the-art segmentation performance on the VisA dataset.

[CV-86] Certifiably Robust Image Watermark

链接: https://arxiv.org/abs/2407.04086
作者: Zhengyuan Jiang,Moyang Guo,Yuepeng Hu,Jinyuan Jia,Neil Zhenqiang Gong
关键词: Generative AI raises, propaganda campaigns, raises many societal, boosting disinformation, disinformation and propaganda
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Generative AI raises many societal concerns such as boosting disinformation and propaganda campaigns. Watermarking AI-generated content is a key technology to address these concerns and has been widely deployed in industry. However, watermarking is vulnerable to removal attacks and forgery attacks. In this work, we propose the first image watermarks with certified robustness guarantees against removal and forgery attacks. Our method leverages randomized smoothing, a popular technique to build certifiably robust classifiers and regression models. Our major technical contributions include extending randomized smoothing to watermarking by considering its unique characteristics, deriving the certified robustness guarantees, and designing algorithms to estimate them. Moreover, we extensively evaluate our image watermarks in terms of both certified and empirical robustness. Our code is available at \urlthis https URL.

[CV-87] FIPGNet:Pyramid grafting network with feature interaction strategies

链接: https://arxiv.org/abs/2407.04085
作者: Ziyi Ding,Like Xin
关键词: Cross Attention Module, Salient object detection, proxy Cross Attention, object detection methods, agent Cross Attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2309.08365 by other authors

点击查看摘要

Abstract:Salient object detection is designed to identify the objects in an image that attract the most visual attention.Currently, the most advanced method of significance object detection adopts pyramid grafting network architecture.However, pyramid-graft network architecture still has the problem of failing to accurately locate significant targets.We observe that this is mainly due to the fact that current salient object detection methods simply aggregate different scale features, ignoring the correlation between different scale this http URL overcome these problems, we propose a new salience object detection framework(FIPGNet),which is a pyramid graft network with feature interaction strategies.Specifically, we propose an attention-mechanism based feature interaction strategy (FIA) that innovatively introduces spatial agent Cross Attention (SACA) to achieve multi-level feature interaction, highlighting important spatial regions from a spatial perspective, thereby enhancing salient regions.And the channel proxy Cross Attention Module (CCM), which is used to effectively connect the features extracted by the backbone network and the features processed using the spatial proxy cross attention module, eliminating inconsistencies.Finally, under the action of these two modules, the prominent target location problem in the current pyramid grafting network model is solved.Experimental results on six challenging datasets show that the proposed method outperforms the current 12 salient object detection methods on four indicators.

[CV-88] CLIP-DR: Textual Knowledge-Guided Diabetic Retinopathy Grading with Ranking-aware Prompting

链接: https://arxiv.org/abs/2407.04068
作者: Qinkai Yu,Jianyang Xie,Anh Nguyen,He Zhao,Jiong Zhang,Huazhu Fu,Yitian Zhao,Yalin Zheng,Yanda Meng
关键词: Diabetic retinopathy, reach sight-threatening levels, decades to reach, reach sight-threatening, complication of diabetes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI 2024

点击查看摘要

Abstract:Diabetic retinopathy (DR) is a complication of diabetes and usually takes decades to reach sight-threatening levels. Accurate and robust detection of DR severity is critical for the timely management and treatment of diabetes. However, most current DR grading methods suffer from insufficient robustness to data variability (\textite.g. colour fundus images), posing a significant difficulty for accurate and robust grading. In this work, we propose a novel DR grading framework CLIP-DR based on three observations: 1) Recent pre-trained visual language models, such as CLIP, showcase a notable capacity for generalisation across various downstream tasks, serving as effective baseline models. 2) The grading of image-text pairs for DR often adheres to a discernible natural sequence, yet most existing DR grading methods have primarily overlooked this aspect. 3) A long-tailed distribution among DR severity levels complicates the grading process. This work proposes a novel ranking-aware prompting strategy to help the CLIP model exploit the ordinal information. Specifically, we sequentially design learnable prompts between neighbouring text-image pairs in two different ranking directions. Additionally, we introduce a Similarity Matrix Smooth module into the structure of CLIP to balance the class distribution. Finally, we perform extensive comparisons with several state-of-the-art methods on the GDRBench benchmark, demonstrating our CLIP-DR’s robustness and superior performance. The implementation code is available \footnote\urlthis https URL

[CV-89] EMPL: A novel Efficient Meta Prompt Learning Framework for Few-shot Unsupervised Domain Adaptation

链接: https://arxiv.org/abs/2407.04066
作者: Wanqi Yang,Haoran Wang,Lei Wang,Ge Song,Yang Gao
关键词: utilizes few-shot labeled, few-shot labeled source, Few-shot unsupervised domain, few-shot labeled data, unlabeled target domain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Few-shot unsupervised domain adaptation (FS-UDA) utilizes few-shot labeled source domain data to realize effective classification in unlabeled target domain. However, current FS-UDA methods are still suffer from two issues: 1) the data from different domains can not be effectively aligned by few-shot labeled data due to the large domain gaps, 2) it is unstable and time-consuming to generalize to new FS-UDA this http URL address this issue, we put forward a novel Efficient Meta Prompt Learning Framework for FS-UDA. Within this framework, we use pre-trained CLIP model as the feature learning base model. First, we design domain-shared prompt learning vectors composed of virtual tokens, which mainly learns the meta knowledge from a large number of meta tasks to mitigate domain gaps. Secondly, we also design a task-shared prompt learning network to adaptively learn specific prompt vectors for each task, which aims to realize fast adaptation and task generalization. Thirdly, we learn a task-specific cross-domain alignment projection and a task-specific classifier with closed-form solutions for each meta task, which can efficiently adapt the model to new tasks in one step. The whole learning process is formulated as a bilevel optimization problem, and a good initialization of model parameters is learned through meta-learning. Extensive experimental study demonstrates the promising performance of our framework on benchmark datasets. Our method has the large improvement of at least 15.4% on 5-way 1-shot and 8.7% on 5-way 5-shot, compared with the state-of-the-art methods. Also, the performance of our method on all the test tasks is more stable than the other methods.

[CV-90] Detect Closer Surfaces that can be Seen: New Modeling and Evaluation in Cross-domain 3D Object Detection

链接: https://arxiv.org/abs/2407.04061
作者: Ruixiao Zhang,Yihong Wu,Juheon Lee,Adam Prugel-Bennett,Xiaohao Cai
关键词: domain adaptation technologies, object detection field, autonomous driving, object detection models’, adaptation technologies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by the 27th European Conference on Artificial Intelligence (ECAI 2024)

点击查看摘要

Abstract:The performance of domain adaptation technologies has not yet reached an ideal level in the current 3D object detection field for autonomous driving, which is mainly due to significant differences in the size of vehicles, as well as the environments they operate in when applied across domains. These factors together hinder the effective transfer and application of knowledge learned from specific datasets. Since the existing evaluation metrics are initially designed for evaluation on a single domain by calculating the 2D or 3D overlap between the prediction and ground-truth bounding boxes, they often suffer from the overfitting problem caused by the size differences among datasets. This raises a fundamental question related to the evaluation of the 3D object detection models’ cross-domain performance: Do we really need models to maintain excellent performance in their original 3D bounding boxes after being applied across domains? From a practical application perspective, one of our main focuses is actually on preventing collisions between vehicles and other obstacles, especially in cross-domain scenarios where correctly predicting the size of vehicles is much more difficult. In other words, as long as a model can accurately identify the closest surfaces to the ego vehicle, it is sufficient to effectively avoid obstacles. In this paper, we propose two metrics to measure 3D object detection models’ ability of detecting the closer surfaces to the sensor on the ego vehicle, which can be used to evaluate their cross-domain performance more comprehensively and reasonably. Furthermore, we propose a refinement head, named EdgeHead, to guide models to focus more on the learnable closer surfaces, which can greatly improve the cross-domain performance of existing models not only under our new metrics, but even also under the original BEV/3D metrics.

[CV-91] Occupancy as Set of Points

链接: https://arxiv.org/abs/2407.04049
作者: Yiang Shi,Tianheng Cheng,Qian Zhang,Wenyu Liu,Xinggang Wang
关键词: multi-view images, Set of Points, occupancy prediction, named Occupancy, occupancy
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted by ECCV 2024. Code and models: this https URL

点击查看摘要

Abstract:In this paper, we explore a novel point representation for 3D occupancy prediction from multi-view images, which is named Occupancy as Set of Points. Existing camera-based methods tend to exploit dense volume-based representation to predict the occupancy of the whole scene, making it hard to focus on the special areas or areas out of the perception range. In comparison, we present the Points of Interest (PoIs) to represent the scene and propose OSP, a novel framework for point-based 3D occupancy prediction. Owing to the inherent flexibility of the point-based representation, OSP achieves strong performance compared with existing methods and excels in terms of training and inference adaptability. It extends beyond traditional perception boundaries and can be seamlessly integrated with volume-based methods to significantly enhance their effectiveness. Experiments on the Occ3D nuScenes occupancy benchmark show that OSP has strong performance and flexibility. Code and models are available at \urlthis https URL.

[CV-92] owards Cross-View-Consistent Self-Supervised Surround Depth Estimation

链接: https://arxiv.org/abs/2407.04041
作者: Laiyan Ding,Hualie Jiang,Jie Li,Yongquan Chen,Rui Huang
关键词: acquiring per-pixel depth, per-pixel depth ground, depth ground truth, Surround Depth Estimation, Self-Supervised Surround Depth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Depth estimation is a cornerstone for autonomous driving, yet acquiring per-pixel depth ground truth for supervised learning is challenging. Self-Supervised Surround Depth Estimation (SSSDE) from consecutive images offers an economical alternative. While previous SSSDE methods have proposed different mechanisms to fuse information across images, few of them explicitly consider the cross-view constraints, leading to inferior performance, particularly in overlapping regions. This paper proposes an efficient and consistent pose estimation design and two loss functions to enhance cross-view consistency for SSSDE. For pose estimation, we propose to use only front-view images to reduce training memory and sustain pose estimation consistency. The first loss function is the dense depth consistency loss, which penalizes the difference between predicted depths in overlapping regions. The second one is the multi-view reconstruction consistency loss, which aims to maintain consistency between reconstruction from spatial and spatial-temporal contexts. Additionally, we introduce a novel flipping augmentation to improve the performance further. Our techniques enable a simple neural model to achieve state-of-the-art performance on the DDAD and nuScenes datasets. Last but not least, our proposed techniques can be easily applied to other methods. The code will be made public.

[CV-93] Beyond Pixels: Semi-Supervised Semantic Segmentation with a Multi-scale Patch-based Multi-Label Classifier

链接: https://arxiv.org/abs/2407.04036
作者: Prantik Howlader,Srijan Das,Hieu Le,Dimitris Samaras
关键词: Incorporating pixel contextual, contextual information, pixel contextual information, incorporate contextual information, Incorporating pixel
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: to be published in ECCV24

点击查看摘要

Abstract:Incorporating pixel contextual information is critical for accurate segmentation. In this paper, we show that an effective way to incorporate contextual information is through a patch-based classifier. This patch classifier is trained to identify classes present within an image region, which facilitates the elimination of distractors and enhances the classification of small object segments. Specifically, we introduce Multi-scale Patch-based Multi-label Classifier (MPMC), a novel plug-in module designed for existing semi-supervised segmentation (SSS) frameworks. MPMC offers patch-level supervision, enabling the discrimination of pixel regions of different classes within a patch. Furthermore, MPMC learns an adaptive pseudo-label weight, using patch-level classification to alleviate the impact of the teacher’s noisy pseudo-label supervision the student. This lightweight module can be integrated into any SSS framework, significantly enhancing their performance. We demonstrate the efficacy of our proposed MPMC by integrating it into four SSS methodologies and improving them across two natural image and one medical segmentation dataset, notably improving the segmentation results of the baselines across all the three datasets.

[CV-94] Adaptive Step-size Perception Unfolding Network with Non-local Hybrid Attention for Hyperspectral Image Reconstruction

链接: https://arxiv.org/abs/2407.04024
作者: Yanan Yang,Like Xin
关键词: recently shown promising, shown promising results, hyperspectral image, architecture have recently, recently shown
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep unfolding methods and transformer architecture have recently shown promising results in hyperspectral image (HSI) reconstruction. However, there still exist two issues: (1) in the data subproblem, most methods represents the stepsize utilizing a learnable parameter. Nevertheless, for different spectral channel, error between features and ground truth is unequal. (2) Transformer struggles to balance receptive field size with pixel-wise detail information. To overcome the aforementioned drawbacks, We proposed an adaptive step-size perception unfolding network (ASPUN), a deep unfolding network based on FISTA algorithm, which uses an adaptive step-size perception module to estimate the update step-size of each spectral channel. In addition, we design a Non-local Hybrid Attention Transformer(NHAT) module for fully leveraging the receptive field advantage of transformer. By plugging the NLHA into the Non-local Information Aggregation (NLIA) module, the unfolding network can achieve better reconstruction results. Experimental results show that our ASPUN is superior to the existing SOTA algorithms and achieves the best performance.

[CV-95] Learning Non-Linear Invariants for Unsupervised Out-of-Distribution Detection

链接: https://arxiv.org/abs/2407.04022
作者: Lars Doorenbos,Raphael Sznitman,Pablo Márquez-Neila
关键词: deep learning models, reliable deep learning, handle data drawn, reliable deep, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:The inability of deep learning models to handle data drawn from unseen distributions has sparked much interest in unsupervised out-of-distribution (U-OOD) detection, as it is crucial for reliable deep learning models. Despite considerable attention, theoretically-motivated approaches are few and far between, with most methods building on top of some form of heuristic. Recently, U-OOD was formalized in the context of data invariants, allowing a clearer understanding of how to characterize U-OOD, and methods leveraging affine invariants have attained state-of-the-art results on large-scale benchmarks. Nevertheless, the restriction to affine invariants hinders the expressiveness of the approach. In this work, we broaden the affine invariants formulation to a more general case and propose a framework consisting of a normalizing flow-like architecture capable of learning non-linear invariants. Our novel approach achieves state-of-the-art results on an extensive U-OOD benchmark, and we demonstrate its further applicability to tabular data. Finally, we show our method has the same desirable properties as those based on affine invariants.

[CV-96] Mitigating Low-Frequency Bias: Feature Recalibration and Frequency Attention Regularization for Adversarial Robustness

链接: https://arxiv.org/abs/2407.04016
作者: Kejia Zhang,Juanjuan Weng,Yuanzheng Cai,Zhiming Luo,Shaozi Li
关键词: computer vision models, long-lasting objective, computer vision, significant and long-lasting, adversarial attacks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ensuring the robustness of computer vision models against adversarial attacks is a significant and long-lasting objective. Motivated by adversarial attacks, researchers have devoted considerable efforts to enhancing model robustness by adversarial training (AT). However, we observe that while AT improves the models’ robustness against adversarial perturbations, it fails to improve their ability to effectively extract features across all frequency components. Each frequency component contains distinct types of crucial information: low-frequency features provide fundamental structural insights, while high-frequency features capture intricate details and textures. In particular, AT tends to neglect the reliance on susceptible high-frequency features. This low-frequency bias impedes the model’s ability to effectively leverage the potentially meaningful semantic information present in high-frequency features. This paper proposes a novel module called High-Frequency Feature Disentanglement and Recalibration (HFDR), which separates features into high-frequency and low-frequency components and recalibrates the high-frequency feature to capture latent useful semantics. Additionally, we introduce frequency attention regularization to magnitude the model’s extraction of different frequency features and mitigate low-frequency bias during AT. Extensive experiments showcase the immense potential and superiority of our approach in resisting various white-box attacks, transfer attacks, and showcasing strong generalization capabilities.

[CV-97] Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners

链接: https://arxiv.org/abs/2407.04003
作者: Mushui Liu,Bozheng Li,Yunlong Yu
关键词: pre-trained Vision-Language Models, Prompt tuning, involves training, training a small, small set
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:Prompt tuning, which involves training a small set of parameters, effectively enhances the pre-trained Vision-Language Models (VLMs) to downstream tasks. However, they often come at the cost of flexibility and adaptability when the tuned models are applied to different datasets or domains. In this paper, we explore capturing the task-specific information via meticulous refinement of entire VLMs, with minimal parameter adjustments. When fine-tuning the entire VLMs for specific tasks under limited supervision, overfitting and catastrophic forgetting become the defacto factors. To mitigate these issues, we propose a framework named CLIP-CITE via designing a discriminative visual-text task, further aligning the visual-text semantics in a supervision manner, and integrating knowledge distillation techniques to preserve the gained knowledge. Extensive experimental results under few-shot learning, base-to-new generalization, domain generalization, and cross-domain generalization settings, demonstrate that our method effectively enhances the performance on specific tasks under limited supervision while preserving the versatility of the VLMs on other datasets.

[CV-98] MineNetCD: A Benchmark for Global Mining Change Detection on Remote Sensing Imagery

链接: https://arxiv.org/abs/2407.03971
作者: Weikang Yu,Xiaokang Zhang,Xiao Xiang Zhu,Richard Gloaguen,Pedram Ghamisi
关键词: poses significant challenges, significant challenges due, industrial controlling, environmental management, regulatory compliance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Monitoring changes triggered by mining activities is crucial for industrial controlling, environmental management and regulatory compliance, yet it poses significant challenges due to the vast and often remote locations of mining sites. Remote sensing technologies have increasingly become indispensable to detect and analyze these changes over time. We thus introduce MineNetCD, a comprehensive benchmark designed for global mining change detection using remote sensing imagery. The benchmark comprises three key contributions. First, we establish a global mining change detection dataset featuring more than 70k paired patches of bi-temporal high-resolution remote sensing images and pixel-level annotations from 100 mining sites worldwide. Second, we develop a novel baseline model based on a change-aware Fast Fourier Transform (ChangeFFT) module, which enhances various backbones by leveraging essential spectrum components within features in the frequency domain and capturing the channel-wise correlation of bi-temporal feature differences to learn change-aware representations. Third, we construct a unified change detection (UCD) framework that integrates over 13 advanced change detection models. This framework is designed for streamlined and efficient processing, utilizing the cloud platform hosted by HuggingFace. Extensive experiments have been conducted to demonstrate the superiority of the proposed baseline model compared with 12 state-of-the-art change detection approaches. Empirical studies on modularized backbones comprehensively confirm the efficacy of different representation learners on change detection. This contribution represents significant advancements in the field of remote sensing and change detection, providing a robust resource for future research and applications in global mining monitoring. Dataset and Codes are available via the link.

[CV-99] Leveraging Latent Diffusion Models for Training-Free In-Distribution Data Augmentation for Surface Defect Detection

链接: https://arxiv.org/abs/2407.03961
作者: Federico Girella,Ziyue Liu,Franco Fummi,Francesco Setti,Marco Cristani,Luigi Capogrosso
关键词: task of identifying, Defect detection, normal samples, data, samples
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at the 21st International Conference on Content-Based Multimedia Indexing (CBMI 2024)

点击查看摘要

Abstract:Defect detection is the task of identifying defects in production samples. Usually, defect detection classifiers are trained on ground-truth data formed by normal samples (negative data) and samples with defects (positive data), where the latter are consistently fewer than normal samples. State-of-the-art data augmentation procedures add synthetic defect data by superimposing artifacts to normal samples to mitigate problems related to unbalanced training data. These techniques often produce out-of-distribution images, resulting in systems that learn what is not a normal sample but cannot accurately identify what a defect looks like. In this work, we introduce DIAG, a training-free Diffusion-based In-distribution Anomaly Generation pipeline for data augmentation. Unlike conventional image generation techniques, we implement a human-in-the-loop pipeline, where domain experts provide multimodal guidance to the model through text descriptions and region localization of the possible anomalies. This strategic shift enhances the interpretability of results and fosters a more robust human feedback loop, facilitating iterative improvements of the generated outputs. Remarkably, our approach operates in a zero-shot manner, avoiding time-consuming fine-tuning procedures while achieving superior performance. We demonstrate the efficacy and versatility of DIAG with respect to state-of-the-art data augmentation approaches on the challenging KSDD2 dataset, with an improvement in AP of approximately 18% when positive samples are available and 28% when they are missing. The source code is available at this https URL.

[CV-100] Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

链接: https://arxiv.org/abs/2407.03958
作者: Young-Jun Lee,Dokyong Lee,Junyoung Youn,Kyeongjin Oh,Byungsoo Ko,Jonghwan Hyeon,Ho-Jin Choi
关键词: instant messaging tools, messaging tools, personal experiences, instant messaging, share a wide
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Project website: this https URL

点击查看摘要

Abstract:Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal conversation dataset that covers a wide range of social personas in a multi-modality format, time intervals, and images. To construct Stark automatically, we propose a novel multi-modal contextualization framework, Mcu, that generates long-term multi-modal dialogue distilled from ChatGPT and our proposed Plan-and-Execute image aligner. Using our Stark, we train a multi-modal conversation model, Ultron 7B, which demonstrates impressive visual imagination ability. Furthermore, we demonstrate the effectiveness of our dataset in human evaluation. We make our source code and dataset publicly available.

[CV-101] rackPGD: A White-box Attack using Binary Masks against Robust Transformer Trackers

链接: https://arxiv.org/abs/2407.03946
作者: Fatemeh Nourilenjan Nokabadi,Yann Batiste Pequignot,Jean-Francois Lalonde,Christian Gagné
关键词: achieved robust performance, visual object tracking, performance on visual, trackers, Object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object trackers with transformer backbones have achieved robust performance on visual object tracking datasets. However, the adversarial robustness of these trackers has not been well studied in the literature. Due to the backbone differences, the adversarial white-box attacks proposed for object tracking are not transferable to all types of trackers. For instance, transformer trackers such as MixFormerM still function well after black-box attacks, especially in predicting the object binary masks. We are proposing a novel white-box attack named TrackPGD, which relies on the predicted object binary mask to attack the robust transformer trackers. That new attack focuses on annotation masks by adapting the well-known SegPGD segmentation attack, allowing to successfully conduct the white-box attack on trackers relying on transformer backbones. The experimental results indicate that the TrackPGD is able to effectively attack transformer-based trackers such as MixFormerM, OSTrackSTS, and TransT-SEG on several tracking datasets.

[CV-102] SfM on-the-fly: Get better 3D from What You Capture

链接: https://arxiv.org/abs/2407.03939
作者: Zhan Zongqian,Yu Yifei,Xia Rui,Gan Wentian,Xie Hong,Perda Giulio,Morelli Luca,Remondino Fabio,Wang Xin
关键词: Structure from Motion, constant research hotspot, Navigable Small World, Hierarchical Navigable Small, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the last twenty years, Structure from Motion (SfM) has been a constant research hotspot in the fields of photogrammetry, computer vision, robotics etc., whereas real-time performance is just a recent topic of growing interest. This work builds upon the original on-the-fly SfM (Zhan et al., 2024) and presents an updated version with three new advancements to get better 3D from what you capture: (i) real-time image matching is further boosted by employing the Hierarchical Navigable Small World (HNSW) graphs, thus more true positive overlapping image candidates are faster identified; (ii) a self-adaptive weighting strategy is proposed for robust hierarchical local bundle adjustment to improve the SfM results; (iii) multiple agents are included for supporting collaborative SfM and seamlessly merge multiple 3D reconstructions into a complete 3D scene when commonly registered images appear. Various comprehensive experiments demonstrate that the proposed SfM method (named on-the-fly SfMv2) can generate more complete and robust 3D reconstructions in a high time-efficient way. Code is available at this http URL.

[CV-103] CRiM-GS: Continuous Rigid Motion-Aware Gaussian Splatting from Motion Blur Images

链接: https://arxiv.org/abs/2407.03923
作者: Junghe Lee,Donghyeong Kim,Dogyoon Lee,Suhwan Cho,Sangyoun Lee
关键词: received significant attention, significant attention due, view rendering ability, prompting research, received significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page : this https URL

点击查看摘要

Abstract:Neural radiance fields (NeRFs) have received significant attention due to their high-quality novel view rendering ability, prompting research to address various real-world cases. One critical challenge is the camera motion blur caused by camera movement during exposure time, which prevents accurate 3D scene reconstruction. In this study, we propose continuous rigid motion-aware gaussian splatting (CRiM-GS) to reconstruct accurate 3D scene from blurry images with real-time rendering speed. Considering the actual camera motion blurring process, which consists of complex motion patterns, we predict the continuous movement of the camera based on neural ordinary differential equations (ODEs). Specifically, we leverage rigid body transformations to model the camera motion with proper regularization, preserving the shape and size of the object. Furthermore, we introduce a continuous deformable 3D transformation in the \textitSE(3) field to adapt the rigid body transformation to real-world problems by ensuring a higher degree of freedom. By revisiting fundamental camera theory and employing advanced neural network training techniques, we achieve accurate modeling of continuous camera trajectories. We conduct extensive experiments, demonstrating state-of-the-art performance both quantitatively and qualitatively on benchmark datasets.

[CV-104] POLAFFINI: Efficient feature-based polyaffine initialization for improved non-linear image registration

链接: https://arxiv.org/abs/2407.03922
作者: Antoine Legouhy,Ross Callaghan,Hojjat Azadbakht,Hui Zhang
关键词: initialize non-linear image, paper presents, presents an efficient, image registration, nonlinear image registration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: submitted to MICCAI 2024

点击查看摘要

Abstract:This paper presents an efficient feature-based approach to initialize non-linear image registration. Today, nonlinear image registration is dominated by methods relying on intensity-based similarity measures. A good estimate of the initial transformation is essential, both for traditional iterative algorithms and for recent one-shot deep learning (DL)-based alternatives. The established approach to estimate this starting point is to perform affine registration, but this may be insufficient due to its parsimonious, global, and non-bending nature. We propose an improved initialization method that takes advantage of recent advances in DL-based segmentation techniques able to instantly estimate fine-grained regional delineations with state-of-the-art accuracies. Those segmentations are used to produce local, anatomically grounded, feature-based affine matchings using iteration-free closed-form expressions. Estimated local affine transformations are then fused, with the log-Euclidean polyaffine framework, into an overall dense diffeomorphic transformation. We show that, compared to its affine counterpart, the proposed initialization leads to significantly better alignment for both traditional and DL-based non-linear registration algorithms. The proposed approach is also more robust and significantly faster than commonly used affine registration algorithms such as FSL FLIRT.

[CV-105] Concept Bottleneck Models Without Predefined Concepts

链接: https://arxiv.org/abs/2407.03921
作者: Simon Schrodi,Julian Schur,Max Argus,Thomas Brox
关键词: Concept Bottleneck Models, considerable recent interest, predict human-interpretable concepts, Bottleneck Models, interpretable concept-based models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:There has been considerable recent interest in interpretable concept-based models such as Concept Bottleneck Models (CBMs), which first predict human-interpretable concepts and then map them to output classes. To reduce reliance on human-annotated concepts, recent works have converted pretrained black-box models into interpretable CBMs post-hoc. However, these approaches predefine a set of concepts, assuming which concepts a black-box model encodes in its representations. In this work, we eliminate this assumption by leveraging unsupervised concept discovery to automatically extract concepts without human annotations or a predefined set of concepts. We further introduce an input-dependent concept selection mechanism that ensures only a small subset of concepts is used across all classes. We show that our approach improves downstream performance and narrows the performance gap to black-box models, while using significantly fewer concepts in the classification. Finally, we demonstrate how large vision-language models can intervene on the final model weights to correct model errors.

[CV-106] MedRAT: Unpaired Medical Report Generation via Auxiliary Tasks

链接: https://arxiv.org/abs/2407.03919
作者: Elad Hirsch,Gefen Dawidowicz,Ayellet Tal
关键词: X-ray images, unavailable for training, unpaired scenario, paired image-report data, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generating medical reports for X-ray images is a challenging task, particularly in an unpaired scenario where paired image-report data is unavailable for training. To address this challenge, we propose a novel model that leverages the available information in two distinct datasets, one comprising reports and the other consisting of images. The core idea of our model revolves around the notion that combining auto-encoding report generation with multi-modal (report-image) alignment can offer a solution. However, the challenge persists regarding how to achieve this alignment when pair correspondence is absent. Our proposed solution involves the use of auxiliary tasks, particularly contrastive learning and classification, to position related images and reports in close proximity to each other. This approach differs from previous methods that rely on pre-processing steps using external information stored in a knowledge graph. Our model, named MedRAT, surpasses previous state-of-the-art methods, demonstrating the feasibility of generating comprehensive medical reports without the need for paired data or external tools.

[CV-107] mestep-Aware Correction for Quantized Diffusion Models

链接: https://arxiv.org/abs/2407.03917
作者: Yuzhe Yao,Feng Tian,Jun Chen,Haonan Lin,Guang Dai,Yong Liu,Jingdong Wang
关键词: semantically coherent images, Diffusion models, synthesis of semantically, semantically coherent, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Diffusion models have marked a significant breakthrough in the synthesis of semantically coherent images. However, their extensive noise estimation networks and the iterative generation process limit their wider application, particularly on resource-constrained platforms like mobile devices. Existing post-training quantization (PTQ) methods have managed to compress diffusion models to low precision. Nevertheless, due to the iterative nature of diffusion models, quantization errors tend to accumulate throughout the generation process. This accumulation of error becomes particularly problematic in low-precision scenarios, leading to significant distortions in the generated images. We attribute this accumulation issue to two main causes: error propagation and exposure bias. To address these problems, we propose a timestep-aware correction method for quantized diffusion model, which dynamically corrects the quantization error. By leveraging the proposed method in low-precision diffusion models, substantial enhancement of output quality could be achieved with only negligible computation overhead. Extensive experiments underscore our method’s effectiveness and generalizability. By employing the proposed correction strategy, we achieve state-of-the-art (SOTA) results on low-precision models.

[CV-108] DiCTI: Diffusion-based Clothing Designer via Text-guided Input

链接: https://arxiv.org/abs/2407.03901
作者: Ajda Lampe(2),Julija Stopar(1),Deepak Kumar Jain(4),Shinichiro Omachi(3),Peter Peer(2),Vitomir Štruc(1) ((1) University of Ljubljana, Faculty of Electrical Engineering, Ljubljana, Slovenia, (2) University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia, (3) Tohoku University, Graduate School of Engineering, Sendai, Japan, (4) Dalian University of Technology, China)
关键词: Recent developments, deep generative models, leading to significant, creative fields, including the fashion
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to FG 2024

点击查看摘要

Abstract:Recent developments in deep generative models have opened up a wide range of opportunities for image synthesis, leading to significant changes in various creative fields, including the fashion industry. While numerous methods have been proposed to benefit buyers, particularly in virtual try-on applications, there has been relatively less focus on facilitating fast prototyping for designers and customers seeking to order new designs. To address this gap, we introduce DiCTI (Diffusion-based Clothing Designer via Text-guided Input), a straightforward yet highly effective approach that allows designers to quickly visualize fashion-related ideas using text inputs only. Given an image of a person and a description of the desired garments as input, DiCTI automatically generates multiple high-resolution, photorealistic images that capture the expressed semantics. By leveraging a powerful diffusion-based inpainting model conditioned on text inputs, DiCTI is able to synthesize convincing, high-quality images with varied clothing designs that viably follow the provided text descriptions, while being able to process very diverse and challenging inputs, captured in completely unconstrained settings. We evaluate DiCTI in comprehensive experiments on two different datasets (VITON-HD and Fashionpedia) and in comparison to the state-of-the-art (SoTa). The results of our experiments show that DiCTI convincingly outperforms the SoTA competitor in generating higher quality images with more elaborate garments and superior text prompt adherence, both according to standard quantitative evaluation measures and human ratings, generated as part of a user study.

[CV-109] Oracle Bone Inscriptions Multi-modal Dataset

链接: https://arxiv.org/abs/2407.03900
作者: Bang Li,Donghao Luo,Yujie Liang,Jing Yang,Zengmao Ding,Xu Peng,Boyuan Jiang,Shengwei Han,Dan Sui,Peichao Qin,Pian Wu,Chaoyang Wang,Yun Qi,Taisong Jin,Chengjie Wang,Xiaoming Huang,Zhan Shu,Rongrong Ji,Yongge Liu,Yunsheng Wu
关键词: early Shang history, bearing invaluable written, earliest developed writing, developed writing system, invaluable written exemplifications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Oracle bone inscriptions(OBI) is the earliest developed writing system in China, bearing invaluable written exemplifications of early Shang history and paleography. However, the task of deciphering OBI, in the current climate of the scholarship, can prove extremely challenging. Out of the 4,500 oracle bone characters excavated, only a third have been successfully identified. Therefore, leveraging the advantages of advanced AI technology to assist in the decipherment of OBI is a highly essential research topic. However, fully utilizing AI’s capabilities in these matters is reliant on having a comprehensive and high-quality annotated OBI dataset at hand whereas most existing datasets are only annotated in just a single or a few dimensions, limiting the value of their potential application. For instance, the Oracle-MNIST dataset only offers 30k images classified into 10 categories. Therefore, this paper proposes an Oracle Bone Inscriptions Multi-modal Dataset(OBIMD), which includes annotation information for 10,077 pieces of oracle bones. Each piece has two modalities: pixel-level aligned rubbings and facsimiles. The dataset annotates the detection boxes, character categories, transcriptions, corresponding inscription groups, and reading sequences in the groups of each oracle bone character, providing a comprehensive and high-quality level of annotations. This dataset can be used for a variety of AI-related research tasks relevant to the field of OBI, such as OBI Character Detection and Recognition, Rubbing Denoising, Character Matching, Character Generation, Reading Sequence Prediction, Missing Characters Completion task and so on. We believe that the creation and publication of a dataset like this will help significantly advance the application of AI algorithms in the field of OBI research.

[CV-110] Do Generalised Classifiers really work on Human Drawn Sketches?

链接: https://arxiv.org/abs/2407.03893
作者: Hmrishav Bandyopadhyay,Pinaki Nath Chowdhury,Aneeshan Sain,Subhadeep Koley,Tao Xiang,Ayan Kumar Bhunia,Yi-Zhe Song
关键词: marries large foundation, human sketch understanding, large foundation models, marries large, large foundation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV 2024

点击查看摘要

Abstract:This paper, for the first time, marries large foundation models with human sketch understanding. We demonstrate what this brings – a paradigm shift in terms of generalised sketch representation learning (e.g., classification). This generalisation happens on two fronts: (i) generalisation across unknown categories (i.e., open-set), and (ii) generalisation traversing abstraction levels (i.e., good and bad sketches), both being timely challenges that remain unsolved in the sketch literature. Our design is intuitive and centred around transferring the already stellar generalisation ability of CLIP to benefit generalised learning for sketches. We first “condition” the vanilla CLIP model by learning sketch-specific prompts using a novel auxiliary head of raster to vector sketch conversion. This importantly makes CLIP “sketch-aware”. We then make CLIP acute to the inherently different sketch abstraction levels. This is achieved by learning a codebook of abstraction-specific prompt biases, a weighted combination of which facilitates the representation of sketches across abstraction levels – low abstract edge-maps, medium abstract sketches in TU-Berlin, and highly abstract doodles in QuickDraw. Our framework surpasses popular sketch representation learning algorithms in both zero-shot and few-shot setups and in novel settings across different abstraction boundaries.

[CV-111] DSMix: Distortion-Induced Sensitivity Map Based Pre-training for No-Reference Image Quality Assessment

链接: https://arxiv.org/abs/2407.03886
作者: Jinsong Shi,Pan Gao,Xiaojiang Peng,Jie Qin
关键词: Image quality assessment, quality assessment, fundamental challenge, IQA, deep learning-based IQA
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Image quality assessment (IQA) has long been a fundamental challenge in image understanding. In recent years, deep learning-based IQA methods have shown promising performance. However, the lack of large amounts of labeled data in the IQA field has hindered further advancements in these methods. This paper introduces DSMix, a novel data augmentation technique specifically designed for IQA tasks, aiming to overcome this limitation. DSMix leverages the distortion-induced sensitivity map (DSM) of an image as prior knowledge. It applies cut and mix operations to diverse categories of synthetic distorted images, assigning confidence scores to class labels based on the aforementioned prior knowledge. In the pre-training phase using DSMix-augmented data, knowledge distillation is employed to enhance the model’s ability to extract semantic features. Experimental results on both synthetic and authentic IQA datasets demonstrate the significant predictive and generalization performance achieved by DSMix, without requiring fine-tuning of the full model. Code is available at \urlthis https URL.

[CV-112] Perception-Guided Quality Metric of 3D Point Clouds Using Hybrid Strategy

链接: https://arxiv.org/abs/2407.03885
作者: Yujie Zhang,Qi Yang,Yiling Xu,Shan Liu
关键词: Full-reference point cloud, point cloud quality, distorted point clouds, cloud quality assessment, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Full-reference point cloud quality assessment (FR-PCQA) aims to infer the quality of distorted point clouds with available references. Most of the existing FR-PCQA metrics ignore the fact that the human visual system (HVS) dynamically tackles visual information according to different distortion levels (i.e., distortion detection for high-quality samples and appearance perception for low-quality samples) and measure point cloud quality using unified features. To bridge the gap, in this paper, we propose a perception-guided hybrid metric (PHM) that adaptively leverages two visual strategies with respect to distortion degree to predict point cloud quality: to measure visible difference in high-quality samples, PHM takes into account the masking effect and employs texture complexity as an effective compensatory factor for absolute difference; on the other hand, PHM leverages spectral graph theory to evaluate appearance degradation in low-quality samples. Variations in geometric signals on graphs and changes in the spectral graph wavelet coefficients are utilized to characterize geometry and texture appearance degradation, respectively. Finally, the results obtained from the two components are combined in a non-linear method to produce an overall quality score of the tested point cloud. The results of the experiment on five independent databases show that PHM achieves state-of-the-art (SOTA) performance and offers significant performance improvement in multiple distortion environments. The code is publicly available at this https URL.

[CV-113] he Solution for the GAIIC2024 RGB-TIR object detection Challenge

链接: https://arxiv.org/abs/2407.03872
作者: Xiangyu Wu,Jinling Xu,Longfei Huang,Yang Yang
关键词: RGB-TIR object detection, object detection, unmanned aerial vehicles, RGB-TIR object, unmanned aerial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This report introduces a solution to The task of RGB-TIR object detection from the perspective of unmanned aerial vehicles. Unlike traditional object detection methods, RGB-TIR object detection aims to utilize both RGB and TIR images for complementary information during detection. The challenges of RGB-TIR object detection from the perspective of unmanned aerial vehicles include highly complex image backgrounds, frequent changes in lighting, and uncalibrated RGB-TIR image pairs. To address these challenges at the model level, we utilized a lightweight YOLOv9 model with extended multi-level auxiliary branches that enhance the model’s robustness, making it more suitable for practical applications in unmanned aerial vehicle scenarios. For image fusion in RGB-TIR detection, we incorporated a fusion module into the backbone network to fuse images at the feature level, implicitly addressing calibration issues. Our proposed method achieved an mAP score of 0.516 and 0.543 on A and B benchmarks respectively while maintaining the highest inference speed among all models.

[CV-114] PFGS: High Fidelity Point Cloud Rendering via Feature Splatting

链接: https://arxiv.org/abs/2407.03857
作者: Jiaxu Wang,Ziyi Zhang,Junhao He,Renjing Xu
关键词: Rendering high-fidelity images, sparse point clouds, images from sparse, high-fidelity images, point cloud rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Rendering high-fidelity images from sparse point clouds is still challenging. Existing learning-based approaches suffer from either hole artifacts, missing details, or expensive computations. In this paper, we propose a novel framework to render high-quality images from sparse points. This method first attempts to bridge the 3D Gaussian Splatting and point cloud rendering, which includes several cascaded modules. We first use a regressor to estimate Gaussian properties in a point-wise manner, the estimated properties are used to rasterize neural feature descriptors into 2D planes which are extracted from a multiscale extractor. The projected feature volume is gradually decoded toward the final prediction via a multiscale and progressive decoder. The whole pipeline experiences a two-stage training and is driven by our well-designed progressive and multiscale reconstruction loss. Experiments on different benchmarks show the superiority of our method in terms of rendering qualities and the necessities of our main components.

[CV-115] Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation

链接: https://arxiv.org/abs/2407.03842
作者: Linlong Fan,Ye Huang,Yanqi Ge,Wen Li,Lixin Duan
关键词: arbitrary views, excel at recognizing, arbitrary, view-based methods excel, recognition under arbitrary
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Existing view-based methods excel at recognizing 3D objects from predefined viewpoints, but their exploration of recognition under arbitrary views is limited. This is a challenging and realistic setting because each object has different viewpoint positions and quantities, and their poses are not aligned. However, most view-based methods, which aggregate multiple view features to obtain a global feature representation, hard to address 3D object recognition under arbitrary views. Due to the unaligned inputs from arbitrary views, it is challenging to robustly aggregate features, leading to performance degradation. In this paper, we introduce a novel Part-aware Network (PANet), which is a part-based representation, to address these issues. This part-based representation aims to localize and understand different parts of 3D objects, such as airplane wings and tails. It has properties such as viewpoint invariance and rotation robustness, which give it an advantage in addressing the 3D object recognition problem under arbitrary views. Our results on benchmark datasets clearly demonstrate that our proposed method outperforms existing view-based aggregation baselines for the task of 3D object recognition under arbitrary views, even surpassing most fixed viewpoint methods.

[CV-116] ADAPT: Multimodal Learning for Detecting Physiological Changes under Missing Modalities

链接: https://arxiv.org/abs/2407.03836
作者: Julie Mordacq,Leo Milecki,Maria Vakalopoulou,Steve Oudot,Vicky Kalogeiton
关键词: recently gained attention, Multimodality has recently, medical domain, health records, Masked Multimodal Transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at MIDL 2024

点击查看摘要

Abstract:Multimodality has recently gained attention in the medical domain, where imaging or video modalities may be integrated with biomedical signals or health records. Yet, two challenges remain: balancing the contributions of modalities, especially in cases with a limited amount of data available, and tackling missing modalities. To address both issues, in this paper, we introduce the AnchoreD multimodAl Physiological Transformer (ADAPT), a multimodal, scalable framework with two key components: (i) aligning all modalities in the space of the strongest, richest modality (called anchor) to learn a joint embedding space, and (ii) a Masked Multimodal Transformer, leveraging both inter- and intra-modality correlations while handling missing modalities. We focus on detecting physiological changes in two real-life scenarios: stress in individuals induced by specific triggers and fighter pilots’ loss of consciousness induced by g -forces. We validate the generalizability of ADAPT through extensive experiments on two datasets for these tasks, where we set the new state of the art while demonstrating its robustness across various modality scenarios and its high potential for real-life applications.

[CV-117] 7th ABAW Competition: Multi-Task Learning and Compound Expression Recognition

链接: https://arxiv.org/abs/2407.03835
作者: Dimitrios Kollias,Stefanos Zafeiriou,Irene Kotsia,Abhinav Dhall,Shreya Ghosh,Chunchang Shao,Guanyu Hu
关键词: Affective Behavior Analysis, respective Workshop held, Compound Expression Recognition, ABAW Competition addresses, Affective Behavior
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper describes the 7th Affective Behavior Analysis in-the-wild (ABAW) Competition, which is part of the respective Workshop held in conjunction with ECCV 2024. The 7th ABAW Competition addresses novel challenges in understanding human expressions and behaviors, crucial for the development of human-centered technologies. The Competition comprises of two sub-challenges: i) Multi-Task Learning (the goal is to learn at the same time, in a multi-task learning setting, to estimate two continuous affect dimensions, valence and arousal, to recognise between the mutually exclusive classes of the 7 basic expressions and ‘other’), and to detect 12 Action Units); and ii) Compound Expression Recognition (the target is to recognise between the 7 mutually exclusive compound expression classes). s-Aff-Wild2, which is a static version of the A/V Aff-Wild2 database and contains annotations for valence-arousal, expressions and Action Units, is utilized for the purposes of the Multi-Task Learning Challenge; a part of C-EXPR-DB, which is an A/V in-the-wild database with compound expression annotations, is utilized for the purposes of the Compound Expression Recognition Challenge. In this paper, we introduce the two challenges, detailing their datasets and the protocols followed for each. We also outline the evaluation metrics, and highlight the baseline systems and their results. Additional information about the competition can be found at \urlthis https URL.

[CV-118] DocXplain: A Novel Model-Agnostic Explainability Method for Document Image Classification

链接: https://arxiv.org/abs/2407.03830
作者: Saifullah Saifullah,Stefan Agne,Andreas Dengel,Sheraz Ahmed
关键词: showcasing superhuman performance, document image classification, Deep learning, document image, deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ICDAR 2024

点击查看摘要

Abstract:Deep learning (DL) has revolutionized the field of document image analysis, showcasing superhuman performance across a diverse set of tasks. However, the inherent black-box nature of deep learning models still presents a significant challenge to their safe and robust deployment in industry. Regrettably, while a plethora of research has been dedicated in recent years to the development of DL-powered document analysis systems, research addressing their transparency aspects has been relatively scarce. In this paper, we aim to bridge this research gap by introducing DocXplain, a novel model-agnostic explainability method specifically designed for generating high interpretability feature attribution maps for the task of document image classification. In particular, our approach involves independently segmenting the foreground and background features of the documents into different document elements and then ablating these elements to assign feature importance. We extensively evaluate our proposed approach in the context of document image classification, utilizing 4 different evaluation metrics, 2 widely recognized document benchmark datasets, and 10 state-of-the-art document image classification models. By conducting a thorough quantitative and qualitative analysis against 9 existing state-of-the-art attribution methods, we demonstrate the superiority of our approach in terms of both faithfulness and interpretability. To the best of the authors’ knowledge, this work presents the first model-agnostic attribution-based explainability method specifically tailored for document images. We anticipate that our work will significantly contribute to advancing research on transparency, fairness, and robustness of document image classification models.

[CV-119] StreamLTS: Query-based Temporal-Spatial LiDAR Fusion for Cooperative Object Detection

链接: https://arxiv.org/abs/2407.03825
作者: Yunshuang Yuan,Monika Sester
关键词: intelligent traffic agents, autonomous driving, intelligent traffic, great potential, potential to improve
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Cooperative perception via communication among intelligent traffic agents has great potential to improve the safety of autonomous driving. However, limited communication bandwidth, localization errors and asynchronized capturing time of sensor data, all introduce difficulties to the data fusion of different agents. To some extend, previous works have attempted to reduce the shared data size, mitigate the spatial feature misalignment caused by localization errors and communication delay. However, none of them have considered the asynchronized sensor ticking times, which can lead to dynamic object misplacement of more than one meter during data fusion. In this work, we propose Time-Aligned COoperative Object Detection (TA-COOD), for which we adapt widely used dataset OPV2V and DairV2X with considering asynchronous LiDAR sensor ticking times and build an efficient fully sparse framework with modeling the temporal information of individual objects with query-based techniques. The experiment results confirmed the superior efficiency of our fully sparse framework compared to the state-of-the-art dense models. More importantly, they show that the point-wise observation timestamps of the dynamic objects are crucial for accurate modeling the object temporal context and the predictability of their time-related locations.

[CV-120] Markerless Multi-view 3D Human Pose Estimation: a survey

链接: https://arxiv.org/abs/2407.03817
作者: Ana Filipa Rodrigues Nogueira,Hélder P. Oliveira,Luís F. Teixeira
关键词: human skeleton, human pose estimation, body joints, scene by detecting, detecting several body
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages, 5 figures, submitted to Image and Vision Computing (IMAVIS)

点击查看摘要

Abstract:3D human pose estimation aims to reconstruct the human skeleton of all the individuals in a scene by detecting several body joints. The creation of accurate and efficient methods is required for several real-world applications including animation, human-robot interaction, surveillance systems or sports, among many others. However, several obstacles such as occlusions, random camera perspectives, or the scarcity of 3D labelled data, have been hampering the models’ performance and limiting their deployment in real-world scenarios. The higher availability of cameras has led researchers to explore multi-view solutions due to the advantage of being able to exploit different perspectives to reconstruct the pose. Thus, the goal of this survey is to present an overview of the methodologies used to estimate the 3D pose in multi-view settings, understand what were the strategies found to address the various challenges and also, identify their limitations. Based on the reviewed articles, it was possible to find that no method is yet capable of solving all the challenges associated with the reconstruction of the 3D pose. Due to the existing trade-off between complexity and performance, the best method depends on the application scenario. Therefore, further research is still required to develop an approach capable of quickly inferring a highly accurate 3D pose with bearable computation cost. To this goal, techniques such as active learning, methods that learn with a low level of supervision, the incorporation of temporal consistency, view selection, estimation of depth information and multi-modal approaches might be interesting strategies to keep in mind when developing a new methodology to solve this task. Comments: 22 pages, 5 figures, submitted to Image and Vision Computing (IMAVIS) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2407.03817 [cs.CV] (or arXiv:2407.03817v1 [cs.CV] for this version)

[CV-121] PECTP: Parameter-Efficient Cross-Task Prompts for Incremental Vision Transformer

链接: https://arxiv.org/abs/2407.03813
作者: Qian Feng,Hanbin Zhao,Chao Zhang,Jiahua Dong,Henghui Ding,Yu-Gang Jiang,Hui Qian
关键词: learn deep models, deep models, sequential tasks continually, Incremental Learning, Prompt Retention Module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Incremental Learning (IL) aims to learn deep models on sequential tasks continually, where each new task includes a batch of new classes and deep models have no access to task-ID information at the inference time. Recent vast pre-trained models (PTMs) have achieved outstanding performance by prompt technique in practical IL without the old samples (rehearsal-free) and with a memory constraint (memory-constrained): Prompt-extending and Prompt-fixed methods. However, prompt-extending methods need a large memory buffer to maintain an ever-expanding prompt pool and meet an extra challenging prompt selection problem. Prompt-fixed methods only learn a single set of prompts on one of the incremental tasks and can not handle all the incremental tasks effectively. To achieve a good balance between the memory cost and the performance on all the tasks, we propose a Parameter-Efficient Cross-Task Prompt (PECTP) framework with Prompt Retention Module (PRM) and classifier Head Retention Module (HRM). To make the final learned prompts effective on all incremental tasks, PRM constrains the evolution of cross-task prompts’ parameters from Outer Prompt Granularity and Inner Prompt Granularity. Besides, we employ HRM to inherit old knowledge in the previously learned classifier heads to facilitate the cross-task prompts’ generalization ability. Extensive experiments show the effectiveness of our method. The source codes will be available at \urlthis https URL.

[CV-122] Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

链接: https://arxiv.org/abs/2407.03788
作者: Thong Nguyen,Yi Bin,Xiaobao Wu,Xinshuai Dong,Zhiyuan Hu,Khoi Le,Cong-Duy Nguyen,See-Kiong Ng,Luu Anh Tuan
关键词: Data quality stands, video-language representation learning, quality stands, forefront of deciding, deciding the effectiveness
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering the downstream performance across unpopular subjects. To address these problems, we propose a contrastive objective with a subtractive angular margin to regularize cross-modal representations in their effort to reach perfect similarity. Furthermore, to adapt to the non-uniform concept distribution, we propose a multi-layer perceptron (MLP)-parameterized weighting function that maps loss values to sample weights which enable dynamic adjustment of the model’s focus throughout the training. With the training guided by a small amount of unbiased meta-data and augmented by video-text data generated by large vision-language model, we improve video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets.

[CV-123] Improving Computer Vision Interpretability: Transparent Two-level Classification for Complex Scenes

链接: https://arxiv.org/abs/2407.03786
作者: Stefan Scholz,Nils B. Weidmann,Zachary C. Steinert-Threlkeld,Eda Keremoğlu,Bastian Goldlücke
关键词: Treating images, increasingly popular, Treating, images, objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Treating images as data has become increasingly popular in political science. While existing classifiers for images reach high levels of accuracy, it is difficult to systematically assess the visual features on which they base their classification. This paper presents a two-level classification method that addresses this transparency problem. At the first stage, an image segmenter detects the objects present in the image and a feature vector is created from those objects. In the second stage, this feature vector is used as input for standard machine learning classifiers to discriminate between images. We apply this method to a new dataset of more than 140,000 images to detect which ones display political protest. This analysis demonstrates three advantages to this paper’s approach. First, identifying objects in images improves transparency by providing human-understandable labels for the objects shown on an image. Second, knowing these objects enables analysis of which distinguish protest images from non-protest ones. Third, comparing the importance of objects across countries reveals how protest behavior varies. These insights are not available using conventional computer vision classifiers and provide new opportunities for comparative research.

[CV-124] SpikeGS: Reconstruct 3D scene via fast-moving bio-inspired sensors

链接: https://arxiv.org/abs/2407.03771
作者: Yijia Guo,Liwen Hu,Lei Ma,Tiejun Huang
关键词: unparalleled superior performance, Gaussian Splatting, demonstrates unparalleled superior, Spike Gausian Splatting, unparalleled superior
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) demonstrates unparalleled superior performance in 3D scene reconstruction. However, 3DGS heavily relies on the sharp images. Fulfilling this requirement can be challenging in real-world scenarios especially when the camera moves fast, which severely limits the application of 3DGS. To address these challenges, we proposed Spike Gausian Splatting (SpikeGS), the first framework that integrates the spike streams into 3DGS pipeline to reconstruct 3D scenes via a fast-moving bio-inspired camera. With accumulation rasterization, interval supervision, and a specially designed pipeline, SpikeGS extracts detailed geometry and texture from high temporal resolution but texture lacking spike stream, reconstructs 3D scenes captured in 1 second. Extensive experiments on multiple synthetic and real-world datasets demonstrate the superiority of SpikeGS compared with existing spike-based and deblur 3D scene reconstruction methods. Codes and data will be released soon.

[CV-125] DiffRetouch: Using Diffusion to Retouch on the Shoulder of Experts

链接: https://arxiv.org/abs/2407.03757
作者: Zheng-Peng Duan,Jiawei zhang,Zheng Lin,Xin Jin,Dongqing Zou,Chunle Guo,Chongyi Li
关键词: Image retouching aims, quality of photos, aims to enhance, enhance the visual, visual quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image retouching aims to enhance the visual quality of photos. Considering the different aesthetic preferences of users, the target of retouching is subjective. However, current retouching methods mostly adopt deterministic models, which not only neglects the style diversity in the expert-retouched results and tends to learn an average style during training, but also lacks sample diversity during inference. In this paper, we propose a diffusion-based method, named DiffRetouch. Thanks to the excellent distribution modeling ability of diffusion, our method can capture the complex fine-retouched distribution covering various visual-pleasing styles in the training data. Moreover, four image attributes are made adjustable to provide a user-friendly editing mechanism. By adjusting these attributes in specified ranges, users are allowed to customize preferred styles within the learned fine-retouched distribution. Additionally, the affine bilateral grid and contrastive learning scheme are introduced to handle the problem of texture distortion and control insensitivity respectively. Extensive experiments have demonstrated the superior performance of our method on visually appealing and sample diversity. The code will be made available to the community.

[CV-126] A Computer Vision Approach to Estimate the Localized Sea State

链接: https://arxiv.org/abs/2407.03755
作者: Aleksandar Vorkapic,Miran Pobar,Marina Ivasic-Kos
关键词: carbon reduction targets, legislative carbon reduction, sea state recognition, real-time sea state, aiming to contribute
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in Ocean Engineering

点击查看摘要

Abstract:This research presents a novel application of computer vision (CV) and deep learning methods for real-time sea state recognition, aiming to contribute to improving the operational safety and energy efficiency of seagoing vessels, key factors in meeting the legislative carbon reduction targets. Our work focuses on utilizing sea images in operational envelopes captured by a single stationary camera mounted on the ship bridge. The collected images are used to train a deep learning model to automatically recognize the state of the sea based on the Beaufort scale. To recognize the sea state, we used 4 state-of-the-art deep neural networks with different characteristics that proved useful in various computer vision tasks: Resnet-101, NASNet, MobileNet_v2, and Transformer ViT-b32. Furthermore, we have defined a unique large-scale dataset, collected over a broad range of sea conditions from an ocean-going vessel prepared for machine learning. We used the transfer learning approach to fine-tune the models on our dataset. The obtained results demonstrate the potential for this approach to complement traditional methods, particularly where in-situ measurements are unfeasible or interpolated weather buoy data is insufficiently accurate. This study sets the groundwork for further development of sea state classification models to address recognized gaps in maritime research and enable safer and more efficient maritime operations.

[CV-127] Semantic Grouping Network for Audio Source Separation

链接: https://arxiv.org/abs/2407.03736
作者: Shentong Mo,Yapeng Tian
关键词: natural synchronization, modalities to boost, source separation performance, boost audio source, Semantic Grouping Network
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recently, audio-visual separation approaches have taken advantage of the natural synchronization between the two modalities to boost audio source separation performance. They extracted high-level semantics from visual inputs as the guidance to help disentangle sound representation for individual sources. Can we directly learn to disentangle the individual semantics from the sound itself? The dilemma is that multiple sound sources are mixed together in the original space. To tackle the difficulty, in this paper, we present a novel Semantic Grouping Network, termed as SGN, that can directly disentangle sound representations and extract high-level semantic information for each source from input audio mixture. Specifically, SGN aggregates category-wise source features through learnable class tokens of sounds. Then, the aggregated semantic features can be used as the guidance to separate the corresponding audio sources from the mixture. We conducted extensive experiments on music-only and universal sound separation benchmarks: MUSIC, FUSS, MUSDB18, and VGG-Sound. The results demonstrate that our SGN significantly outperforms previous audio-only methods and audio-visual models without utilizing additional visual cues.

[CV-128] Relative Difficulty Distillation for Semantic Segmentation

链接: https://arxiv.org/abs/2407.03719
作者: Dong Liang,Yue Sun,Yun Du,Songcan Chen,Sheng-Jun Huang
关键词: Current knowledge distillation, Current knowledge, transferring various structured, imitate the output, RDD
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current knowledge distillation (KD) methods primarily focus on transferring various structured knowledge and designing corresponding optimization goals to encourage the student network to imitate the output of the teacher network. However, introducing too many additional optimization objectives may lead to unstable training, such as gradient conflicts. Moreover, these methods ignored the guidelines of relative learning difficulty between the teacher and student networks. Inspired by human cognitive science, in this paper, we redefine knowledge from a new perspective – the student and teacher networks’ relative difficulty of samples, and propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD). We propose a two-stage RDD framework: Teacher-Full Evaluated RDD (TFE-RDD) and Teacher-Student Evaluated RDD (TSE-RDD). RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals, thus avoiding adjusting learning weights for multiple losses. Extensive experimental evaluations using a general distillation loss function on popular datasets such as Cityscapes, CamVid, Pascal VOC, and ADE20k demonstrate the effectiveness of RDD against state-of-the-art KD methods. Additionally, our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.

[CV-129] Generalized Robust Fundus Photography-based Vision Loss Estimation for High Myopia

链接: https://arxiv.org/abs/2407.03699
作者: Zipei Yan,Zhile Liang,Zhengji Liu,Shuai Wang,Rachel Ka-Man Chun,Jizhou Li,Chea-su Kee,Dong Liang
关键词: myopia significantly increases, irreversible vision loss, High myopia significantly, increases the risk, risk of irreversible
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI 2024, code: this https URL

点击查看摘要

Abstract:High myopia significantly increases the risk of irreversible vision loss. Traditional perimetry-based visual field (VF) assessment provides systematic quantification of visual loss but it is subjective and time-consuming. Consequently, machine learning models utilizing fundus photographs to estimate VF have emerged as promising alternatives. However, due to the high variability and the limited availability of VF data, existing VF estimation models fail to generalize well, particularly when facing out-of-distribution data across diverse centers and populations. To tackle this challenge, we propose a novel, parameter-efficient framework to enhance the generalized robustness of VF estimation on both in- and out-of-distribution data. Specifically, we design a Refinement-by-Denoising (RED) module for feature refinement and adaptation from pretrained vision models, aiming to learn high-entropy feature representations and to mitigate the domain gap effectively and efficiently. Through independent validation on two distinct real-world datasets from separate centers, our method significantly outperforms existing approaches in RMSE, MAE and correlation coefficient for both internal and external validation. Our proposed framework benefits both in- and out-of-distribution VF estimation, offering significant clinical implications and potential utility in real-world ophthalmic practices.

[CV-130] M3:Manipulation Mask Manufacturer for Arbitrary-Scale Super-Resolution Mask

链接: https://arxiv.org/abs/2407.03695
作者: Xinyu Yang,Xiaochen Ma,Xuekang Zhu,Bo Du,Lei Su,Bingkui Tong,Zeyu Lei,Jizhe Zhou
关键词: Manipulation Mask Manufacturer, image manipulation localization, IML models, Mask Manufacturer Dataset, IML
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the field of image manipulation localization (IML), the small quantity and poor quality of existing datasets have always been major issues. A dataset containing various types of manipulations will greatly help improve the accuracy of IML models. Images on the internet (such as those on Baidu Tieba’s PS Bar) are manipulated using various techniques, and creating a dataset from these images will significantly enrich the types of manipulations in our data. However, images on the internet suffer from resolution and clarity issues, and the masks obtained by simply subtracting the manipulated image from the original contain various noises. These noises are difficult to remove, rendering the masks unusable for IML models. Inspired by the field of change detection, we treat the original and manipulated images as changes over time for the same image and view the data generation task as a change detection task. However, due to clarity issues between images, conventional change detection models perform poorly. Therefore, we introduced a super-resolution module and proposed the Manipulation Mask Manufacturer (MMM) framework. It enhances the resolution of both the original and tampered images, thereby improving image details for better comparison. Simultaneously, the framework converts the original and tampered images into feature embeddings and concatenates them, effectively modeling the context. Additionally, we created the Manipulation Mask Manufacturer Dataset (MMMD), a dataset that covers a wide range of manipulation techniques. We aim to contribute to the fields of image forensics and manipulation detection by providing more realistic manipulation data through MMM and MMMD. Detailed information about MMMD and the download link can be found at: the code and datasets will be made available.

[CV-131] Limited-View Photoacoustic Imaging Reconstruction Via High-quality Self-supervised Neural Representation

链接: https://arxiv.org/abs/2407.03663
作者: Youshen xiao,Yuting Shen,Bowei Yao,Xiran Cai,Yuyao Zhang,Fei Gao
关键词: human body, tissue or organ, crucial information, practical applications, encompass the target
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In practical applications within the human body, it is often challenging to fully encompass the target tissue or organ, necessitating the use of limited-view arrays, which can lead to the loss of crucial information. Addressing the reconstruction of photoacoustic sensor signals in limited-view detection spaces has become a focal point of current research. In this study, we introduce a self-supervised network termed HIgh-quality Self-supervised neural representation (HIS), which tackles the inverse problem of photoacoustic imaging to reconstruct high-quality photoacoustic images from sensor data acquired under limited viewpoints. We regard the desired reconstructed photoacoustic image as an implicit continuous function in 2D image space, viewing the pixels of the image as sparse discrete samples. The HIS’s objective is to learn the continuous function from limited observations by utilizing a fully connected neural network combined with Fourier feature position encoding. By simply minimizing the error between the network’s predicted sensor data and the actual sensor data, HIS is trained to represent the observed continuous model. The results indicate that the proposed HIS model offers superior image reconstruction quality compared to three commonly used methods for photoacoustic image reconstruction.

[CV-132] reBEN: Refined BigEarthNet Dataset for Remote Sensing Image Analysis

链接: https://arxiv.org/abs/2407.03653
作者: Kai Norman Clasen,Leonard Hackel,Tom Burgert,Gencer Sumbul,Begüm Demir,Volker Markl
关键词: remote sensing image, remote sensing dataset, sensing image analysis, sensing dataset constructed, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper presents refined BigEarthNet (reBEN) that is a large-scale, multi-modal remote sensing dataset constructed to support deep learning (DL) studies for remote sensing image analysis. The reBEN dataset consists of 549,488 pairs of Sentinel-1 and Sentinel-2 image patches. To construct reBEN, we initially consider the Sentinel-1 and Sentinel-2 tiles used to construct the BigEarthNet dataset and then divide them into patches of size 1200 m x 1200 m. We apply atmospheric correction to the Sentinel-2 patches using the latest version of the sen2cor tool, resulting in higher-quality patches compared to those present in BigEarthNet. Each patch is then associated with a pixel-level reference map and scene-level multi-labels. This makes reBEN suitable for pixel- and scene-based learning tasks. The labels are derived from the most recent CORINE Land Cover (CLC) map of 2018 by utilizing the 19-class nomenclature as in BigEarthNet. The use of the most recent CLC map results in overcoming the label noise present in BigEarthNet. Furthermore, we introduce a new geographical-based split assignment algorithm that significantly reduces the spatial correlation among the train, validation, and test sets with respect to those present in BigEarthNet. This increases the reliability of the evaluation of DL models. To minimize the DL model training time, we introduce software tools that convert the reBEN dataset into a DL-optimized data format. In our experiments, we show the potential of reBEN for multi-modal multi-label image classification problems by considering several state-of-the-art DL models. The pre-trained model weights, associated code, and complete dataset are available at this https URL.

[CV-133] Generative Technology for Human Emotion Recognition: A Scope Review

链接: https://arxiv.org/abs/2407.03640
作者: Fei Ma,Yucheng Yuan,Yifan Xie,Hongwei Ren,Ivan Liu,Ying He,Fuji Ren,Fei Richard Yu,Shiguang Ni
关键词: Affective computing stands, Affective computing, emotion recognition, seeking to imbue, Large Language Model
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:Affective computing stands at the forefront of artificial intelligence (AI), seeking to imbue machines with the ability to comprehend and respond to human emotions. Central to this field is emotion recognition, which endeavors to identify and interpret human emotional states from different modalities, such as speech, facial images, text, and physiological signals. In recent years, important progress has been made in generative models, including Autoencoder, Generative Adversarial Network, Diffusion Model, and Large Language Model. These models, with their powerful data generation capabilities, emerge as pivotal tools in advancing emotion recognition. However, up to now, there remains a paucity of systematic efforts that review generative technology for emotion recognition. This survey aims to bridge the gaps in the existing literature by conducting a comprehensive analysis of over 320 research papers until June 2024. Specifically, this survey will firstly introduce the mathematical principles of different generative models and the commonly used datasets. Subsequently, through a taxonomy, it will provide an in-depth analysis of how generative techniques address emotion recognition based on different modalities in several aspects, including data augmentation, feature extraction, semi-supervised learning, cross-domain, etc. Finally, the review will outline future research directions, emphasizing the potential of generative models to advance the field of emotion recognition and enhance the emotional intelligence of AI systems.

[CV-134] Diff-Restorer: Unleashing Visual Prompts for Diffusion-based Universal Image Restoration

链接: https://arxiv.org/abs/2407.03636
作者: Yuhong Zhang,Hengsheng Zhang,Xinning Chai,Zhengxue Cheng,Rong Xie,Li Song,Wenjun Zhang
关键词: low-level problem aimed, classic low-level problem, recovering high-quality images, classic low-level, aimed at recovering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image restoration is a classic low-level problem aimed at recovering high-quality images from low-quality images with various degradations such as blur, noise, rain, haze, etc. However, due to the inherent complexity and non-uniqueness of degradation in real-world images, it is challenging for a model trained for single tasks to handle real-world restoration problems effectively. Moreover, existing methods often suffer from over-smoothing and lack of realism in the restored results. To address these issues, we propose Diff-Restorer, a universal image restoration method based on the diffusion model, aiming to leverage the prior knowledge of Stable Diffusion to remove degradation while generating high perceptual quality restoration results. Specifically, we utilize the pre-trained visual language model to extract visual prompts from degraded images, including semantic and degradation embeddings. The semantic embeddings serve as content prompts to guide the diffusion model for generation. In contrast, the degradation embeddings modulate the Image-guided Control Module to generate spatial priors for controlling the spatial structure of the diffusion process, ensuring faithfulness to the original image. Additionally, we design a Degradation-aware Decoder to perform structural correction and convert the latent code to the pixel domain. We conducted comprehensive qualitative and quantitative analysis on restoration tasks with different degradations, demonstrating the effectiveness and superiority of our approach.

[CV-135] MRIR: Integrating Multimodal Insights for Diffusion-based Realistic Image Restoration

链接: https://arxiv.org/abs/2407.03635
作者: Yuhong Zhang,Hengsheng Zhang,Xinning Chai,Rong Xie,Li Song,Wenjun Zhang
关键词: Realistic image restoration, produce realistic results, produce realistic, image restoration, Realistic image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Realistic image restoration is a crucial task in computer vision, and the use of diffusion-based models for image restoration has garnered significant attention due to their ability to produce realistic results. However, the quality of the generated images is still a significant challenge due to the severity of image degradation and the uncontrollability of the diffusion model. In this work, we delve into the potential of utilizing pre-trained stable diffusion for image restoration and propose MRIR, a diffusion-based restoration method with multimodal insights. Specifically, we explore the problem from two perspectives: textual level and visual level. For the textual level, we harness the power of the pre-trained multimodal large language model to infer meaningful semantic information from low-quality images. Furthermore, we employ the CLIP image encoder with a designed Refine Layer to capture image details as a supplement. For the visual level, we mainly focus on the pixel level control. Thus, we utilize a Pixel-level Processor and ControlNet to control spatial structures. Finally, we integrate the aforementioned control information into the denoising U-Net using multi-level attention mechanisms and realize controllable image restoration with multimodal insights. The qualitative and quantitative results demonstrate our method’s superiority over other state-of-the-art methods on both synthetic and real-world datasets.

[CV-136] SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection

链接: https://arxiv.org/abs/2407.03634
作者: Zongxiang Hu,Zhaosheng Zhang
关键词: Visual anomaly detection, Visual anomaly, limiting scalability, industrial manufacturing, extensive normal datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 9 figures, conference

点击查看摘要

Abstract:Visual anomaly detection is critical in industrial manufacturing, but traditional methods often rely on extensive normal datasets and custom models, limiting scalability. Recent advancements in large-scale visual-language models have significantly improved zero/few-shot anomaly detection. However, these approaches may not fully utilize hierarchical features, potentially missing nuanced details. We introduce a window self-attention mechanism based on the CLIP model, combined with learnable prompts to process multi-level features within a Soldier-Offier Window self-Attention (SOWA) framework. Our method has been tested on five benchmark datasets, demonstrating superior performance by leading in 18 out of 20 metrics compared to existing state-of-the-art techniques.

[CV-137] CLASH: Complementary Learning with Neural Architecture Search for Gait Recognition

链接: https://arxiv.org/abs/2407.03632
作者: Huanzhang Dou,Pengyi Zhang,Yuhan Zhao,Lu Jin,Xi Li
关键词: walking pattern, achieved great success, great success based, walking, walking pattern sensitive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gait recognition, which aims at identifying individuals by their walking patterns, has achieved great success based on silhouette. The binary silhouette sequence encodes the walking pattern within the sparse boundary representation. Therefore, most pixels in the silhouette are under-sensitive to the walking pattern since the sparse boundary lacks dense spatial-temporal information, which is suitable to be represented with dense texture. To enhance the sensitivity to the walking pattern while maintaining the robustness of recognition, we present a Complementary Learning with neural Architecture Search (CLASH) framework, consisting of walking pattern sensitive gait descriptor named dense spatial-temporal field (DSTF) and neural architecture search based complementary learning (NCL). Specifically, DSTF transforms the representation from the sparse binary boundary into the dense distance-based texture, which is sensitive to the walking pattern at the pixel level. Further, NCL presents a task-specific search space for complementary learning, which mutually complements the sensitivity of DSTF and the robustness of the silhouette to represent the walking pattern effectively. Extensive experiments demonstrate the effectiveness of the proposed methods under both in-the-lab and in-the-wild scenarios. On CASIA-B, we achieve rank-1 accuracy of 98.8%, 96.5%, and 89.3% under three conditions. On OU-MVLP, we achieve rank-1 accuracy of 91.9%. Under the latest in-the-wild datasets, we outperform the latest silhouette-based methods by 16.3% and 19.7% on Gait3D and GREW, respectively.

[CV-138] Wood Surface Inspection Using Structural and Conditional Statistical Features

链接: https://arxiv.org/abs/2407.03630
作者: Cem Ünsalan
关键词: extremely important issue, wood surfaces, extremely important, important issue, wood
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 5 figures

点击查看摘要

Abstract:Surface quality is an extremely important issue for wood products in the market. Although quality inspection can be made by a human expert while manufacturing, this operation is prone to errors. One possible solution may be using standard machine vision techniques to automatically detect defects on wood surfaces. Due to the random texture on wood surfaces, this solution is also not possible most of the times. Therefore, more advanced and novel machine vision techniques are needed to automatically inspect wood surfaces. In this study, we propose such a solution based on support region extraction from the gradient magnitude and the Laplacian of Gaussian response of the wood surface image. We introduce novel structural and conditional statistical features using these support regions. Then, we classify different defect types on wood surfaces using our novel features. We tested our automated wood surface inspection system on a large data set and obtained very promising results.

[CV-139] Resampled Datasets Are Not Enough: Mitigating Societal Bias Beyond Single Attributes

链接: https://arxiv.org/abs/2407.03623
作者: Yusuke Hirota,Jerone T. A. Andrew,Dora Zhao,Orestis Papakyriakopoulos,Apostolos Modas,Yuta Nakashima,Alice Xiang
关键词: removing spurious correlations, tackle societal bias, tackle societal, image-text datasets, datasets by removing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We tackle societal bias in image-text datasets by removing spurious correlations between protected groups and image attributes. Traditional methods only target labeled attributes, ignoring biases from unlabeled ones. Using text-guided inpainting models, our approach ensures protected group independence from all attributes and mitigates inpainting biases through data filtering. Evaluations on multi-label image classification and image captioning tasks show our method effectively reduces bias without compromising performance across various models.

[CV-140] VDMA: Video Question Answering with Dynamically Generated Multi-Agents

链接: https://arxiv.org/abs/2407.03610
作者: Noriyuki Kugo,Tatsuya Ishibashi,Kosuke Ono,Yuji Sato
关键词: EgoSchema Challenge, EgoSchema Challenge aims, Video Question Answering, detailed description, Dynamically Generated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages, 2 figures

点击查看摘要

Abstract:This technical report provides a detailed description of our approach to the EgoSchema Challenge 2024. The EgoSchema Challenge aims to identify the most appropriate responses to questions regarding a given video clip. In this paper, we propose Video Question Answering with Dynamically Generated Multi-Agents (VDMA). This method is a complementary approach to existing response generation systems by employing a multi-agent system with dynamically generated expert agents. This method aims to provide the most accurate and contextually appropriate responses. This report details the stages of our approach, the tools employed, and the results of our experiments.

[CV-141] Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations

链接: https://arxiv.org/abs/2407.03604
作者: Zhiyang Xu,Minqian Liu,Ying Shen,Joy Rimchala,Jiaxin Zhang,Qifan Wang,Yu Cheng,Lifu Huang
关键词: Vision-Language Generalists, Recent advancements, capable of understanding, Lateralization LoRA, Recent
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 Pages, visual instruction tuning, parameter-efficient tuning

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have led to the development of Vision-Language Generalists (VLGs) capable of understanding and generating interleaved images and text. Despite these advances, VLGs still struggle to follow user instructions for interleaved text and image generation. To address this issue, we introduce LeafInstruct, the first open-sourced interleaved instruction tuning data with over 30,000 high-quality instances across more than 10 domains. Due to the extensive size of existing VLGs, we opt for parameter-efficient tuning. However, we observe that VLGs tuned with a standard LoRA typically exhibit inferior performance in interleaved text-image generation. We attribute this problem to modality interference and the lack of modality-specialized adaptation design. Hence, we propose Lateralization LoRA, a novel modality-specialized adaptation method inspired by the concept of brain lateralization. Lateralization LoRA employs a hybrid approach, combining the traditional linear LoRA and a Convolutional LoRA for generating text and images, enabling the generation of high-quality text and images by leveraging modality-specific structures and parameter sets. We perform instruction tuning of the VLG (i.e., EMU2) using Lateralization LoRA on the LeafInstruct dataset. Extensive experiments demonstrate that EMU2 tuned with Lateralization LoRA achieve state-of-the-art performance, significantly surpassing baseline models in complex interleaved tasks.

[CV-142] ASteISR: Adapting Single Image Super-resolution Pre-trained Model for Efficient Stereo Image Super-resolution

链接: https://arxiv.org/abs/2407.03598
作者: Yuanbo Zhou,Yuyang Xue,Wei Deng,Xinlin Zhang,Qinquan Gao,Tong Tong
关键词: low-level vision tasks, significant challenges persist, pre-trained SISR model, vision tasks, significant challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite advances in the paradigm of pre-training then fine-tuning in low-level vision tasks, significant challenges persist particularly regarding the increased size of pre-trained models such as memory usage and training time. Another concern often encountered is the unsatisfying results yielded when directly applying pre-trained single-image models to multi-image domain. In this paper, we propose a efficient method for transferring a pre-trained single-image super-resolution (SISR) transformer network to the domain of stereo image super-resolution (SteISR) through a parameter-efficient fine-tuning (PEFT) method. Specifically, we introduce the concept of stereo adapters and spatial adapters which are incorporated into the pre-trained SISR transformer network. Subsequently, the pre-trained SISR model is frozen, enabling us to fine-tune the adapters using stereo datasets along. By adopting this training method, we enhance the ability of the SISR model to accurately infer stereo images by 0.79dB on the Flickr1024 dataset. This method allows us to train only 4.8% of the original model parameters, achieving state-of-the-art performance on four commonly used SteISR benchmarks. Compared to the more complicated full fine-tuning approach, our method reduces training time and memory consumption by 57% and 15%, respectively.

[CV-143] Self Adaptive Threshold Pseudo-labeling and Unreliable Sample Contrastive Loss for Semi-supervised Image Classification

链接: https://arxiv.org/abs/2407.03596
作者: Xuerong Zhang,Li Huang,Jing Lv,Ming Yang
关键词: attracting blooming attention, combining unlabeled data, unlabeled data, blooming attention, Discarding unlabeled data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICANN24 accepted

点击查看摘要

Abstract:Semi-supervised learning is attracting blooming attention, due to its success in combining unlabeled data. However, pseudo-labeling-based semi-supervised approaches suffer from two problems in image classification: (1) Existing methods might fail to adopt suitable thresholds since they either use a pre-defined/fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. (2) Discarding unlabeled data with confidence below the thresholds results in the loss of discriminating information. To solve these issues, we develop an effective method to make sufficient use of unlabeled data. Specifically, we design a self adaptive threshold pseudo-labeling strategy, which thresholds for each class can be dynamically adjusted to increase the number of reliable samples. Meanwhile, in order to effectively utilise unlabeled data with confidence below the thresholds, we propose an unreliable sample contrastive loss to mine the discriminative information in low-confidence samples by learning the similarities and differences between sample features. We evaluate our method on several classification benchmarks under partially labeled settings and demonstrate its superiority over the other approaches.

[CV-144] UniPlane: Unified Plane Detection and Reconstruction from Posed Monocular Videos

链接: https://arxiv.org/abs/2407.03594
作者: Yuzhong Huang,Chen Liu,Ji Hou,Ke Huo,Shiyu Dong,Fred Morstatter
关键词: posed monocular videos, posed monocular, unifies plane detection, monocular videos, plane detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2206.07710 by other authors

点击查看摘要

Abstract:We present UniPlane, a novel method that unifies plane detection and reconstruction from posed monocular videos. Unlike existing methods that detect planes from local observations and associate them across the video for the final reconstruction, UniPlane unifies both the detection and the reconstruction tasks in a single network, which allows us to directly optimize final reconstruction quality and fully leverage temporal information. Specifically, we build a Transformers-based deep neural network that jointly constructs a 3D feature volume for the environment and estimates a set of per-plane embeddings as queries. UniPlane directly reconstructs the 3D planes by taking dot products between voxel embeddings and the plane embeddings followed by binary thresholding. Extensive experiments on real-world datasets demonstrate that UniPlane outperforms state-of-the-art methods in both plane detection and reconstruction tasks, achieving +4.6 in F-score in geometry as well as consistent improvements in other geometry and segmentation metrics.

[CV-145] Feedback-guided Domain Synthesis with Multi-Source Conditional Diffusion Models for Domain Generalization

链接: https://arxiv.org/abs/2407.03588
作者: Mehrdad Noori,Milad Cheraghalikhani,Ali Bahri,Gustavo Adolfo Vargas Hakim,David Osowiechi,Moslem Yazdanpanah,Ismail Ben Ayed,Christian Desrosiers
关键词: Standard deep learning, deep learning architectures, convolutional neural networks, Standard deep, previously unseen domains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Standard deep learning architectures such as convolutional neural networks and vision transformers often fail to generalize to previously unseen domains due to the implicit assumption that both source and target data are drawn from independent and identically distributed (i.i.d.) populations. In response, Domain Generalization techniques aim to enhance model robustness by simulating novel data distributions during training, typically through various augmentation or stylization strategies. However, these methods frequently suffer from limited control over the diversity of generated images and lack assurance that these images span distinct distributions. To address these challenges, we propose FDS, a novel strategy that employs diffusion models to synthesize samples from new domains by training on source distribution samples and performing domain mixing. By incorporating images that pose classification challenges to models trained on original samples, alongside the original dataset, we ensure the generation of a training set that spans a broad distribution spectrum. Our comprehensive evaluations demonstrate that this methodology sets new benchmarks in domain generalization performance across a range of challenging datasets, effectively managing diverse types of domain shifts. The implementation is available at: \urlthis https URL.

[CV-146] Vision Mamba for Classification of Breast Ultrasound Images

链接: https://arxiv.org/abs/2407.03552
作者: Ali Nasiri-Sarvi,Mahdi S. Hosseini,Hassan Rivaz
关键词: offer promising performance, promising performance improvements, computer vision tasks, VMamba and Vim, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Mamba-based models, VMamba and Vim, are a recent family of vision encoders that offer promising performance improvements in many computer vision tasks. This paper compares Mamba-based models with traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) using the breast ultrasound BUSI and B datasets. Our evaluation, which includes multiple runs of experiments and statistical significance analysis, demonstrates that Mamba-based architectures frequently outperform CNN and ViT models with statistically significant results. These Mamba-based models effectively capture long-range dependencies while maintaining inductive biases, making them suitable for applications with limited data.

[CV-147] CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

链接: https://arxiv.org/abs/2407.03550
作者: Emanuele Vivoli,Marco Bertini,Dimosthenis Karatzas
关键词: domain is rapidly, rapidly advancing, development of single-page, comic, single-page analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review. Repository link: this https URL

点击查看摘要

Abstract:The comic domain is rapidly advancing with the development of single-page analysis and synthesis models. However, evaluation metrics and datasets lag behind, often limited to small-scale or single-style test sets. We introduce a novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis. Unlike existing benchmarks that focus on isolated tasks such as object detection or text recognition, CoMix addresses a broader range of tasks including object detection, speaker identification, character re-identification, reading order, and multi-modal reasoning tasks like character naming and dialogue generation. Our benchmark comprises three existing datasets with expanded annotations to support multi-task evaluation. To mitigate the over-representation of manga-style data, we have incorporated a new dataset of carefully selected American comic-style books, thereby enriching the diversity of comic styles. CoMix is designed to assess pre-trained models in zero-shot and limited fine-tuning settings, probing their transfer capabilities across different comic styles and tasks. The validation split of the benchmark is publicly available for research purposes, and an evaluation server for the held-out test split is also provided. Comparative results between human performance and state-of-the-art models reveal a significant performance gap, highlighting substantial opportunities for advancements in comic understanding. The dataset, baseline models, and code are accessible at the repository link. This initiative sets a new standard for comprehensive comic analysis, providing the community with a common benchmark for evaluation on a large and varied set.

[CV-148] POSTURE: Pose Guided Unsupervised Domain Adaptation for Human Body Part Segmentation

链接: https://arxiv.org/abs/2407.03549
作者: Arindam Dutta,Rohit Lal,Yash Garg,Calvin-Khang Ta,Dripta S. Raychaudhuri,Hannah Dela Cruz,Amit K. Roy-Chowdhury
关键词: shown promising results, underline, primarily relying, human body part, shown promising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing algorithms for human body part segmentation have shown promising results on challenging datasets, primarily relying on end-to-end supervision. However, these algorithms exhibit severe performance drops in the face of domain shifts, leading to inaccurate segmentation masks. To tackle this issue, we introduce POSTURE: \underlinePose Guided Un\underlinesupervised Domain Adap\underlinetation for H\underlineuman Body Pa\underlinert S\underlineegmentation - an innovative pseudo-labelling approach designed to improve segmentation performance on the unlabeled target data. Distinct from conventional domain adaptive methods for general semantic segmentation, POSTURE stands out by considering the underlying structure of the human body and uses anatomical guidance from pose keypoints to drive the adaptation process. This strong inductive prior translates to impressive performance improvements, averaging 8% over existing state-of-the-art domain adaptive semantic segmentation methods across three benchmark datasets. Furthermore, the inherent flexibility of our proposed approach facilitates seamless extension to source-free settings (SF-POSTURE), effectively mitigating potential privacy and computational concerns, with negligible drop in performance.

[CV-149] HiDiff: Hybrid Diffusion Framework for Medical Image Segmentation

链接: https://arxiv.org/abs/2407.03548
作者: Tao Chen,Chenhui Wang,Zhihao Chen,Yiming Lei,Hongming Shan
关键词: underlying data distribution, segmentation, Medical image segmentation, deep learning, significantly advanced
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE Transactions on Medical Imaging 2024

点击查看摘要

Abstract:Medical image segmentation has been significantly advanced with the rapid development of deep learning (DL) techniques. Existing DL-based segmentation models are typically discriminative; i.e., they aim to learn a mapping from the input image to segmentation masks. However, these discriminative methods neglect the underlying data distribution and intrinsic class characteristics, suffering from unstable feature space. In this work, we propose to complement discriminative segmentation methods with the knowledge of underlying data distribution from generative models. To that end, we propose a novel hybrid diffusion framework for medical image segmentation, termed HiDiff, which can synergize the strengths of existing discriminative segmentation models and new generative diffusion models. HiDiff comprises two key components: discriminative segmentor and diffusion refiner. First, we utilize any conventional trained segmentation models as discriminative segmentor, which can provide a segmentation mask prior for diffusion refiner. Second, we propose a novel binary Bernoulli diffusion model (BBDM) as the diffusion refiner, which can effectively, efficiently, and interactively refine the segmentation mask by modeling the underlying data distribution. Third, we train the segmentor and BBDM in an alternate-collaborative manner to mutually boost each other. Extensive experimental results on abdomen organ, brain tumor, polyps, and retinal vessels segmentation datasets, covering four widely-used modalities, demonstrate the superior performance of HiDiff over existing medical segmentation algorithms, including the state-of-the-art transformer- and diffusion-based ones. In addition, HiDiff excels at segmenting small objects and generalizing to new datasets. Source codes are made available at this https URL.

[CV-150] Comics Datasets Framework: Mix of Comics datasets for detection benchmarking

链接: https://arxiv.org/abs/2407.03540
作者: Emanuele Vivoli,Irene Campaioli,Mariateresa Nardoni,Niccolò Biondi,Marco Bertini,Dimosthenis Karatzas
关键词: uniquely combine text, Comics Datasets Framework, uniquely combine, real-world visuals, combine text
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at MANPU - COMICS workshop at ICDAR

点击查看摘要

Abstract:Comics, as a medium, uniquely combine text and images in styles often distinct from real-world visuals. For the past three decades, computational research on comics has evolved from basic object detection to more sophisticated tasks. However, the field faces persistent challenges such as small datasets, inconsistent annotations, inaccessible model weights, and results that cannot be directly compared due to varying train/test splits and metrics. To address these issues, we aim to standardize annotations across datasets, introduce a variety of comic styles into the datasets, and establish benchmark results with clear, replicable settings. Our proposed Comics Datasets Framework standardizes dataset annotations into a common format and addresses the overrepresentation of manga by introducing Comics100, a curated collection of 100 books from the Digital Comics Museum, annotated for detection in our uniform format. We have benchmarked a variety of detection architectures using the Comics Datasets Framework. All related code, model weights, and detailed evaluation processes are available at this https URL, ensuring transparency and facilitating replication. This initiative is a significant advancement towards improving object detection in comics, laying the groundwork for more complex computational tasks dependent on precise object recognition.

[CV-151] BVI-RLV: A Fully Registered Dataset and Benchmarks for Low-Light Video Enhancement

链接: https://arxiv.org/abs/2407.03535
作者: Ruirui Lin,Nantheera Anantrasirichai,Guoxi Huang,Joanne Lin,Qi Sun,Alexandra Malyugina,David R Bull
关键词: computer vision applications, exhibit spatiotemporal incoherent, spatiotemporal incoherent noise, compromising visibility, vision applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2402.01970

点击查看摘要

Abstract:Low-light videos often exhibit spatiotemporal incoherent noise, compromising visibility and performance in computer vision applications. One significant challenge in enhancing such content using deep learning is the scarcity of training data. This paper introduces a novel low-light video dataset, consisting of 40 scenes with various motion scenarios under two distinct low-lighting conditions, incorporating genuine noise and temporal artifacts. We provide fully registered ground truth data captured in normal light using a programmable motorized dolly and refine it via an image-based approach for pixel-wise frame alignment across different light levels. We provide benchmarks based on four different technologies: convolutional neural networks, transformers, diffusion models, and state space models (mamba). Our experimental results demonstrate the significance of fully registered video pairs for low-light video enhancement (LLVE) and the comprehensive evaluation shows that the models trained with our dataset outperform those trained with the existing datasets. Our dataset and links to benchmarks are publicly available at this https URL.

[CV-152] Iris and Palmprint Multimodal Biometric Recognition using Novel Preactivated Inverted ResNet and Hybrid Metaheuristic Optimized DenseNet

链接: https://arxiv.org/abs/2407.03498
作者: Indu Singh,Gunbir Singh Baveja,Shruti Khatri,Sunaina Luthra,Tanvi Singh
关键词: witnessed widespread integration, daily life due, Biometric recognition technology, multimodal biometric recognition, biometric recognition system
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:Biometric recognition technology has witnessed widespread integration into daily life due to the growing emphasis on information security. In this domain, multimodal biometrics, which combines multiple biometric traits, has overcome limitations found in unimodal systems like susceptibility to spoof attacks or failure to adapt to changes over time. This paper proposes a novel multimodal biometric recognition system that utilizes deep learning algorithms using iris and palmprint modalities. A pioneering approach is introduced, beginning with the implementation of the novel Modified Firefly Algorithm with Lévy Flights (MFALF) to optimize the Contrast Limited Adaptive Histogram Equalization (CLAHE) algorithm, thereby effectively enhancing image contrast. Subsequently, feature selection is carried out through a unique hybrid of ReliefF and Moth Flame Optimization (MFOR) to extract informative features. For classification, we employ a parallel approach, first introducing a novel Preactivated Inverted ResNet (PIR) architecture, and secondly, harnessing metaheuristics with hybrid of innovative Johnson Flower Pollination Algorithm and Rainfall Optimization Algorithm for fine tuning of the learning rate and dropout parameters of Transfer Learning based DenseNet architecture (JFPA-ROA). Finally, a score-level fusion strategy is implemented to combine the outputs of the two classifiers, providing a robust and accurate multimodal biometric recognition system. The system’s performance is assessed based on accuracy, Detection Error Tradeoff (DET) Curve, Equal Error Rate (EER), and Total Training time. The proposed multimodal recognition architecture, tested across CASIA Palmprint, MMU, BMPD, and IIT datasets, achieves 100% recognition accuracy, outperforming unimodal iris and palmprint identification approaches.

[CV-153] FlowCon: Out-of-Distribution Detection using Flow-Based Contrastive Learning

链接: https://arxiv.org/abs/2407.03489
作者: Saandeep Aathreya,Shaun Canavan
关键词: OOD, increasingly critical, real-world applications, applications of deep, OOD samples
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Identifying Out-of-distribution (OOD) data is becoming increasingly critical as the real-world applications of deep learning methods expand. Post-hoc methods modify softmax scores fine-tuned on outlier data or leverage intermediate feature layers to identify distinctive patterns between In-Distribution (ID) and OOD samples. Other methods focus on employing diverse OOD samples to learn discrepancies between ID and OOD. These techniques, however, are typically dependent on the quality of the outlier samples assumed. Density-based methods explicitly model class-conditioned distributions but this requires long training time or retraining the classifier. To tackle these issues, we introduce \textitFlowCon, a new density-based OOD detection technique. Our main innovation lies in efficiently combining the properties of normalizing flow with supervised contrastive learning, ensuring robust representation learning with tractable density estimation. Empirical evaluation shows the enhanced performance of our method across common vision datasets such as CIFAR-10 and CIFAR-100 pretrained on ResNet18 and WideResNet classifiers. We also perform quantitative analysis using likelihood plots and qualitative visualization using UMAP embeddings and demonstrate the robustness of the proposed method under various OOD contexts. Code will be open-sourced post decision.

[CV-154] Celeb-FBI: A Benchmark Dataset on Human Full Body Images and Age Gender Height and Weight Estimation using Deep Learning Approach

链接: https://arxiv.org/abs/2407.03486
作者: Pronay Debnath,Usafa Akther Rifa,Busra Kamal Rafa,Ali Haider Talukder Akib,Md. Aminur Rahman
关键词: respective fields, scarcity of comprehensive, healthcare poses, challenge for researchers, researchers in exploring
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in 3rd International Conference on Advanced Communication and Intelligent Systems

点击查看摘要

Abstract:The scarcity of comprehensive datasets in surveillance, identification, image retrieval systems, and healthcare poses a significant challenge for researchers in exploring new methodologies and advancing knowledge in these respective fields. Furthermore, the need for full-body image datasets with detailed attributes like height, weight, age, and gender is particularly significant in areas such as fashion industry analytics, ergonomic design assessment, virtual reality avatar creation, and sports performance analysis. To address this gap, we have created the ‘Celeb-FBI’ dataset which contains 7,211 full-body images of individuals accompanied by detailed information on their height, age, weight, and gender. Following the dataset creation, we proceed with the preprocessing stages, including image cleaning, scaling, and the application of Synthetic Minority Oversampling Technique (SMOTE). Subsequently, utilizing this prepared dataset, we employed three deep learning approaches: Convolutional Neural Network (CNN), 50-layer ResNet, and 16-layer VGG, which are used for estimating height, weight, age, and gender from human full-body images. From the results obtained, ResNet-50 performed best for the system with an accuracy rate of 79.18% for age, 95.43% for gender, 85.60% for height and 81.91% for weight.

[CV-155] Domain-Aware Fine-Tuning of Foundation Models

链接: https://arxiv.org/abs/2407.03482
作者: Ugur Ali Kaplan,Margret Keuper,Anna Khoreva,Dan Zhang,Yumeng Li
关键词: enabling effective learning, Foundation models, enabling effective, effective learning, revolutionized computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at ICML 2024 Workshop on Foundation Models in the Wild

点击查看摘要

Abstract:Foundation models (FMs) have revolutionized computer vision, enabling effective learning across different domains. However, their performance under domain shift is yet underexplored. This paper investigates the zero-shot domain adaptation potential of FMs by comparing different backbone architectures and introducing novel domain-aware components that leverage domain related textual embeddings. We propose domain adaptive normalization, termed as Domino, which explicitly leverages domain embeddings during fine-tuning, thus making the model domain aware. Ultimately, Domino enables more robust computer vision models that can adapt effectively to various unseen domains.

[CV-156] Learning Action and Reasoning-Centric Image Editing from Videos and Simulations

链接: https://arxiv.org/abs/2407.03471
作者: Benno Krojer,Dheeraj Vattikonda,Luis Lara,Varun Jampani,Eva Portelance,Christopher Pal,Siva Reddy
关键词: require many forms, perform diverse edits, changing attributes, attributes or style, performing actions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to NeurIPS (Dataset Benchmarks)

点击查看摘要

Abstract:An image editing model should be able to perform diverse edits, ranging from object replacement, changing attributes or style, to performing actions or movement, which require many forms of reasoning. Current general instruction-guided editing models have significant shortcomings with action and reasoning-centric edits. Object, attribute or stylistic changes can be learned from visually static datasets. On the other hand, high-quality data for action and reasoning-centric edits is scarce and has to come from entirely different sources that cover e.g. physical dynamics, temporality and spatial reasoning. To this end, we meticulously curate the AURORA Dataset (Action-Reasoning-Object-Attribute), a collection of high-quality training data, human-annotated and curated from videos and simulation engines. We focus on a key aspect of quality training data: triplets (source image, prompt, target image) contain a single meaningful visual change described by the prompt, i.e., truly minimal changes between source and target images. To demonstrate the value of our dataset, we evaluate an AURORA-finetuned model on a new expert-curated benchmark (AURORA-Bench) covering 8 diverse editing tasks. Our model significantly outperforms previous editing models as judged by human raters. For automatic evaluations, we find important flaws in previous metrics and caution their use for semantically hard editing tasks. Instead, we propose a new automatic metric that focuses on discriminative understanding. We hope that our efforts : (1) curating a quality training dataset and an evaluation benchmark, (2) developing critical evaluations, and (3) releasing a state-of-the-art model, will fuel further progress on general image editing.

[CV-157] Precision at Scale: Domain-Specific Datasets On-Demand

链接: https://arxiv.org/abs/2407.03463
作者: Jesús M Rodríguez-de-Vera,Imanol G Estepa,Ignacio Sarasúa,Bhalaji Nagarajan,Petia Radeva
关键词: self-supervised learning, conventional wisdom, utility of massive, realm of self-supervised, wisdom has gravitated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the realm of self-supervised learning (SSL), conventional wisdom has gravitated towards the utility of massive, general domain datasets for pretraining robust backbones. In this paper, we challenge this idea by exploring if it is possible to bridge the scale between general-domain datasets and (traditionally smaller) domain-specific datasets to reduce the current performance gap. More specifically, we propose Precision at Scale (PaS), a novel method for the autonomous creation of domain-specific datasets on-demand. The modularity of the PaS pipeline enables leveraging state-of-the-art foundational and generative models to create a collection of images of any given size belonging to any given domain with minimal human intervention. Extensive analysis in two complex domains, proves the superiority of PaS datasets over existing traditional domain-specific datasets in terms of diversity, scale, and effectiveness in training visual transformers and convolutional neural networks. Most notably, we prove that automatically generated domain-specific datasets lead to better pretraining than large-scale supervised datasets such as ImageNet-1k and ImageNet-21k. Concretely, models trained on domain-specific datasets constructed by PaS pipeline, beat ImageNet-1k pretrained backbones by at least 12% in all the considered domains and classification tasks and lead to better food domain performance than supervised ImageNet-21k pretrain while being 12 times smaller. Code repository: this https URL

[CV-158] Fisher-aware Quantization for DETR Detectors with Critical-category Objectives

链接: https://arxiv.org/abs/2407.03442
作者: Huanrui Yang,Yafeng Huang,Zhen Dong,Denis A Gudovskiy,Tomoyuki Okuno,Yohei Nakata,Yuan Du,Kurt Keutzer,Shanghang Zhang
关键词: deep learning models, well-studied problem, deep learning, performance, critical categories
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Poster presentation at the 2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)

点击查看摘要

Abstract:The impact of quantization on the overall performance of deep learning models is a well-studied problem. However, understanding and mitigating its effects on a more fine-grained level is still lacking, especially for harder tasks such as object detection with both classification and regression objectives. This work defines the performance for a subset of task-critical categories, i.e. the critical-category performance, as a crucial yet largely overlooked fine-grained objective for detection tasks. We analyze the impact of quantization at the category-level granularity, and propose methods to improve performance for the critical categories. Specifically, we find that certain critical categories have a higher sensitivity to quantization, and are prone to overfitting after quantization-aware training (QAT). To explain this, we provide theoretical and empirical links between their performance gaps and the corresponding loss landscapes with the Fisher information framework. Using this evidence, we apply a Fisher-aware mixed-precision quantization scheme, and a Fisher-trace regularization for the QAT on the critical-category loss landscape. The proposed methods improve critical-category metrics of the quantized transformer-based DETR detectors. They are even more significant in case of larger models and higher number of classes where the overfitting becomes more severe. For example, our methods lead to 10.4% and 14.5% mAP gains for, correspondingly, 4-bit DETR-R50 and Deformable DETR on the most impacted critical classes in the COCO Panoptic dataset.

[CV-159] DACB-Net: Dual Attention Guided Compact Bilinear Convolution Neural Network for Skin Disease Classification

链接: https://arxiv.org/abs/2407.03439
作者: Belal Ahmad,Mohd Usama,Tanvir Ahmad,Adnan Saeed,Shabnam Khatoon,Min Chen
关键词: Dual Attention-Guided Compact, Attention-Guided Compact Bilinear, Compact Bilinear CNN, three-branch Dual Attention-Guided, Compact Bilinear
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 18 figures, 6 tables

点击查看摘要

Abstract:This paper introduces the three-branch Dual Attention-Guided Compact Bilinear CNN (DACB-Net) by focusing on learning from disease-specific regions to enhance accuracy and alignment. A global branch compensates for lost discriminative features, generating Attention Heat Maps (AHM) for relevant cropped regions. Finally, the last pooling layers of global and local branches are concatenated for fine-tuning, which offers a comprehensive solution to the challenges posed by skin disease diagnosis. Although current CNNs employ Stochastic Gradient Descent (SGD) for discriminative feature learning, using distinct pairs of local image patches to compute gradients and incorporating a modulation factor in the loss for focusing on complex data during training. However, this approach can lead to dataset imbalance, weight adjustments, and vulnerability to overfitting. The proposed solution combines two supervision branches and a novel loss function to address these issues, enhancing performance and interpretability. The framework integrates data augmentation, transfer learning, and fine-tuning to tackle data imbalance to improve classification performance, and reduce computational costs. Simulations on the HAM10000 and ISIC2019 datasets demonstrate the effectiveness of this approach, showcasing a 2.59% increase in accuracy compared to the state-of-the-art.

[CV-160] Lift Splat Map: Lifting Foundation Masks for Label-Free Semantic Scene Completion

链接: https://arxiv.org/abs/2407.03425
作者: Arthur Zhang,Rainier Heijne,Joydeep Biswas
关键词: Autonomous mobile robots, Autonomous mobile, mobile robots deployed, robust to occlusions, semantic scene completion
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 17 pages, 6 figures, 2 Tables

点击查看摘要

Abstract:Autonomous mobile robots deployed in urban environments must be context-aware, i.e., able to distinguish between different semantic entities, and robust to occlusions. Current approaches like semantic scene completion (SSC) require pre-enumerating the set of classes and costly human annotations, while representation learning methods relax these assumptions but are not robust to occlusions and learn representations tailored towards auxiliary tasks. To address these limitations, we propose LSMap, a method that lifts masks from visual foundation models to predict a continuous, open-set semantic and elevation-aware representation in bird’s eye view (BEV) for the entire scene, including regions underneath dynamic entities and in occluded areas. Our model only requires a single RGBD image, does not require human labels, and operates in real time. We quantitatively demonstrate our approach outperforms existing models trained from scratch on semantic and elevation scene completion tasks with finetuning. Furthermore, we show that our pre-trained representation outperforms existing visual foundation models at unsupervised semantic scene completion. We evaluate our approach using CODa, a large-scale, real-world urban robot dataset. Supplementary visualizations, code, data, and pre-trained models, will be publicly available soon.

[CV-161] HEMM: Holistic Evaluation of Multimodal Foundation Models

链接: https://arxiv.org/abs/2407.03418
作者: Paul Pu Liang,Akshay Goindani,Talha Chafekar,Leena Mathur,Haofei Yu,Ruslan Salakhutdinov,Louis-Philippe Morency
关键词: Multimodal foundation models, text alongside images, holistically process text, process text alongside, Multimodal foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code available at this https URL

点击查看摘要

Abstract:Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today’s models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance. Our conclusions regarding challenging multimodal interactions, use cases, and tasks requiring reasoning and external knowledge, the benefits of data and model scale, and the impacts of instruction tuning yield actionable insights for future work in multimodal foundation models.

[CV-162] Visual Robustness Benchmark for Visual Question Answering (VQA)

链接: https://arxiv.org/abs/2407.03386
作者: Md Farhan Ishmam,Ishmam Tashdeed,Talukder Asir Saadat,Md Hamjajul Ashmafee,Dr. Abu Raihan Mostofa Kamal,Dr. Md. Azam Hossain
关键词: Visual Question Answering, Question Answering, Visual Question, systems perform, real world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Can Visual Question Answering (VQA) systems perform just as well when deployed in the real world? Or are they susceptible to realistic corruption effects e.g. image blur, which can be detrimental in sensitive applications, such as medical VQA? While linguistic or textual robustness has been thoroughly explored in the VQA literature, there has yet to be any significant work on the visual robustness of VQA models. We propose the first large-scale benchmark comprising 213,000 augmented images, challenging the visual robustness of multiple VQA models and assessing the strength of realistic visual corruptions. Additionally, we have designed several robustness evaluation metrics that can be aggregated into a unified metric and tailored to fit a variety of use cases. Our experiments reveal several insights into the relationships between model size, performance, and robustness with the visual corruptions. Our benchmark highlights the need for a balanced approach in model development that considers model performance without compromising the robustness.

[CV-163] Jacobi Set Simplification for Tracking Topological Features in Time-Varying Scalar Fields

链接: https://arxiv.org/abs/2407.03348
作者: Dhruv Meduri,Mohit Sharma,Vijay Natarajan
关键词:
类目: Numerical Analysis (math.NA); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

[CV-164] Dual-Domain Deep D-bar Method for Solving Electrical Impedance Tomography

链接: https://arxiv.org/abs/2407.03335
作者: Xiang Cao,Qiaoqiao Ding,Xiaoqun Zhang
关键词: Electrical Impedance Tomography, solving Electrical Impedance, Impedance Tomography, Electrical Impedance, regularized D-bar method
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:The regularized D-bar method is one of the most prominent methods for solving Electrical Impedance Tomography (EIT) problems due to its efficiency and simplicity. It provides a direct approach by applying low-pass filtering to the scattering data in the non-linear Fourier domain, thereby yielding a smoothed conductivity approximation. However, D-bar images often present low contrast and low resolution due to the absence of accurate high-frequency information and ill-posedness of the problem. In this paper, we proposed a dual-domain neural network architecture to retrieve high-contrast D-bar image sequences from low-contrast D-bar images. To further accentuate the spatial features of the conductivity distribution, the widely adopted U-net has been tailored for conductivity image calibration from the predicted D-bar image sequences. We call such a hybrid approach by Dual-Domain Deep D-bar method due to the consideration of both scattering data and image information. Compared to the single-scale structure, our proposed multi-scale structure exhibits superior capabilities in reducing artifacts and refining conductivity approximation. Additionally, solving discrete D-bar systems using the GMRES algorithm entails significant computational complexity, which is extremely time-consuming on CPU-based devices. To remedy this, we designed a surrogate GPU-based Richardson iterative method to accelerate the data enhancement process by D-bar. Numerical results are presented for simulated EIT data from the KIT4 and ACT4 systems to demonstrate notable improvements in absolute EIT imaging quality when compared to existing methodologies.

[CV-165] DDPM-MoCo: Advancing Industrial Surface Defect Generation and Detection with Generative and Contrastive Learning

链接: https://arxiv.org/abs/2407.03332
作者: Yangfan He,Xinyan Wang,Tianyu Shi
关键词: industrial detection based, effective data samples, convenient model training, obtaining sufficient, Denoising Diffusion Probabilistic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The task of industrial detection based on deep learning often involves solving two problems: (1) obtaining sufficient and effective data samples, (2) and using efficient and convenient model training methods. In this paper, we introduce a novel defect-generation method, named DDPM-MoCo, to address these issues. Firstly, we utilize the Denoising Diffusion Probabilistic Model (DDPM) to generate high-quality defect data samples, overcoming the problem of insufficient sample data for model learning. Furthermore, we utilize the unsupervised learning Momentum Contrast model (MoCo) with an enhanced batch contrastive loss function for training the model on unlabeled data, addressing the efficiency and consistency challenges in large-scale negative sample encoding during diffusion model training. The experimental results showcase an enhanced visual detection method for identifying defects on metal surfaces, covering the entire process, starting from generating unlabeled sample data for training the diffusion model, to utilizing the same labeled sample data for downstream detection tasks. This study offers valuable practical insights and application potential for visual detection in the metal processing industry.

[CV-166] Anole: Adapting Diverse Compressed Models For Cross-Scene Prediction On Mobile Devices

链接: https://arxiv.org/abs/2407.03331
作者: Yunzhe Li,Hongzi Zhu,Zhuohong Deng,Yunlong Cheng,Liang Zhang,Shan Chang,Minyi Guo
关键词:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

[CV-167] Efficient Visibility Approximation for Game AI using Neural Omnidirectional Distance Fields

链接: https://arxiv.org/abs/2407.03330
作者: Zhi Ying,Nicholas Edwards,Mikhail Kutuzov
关键词: Omnidirectional Distance Fields, real-time systems, information is critical, computational cost, Omnidirectional Distance
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: I3D 2024

点击查看摘要

Abstract:Visibility information is critical in game AI applications, but the computational cost of raycasting-based methods poses a challenge for real-time systems. To address this challenge, we propose a novel method that represents a partitioned game scene as neural Omnidirectional Distance Fields (ODFs), allowing scalable and efficient visibility approximation between positions without raycasting. For each position of interest, we map its omnidirectional distance data from the spherical surface onto a UV plane. We then use multi-resolution grids and bilinearly interpolated features to encode directions. This allows us to use a compact multi-layer perceptron (MLP) to reconstruct the high-frequency directional distance data at these positions, ensuring fast inference speed. We demonstrate the effectiveness of our method through offline experiments and in-game evaluation. For in-game evaluation, we conduct a side-by-side comparison with raycasting-based visibility tests in three different scenes. Using a compact MLP (128 neurons and 2 layers), our method achieves an average cold start speedup of 9.35 times and warm start speedup of 4.8 times across these scenes. In addition, unlike the raycasting-based method, whose evaluation time is affected by the characteristics of the scenes, our method’s evaluation time remains constant.

[CV-168] Embracing Massive Medical Data

链接: https://arxiv.org/abs/2407.04687
作者: Yu-Cheng Chou,Zongwei Zhou,Alan Yuille
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to MICCAI 2024

点击查看摘要

[CV-169] Efficient Betti Matching Enables Topology-Aware 3D Segmentation via Persistent Homology

链接: https://arxiv.org/abs/2407.04683
作者: Nico Stucki,Vincent Bürgin,Johannes C. Paetzold,Ulrich Bauer
关键词:
类目: Algebraic Topology (math.AT); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-170] Few-Shot Airway-Tree Modeling using Data-Driven Sparse Priors

链接: https://arxiv.org/abs/2407.04507
作者: Ali Keshavarzi,Elsa Angelini
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at 21st IEEE International Symposium on Biomedical Imaging (ISBI)

点击查看摘要

[CV-171] Hard-Attention Gates with Gradient Routing for Endoscopic Image Computing

链接: https://arxiv.org/abs/2407.04400
作者: Giorgio Roffo,Carlo Biffi,Pietro Salvagnini,Andrea Cherubini
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Attention Gates, Hard-Attention Gates, Gradient Routing, Feature Selection Gates, Endoscopy, Medical Image Processing, Computer Vision

点击查看摘要

[CV-172] Segmenting Medical Images: From UNet to Res-UNet and nnUNet

链接: https://arxiv.org/abs/2407.04353
作者: Lina Huang,Alina Miron,Kate Hone,Yongmin Li
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 3 figures

点击查看摘要

[CV-173] Measurement Embedded Schr"odinger Bridge for Inverse Problems

链接: https://arxiv.org/abs/2407.04162
作者: Yuang Wang,Pengfei Jin,Siyeop Yoon,Matthew Tivnan,Quanzheng Li,Li Zhang,Dufan Wu
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 2 figures, Neurips preprint

点击查看摘要

[CV-174] Autoencoded Image Compression for Secure and Fast Transmission

链接: https://arxiv.org/abs/2407.03990
作者: Aryan Kashyap Naveen,Sunil Thunga,Anuhya Murki,Mahati A Kalale,Shriya Anil
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 7 figures

点击查看摘要

[CV-175] LeDNet: Localization-enabled Deep Neural Network for Multi-Label Radiography Image Classification

链接: https://arxiv.org/abs/2407.03931
作者: Lalit Pant,Shubham Arora
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 7 figures

点击查看摘要

[CV-176] Unsupervised Analysis of Alzheimers Disease Signatures using 3D Deformable Autoencoders

链接: https://arxiv.org/abs/2407.03863
作者: Mehmet Yigit Avci,Emily Chan,Veronika Zimmer,Daniel Rueckert,Benedikt Wiestler,Julia A. Schnabel,Cosmin I. Bercea
关键词:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 5 figures

点击查看摘要

[CV-177] CardioSpectrum: Comprehensive Myocardium Motion Analysis with 3D Deep Learning and Geometric Insights

链接: https://arxiv.org/abs/2407.03794
作者: Shahar Zuler,Shai Tejman-Yarden,Dan Raviv
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been early accepted to MICCAI 2024

点击查看摘要

[CV-178] CS3: Cascade SAM for Sperm Segmentation

链接: https://arxiv.org/abs/2407.03772
作者: Yi Shi,Xu-Peng Tian,Yun-Kai Wang,Tie-Yi Zhang,Bin Yao,Hui Wang,Yong Shao,Cen-Cen Wang,Rong Zeng,De-Chuan Zhan
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

[CV-179] HyperSpace: Hypernetworks for spacing-adaptive image segmentation

链接: https://arxiv.org/abs/2407.03681
作者: Samuel Joutard,Maximilian Pietsch,Raphael Prevost
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at MICCAI 2024

点击查看摘要

[CV-180] Pathological Semantics-Preserving Learning for HE-to-IHC Virtual Staining

链接: https://arxiv.org/abs/2407.03655
作者: Fuqiang Chen,Ranran Zhang,Boyun Zheng,Yiwen Sun,Jiahui He,Wenjian Qin
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-181] Orthogonal Constrained Minimization with Tensor ell_2p Regularization for HSI Denoising and Destriping

链接: https://arxiv.org/abs/2407.03605
作者: Xiaoxia Liu,Shijie Yu,Jian Lu,Xiaojun Chen
关键词:
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-182] DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification

链接: https://arxiv.org/abs/2407.03575
作者: Wenhui Zhu,Xiwen Chen,Peijie Qiu,Aristeidis Sotiras,Abolfazl Razi,Yalin Wang
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

[CV-183] Probing Perfection: The Relentless Art of Meddling for Pulmonary Airway Segmentation from HRCT via a Human-AI Collaboration Based Active Learning Method

链接: https://arxiv.org/abs/2407.03542
作者: Shiyi Wang,Yang Nan,Sheng Zhang,Federico Felder,Xiaodan Xing,Yingying Fang,Javier Del Ser,Simon L F Walsh,Guang Yang
关键词:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

机器学习

[LG-0] Me Myself and AI: The Situational Awareness Dataset (SAD) for LLMs

链接: https://arxiv.org/abs/2407.04694
作者: Rudolf Laine,Bilal Chughtai,Jan Betley,Kaivalya Hariharan,Jeremy Scheurer,Mikita Balesni,Marius Hobbhahn,Alexander Meinke,Owain Evans
关键词: situational awareness, ChatGPT are trained, trained to respond, respond to users, SAD
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 page main body, 98 page appendix, 58 figures

点击查看摘要

Abstract:AI assistants such as ChatGPT are trained to respond to users by saying, “I am a large language model”. This raises questions. Do such models know that they are LLMs and reliably act on this knowledge? Are they aware of their current circumstances, such as being deployed to the public? We refer to a model’s knowledge of itself and its circumstances as situational awareness. To quantify situational awareness in LLMs, we introduce a range of behavioral tests, based on question answering and instruction following. These tests form the \textbfSituational Awareness Dataset (SAD) , a benchmark comprising 7 task categories and over 13,000 questions. The benchmark tests numerous abilities, including the capacity of LLMs to (i) recognize their own generated text, (ii) predict their own behavior, (iii) determine whether a prompt is from internal evaluation or real-world deployment, and (iv) follow instructions that depend on self-knowledge. We evaluate 16 LLMs on SAD, including both base (pretrained) and chat models. While all models perform better than chance, even the highest-scoring model (Claude 3 Opus) is far from a human baseline on certain tasks. We also observe that performance on SAD is only partially predicted by metrics of general knowledge (e.g. MMLU). Chat models, which are finetuned to serve as AI assistants, outperform their corresponding base models on SAD but not on general knowledge tasks. The purpose of SAD is to facilitate scientific understanding of situational awareness in LLMs by breaking it down into quantitative abilities. Situational awareness is important because it enhances a model’s capacity for autonomous planning and action. While this has potential benefits for automation, it also introduces novel risks related to AI safety and control. Code and latest results available at this https URL . Comments: 11 page main body, 98 page appendix, 58 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2407.04694 [cs.CL] (or arXiv:2407.04694v1 [cs.CL] for this version)

[LG-1] Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks

链接: https://arxiv.org/abs/2407.04690
作者: Aaron Mueller
关键词: causality for granted, Interpretability research, Abstract, counterfactual, counterfactual theories
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Interpretability research takes counterfactual theories of causality for granted. Most causal methods rely on counterfactual interventions to inputs or the activations of particular model components, followed by observations of the change in models’ output logits or behaviors. While this yields more faithful evidence than correlational methods, counterfactuals nonetheless have key problems that bias our findings in specific and predictable ways. Specifically, (i) counterfactual theories do not effectively capture multiple independently sufficient causes of the same effect, which leads us to miss certain causes entirely; and (ii) counterfactual dependencies in neural networks are generally not transitive, which complicates methods for extracting and interpreting causal graphs from neural networks. We discuss the implications of these challenges for interpretability researchers and propose concrete suggestions for future work.

[LG-2] Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

链接: https://arxiv.org/abs/2407.04681
作者: Yuanze Lin,Yunsheng Li,Dongdong Chen,Weijian Xu,Ronald Clark,Philip Torr,Lu Yuan
关键词: multimodal large language, high-quality image-text datasets, made significant strides, vast high-quality image-text, generally understand images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets, enabling them to generally understand images well. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs, limiting their ability to answer questions requiring an understanding of detailed or localized visual elements. Drawing inspiration from the Retrieval-Augmented Generation (RAG) concept, this paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models (e.g., instance segmentation/OCR models), into MLLMs. This is a promising yet underexplored direction for enhancing MLLMs’ performance. Our approach diverges from concurrent works, which transform external knowledge into additional text prompts, necessitating the model to indirectly learn the correspondence between visual content and text coordinates. Instead, we propose embedding fine-grained knowledge information directly into a spatial embedding map as a visual prompt. This design can be effortlessly incorporated into various MLLMs, such as LLaVA and Mipha, considerably improving their visual understanding performance. Through rigorous experiments, we demonstrate that our method can enhance MLLM performance across nine benchmarks, amplifying their fine-grained context-aware capabilities.

[LG-3] XQSV: A Structurally Variable Network to Imitate Human Play in Xiangqi

链接: https://arxiv.org/abs/2407.04678
作者: Chenliang Zhou
关键词: Xiangqi Structurally Variable, termed Xiangqi Structurally, Structurally Variable, Chinese Chess, deep learning architecture
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce an innovative deep learning architecture, termed Xiangqi Structurally Variable (XQSV), designed to emulate the behavioral patterns of human players in Xiangqi, or Chinese Chess. The unique attribute of XQSV is its capacity to alter its structural configuration dynamically, optimizing performance for the task based on the particular subset of data on which it is trained. We have incorporated several design improvements to significantly enhance the network’s predictive accuracy, including a local illegal move filter, an Elo range partitioning, a sequential one-dimensional input, and a simulation of imperfect memory capacity. Empirical evaluations reveal that XQSV attains a predictive accuracy of approximately 40%, with its performance peaking within the trained Elo range. This indicates the model’s success in mimicking the play behavior of individuals within that specific range. A three-terminal Turing Test was employed to demonstrate that the XQSV model imitates human behavior more accurately than conventional Xiangqi engines, rendering it indistinguishable from actual human opponents. Given the inherent nondeterminism in human gameplay, we propose two supplementary relaxed evaluation metrics. To our knowledge, XQSV represents the first model to mimic Xiangqi players.

[LG-4] Unsupervised 4D Cardiac Motion Tracking with Spatiotemporal Optical Flow Networks

链接: https://arxiv.org/abs/2407.04663
作者: Long Teng,Wei Feng,Menglong Zhu,Xinchao Li
关键词: quantify myocardial motion, motion tracking, motion, quantify myocardial, cardiac cycle
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardiac motion tracking from echocardiography can be used to estimate and quantify myocardial motion within a cardiac cycle. It is a cost-efficient and effective approach for assessing myocardial function. However, ultrasound imaging has the inherent characteristics of spatially low resolution and temporally random noise, which leads to difficulties in obtaining reliable annotation. Thus it is difficult to perform supervised learning for motion tracking. In addition, there is no end-to-end unsupervised method currently in the literature. This paper presents a motion tracking method where unsupervised optical flow networks are designed with spatial reconstruction loss and temporal-consistency loss. Our proposed loss functions make use of the pair-wise and temporal correlation to estimate cardiac motion from noisy background. Experiments using a synthetic 4D echocardiography dataset has shown the effectiveness of our approach, and its superiority over existing methods on both accuracy and running speed. To the best of our knowledge, this is the first work performed that uses unsupervised end-to-end deep learning optical flow network for 4D cardiac motion tracking.

[LG-5] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

链接: https://arxiv.org/abs/2407.04656
作者: Yongji Wu,Wenjie Qu,Tianyang Tao,Zhuang Wang,Wei Bai,Zhuohao Li,Yuan Tian,Jiaheng Zhang,Matthew Lentz,Danyang Zhuo
关键词: scale large language, large language models, large language, sub-linear scaling, scaling for computation
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs) due to its sub-linear scaling for computation costs. However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to wait idle until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy adopted by the MoE architecture. We present Lazarus, a system for resilient and elastic training of MoE models. Lazarus adaptively allocates expert replicas to address the inherent imbalance in expert workload and speeds-up training, while a provably optimal expert placement algorithm is developed to maximize the probability of recovery upon failures. Through adaptive expert placement and a flexible token dispatcher, Lazarus can also fully utilize all available nodes after failures, leaving no GPU idle. Our evaluation shows that Lazarus outperforms existing MoE training systems by up to 5.7x under frequent node failures and 3.4x on a real spot instance trace. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2407.04656 [cs.DC] (or arXiv:2407.04656v1 [cs.DC] for this version)

[LG-6] On scalable oversight with weak LLMs judging strong LLMs

链接: https://arxiv.org/abs/2407.04622
作者: Zachary Kenton,Noah Y. Siegel,János Kramár,Jonah Brown-Cohen,Samuel Albanie,Jannis Bulian,Rishabh Agarwal,David Lindner,Yunhao Tang,Noah D. Goodman,Rohin Shah
关键词: Scalable oversight protocols, oversight protocols aim, accurately supervise superhuman, Scalable oversight, oversight protocols
类目: Machine Learning (cs.LG)
*备注: 15 pages (53 including appendices)

点击查看摘要

Abstract:Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI’s compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.

[LG-7] Learning to (Learn at Test Time): RNNs with Expressive Hidden States

链接: https://arxiv.org/abs/2407.04620
作者: Yu Sun,Xinhao Li,Karan Dalal,Jiarui Xu,Arjun Vikram,Genghan Zhang,Yann Dubois,Xinlei Chen,Xiaolong Wang,Sanmi Koyejo,Tatsunori Hashimoto,Carlos Guestrin
关键词: Self-attention performs, hidden state, long context, Self-attention, state
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Both TTT-Linear and TTT-MLP match or exceed the baselines. Similar to Transformer, they can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

[LG-8] Randomized Physics-Informed Neural Networks for Bayesian Data Assimilation

链接: https://arxiv.org/abs/2407.04617
作者: Yifei Zong,David Barajas-Solano,Alexandre M. Tartakovsky
关键词: inverse PDE PINN, physics-informed neural network, randomized physics-informed neural, PDE PINN solutions, inverse partial differential
类目: Machine Learning (cs.LG)
*备注: 38 pages, 8 figures

点击查看摘要

Abstract:We propose a randomized physics-informed neural network (PINN) or rPINN method for uncertainty quantification in inverse partial differential equation (PDE) problems with noisy data. This method is used to quantify uncertainty in the inverse PDE PINN solutions. Recently, the Bayesian PINN (BPINN) method was proposed, where the posterior distribution of the PINN parameters was formulated using the Bayes’ theorem and sampled using approximate inference methods such as the Hamiltonian Monte Carlo (HMC) and variational inference (VI) methods. In this work, we demonstrate that HMC fails to converge for non-linear inverse PDE problems. As an alternative to HMC, we sample the distribution by solving the stochastic optimization problem obtained by randomizing the PINN loss function. The effectiveness of the rPINN method is tested for linear and non-linear Poisson equations, and the diffusion equation with a high-dimensional space-dependent diffusion coefficient. The rPINN method provides informative distributions for all considered problems. For the linear Poisson equation, HMC and rPINN produce similar distributions, but rPINN is on average 27 times faster than HMC. For the non-linear Poison and diffusion equations, the HMC method fails to converge because a single HMC chain cannot sample multiple modes of the posterior distribution of the PINN parameters in a reasonable amount of time.

[LG-9] Isomorphic Pruning for Vision Models

链接: https://arxiv.org/abs/2407.04616
作者: Gongfan Fang,Xinyin Ma,Michael Bi Mi,Xinchao Wang
关键词: Structured pruning reduces, deep neural networks, Structured pruning, removing redundant sub-structures, overhead of deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structured pruning reduces the computational overhead of deep neural networks by removing redundant sub-structures. However, assessing the relative importance of different sub-structures remains a significant challenge, particularly in advanced vision models featuring novel mechanisms and architectures like self-attention, depth-wise convolutions, or residual connections. These heterogeneous substructures usually exhibit diverged parameter scales, weight distributions, and computational topology, introducing considerable difficulty to importance comparison. To overcome this, we present Isomorphic Pruning, a simple approach that demonstrates effectiveness across a range of network architectures such as Vision Transformers and CNNs, and delivers competitive performance across different model sizes. Isomorphic Pruning originates from an observation that, when evaluated under a pre-defined importance criterion, heterogeneous sub-structures demonstrate significant divergence in their importance distribution, as opposed to isomorphic structures that present similar importance patterns. This inspires us to perform isolated ranking and comparison on different types of sub-structures for more reliable pruning. Our empirical results on ImageNet-1K demonstrate that Isomorphic Pruning surpasses several pruning baselines dedicatedly designed for Transformers or CNNs. For instance, we improve the accuracy of DeiT-Tiny from 74.52% to 77.50% by pruning an off-the-shelf DeiT-Base model. And for ConvNext-Tiny, we enhanced performance from 82.06% to 82.18%, while reducing the number of parameters and memory usage. Code is available at \urlthis https URL.

[LG-10] Understanding the Gains from Repeated Self-Distillation

链接: https://arxiv.org/abs/2407.04600
作者: Divyansh Pareek,Simon S. Du,Sewoong Oh
关键词: special type, type of knowledge, knowledge distillation, Self-Distillation, student model
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 10 figures

点击查看摘要

Abstract:Self-Distillation is a special type of knowledge distillation where the student model has the same architecture as the teacher model. Despite using the same architecture and the same training data, self-distillation has been empirically observed to improve performance, especially when applied repeatedly. For such a process, there is a fundamental question of interest: How much gain is possible by applying multiple steps of self-distillation? To investigate this relative gain, we propose studying the simple but canonical task of linear regression. Our analysis shows that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation, reducing the excess risk by a factor as large as d , where d is the input dimension. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model’s risk (MSE) by up to 47%.

[LG-11] Proximal Point Method for Online Saddle Point Problem

链接: https://arxiv.org/abs/2407.04591
作者: Qing-xin Meng,Jian-wei Liu
关键词: time-varying convex-concave games, two-player time-varying convex-concave, saddle point problem, Nash equilibrium regret, online saddle point
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper focuses on the online saddle point problem, which involves a sequence of two-player time-varying convex-concave games. Considering the nonstationarity of the environment, we adopt the duality gap and the dynamic Nash equilibrium regret as performance metrics for algorithm design. We present three variants of the proximal point method: the Online Proximal Point Method~(OPPM), the Optimistic OPPM~(OptOPPM), and the OptOPPM with multiple predictors. Each algorithm guarantees upper bounds for both the duality gap and dynamic Nash equilibrium regret, achieving near-optimality when measured against the duality gap. Specifically, in certain benign environments, such as sequences of stationary payoff functions, these algorithms maintain a nearly constant metric bound. Experimental results further validate the effectiveness of these algorithms. Lastly, this paper discusses potential reliability concerns associated with using dynamic Nash equilibrium regret as a performance metric.

[LG-12] Remembering Everything Makes You Vulnerable: A Limelight on Machine Unlearning for Personalized Healthcare Sector

链接: https://arxiv.org/abs/2407.04589
作者: Ahan Chatterjee,Sai Anirudh Aryasomayajula,Rajat Chaudhari,Subhajit Paul,Vishwa Mohan Singh
关键词: Machine Unlearning, adversarial attacks, continues to rise, increasingly paramount, prevalence of data-driven
类目: Machine Learning (cs.LG)
*备注: 15 Pages, Exploring unlearning techniques on ECG Classifier

点击查看摘要

Abstract:As the prevalence of data-driven technologies in healthcare continues to rise, concerns regarding data privacy and security become increasingly paramount. This thesis aims to address the vulnerability of personalized healthcare models, particularly in the context of ECG monitoring, to adversarial attacks that compromise patient privacy. We propose an approach termed “Machine Unlearning” to mitigate the impact of exposed data points on machine learning models, thereby enhancing model robustness against adversarial attacks while preserving individual privacy. Specifically, we investigate the efficacy of Machine Unlearning in the context of personalized ECG monitoring, utilizing a dataset of clinical ECG recordings. Our methodology involves training a deep neural classifier on ECG data and fine-tuning the model for individual patients. We demonstrate the susceptibility of fine-tuned models to adversarial attacks, such as the Fast Gradient Sign Method (FGSM), which can exploit additional data points in personalized models. To address this vulnerability, we propose a Machine Unlearning algorithm that selectively removes sensitive data points from fine-tuned models, effectively enhancing model resilience against adversarial manipulation. Experimental results demonstrate the effectiveness of our approach in mitigating the impact of adversarial attacks while maintaining the pre-trained model accuracy.

[LG-13] Multimodal Classification via Modal-Aware Interactive Enhancement

链接: https://arxiv.org/abs/2407.04587
作者: Qing-Yuan Jiang,Zhouyang Chi,Yang Yang
关键词: modality imbalance problem, notorious modality imbalance, achieve satisfactory performance, multimodal learning, imbalance problem
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to the notorious modality imbalance problem, multimodal learning (MML) leads to the phenomenon of optimization imbalance, thus struggling to achieve satisfactory performance. Recently, some representative methods have been proposed to boost the performance, mainly focusing on adaptive adjusting the optimization of each modality to rebalance the learning speed of dominant and non-dominant modalities. To better facilitate the interaction of model information in multimodal learning, in this paper, we propose a novel multimodal learning method, called modal-aware interactive enhancement (MIE). Specifically, we first utilize an optimization strategy based on sharpness aware minimization (SAM) to smooth the learning objective during the forward phase. Then, with the help of the geometry property of SAM, we propose a gradient modification strategy to impose the influence between different modalities during the backward phase. Therefore, we can improve the generalization ability and alleviate the modality forgetting phenomenon simultaneously for multimodal learning. Extensive experiments on widely used datasets demonstrate that our proposed method can outperform various state-of-the-art baselines to achieve the best performance.

[LG-14] Leveraging Large Language Models for Integrated Satellite-Aerial-Terrestrial Networks: Recent Advances and Future Directions

链接: https://arxiv.org/abs/2407.04581
作者: Shumaila Javaid,Ruhul Amin Khalil,Nasir Saeed,Bin He,Mohamed-Slim Alouini
关键词: Large Language Models, Integrated satellite, advanced Artificial Intelligence, diverse communication technologies, integrating Large Language
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Integrated satellite, aerial, and terrestrial networks (ISATNs) represent a sophisticated convergence of diverse communication technologies to ensure seamless connectivity across different altitudes and platforms. This paper explores the transformative potential of integrating Large Language Models (LLMs) into ISATNs, leveraging advanced Artificial Intelligence (AI) and Machine Learning (ML) capabilities to enhance these networks. We outline the current architecture of ISATNs and highlight the significant role LLMs can play in optimizing data flow, signal processing, and network management to advance 5G/6G communication technologies through advanced predictive algorithms and real-time decision-making. A comprehensive analysis of ISATN components is conducted, assessing how LLMs can effectively address traditional data transmission and processing bottlenecks. The paper delves into the network management challenges within ISATNs, emphasizing the necessity for sophisticated resource allocation strategies, traffic routing, and security management to ensure seamless connectivity and optimal performance under varying conditions. Furthermore, we examine the technical challenges and limitations associated with integrating LLMs into ISATNs, such as data integration for LLM processing, scalability issues, latency in decision-making processes, and the design of robust, fault-tolerant systems. The study also identifies key future research directions for fully harnessing LLM capabilities in ISATNs, which is crucial for enhancing network reliability, optimizing performance, and achieving a truly interconnected and intelligent global network system.

[LG-15] GOALPlace: Begin with the End in Mind

链接: https://arxiv.org/abs/2407.04579
作者: Anthony Agnesina,Rongjian Liang,Geraldo Pradipta,Anand Rajaram,Haoxing Ren
关键词: Co-optimizing placement, Co-optimizing, achieving high-quality designs, empirical Bayes, achieving high-quality
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, preprint

点击查看摘要

Abstract:Co-optimizing placement with congestion is integral to achieving high-quality designs. This paper presents GOALPlace, a new learning-based general approach to improving placement congestion by controlling cell density. Our method efficiently learns from an EDA tool’s post-route optimized results and uses an empirical Bayes technique to adapt this goal/target to a specific placer’s solutions, effectively beginning with the end in mind. It enhances correlation with the long-running heuristics of the tool’s router and timing-opt engine – while solving placement globally without expensive incremental congestion estimation and mitigation methods. A statistical analysis with a new hierarchical netlist clustering establishes the importance of density and the potential for an adequate cell density target across placements. Our experiments show that our method, integrated as a demonstration inside an academic GPU-accelerated global placer, consistently produces macro and standard cell placements of superior or comparable quality to commercial tools. Our empirical Bayes methodology also allows a substantial quality improvement over state-of-the-art academic mixed-size placers, achieving up to 10x fewer design rule check (DRC) violations, a 5% decrease in wirelength, and a 30% and 60% reduction in worst and total negative slack (WNS/TNS).

[LG-16] Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence Grounding and Repetition

链接: https://arxiv.org/abs/2407.04559
作者: Aditya K Surikuchi,Raquel Fernández,Sandro Pezzelle
关键词: temporally ordered sequence, Visual storytelling consists, sequence of images, consists in generating, generating a natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Visual storytelling consists in generating a natural language story given a temporally ordered sequence of images. This task is not only challenging for models, but also very difficult to evaluate with automatic metrics since there is no consensus about what makes a story ‘good’. In this paper, we introduce a novel method that measures story quality in terms of human likeness regarding three key aspects highlighted in previous work: visual grounding, coherence, and repetitiveness. We then use this method to evaluate the stories generated by several models, showing that the foundation model LLaVA obtains the best result, but only slightly so compared to TAPM, a 50-times smaller visual storytelling model. Upgrading the visual and language components of TAPM results in a model that yields competitive performance with a relatively low number of parameters. Finally, we carry out a human evaluation study, whose results suggest that a ‘good’ story may require more than a human-like level of visual grounding, coherence, and repetition.

[LG-17] An AI Architecture with the Capability to Classify and Explain Hardware Trojans

链接: https://arxiv.org/abs/2407.04551
作者: Paul Whitten,Francis Wolff,Chris Papachristou
关键词: identify suspected circuits, Hardware trojan detection, trojan detection methods, machine learning, identify suspected
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hardware trojan detection methods, based on machine learning (ML) techniques, mainly identify suspected circuits but lack the ability to explain how the decision was arrived at. An explainable methodology and architecture is introduced based on the existing hardware trojan detection features. Results are provided for explaining digital hardware trojans within a netlist using trust-hub trojan benchmarks.

[LG-18] Real-time Timbre Remapping with Differentiable DSP

链接: https://arxiv.org/abs/2407.04547
作者: Jordie Shier,Charalampos Saitis,Andrew Robertson,Andrew McPherson
关键词: diverse musical contexts, primary mode, timbral expression, expression, musical contexts
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Accepted for publication at the 24th International Conference on New Interfaces for Musical Expression in Utrecht, Netherlands

点击查看摘要

Abstract:Timbre is a primary mode of expression in diverse musical contexts. However, prevalent audio-driven synthesis methods predominantly rely on pitch and loudness envelopes, effectively flattening timbral expression from the input. Our approach draws on the concept of timbre analogies and investigates how timbral expression from an input signal can be mapped onto controls for a synthesizer. Leveraging differentiable digital signal processing, our method facilitates direct optimization of synthesizer parameters through a novel feature difference loss. This loss function, designed to learn relative timbral differences between musical events, prioritizes the subtleties of graded timbre modulations within phrases, allowing for meaningful translations in a timbre space. Using snare drum performances as a case study, where timbral expression is central, we demonstrate real-time timbre remapping from acoustic snare drums to a differentiable synthesizer modeled after the Roland TR-808.

[LG-19] Rethinking Image Compression on the Web with Generative AI

链接: https://arxiv.org/abs/2407.04542
作者: Shayan Ali Hassan,Danish Humair,Ihsan Ayyub Qazi,Zafar Ayyub Qazi
关键词: increased webpage sizes, significant data transfer, made images central, web browsing, Web experience
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The rapid growth of the Internet, driven by social media, web browsing, and video streaming, has made images central to the Web experience, resulting in significant data transfer and increased webpage sizes. Traditional image compression methods, while reducing bandwidth, often degrade image quality. This paper explores a novel approach using generative AI to reconstruct images at the edge or client-side. We develop a framework that leverages text prompts and provides additional conditioning inputs like Canny edges and color palettes to a text-to-image model, achieving up to 99.8% bandwidth savings in the best cases and 92.6% on average, while maintaining high perceptual similarity. Empirical analysis and a user study show that our method preserves image meaning and structure more effectively than traditional compression methods, offering a promising solution for reducing bandwidth usage and improving Internet affordability with minimal degradation in image quality.

[LG-20] PoPreRo: A New Dataset for Popularity Prediction of Romanian Reddit Posts

链接: https://arxiv.org/abs/2407.04541
作者: Ana-Cristina Rogoz,Maria Ilinca Nechita,Radu Tudor Ionescu
关键词: collected from Reddit, Romanian posts collected, Popularity Prediction, Reddit, posts collected
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at ICPR 2024

点击查看摘要

Abstract:We introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. The PoPreRo dataset includes a varied compilation of post samples from five distinct subreddits of Romania, totaling 28,107 data samples. Along with our novel dataset, we introduce a set of competitive models to be used as baselines for future research. Interestingly, the top-scoring model achieves an accuracy of 61.35% and a macro F1 score of 60.60% on the test set, indicating that the popularity prediction task on PoPreRo is very challenging. Further investigations based on few-shot prompting the Falcon-7B Large Language Model also point in the same direction. We thus believe that PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian. We release our dataset at this https URL.

[LG-21] PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

链接: https://arxiv.org/abs/2407.04538
作者: Ananthu Aniraj,Cassio F.Dantas,Dino Ienco,Diego Marcos
关键词: explicitly detect object, detect object parts, Computer vision methods, inherently interpretable models, Computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted as a main conference paper at the European Conference of Computer Vision (ECCV) 2024

点击查看摘要

Abstract:Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.

[LG-22] Introducing Inside Out of Distribution

链接: https://arxiv.org/abs/2407.04534
作者: Teddy Lazebnik
关键词: Detecting and understanding, ensure reliable model, reliable model performance, OOD, samples is crucial
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting and understanding out-of-distribution (OOD) samples is crucial in machine learning (ML) to ensure reliable model performance. Current OOD studies, in general, and in the context of ML, in particular, primarily focus on extrapolatory OOD (outside), neglecting potential cases of interpolatory OOD (inside). This study introduces a novel perspective on OOD by suggesting OOD can be divided into inside and outside cases. In addition, following this framework, we examine the inside-outside OOD profiles of datasets and their impact on ML model performance. Our analysis shows that different inside-outside OOD profiles lead to nuanced declines in ML model performance, highlighting the importance of distinguishing between these two cases for developing effective counter-OOD methods.

[LG-23] GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning

链接: https://arxiv.org/abs/2407.04528
作者: Aleksander Ficek,Jiaqi Zeng,Oleksii Kuchaiev
关键词: minimizing compute requirements, adapting large language, Retrieval-Augmented Generation, large language models, Parameter-Efficient Fine-Tuning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) and Retrieval-Augmented Generation (RAG) have become popular methods for adapting large language models while minimizing compute requirements. In this paper, we apply PEFT methods (P-tuning, Adapters, and LoRA) to a modified Retrieval-Enhanced Transformer (RETRO) and a baseline GPT model across several sizes, ranging from 823 million to 48 billion parameters. We show that RETRO models outperform GPT models in zero-shot settings due to their unique pre-training process but GPT models have higher performance potential with PEFT. Additionally, our study indicates that 8B parameter models strike an optimal balance between cost and performance and P-tuning lags behind other PEFT techniques. We further provide a comparative analysis of between applying PEFT to an Instruction-tuned RETRO model and base RETRO model. This work presents the first comprehensive comparison of various PEFT methods integrated with RAG, applied to both GPT and RETRO models, highlighting their relative performance.

[LG-24] Graph Reinforcement Learning in Power Grids: A Survey

链接: https://arxiv.org/abs/2407.04522
作者: Mohamed Hassouna,Clara Holzhüter,Pawel Lytaev,Josephine Thomas,Bernhard Sick,Christoph Scholz
关键词: distributed electricity generation, electricity generation motivate, deep learning approaches, power grids, posed by renewable
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The challenges posed by renewable energy and distributed electricity generation motivate the development of deep learning approaches to overcome the lack of flexibility of traditional methods in power grids use cases. The application of GNNs is particularly promising due to their ability to learn from graph-structured data present in power grids. Combined with RL, they can serve as control approaches to determine remedial grid actions. This review analyses the ability of GRL to capture the inherent graph structure of power grids to improve representation learning and decision making in different power grid use cases. It distinguishes between common problems in transmission and distribution grids and explores the synergy between RL and GNNs. In transmission grids, GRL typically addresses automated grid management and topology control, whereas on the distribution side, GRL concentrates more on voltage regulation. We analyzed the selected papers based on their graph structure and GNN model, the applied RL algorithm, and their overall contributions. Although GRL demonstrate adaptability in the face of unpredictable events and noisy or incomplete data, it primarily serves as a proof of concept at this stage. There are multiple open challenges and limitations that need to be addressed when considering the application of RL to real power grid operation.

[LG-25] G-Adaptive mesh refinement – leveraging graph neural networks and differentiable finite element solvers

链接: https://arxiv.org/abs/2407.04516
作者: James Rowbottom,Georg Maierhofer,Teo Deveney,Katharina Schratz,Pietro Liò,Carola-Bibiane Schönlieb,Chris Budd
关键词: mesh point locations, finite element methods, mesh point, mesh, adaptivity in finite
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We present a novel, and effective, approach to the long-standing problem of mesh adaptivity in finite element methods (FEM). FE solvers are powerful tools for solving partial differential equations (PDEs), but their cost and accuracy are critically dependent on the choice of mesh points. To keep computational costs low, mesh relocation (r-adaptivity) seeks to optimise the position of a fixed number of mesh points to obtain the best FE solution accuracy. Classical approaches to this problem require the solution of a separate nonlinear “meshing” PDE to find the mesh point locations. This incurs significant cost at remeshing and relies on certain a-priori assumptions and guiding heuristics for optimal mesh point location. Recent machine learning approaches to r-adaptivity have mainly focused on the construction of fast surrogates for such classical methods. Our new approach combines a graph neural network (GNN) powered architecture, with training based on direct minimisation of the FE solution error with respect to the mesh point locations. The GNN employs graph neural diffusion (GRAND), closely aligning the mesh solution space to that of classical meshing methodologies, thus replacing heuristics with a learnable strategy, and providing a strong inductive bias. This allows for rapid and robust training and results in an extremely efficient and effective GNN approach to online r-adaptivity. This method outperforms classical and prior ML approaches to r-adaptive meshing on the test problems we consider, in particular achieving lower FE solution error, whilst retaining the significant speed-up over classical methods observed in prior ML work.

[LG-26] LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order

链接: https://arxiv.org/abs/2407.04513
作者: Matthias Freiberger,Peter Kun,Anders Sundnes Løvlie,Sebastian Risi
关键词: artificial neural networks, robust toward pruning, typically not robust, neural network architectures, artificial neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to their architecture and how they are trained, artificial neural networks are typically not robust toward pruning, replacing, or shuffling layers at test time. However, such properties would be desirable for different applications, such as distributed neural network architectures where the order of execution cannot be guaranteed or parts of the network can fail during inference. In this work, we address these issues through a number of proposed training approaches for vision transformers whose most important component is randomizing the execution order of attention modules at training time. We show that with our proposed approaches, vision transformers are indeed capable to adapt to arbitrary layer execution orders at test time assuming one tolerates a reduction (about 20%) in accuracy at the same model size. We also find that our trained models can be randomly merged with each other resulting in functional (“Frankenstein”) models without loss of performance compared to the source models. Finally, we layer-prune our models at test time and find that their performance declines gracefully.

[LG-27] PROUD: PaRetO-gUided Diffusion Model for Multi-objective Generation

链接: https://arxiv.org/abs/2407.04493
作者: Yinghua Yao,Yuangang Pan,Jing Li,Ivor Tsang,Xin Yao
关键词: Recent advancements, multiple desired properties, satisfy multiple desired, deep generative models, generative models focus
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in the realm of deep generative models focus on generating samples that satisfy multiple desired properties. However, prevalent approaches optimize these property functions independently, thus omitting the trade-offs among them. In addition, the property optimization is often improperly integrated into the generative models, resulting in an unnecessary compromise on generation quality (i.e., the quality of generated samples). To address these issues, we formulate a constrained optimization problem. It seeks to optimize generation quality while ensuring that generated samples reside at the Pareto front of multiple property objectives. Such a formulation enables the generation of samples that cannot be further improved simultaneously on the conflicting property functions and preserves good quality of generated samples. Building upon this formulation, we introduce the PaRetO-gUided Diffusion model (PROUD), wherein the gradients in the denoising process are dynamically adjusted to enhance generation quality while the generated samples adhere to Pareto optimality. Experimental evaluations on image generation and protein generation tasks demonstrate that our PROUD consistently maintains superior generation quality while approaching Pareto optimality across multiple property functions compared to various baselines.

[LG-28] Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

链接: https://arxiv.org/abs/2407.04491
作者: David Holzmüller,Léo Grinsztajn,Ingo Steinwart
关键词: gradient-boosted decision trees, slower deep learning, deep learning methods, decision trees, dominance of gradient-boosted
类目: Machine Learning (cs.LG)
*备注: 10 pages + 44 pages appendix. Code is available at this http URL and this http URL

点击查看摘要

Abstract:For classification and regression on tabular data, the dominance of gradient-boosted decision trees (GBDTs) has recently been challenged by often much slower deep learning methods with extensive hyperparameter tuning. We address this discrepancy by introducing (a) RealMLP, an improved multilayer perceptron (MLP), and (b) improved default parameters for GBDTs and RealMLP. We tune RealMLP and the default parameters on a meta-train benchmark with 71 classification and 47 regression datasets and compare them to hyperparameter-optimized versions on a disjoint meta-test benchmark with 48 classification and 42 regression datasets, as well as the GBDT-friendly benchmark by Grinsztajn et al. (2022). Our benchmark results show that RealMLP offers a better time-accuracy tradeoff than other neural nets and is competitive with GBDTs. Moreover, a combination of RealMLP and GBDTs with improved default parameters can achieve excellent results on medium-sized tabular datasets (1K–500K samples) without hyperparameter tuning.

[LG-29] Leveraging Graph Structures to Detect Hallucinations in Large Language Models

链接: https://arxiv.org/abs/2407.04485
作者: Noa Nonkes,Sergei Agaronian,Evangelos Kanoulas,Roxana Petcu
关键词: Large language models, providing financial guidance, Large language, content creation, educational tutoring
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models are extensively applied across a wide range of tasks, such as customer support, content creation, educational tutoring, and providing financial guidance. However, a well-known drawback is their predisposition to generate hallucinations. This damages the trustworthiness of the information these models provide, impacting decision-making and user confidence. We propose a method to detect hallucinations by looking at the structure of the latent space and finding associations within hallucinated and non-hallucinated generations. We create a graph structure that connects generations that lie closely in the embedding space. Moreover, we employ a Graph Attention Network which utilizes message passing to aggregate information from neighboring nodes and assigns varying degrees of importance to each neighbor based on their relevance. Our findings show that 1) there exists a structure in the latent space that differentiates between hallucinated and non-hallucinated generations, 2) Graph Attention Networks can learn this structure and generalize it to unseen generations, and 3) the robustness of our method is enhanced when incorporating contrastive learning. When evaluated against evidence-based benchmarks, our model performs similarly without access to search-based methods.

[LG-30] Using Petri Nets as an Integrated Constraint Mechanism for Reinforcement Learning Tasks

链接: https://arxiv.org/abs/2407.04481
作者: Timon Sachweh,Pierre Haritz,Thomas Liebig
关键词: Reinforcement Learning, autonomous vehicles, lack of trust, lack of verifiability, production plants
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The lack of trust in algorithms is usually an issue when using Reinforcement Learning (RL) agents for control in real-world domains such as production plants, autonomous vehicles, or traffic-related infrastructure, partly due to the lack of verifiability of the model itself. In such scenarios, Petri nets (PNs) are often available for flowcharts or process steps, as they are versatile and standardized. In order to facilitate integration of RL models and as a step towards increasing AI trustworthiness, we propose an approach that uses PNs with three main advantages over typical RL approaches: Firstly, the agent can now easily be modeled with a combined state including both external environmental observations and agent-specific state information from a given PN. Secondly, we can enforce constraints for state-dependent actions through the inherent PN model. And lastly, we can increase trustworthiness by verifying PN properties through techniques such as model checking. We test our approach on a typical four-way intersection traffic light control setting and present our results, beating cycle-based baselines.

[LG-31] LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

链接: https://arxiv.org/abs/2407.04480
作者: Xingyu Xie,Zhijie Lin,Kim-Chuan Toh,Pan Zhou
关键词: local GPU nodes, GPU nodes, local GPU, efficiently train large-scale, Low-bit Communication Adaptor
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating LoCo into full-precision optimizers like Adam and SGD does not impair their convergence speed on nonconvex problems. Experimental results show that across large-scale model training frameworks like Megatron-LM and PyTorch’s FSDP, LoCo significantly improves communication efficiency, e.g., improving Adam’s training speed by 14% to 40% without performance degradation on large language models like LLAMAs and MoE.

[LG-32] Rethinking Data Input for Point Cloud Upsampling

链接: https://arxiv.org/abs/2407.04476
作者: Tongxu Zhang
关键词: point cloud, point cloud upsampling, point cloud model, patch based, recent years
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:In recent years, point cloud upsampling has been widely applied in fields such as 3D reconstruction and surface generation. However, existing point cloud upsampling inputs are all patch based, and there is no research discussing the differences and principles between point cloud model full input and patch based input. In order to compare with patch based point cloud input, this article proposes a new data input method, which divides the full point cloud model to ensure shape integrity while training PU-GCN. This article was validated on the PU1K and ABC datasets, but the results showed that Patch based performance is better than model based full input i.e. Average Segment input. Therefore, this article explores the data input factors and model modules that affect the upsampling results of point clouds.

[LG-33] EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context

链接: https://arxiv.org/abs/2407.04472
作者: Hannes Kunstmann,Joseph Ollier,Joel Persson,Florian von Wangenheim
关键词: Large language models, LLM-driven CRS, Large language, conversational recommender systems, implement LLM-driven CRS
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 27 pages, 3 tables, 5 figures, pre-print manuscript, updated version of manuscript due to typo (previous version, Figure 5 was incorrectly named Figure 6)

点击查看摘要

Abstract:Large language models (LLMs) present an enormous evolution in the strategic potential of conversational recommender systems (CRS). Yet to date, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms, particularly from the perspective of a small to medium enterprises (SME) that makeup the bedrock of the global economy. In the current paper, we detail the design of an LLM-driven CRS in an SME setting, and its subsequent performance in the field using both objective system metrics and subjective user evaluations. While doing so, we additionally outline a short-form revised ResQue model for evaluating LLM-driven CRS, enabling replicability in a rapidly evolving field. Our results reveal good system performance from a user experience perspective (85.5% recommendation accuracy) but underscore latency, cost, and quality issues challenging business viability. Notably, with a median cost of 0.04 per interaction and a latency of 5.7s, cost-effectiveness and response time emerge as crucial areas for achieving a more user-friendly and economically viable LLM-driven CRS for SME settings. One major driver of these costs is the use of an advanced LLM as a ranker within the retrieval-augmented generation (RAG) technique. Our results additionally indicate that relying solely on approaches such as Prompt-based learning with ChatGPT as the underlying LLM makes it challenging to achieve satisfying quality in a production environment. Strategic considerations for SMEs deploying an LLM-driven CRS are outlined, particularly considering trade-offs in the current technical landscape.

[LG-34] Smart Sampling: Helping from Friendly Neighbors for Decentralized Federated Learning

链接: https://arxiv.org/abs/2407.04460
作者: Lin Wang,Yang Chen,Yongxin Guo,Xiaoying Tang
关键词: Federated Learning, gaining widespread interest, reducing communication costs, gaining widespread, widespread interest
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is gaining widespread interest for its ability to share knowledge while preserving privacy and reducing communication costs. Unlike Centralized FL, Decentralized FL (DFL) employs a network architecture that eliminates the need for a central server, allowing direct communication among clients and leading to significant communication resource savings. However, due to data heterogeneity, not all neighboring nodes contribute to enhancing the local client’s model performance. In this work, we introduce \textbf\emphAFIND+, a simple yet efficient algorithm for sampling and aggregating neighbors in DFL, with the aim of leveraging collaboration to improve clients’ model performance. AFIND+ identifies helpful neighbors, adaptively adjusts the number of selected neighbors, and strategically aggregates the sampled neighbors’ models based on their contributions. Numerical results on real-world datasets with diverse data partitions demonstrate that AFIND+ outperforms other sampling algorithms in DFL and is compatible with most existing DFL optimization algorithms.

[LG-35] Hindsight Preference Learning for Offline Preference-based Reinforcement Learning

链接: https://arxiv.org/abs/2407.04451
作者: Chen-Xiao Gao,Shengjun Fang,Chenjun Xiao,Yang Yu,Zongzhang Zhang
关键词: preference-based reinforcement learning, Offline preference-based reinforcement, preference-based reinforcement, focuses on optimizing, optimizing policies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offline preference-based reinforcement learning (RL), which focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset, has emerged as a practical avenue for RL applications. Existing works rely on extracting step-wise reward signals from trajectory-wise preference annotations, assuming that preferences correlate with the cumulative Markovian rewards. However, such methods fail to capture the holistic perspective of data annotation: Humans often assess the desirability of a sequence of actions by considering the overall outcome rather than the immediate rewards. To address this challenge, we propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments, i.e. the hindsight information. For downstream RL optimization, the reward of each step is calculated by marginalizing over possible future outcomes, the distribution of which is approximated by a variational auto-encoder trained using the offline dataset. Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets. Comprehensive empirical studies demonstrate the benefits of HPL in delivering robust and advantageous rewards across various domains. Our code is publicly released at this https URL.

[LG-36] Multi-modal Masked Siamese Network Improves Chest X-Ray Representation Learning

链接: https://arxiv.org/abs/2407.04449
作者: Saeed Shurrab,Alejandro Guerra-Manzanares,Farah E. Shamout
关键词: images primarily rely, Electronic Health Records, medical images primarily, Masked Siamese Network, images primarily
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Self-supervised learning methods for medical images primarily rely on the imaging modality during pretraining. While such approaches deliver promising results, they do not leverage associated patient or scan information collected within Electronic Health Records (EHR). Here, we propose to incorporate EHR data during self-supervised pretraining with a Masked Siamese Network (MSN) to enhance the quality of chest X-ray representations. We investigate three types of EHR data, including demographic, scan metadata, and inpatient stay information. We evaluate our approach on three publicly available chest X-ray datasets, MIMIC-CXR, CheXpert, and NIH-14, using two vision transformer (ViT) backbones, specifically ViT-Tiny and ViT-Small. In assessing the quality of the representations via linear evaluation, our proposed method demonstrates significant improvement compared to vanilla MSN and state-of-the-art self-supervised learning baselines. Our work highlights the potential of EHR-enhanced self-supervised pre-training for medical imaging. The code is publicly available at: this https URL

[LG-37] Wavelet-based Temporal Attention Improves Traffic Forecasting

链接: https://arxiv.org/abs/2407.04440
作者: Yash Jakhmola,Nitish Kumar Mishra,Kripabandhu Ghosh,Tanujit Chakraborty
关键词: flow data represents, impacting urban traffic, traffic management systems, urban traffic management, impacting urban
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Spatio-temporal forecasting of traffic flow data represents a typical problem in the field of machine learning, impacting urban traffic management systems. Traditional statistical and machine learning methods cannot adequately handle both the temporal and spatial dependencies in these complex traffic flow datasets. A prevalent approach in the field is to combine graph convolutional networks and multi-head attention mechanisms for spatio-temporal processing. This paper proposes a wavelet-based temporal attention model, namely a wavelet-based dynamic spatio-temporal aware graph neural network (W-DSTAGNN), for tackling the traffic forecasting problem. Benchmark experiments using several statistical metrics confirm that our proposal efficiently captures spatio-temporal correlations and outperforms ten state-of-the-art models on three different real-world traffic datasets. Our proposed ensemble data-driven method can handle dynamic temporal and spatial dependencies and make long-term forecasts in an efficient manner.

[LG-38] Enabling On-Device LLMs Personalization with Smartphone Sensing

链接: https://arxiv.org/abs/2407.04418
作者: Shiquan Zhang,Ying Ma,Le Fang,Hong Jia,Simon D’Alfonso,Vassilis Kostakos
关键词: large language models, combines on-device large, on-device large language, smartphone sensing technologies, language models
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, conference demo paper

点击查看摘要

Abstract:This demo presents a novel end-to-end framework that combines on-device large language models (LLMs) with smartphone sensing technologies to achieve context-aware and personalized services. The framework addresses critical limitations of current personalization solutions via cloud-based LLMs, such as privacy concerns, latency and cost, and limited personal sensor data. To achieve this, we innovatively proposed deploying LLMs on smartphones with multimodal sensor data and customized prompt engineering, ensuring privacy and enhancing personalization performance through context-aware sensing. A case study involving a university student demonstrated the proposed framework’s capability to provide tailored recommendations. In addition, we show that the proposed framework achieves the best trade-off in privacy, performance, latency, cost, battery and energy consumption between on-device and cloud LLMs. Future work aims to integrate more diverse sensor data and conduct large-scale user studies to further refine the personalization. We envision the proposed framework could significantly improve user experiences in various domains such as healthcare, productivity, and entertainment by providing secure, context-aware, and efficient interactions directly on users’ devices.

[LG-39] rustworthy Classification through Rank-Based Conformal Prediction Sets

链接: https://arxiv.org/abs/2407.04407
作者: Rui Luo,Zhixin Zhou
关键词: learning classification tasks, tasks often benefit, benefit from predicting, conformal prediction method, conformal prediction
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning classification tasks often benefit from predicting a set of possible labels with confidence scores to capture uncertainty. However, existing methods struggle with the high-dimensional nature of the data and the lack of well-calibrated probabilities from modern classification models. We propose a novel conformal prediction method that employs a rank-based score function suitable for classification models that predict the order of labels correctly, even if not well-calibrated. Our approach constructs prediction sets that achieve the desired coverage rate while managing their size. We provide a theoretical analysis of the expected size of the conformal prediction sets based on the rank distribution of the underlying classifier. Through extensive experiments, we demonstrate that our method outperforms existing techniques on various datasets, providing reliable uncertainty quantification. Our contributions include a novel conformal prediction method, theoretical analysis, and empirical evaluation. This work advances the practical deployment of machine learning systems by enabling reliable uncertainty quantification.

[LG-40] On Quantum Channel Learning

链接: https://arxiv.org/abs/2407.04406
作者: Mikhail Gennadievich Belov,Victor Victorovich Dubov,Alexey Vladimirovich Filimonov,Vladislav Gennadievich Malyshkin
关键词: optimization problem maximizing, probability preservation constraints, Hilbert spaces, varrho, general quantum channel
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Quantum Physics (quant-ph)
*备注: The unitary learning from arXiv:2405.10263 is generalized to density matrices and quantum channels

点击查看摘要

Abstract:The problem of an optimal mapping between Hilbert spaces IN and OUT , based on a series of density matrix mapping measurements \rho^(l) \to \varrho^(l) , l=1\dots M , is formulated as an optimization problem maximizing the total fidelity \mathcalF=\sum_l=1^M \omega^(l) F\left(\varrho^(l),\sum_s B_s \rho^(l) B^\dagger_s\right) subject to probability preservation constraints on Kraus operators B_s . For F(\varrho,\sigma) in the form that total fidelity can be represented as a quadratic form with superoperator \mathcalF=\sum_s\left\langle B_s\middle|S\middle| B_s \right\rangle (either exactly or as an approximation) an iterative algorithm is developed to find the global maximum. The result comprises in N_s operators B_s that collectively form an IN to OUT quantum channel A^OUT=\sum_s B_s A^IN B_s^\dagger . The work introduces two important generalizations of unitary learning: 1. IN / OUT states are represented as density matrices. 2. The mapping itself is formulated as a general quantum channel. This marks a crucial advancement from the commonly studied unitary mapping of pure states \phi_l=\mathcalU \psi_l to a general quantum channel, what allows us to distinguish probabilistic mixture of states and their superposition. An application of the approach is demonstrated on unitary learning of density matrix mapping \varrho^(l)=\mathcalU \rho^(l) \mathcalU^\dagger , in this case a quadratic on \mathcalU fidelity can be constructed by considering \sqrt\rho^(l) \to \sqrt\varrho^(l) mapping, and on a general quantum channel of Kraus rank N_s , where quadratic on B_s fidelity is an approximation – a quantum channel is then built as a hierarchy of unitary mappings. The approach can be applied to study decoherence effects, spontaneous coherence, synchronizing, etc.

[LG-41] Discovering symbolic expressions with parallelized tree search

链接: https://arxiv.org/abs/2407.04405
作者: Kai Ruan,Ze-Feng Gao,Yike Guo,Hao Sun,Ji-Rong Wen,Yang Liu
关键词: modern scientific research, Symbolic regression plays, plays a crucial, crucial role, role in modern
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Symbolic regression plays a crucial role in modern scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data. A grand challenge lies in the arduous search for parsimonious and generalizable mathematical formulas, in an infinite search space, while intending to fit the training data. Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade when handling problems of complexity, which essentially hinders the pace of applying symbolic regression for scientific exploration across interdisciplinary domains. To this end, we introduce a parallelized tree search (PTS) model to efficiently distill generic mathematical expressions from limited data. Through a series of extensive experiments, we demonstrate the superior accuracy and efficiency of PTS for equation discovery, which greatly outperforms the state-of-the-art baseline models on over 80 synthetic and experimental datasets (e.g., lifting its performance by up to 99% accuracy improvement and one-order of magnitude speed up). PTS represents a key advance in accurate and efficient data-driven discovery of symbolic, interpretable models (e.g., underlying physical laws) and marks a pivotal transition towards scalable symbolic learning.

[LG-42] Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density

链接: https://arxiv.org/abs/2407.04370
作者: Peiyu Yang,Naveed Akhtar,Mubarak Shah,Ajmal Mian
关键词: Trustworthy machine learning, machine learning necessitates, learning necessitates meticulous, necessitates meticulous regulation, Trustworthy machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Trustworthy machine learning necessitates meticulous regulation of model reliance on non-robust features. We propose a framework to delineate and regulate such features by attributing model predictions to the input. Within our approach, robust feature attributions exhibit a certain consistency, while non-robust feature attributions are susceptible to fluctuations. This behavior allows identification of correlation between model reliance on non-robust features and smoothness of marginal density of the input samples. Hence, we uniquely regularize the gradients of the marginal density w.r.t. the input features for robustness. We also devise an efficient implementation of our regularization to address the potential numerical instability of the underlying optimization process. Moreover, we analytically reveal that, as opposed to our marginal density smoothing, the prevalent input gradient regularization smoothens conditional or joint density of the input, which can cause limited robustness. Our experiments validate the effectiveness of the proposed method, providing clear evidence of its capability to address the feature leakage problem and mitigate spurious correlations. Extensive results further establish that our technique enables the model to exhibit robustness against perturbations in pixel values, input gradients, and density.

[LG-43] UpStory: the Uppsala Storytelling dataset

链接: https://arxiv.org/abs/2407.04352
作者: Marc Fraile,Natalia Calvo-Barajas,Anastasia Sophia Apeiron,Giovanna Varni,Joakim Lindblad,Nataša Sladoje,Ginevra Castellano
关键词: constructive social interactions, educational settings due, student outcomes, important role, formation of constructive
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Friendship and rapport play an important role in the formation of constructive social interactions, and have been widely studied in educational settings due to their impact on student outcomes. Given the growing interest in automating the analysis of such phenomena through Machine Learning (ML), access to annotated interaction datasets is highly valuable. However, no dataset on dyadic child-child interactions explicitly capturing rapport currently exists. Moreover, despite advances in the automatic analysis of human behaviour, no previous work has addressed the prediction of rapport in child-child dyadic interactions in educational settings. We present UpStory – the Uppsala Storytelling dataset: a novel dataset of naturalistic dyadic interactions between primary school aged children, with an experimental manipulation of rapport. Pairs of children aged 8-10 participate in a task-oriented activity: designing a story together, while being allowed free movement within the play area. We promote balanced collection of different levels of rapport by using a within-subjects design: self-reported friendships are used to pair each child twice, either minimizing or maximizing pair separation in the friendship network. The dataset contains data for 35 pairs, totalling 3h 40m of audio and video recordings. It includes two video sources covering the play area, as well as separate voice recordings for each child. An anonymized version of the dataset is made publicly available, containing per-frame head pose, body pose, and face features; as well as per-pair information, including the level of rapport. Finally, we provide ML baselines for the prediction of rapport.

[LG-44] Enhancing Safety for Autonomous Agents in Partly Concealed Urban Traffic Environments Through Representation-Based Shielding

链接: https://arxiv.org/abs/2407.04343
作者: Pierre Haritz,David Wanke,Thomas Liebig
关键词: Navigating unsignalized intersections, unpredictable pedestrian crossings, Navigating unsignalized, diverse traffic participants, traffic participants demand
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Navigating unsignalized intersections in urban environments poses a complex challenge for self-driving vehicles, where issues such as view obstructions, unpredictable pedestrian crossings, and diverse traffic participants demand a great focus on crash prevention. In this paper, we propose a novel state representation for Reinforcement Learning (RL) agents centered around the information perceivable by an autonomous agent, enabling the safe navigation of previously uncharted road maps. Our approach surpasses several baseline models by a sig nificant margin in terms of safety and energy consumption metrics. These improvements are achieved while maintaining a competitive average travel speed. Our findings pave the way for more robust and reliable autonomous navigation strategies, promising safer and more efficient urban traffic environments.

[LG-45] Geometrically Inspired Kernel Machines for Collaborative Learning Beyond Gradient Descent

链接: https://arxiv.org/abs/2407.04335
作者: Mohit Kumar,Alexander Valentinitsch,Magdalena Fuchs,Mathias Brucker,Juliana Bowles,Adnan Husakovic,Ali Abbas,Bernhard A. Moser(Institute of Signal Processing)
关键词: geometrically inspired kernel, inspired kernel machines, Kernel Hilbert Space, Reproducing Kernel Hilbert, approximation errors
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper develops a novel mathematical framework for collaborative learning by means of geometrically inspired kernel machines which includes statements on the bounds of generalisation and approximation errors, and sample complexity. For classification problems, this approach allows us to learn bounded geometric structures around given data points and hence solve the global model learning problem in an efficient way by exploiting convexity properties of the related optimisation problem in a Reproducing Kernel Hilbert Space (RKHS). In this way, we can reduce classification problems to determining the closest bounded geometric structure from a given data point. Further advantages that come with our solution is that our approach does not require clients to perform multiple epochs of local optimisation using stochastic gradient descent, nor require rounds of communication between client/server for optimising the global model. We highlight that numerous experiments have shown that the proposed method is a competitive alternative to the state-of-the-art.

[LG-46] Learning Geometric Invariant Features for Classification of Vector Polygons with Graph Message-passing Neural Network

链接: https://arxiv.org/abs/2407.04334
作者: Zexian Huang,Kourosh Khoshelham,Martin Tomko
关键词: non-trivial learning task, vector polygons remains, deep learning approaches, vector polygons, polygons
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Geometric shape classification of vector polygons remains a non-trivial learning task in spatial analysis. Previous studies mainly focus on devising deep learning approaches for representation learning of rasterized vector polygons, whereas the study of discrete representations of polygons and subsequent deep learning approaches have not been fully investigated. In this study, we investigate a graph representation of vector polygons and propose a novel graph message-passing neural network (PolyMP) to learn the geometric-invariant features for shape classification of polygons. Through extensive experiments, we show that the graph representation of polygons combined with a permutation-invariant graph message-passing neural network achieves highly robust performances on benchmark datasets (i.e., synthetic glyph and real-world building footprint datasets) as compared to baseline methods. We demonstrate that the proposed graph-based PolyMP network enables the learning of expressive geometric features invariant to geometric transformations of polygons (i.e., translation, rotation, scaling and shearing) and is robust to trivial vertex removals of polygons. We further show the strong generalizability of PolyMP, which enables generalizing the learned geometric features from the synthetic glyph polygons to the real-world building footprints.

[LG-47] EAGERx: Graph-Based Framework for Sim2real Robot Learning

链接: https://arxiv.org/abs/2407.04328
作者: Bas van der Heijden,Jelle Luijkx,Laura Ferranti,Jens Kober,Robert Babuska
关键词: handle complex tasks, efficiently handle complex, learned control policies, complex tasks, transfer of learned
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: For an introductory video, see this http URL . The documentation, tutorials, and our open-source code can be found at this http URL

点击查看摘要

Abstract:Sim2real, that is, the transfer of learned control policies from simulation to real world, is an area of growing interest in robotics due to its potential to efficiently handle complex tasks. The sim2real approach faces challenges due to mismatches between simulation and reality. These discrepancies arise from inaccuracies in modeling physical phenomena and asynchronous control, among other factors. To this end, we introduce EAGERx, a framework with a unified software pipeline for both real and simulated robot learning. It can support various simulators and aids in integrating state, action and time-scale abstractions to facilitate learning. EAGERx’s integrated delay simulation, domain randomization features, and proposed synchronization algorithm contribute to narrowing the sim2real gap. We demonstrate (in the context of robot learning and beyond) the efficacy of EAGERx in accommodating diverse robotic systems and maintaining consistent simulation behavior. EAGERx is open source and its code is available at this https URL.

[LG-48] Understanding the Role of Invariance in Transfer Learning

链接: https://arxiv.org/abs/2407.04325
作者: Till Speicher,Vedant Nanda,Krishna P. Gummadi
关键词: powerful technique, technique for knowledge-sharing, Transfer, Transfer learning, invariance
类目: Machine Learning (cs.LG)
*备注: Published at TMLR 2024

点击查看摘要

Abstract:Transfer learning is a powerful technique for knowledge-sharing between different tasks. Recent work has found that the representations of models with certain invariances, such as to adversarial input perturbations, achieve higher performance on downstream tasks. These findings suggest that invariance may be an important property in the context of transfer learning. However, the relationship of invariance with transfer performance is not fully understood yet and a number of questions remain. For instance, how important is invariance compared to other factors of the pretraining task? How transferable is learned invariance? In this work, we systematically investigate the importance of representational invariance for transfer learning, as well as how it interacts with other parameters during pretraining. To do so, we introduce a family of synthetic datasets that allow us to precisely control factors of variation both in training and test data. Using these datasets, we a) show that for learning representations with high transfer performance, invariance to the right transformations is as, or often more, important than most other factors such as the number of training samples, the model architecture and the identity of the pretraining classes, b) show conditions under which invariance can harm the ability to transfer representations and c) explore how transferable invariance is between tasks. The code is available at \urlthis https URL.

[LG-49] SSP-GNN: Learning to Track via Bilevel Optimization

链接: https://arxiv.org/abs/2407.04308
作者: Griffin Golias,Masa Nakura-Fan,Vitaly Ablavsky
关键词: graph-based tracking formulation, re-identification features, propose a graph-based, formulation for multi-object, kinematic information
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a graph-based tracking formulation for multi-object tracking (MOT) where target detections contain kinematic information and re-identification features (attributes). Our method applies a successive shortest paths (SSP) algorithm to a tracking graph defined over a batch of frames. The edge costs in this tracking graph are computed via a message-passing network, a graph neural network (GNN) variant. The parameters of the GNN, and hence, the tracker, are learned end-to-end on a training set of example ground-truth tracks and detections. Specifically, learning takes the form of bilevel optimization guided by our novel loss function. We evaluate our algorithm on simulated scenarios to understand its sensitivity to scenario aspects and model hyperparameters. Across varied scenario complexities, our method compares favorably to a strong baseline.

[LG-50] Crafting Large Language Models for Enhanced Interpretability

链接: https://arxiv.org/abs/2407.04307
作者: Chung-En Sun,Tuomas Oikarinen,Tsui-Wei Weng
关键词: Bottleneck Large Language, Concept Bottleneck Large, interpretable Large Language, inherently interpretable Large, Large Language Model
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Present at ICML 2024 Mechanistic Interpretability (MI) Workshop

点击查看摘要

Abstract:We introduce the Concept Bottleneck Large Language Model (CB-LLM), a pioneering approach to creating inherently interpretable Large Language Models (LLMs). Unlike traditional black-box LLMs that rely on post-hoc interpretation methods with limited neuron function insights, CB-LLM sets a new standard with its built-in interpretability, scalability, and ability to provide clear, accurate explanations. This innovation not only advances transparency in language models but also enhances their effectiveness. Our unique Automatic Concept Correction (ACC) strategy successfully narrows the performance gap with conventional black-box LLMs, positioning CB-LLM as a model that combines the high accuracy of traditional LLMs with the added benefit of clear interpretability – a feature markedly absent in existing LLMs.

[LG-51] Fair Federated Data Clustering through Personalization: Bridging the Gap between Diverse Data Distributions

链接: https://arxiv.org/abs/2407.04302
作者: Shivam Gupta,Tarushi,Tsering Wangzes,Shweta Jain
关键词: data, machine learning, machine learning paradigms, machine learning algorithms, rapid growth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid growth of data from edge devices has catalyzed the performance of machine learning algorithms. However, the data generated resides at client devices thus there are majorly two challenge faced by traditional machine learning paradigms - centralization of data for training and secondly for most the generated data the class labels are missing and there is very poor incentives to clients to manually label their data owing to high cost and lack of expertise. To overcome these issues, there have been initial attempts to handle unlabelled data in a privacy preserving distributed manner using unsupervised federated data clustering. The goal is partition the data available on clients into k partitions (called clusters) without actual exchange of data. Most of the existing algorithms are highly dependent on data distribution patterns across clients or are computationally expensive. Furthermore, due to presence of skewed nature of data across clients in most of practical scenarios existing models might result in clients suffering high clustering cost making them reluctant to participate in federated process. To this, we are first to introduce the idea of personalization in federated clustering. The goal is achieve balance between achieving lower clustering cost and at same time achieving uniform cost across clients. We propose p-FClus that addresses these goal in a single round of communication between server and clients. We validate the efficacy of p-FClus against variety of federated datasets showcasing it’s data independence nature, applicability to any finite \ell -norm, while simultaneously achieving lower cost and variance.

[LG-52] Jailbreak Attacks and Defenses Against Large Language Models: A Survey

链接: https://arxiv.org/abs/2407.04295
作者: Sibo Yi,Yule Liu,Zhen Sun,Tianshuo Cong,Xinlei He,Jiaxing Song,Ke Xu,Qi Li
关键词: Large Language Models, Large Language, including question answering, Language Models, code completion
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of “jailbreaking”, which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.

[LG-53] Robust Decision Transformer: Tackling Data Corruption in Offline RL via Sequence Modeling

链接: https://arxiv.org/abs/2407.04285
作者: Jiawei Xu,Rui Yang,Feng Luo,Meng Fang,Baoxiang Wang,Lei Han
关键词: costly online interactions, scaling data-driven decision-making, Decision Transformer, holds promise, online interactions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning policies from offline datasets through offline reinforcement learning (RL) holds promise for scaling data-driven decision-making and avoiding unsafe and costly online interactions. However, real-world data collected from sensors or humans often contains noise and errors, posing a significant challenge for existing offline RL methods. Our study indicates that traditional offline RL methods based on temporal difference learning tend to underperform Decision Transformer (DT) under data corruption, especially when the amount of data is limited. This suggests the potential of sequential modeling for tackling data corruption in offline RL. To further unleash the potential of sequence modeling methods, we propose Robust Decision Transformer (RDT) by incorporating several robust techniques. Specifically, we introduce Gaussian weighted learning and iterative data correction to reduce the effect of corrupted data. Additionally, we leverage embedding dropout to enhance the model’s resistance to erroneous inputs. Extensive experiments on MoJoCo, KitChen, and Adroit tasks demonstrate RDT’s superior performance under diverse data corruption compared to previous methods. Moreover, RDT exhibits remarkable robustness in a challenging setting that combines training-time data corruption with testing-time observation perturbations. These results highlight the potential of robust sequence modeling for learning from noisy or corrupted offline datasets, thereby promoting the reliable application of offline RL in real-world tasks.

[LG-54] BiosERC: Integrating Biography Speakers Supported by LLMs for ERC Tasks

链接: https://arxiv.org/abs/2407.04279
作者: Jieying Xue,Minh Phuong Nguyen,Blake Matheny,Le Minh Nguyen
关键词: Emotion Recognition, utilized attention mechanisms, attention mechanisms exploring, mechanisms exploring relationships, modeling emotional interaction
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted in the 33rd International Conference on Artificial Neural Networks (ICANN 2024)

点击查看摘要

Abstract:In the Emotion Recognition in Conversation task, recent investigations have utilized attention mechanisms exploring relationships among utterances from intra- and inter-speakers for modeling emotional interaction between them. However, attributes such as speaker personality traits remain unexplored and present challenges in terms of their applicability to other tasks or compatibility with diverse model architectures. Therefore, this work introduces a novel framework named BiosERC, which investigates speaker characteristics in a conversation. By employing Large Language Models (LLMs), we extract the “biographical information” of the speaker within a conversation as supplementary knowledge injected into the model to classify emotional labels for each utterance. Our proposed method achieved state-of-the-art (SOTA) results on three famous benchmark datasets: IEMOCAP, MELD, and EmoryNLP, demonstrating the effectiveness and generalization of our model and showcasing its potential for adaptation to various conversation analysis tasks. Our source code is available at this https URL.

[LG-55] Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression

链接: https://arxiv.org/abs/2407.04272
作者: Hao Feng,Boyuan Zhang,Fanjiang Ye,Min Si,Ching-Hsiang Chu,Jiannan Tian,Chunxing Yin,Summer Deng,Yuchen Hao,Pavan Balaji,Tong Geng,Dingwen Tao
关键词: gained widespread adoption, recommendation system model, recommendation system, industry applications, gained widespread
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: accepted by SC '24

点击查看摘要

Abstract:DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. The large size of DLRM models, however, necessitates the use of multiple devices/GPUs for efficient training. A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices. To mitigate this, we introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training. We develop a novel error-bounded lossy compression algorithm, informed by an in-depth analysis of embedding data features, to achieve high compression ratios. Moreover, we introduce a dual-level adaptive strategy for error-bound adjustment, spanning both table-wise and iteration-wise aspects, to balance the compression benefits with the potential impacts on accuracy. We further optimize our compressor for PyTorch tensors on GPUs, minimizing compression overhead. Evaluation shows that our method achieves a 1.38 \times training speedup with a minimal accuracy impact.

[LG-56] Variational Partial Group Convolutions for Input-Aware Partial Equivariance of Rotations and Color-Shifts

链接: https://arxiv.org/abs/2407.04271
作者: Hyunsu Kim,Yegon Kim,Hongseok Yang,Juho Lee
关键词: Group Equivariant CNNs, shown promising efficacy, Equivariant CNNs, equivariant manner, capture hierarchical features
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ICML2024

点击查看摘要

Abstract:Group Equivariant CNNs (G-CNNs) have shown promising efficacy in various tasks, owing to their ability to capture hierarchical features in an equivariant manner. However, their equivariance is fixed to the symmetry of the whole group, limiting adaptability to diverse partial symmetries in real-world datasets, such as limited rotation symmetry of handwritten digit images and limited color-shift symmetry of flower images. Recent efforts address this limitation, one example being Partial G-CNN which restricts the output group space of convolution layers to break full equivariance. However, such an approach still fails to adjust equivariance levels across data. In this paper, we propose a novel approach, Variational Partial G-CNN (VP G-CNN), to capture varying levels of partial equivariance specific to each data instance. VP G-CNN redesigns the distribution of the output group elements to be conditioned on input data, leveraging variational inference to avoid overfitting. This enables the model to adjust its equivariance levels according to the needs of individual data points. Additionally, we address training instability inherent in discrete group equivariance models by redesigning the reparametrizable distribution. We demonstrate the effectiveness of VP G-CNN on both toy and real-world datasets, including MNIST67-180, CIFAR10, ColorMNIST, and Flowers102. Our results show robust performance, even in uncertainty metrics.

[LG-57] NeuFair: Neural Network Fairness Repair with Dropout

链接: https://arxiv.org/abs/2407.04268
作者: Vishnu Asutosh Dasu,Ashish Kumar,Saeid Tizpaz-Niari,Gang Tan
关键词: deep neural networks, paper investigates, neural networks, deep neural, neural dropout method
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Paper accepted at ACM ISSTA 2024

点击查看摘要

Abstract:This paper investigates the neural dropout method as a post-processing bias mitigation for deep neural networks (DNNs). Neural-driven software solutions are increasingly applied in socially critical domains with significant fairness implications. While neural networks are exceptionally good at finding statistical patterns from data, they are notorious for overfitting to the training datasets that may encode and amplify existing biases from the historical data. Existing bias mitigation algorithms often require either modifying the input dataset or modifying the learning algorithms. We posit that the prevalent dropout methods that prevent over-fitting during training by randomly dropping neurons may be an effective and less intrusive approach to improve fairness of pre-trained DNNs. However, finding the ideal set of neurons to drop is a combinatorial problem. We propose NeuFair, a family of post-processing randomized algorithms that mitigate unfairness in pre-trained DNNs. Our randomized search is guided by an objective to minimize discrimination while maintaining the model utility. We show that our design of randomized algorithms provides statistical guarantees on finding optimal solutions, and we empirically evaluate the efficacy and efficiency of NeuFair in improving fairness, with minimal or no performance degradation. Our results show that NeuFair improves fairness by up to 69% and outperforms state-of-the-art post-processing bias techniques.

[LG-58] Langevin Dynamics: A Unified Perspective on Optimization via Lyapunov Potentials

链接: https://arxiv.org/abs/2407.04264
作者: August Y. Chen,Ayush Sekhari,Karthik Sridharan
关键词: Stochastic Gradient Langevin, SGLD, Stochastic Gradient, Gradient Langevin Dynamics, Langevin Dynamics
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study the problem of non-convex optimization using Stochastic Gradient Langevin Dynamics (SGLD). SGLD is a natural and popular variation of stochastic gradient descent where at each step, appropriately scaled Gaussian noise is added. To our knowledge, the only strategy for showing global convergence of SGLD on the loss function is to show that SGLD can sample from a stationary distribution which assigns larger mass when the function is small (the Gibbs measure), and then to convert these guarantees to optimization results. We employ a new strategy to analyze the convergence of SGLD to global minima, based on Lyapunov potentials and optimization. We convert the same mild conditions from previous works on SGLD into geometric properties based on Lyapunov potentials. This adapts well to the case with a stochastic gradient oracle, which is natural for machine learning applications where one wants to minimize population loss but only has access to stochastic gradients via minibatch training samples. Here we provide 1) improved rates in the setting of previous works studying SGLD for optimization, 2) the first finite gradient complexity guarantee for SGLD where the function is Lipschitz and the Gibbs measure defined by the function satisfies a Poincaré Inequality, and 3) prove if continuous-time Langevin Dynamics succeeds for optimization, then discrete-time SGLD succeeds under mild regularity assumptions. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2407.04264 [cs.LG] (or arXiv:2407.04264v1 [cs.LG] for this version)

[LG-59] Unsupervised Video Summarization via Reinforcement Learning and a Trained Evaluator

链接: https://arxiv.org/abs/2407.04258
作者: Mehryar Abbasi,Hadi Hadizadeh,Parvaneh Saeedi
关键词: unsupervised video summarization, paper presents, video, summarizer model, reward generation pipeline
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach for unsupervised video summarization using reinforcement learning. It aims to address the existing limitations of current unsupervised methods, including unstable training of adversarial generator-discriminator architectures and reliance on hand-crafted reward functions for quality evaluation. The proposed method is based on the concept that a concise and informative summary should result in a reconstructed video that closely resembles the original. The summarizer model assigns an importance score to each frame and generates a video summary. In the proposed scheme, reinforcement learning, coupled with a unique reward generation pipeline, is employed to train the summarizer model. The reward generation pipeline trains the summarizer to create summaries that lead to improved reconstructions. It comprises a generator model capable of reconstructing masked frames from a partially masked video, along with a reward mechanism that compares the reconstructed video from the summary against the original. The video generator is trained in a self-supervised manner to reconstruct randomly masked frames, enhancing its ability to generate accurate summaries. This training pipeline results in a summarizer model that better mimics human-generated video summaries compared to methods relying on hand-crafted rewards. The training process consists of two stable and isolated training steps, unlike adversarial architectures. Experimental results demonstrate promising performance, with F-scores of 62.3 and 54.5 on TVSum and SumMe datasets, respectively. Additionally, the inference stage is 300 times faster than our previously reported state-of-the-art method.

[LG-60] Unified Interpretation of Smoothing Methods for Negative Sampling Loss Functions in Knowledge Graph Embedding

链接: https://arxiv.org/abs/2407.04251
作者: Xincan Feng,Hidetaka Kamigaito,Katsuhiko Hayashi,Taro Watanabe
关键词: Knowledge Graphs, tasks in NLP, Negative Sampling, Adaptive Negative Sampling, fundamental resources
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 2 tables; accepted to workshop RepL4NLP held in conjunction with ACL 2024

点击查看摘要

Abstract:Knowledge Graphs (KGs) are fundamental resources in knowledge-intensive tasks in NLP. Due to the limitation of manually creating KGs, KG Completion (KGC) has an important role in automatically completing KGs by scoring their links with KG Embedding (KGE). To handle many entities in training, KGE relies on Negative Sampling (NS) loss that can reduce the computational cost by sampling. Since the appearance frequencies for each link are at most one in KGs, sparsity is an essential and inevitable problem. The NS loss is no exception. As a solution, the NS loss in KGE relies on smoothing methods like Self-Adversarial Negative Sampling (SANS) and subsampling. However, it is uncertain what kind of smoothing method is suitable for this purpose due to the lack of theoretical understanding. This paper provides theoretical interpretations of the smoothing methods for the NS loss in KGE and induces a new NS loss, Triplet Adaptive Negative Sampling (TANS), that can cover the characteristics of the conventional smoothing methods. Experimental results of TransE, DistMult, ComplEx, RotatE, HAKE, and HousE on FB15k-237, WN18RR, and YAGO3-10 datasets and their sparser subsets show the soundness of our interpretation and performance improvement by our TANS.

[LG-61] A Two-Step Minimax Q-learning Algorithm for Two-Player Zero-Sum Markov Games

链接: https://arxiv.org/abs/2407.04240
作者: Shreyas S R,Antony Vijesh
关键词: two-player zero-sum Markov, interesting iterative procedure, zero-sum Markov games, Markov game, zero-sum Markov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An interesting iterative procedure is proposed to solve a two-player zero-sum Markov games. First this problem is expressed as a min-max Markov game. Next, a two-step Q-learning algorithm for solving Markov decision problem (MDP) is suitably modified to solve this Markov game. Under a suitable assumption, the boundedness of the proposed iterates is obtained theoretically. Using results from stochastic approximation, the almost sure convergence of the proposed two-step minimax Q-learning is obtained theoretically. More specifically, the proposed algorithm converges to the game theoretic optimal value with probability one, when the model information is not known. Numerical simulation authenticate that the proposed algorithm is effective and easy to implement.

[LG-62] Graph Pooling via Ricci Flow

链接: https://arxiv.org/abs/2407.04236
作者: Amy Feng,Melanie Weber
关键词: Graph Machine Learning, Machine Learning, Learning often involves, Graph Neural Networks, Graph Machine
类目: Machine Learning (cs.LG)
*备注: 32 pages, 7 figures

点击查看摘要

Abstract:Graph Machine Learning often involves the clustering of nodes based on similarity structure encoded in the graph’s topology and the nodes’ attributes. On homophilous graphs, the integration of pooling layers has been shown to enhance the performance of Graph Neural Networks by accounting for inherent multi-scale structure. Here, similar nodes are grouped together to coarsen the graph and reduce the input size in subsequent layers in deeper architectures. In both settings, the underlying clustering approach can be implemented via graph pooling operators, which often rely on classical tools from Graph Theory. In this work, we introduce a graph pooling operator (ORC-Pool), which utilizes a characterization of the graph’s geometry via Ollivier’s discrete Ricci curvature and an associated geometric flow. Previous Ricci flow based clustering approaches have shown great promise across several domains, but are by construction unable to account for similarity structure encoded in the node attributes. However, in many ML applications, such information is vital for downstream tasks. ORC-Pool extends such clustering approaches to attributed graphs, allowing for the integration of geometric coarsening into Graph Neural Networks as a pooling layer.

[LG-63] meLDM: Latent Diffusion Model for Unconditional Time Series Generation

链接: https://arxiv.org/abs/2407.04211
作者: Jian Qian,Miao Sun,Sifan Zhou,Biao Wan,Minhao Li,Patrick Chiang
关键词: crucial research topic, Time series, latent diffusion model, latent diffusion, Time series generation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series generation is a crucial research topic in the area of deep learning, which can be used for data augmentation, imputing missing values, and forecasting. Currently, latent diffusion models are ascending to the forefront of generative modeling for many important data representations. Being the most pivotal in the computer vision domain, latent diffusion models have also recently attracted interest in other communities, including NLP, Speech, and Geometric Space. In this work, we propose TimeLDM, a novel latent diffusion model for high-quality time series generation. TimeLDM is composed of a variational autoencoder that encodes time series into an informative and smoothed latent content and a latent diffusion model operating in the latent space to generate latent information. We evaluate the ability of our method to generate synthetic time series with simulated and realistic datasets, benchmark the performance against existing state-of-the-art methods. Qualitatively and quantitatively, we find that the proposed TimeLDM persistently delivers high-quality generated time series. Sores from Context-FID and Discriminative indicate that TimeLDM consistently and significantly outperforms current state-of-the-art benchmarks with an average improvement of 3.4 \times and 3.8 \times , respectively. Further studies demonstrate that our method presents better performance on different lengths of time series data generation. To the best of our knowledge, this is the first study to explore the potential of the latent diffusion model for unconditional time series generation and establish a new baseline for synthetic time series.

[LG-64] KAN-ODEs: Kolmogorov-Arnold Network Ordinary Differential Equations for Learning Dynamical Systems and Hidden Physics

链接: https://arxiv.org/abs/2407.04192
作者: Benjamin C. Koenig,Suyong Kim,Sili Deng
关键词: recent development demonstrating, Neural Ordinary Differential, Multi-layer perceptrons, alternative to Multi-layer, Ordinary Differential Equation
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures plus 1 appendix figure, 1 table plus 1 appendix table. B.C.K. and S.K. contributed equally to this work

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) as an alternative to Multi-layer perceptrons (MLPs) are a recent development demonstrating strong potential for data-driven modeling. This work applies KANs as the backbone of a Neural Ordinary Differential Equation framework, generalizing their use to the time-dependent and grid-sensitive cases often seen in scientific machine learning applications. The proposed KAN-ODEs retain the flexible dynamical system modeling framework of Neural ODEs while leveraging the many benefits of KANs, including faster neural scaling, stronger interpretability, and lower parameter counts when compared against MLPs. We demonstrate these benefits in three test cases: the Lotka-Volterra predator-prey model, Burgers’ equation, and the Fisher-KPP PDE. We showcase the strong performance of parameter-lean KAN-ODE systems generally in reconstructing entire dynamical systems, and also in targeted applications to the inference of a source term in an otherwise known flow field. We additionally demonstrate the interpretability of KAN-ODEs via activation function visualization and symbolic regression of trained results. The successful training of KAN-ODEs and their improved performance when compared to traditional Neural ODEs implies significant potential in leveraging this novel network architecture in myriad scientific machine learning applications.

[LG-65] Meta-Learning and representation learner: A short theoretical note

链接: https://arxiv.org/abs/2407.04189
作者: Mouad El Bouchattaoui
关键词: process over time, develop models, learning, machine learning, Meta-learning
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Meta-learning, or “learning to learn,” is a subfield of machine learning where the goal is to develop models and algorithms that can learn from various tasks and improve their learning process over time. Unlike traditional machine learning methods focusing on learning a specific task, meta-learning aims to leverage experience from previous tasks to enhance future learning. This approach is particularly beneficial in scenarios where the available data for a new task is limited, but there exists abundant data from related tasks. By extracting and utilizing the underlying structure and patterns across these tasks, meta-learning algorithms can achieve faster convergence and better performance with fewer data. The following notes are mainly inspired from \citevanschoren2018meta, \citebaxter2019learning, and \citemaurer2005algorithmic.

[LG-66] Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs

链接: https://arxiv.org/abs/2407.04173
作者: Faisal Hamman,Pasan Dissanayake,Saumitra Mishra,Freddy Lecue,Sanghamitra Dutta
关键词: random weight initialization, limited tabular data, Fine-tuning large language, make conflicting predictions, large language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on limited tabular data for classification tasks can lead to \textitfine-tuning multiplicity, where equally well-performing models make conflicting predictions on the same inputs due to variations in the training process (i.e., seed, random weight initialization, retraining on additional or deleted samples). This raises critical concerns about the robustness and reliability of Tabular LLMs, particularly when deployed for high-stakes decision-making, such as finance, hiring, education, healthcare, etc. This work formalizes the challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel metric to quantify the robustness of individual predictions without expensive model retraining. Our metric quantifies a prediction’s stability by analyzing (sampling) the model’s local behavior around the input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic robustness guarantees against a broad class of fine-tuned models. By leveraging Bernstein’s Inequality, we show that predictions with sufficiently high robustness (as defined by our measure) will remain consistent with high probability. We also provide empirical evaluation on real-world datasets to support our theoretical results. Our work highlights the importance of addressing fine-tuning instabilities to enable trustworthy deployment of LLMs in high-stakes and safety-critical applications.

[LG-67] Learning Interpretable Differentiable Logic Networks

链接: https://arxiv.org/abs/2407.04168
作者: Chang Yue,Niraj K. Jha
关键词: natural language processing, capturing complex relationships, real-world applications, language processing, underscores their immense
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ubiquity of neural networks (NNs) in real-world applications, from healthcare to natural language processing, underscores their immense utility in capturing complex relationships within high-dimensional data. However, NNs come with notable disadvantages, such as their “black-box” nature, which hampers interpretability, as well as their tendency to overfit the training data. We introduce a novel method for learning interpretable differentiable logic networks (DLNs) that are architectures that employ multiple layers of binary logic operators. We train these networks by softening and differentiating their discrete components, e.g., through binarization of inputs, binary logic operations, and connections between neurons. This approach enables the use of gradient-based learning methods. Experimental results on twenty classification tasks indicate that differentiable logic networks can achieve accuracies comparable to or exceeding that of traditional NNs. Equally importantly, these networks offer the advantage of interpretability. Moreover, their relatively simple structure results in the number of logic gate-level operations during inference being up to a thousand times smaller than NNs, making them suitable for deployment on edge devices.

[LG-68] Finite Operator Learning: Bridging Neural Operators and Numerical Methods for Efficient Parametric Solution and Optimization of PDEs

链接: https://arxiv.org/abs/2407.04157
作者: Shahed Rezaei,Reza Najian Asl,Kianoosh Taghikhani,Ahmad Moeineddin,Michael Kaliske,Markus Apel
关键词: physics-informed machine learning, standard numerical methods, combines neural operators, Finite Operator Learning, physics-informed machine
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2401.02363

点击查看摘要

Abstract:We introduce a method that combines neural operators, physics-informed machine learning, and standard numerical methods for solving PDEs. The proposed approach extends each of the aforementioned methods and unifies them within a single framework. We can parametrically solve partial differential equations in a data-free manner and provide accurate sensitivities, meaning the derivatives of the solution space with respect to the design space. These capabilities enable gradient-based optimization without the typical sensitivity analysis costs, unlike adjoint methods that scale directly with the number of response functions. Our Finite Operator Learning (FOL) approach uses an uncomplicated feed-forward neural network model to directly map the discrete design space (i.e. parametric input space) to the discrete solution space (i.e. finite number of sensor points in the arbitrary shape domain) ensuring compliance with physical laws by designing them into loss functions. The discretized governing equations, as well as the design and solution spaces, can be derived from any well-established numerical techniques. In this work, we employ the Finite Element Method (FEM) to approximate fields and their spatial derivatives. Subsequently, we conduct Sobolev training to minimize a multi-objective loss function, which includes the discretized weak form of the energy functional, boundary conditions violations, and the stationarity of the residuals with respect to the design variables. Our study focuses on the steady-state heat equation within heterogeneous materials that exhibits significant phase contrast and possibly temperature-dependent conductivity. The network’s tangent matrix is directly used for gradient-based optimization to improve the microstructure’s heat transfer characteristics. …

[LG-69] Mixture of A Million Experts

链接: https://arxiv.org/abs/2407.04153
作者: Xu Owen He
关键词: layer width grows, hidden layer width, width grows, incur a linear, linear increase
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.

[LG-70] VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation

链接: https://arxiv.org/abs/2407.04152
作者: I-Chun Arthur Liu,Sicheng He,Daniel Seita,Gaurav Sukhatme
关键词: Bimanual manipulation, robotics applications, bimanual manipulation tasks, Vision Language Models, manipulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bimanual manipulation is critical to many robotics applications. In contrast to single-arm manipulation, bimanual manipulation tasks are challenging due to higher-dimensional action spaces. Prior works leverage large amounts of data and primitive actions to address this problem, but may suffer from sample inefficiency and limited generalization across various tasks. To this end, we propose VoxAct-B, a language-conditioned, voxel-based method that leverages Vision Language Models (VLMs) to prioritize key regions within the scene and reconstruct a voxel grid. We provide this voxel grid to our bimanual manipulation policy to learn acting and stabilizing actions. This approach enables more efficient policy learning from voxels and is generalizable to different tasks. In simulation, we show that VoxAct-B outperforms strong baselines on fine-grained bimanual manipulation tasks. Furthermore, we demonstrate VoxAct-B on real-world \textttOpen Drawer and \textttOpen Jar tasks using two UR5s. Code, data, and videos will be available at this https URL.

[LG-71] Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers

链接: https://arxiv.org/abs/2407.04151
作者: Terry Tong,Jiashu Xu,Qin Liu,Muhao Chen
关键词: conversational large language, popular LLM utilization, large language models, multi-turn conversational large, conversational large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Submitted to EMNLP 2024

点击查看摘要

Abstract:The security of multi-turn conversational large language models (LLMs) is understudied despite it being one of the most popular LLM utilization. Specifically, LLMs are vulnerable to data poisoning backdoor attacks, where an adversary manipulates the training data to cause the model to output malicious responses to predefined triggers. Specific to the multi-turn dialogue setting, LLMs are at the risk of even more harmful and stealthy backdoor attacks where the backdoor triggers may span across multiple utterances, giving lee-way to context-driven attacks. In this paper, we explore a novel distributed backdoor trigger attack that serves to be an extra tool in an adversary’s toolbox that can interface with other single-turn attack strategies in a plug and play manner. Results on two representative defense mechanisms indicate that distributed backdoor triggers are robust against existing defense strategies which are designed for single-turn user-model interactions, motivating us to propose a new defense strategy for the multi-turn dialogue setting that is more challenging. To this end, we also explore a novel contrastive decoding based defense that is able to mitigate the backdoor with a low computational tradeoff.

[LG-72] SineKAN: Kolmogorov-Arnold Networks Using Sinusoidal Activation Functions

链接: https://arxiv.org/abs/2407.04149
作者: Eric A. F. Reinhardt,Sergei Gleyzer
关键词: perceptron neural networks, traditional multi-layer perceptron, multi-layer perceptron neural, Recent work, neural networks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 8 figures

点击查看摘要

Abstract:Recent work has established an alternative to traditional multi-layer perceptron neural networks in the form of Kolmogorov-Arnold Networks (KAN). The general KAN framework uses learnable activation functions on the edges of the computational graph followed by summation on nodes. The learnable edge activation functions in the original implementation are basis spline functions (B-Spline). Here, we present a model in which learnable grids of B-Spline activation functions can be replaced by grids of re-weighted sine functions. We show that this leads to better or comparable numerical performance to B-Spline KAN models on the MNIST benchmark, while also providing a substantial speed increase on the order of 4-9 times.

[LG-73] Query-Guided Self-Supervised Summarization of Nursing Notes

链接: https://arxiv.org/abs/2407.04125
作者: Ya Gao,Hans Moen,Saila Koivusalo,Miika Koskinen,Pekka Marttinen
关键词: Electronic Health Records, patient health status, Health Records, Electronic Health, component of Electronic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nursing notes, an important component of Electronic Health Records (EHRs), keep track of the progression of a patient’s health status during a care episode. Distilling the key information in nursing notes through text summarization techniques can improve clinicians’ efficiency in understanding patients’ conditions when reviewing nursing notes. However, existing abstractive summarization methods in the clinical setting have often overlooked nursing notes and require the creation of reference summaries for supervision signals, which is time-consuming. In this work, we introduce QGSumm, a query-guided self-supervised domain adaptation framework for nursing note summarization. Using patient-related clinical queries as guidance, our approach generates high-quality, patient-centered summaries without relying on reference summaries for training. Through automatic and manual evaluation by an expert clinician, we demonstrate the strengths of our approach compared to the state-of-the-art Large Language Models (LLMs) in both zero-shot and few-shot settings. Ultimately, our approach provides a new perspective on conditional text summarization, tailored to the specific interests of clinical personnel.

[LG-74] An Autoencoder Architecture for L-band Passive Microwave Retrieval of Landscape Freeze-Thaw Cycle

链接: https://arxiv.org/abs/2407.04119
作者: Divya Kumawat,Ardeshir Ebtehaj,Xiaolan Xu,Andreas Colliander,Vipin Kumar
关键词: global carbon budgets, understanding permafrost response, Northern Hemisphere, Hemisphere is crucial, global warming
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Estimating the landscape and soil freeze-thaw (FT) dynamics in the Northern Hemisphere is crucial for understanding permafrost response to global warming and changes in regional and global carbon budgets. A new framework is presented for surface FT-cycle retrievals using L-band microwave radiometry based on a deep convolutional autoencoder neural network. This framework defines the landscape FT-cycle retrieval as a time series anomaly detection problem considering the frozen states as normal and thawed states as anomalies. The autoencoder retrieves the FT-cycle probabilistically through supervised reconstruction of the brightness temperature (TB) time series using a contrastive loss function that minimizes (maximizes) the reconstruction error for the peak winter (summer). Using the data provided by the Soil Moisture Active Passive (SMAP) satellite, it is demonstrated that the framework learns to isolate the landscape FT states over different land surface types with varying complexities related to the radiometric characteristics of snow cover, lake-ice phenology, and vegetation canopy. The consistency of the retrievals is evaluated over Alaska, against in situ ground-based observations, showing reduced uncertainties compared to the traditional methods that use thresholding of the normalized polarization ratio.

[LG-75] Predictive Coding Networks and Inference Learning: Tutorial and Survey

链接: https://arxiv.org/abs/2407.04117
作者: Björn van Zwol,Ro Jefferson,Egon L. van den Broek
关键词: artificial intelligence research, intelligence research, years have witnessed, witnessed a growing, growing call
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: 49 pages, 13 figures, 9 tables

点击查看摘要

Abstract:Recent years have witnessed a growing call for renewed emphasis on neuroscience-inspired approaches in artificial intelligence research, under the banner of \textitNeuroAI . This is exemplified by recent attention gained by predictive coding networks (PCNs) within machine learning (ML). PCNs are based on the neuroscientific framework of predictive coding (PC), which views the brain as a hierarchical Bayesian inference model that minimizes prediction errors from feedback connections. PCNs trained with inference learning (IL) have potential advantages to traditional feedforward neural networks (FNNs) trained with backpropagation. While historically more computationally intensive, recent improvements in IL have shown that it can be more efficient than backpropagation with sufficient parallelization, making PCNs promising alternatives for large-scale applications and neuromorphic hardware. Moreover, PCNs can be mathematically considered as a superset of traditional FNNs, which substantially extends the range of possible architectures for both supervised and unsupervised learning. In this work, we provide a comprehensive review as well as a formal specification of PCNs, in particular placing them in the context of modern ML methods, and positioning PC as a versatile and promising framework worthy of further study by the ML community.

[LG-76] Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

链接: https://arxiv.org/abs/2407.04108
作者: Sara Price,Arjun Panickssery,Sam Bowman,Asa Cooper Stickland
关键词: hidden behaviors, https URL, URL, Backdoors, model
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backdoors are hidden behaviors that are only triggered once an AI system has been deployed. Bad actors looking to create successful backdoors must design them to avoid activation during training and evaluation. Since data used in these stages often only contains information about events that have already occurred, a component of a simple backdoor trigger could be a model recognizing data that is in the future relative to when it was trained. Through prompting experiments and by probing internal activations, we show that current large language models (LLMs) can distinguish past from future events, with probes on model activations achieving 90% accuracy. We train models with backdoors triggered by a temporal distributional shift; they activate when the model is exposed to news headlines beyond their training cut-off dates. Fine-tuning on helpful, harmless and honest (HHH) data does not work well for removing simpler backdoor triggers but is effective on our backdoored models, although this distinction is smaller for the larger-scale model we tested. We also find that an activation-steering vector representing a model’s internal representation of the date influences the rate of backdoor activation. We take these results as initial evidence that, at least for models at the modest scale we test, standard safety measures are enough to remove these backdoors. We publicly release all relevant code (this https URL), datasets (this https URL), and models (this https URL).

[LG-77] Certifiably Robust Image Watermark

链接: https://arxiv.org/abs/2407.04086
作者: Zhengyuan Jiang,Moyang Guo,Yuepeng Hu,Jinyuan Jia,Neil Zhenqiang Gong
关键词: Generative AI raises, propaganda campaigns, raises many societal, boosting disinformation, disinformation and propaganda
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Generative AI raises many societal concerns such as boosting disinformation and propaganda campaigns. Watermarking AI-generated content is a key technology to address these concerns and has been widely deployed in industry. However, watermarking is vulnerable to removal attacks and forgery attacks. In this work, we propose the first image watermarks with certified robustness guarantees against removal and forgery attacks. Our method leverages randomized smoothing, a popular technique to build certifiably robust classifiers and regression models. Our major technical contributions include extending randomized smoothing to watermarking by considering its unique characteristics, deriving the certified robustness guarantees, and designing algorithms to estimate them. Moreover, we extensively evaluate our image watermarks in terms of both certified and empirical robustness. Our code is available at \urlthis https URL.

[LG-78] DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning

链接: https://arxiv.org/abs/2407.04078
作者: Chengpeng Li,Guanting Dong,Mingfeng Xue,Ru Peng,Xiang Wang,Dayiheng Liu
关键词: Large language models, made impressive progress, Large language, handling simple math, complex mathematical tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have made impressive progress in handling simple math problems, yet they still struggle with more challenging and complex mathematical tasks. In this paper, we introduce a series of LLMs that employs the Decomposition of thought with code assistance and self-correction for mathematical reasoning, dubbed as DotaMath. DotaMath models tackle complex mathematical tasks by decomposing them into simpler logical subtasks, leveraging code to solve these subtasks, obtaining fine-grained feedback from the code interpreter, and engaging in self-reflection and correction. By annotating diverse interactive tool-use trajectories and employing query evolution on GSM8K and MATH datasets, we generate an instruction fine-tuning dataset called DotaMathQA with 574K query-response pairs. We train a series of base LLMs using imitation learning on DotaMathQA, resulting in DotaMath models that achieve remarkable performance compared to open-source LLMs across various in-domain and out-of-domain benchmarks. Notably, DotaMath-deepseek-7B showcases an outstanding performance of 64.8% on the competitive MATH dataset and 86.7% on GSM8K. Besides, DotaMath-deepseek-7B maintains strong competitiveness on a series of in-domain and out-of-domain benchmarks (Avg. 80.1%). Looking forward, we anticipate that the DotaMath paradigm will open new pathways for addressing intricate mathematical problems. Our code is publicly available at this https URL.

[LG-79] Sparsest Models Elude Pruning: An Expose of Prunings Current Capabilities

链接: https://arxiv.org/abs/2407.04075
作者: Stephen Zhang,Vardan Papyan
关键词: compressing large-scale models, large-scale models, promising approach, approach for compressing, compressing large-scale
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in Proceedings of the 41st International Conference on Machine Learning

点击查看摘要

Abstract:Pruning has emerged as a promising approach for compressing large-scale models, yet its effectiveness in recovering the sparsest of models has not yet been explored. We conducted an extensive series of 485,838 experiments, applying a range of state-of-the-art pruning algorithms to a synthetic dataset we created, named the Cubist Spiral. Our findings reveal a significant gap in performance compared to ideal sparse networks, which we identified through a novel combinatorial search algorithm. We attribute this performance gap to current pruning algorithms’ poor behaviour under overparameterization, their tendency to induce disconnected paths throughout the network, and their propensity to get stuck at suboptimal solutions, even when given the optimal width and initialization. This gap is concerning, given the simplicity of the network architectures and datasets used in our study. We hope that our research encourages further investigation into new pruning techniques that strive for true network sparsity.

[LG-80] A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges Limitations and Recommendations

链接: https://arxiv.org/abs/2407.04069
作者: Md Tahmid Rahman Laskar,Sawsan Alqahtani,M Saiful Bari,Mizanur Rahman,Mohammad Abdullah Matin Khan,Haidar Khan,Israt Jahan,Amran Bhuiyan,Chee Wei Tan,Md Rizwan Parvez,Enamul Hoque,Shafiq Joty,Jimmy Huang
关键词: Large Language Models, Large Language, recently gained significant, gained significant attention, significant attention due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.

[LG-81] On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

链接: https://arxiv.org/abs/2407.04065
作者: Zhimin Zhao,Abdul Ali Bangash,Filipe Roseiro Côgo,Bram Adams,Ahmed E. Hassan
关键词: downstream software engineering, large-scale machine learning, demonstrated remarkable adaptability, large language models, software engineering
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards, especially those hosted on cloud platforms, have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders’ ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios (“leaderboard operations”) and identifying potential leaderboard pitfalls and areas for improvement (“leaderboard smells”). In this regard, we perform a multivocal literature review to collect up to 721 FM leaderboards, after which we examine their documentation and engage in direct communication with leaderboard operators to understand their workflow patterns. Using card sorting and negotiated agreement, we identify 5 unique workflow patterns and develop a domain model that outlines the essential components and their interaction within FM leaderboards. We then identify 8 unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.

[LG-82] ALENT: A Tabular Analytics and Learning Toolbox

链接: https://arxiv.org/abs/2407.04057
作者: Si-Yang Liu,Hao-Run Cai,Qi-Le Zhou,Han-Jia Ye
关键词: common data sources, Tabular data, sources in machine, Tabular, common data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data is one of the most common data sources in machine learning. Although a wide range of classical methods demonstrate practical utilities in this field, deep learning methods on tabular data are becoming promising alternatives due to their flexibility and ability to capture complex interactions within the data. Considering that deep tabular methods have diverse design philosophies, including the ways they handle features, design learning objectives, and construct model architectures, we introduce a versatile deep-learning toolbox called TALENT (Tabular Analytics and LEarNing Toolbox) to utilize, analyze, and compare tabular methods. TALENT encompasses an extensive collection of more than 20 deep tabular prediction methods, associated with various encoding and normalization modules, and provides a unified interface that is easily integrable with new methods as they emerge. In this paper, we present the design and functionality of the toolbox, illustrate its practical application through several case studies, and investigate the performance of various methods fairly based on our toolbox. Code is available at this https URL.

[LG-83] Robust Learning under Hybrid Noise

链接: https://arxiv.org/abs/2407.04029
作者: Yang Wei,Shuo Chen,Shanshan Ye,Bo Han,Chen Gong
关键词: label noise, Feature noise, machine learning model, noise, pose great challenges
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature noise and label noise are ubiquitous in practical scenarios, which pose great challenges for training a robust machine learning model. Most previous approaches usually deal with only a single problem of either feature noise or label noise. However, in real-world applications, hybrid noise, which contains both feature noise and label noise, is very common due to the unreliable data collection and annotation processes. Although some results have been achieved by a few representation learning based attempts, this issue is still far from being addressed with promising performance and guaranteed theoretical analyses. To address the challenge, we propose a novel unified learning framework called “Feature and Label Recovery” (FLR) to combat the hybrid noise from the perspective of data recovery, where we concurrently reconstruct both the feature matrix and the label matrix of input data. Specifically, the clean feature matrix is discovered by the low-rank approximation, and the ground-truth label matrix is embedded based on the recovered features with a nuclear norm regularization. Meanwhile, the feature noise and label noise are characterized by their respective adaptive matrix norms to satisfy the corresponding maximum likelihood. As this framework leads to a non-convex optimization problem, we develop the non-convex Alternating Direction Method of Multipliers (ADMM) with the convergence guarantee to solve our learning objective. We also provide the theoretical analysis to show that the generalization error of FLR can be upper-bounded in the presence of hybrid noise. Experimental results on several typical benchmark datasets clearly demonstrate the superiority of our proposed method over the state-of-the-art robust learning approaches for various noises.

[LG-84] Learning Non-Linear Invariants for Unsupervised Out-of-Distribution Detection

链接: https://arxiv.org/abs/2407.04022
作者: Lars Doorenbos,Raphael Sznitman,Pablo Márquez-Neila
关键词: deep learning models, reliable deep learning, handle data drawn, reliable deep, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:The inability of deep learning models to handle data drawn from unseen distributions has sparked much interest in unsupervised out-of-distribution (U-OOD) detection, as it is crucial for reliable deep learning models. Despite considerable attention, theoretically-motivated approaches are few and far between, with most methods building on top of some form of heuristic. Recently, U-OOD was formalized in the context of data invariants, allowing a clearer understanding of how to characterize U-OOD, and methods leveraging affine invariants have attained state-of-the-art results on large-scale benchmarks. Nevertheless, the restriction to affine invariants hinders the expressiveness of the approach. In this work, we broaden the affine invariants formulation to a more general case and propose a framework consisting of a normalizing flow-like architecture capable of learning non-linear invariants. Our novel approach achieves state-of-the-art results on an extensive U-OOD benchmark, and we demonstrate its further applicability to tabular data. Finally, we show our method has the same desirable properties as those based on affine invariants.

[LG-85] A Critical Assessment of Interpretable and Explainable Machine Learning for Intrusion Detection

链接: https://arxiv.org/abs/2407.04009
作者: Omer Subasi,Johnathan Cree,Joseph Manzano,Elena Peterson
关键词: learning process, large number, feature-based model explanations, Deep Neural Networks, explanations
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There has been a large number of studies in interpretable and explainable ML for cybersecurity, in particular, for intrusion detection. Many of these studies have significant amount of overlapping and repeated evaluations and analysis. At the same time, these studies overlook crucial model, data, learning process, and utility related issues and many times completely disregard them. These issues include the use of overly complex and opaque ML models, unaccounted data imbalances and correlated features, inconsistent influential features across different explanation methods, the inconsistencies stemming from the constituents of a learning process, and the implausible utility of explanations. In this work, we empirically demonstrate these issues, analyze them and propose practical solutions in the context of feature-based model explanations. Specifically, we advise avoiding complex opaque models such as Deep Neural Networks and instead using interpretable ML models such as Decision Trees as the available intrusion datasets are not difficult for such interpretable models to classify successfully. Then, we bring attention to the binary classification metrics such as Matthews Correlation Coefficient (which are well-suited for imbalanced datasets. Moreover, we find that feature-based model explanations are most often inconsistent across different settings. In this respect, to further gauge the extent of inconsistencies, we introduce the notion of cross explanations which corroborates that the features that are determined to be impactful by one explanation method most often differ from those by another method. Furthermore, we show that strongly correlated data features and the constituents of a learning process, such as hyper-parameters and the optimization routine, become yet another source of inconsistent explanations. Finally, we discuss the utility of feature-based explanations.

[LG-86] PaSE: Parallelization Strategies for Efficient DNN Training

链接: https://arxiv.org/abs/2407.04001
作者: Venmugil Elango
关键词: requires substantial computational, deep neural network, neural network, requires substantial, deep neural
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Published as conference paper at IPDPS 2021

点击查看摘要

Abstract:Training a deep neural network (DNN) requires substantial computational and memory requirements. It is common to use multiple devices to train a DNN to reduce the overall training time. There are several choices to parallelize each layer in a DNN. Exhaustively searching this list to find an optimal parallelization strategy is prohibitively time consuming and impractical. The standard practice is to use data parallelism because of its simplicity. However, data parallelism is often sub-optimal, and suffers from poor performance and high memory requirement. Expert-designed strategies have been proposed on a case-by-case basis using domain specific knowledge. These expert-designed strategies do not generalize well to DNNs other than the ones for which they were designed, and are not always necessarily the best choice. In this paper, we propose an approach to automatically find efficient parallelization strategies for DNNs from their computation graphs. We present an efficient algorithm to compute these strategies within a reasonable time in practice. We evaluate the effectiveness of our approach on various DNNs. We also compare the performance of the strategies identified by our approach against data parallelism, expert-designed strategies, and the state-of-the-art approaches. Our results show that the strategies found using our approach outperform the baseline data parallelism strategy in all the cases. In addition, our strategies achieve better performance than the expert-designed strategies and the state-of-the-art approaches. Comments: Published as conference paper at IPDPS 2021 Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2407.04001 [cs.LG] (or arXiv:2407.04001v1 [cs.LG] for this version) Journalreference: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Portland, OR, USA, 2021, pp. 1025-1034 Related DOI: https://doi.org/10.1109/IPDPS49936.2021.00111 Focus to learn more DOI(s) linking to related resources

[LG-87] ROER: Regularized Optimal Experience Replay

链接: https://arxiv.org/abs/2407.03995
作者: Changling Li,Zhang-Wei Hong,Pulkit Agrawal,Divyansh Garg,Joni Pajarinen
关键词: online reinforcement learning, Experience replay serves, reinforcement learning, Experience replay, key component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Experience replay serves as a key component in the success of online reinforcement learning (RL). Prioritized experience replay (PER) reweights experiences by the temporal difference (TD) error empirically enhancing the performance. However, few works have explored the motivation of using TD error. In this work, we provide an alternative perspective on TD-error-based reweighting. We show the connections between the experience prioritization and occupancy optimization. By using a regularized RL objective with f- divergence regularizer and employing its dual form, we show that an optimal solution to the objective is obtained by shifting the distribution of off-policy data in the replay buffer towards the on-policy optimal distribution using TD-error-based occupancy ratios. Our derivation results in a new pipeline of TD error prioritization. We specifically explore the KL divergence as the regularizer and obtain a new form of prioritization scheme, the regularized optimal experience replay (ROER). We evaluate the proposed prioritization scheme with the Soft Actor-Critic (SAC) algorithm in continuous control MuJoCo and DM Control benchmark tasks where our proposed scheme outperforms baselines in 6 out of 11 tasks while the results of the rest match with or do not deviate far from the baselines. Further, using pretraining, ROER achieves noticeable improvement on difficult Antmaze environment where baselines fail, showing applicability to offline-to-online fine-tuning. Code is available at \urlthis https URL.

[LG-88] Zero-failure testing of binary classifiers

链接: https://arxiv.org/abs/2407.03979
作者: Ioannis Ivrissimtzis,Matthew Houliston,Shauna Concannon,Graham Roberts
关键词: performance metrics derived, assess binary classifiers, metrics derived, derived from zero-failure, assess binary
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:We propose using performance metrics derived from zero-failure testing to assess binary classifiers. The principal characteristic of the proposed approach is the asymmetric treatment of the two types of error. In particular, we construct a test set consisting of positive and negative samples, set the operating point of the binary classifier at the lowest value that will result to correct classifications of all positive samples, and use the algorithm’s success rate on the negative samples as a performance measure. A property of the proposed approach, setting it apart from other commonly use