本篇博文主要展示 2024-09-25 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-09-25)

今日共更新556篇论文,其中:

  • 自然语言处理79篇(Computation and Language (cs.CL))
  • 人工智能157篇(Artificial Intelligence (cs.AI))
  • 计算机视觉128篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习133篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] A fast and sound tagging method for discontinuous named-entity recognition EMNLP2024

【速读】: 该论文试图解决不连续命名实体识别问题,提出了一种基于显式描述不连续提及内部结构的新型标注方案。解决方案的关键在于使用加权有限状态自动机(weighted finite state automaton)进行边际和最大后验推断,确保预测的标签序列的合理性,并通过自动机结构实现标签序列与不连续提及之间的明确映射。该方法在生物医学领域的三个英语数据集上进行了评估,结果与最先进的方法相当,同时模型更为简单和高效。

链接: https://arxiv.org/abs/2409.16243
作者: Caio Corro
关键词-EN: named entity recognition, entity recognition based, discontinuous named entity, tagging scheme, named entity
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:We introduce a novel tagging scheme for discontinuous named entity recognition based on an explicit description of the inner structure of discontinuous mentions. We rely on a weighted finite state automaton for both marginal and maximum a posteriori inference. As such, our method is sound in the sense that (1) well-formedness of predicted tag sequences is ensured via the automaton structure and (2) there is an unambiguous mapping between well-formed sequences of tags and (discontinuous) mentions. We evaluate our approach on three English datasets in the biomedical domain, and report comparable results to state-of-the-art while having a way simpler and faster model.
摘要:我们提出了一种基于对不连续命名实体提及内部结构的显式描述的新型标注方案,用于不连续命名实体识别。我们依赖于加权有限状态自动机进行边际和最大后验概率推断。因此,我们的方法在以下意义上是合理的:(1) 通过自动机结构确保预测的标签序列的正确性;(2) 存在一个明确的映射,将正确的标签序列与(不连续的)提及对应起来。我们在生物医学领域的三个英语数据集上评估了我们的方法,并报告了与最先进方法相当的结果,同时拥有更简单和更快的模型。

[NLP-1] EuroLLM: Multilingual Language Models for Europe

【速读】: 该论文试图解决当前大型语言模型(LLMs)主要集中在英语上的问题,提出开发一套能够理解和生成所有欧盟官方语言及若干相关语言的多语言LLMs。解决方案的关键在于:1) 数据收集与过滤过程;2) 开发适用于多语言的缩放定律;3) 创建多语言分词器;4) 设计数据混合与模型配置。通过这些关键步骤,论文发布了初始模型EuroLLM-1.7B和EuroLLM-1.7B-Instruct,并展示了其在多语言通用基准和机器翻译任务中的性能。

链接: https://arxiv.org/abs/2409.16235
作者: Pedro Henrique Martins,Patrick Fernandes,João Alves,Nuno M. Guerreiro,Ricardo Rei,Duarte M. Alves,José Pombal,Amin Farajian,Manuel Faysse,Mateusz Klimaszewski,Pierre Colombo,Barry Haddow,José G. C. de Souza,Alexandra Birch,André F. T. Martins
关键词-EN: remain predominantly focused, focused on English, European Union languages, official European Union, significant improvement
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.
摘要:开放权重的大语言模型 (LLM) 的质量有了显著提升,但它们主要集中在英语上。本文介绍了 EuroLLM 项目,旨在开发一系列开放权重的多语言大语言模型,能够理解和生成所有欧盟官方语言以及几种相关语言的文本。我们概述了迄今为止取得的进展,详细介绍了我们的数据收集和过滤过程、缩放定律的开发、多语言 Tokenizer 的创建以及数据混合和建模配置。此外,我们发布了初始模型:EuroLLM-1.7B 和 EuroLLM-1.7B-Instruct,并报告了它们在多语言通用基准和机器翻译上的表现。

[NLP-2] owards Enhancing Linked Data Retrieval in Conversational UIs using Large Language Models

【速读】: 该论文试图解决现有对话式用户界面(UI)模型在处理新数据集或更新时需要重新训练的问题,从而限制了其作为通用数据提取工具的功能。解决方案的关键在于将大型语言模型(LLMs)集成到对话式UI的工作流程中,利用LLMs先进的自然语言理解能力,显著提升模型对用户查询的理解和处理能力,从而在不重新训练模型的情况下生成更准确的SPARQL查询,并改进RDF实体提取。这种方法不仅增强了系统的表达能力和响应准确性,还为处理RDF数据集和Linked Open Data(LOD)端点中的复杂查询模式提供了更细致和上下文感知的交互模型。

链接: https://arxiv.org/abs/2409.16220
作者: Omar Mussa,Omer Rana,Benoît Goossens,Pablo Orozco-Terwengel,Charith Perera
关键词-EN: Resource Description Framework, Description Framework, Resource Description, Large Language Models, recent broad adoption
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper has been accepted at the 25th International Web Information Systems Engineering Conference (WISE 2024)

点击查看摘要

Abstract:Despite the recent broad adoption of Large Language Models (LLMs) across various domains, their potential for enriching information systems in extracting and exploring Linked Data (LD) and Resource Description Framework (RDF) triplestores has not been extensively explored. This paper examines the integration of LLMs within existing systems, emphasising the enhancement of conversational user interfaces (UIs) and their capabilities for data extraction by producing more accurate SPARQL queries without the requirement for model retraining. Typically, conversational UI models necessitate retraining with the introduction of new datasets or updates, limiting their functionality as general-purpose extraction tools. Our approach addresses this limitation by incorporating LLMs into the conversational UI workflow, significantly enhancing their ability to comprehend and process user queries effectively. By leveraging the advanced natural language understanding capabilities of LLMs, our method improves RDF entity extraction within web systems employing conventional chatbots. This integration facilitates a more nuanced and context-aware interaction model, critical for handling the complex query patterns often encountered in RDF datasets and Linked Open Data (LOD) endpoints. The evaluation of this methodology shows a marked enhancement in system expressivity and the accuracy of responses to user queries, indicating a promising direction for future research in this area. This investigation not only underscores the versatility of LLMs in enhancing existing information systems but also sets the stage for further explorations into their potential applications within more specialised domains of web information systems.
摘要:尽管大语言模型 (LLM) 在各个领域的广泛应用,其在提取和探索关联数据 (Linked Data, LD) 和资源描述框架 (Resource Description Framework, RDF) 三元组存储的信息系统中的潜力尚未得到充分探索。本文探讨了将 LLM 集成到现有系统中,重点在于增强对话式用户界面 (UI) 及其通过生成更准确的 SPARQL 查询来提取数据的能力,而无需重新训练模型。通常,对话式 UI 模型在引入新数据集或更新时需要重新训练,限制了其作为通用提取工具的功能。我们的方法通过将 LLM 融入对话式 UI 工作流程中,显著增强了其理解和有效处理用户查询的能力。通过利用 LLM 的高级自然语言理解能力,我们的方法改进了在采用传统聊天机器人的网络系统中的 RDF 实体提取。这种集成促进了更细致和上下文感知的交互模型,这对于处理 RDF 数据集和关联开放数据 (Linked Open Data, LOD) 端点中常见的复杂查询模式至关重要。该方法的评估显示,系统表达能力和用户查询响应的准确性显著提高,表明这一领域未来研究的一个有前景的方向。这一研究不仅突显了 LLM 在增强现有信息系统中的多功能性,还为进一步探索其在网络信息系统更专业化领域中的潜在应用奠定了基础。

[NLP-3] HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在长文本生成能力方面的不足问题。解决方案的关键在于引入了分层长文本生成基准(HelloBench)和分层长文本评估方法(HelloEval)。HelloBench通过基于布鲁姆分类法的五个子任务(开放式问答、摘要、聊天、文本补全和启发式文本生成)来全面评估LLMs的长文本生成能力。HelloEval则是一种与人类评估高度一致的评估方法,显著减少了人工评估的时间和精力,同时保持了高度的相关性。通过对比传统评估指标和LLM-as-a-Judge方法,HelloEval显示出最高的与人类评估的相关性,从而有效解决了当前LLMs在长文本生成方面的缺陷。

链接: https://arxiv.org/abs/2409.16191
作者: Haoran Que,Feiyu Duan,Liqun He,Yutao Mou,Wangchunshu Zhou,Jiaheng Liu,Wenge Rong,Zekun Moore Wang,Jian Yang,Ge Zhang,Junran Peng,Zhaoxiang Zhang,Songyang Zhang,Kai Chen
关键词-EN: Large Language Models, Large Language, Language Models, long text generation, Hierarchical Long Text
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs’ performance in generating long text. Based on Bloom’s Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted extensive experiments across around 30 mainstream LLMs and observed that the current LLMs lack long text generation capabilities. Specifically, first, regardless of whether the instructions include explicit or implicit length constraints, we observe that most LLMs cannot generate text that is longer than 4000 words. Second, we observe that while some LLMs can generate longer text, many issues exist (e.g., severe repetition and quality degradation). Third, to demonstrate the effectiveness of HelloEval, we compare HelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge methods, which show that HelloEval has the highest correlation with human evaluation. We release our code in this https URL.
摘要:近年来,大语言模型 (LLMs) 在多种任务(例如,长上下文理解)中展示了显著的能力,并提出了许多基准测试。然而,我们观察到长文本生成能力并未得到充分研究。因此,我们引入了分层长文本生成基准测试 (HelloBench),这是一个全面、真实场景且开放式的基准测试,用于评估大语言模型在生成长文本方面的性能。基于布鲁姆分类法 (Bloom’s Taxonomy),HelloBench 将长文本生成任务分为五个子任务:开放式问答、摘要、聊天、文本补全和启发式文本生成。此外,我们提出了分层长文本评估 (HelloEval),这是一种与人类评估高度一致的评估方法,显著减少了人类评估所需的时间和精力,同时保持了与人类评估的高度相关性。我们在约 30 个主流大语言模型上进行了广泛的实验,并观察到当前的大语言模型在长文本生成能力方面存在不足。具体来说,首先,无论指令是否包含显式或隐式的长度约束,我们观察到大多数大语言模型无法生成超过 4000 字的文本。其次,我们观察到尽管一些大语言模型能够生成更长的文本,但存在许多问题(例如,严重的重复和质量下降)。第三,为了展示 HelloEval 的有效性,我们将其与传统指标(例如,ROUGE、BLEU 等)和大语言模型作为评判方法进行了比较,结果显示 HelloEval 与人类评估的相关性最高。我们在该 https URL 上发布了我们的代码。

[NLP-4] Merging LoRAs like Playing LEGO: Pushing the Modularity of LoRA to Extremes Through Rank-Wise Clustering

【速读】: 该论文试图解决现有LoRA组合方法在模型合并时参数干扰和性能下降的问题。解决方案的关键在于引入Minimal Semantic Units (MSUs)概念,将LoRA参数分解为独立的功能单元,并通过LoRA-LEGO框架进行秩级别的参数聚类,以实现灵活的LoRA组合和优化合并后的LoRA规模。这一方法通过秩级别的参数聚类和双重加权策略,有效提升了LoRA合并后的性能。

链接: https://arxiv.org/abs/2409.16167
作者: Ziyu Zhao,Tao Shen,Didi Zhu,Zexi Li,Jing Su,Xuwu Wang,Kun Kuang,Fei Wu
关键词-EN: fine-tuning large language, platforms like Huggingface, large language models, fine-tuning large, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has emerged as a popular technique for fine-tuning large language models (LLMs) to various domains due to its modular design and widespread availability on platforms like Huggingface. This modularity has sparked interest in combining multiple LoRAs to enhance LLM capabilities. However, existing methods for LoRA composition primarily focus on task-specific adaptations that require additional training, and current model merging techniques often fail to fully leverage LoRA’s modular nature, leading to parameter interference and performance degradation. In this paper, we investigate the feasibility of disassembling and reassembling multiple LoRAs at a finer granularity, analogous to assembling LEGO blocks. We introduce the concept of Minimal Semantic Units (MSUs), where the parameters corresponding to each rank in LoRA function as independent units. These MSUs demonstrate permutation invariance and concatenation-summation equivalence properties, enabling flexible combinations to create new LoRAs. Building on these insights, we propose the LoRA-LEGO framework. This framework conducts rank-wise parameter clustering by grouping MSUs from different LoRAs into k clusters. The centroid of each cluster serves as a representative MSU, enabling the assembly of a merged LoRA with an adjusted rank of k . Additionally, we apply a dual reweighting strategy to optimize the scale of the merged LoRA. Experiments across various benchmarks demonstrate that our method outperforms existing approaches in LoRA merging.
摘要:低秩适应 (Low-Rank Adaptation, LoRA) 因其模块化设计和在 Huggingface 等平台上的广泛可用性,已成为微调大语言模型 (Large Language Models, LLMs) 以适应不同领域的流行技术。这种模块化设计激发了将多个 LoRA 结合以增强 LLM 能力的兴趣。然而,现有的 LoRA 组合方法主要集中在需要额外训练的任务特定适应上,而当前的模型合并技术往往未能充分利用 LoRA 的模块化特性,导致参数干扰和性能下降。本文探讨了在更细粒度上拆解和重组多个 LoRA 的可行性,类似于组装 LEGO 积木。我们引入了最小语义单元 (Minimal Semantic Units, MSUs) 的概念,其中每个 LoRA 中的参数对应于独立的单元。这些 MSUs 表现出排列不变性和串联求和等价性,能够灵活组合以创建新的 LoRA。基于这些见解,我们提出了 LoRA-LEGO 框架。该框架通过将来自不同 LoRA 的 MSUs 分组为 k 个簇来进行秩参数聚类。每个簇的质心作为代表性 MSU,使得可以组装一个调整秩为 k 的合并 LoRA。此外,我们应用双重重加权策略来优化合并 LoRA 的规模。在各种基准测试中的实验表明,我们的方法在 LoRA 合并方面优于现有方法。

[NLP-5] Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework

【速读】: 该论文试图解决检索增强生成(RAG)模型在预测不确定性方面的风险控制问题,即如何确保RAG模型在低置信度情况下主动拒绝回答问题,以减少实际应用中的不可控风险。解决方案的关键在于识别影响RAG模型置信度的两个潜在因素:检索结果的质量和这些结果的利用方式。论文提出了一种反事实提示框架,通过引导模型改变这些因素并分析其对答案的影响,从而帮助RAG模型基于这两个因素评估自身置信度。此外,论文还引入了一种基准测试程序,允许模型在回答问题时选择拒绝,并通过一系列实验验证了该方法的有效性。

链接: https://arxiv.org/abs/2409.16146
作者: Lu Chen,Ruqing Zhang,Jiafeng Guo,Yixing Fan,Xueqi Cheng
关键词-EN: Retrieval-augmented generation, large language models, RAG models, RAG model prediction, RAG
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a popular solution to mitigate the hallucination issues of large language models. However, existing studies on RAG seldom address the issue of predictive uncertainty, i.e., how likely it is that a RAG model’s prediction is incorrect, resulting in uncontrollable risks in real-world applications. In this work, we emphasize the importance of risk control, ensuring that RAG models proactively refuse to answer questions with low confidence. Our research identifies two critical latent factors affecting RAG’s confidence in its predictions: the quality of the retrieved results and the manner in which these results are utilized. To guide RAG models in assessing their own confidence based on these two latent factors, we develop a counterfactual prompting framework that induces the models to alter these factors and analyzes the effect on their answers. We also introduce a benchmarking procedure to collect answers with the option to abstain, facilitating a series of experiments. For evaluation, we introduce several risk-related metrics and the experimental results demonstrate the effectiveness of our approach.
摘要:检索增强生成 (Retrieval-augmented generation, RAG) 已成为缓解大语言模型 (Large Language Model, LLM) 幻觉问题的一种流行解决方案。然而,现有关于 RAG 的研究很少涉及预测不确定性的问题,即 RAG 模型的预测出错的可能性有多大,这导致了在实际应用中不可控的风险。在本研究中,我们强调了风险控制的重要性,确保 RAG 模型主动拒绝回答低置信度的问题。我们的研究识别了影响 RAG 预测置信度的两个关键潜在因素:检索结果的质量以及这些结果的利用方式。为了指导 RAG 模型基于这两个潜在因素评估其自身的置信度,我们开发了一个反事实提示框架,该框架促使模型改变这些因素并分析其对答案的影响。我们还引入了一个基准测试程序,以收集带有拒绝选项的答案,从而进行一系列实验。在评估方面,我们引入了几个与风险相关的指标,实验结果证明了我们方法的有效性。

[NLP-6] HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection

【速读】: 该论文试图解决开放词汇对象检测(OVD)模型在细粒度属性识别上的不足问题。解决方案的关键在于提出了一种通用的显式方法,通过在显式线性空间中突出细粒度属性来增强冻结的主流OVD模型的属性级检测能力。具体步骤包括利用大型语言模型(LLM)在输入文本中突出属性词,并通过调整标记掩码使文本编码器提取全局文本和属性特定特征,然后将这些特征显式组合为线性空间中的两个向量,形成新的属性突出特征用于检测任务。该方法的关键在于通过手工或学习的方式重新加权这两个向量,并且这些标量可以在不同OVD模型之间无缝转移,证明了这种显式线性组合的通用性。

链接: https://arxiv.org/abs/2409.16136
作者: Yuqi Ma,Mengyin Liu,Chao Zhu,Xu-Cheng Yin
关键词-EN: Large Multi-modal Models, Large Multi-modal, extensive training data, Open-vocabulary object detection, OVD models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. However, OVD models are pretrained on large-scale image-text pairs with rich attribute words, whose latent feature space can represent the global text feature as a linear composition of fine-grained attribute tokens without highlighting them. Therefore, we propose in this paper a universal and explicit approach for frozen mainstream OVD models that boosts their attribute-level detection capabilities by highlighting fine-grained attributes in explicit linear space. Firstly, a LLM is leveraged to highlight attribute words within the input text as a zero-shot prompted task. Secondly, by strategically adjusting the token masks, the text encoders of OVD models extract both global text and attribute-specific features, which are then explicitly composited as two vectors in linear space to form the new attribute-highlighted feature for detection tasks, where corresponding scalars are hand-crafted or learned to reweight both two vectors. Notably, these scalars can be seamlessly transferred among different OVD models, which proves that such an explicit linear composition is universal. Empirical evaluation on the FG-OVD dataset demonstrates that our proposed method uniformly improves fine-grained attribute-level OVD of various mainstream models and achieves new state-of-the-art performance.
摘要:开放词汇对象检测 (Open-vocabulary object detection, OVD) 模型因其广泛的训练数据和大量的参数,被视为大型的多模态模型 (Large Multi-modal Models, LMM)。主流的 OVD 模型主要关注对象的粗粒度类别,而非其细粒度属性,例如颜色或材质,因此无法识别具有特定属性的对象。然而,OVD 模型在大规模图像-文本对上进行了预训练,这些文本对中包含了丰富的属性词汇,其潜在特征空间能够将全局文本特征表示为细粒度属性 Token 的线性组合,而无需特别强调这些属性。因此,本文提出了一种通用且显式的方法,通过在显式线性空间中强调细粒度属性,提升冻结状态下的主流 OVD 模型的属性级检测能力。首先,利用大语言模型 (Large Language Model, LLM) 将输入文本中的属性词作为零样本提示任务进行强调。其次,通过策略性地调整 Token 掩码,OVD 模型的文本编码器同时提取全局文本和属性特定的特征,这些特征在线性空间中显式地组合为两个向量,形成新的属性强调特征用于检测任务,其中相应的标量通过手工设计或学习来重新加权这两个向量。值得注意的是,这些标量可以无缝地在不同的 OVD 模型之间转移,这证明了这种显式线性组合的通用性。在 FG-OVD 数据集上的实证评估表明,我们提出的方法一致地提升了各种主流模型的细粒度属性级 OVD 性能,并达到了新的最先进水平。

[NLP-7] Implicit assessment of language learning during practice as accurate as explicit testing

【速读】: 该论文试图解决在智能辅导系统(ITS)中如何高效且准确地评估学生能力的问题。解决方案的关键在于利用项目反应理论(IRT)进行适应性测试,并通过将练习中的语言构造转化为IRT模型中的“项目”,从而直接从练习中估计学生能力。通过大规模实验验证,该方法在减少测试负担的同时,仍能提供准确的能力评估。

链接: https://arxiv.org/abs/2409.16133
作者: Jue Hou,Anisia Katinskaia,Anh-Duc Vu,Roman Yangarber
关键词-EN: Intelligent Tutoring Systems, Tutoring Systems, Intelligent Tutoring, part of Intelligent, Item Response Theory
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Assessment of proficiency of the learner is an essential part of Intelligent Tutoring Systems (ITS). We use Item Response Theory (IRT) in computer-aided language learning for assessment of student ability in two contexts: in test sessions, and in exercises during practice sessions. Exhaustive testing across a wide range of skills can provide a detailed picture of proficiency, but may be undesirable for a number of reasons. Therefore, we first aim to replace exhaustive tests with efficient but accurate adaptive tests. We use learner data collected from exhaustive tests under imperfect conditions, to train an IRT model to guide adaptive tests. Simulations and experiments with real learner data confirm that this approach is efficient and accurate. Second, we explore whether we can accurately estimate learner ability directly from the context of practice with exercises, without testing. We transform learner data collected from exercise sessions into a form that can be used for IRT modeling. This is done by linking the exercises to \em linguistic constructs; the constructs are then treated as “items” within IRT. We present results from large-scale studies with thousands of learners. Using teacher assessments of student ability as “ground truth,” we compare the estimates obtained from tests vs. those from exercises. The experiments confirm that the IRT models can produce accurate ability estimation based on exercises.
摘要:学习者能力的评估是智能辅导系统 (Intelligent Tutoring Systems, ITS) 的重要组成部分。我们在计算机辅助语言学习中使用项目反应理论 (Item Response Theory, IRT) 来评估学生在两种情境下的能力:测试环节和练习环节中的练习。全面测试广泛技能可以提供详细的熟练度图景,但由于多种原因可能并不理想。因此,我们首先旨在用高效但准确的适应性测试取代全面测试。我们利用在非理想条件下收集的全面测试的学习者数据,训练 IRT 模型以指导适应性测试。模拟和真实学习者数据的实验证实了这种方法的高效性和准确性。其次,我们探讨是否可以直接从练习情境中准确估计学习者能力,而无需进行测试。我们将从练习环节收集的学习者数据转换为可用于 IRT 建模的形式。这是通过将练习与语言结构相联系来实现的;这些结构随后在 IRT 中被视为“项目”。我们展示了来自数千名学习者的大规模研究结果。使用教师对学生能力的评估作为“真实值”,我们比较了从测试和练习中获得的估计值。实验证实,IRT 模型能够基于练习产生准确的能力估计。

[NLP-8] MOSS: Enabling Code-Driven Evolution and Context Management for AI Agents

【速读】: 该论文试图解决AI代理在基于大型语言模型(LLMs)开发过程中面临的几个关键问题,包括代码生成与运行时上下文脱节、手动协议开发限制自主适应性、多轮交互中的代码与上下文一致性以及局部变量的隔离问题。解决方案的关键在于引入MOSS(面向LLM的操作系统模拟)框架,该框架通过集成代码生成与动态上下文管理系统,确保了Python上下文在多轮交互中的连续性和局部变量的隔离,同时利用控制反转(IoC)容器和装饰器来实施最少知识原则,从而提升代理的效率和能力,使其更接近图灵完备并能够通过代码进行进化。

链接: https://arxiv.org/abs/2409.16120
作者: Ming Zhu,Yi Zhou
关键词-EN: true Turing completeness, achieving true Turing, true Turing, Turing completeness, large language models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Developing AI agents powered by large language models (LLMs) faces significant challenges in achieving true Turing completeness and adaptive, code-driven evolution. Current approaches often generate code independently of its runtime context, relying heavily on the LLM’s memory, which results in inefficiencies and limits adaptability. Manual protocol development in sandbox environments further constrains the agent’s autonomous adaptability. Crucially, achieving consistency in code and context across multi-turn interactions and ensuring isolation of local variables within each interaction remains an unsolved problem. We introduce MOSS (llM-oriented Operating System Simulation), a novel framework that addresses these challenges by integrating code generation with a dynamic context management system. MOSS ensures consistency and adaptability by using a mechanism that maintains the Python context across interactions, including isolation of local variables and preservation of runtime integrity. At its core, the framework employs an Inversion of Control (IoC) container in conjunction with decorators to enforce the least knowledge principle, allowing agents to focus on abstract interfaces rather than concrete implementations. This facilitates seamless integration of new tools and libraries, enables runtime instance replacement, and reduces prompt complexity, providing a “what you see is what you get” environment for the agent. Through a series of case studies, we show how this framework can enhance the efficiency and capabilities of agent development and highlight its advantages in moving towards Turing-complete agents capable of evolving through code. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2409.16120 [cs.SE] (or arXiv:2409.16120v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.16120 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:开发由大语言模型 (LLM) 驱动的 AI 智能体面临实现真正的图灵完备性和适应性、代码驱动的进化的重大挑战。当前的方法通常独立于其运行时上下文生成代码,严重依赖 LLM 的内存,这导致效率低下并限制了适应性。在沙盒环境中手动开发协议进一步限制了智能体的自主适应性。至关重要的是,在多轮交互中确保代码和上下文的一致性,并在每次交互中确保局部变量的隔离,仍然是一个未解决的问题。我们引入了 MOSS (面向 LLM 的操作系统模拟),这是一个通过将代码生成与动态上下文管理系统集成来解决这些挑战的新框架。MOSS 通过使用一种机制来确保一致性和适应性,该机制在交互过程中维护 Python 上下文,包括局部变量的隔离和运行时完整性的保持。其核心是,该框架采用控制反转 (IoC) 容器与装饰器相结合,以强制执行最少知识原则,使智能体能够专注于抽象接口而非具体实现。这促进了新工具和库的无缝集成,实现了运行时实例的替换,并减少了提示复杂性,为智能体提供了一个“所见即所得”的环境。通过一系列案例研究,我们展示了该框架如何提高智能体开发的效率和能力,并突显其在推动实现通过代码进化的图灵完备智能体方面的优势。

主题:软件工程 (cs.SE); 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用为:arXiv:2409.16120 [cs.SE] (或 arXiv:2409.16120v1 [cs.SE] 用于此版本)
https://doi.org/10.48550/arXiv.2409.16120
arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-9] Exploring Hint Generation Approaches in Open-Domain Question Answering EMNLP2024

【速读】: 该论文试图解决自动问答(QA)系统中上下文准备的问题,提出了一种名为HINTQA的新方法。其关键解决方案是采用自动提示生成(HG)技术,通过提示大型语言模型(LLMs)生成关于潜在答案的提示(hints),而不是直接生成或检索相关上下文。这种方法通过生成提示来增强答案的准确性,研究表明HINTQA在多个QA数据集上的表现优于传统的检索和生成方法。

链接: https://arxiv.org/abs/2409.16096
作者: Jamshid Mozafari,Abdelrahman Abdallah,Bhawna Piryani,Adam Jatowt
关键词-EN: Automatic Question Answering, Question Answering, provide accurate answers, systems rely, Large Language Models
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at EMNLP 2024

点击查看摘要

Abstract:Automatic Question Answering (QA) systems rely on contextual information to provide accurate answers. Commonly, contexts are prepared through either retrieval-based or generation-based methods. The former involves retrieving relevant documents from a corpus like Wikipedia, whereas the latter uses generative models such as Large Language Models (LLMs) to generate the context. In this paper, we introduce a novel context preparation approach called HINTQA, which employs Automatic Hint Generation (HG) techniques. Unlike traditional methods, HINTQA prompts LLMs to produce hints about potential answers for the question rather than generating relevant context. We evaluate our approach across three QA datasets including TriviaQA, NaturalQuestions, and Web Questions, examining how the number and order of hints impact performance. Our findings show that the HINTQA surpasses both retrieval-based and generation-based approaches. We demonstrate that hints enhance the accuracy of answers more than retrieved and generated contexts.
摘要:自动问答 (QA) 系统依赖于上下文信息来提供准确的答案。通常,上下文通过基于检索或基于生成的方法来准备。前者涉及从语料库(如 Wikipedia)中检索相关文档,而后者则使用生成模型(如大语言模型 (LLM))来生成上下文。在本文中,我们介绍了一种名为 HINTQA 的新型上下文准备方法,该方法采用自动提示生成 (HG) 技术。与传统方法不同,HINTQA 提示 LLM 生成关于问题潜在答案的提示,而不是生成相关上下文。我们在三个 QA 数据集(包括 TriviaQA、NaturalQuestions 和 Web Questions)上评估了我们的方法,考察了提示的数量和顺序对性能的影响。我们的研究结果表明,HINTQA 超越了基于检索和基于生成的方法。我们证明,提示比检索和生成的上下文更能提高答案的准确性。

[NLP-10] Unlocking Markets: A Multilingual Benchmark to Cross-Market Question Answering EMNLP2024

【速读】: 该论文试图解决多语言跨市场产品相关问答(MCPQA)问题,即在主市场利用资源丰富的辅助市场信息来回答产品相关问题。解决方案的关键在于构建了一个大规模的多语言跨市场数据集McMarket,并针对基于评论的答案生成和产品相关问题排序两个子任务进行了标注和评估。通过实验验证,利用跨市场信息显著提升了模型在这两个任务中的表现。

链接: https://arxiv.org/abs/2409.16025
作者: Yifei Yuan,Yang Deng,Anders Søgaard,Mohammad Aliannejadi
关键词-EN: Users post numerous, post numerous product-related, Product-based Question Answering, e-commerce platforms, affecting their purchase
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Users post numerous product-related questions on e-commerce platforms, affecting their purchase decisions. Product-related question answering (PQA) entails utilizing product-related resources to provide precise responses to users. We propose a novel task of Multilingual Cross-market Product-based Question Answering (MCPQA) and define the task as providing answers to product-related questions in a main marketplace by utilizing information from another resource-rich auxiliary marketplace in a multilingual context. We introduce a large-scale dataset comprising over 7 million questions from 17 marketplaces across 11 languages. We then perform automatic translation on the Electronics category of our dataset, naming it as McMarket. We focus on two subtasks: review-based answer generation and product-related question ranking. For each subtask, we label a subset of McMarket using an LLM and further evaluate the quality of the annotations via human assessment. We then conduct experiments to benchmark our dataset, using models ranging from traditional lexical models to LLMs in both single-market and cross-market scenarios across McMarket and the corresponding LLM subset. Results show that incorporating cross-market information significantly enhances performance in both tasks.
摘要:用户在电子商务平台上发布大量与产品相关的问题,这些问题影响他们的购买决策。产品相关问答 (PQA) 涉及利用产品相关资源为用户提供精确的回答。我们提出了一项新的任务——多语言跨市场产品问答 (MCPQA),并将其定义为在多语言环境中,通过利用另一个资源丰富的辅助市场中的信息,为主市场中的产品相关问题提供答案。我们引入了一个大规模数据集,包含来自 17 个市场的超过 700 万条问题,涵盖 11 种语言。随后,我们对数据集中的电子产品类别进行了自动翻译,并命名为 McMarket。我们专注于两个子任务:基于评论的答案生成和产品相关问题排序。对于每个子任务,我们使用大语言模型 (LLM) 对 McMarket 的一个子集进行标注,并通过人工评估进一步验证标注质量。接着,我们进行了实验,使用从传统词汇模型到大语言模型 (LLM) 的一系列模型,在 McMarket 及其对应的 LLM 子集上,分别在单一市场和跨市场场景下进行基准测试。结果表明,结合跨市场信息显著提升了两个任务的性能。

[NLP-11] AI Can Be Cognitively Biased: An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment

【速读】: 该论文试图解决的问题是大型语言模型(LLMs)在信息检索(IR)任务中是否受到认知偏差(特别是阈值启动效应)的影响。解决方案的关键在于通过实验验证LLMs在处理文档相关性评分时,是否表现出与人类相似的认知偏差,即在早期文档的高相关性评分影响下,后续文档的评分是否会降低,反之亦然。实验结果表明,LLMs确实受到这种阈值启动效应的影响,这一发现提示在设计和评估LLMs时,应考虑并处理潜在的人类认知偏差。

链接: https://arxiv.org/abs/2409.16022
作者: Nuo Chen,Jiqun Liu,Xiaoyu Dong,Qijiong Liu,Tetsuya Sakai,Xiao-Ming Wu
关键词-EN: Cognitive biases, problematic decision-making, extensively studied, systematic deviations, deviations in thinking
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cognitive biases are systematic deviations in thinking that lead to irrational judgments and problematic decision-making, extensively studied across various fields. Recently, large language models (LLMs) have shown advanced understanding capabilities but may inherit human biases from their training data. While social biases in LLMs have been well-studied, cognitive biases have received less attention, with existing research focusing on specific scenarios. The broader impact of cognitive biases on LLMs in various decision- making contexts remains underexplored. We investigated whether LLMs are influenced by the threshold priming effect in relevance judgments, a core task and widely-discussed research topic in the Information Retrieval (IR) coummunity. The priming effect occurs when exposure to certain stimuli unconsciously affects subsequent behavior and decisions. Our experiment employed 10 topics from the TREC 2019 Deep Learning passage track collection, and tested AI judgments under different document relevance scores, batch lengths, and LLM models, including GPT-3.5, GPT-4, LLaMa2-13B and LLaMa2-70B. Results showed that LLMs tend to give lower scores to later documents if earlier ones have high relevance, and vice versa, regardless of the combination and model used. Our finding demonstrates that LLM%u2019s judgments, similar to human judgments, are also influenced by threshold priming biases, and suggests that researchers and system engineers should take into account potential human-like cognitive biases in designing, evaluating, and auditing LLMs in IR tasks and beyond.
摘要:认知偏差是思维中的系统性偏差,导致非理性判断和有问题的决策,广泛研究于各个领域。最近,大语言模型 (LLMs) 展示了高级的理解能力,但可能从其训练数据中继承了人类的偏见。尽管 LLMs 中的社会偏见已被广泛研究,认知偏见却受到较少关注,现有研究主要集中在特定场景。认知偏见对 LLMs 在各种决策情境中的广泛影响仍未得到充分探索。我们研究了 LLMs 在相关性判断中是否受到阈值启动效应的影响,这是信息检索 (IR) 社区的核心任务和广泛讨论的研究课题。启动效应发生在接触某些刺激后,无意识地影响后续行为和决策。我们的实验使用了 TREC 2019 深度学习段落跟踪集合中的 10 个主题,并在不同文档相关性评分、批量长度和 LLM 模型(包括 GPT-3.5、GPT-4、LLaMa2-13B 和 LLaMa2-70B)下测试了 AI 判断。结果显示,无论使用何种组合和模型,LLMs 倾向于对后续文档给出较低评分,如果前面的文档具有高相关性,反之亦然。我们的发现表明,LLM 的判断与人类判断类似,也受到阈值启动偏见的影响,并建议研究人员和系统工程师在设计、评估和审计 LLMs 在 IR 任务及其他任务时应考虑潜在的人类认知偏见。

[NLP-12] Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs

【速读】: 该论文试图解决在自动语音识别(ASR)任务中有效利用大型语言模型(LLMs)的问题。解决方案的关键在于提出了一种新的训练方法,即在预训练阶段使用拼音嵌入序列(Pinyin embedding sequences)来生成对应的中文字符,从而使LLM在接触真实语音数据之前就能适应从发音特征生成文本的任务。此外,通过微调LoRA参数来增强LLM对语音模态信息的理解。这种方法在AISHELL-1语料库中显著提升了ASR性能,相较于基线模型,相对改进率达到9.5%,并在结合辅助文本数据进行拼音到字符的预训练后,进一步将性能提升至19.0%。

链接: https://arxiv.org/abs/2409.16005
作者: Yang Yuhang,Peng Yizhou,Eng Siong Chng,Xionghu Zhong
关键词-EN: large language models, automatic speech recognition, pre-trained speech models, language models, ASR
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by ISCSLP2024-Special session-Speech Processing in LLM Era

点击查看摘要

Abstract:The integration of large language models (LLMs) with pre-trained speech models has opened up new avenues in automatic speech recognition (ASR). While LLMs excel in multimodal understanding tasks, effectively leveraging their capabilities for ASR remains a significant challenge. This paper presents a novel training approach to enhance LLM performance in ASR tasks. We propose pre-training LLMs on Pinyin embedding sequences, which represent pronunciation features, to generate corresponding Chinese characters. This step enables the LLM to adapt to generating text from pronunciation features before encountering real speech data. Furthermore, we fine-tune the LoRA parameters to enhance the LLM’s understanding of speech modality information. In AISHELL-1 corpus, our approach yields a 9.5% relative improvement in ASR tasks compared to the baseline without Pinyi-to-Character pre-training. Additionally, incorporating auxiliary text data for Pinyi-to-Character pre-training further boosts performance, achieving a 19.0% relative improvement.
摘要:大语言模型 (LLM) 与预训练语音模型的结合为自动语音识别 (ASR) 开辟了新的途径。尽管 LLM 在多模态理解任务中表现出色,但如何有效利用其能力进行 ASR 仍然是一个重大挑战。本文提出了一种新颖的训练方法,以提升 LLM 在 ASR 任务中的表现。我们建议在拼音嵌入序列上预训练 LLM,这些序列代表了发音特征,以生成相应的中文字符。这一步骤使 LLM 能够在接触实际语音数据之前适应从发音特征生成文本。此外,我们通过微调 LoRA 参数来增强 LLM 对语音模态信息的理解。在 AISHELL-1 语料库中,与没有进行拼音到字符预训练的基线相比,我们的方法在 ASR 任务中实现了 9.5% 的相对改进。此外,结合辅助文本数据进行拼音到字符的预训练进一步提升了性能,实现了 19.0% 的相对改进。

[NLP-13] Finetuning LLMs for Comparative Assessment Tasks

【速读】: 该论文试图解决自然语言生成中的自动评估问题,特别是通过指令微调的大语言模型(LLMs)在无参考评估中的应用,但其二次计算复杂度限制了可扩展性。解决方案的关键在于提出一个微调LLMs进行比较评估的框架,通过训练模型使其输出与目标比较概率分布对齐。该方法通过在软概率上进行训练,不仅提升了现有技术的性能,还通过高效的比较子集保持了高效率。

链接: https://arxiv.org/abs/2409.15979
作者: Vatsal Raina,Adian Liusie,Mark Gales
关键词-EN: natural language generation, Automated assessment, challenging task, Automated, natural language
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Automated assessment in natural language generation is a challenging task. Instruction-tuned large language models (LLMs) have shown promise in reference-free evaluation, particularly through comparative assessment. However, the quadratic computational complexity of pairwise comparisons limits its scalability. To address this, efficient comparative assessment has been explored by applying comparative strategies on zero-shot LLM probabilities. We propose a framework for finetuning LLMs for comparative assessment to align the model’s output with the target distribution of comparative probabilities. By training on soft probabilities, our approach improves state-of-the-art performance while maintaining high performance with an efficient subset of comparisons.
摘要:自然语言生成中的自动评估是一项具有挑战性的任务。指令微调的大语言模型 (LLM) 在无参考评估中显示出潜力,特别是在通过比较评估方面。然而,成对比较的二次计算复杂性限制了其可扩展性。为了解决这一问题,我们通过在零样本 LLM 概率上应用比较策略,探索了高效的比较评估方法。我们提出了一种微调 LLM 进行比较评估的框架,以使模型的输出与比较概率的目标分布对齐。通过在软概率上进行训练,我们的方法在保持高效比较子集的高性能的同时,提升了最先进的表现。

[NLP-14] Beats of Bias: Analyzing Lyrics with Topic Modeling and Gender Bias Measurements

【速读】: 该论文旨在通过主题建模和偏见测量技术分析英语歌词中的性别偏见。其关键解决方案是利用BERTopic对537,553首英语歌曲进行主题聚类,并追踪这些主题随时间的发展变化。通过分析,论文揭示了歌词主题从浪漫转向女性性化的趋势,并发现大量亵渎和厌女歌词。此外,论文采用单类别词嵌入关联测试(SC-WEAT)计算各主题和流派中词嵌入的偏见分数,发现与智力和力量相关的词汇在各流派中表现出男性偏见,而与外貌和弱点相关的词汇则更多地表现出女性偏见,但不同主题间的偏见差异也被进一步揭示。

链接: https://arxiv.org/abs/2409.15949
作者: Danqing Chen,Adithi Satish,Rasul Khanbayov,Carolin M. Schuster,Georg Groh
关键词-EN: English song lyrics, bias measurement techniques, English song, measurement techniques, song lyrics
类目: Computation and Language (cs.CL)
备注: Accepted and presented at the 17th International Conference on Social Computing, Behavioral-Cultural Modeling, Prediction and Behavior Representation in Modeling and Simulation (see this https URL )

点击查看摘要

Abstract:This paper uses topic modeling and bias measurement techniques to analyze and determine gender bias in English song lyrics. We utilize BERTopic to cluster 537,553 English songs into distinct topics and chart their development over time. Our analysis shows the thematic shift in song lyrics over the years, from themes of romance to the increasing sexualization of women in songs. We observe large amounts of profanity and misogynistic lyrics on various topics, especially in the overall biggest cluster. Furthermore, to analyze gender bias across topics and genres, we employ the Single Category Word Embedding Association Test (SC-WEAT) to compute bias scores for the word embeddings trained on the most popular topics as well as for each genre. We find that words related to intelligence and strength tend to show a male bias across genres, as opposed to appearance and weakness words, which are more female-biased; however, a closer look also reveals differences in biases across topics.
摘要:本文采用主题建模和偏差测量技术,分析并确定英语歌词中的性别偏差。我们利用 BERTopic 将 537,553 首英语歌曲聚类为不同的主题,并绘制其随时间发展的趋势。分析显示,歌词主题随年份变化,从浪漫主题转向对女性的性化描述。我们观察到在各个主题中,尤其是最大的整体聚类中,存在大量亵渎和厌女歌词。此外,为了分析不同主题和流派中的性别偏差,我们采用单类别词嵌入关联测试 (SC-WEAT) 计算了基于最流行主题以及每个流派训练的词嵌入的偏差分数。我们发现,与智力与力量相关的词汇在各流派中普遍表现出男性偏差,而与外貌和弱点相关的词汇则更多地表现出女性偏差;然而,进一步观察也揭示了不同主题间偏差的差异。

[NLP-15] Automated test generation to evaluate tool-augmented LLMs as conversational AI agents EMNLP2024

【速读】: 该论文试图解决工具增强型大型语言模型(LLMs)在评估其作为对话式AI代理时的挑战,特别是由于对话多样性和现有数据集仅关注单一交互和函数调用的问题。解决方案的关键在于提出了一种测试生成框架,利用LLMs生成基于用户定义流程的多样化测试,并通过中间图结构限制LLM测试生成器产生与输入流程无关的内容,从而确保对话覆盖率最大化。此外,论文还引入了ALMITA数据集,用于评估客户支持领域的AI代理,并展示了现有工具增强型LLMs在处理完整对话时的不足。

链接: https://arxiv.org/abs/2409.15934
作者: Samuel Arcadinho,David Aparicio,Mariana Almeida
关键词-EN: call appropriate functions, promising approach, approach to create, realistic conversations, follow procedures
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 figures, Submitted to GenBench@EMNLP2024

点击查看摘要

Abstract:Tool-augmented LLMs are a promising approach to create AI agents that can have realistic conversations, follow procedures, and call appropriate functions. However, evaluating them is challenging due to the diversity of possible conversations, and existing datasets focus only on single interactions and function-calling. We present a test generation pipeline to evaluate LLMs as conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. For that, we use intermediate graphs to limit the LLM test generator’s tendency to hallucinate content that is not grounded on input procedures, and enforces high coverage of the possible conversations. Additionally, we put forward ALMITA, a manually curated dataset for evaluating AI agents in customer support, and use it to evaluate existing LLMs. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations. While our focus is on customer support, our method is general and capable of AI agents for different domains.
摘要:工具增强型大语言模型 (Tool-augmented LLMs) 是一种有前景的方法,用于创建能够进行真实对话、遵循程序并调用适当功能的 AI 智能体。然而,由于可能的对话多样性,评估这些模型具有挑战性,现有数据集仅关注单一交互和功能调用。我们提出了一种测试生成管道,用于评估大语言模型作为对话式 AI 智能体。我们的框架利用大语言模型生成基于用户定义程序的多样化测试。为此,我们使用中间图来限制大语言模型测试生成器产生与输入程序无关的内容的倾向,并强制实现可能对话的高覆盖率。此外,我们提出了 ALMITA,一个用于评估客户支持中 AI 智能体的手工策划数据集,并使用它来评估现有的大语言模型。我们的结果表明,尽管工具增强型大语言模型在单一交互中表现良好,但它们通常难以处理完整的对话。虽然我们的重点是客户支持,但我们的方法具有通用性,适用于不同领域的 AI 智能体。

[NLP-16] SLIMER-IT: Zero-Shot NER on Italian Language

【速读】: 该论文试图解决在意大利语中进行零样本命名实体识别(Zero-Shot NER)的问题,即在没有预先标注数据的情况下识别新类型的实体。解决方案的关键在于引入SLIMER-IT,这是基于SLIMER的意大利语版本,采用指令微调方法,通过丰富提示信息(包括定义和指南)来提升零样本NER的性能。实验结果表明,SLIMER-IT在处理未见过的实体标签时表现优于其他最先进模型。

链接: https://arxiv.org/abs/2409.15933
作者: Andrew Zamai,Leonardo Rigutini,Marco Maggini,Andrea Zugarini
关键词-EN: Named Entity Recognition, BIO sequence labeling, sequence labeling problem, Traditional approaches, approaches to Named
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Traditional approaches to Named Entity Recognition (NER) frame the task into a BIO sequence labeling problem. Although these systems often excel in the downstream task at hand, they require extensive annotated data and struggle to generalize to out-of-distribution input domains and unseen entity types. On the contrary, Large Language Models (LLMs) have demonstrated strong zero-shot capabilities. While several works address Zero-Shot NER in English, little has been done in other languages. In this paper, we define an evaluation framework for Zero-Shot NER, applying it to the Italian language. Furthermore, we introduce SLIMER-IT, the Italian version of SLIMER, an instruction-tuning approach for zero-shot NER leveraging prompts enriched with definition and guidelines. Comparisons with other state-of-the-art models, demonstrate the superiority of SLIMER-IT on never-seen-before entity tags.
摘要:传统的命名实体识别 (Named Entity Recognition, NER) 方法将任务框架化为 BIO 序列标注问题。尽管这些系统在当前的下游任务中表现出色,但它们需要大量的标注数据,并且在处理分布外输入领域和未见实体类型时表现不佳。相反,大语言模型 (Large Language Models, LLMs) 展示了强大的零样本 (Zero-shot) 能力。虽然已有一些工作针对英语的零样本 NER 进行了研究,但在其他语言上的研究却很少。本文中,我们为零样本 NER 定义了一个评估框架,并将其应用于意大利语。此外,我们引入了 SLIMER-IT,即 SLIMER 的意大利语版本,这是一种利用丰富定义和指南的提示进行指令微调的零样本 NER 方法。与其他最先进模型的比较表明,SLIMER-IT 在从未见过的实体标签上表现优越。

[NLP-17] Multilingual Transfer and Domain Adaptation for Low-Resource Languages of Spain

【速读】: 该论文旨在解决西班牙语到低资源语言(Aragonese、Aranese和Asturian)的翻译任务,关键解决方案包括采用多语言迁移、正则化dropout、前向翻译和后向翻译、LABSE去噪、转导集成学习等策略,对基于深度Transformer架构的神经机器翻译模型进行训练和优化,从而在WMT 2024的最终评估中取得了竞争性结果。

链接: https://arxiv.org/abs/2409.15924
作者: Yuanchang Luo,Zhanglin Wu,Daimeng Wei,Hengchao Shang,Zongyao Li,Jiaxin Guo,Zhiqiang Rao,Shaojun Li,Jinlong Yang,Yuhao Xie,Jiawei Zheng Bin Wei,Hao Yang
关键词-EN: Translation Service Center, Huawei Translation Service, Service Center, Languages of Spain, Low-Resource Languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages,wmt24. arXiv admin note: substantial text overlap with arXiv:2409.14842 ; text overlap with arXiv:2409.14800

点击查看摘要

Abstract:This article introduces the submission status of the Translation into Low-Resource Languages of Spain task at (WMT 2024) by Huawei Translation Service Center (HW-TSC). We participated in three translation tasks: spanish to aragonese (es-arg), spanish to aranese (es-arn), and spanish to asturian (es-ast). For these three translation tasks, we use training strategies such as multilingual transfer, regularized dropout, forward translation and back translation, labse denoising, transduction ensemble learning and other strategies to neural machine translation (NMT) model based on training deep transformer-big architecture. By using these enhancement strategies, our submission achieved a competitive result in the final evaluation.
摘要:本文介绍了华为翻译服务中心 (HW-TSC) 在 (WMT 2024) 上提交的西班牙语低资源语言翻译任务的状态。我们参与了三个翻译任务:西班牙语到阿拉贡语 (es-arg)、西班牙语到阿拉内斯语 (es-arn) 以及西班牙语到阿斯图里亚斯语 (es-ast)。对于这三个翻译任务,我们采用了多语言迁移、正则化 dropout、前向翻译和后向翻译、labse 去噪、转导集成学习等策略,基于深度 Transformer-big 架构的神经机器翻译 (NMT) 模型进行训练。通过使用这些增强策略,我们的提交在最终评估中取得了有竞争力的结果。

[NLP-18] Explaining word embeddings with perfect fidelity: Case study in research impact prediction

【速读】: 该论文试图解决学术文档质量预测中基于嵌入模型的解释性问题,即这些模型无法直接解释分类器的决策过程,因为单个词不再对应于模型的输入特征。论文提出的解决方案是引入一种新的特征重要性方法——自模型评级实体(SMER),该方法专为基于词嵌入的逻辑回归分类模型设计。SMER的关键在于其理论上的完美保真度,即其预测结果与文本中单个词的平均预测结果完全一致,从而能够可靠地确定哪些词或实体对预测高质量文章有积极贡献。通过在CORD-19语料库的50,000篇研究论文上进行的五项多样化实验,论文验证了SMER在解释性方面优于LIME。

链接: https://arxiv.org/abs/2409.15912
作者: Lucie Dvorackova,Marcin P. Joachimiak,Michal Cerny,Adriana Kubecova,Vilem Sklenak,Tomas Kliegr
关键词-EN: scholarly document quality, document quality prediction, Self-model Rated Entities, performing approaches, approaches for scholarly
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Best performing approaches for scholarly document quality prediction are based on embedding models, which do not allow direct explanation of classifiers as distinct words no longer correspond to the input features for model training. Although model-agnostic explanation methods such as Local interpretable model-agnostic explanations (LIME) can be applied, these produce results with questionable correspondence to the ML model. We introduce a new feature importance method, Self-model Rated Entities (SMER), for logistic regression-based classification models trained on word embeddings. We show that SMER has theoretically perfect fidelity with the explained model, as its prediction corresponds exactly to the average of predictions for individual words in the text. SMER allows us to reliably determine which words or entities positively contribute to predicting impactful articles. Quantitative and qualitative evaluation is performed through five diverse experiments conducted on 50.000 research papers from the CORD-19 corpus. Through an AOPC curve analysis, we experimentally demonstrate that SMER produces better explanations than LIME for logistic regression.
摘要:在学术文档质量预测中表现最佳的方法基于嵌入模型,这些模型不允许直接解释分类器,因为不同的单词不再对应于模型训练的输入特征。尽管可以应用如局部可解释模型无关解释 (LIME) 这样的模型无关解释方法,但这些方法产生的结果与机器学习模型之间的对应关系存疑。我们引入了一种新的特征重要性方法,自模型评级实体 (SMER),用于基于词嵌入训练的逻辑回归分类模型。我们证明,SMER 在理论上与被解释模型具有完美的保真度,因为其预测结果完全对应于文本中各个单词预测的平均值。SMER 使我们能够可靠地确定哪些单词或实体对预测有影响力的文章有正面贡献。通过在 CORD-19 语料库中的 50,000 篇研究论文上进行的五项不同实验,我们进行了定量和定性评估。通过 AOPC 曲线分析,我们实验证明 SMER 对逻辑回归产生的解释优于 LIME。

[NLP-19] A Modular-based Strategy for Mitigating Gradient Conflicts in Simultaneous Speech Translation

【速读】: 该论文试图解决多任务学习在同时语音翻译(SimulST)中引入的优化冲突问题,这些冲突导致效率下降和高GPU内存消耗。解决方案的关键是提出了一种模块化梯度冲突缓解(MGCM)策略,该策略在更细粒度的模块级别检测冲突,并通过梯度投影进行解决。实验结果表明,MGCM显著提升了SimulST性能,特别是在中等和高延迟条件下,同时在离线任务中实现了0.68 BLEU分数的提升,并大幅减少了GPU内存消耗,使其成为SimulST任务的强有力解决方案。

链接: https://arxiv.org/abs/2409.15911
作者: Xiaoqian Liu,Yangfan Du,Jianjin Wang,Yuan Ge,Chen Xu,Tong Xiao,Guocheng Chen,Jingbo Zhu
关键词-EN: Simultaneous Speech Translation, streaming speech input, processing streaming speech, Simultaneous Speech, Speech Translation
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Simultaneous Speech Translation (SimulST) involves generating target language text while continuously processing streaming speech input, presenting significant real-time challenges. Multi-task learning is often employed to enhance SimulST performance but introduces optimization conflicts between primary and auxiliary tasks, potentially compromising overall efficiency. The existing model-level conflict resolution methods are not well-suited for this task which exacerbates inefficiencies and leads to high GPU memory consumption. To address these challenges, we propose a Modular Gradient Conflict Mitigation (MGCM) strategy that detects conflicts at a finer-grained modular level and resolves them utilizing gradient projection. Experimental results demonstrate that MGCM significantly improves SimulST performance, particularly under medium and high latency conditions, achieving a 0.68 BLEU score gain in offline tasks. Additionally, MGCM reduces GPU memory consumption by over 95% compared to other conflict mitigation methods, establishing it as a robust solution for SimulST tasks.
摘要:同时语音翻译 (Simultaneous Speech Translation, SimulST) 涉及在持续处理流式语音输入的同时生成目标语言文本,这带来了显著的实时性挑战。多任务学习常被用于提升 SimulST 性能,但引入了主任务与辅助任务之间的优化冲突,可能损害整体效率。现有的模型级冲突解决方法并不适用于此任务,这加剧了效率低下并导致高 GPU 内存消耗。为应对这些挑战,我们提出了一种模块化梯度冲突缓解 (Modular Gradient Conflict Mitigation, MGCM) 策略,该策略在更细粒度的模块级别检测冲突,并利用梯度投影进行解决。实验结果表明,MGCM 显著提升了 SimulST 性能,特别是在中等和高延迟条件下,离线任务的 BLEU 分数提高了 0.68。此外,与其它冲突缓解方法相比,MGCM 将 GPU 内存消耗减少了超过 95%,确立了其在 SimulST 任务中的稳健解决方案地位。

[NLP-20] Enhancing Text-to-SQL Capabilities of Large Language Models via Domain Database Knowledge Injection ECAI2024

【速读】: 该论文试图解决大型语言模型(LLMs)在文本到SQL转换任务中由于幻觉问题和缺乏领域特定数据库知识(如表结构和单元格值)而导致的错误,如生成错误的表名、列名以及将值错误匹配到列。解决方案的关键在于通过知识注入方法,将领域特定的数据库知识融入LLMs中,从而增强其对表结构内容的理解。实验结果表明,预训练LLMs在领域特定数据库知识上并针对下游Text-to-SQL任务进行微调,显著提高了执行匹配(EX)和精确匹配(EM)指标,减少了生成列名和匹配值到列的错误,并展示了该方法在多个下游Text-to-SQL任务中的通用性。

链接: https://arxiv.org/abs/2409.15907
作者: Xingyu Ma,Xin Tian,Lingxiang Wu,Xuepeng Wang,Xueming Tang,Jinqiao Wang
关键词-EN: Large Language Models, Large Language, evolution of Large, Language Models, subtask in semantic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ECAI 2024

点击查看摘要

Abstract:Text-to-SQL is a subtask in semantic parsing that has seen rapid progress with the evolution of Large Language Models (LLMs). However, LLMs face challenges due to hallucination issues and a lack of domain-specific database knowledge(such as table schema and cell values). As a result, they can make errors in generating table names, columns, and matching values to the correct columns in SQL statements. This paper introduces a method of knowledge injection to enhance LLMs’ ability to understand schema contents by incorporating prior knowledge. This approach improves their performance in Text-to-SQL tasks. Experimental results show that pre-training LLMs on domain-specific database knowledge and fine-tuning them on downstream Text-to-SQL tasks significantly improves the Execution Match (EX) and Exact Match (EM) metrics across various models. This effectively reduces errors in generating column names and matching values to the columns. Furthermore, the knowledge-injected models can be applied to many downstream Text-to-SQL tasks, demonstrating the generalizability of the approach presented in this paper.
摘要:文本到 SQL 转换是语义解析中的一个子任务,随着大语言模型 (Large Language Models, LLMs) 的发展,该领域取得了快速进展。然而,LLMs 由于存在幻觉问题以及缺乏特定领域的数据库知识(如表结构和单元格值),面临着挑战。因此,在生成 SQL 语句时,它们可能会在生成表名、列名以及将值匹配到正确的列时出现错误。本文介绍了一种知识注入方法,通过结合先验知识来增强 LLMs 对结构内容的理解能力。这种方法提升了 LLMs 在文本到 SQL 任务中的表现。实验结果表明,在特定领域的数据库知识上预训练 LLMs,并在下游文本到 SQL 任务上进行微调,显著提高了各种模型在执行匹配 (Execution Match, EX) 和完全匹配 (Exact Match, EM) 指标上的表现。这有效地减少了生成列名和将值匹配到列时的错误。此外,知识注入模型可以应用于许多下游的文本到 SQL 任务,展示了本文提出的方法的通用性。

[NLP-21] Konstruktor: A Strong Baseline for Simple Knowledge Graph Question Answering

【速读】: 该论文试图解决简单问题(如“谁是《灰姑娘》的作者?”)在现代大型语言模型中仍然存在的错误率问题,特别是在处理罕见实体时。解决方案的关键在于引入了一种名为Konstruktor的高效且鲁棒的方法,该方法将问题分解为三个步骤:实体提取与链接、关系预测和知识图谱查询。通过整合语言模型和知识图谱,利用前者的强大能力和后者的可解释性,Konstruktor在关系检测这一最具挑战性的步骤中,通过结合关系分类/生成和排序的方法,显著优于其他方法,并在四个数据集上展示了其强大的性能。

链接: https://arxiv.org/abs/2409.15902
作者: Maria Lysyuk,Mikhail Salnikov,Pavel Braslavski,Alexander Panchenko
关键词-EN: popular question types, author of Cinderella, Large Language Models, completely solved, modern Large Language
类目: Computation and Language (cs.CL)
备注: 18 pages, 2 figures, 7 tables

点击查看摘要

Abstract:While being one of the most popular question types, simple questions such as “Who is the author of Cinderella?”, are still not completely solved. Surprisingly, even the most powerful modern Large Language Models are prone to errors when dealing with such questions, especially when dealing with rare entities. At the same time, as an answer may be one hop away from the question entity, one can try to develop a method that uses structured knowledge graphs (KGs) to answer such questions. In this paper, we introduce Konstruktor - an efficient and robust approach that breaks down the problem into three steps: (i) entity extraction and entity linking, (ii) relation prediction, and (iii) querying the knowledge graph. Our approach integrates language models and knowledge graphs, exploiting the power of the former and the interpretability of the latter. We experiment with two named entity recognition and entity linking methods and several relation detection techniques. We show that for relation detection, the most challenging step of the workflow, a combination of relation classification/generation and ranking outperforms other methods. We report Konstruktor’s strong results on four datasets.
摘要:尽管简单问题(如“谁是《灰姑娘》的作者?”)是最常见的问题类型之一,但这类问题仍未完全解决。令人惊讶的是,即使是目前最强大的大语言模型在处理这类问题时也容易出错,尤其是在涉及罕见实体时。同时,由于答案可能与问题实体仅一步之遥,因此可以尝试开发一种利用结构化知识图谱 (KG) 来回答此类问题的方法。本文介绍了一种名为 Konstruktor 的高效且稳健的方法,该方法将问题分解为三个步骤:(i) 实体提取与实体链接,(ii) 关系预测,以及 (iii) 查询知识图谱。我们的方法结合了语言模型和知识图谱,充分利用了前者的强大能力和后者的可解释性。我们实验了两种命名实体识别与实体链接方法以及多种关系检测技术。结果表明,对于工作流程中最具挑战性的关系检测步骤,结合关系分类/生成与排序的方法优于其他方法。我们在四个数据集上报告了 Konstruktor 的优异表现。

[NLP-22] HLB: Benchmarking LLMs Humanlikeness in Language Use

【速读】: 该论文试图解决的问题是如何评估大型语言模型(LLMs)在实际语言使用中的“人性化”程度,即这些模型在多大程度上能够模仿人类语言的丰富性和创造性。解决方案的关键在于提出了一个全面的人性化基准(HLB),通过10个心理语言学实验来评估20个LLMs在声音、词汇、句法、语义和语篇等核心语言方面的表现。通过收集2000多名人类参与者的响应并与LLMs的输出进行比较,利用编码算法提取响应分布,量化了人类与LLMs在语言使用模式上的相似性。研究发现,尽管某些性能指标有所提升,但并不一定增加模型的“人性化”程度,甚至可能导致下降。该基准通过引入心理语言学方法,为系统评估LLMs的“人性化”提供了首个框架。

链接: https://arxiv.org/abs/2409.15890
作者: Xufeng Duan,Bei Xiao,Xuemei Tang,Zhenguang G. Cai
关键词-EN: training language models, generated dialogue, concerns have emerged, potentially losing, synthetic data
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As synthetic data becomes increasingly prevalent in training language models, particularly through generated dialogue, concerns have emerged that these models may deviate from authentic human language patterns, potentially losing the richness and creativity inherent in human communication. This highlights the critical need to assess the humanlikeness of language models in real-world language use. In this paper, we present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs) using 10 psycholinguistic experiments designed to probe core linguistic aspects, including sound, word, syntax, semantics, and discourse (see this https URL). To anchor these comparisons, we collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments. For rigorous evaluation, we developed a coding algorithm that accurately identified language use patterns, enabling the extraction of response distributions for each task. By comparing the response distributions between human participants and LLMs, we quantified humanlikeness through distributional similarity. Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels. Importantly, we found that improvements in other performance metrics did not necessarily lead to greater humanlikeness, and in some cases, even resulted in a decline. By introducing psycholinguistic methods to model evaluation, this benchmark offers the first framework for systematically assessing the humanlikeness of LLMs in language use. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2409.15890 [cs.CL] (or arXiv:2409.15890v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.15890 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:随着合成数据在训练语言模型中的应用日益广泛,特别是通过生成对话,人们开始担忧这些模型可能偏离真实的人类语言模式,从而可能失去人类交流中固有的丰富性和创造性。这凸显了在实际语言使用中评估语言模型的人类相似性的迫切需求。本文提出了一套全面的人类相似性基准 (Humanlikeness Benchmark, HLB),通过 10 项心理语言学实验评估了 20 个大语言模型 (Large Language Models, LLMs),这些实验旨在探究声音、词汇、句法、语义和语篇等核心语言学方面 (参见此 https URL)。为了锚定这些比较,我们收集了超过 2,000 名人类参与者的回答,并将这些回答与 LLMs 在这些实验中的输出进行比较。为了进行严格的评估,我们开发了一种编码算法,能够准确识别语言使用模式,从而提取每个任务的回答分布。通过比较人类参与者和 LLMs 的回答分布,我们通过分布相似性量化了人类相似性。研究结果揭示了 LLMs 在不同语言层次上复制人类回答的细微差异。重要的是,我们发现其他性能指标的改进并不一定导致更高的人类相似性,在某些情况下甚至会导致下降。通过引入心理语言学方法进行模型评估,该基准提供了首个系统评估语言使用中 LLMs 人类相似性的框架。

主题:计算与语言 (cs.CL) 引用方式:arXiv:2409.15890 [cs.CL] (或 arXiv:2409.15890v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2409.15890 聚焦以了解更多 arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-23] Machine Translation Advancements of Low-Resource Indian Languages by Transfer Learning

【速读】: 该论文试图解决低资源印度语言的机器翻译问题,特别是针对阿萨姆语(Assamese)、曼尼普尔语(Manipuri)、卡西语(Khasi)和米佐语(Mizo)等语言。解决方案的关键在于采用了两种知识迁移策略:一是对现有的IndicTrans2开源模型进行微调,以实现英语与这些语言之间的双向翻译;二是训练一个多语言模型作为基线,并结合额外的英语-孟加拉语双语数据进行微调,以利用这些语言之间的语言特征相似性。通过这些策略,论文展示了显著的翻译效果提升,证明了迁移学习技术在低资源语言机器翻译中的有效性。

链接: https://arxiv.org/abs/2409.15879
作者: Bin Wei,Jiawei Zhen,Zongyao Li,Zhanglin Wu,Daimeng Wei,Jiaxin Guo,Zhiqiang Rao,Shaojun Li,Yuanchang Luo,Hengchao Shang,Jinlong Yang,Yuhao Xie,Hao Yang
关键词-EN: Huawei Translation Center, Shared Task, Indian Languages Machine, Indian Languages, low-resource Indian languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, wmt24. arXiv admin note: substantial text overlap with arXiv:2409.14800

点击查看摘要

Abstract:This paper introduces the submission by Huawei Translation Center (HW-TSC) to the WMT24 Indian Languages Machine Translation (MT) Shared Task. To develop a reliable machine translation system for low-resource Indian languages, we employed two distinct knowledge transfer strategies, taking into account the characteristics of the language scripts and the support available from existing open-source models for Indian languages. For Assamese(as) and Manipuri(mn), we fine-tuned the existing IndicTrans2 open-source model to enable bidirectional translation between English and these languages. For Khasi (kh) and Mizo (mz), We trained a multilingual model as a baseline using bilingual data from these four language pairs, along with an additional about 8kw English-Bengali bilingual data, all of which share certain linguistic features. This was followed by fine-tuning to achieve bidirectional translation between English and Khasi, as well as English and Mizo. Our transfer learning experiments produced impressive results: 23.5 BLEU for en-as, 31.8 BLEU for en-mn, 36.2 BLEU for as-en, and 47.9 BLEU for mn-en on their respective test sets. Similarly, the multilingual model transfer learning experiments yielded impressive outcomes, achieving 19.7 BLEU for en-kh, 32.8 BLEU for en-mz, 16.1 BLEU for kh-en, and 33.9 BLEU for mz-en on their respective test sets. These results not only highlight the effectiveness of transfer learning techniques for low-resource languages but also contribute to advancing machine translation capabilities for low-resource Indian languages.
摘要:本文介绍了华为翻译中心 (HW-TSC) 提交给 WMT24 印度语言机器翻译 (MT) 共享任务的内容。为了开发针对低资源印度语言的可靠机器翻译系统,我们采用了两种不同的知识迁移策略,考虑到语言文字的特性以及现有开源模型对印度语言的支持。对于阿萨姆语 (as) 和曼尼普尔语 (mn),我们微调了现有的 IndicTrans2 开源模型,以实现英语与这些语言之间的双向翻译。对于卡西语 (kh) 和米佐语 (mz),我们使用这四种语言对的双语数据以及额外的约 8kw 英语-孟加拉语双语数据,训练了一个多语言模型作为基线,这些数据共享某些语言特征。随后进行微调,以实现英语与卡西语以及英语与米佐语之间的双向翻译。我们的迁移学习实验取得了显著成果:在各自的测试集上,en-as 的 BLEU 得分为 23.5,en-mn 为 31.8,as-en 为 36.2,mn-en 为 47.9。同样,多语言模型的迁移学习实验也取得了显著成果,在各自的测试集上,en-kh 的 BLEU 得分为 19.7,en-mz 为 32.8,kh-en 为 16.1,mz-en 为 33.9。这些结果不仅突显了迁移学习技术对低资源语言的有效性,还推动了低资源印度语言机器翻译能力的进步。

[NLP-24] Privacy Evaluation Benchmarks for NLP Models EMNLP2024

【速读】: 该论文试图解决自然语言处理(NLP)模型在隐私攻击下的风险评估问题,特别是缺乏系统性分析和全面理解攻击影响的问题。解决方案的关键在于提出了一个隐私攻击与防御评估基准,涵盖了传统模型和大型语言模型(LLMs),支持多种模型、数据集和协议,并包含标准化模块用于综合评估攻击和防御策略。此外,论文还探讨了不同领域辅助数据与隐私攻击强度之间的关系,提出了一种基于知识蒸馏(KD)的改进攻击方法,并引入了一个链式攻击框架,允许实践者通过串联多个攻击来实现更高层次的攻击目标,同时提供了相应的防御和增强攻击策略。

链接: https://arxiv.org/abs/2409.15868
作者: Wei Huang,Yinggui Wang,Cen Chen
关键词-EN: obtain sensitive information, NLP models, attacks, NLP, attackers can obtain
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by the Findings of EMNLP 2024

点击查看摘要

Abstract:By inducing privacy attacks on NLP models, attackers can obtain sensitive information such as training data and model parameters, etc. Although researchers have studied, in-depth, several kinds of attacks in NLP models, they are non-systematic analyses. It lacks a comprehensive understanding of the impact caused by the attacks. For example, we must consider which scenarios can apply to which attacks, what the common factors are that affect the performance of different attacks, the nature of the relationships between different attacks, and the influence of various datasets and models on the effectiveness of the attacks, etc. Therefore, we need a benchmark to holistically assess the privacy risks faced by NLP models. In this paper, we present a privacy attack and defense evaluation benchmark in the field of NLP, which includes the conventional/small models and large language models (LLMs). This benchmark supports a variety of models, datasets, and protocols, along with standardized modules for comprehensive evaluation of attacks and defense strategies. Based on the above framework, we present a study on the association between auxiliary data from different domains and the strength of privacy attacks. And we provide an improved attack method in this scenario with the help of Knowledge Distillation (KD). Furthermore, we propose a chained framework for privacy attacks. Allowing a practitioner to chain multiple attacks to achieve a higher-level attack objective. Based on this, we provide some defense and enhanced attack strategies. The code for reproducing the results can be found at this https URL.
摘要:通过在自然语言处理 (NLP) 模型上实施隐私攻击,攻击者可以获取诸如训练数据和模型参数等敏感信息。尽管研究人员对 NLP 模型中的几种攻击进行了深入研究,但这些分析缺乏系统性。对于攻击造成的影响缺乏全面的理解。例如,我们必须考虑哪些场景适用于哪些攻击,影响不同攻击性能的常见因素是什么,不同攻击之间的关系性质如何,以及各种数据集和模型对攻击效果的影响等。因此,我们需要一个基准来全面评估 NLP 模型面临的隐私风险。本文提出了一种 NLP 领域的隐私攻击与防御评估基准,涵盖了传统/小型模型和大语言模型 (LLMs)。该基准支持多种模型、数据集和协议,并配备了标准化的模块,用于全面评估攻击和防御策略。基于上述框架,我们研究了来自不同领域的辅助数据与隐私攻击强度之间的关联,并在知识蒸馏 (KD) 的帮助下,提供了在此场景下的改进攻击方法。此外,我们提出了一种链式隐私攻击框架,允许从业者将多个攻击链接起来,以实现更高层次的攻击目标。基于此,我们提供了一些防御和增强攻击策略。重现结果的代码可在以下链接找到:https URL。

[NLP-25] BeSimulator: A Large Language Model Powered Text-based Behavior Simulator

【速读】: 该论文试图解决传统机器人模拟器在物理过程建模和真实渲染中存在的高计算成本、低效率和有限适应性问题。解决方案的关键在于提出了BeSimulator框架,该框架通过行为模拟强调机器人行为逻辑的检查,并实现机器人动作结果与真实场景的高度一致。BeSimulator采用模块化和LLM驱动的架构,在文本环境中进行语义级模拟,能够跨场景泛化并实现长周期的复杂模拟。其核心方法包括“考虑-决策-捕捉-转移”的行为模拟链,结合代码驱动的推理和反射反馈机制,显著提升了行为模拟的性能。

链接: https://arxiv.org/abs/2409.15865
作者: Jianan Wang,Bin Li,Xueying Wang,Fu Li,Yunlong Wu,Juan Chen,Xiaodong Yi
关键词-EN: Traditional robot simulators, high computational costs, physical process modeling, robot simulators focus, Traditional robot
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Traditional robot simulators focus on physical process modeling and realistic rendering, often suffering from high computational costs, inefficiencies, and limited adaptability. To handle this issue, we propose Behavior Simulation in robotics to emphasize checking the behavior logic of robots and achieving sufficient alignment between the outcome of robot actions and real scenarios. In this paper, we introduce BeSimulator, a modular and novel LLM-powered framework, as an attempt towards behavior simulation in the context of text-based environments. By constructing text-based virtual environments and performing semantic-level simulation, BeSimulator can generalize across scenarios and achieve long-horizon complex simulation. Inspired by human cognition processes, it employs a “consider-decide-capture-transfer” methodology, termed Chain of Behavior Simulation, which excels at analyzing action feasibility and state transitions. Additionally, BeSimulator incorporates code-driven reasoning to enable arithmetic operations and enhance reliability, as well as integrates reflective feedback to refine simulation. Based on our manually constructed behavior-tree-based simulation benchmark BTSIMBENCH, our experiments show a significant performance improvement in behavior simulation compared to baselines, ranging from 14.7% to 26.6%.
摘要:传统的机器人模拟器专注于物理过程建模和真实感渲染,往往面临高计算成本、效率低下和适应性有限的问题。为了解决这一问题,我们提出了机器人行为模拟,旨在强调检查机器人的行为逻辑,并实现机器人动作结果与真实场景之间的充分一致性。本文介绍了 BeSimulator,一个模块化且新颖的大语言模型 (LLM) 驱动的框架,作为基于文本环境的机器人行为模拟的尝试。通过构建基于文本的虚拟环境并进行语义级模拟,BeSimulator 能够跨场景泛化并实现长时程复杂模拟。受人类认知过程的启发,它采用了一种名为“考虑-决定-捕捉-转移”的方法论,称为行为模拟链,擅长分析动作可行性和状态转换。此外,BeSimulator 结合了代码驱动的推理,以实现算术操作并增强可靠性,同时集成了反射反馈以优化模拟。基于我们手动构建的行为树模拟基准 BTSIMBENCH,我们的实验显示,与基线相比,行为模拟的性能显著提高,范围从 14.7% 到 26.6%。

[NLP-26] A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding

【速读】: 该论文试图解决现有对话状态追踪(DST)方法在适应新槽值和领域标签方面的局限性,特别是依赖预定义本体和黄金领域标签的问题。解决方案的关键在于提出了一种零样本、开放词汇的系统,通过将DST重构为问答任务,并利用自优化提示技术,使得系统能够在不依赖固定槽值的情况下动态适应。该方法显著提高了联合目标准确率(JGA),并在Multi-WOZ 2.1数据集上比现有最先进方法提升了20%,同时减少了高达90%的LLM API调用次数。

链接: https://arxiv.org/abs/2409.15861
作者: Abdulfattah Safa,Gözde Gül Şahin
关键词-EN: Dialogue State Tracking, State Tracking, Dialogue State, priate system actions, task-oriented dialogues
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dialogue State Tracking (DST) is crucial for understanding user needs and executing appro- priate system actions in task-oriented dialogues. Majority of existing DST methods are designed to work within predefined ontologies and as- sume the availability of gold domain labels, struggling with adapting to new slots values. While Large Language Models (LLMs)-based systems show promising zero-shot DST perfor- mance, they either require extensive computa- tional resources or they underperform existing fully-trained systems, limiting their practical- ity. To address these limitations, we propose a zero-shot, open-vocabulary system that in- tegrates domain classification and DST in a single pipeline. Our approach includes refor- mulating DST as a question-answering task for less capable models and employing self- refining prompts for more adaptable ones. Our system does not rely on fixed slot values de- fined in the ontology allowing the system to adapt dynamically. We compare our approach with existing SOTA, and show that it provides up to 20% better Joint Goal Accuracy (JGA) over previous methods on datasets like Multi- WOZ 2.1, with up to 90% fewer requests to the LLM API.
摘要:对话状态跟踪 (Dialogue State Tracking, DST) 在理解用户需求和执行适当的系统操作方面至关重要,特别是在面向任务的对话中。大多数现有的 DST 方法设计为在预定义的本体中工作,并假设黄金领域标签的可用性,但在适应新的槽值时表现不佳。尽管基于大语言模型 (Large Language Models, LLMs) 的系统显示出有前景的零样本 DST 性能,但它们要么需要大量的计算资源,要么在性能上不如现有的完全训练的系统,限制了它们的实用性。为了解决这些限制,我们提出了一种零样本、开放词汇的系统,该系统在一个单一的管道中整合了领域分类和 DST。我们的方法包括将 DST 重新构建为一种问答任务,以适应能力较弱的模型,并为更具适应性的模型采用自精炼提示。我们的系统不依赖于本体中定义的固定槽值,从而允许系统动态适应。我们将我们的方法与现有的最先进 (SOTA) 方法进行比较,结果显示,在 Multi-WOZ 2.1 等数据集上,我们的方法在联合目标准确率 (Joint Goal Accuracy, JGA) 上比之前的方法提高了多达 20%,同时对 LLM API 的请求减少了多达 90%。

[NLP-27] GAiVA: Integrated Generative AI and Visual Analytics in a Machine Learning Workflow for Text Classification

【速读】: 该论文试图解决机器学习模型在文本分类任务中,由于数据分布不理想,尤其是新类别引入时导致的数据不足问题。解决方案的关键在于利用视觉分析(VA)指导生成合成数据,通过VA识别数据相关缺陷,并针对性地生成合成数据以弥补这些缺陷,从而提高模型准确性。论文还介绍了一个名为iGAiVA的软件工具,将生成式AI与VA集成到机器学习工作流程中,以支持文本分类模型的开发和改进。

链接: https://arxiv.org/abs/2409.15848
作者: Yuanzhe Jin,Adrian Carrasco-Revilla,Min Chen
关键词-EN: developing machine learning, machine learning, ideally distributed, common challenge, classes are introduced
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In developing machine learning (ML) models for text classification, one common challenge is that the collected data is often not ideally distributed, especially when new classes are introduced in response to changes of data and tasks. In this paper, we present a solution for using visual analytics (VA) to guide the generation of synthetic data using large language models. As VA enables model developers to identify data-related deficiency, data synthesis can be targeted to address such deficiency. We discuss different types of data deficiency, describe different VA techniques for supporting their identification, and demonstrate the effectiveness of targeted data synthesis in improving model accuracy. In addition, we present a software tool, iGAiVA, which maps four groups of ML tasks into four VA views, integrating generative AI and VA into an ML workflow for developing and improving text classification models.
摘要:在开发用于文本分类的机器学习 (ML) 模型时,一个常见的挑战是收集的数据往往分布不理想,尤其是在数据和任务发生变化时引入新类别的情况下。本文提出了一种利用视觉分析 (VA) 指导使用大语言模型生成合成数据的解决方案。由于 VA 使模型开发者能够识别数据相关缺陷,因此数据合成可以针对这些缺陷进行优化。我们讨论了不同类型的数据缺陷,描述了支持其识别的不同 VA 技术,并展示了针对数据合成在提高模型准确性方面的有效性。此外,我们介绍了一个软件工具 iGAiVA,该工具将四组 ML 任务映射到四个 VA 视图,将生成式 AI 和 VA 集成到 ML 工作流程中,用于开发和改进文本分类模型。

[NLP-28] Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability

【速读】: 该论文试图解决大型语言模型(LLMs)在语言能力方面的深层认知机制问题,特别是探讨这些模型如何在神经元层面捕捉语言能力的各个方面。解决方案的关键在于采用心理语言学范式,通过声音-形状关联、声音-性别关联和隐含因果关系三项任务,深入研究GPT-2-XL模型在神经元层面的表现。研究发现,特定神经元与模型的语言能力直接相关,即模型展现出某种语言能力时,存在对应的神经元;反之,缺乏该能力则表明缺乏相应的专用神经元。这一研究首次利用心理语言学实验在神经元层面探索深层语言能力,为模型的可解释性提供了新的粒度,并揭示了基于Transformer的LLMs内部驱动语言能力的机制。

链接: https://arxiv.org/abs/2409.15827
作者: Xufeng Duan,Xinyu Zhou,Bei Xiao,Zhenguang G. Cai
关键词-EN: significant challenge, large language models, remains a significant, language competence remains, implicit causality
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become advance in their linguistic capacity, understanding how they capture aspects of language competence remains a significant challenge. This study therefore employs psycholinguistic paradigms, which are well-suited for probing deeper cognitive aspects of language processing, to explore neuron-level representations in language model across three tasks: sound-shape association, sound-gender association, and implicit causality. Our findings indicate that while GPT-2-XL struggles with the sound-shape task, it demonstrates human-like abilities in both sound-gender association and implicit causality. Targeted neuron ablation and activation manipulation reveal a crucial relationship: when GPT-2-XL displays a linguistic ability, specific neurons correspond to that competence; conversely, the absence of such an ability indicates a lack of specialized neurons. This study is the first to utilize psycholinguistic experiments to investigate deep language competence at the neuron level, providing a new level of granularity in model interpretability and insights into the internal mechanisms driving language ability in transformer based LLMs.
摘要:随着大语言模型 (Large Language Models, LLMs) 在语言能力上的不断进步,理解它们如何捕捉语言能力的各个方面仍然是一个重大挑战。因此,本研究采用了心理语言学范式,这些范式非常适合深入探究语言处理的认知层面,以探索语言模型在三个任务中的神经元级表示:声音-形状关联、声音-性别关联和隐含因果关系。我们的研究结果表明,尽管 GPT-2-XL 在声音-形状任务中表现不佳,但在声音-性别关联和隐含因果关系任务中展现了类似人类的能力。通过有针对性的神经元消融和激活操作,我们揭示了一个关键关系:当 GPT-2-XL 展现出某种语言能力时,特定的神经元对应于该能力;反之,缺乏这种能力则表明缺乏专门的神经元。本研究首次利用心理语言学实验在神经元层面研究深层语言能力,为模型的可解释性提供了新的粒度级别,并深入了解了基于 Transformer 的大语言模型中驱动语言能力的内部机制。

[NLP-29] Empirical Insights on Fine-Tuning Large Language Models for Question-Answering

【速读】: 该论文试图解决的问题是如何有效地微调大型语言模型(LLMs)以提升其在问答(QA)任务中的表现。解决方案的关键在于对监督微调(SFT)数据进行分类,基于预训练LLMs所记忆的知识程度,并通过一系列实验分析三个关键因素:SFT所需的数据量、不同SFT数据集对模型性能的影响以及不同LLMs对数据需求的变化。研究结果表明,仅需60个数据点即可激活预训练阶段编码的知识,使LLMs能够执行QA任务,并且不同记忆水平的SFT数据对模型性能有显著影响,最佳数据集因具体模型而异。

链接: https://arxiv.org/abs/2409.15825
作者: Junjie Ye,Yuming Yang,Qi Zhang,Tao Gui,Xuanjing Huang,Peng Wang,Zhongchao Shi,Jianping Fan
关键词-EN: Large language models, encode extensive world, Large language, extensive world knowledge, encode extensive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can then be fine-tuned for the question-answering (QA) task. However, effective strategies for fine-tuning LLMs for the QA task remain largely unexplored. To address this gap, we categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs and conduct a series of empirical analyses. Our experiments, involving four LLMs from three different model families, focus on three key factors: the amount of data required for SFT, the impact of different SFT datasets on model performance, and how data requirements vary across LLMs. The results show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task. Additionally, SFT with data of varying memory levels has a significant impact on LLM performance, with the optimal dataset differing based on the specific model being fine-tuned. Future research will delve deeper into the mechanisms underlying these phenomena.
摘要:大语言模型 (LLMs) 通过在海量数据集上的预训练编码了广泛的世界知识,这些知识随后可以被微调用于问答 (QA) 任务。然而,针对 QA 任务的有效微调策略仍未得到充分探索。为了填补这一空白,我们根据预训练 LLMs 记忆知识的程度对监督微调 (SFT) 数据进行了分类,并进行了一系列实证分析。我们的实验涉及来自三个不同模型家族的四个 LLMs,重点关注三个关键因素:SFT 所需的数据量、不同 SFT 数据集对模型性能的影响,以及数据需求在不同 LLMs 之间的变化。结果显示,在 SFT 阶段仅需 60 个数据点即可激活预训练期间编码的知识,使 LLMs 能够执行 QA 任务。此外,使用不同记忆水平的数据进行 SFT 对 LLM 性能有显著影响,最佳数据集因具体微调的模型而异。未来的研究将进一步深入探讨这些现象背后的机制。

[NLP-30] Supervised Fine-Tuning: An Activation Pattern Optimization Process for Attention Heads

【速读】: 该论文试图解决大语言模型(LLMs)在处理复杂任务(如高级数学和复杂疾病诊断)时表现不佳的问题,关键在于揭示LLMs在预训练阶段捕获的先验知识如何快速泛化到简单任务的机制,并将其应用于复杂任务的学习中。论文通过基于梯度的方法,从注意力模式的角度分析了监督微调(SFT)过程中LLMs如何适应下游任务,发现LLMs在SFT过程中选择性地激活任务特定的注意力头,复杂任务的激活模式是基本任务模式的组合,且少量参数的变化可以显著影响SFT后的激活模式。基于这些发现,论文提出通过优化这些参数来提高SFT在处理复杂任务和资源稀缺情况下的效率和效果。

链接: https://arxiv.org/abs/2409.15820
作者: Yang Zhao,Li Du,Xiao Ding,Kai Xiong,Ting Liu,Bing Qin
关键词-EN: demonstrating promising potential, complex disease diagnosis, promising potential, complex tasks, demonstrating promising
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: in review

点击查看摘要

Abstract:Though demonstrating promising potential, LLMs’ performance on complex tasks, such as advanced mathematics and complex disease diagnosis is still unsatisfactory. A key issue is the present LLMs learn in a data-driven schema, while the instruction dataset about these complex tasks is both scarce and hard to collect or construct. On the contrary, a prominent phenomenon is that LLMs can learn rather fast on those simpler tasks with adequate prior knowledge captured during pretraining stage. Thus, if the prerequisite and mechanism of such rapid generalization could be elucidated, it could be highly beneficial in enhancing the efficiency and effectiveness of the LLM’s ability to learn complex tasks. Thus, in this paper, we employ a gradient-based method, to dissect the process that the SFT process adapts LLMs to downstream tasks via the perspective of attention patterns. We find that: (1) LLMs selectively activate task-specific attention heads during SFT; (2) activation patterns for complex tasks are combinations of basic task patterns; and (3) changes in a few parameters can significantly impact activation patterns after SFT on a small number of samples. Based on these insights, we conduct experiments to examine whether these conclusions could effectively enhance the efficiency and effectiveness of SFT, particularly in handling complex tasks and when instructional resources are scarce. Our research not only uncovers the underlying reasons behind LLMs’ rapid learning and generalization mechanisms but also provides practical solutions for addressing data challenges in complex and specialized tasks.
摘要:尽管展现出令人鼓舞的潜力,大语言模型 (LLM) 在处理复杂任务(如高级数学和复杂疾病诊断)时的表现仍不尽如人意。一个关键问题是,当前的大语言模型采用数据驱动的方式进行学习,而这些复杂任务的指令数据集既稀缺又难以收集或构建。相反,一个显著的现象是,大语言模型在那些具备充足预训练阶段捕获的先验知识的简单任务上能够快速学习。因此,如果能够阐明这种快速泛化的前提条件和机制,将极大地有助于提高大语言模型学习复杂任务的效率和效果。因此,在本论文中,我们采用基于梯度的方法,从注意力模式的角度剖析了监督微调 (SFT) 过程如何使大语言模型适应下游任务。我们发现:(1) 大语言模型在监督微调过程中选择性地激活任务特定的注意力头;(2) 复杂任务的激活模式是基本任务模式的组合;(3) 少量参数的变化可以在少量样本的监督微调后显著影响激活模式。基于这些发现,我们进行了实验,以检验这些结论是否能有效提高监督微调的效率和效果,特别是在处理复杂任务和指令资源稀缺的情况下。我们的研究不仅揭示了大语言模型快速学习和泛化机制的内在原因,还为解决复杂和专业化任务中的数据挑战提供了实际解决方案。

[NLP-31] AsthmaBot: Multi-modal Multi-Lingual Retrieval Augmented Generation For Asthma Patient Support

【速读】: 该论文试图解决哮喘患者在缺乏即时医疗护理的情况下,如何通过自动化支持系统获得准确信息的问题。解决方案的关键在于引入AsthmaBot,这是一个多语言、多模态的检索增强生成系统,通过整合经过筛选的文档来提升大型语言模型的性能,减少事实错误(即幻觉现象)的发生。AsthmaBot不仅在哮喘相关常见问题数据集上展示了其有效性,还通过交互式和直观的界面,整合了文本、图像和视频等多种数据形式,使其更易于公众使用。

链接: https://arxiv.org/abs/2409.15815
作者: Adil Bahaj,Mounir Ghogho
关键词-EN: Chat Generative Pre-trained, Generative Pre-trained Transformer, risen globally, driven by environmental, lifestyle factors
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Asthma rates have risen globally, driven by environmental and lifestyle factors. Access to immediate medical care is limited, particularly in developing countries, necessitating automated support systems. Large Language Models like ChatGPT (Chat Generative Pre-trained Transformer) and Gemini have advanced natural language processing in general and question answering in particular, however, they are prone to producing factually incorrect responses (i.e. hallucinations). Retrieval-augmented generation systems, integrating curated documents, can improve large language models’ performance and reduce the incidence of hallucination. We introduce AsthmaBot, a multi-lingual, multi-modal retrieval-augmented generation system for asthma support. Evaluation of an asthma-related frequently asked questions dataset shows AsthmaBot’s efficacy. AsthmaBot has an added interactive and intuitive interface that integrates different data modalities (text, images, videos) to make it accessible to the larger public. AsthmaBot is available online via \urlthis http URL.
摘要:哮喘发病率在全球范围内呈上升趋势,主要受环境和生活方式因素驱动。在发展中国家,尤其是医疗资源有限,急需自动化支持系统。像 ChatGPT (Chat Generative Pre-trained Transformer) 和 Gemini 这样的大语言模型在自然语言处理和问答系统方面取得了显著进展,但它们容易产生事实性错误(即幻觉)。通过整合精选文档的检索增强生成系统,可以提升大语言模型的性能并减少幻觉的发生。我们推出了 AsthmaBot,这是一个多语言、多模态的检索增强生成系统,专门用于哮喘支持。对哮喘相关常见问题数据集的评估显示了 AsthmaBot 的有效性。AsthmaBot 还配备了一个交互式且直观的界面,整合了不同的数据模态(文本、图像、视频),使其更易于大众使用。AsthmaBot 可通过 [http URL](http URL) 在线访问。

[NLP-32] NER-Luxury: Named entity recognition for the fashion and luxury domain

【速读】: 该论文旨在解决时尚与奢侈品行业中命名实体识别(NER)模型开发的多个挑战,包括实体歧义性、法语技术术语、ESG方法论的稀缺性以及行业内公司结构的多样性(从小型到大型集团)。解决方案的关键在于引入了一个包含36种以上实体类型的奢侈品导向分类体系,并创建了一个包含超过40,000个句子的数据集,这些句子遵循清晰的层次分类。此外,论文提出了五个针对不同奢侈品子领域的监督微调模型(NER-Luxury),并在实验中通过定量评估展示了这些模型相对于现有开源大型语言模型的优越性,强调了定制化NER模型在现有机器学习流程中的优势。

链接: https://arxiv.org/abs/2409.15804
作者: Akim Mousterou
关键词-EN: French technical jargon, address multiple challenges, disparate company structures, conglomerate leveraging economy, medium-sized luxury houses
类目: Computation and Language (cs.CL)
备注: 28 pages, 6 figures

点击查看摘要

Abstract:In this study, we address multiple challenges of developing a named-entity recognition model in English for the fashion and luxury industry, namely the entity disambiguation, French technical jargon in multiple sub-sectors, scarcity of the ESG methodology, and a disparate company structures of the sector with small and medium-sized luxury houses to large conglomerate leveraging economy of scale. In this work, we introduce a taxonomy of 36+ entity types with a luxury-oriented annotation scheme, and create a dataset of more than 40K sentences respecting a clear hierarchical classification. We also present five supervised fine-tuned models NER-Luxury for fashion, beauty, watches, jewelry, fragrances, cosmetics, and overall luxury, focusing equally on the aesthetic side and the quantitative side. In an additional experiment, we compare in a quantitative empirical assessment of the NER performance of our models against the state-of-the-art open-source large language models that show promising results and highlights the benefits of incorporating a bespoke NER model in existing machine learning pipelines. Comments: 28 pages, 6 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2409.15804 [cs.CL] (or arXiv:2409.15804v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.15804 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在本研究中,我们针对时尚与奢侈品行业开发英语命名实体识别模型时所面临的多重挑战进行了探讨,包括实体歧义性、多个子行业中的法语专业术语、ESG方法论的稀缺性,以及该行业内从小型到大型集团的不同公司结构,这些集团利用规模经济效应。在此工作中,我们引入了一个包含36种以上实体类型的分类体系,并采用面向奢侈品的标注方案,创建了一个包含超过40,000句句子的数据集,这些句子遵循清晰的层级分类。我们还展示了五个针对时尚、美容、手表、珠宝、香水和化妆品以及整体奢侈品领域的监督微调模型NER-Luxury,这些模型同样关注美学和定量分析。在附加实验中,我们通过定量实证评估,将我们的模型与展示出良好结果的最新开源大语言模型进行了NER性能的比较,并强调了在现有机器学习流程中引入定制NER模型的优势。

评论:28页,6幅图
主题:计算与语言 (cs.CL)
引用方式:arXiv:2409.15804 [cs.CL]
(或 arXiv:2409.15804v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2409.15804
了解更多
arXiv通过DataCite发布的DOI (待注册)

[NLP-33] Small Language Models: Survey Measurements and Insights

【速读】: 该论文试图解决小型语言模型(SLMs)在学术研究中关注度不足的问题,并探讨如何通过技术创新使其在日常任务中更加高效、易用和成本效益高。解决方案的关键在于对59个开源SLMs进行深入分析,从架构、训练数据集和训练算法三个维度评估其技术进步,并通过基准测试评估其在设备上的推理延迟和内存占用,从而为该领域的进一步研究提供有价值的见解。

链接: https://arxiv.org/abs/2409.15790
作者: Zhenyan Lu,Xiang Li,Dongqi Cai,Rongjie Yi,Fangming Liu,Xiwen Zhang,Nicholas D. Lane,Mengwei Xu
关键词-EN: modern smart devices, academic attention compared, Small language models, large language model, Small language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Small language models (SLMs), despite their widespread adoption in modern smart devices, have received significantly less academic attention compared to their large language model (LLM) counterparts, which are predominantly deployed in data centers and cloud environments. While researchers continue to improve the capabilities of LLMs in the pursuit of artificial general intelligence, SLM research aims to make machine intelligence more accessible, affordable, and efficient for everyday tasks. Focusing on transformer-based, decoder-only language models with 100M-5B parameters, we survey 59 state-of-the-art open-source SLMs, analyzing their technical innovations across three axes: architectures, training datasets, and training algorithms. In addition, we evaluate their capabilities in various domains, including commonsense reasoning, in-context learning, mathematics, and coding. To gain further insight into their on-device runtime costs, we benchmark their inference latency and memory footprints. Through in-depth analysis of our benchmarking data, we offer valuable insights to advance research in this field.
摘要:尽管小型语言模型 (SLM) 在现代智能设备中得到了广泛应用,但相较于主要部署在数据中心和云环境中的大语言模型 (LLM),它们在学术界受到的关注显著较少。在研究人员不断追求提升大语言模型能力以实现通用人工智能 (AGI) 的同时,小型语言模型研究则致力于使机器智能在日常任务中更加普及、经济和高效。本文聚焦于基于 Transformer、仅解码器的语言模型,参数规模在 100M 至 5B 之间,调研了 59 个最先进的开源小型语言模型,从架构、训练数据集和训练算法三个维度分析了它们的技术创新。此外,我们还评估了它们在常识推理、上下文学习、数学和编码等多个领域的能力。为了进一步了解它们在设备上的运行时成本,我们对其推理延迟和内存占用进行了基准测试。通过对基准测试数据的深入分析,我们提供了宝贵的见解,以推动该领域的研究进展。

[NLP-34] CHBench: A Chinese Dataset for Evaluating Health in Large Language Models

【速读】: 该论文旨在解决大语言模型(LLMs)在处理健康相关查询时的性能评估问题。解决方案的关键在于提出了CHBench,这是一个全面的中文健康相关基准测试,用于评估LLMs在理解和生成与身体和心理健康相关的准确信息方面的能力。CHBench包含6,493条心理健康相关条目和2,999条身体健康的条目,覆盖广泛的主题,为评估中文LLMs在健康信息处理方面的能力提供了基础。通过对比四种流行的中文LLMs的评估结果,论文指出这些模型在理解健康相关信息方面仍有显著的改进空间。

链接: https://arxiv.org/abs/2409.15766
作者: Chenlu Guo,Nuo Xu,Yi Chang,Yuan Wu
关键词-EN: large language models, assessing their performance, increasingly essential, rapid development, development of large
类目: Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:With the rapid development of large language models (LLMs), assessing their performance on health-related inquiries has become increasingly essential. It is critical that these models provide accurate and trustworthy health information, as their application in real-world contexts–where misinformation can have serious consequences for individuals seeking medical advice and support–depends on their reliability. In this work, we present CHBench, the first comprehensive Chinese Health-related Benchmark designed to evaluate LLMs’ capabilities in understanding physical and mental health across diverse scenarios. CHBench includes 6,493 entries related to mental health and 2,999 entries focused on physical health, covering a broad spectrum of topics. This dataset serves as a foundation for evaluating Chinese LLMs’ capacity to comprehend and generate accurate health-related information. Our extensive evaluations of four popular Chinese LLMs demonstrate that there remains considerable room for improvement in their understanding of health-related information. The code is available at this https URL.
摘要:随着大语言模型 (Large Language Models, LLMs) 的快速发展,评估其在健康相关查询中的表现变得愈发重要。这些模型提供准确且可信赖的健康信息至关重要,因为它们在现实世界中的应用——其中错误信息可能对寻求医疗建议和支持的个人造成严重后果——依赖于其可靠性。在本研究中,我们提出了 CHBench,这是首个全面的中文健康相关基准,旨在评估 LLMs 在理解各种场景下的身心健康方面的能力。CHBench 包括 6,493 条与心理健康相关的条目和 2,999 条专注于生理健康的条目,涵盖了广泛的主题。该数据集作为评估中文 LLMs 理解和生成准确健康相关信息能力的基础。我们对四种流行的中文 LLMs 进行了广泛的评估,结果表明,它们在理解健康相关信息方面仍有显著的改进空间。代码可在以下链接获取:https URL。

[NLP-35] XTRUST: On the Multilingual Trustworthiness of Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在多语言环境下的可信度问题,特别是在敏感领域如医疗和金融中的应用。解决方案的关键是引入了XTRUST,这是一个全面的多语言可信度基准,涵盖了非法活动、幻觉、分布外鲁棒性、身心健康、毒性、公平性、错误信息、隐私和机器伦理等多个主题,并跨越10种不同语言。通过XTRUST,论文对五种广泛使用的LLMs进行了多语言可信度的实证评估,揭示了当前语言模型在某些低资源语言(如阿拉伯语和俄语)上的表现不足,强调了提升多语言可信度的迫切需求。

链接: https://arxiv.org/abs/2409.15762
作者: Yahan Li,Yi Wang,Yi Chang,Yuan Wu
关键词-EN: Large language models, demonstrated remarkable capabilities, natural language processing, Large language, capturing the attention
类目: Computation and Language (cs.CL)
备注: 21 pages

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a range of natural language processing (NLP) tasks, capturing the attention of both practitioners and the broader public. A key question that now preoccupies the AI community concerns the capabilities and limitations of these models, with trustworthiness emerging as a central issue, particularly as LLMs are increasingly applied in sensitive fields like healthcare and finance, where errors can have serious consequences. However, most previous studies on the trustworthiness of LLMs have been limited to a single language, typically the predominant one in the dataset, such as English. In response to the growing global deployment of LLMs, we introduce XTRUST, the first comprehensive multilingual trustworthiness benchmark. XTRUST encompasses a diverse range of topics, including illegal activities, hallucination, out-of-distribution (OOD) robustness, physical and mental health, toxicity, fairness, misinformation, privacy, and machine ethics, across 10 different languages. Using XTRUST, we conduct an empirical evaluation of the multilingual trustworthiness of five widely used LLMs, offering an in-depth analysis of their performance across languages and tasks. Our results indicate that many LLMs struggle with certain low-resource languages, such as Arabic and Russian, highlighting the considerable room for improvement in the multilingual trustworthiness of current language models. The code is available at this https URL.
摘要:大语言模型 (LLMs) 在自然语言处理 (NLP) 任务中展现了卓越的能力,引起了从业者和公众的广泛关注。当前,AI 社区的核心问题之一是这些模型的能力与局限性,尤其是信任度问题,随着 LLMs 在医疗和金融等敏感领域的应用日益增多,错误可能导致严重后果。然而,以往关于 LLMs 信任度的研究大多局限于单一语言,通常是数据集中占主导地位的语言,如英语。为了应对 LLMs 在全球范围内的广泛部署,我们推出了 XTRUST,这是首个全面的多语言信任度基准。XTRUST 涵盖了多个主题,包括非法活动、幻觉、分布外 (OOD) 鲁棒性、身心健康、毒性、公平性、错误信息、隐私和机器伦理,涉及 10 种不同语言。通过 XTRUST,我们对五种广泛使用的大语言模型进行了多语言信任度的实证评估,深入分析了它们在不同语言和任务中的表现。我们的结果表明,许多 LLMs 在某些低资源语言(如阿拉伯语和俄语)上表现不佳,凸显了当前语言模型在多语言信任度方面仍有显著提升空间。代码可在以下链接获取:https URL。

[NLP-36] Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens ICASSP2025

【速读】: 该论文试图解决在多说话人场景中,由于说话人数量未知且语音重叠导致的语音识别难题。解决方案的关键在于采用了一种基于注意力机制的编码器-解码器方法,并通过说话人聚类获取特殊的说话人类别标记。在推理阶段,系统根据预测的说话人聚类标记选择多个识别假设,并通过基于归一化编辑距离的凝聚层次聚类(AHC)进行合并,从而生成具有适当说话人数量的多说话人转录文本。实验结果表明,该方法在复杂的三说话人混合环境中表现尤为出色,显著降低了错误率。

链接: https://arxiv.org/abs/2409.15732
作者: Yosuke Kashiwagi,Hayato Futami,Emiru Tsunoo,Siddhant Arora,Shinji Watanabe
关键词-EN: real-world scenarios, utterances often overlap, number of participants, unknown number, relative error reduction
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:In many real-world scenarios, such as meetings, multiple speakers are present with an unknown number of participants, and their utterances often overlap. We address these multi-speaker challenges by a novel attention-based encoder-decoder method augmented with special speaker class tokens obtained by speaker clustering. During inference, we select multiple recognition hypotheses conditioned on predicted speaker cluster tokens, and these hypotheses are merged by agglomerative hierarchical clustering (AHC) based on the normalized edit distance. The clustered hypotheses result in the multi-speaker transcriptions with the appropriate number of speakers determined by AHC. Our experiments on the LibriMix dataset demonstrate that our proposed method was particularly effective in complex 3-mix environments, achieving a 55% relative error reduction on clean data and a 36% relative error reduction on noisy data compared with conventional serialized output training.
摘要:在许多现实场景中,如会议,存在多个发言者,参与者数量未知,且他们的发言常常重叠。我们通过一种新颖的基于注意力机制的编码器-解码器方法来应对这些多说话者的挑战,该方法通过说话者聚类获得特殊说话者类别 Token。在推理过程中,我们根据预测的说话者聚类 Token 选择多个识别假设,并通过基于归一化编辑距离的凝聚层次聚类 (AHC) 将这些假设合并。聚类后的假设结果生成多说话者的转录文本,其说话者数量由 AHC 确定。我们在 LibriMix 数据集上的实验表明,所提出的方法在复杂的 3-mix 环境中特别有效,与传统的串行输出训练相比,在干净数据上实现了 55% 的相对错误减少,在噪声数据上实现了 36% 的相对错误减少。

[NLP-37] Federated Large Language Models : Current Progress and Future Directions

【速读】: 该论文试图解决在大规模语言模型(LLMs)的联邦学习(Federated Learning, FL)中遇到的关键问题,包括数据异质性导致的模型收敛问题和高通信成本。解决方案的关键在于探讨在联邦学习环境下对LLMs进行微调(fine-tuning)和提示学习(prompt learning)的方法,以及如何通过预训练和利用LLMs的优势来进一步增强联邦学习的效果。论文通过综述现有研究,提出了未来在联邦LLMs领域的潜在研究方向。

链接: https://arxiv.org/abs/2409.15723
作者: Yuhang Yao,Jianyi Zhang,Junda Wu,Chengkai Huang,Yu Xia,Tong Yu,Ruiyi Zhang,Sungchul Kim,Ryan Rossi,Ang Li,Lina Yao,Julian McAuley,Yiran Chen,Carlee Joe-Wong
关键词-EN: Large language models, rapidly gaining popularity, Large language, real-world applications, rapidly gaining
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are rapidly gaining popularity and have been widely adopted in real-world applications. While the quality of training data is essential, privacy concerns arise during data collection. Federated learning offers a solution by allowing multiple clients to collaboratively train LLMs without sharing local data. However, FL introduces new challenges, such as model convergence issues due to heterogeneous data and high communication costs. A comprehensive study is required to address these challenges and guide future research. This paper surveys Federated learning for LLMs (FedLLM), highlighting recent advances and future directions. We focus on two key aspects: fine-tuning and prompt learning in a federated setting, discussing existing work and associated research challenges. We finally propose potential research directions for federated LLMs, including pre-training and how LLMs can further enhance federated learning.
摘要:大语言模型正在迅速获得普及,并在实际应用中被广泛采用。尽管训练数据的质量至关重要,但在数据收集过程中隐私问题也随之出现。联邦学习提供了一种解决方案,允许多个客户端在不共享本地数据的情况下协作训练大语言模型。然而,联邦学习引入了新的挑战,例如由于数据异质性导致的模型收敛问题和高通信成本。需要进行全面的研究来解决这些挑战并指导未来的研究。本文综述了针对大语言模型的联邦学习 (FedLLM),重点介绍了最近的进展和未来的方向。我们聚焦于两个关键方面:在联邦环境中的微调和提示学习,讨论了现有工作和相关的研究挑战。最后,我们提出了联邦大语言模型的潜在研究方向,包括预训练以及大语言模型如何进一步增强联邦学习。

[NLP-38] Making Text Embedders Few-Shot Learners

【速读】: 该论文试图解决如何利用大型语言模型(LLMs)的上下文学习(ICL)能力来提升文本嵌入生成质量的问题。解决方案的关键在于提出了一种名为bge-en-icl的新模型,该模型通过在查询端直接集成任务相关的少量示例(few-shot examples),显著提升了文本嵌入的质量,并在多个任务中取得了新的最先进(SOTA)性能。此外,研究还探讨了如何有效利用LLMs作为嵌入模型,强调在保留原始框架的情况下,简单的方法往往能获得最佳效果。

链接: https://arxiv.org/abs/2409.15700
作者: Chaofan Li,MingHao Qin,Shitao Xiao,Jianlyu Chen,Kun Luo,Yingxia Shao,Defu Lian,Zheng Liu
关键词-EN: Large language models, remarkable in-context learning, Large language, decoder-only architectures demonstrate, architectures demonstrate remarkable
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) with decoder-only architectures demonstrate remarkable in-context learning (ICL) capabilities. This feature enables them to effectively handle both familiar and novel tasks by utilizing examples provided within their input context. Recognizing the potential of this capability, we propose leveraging the ICL feature in LLMs to enhance the process of text embedding generation. To this end, we introduce a novel model bge-en-icl, which employs few-shot examples to produce high-quality text embeddings. Our approach integrates task-related examples directly into the query side, resulting in significant improvements across various tasks. Additionally, we have investigated how to effectively utilize LLMs as embedding models, including various attention mechanisms, pooling methods, etc. Our findings suggest that retaining the original framework often yields the best results, underscoring that simplicity is best. Experimental results on the MTEB and AIR-Bench benchmarks demonstrate that our approach sets new state-of-the-art (SOTA) performance. Our model, code and dataset are freely available at this https URL .
摘要:具有解码器架构的大语言模型 (LLMs) 展示了显著的上下文学习 (ICL) 能力。这一特性使它们能够通过利用输入上下文中提供的示例,有效地处理熟悉和新颖的任务。认识到这一能力的潜力,我们提出利用 LLMs 的 ICL 特性来增强文本嵌入生成的过程。为此,我们引入了一种新型模型 bge-en-icl,该模型采用少样本示例来生成高质量的文本嵌入。我们的方法将任务相关的示例直接集成到查询端,从而在各种任务中实现了显著的改进。此外,我们还研究了如何有效地利用 LLMs 作为嵌入模型,包括各种注意力机制、池化方法等。我们的研究结果表明,保留原始框架通常能获得最佳结果,强调了简单性是最好的选择。在 MTEB 和 AIR-Bench 基准测试上的实验结果表明,我们的方法达到了新的最先进 (SOTA) 性能。我们的模型、代码和数据集可在以下链接免费获取:https URL。

[NLP-39] Lighter And Better: Towards Flexible Context Adaptation For Retrieval Augmented Generation

【速读】: 该论文试图解决现有检索增强生成(RAG)系统在成本和效果方面的挑战。现有系统在处理长篇检索内容时面临显著的计算开销,同时直接使用通用大型语言模型(LLMs)可能导致次优答案,而任务特定微调可能削弱LLMs的通用能力。论文提出的解决方案是FlexRAG(灵活上下文适应RAG),其关键在于将检索到的上下文压缩为紧凑的嵌入表示,并在LLMs编码前进行优化,以增强下游RAG性能。FlexRAG的灵活性允许支持多种压缩比率,并选择性地保留重要上下文,从而在显著降低运行成本的同时,实现更高质量的生成效果。

链接: https://arxiv.org/abs/2409.15699
作者: Zheng Liu,Chenyuan Wu,Ninglu Shao,Shitao Xiao,Chaozhuo Li,Defu Lian
关键词-EN: existing Retrieval-Augmented Generation, face significant challenges, systems face significant, Large Language Models, existing Retrieval-Augmented
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The existing Retrieval-Augmented Generation (RAG) systems face significant challenges in terms of cost and effectiveness. On one hand, they need to encode the lengthy retrieved contexts before responding to the input tasks, which imposes substantial computational overhead. On the other hand, directly using generic Large Language Models (LLMs) often leads to sub-optimal answers, while task-specific fine-tuning may compromise the LLMs’ general capabilities. To address these challenges, we introduce a novel approach called FlexRAG (Flexible Context Adaptation for RAG). In this approach, the retrieved contexts are compressed into compact embeddings before being encoded by the LLMs. Simultaneously, these compressed embeddings are optimized to enhance downstream RAG performance. A key feature of FlexRAG is its flexibility, which enables effective support for diverse compression ratios and selective preservation of important contexts. Thanks to these technical designs, FlexRAG achieves superior generation quality while significantly reducing running costs. Comprehensive experiments on various question-answering datasets validate our approach as a cost-effective and flexible solution for RAG systems.
摘要:现有的检索增强生成 (Retrieval-Augmented Generation, RAG) 系统在成本和效果方面面临显著挑战。一方面,它们需要在响应输入任务之前对检索到的大量上下文进行编码,这带来了巨大的计算开销。另一方面,直接使用通用大语言模型 (Large Language Models, LLMs) 往往导致次优答案,而任务特定的微调可能会削弱 LLMs 的通用能力。为解决这些挑战,我们提出了一种名为 FlexRAG (Flexible Context Adaptation for RAG) 的新方法。在该方法中,检索到的上下文被压缩成紧凑的嵌入向量,然后再由 LLMs 进行编码。同时,这些压缩的嵌入向量经过优化以增强下游 RAG 性能。FlexRAG 的一个关键特性是其灵活性,能够有效支持多种压缩比率并选择性地保留重要上下文。得益于这些技术设计,FlexRAG 在显著降低运行成本的同时实现了卓越的生成质量。在多种问答数据集上的全面实验验证了我们的方法作为 RAG 系统的一种成本效益高且灵活的解决方案。

[NLP-40] A Survey of Stance Detection on Social Media: New Directions and Perspectives

【速读】: 该论文试图解决在社交媒体上自动检测用户立场(stance detection)的问题,以深入理解公众对复杂议题的态度和情感。解决方案的关键在于综合利用传统立场检测模型和基于大规模语言模型的先进方法,通过系统分析社交媒体数据,识别和分类用户的立场,从而为市场营销和政治决策等领域提供有价值的见解。论文强调了开发更鲁棒和可泛化的模型的重要性,并指出了未来研究方向,如多模态立场检测和低资源语言中的立场检测。

链接: https://arxiv.org/abs/2409.15690
作者: Bowen Zhang,Genan Dai,Fuqiang Niu,Nan Yin,Xiaomao Fan,Hu Huang
关键词-EN: modern digital environments, frequently express opinions, users frequently express, stance detection, digital environments
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In modern digital environments, users frequently express opinions on contentious topics, providing a wealth of information on prevailing attitudes. The systematic analysis of these opinions offers valuable insights for decision-making in various sectors, including marketing and politics. As a result, stance detection has emerged as a crucial subfield within affective computing, enabling the automatic detection of user stances in social media conversations and providing a nuanced understanding of public sentiment on complex issues. Recent years have seen a surge of research interest in developing effective stance detection methods, with contributions from multiple communities, including natural language processing, web science, and social computing. This paper provides a comprehensive survey of stance detection techniques on social media, covering task definitions, datasets, approaches, and future works. We review traditional stance detection models, as well as state-of-the-art methods based on large language models, and discuss their strengths and limitations. Our survey highlights the importance of stance detection in understanding public opinion and sentiment, and identifies gaps in current research. We conclude by outlining potential future directions for stance detection on social media, including the need for more robust and generalizable models, and the importance of addressing emerging challenges such as multi-modal stance detection and stance detection in low-resource languages.
摘要:在现代数字环境中,用户经常对有争议的话题表达意见,提供了大量关于当前态度的信息。系统地分析这些意见为包括市场营销和政治在内的各个领域的决策提供了宝贵的见解。因此,立场检测已成为情感计算中的一个关键子领域,能够自动检测社交媒体对话中的用户立场,并提供对复杂问题公众情绪的细致理解。近年来,开发有效立场检测方法的研究兴趣激增,多个社区,包括自然语言处理、网络科学和社会计算,都做出了贡献。本文对社交媒体上的立场检测技术进行了全面综述,涵盖了任务定义、数据集、方法和未来工作。我们回顾了传统的立场检测模型,以及基于大语言模型的最先进方法,并讨论了它们的优缺点。我们的综述强调了立场检测在理解公众意见和情绪中的重要性,并指出了当前研究的差距。我们最后概述了社交媒体上立场检测的未来潜在方向,包括需要更强大和更具泛化能力的模型,以及应对新兴挑战的重要性,如多模态立场检测和低资源语言中的立场检测。

[NLP-41] Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint

【速读】: 该论文试图解决跨语言句子嵌入中语义和语言信息未完全分离的问题,即语义泄露问题。解决方案的关键在于提出了一种新的训练目标——正交性约束学习(ORACLE),通过强制语义嵌入和语言嵌入之间的正交性,来减少语义泄露并增强嵌入空间中的语义对齐。ORACLE方法结合了类内聚合和类间分离两个组件,实验结果表明其在跨语言检索和语义文本相似性任务中有效提升了语义对齐效果。

链接: https://arxiv.org/abs/2409.15664
作者: Dayeon Ki,Cheonbok Park,Hyunjoong Kim
关键词-EN: Accurately aligning contextual, parallel data mining, Accurately aligning, effective parallel data, aligning contextual representations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 16 figures

点击查看摘要

Abstract:Accurately aligning contextual representations in cross-lingual sentence embeddings is key for effective parallel data mining. A common strategy for achieving this alignment involves disentangling semantics and language in sentence embeddings derived from multilingual pre-trained models. However, we discover that current disentangled representation learning methods suffer from semantic leakage - a term we introduce to describe when a substantial amount of language-specific information is unintentionally leaked into semantic representations. This hinders the effective disentanglement of semantic and language representations, making it difficult to retrieve embeddings that distinctively represent the meaning of the sentence. To address this challenge, we propose a novel training objective, ORthogonAlity Constraint LEarning (ORACLE), tailored to enforce orthogonality between semantic and language embeddings. ORACLE builds upon two components: intra-class clustering and inter-class separation. Through experiments on cross-lingual retrieval and semantic textual similarity tasks, we demonstrate that training with the ORACLE objective effectively reduces semantic leakage and enhances semantic alignment within the embedding space.
摘要:准确对齐跨语言句子嵌入中的上下文表示是有效进行平行数据挖掘的关键。实现这种对齐的常见策略涉及在从多语言预训练模型中提取的句子嵌入中解耦语义和语言。然而,我们发现当前的解耦表示学习方法存在语义泄漏问题——我们引入这一术语来描述语言特定信息无意中大量泄漏到语义表示中的现象。这阻碍了语义和语言表示的有效解耦,使得难以检索出能够独特表示句子含义的嵌入。为应对这一挑战,我们提出了一种新的训练目标,即正交约束学习 (ORthogonAlity Constraint LEarning, ORACLE),旨在强化语义和语言嵌入之间的正交性。ORACLE 基于两个组成部分:类内聚类和类间分离。通过在跨语言检索和语义文本相似性任务中的实验,我们证明了采用 ORACLE 目标进行训练能够有效减少语义泄漏,并增强嵌入空间内的语义对齐。

[NLP-42] MMPT: Multimodal Prompt Tuning for Zero-shot Instruction Learning EMNLP2024

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在零样本泛化能力方面的提升问题,特别是在处理未见任务时的跨模态泛化能力。解决方案的关键在于提出了一种新的多模态提示调优(Multimodal Prompt Tuning, MMPT)方法,通过在微调过程中将视觉和文本提示分别集成到视觉编码器和语言处理器中,有效地提取和校准跨模态特征,从而显著提升了模型在多模态任务上的表现。

链接: https://arxiv.org/abs/2409.15657
作者: Taowen Wang,Yiyang Liu,James Chenhao Liang,junhan zhao,Yiming Cui,Yuning Mao,Shaoliang Nie,Jiahao Liu,Fuli Feng,Zenglin Xu,Cheng Han,Lifu Huang,Qifan Wang,Dongfang Liu
关键词-EN: Large Language Models, Multimodal Large Language, zero-shot generalization capabilities, Large Language, Multimodal Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP 2024

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (MMPT) approach for efficient instruction tuning of MLLMs. MMPT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.
摘要:多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在多个领域展示了卓越的性能,并越来越注重提升其在跨模态未见任务上的零样本泛化能力。指令微调 (Instruction tuning) 已成为通过在多样化的多模态任务上微调预训练模型来实现零样本泛化的一种有效策略。随着 MLLMs 规模的不断扩大,参数高效微调 (Parameter-efficient finetuning) 变得愈发关键。然而,大多数现有的参数高效方法仅关注单一模态,并且在微调过程中往往忽视了多模态特性。在本研究中,我们提出了一种新颖的多模态提示微调 (Multimodal Prompt Tuning, MMPT) 方法,用于 MLLMs 的高效指令微调。MMPT 在微调过程中分别将视觉和文本提示集成到视觉编码器和语言处理器中,促进了跨模态特征的提取和对齐。在多个多模态评估数据集上的实验结果表明,我们的方法相较于几种最先进的基线方法具有更优越的性能。一系列全面的消融研究验证了我们提示设计的有效性和方法的高效性。

[NLP-43] English offensive text detection using CNN based Bi-GRU model

【速读】: 该论文试图解决社交媒体平台上日益增多的仇恨内容问题,解决方案的关键在于提出了一种新的Bi-GRU-CNN模型,用于自动分类文本是否具有冒犯性。通过结合双向门控循环单元(Bi-GRU)和卷积神经网络(CNN),该模型在识别和分类不当内容方面表现优于现有模型。

链接: https://arxiv.org/abs/2409.15652
作者: Tonmoy Roy,Md Robiul Islam,Asif Ahmed Miazi,Anika Antara,Al Amin,Sunjim Hossain
关键词-EN: increased drastically, number of users, social media, social, People frequently share
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 6 pages and 6 figures

点击查看摘要

Abstract:Over the years, the number of users of social media has increased drastically. People frequently share their thoughts through social platforms, and this leads to an increase in hate content. In this virtual community, individuals share their views, express their feelings, and post photos, videos, blogs, and more. Social networking sites like Facebook and Twitter provide platforms to share vast amounts of content with a single click. However, these platforms do not impose restrictions on the uploaded content, which may include abusive language and explicit images unsuitable for social media. To resolve this issue, a new idea must be implemented to divide the inappropriate content. Numerous studies have been done to automate the process. In this paper, we propose a new Bi-GRU-CNN model to classify whether the text is offensive or not. The combination of the Bi-GRU and CNN models outperforms the existing model
摘要:多年来,社交媒体的用户数量急剧增加。人们经常通过社交平台分享他们的想法,这导致了仇恨内容的增加。在这个虚拟社区中,个人分享他们的观点,表达他们的情感,并发布照片、视频、博客等。像 Facebook 和 Twitter 这样的社交网络网站提供了一个平台,只需点击一下就可以分享大量内容。然而,这些平台对上传的内容没有施加限制,其中可能包括不适用于社交媒体的辱骂性语言和露骨图像。为了解决这个问题,必须实施一个新的想法来划分不适当的内容。已经进行了大量研究来自动化这一过程。在本文中,我们提出了一种新的 Bi-GRU-CNN 模型来分类文本是否具有攻击性。Bi-GRU 和 CNN 模型的结合优于现有模型。

[NLP-44] Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents EMNLP

【速读】: 该论文试图解决现有对话代理模型在处理口语对话时缺乏“全双工”能力的问题,即无法像人类对话那样实现快速、动态的轮流发言、重叠语音和反馈。解决方案的关键在于设计了一种新机制,将时间信息整合到预训练的大型语言模型(如Llama3-8b)中,使其能够与现实世界的时间同步运行。此外,通过使用212k小时的合成口语对话数据进行训练,并结合2k小时的实际口语对话数据,模型能够生成有意义且自然的口语对话,从而在对话意义和自然性方面超越了现有最先进的技术。

链接: https://arxiv.org/abs/2409.15594
作者: Bandhav Veluri,Benjamin N Peloquin,Bokai Yu,Hongyu Gong,Shyamnath Gollakota
关键词-EN: responses requiring explicit, requiring explicit prompting, spoken dialogue, dialogue, approaches are inherently
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: EMNLP Main 2024

点击查看摘要

Abstract:Despite broad interest in modeling spoken dialogue agents, most approaches are inherently “half-duplex” – restricted to turn-based interaction with responses requiring explicit prompting by the user or implicit tracking of interruption or silence events. Human dialogue, by contrast, is “full-duplex” allowing for rich synchronicity in the form of quick and dynamic turn-taking, overlapping speech, and backchanneling. Technically, the challenge of achieving full-duplex dialogue with LLMs lies in modeling synchrony as pre-trained LLMs do not have a sense of “time”. To bridge this gap, we propose Synchronous LLMs for full-duplex spoken dialogue modeling. We design a novel mechanism to integrate time information into Llama3-8b so that they run synchronously with the real-world clock. We also introduce a training recipe that uses 212k hours of synthetic spoken dialogue data generated from text dialogue data to create a model that generates meaningful and natural spoken dialogue, with just 2k hours of real-world spoken dialogue data. Synchronous LLMs outperform state-of-the-art in dialogue meaningfulness while maintaining naturalness. Finally, we demonstrate the model’s ability to participate in full-duplex dialogue by simulating interaction between two agents trained on different datasets, while considering Internet-scale latencies of up to 240 ms. Webpage: this https URL.
摘要:尽管对建模口语对话智能体有着广泛的兴趣,但大多数方法本质上都是“半双工”的——仅限于基于回合的交互,响应需要用户明确的提示或隐式跟踪中断或沉默事件。相比之下,人类对话是“全双工”的,允许以快速和动态的轮流、重叠的语音和回声通道的形式进行丰富的同步性。从技术上讲,实现与大语言模型 (LLM) 的全双工对话的挑战在于建模同步性,因为预训练的 LLM 没有“时间”的概念。为了弥合这一差距,我们提出了用于全双工口语对话建模的同步 LLM。我们设计了一种新颖的机制,将时间信息整合到 Llama3-8b 中,使其与现实世界的时钟同步运行。我们还引入了一种训练方法,使用从文本对话数据生成的 212k 小时的合成口语对话数据来创建一个模型,该模型仅用 2k 小时的现实世界口语对话数据就能生成有意义且自然的口语对话。同步 LLM 在对话意义性方面优于最先进的技术,同时保持了自然性。最后,我们通过模拟两个基于不同数据集训练的智能体之间的交互,展示了模型参与全双工对话的能力,同时考虑了高达 240 毫秒的互联网规模延迟。网页:this https URL。

[NLP-45] Optimizing News Text Classification with Bi-LSTM and Attention Mechanism for Efficient Data Processing

【速读】: 该论文试图解决从海量新闻信息中高效筛选出有价值内容的问题。解决方案的关键在于引入基于深度学习的自动分类方案,特别是结合了双向长短期记忆网络(Bi-LSTM)和注意力机制(Attention Mechanism)的优化模型。这一方法不仅显著提升了分类的准确性和时效性,还大幅减少了人工干预的需求,为新闻行业的信息处理能力和信息流通速度的提升提供了重要支持。

链接: https://arxiv.org/abs/2409.15576
作者: Bingyao Liu,Jiajing Chen,Rui Wang,Junming Huang,Yuanshuai Luo,Jianjun Wei
关键词-EN: development of Internet, Internet technology, technology has led, rapid increase, Internet
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The development of Internet technology has led to a rapid increase in news information. Filtering out valuable content from complex information has become an urgentproblem that needs to be solved. In view of the shortcomings of traditional manual classification methods that are time-consuming and inefficient, this paper proposes an automaticclassification scheme for news texts based on deep learning. This solution achieves efficient classification and management of news texts by introducing advanced machine learning algorithms, especially an optimization model that combines Bi-directional Long Short-Term Memory Network (Bi-LSTM) and Attention Mechanism. Experimental results show that this solution can not only significantly improve the accuracy and timeliness of classification, but also significantly reduce the need for manual intervention. It has important practical significance for improving the information processing capabilities of the news industry and accelerating the speed of information flow. Through comparative analysis of multiple common models, the effectiveness and advancement of the proposed method are proved, laying a solid foundation for future news text classification research.
摘要:互联网技术的发展导致了新闻信息的快速增长。从复杂的信息中筛选出有价值的内容已成为亟待解决的问题。鉴于传统人工分类方法耗时且效率低下的缺点,本文提出了一种基于深度学习的新闻文本自动分类方案。该方案通过引入先进的机器学习算法,特别是结合双向长短期记忆网络 (Bi-LSTM) 和注意力机制 (Attention Mechanism) 的优化模型,实现了新闻文本的高效分类和管理。实验结果表明,该方案不仅能显著提高分类的准确性和及时性,还能大幅减少人工干预的需求。这对于提升新闻行业的信息处理能力,加速信息流动具有重要的实际意义。通过对比分析多种常见模型,证明了所提出方法的有效性和先进性,为未来新闻文本分类研究奠定了坚实的基础。

[NLP-46] Asking an AI for salary negotiation advice is a matter of concern: Controlled experimental perturbation of ChatGPT for protected and non-protected group discrimination on a contextual task with no clear ground truth answers

【速读】: 该论文试图解决ChatGPT在薪资谈判建议中可能存在的偏见问题。解决方案的关键在于通过大规模的实验偏见审计,系统地测试ChatGPT在不同性别、大学和专业背景下的建议一致性,特别是对比员工和雇主视角下的建议差异。研究发现,ChatGPT在不同模型版本和视角下的建议存在显著差异,尤其是在性别、大学和专业等非受保护类别上的偏见表现不一致,这表明ChatGPT作为多模型平台在处理此类任务时缺乏足够的鲁棒性和一致性。

链接: https://arxiv.org/abs/2409.15567
作者: R. Stuart Geiger,Flynn O’Sullivan,Elsie Wang,Jonathan Lo
关键词-EN: conducted controlled experimental, conducted controlled, asked to recommend, recommend an opening, model versions
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We conducted controlled experimental bias audits for four versions of ChatGPT, which we asked to recommend an opening offer in salary negotiations for a new hire. We submitted 98,800 prompts to each version, systematically varying the employee’s gender, university, and major, and tested prompts in voice of each side of the negotiation: the employee versus employer. We find ChatGPT as a multi-model platform is not robust and consistent enough to be trusted for such a task. We observed statistically significant salary offers when varying gender for all four models, although with smaller gaps than for other attributes tested. The largest gaps were different model versions and between the employee- vs employer-voiced prompts. We also observed substantial gaps when varying university and major, but many of the biases were not consistent across model versions. We tested for fictional and fraudulent universities and found wildly inconsistent results across cases and model versions. We make broader contributions to the AI/ML fairness literature. Our scenario and our experimental design differ from mainstream AI/ML auditing efforts in key ways. Bias audits typically test discrimination for protected classes like gender, which we contrast with testing non-protected classes of university and major. Asking for negotiation advice includes how aggressive one ought to be in a negotiation relative to known empirical salary distributions and scales, which is a deeply contextual and personalized task that has no objective ground truth to validate. These results raise concerns for the specific model versions we tested and ChatGPT as a multi-model platform in continuous development. Our epistemology does not permit us to definitively certify these models as either generally biased or unbiased on the attributes we test, but our study raises matters of concern for stakeholders to further investigate.
摘要:我们针对四个版本的 ChatGPT 进行了受控实验偏差审计,要求其在薪资谈判中为新员工推荐一个起始报价。我们向每个版本提交了 98,800 个提示,系统性地变化了员工的性别、大学和专业,并测试了以谈判双方(员工与雇主)声音发出的提示。我们发现,作为一个多模型平台,ChatGPT 在执行此类任务时并不具备足够的鲁棒性和一致性。我们观察到,在所有四个模型中,当性别变化时,薪资报价存在统计学上的显著差异,尽管这些差异比其他测试属性要小。最大的差异出现在不同模型版本之间以及员工与雇主声音的提示之间。我们还观察到,当大学和专业变化时,存在显著的差异,但许多偏差在不同模型版本之间并不一致。我们测试了虚构和欺诈性的大学,发现结果在不同案例和模型版本之间极不一致。我们对 AI/ML 公平性文献做出了更广泛的贡献。我们的场景和实验设计在关键方面与主流 AI/ML 审计工作有所不同。偏差审计通常测试对受保护类别的歧视,而我们则对比测试了非受保护类别(如大学和专业)。要求提供谈判建议涉及在已知经验薪资分布和尺度上应如何积极地进行谈判,这是一个高度情境化和个性化的任务,没有客观的基准来验证。这些结果引发了我们对所测试的具体模型版本以及作为持续开发中的多模型平台的 ChatGPT 的担忧。我们的认识论不允许我们明确地认证这些模型在我们测试的属性上是否普遍存在偏差或无偏差,但我们的研究提出了值得利益相关者进一步调查的问题。

[NLP-47] GEM-RAG: Graphical Eigen Memories For Retrieval Augmented Generation

【速读】: 该论文试图解决大语言模型(LLMs)在记忆编码、存储和检索方面的不足,这些问题限制了LLMs作为AI代理在特定领域中的应用能力。解决方案的关键是引入Graphical Eigen Memories For Retrieval Augmented Generation (GEM-RAG),通过生成和编码更高层次的信息,并根据文本块的实用性进行标签化,构建基于文本和实用性问题相似性的图结构,利用特征分解技术生成更高层次的总结节点,从而提升检索增强生成(RAG)方法的性能。实验结果表明,GEM-RAG在标准问答任务中优于其他最先进的RAG方法。

链接: https://arxiv.org/abs/2409.15566
作者: Brendan Hogan Rappazzo,Yingheng Wang,Aaron Ferber,Carla Gomes
关键词-EN: shaping entities capable, Retrieval Augmented Generation, Large Language Models, general intelligence, shaping entities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:The ability to form, retrieve, and reason about memories in response to stimuli serves as the cornerstone for general intelligence - shaping entities capable of learning, adaptation, and intuitive insight. Large Language Models (LLMs) have proven their ability, given the proper memories or context, to reason and respond meaningfully to stimuli. However, they are still unable to optimally encode, store, and retrieve memories - the ability to do this would unlock their full ability to operate as AI agents, and to specialize to niche domains. To remedy this, one promising area of research is Retrieval Augmented Generation (RAG), which aims to augment LLMs by providing them with rich in-context examples and information. In question-answering (QA) applications, RAG methods embed the text of interest in chunks, and retrieve the most relevant chunks for a prompt using text embeddings. Motivated by human memory encoding and retrieval, we aim to improve over standard RAG methods by generating and encoding higher-level information and tagging the chunks by their utility to answer questions. We introduce Graphical Eigen Memories For Retrieval Augmented Generation (GEM-RAG). GEM-RAG works by tagging each chunk of text in a given text corpus with LLM generated ``utility’’ questions, connecting chunks in a graph based on the similarity of both their text and utility questions, and then using the eigendecomposition of the memory graph to build higher level summary nodes that capture the main themes of the text. We evaluate GEM-RAG, using both UnifiedQA and GPT-3.5 Turbo as the LLMs, with SBERT, and OpenAI’s text encoders on two standard QA tasks, showing that GEM-RAG outperforms other state-of-the-art RAG methods on these tasks. We also discuss the implications of having a robust RAG system and future directions.
摘要:对刺激形成、检索和推理记忆的能力是通用智能的基石,塑造了能够学习、适应和产生直观洞察的实体。大语言模型 (LLMs) 在提供适当的记忆或上下文的情况下,已证明其能够对刺激进行推理并做出有意义的响应。然而,它们在记忆的最佳编码、存储和检索方面仍存在不足——具备这种能力将解锁它们作为 AI 智能体的全部潜力,并使其能够专门化于特定领域。为解决这一问题,一个有前景的研究方向是检索增强生成 (Retrieval Augmented Generation, RAG),其目标是通过提供丰富的上下文示例和信息来增强 LLMs。在问答 (QA) 应用中,RAG 方法将感兴趣的文本嵌入块中,并使用文本嵌入检索与提示最相关的块。受人类记忆编码和检索的启发,我们旨在通过生成和编码更高层次的信息,并根据块对回答问题的实用性进行标记,来改进标准的 RAG 方法。我们引入了图形特征记忆用于检索增强生成 (Graphical Eigen Memories For Retrieval Augmented Generation, GEM-RAG)。GEM-RAG 通过使用 LLM 生成的“实用性”问题标记给定文本语料库中的每个文本块,基于文本和实用性问题的相似性将块连接在图中,然后使用记忆图的特征分解构建更高层次的总结节点,捕捉文本的主要主题。我们使用 UnifiedQA 和 GPT-3.5 Turbo 作为 LLMs,结合 SBERT 和 OpenAI 的文本编码器,在两个标准 QA 任务上评估 GEM-RAG,结果显示 GEM-RAG 在这些任务上优于其他最先进的 RAG 方法。我们还讨论了拥有一个强大的 RAG 系统的意义和未来方向。

[NLP-48] Learning When to Retrieve What to Rewrite and How to Respond in Conversational QA EMNLP

【速读】: 该论文试图解决在对话式问答(conversational question answering)中,大型语言模型(LLMs)如何更好地理解用户上下文搜索意图并决定何时进行信息检索的问题。解决方案的关键在于提出了SELF-multi-RAG框架,该框架扩展了单轮问答的SELF-RAG方法,使其能够处理多轮对话中的上下文理解和管理检索到的段落。具体来说,SELF-multi-RAG通过总结对话上下文来改进相关段落的检索,并在生成回答前评估返回段落的相关性,从而显著提升了对话式问答系统生成回答的质量,实验结果显示其性能提升了约13%。

链接: https://arxiv.org/abs/2409.15515
作者: Nirmal Roy,Leonardo F. R. Ribeiro,Rexhina Blloshmi,Kevin Small
关键词-EN: Augmenting Large Language, Large Language Models, Augmenting Large, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in EMNLP (findings) 2024

点击查看摘要

Abstract:Augmenting Large Language Models (LLMs) with information retrieval capabilities (i.e., Retrieval-Augmented Generation (RAG)) has proven beneficial for knowledge-intensive tasks. However, understanding users’ contextual search intent when generating responses is an understudied topic for conversational question answering (QA). This conversational extension leads to additional concerns when compared to single-turn QA as it is more challenging for systems to comprehend conversational context and manage retrieved passages over multiple turns. In this work, we propose a method for enabling LLMs to decide when to retrieve in RAG settings given a conversational context. When retrieval is deemed necessary, the LLM then rewrites the conversation for passage retrieval and judges the relevance of returned passages before response generation. Operationally, we build on the single-turn SELF-RAG framework (Asai et al., 2023) and propose SELF-multi-RAG for conversational settings. SELF-multi-RAG demonstrates improved capabilities over single-turn variants with respect to retrieving relevant passages (by using summarized conversational context) and assessing the quality of generated responses. Experiments on three conversational QA datasets validate the enhanced response generation capabilities of SELF-multi-RAG, with improvements of ~13% measured by human annotation.
摘要:通过增强大语言模型 (LLM) 的信息检索能力(即检索增强生成 (RAG)),已被证明对知识密集型任务有益。然而,在生成回答时理解用户的上下文搜索意图,对于对话问答 (QA) 来说是一个研究不足的课题。与单轮 QA 相比,这种对话扩展带来了额外的挑战,因为系统更难以理解对话上下文并在多轮对话中管理检索到的段落。在这项工作中,我们提出了一种方法,使 LLM 能够在给定对话上下文的情况下,决定何时在 RAG 设置中进行检索。当认为检索是必要时,LLM 会重写对话以进行段落检索,并在生成回答之前判断返回段落的相关性。在操作上,我们在单轮 SELF-RAG 框架(Asai 等人,2023)的基础上,提出了适用于对话场景的 SELF-multi-RAG。SELF-multi-RAG 在检索相关段落(通过使用总结的对话上下文)和评估生成回答的质量方面,展示了优于单轮变体的改进能力。在三个对话 QA 数据集上的实验验证了 SELF-multi-RAG 在增强回答生成能力方面的优势,通过人工标注测量的改进约为 13%。

[NLP-49] RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration

【速读】: 该论文试图解决在教育对话中,特别是文科对话中,如何利用大型语言模型(LLMs)生成符合人类化沟通、教学专业性和安全伦理(HTS)标准的高质量对话数据的问题。解决方案的关键在于设计了一个名为Retrieval-augmented Multi-role Multi-expert Collaboration (RAM2C)的框架。该框架通过建立HTS指导的知识库,涵盖教学技能、心理学和安全伦理三个领域的知识,并将这些知识库与LLMs结合,形成多专家、多角色的协作模式,从而生成符合HTS标准的教学对话数据集。随后,使用该数据集对LLMs进行微调,使其在教育对话中表现更加个性化和伦理安全。

链接: https://arxiv.org/abs/2409.15461
作者: Haoyu Huang,Tong Niu,Rui Yang,Luping Shi
关键词-EN: large language models, utilizing large language, textbf, language models, Recently
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, many studies focus on utilizing large language models (LLMs) into educational dialogues. Especially, within liberal arts dialogues, educators must balance \textbfHumanized communication, \textbfTeaching expertise, and \textbfSafety-ethics (\textbfHTS), besides the subject knowledge itself. However, due to collecting massive amounts of HTS-compliant teaching dialogues from real world as training corpus is expensive, the outputs of existing LLMs in teaching dialogues fall short of human standards. To address this, we design a Retrieval-augmented Multi-role Multi-expert Collaboration (RAM2C) framework to automatically generate such dialogues data. Specifically, we first establish HTS-guided knowledge bases, encompassing three domain knowledge in teaching skills, psychology, and safety ethics. Then, RAM2C organizes LLMs, which are retrieval-augmented by the above different knowledge bases, into multi-experts groups with distinct roles to generate the HTS-compliant educational dialogues dataset. We then fine-tuned the LLMs using this dataset. Empirical evaluations indicate that RM2C-empowered LLMs excel in Chinese reading teaching, offering more personalized, and ethically safe teaching response, demonstrating RAM2C’s practicality and high quality. We release the experiments at \hyperlinkthis https URLthis https URL.
摘要:近年来,许多研究致力于将大语言模型 (LLMs) 应用于教育对话中。特别是在文科对话中,教育者不仅需要掌握学科知识,还需平衡人性化沟通 (Humanized communication)、教学专业性 (Teaching expertise) 以及安全伦理 (Safety-ethics) (HTS)。然而,由于从现实世界中收集大量符合 HTS 标准的教学对话作为训练语料成本高昂,现有 LLMs 在教学对话中的输出未能达到人类标准。为解决这一问题,我们设计了检索增强的多角色多专家协作 (Retrieval-augmented Multi-role Multi-expert Collaboration, RAM2C) 框架,以自动生成此类对话数据。具体而言,我们首先建立了 HTS 指导的知识库,涵盖教学技能、心理学和安全伦理三个领域的知识。随后,RAM2C 将通过上述不同知识库增强检索能力的 LLMs 组织成具有不同角色的多专家组,以生成符合 HTS 标准的教育对话数据集。我们随后使用该数据集对 LLMs 进行了微调。实证评估表明,RAM2C 赋能的 LLMs 在中文阅读教学中表现出色,提供了更加个性化且伦理安全的教学反馈,展示了 RAM2C 的实用性和高质量。我们在 [此处插入实验链接] 发布了实验结果。

[NLP-50] In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在抑制控制能力上的不足,特别是在面对A-Not-B错误时表现出的类似婴儿的认知局限。解决方案的关键在于设计了一个基于文本的多选问答场景,模拟A-Not-B实验设置,以系统性地测试LLMs的抑制控制能力。研究发现,尽管LLMs在上下文学习(ICL)中表现良好,但在上下文微小变化时,其推理任务的错误率显著增加,高达83.3%,这表明LLMs在抑制先前建立的响应模式方面存在显著缺陷,其抑制控制能力仅相当于人类婴儿的水平。

链接: https://arxiv.org/abs/2409.15454
作者: Pengrui Han,Peiyang Song,Haofei Yu,Jiaxuan You
关键词-EN: large language models, highly capable large, capable large language, Recent advancements, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2024 Findings

点击查看摘要

Abstract:Recent advancements in artificial intelligence have led to the creation of highly capable large language models (LLMs) that can perform tasks in a human-like manner. However, LLMs exhibit only infant-level cognitive abilities in certain areas. One such area is the A-Not-B error, a phenomenon seen in infants where they repeat a previously rewarded behavior despite well-observed changed conditions. This highlights their lack of inhibitory control – the ability to stop a habitual or impulsive response. In our work, we design a text-based multi-choice QA scenario similar to the A-Not-B experimental settings to systematically test the inhibitory control abilities of LLMs. We found that state-of-the-art LLMs (like Llama3-8b) perform consistently well with in-context learning (ICL) but make errors and show a significant drop of as many as 83.3% in reasoning tasks when the context changes trivially. This suggests that LLMs only have inhibitory control abilities on par with human infants in this regard, often failing to suppress the previously established response pattern during ICL.
摘要:近年来,人工智能领域的进步催生了高度智能的大语言模型 (LLM),这些模型能够以类似人类的方式执行任务。然而,LLM 在某些领域仅表现出婴儿级别的认知能力。其中一个领域是 A-Not-B 错误,这是婴儿中常见的一种现象,即在条件明显改变的情况下,他们仍重复之前受到奖励的行为。这突显了它们在抑制控制方面的不足——即停止习惯性或冲动反应的能力。在我们的研究中,我们设计了一个基于文本的多选问答场景,类似于 A-Not-B 实验设置,以系统地测试 LLM 的抑制控制能力。我们发现,最先进的 LLM(如 Llama3-8b)在上下文学习 (ICL) 中表现出色,但在上下文微小变化时,推理任务中的错误率显著增加,高达 83.3%。这表明,在这方面,LLM 的抑制控制能力仅相当于人类婴儿的水平,在 ICL 过程中往往无法抑制先前建立的反应模式。

[NLP-51] CUTE: Measuring LLMs Understanding of Their Tokens EMNLP2024

【速读】: 该论文试图解决的问题是评估大型语言模型(LLMs)在多大程度上能够学习和利用正字法信息(orthographic information)。解决方案的关键在于提出了一个新的基准测试CUTE,该基准包含一系列任务,专门用于测试LLMs的正字法知识。通过在CUTE上评估流行的LLMs,研究发现尽管这些模型似乎知道其标记的拼写,但在实际文本操作中未能有效利用这些信息,这引发了对这些知识可泛化性的质疑。

链接: https://arxiv.org/abs/2409.15452
作者: Lukas Edman,Helmut Schmid,Alexander Fraser
关键词-EN: Large Language Models, Large Language, Language Models, show remarkable performance, show remarkable
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024 main conference

点击查看摘要

Abstract:Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. Most LLMs split text into multi-character tokens and process them as atomic units without direct access to individual characters. This raises the question: To what extent can LLMs learn orthographic information? To answer this, we propose a new benchmark, CUTE, which features a collection of tasks designed to test the orthographic knowledge of LLMs. We evaluate popular LLMs on CUTE, finding that most of them seem to know the spelling of their tokens, yet fail to use this information effectively to manipulate text, calling into question how much of this knowledge is generalizable.
摘要:大语言模型 (LLMs) 在多种任务上展现出卓越的性能。大多数 LLMs 将文本分割为多字符的 Token,并以原子单位的形式处理它们,而无法直接访问单个字符。这引发了一个问题:LLMs 能在多大程度上学习到正字法信息?为了回答这个问题,我们提出了一种新的基准测试,名为 CUTE,该基准包含了一系列旨在测试 LLMs 正字法知识的任务。我们评估了多个流行的 LLMs 在 CUTE 上的表现,发现尽管它们似乎了解其 Token 的拼写,但在利用这些信息有效操作文本方面表现不佳,这引发了对其知识泛化能力的质疑。

[NLP-52] Parse Trees Guided LLM Prompt Compression

【速读】: 该论文试图解决大语言模型(LLMs)在处理长提示时面临的计算成本增加和输入限制问题。解决方案的关键在于提出了一种名为PartPrompt的新型选择性压缩方法。该方法首先基于语言规则为每个句子生成解析树,并计算每个节点的局部信息熵。随后,这些局部解析树根据句子的层次结构(如依赖关系、段落和章节)组织成全局树。通过根向传播和叶向传播调整节点值,并开发递归算法基于调整后的节点值对全局树进行剪枝。实验结果表明,PartPrompt在各种数据集、指标、压缩比和目标LLMs上均达到了最先进的性能,且在压缩提示的连贯性和极端长提示场景中表现优异。

链接: https://arxiv.org/abs/2409.15395
作者: Wenhao Mao,Chengbin Hou,Tianyu Zhang,Xinyu Lin,Ke Tang,Hairong Lv
关键词-EN: Large Language Models, Offering rich contexts, Large Language, resulting longer prompt, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offering rich contexts to Large Language Models (LLMs) has shown to boost the performance in various tasks, but the resulting longer prompt would increase the computational cost and might exceed the input limit of LLMs. Recently, some prompt compression methods have been suggested to shorten the length of prompts by using language models to generate shorter prompts or by developing computational models to select important parts of original prompt. The generative compression methods would suffer from issues like hallucination, while the selective compression methods have not involved linguistic rules and overlook the global structure of prompt. To this end, we propose a novel selective compression method called PartPrompt. It first obtains a parse tree for each sentence based on linguistic rules, and calculates local information entropy for each node in a parse tree. These local parse trees are then organized into a global tree according to the hierarchical structure such as the dependency of sentences, paragraphs, and sections. After that, the root-ward propagation and leaf-ward propagation are proposed to adjust node values over the global tree. Finally, a recursive algorithm is developed to prune the global tree based on the adjusted node values. The experiments show that PartPrompt receives the state-of-the-art performance across various datasets, metrics, compression ratios, and target LLMs for inference. The in-depth ablation studies confirm the effectiveness of designs in PartPrompt, and other additional experiments also demonstrate its superiority in terms of the coherence of compressed prompts and in the extreme long prompt scenario.
摘要:为大语言模型 (LLM) 提供丰富的上下文信息已被证明可以提升其在各种任务中的表现,但由此产生的更长的提示会增加计算成本,并可能超出 LLM 的输入限制。近期,一些提示压缩方法被提出,通过使用语言模型生成更短的提示或开发计算模型来选择原始提示中的重要部分,以缩短提示的长度。生成式压缩方法会面临诸如幻觉等问题,而选择性压缩方法则未涉及语言规则,并忽略了提示的全局结构。为此,我们提出了一种名为 PartPrompt 的新型选择性压缩方法。它首先基于语言规则为每个句子获取一个解析树,并计算解析树中每个节点的局部信息熵。然后,这些局部解析树根据句子的层次结构(如句子、段落和章节的依赖关系)组织成一个全局树。接着,提出了根向传播和叶向传播来调整全局树上的节点值。最后,开发了一种递归算法,根据调整后的节点值对全局树进行剪枝。实验表明,PartPrompt 在各种数据集、指标、压缩比率和目标 LLM 的推理中均达到了最先进的性能。深入的消融研究证实了 PartPrompt 设计的有效性,其他附加实验也展示了其在压缩提示的连贯性和极端长提示场景中的优越性。

[NLP-53] Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation EMNLP2024

【速读】: 该论文试图解决文本到图像(T2I)模型在面对文本提示中的词性(POS)标签扰动时的脆弱性问题。解决方案的关键在于通过创建高质量的词性标签替换数据集,并利用基于梯度的攻击方法,识别出能够误导T2I模型生成带有改变标签图像的对抗性后缀。研究结果表明,不同词性标签的攻击成功率(ASR)差异显著,其中名词、专有名词和形容词最容易受到攻击。此外,研究还探讨了对抗性后缀的引导机制,发现关键词的数量和内容融合在不同词性标签间有所不同,而如后缀可转移性等特征在各词性类别中保持一致。

链接: https://arxiv.org/abs/2409.15381
作者: G M Shahariar,Jia Chen,Jiachen Li,Yue Dong
关键词-EN: Recent studies show, Recent studies, text prompts, POS tags, POS tag
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Findings of the EMNLP 2024

点击查看摘要

Abstract:Recent studies show that text-to-image (T2I) models are vulnerable to adversarial attacks, especially with noun perturbations in text prompts. In this study, we investigate the impact of adversarial attacks on different POS tags within text prompts on the images generated by T2I models. We create a high-quality dataset for realistic POS tag token swapping and perform gradient-based attacks to find adversarial suffixes that mislead T2I models into generating images with altered tokens. Our empirical results show that the attack success rate (ASR) varies significantly among different POS tag categories, with nouns, proper nouns, and adjectives being the easiest to attack. We explore the mechanism behind the steering effect of adversarial suffixes, finding that the number of critical tokens and content fusion vary among POS tags, while features like suffix transferability are consistent across categories. We have made our implementation publicly available at - this https URL.
摘要:最近的研究表明,文本到图像 (T2I) 模型容易受到对抗性攻击,尤其是在文本提示中对名词进行扰动时。在本研究中,我们探讨了对抗性攻击对文本提示中不同词性标签 (POS tags) 的影响,以及这些攻击如何影响 T2I 模型生成的图像。我们构建了一个高质量的数据集,用于进行真实的词性标签 Token 交换,并执行基于梯度的攻击,以发现能够误导 T2I 模型生成带有改变 Token 的图像的对抗性后缀。我们的实证结果显示,攻击成功率 (ASR) 在不同词性标签类别之间存在显著差异,其中名词、专有名词和形容词最容易受到攻击。我们探索了对抗性后缀引导效应背后的机制,发现关键 Token 的数量和内容融合在不同词性标签之间有所不同,而诸如后缀可转移性等特征在各类别之间是一致的。我们已经将我们的实现公开发布在 - 这个 https URL。

[NLP-54] Kalahi: A handcrafted grassroots cultural LLM evaluation suite for Filipino

【速读】: 该论文试图解决当前多语言大型语言模型(LLMs)在为菲律宾用户提供文化适宜和相关响应方面可能存在的不足。解决方案的关键在于引入了Kalahi,这是一个由菲律宾本土语言使用者共同创建的文化LLM评估套件。Kalahi包含150个高质量、手工制作且具有文化细微差别的提示,用于测试LLMs在生成与菲律宾共享文化知识和价值观相关的响应方面的能力。通过在支持多语言和菲律宾语的LLMs上进行实验,结果表明Kalahi对菲律宾人来说虽然简单,但对LLMs来说却具有挑战性,最佳模型仅能正确回答46.0%的问题,而菲律宾本土表现则为89.10%。因此,Kalahi能够准确可靠地评估LLMs中菲律宾文化的表现。

链接: https://arxiv.org/abs/2409.15380
作者: Jann Railey Montalan,Jian Gang Ngui,Wei Qi Leong,Yosephine Susanto,Hamsawardhini Rengarajan,William Chandra Tjhi,Alham Fikri Aji
关键词-EN: necessarily provide culturally, Filipino, necessarily provide, provide culturally, Filipino users
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual large language models (LLMs) today may not necessarily provide culturally appropriate and relevant responses to its Filipino users. We introduce Kalahi, a cultural LLM evaluation suite collaboratively created by native Filipino speakers. It is composed of 150 high-quality, handcrafted and nuanced prompts that test LLMs for generations that are relevant to shared Filipino cultural knowledge and values. Strong LLM performance in Kalahi indicates a model’s ability to generate responses similar to what an average Filipino would say or do in a given situation. We conducted experiments on LLMs with multilingual and Filipino language support. Results show that Kalahi, while trivial for Filipinos, is challenging for LLMs, with the best model answering only 46.0% of the questions correctly compared to native Filipino performance of 89.10%. Thus, Kalahi can be used to accurately and reliably evaluate Filipino cultural representation in LLMs.
摘要:当今的多语言大语言模型 (LLM) 未必能为其菲律宾用户提供文化上适当且相关的回应。我们引入了 Kalahi,这是一个由菲律宾本土人士共同创建的文化 LLM 评估套件。它由 150 个高质量、手工制作且细致入微的提示组成,用于测试 LLM 生成与菲律宾共享文化知识和价值观相关的回应的能力。在 Kalahi 中表现出色的 LLM 表明其能够生成类似于普通菲律宾人在特定情境下所说或所做的回应。我们对支持多语言和菲律宾语的 LLM 进行了实验。结果显示,尽管 Kalahi 对菲律宾人来说微不足道,但对 LLM 来说却颇具挑战性,最佳模型仅能正确回答 46.0% 的问题,而菲律宾本土表现则为 89.10%。因此,Kalahi 可用于准确可靠地评估 LLM 中菲律宾文化的呈现。

[NLP-55] Prompting Large Language Models for Supporting the Differential Diagnosis of Anemia

【速读】: 该论文试图解决临床指南在应对罕见疾病和快速变化的医疗实践时的不足,关键解决方案是利用大型语言模型(LLMs)如GPT-4、LLaMA和Mistral,通过高级提示技术生成诊断路径。实验结果表明,这些模型在从患者数据中发现临床路径方面具有巨大潜力,其中GPT-4在所有实验中表现最佳。

链接: https://arxiv.org/abs/2409.15377
作者: Elisa Castagnari(HeKA),Lillian Muyama(HeKA),Adrien Coulet(HeKA)
关键词-EN: laboratory exams, sequence of steps, Large Language Models, reach diagnosis decisions, clinicians achieve
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In practice, clinicians achieve a diagnosis by following a sequence of steps, such as laboratory exams, observations, or imaging. The pathways to reach diagnosis decisions are documented by guidelines authored by expert organizations, which guide clinicians to reach a correct diagnosis through these sequences of steps. While these guidelines are beneficial for following medical reasoning and consolidating medical knowledge, they have some drawbacks. They often fail to address patients with uncommon conditions due to their focus on the majority population, and are slow and costly to update, making them unsuitable for rapidly emerging diseases or new practices. Inspired by clinical guidelines, our study aimed to develop pathways similar to those that can be obtained in clinical guidelines. We tested three Large Language Models (LLMs) -Generative Pretrained Transformer 4 (GPT-4), Large Language Model Meta AI (LLaMA), and Mistral -on a synthetic yet realistic dataset to differentially diagnose anemia and its subtypes. By using advanced prompting techniques to enhance the decision-making process, we generated diagnostic pathways using these models. Experimental results indicate that LLMs hold huge potential in clinical pathway discovery from patient data, with GPT-4 exhibiting the best performance in all conducted experiments.
摘要:在实际操作中,临床医生通过一系列步骤来实现诊断,例如实验室检查、观察或影像学检查。这些诊断决策的路径由专家组织编写的指南记录,指导临床医生通过这些步骤序列达到正确的诊断。尽管这些指南对遵循医学推理和巩固医学知识有益,但它们也存在一些缺点。由于它们主要关注大多数人群,因此往往无法解决罕见疾病患者的问题,并且更新速度慢且成本高,使其不适用于快速出现的疾病或新实践。受临床指南的启发,我们的研究旨在开发类似于临床指南中可获得的路径。我们测试了三种大语言模型 (LLMs) - 生成式预训练 Transformer 4 (GPT-4)、大语言模型 Meta AI (LLaMA) 和 Mistral - 在一个合成但现实的数据集上,以区分诊断贫血及其亚型。通过使用先进的提示技术来增强决策过程,我们使用这些模型生成了诊断路径。实验结果表明,LLMs 在从患者数据中发现临床路径方面具有巨大潜力,其中 GPT-4 在所有进行的实验中表现最佳。

[NLP-56] ControlMath: Controllable Data Generation Promotes Math Generalist Models

【速读】: 该论文试图解决现有数据增强方法在数学推理任务中问题多样性不足的问题,特别是受限于特定领域或分布的数据生成。解决方案的关键在于提出了一种名为ControlMath的迭代方法,该方法结合了方程生成模块和两个基于大语言模型(LLM)的代理(Problem-Crafter和Reverse-Agent)。方程生成模块创建多样化的方程,Problem-Crafter将其转化为数学应用题,而Reverse-Agent则筛选和选择高质量的数据,遵循“少即是多”的原则。这种方法不仅生成了多样化的数学问题,还避免了特定领域或分布的限制,从而显著提升了模型在数学推理任务中的泛化能力。

链接: https://arxiv.org/abs/2409.15376
作者: Nuo Chen,Ning Wu,Jianhui Chang,Jia Li
关键词-EN: Utilizing large language, Utilizing large, large language models, yielded encouraging results, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Utilizing large language models (LLMs) for data augmentation has yielded encouraging results in mathematical reasoning. However, these approaches face constraints in problem diversity, potentially restricting them to in-domain/distribution data generation. To this end, we propose ControlMath, an iterative method involving an equation-generator module and two LLM-based agents. The module creates diverse equations, which the Problem-Crafter agent then transforms into math word problems. The Reverse-Agent filters and selects high-quality data, adhering to the “less is more” principle, achieving better results with fewer data points. This approach enables the generation of diverse math problems, not limited to specific domains or distributions. As a result, we collect ControlMathQA, which involves 190k math word problems. Extensive results prove that combining our dataset with in-domain datasets like GSM8K can help improve the model’s mathematical ability to generalize, leading to improved performances both within and beyond specific domains.
摘要:利用大语言模型 (LLMs) 进行数据增强在数学推理方面取得了令人鼓舞的成果。然而,这些方法在问题多样性方面存在局限,可能限制它们仅生成领域内/分布内的数据。为此,我们提出了 ControlMath,一种迭代方法,涉及一个方程生成模块和两个基于 LLM 的智能体。该模块创建多样化的方程,然后由 Problem-Crafter 智能体将其转化为数学应用题。Reverse-Agent 则根据“少即是多”的原则筛选和选择高质量数据,以较少的数据点实现更好的效果。这种方法能够生成多样化的数学问题,不受特定领域或分布的限制。因此,我们收集了 ControlMathQA,其中包含 19 万个数学应用题。广泛的实验结果证明,将我们的数据集与 GSM8K 等领域内数据集结合,可以帮助提升模型在数学能力上的泛化能力,从而在特定领域内外均取得更好的表现。

[NLP-57] Bone: Block Affine Transformation as Parameter Efficient Fine-tuning Methods for Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在微调过程中计算和内存需求增加的问题,以及现有低秩适应(LoRA)方法在性能上难以超越全参数微调的瓶颈。解决方案的关键在于引入了一种名为Bone(Block Affine)的新方法,该方法通过减少内存开销并强调权重之间的内部联系,实现了更快的收敛速度和更好的数据拟合效果。Bone结构在不需要复杂初始化的情况下,能够在不同参数规模的LLM架构(如LLaMA2和RWKV6)上实现快速收敛和优越的数据拟合性能,从而在实际应用中表现出色。

链接: https://arxiv.org/abs/2409.15371
作者: Jiale Kang
关键词-EN: Large Language Models, Language Models, Large Language, requirements increase correspondingly, memory requirements increase
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to grow in size, their computational and memory requirements increase correspondingly. Consequently, the exploration of cost-effective and efficient fine-tuning methods has become increasingly important. Low-Rank Adaptation (LoRA) has achieved remarkable training results by freezing the original weights and training only low-rank matrices, establishing itself as the predominant fine-tuning method for LLMs. In pursuit of performance closer to full-parameter training, a series of LoRA variants have emerged, such as LoRA+, PISSA, Olora, and LoRA-GA. However, these methods also make the fine-tuning initialization process more complex, and it remains challenging to surpass the performance ceiling of full fine-tuning. To address these issues, this paper introduces an innovative method called Bone (Block Affine), which not only reduces memory overhead but also emphasizes the internal connections between weights, leading to faster convergence and better data fitting. Experimental comparisons across two different LLM architectures (LLaMA2, RWKV6) and various parameter scales demonstrate that the Bone structure can achieve rapid convergence and superior data fitting without the need for complex initialization. For example, when fine-tuning LLaMA2-7B on the MetaMathQA dataset and validating on GSM8k and math benchmarks, Bone achieved fine-tuning scores of 49.36 and 8.8, respectively, outperforming PISSA by 5.84% and 1.96%.
摘要:随着大语言模型 (LLM) 的规模不断扩大,其计算和内存需求也随之增加。因此,探索成本效益高且高效的微调方法变得愈发重要。低秩适应 (LoRA) 通过冻结原始权重并仅训练低秩矩阵,取得了显著的训练成果,已成为 LLM 微调的主导方法。为了追求更接近全参数训练的性能,一系列 LoRA 变体应运而生,如 LoRA+、PISSA、Olora 和 LoRA-GA。然而,这些方法也使得微调初始化过程更加复杂,且难以超越全微调的性能上限。为解决这些问题,本文提出了一种名为 Bone (Block Affine) 的创新方法,该方法不仅减少了内存开销,还强调了权重之间的内部联系,从而实现了更快的收敛和更好的数据拟合。在两种不同 LLM 架构 (LLaMA2, RWKV6) 和多种参数规模上的实验比较表明,Bone 结构无需复杂的初始化即可实现快速收敛和优越的数据拟合。例如,在 MetaMathQA 数据集上微调 LLaMA2-7B 并在 GSM8k 和数学基准上验证时,Bone 分别取得了 49.36 和 8.8 的微调分数,分别比 PISSA 高出 5.84% 和 1.96%。

[NLP-58] MedCodER: A Generative AI Assistant for Medical Coding

【速读】: 该论文试图解决医学编码自动化中的挑战,包括标签空间大、文本输入长以及缺乏支持证据注释等问题。解决方案的关键在于引入MedCodER框架,该框架利用生成式人工智能技术,通过提取、检索和重排序技术作为核心组件,显著提高了国际疾病分类(ICD)代码预测的准确性,达到0.60的微观F1分数,超越了现有最先进的方法。

链接: https://arxiv.org/abs/2409.15368
作者: Krishanu Das Baksi,Elijah Soba,John J. Higgins,Ravi Saini,Jaden Wood,Jane Cook,Jack Scott,Nirmala Pudota,Tim Weninger,Edward Bowen,Sanmitra Bhattacharya
关键词-EN: standardizing clinical data, Natural Language Processing, Traditional Natural Language, Generative Artificial Intelligence, prone to errors
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical coding is essential for standardizing clinical data and communication but is often time-consuming and prone to errors. Traditional Natural Language Processing (NLP) methods struggle with automating coding due to the large label space, lengthy text inputs, and the absence of supporting evidence annotations that justify code selection. Recent advancements in Generative Artificial Intelligence (AI) offer promising solutions to these challenges. In this work, we introduce MedCodER, a Generative AI framework for automatic medical coding that leverages extraction, retrieval, and re-ranking techniques as core components. MedCodER achieves a micro-F1 score of 0.60 on International Classification of Diseases (ICD) code prediction, significantly outperforming state-of-the-art methods. Additionally, we present a new dataset containing medical records annotated with disease diagnoses, ICD codes, and supporting evidence texts (this https URL). Ablation tests confirm that MedCodER’s performance depends on the integration of each of its aforementioned components, as performance declines when these components are evaluated in isolation.
摘要:医疗编码对于标准化临床数据和沟通至关重要,但通常耗时且容易出错。传统的自然语言处理 (NLP) 方法在自动化编码方面遇到困难,主要是因为标签空间庞大、文本输入冗长以及缺乏支持代码选择的证据注释。近期生成式人工智能 (Generative AI) 的进展为这些挑战提供了有前景的解决方案。在本研究中,我们介绍了 MedCodER,这是一个利用提取、检索和重排序技术作为核心组件的自动医疗编码生成式 AI 框架。MedCodER 在国际疾病分类 (ICD) 代码预测中达到了 0.60 的微观 F1 分数,显著优于最先进的方法。此外,我们提供了一个新的数据集,其中包含标注了疾病诊断、ICD 代码和支持证据文本的医疗记录 (此 https URL)。消融测试证实,MedCodER 的性能依赖于上述各组件的集成,当这些组件单独评估时,性能会下降。

[NLP-59] Fine-Tuning a Time Series Foundation Model with Wasserstein Loss

【速读】: 该论文试图解决在时间序列预测中使用交叉熵损失函数的局限性问题,即交叉熵损失主要适用于分类任务,无法有效衡量类别间的距离。解决方案的关键在于提出使用Wasserstein损失函数替代交叉熵损失,以更好地处理时间序列数据的距离特性。通过在22个零样本数据集上微调基础时间序列模型,研究结果表明,采用Wasserstein损失显著提升了点估计的性能。

链接: https://arxiv.org/abs/2409.15367
作者: Andrei Chernov
关键词-EN: Natural Language Processing, Language Processing, Natural Language, large language models, Inspired by recent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 main pages; 2 figures

点击查看摘要

Abstract:Inspired by recent advancements in large language models (LLMs) for Natural Language Processing (NLP), there has been a surge in research focused on developing foundational models for time series forecasting. One approach involves training LLM architectures on tokenized time series data using cross-entropy loss. Although this method has demonstrated promising results, cross-entropy loss is primarily designed for classification tasks and does not account for the distance between classes. To address this limitation, we propose using the Wasserstein loss for such architectures. To validate our approach, we fine-tuned a foundational time series model on 22 zero-shot datasets, comparing the performance of cross-entropy loss with that of Wasserstein loss. Our results demonstrate that replacing cross-entropy loss with Wasserstein loss significantly improves point estimation.
摘要:受到自然语言处理 (Natural Language Processing, NLP) 领域中大语言模型 (Large Language Models, LLMs) 最新进展的启发,针对时间序列预测的基础模型开发研究激增。一种方法涉及使用交叉熵损失 (cross-entropy loss) 在 Token 化的时间序列数据上训练 LLM 架构。尽管这种方法已显示出有希望的结果,但交叉熵损失主要设计用于分类任务,并未考虑类别之间的距离。为解决这一局限性,我们建议在这些架构中使用 Wasserstein 损失。为验证我们的方法,我们在 22 个零样本数据集上对基础时间序列模型进行了微调,比较了交叉熵损失与 Wasserstein 损失的性能。我们的结果表明,用 Wasserstein 损失替代交叉熵损失显著提升了点估计的准确性。

[NLP-60] VERA: Validation and Enhancement for Retrieval Augmented systems

【速读】: 该论文试图解决大语言模型(LLMs)在生成响应时可能依赖不相关文档或错误推断其训练知识的问题。解决方案的关键在于提出了VERA系统,该系统通过以下两个主要步骤来评估和增强检索增强生成(RAG)框架的性能:1)在响应生成前评估和增强检索到的上下文,确保其相关性和消除冗余信息;2)在生成响应后,将响应分解为原子陈述,评估其与查询的相关性,并确保其符合上下文。VERA系统通过结合多个评估和细化步骤,有效减少了幻觉现象,提高了检索和响应过程的准确性和可靠性。

链接: https://arxiv.org/abs/2409.15364
作者: Nitin Aravind Birur,Tanay Baswa,Divyanshu Kumar,Jatan Loya,Sahil Agarwal,Prashanth Harshangi
关键词-EN: Large language models, exhibit remarkable capabilities, Large language, VERA, textbf
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit remarkable capabilities but often produce inaccurate responses, as they rely solely on their embedded knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating an external information retrieval system, supplying additional context along with the query to mitigate inaccuracies for a particular context. However, accuracy issues still remain, as the model may rely on irrelevant documents or extrapolate incorrectly from its training knowledge. To assess and improve the performance of both the retrieval system and the LLM in a RAG framework, we propose \textbfVERA (\textbfValidation and \textbfEnhancement for \textbfRetrieval \textbfAugmented systems), a system designed to: 1) Evaluate and enhance the retrieved context before response generation, and 2) Evaluate and refine the LLM-generated response to ensure precision and minimize errors. VERA employs an evaluator-cum-enhancer LLM that first checks if external retrieval is necessary, evaluates the relevance and redundancy of the retrieved context, and refines it to eliminate non-essential information. Post-response generation, VERA splits the response into atomic statements, assesses their relevance to the query, and ensures adherence to the context. Our experiments demonstrate VERA’s remarkable efficacy not only in improving the performance of smaller open-source models, but also larger state-of-the art models. These enhancements underscore VERA’s potential to produce accurate and relevant responses, advancing the state-of-the-art in retrieval-augmented language modeling. VERA’s robust methodology, combining multiple evaluation and refinement steps, effectively mitigates hallucinations and improves retrieval and response processes, making it a valuable tool for applications demanding high accuracy and reliability in information generation. .
摘要: 大语言模型 (LLM) 展现出显著的能力,但往往产生不准确的响应,因为它们完全依赖于嵌入的知识。检索增强生成 (RAG) 通过整合外部信息检索系统来增强 LLM,为查询提供额外的上下文,以减少特定上下文中的不准确性。然而,准确性问题仍然存在,因为模型可能依赖于不相关的文档或从其训练知识中错误地推断。为了评估和提升 RAG 框架中检索系统和 LLM 的性能,我们提出了 VERA (Validation and Enhancement for Retrieval Augmented systems),这是一个旨在:1) 在响应生成前评估和增强检索到的上下文,以及 2) 评估和优化 LLM 生成的响应以确保精确性和最小化错误的系统。VERA 采用了一种评估兼增强型 LLM,首先检查是否需要外部检索,评估检索上下文的相关性和冗余性,并对其进行优化以消除非必要信息。在响应生成后,VERA 将响应分解为原子陈述,评估其与查询的相关性,并确保与上下文的一致性。我们的实验表明,VERA 不仅显著提升了较小开源模型的性能,还对更大规模的最先进模型产生了积极影响。这些改进凸显了 VERA 在生成准确和相关响应方面的潜力,推动了检索增强语言建模的最新技术。VERA 结合了多重评估和优化步骤的稳健方法,有效减少了幻觉现象,并改进了检索和响应过程,使其成为在信息生成中要求高准确性和可靠性的应用中的宝贵工具。

[NLP-61] Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

【速读】: 该论文试图解决大语言模型(LLMs)在下游任务微调过程中安全防护措施(guardrails)的降级问题。研究发现,针对代码生成和翻译任务的微调会导致最高的安全降级,且现有安全措施在跨任务场景下缺乏鲁棒性。解决方案的关键在于开发了一种新的多任务安全数据集,通过该数据集的训练,能够有效降低各任务中的攻击成功率,同时不损害模型的整体有用性,从而提升模型的安全性和鲁棒性。

链接: https://arxiv.org/abs/2409.15361
作者: Essa Jan,Nouar AlDahoul,Moiz Ali,Faizan Ahmad,Fareed Zaffar,Yasir Zaki
关键词-EN: Large Language Models, Large Language, Recent breakthroughs, breakthroughs in Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Recent breakthroughs in Large Language Models (LLMs) have led to their adoption across a wide range of tasks, ranging from code generation to machine translation and sentiment analysis, etc. Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. However, it remains unclear to what extent this phenomenon is influenced by different variables, including fine-tuning task, model calibrations, etc. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification across various calibration. Our results reveal that: 1) Fine-tuning LLMs for code generation and translation leads to the highest degradation in safety guardrails. 2) LLMs generally have weaker guardrails for translation and classification, with 73-92% of harmful prompts answered, across baseline and other calibrations, falling into one of two concern categories. 3) Current solutions, including guards and safety tuning datasets, lack cross-task robustness. To address these issues, we developed a new multitask safety dataset effectively reducing attack success rates across a range of tasks without compromising the model’s overall helpfulness. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.
摘要:近期大语言模型 (Large Language Models, LLMs) 的突破性进展使其在从代码生成到机器翻译、情感分析等广泛任务中得到了应用。红队测试/安全对齐的努力表明,在良性 (非有害) 数据上微调模型可能会影响安全性。然而,这一现象受不同变量(包括微调任务、模型校准等)影响的程度尚不清楚。本文探讨了在各种校准下,对下游任务(如摘要生成、代码生成、翻译和分类)进行微调导致的任务相关安全性下降。我们的研究结果揭示:1) 对代码生成和翻译任务进行微调会导致最高的安全防护降级。2) 大语言模型在翻译和分类任务上的防护较弱,在基线和其它校准下,73-92% 的有害提示回答落入两个关注类别之一。3) 当前的解决方案,包括防护措施和安全微调数据集,缺乏跨任务的鲁棒性。为解决这些问题,我们开发了一种新的多任务安全数据集,有效降低了各种任务中的攻击成功率,同时不损害模型的整体有用性。我们的工作强调了需要通用的对齐措施来确保更安全和更鲁棒的模型。

[NLP-62] Reward-Robust RLHF in LLMs

【速读】: 该论文试图解决基于奖励模型(Reward Model, RM)的强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)中存在的奖励模型不稳定和潜在的奖励作弊问题,这些问题可能导致模型与人类意图的偏差。解决方案的关键在于引入了一种新的奖励鲁棒RLHF框架,通过结合贝叶斯奖励模型集成(Bayesian Reward Model Ensembles, BRME)来建模奖励函数的不确定性集合,从而在优化目标中平衡性能和鲁棒性。这种方法确保了即使在奖励模型不完美的情况下,也能实现更稳定的学习,并通过理论分析和实验结果证明了其有效性和优越性。

链接: https://arxiv.org/abs/2409.15360
作者: Yuzi Yan,Xingzhou Lou,Jialian Li,Yiping Zhang,Jian Xie,Chao Yu,Yu Wang,Dong Yan,Yuan Shen
关键词-EN: Artificial General Intelligence, achieving Artificial General, Large Language Models, Large Language, Artificial General
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI). However, the reliance on reward-model-based (RM-based) alignment methods introduces significant challenges due to the inherent instability and imperfections of Reward Models (RMs), which can lead to critical issues such as reward hacking and misalignment with human intentions. In this paper, we introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges, paving the way for more reliable and resilient learning in LLMs. Our approach introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME) to model the uncertainty set of reward functions. This allows the framework to integrate both nominal performance and minimum reward signals, ensuring more stable learning even with imperfect reward models. Empirical results demonstrate that our framework consistently outperforms traditional RLHF across diverse benchmarks, showing improved accuracy and long-term stability. We also provide a theoretical analysis, demonstrating that reward-robust RLHF approaches the stability of constant reward settings, which proves to be effective in a stochastic-case analysis. Together, these contributions highlight the framework potential to enhance both the performance and stability of LLM alignment with RLHF.
摘要:随着大语言模型 (Large Language Models, LLM) 不断向更高级的智能形式发展,基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 被越来越多地视为实现通用人工智能 (Artificial General Intelligence, AGI) 的关键途径。然而,依赖于奖励模型 (Reward Model, RM) 的对齐方法由于奖励模型固有的不稳定性和不完美性,引入了显著的挑战,可能导致诸如奖励作弊和与人类意图不一致等严重问题。本文介绍了一种旨在解决这些根本挑战的鲁棒奖励 RLHF 框架,为 LLM 中更可靠和更具弹性的学习铺平了道路。我们的方法引入了一种新颖的优化目标,通过结合贝叶斯奖励模型集成 (Bayesian Reward Model Ensembles, BRME) 来建模奖励函数的不确定集合,从而在性能和鲁棒性之间进行精心平衡。这使得框架能够整合名义性能和最小奖励信号,即使在奖励模型不完美的情况下也能确保更稳定的学习。实证结果表明,我们的框架在各种基准测试中始终优于传统的 RLHF,显示出更高的准确性和长期稳定性。我们还提供了理论分析,证明鲁棒奖励 RLHF 方法在随机情况下接近恒定奖励设置的稳定性,这被证明是有效的。这些贡献共同突显了该框架在通过 RLHF 增强 LLM 对齐的性能和稳定性方面的潜力。

[NLP-63] Watch Your Steps: Observable and Modular Chains of Thought

【速读】: 该论文试图解决传统链式思维(Chain of Thought, CoT)提示方法中解释过程不易观察的问题。解决方案的关键在于提出了一种名为“程序跟踪提示”(Program Trace Prompting)的新方法,通过将少样本CoT演示包装在基于Python的正式语法中,明确标识和命名步骤,定义步骤的输入输出行为,并用这些正式化的步骤链替换上下文示例中的CoT解释。这种方法不仅提高了解释的可观察性,还保留了CoT的强大、通用和灵活性,适用于多种任务,并在BIG-Bench Hard基准测试的23个多样化任务中取得了优异成绩。此外,通过这种方式,论文还揭示了CoT学习中未解决的“非局部错误”问题,并提出了验证CoT解释中步骤模块性的方法。

链接: https://arxiv.org/abs/2409.15359
作者: Cassandra A. Cohen,William W. Cohen
关键词-EN: Program Trace Prompting, called Program Trace, prompting called Program, Trace Prompting, Program Trace
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a variant of chain of thought (CoT) prompting called Program Trace Prompting that makes explanations more observable while preserving the power, generality and flexibility of CoT. In our approach, few-shot CoT demonstrations are wrapped in a formal syntax based on Python, and each prompt: identifies and names steps; defines the input/output behavior of steps; and replaces CoT explanations of in-context examples with chains of these formalized steps on the same examples. Program Trace Prompting is applicable to many tasks, achieving strong results on the 23 diverse tasks in the BIG-Bench Hard benchmark. More importantly, by instrumenting explanations in this way, we enable new types of analysis. In particular, we identify “non-local errors” (which correspond to incorrectly learning the reasoning method illustrated in the demonstrations) as an unaddressed issue in CoT learning, and we present methods for verifying the modularity of steps in a CoT explanation.
摘要:我们提出了一种名为程序跟踪提示 (Program Trace Prompting) 的链式思维 (Chain of Thought, CoT) 提示变体,该方法在保留 CoT 的强大性、通用性和灵活性的同时,使解释更加可观察。在我们的方法中,少样本 CoT 演示被包装在基于 Python 的正式语法中,每个提示:识别并命名步骤;定义步骤的输入/输出行为;并用这些正式化的步骤链替换上下文示例中的 CoT 解释。程序跟踪提示适用于多种任务,在 BIG-Bench Hard 基准测试中的 23 个多样化任务上取得了优异的结果。更重要的是,通过这种方式对解释进行工具化,我们能够进行新的分析类型。特别是,我们识别出“非局部错误”(对应于错误地学习了演示中展示的推理方法)作为 CoT 学习中未解决的问题,并提出了验证 CoT 解释中步骤模块化的方法。

[NLP-64] Block-Attention for Low-Latency RAG

【速读】: 该论文试图解决在检索增强生成(RAG)场景中推理延迟增加的问题。解决方案的关键在于引入了一种名为Block-Attention的注意力机制,其核心思想是将输入序列划分为多个块,每个块独立计算其键值(KV)状态,除了最后一个块。在RAG场景中,将每个段落定义为一个块,使得可以预先计算并缓存所有段落的KV状态。通过块分割、位置编码计算和微调大型语言模型(LLM)以适应Block-Attention机制,实验证明该方法在多个RAG基准测试中表现优异,甚至在某些情况下优于自注意力模型,同时显著降低了首次输出时间(TTFT)。

链接: https://arxiv.org/abs/2409.15355
作者: East Sun,Yan Wang,Lan Tian
关键词-EN: increased inference latency, Retrieval-Augmented Generation, attention mechanism designed, designed to address, address the increased
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Block-Attention, an attention mechanism designed to address the increased inference latency in Retrieval-Augmented Generation (RAG) scenarios. Its main idea lies in dividing the input sequence into blocks, where each block calculates its key-value (KV) states independently except for the final block. In RAG scenarios, by defining each passage as a block, Block-Attention enables us to pre-compute the KV states for all passages and cache them in memory. The implementation involves block segmentation, positional encoding calculation, and fine-tuning the LLM to adapt to the Block-Attention mechanism. Experiments on four RAG benchmarks demonstrate that after block fine-tuning, the Block Attention model can achieve performance comparable to (68.4% vs 67.9% on Llama3) or even better (62.8% vs 59.6% on Mistral) than self-attention models. Notably, Block-Attention reduces the TTFT to a very low level. It only takes 45 ms to output the first token for an input sequence with a total length of 32K. Compared with the self-attention model, the time consumption is reduced by 98.7%. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2409.15355 [cs.LG] (or arXiv:2409.15355v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.15355 Focus to learn more arXiv-issued DOI via DataCite
摘要:我们介绍了 Block-Attention,这是一种注意力机制,旨在解决在检索增强生成 (Retrieval-Augmented Generation, RAG) 场景中推理延迟增加的问题。其主要思想在于将输入序列划分为多个块,每个块独立计算其键值 (Key-Value, KV) 状态,除了最后一个块。在 RAG 场景中,通过将每个段落定义为一个块,Block-Attention 使我们能够预先计算所有段落的 KV 状态并将其缓存到内存中。实现过程包括块分割、位置编码计算以及微调大语言模型 (Large Language Model, LLM) 以适应 Block-Attention 机制。在四个 RAG 基准测试上的实验表明,经过块微调后,Block Attention 模型可以达到与自注意力模型相当 (Llama3 上为 68.4% 对 67.9%) 或甚至更好的性能 (Mistral 上为 62.8% 对 59.6%)。值得注意的是,Block-Attention 将首次 Token 生成时间 (TTFT) 降低到了极低的水平。对于总长度为 32K 的输入序列,输出第一个 Token 仅需 45 毫秒。与自注意力模型相比,时间消耗减少了 98.7%。

主题:机器学习 (cs.LG); 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用为:arXiv:2409.15355 [cs.LG] (或 arXiv:2409.15355v1 [cs.LG] 用于此版本)
https://doi.org/10.48550/arXiv.2409.15355
通过 DataCite 发布的 arXiv DOI

[NLP-65] Revisiting the Solution of Meta KDD Cup 2024: CRAG

【速读】: 该论文试图解决现有问答基准在评估检索增强生成(RAG)系统时面临的多样性和动态性挑战。解决方案的关键在于提出了一种基于路由的领域和动态自适应RAG流水线,该流水线在检索、增强和生成三个阶段针对问题的多样性和动态性进行特定处理,从而在CRAG基准上取得了优异表现,并在Meta KDD CUP 2024的CRAG Comprehensive RAG Benchmark Challenge中排名第二。

链接: https://arxiv.org/abs/2409.15337
作者: Jie Ouyang,Yucong Luo,Mingyue Cheng,Daoyu Wang,Shuo Yu,Qi Liu,Enhong Chen
关键词-EN: Meta KDD CUP, KDD CUP, Meta KDD, RAG Benchmark Challenge, CRAG Comprehensive RAG
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the solution of our team APEX in the Meta KDD CUP 2024: CRAG Comprehensive RAG Benchmark Challenge. The CRAG benchmark addresses the limitations of existing QA benchmarks in evaluating the diverse and dynamic challenges faced by Retrieval-Augmented Generation (RAG) systems. It provides a more comprehensive assessment of RAG performance and contributes to advancing research in this field. We propose a routing-based domain and dynamic adaptive RAG pipeline, which performs specific processing for the diverse and dynamic nature of the question in all three stages: retrieval, augmentation, and generation. Our method achieved superior performance on CRAG and ranked 2nd for Task 23 on the final competition leaderboard. Our implementation is available at this link: this https URL.
摘要:本文介绍了我们团队 APEX 在 Meta KDD CUP 2024 中的 CRAG 综合 RAG 基准挑战赛中的解决方案。CRAG 基准解决了现有问答基准在评估检索增强生成 (RAG) 系统所面临的多样化和动态挑战方面的局限性。它提供了对 RAG 性能更全面的评估,并有助于推动该领域的研究进展。我们提出了一种基于路由的领域和动态自适应 RAG 流水线,该流水线针对问题的多样化和动态特性,在检索、增强和生成三个阶段进行特定处理。我们的方法在 CRAG 上取得了优异的性能,并在最终竞赛排行榜上在任务 23 中排名第二。我们的实现代码可在此链接中获取:this https URL。

[NLP-66] Evaluating Large Language Models with Tests of Spanish as a Foreign Language: Pass or Fail?

【速读】: 该论文试图解决大语言模型(LLMs)在非英语语言理解能力上的评估问题,特别是西班牙语的理解能力。解决方案的关键在于使用TELEIA基准测试,这是一个针对西班牙语考试设计的测试集,涵盖阅读理解、词汇构成、意义和组合语义以及语法等多个方面。通过这一测试,论文评估了当前最先进的LLMs在西班牙语理解上的表现,发现尽管LLMs在理解西班牙语方面表现良好,但在语法能力上仍远未达到母语者的水平。

链接: https://arxiv.org/abs/2409.15334
作者: Marina Mayor-Rocher,Nina Melero,Elena Merino-Gómez,María Grandury,Javier Conde,Pedro Reviriego
关键词-EN: Large Language Models, Large Language, Language Models, language understanding tasks, profusely evaluated
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been profusely evaluated on their ability to answer questions on many topics and their performance on different natural language understanding tasks. Those tests are usually conducted in English, but most LLM users are not native English speakers. Therefore, it is of interest to analyze how LLMs understand other languages at different levels: from paragraphs to morphems. In this paper, we evaluate the performance of state-of-the-art LLMs in TELEIA, a recently released benchmark with similar questions to those of Spanish exams for foreign students, covering topics such as reading comprehension, word formation, meaning and compositional semantics, and grammar. The results show that LLMs perform well at understanding Spanish but are still far from achieving the level of a native speaker in terms of grammatical competence.
摘要:大语言模型 (LLMs) 已经在其回答多种主题问题以及在不同自然语言理解任务中的表现方面得到了广泛评估。这些测试通常以英语进行,但大多数 LLM 用户并非英语母语者。因此,分析 LLM 在不同层次上理解其他语言的能力显得尤为重要:从段落到词素。本文中,我们评估了最先进的 LLM 在 TELEIA 中的表现,TELEIA 是一个最近发布的基准测试,其问题与西班牙语考试中的问题相似,涵盖了阅读理解、词形形成、意义和组合语义以及语法等主题。结果显示,LLM 在理解西班牙语方面表现良好,但在语法能力方面仍远未达到母语者的水平。

[NLP-67] Sorbet: A Neuromorphic Hardware-Compatible Transformer-Based Spiking Language Model

【速读】: 该论文试图解决在资源受限的边缘设备上部署语言模型时面临的能效问题。解决方案的关键在于引入了一种名为Sorbet的基于脉冲神经网络(SNN)的变压器语言模型,该模型通过创新的位移式softmax(PTsoftmax)和基于位移的功率归一化(BSPN)方法,替代了传统变压器模型中能耗较高的softmax和层归一化操作。此外,通过知识蒸馏和模型量化技术,Sorbet实现了高度压缩的二值权重模型,在保持竞争性能的同时显著降低了能耗。

链接: https://arxiv.org/abs/2409.15298
作者: Kaiwen Tang,Zhanglu Yan,Weng-Fai Wong
关键词-EN: language, language models, Sorbet, energy efficiency, model
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:For reasons such as privacy, there are use cases for language models at the edge. This has given rise to small language models (SLMs) targeted for deployment in resource-constrained devices where energy efficiency is a significant concern. Spiking neural networks (SNNs) offer a promising solution due to their energy efficiency, and there are already works on realizing transformer-based models on SNNs. However, key operations like softmax and layer normalization (LN) are difficult to implement on neuromorphic hardware, and many of these early works sidestepped them. To address these challenges, we introduce Sorbet, a transformer-based spiking language model that is more neuromorphic hardware-compatible. Sorbet incorporates a novel shifting-based softmax called PTsoftmax and a power normalization method using bit-shifting (BSPN), both designed to replace the respective energy-intensive operations. By leveraging knowledge distillation and model quantization, Sorbet achieved a highly compressed binary weight model that maintains competitive performance while significantly reducing energy consumption. We validate Sorbet’s effectiveness through extensive testing on the GLUE benchmark and a series of ablation studies, demonstrating its potential as an energy-efficient solution for language model inference.
摘要:由于隐私等原因,语言模型在边缘设备上有应用场景。这促使了针对资源受限设备的小型语言模型 (SLM) 的发展,其中能源效率是一个重要考虑因素。脉冲神经网络 (SNN) 因其能源效率而成为一种有前景的解决方案,并且已有研究在 SNN 上实现基于 Transformer 的模型。然而,像 softmax 和层归一化 (LN) 这样的关键操作在神经形态硬件上难以实现,许多早期工作都回避了这些问题。为了解决这些挑战,我们引入了 Sorbet,这是一种基于 Transformer 的脉冲语言模型,更兼容神经形态硬件。Sorbet 包含了一种新颖的基于移位的 softmax 方法,称为 PTsoftmax,以及一种使用位移的幂归一化方法 (BSPN),两者都旨在替代相应的能耗密集型操作。通过利用知识蒸馏和模型量化,Sorbet 实现了一个高度压缩的二进制权重模型,该模型在保持竞争性能的同时显著降低了能耗。我们通过在 GLUE 基准上的广泛测试和一系列消融研究验证了 Sorbet 的有效性,展示了其作为语言模型推理的能源高效解决方案的潜力。

[NLP-68] he NGT200 Dataset: Geometric Multi-View Isolated Sign Recognition

【速读】: 该论文试图解决多视角孤立手语识别(MV-ISR)问题,并强调了3D感知和几何在手语处理系统中的关键作用。解决方案的关键在于引入NGT200数据集,这是一个新颖的时空多视角基准,将MV-ISR与单视角ISR(SV-ISR)区分开来。论文还展示了合成数据的优势,并提出将手语表示条件化于手语固有的空间对称性上。通过利用SE(2)等变模型,MV-ISR的性能比基线提高了8%-22%。

链接: https://arxiv.org/abs/2409.15284
作者: Oline Ranum,David R. Wessels,Gomer Otterspeer,Erik J. Bekkers,Floris Roelofsen,Jari I. Andersen
关键词-EN: Sign Language Processing, real-world applications, Language Processing, achieve practical, inclusive future
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Proceedings of the Geometry-grounded Representation Learning and Generative Modeling Workshop (GRaM) at the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 251, 2024

点击查看摘要

Abstract:Sign Language Processing (SLP) provides a foundation for a more inclusive future in language technology; however, the field faces several significant challenges that must be addressed to achieve practical, real-world applications. This work addresses multi-view isolated sign recognition (MV-ISR), and highlights the essential role of 3D awareness and geometry in SLP systems. We introduce the NGT200 dataset, a novel spatio-temporal multi-view benchmark, establishing MV-ISR as distinct from single-view ISR (SV-ISR). We demonstrate the benefits of synthetic data and propose conditioning sign representations on spatial symmetries inherent in sign language. Leveraging an SE(2) equivariant model improves MV-ISR performance by 8%-22% over the baseline.
摘要:手语处理 (Sign Language Processing, SLP) 为语言技术的包容性未来奠定了基础;然而,该领域面临着若干重大挑战,必须解决这些挑战才能实现实用、现实世界的应用。本文探讨了多视角孤立手语识别 (Multi-View Isolated Sign Recognition, MV-ISR),并强调了 3D 感知和几何在 SLP 系统中的关键作用。我们引入了 NGT200 数据集,这是一个新颖的时空多视角基准,确立了 MV-ISR 与单视角 ISR (Single-View ISR, SV-ISR) 的区别。我们展示了合成数据的优势,并提出基于手语固有的空间对称性来调节手语表示。利用 SE(2) 等变模型,MV-ISR 性能比基线提高了 8%-22%。

[NLP-69] StyleSinger 2: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control EMNLP2024

【速读】: 该论文试图解决零样本歌唱语音合成(SVS)中的风格迁移和控制问题,特别是针对未见过的音色和歌唱风格(如唱法、情感、节奏、技巧和发音)的高质量生成。解决方案的关键在于引入StyleSinger 2模型,该模型通过三个主要模块实现跨语言语音和歌唱风格的多层次风格控制:1) 聚类风格编码器使用聚类向量量化模型稳定地将风格信息压缩到紧凑的潜在空间;2) 风格和时长语言模型(S\D-LM)同时预测风格信息和音素时长,从而相互受益;3) 风格自适应解码器采用新颖的梅尔风格自适应归一化方法生成细节丰富的歌唱语音。实验结果表明,StyleSinger 2在合成质量、歌手相似度和风格可控性方面优于所有基线模型。

链接: https://arxiv.org/abs/2409.15977
作者: Yu Zhang,Ziyue Jiang,Ruiqi Li,Changhao Pan,Jinzheng He,Rongjie Huang,Chuxin Wang,Zhou Zhao
关键词-EN: style transfer, style, generate singing voices, high-quality singing voices, generate high-quality singing
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text prompts. However, the multifaceted nature of singing styles poses a significant challenge for effective modeling, transfer, and control. Furthermore, current SVS models often fail to generate singing voices rich in stylistic nuances for unseen singers. To address these challenges, we introduce StyleSinger 2, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. Specifically, StyleSinger 2 proposes three primary modules: 1) the clustering style encoder employs a clustering vector quantization model to stably condense style information into a compact latent space; 2) the Style and Duration Language Model (S\D-LM) concurrently predicts style information and phoneme duration, which benefits both; 3) the style adaptive decoder uses a novel mel-style adaptive normalization method to generate singing voices with enhanced details. Experimental results show that StyleSinger 2 outperforms all baseline models in synthesis quality, singer similarity, and style controllability across various tasks, including zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer. Singing voice samples can be accessed at this https URL.
摘要:零样本歌唱语音合成 (Zero-shot Singing Voice Synthesis, SVS) 结合风格迁移与风格控制,旨在通过音频和文本提示生成具有未见过的音色和风格(包括演唱方法、情感、节奏、技巧和发音)的高质量歌唱语音。然而,歌唱风格的多样性给有效的建模、迁移和控制带来了显著挑战。此外,当前的 SVS 模型往往无法为未见过的歌手生成富有风格细节的歌唱语音。为应对这些挑战,我们推出了 StyleSinger 2,这是首个用于跨语言语音和歌唱风格迁移的零样本 SVS 模型,并具备多层次风格控制功能。具体而言,StyleSinger 2 提出了三个主要模块:1) 聚类风格编码器采用聚类向量量化模型,将风格信息稳定地压缩到紧凑的潜在空间中;2) 风格与时长语言模型 (Style and Duration Language Model, S\D-LM) 同时预测风格信息和音素时长,两者均受益;3) 风格自适应解码器采用新颖的梅尔风格自适应归一化方法,生成细节丰富的歌唱语音。实验结果表明,StyleSinger 2 在合成质量、歌手相似度和风格可控性方面均优于所有基线模型,涵盖零样本风格迁移、多层次风格控制、跨语言风格迁移和语音到歌唱风格迁移等多种任务。歌唱语音样本可通过此 https URL 访问。

[NLP-70] dnaGrinder: a lightweight and high-capacity genomic foundation model

【速读】: 该论文试图解决基因组序列中复杂信息的理解和解释问题,特别是长程依赖关系的有效管理、核苷酸变异的有效表示以及大规模模型架构和预训练数据集带来的高计算成本问题。解决方案的关键在于引入了一种名为dnaGrinder的高效基因组基础模型,该模型在处理长程依赖关系时表现出色,同时显著降低了计算成本,且性能不逊于甚至优于现有的领先DNA模型如Nucleotide Transformer和DNABERT-2。dnaGrinder还设计为易于在台式机级GPU上进行微调,支持超过17,000个token的输入长度,并在单个高性能GPU上支持超过140,000个token的序列,从而成为基础生物研究和临床应用的高效且易用的工具。

链接: https://arxiv.org/abs/2409.15697
作者: Qihang Zhao,Chi Zhang,Weixiong Zhang
关键词-EN: complex information encoded, genomic sequences remains, task of understanding, understanding and interpreting, interpreting the complex
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The task of understanding and interpreting the complex information encoded within genomic sequences remains a grand challenge in biological research and clinical applications. In this context, recent advancements in large language model research have led to the development of both encoder-only and decoder-only foundation models designed to decode intricate information in DNA sequences. However, several issues persist, particularly regarding the efficient management of long-range dependencies inherent in genomic sequences, the effective representation of nucleotide variations, and the considerable computational costs associated with large model architectures and extensive pretraining datasets. Current genomic foundation models often face a critical tradeoff: smaller models with mediocre performance versus large models with improved performance. To address these challenges, we introduce dnaGrinder, a unique and efficient genomic foundation model. dnaGrinder excels at managing long-range dependencies within genomic sequences while minimizing computational costs without compromising performance. It achieves results that are not just comparable but often superior to leading DNA models such as Nucleotide Transformer and DNABERT-2. Furthermore, dnaGrinder is designed for easy fine-tuning on workstation-grade GPUs, accommodating input lengths exceeding 17,000 tokens. On a single high-performance GPU, it supports sequences longer than 140,000 tokens, making it a highly efficient and accessible tool for both basic biological research and clinical applications.
摘要:理解和解释基因组序列中编码的复杂信息仍然是生物研究和临床应用中的重大挑战。在此背景下,大语言模型研究的最新进展促使了仅编码器和仅解码器基础模型的开发,这些模型旨在解码 DNA 序列中的复杂信息。然而,仍存在若干问题,特别是关于基因组序列中固有的长程依赖关系的有效管理、核苷酸变异的有效表示,以及与大型模型架构和广泛预训练数据集相关的巨大计算成本。当前的基因组基础模型通常面临一个关键的权衡:性能平庸的小模型与性能提升的大模型。为解决这些挑战,我们引入了 dnaGrinder,一种独特且高效的基因组基础模型。dnaGrinder 擅长管理基因组序列中的长程依赖关系,同时在不牺牲性能的情况下最小化计算成本。其结果不仅可与 Nucleotide Transformer 和 DNABERT-2 等领先 DNA 模型相媲美,而且往往更优。此外,dnaGrinder 设计为易于在工作站级 GPU 上进行微调,支持超过 17,000 Token 的输入长度。在单个高性能 GPU 上,它支持超过 140,000 Token 的序列,使其成为基础生物研究和临床应用中高效且易用的工具。

[NLP-71] Language-based Audio Moment Retrieval

【速读】: 该论文试图解决音频时刻检索(Audio Moment Retrieval, AMR)问题,即根据文本查询从非修剪的长音频中预测相关时刻。解决方案的关键在于构建了一个名为Clotho-Moment的专用数据集,并提出了基于DETR的模型Audio Moment DETR (AM-DETR),该模型能够捕捉音频特征中的时间依赖性,从而超越传统的基于片段的音频检索方法。实验结果表明,AM-DETR在所有评估指标上均优于使用滑动窗口的基线模型,特别是在Recall1@0.7上提升了9.00个百分点。

链接: https://arxiv.org/abs/2409.15672
作者: Hokuto Munakata,Taichi Nishimura,Shota Nakada,Tatsuya Komatsu
关键词-EN: called audio moment, audio, task called audio, audio retrieval, AMR
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:In this paper, we propose and design a new task called audio moment retrieval (AMR). Unlike conventional language-based audio retrieval tasks that search for short audio clips from an audio database, AMR aims to predict relevant moments in untrimmed long audio based on a text query. Given the lack of prior work in AMR, we first build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations. We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks. This model captures temporal dependencies within audio features, inspired by similar video moment retrieval tasks, thus surpassing conventional clip-level audio retrieval methods. Additionally, we provide manually annotated datasets to properly measure the effectiveness and robustness of our methods on real data. Experimental results show that AM-DETR, trained with Clotho-Moment, outperforms a baseline model that applies a clip-level audio retrieval method with a sliding window on all metrics, particularly improving Recall1@0.7 by 9.00 points. Our datasets and code are publicly available in this https URL.
摘要:本文提出并设计了一项名为音频时刻检索 (Audio Moment Retrieval, AMR) 的新任务。与传统的基于语言的音频检索任务不同,后者从音频数据库中搜索短音频片段,AMR 旨在根据文本查询预测未修剪长音频中的相关时刻。鉴于 AMR 领域缺乏先前的工作,我们首先构建了一个专用数据集,名为 Clotho-Moment,该数据集包含大规模模拟音频记录及其时刻标注。随后,我们提出了一种基于 DETR 的模型,命名为 Audio Moment DETR (AM-DETR),作为 AMR 任务的基础框架。该模型借鉴了类似视频时刻检索任务的思想,捕捉音频特征中的时间依赖性,从而超越了传统的片段级音频检索方法。此外,我们提供了人工标注的数据集,以准确评估我们的方法在实际数据上的有效性和鲁棒性。实验结果表明,使用 Clotho-Moment 训练的 AM-DETR 在所有指标上均优于采用滑动窗口的片段级音频检索基线模型,特别是在 Recall1@0.7 上提升了 9.00 个百分点。我们的数据集和代码已公开发布,详见此 https URL。

[NLP-72] Revise Reason and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction

【速读】: 该论文试图解决使用大型语言模型(LLMs)进行语音情感标注和识别的有效性和可靠性问题。解决方案的关键在于提出了一种包含声学、语言学和心理学情感知识的创新提示(prompt)设计,并通过对比自动语音识别(ASR)转录与真实转录的效果,验证了LLM提示的有效性。此外,论文还提出了一种“修订-推理-识别”的提示流程,用于从带有ASR错误的语音中进行鲁棒的情感识别,并探讨了上下文感知学习、情境学习和指令调优等LLM训练方案的实用性。实验结果表明,这些方法显著提升了LLM在情感识别中的表现。

链接: https://arxiv.org/abs/2409.15551
作者: Yuanchao Li,Yuan Gong,Chao-Han Huck Yang,Peter Bell,Catherine Lai
关键词-EN: Large Language Models, reliability remain questionable, Annotating and recognizing, advancement of Large, Language Models
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Annotating and recognizing speech emotion using prompt engineering has recently emerged with the advancement of Large Language Models (LLMs), yet its efficacy and reliability remain questionable. In this paper, we conduct a systematic study on this topic, beginning with the proposal of novel prompts that incorporate emotion-specific knowledge from acoustics, linguistics, and psychology. Subsequently, we examine the effectiveness of LLM-based prompting on Automatic Speech Recognition (ASR) transcription, contrasting it with ground-truth transcription. Furthermore, we propose a Revise-Reason-Recognize prompting pipeline for robust LLM-based emotion recognition from spoken language with ASR errors. Additionally, experiments on context-aware learning, in-context learning, and instruction tuning are performed to examine the usefulness of LLM training schemes in this direction. Finally, we investigate the sensitivity of LLMs to minor prompt variations. Experimental results demonstrate the efficacy of the emotion-specific prompts, ASR error correction, and LLM training schemes for LLM-based emotion recognition. Our study aims to refine the use of LLMs in emotion recognition and related domains.
摘要:随着大语言模型 (Large Language Models, LLM) 的进步,利用提示工程 (prompt engineering) 进行语音情感的标注和识别近期逐渐兴起,但其有效性和可靠性仍存疑。本文对此课题进行了系统研究,首先提出了结合声学、语言学和心理学中情感特定知识的创新提示。随后,我们考察了基于 LLM 的提示在自动语音识别 (Automatic Speech Recognition, ASR) 转录中的有效性,并将其与真实转录进行对比。此外,我们提出了一种“修正-推理-识别”的提示流程,用于从带有 ASR 错误的口语中进行稳健的 LLM 情感识别。同时,我们还进行了上下文感知学习、上下文学习以及指令调优的实验,以检验 LLM 训练方案在此方向上的实用性。最后,我们研究了 LLM 对提示微小变化的敏感性。实验结果表明,情感特定提示、ASR 错误修正以及 LLM 训练方案在基于 LLM 的情感识别中具有显著效果。本研究旨在优化 LLM 在情感识别及相关领域的应用。

[NLP-73] Rethinking Emotion Bias in Music via Frechet Audio Distance

【速读】: 该论文试图解决音乐情感识别(MER)和情感音乐生成(EMG)中存在的情感偏见问题,特别是在依赖单一音频编码器、情感分类器或评估指标时。解决方案的关键在于采用多样化的音频编码器,并引入Frechet音频距离(FAD)这一无参考评估指标,以提供更客观的音乐情感评估。通过多编码器的FAD评估,论文不仅提升了MER的客观性,还提出了一种增强的EMG方法,以提高生成音乐情感的多样性和显著性,从而增强其真实感。实验结果表明,这种方法有效减少了情感偏见,并展示了FAD和多样化编码器在客观评估音乐情感方面的潜力。

链接: https://arxiv.org/abs/2409.15545
作者: Yuanchao Li,Azalea Gui,Dimitra Emmanouilidou,Hannes Gamper
关键词-EN: Frechet Audio Distance, Emotional Music Generation, single audio encoder, music emotion, diverse audio encoders
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:The subjective nature of music emotion introduces inherent bias in both recognition and generation, especially when relying on a single audio encoder, emotion classifier, or evaluation metric. In this work, we conduct a study on Music Emotion Recognition (MER) and Emotional Music Generation (EMG), employing diverse audio encoders alongside the Frechet Audio Distance (FAD), a reference-free evaluation metric. Our study begins with a benchmark evaluation of MER, highlighting the limitations associated with using a single audio encoder and the disparities observed across different measurements. We then propose assessing MER performance using FAD from multiple encoders to provide a more objective measure of music emotion. Furthermore, we introduce an enhanced EMG approach designed to improve both the variation and prominence of generated music emotion, thus enhancing realism. Additionally, we investigate the realism disparities between the emotions conveyed in real and synthetic music, comparing our EMG model against two baseline models. Experimental results underscore the emotion bias problem in both MER and EMG and demonstrate the potential of using FAD and diverse audio encoders to evaluate music emotion objectively.
摘要:音乐情感的主观性在识别和生成过程中引入了固有的偏见,尤其是在依赖单一音频编码器、情感分类器或评估指标时。在本研究中,我们针对音乐情感识别 (Music Emotion Recognition, MER) 和情感音乐生成 (Emotional Music Generation, EMG) 进行了研究,采用了多种音频编码器以及 Frechet 音频距离 (Frechet Audio Distance, FAD) 这一无参考评估指标。我们的研究首先对 MER 进行了基准评估,突出了使用单一音频编码器的局限性以及不同测量方法之间的差异。随后,我们提出使用来自多个编码器的 FAD 来评估 MER 性能,以提供更客观的音乐情感衡量标准。此外,我们引入了一种增强的 EMG 方法,旨在提高生成音乐情感的变化性和显著性,从而增强真实感。我们还研究了真实音乐与合成音乐所传达情感之间的真实感差异,并将我们的 EMG 模型与两个基线模型进行了比较。实验结果强调了 MER 和 EMG 中的情感偏见问题,并展示了使用 FAD 和多样音频编码器来客观评估音乐情感的潜力。

[NLP-74] he ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

【速读】: 该论文试图解决在资源匮乏语言中缺乏语音与文本对齐数据的问题。解决方案的关键在于利用欧洲议会会议记录的文本和录音,通过创新的方法在大量搜索空间中对齐长序列的文本和音频,从而构建高质量的语音与文本对齐数据集。具体来说,研究团队选择了三种斯拉夫语言(克罗地亚语、波兰语和塞尔维亚语)作为试点,成功创建了超过5000小时的语音和文本对齐数据集,展示了该方法在扩展到更多语言中的潜力。

链接: https://arxiv.org/abs/2409.15397
作者: Nikola Ljubešić,Peter Rupnik,Danijel Koržinek
关键词-EN: Recent significant improvements, Recent significant, explicit supervision, raw language data, significant improvements
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Submitted to SPECOM 2024

点击查看摘要

Abstract:Recent significant improvements in speech and language technologies come both from self-supervised approaches over raw language data as well as various types of explicit supervision. To ensure high-quality processing of spoken data, the most useful type of explicit supervision is still the alignment between the speech signal and its corresponding text transcript, which is a data type that is not available for many languages. In this paper, we present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages based on transcripts of parliamentary proceedings and their recordings. Our starting point are the ParlaMint comparable corpora of transcripts of parliamentary proceedings of 26 national European parliaments. In the pilot run on expanding the ParlaMint corpora with aligned publicly available recordings, we focus on three Slavic languages, namely Croatian, Polish, and Serbian. The main challenge of our approach is the lack of any global alignment between the ParlaMint texts and the available recordings, as well as the sometimes varying data order in each of the modalities, which requires a novel approach in aligning long sequences of text and audio in a large search space. The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts. Although these datasets already make a huge difference in the availability of spoken and textual data for the three languages, we want to emphasize the potential of the presented approach in building similar datasets for many more languages.
摘要:近年来,语音和语言技术的显著进步既来自于对原始语言数据的自我监督方法,也来自于各种类型的显式监督。为了确保高质量的语音数据处理,最有用的显式监督类型仍然是语音信号与其对应文本转录之间的对齐,这种数据类型对于许多语言来说是不可用的。在本文中,我们介绍了基于议会会议记录及其录音来构建大型且开放的语音与文本对齐数据集的方法,这些数据集适用于资源较少的语言。我们的起点是 ParlaMint 可比语料库,该语料库包含了 26 个欧洲国家议会的会议记录。在扩展 ParlaMint 语料库的试点运行中,我们专注于三种斯拉夫语言,即克罗地亚语、波兰语和塞尔维亚语。我们方法的主要挑战在于 ParlaMint 文本与可用录音之间缺乏任何全局对齐,以及每种模态中数据顺序有时不同,这需要在大型搜索空间中对齐长序列文本和音频的新方法。试点运行的结果是三个高质量的数据集,涵盖了超过 5,000 小时的语音和相应的文本转录。尽管这些数据集已经极大地改善了这三种语言的语音和文本数据的可用性,但我们仍想强调所提出方法在为更多语言构建类似数据集方面的潜力。

[NLP-75] oward Automated Clinical Transcriptions

【速读】: 该论文试图解决行政文书工作导致的医疗成本上升和不良结果问题,特别是医生职业倦怠和护理质量下降。解决方案的关键在于引入一个安全系统,利用最新的语音转文本技术和说话者标记(话者分离)技术,对医患对话进行自动转录。该系统通过优化转录准确性并突出潜在错误,促进快速人工验证,从而减少手动工作量。在超过40小时的模拟对话测试中,该系统展示了自动化临床转录的潜力。

链接: https://arxiv.org/abs/2409.15378
作者: Mitchell A. Klusty,W. Vaiden Logan,Samuel E. Armstrong,Aaron D. Mullen,Caroline N. Leach,Jeff Talbert,V. K. Cody Bumgardner
关键词-EN: including physician burnout, rising healthcare costs, Administrative documentation, adverse outcomes, including physician
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Administrative documentation is a major driver of rising healthcare costs and is linked to adverse outcomes, including physician burnout and diminished quality of care. This paper introduces a secure system that applies recent advancements in speech-to-text transcription and speaker-labeling (diarization) to patient-provider conversations. This system is optimized to produce accurate transcriptions and highlight potential errors to promote rapid human verification, further reducing the necessary manual effort. Applied to over 40 hours of simulated conversations, this system offers a promising foundation for automating clinical transcriptions.
摘要:行政文件是医疗成本上升的主要驱动因素,并与不良结果相关,包括医生倦怠和护理质量下降。本文介绍了一种安全系统,该系统应用了语音转文本转录和说话者标记(分割聚类)的最新进展,以处理患者与提供者之间的对话。该系统经过优化,能够生成准确的转录文本,并突出显示潜在错误,以促进快速的人工验证,从而进一步减少所需的手动工作量。应用于超过 40 小时的模拟对话中,该系统为自动化临床转录提供了有前景的基础。

[NLP-76] A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework

【速读】: 该论文试图解决人工语音识别系统中缺乏关系思维能力的问题,即如何使机器像人类一样理解和利用语音信号之间的关系。解决方案的关键在于提出了一种新颖的频谱-时间关系思维声学建模框架,通过生成概率图模型来捕捉语音片段在时间和频率域中的关系,并将这些关系信息嵌入到潜在表示中,从而显著提升了音素识别任务的性能,特别是在容易混淆的元音识别方面。

链接: https://arxiv.org/abs/2409.15357
作者: Zheng Nan,Ting Dang,Vidhyasaharan Sethu,Beena Ahmed
关键词-EN: form mental impressions, Relational thinking refers, Relational thinking, prior knowledge, form mental
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Relational thinking refers to the inherent ability of humans to form mental impressions about relations between sensory signals and prior knowledge, and subsequently incorporate them into their model of their world. Despite the crucial role relational thinking plays in human understanding of speech, it has yet to be leveraged in any artificial speech recognition systems. Recently, there have been some attempts to correct this oversight, but these have been limited to coarse utterance-level models that operate exclusively in the time domain. In an attempt to narrow the gap between artificial systems and human abilities, this paper presents a novel spectro-temporal relational thinking based acoustic modeling framework. Specifically, it first generates numerous probabilistic graphs to model the relationships among speech segments across both time and frequency domains. The relational information rooted in every pair of nodes within these graphs is then aggregated and embedded into latent representations that can be utilized by downstream tasks. Models built upon this framework outperform state-of-the-art systems with a 7.82% improvement in phoneme recognition tasks over the TIMIT dataset. In-depth analyses further reveal that our proposed relational thinking modeling mainly improves the model’s ability to recognize vowels, which are the most likely to be confused by phoneme recognizers.
摘要:关系思维是指人类固有的能力,能够形成关于感官信号与先验知识之间关系的思维印象,并随后将其融入到对世界的模型中。尽管关系思维在人类对语音的理解中起着至关重要的作用,但迄今为止,它尚未在任何人工语音识别系统中得到应用。最近,有一些尝试纠正这一疏忽,但这些尝试仅限于在时间域内操作的粗略话语级模型。为了缩小人工系统与人类能力之间的差距,本文提出了一种新颖的频谱-时间关系思维声学建模框架。具体而言,它首先生成大量的概率图来模拟语音片段在时间和频率域之间的关系。然后,这些图中每对节点之间的关系信息被聚合并嵌入到潜在表示中,这些表示可以被下游任务利用。基于此框架构建的模型在音素识别任务上优于最先进的系统,在TIMIT数据集上的音素识别任务中提高了7.82%。深入分析进一步揭示,我们提出的关系思维建模主要提高了模型识别元音的能力,这些元音是最容易被音素识别器混淆的。

[NLP-77] Contextualization of ASR with LLM using phonetic retrieval-based augmentation

【速读】: 该论文试图解决大语言模型(LLMs)在语音输入中识别个人命名实体(如电话簿中的联系人)的挑战。解决方案的关键在于提出了一种基于检索的方法,首先让LLM在无上下文的情况下检测语音中的命名实体,然后使用这些实体作为查询,从个人数据库中检索发音相似的命名实体,并将它们反馈给LLM进行上下文感知的解码。这种方法在语音助手任务中显著降低了词错误率和命名实体错误率,同时避免了直接将整个命名实体数据库提供给LLM,从而提高了效率并适用于大规模命名实体数据库。

链接: https://arxiv.org/abs/2409.15353
作者: Zhihong Lei,Xingyu Na,Mingbin Xu,Ernest Pusateri,Christophe Van Gysel,Yuanyuan Zhang,Shiyi Han,Zhen Huang
关键词-EN: shown superb capability, modeling multimodal signals, multimodal signals including, signals including audio, Large language models
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown superb capability of modeling multimodal signals including audio and text, allowing the model to generate spoken or textual response given a speech input. However, it remains a challenge for the model to recognize personal named entities, such as contacts in a phone book, when the input modality is speech. In this work, we start with a speech recognition task and propose a retrieval-based solution to contextualize the LLM: we first let the LLM detect named entities in speech without any context, then use this named entity as a query to retrieve phonetically similar named entities from a personal database and feed them to the LLM, and finally run context-aware LLM decoding. In a voice assistant task, our solution achieved up to 30.2% relative word error rate reduction and 73.6% relative named entity error rate reduction compared to a baseline system without contextualization. Notably, our solution by design avoids prompting the LLM with the full named entity database, making it highly efficient and applicable to large named entity databases.
摘要:大语言模型 (LLMs) 展示了在处理包括音频和文本在内的多模态信号方面的卓越能力,使得模型能够在接收到语音输入时生成口语或文本响应。然而,当输入模态为语音时,模型识别个人命名实体(如电话簿中的联系人)仍然是一个挑战。在本研究中,我们从语音识别任务出发,提出了一种基于检索的解决方案来为大语言模型提供上下文:首先,让大语言模型在没有上下文的情况下检测语音中的命名实体,然后使用该命名实体作为查询,从个人数据库中检索发音相似的命名实体并将其输入到大语言模型中,最后进行上下文感知的解码。在语音助手任务中,与没有上下文的基线系统相比,我们的解决方案实现了高达 30.2% 的相对词错误率降低和 73.6% 的相对命名实体错误率降低。值得注意的是,我们的解决方案在设计上避免了将完整的命名实体数据库提供给大语言模型,从而使其在处理大型命名实体数据库时具有高效率和适用性。

[NLP-78] A Large Dataset of Spontaneous Speech with the Accent Spoken in S~ao Paulo for Automatic Speech Recognition Evaluation

【速读】: 该论文试图解决巴西葡萄牙语自发语音语料库的缺乏问题,并评估其在自动语音识别(ASR)任务中的适用性。解决方案的关键在于构建了一个名为NURC-SP Audio Corpus的大型自发语音语料库,包含401名不同说话者(204名女性,197名男性)的239.30小时转录音频。通过微调Wav2Vec2-XLSR-53和Distil-Whisper模型,并使用该语料库进行训练和评估,论文展示了这些模型在巴西葡萄牙语ASR任务中的潜力。特别是,Distil-Whisper模型在微调后达到了24.22%的词错误率(WER),表现优于Wav2Vec2-XLSR-53模型。

链接: https://arxiv.org/abs/2409.15350
作者: Rodrigo Lima,Sidney Evaldo Leal,Arnaldo Candido Junior,Sandra Maria Aluísio
关键词-EN: Brazilian Portuguese language, NURC-SP Audio Corpus, report preliminary automatic, Brazilian Portuguese, Audio Corpus
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a freely available spontaneous speech corpus for the Brazilian Portuguese language and report preliminary automatic speech recognition (ASR) results, using both the Wav2Vec2-XLSR-53 and Distil-Whisper models fine-tuned and trained on our corpus. The NURC-SP Audio Corpus comprises 401 different speakers (204 females, 197 males) with a total of 239.30 hours of transcribed audio recordings. To the best of our knowledge, this is the first large Paulistano accented spontaneous speech corpus dedicated to the ASR task in Portuguese. We first present the design and development procedures of the NURC-SP Audio Corpus, and then describe four ASR experiments in detail. The experiments demonstrated promising results for the applicability of the corpus for ASR. Specifically, we fine-tuned two versions of Wav2Vec2-XLSR-53 model, trained a Distil-Whisper model using our dataset with labels determined by Whisper Large-V3 model, and fine-tuned this Distil-Whisper model with our corpus. Our best results were the Distil-Whisper fine-tuned over NURC-SP Audio Corpus with a WER of 24.22% followed by a fine-tuned versions of Wav2Vec2-XLSR-53 model with a WER of 33.73%, that is almost 10% point worse than Distil-Whisper’s. To enable experiment reproducibility, we share the NURC-SP Audio Corpus dataset, pre-trained models, and training recipes in Hugging-Face and Github repositories.
摘要:我们提供了一个免费的巴西葡萄牙语自发语音语料库,并报告了使用微调和训练于我们语料库的 Wav2Vec2-XLSR-53 和 Distil-Whisper 模型的初步自动语音识别 (ASR) 结果。NURC-SP 音频语料库包含 401 名不同的说话者(204 名女性,197 名男性),总计 239.30 小时的转录音频记录。据我们所知,这是首个针对葡萄牙语 ASR 任务的带有 Paulistano 口音的大型自发语音语料库。我们首先介绍了 NURC-SP 音频语料库的设计和开发过程,然后详细描述了四个 ASR 实验。实验展示了该语料库在 ASR 应用中的良好前景。具体来说,我们微调了两个版本的 Wav2Vec2-XLSR-53 模型,使用我们的数据集和由 Whisper Large-V3 模型确定的标签训练了一个 Distil-Whisper 模型,并使用我们的语料库微调了该 Distil-Whisper 模型。我们的最佳结果是基于 NURC-SP 音频语料库微调的 Distil-Whisper 模型,其词错误率 (WER) 为 24.22%,其次是微调的 Wav2Vec2-XLSR-53 模型,其 WER 为 33.73%,比 Distil-Whisper 的性能差近 10 个百分点。为了便于实验的可重复性,我们分享了 NURC-SP 音频语料库数据集、预训练模型和训练方法在 Hugging-Face 和 Github 仓库中。

人工智能

[AI-0] Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking

链接: https://arxiv.org/abs/2409.16287
作者: Xi Wang,Tianxing Chen,Qiaojun Yu,Tianling Xu,Zanxin Chen,Yiting Fu,Cewu Lu,Yao Mu,Ping Luo
关键词-EN: carefully considered, interactive perception, articulated objects, Articulated, interactive
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Articulated object manipulation requires precise object interaction, where the object’s axis must be carefully considered. Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics. To address this limitation, we present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds. Our method leverages any interactive perception technique as a foundation for interactive perception, inducing slight object movement to generate point cloud frames of the evolving dynamic scene. These point clouds are then segmented using Segment Anything Model 2 (SAM2), after which the moving part of the object is masked for accurate motion online axis estimation, guiding subsequent robotic actions. Our approach significantly enhances the precision and efficiency of manipulation tasks involving articulated objects. Experiments in simulated environments demonstrate that our method outperforms baseline approaches, especially in tasks that demand precise axis-based control. Project Page: this https URL.

[AI-1] Fields of The World: A Machine Learning Benchmark Dataset For Global Agricultural Field Boundary Segmentation

链接: https://arxiv.org/abs/2409.16252
作者: Hannah Kerner,Snehal Chaudhari,Aninda Ghosh,Caleb Robinson,Adeel Ahmad,Eddie Choi,Nathan Jacobs,Chris Holmes,Matthias Mohr,Rahul Dodhia,Juan M. Lavista Ferres,Jennifer Marcus
关键词-EN: Crop field boundaries, Crop field, collect manually, monitoring and assessments, expensive to collect
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Crop field boundaries are foundational datasets for agricultural monitoring and assessments but are expensive to collect manually. Machine learning (ML) methods for automatically extracting field boundaries from remotely sensed images could help realize the demand for these datasets at a global scale. However, current ML methods for field instance segmentation lack sufficient geographic coverage, accuracy, and generalization capabilities. Further, research on improving ML methods is restricted by the lack of labeled datasets representing the diversity of global agricultural fields. We present Fields of The World (FTW) – a novel ML benchmark dataset for agricultural field instance segmentation spanning 24 countries on four continents (Europe, Africa, Asia, and South America). FTW is an order of magnitude larger than previous datasets with 70,462 samples, each containing instance and semantic segmentation masks paired with multi-date, multi-spectral Sentinel-2 satellite images. We provide results from baseline models for the new FTW benchmark, show that models trained on FTW have better zero-shot and fine-tuning performance in held-out countries than models that aren’t pre-trained with diverse datasets, and show positive qualitative zero-shot results of FTW models in a real-world scenario – running on Sentinel-2 scenes over Ethiopia.

[AI-2] LLM Echo Chamber: personalized and automated disinformation

链接: https://arxiv.org/abs/2409.16241
作者: Tony Ma
关键词-EN: Large Language Models, Large Language, Recent advancements, capabilities of Large, Language Models
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 42 pages

点击查看摘要

Abstract:Recent advancements have showcased the capabilities of Large Language Models like GPT4 and Llama2 in tasks such as summarization, translation, and content review. However, their widespread use raises concerns, particularly around the potential for LLMs to spread persuasive, humanlike misinformation at scale, which could significantly influence public opinion. This study examines these risks, focusing on LLMs ability to propagate misinformation as factual. To investigate this, we built the LLM Echo Chamber, a controlled digital environment simulating social media chatrooms, where misinformation often spreads. Echo chambers, where individuals only interact with like minded people, further entrench beliefs. By studying malicious bots spreading misinformation in this environment, we can better understand this phenomenon. We reviewed current LLMs, explored misinformation risks, and applied sota finetuning techniques. Using Microsoft phi2 model, finetuned with our custom dataset, we generated harmful content to create the Echo Chamber. This setup, evaluated by GPT4 for persuasiveness and harmfulness, sheds light on the ethical concerns surrounding LLMs and emphasizes the need for stronger safeguards against misinformation.

[AI-3] Label-Augmented Dataset Distillation

链接: https://arxiv.org/abs/2409.16239
作者: Seoungyoon Kang,Youngsun Lim,Hyunjung Shim
关键词-EN: Traditional dataset distillation, distillation primarily focuses, Traditional dataset, dataset distillation, primarily focuses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional dataset distillation primarily focuses on image representation while often overlooking the important role of labels. In this study, we introduce Label-Augmented Dataset Distillation (LADD), a new dataset distillation framework enhancing dataset distillation with label augmentations. LADD sub-samples each synthetic image, generating additional dense labels to capture rich semantics. These dense labels require only a 2.5% increase in storage (ImageNet subsets) with significant performance benefits, providing strong learning signals. Our label generation strategy can complement existing dataset distillation methods for significantly enhancing their training efficiency and performance. Experimental results demonstrate that LADD outperforms existing methods in terms of computational overhead and accuracy. With three high-performance dataset distillation algorithms, LADD achieves remarkable gains by an average of 14.9% in accuracy. Furthermore, the effectiveness of our method is proven across various datasets, distillation hyperparameters, and algorithms. Finally, our method improves the cross-architecture robustness of the distilled dataset, which is important in the application scenario.

[AI-4] Efficiently Learning Probabilistic Logical Models by Cheaply Ranking Mined Rules

链接: https://arxiv.org/abs/2409.16238
作者: Jonathan Feldstein,Dominic Phillips,Efthymia Tsamoura
关键词-EN: require high explainability, Probabilistic logical models, logical models, Probabilistic logical, high explainability
类目: Artificial Intelligence (cs.AI)
*备注: 21 pages

点击查看摘要

Abstract:Probabilistic logical models are a core component of neurosymbolic AI and are important models in their own right for tasks that require high explainability. Unlike neural networks, logical models are often handcrafted using domain expertise, making their development costly and prone to errors. While there are algorithms that learn logical models from data, they are generally prohibitively expensive, limiting their applicability in real-world settings. In this work, we introduce precision and recall for logical rules and define their composition as rule utility – a cost-effective measure to evaluate the predictive power of logical models. Further, we introduce SPECTRUM, a scalable framework for learning logical models from relational data. Its scalability derives from a linear-time algorithm that mines recurrent structures in the data along with a second algorithm that, using the cheap utility measure, efficiently ranks rules built from these structures. Moreover, we derive theoretical guarantees on the utility of the learnt logical model. As a result, SPECTRUM learns more accurate logical models orders of magnitude faster than previous methods on real-world datasets.

[AI-5] Predicting Deterioration in Mild Cognitive Impairment with Survival Transformers Extreme Gradient Boosting and Cox Proportional Hazard Modelling ICANN2024

链接: https://arxiv.org/abs/2409.16231
作者: Henry Musto,Daniel Stamate,Doina Logofatu,Daniel Stahl
关键词-EN: mild cognitive impairment, predicting cognitive deterioration, extreme gradient boosting, ADNI cohort, gradient boosting models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted to ICANN 2024

点击查看摘要

Abstract:The paper proposes a novel approach of survival transformers and extreme gradient boosting models in predicting cognitive deterioration in individuals with mild cognitive impairment (MCI) using metabolomics data in the ADNI cohort. By leveraging advanced machine learning and transformer-based techniques applied in survival analysis, the proposed approach highlights the potential of these techniques for more accurate early detection and intervention in Alzheimer’s dementia disease. This research also underscores the importance of non-invasive biomarkers and innovative modelling tools in enhancing the accuracy of dementia risk assessments, offering new avenues for clinical practice and patient care. A comprehensive Monte Carlo simulation procedure consisting of 100 repetitions of a nested cross-validation in which models were trained and evaluated, indicates that the survival machine learning models based on Transformer and XGBoost achieved the highest mean C-index performances, namely 0.85 and 0.8, respectively, and that they are superior to the conventional survival analysis Cox Proportional Hazards model which achieved a mean C-Index of 0.77. Moreover, based on the standard deviations of the C-Index performances obtained in the Monte Carlo simulation, we established that both survival machine learning models above are more stable than the conventional statistical model.

[AI-6] Fine-Tuning is Fine if Calibrated

链接: https://arxiv.org/abs/2409.16223
作者: Zheda Mai,Arpita Chowdhury,Ping Zhang,Cheng-Hao Tu,Hong-You Chen,Vardaan Pahuja,Tanya Berger-Wolf,Song Gao,Charles Stewart,Yu Su,Wei-Lun Chao
关键词-EN: losing valuable knowledge, fine-tuned model, model, classes, pre-trained model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The first three authors contribute equally

点击查看摘要

Abstract:Fine-tuning is arguably the most straightforward way to tailor a pre-trained model (e.g., a foundation model) to downstream applications, but it also comes with the risk of losing valuable knowledge the model had learned in pre-training. For example, fine-tuning a pre-trained classifier capable of recognizing a large number of classes to master a subset of classes at hand is shown to drastically degrade the model’s accuracy in the other classes it had previously learned. As such, it is hard to further use the fine-tuned model when it encounters classes beyond the fine-tuning data. In this paper, we systematically dissect the issue, aiming to answer the fundamental question, ‘‘What has been damaged in the fine-tuned model?’’ To our surprise, we find that the fine-tuned model neither forgets the relationship among the other classes nor degrades the features to recognize these classes. Instead, the fine-tuned model often produces more discriminative features for these other classes, even if they were missing during fine-tuning! What really hurts the accuracy is the discrepant logit scales between the fine-tuning classes and the other classes, implying that a simple post-processing calibration would bring back the pre-trained model’s capability and at the same time unveil the feature improvement over all classes. We conduct an extensive empirical study to demonstrate the robustness of our findings and provide preliminary explanations underlying them, suggesting new directions for future theoretical analysis. Our code is available at this https URL.

[AI-7] owards Enhancing Linked Data Retrieval in Conversational UIs using Large Language Models

链接: https://arxiv.org/abs/2409.16220
作者: Omar Mussa,Omer Rana,Benoît Goossens,Pablo Orozco-Terwengel,Charith Perera
关键词-EN: Resource Description Framework, Description Framework, Resource Description, Large Language Models, recent broad adoption
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: This paper has been accepted at the 25th International Web Information Systems Engineering Conference (WISE 2024)

点击查看摘要

Abstract:Despite the recent broad adoption of Large Language Models (LLMs) across various domains, their potential for enriching information systems in extracting and exploring Linked Data (LD) and Resource Description Framework (RDF) triplestores has not been extensively explored. This paper examines the integration of LLMs within existing systems, emphasising the enhancement of conversational user interfaces (UIs) and their capabilities for data extraction by producing more accurate SPARQL queries without the requirement for model retraining. Typically, conversational UI models necessitate retraining with the introduction of new datasets or updates, limiting their functionality as general-purpose extraction tools. Our approach addresses this limitation by incorporating LLMs into the conversational UI workflow, significantly enhancing their ability to comprehend and process user queries effectively. By leveraging the advanced natural language understanding capabilities of LLMs, our method improves RDF entity extraction within web systems employing conventional chatbots. This integration facilitates a more nuanced and context-aware interaction model, critical for handling the complex query patterns often encountered in RDF datasets and Linked Open Data (LOD) endpoints. The evaluation of this methodology shows a marked enhancement in system expressivity and the accuracy of responses to user queries, indicating a promising direction for future research in this area. This investigation not only underscores the versatility of LLMs in enhancing existing information systems but also sets the stage for further explorations into their potential applications within more specialised domains of web information systems.

[AI-8] Problem-oriented AutoML in Clustering

链接: https://arxiv.org/abs/2409.16218
作者: Matheus Camilo da Silva,Gabriel Marques Tavares,Eric Medvet,Sylvio Barbon Junior
关键词-EN: Clustering Validity Indexes, flexible approach, automating clustering tasks, Problem-oriented AutoML, approach to automating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Problem-oriented AutoML in Clustering (PoAC) framework introduces a novel, flexible approach to automating clustering tasks by addressing the shortcomings of traditional AutoML solutions. Conventional methods often rely on predefined internal Clustering Validity Indexes (CVIs) and static meta-features, limiting their adaptability and effectiveness across diverse clustering tasks. In contrast, PoAC establishes a dynamic connection between the clustering problem, CVIs, and meta-features, allowing users to customize these components based on the specific context and goals of their task. At its core, PoAC employs a surrogate model trained on a large meta-knowledge base of previous clustering datasets and solutions, enabling it to infer the quality of new clustering pipelines and synthesize optimal solutions for unseen datasets. Unlike many AutoML frameworks that are constrained by fixed evaluation metrics and algorithm sets, PoAC is algorithm-agnostic, adapting seamlessly to different clustering problems without requiring additional data or retraining. Experimental results demonstrate that PoAC not only outperforms state-of-the-art frameworks on a variety of datasets but also excels in specific tasks such as data visualization, and highlight its ability to dynamically adjust pipeline configurations based on dataset complexity.

[AI-9] Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech ECCV

链接: https://arxiv.org/abs/2409.16203
作者: Yunji Chu,Yunseob Shim,Unsang Park
关键词-EN: synthesizes emotionally expressive, emotionally expressive speech, innovative zero-shot, synthesizes emotionally, emotionally expressive
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 13 pages, 3 figures, accepted to ECCV Workshop ABAW(Affective Behavior Analysis in-the-wild)7 (to be appear)

点击查看摘要

Abstract:We propose FEIM-TTS, an innovative zero-shot text-to-speech (TTS) model that synthesizes emotionally expressive speech, aligned with facial images and modulated by emotion intensity. Leveraging deep learning, FEIM-TTS transcends traditional TTS systems by interpreting facial cues and adjusting to emotional nuances without dependence on labeled datasets. To address sparse audio-visual-emotional data, the model is trained using LRS3, CREMA-D, and MELD datasets, demonstrating its adaptability. FEIM-TTS’s unique capability to produce high-quality, speaker-agnostic speech makes it suitable for creating adaptable voices for virtual characters. Moreover, FEIM-TTS significantly enhances accessibility for individuals with visual impairments or those who have trouble seeing. By integrating emotional nuances into TTS, our model enables dynamic and engaging auditory experiences for webcomics, allowing visually impaired users to enjoy these narratives more fully. Comprehensive evaluation evidences its proficiency in modulating emotion and intensity, advancing emotional speech synthesis and accessibility. Samples are available at: this https URL.

[AI-10] CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data

链接: https://arxiv.org/abs/2409.16202
作者: Qianwen Zhang,Haochen Wang,Fang Li,Siyu An,Lingfeng Qiao,Liangcai Gao,Di Yin,Xing Sun
关键词-EN: Large Language Models, Online education platforms, digital infrastructure, significantly transformed, transformed the dissemination
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Online education platforms have significantly transformed the dissemination of educational resources by providing a dynamic and digital infrastructure. With the further enhancement of this transformation, the advent of Large Language Models (LLMs) has elevated the intelligence levels of these platforms. However, current academic benchmarks provide limited guidance for real-world industry scenarios. This limitation arises because educational applications require more than mere test question responses. To bridge this gap, we introduce CJEval, a benchmark based on Chinese Junior High School Exam Evaluations. CJEval consists of 26,136 samples across four application-level educational tasks covering ten subjects. These samples include not only questions and answers but also detailed annotations such as question types, difficulty levels, knowledge concepts, and answer explanations. By utilizing this benchmark, we assessed LLMs’ potential applications and conducted a comprehensive analysis of their performance by fine-tuning on various educational tasks. Extensive experiments and discussions have highlighted the opportunities and challenges of applying LLMs in the field of education.

[AI-11] Leveraging Estimated Transferability Over Human Intuition for Model Selection in Text Ranking EMNLP2024

链接: https://arxiv.org/abs/2409.16198
作者: Jun Bai,Zhuofan Chen,Zhenzi Li,Hanhua Hong,Jianfei Zhang,Chen Li,Chenghua Lin,Wenge Rong
关键词-EN: Pre-trained Language Models, Pre-trained Language, enhanced by Pre-trained, witnessed significant advancements, Language Models
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024 main conference

点击查看摘要

Abstract:Text ranking has witnessed significant advancements, attributed to the utilization of dual-encoder enhanced by Pre-trained Language Models (PLMs). Given the proliferation of available PLMs, selecting the most effective one for a given dataset has become a non-trivial challenge. As a promising alternative to human intuition and brute-force fine-tuning, Transferability Estimation (TE) has emerged as an effective approach to model selection. However, current TE methods are primarily designed for classification tasks, and their estimated transferability may not align well with the objectives of text ranking. To address this challenge, we propose to compute the expected rank as transferability, explicitly reflecting the model’s ranking capability. Furthermore, to mitigate anisotropy and incorporate training dynamics, we adaptively scale isotropic sentence embeddings to yield an accurate expected rank score. Our resulting method, Adaptive Ranking Transferability (AiRTran), can effectively capture subtle differences between models. On challenging model selection scenarios across various text ranking datasets, it demonstrates significant improvements over previous classification-oriented TE methods, human intuition, and ChatGPT with minor time consumption.

[AI-12] Second Order Bounds for Contextual Bandits with Function Approximation

链接: https://arxiv.org/abs/2409.16197
作者: Aldo Pacchiano
关键词-EN: context-action pairs belongs, developed algorithms no-regret, algorithms no-regret algorithms, square root, function class
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 12 pages main, 33 pages total

点击查看摘要

Abstract:Many works have developed algorithms no-regret algorithms for contextual bandits with function approximation, where the mean rewards over context-action pairs belongs to a function class. Although there are many approaches to this problem, one that has gained in importance is the use of algorithms based on the optimism principle such as optimistic least squares. It can be shown the regret of this algorithm scales as square root of the product of the eluder dimension (a statistical measure of the complexity of the function class), the logarithm of the function class size and the time horizon. Unfortunately, even if the variance of the measurement noise of the rewards at each time is changing and is very small, the regret of the optimistic least squares algorithm scales with square root of the time horizon. In this work we are the first to develop algorithms that satisfy regret bounds of scaling not with the square root of the time horizon, but the square root of the sum of the measurement variances in the setting of contextual bandits with function approximation when the variances are unknown. These bounds generalize existing techniques for deriving second order bounds in contextual linear problems.

[AI-13] Cyber Knowledge Completion Using Large Language Models

链接: https://arxiv.org/abs/2409.16176
作者: Braden K Webb,Sumit Purohit,Rounak Meyur
关键词-EN: Internet of Things, exploit emerging vulnerabilities, Cyber-Physical Systems, potential to exploit, exploit emerging
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures. Submitted to 2024 IEEE International Conference on Big Data

点击查看摘要

Abstract:The integration of the Internet of Things (IoT) into Cyber-Physical Systems (CPSs) has expanded their cyber-attack surface, introducing new and sophisticated threats with potential to exploit emerging vulnerabilities. Assessing the risks of CPSs is increasingly difficult due to incomplete and outdated cybersecurity knowledge. This highlights the urgent need for better-informed risk assessments and mitigation strategies. While previous efforts have relied on rule-based natural language processing (NLP) tools to map vulnerabilities, weaknesses, and attack patterns, recent advancements in Large Language Models (LLMs) present a unique opportunity to enhance cyber-attack knowledge completion through improved reasoning, inference, and summarization capabilities. We apply embedding models to encapsulate information on attack patterns and adversarial techniques, generating mappings between them using vector embeddings. Additionally, we propose a Retrieval-Augmented Generation (RAG)-based approach that leverages pre-trained models to create structured mappings between different taxonomies of threat patterns. Further, we use a small hand-labeled dataset to compare the proposed RAG-based approach to a baseline standard binary classification model. Thus, the proposed approach provides a comprehensive framework to address the challenge of cyber-attack knowledge graph completion.

[AI-14] Merging LoRAs like Playing LEGO: Pushing the Modularity of LoRA to Extremes Through Rank-Wise Clustering

链接: https://arxiv.org/abs/2409.16167
作者: Ziyu Zhao,Tao Shen,Didi Zhu,Zexi Li,Jing Su,Xuwu Wang,Kun Kuang,Fei Wu
关键词-EN: fine-tuning large language, platforms like Huggingface, large language models, fine-tuning large, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has emerged as a popular technique for fine-tuning large language models (LLMs) to various domains due to its modular design and widespread availability on platforms like Huggingface. This modularity has sparked interest in combining multiple LoRAs to enhance LLM capabilities. However, existing methods for LoRA composition primarily focus on task-specific adaptations that require additional training, and current model merging techniques often fail to fully leverage LoRA’s modular nature, leading to parameter interference and performance degradation. In this paper, we investigate the feasibility of disassembling and reassembling multiple LoRAs at a finer granularity, analogous to assembling LEGO blocks. We introduce the concept of Minimal Semantic Units (MSUs), where the parameters corresponding to each rank in LoRA function as independent units. These MSUs demonstrate permutation invariance and concatenation-summation equivalence properties, enabling flexible combinations to create new LoRAs. Building on these insights, we propose the LoRA-LEGO framework. This framework conducts rank-wise parameter clustering by grouping MSUs from different LoRAs into k clusters. The centroid of each cluster serves as a representative MSU, enabling the assembly of a merged LoRA with an adjusted rank of k . Additionally, we apply a dual reweighting strategy to optimize the scale of the merged LoRA. Experiments across various benchmarks demonstrate that our method outperforms existing approaches in LoRA merging.

[AI-15] EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges

链接: https://arxiv.org/abs/2409.16165
作者: Talor Abramovich,Meet Udeshi,Minghao Shao,Kilian Lieret,Haoran Xi,Kimberly Milner,Sofija Jancheska,John Yang,Carlos E. Jimenez,Farshad Khorrami,Prashanth Krishnamurthy,Brendan Dolan-Gavitt,Muhammad Shafique,Karthik Narasimhan,Ramesh Karri,Ofir Press
关键词-EN: demonstrating growing potential, language model, demonstrating growing, growing potential, limited due
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although language model (LM) agents are demonstrating growing potential in many domains, their success in cybersecurity has been limited due to simplistic design and the lack of fundamental features for this domain. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. EnIGMA introduces new Agent-Computer Interfaces (ACIs) to improve the success rate on CTF challenges. We establish the novel Interactive Agent Tool concept, which enables LM agents to run interactive command-line utilities essential for these challenges. Empirical analysis of EnIGMA on over 350 CTF challenges from three different benchmarks indicates that providing a robust set of new tools with demonstration of their usage helps the LM solve complex problems and achieves state-of-the-art results on the NYU CTF and Intercode-CTF benchmarks. Finally, we discuss insights on ACI design and agent behavior on cybersecurity tasks that highlight the need to adapt real-world tools for LM agents.

[AI-16] Seeing Faces in Things: A Model and Dataset for Pareidolia

链接: https://arxiv.org/abs/2409.16143
作者: Mark Hamilton,Simon Stent,Vasha DuTell,Anne Harrington,Jennifer Corbett,Ruth Rosenholtz,William T. Freeman
关键词-EN: human visual system, shapes and sizes, visual system, system is well-tuned, faces
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The human visual system is well-tuned to detect faces of all shapes and sizes. While this brings obvious survival advantages, such as a better chance of spotting unknown predators in the bush, it also leads to spurious face detections. Face pareidolia'' describes the perception of face-like structure among otherwise random stimuli: seeing faces in coffee stains or clouds in the sky. In this paper, we study face pareidolia from a computer vision perspective. We present an image dataset of Faces in Things’', consisting of five thousand web images with human-annotated pareidolic faces. Using this dataset, we examine the extent to which a state-of-the-art human face detector exhibits pareidolia, and find a significant behavioral gap between humans and machines. We find that the evolutionary need for humans to detect animal faces, as well as human faces, may explain some of this gap. Finally, we propose a simple statistical model of pareidolia in images. Through studies on human subjects and our pareidolic face detectors we confirm a key prediction of our model regarding what image conditions are most likely to induce pareidolia. Dataset and Website: this https URL

[AI-17] HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection

链接: https://arxiv.org/abs/2409.16136
作者: Yuqi Ma,Mengyin Liu,Chao Zhu,Xu-Cheng Yin
关键词-EN: Large Multi-modal Models, Large Multi-modal, extensive training data, Open-vocabulary object detection, OVD models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. However, OVD models are pretrained on large-scale image-text pairs with rich attribute words, whose latent feature space can represent the global text feature as a linear composition of fine-grained attribute tokens without highlighting them. Therefore, we propose in this paper a universal and explicit approach for frozen mainstream OVD models that boosts their attribute-level detection capabilities by highlighting fine-grained attributes in explicit linear space. Firstly, a LLM is leveraged to highlight attribute words within the input text as a zero-shot prompted task. Secondly, by strategically adjusting the token masks, the text encoders of OVD models extract both global text and attribute-specific features, which are then explicitly composited as two vectors in linear space to form the new attribute-highlighted feature for detection tasks, where corresponding scalars are hand-crafted or learned to reweight both two vectors. Notably, these scalars can be seamlessly transferred among different OVD models, which proves that such an explicit linear composition is universal. Empirical evaluation on the FG-OVD dataset demonstrates that our proposed method uniformly improves fine-grained attribute-level OVD of various mainstream models and achieves new state-of-the-art performance.

[AI-18] Implicit assessment of language learning during practice as accurate as explicit testing

链接: https://arxiv.org/abs/2409.16133
作者: Jue Hou,Anisia Katinskaia,Anh-Duc Vu,Roman Yangarber
关键词-EN: Intelligent Tutoring Systems, Tutoring Systems, Intelligent Tutoring, part of Intelligent, Item Response Theory
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Assessment of proficiency of the learner is an essential part of Intelligent Tutoring Systems (ITS). We use Item Response Theory (IRT) in computer-aided language learning for assessment of student ability in two contexts: in test sessions, and in exercises during practice sessions. Exhaustive testing across a wide range of skills can provide a detailed picture of proficiency, but may be undesirable for a number of reasons. Therefore, we first aim to replace exhaustive tests with efficient but accurate adaptive tests. We use learner data collected from exhaustive tests under imperfect conditions, to train an IRT model to guide adaptive tests. Simulations and experiments with real learner data confirm that this approach is efficient and accurate. Second, we explore whether we can accurately estimate learner ability directly from the context of practice with exercises, without testing. We transform learner data collected from exercise sessions into a form that can be used for IRT modeling. This is done by linking the exercises to \em linguistic constructs; the constructs are then treated as “items” within IRT. We present results from large-scale studies with thousands of learners. Using teacher assessments of student ability as “ground truth,” we compare the estimates obtained from tests vs. those from exercises. The experiments confirm that the IRT models can produce accurate ability estimation based on exercises.

[AI-19] Analyzing Probabilistic Methods for Evaluating Agent Capabilities

链接: https://arxiv.org/abs/2409.16125
作者: Axel Højmark,Govind Pimpale,Arjun Panickssery,Marius Hobbhahn,Jérémy Scheurer
关键词-EN: Monte Carlo estimators, Monte Carlo, mitigate risks, capabilities accurately, Carlo estimators
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To mitigate risks from AI systems, we need to assess their capabilities accurately. This is especially difficult in cases where capabilities are only rarely displayed. Phuong et al. propose two methods that aim to obtain better estimates of the probability of an AI agent successfully completing a given task. The milestone method decomposes tasks into subtasks, aiming to improve overall success rate estimation, while the expert best-of-N method leverages human guidance as a proxy for the model’s independent performance. Our analysis of these methods as Monte Carlo estimators reveals that while both effectively reduce variance compared to naive Monte Carlo sampling, they also introduce bias. Experimental results demonstrate that the milestone method underestimates true solve rates for many real-world tasks due to its constraining assumptions. The expert best-of-N method exhibits even more severe underestimation across all tasks, attributed to an inherently flawed re-weighting factor. To enhance the accuracy of capability estimates of AI agents on difficult tasks, we suggest future work should leverage the rich literature on Monte Carlo Estimators. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2409.16125 [cs.AI] (or arXiv:2409.16125v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.16125 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-20] MOSS: Enabling Code-Driven Evolution and Context Management for AI Agents

链接: https://arxiv.org/abs/2409.16120
作者: Ming Zhu,Yi Zhou
关键词-EN: true Turing completeness, achieving true Turing, true Turing, Turing completeness, large language models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Developing AI agents powered by large language models (LLMs) faces significant challenges in achieving true Turing completeness and adaptive, code-driven evolution. Current approaches often generate code independently of its runtime context, relying heavily on the LLM’s memory, which results in inefficiencies and limits adaptability. Manual protocol development in sandbox environments further constrains the agent’s autonomous adaptability. Crucially, achieving consistency in code and context across multi-turn interactions and ensuring isolation of local variables within each interaction remains an unsolved problem. We introduce MOSS (llM-oriented Operating System Simulation), a novel framework that addresses these challenges by integrating code generation with a dynamic context management system. MOSS ensures consistency and adaptability by using a mechanism that maintains the Python context across interactions, including isolation of local variables and preservation of runtime integrity. At its core, the framework employs an Inversion of Control (IoC) container in conjunction with decorators to enforce the least knowledge principle, allowing agents to focus on abstract interfaces rather than concrete implementations. This facilitates seamless integration of new tools and libraries, enables runtime instance replacement, and reduces prompt complexity, providing a “what you see is what you get” environment for the agent. Through a series of case studies, we show how this framework can enhance the efficiency and capabilities of agent development and highlight its advantages in moving towards Turing-complete agents capable of evolving through code. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2409.16120 [cs.SE] (or arXiv:2409.16120v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.16120 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-21] Neuromorphic Drone Detection: an Event-RGB Multimodal Approach ECCV24

链接: https://arxiv.org/abs/2409.16099
作者: Gabriele Magrini,Federico Becattini,Pietro Pala,Alberto Del Bimbo,Antonio Porta
关键词-EN: drone detection, recent years, extreme interest, identifying such elements, subject of extreme
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at NeVi Workshop at ECCV24

点击查看摘要

Abstract:In recent years, drone detection has quickly become a subject of extreme interest: the potential for fast-moving objects of contained dimensions to be used for malicious intents or even terrorist attacks has posed attention to the necessity for precise and resilient systems for detecting and identifying such elements. While extensive literature and works exist on object detection based on RGB data, it is also critical to recognize the limits of such modality when applied to UAVs detection. Detecting drones indeed poses several challenges such as fast-moving objects and scenes with a high dynamic range or, even worse, scarce illumination levels. Neuromorphic cameras, on the other hand, can retain precise and rich spatio-temporal information in situations that are challenging for RGB cameras. They are resilient to both high-speed moving objects and scarce illumination settings, while prone to suffer a rapid loss of information when the objects in the scene are static. In this context, we present a novel model for integrating both domains together, leveraging multimodal data to take advantage of the best of both worlds. To this end, we also release NeRDD (Neuromorphic-RGB Drone Detection), a novel spatio-temporally synchronized Event-RGB Drone detection dataset of more than 3.5 hours of multimodal annotated recordings.

[AI-22] he Digital Transformation in Health: How AI Can Improve the Performance of Health Systems ALT

链接: https://arxiv.org/abs/2409.16098
作者: África Periáñez,Ana Fernández del Río,Ivan Nazarov,Enric Jané,Moiz Hassan,Aditya Rastogi,Dexian Tang
关键词-EN: revolutionize health care, Artificial Intelligence, integrating Artificial Intelligence, health, health care delivery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: This article has been accepted for publication in Health Systems Reform, published by Taylor Francis

点击查看摘要

Abstract:Mobile health has the potential to revolutionize health care delivery and patient engagement. In this work, we discuss how integrating Artificial Intelligence into digital health applications-focused on supply chain, patient management, and capacity building, among other use cases-can improve the health system and public health performance. We present an Artificial Intelligence and Reinforcement Learning platform that allows the delivery of adaptive interventions whose impact can be optimized through experimentation and real-time monitoring. The system can integrate multiple data sources and digital health applications. The flexibility of this platform to connect to various mobile health applications and digital devices and send personalized recommendations based on past data and predictions can significantly improve the impact of digital tools on health system outcomes. The potential for resource-poor settings, where the impact of this approach on health outcomes could be more decisive, is discussed specifically. This framework is, however, similarly applicable to improving efficiency in health systems where scarcity is not an issue.

[AI-23] From Pixels to Words: Leveraging Explainability in Face Recognition through Interactive Natural Language Processing

链接: https://arxiv.org/abs/2409.16089
作者: Ivan DeAndres-Tame,Muhammad Faisal,Ruben Tolosana,Rouqaiah Al-Refai,Ruben Vera-Rodriguez,Philipp Terhörst
关键词-EN: achieving high accuracy, Explainable Artificial Intelligence, deep learning, achieving high, advanced significantly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Face Recognition (FR) has advanced significantly with the development of deep learning, achieving high accuracy in several applications. However, the lack of interpretability of these systems raises concerns about their accountability, fairness, and reliability. In the present study, we propose an interactive framework to enhance the explainability of FR models by combining model-agnostic Explainable Artificial Intelligence (XAI) and Natural Language Processing (NLP) techniques. The proposed framework is able to accurately answer various questions of the user through an interactive chatbot. In particular, the explanations generated by our proposed method are in the form of natural language text and visual representations, which for example can describe how different facial regions contribute to the similarity measure between two faces. This is achieved through the automatic analysis of the output’s saliency heatmaps of the face images and a BERT question-answering model, providing users with an interface that facilitates a comprehensive understanding of the FR decisions. The proposed approach is interactive, allowing the users to ask questions to get more precise information based on the user’s background knowledge. More importantly, in contrast to previous studies, our solution does not decrease the face recognition performance. We demonstrate the effectiveness of the method through different experiments, highlighting its potential to make FR systems more interpretable and user-friendly, especially in sensitive applications where decision-making transparency is crucial.

[AI-24] Assessing Simplification Levels in Neural Networks: The Impact of Hyperparameter Configurations on Complexity and Sensitivity

链接: https://arxiv.org/abs/2409.16086
作者: (Joy)Huixin Guan
关键词-EN: Lempel Ziv complexity, Lempel Ziv, experimental study focused, effects on Lempel, specifically investigating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents an experimental study focused on understanding the simplification properties of neural networks under different hyperparameter configurations, specifically investigating the effects on Lempel Ziv complexity and sensitivity. By adjusting key hyperparameters such as activation functions, hidden layers, and learning rate, this study evaluates how these parameters impact the complexity of network outputs and their robustness to input perturbations. The experiments conducted using the MNIST dataset aim to provide insights into the relationships between hyperparameters, complexity, and sensitivity, contributing to a deeper theoretical understanding of these concepts in neural networks.

[AI-25] Online Multi-level Contrastive Representation Distillation for Cross-Subject fNIRS Emotion Recognition

链接: https://arxiv.org/abs/2409.16081
作者: Zhili Lai,Chunmei Qing,Junpeng Tan,Wanxiang Luo,Xiangmin Xu
关键词-EN: Utilizing functional near-infrared, functional near-infrared spectroscopy, Utilizing functional, understanding human emotions, emotion recognition
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted in ACMMM-2024 Workshop BCI. Codes are available at this https URL

点击查看摘要

Abstract:Utilizing functional near-infrared spectroscopy (fNIRS) signals for emotion recognition is a significant advancement in understanding human emotions. However, due to the lack of artificial intelligence data and algorithms in this field, current research faces the following challenges: 1) The portable wearable devices have higher requirements for lightweight models; 2) The objective differences of physiology and psychology among different subjects aggravate the difficulty of emotion recognition. To address these challenges, we propose a novel cross-subject fNIRS emotion recognition method, called the Online Multi-level Contrastive Representation Distillation framework (OMCRD). Specifically, OMCRD is a framework designed for mutual learning among multiple lightweight student networks. It utilizes multi-level fNIRS feature extractor for each sub-network and conducts multi-view sentimental mining using physiological signals. The proposed Inter-Subject Interaction Contrastive Representation (IS-ICR) facilitates knowledge transfer for interactions between student models, enhancing cross-subject emotion recognition performance. The optimal student network can be selected and deployed on a wearable device. Some experimental results demonstrate that OMCRD achieves state-of-the-art results in emotional perception and affective imagery tasks.

[AI-26] Leveraging Mixture of Experts for Improved Speech Deepfake Detection ICASSP2025

链接: https://arxiv.org/abs/2409.16077
作者: Viola Negroni,Davide Salvi,Alessandro Ilic Mezza,Paolo Bestagini,Stefano Tubaro
关键词-EN: Speech deepfakes pose, content authenticity, speech deepfake detection, pose a significant, significant threat
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Speech deepfakes pose a significant threat to personal security and content authenticity. Several detectors have been proposed in the literature, and one of the primary challenges these systems have to face is the generalization over unseen data to identify fake signals across a wide range of datasets. In this paper, we introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture. The Mixture of Experts framework is well-suited for the speech deepfake detection task due to its ability to specialize in different input types and handle data variability efficiently. This approach offers superior generalization and adaptability to unseen data compared to traditional single models or ensemble methods. Additionally, its modular structure supports scalable updates, making it more flexible in managing the evolving complexity of deepfake techniques while maintaining high detection accuracy. We propose an efficient, lightweight gating mechanism to dynamically assign expert weights for each input, optimizing detection performance. Experimental results across multiple datasets demonstrate the effectiveness and potential of our proposed approach.

[AI-27] owards Robust Object Detection: Identifying and Removing Backdoors via Module Inconsistency Analysis

链接: https://arxiv.org/abs/2409.16057
作者: Xianda Zhang,Siyuan Liang
关键词-EN: Object detection models, Region Proposal Network, Object detection, security-critical applications, specific patterns
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Object detection models, widely used in security-critical applications, are vulnerable to backdoor attacks that cause targeted misclassifications when triggered by specific patterns. Existing backdoor defense techniques, primarily designed for simpler models like image classifiers, often fail to effectively detect and remove backdoors in object detectors. We propose a backdoor defense framework tailored to object detection models, based on the observation that backdoor attacks cause significant inconsistencies between local modules’ behaviors, such as the Region Proposal Network (RPN) and classification head. By quantifying and analyzing these inconsistencies, we develop an algorithm to detect backdoors. We find that the inconsistent module is usually the main source of backdoor behavior, leading to a removal method that localizes the affected module, resets its parameters, and fine-tunes the model on a small clean dataset. Extensive experiments with state-of-the-art two-stage object detectors show our method achieves a 90% improvement in backdoor removal rate over fine-tuning baselines, while limiting clean data accuracy loss to less than 4%. To the best of our knowledge, this work presents the first approach that addresses both the detection and removal of backdoors in two-stage object detection models, advancing the field of securing these complex systems against backdoor attacks.

[AI-28] Adversarial Watermarking for Face Recognition

链接: https://arxiv.org/abs/2409.16056
作者: Yuguang Yao,Anil Jain,Sijia Liu
关键词-EN: monitor unauthorized alterations, Watermarking, embedding an identifier, unauthorized alterations, essential technique
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Watermarking is an essential technique for embedding an identifier (i.e., watermark message) within digital images to assert ownership and monitor unauthorized alterations. In face recognition systems, watermarking plays a pivotal role in ensuring data integrity and security. However, an adversary could potentially interfere with the watermarking process, significantly impairing recognition performance. We explore the interaction between watermarking and adversarial attacks on face recognition models. Our findings reveal that while watermarking or input-level perturbation alone may have a negligible effect on recognition accuracy, the combined effect of watermarking and perturbation can result in an adversarial watermarking attack, significantly degrading recognition performance. Specifically, we introduce a novel threat model, the adversarial watermarking attack, which remains stealthy in the absence of watermarking, allowing images to be correctly recognized initially. However, once watermarking is applied, the attack is activated, causing recognition failures. Our study reveals a previously unrecognized vulnerability: adversarial perturbations can exploit the watermark message to evade face recognition systems. Evaluated on the CASIA-WebFace dataset, our proposed adversarial watermarking attack reduces face matching accuracy by 67.2% with an \ell_\infty norm-measured perturbation strength of 2/255 and by 95.9% with a strength of 4/255 .

[AI-29] Whole-body end-effector pose tracking

链接: https://arxiv.org/abs/2409.16048
作者: Tifanny Portela,Andrei Cramariuc,Mayank Mittal,Marco Hutter
关键词-EN: Combining manipulation, mobility of legged, Combining, recent Reinforcement Learning, robotic applications
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Combining manipulation with the mobility of legged robots is essential for a wide range of robotic applications. However, integrating an arm with a mobile base significantly increases the system’s complexity, making precise end-effector control challenging. Existing model-based approaches are often constrained by their modeling assumptions, leading to limited robustness. Meanwhile, recent Reinforcement Learning (RL) implementations restrict the arm’s workspace to be in front of the robot or track only the position to obtain decent tracking accuracy. In this work, we address these limitations by introducing a whole-body RL formulation for end-effector pose tracking in a large workspace on rough, unstructured terrains. Our proposed method involves a terrain-aware sampling strategy for the robot’s initial configuration and end-effector pose commands, as well as a game-based curriculum to extend the robot’s operating range. We validate our approach on the ANYmal quadrupedal robot with a six DoF robotic arm. Through our experiments, we show that the learned controller achieves precise command tracking over a large workspace and adapts across varying terrains such as stairs and slopes. On deployment, it achieves a pose-tracking error of 2.64 cm and 3.64 degrees, outperforming existing competitive baselines.

[AI-30] LTNtorch: PyTorch Implementation of Logic Tensor Networks

链接: https://arxiv.org/abs/2409.16045
作者: Tommaso Carraro,Luciano Serafini,Fabio Aiolli
关键词-EN: effectively incorporates deep, Logic Tensor Networks, incorporates deep learning, effectively incorporates, incorporates deep
类目: Artificial Intelligence (cs.AI)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Logic Tensor Networks (LTN) is a Neuro-Symbolic framework that effectively incorporates deep learning and logical reasoning. In particular, LTN allows defining a logical knowledge base and using it as the objective of a neural model. This makes learning by logical reasoning possible as the parameters of the model are optimized by minimizing a loss function composed of a set of logical formulas expressing facts about the learning task. The framework learns via gradient-descent optimization. Fuzzy logic, a relaxation of classical logic permitting continuous truth values in the interval [0,1], makes this learning possible. Specifically, the training of an LTN consists of three steps. Firstly, (1) the training data is used to ground the formulas. Then, (2) the formulas are evaluated, and the loss function is computed. Lastly, (3) the gradients are back-propagated through the logical computational graph, and the weights of the neural model are changed so the knowledge base is maximally satisfied. LTNtorch is the fully documented and tested PyTorch implementation of Logic Tensor Networks. This paper presents the formalization of LTN and how LTNtorch implements it. Moreover, it provides a basic binary classification example.

[AI-31] me-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

链接: https://arxiv.org/abs/2409.16040
作者: Xiaoming Shi,Shiyu Wang,Yuqi Nie,Dianqi Li,Zhou Ye,Qingsong Wen,Ming Jin
关键词-EN: Deep learning, past decades, time series, time series forecasting, Deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 29 pages, 10 figures, 13 tables

点击查看摘要

Abstract:Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.

[AI-32] Bridging Environments and Language with Rendering Functions and Vision-Language Models

链接: https://arxiv.org/abs/2409.16024
作者: Theo Cachet,Christopher R. Dance,Olivier Sigaud
关键词-EN: enabling language-conditioned agents, perform diverse tasks, Vision-language models, grounding language, language-conditioned agents
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have tremendous potential for grounding language, and thus enabling language-conditioned agents (LCAs) to perform diverse tasks specified with text. This has motivated the study of LCAs based on reinforcement learning (RL) with rewards given by rendering images of an environment and evaluating those images with VLMs. If single-task RL is employed, such approaches are limited by the cost and time required to train a policy for each new task. Multi-task RL (MTRL) is a natural alternative, but requires a carefully designed corpus of training tasks and does not always generalize reliably to new tasks. Therefore, this paper introduces a novel decomposition of the problem of building an LCA: first find an environment configuration that has a high VLM score for text describing a task; then use a (pretrained) goal-conditioned policy to reach that configuration. We also explore several enhancements to the speed and quality of VLM-based LCAs, notably, the use of distilled models, and the evaluation of configurations from multiple viewpoints to resolve the ambiguities inherent in a single 2D view. We demonstrate our approach on the Humanoid environment, showing that it results in LCAs that outperform MTRL baselines in zero-shot generalization, without requiring any textual task descriptions or other forms of environment-specific annotation during training. Videos and an interactive demo can be found at this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2409.16024 [cs.AI] (or arXiv:2409.16024v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.16024 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-33] AI Can Be Cognitively Biased: An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment

链接: https://arxiv.org/abs/2409.16022
作者: Nuo Chen,Jiqun Liu,Xiaoyu Dong,Qijiong Liu,Tetsuya Sakai,Xiao-Ming Wu
关键词-EN: Cognitive biases, problematic decision-making, extensively studied, systematic deviations, deviations in thinking
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cognitive biases are systematic deviations in thinking that lead to irrational judgments and problematic decision-making, extensively studied across various fields. Recently, large language models (LLMs) have shown advanced understanding capabilities but may inherit human biases from their training data. While social biases in LLMs have been well-studied, cognitive biases have received less attention, with existing research focusing on specific scenarios. The broader impact of cognitive biases on LLMs in various decision- making contexts remains underexplored. We investigated whether LLMs are influenced by the threshold priming effect in relevance judgments, a core task and widely-discussed research topic in the Information Retrieval (IR) coummunity. The priming effect occurs when exposure to certain stimuli unconsciously affects subsequent behavior and decisions. Our experiment employed 10 topics from the TREC 2019 Deep Learning passage track collection, and tested AI judgments under different document relevance scores, batch lengths, and LLM models, including GPT-3.5, GPT-4, LLaMa2-13B and LLaMa2-70B. Results showed that LLMs tend to give lower scores to later documents if earlier ones have high relevance, and vice versa, regardless of the combination and model used. Our finding demonstrates that LLM%u2019s judgments, similar to human judgments, are also influenced by threshold priming biases, and suggests that researchers and system engineers should take into account potential human-like cognitive biases in designing, evaluating, and auditing LLMs in IR tasks and beyond.

[AI-34] Artificial Human Intelligence: The role of Humans in the Development of Next Generation AI

链接: https://arxiv.org/abs/2409.16001
作者: Suayb S. Arslan
关键词-EN: evolutionary path forward, accessible form, source of reasoning, hosted by biological, biological hardware
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: 34 pages, 5 figures, submitted to IEEE Trans. on Artificial Intelligence

点击查看摘要

Abstract:Human intelligence, the most evident and accessible form of source of reasoning, hosted by biological hardware, has evolved and been refined over thousands of years, positioning itself today to create new artificial forms and preparing to self–design their evolutionary path forward. Beginning with the advent of foundation models, the rate at which human and artificial intelligence interact with each other has surpassed any anticipated quantitative figures. The close engagement led to both bits of intelligence to be impacted in various ways, which naturally resulted in complex confluences that warrant close scrutiny. In the sequel, we shall explore the interplay between human and machine intelligence, focusing on the crucial role humans play in developing ethical, responsible, and robust intelligent systems. We slightly delve into interesting aspects of implementation inspired by the mechanisms underlying neuroscience and human cognition. Additionally, we propose future perspectives, capitalizing on the advantages of symbiotic designs to suggest a human-centered direction for next-generation AI development. We finalize this evolving document with a few thoughts and open questions yet to be addressed by the broader community.

[AI-35] Improvements to SDXL in NovelAI Diffusion V3

链接: https://arxiv.org/abs/2409.15997
作者: Juan Ossa,Eren Doğan,Alex Birch,F. Johnson
关键词-EN: training NovelAI Diffusion, image generation model, art anime image, anime image generation, NovelAI Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:In this technical report, we document the changes we made to SDXL in the process of training NovelAI Diffusion V3, our state of the art anime image generation model.

[AI-36] DataGpt-SQL-7B: An Open-Source Language Model for Text-to-SQL

链接: https://arxiv.org/abs/2409.15985
作者: Lixia Wu,Peng Li,Junhong Lou,Lei Fu
关键词-EN: closed-source Large Language, Large Language Models, translating natural language, natural language queries, Large Language
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In addressing the pivotal role of translating natural language queries into SQL commands, we propose a suite of compact, fine-tuned models and self-refine mechanisms to democratize data access and analysis for non-expert users, mitigating risks associated with closed-source Large Language Models. Specifically, we constructed a dataset of over 20K sample for Text-to-SQL as well as the preference dateset, to improve the efficiency in the domain of SQL generation. To further ensure code validity, a code corrector was integrated into the model. Our system, DataGpt-sql, achieved 87.2% accuracy on the spider-dev, respectively, showcasing the effectiveness of our solution in text-to-SQL conversion tasks. Our code, data, and models are available at \urlthis https URL

[AI-37] Leveraging Unsupervised Learning for Cost-Effective Visual Anomaly Detection

链接: https://arxiv.org/abs/2409.15980
作者: Yunbo Long,Zhengyang Ling,Sam Brook,Duncan McFarlane,Alexandra Brintrup
关键词-EN: Traditional machine learning-based, extensive data collection, Traditional machine, machine learning-based visual, require extensive data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional machine learning-based visual inspection systems require extensive data collection and repetitive model training to improve accuracy. These systems typically require expensive camera, computing equipment and significant machine learning ex- pertise, which can substantially burden small and medium-sized enterprises. This study explores leveraging unsupervised learning methods with pre-trained models and low-cost hardware to create a cost-effective visual anomaly detection system. The research aims to develop a low-cost visual anomaly detection solution that uses minimal data for model training while maintaining general- izability and scalability. The system utilises unsupervised learning models from Anomalib and is deployed on affordable Raspberry Pi hardware through openVINO. The results show that this cost-effective system can complete anomaly defection training and inference on a Raspberry Pi in just 90 seconds using only 10 normal product images, achieving an F1 macro score exceeding 0.95. While the system is slightly sensitive to environmental changes like lighting, product positioning, or background, it remains a swift and economical method for factory automation inspection for small and medium-sized manufacturers

[AI-38] Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification INTERSPEECH2024

链接: https://arxiv.org/abs/2409.15974
作者: Fengrun Zhang,Wangjin Zhou,Yiming Liu,Wang Geng,Yahui Shan,Chen Zhang
关键词-EN: increasing research interest, increasing research, research interest, existing speaker verification, speaker verification
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Interspeech 2024

点击查看摘要

Abstract:There has been an increasing research interest in cross-age speaker verification~(CASV). However, existing speaker verification systems perform poorly in CASV due to the great individual differences in voice caused by aging. In this paper, we propose a disentangled representation learning framework for CASV based on mutual information~(MI) minimization. In our method, a backbone model is trained to disentangle the identity- and age-related embeddings from speaker information, and an MI estimator is trained to minimize the correlation between age- and identity-related embeddings via MI minimization, resulting in age-invariant speaker embeddings. Furthermore, by using the age gaps between positive and negative samples, we propose an aging-aware MI minimization loss function that allows the backbone model to focus more on the vocal changes with large age gaps. Experimental results show that the proposed method outperforms other methods on multiple Cross-Age test sets of Vox-CA.

[AI-39] Edge-device Collaborative Computing for Multi-view Classification

链接: https://arxiv.org/abs/2409.15973
作者: Marco Palena,Tania Cerquitelli,Carla Fabiana Chiasserini
关键词-EN: pushing deep learning, deep learning, deliver faster responses, deep learning computations, realize deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Motivated by the proliferation of Internet-of-Thing (IoT) devices and the rapid advances in the field of deep learning, there is a growing interest in pushing deep learning computations, conventionally handled by the cloud, to the edge of the network to deliver faster responses to end users, reduce bandwidth consumption to the cloud, and address privacy concerns. However, to fully realize deep learning at the edge, two main challenges still need to be addressed: (i) how to meet the high resource requirements of deep learning on resource-constrained devices, and (ii) how to leverage the availability of multiple streams of spatially correlated data, to increase the effectiveness of deep learning and improve application-level performance. To address the above challenges, we explore collaborative inference at the edge, in which edge nodes and end devices share correlated data and the inference computational burden by leveraging different ways to split computation and fuse data. Besides traditional centralized and distributed schemes for edge-end device collaborative inference, we introduce selective schemes that decrease bandwidth resource consumption by effectively reducing data redundancy. As a reference scenario, we focus on multi-view classification in a networked system in which sensing nodes can capture overlapping fields of view. The proposed schemes are compared in terms of accuracy, computational expenditure at the nodes, communication overhead, inference latency, robustness, and noise sensitivity. Experimental results highlight that selective collaborative schemes can achieve different trade-offs between the above performance metrics, with some of them bringing substantial communication savings (from 18% to 74% of the transmitted data with respect to centralized inference) while still keeping the inference accuracy well above 90%.

[AI-40] Creating Healthy Friction: Determining Stakeholder Requirements of Job Recommendation Explanations RECSYS

链接: https://arxiv.org/abs/2409.15971
作者: Roan Schellingerhout,Francesco Barile,Nava Tintarev
关键词-EN: retrieval in recruitment, information retrieval, large impact, job seekers, explanations
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 14 pages, 3 figures, to be published in ACM RecSys in HR '24: 4th Workshop on Recommender Systems for Human Resources

点击查看摘要

Abstract:The increased use of information retrieval in recruitment, primarily through job recommender systems (JRSs), can have a large impact on job seekers, recruiters, and companies. As a result, such systems have been determined to be high-risk in recent legislature. This requires JRSs to be trustworthy and transparent, allowing stakeholders to understand why specific recommendations were made. To fulfill this requirement, the stakeholders’ exact preferences and needs need to be determined. To do so, we evaluated an explainable job recommender system using a realistic, task-based, mixed-design user study (n=30) in which stakeholders had to make decisions based on the model’s explanations. This mixed-methods evaluation consisted of two objective metrics - correctness and efficiency, along with three subjective metrics - trust, transparency, and usefulness. These metrics were evaluated twice per participant, once using real explanations and once using random explanations. The study included a qualitative analysis following a think-aloud protocol while performing tasks adapted to each stakeholder group. We find that providing stakeholders with real explanations does not significantly improve decision-making speed and accuracy. Our results showed a non-significant trend for the real explanations to outperform the random ones on perceived trust, usefulness, and transparency of the system for all stakeholder types. We determine that stakeholders benefit more from interacting with explanations as decision support capable of providing healthy friction, rather than as previously-assumed persuasive tools.

[AI-41] Provably Efficient Exploration in Inverse Constrained Reinforcement Learning

链接: https://arxiv.org/abs/2409.15963
作者: Bo Yue,Jian Li,Guiliang Liu
关键词-EN: Inverse Constrained Reinforcement, Constrained Reinforcement Learning, Inverse Constrained, Reinforcement Learning, Constrained Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To obtain the optimal constraints in complex environments, Inverse Constrained Reinforcement Learning (ICRL) seeks to recover these constraints from expert demonstrations in a data-driven manner. Existing ICRL algorithms collect training samples from an interactive environment. However, the efficacy and efficiency of these sampling strategies remain unknown. To bridge this gap, we introduce a strategic exploration framework with provable efficiency. Specifically, we define a feasible constraint set for ICRL problems and investigate how expert policy and environmental dynamics influence the optimality of constraints. Motivated by our findings, we propose two exploratory algorithms to achieve efficient constraint inference via 1) dynamically reducing the bounded aggregate error of cost estimation and 2) strategically constraining the exploration policy. Both algorithms are theoretically grounded with tractable sample complexity. We empirically demonstrate the performance of our algorithms under various environments.

[AI-42] ASD-Diffusion: Anomalous Sound Detection with Diffusion Models ICPR2024

链接: https://arxiv.org/abs/2409.15957
作者: Fengrun Zhang,Xiang Xie,Kai Guo
关键词-EN: Anomalous Sound Detection, Unsupervised Anomalous Sound, Sound Detection based, Unsupervised Anomalous, Anomalous Sound
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: This paper will appear at ICPR 2024

点击查看摘要

Abstract:Unsupervised Anomalous Sound Detection (ASD) aims to design a generalizable method that can be used to detect anomalies when only normal sounds are given. In this paper, Anomalous Sound Detection based on Diffusion Models (ASD-Diffusion) is proposed for ASD in real-world factories. In our pipeline, the anomalies in acoustic features are reconstructed from their noisy corrupted features into their approximate normal pattern. Secondly, a post-processing anomalies filter algorithm is proposed to detect anomalies that exhibit significant deviation from the original input after reconstruction. Furthermore, denoising diffusion implicit model is introduced to accelerate the inference speed by a longer sampling interval of the denoising process. The proposed method is innovative in the application of diffusion models as a new scheme. Experimental results on the development set of DCASE 2023 challenge task 2 outperform the baseline by 7.75%, demonstrating the effectiveness of the proposed method.

[AI-43] Historical Trajectory Assisted Zeroth-Order Federated Optimization

链接: https://arxiv.org/abs/2409.15955
作者: Xiaoyu He,Chenlin Wu,Zike Li,Zibin Zheng
关键词-EN: train models individually, distributed learning framework, learning framework, train models, models individually
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 28 pages with theoretical proof

点击查看摘要

Abstract:Federated learning is a distributed learning framework which enables clients to train models individually and to upload their model updates for aggregation. The local training process heavily relies on distributed gradient descent techniques. In the situation where gradient information is not available, the gradients need to be estimated from zeroth-order information, which typically involves computing finite-differences along isotropic random directions. This method suffers from high estimation errors, as the geometric features of the objective landscape may be overlooked during the isotropic sampling. In this work, we propose a non-isotropic sampling method to improve the gradient estimation procedure. Gradients in our method are estimated in a subspace spanned by historical trajectories of solutions, aiming to encourage the exploration of promising regions and hence improve the convergence. We implement this method in zeroth-order federated settings, and show that the convergence rate aligns with existing ones while introducing no significant overheads in communication or local computation. The effectiveness of our proposal is verified on several numerical experiments in comparison to several commonly-used zeroth-order federated optimization algorithms.

[AI-44] SFeatLIME: An Online User Study in Enhancing Explainability in Univariate Time Series Forecasting

链接: https://arxiv.org/abs/2409.15950
作者: Hongnan Ma,Kevin McAreavey,Weiru Liu
关键词-EN: Time series forecasting, employs complex models, Time series, employs complex, series forecasting
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting, while vital in various applications, often employs complex models that are difficult for humans to understand. Effective explainable AI techniques are crucial to bridging the gap between model predictions and user understanding. This paper presents a framework - TSFeatLIME, extending TSLIME, tailored specifically for explaining univariate time series forecasting. TSFeatLIME integrates an auxiliary feature into the surrogate model and considers the pairwise Euclidean distances between the queried time series and the generated samples to improve the fidelity of the surrogate models. However, the usefulness of such explanations for human beings remains an open question. We address this by conducting a user study with 160 participants through two interactive interfaces, aiming to measure how individuals from different backgrounds can simulate or predict model output changes in the treatment group and control group. Our results show that the surrogate model under the TSFeatLIME framework is able to better simulate the behaviour of the black-box considering distance, without sacrificing accuracy. In addition, the user study suggests that the explanations were significantly more effective for participants without a computer science background.

[AI-45] Automated test generation to evaluate tool-augmented LLMs as conversational AI agents EMNLP2024

链接: https://arxiv.org/abs/2409.15934
作者: Samuel Arcadinho,David Aparicio,Mariana Almeida
关键词-EN: call appropriate functions, promising approach, approach to create, realistic conversations, follow procedures
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, Submitted to GenBench@EMNLP2024

点击查看摘要

Abstract:Tool-augmented LLMs are a promising approach to create AI agents that can have realistic conversations, follow procedures, and call appropriate functions. However, evaluating them is challenging due to the diversity of possible conversations, and existing datasets focus only on single interactions and function-calling. We present a test generation pipeline to evaluate LLMs as conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. For that, we use intermediate graphs to limit the LLM test generator’s tendency to hallucinate content that is not grounded on input procedures, and enforces high coverage of the possible conversations. Additionally, we put forward ALMITA, a manually curated dataset for evaluating AI agents in customer support, and use it to evaluate existing LLMs. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations. While our focus is on customer support, our method is general and capable of AI agents for different domains.

[AI-46] Multilingual Transfer and Domain Adaptation for Low-Resource Languages of Spain

链接: https://arxiv.org/abs/2409.15924
作者: Yuanchang Luo,Zhanglin Wu,Daimeng Wei,Hengchao Shang,Zongyao Li,Jiaxin Guo,Zhiqiang Rao,Shaojun Li,Jinlong Yang,Yuhao Xie,Jiawei Zheng Bin Wei,Hao Yang
关键词-EN: Translation Service Center, Huawei Translation Service, Service Center, Languages of Spain, Low-Resource Languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages,wmt24. arXiv admin note: substantial text overlap with arXiv:2409.14842 ; text overlap with arXiv:2409.14800

点击查看摘要

Abstract:This article introduces the submission status of the Translation into Low-Resource Languages of Spain task at (WMT 2024) by Huawei Translation Service Center (HW-TSC). We participated in three translation tasks: spanish to aragonese (es-arg), spanish to aranese (es-arn), and spanish to asturian (es-ast). For these three translation tasks, we use training strategies such as multilingual transfer, regularized dropout, forward translation and back translation, labse denoising, transduction ensemble learning and other strategies to neural machine translation (NMT) model based on training deep transformer-big architecture. By using these enhancement strategies, our submission achieved a competitive result in the final evaluation.

[AI-47] Planning in the Dark: LLM-Symbolic Planning Pipeline without Experts

链接: https://arxiv.org/abs/2409.15915
作者: Sukai Huang,Nir Lipovetzky,Trevor Cohn
关键词-EN: Large Language Models, solving natural language-described, Large Language, Language Models, language-described planning tasks
类目: Artificial Intelligence (cs.AI)
*备注: 8 main body pages, 10 appendix pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in solving natural language-described planning tasks, but their direct use often leads to inconsistent reasoning and hallucination. While hybrid LLM-symbolic planning pipelines have emerged as a more robust alternative, they typically require extensive expert intervention to refine and validate generated action schemas. It not only limits scalability but also introduces a potential for biased interpretation, as a single expert’s interpretation of ambiguous natural language descriptions might not align with the user’s actual intent. To address this, we propose a novel approach that constructs an action schema library to generate multiple candidates, accounting for the diverse possible interpretations of natural language descriptions. We further introduce a semantic validation and ranking module that automatically filter and rank the generated schemas and plans without expert-in-the-loop. The experiments showed our pipeline maintains superiority in planning over the direct LLM planning approach. These findings demonstrate the feasibility of a fully automated end-to-end LLM-symbolic planner that requires no expert intervention, opening up the possibility for a broader audience to engage with AI planning with less prerequisite of domain expertise.

[AI-48] Enhancing IoT based Plant Health Monitoring through Advanced Human Plant Interaction using Large Language Models and Mobile Applications

链接: https://arxiv.org/abs/2409.15910
作者: Kriti Agarwal,Samhruth Ananthanarayanan,Srinitish Srinivasan,Abirami S
关键词-EN: AI-powered language models, real-time sensor data, presents the development, language models, Gemini API
类目: Artificial Intelligence (cs.AI)
*备注: Pre-print Version. Submitted to conference

点击查看摘要

Abstract:This paper presents the development of a novel plant communication application that allows plants to “talk” to humans using real-time sensor data and AI-powered language models. Utilizing soil sensors that track moisture, temperature, and nutrient levels, the system feeds this data into the Gemini API, where it is processed and transformed into natural language insights about the plant’s health and “mood.” Developed using Flutter, Firebase, and ThingSpeak, the app offers a seamless user experience with real-time interaction capabilities. By fostering human-plant connectivity, this system enhances plant care practices, promotes sustainability, and introduces innovative applications for AI and IoT technologies in both personal and agricultural contexts. The paper explores the technical architecture, system integration, and broader implications of AI-driven plant communication.

[AI-49] Enhancing Text-to-SQL Capabilities of Large Language Models via Domain Database Knowledge Injection ECAI2024

链接: https://arxiv.org/abs/2409.15907
作者: Xingyu Ma,Xin Tian,Lingxiang Wu,Xuepeng Wang,Xueming Tang,Jinqiao Wang
关键词-EN: Large Language Models, Large Language, evolution of Large, Language Models, subtask in semantic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by ECAI 2024

点击查看摘要

Abstract:Text-to-SQL is a subtask in semantic parsing that has seen rapid progress with the evolution of Large Language Models (LLMs). However, LLMs face challenges due to hallucination issues and a lack of domain-specific database knowledge(such as table schema and cell values). As a result, they can make errors in generating table names, columns, and matching values to the correct columns in SQL statements. This paper introduces a method of knowledge injection to enhance LLMs’ ability to understand schema contents by incorporating prior knowledge. This approach improves their performance in Text-to-SQL tasks. Experimental results show that pre-training LLMs on domain-specific database knowledge and fine-tuning them on downstream Text-to-SQL tasks significantly improves the Execution Match (EX) and Exact Match (EM) metrics across various models. This effectively reduces errors in generating column names and matching values to the columns. Furthermore, the knowledge-injected models can be applied to many downstream Text-to-SQL tasks, demonstrating the generalizability of the approach presented in this paper.

[AI-50] Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM ICASSP2025

链接: https://arxiv.org/abs/2409.15905
作者: Fengrun Zhang,Wang Geng,Hukai Huang,Cheng Yi,He Qu
关键词-EN: Automatic Speech Recognition, speech-conditioned Large Language, Large Language Model, speech-conditioned Large, Automatic Speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:In this paper, we introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) in Automatic Speech Recognition (ASR). Specifically, we propose an Insertion and Deletion of Interruption Token (IDIT) mechanism for better transfer text generation ability of LLM to speech recognition task. We also present a connecter with MoE architecture that manages multiple languages efficiently. To further enhance the collaboration of multiple experts and leverage the understanding capabilities of LLM, we propose a two-stage progressive training strategy: 1) The connector is unfrozen and trained with language-specialized experts to map speech representations to the text space. 2) The connector and LLM LoRA adaptor are trained with the proposed IDIT mechanism and all experts are activated to learn general representations. Experimental results demonstrate that our method significantly outperforms state-of-the-art models, including end-to-end and large-scale audio-language models.

[AI-51] Five questions and answers about artificial intelligence

链接: https://arxiv.org/abs/2409.15903
作者: Alberto Prieto,Beatriz Prieto
关键词-EN: Artificial Intelligence, Rapid advances, advances in Artificial, controversy in society, scientific basis
类目: Artificial Intelligence (cs.AI)
*备注: 17 pages, 0 figures, Scientific and technological popularization article

点击查看摘要

Abstract:Rapid advances in Artificial Intelligence (AI) are generating much controversy in society, often without scientific basis. As occurred the development of other emerging technologies, such as the introduction of electricity in the early 20th century, AI causes both fascination and fear. Following the advice of the philosopher R.W. Emerson’s: advice the knowledge is the antidote to fear; this paper seeks to contribute to the dissemination of knowledge about AI. To this end, it reflects on the following questions: the origins of AI, its possible future evolution, its ability to show feelings, the associated threats and dangers, and the concept of AI singularity.

[AI-52] Symmetries and Expressive Requirements for Learning General Policies KR2024

链接: https://arxiv.org/abs/2409.15892
作者: Dominik Drexler,Simon Ståhlberg,Blai Bonet,Hector Geffner
关键词-EN: State symmetries play, general policies, play an important, important role, learning general policies
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at the 21st International Conference on Principles of Knowledge Representation and Reasoning (KR2024) in the Reasoning, Learning, and Decision Making track

点击查看摘要

Abstract:State symmetries play an important role in planning and generalized planning. In the first case, state symmetries can be used to reduce the size of the search; in the second, to reduce the size of the training set. In the case of general planning, however, it is also critical to distinguish non-symmetric states, i.e., states that represent non-isomorphic relational structures. However, while the language of first-order logic distinguishes non-symmetric states, the languages and architectures used to represent and learn general policies do not. In particular, recent approaches for learning general policies use state features derived from description logics or learned via graph neural networks (GNNs) that are known to be limited by the expressive power of C_2, first-order logic with two variables and counting. In this work, we address the problem of detecting symmetries in planning and generalized planning and use the results to assess the expressive requirements for learning general policies over various planning domains. For this, we map planning states to plain graphs, run off-the-shelf algorithms to determine whether two states are isomorphic with respect to the goal, and run coloring algorithms to determine if C_2 features computed logically or via GNNs distinguish non-isomorphic states. Symmetry detection results in more effective learning, while the failure to detect non-symmetries prevents general policies from being learned at all in certain domains.

[AI-53] Machine Translation Advancements of Low-Resource Indian Languages by Transfer Learning

链接: https://arxiv.org/abs/2409.15879
作者: Bin Wei,Jiawei Zhen,Zongyao Li,Zhanglin Wu,Daimeng Wei,Jiaxin Guo,Zhiqiang Rao,Shaojun Li,Yuanchang Luo,Hengchao Shang,Jinlong Yang,Yuhao Xie,Hao Yang
关键词-EN: Huawei Translation Center, Shared Task, Indian Languages Machine, Indian Languages, low-resource Indian languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 6 pages, wmt24. arXiv admin note: substantial text overlap with arXiv:2409.14800

点击查看摘要

Abstract:This paper introduces the submission by Huawei Translation Center (HW-TSC) to the WMT24 Indian Languages Machine Translation (MT) Shared Task. To develop a reliable machine translation system for low-resource Indian languages, we employed two distinct knowledge transfer strategies, taking into account the characteristics of the language scripts and the support available from existing open-source models for Indian languages. For Assamese(as) and Manipuri(mn), we fine-tuned the existing IndicTrans2 open-source model to enable bidirectional translation between English and these languages. For Khasi (kh) and Mizo (mz), We trained a multilingual model as a baseline using bilingual data from these four language pairs, along with an additional about 8kw English-Bengali bilingual data, all of which share certain linguistic features. This was followed by fine-tuning to achieve bidirectional translation between English and Khasi, as well as English and Mizo. Our transfer learning experiments produced impressive results: 23.5 BLEU for en-as, 31.8 BLEU for en-mn, 36.2 BLEU for as-en, and 47.9 BLEU for mn-en on their respective test sets. Similarly, the multilingual model transfer learning experiments yielded impressive outcomes, achieving 19.7 BLEU for en-kh, 32.8 BLEU for en-mz, 16.1 BLEU for kh-en, and 33.9 BLEU for mz-en on their respective test sets. These results not only highlight the effectiveness of transfer learning techniques for low-resource languages but also contribute to advancing machine translation capabilities for low-resource Indian languages.

[AI-54] In-Context Ensemble Improves Video-Language Models for Low-Level Workflow Understanding from Human Demonstrations

链接: https://arxiv.org/abs/2409.15867
作者: Moucheng Xu,Evangelos Chatzaroulas,Luc McCutcheon,Abdul Ahad,Hamzah Azeem,Janusz Marecki,Ammar Anwar
关键词-EN: Standard Operating Procedure, Operating Procedure, Standard Operating, business software workflow, software workflow based
类目: Artificial Intelligence (cs.AI)
*备注: multimodal in-context ensemble learning; video-language models; SOP generation; pseudo-labels

点击查看摘要

Abstract:A Standard Operating Procedure (SOP) defines a low-level, step-by-step written guide for a business software workflow based on a video demonstration. SOPs are a crucial step toward automating end-to-end software workflows. Manually creating SOPs can be time-consuming. Recent advancements in large video-language models offer the potential for automating SOP generation by analyzing recordings of human demonstrations. However, current large video-language models face challenges with zero-shot SOP generation. We explore in-context learning with video-language models for SOP generation. We report that in-context learning sometimes helps video-language models at SOP generation. We then propose an in-context ensemble learning to further enhance the capabilities of the models in SOP generation.

[AI-55] BeSimulator: A Large Language Model Powered Text-based Behavior Simulator

链接: https://arxiv.org/abs/2409.15865
作者: Jianan Wang,Bin Li,Xueying Wang,Fu Li,Yunlong Wu,Juan Chen,Xiaodong Yi
关键词-EN: Traditional robot simulators, high computational costs, physical process modeling, robot simulators focus, Traditional robot
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 7 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Traditional robot simulators focus on physical process modeling and realistic rendering, often suffering from high computational costs, inefficiencies, and limited adaptability. To handle this issue, we propose Behavior Simulation in robotics to emphasize checking the behavior logic of robots and achieving sufficient alignment between the outcome of robot actions and real scenarios. In this paper, we introduce BeSimulator, a modular and novel LLM-powered framework, as an attempt towards behavior simulation in the context of text-based environments. By constructing text-based virtual environments and performing semantic-level simulation, BeSimulator can generalize across scenarios and achieve long-horizon complex simulation. Inspired by human cognition processes, it employs a “consider-decide-capture-transfer” methodology, termed Chain of Behavior Simulation, which excels at analyzing action feasibility and state transitions. Additionally, BeSimulator incorporates code-driven reasoning to enable arithmetic operations and enhance reliability, as well as integrates reflective feedback to refine simulation. Based on our manually constructed behavior-tree-based simulation benchmark BTSIMBENCH, our experiments show a significant performance improvement in behavior simulation compared to baselines, ranging from 14.7% to 26.6%.

[AI-56] A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding

链接: https://arxiv.org/abs/2409.15861
作者: Abdulfattah Safa,Gözde Gül Şahin
关键词-EN: Dialogue State Tracking, State Tracking, Dialogue State, priate system actions, task-oriented dialogues
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dialogue State Tracking (DST) is crucial for understanding user needs and executing appro- priate system actions in task-oriented dialogues. Majority of existing DST methods are designed to work within predefined ontologies and as- sume the availability of gold domain labels, struggling with adapting to new slots values. While Large Language Models (LLMs)-based systems show promising zero-shot DST perfor- mance, they either require extensive computa- tional resources or they underperform existing fully-trained systems, limiting their practical- ity. To address these limitations, we propose a zero-shot, open-vocabulary system that in- tegrates domain classification and DST in a single pipeline. Our approach includes refor- mulating DST as a question-answering task for less capable models and employing self- refining prompts for more adaptable ones. Our system does not rely on fixed slot values de- fined in the ontology allowing the system to adapt dynamically. We compare our approach with existing SOTA, and show that it provides up to 20% better Joint Goal Accuracy (JGA) over previous methods on datasets like Multi- WOZ 2.1, with up to 90% fewer requests to the LLM API.

[AI-57] Identification For Control Based on Neural Networks: Approximately Linearizable Models

链接: https://arxiv.org/abs/2409.15858
作者: Maxime Thieffry,Alexandre Hache,Mohamed Yagoubi,Philippe Chevrel
关键词-EN: control-oriented identification scheme, work presents, presents a control-oriented, scheme for efficient, efficient control design
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注: 15 pages, 3 figures, 6 tables, accepted as a poster in SysDO 2024, Stuttgart, Germany

点击查看摘要

Abstract:This work presents a control-oriented identification scheme for efficient control design and stability analysis of nonlinear systems. Neural networks are used to identify a discrete-time nonlinear state- space model to approximate time-domain input-output behavior of a nonlinear system. The network is constructed such that the identified model is approximately linearizable by feedback, ensuring that the control law trivially follows from the learning stage. After the identification and quasi-linearization procedures, linear control theory comes at hand to design robust controllers and study stability of the closed-loop system. The effectiveness and interest of the methodology are illustrated throughout the paper on popular benchmarks for system identification.

[AI-58] From Passive Watching to Active Learning: Empowering Proactive Participation in Digital Classrooms with AI Video Assistant

链接: https://arxiv.org/abs/2409.15843
作者: Anna Bodonhelyi,Enkeleda Thaqi,Süleyman Özdel,Efe Bozkir,Enkelejda Kasneci
关键词-EN: crucial for enhancing, SAM, enhancing learning outcomes, online education, innovative tools
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In online education, innovative tools are crucial for enhancing learning outcomes. SAM (Study with AI Mentor) is an advanced platform that integrates educational videos with a context-aware chat interface powered by large language models. SAM encourages students to ask questions and explore unclear concepts in real-time, offering personalized, context-specific assistance, including explanations of formulas, slides, and images. In a crowdsourced user study involving 140 participants, SAM was evaluated through pre- and post-knowledge tests, comparing a group using SAM with a control group. The results demonstrated that SAM users achieved greater knowledge gains, with a 96.8% answer accuracy. Participants also provided positive feedback on SAM’s usability and effectiveness. SAM’s proactive approach to learning not only enhances learning outcomes but also empowers students to take full ownership of their educational experience, representing a promising future direction for online learning tools.

[AI-59] Empirical Insights on Fine-Tuning Large Language Models for Question-Answering

链接: https://arxiv.org/abs/2409.15825
作者: Junjie Ye,Yuming Yang,Qi Zhang,Tao Gui,Xuanjing Huang,Peng Wang,Zhongchao Shi,Jianping Fan
关键词-EN: Large language models, encode extensive world, Large language, extensive world knowledge, encode extensive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can then be fine-tuned for the question-answering (QA) task. However, effective strategies for fine-tuning LLMs for the QA task remain largely unexplored. To address this gap, we categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs and conduct a series of empirical analyses. Our experiments, involving four LLMs from three different model families, focus on three key factors: the amount of data required for SFT, the impact of different SFT datasets on model performance, and how data requirements vary across LLMs. The results show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task. Additionally, SFT with data of varying memory levels has a significant impact on LLM performance, with the optimal dataset differing based on the specific model being fine-tuned. Future research will delve deeper into the mechanisms underlying these phenomena.

[AI-60] SwiftDossier: Tailored Automatic Dossier for Drug Discovery with LLMs and Agents

链接: https://arxiv.org/abs/2409.15817
作者: Gabriele Fossi,Youssef Boulaimena,Leila Outemzabeta,Nathalie Jeanraya,Stephane Gerarta,Sebastien Vachenca,Joanna Giemzaa,Salvatore Raieli
关键词-EN: artificial intelligence algorithms, Large Language Models, including Large Language, artificial intelligence, Artificial intelligence systems
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 7 figures, 2 tables

点击查看摘要

Abstract:The advancement of artificial intelligence algorithms has expanded their application to several fields such as the biomedical domain. Artificial intelligence systems, including Large Language Models (LLMs), can be particularly advantageous in drug discovery, which is a very long and expensive process. However, LLMs by themselves lack in-depth knowledge about specific domains and can generate factually incorrect information. Moreover, they are not able to perform more complex actions that imply the usage of external tools. Our work is focused on these two issues. Firstly, we show how the implementation of an advanced RAG system can help the LLM to generate more accurate answers to drug-discovery-related questions. The results show that the answers generated by the LLM with the RAG system surpass in quality the answers produced by the model without RAG. Secondly, we show how to create an automatic target dossier using LLMs and incorporating them with external tools that they can use to execute more intricate tasks to gather data such as accessing databases and executing code. The result is a production-ready target dossier containing the acquired information summarized into a PDF and a PowerPoint presentation.

[AI-61] AsthmaBot: Multi-modal Multi-Lingual Retrieval Augmented Generation For Asthma Patient Support

链接: https://arxiv.org/abs/2409.15815
作者: Adil Bahaj,Mounir Ghogho
关键词-EN: Chat Generative Pre-trained, Generative Pre-trained Transformer, risen globally, driven by environmental, lifestyle factors
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 10 pages

点击查看摘要

Abstract:Asthma rates have risen globally, driven by environmental and lifestyle factors. Access to immediate medical care is limited, particularly in developing countries, necessitating automated support systems. Large Language Models like ChatGPT (Chat Generative Pre-trained Transformer) and Gemini have advanced natural language processing in general and question answering in particular, however, they are prone to producing factually incorrect responses (i.e. hallucinations). Retrieval-augmented generation systems, integrating curated documents, can improve large language models’ performance and reduce the incidence of hallucination. We introduce AsthmaBot, a multi-lingual, multi-modal retrieval-augmented generation system for asthma support. Evaluation of an asthma-related frequently asked questions dataset shows AsthmaBot’s efficacy. AsthmaBot has an added interactive and intuitive interface that integrates different data modalities (text, images, videos) to make it accessible to the larger public. AsthmaBot is available online via \urlthis http URL.

[AI-62] Interactive Example-based Explanations to Improve Health Professionals Onboarding with AI for Human-AI Collaborative Decision Making

链接: https://arxiv.org/abs/2409.15814
作者: Min Hun Lee,Renee Bao Xuan Ng,Silvana Xinyi Choo,Shamala Thilarajah
关键词-EN: growing research explores, interactive example-based explanations, example-based explanations, interactive example-based, growing research
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A growing research explores the usage of AI explanations on user’s decision phases for human-AI collaborative decision-making. However, previous studies found the issues of overreliance on wrong' AI outputs. In this paper, we propose interactive example-based explanations to improve health professionals' onboarding with AI for their better reliance on AI during AI-assisted decision-making. We implemented an AI-based decision support system that utilizes a neural network to assess the quality of post-stroke survivors' exercises and interactive example-based explanations that systematically surface the nearest neighborhoods of a test/task sample from the training set of the AI model to assist users' onboarding with the AI model. To investigate the effect of interactive example-based explanations, we conducted a study with domain experts, health professionals to evaluate their performance and reliance on AI. Our interactive example-based explanations during onboarding assisted health professionals in having a better reliance on AI and making a higher ratio of making right’ decisions and a lower ratio of `wrong’ decisions than providing only feature-based explanations during the decision-support phase. Our study discusses new challenges of assisting user’s onboarding with AI for human-AI collaborative decision-making.

[AI-63] Layer-wise Model Merging for Unsupervised Domain Adaptation in Segmentation Tasks

链接: https://arxiv.org/abs/2409.15813
作者: Roberto Alcover-Couso,Juan C. SanMiguel,Marcos Escudero-Viñolo,Jose M Martínez
关键词-EN: creation and inference, strategy to enhance, prior work, work is limited, ensemble creation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Merging parameters of multiple models has resurfaced as an effective strategy to enhance task performance and robustness, but prior work is limited by the high costs of ensemble creation and inference. In this paper, we leverage the abundance of freely accessible trained models to introduce a cost-free approach to model merging. It focuses on a layer-wise integration of merged models, aiming to maintain the distinctiveness of the task-specific final layers while unifying the initial layers, which are primarily associated with feature extraction. This approach ensures parameter consistency across all layers, essential for boosting performance. Moreover, it facilitates seamless integration of knowledge, enabling effective merging of models from different datasets and tasks. Specifically, we investigate its applicability in Unsupervised Domain Adaptation (UDA), an unexplored area for model merging, for Semantic and Panoptic Segmentation. Experimental results demonstrate substantial UDA improvements without additional costs for merging same-architecture models from distinct datasets ( \uparrow 2.6% mIoU) and different-architecture models with a shared backbone ( \uparrow 6.8% mIoU). Furthermore, merging Semantic and Panoptic Segmentation models increases mPQ by \uparrow 7% . These findings are validated across a wide variety of UDA strategies, architectures, and datasets.

[AI-64] CLSP: High-Fidelity Contrastive Language-State Pre-training for Agent State Representation

链接: https://arxiv.org/abs/2409.15806
作者: Fuxian Huang,Qi Zhang,Shaopeng Zhai,Jie Wang,Tianyi Zhang,Haoran Zhang,Ming Zhou,Yu Liu,Yu Qiao
关键词-EN: important research area, multimodal large language, artificial intelligence, research area, important research
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of artificial intelligence, multimodal learning has become an important research area. For intelligent agents, the state is a crucial modality to convey precise information alongside common modalities like images, videos, and language. This becomes especially clear with the broad adoption of reinforcement learning and multimodal large language models. Nevertheless, the representation of state modality still lags in development. To this end, we propose a High-Fidelity Contrastive Language-State Pre-training (CLSP) method, which can accurately encode state information into general representations for both reinforcement learning and multimodal large language models. Specifically, we first design a pre-training task based on the classification to train an encoder with coarse-grained information. Next, we construct data pairs of states and language descriptions, utilizing the pre-trained encoder to initialize the CLSP encoder. Then, we deploy contrastive learning to train the CLSP encoder to effectively represent precise state information. Additionally, we enhance the representation of numerical information using the Random Fourier Features (RFF) method for high-fidelity mapping. Extensive experiments demonstrate the superior precision and generalization capabilities of our representation, achieving outstanding results in text-state retrieval, reinforcement learning navigation tasks, and multimodal large language model understanding.

[AI-65] owards Universal Large-Scale Foundational Model for Natural Gas Demand Forecasting

链接: https://arxiv.org/abs/2409.15794
作者: Xinxing Zhou,Jiaqi Ye,Shubao Zhao,Ming Jin,Zhaoxiang Hou,Chengyi Yang,Zengxiang Li,Yanlong Wen,Xiaojie Yuan
关键词-EN: global energy strategy, ensuring efficient resource, efficient resource allocation, gas demand forecasting, natural gas demand
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the context of global energy strategy, accurate natural gas demand forecasting is crucial for ensuring efficient resource allocation and operational planning. Traditional forecasting methods struggle to cope with the growing complexity and variability of gas consumption patterns across diverse industries and commercial sectors. To address these challenges, we propose the first foundation model specifically tailored for natural gas demand forecasting. Foundation models, known for their ability to generalize across tasks and datasets, offer a robust solution to the limitations of traditional methods, such as the need for separate models for different customer segments and their limited generalization capabilities. Our approach leverages contrastive learning to improve prediction accuracy in real-world scenarios, particularly by tackling issues such as noise in historical consumption data and the potential misclassification of similar data samples, which can lead to degradation in the quaility of the representation and thus the accuracy of downstream forecasting tasks. By integrating advanced noise filtering techniques within the contrastive learning framework, our model enhances the quality of learned representations, leading to more accurate predictions. Furthermore, the model undergoes industry-specific fine-tuning during pretraining, enabling it to better capture the unique characteristics of gas consumption across various sectors. We conducted extensive experiments using a large-scale dataset from ENN Group, which includes data from over 10,000 industrial, commercial, and welfare-related customers across multiple regions. Our model outperformed existing state-of-the-art methods, demonstrating a relative improvement in MSE by 3.68% and in MASE by 6.15% compared to the best available model.

[AI-66] Small Language Models: Survey Measurements and Insights

链接: https://arxiv.org/abs/2409.15790
作者: Zhenyan Lu,Xiang Li,Dongqi Cai,Rongjie Yi,Fangming Liu,Xiwen Zhang,Nicholas D. Lane,Mengwei Xu
关键词-EN: modern smart devices, academic attention compared, Small language models, large language model, Small language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Small language models (SLMs), despite their widespread adoption in modern smart devices, have received significantly less academic attention compared to their large language model (LLM) counterparts, which are predominantly deployed in data centers and cloud environments. While researchers continue to improve the capabilities of LLMs in the pursuit of artificial general intelligence, SLM research aims to make machine intelligence more accessible, affordable, and efficient for everyday tasks. Focusing on transformer-based, decoder-only language models with 100M-5B parameters, we survey 59 state-of-the-art open-source SLMs, analyzing their technical innovations across three axes: architectures, training datasets, and training algorithms. In addition, we evaluate their capabilities in various domains, including commonsense reasoning, in-context learning, mathematics, and coding. To gain further insight into their on-device runtime costs, we benchmark their inference latency and memory footprints. Through in-depth analysis of our benchmarking data, we offer valuable insights to advance research in this field.

[AI-67] Spatial-Temporal Mixture-of-Graph-Experts for Multi-Type Crime Prediction

链接: https://arxiv.org/abs/2409.15764
作者: Ziyang Wu,Fan Liu,Jindong Han,Yuxuan Liang,Hao Liu
关键词-EN: effective prevention measures, threaten public safety, multiple types, economic development, predicting the occurrence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As various types of crime continue to threaten public safety and economic development, predicting the occurrence of multiple types of crimes becomes increasingly vital for effective prevention measures. Although extensive efforts have been made, most of them overlook the heterogeneity of different crime categories and fail to address the issue of imbalanced spatial distribution. In this work, we propose a Spatial-Temporal Mixture-of-Graph-Experts (ST-MoGE) framework for collective multiple-type crime prediction. To enhance the model’s ability to identify diverse spatial-temporal dependencies and mitigate potential conflicts caused by spatial-temporal heterogeneity of different crime categories, we introduce an attentive-gated Mixture-of-Graph-Experts (MGEs) module to capture the distinctive and shared crime patterns of each crime category. Then, we propose Cross-Expert Contrastive Learning(CECL) to update the MGEs and force each expert to focus on specific pattern modeling, thereby reducing blending and redundancy. Furthermore, to address the issue of imbalanced spatial distribution, we propose a Hierarchical Adaptive Loss Re-weighting (HALR) approach to eliminate biases and insufficient learning of data-scarce regions. To evaluate the effectiveness of our methods, we conduct comprehensive experiments on two real-world crime datasets and compare our results with twelve advanced baselines. The experimental results demonstrate the superiority of our methods.

[AI-68] IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios

链接: https://arxiv.org/abs/2409.15763
作者: Hai Lin,Shaoxiong Zhan,Junyou Su,Haitao Zheng,Hui Wang
关键词-EN: Large Language Models, Large Language, Retrieval-Augmented Generation, Language Models, RAG tasks
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In Retrieval-Augmented Generation (RAG) tasks using Large Language Models (LLMs), the quality of retrieved information is critical to the final output. This paper introduces the IRSC benchmark for evaluating the performance of embedding models in multilingual RAG tasks. The benchmark encompasses five retrieval tasks: query retrieval, title retrieval, part-of-paragraph retrieval, keyword retrieval, and summary retrieval. Our research addresses the current lack of comprehensive testing and effective comparison methods for embedding models in RAG scenarios. We introduced new metrics: the Similarity of Semantic Comprehension Index (SSCI) and the Retrieval Capability Contest Index (RCCI), and evaluated models such as Snowflake-Arctic, BGE, GTE, and M3E. Our contributions include: 1) the IRSC benchmark, 2) the SSCI and RCCI metrics, and 3) insights into the cross-lingual limitations of embedding models. The IRSC benchmark aims to enhance the understanding and development of accurate retrieval systems in RAG tasks. All code and datasets are available at: this https URL_Benchmark

[AI-69] FG: Unified Training-Free Guidance for Diffusion Models

链接: https://arxiv.org/abs/2409.15761
作者: Haotian Ye,Haowei Lin,Jiaqi Han,Minkai Xu,Sheng Liu,Yitao Liang,Jianzhu Ma,James Zou,Stefano Ermon
关键词-EN: desirable target properties, property of interest, additional training, generate samples, samples with desirable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Given an unconditional diffusion model and a predictor for a target property of interest (e.g., a classifier), the goal of training-free guidance is to generate samples with desirable target properties without additional training. Existing methods, though effective in various individual applications, often lack theoretical grounding and rigorous testing on extensive benchmarks. As a result, they could even fail on simple tasks, and applying them to a new problem becomes unavoidably difficult. This paper introduces a novel algorithmic framework encompassing existing methods as special cases, unifying the study of training-free guidance into the analysis of an algorithm-agnostic design space. Via theoretical and empirical investigation, we propose an efficient and effective hyper-parameter searching strategy that can be readily applied to any downstream task. We systematically benchmark across 7 diffusion models on 16 tasks with 40 targets, and improve performance by 8.5% on average. Our framework and benchmark offer a solid foundation for conditional generation in a training-free manner.

[AI-70] Stage-Wise Reward Shaping for Acrobatic Robots: A Constrained Multi-Objective Reinforcement Learning Approach

链接: https://arxiv.org/abs/2409.15755
作者: Dohyeong Kim,Hyeokjin Kwon,Junseok Kim,Gunmin Lee,Songhwai Oh
关键词-EN: reinforcement learning, highly complicated, addressed through reinforcement, Abstract, increases
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:As the complexity of tasks addressed through reinforcement learning (RL) increases, the definition of reward functions also has become highly complicated. We introduce an RL method aimed at simplifying the reward-shaping process through intuitive strategies. Initially, instead of a single reward function composed of various terms, we define multiple reward and cost functions within a constrained multi-objective RL (CMORL) framework. For tasks involving sequential complex movements, we segment the task into distinct stages and define multiple rewards and costs for each stage. Finally, we introduce a practical CMORL algorithm that maximizes objectives based on these rewards while satisfying constraints defined by the costs. The proposed method has been successfully demonstrated across a variety of acrobatic tasks in both simulation and real-world environments. Additionally, it has been shown to successfully perform tasks compared to existing RL and constrained RL algorithms. Our code is available at this https URL.

[AI-71] Development and Validation of Heparin Dosing Policies Using an Offline Reinforcement Learning Algorithm

链接: https://arxiv.org/abs/2409.15753
作者: Yooseok Lim,Inbeom Park,Sujee Lee
关键词-EN: intensive care unit, patient survival, care unit, Intensive Care III, ICU
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Appropriate medication dosages in the intensive care unit (ICU) are critical for patient survival. Heparin, used to treat thrombosis and inhibit blood clotting in the ICU, requires careful administration due to its complexity and sensitivity to various factors, including patient clinical characteristics, underlying medical conditions, and potential drug interactions. Incorrect dosing can lead to severe complications such as strokes or excessive bleeding. To address these challenges, this study proposes a reinforcement learning (RL)-based personalized optimal heparin dosing policy that guides dosing decisions reliably within the therapeutic range based on individual patient conditions. A batch-constrained policy was implemented to minimize out-of-distribution errors in an offline RL environment and effectively integrate RL with existing clinician policies. The policy’s effectiveness was evaluated using weighted importance sampling, an off-policy evaluation method, and the relationship between state representations and Q-values was explored using t-SNE. Both quantitative and qualitative analyses were conducted using the Medical Information Mart for Intensive Care III (MIMIC-III) database, demonstrating the efficacy of the proposed RL-based medication policy. Leveraging advanced machine learning techniques and extensive clinical data, this research enhances heparin administration practices and establishes a precedent for the development of sophisticated decision-support tools in medicine.

[AI-72] he Roles of Generative Artificial Intelligence in Internet of Electric Vehicles

链接: https://arxiv.org/abs/2409.15750
作者: Hanwen Zhang,Dusit Niyato,Wei Zhang,Changyuan Zhao,Hongyang Du,Abbas Jamalipour,Sumei Sun,Yiyang Pei
关键词-EN: generative artificial intelligence, artificial intelligence, significant enhancement, leading to widespread, generation and forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 25 Pages

点击查看摘要

Abstract:With the advancement of generative artificial intelligence (GenAI) models, their capability to generate content is seeing significant enhancement, leading to widespread applications in the field of data generation and forecasting. Furthermore, GenAI has strong capabilities in data modeling and analysis, which enhances Internet of electric vehicles (IoEV) applications in various aspects. In this paper, we investigate and survey applications of GenAI in the IoEV. Specifically, we categorize GenAI for IoEV into four different layers namely, EV’s battery layer, individual electric vehicle (EV) layer, smart grid with EV layer, and security layer. We first introduce various GenAI techniques used in each layer of IoEV applications. Subsequently, public datasets available for training the GenAI models are summarized. Finally, we provide recommendations for future directions. This survey not only categorizes the applications of GenAI in IoEV across different layers but also serves as a valuable resource for researchers and practitioners by highlighting the design and implementation challenges within each layer. Furthermore, it provides a roadmap for future research directions, enabling the development of more robust and efficient IoEV systems through the integration of advanced GenAI techniques.

[AI-73] Automated Assessment of Multimodal Answer Sheets in the STEM domain

链接: https://arxiv.org/abs/2409.15749
作者: Rajlaxmi Patil,Aditya Ashutosh Kulkarni,Ruturaj Ghatage,Sharvi Endait,Geetanjali Kale,Raviraj Joshi
关键词-EN: reshaping traditional,learning paradigms, domain encompassing Science, STEM domain encompassing, transformative era, reshaping traditional,learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the domain of education, the integration of,technology has led to a transformative era, reshaping traditional,learning paradigms. Central to this evolution is the automation,of grading processes, particularly within the STEM domain encompassing Science, Technology, Engineering, and Mathematics.,While efforts to automate grading have been made in subjects,like Literature, the multifaceted nature of STEM assessments,presents unique challenges, ranging from quantitative analysis,to the interpretation of handwritten diagrams. To address these,challenges, this research endeavors to develop efficient and reliable grading methods through the implementation of automated,assessment techniques using Artificial Intelligence (AI). Our,contributions lie in two key areas: firstly, the development of a,robust system for evaluating textual answers in STEM, leveraging,sample answers for precise comparison and grading, enabled by,advanced algorithms and natural language processing techniques.,Secondly, a focus on enhancing diagram evaluation, particularly,flowcharts, within the STEM context, by transforming diagrams,into textual representations for nuanced assessment using a,Large Language Model (LLM). By bridging the gap between,visual representation and semantic meaning, our approach ensures accurate evaluation while minimizing manual intervention.,Through the integration of models such as CRAFT for text,extraction and YoloV5 for object detection, coupled with LLMs,like Mistral-7B for textual evaluation, our methodology facilitates,comprehensive assessment of multimodal answer sheets. This,paper provides a detailed account of our methodology, challenges,encountered, results, and implications, emphasizing the potential,of AI-driven approaches in revolutionizing grading practices in,STEM education.

[AI-74] raining Neural Networks for Modularity aids Interpretability

链接: https://arxiv.org/abs/2409.15747
作者: Satvik Golechha,Dylan Cope,Nandi Schoots
关键词-EN: studied independently, improve network interpretability, Abstract, improve network, clusters
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 4 pages, preprint

点击查看摘要

Abstract:An approach to improve network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We find pretrained models to be highly unclusterable and thus train models to be more modular using an ``enmeshment loss’’ function that encourages the formation of non-interacting clusters. Using automated interpretability measures, we show that our method finds clusters that learn different, disjoint, and smaller circuits for CIFAR-10 labels. Our approach provides a promising direction for making neural networks easier to interpret.

[AI-75] Real-Time Pedestrian Detection on IoT Edge Devices: A Lightweight Deep Learning Approach

链接: https://arxiv.org/abs/2409.15740
作者: Muhammad Dany Alfikri,Rafael Kaliski
关键词-EN: everyday lives, Artificial intelligence, Computer vision, Edge servers, Edge
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
*备注: 10 pages, 3 tables, 12 figures, article submitted to IEEE for possible publication

点击查看摘要

Abstract:Artificial intelligence (AI) has become integral to our everyday lives. Computer vision has advanced to the point where it can play the safety critical role of detecting pedestrians at road intersections in intelligent transportation systems and alert vehicular traffic as to potential collisions. Centralized computing analyzes camera feeds and generates alerts for nearby vehicles. However, real-time applications face challenges such as latency, limited data transfer speeds, and the risk of life loss. Edge servers offer a potential solution for real-time applications, providing localized computing and storage resources and lower response times. Unfortunately, edge servers have limited processing power. Lightweight deep learning (DL) techniques enable edge servers to utilize compressed deep neural network (DNN) models. The research explores implementing a lightweight DL model on Artificial Intelligence of Things (AIoT) edge devices. An optimized You Only Look Once (YOLO) based DL model is deployed for real-time pedestrian detection, with detection events transmitted to the edge server using the Message Queuing Telemetry Transport (MQTT) protocol. The simulation results demonstrate that the optimized YOLO model can achieve real-time pedestrian detection, with a fast inference speed of 147 milliseconds, a frame rate of 2.3 frames per second, and an accuracy of 78%, representing significant improvements over baseline models. Comments: 10 pages, 3 tables, 12 figures, article submitted to IEEE for possible publication Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2409.15740 [cs.AI] (or arXiv:2409.15740v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.15740 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-76] EvoFA: Evolvable Fast Adaptation for EEG Emotion Recognition

链接: https://arxiv.org/abs/2409.15733
作者: Ming Jin,Danni Zhang,Gangming Zhao,Changde Du,Jinpeng Li
关键词-EN: gained significant traction, significant traction due, accuracy and objectivity, traction due, EEG signals leads
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG)-based emotion recognition has gained significant traction due to its accuracy and objectivity. However, the non-stationary nature of EEG signals leads to distribution drift over time, causing severe performance degradation when the model is reused. While numerous domain adaptation (DA) approaches have been proposed in recent years to address this issue, their reliance on large amounts of target data for calibration restricts them to offline scenarios, rendering them unsuitable for real-time applications. To address this challenge, this paper proposes Evolvable Fast Adaptation (EvoFA), an online adaptive framework tailored for EEG data. EvoFA organically integrates the rapid adaptation of Few-Shot Learning (FSL) and the distribution matching of Domain Adaptation (DA) through a two-stage generalization process. During the training phase, a robust base meta-learning model is constructed for strong generalization. In the testing phase, a designed evolvable meta-adaptation module iteratively aligns the marginal distribution of target (testing) data with the evolving source (training) data within a model-agnostic meta-learning framework, enabling the model to learn the evolving trends of testing data relative to training data and improving online testing performance. Experimental results demonstrate that EvoFA achieves significant improvements compared to the basic FSL method and previous online methods. The introduction of EvoFA paves the way for broader adoption of EEG-based emotion recognition in real-world applications. Our code will be released upon publication.

[AI-77] Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving

链接: https://arxiv.org/abs/2409.15730
作者: Lingyu Xiao,Jiang-Jiang Liu,Sen Yang,Xiaofan Li,Xiaoqing Ye,Wankou Yang,Jingdong Wang
关键词-EN: exhibits robust generalization, robust generalization capabilities, vectorized scene understanding, autoregressive world model, model exhibits robust
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The autoregressive world model exhibits robust generalization capabilities in vectorized scene understanding but encounters difficulties in deriving actions due to insufficient uncertainty modeling and self-delusion. In this paper, we explore the feasibility of deriving decisions from an autoregressive world model by addressing these challenges through the formulation of multiple probabilistic hypotheses. We propose LatentDriver, a framework models the environment’s next states and the ego vehicle’s possible actions as a mixture distribution, from which a deterministic control signal is then derived. By incorporating mixture modeling, the stochastic nature of decisionmaking is captured. Additionally, the self-delusion problem is mitigated by providing intermediate actions sampled from a distribution to the world model. Experimental results on the recently released close-loop benchmark Waymax demonstrate that LatentDriver surpasses state-of-the-art reinforcement learning and imitation learning methods, achieving expert-level performance. The code and models will be made available at this https URL.

[AI-78] Sequential Learning in the Dense Associative Memory

链接: https://arxiv.org/abs/2409.15729
作者: Hayden McAlister,Anthony Robins,Lech Szymanski
关键词-EN: Dense Associative Memory, associative memory, Dense Associative, Sequential learning, Hopfield network
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sequential learning involves learning tasks in a sequence, and proves challenging for most neural networks. Biological neural networks regularly conquer the sequential learning challenge and are even capable of transferring knowledge both forward and backwards between tasks. Artificial neural networks often totally fail to transfer performance between tasks, and regularly suffer from degraded performance or catastrophic forgetting on previous tasks. Models of associative memory have been used to investigate the discrepancy between biological and artificial neural networks due to their biological ties and inspirations, of which the Hopfield network is perhaps the most studied model. The Dense Associative Memory, or modern Hopfield network, generalizes the Hopfield network, allowing for greater capacities and prototype learning behaviors, while still retaining the associative memory structure. We investigate the performance of the Dense Associative Memory in sequential learning problems, and benchmark various sequential learning techniques in the network. We give a substantial review of the sequential learning space with particular respect to the Hopfield network and associative memories, as well as describe the techniques we implement in detail. We also draw parallels between the classical and Dense Associative Memory in the context of sequential learning, and discuss the departures from biological inspiration that may influence the utility of the Dense Associative Memory as a tool for studying biological neural networks. We present our findings, and show that existing sequential learning methods can be applied to the Dense Associative Memory to improve sequential learning performance.

[AI-79] LLM-Cure: LLM-based Competitor User Review Analysis for Feature Enhancement

链接: https://arxiv.org/abs/2409.15724
作者: Maram Assi,Safwat Hassan,Ying Zou
关键词-EN: app market underscores, user, mobile app market, reviews, exponential growth
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 25 pages

点击查看摘要

Abstract:The exponential growth of the mobile app market underscores the importance of constant innovation and rapid response to user demands. As user satisfaction is paramount to the success of a mobile application (app), developers typically rely on user reviews, which represent user feedback that includes ratings and comments to identify areas for improvement. However, the sheer volume of user reviews poses challenges in manual analysis, necessitating automated approaches. Existing automated approaches either analyze only the target apps reviews, neglecting the comparison of similar features to competitors or fail to provide suggestions for feature enhancement. To address these gaps, we propose a Large Language Model (LLM)-based Competitive User Review Analysis for Feature Enhancement) (LLM-Cure), an approach powered by LLMs to automatically generate suggestion s for mobile app feature improvements. More specifically, LLM-Cure identifies and categorizes features within reviews by applying LLMs. When provided with a complaint in a user review, LLM-Cure curates highly rated (4 and 5 stars) reviews in competing apps related to the complaint and proposes potential improvements tailored to the target application. We evaluate LLM-Cure on 1,056,739 reviews of 70 popular Android apps. Our evaluation demonstrates that LLM-Cure significantly outperforms the state-of-the-art approaches in assigning features to reviews by up to 13% in F1-score, up to 16% in recall and up to 11% in precision. Additionally, LLM-Cure demonstrates its capability to provide suggestions for resolving user complaints. We verify the suggestions using the release notes that reflect the changes of features in the target mobile app. LLM-Cure achieves a promising average of 73% of the implementation of the provided suggestions.

[AI-80] Adversarial Federated Consensus Learning for Surface Defect Classification Under Data Heterogeneity in IIoT

链接: https://arxiv.org/abs/2409.15711
作者: Jixuan Cui,Jun Li,Zhen Mei,Yiyang Ni,Wen Chen,Zengxiang Li
关键词-EN: Internet of Things, industrial surface defect, Industrial Internet, surface defect classification, industrial surface
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The challenge of data scarcity hinders the application of deep learning in industrial surface defect classification (SDC), as it’s difficult to collect and centralize sufficient training data from various entities in Industrial Internet of Things (IIoT) due to privacy concerns. Federated learning (FL) provides a solution by enabling collaborative global model training across clients while maintaining privacy. However, performance may suffer due to data heterogeneity–discrepancies in data distributions among clients. In this paper, we propose a novel personalized FL (PFL) approach, named Adversarial Federated Consensus Learning (AFedCL), for the challenge of data heterogeneity across different clients in SDC. First, we develop a dynamic consensus construction strategy to mitigate the performance degradation caused by data heterogeneity. Through adversarial training, local models from different clients utilize the global model as a bridge to achieve distribution alignment, alleviating the problem of global knowledge forgetting. Complementing this strategy, we propose a consensus-aware aggregation mechanism. It assigns aggregation weights to different clients based on their efficacy in global knowledge learning, thereby enhancing the global model’s generalization capabilities. Finally, we design an adaptive feature fusion module to further enhance global knowledge utilization efficiency. Personalized fusion weights are gradually adjusted for each client to optimally balance global and local features, tailored to their individual global knowledge learning efficacy. Compared with state-of-the-art FL methods like FedALA, the proposed AFedCL method achieves an accuracy increase of up to 5.67% on three SDC datasets.

[AI-81] Autotuning Bipedal Locomotion MPC with GRFM-Net for Efficient Sim-to-Real Transfer

链接: https://arxiv.org/abs/2409.15710
作者: Qianzhong Chen,Junheng Li,Sheng Cheng,Naira Hovakimyan,Quan Nguyen
关键词-EN: Bipedal locomotion control, human-centric environments, navigate complex, humanoid robots, Bipedal locomotion
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Bipedal locomotion control is essential for humanoid robots to navigate complex, human-centric environments. While optimization-based control designs are popular for integrating sophisticated models of humanoid robots, they often require labor-intensive manual tuning. In this work, we address the challenges of parameter selection in bipedal locomotion control using DiffTune, a model-based autotuning method that leverages differential programming for efficient parameter learning. A major difficulty lies in balancing model fidelity with differentiability. We address this difficulty using a low-fidelity model for differentiability, enhanced by a Ground Reaction Force-and-Moment Network (GRFM-Net) to capture discrepancies between MPC commands and actual control effects. We validate the parameters learned by DiffTune with GRFM-Net in hardware experiments, which demonstrates the parameters’ optimality in a multi-objective setting compared with baseline parameters, reducing the total loss by up to 40.5 % compared with the expert-tuned parameters. The results confirm the GRFM-Net’s effectiveness in mitigating the sim-to-real gap, improving the transferability of simulation-learned parameters to real hardware.

[AI-82] Improving Emotional Support Delivery in Text-Based Community Safety Reporting Using Large Language Models

链接: https://arxiv.org/abs/2409.15706
作者: Yiren Liu,Yerong Li,Ryan Mayfield,Yun Huang
关键词-EN: Emotional support, crucial aspect, aspect of communication, communication between community, community members
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Emotional support is a crucial aspect of communication between community members and police dispatchers during incident reporting. However, there is a lack of understanding about how emotional support is delivered through text-based systems, especially in various non-emergency contexts. In this study, we analyzed two years of chat logs comprising 57,114 messages across 8,239 incidents from 130 higher education institutions. Our empirical findings revealed significant variations in emotional support provided by dispatchers, influenced by the type of incident, service time, and a noticeable decline in support over time across multiple organizations. To improve the consistency and quality of emotional support, we developed and implemented a fine-tuned Large Language Model (LLM), named dispatcherLLM. We evaluated dispatcherLLM by comparing its generated responses to those of human dispatchers and other off-the-shelf models using real chat messages. Additionally, we conducted a human evaluation to assess the perceived effectiveness of the support provided by dispatcherLLM. This study not only contributes new empirical understandings of emotional support in text-based dispatch systems but also demonstrates the significant potential of generative AI in improving service delivery.

[AI-83] oward Mixture-of-Experts Enabled Trustworthy Semantic Communication for 6G Networks

链接: https://arxiv.org/abs/2409.15695
作者: Jiayi He,Xiaofeng Luo,Jiawen Kang,Hongyang Du,Zehui Xiong,Ci Chen,Dusit Niyato,Xuemin Shen
关键词-EN: future efficient communication, plays a pivotal, offering a viable, pivotal role, viable solution
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Semantic Communication (SemCom) plays a pivotal role in 6G networks, offering a viable solution for future efficient communication. Deep Learning (DL)-based semantic codecs further enhance this efficiency. However, the vulnerability of DL models to security threats, such as adversarial attacks, poses significant challenges for practical applications of SemCom systems. These vulnerabilities enable attackers to tamper with messages and eavesdrop on private information, especially in wireless communication scenarios. Although existing defenses attempt to address specific threats, they often fail to simultaneously handle multiple heterogeneous attacks. To overcome this limitation, we introduce a novel Mixture-of-Experts (MoE)-based SemCom system. This system comprises a gating network and multiple experts, each specializing in different security challenges. The gating network adaptively selects suitable experts to counter heterogeneous attacks based on user-defined security requirements. Multiple experts collaborate to accomplish semantic communication tasks while meeting the security requirements of users. A case study in vehicular networks demonstrates the efficacy of the MoE-based SemCom system. Simulation results show that the proposed MoE-based SemCom system effectively mitigates concurrent heterogeneous attacks, with minimal impact on downstream task accuracy.

[AI-84] Safe Navigation for Robotic Digestive Endoscopy via Human Intervention-based Reinforcement Learning

链接: https://arxiv.org/abs/2409.15688
作者: Min Tan,Yushun Tao,Boyun Zheng,GaoSheng Xie,Lijuan Feng,Zeyang Xia,Jing Xiong
关键词-EN: robotic digestive endoscopy, narrow digestive tract, automated robotic digestive, digestive endoscopy, Proximal Policy Optimization
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the increasing application of automated robotic digestive endoscopy (RDE), ensuring safe and efficient navigation in the unstructured and narrow digestive tract has become a critical challenge. Existing automated reinforcement learning navigation algorithms, often result in potentially risky collisions due to the absence of essential human intervention, which significantly limits the safety and effectiveness of RDE in actual clinical practice. To address this limitation, we proposed a Human Intervention (HI)-based Proximal Policy Optimization (PPO) framework, dubbed HI-PPO, which incorporates expert knowledge to enhance RDE’s safety. Specifically, we introduce an Enhanced Exploration Mechanism (EEM) to address the low exploration efficiency of the standard PPO. Additionally, a reward-penalty adjustment (RPA) is implemented to penalize unsafe actions during initial interventions. Furthermore, Behavior Cloning Similarity (BCS) is included as an auxiliary objective to ensure the agent emulates expert actions. Comparative experiments conducted in a simulated platform across various anatomical colon segments demonstrate that our model effectively and safely guides RDE.

[AI-85] A Comprehensive Evaluation of Large Language Models on Mental Illnesses

链接: https://arxiv.org/abs/2409.15687
作者: Abdelrahman Hanafi,Mohammed Saad,Noureldin Zahran,Radwa J. Hanafy,Mohammed E. Fouda
关键词-EN: Large language models, Large language, shown promise, disorder severity evaluation, binary disorder detection
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models have shown promise in various domains, including healthcare. In this study, we conduct a comprehensive evaluation of LLMs in the context of mental health tasks using social media data. We explore the zero-shot (ZS) and few-shot (FS) capabilities of various LLMs, including GPT-4, Llama 3, Gemini, and others, on tasks such as binary disorder detection, disorder severity evaluation, and psychiatric knowledge assessment. Our evaluation involved 33 models testing 9 main prompt templates across the tasks. Key findings revealed that models like GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets. Moreover, prompt engineering played a crucial role in enhancing model performance. Notably, the Mixtral 8x22b model showed an improvement of over 20%, while Gemma 7b experienced a similar boost in performance. In the task of disorder severity evaluation, we observed that FS learning significantly improved the model’s accuracy, highlighting the importance of contextual examples in complex assessments. Notably, the Phi-3-mini model exhibited a substantial increase in performance, with balanced accuracy improving by over 6.80% and mean average error dropping by nearly 1.3 when moving from ZS to FS learning. In the psychiatric knowledge task, recent models generally outperformed older, larger counterparts, with the Llama 3.1 405b achieving an accuracy of 91.2%. Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering. Furthermore, the ethical guards imposed by many LLM providers hamper the ability to accurately evaluate their performance, due to tendency to not respond to potentially sensitive queries.

[AI-86] Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint

链接: https://arxiv.org/abs/2409.15664
作者: Dayeon Ki,Cheonbok Park,Hyunjoong Kim
关键词-EN: Accurately aligning contextual, parallel data mining, Accurately aligning, effective parallel data, aligning contextual representations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 18 pages, 16 figures

点击查看摘要

Abstract:Accurately aligning contextual representations in cross-lingual sentence embeddings is key for effective parallel data mining. A common strategy for achieving this alignment involves disentangling semantics and language in sentence embeddings derived from multilingual pre-trained models. However, we discover that current disentangled representation learning methods suffer from semantic leakage - a term we introduce to describe when a substantial amount of language-specific information is unintentionally leaked into semantic representations. This hinders the effective disentanglement of semantic and language representations, making it difficult to retrieve embeddings that distinctively represent the meaning of the sentence. To address this challenge, we propose a novel training objective, ORthogonAlity Constraint LEarning (ORACLE), tailored to enforce orthogonality between semantic and language embeddings. ORACLE builds upon two components: intra-class clustering and inter-class separation. Through experiments on cross-lingual retrieval and semantic textual similarity tasks, we demonstrate that training with the ORACLE objective effectively reduces semantic leakage and enhances semantic alignment within the embedding space.

[AI-87] Double-Path Adaptive-correlation Spatial-Temporal Inverted Transformer for Stock Time Series Forecasting

链接: https://arxiv.org/abs/2409.15662
作者: Wenbo Yan,Ying Tan
关键词-EN: graph neural networks, achieved significant success, Spatial-temporal graph neural, series forecasting tasks, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spatial-temporal graph neural networks (STGNNs) have achieved significant success in various time series forecasting tasks. However, due to the lack of explicit and fixed spatial relationships in stock prediction tasks, many STGNNs fail to perform effectively in this domain. While some STGNNs learn spatial relationships from time series, they often lack comprehensiveness. Research indicates that modeling time series using feature changes as tokens reveals entirely different information compared to using time steps as tokens. To more comprehensively extract dynamic spatial information from stock data, we propose a Double-Path Adaptive-correlation Spatial-Temporal Inverted Transformer (DPA-STIFormer). DPA-STIFormer models each node via continuous changes in features as tokens and introduces a Double Direction Self-adaptation Fusion mechanism. This mechanism decomposes node encoding into temporal and feature representations, simultaneously extracting different spatial correlations from a double path approach, and proposes a Double-path gating mechanism to fuse these two types of correlation information. Experiments conducted on four stock market datasets demonstrate state-of-the-art results, validating the model’s superior capability in uncovering latent temporal-correlation patterns.

[AI-88] ReLEP: A Novel Framework for Real-world Long-horizon Embodied Planning

链接: https://arxiv.org/abs/2409.15658
作者: Siyuan Liu,Jiawei Du,Sicheng Xiang,Zibo Wang,Dingsheng Luo
关键词-EN: long-horizon embodied planning, embodied planning underpins, planning underpins embodied, long-horizon embodied, Real-world long-horizon embodied
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-world long-horizon embodied planning underpins embodied AI. To accomplish long-horizon tasks, agents need to decompose abstract instructions into detailed steps. Prior works mostly rely on GPT-4V for task decomposition into predefined actions, which limits task diversity due to GPT-4V’s finite understanding of larger skillsets. Therefore, we present ReLEP, a groundbreaking framework for Real world Long-horizon Embodied Planning, which can accomplish a wide range of daily tasks. At its core lies a fine-tuned large vision language model that formulates plans as sequences of skill functions according to input instruction and scene image. These functions are selected from a carefully designed skill library. ReLEP is also equipped with a Memory module for plan and status recall, and a Robot Configuration module for versatility across robot types. In addition, we propose a semi-automatic data generation pipeline to tackle dataset scarcity. Real-world off-line experiments across eight daily embodied tasks demonstrate that ReLEP is able to accomplish long-horizon embodied tasks and outperforms other state-of-the-art baseline methods.

[AI-89] MMPT: Multimodal Prompt Tuning for Zero-shot Instruction Learning EMNLP2024

链接: https://arxiv.org/abs/2409.15657
作者: Taowen Wang,Yiyang Liu,James Chenhao Liang,junhan zhao,Yiming Cui,Yuning Mao,Shaoliang Nie,Jiahao Liu,Fuli Feng,Zenglin Xu,Cheng Han,Lifu Huang,Qifan Wang,Dongfang Liu
关键词-EN: Large Language Models, Multimodal Large Language, zero-shot generalization capabilities, Large Language, Multimodal Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: EMNLP 2024

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (MMPT) approach for efficient instruction tuning of MLLMs. MMPT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.

[AI-90] Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

链接: https://arxiv.org/abs/2409.15637
作者: Tianyue Ou,Frank F. Xu,Aman Madaan,Jiarui Liu,Robert Lo,Abishek Sridhar,Sudipta Sengupta,Dan Roth,Graham Neubig,Shuyan Zhou
关键词-EN: complete specific objectives, specific objectives, act as autonomous, environments and complete, complete specific
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LLMs can now act as autonomous agents that interact with digital environments and complete specific objectives (e.g., arranging an online meeting). However, accuracy is still far from satisfactory, partly due to a lack of large-scale, direct demonstrations for digital tasks. Obtaining supervised data from humans is costly, and automatic data collection through exploration or reinforcement learning relies on complex environmental and content setup, resulting in datasets that lack comprehensive coverage of various scenarios. On the other hand, there is abundant knowledge that may indirectly assist task completion, such as online tutorials that were created for human consumption. In this work, we present Synatra, an approach that effectively transforms this indirect knowledge into direct supervision at scale. We define different types of indirect knowledge, and carefully study the available sources to obtain it, methods to encode the structure of direct demonstrations, and finally methods to transform indirect knowledge into direct demonstrations. We use 100k such synthetically-created demonstrations to finetune a 7B CodeLlama, and demonstrate that the resulting agent surpasses all comparably sized models on three web-based task benchmarks Mind2Web, MiniWoB++ and WebArena, as well as surpassing GPT-3.5 on WebArena and Mind2Web. In addition, while synthetic demonstrations prove to be only 3% the cost of human demonstrations (at 0.031 each), we show that the synthetic demonstrations can be more effective than an identical number of human demonstrations collected from limited domains.

[AI-91] Personalized Federated Learning via Backbone Self-Distillation ACM-MM

链接: https://arxiv.org/abs/2409.15636
作者: Pengju Wang,Bochao Liu,Dan Zeng,Chenggang Yan,Shiming Ge
关键词-EN: frequently necessitates training, necessitates training personalized, learning frequently necessitates, federated learning frequently, training personalized models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Pubished in ACM MMAsia 2023

点击查看摘要

Abstract:In practical scenarios, federated learning frequently necessitates training personalized models for each client using heterogeneous data. This paper proposes a backbone self-distillation approach to facilitate personalized federated learning. In this approach, each client trains its local model and only sends the backbone weights to the server. These weights are then aggregated to create a global backbone, which is returned to each client for updating. However, the client’s local backbone lacks personalization because of the common representation. To solve this problem, each client further performs backbone self-distillation by using the global backbone as a teacher and transferring knowledge to update the local backbone. This process involves learning two components: the shared backbone for common representation and the private head for local personalization, which enables effective global knowledge transfer. Extensive experiments and comparisons with 12 state-of-the-art approaches demonstrate the effectiveness of our approach.

[AI-92] Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

链接: https://arxiv.org/abs/2409.15631
作者: Liang Zhang,Jionghao Lin,John Sabatini,Conrad Borchers,Daniel Weitekamp,Meng Cao,John Hollander,Xiangen Hu,Arthur C. Graesser
关键词-EN: intelligent tutoring systems, data describe correct, Learning performance data, Learning performance, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning performance data describe correct and incorrect answers or problem-solving attempts in adaptive learning, such as in intelligent tutoring systems (ITSs). Learning performance data tend to be highly sparse (80%(\sim)90% missing observations) in most real-world applications due to adaptive item selection. This data sparsity presents challenges to using learner models to effectively predict future performance explore new hypotheses about learning. This article proposes a systematic framework for augmenting learner data to address data sparsity in learning performance data. First, learning performance is represented as a three-dimensional tensor of learners’ questions, answers, and attempts, capturing longitudinal knowledge states during learning. Second, a tensor factorization method is used to impute missing values in sparse tensors of collected learner data, thereby grounding the imputation on knowledge tracing tasks that predict missing performance values based on real observations. Third, a module for generating patterns of learning is used. This study contrasts two forms of generative Artificial Intelligence (AI), including Generative Adversarial Networks (GANs) and Generate Pre-Trained Transformers (GPT) to generate data associated with different clusters of learner data. We tested this approach on an adult literacy dataset from AutoTutor lessons developed for Adult Reading Comprehension (ARC). We found that: (1) tensor factorization improved the performance in tracing and predicting knowledge mastery compared with other knowledge tracing techniques without data augmentation, showing higher relative fidelity for this imputation method, and (2) the GAN-based simulation showed greater overall stability and less statistical bias based on a divergence evaluation with varying simulation sample sizes compared to GPT.

[AI-93] Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling

链接: https://arxiv.org/abs/2409.15626
作者: Satya Kapoor,Alex Gil,Sreyoshi Bhaduri,Anshul Mittal,Rutu Mulkar
关键词-EN: Latent Dirichlet Allocation, uncovering thematic structures, Topic modeling, topic modeling approaches, Topic
类目: Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 tables, 1 figure

点击查看摘要

Abstract:Topic modeling is a widely used technique for uncovering thematic structures from large text corpora. However, most topic modeling approaches e.g. Latent Dirichlet Allocation (LDA) struggle to capture nuanced semantics and contextual understanding required to accurately model complex narratives. Recent advancements in this area include methods like BERTopic, which have demonstrated significantly improved topic coherence and thus established a new standard for benchmarking. In this paper, we present a novel approach, the Qualitative Insights Tool (QualIT) that integrates large language models (LLMs) with existing clustering-based topic modeling approaches. Our method leverages the deep contextual understanding and powerful language generation capabilities of LLMs to enrich the topic modeling process using clustering. We evaluate our approach on a large corpus of news articles and demonstrate substantial improvements in topic coherence and topic diversity compared to baseline topic modeling techniques. On the 20 ground-truth topics, our method shows 70% topic coherence (vs 65% 57% benchmarks) and 95.5% topic diversity (vs 85% 72% benchmarks). Our findings suggest that the integration of LLMs can unlock new opportunities for topic modeling of dynamic and complex text data, as is common in talent management research contexts.

[AI-94] Revolutionizing Biomarker Discovery: Leveraging Generative AI for Bio-Knowledge-Embedded Continuous Space Exploration

链接: https://arxiv.org/abs/2409.15612
作者: Wangyang Ying,Dongjie Wang,Xuanming Hu,Ji Qiu,Jin Park,Yanjie Fu
关键词-EN: advancing personalized medicine, personalized medicine, offering insights, disease diagnosis, therapeutic efficacy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Biomarker discovery is vital in advancing personalized medicine, offering insights into disease diagnosis, prognosis, and therapeutic efficacy. Traditionally, the identification and validation of biomarkers heavily depend on extensive experiments and statistical analyses. These approaches are time-consuming, demand extensive domain expertise, and are constrained by the complexity of biological systems. These limitations motivate us to ask: Can we automatically identify the effective biomarker subset without substantial human efforts? Inspired by the success of generative AI, we think that the intricate knowledge of biomarker identification can be compressed into a continuous embedding space, thus enhancing the search for better biomarkers. Thus, we propose a new biomarker identification framework with two important modules:1) training data preparation and 2) embedding-optimization-generation. The first module uses a multi-agent system to automatically collect pairs of biomarker subsets and their corresponding prediction accuracy as training data. These data establish a strong knowledge base for biomarker identification. The second module employs an encoder-evaluator-decoder learning paradigm to compress the knowledge of the collected data into a continuous space. Then, it utilizes gradient-based search techniques and autoregressive-based reconstruction to efficiently identify the optimal subset of biomarkers. Finally, we conduct extensive experiments on three real-world datasets to show the efficiency, robustness, and effectiveness of our method.

[AI-95] Physics Enhanced Residual Policy Learning (PERPL) for safety cruising in mixed traffic platooning under actuator and communication delay

链接: https://arxiv.org/abs/2409.15595
作者: Keke Long,Haotian Shi,Yang Zhou,Xiaopeng Li
关键词-EN: gained extensive application, gained extensive, extensive application, Residual Policy Learning, vehicle control due
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Linear control models have gained extensive application in vehicle control due to their simplicity, ease of use, and support for stability analysis. However, these models lack adaptability to the changing environment and multi-objective settings. Reinforcement learning (RL) models, on the other hand, offer adaptability but suffer from a lack of interpretability and generalization capabilities. This paper aims to develop a family of RL-based controllers enhanced by physics-informed policies, leveraging the advantages of both physics-based models (data-efficient and interpretable) and RL methods (flexible to multiple objectives and fast computing). We propose the Physics-Enhanced Residual Policy Learning (PERPL) framework, where the physics component provides model interpretability and stability. The learning-based Residual Policy adjusts the physics-based policy to adapt to the changing environment, thereby refining the decisions of the physics model. We apply our proposed model to decentralized control to mixed traffic platoon of Connected and Automated Vehicles (CAVs) and Human-driven Vehicles (HVs) using a constant time gap (CTG) strategy for cruising and incorporating actuator and communication delays. Experimental results demonstrate that our method achieves smaller headway errors and better oscillation dampening than linear models and RL alone in scenarios with artificially extreme conditions and real preceding vehicle trajectories. At the macroscopic level, overall traffic oscillations are also reduced as the penetration rate of CAVs employing the PERPL scheme increases.

[AI-96] FACET: Fast and Accurate Event-Based Eye Tracking Using Ellipse Modeling for Extended Reality

链接: https://arxiv.org/abs/2409.15584
作者: Junyuan Ding,Ziteng Wang,Chang Gao,Min Liu,Qinyu Chen
关键词-EN: Extended Reality, traditional frame-based systems, frame-based systems struggle, Event-based Eye Tracking, interactions in Extended
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Eye tracking is a key technology for gaze-based interactions in Extended Reality (XR), but traditional frame-based systems struggle to meet XR’s demands for high accuracy, low latency, and power efficiency. Event cameras offer a promising alternative due to their high temporal resolution and low power consumption. In this paper, we present FACET (Fast and Accurate Event-based Eye Tracking), an end-to-end neural network that directly outputs pupil ellipse parameters from event data, optimized for real-time XR applications. The ellipse output can be directly used in subsequent ellipse-based pupil trackers. We enhance the EV-Eye dataset by expanding annotated data and converting original mask labels to ellipse-based annotations to train the model. Besides, a novel trigonometric loss is adopted to address angle discontinuities and a fast causal event volume event representation method is put forward. On the enhanced EV-Eye test set, FACET achieves an average pupil center error of 0.20 pixels and an inference time of 0.53 ms, reducing pixel error and inference time by 1.6 \times and 1.8 \times compared to the prior art, EV-Eye, with 4.4 \times and 11.7 \times less parameters and arithmetic operations. The code is available at this https URL.

[AI-97] Asking an AI for salary negotiation advice is a matter of concern: Controlled experimental perturbation of ChatGPT for protected and non-protected group discrimination on a contextual task with no clear ground truth answers

链接: https://arxiv.org/abs/2409.15567
作者: R. Stuart Geiger,Flynn O’Sullivan,Elsie Wang,Jonathan Lo
关键词-EN: conducted controlled experimental, conducted controlled, asked to recommend, recommend an opening, model versions
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We conducted controlled experimental bias audits for four versions of ChatGPT, which we asked to recommend an opening offer in salary negotiations for a new hire. We submitted 98,800 prompts to each version, systematically varying the employee’s gender, university, and major, and tested prompts in voice of each side of the negotiation: the employee versus employer. We find ChatGPT as a multi-model platform is not robust and consistent enough to be trusted for such a task. We observed statistically significant salary offers when varying gender for all four models, although with smaller gaps than for other attributes tested. The largest gaps were different model versions and between the employee- vs employer-voiced prompts. We also observed substantial gaps when varying university and major, but many of the biases were not consistent across model versions. We tested for fictional and fraudulent universities and found wildly inconsistent results across cases and model versions. We make broader contributions to the AI/ML fairness literature. Our scenario and our experimental design differ from mainstream AI/ML auditing efforts in key ways. Bias audits typically test discrimination for protected classes like gender, which we contrast with testing non-protected classes of university and major. Asking for negotiation advice includes how aggressive one ought to be in a negotiation relative to known empirical salary distributions and scales, which is a deeply contextual and personalized task that has no objective ground truth to validate. These results raise concerns for the specific model versions we tested and ChatGPT as a multi-model platform in continuous development. Our epistemology does not permit us to definitively certify these models as either generally biased or unbiased on the attributes we test, but our study raises matters of concern for stakeholders to further investigate.

[AI-98] GEM-RAG: Graphical Eigen Memories For Retrieval Augmented Generation

链接: https://arxiv.org/abs/2409.15566
作者: Brendan Hogan Rappazzo,Yingheng Wang,Aaron Ferber,Carla Gomes
关键词-EN: shaping entities capable, Retrieval Augmented Generation, Large Language Models, general intelligence, shaping entities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:The ability to form, retrieve, and reason about memories in response to stimuli serves as the cornerstone for general intelligence - shaping entities capable of learning, adaptation, and intuitive insight. Large Language Models (LLMs) have proven their ability, given the proper memories or context, to reason and respond meaningfully to stimuli. However, they are still unable to optimally encode, store, and retrieve memories - the ability to do this would unlock their full ability to operate as AI agents, and to specialize to niche domains. To remedy this, one promising area of research is Retrieval Augmented Generation (RAG), which aims to augment LLMs by providing them with rich in-context examples and information. In question-answering (QA) applications, RAG methods embed the text of interest in chunks, and retrieve the most relevant chunks for a prompt using text embeddings. Motivated by human memory encoding and retrieval, we aim to improve over standard RAG methods by generating and encoding higher-level information and tagging the chunks by their utility to answer questions. We introduce Graphical Eigen Memories For Retrieval Augmented Generation (GEM-RAG). GEM-RAG works by tagging each chunk of text in a given text corpus with LLM generated ``utility’’ questions, connecting chunks in a graph based on the similarity of both their text and utility questions, and then using the eigendecomposition of the memory graph to build higher level summary nodes that capture the main themes of the text. We evaluate GEM-RAG, using both UnifiedQA and GPT-3.5 Turbo as the LLMs, with SBERT, and OpenAI’s text encoders on two standard QA tasks, showing that GEM-RAG outperforms other state-of-the-art RAG methods on these tasks. We also discuss the implications of having a robust RAG system and future directions.

[AI-99] SEAL: Suite for Evaluating API-use of LLMs

链接: https://arxiv.org/abs/2409.15523
作者: Woojeong Kim,Ashish Jagmohan,Aditya Vempaty
关键词-EN: Large language models, Large language, require real-time access, language models, limitations in handling
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have limitations in handling tasks that require real-time access to external APIs. While several benchmarks like ToolBench and APIGen have been developed to assess LLMs’ API-use capabilities, they often suffer from issues such as lack of generalizability, limited multi-step reasoning coverage, and instability due to real-time API fluctuations. In this paper, we introduce SEAL, an end-to-end testbed designed to evaluate LLMs in real-world API usage. SEAL standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs by introducing a GPT-4-powered API simulator with caching for deterministic evaluations. Our testbed provides a comprehensive evaluation pipeline that covers API retrieval, API calls, and final responses, offering a reliable framework for structured performance comparison in diverse real-world scenarios. SEAL is publicly available, with ongoing updates for new benchmarks.

[AI-100] CANDERE-COACH: Reinforcement Learning from Noisy Feedback

链接: https://arxiv.org/abs/2409.15521
作者: Yuxuan Li,Srijita Das,Matthew E. Taylor
关键词-EN: recent times, challenging tasks, widely applied, Reinforcement learning, learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent times, Reinforcement learning (RL) has been widely applied to many challenging tasks. However, in order to perform well, it requires access to a good reward function which is often sparse or manually engineered with scope for error. Introducing human prior knowledge is often seen as a possible solution to the above-mentioned problem, such as imitation learning, learning from preference, and inverse reinforcement learning. Learning from feedback is another framework that enables an RL agent to learn from binary evaluative signals describing the teacher’s (positive or negative) evaluation of the agent’s action. However, these methods often make the assumption that evaluative teacher feedback is perfect, which is a restrictive assumption. In practice, such feedback can be noisy due to limited teacher expertise or other exacerbating factors like cognitive load, availability, distraction, etc. In this work, we propose the CANDERE-COACH algorithm, which is capable of learning from noisy feedback by a nonoptimal teacher. We propose a noise-filtering mechanism to de-noise online feedback data, thereby enabling the RL agent to successfully learn with up to 40% of the teacher feedback being incorrect. Experiments on three common domains demonstrate the effectiveness of the proposed approach.

[AI-101] Learning When to Retrieve What to Rewrite and How to Respond in Conversational QA EMNLP

链接: https://arxiv.org/abs/2409.15515
作者: Nirmal Roy,Leonardo F. R. Ribeiro,Rexhina Blloshmi,Kevin Small
关键词-EN: Augmenting Large Language, Large Language Models, Augmenting Large, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted in EMNLP (findings) 2024

点击查看摘要

Abstract:Augmenting Large Language Models (LLMs) with information retrieval capabilities (i.e., Retrieval-Augmented Generation (RAG)) has proven beneficial for knowledge-intensive tasks. However, understanding users’ contextual search intent when generating responses is an understudied topic for conversational question answering (QA). This conversational extension leads to additional concerns when compared to single-turn QA as it is more challenging for systems to comprehend conversational context and manage retrieved passages over multiple turns. In this work, we propose a method for enabling LLMs to decide when to retrieve in RAG settings given a conversational context. When retrieval is deemed necessary, the LLM then rewrites the conversation for passage retrieval and judges the relevance of returned passages before response generation. Operationally, we build on the single-turn SELF-RAG framework (Asai et al., 2023) and propose SELF-multi-RAG for conversational settings. SELF-multi-RAG demonstrates improved capabilities over single-turn variants with respect to retrieving relevant passages (by using summarized conversational context) and assessing the quality of generated responses. Experiments on three conversational QA datasets validate the enhanced response generation capabilities of SELF-multi-RAG, with improvements of ~13% measured by human annotation.

[AI-102] PixelBytes: Catching Unified Embedding for Multimodal Generation

链接: https://arxiv.org/abs/2409.15512
作者: Fabien Furfaro
关键词-EN: report introduces PixelBytes, multimodal representation learning, report introduces, Recurrent Neural Networks, representation learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This report introduces PixelBytes Embedding, a novel approach for unified multimodal representation learning. Our method captures diverse inputs in a single, cohesive representation, enabling emergent properties for multimodal sequence generation, particularly for text and pixelated images. Inspired by state-of-the-art sequence models such as Image Transformers, PixelCNN, and Mamba-Bytes, PixelBytes aims to address the challenges of integrating different data types. We explore various model architectures, including Recurrent Neural Networks (RNNs), State Space Models (SSMs), and Attention-based models, focusing on bidirectional processing and our innovative PxBy embedding technique. Our experiments, conducted on a specialized PixelBytes Pokémon dataset, demonstrate that bidirectional sequence models with PxBy embedding and convolutional layers can generate coherent multimodal sequences. This work contributes to the advancement of integrated AI models capable of understanding and generating multimodal data in a unified manner.

[AI-103] From Text to Treatment Effects: A Meta-Learning Approach to Handling Text-Based Confounding

链接: https://arxiv.org/abs/2409.15503
作者: Henri Arno,Paloma Rabaey,Thomas Demeester
关键词-EN: heterogeneous treatment effects, causal machine learning, average treatment effects, treatment effects, central goals
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One of the central goals of causal machine learning is the accurate estimation of heterogeneous treatment effects from observational data. In recent years, meta-learning has emerged as a flexible, model-agnostic paradigm for estimating conditional average treatment effects (CATE) using any supervised model. This paper examines the performance of meta-learners when the confounding variables are embedded in text. Through synthetic data experiments, we show that learners using pre-trained text representations of confounders, in addition to tabular background variables, achieve improved CATE estimates compare to those relying solely on the tabular variables, particularly when sufficient data is available. However, due to the entangled nature of the text embeddings, these models do not fully match the performance of meta-learners with perfect confounder knowledge. These findings highlight both the potential and the limitations of pre-trained text representations for causal inference and open up interesting avenues for future research.

[AI-104] VLMine: Long-Tail Data Mining with Vision Language Models

链接: https://arxiv.org/abs/2409.15486
作者: Mao Ye,Gregory P. Meyer,Zaiwei Zhang,Dennis Park,Siva Karthik Mustikovela,Yuning Chai,Eric M Wolff
关键词-EN: Ensuring robust performance, Ensuring robust, machine learning, autonomous driving, robust performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10% and 50%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.

[AI-105] RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration

链接: https://arxiv.org/abs/2409.15461
作者: Haoyu Huang,Tong Niu,Rui Yang,Luping Shi
关键词-EN: large language models, utilizing large language, textbf, language models, Recently
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recently, many studies focus on utilizing large language models (LLMs) into educational dialogues. Especially, within liberal arts dialogues, educators must balance \textbfHumanized communication, \textbfTeaching expertise, and \textbfSafety-ethics (\textbfHTS), besides the subject knowledge itself. However, due to collecting massive amounts of HTS-compliant teaching dialogues from real world as training corpus is expensive, the outputs of existing LLMs in teaching dialogues fall short of human standards. To address this, we design a Retrieval-augmented Multi-role Multi-expert Collaboration (RAM2C) framework to automatically generate such dialogues data. Specifically, we first establish HTS-guided knowledge bases, encompassing three domain knowledge in teaching skills, psychology, and safety ethics. Then, RAM2C organizes LLMs, which are retrieval-augmented by the above different knowledge bases, into multi-experts groups with distinct roles to generate the HTS-compliant educational dialogues dataset. We then fine-tuned the LLMs using this dataset. Empirical evaluations indicate that RM2C-empowered LLMs excel in Chinese reading teaching, offering more personalized, and ethically safe teaching response, demonstrating RAM2C’s practicality and high quality. We release the experiments at \hyperlinkthis https URLthis https URL.

[AI-106] In-Context Learning May Not Elicit Trustworthy Reasoning: A-Not-B Errors in Pretrained Language Models EMNLP2024

链接: https://arxiv.org/abs/2409.15454
作者: Pengrui Han,Peiyang Song,Haofei Yu,Jiaxuan You
关键词-EN: large language models, highly capable large, capable large language, Recent advancements, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at EMNLP 2024 Findings

点击查看摘要

Abstract:Recent advancements in artificial intelligence have led to the creation of highly capable large language models (LLMs) that can perform tasks in a human-like manner. However, LLMs exhibit only infant-level cognitive abilities in certain areas. One such area is the A-Not-B error, a phenomenon seen in infants where they repeat a previously rewarded behavior despite well-observed changed conditions. This highlights their lack of inhibitory control – the ability to stop a habitual or impulsive response. In our work, we design a text-based multi-choice QA scenario similar to the A-Not-B experimental settings to systematically test the inhibitory control abilities of LLMs. We found that state-of-the-art LLMs (like Llama3-8b) perform consistently well with in-context learning (ICL) but make errors and show a significant drop of as many as 83.3% in reasoning tasks when the context changes trivially. This suggests that LLMs only have inhibitory control abilities on par with human infants in this regard, often failing to suppress the previously established response pattern during ICL.

[AI-107] ag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models

链接: https://arxiv.org/abs/2409.15451
作者: Mike Zhang,Kaixian Qu,Vaishakh Patil,Cesar Cadena,Marco Hutter
关键词-EN: Large Language Models, Large Language, common sense reasoning, Language Models, sense reasoning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Language Models (LLM) have emerged as a tool for robots to generate task plans using common sense reasoning. For the LLM to generate actionable plans, scene context must be provided, often through a map. Recent works have shifted from explicit maps with fixed semantic classes to implicit open vocabulary maps based on queryable embeddings capable of representing any semantic class. However, embeddings cannot directly report the scene context as they are implicit, requiring further processing for LLM integration. To address this, we propose an explicit text-based map that can represent thousands of semantic classes while easily integrating with LLMs due to their text-based nature by building upon large-scale image recognition models. We study how entities in our map can be localized and show through evaluations that our text-based map localizations perform comparably to those from open vocabulary maps while using two to four orders of magnitude less memory. Real-robot experiments demonstrate the grounding of an LLM with the text-based map to solve user tasks.

[AI-108] Steward: Natural Language Web Automation

链接: https://arxiv.org/abs/2409.15441
作者: Brian Tang,Kang G. Shin
关键词-EN: large language models, demonstrated exceptional capabilities, demonstrated exceptional, Recently, large language
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have demonstrated exceptional capabilities in serving as the foundation for AI assistants. One emerging application of LLMs, navigating through websites and interacting with UI elements across various web pages, remains somewhat underexplored. We introduce Steward, a novel LLM-powered web automation tool designed to serve as a cost-effective, scalable, end-to-end solution for automating web interactions. Traditional browser automation frameworks like Selenium, Puppeteer, and Playwright are not scalable for extensive web interaction tasks, such as studying recommendation algorithms on platforms like YouTube and Twitter. These frameworks require manual coding of interactions, limiting their utility in large-scale or dynamic contexts. Steward addresses these limitations by integrating LLM capabilities with browser automation, allowing for natural language-driven interaction with websites. Steward operates by receiving natural language instructions and reactively planning and executing a sequence of actions on websites, looping until completion, making it a practical tool for developers and researchers to use. It achieves high efficiency, completing actions in 8.52 to 10.14 seconds at a cost of 0.028 per action or an average of 0.18 per task, which is further reduced to 4.8 seconds and 0.022 through a caching mechanism. It runs tasks on real websites with a 40% completion success rate. We discuss various design and implementation challenges, including state representation, action sequence selection, system responsiveness, detecting task completion, and caching implementation.

[AI-109] Attack Atlas: A Practitioners Perspective on Challenges and Pitfalls in Red Teaming GenAI

链接: https://arxiv.org/abs/2409.15398
作者: Ambrish Rawat,Stefan Schoepf,Giulio Zizzo,Giandomenico Cornacchia,Muhammad Zaid Hameed,Kieran Fraser,Erik Miehling,Beat Buesser,Elizabeth M. Daly,Mark Purcell,Prasanna Sattigeri,Pin-Yu Chen,Kush R. Varshney
关键词-EN: large language models, language models, large language, natural language, production applications
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As generative AI, particularly large language models (LLMs), become increasingly integrated into production applications, new attack surfaces and vulnerabilities emerge and put a focus on adversarial threats in natural language and multi-modal systems. Red-teaming has gained importance in proactively identifying weaknesses in these systems, while blue-teaming works to protect against such adversarial attacks. Despite growing academic interest in adversarial risks for generative AI, there is limited guidance tailored for practitioners to assess and mitigate these challenges in real-world environments. To address this, our contributions include: (1) a practical examination of red- and blue-teaming strategies for securing generative AI, (2) identification of key challenges and open questions in defense development and evaluation, and (3) the Attack Atlas, an intuitive framework that brings a practical approach to analyzing single-turn input attacks, placing it at the forefront for practitioners. This work aims to bridge the gap between academic insights and practical security measures for the protection of generative AI systems.

[AI-110] Parse Trees Guided LLM Prompt Compression

链接: https://arxiv.org/abs/2409.15395
作者: Wenhao Mao,Chengbin Hou,Tianyu Zhang,Xinyu Lin,Ke Tang,Hairong Lv
关键词-EN: Large Language Models, Offering rich contexts, Large Language, resulting longer prompt, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offering rich contexts to Large Language Models (LLMs) has shown to boost the performance in various tasks, but the resulting longer prompt would increase the computational cost and might exceed the input limit of LLMs. Recently, some prompt compression methods have been suggested to shorten the length of prompts by using language models to generate shorter prompts or by developing computational models to select important parts of original prompt. The generative compression methods would suffer from issues like hallucination, while the selective compression methods have not involved linguistic rules and overlook the global structure of prompt. To this end, we propose a novel selective compression method called PartPrompt. It first obtains a parse tree for each sentence based on linguistic rules, and calculates local information entropy for each node in a parse tree. These local parse trees are then organized into a global tree according to the hierarchical structure such as the dependency of sentences, paragraphs, and sections. After that, the root-ward propagation and leaf-ward propagation are proposed to adjust node values over the global tree. Finally, a recursive algorithm is developed to prune the global tree based on the adjusted node values. The experiments show that PartPrompt receives the state-of-the-art performance across various datasets, metrics, compression ratios, and target LLMs for inference. The in-depth ablation studies confirm the effectiveness of designs in PartPrompt, and other additional experiments also demonstrate its superiority in terms of the coherence of compressed prompts and in the extreme long prompt scenario.

[AI-111] Neural Control Variates with Automatic Integration

链接: https://arxiv.org/abs/2409.15394
作者: Zilu Li,Guandao Yang,Qingqing Zhao,Xi Deng,Leonidas Guibas,Bharath Hariharan,Gordon Wetzstein
关键词-EN: Monte Carlo integration, Monte Carlo, network, leverage arbitrary neural, control variates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:This paper presents a method to leverage arbitrary neural network architecture for control variates. Control variates are crucial in reducing the variance of Monte Carlo integration, but they hinge on finding a function that both correlates with the integrand and has a known analytical integral. Traditional approaches rely on heuristics to choose this function, which might not be expressive enough to correlate well with the integrand. Recent research alleviates this issue by modeling the integrands with a learnable parametric model, such as a neural network. However, the challenge remains in creating an expressive parametric model with a known analytical integral. This paper proposes a novel approach to construct learnable parametric control variates functions from arbitrary neural network architectures. Instead of using a network to approximate the integrand directly, we employ the network to approximate the anti-derivative of the integrand. This allows us to use automatic differentiation to create a function whose integration can be constructed by the antiderivative network. We apply our method to solve partial differential equations using the Walk-on-sphere algorithm. Our results indicate that this approach is unbiased and uses various network architectures to achieve lower variance than other control variate methods.

[AI-112] Approximated Orthogonal Projection Unit: Stabilizing Regression Network Training Using Natural Gradient

链接: https://arxiv.org/abs/2409.15393
作者: Shaoqi Wang,Chunjie Yang,Siwei Lou
关键词-EN: function approximation capabilities, Neural networks, approximation capabilities, cutting-edge soft sensor, extensively studied
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural networks (NN) are extensively studied in cutting-edge soft sensor models due to their feature extraction and function approximation capabilities. Current research into network-based methods primarily focuses on models’ offline accuracy. Notably, in industrial soft sensor context, online optimizing stability and interpretability are prioritized, followed by accuracy. This requires a clearer understanding of network’s training process. To bridge this gap, we propose a novel NN named the Approximated Orthogonal Projection Unit (AOPU) which has solid mathematical basis and presents superior training stability. AOPU truncates the gradient backpropagation at dual parameters, optimizes the trackable parameters updates, and enhances the robustness of training. We further prove that AOPU attains minimum variance estimation (MVE) in NN, wherein the truncated gradient approximates the natural gradient (NG). Empirical results on two chemical process datasets clearly show that AOPU outperforms other models in achieving stable convergence, marking a significant advancement in soft sensor field.

[AI-113] Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation EMNLP2024

链接: https://arxiv.org/abs/2409.15381
作者: G M Shahariar,Jia Chen,Jiachen Li,Yue Dong
关键词-EN: Recent studies show, Recent studies, text prompts, POS tags, POS tag
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Findings of the EMNLP 2024

点击查看摘要

Abstract:Recent studies show that text-to-image (T2I) models are vulnerable to adversarial attacks, especially with noun perturbations in text prompts. In this study, we investigate the impact of adversarial attacks on different POS tags within text prompts on the images generated by T2I models. We create a high-quality dataset for realistic POS tag token swapping and perform gradient-based attacks to find adversarial suffixes that mislead T2I models into generating images with altered tokens. Our empirical results show that the attack success rate (ASR) varies significantly among different POS tag categories, with nouns, proper nouns, and adjectives being the easiest to attack. We explore the mechanism behind the steering effect of adversarial suffixes, finding that the number of critical tokens and content fusion vary among POS tags, while features like suffix transferability are consistent across categories. We have made our implementation publicly available at - this https URL.

[AI-114] Kalahi: A handcrafted grassroots cultural LLM evaluation suite for Filipino

链接: https://arxiv.org/abs/2409.15380
作者: Jann Railey Montalan,Jian Gang Ngui,Wei Qi Leong,Yosephine Susanto,Hamsawardhini Rengarajan,William Chandra Tjhi,Alham Fikri Aji
关键词-EN: necessarily provide culturally, Filipino, necessarily provide, provide culturally, Filipino users
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multilingual large language models (LLMs) today may not necessarily provide culturally appropriate and relevant responses to its Filipino users. We introduce Kalahi, a cultural LLM evaluation suite collaboratively created by native Filipino speakers. It is composed of 150 high-quality, handcrafted and nuanced prompts that test LLMs for generations that are relevant to shared Filipino cultural knowledge and values. Strong LLM performance in Kalahi indicates a model’s ability to generate responses similar to what an average Filipino would say or do in a given situation. We conducted experiments on LLMs with multilingual and Filipino language support. Results show that Kalahi, while trivial for Filipinos, is challenging for LLMs, with the best model answering only 46.0% of the questions correctly compared to native Filipino performance of 89.10%. Thus, Kalahi can be used to accurately and reliably evaluate Filipino cultural representation in LLMs.

[AI-115] Prompting Large Language Models for Supporting the Differential Diagnosis of Anemia

链接: https://arxiv.org/abs/2409.15377
作者: Elisa Castagnari(HeKA),Lillian Muyama(HeKA),Adrien Coulet(HeKA)
关键词-EN: laboratory exams, sequence of steps, Large Language Models, reach diagnosis decisions, clinicians achieve
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In practice, clinicians achieve a diagnosis by following a sequence of steps, such as laboratory exams, observations, or imaging. The pathways to reach diagnosis decisions are documented by guidelines authored by expert organizations, which guide clinicians to reach a correct diagnosis through these sequences of steps. While these guidelines are beneficial for following medical reasoning and consolidating medical knowledge, they have some drawbacks. They often fail to address patients with uncommon conditions due to their focus on the majority population, and are slow and costly to update, making them unsuitable for rapidly emerging diseases or new practices. Inspired by clinical guidelines, our study aimed to develop pathways similar to those that can be obtained in clinical guidelines. We tested three Large Language Models (LLMs) -Generative Pretrained Transformer 4 (GPT-4), Large Language Model Meta AI (LLaMA), and Mistral -on a synthetic yet realistic dataset to differentially diagnose anemia and its subtypes. By using advanced prompting techniques to enhance the decision-making process, we generated diagnostic pathways using these models. Experimental results indicate that LLMs hold huge potential in clinical pathway discovery from patient data, with GPT-4 exhibiting the best performance in all conducted experiments.

[AI-116] ControlMath: Controllable Data Generation Promotes Math Generalist Models

链接: https://arxiv.org/abs/2409.15376
作者: Nuo Chen,Ning Wu,Jianhui Chang,Jia Li
关键词-EN: Utilizing large language, Utilizing large, large language models, yielded encouraging results, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 17 pages

点击查看摘要

Abstract:Utilizing large language models (LLMs) for data augmentation has yielded encouraging results in mathematical reasoning. However, these approaches face constraints in problem diversity, potentially restricting them to in-domain/distribution data generation. To this end, we propose ControlMath, an iterative method involving an equation-generator module and two LLM-based agents. The module creates diverse equations, which the Problem-Crafter agent then transforms into math word problems. The Reverse-Agent filters and selects high-quality data, adhering to the “less is more” principle, achieving better results with fewer data points. This approach enables the generation of diverse math problems, not limited to specific domains or distributions. As a result, we collect ControlMathQA, which involves 190k math word problems. Extensive results prove that combining our dataset with in-domain datasets like GSM8K can help improve the model’s mathematical ability to generalize, leading to improved performances both within and beyond specific domains.

[AI-117] DS2TA: Denoising Spiking Transformer with Attenuated Spatiotemporal Attention

链接: https://arxiv.org/abs/2409.15375
作者: Boxun Xu,Hejia Geng,Yuxuan Yin,Peng Li
关键词-EN: current high-performance models, vision applications, current high-performance, high-performance models, models of choice
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2311.09376

点击查看摘要

Abstract:Vision Transformers (ViT) are current high-performance models of choice for various vision applications. Recent developments have given rise to biologically inspired spiking transformers that thrive in ultra-low power operations on neuromorphic hardware, however, without fully unlocking the potential of spiking neural networks. We introduce DS2TA, a Denoising Spiking transformer with attenuated SpatioTemporal Attention, designed specifically for vision applications. DS2TA introduces a new spiking attenuated spatiotemporal attention mechanism that considers input firing correlations occurring in both time and space, thereby fully harnessing the computational power of spiking neurons at the core of the transformer architecture. Importantly, DS2TA facilitates parameter-efficient spatiotemporal attention computation without introducing extra weights. DS2TA employs efficient hashmap-based nonlinear spiking attention denoisers to enhance the robustness and expressive power of spiking attention maps. DS2TA demonstrates state-of-the-art performances on several widely adopted static image and dynamic neuromorphic datasets. Operated over 4 time steps, DS2TA achieves 94.92% top-1 accuracy on CIFAR10 and 77.47% top-1 accuracy on CIFAR100, as well as 79.1% and 94.44% on CIFAR10-DVS and DVS-Gesture using 10 time steps.

[AI-118] Enhancing Performance and Scalability of Large-Scale Recommendation Systems with Jagged Flash Attention

链接: https://arxiv.org/abs/2409.15373
作者: Rengan Xu,Junjie Yang,Yifan Xu,Hong Li,Xing Liu,Devashish Shankar,Haoci Zhang,Meng Liu,Boyang Li,Yuxi Hu,Mingwei Tang,Zehua Zhang,Tunhou Zhang,Dai Li,Sijia Chen,Gian-Paolo Musumeci,Jiaqi Zhai,Bill Zhu,Hong Yan,Srihari Reddy
关键词-EN: previously deemed impractical, paradigms previously deemed, deemed impractical, integration of hardware, hardware accelerators
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 3 pages, 2 figures

点击查看摘要

Abstract:The integration of hardware accelerators has significantly advanced the capabilities of modern recommendation systems, enabling the exploration of complex ranking paradigms previously deemed impractical. However, the GPU-based computational costs present substantial challenges. In this paper, we demonstrate our development of an efficiency-driven approach to explore these paradigms, moving beyond traditional reliance on native PyTorch modules. We address the specific challenges posed by ranking models’ dependence on categorical features, which vary in length and complicate GPU utilization. We introduce Jagged Feature Interaction Kernels, a novel method designed to extract fine-grained insights from long categorical features through efficient handling of dynamically sized tensors. We further enhance the performance of attention mechanisms by integrating Jagged tensors with Flash Attention. Our novel Jagged Flash Attention achieves up to 9x speedup and 22x memory reduction compared to dense attention. Notably, it also outperforms dense flash attention, with up to 3x speedup and 53% more memory efficiency. In production models, we observe 10% QPS improvement and 18% memory savings, enabling us to scale our recommendation systems with longer features and more complex architectures.

[AI-119] Fuzzy Rule based Intelligent Cardiovascular Disease Prediction using Complex Event Processing

链接: https://arxiv.org/abs/2409.15372
作者: Shashi Shekhar Kumar,Anurag Harsh,Ritesh Chandra,Sonali Agarwal
关键词-EN: rapidly rising global, rising global concern, global concern due, World Health Organization, risk
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardiovascular disease (CVDs) is a rapidly rising global concern due to unhealthy diets, lack of physical activity, and other factors. According to the World Health Organization (WHO), primary risk factors include elevated blood pressure, glucose, blood lipids, and obesity. Recent research has focused on accurate and timely disease prediction to reduce risk and fatalities, often relying on predictive models trained on large datasets, which require intensive training. An intelligent system for CVDs patients could greatly assist in making informed decisions by effectively analyzing health parameters. Complex Event Processing (CEP) has emerged as a valuable method for solving real-time challenges by aggregating patterns of interest and their causes and effects on end users. In this work, we propose a fuzzy rule-based system for monitoring clinical data to provide real-time decision support. We designed fuzzy rules based on clinical and WHO standards to ensure accurate predictions. Our integrated approach uses Apache Kafka and Spark for data streaming, and the Siddhi CEP engine for event processing. Additionally, we pass numerous cardiovascular disease-related parameters through CEP engines to ensure fast and reliable prediction decisions. To validate the effectiveness of our approach, we simulated real-time, unseen data to predict cardiovascular disease. Using synthetic data (1000 samples), we categorized it into “Very Low Risk, Low Risk, Medium Risk, High Risk, and Very High Risk.” Validation results showed that 20% of samples were categorized as very low risk, 15-45% as low risk, 35-65% as medium risk, 55-85% as high risk, and 75% as very high risk.

[AI-120] Bone: Block Affine Transformation as Parameter Efficient Fine-tuning Methods for Large Language Models

链接: https://arxiv.org/abs/2409.15371
作者: Jiale Kang
关键词-EN: Large Language Models, Language Models, Large Language, requirements increase correspondingly, memory requirements increase
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to grow in size, their computational and memory requirements increase correspondingly. Consequently, the exploration of cost-effective and efficient fine-tuning methods has become increasingly important. Low-Rank Adaptation (LoRA) has achieved remarkable training results by freezing the original weights and training only low-rank matrices, establishing itself as the predominant fine-tuning method for LLMs. In pursuit of performance closer to full-parameter training, a series of LoRA variants have emerged, such as LoRA+, PISSA, Olora, and LoRA-GA. However, these methods also make the fine-tuning initialization process more complex, and it remains challenging to surpass the performance ceiling of full fine-tuning. To address these issues, this paper introduces an innovative method called Bone (Block Affine), which not only reduces memory overhead but also emphasizes the internal connections between weights, leading to faster convergence and better data fitting. Experimental comparisons across two different LLM architectures (LLaMA2, RWKV6) and various parameter scales demonstrate that the Bone structure can achieve rapid convergence and superior data fitting without the need for complex initialization. For example, when fine-tuning LLaMA2-7B on the MetaMathQA dataset and validating on GSM8k and math benchmarks, Bone achieved fine-tuning scores of 49.36 and 8.8, respectively, outperforming PISSA by 5.84% and 1.96%.

[AI-121] Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models

链接: https://arxiv.org/abs/2409.15370
作者: Alexius Wadell,Anoushka Bhutani,Venkatasubramanian Viswanathan
关键词-EN: leveraging transformer architectures, accelerating molecular design, Molecular Foundation Models, material science, leveraging transformer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
*备注: 26 pages, 6 figures

点击查看摘要

Abstract:Molecular Foundation Models are emerging as powerful tools for accelerating molecular design, material science, and cheminformatics, leveraging transformer architectures to speed up the discovery of new materials and drugs while reducing the computational cost of traditional ab initio methods. However, current models are constrained by closed-vocabulary tokenizers that fail to capture the full diversity of molecular structures. In this work, we systematically evaluate thirteen chemistry-specific tokenizers for their coverage of the SMILES language, uncovering substantial gaps. Using N-gram language models, we accessed the impact of tokenizer choice on model performance and quantified the information loss of unknown tokens. We introduce two new tokenizers, ismirk/i and ismirk-gpe/i, which can represent the entirety of the OpenSMILES specification while avoiding the pitfalls of existing tokenizers. Our work highlights the importance of open-vocabulary modeling for molecular foundation models and the need for chemically diverse benchmarks for cheminformatics.

[AI-122] Geometric Relational Embeddings

链接: https://arxiv.org/abs/2409.15369
作者: Bo Xiong
关键词-EN: low-dimensional vector representations, transforms relational data, relational data, learning transforms relational, geometric relational embeddings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Doctoral Dissertation, 177 pages

点击查看摘要

Abstract:Relational representation learning transforms relational data into continuous and low-dimensional vector representations. However, vector-based representations fall short in capturing crucial properties of relational data that are complex and symbolic. We propose geometric relational embeddings, a paradigm of relational embeddings that respect the underlying symbolic structures. Specifically, this dissertation introduces various geometric relational embedding models capable of capturing: 1) complex structured patterns like hierarchies and cycles in networks and knowledge graphs; 2) logical structures in ontologies and logical constraints applicable for constraining machine learning model outputs; and 3) high-order structures between entities and relations. Our results obtained from benchmark and real-world datasets demonstrate the efficacy of geometric relational embeddings in adeptly capturing these discrete, symbolic, and structured properties inherent in relational data.

[AI-123] MedCodER: A Generative AI Assistant for Medical Coding

链接: https://arxiv.org/abs/2409.15368
作者: Krishanu Das Baksi,Elijah Soba,John J. Higgins,Ravi Saini,Jaden Wood,Jane Cook,Jack Scott,Nirmala Pudota,Tim Weninger,Edward Bowen,Sanmitra Bhattacharya
关键词-EN: standardizing clinical data, Natural Language Processing, Traditional Natural Language, Generative Artificial Intelligence, prone to errors
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medical coding is essential for standardizing clinical data and communication but is often time-consuming and prone to errors. Traditional Natural Language Processing (NLP) methods struggle with automating coding due to the large label space, lengthy text inputs, and the absence of supporting evidence annotations that justify code selection. Recent advancements in Generative Artificial Intelligence (AI) offer promising solutions to these challenges. In this work, we introduce MedCodER, a Generative AI framework for automatic medical coding that leverages extraction, retrieval, and re-ranking techniques as core components. MedCodER achieves a micro-F1 score of 0.60 on International Classification of Diseases (ICD) code prediction, significantly outperforming state-of-the-art methods. Additionally, we present a new dataset containing medical records annotated with disease diagnoses, ICD codes, and supporting evidence texts (this https URL). Ablation tests confirm that MedCodER’s performance depends on the integration of each of its aforementioned components, as performance declines when these components are evaluated in isolation.

[AI-124] Fine-Tuning a Time Series Foundation Model with Wasserstein Loss

链接: https://arxiv.org/abs/2409.15367
作者: Andrei Chernov
关键词-EN: Natural Language Processing, Language Processing, Natural Language, large language models, Inspired by recent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 4 main pages; 2 figures

点击查看摘要

Abstract:Inspired by recent advancements in large language models (LLMs) for Natural Language Processing (NLP), there has been a surge in research focused on developing foundational models for time series forecasting. One approach involves training LLM architectures on tokenized time series data using cross-entropy loss. Although this method has demonstrated promising results, cross-entropy loss is primarily designed for classification tasks and does not account for the distance between classes. To address this limitation, we propose using the Wasserstein loss for such architectures. To validate our approach, we fine-tuned a foundational time series model on 22 zero-shot datasets, comparing the performance of cross-entropy loss with that of Wasserstein loss. Our results demonstrate that replacing cross-entropy loss with Wasserstein loss significantly improves point estimation.

[AI-125] rajectory Anomaly Detection with Language Models

链接: https://arxiv.org/abs/2409.15366
作者: Jonathan Mbuya,Dieter Pfoser,Antonios Anastasopoulos
关键词-EN: autoregressive causal-attention model, paper presents, autoregressive causal-attention, trajectory anomaly detection, anomaly detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach for trajectory anomaly detection using an autoregressive causal-attention model, termed LM-TAD. This method leverages the similarities between language statements and trajectories, both of which consist of ordered elements requiring coherence through external rules and contextual variations. By treating trajectories as sequences of tokens, our model learns the probability distributions over trajectories, enabling the identification of anomalous locations with high precision. We incorporate user-specific tokens to account for individual behavior patterns, enhancing anomaly detection tailored to user context. Our experiments demonstrate the effectiveness of LM-TAD on both synthetic and real-world datasets. In particular, the model outperforms existing methods on the Pattern of Life (PoL) dataset by detecting user-contextual anomalies and achieves competitive results on the Porto taxi dataset, highlighting its adaptability and robustness. Additionally, we introduce the use of perplexity and surprisal rate metrics for detecting outliers and pinpointing specific anomalous locations within trajectories. The LM-TAD framework supports various trajectory representations, including GPS coordinates, staypoints, and activity types, proving its versatility in handling diverse trajectory data. Moreover, our approach is well-suited for online trajectory anomaly detection, significantly reducing computational latency by caching key-value states of the attention mechanism, thereby avoiding repeated computations.

[AI-126] Novel Saliency Analysis for the Forward Forward Algorithm

链接: https://arxiv.org/abs/2409.15365
作者: Mitra Bakhshi
关键词-EN: Forward Forward algorithm, dual forward mechanism, Forward Forward, Forward Forward framework, Forward
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things, (AIBThings)

点击查看摘要

Abstract:Incorporating the Forward Forward algorithm into neural network training represents a transformative shift from traditional methods, introducing a dual forward mechanism that streamlines the learning process by bypassing the complexities of derivative propagation. This method is noted for its simplicity and efficiency and involves executing two forward passes the first with actual data to promote positive reinforcement, and the second with synthetically generated negative data to enable discriminative learning. Our experiments confirm that the Forward Forward algorithm is not merely an experimental novelty but a viable training strategy that competes robustly with conventional multi layer perceptron (MLP) architectures. To overcome the limitations inherent in traditional saliency techniques, which predominantly rely on gradient based methods, we developed a bespoke saliency algorithm specifically tailored for the Forward Forward framework. This innovative algorithm enhances the intuitive understanding of feature importance and network decision-making, providing clear visualizations of the data features most influential in model predictions. By leveraging this specialized saliency method, we gain deeper insights into the internal workings of the model, significantly enhancing our interpretative capabilities beyond those offered by standard approaches. Our evaluations, utilizing the MNIST and Fashion MNIST datasets, demonstrate that our method performs comparably to traditional MLP-based models.

[AI-127] VERA: Validation and Enhancement for Retrieval Augmented systems

链接: https://arxiv.org/abs/2409.15364
作者: Nitin Aravind Birur,Tanay Baswa,Divyanshu Kumar,Jatan Loya,Sahil Agarwal,Prashanth Harshangi
关键词-EN: Large language models, exhibit remarkable capabilities, Large language, VERA, textbf
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit remarkable capabilities but often produce inaccurate responses, as they rely solely on their embedded knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating an external information retrieval system, supplying additional context along with the query to mitigate inaccuracies for a particular context. However, accuracy issues still remain, as the model may rely on irrelevant documents or extrapolate incorrectly from its training knowledge. To assess and improve the performance of both the retrieval system and the LLM in a RAG framework, we propose \textbfVERA (\textbfValidation and \textbfEnhancement for \textbfRetrieval \textbfAugmented systems), a system designed to: 1) Evaluate and enhance the retrieved context before response generation, and 2) Evaluate and refine the LLM-generated response to ensure precision and minimize errors. VERA employs an evaluator-cum-enhancer LLM that first checks if external retrieval is necessary, evaluates the relevance and redundancy of the retrieved context, and refines it to eliminate non-essential information. Post-response generation, VERA splits the response into atomic statements, assesses their relevance to the query, and ensures adherence to the context. Our experiments demonstrate VERA’s remarkable efficacy not only in improving the performance of smaller open-source models, but also larger state-of-the art models. These enhancements underscore VERA’s potential to produce accurate and relevant responses, advancing the state-of-the-art in retrieval-augmented language modeling. VERA’s robust methodology, combining multiple evaluation and refinement steps, effectively mitigates hallucinations and improves retrieval and response processes, making it a valuable tool for applications demanding high accuracy and reliability in information generation. .

[AI-128] Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

链接: https://arxiv.org/abs/2409.15361
作者: Essa Jan,Nouar AlDahoul,Moiz Ali,Faizan Ahmad,Fareed Zaffar,Yasir Zaki
关键词-EN: Large Language Models, Large Language, Recent breakthroughs, breakthroughs in Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 11 figures

点击查看摘要

Abstract:Recent breakthroughs in Large Language Models (LLMs) have led to their adoption across a wide range of tasks, ranging from code generation to machine translation and sentiment analysis, etc. Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. However, it remains unclear to what extent this phenomenon is influenced by different variables, including fine-tuning task, model calibrations, etc. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification across various calibration. Our results reveal that: 1) Fine-tuning LLMs for code generation and translation leads to the highest degradation in safety guardrails. 2) LLMs generally have weaker guardrails for translation and classification, with 73-92% of harmful prompts answered, across baseline and other calibrations, falling into one of two concern categories. 3) Current solutions, including guards and safety tuning datasets, lack cross-task robustness. To address these issues, we developed a new multitask safety dataset effectively reducing attack success rates across a range of tasks without compromising the model’s overall helpfulness. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.

[AI-129] Reward-Robust RLHF in LLMs

链接: https://arxiv.org/abs/2409.15360
作者: Yuzi Yan,Xingzhou Lou,Jialian Li,Yiping Zhang,Jian Xie,Chao Yu,Yu Wang,Dong Yan,Yuan Shen
关键词-EN: Artificial General Intelligence, achieving Artificial General, Large Language Models, Large Language, Artificial General
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI). However, the reliance on reward-model-based (RM-based) alignment methods introduces significant challenges due to the inherent instability and imperfections of Reward Models (RMs), which can lead to critical issues such as reward hacking and misalignment with human intentions. In this paper, we introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges, paving the way for more reliable and resilient learning in LLMs. Our approach introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME) to model the uncertainty set of reward functions. This allows the framework to integrate both nominal performance and minimum reward signals, ensuring more stable learning even with imperfect reward models. Empirical results demonstrate that our framework consistently outperforms traditional RLHF across diverse benchmarks, showing improved accuracy and long-term stability. We also provide a theoretical analysis, demonstrating that reward-robust RLHF approaches the stability of constant reward settings, which proves to be effective in a stochastic-case analysis. Together, these contributions highlight the framework potential to enhance both the performance and stability of LLM alignment with RLHF.

[AI-130] Watch Your Steps: Observable and Modular Chains of Thought

链接: https://arxiv.org/abs/2409.15359
作者: Cassandra A. Cohen,William W. Cohen
关键词-EN: Program Trace Prompting, called Program Trace, prompting called Program, Trace Prompting, Program Trace
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a variant of chain of thought (CoT) prompting called Program Trace Prompting that makes explanations more observable while preserving the power, generality and flexibility of CoT. In our approach, few-shot CoT demonstrations are wrapped in a formal syntax based on Python, and each prompt: identifies and names steps; defines the input/output behavior of steps; and replaces CoT explanations of in-context examples with chains of these formalized steps on the same examples. Program Trace Prompting is applicable to many tasks, achieving strong results on the 23 diverse tasks in the BIG-Bench Hard benchmark. More importantly, by instrumenting explanations in this way, we enable new types of analysis. In particular, we identify “non-local errors” (which correspond to incorrectly learning the reasoning method illustrated in the demonstrations) as an unaddressed issue in CoT learning, and we present methods for verifying the modularity of steps in a CoT explanation.

[AI-131] Block-Attention for Low-Latency RAG

链接: https://arxiv.org/abs/2409.15355
作者: East Sun,Yan Wang,Lan Tian
关键词-EN: increased inference latency, Retrieval-Augmented Generation, attention mechanism designed, designed to address, address the increased
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We introduce Block-Attention, an attention mechanism designed to address the increased inference latency in Retrieval-Augmented Generation (RAG) scenarios. Its main idea lies in dividing the input sequence into blocks, where each block calculates its key-value (KV) states independently except for the final block. In RAG scenarios, by defining each passage as a block, Block-Attention enables us to pre-compute the KV states for all passages and cache them in memory. The implementation involves block segmentation, positional encoding calculation, and fine-tuning the LLM to adapt to the Block-Attention mechanism. Experiments on four RAG benchmarks demonstrate that after block fine-tuning, the Block Attention model can achieve performance comparable to (68.4% vs 67.9% on Llama3) or even better (62.8% vs 59.6% on Mistral) than self-attention models. Notably, Block-Attention reduces the TTFT to a very low level. It only takes 45 ms to output the first token for an input sequence with a total length of 32K. Compared with the self-attention model, the time consumption is reduced by 98.7%. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2409.15355 [cs.LG] (or arXiv:2409.15355v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.15355 Focus to learn more arXiv-issued DOI via DataCite

[AI-132] Recall: Empowering Multimodal Embedding for Edge Devices

链接: https://arxiv.org/abs/2409.15342
作者: Dongqi Cai,Shangguang Wang,Chen Peng,Zeling Zhang,Mengwei Xu
关键词-EN: prone to forgetting, Human memory, inherently prone, Human, Abstract
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human memory is inherently prone to forgetting. To address this, multimodal embedding models have been introduced, which transform diverse real-world data into a unified embedding space. These embeddings can be retrieved efficiently, aiding mobile users in recalling past information. However, as model complexity grows, so do its resource demands, leading to reduced throughput and heavy computational requirements that limit mobile device implementation. In this paper, we introduce RECALL, a novel on-device multimodal embedding system optimized for resource-limited mobile environments. RECALL achieves high-throughput, accurate retrieval by generating coarse-grained embeddings and leveraging query-based filtering for refined retrieval. Experimental results demonstrate that RECALL delivers high-quality embeddings with superior throughput, all while operating unobtrusively with minimal memory and energy consumption.

[AI-133] Explainable AI: Definition and attributes of a good explanation for health AI

链接: https://arxiv.org/abs/2409.15338
作者: Evangelia Kyrimi,Scott McLachlan,Jared M Wohlgemut,Zane B Perkins,David A. Lagnado,William Marsh, theExAIDSS Expert Group
关键词-EN: Proposals of artificial, accurate predictive models, artificial intelligence, based on increasingly, predictive models
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 21 pages

点击查看摘要

Abstract:Proposals of artificial intelligence (AI) solutions based on increasingly complex and accurate predictive models are becoming ubiquitous across many disciplines. As the complexity of these models grows, transparency and users’ understanding often diminish. This suggests that accurate prediction alone is insufficient for making an AI-based solution truly useful. In the development of healthcare systems, this introduces new issues related to accountability and safety. Understanding how and why an AI system makes a recommendation may require complex explanations of its inner workings and reasoning processes. Although research on explainable AI (XAI) has significantly increased in recent years and there is high demand for XAI in medicine, defining what constitutes a good explanation remains ad hoc, and providing adequate explanations continues to be challenging. To fully realize the potential of AI, it is critical to address two fundamental questions about explanations for safety-critical AI applications, such as health-AI: (1) What is an explanation in health-AI? and (2) What are the attributes of a good explanation in health-AI? In this study, we examined published literature and gathered expert opinions through a two-round Delphi study. The research outputs include (1) a definition of what constitutes an explanation in health-AI and (2) a comprehensive list of attributes that characterize a good explanation in health-AI.

[AI-134] Revisiting the Solution of Meta KDD Cup 2024: CRAG

链接: https://arxiv.org/abs/2409.15337
作者: Jie Ouyang,Yucong Luo,Mingyue Cheng,Daoyu Wang,Shuo Yu,Qi Liu,Enhong Chen
关键词-EN: Meta KDD CUP, KDD CUP, Meta KDD, RAG Benchmark Challenge, CRAG Comprehensive RAG
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This paper presents the solution of our team APEX in the Meta KDD CUP 2024: CRAG Comprehensive RAG Benchmark Challenge. The CRAG benchmark addresses the limitations of existing QA benchmarks in evaluating the diverse and dynamic challenges faced by Retrieval-Augmented Generation (RAG) systems. It provides a more comprehensive assessment of RAG performance and contributes to advancing research in this field. We propose a routing-based domain and dynamic adaptive RAG pipeline, which performs specific processing for the diverse and dynamic nature of the question in all three stages: retrieval, augmentation, and generation. Our method achieved superior performance on CRAG and ranked 2nd for Task 23 on the final competition leaderboard. Our implementation is available at this link: this https URL.

[AI-135] Causality-Driven Reinforcement Learning for Joint Communication and Sensing

链接: https://arxiv.org/abs/2409.15329
作者: Anik Roy,Serene Banerjee,Jishnu Sadasivan,Arnab Sarkar,Soumyajit Dey
关键词-EN: improve spectrum efficiency, next-generation wireless network, based Joint Communication, Multiple-Input Multiple Output, wireless network
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
*备注: 18 pages, 9 figures, 4 tables, 1 algorithm

点击查看摘要

Abstract:The next-generation wireless network, 6G and beyond, envisions to integrate communication and sensing to overcome interference, improve spectrum efficiency, and reduce hardware and power consumption. Massive Multiple-Input Multiple Output (mMIMO)-based Joint Communication and Sensing (JCAS) systems realize this integration for 6G applications such as autonomous driving, as it requires accurate environmental sensing and time-critical communication with neighboring vehicles. Reinforcement Learning (RL) is used for mMIMO antenna beamforming in the existing literature. However, the huge search space for actions associated with antenna beamforming causes the learning process for the RL agent to be inefficient due to high beam training overhead. The learning process does not consider the causal relationship between action space and the reward, and gives all actions equal importance. In this work, we explore a causally-aware RL agent which can intervene and discover causal relationships for mMIMO-based JCAS environments, during the training phase. We use a state dependent action dimension selection strategy to realize causal discovery for RL-based JCAS. Evaluation of the causally-aware RL framework in different JCAS scenarios shows the benefit of our proposed framework over baseline methods in terms of the beamforming gain.

[AI-136] Evaluating the Impact of a Specialized LLM on Physician Experience in Clinical Decision Support: A Comparison of Ask Avo and ChatGPT-4

链接: https://arxiv.org/abs/2409.15326
作者: Daniel Jung,Alex Butler,Joongheum Park,Yair Saperstein
关键词-EN: Large language models, rapidly growing interest, Language Model Augmented, Model Augmented Retrieval, clear source citations
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:The use of Large language models (LLMs) to augment clinical decision support systems is a topic with rapidly growing interest, but current shortcomings such as hallucinations and lack of clear source citations make them unreliable for use in the clinical environment. This study evaluates Ask Avo, an LLM-derived software by AvoMD that incorporates a proprietary Language Model Augmented Retrieval (LMAR) system, in-built visual citation cues, and prompt engineering designed for interactions with physicians, against ChatGPT-4 in end-user experience for physicians in a simulated clinical scenario environment. Eight clinical questions derived from medical guideline documents in various specialties were prompted to both models by 62 study participants, with each response rated on trustworthiness, actionability, relevancy, comprehensiveness, and friendly format from 1 to 5. Ask Avo significantly outperformed ChatGPT-4 in all criteria: trustworthiness (4.52 vs. 3.34, p0.001), actionability (4.41 vs. 3.19, p0.001), relevancy (4.55 vs. 3.49, p0.001), comprehensiveness (4.50 vs. 3.37, p0.001), and friendly format (4.52 vs. 3.60, p0.001). Our findings suggest that specialized LLMs designed with the needs of clinicians in mind can offer substantial improvements in user experience over general-purpose LLMs. Ask Avo’s evidence-based approach tailored to clinician needs shows promise in the adoption of LLM-augmented clinical decision support software.

[AI-137] Cognitive phantoms in LLMs through the lens of latent variables

链接: https://arxiv.org/abs/2409.15324
作者: Sanne Peereboom,Inga Schwabe,Bennett Kleinberg
关键词-EN: increasingly reach real-world, reach real-world applications, increasingly reach, real-world applications, reach real-world
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly reach real-world applications, necessitating a better understanding of their behaviour. Their size and complexity complicate traditional assessment methods, causing the emergence of alternative approaches inspired by the field of psychology. Recent studies administering psychometric questionnaires to LLMs report human-like traits in LLMs, potentially influencing LLM behaviour. However, this approach suffers from a validity problem: it presupposes that these traits exist in LLMs and that they are measurable with tools designed for humans. Typical procedures rarely acknowledge the validity problem in LLMs, comparing and interpreting average LLM scores. This study investigates this problem by comparing latent structures of personality between humans and three LLMs using two validated personality questionnaires. Findings suggest that questionnaires designed for humans do not validly measure similar constructs in LLMs, and that these constructs may not exist in LLMs at all, highlighting the need for psychometric analyses of LLM responses to avoid chasing cognitive phantoms. Keywords: large language models, psychometrics, machine behaviour, latent variable modeling, validity Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2409.15324 [cs.AI] (or arXiv:2409.15324v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.15324 Focus to learn more arXiv-issued DOI via DataCite

[AI-138] Introducing ELLIPS: An Ethics-Centered Approach to Research on LLM-Based Inference of Psychiatric Conditions

链接: https://arxiv.org/abs/2409.15323
作者: Roberta Rocca,Giada Pistilli,Kritika Maheshwari,Riccardo Fusaroli
关键词-EN: mental health care, health care systems, care systems worldwide, systems worldwide struggle, infer neuropsychiatric conditions
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As mental health care systems worldwide struggle to meet demand, there is increasing focus on using language models to infer neuropsychiatric conditions or psychopathological traits from language production. Yet, so far, this research has only delivered solutions with limited clinical applicability, due to insufficient consideration of ethical questions crucial to ensuring the synergy between possible applications and model design. To accelerate progress towards clinically applicable models, our paper charts the ethical landscape of research on language-based inference of psychopathology and provides a practical tool for researchers to navigate it. We identify seven core ethical principles that should guide model development and deployment in this domain, translate them into ELLIPS, an ethical toolkit operationalizing these principles into questions that can guide researchers’ choices with respect to data selection, architectures, evaluation, and model deployment, and provide a case study exemplifying its use. With this, we aim to facilitate the emergence of model technology with concrete potential for real-world applicability.

[AI-139] On the Complexity of Neural Computation in Superposition

链接: https://arxiv.org/abs/2409.15318
作者: Micah Adler,Nir Shavit
关键词-EN: key mechanism underlying, Recent advances, multiple features simultaneously, represent multiple features, neural networks suggest
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注: 43 pages, 8 figures

点击查看摘要

Abstract:Recent advances in the understanding of neural networks suggest that superposition, the ability of a single neuron to represent multiple features simultaneously, is a key mechanism underlying the computational efficiency of large-scale networks. This paper explores the theoretical foundations of computing in superposition, focusing on explicit, provably correct algorithms and their efficiency. We present the first lower bounds showing that for a broad class of problems, including permutations and pairwise logical operations, a neural network computing in superposition requires at least \Omega(m’ \log m’) parameters and \Omega(\sqrtm’ \log m’) neurons, where m’ is the number of output features being computed. This implies that any ``lottery ticket’’ sparse sub-network must have at least \Omega(m’ \log m’) parameters no matter what the initial dense network size. Conversely, we show a nearly tight upper bound: logical operations like pairwise AND can be computed using O(\sqrtm’ \log m’) neurons and O(m’ \log^2 m’) parameters. There is thus an exponential gap between computing in superposition, the subject of this work, and representing features in superposition, which can require as little as O(\log m’ ) neurons based on the Johnson-Lindenstrauss Lemma. Our hope is that our results open a path for using complexity theoretic techniques in neural network interpretability research. Comments: 43 pages, 8 figures Subjects: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS) ACMclasses: F.1.1; F.2.2; I.2.m; E.3 Cite as: arXiv:2409.15318 [cs.CC] (or arXiv:2409.15318v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2409.15318 Focus to learn more arXiv-issued DOI via DataCite

[AI-140] Shared Autonomy with IDA: Interventional Diffusion Assistance

链接: https://arxiv.org/abs/2409.15317
作者: Brandon J. McMahan,Zhenghao Peng,Bolei Zhou,Jonathan C. Kao
关键词-EN: controlling advanced technologies, artificial intelligence, advanced technologies, IDA, human
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 4 main figures, 2 appendix figures

点击查看摘要

Abstract:The rapid development of artificial intelligence (AI) has unearthed the potential to assist humans in controlling advanced technologies. Shared autonomy (SA) facilitates control by combining inputs from a human pilot and an AI copilot. In prior SA studies, the copilot is constantly active in determining the action played at each time step. This limits human autonomy and may have deleterious effects on performance. In general, the amount of helpful copilot assistance can vary greatly depending on the task dynamics. We therefore hypothesize that human autonomy and SA performance improve through dynamic and selective copilot intervention. To address this, we develop a goal-agnostic intervention assistance (IA) that dynamically shares control by having the copilot intervene only when the expected value of the copilot’s action exceeds that of the human’s action across all possible goals. We implement IA with a diffusion copilot (termed IDA) trained on expert demonstrations with goal masking. We prove a lower bound on the performance of IA that depends on pilot and copilot performance. Experiments with simulated human pilots show that IDA achieves higher performance than pilot-only and traditional SA control in variants of the Reacher environment and Lunar Lander. We then demonstrate that IDA achieves better control in Lunar Lander with human-in-the-loop experiments. Human participants report greater autonomy with IDA and prefer IDA over pilot-only and traditional SA control. We attribute the success of IDA to preserving human autonomy while simultaneously offering assistance to prevent the human pilot from entering universally bad states.

[AI-141] An Efficient Recommendation Model Based on Knowledge Graph Attention-Assisted Network (KGATAX)

链接: https://arxiv.org/abs/2409.15315
作者: Zhizhong Wu
关键词-EN: helping users filter, Recommendation systems play, Graph Attention-assisted Network, play a crucial, crucial role
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommendation systems play a crucial role in helping users filter through vast amounts of information. However, traditional recommendation algorithms often overlook the integration and utilization of multi-source information, limiting system performance. Therefore, this study proposes a novel recommendation model, Knowledge Graph Attention-assisted Network (KGAT-AX). We first incorporate the knowledge graph into the recommendation model, introducing an attention mechanism to explore higher order connectivity more explicitly. By using multilayer interactive information propagation, the model aggregates information to enhance its generalization ability. Furthermore, we integrate auxiliary information into entities through holographic embeddings, aggregating the information of adjacent entities for each entity by learning their inferential relationships. This allows for better utilization of auxiliary information associated with entities. We conducted experiments on real datasets to demonstrate the rationality and effectiveness of the KGAT-AX model. Through experimental analysis, we observed the effectiveness and potential of KGAT-AX compared to other baseline models on public datasets. KGAT-AX demonstrates better knowledge information capture and relationship learning capabilities.

[AI-142] Irrelevant Alternatives Bias Large Language Model Hiring Decisions

链接: https://arxiv.org/abs/2409.15299
作者: Kremena Valkanova,Pencho Yordanov
关键词-EN: well-known human cognitive, human cognitive bias, attraction effect, hiring decisions, attraction effect occurs
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We investigate whether LLMs display a well-known human cognitive bias, the attraction effect, in hiring decisions. The attraction effect occurs when the presence of an inferior candidate makes a superior candidate more appealing, increasing the likelihood of the superior candidate being chosen over a non-dominated competitor. Our study finds consistent and significant evidence of the attraction effect in GPT-3.5 and GPT-4 when they assume the role of a recruiter. Irrelevant attributes of the decoy, such as its gender, further amplify the observed bias. GPT-4 exhibits greater bias variation than GPT-3.5. Our findings remain robust even when warnings against the decoy effect are included and the recruiter role definition is varied.

[AI-143] SketcherX: AI-Driven Interactive Robotic drawing with Diffusion model and Vectorization Techniques

链接: https://arxiv.org/abs/2409.15292
作者: Jookyung Song,Mookyoung Kang,Nojun Kwak
关键词-EN: Large Language Model, Stable Diffusion model, Vector Low Rank, interactive human-robot engagement, introduce SketcherX
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 10 pages, 10 figures

点击查看摘要

Abstract:We introduce SketcherX, a novel robotic system for personalized portrait drawing through interactive human-robot engagement. Unlike traditional robotic art systems that rely on analog printing techniques, SketcherX captures and processes facial images to produce vectorized drawings in a distinctive, human-like artistic style. The system comprises two 6-axis robotic arms : a face robot, which is equipped with a head-mounted camera and Large Language Model (LLM) for real-time interaction, and a drawing robot, utilizing a fine-tuned Stable Diffusion model, ControlNet, and Vision-Language models for dynamic, stylized drawing. Our contributions include the development of a custom Vector Low Rank Adaptation model (LoRA), enabling seamless adaptation to various artistic styles, and integrating a pair-wise fine-tuning approach to enhance stroke quality and stylistic accuracy. Experimental results demonstrate the system’s ability to produce high-quality, personalized portraits within two minutes, highlighting its potential as a new paradigm in robotic creativity. This work advances the field of robotic art by positioning robots as active participants in the creative process, paving the way for future explorations in interactive, human-robot artistic collaboration.

[AI-144] Broadening Access to Simulations for End-Users via Large Language Models : Challenges and Opportunities

链接: https://arxiv.org/abs/2409.15290
作者: Philippe J. Giabbanelli,Jose J. Padilla,Ameeta Agrawal
关键词-EN: Large Language Models, create intelligent virtual, intelligent virtual assistants, Large Language, exemplified in marketing
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: To appear in proceedings of the 2024 Winter Simulation Conference

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming ubiquitous to create intelligent virtual assistants that assist users in interacting with a system, as exemplified in marketing. Although LLMs have been discussed in Modeling Simulation (MS), the community has focused on generating code or explaining results. We examine the possibility of using LLMs to broaden access to simulations, by enabling non-simulation end-users to ask what-if questions in everyday language. Specifically, we discuss the opportunities and challenges in designing such an end-to-end system, divided into three broad phases. First, assuming the general case in which several simulation models are available, textual queries are mapped to the most relevant model. Second, if a mapping cannot be found, the query can be automatically reformulated and clarifying questions can be generated. Finally, simulation results are produced and contextualized for decision-making. Our vision for such system articulates long-term research opportunities spanning MS, LLMs, information retrieval, and ethics.

[AI-145] Scenario of Use Scheme: Threat Model Specification for Speaker Privacy Protection in the Medical Domain INTERSPEECH2024

链接: https://arxiv.org/abs/2409.16106
作者: Mehtab Ur Rahman,Martha Larson,Louis ten Bosch,Cristian Tejedor-García
关键词-EN: monitor disease, detect and monitor, Speech recordings, privacy concerns, medical analysis purposes
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Sound (cs.SD)
*备注: Accepted and published at SPSC Symposium 2024 4th Symposium on Security and Privacy in Speech Communication. Interspeech 2024

点击查看摘要

Abstract:Speech recordings are being more frequently used to detect and monitor disease, leading to privacy concerns. Beyond cryptography, protection of speech can be addressed by approaches, such as perturbation, disentanglement, and re-synthesis, that eliminate sensitive information of the speaker, leaving the information necessary for medical analysis purposes. In order for such privacy protective approaches to be developed, clear and systematic specifications of assumptions concerning medical settings and the needs of medical professionals are necessary. In this paper, we propose a Scenario of Use Scheme that incorporates an Attacker Model, which characterizes the adversary against whom the speaker’s privacy must be defended, and a Protector Model, which specifies the defense. We discuss the connection of the scheme with previous work on speech privacy. Finally, we present a concrete example of a specified Scenario of Use and a set of experiments about protecting speaker data against gender inference attacks while maintaining utility for Parkinson’s detection.

[AI-146] Grounded Computation Consciousness: A Framework for Exploring Consciousness in Machines Other Organisms

链接: https://arxiv.org/abs/2409.16036
作者: Ryan Williams
关键词-EN: critical tool, tool for understanding, understanding consciousness, consciousness, Computational modeling
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computational modeling is a critical tool for understanding consciousness, but is it enough on its own? This paper discusses the necessity for an ontological basis of consciousness, and introduces a formal framework for grounding computational descriptions into an ontological substrate. Utilizing this technique, a method is demonstrated for estimating the difference in qualitative experience between two systems. This framework has wide applicability to computational theories of consciousness.

[AI-147] Deep chroma compression of tone-mapped images

链接: https://arxiv.org/abs/2409.16032
作者: Xenios Milidonis,Francesco Banterle,Alessandro Artusi
关键词-EN: high dynamic range, Acquisition of high, high-quality output, high dynamic, thriving due
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Acquisition of high dynamic range (HDR) images is thriving due to the increasing use of smart devices and the demand for high-quality output. Extensive research has focused on developing methods for reducing the luminance range in HDR images using conventional and deep learning-based tone mapping operators to enable accurate reproduction on conventional 8 and 10-bit digital displays. However, these methods often fail to account for pixels that may lie outside the target display’s gamut, resulting in visible chromatic distortions or color clipping artifacts. Previous studies suggested that a gamut management step ensures that all pixels remain within the target gamut. However, such approaches are computationally expensive and cannot be deployed on devices with limited computational resources. We propose a generative adversarial network for fast and reliable chroma compression of HDR tone-mapped images. We design a loss function that considers the hue property of generated images to improve color accuracy, and train the model on an extensive image dataset. Quantitative experiments demonstrate that the proposed model outperforms state-of-the-art image generation and enhancement networks in color accuracy, while a subjective study suggests that the generated images are on par or superior to those produced by conventional chroma compression methods in terms of visual quality. Additionally, the model achieves real-time performance, showing promising results for deployment on devices with limited computational resources.

[AI-148] Whisper in Medusas Ear: Multi-head Efficient Decoding for Transformer-based ASR

链接: https://arxiv.org/abs/2409.15869
作者: Yael Segal-Feldman,Aviv Shamsian,Aviv Navon,Gill Hetz,Joseph Keshet
关键词-EN: transcription and translation, Large transformer-based models, speech transcription, Large transformer-based, Word Error Rate
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Under Review

点击查看摘要

Abstract:Large transformer-based models have significant potential for speech transcription and translation. Their self-attention mechanisms and parallel processing enable them to capture complex patterns and dependencies in audio sequences. However, this potential comes with challenges, as these large and computationally intensive models lead to slow inference speeds. Various optimization strategies have been proposed to improve performance, including efficient hardware utilization and algorithmic enhancements. In this paper, we introduce Whisper-Medusa, a novel approach designed to enhance processing speed with minimal impact on Word Error Rate (WER). The proposed model extends the OpenAI’s Whisper architecture by predicting multiple tokens per iteration, resulting in a 50% reduction in latency. We showcase the effectiveness of Whisper-Medusa across different learning setups and datasets.

[AI-149] Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection

链接: https://arxiv.org/abs/2409.15844
作者: Matteo Zecchin,Osvaldo Simeone
关键词-EN: finite-sample statistical guarantees, introduce adaptive, efficient hyperparameter selection, hyperparameter selection procedure, sequential data-dependent MHT
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We introduce adaptive learn-then-test (aLTT), an efficient hyperparameter selection procedure that provides finite-sample statistical guarantees on the population risk of AI models. Unlike the existing learn-then-test (LTT) technique, which relies on conventional p-value-based multiple hypothesis testing (MHT), aLTT implements sequential data-dependent MHT with early termination by leveraging e-processes. As a result, aLTT can reduce the number of testing rounds, making it particularly well-suited for scenarios in which testing is costly or presents safety risks. Apart from maintaining statistical validity, in applications such as online policy selection for offline reinforcement learning and hyperparameter tuning for engineering systems, aLTT is shown to achieve the same performance as LTT while requiring only a fraction of the testing rounds.

[AI-150] dnaGrinder: a lightweight and high-capacity genomic foundation model

链接: https://arxiv.org/abs/2409.15697
作者: Qihang Zhao,Chi Zhang,Weixiong Zhang
关键词-EN: complex information encoded, genomic sequences remains, task of understanding, understanding and interpreting, interpreting the complex
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The task of understanding and interpreting the complex information encoded within genomic sequences remains a grand challenge in biological research and clinical applications. In this context, recent advancements in large language model research have led to the development of both encoder-only and decoder-only foundation models designed to decode intricate information in DNA sequences. However, several issues persist, particularly regarding the efficient management of long-range dependencies inherent in genomic sequences, the effective representation of nucleotide variations, and the considerable computational costs associated with large model architectures and extensive pretraining datasets. Current genomic foundation models often face a critical tradeoff: smaller models with mediocre performance versus large models with improved performance. To address these challenges, we introduce dnaGrinder, a unique and efficient genomic foundation model. dnaGrinder excels at managing long-range dependencies within genomic sequences while minimizing computational costs without compromising performance. It achieves results that are not just comparable but often superior to leading DNA models such as Nucleotide Transformer and DNABERT-2. Furthermore, dnaGrinder is designed for easy fine-tuning on workstation-grade GPUs, accommodating input lengths exceeding 17,000 tokens. On a single high-performance GPU, it supports sequences longer than 140,000 tokens, making it a highly efficient and accessible tool for both basic biological research and clinical applications.

[AI-151] Safe Guard: an LLM-agent for Real-time Voice-based Hate Speech Detection in Social Virtual Reality

链接: https://arxiv.org/abs/2409.15623
作者: Yiwen Xu,Qinyang Hou,Hongyu Wan,Mirjana Prpa
关键词-EN: present Safe Guard, Safe Guard, present Safe, voice-based interactions, system leverages Open
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:In this paper, we present Safe Guard, an LLM-agent for the detection of hate speech in voice-based interactions in social VR (VRChat). Our system leverages Open AI GPT and audio feature extraction for real-time voice interactions. We contribute a system design and evaluation of the system that demonstrates the capability of our approach in detecting hate speech, and reducing false positives compared to currently available approaches. Our results indicate the potential of LLM-based agents in creating safer virtual environments and set the groundwork for further advancements in LLM-driven moderation approaches.

[AI-152] FT-multi: simultaneous forecasting of vital sign trajectories in the ICU

链接: https://arxiv.org/abs/2409.15586
作者: Rosemary Y. He,Jeff N. Chiang
关键词-EN: Trajectory forecasting, computational methods, important area, area of research, research in precision
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Trajectory forecasting in healthcare data has been an important area of research in precision care and clinical integration for computational methods. In recent years, generative AI models have demonstrated promising results in capturing short and long range dependencies in time series data. While these models have also been applied in healthcare, most of them only predict one value at a time, which is unrealistic in a clinical setting where multiple measures are taken at once. In this work, we extend the framework temporal fusion transformer (TFT), a multi-horizon time series prediction tool, and propose TFT-multi, an end-to-end framework that can predict multiple vital trajectories simultaneously. We apply TFT-multi to forecast 5 vital signs recorded in the intensive care unit: blood pressure, pulse, SpO2, temperature and respiratory rate. We hypothesize that by jointly predicting these measures, which are often correlated with one another, we can make more accurate predictions, especially in variables with large missingness. We validate our model on the public MIMIC dataset and an independent institutional dataset, and demonstrate that this approach outperforms state-of-the-art univariate prediction tools including the original TFT and Prophet, as well as vector regression modeling for multivariate prediction. Furthermore, we perform a study case analysis by applying our pipeline to forecast blood pressure changes in response to actual and hypothetical pressor administration.

[AI-153] Revise Reason and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction

链接: https://arxiv.org/abs/2409.15551
作者: Yuanchao Li,Yuan Gong,Chao-Han Huck Yang,Peter Bell,Catherine Lai
关键词-EN: Large Language Models, reliability remain questionable, Annotating and recognizing, advancement of Large, Language Models
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Annotating and recognizing speech emotion using prompt engineering has recently emerged with the advancement of Large Language Models (LLMs), yet its efficacy and reliability remain questionable. In this paper, we conduct a systematic study on this topic, beginning with the proposal of novel prompts that incorporate emotion-specific knowledge from acoustics, linguistics, and psychology. Subsequently, we examine the effectiveness of LLM-based prompting on Automatic Speech Recognition (ASR) transcription, contrasting it with ground-truth transcription. Furthermore, we propose a Revise-Reason-Recognize prompting pipeline for robust LLM-based emotion recognition from spoken language with ASR errors. Additionally, experiments on context-aware learning, in-context learning, and instruction tuning are performed to examine the usefulness of LLM training schemes in this direction. Finally, we investigate the sensitivity of LLMs to minor prompt variations. Experimental results demonstrate the efficacy of the emotion-specific prompts, ASR error correction, and LLM training schemes for LLM-based emotion recognition. Our study aims to refine the use of LLMs in emotion recognition and related domains.

[AI-154] Computational Pathology for Accurate Prediction of Breast Cancer Recurrence: Development and Validation of a Deep Learning-based Tool

链接: https://arxiv.org/abs/2409.15491
作者: Ziyu Su,Yongxin Guo,Robert Wesolowski,Gary Tozbikian,Nathaniel S. O’Connell,M. Khalid Khan Niazi,Metin N. Gurcan
关键词-EN: Accurate recurrence risk, Accurate recurrence, optimizing treatment plans, stratification is crucial, crucial for optimizing
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Accurate recurrence risk stratification is crucial for optimizing treatment plans for breast cancer patients. Current prognostic tools like Oncotype DX (ODX) offer valuable genomic insights for HR+/HER2- patients but are limited by cost and accessibility, particularly in underserved populations. In this study, we present Deep-BCR-Auto, a deep learning-based computational pathology approach that predicts breast cancer recurrence risk from routine HE-stained whole slide images (WSIs). Our methodology was validated on two independent cohorts: the TCGA-BRCA dataset and an in-house dataset from The Ohio State University (OSU). Deep-BCR-Auto demonstrated robust performance in stratifying patients into low- and high-recurrence risk categories. On the TCGA-BRCA dataset, the model achieved an area under the receiver operating characteristic curve (AUROC) of 0.827, significantly outperforming existing weakly supervised models (p=0.041). In the independent OSU dataset, Deep-BCR-Auto maintained strong generalizability, achieving an AUROC of 0.832, along with 82.0% accuracy, 85.0% specificity, and 67.7% sensitivity. These findings highlight the potential of computational pathology as a cost-effective alternative for recurrence risk assessment, broadening access to personalized treatment strategies. This study underscores the clinical utility of integrating deep learning-based computational pathology into routine pathological assessment for breast cancer prognosis across diverse clinical settings.

[AI-155] oward Automated Clinical Transcriptions

链接: https://arxiv.org/abs/2409.15378
作者: Mitchell A. Klusty,W. Vaiden Logan,Samuel E. Armstrong,Aaron D. Mullen,Caroline N. Leach,Jeff Talbert,V. K. Cody Bumgardner
关键词-EN: including physician burnout, rising healthcare costs, Administrative documentation, adverse outcomes, including physician
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Administrative documentation is a major driver of rising healthcare costs and is linked to adverse outcomes, including physician burnout and diminished quality of care. This paper introduces a secure system that applies recent advancements in speech-to-text transcription and speaker-labeling (diarization) to patient-provider conversations. This system is optimized to produce accurate transcriptions and highlight potential errors to promote rapid human verification, further reducing the necessary manual effort. Applied to over 40 hours of simulated conversations, this system offers a promising foundation for automating clinical transcriptions.

[AI-156] Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data

链接: https://arxiv.org/abs/2409.15374
作者: Suryansh Vidya,Kush Gupta,Amir Aly,Andy Wills,Emmanuel Ifeachor,Rohit Shankar
关键词-EN: Autism Spectrum Disorder, Spectrum Disorder, ASD, Autism Spectrum, Early diagnosis
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been shown to significantly improve the quality of life of autistic individuals. However, diagnostics methods for ASD rely on assessments based on clinical presentation that are prone to bias and can be challenging to arrive at an early diagnosis. There is a need for objective biomarkers of ASD which can help improve diagnostic accuracy. Deep learning (DL) has achieved outstanding performance in diagnosing diseases and conditions from medical imaging data. Extensive research has been conducted on creating models that classify ASD using resting-state functional Magnetic Resonance Imaging (fMRI) data. However, existing models lack interpretability. This research aims to improve the accuracy and interpretability of ASD diagnosis by creating a DL model that can not only accurately classify ASD but also provide explainable insights into its working. The dataset used is a preprocessed version of the Autism Brain Imaging Data Exchange (ABIDE) with 884 samples. Our findings show a model that can accurately classify ASD and highlight critical brain regions differing between ASD and typical controls, with potential implications for early diagnosis and understanding of the neural basis of ASD. These findings are validated by studies in the literature that use different datasets and modalities, confirming that the model actually learned characteristics of ASD and not just the dataset. This study advances the field of explainable AI in medical imaging by providing a robust and interpretable model, thereby contributing to a future with objective and reliable ASD diagnostics.

计算机视觉

[CV-0] Self-Supervised Any-Point Tracking by Contrastive Random Walks ECCV2024

链接: https://arxiv.org/abs/2409.16288
作者: Ayush Shrivastava,Andrew Owens
关键词-EN: present a simple, global matching, global matching transformer, contrastive random walks, TAP
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024. Project link: this https URL . Code: this https URL

点击查看摘要

Abstract:We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem. We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks, using the transformer’s attention-based global matching to define the transition matrices for a random walk on a space-time graph. The ability to perform “all pairs” comparisons between points allows the model to obtain high spatial precision and to obtain a strong contrastive learning signal, while avoiding many of the complexities of recent approaches (such as coarse-to-fine matching). To do this, we propose a number of design decisions that allow global matching architectures to be trained through self-supervision using cycle consistency. For example, we identify that transformer-based methods are sensitive to shortcut solutions, and propose a data augmentation scheme to address them. Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods, such as DIFT, and is competitive with several supervised methods.

[CV-1] Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

链接: https://arxiv.org/abs/2409.16283
作者: Homanga Bharadhwaj,Debidatta Dwibedi,Abhinav Gupta,Shubham Tulsiani,Carl Doersch,Ted Xiao,Dhruv Shah,Fei Xia,Dorsa Sadigh,Sean Kirmani
关键词-EN: manipulation policies generalize, policies generalize, video, human video generation, video generation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Preprint. Under Review

点击查看摘要

Abstract:How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn’t require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at this https URL

[CV-2] MonoFormer: One Transformer for Both Diffusion and Autoregression

链接: https://arxiv.org/abs/2409.16280
作者: Chuyang Zhao,Yuxing Song,Wenhao Wang,Haocheng Feng,Errui Ding,Yifan Sun,Xinyan Xiao,Jingdong Wang
关键词-EN: diffusion-based continuous visual, autoregression-based discrete text, existing multimodality methods, separate backbones, existing multimodality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at this https URL.

[CV-3] Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation

链接: https://arxiv.org/abs/2409.16278
作者: Yong Xien Chng,Xuchong Qiu,Yizeng Han,Kai Ding,Wan Ding,Gao Huang
关键词-EN: emerging task aiming, Open-vocabulary panoptic segmentation, semantically meaningful masks, Open-vocabulary panoptic, emerging task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks based on a set of texts. Despite existing efforts, it remains challenging to develop a high-performing method that generalizes effectively across new domains and requires minimal training resources. Our in-depth analysis of current methods reveals a crucial insight: mask classification is the main performance bottleneck for open-vocab. panoptic segmentation. Based on this, we propose Semantic Refocused Tuning (SMART), a novel framework that greatly enhances open-vocab. panoptic segmentation by improving mask classification through two key innovations. First, SMART adopts a multimodal Semantic-guided Mask Attention mechanism that injects task-awareness into the regional information extraction process. This enables the model to capture task-specific and contextually relevant information for more effective mask classification. Second, it incorporates Query Projection Tuning, which strategically fine-tunes the query projection layers within the Vision Language Model (VLM) used for mask classification. This adjustment allows the model to adapt the image focus of mask tokens to new distributions with minimal training resources, while preserving the VLM’s pre-trained knowledge. Extensive ablation studies confirm the superiority of our approach. Notably, SMART sets new state-of-the-art results, demonstrating improvements of up to +1.3 PQ and +5.4 mIoU across representative benchmarks, while reducing training costs by nearly 10x compared to the previous best method. Our code and data will be released.

[CV-4] AIM 2024 Challenge on UHD Blind Photo Quality Assessment ECCV2024

链接: https://arxiv.org/abs/2409.16271
作者: Vlad Hosu,Marcos V. Conde,Lorenzo Agnolucci,Nabajeet Barman,Saman Zadtootaghaj,Radu Timofte
关键词-EN: Image Quality Assessment, UHD-IQA Benchmark Database, Quality Assessment, No-Reference Image Quality, task for modern
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 - Advances in Image Manipulation (AIM). arXiv admin note: text overlap with arXiv:2401.10511 by other authors

点击查看摘要

Abstract:We introduce the AIM 2024 UHD-IQA Challenge, a competition to advance the No-Reference Image Quality Assessment (NR-IQA) task for modern, high-resolution photos. The challenge is based on the recently released UHD-IQA Benchmark Database, which comprises 6,073 UHD-1 (4K) images annotated with perceptual quality ratings from expert raters. Unlike previous NR-IQA datasets, UHD-IQA focuses on highly aesthetic photos of superior technical quality, reflecting the ever-increasing standards of digital photography. This challenge aims to develop efficient and effective NR-IQA models. Participants are tasked with creating novel architectures and training strategies to achieve high predictive performance on UHD-1 images within a computational budget of 50G MACs. This enables model deployment on edge devices and scalable processing of extensive image collections. Winners are determined based on a combination of performance metrics, including correlation measures (SRCC, PLCC, KRCC), absolute error metrics (MAE, RMSE), and computational efficiency (G MACs). To excel in this challenge, participants leverage techniques like knowledge distillation, low-precision inference, and multi-scale training. By pushing the boundaries of NR-IQA for high-resolution photos, the UHD-IQA Challenge aims to stimulate the development of practical models that can keep pace with the rapidly evolving landscape of digital photography. The innovative solutions emerging from this competition will have implications for various applications, from photo curation and enhancement to image compression.

[CV-5] CDChat: A Large Multimodal Model for Remote Sensing Change Description

链接: https://arxiv.org/abs/2409.16261
作者: Mubashir Noman,Noor Ahsan,Muzammal Naseer,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan,Fahad Shahbaz Khan
关键词-EN: Large multimodal models, Large multimodal, natural image domain, visual instruction tuning, shown encouraging performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) have shown encouraging performance in the natural image domain using visual instruction tuning. However, these LMMs struggle to describe the content of remote sensing images for tasks such as image or region grounding, classification, etc. Recently, GeoChat make an effort to describe the contents of the RS images. Although, GeoChat achieves promising performance for various RS tasks, it struggles to describe the changes between bi-temporal RS images which is a key RS task. This necessitates the development of an LMM that can describe the changes between the bi-temporal RS images. However, there is insufficiency of datasets that can be utilized to tune LMMs. In order to achieve this, we introduce a change description instruction dataset that can be utilized to finetune an LMM and provide better change descriptions for RS images. Furthermore, we show that the LLaVA-1.5 model, with slight modifications, can be finetuned on the change description instruction dataset and achieve favorably better performance.

[CV-6] Fields of The World: A Machine Learning Benchmark Dataset For Global Agricultural Field Boundary Segmentation

链接: https://arxiv.org/abs/2409.16252
作者: Hannah Kerner,Snehal Chaudhari,Aninda Ghosh,Caleb Robinson,Adeel Ahmad,Eddie Choi,Nathan Jacobs,Chris Holmes,Matthias Mohr,Rahul Dodhia,Juan M. Lavista Ferres,Jennifer Marcus
关键词-EN: Crop field boundaries, Crop field, collect manually, monitoring and assessments, expensive to collect
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Crop field boundaries are foundational datasets for agricultural monitoring and assessments but are expensive to collect manually. Machine learning (ML) methods for automatically extracting field boundaries from remotely sensed images could help realize the demand for these datasets at a global scale. However, current ML methods for field instance segmentation lack sufficient geographic coverage, accuracy, and generalization capabilities. Further, research on improving ML methods is restricted by the lack of labeled datasets representing the diversity of global agricultural fields. We present Fields of The World (FTW) – a novel ML benchmark dataset for agricultural field instance segmentation spanning 24 countries on four continents (Europe, Africa, Asia, and South America). FTW is an order of magnitude larger than previous datasets with 70,462 samples, each containing instance and semantic segmentation masks paired with multi-date, multi-spectral Sentinel-2 satellite images. We provide results from baseline models for the new FTW benchmark, show that models trained on FTW have better zero-shot and fine-tuning performance in held-out countries than models that aren’t pre-trained with diverse datasets, and show positive qualitative zero-shot results of FTW models in a real-world scenario – running on Sentinel-2 scenes over Ethiopia.

[CV-7] Label-Augmented Dataset Distillation

链接: https://arxiv.org/abs/2409.16239
作者: Seoungyoon Kang,Youngsun Lim,Hyunjung Shim
关键词-EN: Traditional dataset distillation, distillation primarily focuses, Traditional dataset, dataset distillation, primarily focuses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional dataset distillation primarily focuses on image representation while often overlooking the important role of labels. In this study, we introduce Label-Augmented Dataset Distillation (LADD), a new dataset distillation framework enhancing dataset distillation with label augmentations. LADD sub-samples each synthetic image, generating additional dense labels to capture rich semantics. These dense labels require only a 2.5% increase in storage (ImageNet subsets) with significant performance benefits, providing strong learning signals. Our label generation strategy can complement existing dataset distillation methods for significantly enhancing their training efficiency and performance. Experimental results demonstrate that LADD outperforms existing methods in terms of computational overhead and accuracy. With three high-performance dataset distillation algorithms, LADD achieves remarkable gains by an average of 14.9% in accuracy. Furthermore, the effectiveness of our method is proven across various datasets, distillation hyperparameters, and algorithms. Finally, our method improves the cross-architecture robustness of the distilled dataset, which is important in the application scenario.

[CV-8] VideoPatchCore: An Effective Method to Memorize Normality for Video Anomaly Detection ACCV2024

链接: https://arxiv.org/abs/2409.16225
作者: Sunghyun Ahn,Youngwan Jo,Kijung Lee,Sanghyun Park
关键词-EN: Video anomaly detection, anomaly detection, computer vision, analysis and surveillance, surveillance within computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACCV 2024

点击查看摘要

Abstract:Video anomaly detection (VAD) is a crucial task in video analysis and surveillance within computer vision. Currently, VAD is gaining attention with memory techniques that store the features of normal frames. The stored features are utilized for frame reconstruction, identifying an abnormality when a significant difference exists between the reconstructed and input frames. However, this approach faces several challenges due to the simultaneous optimization required for both the memory and encoder-decoder model. These challenges include increased optimization difficulty, complexity of implementation, and performance variability depending on the memory size. To address these challenges,we propose an effective memory method for VAD, called VideoPatchCore. Inspired by PatchCore, our approach introduces a structure that prioritizes memory optimization and configures three types of memory tailored to the characteristics of video data. This method effectively addresses the limitations of existing memory-based methods, achieving good performance comparable to state-of-the-art methods. Furthermore, our method requires no training and is straightforward to implement, making VAD tasks more accessible. Our code is available online at this http URL.

[CV-9] Fine-Tuning is Fine if Calibrated

链接: https://arxiv.org/abs/2409.16223
作者: Zheda Mai,Arpita Chowdhury,Ping Zhang,Cheng-Hao Tu,Hong-You Chen,Vardaan Pahuja,Tanya Berger-Wolf,Song Gao,Charles Stewart,Yu Su,Wei-Lun Chao
关键词-EN: losing valuable knowledge, fine-tuned model, model, classes, pre-trained model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The first three authors contribute equally

点击查看摘要

Abstract:Fine-tuning is arguably the most straightforward way to tailor a pre-trained model (e.g., a foundation model) to downstream applications, but it also comes with the risk of losing valuable knowledge the model had learned in pre-training. For example, fine-tuning a pre-trained classifier capable of recognizing a large number of classes to master a subset of classes at hand is shown to drastically degrade the model’s accuracy in the other classes it had previously learned. As such, it is hard to further use the fine-tuned model when it encounters classes beyond the fine-tuning data. In this paper, we systematically dissect the issue, aiming to answer the fundamental question, ‘‘What has been damaged in the fine-tuned model?’’ To our surprise, we find that the fine-tuned model neither forgets the relationship among the other classes nor degrades the features to recognize these classes. Instead, the fine-tuned model often produces more discriminative features for these other classes, even if they were missing during fine-tuning! What really hurts the accuracy is the discrepant logit scales between the fine-tuning classes and the other classes, implying that a simple post-processing calibration would bring back the pre-trained model’s capability and at the same time unveil the feature improvement over all classes. We conduct an extensive empirical study to demonstrate the robustness of our findings and provide preliminary explanations underlying them, suggesting new directions for future theoretical analysis. Our code is available at this https URL.

[CV-10] ny Robotics Dataset and Benchmark for Continual Object Detection

链接: https://arxiv.org/abs/2409.16215
作者: Francesco Pasti,Riccardo De Monte,Davide Dalle Pezze,Gian Antonio Susto,Nicola Bellotto
关键词-EN: Detecting objects, numerous applications, navigation to inspection, autonomous navigation, object detection systems
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper under review

点击查看摘要

Abstract:Detecting objects in mobile robotics is crucial for numerous applications, from autonomous navigation to inspection. However, robots are often required to perform tasks in different domains with respect to the training one and need to adapt to these changes. Tiny mobile robots, subject to size, power, and computational constraints, encounter even more difficulties in running and adapting these algorithms. Such adaptability, though, is crucial for real-world deployment, where robots must operate effectively in dynamic and unpredictable settings. In this work, we introduce a novel benchmark to evaluate the continual learning capabilities of object detection systems in tiny robotic platforms. Our contributions include: (i) Tiny Robotics Object Detection (TiROD), a comprehensive dataset collected using a small mobile robot, designed to test the adaptability of object detectors across various domains and classes; (ii) an evaluation of state-of-the-art real-time object detectors combined with different continual learning strategies on this dataset, providing detailed insights into their performance and limitations; and (iii) we publish the data and the code to replicate the results to foster continuous advancements in this field. Our benchmark results indicate key challenges that must be addressed to advance the development of robust and efficient object detection systems for tiny robotics.

[CV-11] Deep Learning for Precision Agriculture: Post-Spraying Evaluation and Deposition Estimation

链接: https://arxiv.org/abs/2409.16213
作者: Harry Rogers,Tahmina Zebin,Grzegorz Cielniak,Beatriz De La Iglesia,Ben Magri
关键词-EN: Precision spraying, eXplainable Artificial Intelligence, requires automation primarily, Precision spraying evaluation, spraying evaluation requires
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Precision spraying evaluation requires automation primarily in post-spraying imagery. In this paper we propose an eXplainable Artificial Intelligence (XAI) computer vision pipeline to evaluate a precision spraying system post-spraying without the need for traditional agricultural methods. The developed system can semantically segment potential targets such as lettuce, chickweed, and meadowgrass and correctly identify if targets have been sprayed. Furthermore, this pipeline evaluates using a domain-specific Weakly Supervised Deposition Estimation task, allowing for class-specific quantification of spray deposit weights in \muL. Estimation of coverage rates of spray deposition in a class-wise manner allows for further understanding of effectiveness of precision spraying systems. Our study evaluates different Class Activation Mapping techniques, namely AblationCAM and ScoreCAM, to determine which is more effective and interpretable for these tasks. In the pipeline, inference-only feature fusion is used to allow for further interpretability and to enable the automation of precision spraying evaluation post-spray. Our findings indicate that a Fully Convolutional Network with an EfficientNet-B0 backbone and inference-only feature fusion achieves an average absolute difference in deposition values of 156.8 \muL across three classes in our test set. The dataset curated in this paper is publicly available at this https URL

[CV-12] MaskBit: Embedding-free Image Generation via Bit Tokens

链接: https://arxiv.org/abs/2409.16211
作者: Mark Weber,Lijun Yu,Qihang Yu,Xueqing Deng,Xiaohui Shen,Daniel Cremers,Liang-Chieh Chen
关键词-EN: Masked transformer models, Masked transformer, class-conditional image generation, subsequent Transformer model, compelling alternative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.

[CV-13] LLMCount: Enhancing Stationary mmWave Detection with Multimodal-LLM

链接: https://arxiv.org/abs/2409.16209
作者: Boyan Li,Shengyi Ding,Deen Ma,Yixuan Wu,Hongjie Liao,Kaiyuan Hu
关键词-EN: Millimeter wave sensing, Millimeter wave, huge application potential, holds huge application, wave sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Millimeter wave sensing provides people with the capability of sensing the surrounding crowds in a non-invasive and privacy-preserving manner, which holds huge application potential. However, detecting stationary crowds remains challenging due to several factors such as minimal movements (like breathing or casual fidgets), which can be easily treated as noise clusters during data collection and consequently filtered in the following processing procedures. Additionally, the uneven distribution of signal power due to signal power attenuation and interferences resulting from external reflectors or absorbers further complicates accurate detection. To address these challenges and enable stationary crowd detection across various application scenarios requiring specialized domain adaption, we introduce LLMCount, the first system to harness the capabilities of large-language models (LLMs) to enhance crowd detection performance. By exploiting the decision-making capability of LLM, we can successfully compensate the signal power to acquire a uniform distribution and thereby achieve a detection with higher accuracy. To assess the system’s performance, comprehensive evaluations are conducted under diversified scenarios like hall, meeting room, and cinema. The evaluation results show that our proposed approach reaches high detection accuracy with lower overall latency compared with previous methods.

[CV-14] Segmentation Strategies in Deep Learning for Prostate Cancer Diagnosis: A Comparative Study of Mamba SAM and YOLO

链接: https://arxiv.org/abs/2409.16205
作者: Ali Badiezadeh,Amin Malekmohammadi,Seyed Mostafa Mirhassani,Parisa Gifani,Majid Vafaeezadeh
关键词-EN: prostate cancer histopathology, cancer histopathology images, prostate cancer, cancer histopathology, treatment planning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate segmentation of prostate cancer histopathology images is crucial for diagnosis and treatment planning. This study presents a comparative analysis of three deep learning-based methods, Mamba, SAM, and YOLO, for segmenting prostate cancer histopathology images. We evaluated the performance of these models on two comprehensive datasets, Gleason 2019 and SICAPv2, using Dice score, precision, and recall metrics. Our results show that the High-order Vision Mamba UNet (H-vmunet) model outperforms the other two models, achieving the highest scores across all metrics on both datasets. The H-vmunet model’s advanced architecture, which integrates high-order visual state spaces and 2D-selective-scan operations, enables efficient and sensitive lesion detection across different scales. Our study demonstrates the potential of the H-vmunet model for clinical applications and highlights the importance of robust validation and comparison of deep learning-based methods for medical image analysis. The findings of this study contribute to the development of accurate and reliable computer-aided diagnosis systems for prostate cancer. The code is available at this http URL.

[CV-15] Expert-level vision-language foundation model for real-world radiology and comprehensive evaluation

链接: https://arxiv.org/abs/2409.16183
作者: Xiaohong Liu,Guoxing Yang,Yulin Luo,Jiaji Mao,Xiang Zhang,Ming Gao,Shanghang Zhang,Jun Shen,Guangyu Wang
关键词-EN: vital and complex, complex component, component of modern, Radiology, modern clinical workflow
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Radiology is a vital and complex component of modern clinical workflow and covers many tasks. Recently, vision-language (VL) foundation models in medicine have shown potential in processing multimodal information, offering a unified solution for various radiology tasks. However, existing studies either pre-trained VL models on natural data or did not fully integrate vision-language architecture and pretraining, often neglecting the unique multimodal complexity in radiology images and their textual contexts. Additionally, their practical applicability in real-world scenarios remains underexplored. Here, we present RadFound, a large and open-source vision-language foundation model tailored for radiology, that is trained on the most extensive dataset of over 8.1 million images and 250,000 image-text pairs, covering 19 major organ systems and 10 imaging modalities. To establish expert-level multimodal perception and generation capabilities, RadFound introduces an enhanced vision encoder to capture intra-image local features and inter-image contextual information, and a unified cross-modal learning design tailored to radiology. To fully assess the models’ capability, we construct a benchmark, RadVLBench, including radiology interpretation tasks like medical vision-language question-answering, as well as text generation tasks ranging from captioning to report generation. We also propose a human evaluation framework. When evaluated on the real-world benchmark involving three representative modalities, 2D images (chest X-rays), multi-view images (mammograms), and 3D images (thyroid CT scans), RadFound significantly outperforms other VL foundation models on both quantitative metrics and human evaluation. In summary, the development of RadFound represents an advancement in radiology generalists, demonstrating broad applicability potential for integration into clinical workflows.

[CV-16] SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image

链接: https://arxiv.org/abs/2409.16178
作者: Dimitrije Antić,Sai Kumar Dwivedi,Shashank Tripathi,Theo Gevers,Dimitrios Tzionas
关键词-EN: focus on recovering, object pose, shape, images, SDFit
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures, 2 tables

点击查看摘要

Abstract:We focus on recovering 3D object pose and shape from single images. This is highly challenging due to strong (self-)occlusions, depth ambiguities, the enormous shape variance, and lack of 3D ground truth for natural images. Recent work relies mostly on learning from finite datasets, so it struggles generalizing, while it focuses mostly on the shape itself, largely ignoring the alignment with pixels. Moreover, it performs feed-forward inference, so it cannot refine estimates. We tackle these limitations with a novel framework, called SDFit. To this end, we make three key observations: (1) Learned signed-distance-function (SDF) models act as a strong morphable shape prior. (2) Foundational models embed 2D images and 3D shapes in a joint space, and (3) also infer rich features from images. SDFit exploits these as follows. First, it uses a category-level morphable SDF (mSDF) model, called DIT, to generate 3D shape hypotheses. This mSDF is initialized by querying OpenShape’s latent space conditioned on the input image. Then, it computes 2D-to-3D correspondences, by extracting and matching features from the image and mSDF. Last, it fits the mSDF to the image in an render-and-compare fashion, to iteratively refine estimates. We evaluate SDFit on the Pix3D and Pascal3D+ datasets of real-world images. SDFit performs roughly on par with state-of-the-art learned methods, but, uniquely, requires no re-training. Thus, SDFit is promising for generalizing in the wild, paving the way for future research. Code will be released

[CV-17] Fine Tuning Text-to-Image Diffusion Models for Correcting Anomalous Images

链接: https://arxiv.org/abs/2409.16174
作者: Hyunwoo Yoo
关键词-EN: Stable Diffusion, GANs and VAEs, continuously evolved, image generation models, advent of GANs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Since the advent of GANs and VAEs, image generation models have continuously evolved, opening up various real-world applications with the introduction of Stable Diffusion and DALL-E models. These text-to-image models can generate high-quality images for fields such as art, design, and advertising. However, they often produce aberrant images for certain prompts. This study proposes a method to mitigate such issues by fine-tuning the Stable Diffusion 3 model using the DreamBooth technique. Experimental results targeting the prompt “lying on the grass/street” demonstrate that the fine-tuned model shows improved performance in visual evaluation and metrics such as Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Frechet Inception Distance (FID). User surveys also indicated a higher preference for the fine-tuned model. This research is expected to make contributions to enhancing the practicality and reliability of text-to-image models.

[CV-18] MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

链接: https://arxiv.org/abs/2409.16160
作者: Yifang Men,Yuan Yao,Miaomiao Cui,Liefeng Bo
关键词-EN: produce realistic videos, aims to produce, produce realistic, animatable characters, arbitrary characters
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. The design of spatial decomposed modeling enables flexible user control, complex motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate effectiveness and robustness of the proposed method.

[CV-19] ComiCap: A VLMs pipeline for dense captioning of Comic Panels ECCV2024

链接: https://arxiv.org/abs/2409.16159
作者: Emanuele Vivoli,Niccolò Biondi,Marco Bertini,Dimosthenis Karatzas
关键词-EN: development of single, domain is rapidly, rapidly advancing, comic domain, multi-page analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024 Workshop (AI for Visual Art), repo: this https URL

点击查看摘要

Abstract:The comic domain is rapidly advancing with the development of single- and multi-page analysis and synthesis models. Recent benchmarks and datasets have been introduced to support and assess models’ capabilities in tasks such as detection (panels, characters, text), linking (character re-identification and speaker identification), and analysis of comic elements (e.g., dialog transcription). However, to provide a comprehensive understanding of the storyline, a model must not only extract elements but also understand their relationships and generate highly informative captions. In this work, we propose a pipeline that leverages Vision-Language Models (VLMs) to obtain dense, grounded captions. To construct our pipeline, we introduce an attribute-retaining metric that assesses whether all important attributes are identified in the caption. Additionally, we created a densely annotated test set to fairly evaluate open-source VLMs and select the best captioning model according to our metric. Our pipeline generates dense captions with bounding boxes that are quantitatively and qualitatively superior to those produced by specifically trained models, without requiring any additional training. Using this pipeline, we annotated over 2 million panels across 13,000 books, which will be available on the project page this https URL.

[CV-20] Efficient Motion Prediction: A Lightweight Accurate Trajectory Prediction Model With Fast Training and Inference Speed IROS2024

链接: https://arxiv.org/abs/2409.16154
作者: Alexander Prutsch,Horst Bischof,Horst Possegger
关键词-EN: safe autonomous driving, traffic agents, vehicles can predict, Abstract, motion prediction
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to IROS 2024

点击查看摘要

Abstract:For efficient and safe autonomous driving, it is essential that autonomous vehicles can predict the motion of other traffic agents. While highly accurate, current motion prediction models often impose significant challenges in terms of training resource requirements and deployment on embedded hardware. We propose a new efficient motion prediction model, which achieves highly competitive benchmark results while training only a few hours on a single GPU. Due to our lightweight architectural choices and the focus on reducing the required training resources, our model can easily be applied to custom datasets. Furthermore, its low inference latency makes it particularly suitable for deployment in autonomous applications with limited computing resources.

[CV-21] MCTrack: A Unified 3D Multi-Object Tracking Framework for Autonomous Driving

链接: https://arxiv.org/abs/2409.16149
作者: Xiyang Wang,Shouzheng Qi,Jieyou Zhao,Hangning Zhou,Siyu Zhang,Guoan Wang,Kai Tu,Songlin Guo,Jianbo Zhao,Jian Li,Mu Yang
关键词-EN: paper introduces MCTrack, performance across KITTI, Waymo datasets, paper introduces, multi-object tracking method
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:This paper introduces MCTrack, a new 3D multi-object tracking method that achieves state-of-the-art (SOTA) performance across KITTI, nuScenes, and Waymo datasets. Addressing the gap in existing tracking paradigms, which often perform well on specific datasets but lack generalizability, MCTrack offers a unified solution. Additionally, we have standardized the format of perceptual results across various datasets, termed BaseVersion, facilitating researchers in the field of multi-object tracking (MOT) to concentrate on the core algorithmic development without the undue burden of data preprocessing. Finally, recognizing the limitations of current evaluation metrics, we propose a novel set that assesses motion information output, such as velocity and acceleration, crucial for downstream tasks. The source codes of the proposed method are available at this link: this https URLthis https URL

[CV-22] Gaussian Dej`a-vu: Creating Controllable 3D Gaussian Head-Avatars with Enhanced Generalization and Personalization Abilities WACV2025

链接: https://arxiv.org/abs/2409.16147
作者: Peizhi Yan,Rabab Ward,Qiang Tang,Shan Du
关键词-EN: providing greater flexibility, unlocked significant potential, efficient rendering compared, Gaussian Splatting, Gaussian head avatars
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, Accepted by WACV 2025 in Round 1

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting (3DGS) have unlocked significant potential for modeling 3D head avatars, providing greater flexibility than mesh-based methods and more efficient rendering compared to NeRF-based approaches. Despite these advancements, the creation of controllable 3DGS-based head avatars remains time-intensive, often requiring tens of minutes to hours. To expedite this process, we here introduce the ``Gaussian Déjà-vu" framework, which first obtains a generalized model of the head avatar and then personalizes the result. The generalized model is trained on large 2D (synthetic and real) image datasets. This model provides a well-initialized 3D Gaussian head that is further refined using a monocular video to achieve the personalized head avatar. For personalizing, we propose learnable expression-aware rectification blendmaps to correct the initial 3D Gaussians, ensuring rapid convergence without the reliance on neural networks. Experiments demonstrate that the proposed method meets its objectives. It outperforms state-of-the-art 3D Gaussian head avatars in terms of photorealistic quality as well as reduces training time consumption to at least a quarter of the existing methods, producing the avatar in minutes.

[CV-23] Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment ECCV2024

链接: https://arxiv.org/abs/2409.16145
作者: Yuxiao Chen,Kai Li,Wentao Bao,Deep Patel,Yu Kong,Martin Renqiang Min,Dimitris N. Metaxas
关键词-EN: localize temporal boundaries, annotated large-scale training, localize temporal, temporal boundaries, challenging due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive learning. However, these methods fail to account for the alignment noise, i.e., irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work proposes a novel training framework. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy. The key idea is to measure alignment between LLM-steps and videos via multiple pathways, including: (1) step-narration-video alignment using narration timestamps, (2) direct step-to-video alignment based on their long-term semantic similarity, and (3) direct step-to-video alignment focusing on short-term fine-grained semantic similarity learned from general video domains. The results from different pathways are fused to generate reliable pseudo step-video matching. We conducted extensive experiments across various tasks and problem settings to evaluate our proposed method. Our approach surpasses state-of-the-art methods in three downstream tasks: procedure step grounding, step localization, and narration grounding by 5.9%, 3.1%, and 2.8%.

[CV-24] Seeing Faces in Things: A Model and Dataset for Pareidolia

链接: https://arxiv.org/abs/2409.16143
作者: Mark Hamilton,Simon Stent,Vasha DuTell,Anne Harrington,Jennifer Corbett,Ruth Rosenholtz,William T. Freeman
关键词-EN: human visual system, shapes and sizes, visual system, system is well-tuned, faces
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The human visual system is well-tuned to detect faces of all shapes and sizes. While this brings obvious survival advantages, such as a better chance of spotting unknown predators in the bush, it also leads to spurious face detections. Face pareidolia'' describes the perception of face-like structure among otherwise random stimuli: seeing faces in coffee stains or clouds in the sky. In this paper, we study face pareidolia from a computer vision perspective. We present an image dataset of Faces in Things’', consisting of five thousand web images with human-annotated pareidolic faces. Using this dataset, we examine the extent to which a state-of-the-art human face detector exhibits pareidolia, and find a significant behavioral gap between humans and machines. We find that the evolutionary need for humans to detect animal faces, as well as human faces, may explain some of this gap. Finally, we propose a simple statistical model of pareidolia in images. Through studies on human subjects and our pareidolic face detectors we confirm a key prediction of our model regarding what image conditions are most likely to induce pareidolia. Dataset and Website: this https URL

[CV-25] HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection

链接: https://arxiv.org/abs/2409.16136
作者: Yuqi Ma,Mengyin Liu,Chao Zhu,Xu-Cheng Yin
关键词-EN: Large Multi-modal Models, Large Multi-modal, extensive training data, Open-vocabulary object detection, OVD models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. However, OVD models are pretrained on large-scale image-text pairs with rich attribute words, whose latent feature space can represent the global text feature as a linear composition of fine-grained attribute tokens without highlighting them. Therefore, we propose in this paper a universal and explicit approach for frozen mainstream OVD models that boosts their attribute-level detection capabilities by highlighting fine-grained attributes in explicit linear space. Firstly, a LLM is leveraged to highlight attribute words within the input text as a zero-shot prompted task. Secondly, by strategically adjusting the token masks, the text encoders of OVD models extract both global text and attribute-specific features, which are then explicitly composited as two vectors in linear space to form the new attribute-highlighted feature for detection tasks, where corresponding scalars are hand-crafted or learned to reweight both two vectors. Notably, these scalars can be seamlessly transferred among different OVD models, which proves that such an explicit linear composition is universal. Empirical evaluation on the FG-OVD dataset demonstrates that our proposed method uniformly improves fine-grained attribute-level OVD of various mainstream models and achieves new state-of-the-art performance.

[CV-26] VisioPhysioENet: Multimodal Engagement Detection using Visual and Physiological Signals

链接: https://arxiv.org/abs/2409.16126
作者: Alakhsimar Singh,Nischay Verma,Kanav Goyal,Amritpal Singh,Puneet Kumar,Xiaobai Li
关键词-EN: leverages visual cues, paper presents VisioPhysioENet, detect learner engagement, paper presents, detect learner
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 Pages, 2 figures

点击查看摘要

Abstract:This paper presents VisioPhysioENet, a novel multimodal system that leverages visual cues and physiological signals to detect learner engagement. It employs a two-level approach for visual feature extraction using the Dlib library for facial landmark extraction and the OpenCV library for further estimations. This is complemented by extracting physiological signals using the plane-orthogonal-to-skin method to assess cardiovascular activity. These features are integrated using advanced machine learning classifiers, enhancing the detection of various engagement levels. We rigorously evaluate VisioPhysioENet on the DAiSEE dataset, where it achieves an accuracy of 63.09%, demonstrating a superior ability to discern various levels of engagement compared to existing methodologies. The proposed system’s code can be accessed at this https URL.

[CV-27] CloudTrack: Scalable UAV Tracking with Cloud Semantics

链接: https://arxiv.org/abs/2409.16111
作者: Yannik Blei,Michael Krawez,Nisarga Nilavadi,Tanja Katharina Kaiser,Wolfram Burgard
关键词-EN: unmanned aerial vehicles, rescue scenarios, scenarios to gather, gather information, search area
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:Nowadays, unmanned aerial vehicles (UAVs) are commonly used in search and rescue scenarios to gather information in the search area. The automatic identification of the person searched for in aerial footage could increase the autonomy of such systems, reduce the search time, and thus increase the missed person’s chances of survival. In this paper, we present a novel approach to perform semantically conditioned open vocabulary object tracking that is specifically designed to cope with the limitations of UAV hardware. Our approach has several advantages. It can run with verbal descriptions of the missing person, e.g., the color of the shirt, it does not require dedicated training to execute the mission and can efficiently track a potentially moving person. Our experimental results demonstrate the versatility and efficacy of our approach.

[CV-28] Neuromorphic Drone Detection: an Event-RGB Multimodal Approach ECCV24

链接: https://arxiv.org/abs/2409.16099
作者: Gabriele Magrini,Federico Becattini,Pietro Pala,Alberto Del Bimbo,Antonio Porta
关键词-EN: drone detection, recent years, extreme interest, identifying such elements, subject of extreme
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at NeVi Workshop at ECCV24

点击查看摘要

Abstract:In recent years, drone detection has quickly become a subject of extreme interest: the potential for fast-moving objects of contained dimensions to be used for malicious intents or even terrorist attacks has posed attention to the necessity for precise and resilient systems for detecting and identifying such elements. While extensive literature and works exist on object detection based on RGB data, it is also critical to recognize the limits of such modality when applied to UAVs detection. Detecting drones indeed poses several challenges such as fast-moving objects and scenes with a high dynamic range or, even worse, scarce illumination levels. Neuromorphic cameras, on the other hand, can retain precise and rich spatio-temporal information in situations that are challenging for RGB cameras. They are resilient to both high-speed moving objects and scarce illumination settings, while prone to suffer a rapid loss of information when the objects in the scene are static. In this context, we present a novel model for integrating both domains together, leveraging multimodal data to take advantage of the best of both worlds. To this end, we also release NeRDD (Neuromorphic-RGB Drone Detection), a novel spatio-temporally synchronized Event-RGB Drone detection dataset of more than 3.5 hours of multimodal annotated recordings.

[CV-29] From Pixels to Words: Leveraging Explainability in Face Recognition through Interactive Natural Language Processing

链接: https://arxiv.org/abs/2409.16089
作者: Ivan DeAndres-Tame,Muhammad Faisal,Ruben Tolosana,Rouqaiah Al-Refai,Ruben Vera-Rodriguez,Philipp Terhörst
关键词-EN: achieving high accuracy, Explainable Artificial Intelligence, deep learning, achieving high, advanced significantly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Face Recognition (FR) has advanced significantly with the development of deep learning, achieving high accuracy in several applications. However, the lack of interpretability of these systems raises concerns about their accountability, fairness, and reliability. In the present study, we propose an interactive framework to enhance the explainability of FR models by combining model-agnostic Explainable Artificial Intelligence (XAI) and Natural Language Processing (NLP) techniques. The proposed framework is able to accurately answer various questions of the user through an interactive chatbot. In particular, the explanations generated by our proposed method are in the form of natural language text and visual representations, which for example can describe how different facial regions contribute to the similarity measure between two faces. This is achieved through the automatic analysis of the output’s saliency heatmaps of the face images and a BERT question-answering model, providing users with an interface that facilitates a comprehensive understanding of the FR decisions. The proposed approach is interactive, allowing the users to ask questions to get more precise information based on the user’s background knowledge. More importantly, in contrast to previous studies, our solution does not decrease the face recognition performance. We demonstrate the effectiveness of the method through different experiments, highlighting its potential to make FR systems more interpretable and user-friendly, especially in sensitive applications where decision-making transparency is crucial.

[CV-30] MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

链接: https://arxiv.org/abs/2409.16084
作者: Jiacheng Ruan,Wenzhen Yuan,Zehao Lin,Ning Liao,Zhiyu Li,Feiyu Xiong,Ting Liu,Yuzhuo Fu
关键词-EN: Large visual-language models, achieved great success, Large visual-language, multiple applications, achieved great
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 5 figures. Work in progress

点击查看摘要

Abstract:Large visual-language models (LVLMs) have achieved great success in multiple applications. However, they still encounter challenges in complex scenes, especially those involving camouflaged objects. This is primarily due to the lack of samples related to camouflaged scenes in the training dataset. To mitigate this issue, we construct the MM-CamObj dataset for the first time, comprising two subsets: CamObj-Align and CamObj-Instruct. Specifically, CamObj-Align contains 11,363 image-text pairs, and it is designed for VL alignment and injecting rich knowledge of camouflaged scenes into LVLMs. CamObj-Instruct is collected for fine-tuning the LVLMs with improved instruction-following capabilities, and it includes 11,363 images and 68,849 conversations with diverse instructions. Based on the MM-CamObj dataset, we propose the CamObj-Llava, an LVLM specifically designed for addressing tasks in camouflaged scenes. To facilitate our model’s effective acquisition of knowledge about camouflaged objects and scenes, we introduce a curriculum learning strategy with six distinct modes. Additionally, we construct the CamObj-Bench to evaluate the existing LVLMs’ capabilities of understanding, recognition, localization and count in camouflage scenes. This benchmark includes 600 images and 7 tasks, with a total of 9,449 questions. Extensive experiments are conducted on the CamObj-Bench with CamObj-Llava, 8 existing open-source and 3 closed-source LVLMs. Surprisingly, the results indicate that our model achieves a 25.84% improvement in 4 out of 7 tasks compared to GPT-4o. Code and datasets will be available at this https URL.

[CV-31] GS-Net: Global Self-Attention Guided CNN for Multi-Stage Glaucoma Classification

链接: https://arxiv.org/abs/2409.16082
作者: Dipankar Das,Deepak Ranjan Nayak
关键词-EN: common eye disease, timely detected, common eye, eye disease, disease that leads
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Glaucoma is a common eye disease that leads to irreversible blindness unless timely detected. Hence, glaucoma detection at an early stage is of utmost importance for a better treatment plan and ultimately saving the vision. The recent literature has shown the prominence of CNN-based methods to detect glaucoma from retinal fundus images. However, such methods mainly focus on solving binary classification tasks and have not been thoroughly explored for the detection of different glaucoma stages, which is relatively challenging due to minute lesion size variations and high inter-class similarities. This paper proposes a global self-attention based network called GS-Net for efficient multi-stage glaucoma classification. We introduce a global self-attention module (GSAM) consisting of two parallel attention modules, a channel attention module (CAM) and a spatial attention module (SAM), to learn global feature dependencies across channel and spatial dimensions. The GSAM encourages extracting more discriminative and class-specific features from the fundus images. The experimental results on a publicly available dataset demonstrate that our GS-Net outperforms state-of-the-art methods. Also, the GSAM achieves competitive performance against popular attention modules.

[CV-32] Open-World Object Detection with Instance Representation Learning

链接: https://arxiv.org/abs/2409.16073
作者: Sunoh Lee,Minsik Jeon,Jihong Min,Junwon Seo
关键词-EN: humans naturally identify, deep learning-based object, Open World Object, World Object Detection, deep learning-based
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Our project website can be found at this https URL

点击查看摘要

Abstract:While humans naturally identify novel objects and understand their relationships, deep learning-based object detectors struggle to detect and relate objects that are not observed during training. To overcome this issue, Open World Object Detection(OWOD) has been introduced to enable models to detect unknown objects in open-world scenarios. However, OWOD methods fail to capture the fine-grained relationships between detected objects, which are crucial for comprehensive scene understanding and applications such as class discovery and tracking. In this paper, we propose a method to train an object detector that can both detect novel objects and extract semantically rich features in open-world conditions by leveraging the knowledge of Vision Foundation Models(VFM). We first utilize the semantic masks from the Segment Anything Model to supervise the box regression of unknown objects, ensuring accurate localization. By transferring the instance-wise similarities obtained from the VFM features to the detector’s instance embeddings, our method then learns a semantically rich feature space of these embeddings. Extensive experiments show that our method learns a robust and generalizable feature space, outperforming other OWOD-based feature extraction methods. Additionally, we demonstrate that the enhanced feature from our model increases the detector’s applicability to tasks such as open-world tracking.

[CV-33] Machine learning approaches for automatic defect detection in photovoltaic systems

链接: https://arxiv.org/abs/2409.16069
作者: Swayam Rajat Mohanty,Moin Uddin Maruf,Vaibhav Singh,Zeeshan Ahmad
关键词-EN: power conversion efficiency, damage during manufacturing, prone to damage, Solar photovoltaic, power conversion
类目: Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
*备注: 31 pages, 14 figures

点击查看摘要

Abstract:Solar photovoltaic (PV) modules are prone to damage during manufacturing, installation and operation which reduces their power conversion efficiency. This diminishes their positive environmental impact over the lifecycle. Continuous monitoring of PV modules during operation via unmanned aerial vehicles is essential to ensure that defective panels are promptly replaced or repaired to maintain high power conversion efficiencies. Computer vision provides an automatic, non-destructive and cost-effective tool for monitoring defects in large-scale PV plants. We review the current landscape of deep learning-based computer vision techniques used for detecting defects in solar modules. We compare and evaluate the existing approaches at different levels, namely the type of images used, data collection and processing method, deep learning architectures employed, and model interpretability. Most approaches use convolutional neural networks together with data augmentation or generative adversarial network-based techniques. We evaluate the deep learning approaches by performing interpretability analysis on classification tasks. This analysis reveals that the model focuses on the darker regions of the image to perform the classification. We find clear gaps in the existing approaches while also laying out the groundwork for mitigating these challenges when building new models. We conclude with the relevant research gaps that need to be addressed and approaches for progress in this field: integrating geometric deep learning with existing approaches for building more robust and reliable models, leveraging physics-based neural networks that combine domain expertise of physical laws to build more domain-aware deep learning models, and incorporating interpretability as a factor for building models that can be trusted. The review points towards a clear roadmap for making this technology commercially relevant.

[CV-34] Benchmarking Robustness of Endoscopic Depth Estimation with Synthetically Corrupted Data MICCAI2024

链接: https://arxiv.org/abs/2409.16063
作者: An Wang,Haochen Yin,Beilei Cui,Mengya Xu,Hongliang Ren
关键词-EN: Accurate depth perception, image distortions common, depth estimation, Accurate depth, endoscopic depth estimation
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: To appear at the Simulation and Synthesis in Medical Imaging (SASHIMI) workshop at MICCAI 2024

点击查看摘要

Abstract:Accurate depth perception is crucial for patient outcomes in endoscopic surgery, yet it is compromised by image distortions common in surgical settings. To tackle this issue, our study presents a benchmark for assessing the robustness of endoscopic depth estimation models. We have compiled a comprehensive dataset that reflects real-world conditions, incorporating a range of synthetically induced corruptions at varying severity levels. To further this effort, we introduce the Depth Estimation Robustness Score (DERS), a novel metric that combines measures of error, accuracy, and robustness to meet the multifaceted requirements of surgical applications. This metric acts as a foundational element for evaluating performance, establishing a new paradigm for the comparative analysis of depth estimation technologies. Additionally, we set forth a benchmark focused on robustness for the evaluation of depth estimation in endoscopic surgery, with the aim of driving progress in model refinement. A thorough analysis of two monocular depth estimation models using our framework reveals crucial information about their reliability under adverse conditions. Our results emphasize the essential need for algorithms that can tolerate data corruption, thereby advancing discussions on improving model robustness. The impact of this research transcends theoretical frameworks, providing concrete gains in surgical precision and patient safety. This study establishes a benchmark for the robustness of depth estimation and serves as a foundation for developing more resilient surgical support technologies. Code is available at this https URL.

[CV-35] Generative 3D Cardiac Shape Modelling for In-Silico Trials

链接: https://arxiv.org/abs/2409.16058
作者: Andrei Gasparovici,Alex Serban
关键词-EN: deep learning method, neural signed distance, signed distance field, propose a deep, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: EFMI Special Topic Conference 2024

点击查看摘要

Abstract:We propose a deep learning method to model and generate synthetic aortic shapes based on representing shapes as the zero-level set of a neural signed distance field, conditioned by a family of trainable embedding vectors with encode the geometric features of each shape. The network is trained on a dataset of aortic root meshes reconstructed from CT images by making the neural field vanish on sampled surface points and enforcing its spatial gradient to have unit norm. Empirical results show that our model can represent aortic shapes with high fidelity. Moreover, by sampling from the learned embedding vectors, we can generate novel shapes that resemble real patient anatomies, which can be used for in-silico trials.

[CV-36] owards Robust Object Detection: Identifying and Removing Backdoors via Module Inconsistency Analysis

链接: https://arxiv.org/abs/2409.16057
作者: Xianda Zhang,Siyuan Liang
关键词-EN: Object detection models, Region Proposal Network, Object detection, security-critical applications, specific patterns
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Object detection models, widely used in security-critical applications, are vulnerable to backdoor attacks that cause targeted misclassifications when triggered by specific patterns. Existing backdoor defense techniques, primarily designed for simpler models like image classifiers, often fail to effectively detect and remove backdoors in object detectors. We propose a backdoor defense framework tailored to object detection models, based on the observation that backdoor attacks cause significant inconsistencies between local modules’ behaviors, such as the Region Proposal Network (RPN) and classification head. By quantifying and analyzing these inconsistencies, we develop an algorithm to detect backdoors. We find that the inconsistent module is usually the main source of backdoor behavior, leading to a removal method that localizes the affected module, resets its parameters, and fine-tunes the model on a small clean dataset. Extensive experiments with state-of-the-art two-stage object detectors show our method achieves a 90% improvement in backdoor removal rate over fine-tuning baselines, while limiting clean data accuracy loss to less than 4%. To the best of our knowledge, this work presents the first approach that addresses both the detection and removal of backdoors in two-stage object detection models, advancing the field of securing these complex systems against backdoor attacks.

[CV-37] Adversarial Watermarking for Face Recognition

链接: https://arxiv.org/abs/2409.16056
作者: Yuguang Yao,Anil Jain,Sijia Liu
关键词-EN: monitor unauthorized alterations, Watermarking, embedding an identifier, unauthorized alterations, essential technique
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Watermarking is an essential technique for embedding an identifier (i.e., watermark message) within digital images to assert ownership and monitor unauthorized alterations. In face recognition systems, watermarking plays a pivotal role in ensuring data integrity and security. However, an adversary could potentially interfere with the watermarking process, significantly impairing recognition performance. We explore the interaction between watermarking and adversarial attacks on face recognition models. Our findings reveal that while watermarking or input-level perturbation alone may have a negligible effect on recognition accuracy, the combined effect of watermarking and perturbation can result in an adversarial watermarking attack, significantly degrading recognition performance. Specifically, we introduce a novel threat model, the adversarial watermarking attack, which remains stealthy in the absence of watermarking, allowing images to be correctly recognized initially. However, once watermarking is applied, the attack is activated, causing recognition failures. Our study reveals a previously unrecognized vulnerability: adversarial perturbations can exploit the watermark message to evade face recognition systems. Evaluated on the CASIA-WebFace dataset, our proposed adversarial watermarking attack reduces face matching accuracy by 67.2% with an \ell_\infty norm-measured perturbation strength of 2/255 and by 95.9% with a strength of 4/255 .

[CV-38] Unleashing the Potential of Synthetic Images: A Study on Histopathology Image Classification ECCV2024

链接: https://arxiv.org/abs/2409.16002
作者: Leire Benito-Del-Valle,Aitor Alvarez-Gila,Itziar Eguskiza,Cristina L. Saratxaga
关键词-EN: accurate identification, identification and diagnosis, large and diverse, Histopathology image, Histopathology image classification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024 - BioImage Computing Workshop

点击查看摘要

Abstract:Histopathology image classification is crucial for the accurate identification and diagnosis of various diseases but requires large and diverse datasets. Obtaining such datasets, however, is often costly and time-consuming due to the need for expert annotations and ethical constraints. To address this, we examine the suitability of different generative models and image selection approaches to create realistic synthetic histopathology image patches conditioned on class labels. Our findings highlight the importance of selecting an appropriate generative model type and architecture to enhance performance. Our experiments over the PCam dataset show that diffusion models are effective for transfer learning, while GAN-generated samples are better suited for augmentation. Additionally, transformer-based generative models do not require image filtering, in contrast to those derived from Convolutional Neural Networks (CNNs), which benefit from realism score-based selection. Therefore, we show that synthetic images can effectively augment existing datasets, ultimately improving the performance of the downstream histopathology image classification task.

[CV-39] Improvements to SDXL in NovelAI Diffusion V3

链接: https://arxiv.org/abs/2409.15997
作者: Juan Ossa,Eren Doğan,Alex Birch,F. Johnson
关键词-EN: training NovelAI Diffusion, image generation model, art anime image, anime image generation, NovelAI Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:In this technical report, we document the changes we made to SDXL in the process of training NovelAI Diffusion V3, our state of the art anime image generation model.

[CV-40] Leveraging Unsupervised Learning for Cost-Effective Visual Anomaly Detection

链接: https://arxiv.org/abs/2409.15980
作者: Yunbo Long,Zhengyang Ling,Sam Brook,Duncan McFarlane,Alexandra Brintrup
关键词-EN: Traditional machine learning-based, extensive data collection, Traditional machine, machine learning-based visual, require extensive data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional machine learning-based visual inspection systems require extensive data collection and repetitive model training to improve accuracy. These systems typically require expensive camera, computing equipment and significant machine learning ex- pertise, which can substantially burden small and medium-sized enterprises. This study explores leveraging unsupervised learning methods with pre-trained models and low-cost hardware to create a cost-effective visual anomaly detection system. The research aims to develop a low-cost visual anomaly detection solution that uses minimal data for model training while maintaining general- izability and scalability. The system utilises unsupervised learning models from Anomalib and is deployed on affordable Raspberry Pi hardware through openVINO. The results show that this cost-effective system can complete anomaly defection training and inference on a Raspberry Pi in just 90 seconds using only 10 normal product images, achieving an F1 macro score exceeding 0.95. While the system is slightly sensitive to environmental changes like lighting, product positioning, or background, it remains a swift and economical method for factory automation inspection for small and medium-sized manufacturers

[CV-41] Adversarial Backdoor Defense in CLIP

链接: https://arxiv.org/abs/2409.15968
作者: Junhao Kuang,Siyuan Liang,Jiawei Liang,Kuanrong Liu,Xiaochun Cao
关键词-EN: Multimodal contrastive pretraining, contrastive pretraining, backdoor, defense, Multimodal contrastive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal contrastive pretraining, exemplified by models like CLIP, has been found to be vulnerable to backdoor attacks. While current backdoor defense methods primarily employ conventional data augmentation to create augmented samples aimed at feature alignment, these methods fail to capture the distinct features of backdoor samples, resulting in suboptimal defense performance. Observations reveal that adversarial examples and backdoor samples exhibit similarities in the feature space within the compromised models. Building on this insight, we propose Adversarial Backdoor Defense (ABD), a novel data augmentation strategy that aligns features with meticulously crafted adversarial examples. This approach effectively disrupts the backdoor association. Our experiments demonstrate that ABD provides robust defense against both traditional uni-modal and multimodal backdoor attacks targeting CLIP. Compared to the current state-of-the-art defense method, CleanCLIP, ABD reduces the attack success rate by 8.66% for BadNet, 10.52% for Blended, and 53.64% for BadCLIP, while maintaining a minimal average decrease of just 1.73% in clean accuracy.

[CV-42] Semantics-Controlled Gaussian Splatting for Outdoor Scene Reconstruction and Rendering in Virtual Reality

链接: https://arxiv.org/abs/2409.15959
作者: Hannah Schieber,Jacob Young,Tobias Langlotz,Stefanie Zollmann,Daniel Roth
关键词-EN: Gaussian Splatting, real-time rendering, virtual reality, view synthesis, synthesis and real-time
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Advancements in 3D rendering like Gaussian Splatting (GS) allow novel view synthesis and real-time rendering in virtual reality (VR). However, GS-created 3D environments are often difficult to edit. For scene enhancement or to incorporate 3D assets, segmenting Gaussians by class is essential. Existing segmentation approaches are typically limited to certain types of scenes, e.g., ‘‘circular’’ scenes, to determine clear object boundaries. However, this method is ineffective when removing large objects in non-‘‘circling’’ scenes such as large outdoor scenes. We propose Semantics-Controlled GS (SCGS), a segmentation-driven GS approach, enabling the separation of large scene parts in uncontrolled, natural environments. SCGS allows scene editing and the extraction of scene parts for VR. Additionally, we introduce a challenging outdoor dataset, overcoming the ‘‘circling’’ setup. We outperform the state-of-the-art in visual quality on our dataset and in segmentation quality on the 3D-OVS dataset. We conducted an exploratory user study, comparing a 360-video, plain GS, and SCGS in VR with a fixed viewpoint. In our subsequent main study, users were allowed to move freely, evaluating plain GS and SCGS. Our main study results show that participants clearly prefer SCGS over plain GS. We overall present an innovative approach that surpasses the state-of-the-art both technically and in user experience.

[CV-43] An ensemble framework approach of hybrid Quantum convolutional neural networks for classification of breast cancer images

链接: https://arxiv.org/abs/2409.15958
作者: Dibyasree Guha,Shyamali Mitra,Somenath Kuiry,Nibaran Das
关键词-EN: superposition and entanglement, Quantum neural networks, deemed suitable, suitable to replace, ability to learn
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in the 3rd International Conference on Data Electronics and Computing

点击查看摘要

Abstract:Quantum neural networks are deemed suitable to replace classical neural networks in their ability to learn and scale up network models using quantum-exclusive phenomena like superposition and entanglement. However, in the noisy intermediate scale quantum (NISQ) era, the trainability and expressibility of quantum models are yet under investigation. Medical image classification on the other hand, pertains well to applications in deep learning, particularly, convolutional neural networks. In this paper, we carry out a study of three hybrid classical-quantum neural network architectures and combine them using standard ensembling techniques on a breast cancer histopathological dataset. The best accuracy percentage obtained by an individual model is 85.59. Whereas, on performing ensemble, we have obtained accuracy as high as 86.72%, an improvement over the individual hybrid network as well as classical neural network counterparts of the hybrid network models.

[CV-44] Mind the Prompt: A Novel Benchmark for Prompt-based Class-Agnostic Counting

链接: https://arxiv.org/abs/2409.15953
作者: Luca Ciampi,Nicola Messina,Matteo Pierucci,Giuseppe Amato,Marco Avvenuti,Fabrizio Falchi
关键词-EN: arbitrary object classes, Class-agnostic counting, computer vision, vision that aims, aims to estimate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Class-agnostic counting (CAC) is a recent task in computer vision that aims to estimate the number of instances of arbitrary object classes never seen during model training. With the recent advancement of robust vision-and-language foundation models, there is a growing interest in prompt-based CAC, where object categories to be counted can be specified using natural language. However, we identify significant limitations in current benchmarks for evaluating this task, which hinder both accurate assessment and the development of more effective solutions. Specifically, we argue that the current evaluation protocols do not measure the ability of the model to understand which object has to be counted. This is due to two main factors: (i) the shortcomings of CAC datasets, which primarily consist of images containing objects from a single class, and (ii) the limitations of current counting performance evaluators, which are based on traditional class-specific counting and focus solely on counting errors. To fill this gap, we introduce the Prompt-Aware Counting (PrACo) benchmark, which comprises two targeted tests, each accompanied by appropriate evaluation metrics. We evaluate state-of-the-art methods and demonstrate that, although some achieve impressive results on standard class-specific counting metrics, they exhibit a significant deficiency in understanding the input prompt, indicating the need for more careful training procedures or revised designs. The code for reproducing our results is available at this https URL.

[CV-45] A Formalization of Image Vectorization by Region Merging

链接: https://arxiv.org/abs/2409.15940
作者: Roy Y. He,Sung Ha Kang,Jean-Michel Morel
关键词-EN: vector graphics composed, converts raster images, Image vectorization converts, vectorization converts raster, vector graphics
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Image vectorization converts raster images into vector graphics composed of regions separated by curves. Typical vectorization methods first define the regions by grouping similar colored regions via color quantization, then approximate their boundaries by Bezier curves. In that way, the raster input is converted into an SVG format parameterizing the regions’ colors and the Bezier control points. This compact representation has many graphical applications thanks to its universality and resolution-independence. In this paper, we remark that image vectorization is nothing but an image segmentation, and that it can be built by fine to coarse region merging. Our analysis of the problem leads us to propose a vectorization method alternating region merging and curve smoothing. We formalize the method by alternate operations on the dual and primal graph induced from any domain partition. In that way, we address a limitation of current vectorization methods, which separate the update of regional information from curve approximation. We formalize region merging methods by associating them with various gain functionals, including the classic Beaulieu-Goldberg and Mumford-Shah functionals. More generally, we introduce and compare region merging criteria involving region number, scale, area, and internal standard deviation. We also show that the curve smoothing, implicit in all vectorization methods, can be performed by the shape-preserving affine scale space. We extend this flow to a network of curves and give a sufficient condition for the topological preservation of the segmentation. The general vectorization method that follows from this analysis shows explainable behaviors, explicitly controlled by a few intuitive parameters. It is experimentally compared to state-of-the-art software and proved to have comparable or superior fidelity and cost efficiency.

[CV-46] Self-supervised Shape Completion via Involution and Implicit Correspondences ECCV2024

链接: https://arxiv.org/abs/2409.15939
作者: Mengya Liu,Ajad Chhatkuli,Janis Postels,Luc Van Gool,Federico Tombari
关键词-EN: completion, traditionally solved, shape completion, shape, distribution learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:3D shape completion is traditionally solved using supervised training or by distribution learning on complete shape examples. Recently self-supervised learning approaches that do not require any complete 3D shape examples have gained more interests. In this paper, we propose a non-adversarial self-supervised approach for the shape completion task. Our first finding is that completion problems can be formulated as an involutory function trivially, which implies a special constraint on the completion function G, such that G(G(X)) = X. Our second constraint on self-supervised shape completion relies on the fact that shape completion becomes easier to solve with correspondences and similarly, completion can simplify the correspondences problem. We formulate a consistency measure in the canonical space in order to supervise the completion function. We efficiently optimize the completion and correspondence modules using “freeze and alternate” strategy. The overall approach performs well for rigid shapes in a category as well as dynamic non-rigid shapes. We ablate our design choices and compare our solution against state-of-the-art methods, showing remarkable accuracy approaching supervised accuracy in some cases.

[CV-47] DepMamba: Progressive Fusion Mamba for Multimodal Depression Detection

链接: https://arxiv.org/abs/2409.15936
作者: Jiaxin Ye,Junping Zhang,Hongming Shan
关键词-EN: common mental disorder, people worldwide, common mental, mental disorder, disorder that affects
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Depression is a common mental disorder that affects millions of people worldwide. Although promising, current multimodal methods hinge on aligned or aggregated multimodal fusion, suffering two significant limitations: (i) inefficient long-range temporal modeling, and (ii) sub-optimal multimodal fusion between intermodal fusion and intramodal processing. In this paper, we propose an audio-visual progressive fusion Mamba for multimodal depression detection, termed DepMamba. DepMamba features two core designs: hierarchical contextual modeling and progressive multimodal fusion. On the one hand, hierarchical modeling introduces convolution neural networks and Mamba to extract the local-to-global features within long-range sequences. On the other hand, the progressive fusion first presents a multimodal collaborative State Space Model (SSM) extracting intermodal and intramodal information for each modality, and then utilizes a multimodal enhanced SSM for modality cohesion. Extensive experimental results on two large-scale depression datasets demonstrate the superior performance of our DepMamba over existing state-of-the-art methods. Code is available at this https URL.

[CV-48] Automatic Registration of SHG and HE Images with Feature-based Initial Alignment and Intensity-based Instance Optimization: Contribution to the COMULIS Challenge

链接: https://arxiv.org/abs/2409.15931
作者: Marek Wodzinski,Henning Müller
关键词-EN: noninvasive second-harmonic generation, second-harmonic generation microscopy, highly desired, generation microscopy, microscopy to hematoxylin
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The automatic registration of noninvasive second-harmonic generation microscopy to hematoxylin and eosin slides is a highly desired, yet still unsolved problem. The task is challenging because the second-harmonic images contain only partial information, in contrast to the stained HE slides that provide more information about the tissue morphology. Moreover, both imaging methods have different intensity distributions. Therefore, the task can be formulated as a multi-modal registration problem with missing data. In this work, we propose a method based on automatic keypoint matching followed by deformable registration based on instance optimization. The method does not require any training and is evaluated using the dataset provided in the Learn2Reg challenge by the COMULIS organization. The method achieved relatively good generalizability resulting in 88% of success rate in the initial alignment and average target registration error equal to 2.48 on the external validation set. We openly release the source code and incorporate it in the DeeperHistReg image registration framework.

[CV-49] Facing Asymmetry - Uncovering the Causal Link between Facial Symmetry and Expression Classifiers using Synthetic Interventions ACCV2024

链接: https://arxiv.org/abs/2409.15927
作者: Tim Büchner,Niklas Penzel,Orlando Guntinas-Lichius,Joachim Denzler
关键词-EN: trained black box, achieve high performance, black box models, box models achieve, models achieve high
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 45 pages; 26 figures; accepted at ACCV 2024

点击查看摘要

Abstract:Understanding expressions is vital for deciphering human behavior, and nowadays, end-to-end trained black box models achieve high performance. Due to the black-box nature of these models, it is unclear how they behave when applied out-of-distribution. Specifically, these models show decreased performance for unilateral facial palsy patients. We hypothesize that one crucial factor guiding the internal decision rules is facial symmetry. In this work, we use insights from causal reasoning to investigate the hypothesis. After deriving a structural causal model, we develop a synthetic interventional framework. This approach allows us to analyze how facial symmetry impacts a network’s output behavior while keeping other factors fixed. All 17 investigated expression classifiers significantly lower their output activations for reduced symmetry. This result is congruent with observed behavior on real-world data from healthy subjects and facial palsy patients. As such, our investigation serves as a case study for identifying causal factors that influence the behavior of black-box models.

[CV-50] Learning Compact Channel Correlation Representation for LiDAR Place Recognition ICRA2025

链接: https://arxiv.org/abs/2409.15919
作者: Saimunur Rahman,Peyman Moghadam
关键词-EN: learn compact channel, compact channel correlation, traditional covariance pooling, channel correlation representation, covariance pooling methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:This paper presents a novel approach to learn compact channel correlation representation for LiDAR place recognition, called C3R, aimed at reducing the computational burden and dimensionality associated with traditional covariance pooling methods for place recognition tasks. Our method partitions the feature matrix into smaller groups, computes group-wise covariance matrices, and aggregates them via a learnable aggregation strategy. Matrix power normalization is applied to ensure stability. Theoretical analyses are also given to demonstrate the effectiveness of the proposed method, including its ability to preserve permutation invariance and maintain high mutual information between the original features and the aggregated representation. We conduct extensive experiments on four large-scale, public LiDAR place recognition datasets including Oxford RobotCar, In-house, MulRan, and WildPlaces datasets to validate our approach’s superiority in accuracy, and robustness. Furthermore, we provide the quantitative results of our approach for a deeper understanding. The code will be released upon acceptance.

[CV-51] Exploring the potential of collaborative UAV 3D mapping in Kenyan savanna for wildlife research

链接: https://arxiv.org/abs/2409.15914
作者: Vandita Shukla,Luca Morelli,Pawel Trybala,Fabio Remondino,Wentian Gan,Yifei Yu,Xin Wang
关键词-EN: UAV-based biodiversity conservation, biodiversity conservation applications, data acquisition advantages, UAV-based biodiversity, advantages for researchers
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted at IMAV 2024

点击查看摘要

Abstract:UAV-based biodiversity conservation applications have exhibited many data acquisition advantages for researchers. UAV platforms with embedded data processing hardware can support conservation challenges through 3D habitat mapping, surveillance and monitoring solutions. High-quality real-time scene reconstruction as well as real-time UAV localization can optimize the exploration vs exploitation balance of single or collaborative mission. In this work, we explore the potential of two collaborative frameworks - Visual Simultaneous Localization and Mapping (V-SLAM) and Structure-from-Motion (SfM) for 3D mapping purposes and compare results with standard offline approaches.

[CV-52] Unimotion: Unifying 3D Human Motion Synthesis and Understanding

链接: https://arxiv.org/abs/2409.15904
作者: Chuqiao Li,Julian Chibane,Yannan He,Naama Pearl,Andreas Geiger,Gerard Pons-moll
关键词-EN: unified multi-task human, multi-task human motion, motion, Unimotion, text
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce Unimotion, the first unified multi-task human motion model capable of both flexible motion control and frame-level motion understanding. While existing works control avatar motion with global text conditioning, or with fine-grained per frame scripts, none can do both at once. In addition, none of the existing works can output frame-level text paired with the generated poses. In contrast, Unimotion allows to control motion with global text, or local frame-level text, or both at once, providing more flexible control for users. Importantly, Unimotion is the first model which by design outputs local text paired with the generated poses, allowing users to know what motion happens and when, which is necessary for a wide range of applications. We show Unimotion opens up new applications: 1.) Hierarchical control, allowing users to specify motion at different levels of detail, 2.) Obtaining motion text descriptions for existing MoCap data or YouTube videos 3.) Allowing for editability, generating motion from text, and editing the motion via text edits. Moreover, Unimotion attains state-of-the-art results for the frame-level text-to-motion task on the established HumanML3D dataset. The pre-trained model and code are available available on our project page at this https URL.

[CV-53] FedRepOpt: Gradient Re-parametrized Optimizers in Federated Learning

链接: https://arxiv.org/abs/2409.15898
作者: Kin Wai Lau,Yasar Abbas Ur Rehman,Pedro Porto Buarque de Gusmão,Lai-Man Po,Lan Ma,Yuyang Xie
关键词-EN: Federated Learning, training machine learning, machine learning models, machine learning, edge devices
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a privacy-preserving method for training machine learning models in a distributed manner on edge devices. However, on-device models face inherent computational power and memory limitations, potentially resulting in constrained gradient updates. As the model’s size increases, the frequency of gradient updates on edge devices decreases, ultimately leading to suboptimal training outcomes during any particular FL round. This limits the feasibility of deploying advanced and large-scale models on edge devices, hindering the potential for performance enhancements. To address this issue, we propose FedRepOpt, a gradient re-parameterized optimizer for FL. The gradient re-parameterized method allows training a simple local model with a similar performance as a complex model by modifying the optimizer’s gradients according to a set of model-specific hyperparameters obtained from the complex models. In this work, we focus on VGG-style and Ghost-style models in the FL environment. Extensive experiments demonstrate that models using FedRepOpt obtain a significant boost in performance of 16.7% and 11.4% compared to the RepGhost-style and RepVGG-style networks, while also demonstrating a faster convergence time of 11.7% and 57.4% compared to their complex structure.

[CV-54] Unsupervised Attention Regularization Based Domain Adaptation for Oracle Character Recognition

链接: https://arxiv.org/abs/2409.15893
作者: Mei Wang,Weihong Deng,Jiani Hu,Sen Su
关键词-EN: role in Chinese, Chinese archaeology, oracle characters plays, oracle characters, archaeology and philology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The study of oracle characters plays an important role in Chinese archaeology and philology. However, the difficulty of collecting and annotating real-world scanned oracle characters hinders the development of oracle character recognition. In this paper, we develop a novel unsupervised domain adaptation (UDA) method, i.e., unsupervised attention regularization net?work (UARN), to transfer recognition knowledge from labeled handprinted oracle characters to unlabeled scanned data. First, we experimentally prove that existing UDA methods are not always consistent with human priors and cannot achieve optimal performance on the target domain. For these oracle characters with flip-insensitivity and high inter-class similarity, model interpretations are not flip-consistent and class-separable. To tackle this challenge, we take into consideration visual perceptual plausibility when adapting. Specifically, our method enforces attention consistency between the original and flipped images to achieve the model robustness to flipping. Simultaneously, we constrain attention separability between the pseudo class and the most confusing class to improve the model discriminability. Extensive experiments demonstrate that UARN shows better interpretability and achieves state-of-the-art performance on Oracle-241 dataset, substantially outperforming the previously structure-texture separation network by 8.5%.

[CV-55] CAD: Memory Efficient Convolutional Adapter for Segment Anything

链接: https://arxiv.org/abs/2409.15889
作者: Joohyeok Kim,Joonhyeon Song,Seohwan Yun,Seongho Yoon,Sangmin Lee
关键词-EN: Foundation model, actively researched, Segment, SAM, image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages

点击查看摘要

Abstract:The Foundation model for image segmentation, Segment Anything (SAM), has been actively researched in various fields since its proposal. Various researches have been proposed to adapt SAM to specific domains, with one notable approach involving the addition and training of lightweight adapter modules. While adapter-based fine-tuning approaches have reported parameter efficiency and significant performance improvements, they face a often overlooked issue: the excessive consumption of GPU memory relative to the number of trainable parameters. Addressing this issue, this paper proposes a memory-efficient parallel convolutional adapter architecture. This architecture connects in parallel with SAM’s image encoder, eliminating the need to store activations and gradients of the image encoder during model training. Our proposed architecture demonstrated competitive experimental results while using less than half the GPU memory compared to SAM Adapter, indicating its value as an alternative to simple decoder fine-tuning when hardware limitations preclude adapter-based learning. Our code implementation is available at our github.

[CV-56] Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization

链接: https://arxiv.org/abs/2409.15882
作者: Sotheara Leang(CADT, M-PSI),Anderson Augusma(M-PSI, SVH),Eric Castelli(M-PSI),Frédérique Letué(SAM),Sethserey Sam(CADT),Dominique Vaufreydaz(M-PSI)
关键词-EN: Human speech conveys, Human speech, speech conveys prosody, Vector-Quantized Variational Auto-Encoder, Human
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Human speech conveys prosody, linguistic content, and speaker identity. This article investigates a novel speaker anonymization approach using an end-to-end network based on a Vector-Quantized Variational Auto-Encoder (VQ-VAE) to deal with these speech components. This approach is designed to disentangle these components to specifically target and modify the speaker identity while preserving the linguistic and emotionalcontent. To do so, three separate branches compute embeddings for content, prosody, and speaker identity respectively. During synthesis, taking these embeddings, the decoder of the proposed architecture is conditioned on both speaker and prosody information, allowing for capturing more nuanced emotional states and precise adjustments to speaker identification. Findings indicate that this method outperforms most baseline techniques in preserving emotional information. However, it exhibits more limited performance on other voice privacy tasks, emphasizing the need for further improvements.

[CV-57] Zero-Shot Detection of AI-Generated Images

链接: https://arxiv.org/abs/2409.15875
作者: Davide Cozzolino,Giovanni Poggi,Matthias Nießner,Luisa Verdoliva
关键词-EN: Detecting AI-generated images, extraordinarily difficult challenge, Detecting AI-generated, generative architectures emerge, unprecedented realism
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detecting AI-generated images has become an extraordinarily difficult challenge as new generative architectures emerge on a daily basis with more and more capabilities and unprecedented realism. New versions of many commercial tools, such as DALLE, Midjourney, and Stable Diffusion, have been released recently, and it is impractical to continually update and retrain supervised forensic detectors to handle such a large variety of models. To address this challenge, we propose a zero-shot entropy-based detector (ZED) that neither needs AI-generated training data nor relies on knowledge of generative architectures to artificially synthesize their artifacts. Inspired by recent works on machine-generated text detection, our idea is to measure how surprising the image under analysis is compared to a model of real images. To this end, we rely on a lossless image encoder that estimates the probability distribution of each pixel given its context. To ensure computational efficiency, the encoder has a multi-resolution architecture and contexts comprise mostly pixels of the lower-resolution version of the image.Since only real images are needed to learn the model, the detector is independent of generator architectures and synthetic training data. Using a single discriminative feature, the proposed detector achieves state-of-the-art performance. On a wide variety of generative models it achieves an average improvement of more than 3% over the SoTA in terms of accuracy. Code is available at this https URL.

[CV-58] Potential Field as Scene Affordance for Behavior Change-Based Visual Risk Object Identification

链接: https://arxiv.org/abs/2409.15846
作者: Pang-Yuan Pao,Shu-Wei Lu,Ze-Yan Lu,Yi-Ting Chen
关键词-EN: intelligent driving systems, critical framework designed, risk object identification, detect potential hazards, visual risk object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:We study behavior change-based visual risk object identification (Visual-ROI), a critical framework designed to detect potential hazards for intelligent driving systems. Existing methods often show significant limitations in spatial accuracy and temporal consistency, stemming from an incomplete understanding of scene affordance. For example, these methods frequently misidentify vehicles that do not impact the ego vehicle as risk objects. Furthermore, existing behavior change-based methods are inefficient because they implement causal inference in the perspective image space. We propose a new framework with a Bird’s Eye View (BEV) representation to overcome the above challenges. Specifically, we utilize potential fields as scene affordance, involving repulsive forces derived from road infrastructure and traffic participants, along with attractive forces sourced from target destinations. In this work, we compute potential fields by assigning different energy levels according to the semantic labels obtained from BEV semantic segmentation. We conduct thorough experiments and ablation studies, comparing the proposed method with various state-of-the-art algorithms on both synthetic and real-world datasets. Our results show a notable increase in spatial and temporal consistency, with enhancements of 20.3% and 11.6% on the RiskBench dataset, respectively. Additionally, we can improve computational efficiency by 88%. We achieve improvements of 5.4% in spatial accuracy and 7.2% in temporal consistency on the nuScenes dataset.

[CV-59] FSF-Net: Enhance 4D Occupancy Forecasting with Coarse BEV Scene Flow for Autonomous Driving

链接: https://arxiv.org/abs/2409.15841
作者: Erxin Guo,Pei An,You Yang,Qiong Liu,An-An Liu
关键词-EN: BEV scene flow, Scene flow, BEV scene, coarse BEV scene, avoid potential risk
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:4D occupancy forecasting is one of the important techniques for autonomous driving, which can avoid potential risk in the complex traffic scenes. Scene flow is a crucial element to describe 4D occupancy map tendency. However, an accurate scene flow is difficult to predict in the real scene. In this paper, we find that BEV scene flow can approximately represent 3D scene flow in most traffic scenes. And coarse BEV scene flow is easy to generate. Under this thought, we propose 4D occupancy forecasting method FSF-Net based on coarse BEV scene flow. At first, we develop a general occupancy forecasting architecture based on coarse BEV scene flow. Then, to further enhance 4D occupancy feature representation ability, we propose a vector quantized based Mamba (VQ-Mamba) network to mine spatial-temporal structural scene feature. After that, to effectively fuse coarse occupancy maps forecasted from BEV scene flow and latent features, we design a U-Net based quality fusion (UQF) network to generate the fine-grained forecasting result. Extensive experiments are conducted on public Occ3D dataset. FSF-Net has achieved IoU and mIoU 9.56% and 10.87% higher than state-of-the-art method. Hence, we believe that proposed FSF-Net benefits to the safety of autonomous driving.

[CV-60] Deep Learning Techniques for Automatic Lateral X-ray Cephalometric Landmark Detection: Is the Problem Solved?

链接: https://arxiv.org/abs/2409.15834
作者: Hongyuan Zhang,Ching-Wei Wang,Hikam Muzakky,Juan Dai,Xuguang Li,Chenglong Ma,Qian Wu,Xianan Cui,Kunlun Xu,Pengfei He,Dongqian Guo,Xianlong Wang,Hyunseok Lee,Zhangnan Zhong,Zhu Zhu,Bingsheng Huang
关键词-EN: Cephalometric Landmark Detection, Cephalometric Landmark, Landmark Detection, deep learning methods, fundamental task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:Localization of the craniofacial landmarks from lateral cephalograms is a fundamental task in cephalometric analysis. The automation of the corresponding tasks has thus been the subject of intense research over the past decades. In this paper, we introduce the “Cephalometric Landmark Detection (CL-Detection)” dataset, which is the largest publicly available and comprehensive dataset for cephalometric landmark detection. This multi-center and multi-vendor dataset includes 600 lateral X-ray images with 38 landmarks acquired with different equipment from three medical centers. The overarching objective of this paper is to measure how far state-of-the-art deep learning methods can go for cephalometric landmark detection. Following the 2023 MICCAI CL-Detection Challenge, we report the results of the top ten research groups using deep learning methods. Results show that the best methods closely approximate the expert analysis, achieving a mean detection rate of 75.719% and a mean radial error of 1.518 mm. While there is room for improvement, these findings undeniably open the door to highly accurate and fully automatic location of craniofacial landmarks. We also identify scenarios for which deep learning methods are still failing. Both the dataset and detailed results are publicly available online, while the platform will remain open for the community to benchmark future algorithm developments at this https URL.

[CV-61] PseudoNeg-MAE: Self-Supervised Point Cloud Learning using Conditional Pseudo-Negative Embeddings ICRA2025

链接: https://arxiv.org/abs/2409.15832
作者: Sutharsan Mahendren,Saimunur Rahman,Piotr Koniusz,Tharindu Fernando,Sridha Sridharan,Clinton Fookes,Peyman Moghadam
关键词-EN: enhances global feature, cloud mask autoencoder, self-supervised learning framework, point cloud mask, global feature representation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ICRA2025

点击查看摘要

Abstract:We propose PseudoNeg-MAE, a novel self-supervised learning framework that enhances global feature representation of point cloud mask autoencoder by making them both discriminative and sensitive to transformations. Traditional contrastive learning methods focus on achieving invariance, which can lead to the loss of valuable transformation-related information. In contrast, PseudoNeg-MAE explicitly models the relationship between original and transformed data points using a parametric network COPE, which learns the localized displacements caused by transformations within the latent space. However, jointly training COPE with the MAE leads to undesirable trivial solutions where COPE outputs collapse to an identity. To address this, we introduce a novel loss function incorporating pseudo-negatives, which effectively penalizes these trivial invariant solutions and promotes transformation sensitivity in the embeddings. We validate PseudoNeg-MAE on shape classification and relative pose estimation tasks, where PseudoNeg-MAE achieves state-of-the-art performance on the ModelNet40 and ScanObjectNN datasets under challenging evaluation protocols and demonstrates superior accuracy in estimating relative poses. These results show the effectiveness of PseudoNeg-MAE in learning discriminative and transformation-sensitive representations.

[CV-62] Layer-wise Model Merging for Unsupervised Domain Adaptation in Segmentation Tasks

链接: https://arxiv.org/abs/2409.15813
作者: Roberto Alcover-Couso,Juan C. SanMiguel,Marcos Escudero-Viñolo,Jose M Martínez
关键词-EN: creation and inference, strategy to enhance, prior work, work is limited, ensemble creation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Merging parameters of multiple models has resurfaced as an effective strategy to enhance task performance and robustness, but prior work is limited by the high costs of ensemble creation and inference. In this paper, we leverage the abundance of freely accessible trained models to introduce a cost-free approach to model merging. It focuses on a layer-wise integration of merged models, aiming to maintain the distinctiveness of the task-specific final layers while unifying the initial layers, which are primarily associated with feature extraction. This approach ensures parameter consistency across all layers, essential for boosting performance. Moreover, it facilitates seamless integration of knowledge, enabling effective merging of models from different datasets and tasks. Specifically, we investigate its applicability in Unsupervised Domain Adaptation (UDA), an unexplored area for model merging, for Semantic and Panoptic Segmentation. Experimental results demonstrate substantial UDA improvements without additional costs for merging same-architecture models from distinct datasets ( \uparrow 2.6% mIoU) and different-architecture models with a shared backbone ( \uparrow 6.8% mIoU). Furthermore, merging Semantic and Panoptic Segmentation models increases mPQ by \uparrow 7% . These findings are validated across a wide variety of UDA strategies, architectures, and datasets.

[CV-63] Aided design of bridge aesthetics based on Stable Diffusion fine-tuning

链接: https://arxiv.org/abs/2409.15812
作者: Leye Zhang,Xiangxiang Tian,Chengli Zhang,Hongjun Zhang
关键词-EN: assist bridge-type innovation, Stable Diffusion, Stable Diffusion fine-tuning, Diffusion fine-tuning technique, bridge-type innovation
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 13 figures

点击查看摘要

Abstract:Stable Diffusion fine-tuning technique is tried to assist bridge-type innovation. The bridge real photo dataset is built, and Stable Diffusion is fine tuned by using four methods that are Textual Inversion, Dreambooth, Hypernetwork and Lora. All of them can capture the main characteristics of dataset images and realize the personalized customization of Stable Diffusion. Through fine-tuning, Stable Diffusion is not only a drawing tool, but also has the designer’s innovative thinking ability. The fine tuned model can generate a large number of innovative new bridge types, which can provide rich inspiration for human designers. The result shows that this technology can be used as an engine of creativity and a power multiplier for human designers.

[CV-64] Hyperbolic Image-and-Pointcloud Contrastive Learning for 3D Classification IROS2024

链接: https://arxiv.org/abs/2409.15810
作者: Naiwen Hu,Haozhe Cheng,Yifan Xie,Pengcheng Shi,Jihua Zhu
关键词-EN: exhibited remarkable efficacy, downstream tasks, exhibited remarkable, remarkable efficacy, existing contrastive learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at IROS2024

点击查看摘要

Abstract:3D contrastive representation learning has exhibited remarkable efficacy across various downstream tasks. However, existing contrastive learning paradigms based on cosine similarity fail to deeply explore the potential intra-modal hierarchical and cross-modal semantic correlations about multi-modal data in Euclidean space. In response, we seek solutions in hyperbolic space and propose a hyperbolic image-and-pointcloud contrastive learning method (HyperIPC). For the intra-modal branch, we rely on the intrinsic geometric structure to explore the hyperbolic embedding representation of point cloud to capture invariant features. For the cross-modal branch, we leverage images to guide the point cloud in establishing strong semantic hierarchical correlations. Empirical experiments underscore the outstanding classification performance of HyperIPC. Notably, HyperIPC enhances object classification results by 2.8% and few-shot classification outcomes by 5.9% on ScanObjectNN compared to the baseline. Furthermore, ablation studies and confirmatory testing validate the rationality of HyperIPC’s parameter settings and the effectiveness of its submodules.

[CV-65] A Computer Vision Approach for Autonomous Cars to Drive Safe at Construction Zone

链接: https://arxiv.org/abs/2409.15809
作者: Abu Shad Ahammed,Md Shahi Amran Hossain,Roman Obermaisser
关键词-EN: sustainable transportation system, autonomous driving system, key requirement, road transportation system, transportation system
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 6 Pages, Double columns

点击查看摘要

Abstract:To build a smarter and safer city, a secure, efficient, and sustainable transportation system is a key requirement. The autonomous driving system (ADS) plays an important role in the development of smart transportation and is considered one of the major challenges facing the automotive sector in recent decades. A car equipped with an autonomous driving system (ADS) comes with various cutting-edge functionalities such as adaptive cruise control, collision alerts, automated parking, and more. A primary area of research within ADAS involves identifying road obstacles in construction zones regardless of the driving environment. This paper presents an innovative and highly accurate road obstacle detection model utilizing computer vision technology that can be activated in construction zones and functions under diverse drift conditions, ultimately contributing to build a safer road transportation system. The model developed with the YOLO framework achieved a mean average precision exceeding 94% and demonstrated an inference time of 1.6 milliseconds on the validation dataset, underscoring the robustness of the methodology applied to mitigate hazards and risks for autonomous vehicles.

[CV-66] 3D-JEPA: A Joint Embedding Predictive Architecture for 3D Self-Supervised Representation Learning

链接: https://arxiv.org/abs/2409.15803
作者: Naiwen Hu,Haozhe Cheng,Yifan Xie,Shiqi Li,Jihua Zhu
关键词-EN: Invariance-based and generative, self-supervised representation learning, generative methods, methods have shown, shown a conspicuous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Invariance-based and generative methods have shown a conspicuous performance for 3D self-supervised representation learning (SSRL). However, the former relies on hand-crafted data augmentations that introduce bias not universally applicable to all downstream tasks, and the latter indiscriminately reconstructs masked regions, resulting in irrelevant details being saved in the representation space. To solve the problem above, we introduce 3D-JEPA, a novel non-generative 3D SSRL framework. Specifically, we propose a multi-block sampling strategy that produces a sufficiently informative context block and several representative target blocks. We present the context-aware decoder to enhance the reconstruction of the target blocks. Concretely, the context information is fed to the decoder continuously, facilitating the encoder in learning semantic modeling rather than memorizing the context information related to target blocks. Overall, 3D-JEPA predicts the representation of target blocks from a context block using the encoder and context-aware decoder architecture. Various downstream tasks on different datasets demonstrate 3D-JEPA’s effectiveness and efficiency, achieving higher accuracy with fewer pretraining epochs, e.g., 88.65% accuracy on PB_T50_RS with 150 pretraining epochs.

[CV-67] DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation ECCV

链接: https://arxiv.org/abs/2409.15801
作者: Soojin Jang,Jungmin Yun,Junehyoung Kwon,Eunju Lee,Youngbin Kim
关键词-EN: Weakly supervised semantic, approaches typically rely, initial seed generation, global context due, supervised semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by the European Conference on Computer Vision (ECCV), 2024

点击查看摘要

Abstract:Weakly supervised semantic segmentation (WSSS) approaches typically rely on class activation maps (CAMs) for initial seed generation, which often fail to capture global context due to limited supervision from image-level labels. To address this issue, we introduce DALNet, Dense Alignment Learning Network that leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our key insight is to employ a dual-level alignment strategy: (1) Global Implicit Alignment (GIA) to capture global semantics by maximizing the similarity between the class token and the corresponding text embeddings while minimizing the similarity with background embeddings, and (2) Local Explicit Alignment (LEA) to improve object localization by utilizing spatial information from patch tokens. Moreover, we propose a cross-contrastive learning approach that aligns foreground features between image and text modalities while separating them from the background, encouraging activation in missing regions and suppressing distractions. Through extensive experiments on the PASCAL VOC and MS COCO datasets, we demonstrate that DALNet significantly outperforms state-of-the-art WSSS methods. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.

[CV-68] raining Data Attribution: Was Your Model Secretly Trained On Data Created By Mine?

链接: https://arxiv.org/abs/2409.15781
作者: Likun Zhang,Hao Wu,Lingcui Zhang,Fengyuan Xu,Jin Cao,Fenghua Li,Ben Niu
关键词-EN: source model, sparked significant interest, recently sparked significant, model, model training data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The emergence of text-to-image models has recently sparked significant interest, but the attendant is a looming shadow of potential infringement by violating the user terms. Specifically, an adversary may exploit data created by a commercial model to train their own without proper authorization. To address such risk, it is crucial to investigate the attribution of a suspicious model’s training data by determining whether its training data originates, wholly or partially, from a specific source model. To trace the generated data, existing methods require applying extra watermarks during either the training or inference phases of the source model. However, these methods are impractical for pre-trained models that have been released, especially when model owners lack security expertise. To tackle this challenge, we propose an injection-free training data attribution method for text-to-image models. It can identify whether a suspicious model’s training data stems from a source model, without additional modifications on the source model. The crux of our method lies in the inherent memorization characteristic of text-to-image models. Our core insight is that the memorization of the training dataset is passed down through the data generated by the source model to the model trained on that data, making the source model and the infringing model exhibit consistent behaviors on specific samples. Therefore, our approach involves developing algorithms to uncover these distinct samples and using them as inherent watermarks to verify if a suspicious model originates from the source model. Our experiments demonstrate that our method achieves an accuracy of over 80% in identifying the source of a suspicious model’s training data, without interfering the original training or generation process of the source model.

[CV-69] ManiNeg: Manifestation-guided Multimodal Pretraining for Mammography Classification

链接: https://arxiv.org/abs/2409.15745
作者: Xujun Li,Xin Wei,Jing Jiang,Danxiang Chen,Wei Zhang,Jinpeng Li
关键词-EN: hard negative samples, human health, Breast cancer, significant threat, threat to human
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Breast cancer is a significant threat to human health. Contrastive learning has emerged as an effective method to extract critical lesion features from mammograms, thereby offering a potent tool for breast cancer screening and analysis. A crucial aspect of contrastive learning involves negative sampling, where the selection of appropriate hard negative samples is essential for driving representations to retain detailed information about lesions. In contrastive learning, it is often assumed that features can sufficiently capture semantic content, and that each minibatch inherently includes ideal hard negative samples. However, the characteristics of breast lumps challenge these assumptions. In response, we introduce ManiNeg, a novel approach that leverages manifestations as proxies to mine hard negative samples. Manifestations, which refer to the observable symptoms or signs of a disease, provide a knowledge-driven and robust basis for choosing hard negative samples. This approach benefits from its invariance to model optimization, facilitating efficient sampling. To support ManiNeg and future research endeavors, we developed the MVKL dataset, which includes multi-view mammograms, corresponding reports, meticulously annotated manifestations, and pathologically confirmed benign-malignant outcomes. We evaluate ManiNeg on the benign and malignant classification task. Our results demonstrate that ManiNeg not only improves representation in both unimodal and multimodal contexts but also shows generalization across datasets. The MVKL dataset and our codes are publicly available at this https URL.

[CV-70] ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features

链接: https://arxiv.org/abs/2409.15744
作者: Xin Wei,Yaling Tao,Changde Du,Gangming Zhao,Yizhou Yu,Jinpeng Li
关键词-EN: breast cancer diagnosis, primary imaging tool, cancer diagnosis, primary imaging, imaging tool
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Mammography is the primary imaging tool for breast cancer diagnosis. Despite significant strides in applying deep learning to interpret mammography images, efforts that focus predominantly on visual features often struggle with generalization across datasets. We hypothesize that integrating additional modalities in the radiology practice, notably the linguistic features of reports and manifestation features embodying radiological insights, offers a more powerful, interpretable and generalizable representation. In this paper, we announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports. Based on this dataset, we focus on the challanging task of unsupervised pretraining and propose ViKL, a innovative framework that synergizes Visual, Knowledge, and Linguistic features. This framework relies solely on pairing information without the necessity for pathology labels, which are often challanging to acquire. ViKL employs a triple contrastive learning approach to merge linguistic and knowledge-based insights with visual data, enabling both inter-modality and intra-modality feature enhancement. Our research yields significant findings: 1) Integrating reports and manifestations with unsupervised visual pretraining, ViKL substantially enhances the pathological classification and fosters multimodal interactions. 2) Manifestations can introduce a novel hard negative sample selection mechanism. 3) The multimodal features demonstrate transferability across different datasets. 4) The multimodal pretraining approach curbs miscalibrations and crafts a high-quality representation space. The MVKL dataset and ViKL code are publicly available at this https URL to support a broad spectrum of future research.

[CV-71] Real-Time Pedestrian Detection on IoT Edge Devices: A Lightweight Deep Learning Approach

链接: https://arxiv.org/abs/2409.15740
作者: Muhammad Dany Alfikri,Rafael Kaliski
关键词-EN: everyday lives, Artificial intelligence, Computer vision, Edge servers, Edge
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
*备注: 10 pages, 3 tables, 12 figures, article submitted to IEEE for possible publication

点击查看摘要

Abstract:Artificial intelligence (AI) has become integral to our everyday lives. Computer vision has advanced to the point where it can play the safety critical role of detecting pedestrians at road intersections in intelligent transportation systems and alert vehicular traffic as to potential collisions. Centralized computing analyzes camera feeds and generates alerts for nearby vehicles. However, real-time applications face challenges such as latency, limited data transfer speeds, and the risk of life loss. Edge servers offer a potential solution for real-time applications, providing localized computing and storage resources and lower response times. Unfortunately, edge servers have limited processing power. Lightweight deep learning (DL) techniques enable edge servers to utilize compressed deep neural network (DNN) models. The research explores implementing a lightweight DL model on Artificial Intelligence of Things (AIoT) edge devices. An optimized You Only Look Once (YOLO) based DL model is deployed for real-time pedestrian detection, with detection events transmitted to the edge server using the Message Queuing Telemetry Transport (MQTT) protocol. The simulation results demonstrate that the optimized YOLO model can achieve real-time pedestrian detection, with a fast inference speed of 147 milliseconds, a frame rate of 2.3 frames per second, and an accuracy of 78%, representing significant improvements over baseline models. Comments: 10 pages, 3 tables, 12 figures, article submitted to IEEE for possible publication Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2409.15740 [cs.AI] (or arXiv:2409.15740v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.15740 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-72] aching Tailored to Talent: Adverse Weather Restoration via Prompt Pool and Depth-Anything Constraint ECCV’2024

链接: https://arxiv.org/abs/2409.15739
作者: Sixiang Chen,Tian Ye,Kai Zhang,Zhaohu Xing,Yunlong Lin,Lei Zhu
关键词-EN: pose significant challenges, real world pose, world pose significant, Recent advancements, adverse weather restoration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV’2024

点击查看摘要

Abstract:Recent advancements in adverse weather restoration have shown potential, yet the unpredictable and varied combinations of weather degradations in the real world pose significant challenges. Previous methods typically struggle with dynamically handling intricate degradation combinations and carrying on background reconstruction precisely, leading to performance and generalization limitations. Drawing inspiration from prompt learning and the “Teaching Tailored to Talent” concept, we introduce a novel pipeline, T3-DiffWeather. Specifically, we employ a prompt pool that allows the network to autonomously combine sub-prompts to construct weather-prompts, harnessing the necessary attributes to adaptively tackle unforeseen weather input. Moreover, from a scene modeling perspective, we incorporate general prompts constrained by Depth-Anything feature to provide the scene-specific condition for the diffusion process. Furthermore, by incorporating contrastive prompt loss, we ensures distinctive representations for both types of prompts by a mutual pushing strategy. Experimental results demonstrate that our method achieves state-of-the-art performance across various synthetic and real-world datasets, markedly outperforming existing diffusion techniques in terms of computational efficiency.

[CV-73] LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation ECCV2024

链接: https://arxiv.org/abs/2409.15727
作者: Ruida Zhang,Ziqin Huang,Gu Wang,Chenyangguang Zhang,Yan Di,Xingxing Zuo,Jiwen Tang,Xiangyang Ji
关键词-EN: depth data limits, estimation hold promise, pose estimation hold, object pose estimation, hold promise
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:While RGBD-based methods for category-level object pose estimation hold promise, their reliance on depth data limits their applicability in diverse scenarios. In response, recent efforts have turned to RGB-based methods; however, they face significant challenges stemming from the absence of depth information. On one hand, the lack of depth exacerbates the difficulty in handling intra-class shape variation, resulting in increased uncertainty in shape predictions. On the other hand, RGB-only inputs introduce inherent scale ambiguity, rendering the estimation of object size and translation an ill-posed problem. To tackle these challenges, we propose LaPose, a novel framework that models the object shape as the Laplacian mixture model for Pose estimation. By representing each point as a probabilistic distribution, we explicitly quantify the shape uncertainty. LaPose leverages both a generalized 3D information stream and a specialized feature stream to independently predict the Laplacian distribution for each point, capturing different aspects of object geometry. These two distributions are then integrated as a Laplacian mixture model to establish the 2D-3D correspondences, which are utilized to solve the pose via the PnP module. In order to mitigate scale ambiguity, we introduce a scale-agnostic representation for object size and translation, enhancing training efficiency and overall robustness. Extensive experiments on the NOCS datasets validate the effectiveness of LaPose, yielding state-of-the-art performance in RGB-based category-level object pose estimation. Codes are released at this https URL

[CV-74] Disentangled Generation and Aggregation for Robust Radiance Fields ECCV’2024

链接: https://arxiv.org/abs/2409.15715
作者: Shihe Shen,Huachen Gao,Wangze Xu,Rui Peng,Luyang Tang,Kaiqiang Xiong,Jianbo Jiao,Ronggang Wang
关键词-EN: low computation cost, triplane-based radiance fields, recent years due, effectively disentangle, computation cost
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 27 pages, 11 figures, Accepted by ECCV’2024

点击查看摘要

Abstract:The utilization of the triplane-based radiance fields has gained attention in recent years due to its ability to effectively disentangle 3D scenes with a high-quality representation and low computation cost. A key requirement of this method is the precise input of camera poses. However, due to the local update property of the triplane, a similar joint estimation as previous joint pose-NeRF optimization works easily results in local minima. To this end, we propose the Disentangled Triplane Generation module to introduce global feature context and smoothness into triplane learning, which mitigates errors caused by local updating. Then, we propose the Disentangled Plane Aggregation to mitigate the entanglement caused by the common triplane feature aggregation during camera pose updating. In addition, we introduce a two-stage warm-start training strategy to reduce the implicit constraints caused by the triplane generator. Quantitative and qualitative results demonstrate that our proposed method achieves state-of-the-art performance in novel view synthesis with noisy or unknown camera poses, as well as efficient convergence of optimization. Project page: this https URL.

[CV-75] Plenoptic PNG: Real-Time Neural Radiance Fields in 150 KB

链接: https://arxiv.org/abs/2409.15689
作者: Jae Yong Lee,Yuqun Wu,Chuhang Zou,Derek Hoiem,Shenlong Wang
关键词-EN: extremely compact representation, extremely compact, Gaussian Splats, compact representation, NeRFs and Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The goal of this paper is to encode a 3D scene into an extremely compact representation from 2D images and to enable its transmittance, decoding and rendering in real-time across various platforms. Despite the progress in NeRFs and Gaussian Splats, their large model size and specialized renderers make it challenging to distribute free-viewpoint 3D content as easily as images. To address this, we have designed a novel 3D representation that encodes the plenoptic function into sinusoidal function indexed dense volumes. This approach facilitates feature sharing across different locations, improving compactness over traditional spatial voxels. The memory footprint of the dense 3D feature grid can be further reduced using spatial decomposition techniques. This design combines the strengths of spatial hashing functions and voxel decomposition, resulting in a model size as small as 150 KB for each 3D scene. Moreover, PPNG features a lightweight rendering pipeline with only 300 lines of code that decodes its representation into standard GL textures and fragment shaders. This enables real-time rendering using the traditional GL pipeline, ensuring universal compatibility and efficiency across various platforms without additional dependencies.

[CV-76] PDT: Uav Target Detection Dataset for Pests and Diseases Tree

链接: https://arxiv.org/abs/2409.15679
作者: Mingle Zhou,Rui Xing,Delong Han,Zhiyong Qi,Gang Li
关键词-EN: PDT dataset, visual weed iden, dataset, proposed PDT dataset, CWC dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 11 figures, European Conference on Computer Vision 2024

点击查看摘要

Abstract:UAVs emerge as the optimal carriers for visual weed iden?tification and integrated pest and disease management in crops. How?ever, the absence of specialized datasets impedes the advancement of model development in this domain. To address this, we have developed the Pests and Diseases Tree dataset (PDT dataset). PDT dataset repre?sents the first high-precision UAV-based dataset for targeted detection of tree pests and diseases, which is collected in real-world operational environments and aims to fill the gap in available datasets for this field. Moreover, by aggregating public datasets and network data, we further introduced the Common Weed and Crop dataset (CWC dataset) to ad?dress the challenge of inadequate classification capabilities of test models within datasets for this field. Finally, we propose the YOLO-Dense Pest (YOLO-DP) model for high-precision object detection of weed, pest, and disease crop images. We re-evaluate the state-of-the-art detection models with our proposed PDT dataset and CWC dataset, showing the completeness of the dataset and the effectiveness of the YOLO-DP. The proposed PDT dataset, CWC dataset, and YOLO-DP model are pre?sented at this https URL.

[CV-77] Autonomous Hiking Trail Navigation via Semantic Segmentation and Geometric Analysis

链接: https://arxiv.org/abs/2409.15671
作者: Camndon Reed,Christopher Tatsch,Jason N. Gross,Yu Gu
关键词-EN: Natural environments pose, pose significant challenges, environments pose significant, ever-changing nature, autonomous robot navigation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Natural environments pose significant challenges for autonomous robot navigation, particularly due to their unstructured and ever-changing nature. Hiking trails, with their dynamic conditions influenced by weather, vegetation, and human traffic, represent one such challenge. This work introduces a novel approach to autonomous hiking trail navigation that balances trail adherence with the flexibility to adapt to off-trail routes when necessary. The solution is a Traversability Analysis module that integrates semantic data from camera images with geometric information from LiDAR to create a comprehensive understanding of the surrounding terrain. A planner uses this traversability map to navigate safely, adhering to trails while allowing off-trail movement when necessary to avoid on-trail hazards or for safe off-trail shortcuts. The method is evaluated through simulation to determine the balance between semantic and geometric information in traversability estimation. These simulations tested various weights to assess their impact on navigation performance across different trail scenarios. Weights were then validated through field tests at the West Virginia University Core Arboretum, demonstrating the method’s effectiveness in a real-world environment.

[CV-78] ImPoster: Text and Frequency Guidance for Subject Driven Action Personalization using Diffusion Models

链接: https://arxiv.org/abs/2409.15650
作者: Divya Kothandaraman,Kuldeep Kulkarni,Sumit Shekhar,Balaji Vasan Srinivasan,Dinesh Manocha
关键词-EN: image, driving image, driving action, driving, subject
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present ImPoster, a novel algorithm for generating a target image of a ‘source’ subject performing a ‘driving’ action. The inputs to our algorithm are a single pair of a source image with the subject that we wish to edit and a driving image with a subject of an arbitrary class performing the driving action, along with the text descriptions of the two images. Our approach is completely unsupervised and does not require any access to additional annotations like keypoints or pose. Our approach builds on a pretrained text-to-image latent diffusion model and learns the characteristics of the source and the driving image by finetuning the diffusion model for a small number of iterations. At inference time, ImPoster performs step-wise text prompting i.e. it denoises by first moving in the direction of the image manifold corresponding to the driving image followed by the direction of the image manifold corresponding to the text description of the desired target image. We propose a novel diffusion guidance formulation, image frequency guidance, to steer the generation towards the manifold of the source subject and the driving action at every step of the inference denoising. Our frequency guidance formulations are derived from the frequency domain properties of images. We extensively evaluate ImPoster on a diverse set of source-driving image pairs to demonstrate improvements over baselines. To the best of our knowledge, ImPoster is the first approach towards achieving both subject-driven as well as action-driven image personalization. Code and data is available at this https URL.

[CV-79] Personalized Federated Learning via Backbone Self-Distillation ACM-MM

链接: https://arxiv.org/abs/2409.15636
作者: Pengju Wang,Bochao Liu,Dan Zeng,Chenggang Yan,Shiming Ge
关键词-EN: frequently necessitates training, necessitates training personalized, learning frequently necessitates, federated learning frequently, training personalized models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Pubished in ACM MMAsia 2023

点击查看摘要

Abstract:In practical scenarios, federated learning frequently necessitates training personalized models for each client using heterogeneous data. This paper proposes a backbone self-distillation approach to facilitate personalized federated learning. In this approach, each client trains its local model and only sends the backbone weights to the server. These weights are then aggregated to create a global backbone, which is returned to each client for updating. However, the client’s local backbone lacks personalization because of the common representation. To solve this problem, each client further performs backbone self-distillation by using the global backbone as a teacher and transferring knowledge to update the local backbone. This process involves learning two components: the shared backbone for common representation and the private head for local personalization, which enables effective global knowledge transfer. Extensive experiments and comparisons with 12 state-of-the-art approaches demonstrate the effectiveness of our approach.

[CV-80] KISS-Matcher: Fast and Robust Point Cloud Registration Revisited

链接: https://arxiv.org/abs/2409.15615
作者: Hyungtae Lim,Daebeom Kim,Gunhee Shin,Jingnan Shi,Ignacio Vizzo,Hyun Myung,Jaesik Park,and Luca Carlone
关键词-EN: global point cloud, cloud registration systems, point cloud registration, specific components, pose solvers
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 9 pages, 9 figures

点击查看摘要

Abstract:While global point cloud registration systems have advanced significantly in all aspects, many studies have focused on specific components, such as feature extraction, graph-theoretic pruning, or pose solvers. In this paper, we take a holistic view on the registration problem and develop an open-source and versatile C++ library for point cloud registration, called \textitKISS-Matcher. KISS-Matcher combines a novel feature detector, \textitFaster-PFH, that improves over the classical fast point feature histogram (FPFH). Moreover, it adopts a k -core-based graph-theoretic pruning to reduce the time complexity of rejecting outlier correspondences. Finally, it combines these modules in a complete, user-friendly, and ready-to-use pipeline. As verified by extensive experiments, KISS-Matcher has superior scalability and broad applicability, achieving a substantial speed-up compared to state-of-the-art outlier-robust registration pipelines while preserving accuracy. Our code will be available at \hrefthis https URL\textttthis https URL.

[CV-81] Assessment of Submillimeter Precision via Structure from Motion Technique in Close-Range Capture Environments

链接: https://arxiv.org/abs/2409.15602
作者: Francisco Roza de Moraes,Irineu da Silva
关键词-EN: cost-effective structural monitoring, structural monitoring strategy, Structure from Motion, Motion technique, monitoring strategy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This study comprises 23 pages, 15 figures, and 5 tables. It is part of an ongoing PhD thesis currently under development

点击查看摘要

Abstract:Creating 3D models through the Structure from Motion technique is a recognized, efficient, cost-effective structural monitoring strategy. This technique is applied in several engineering fields, particularly for creating models of large structures from photographs taken a few tens of meters away. However, discussions about its usability and the procedures for conducting laboratory analysis, such as structural tests, are rarely addressed. This study investigates the potential of the SfM method to create submillimeter-quality models for structural tests, with short-distance captures. A series of experiments was carried out, with photographic captures at a 1-meter distance, using different quality settings: camera calibration model, Scale Bars dispersion, overlapping rates, and the use of vertical and oblique images. Employing a calibration model with images taken over a test board and a set of Scale Bars (SB) appropriately distributed over the test area, an overlap rate of 80 percent, and the integration of vertical and oblique images, RMSE values of approximately 0.1 mm were obtained. This result indicates the potential application of the technique for 3D modeling with submillimeter positional quality, as required for structural tests in laboratory environments.

[CV-82] MapEx: Indoor Structure Exploration with Probabilistic Information Gain from Global Map Predictions

链接: https://arxiv.org/abs/2409.15590
作者: Cherie Ho,Seungchan Kim,Brady Moon,Aditya Parandekar,Narek Harutyunyan,Chen Wang,Katia Sycara,Graeme Best,Sebastian Scherer
关键词-EN: challenge in robotics, centered on understanding, understanding unknown environments, critical challenge, information gain
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages

点击查看摘要

Abstract:Exploration is a critical challenge in robotics, centered on understanding unknown environments. In this work, we focus on robots exploring structured indoor environments which are often predictable and composed of repeating patterns. Most existing approaches, such as conventional frontier approaches, have difficulty leveraging the predictability and explore with simple heuristics such as `closest first’. Recent works use deep learning techniques to predict unknown regions of the map, using these predictions for information gain calculation. However, these approaches are often sensitive to the predicted map quality or do not reason over sensor coverage. To overcome these issues, our key insight is to jointly reason over what the robot can observe and its uncertainty to calculate probabilistic information gain. We introduce MapEx, a new exploration framework that uses predicted maps to form probabilistic sensor model for information gain estimation. MapEx generates multiple predicted maps based on observed information, and takes into consideration both the computed variances of predicted maps and estimated visible area to estimate the information gain of a given viewpoint. Experiments on the real-world KTH dataset showed on average 12.4% improvement than representative map-prediction based exploration and 25.4% improvement than nearest frontier approach.

[CV-83] FACET: Fast and Accurate Event-Based Eye Tracking Using Ellipse Modeling for Extended Reality

链接: https://arxiv.org/abs/2409.15584
作者: Junyuan Ding,Ziteng Wang,Chang Gao,Min Liu,Qinyu Chen
关键词-EN: Extended Reality, traditional frame-based systems, frame-based systems struggle, Event-based Eye Tracking, interactions in Extended
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Eye tracking is a key technology for gaze-based interactions in Extended Reality (XR), but traditional frame-based systems struggle to meet XR’s demands for high accuracy, low latency, and power efficiency. Event cameras offer a promising alternative due to their high temporal resolution and low power consumption. In this paper, we present FACET (Fast and Accurate Event-based Eye Tracking), an end-to-end neural network that directly outputs pupil ellipse parameters from event data, optimized for real-time XR applications. The ellipse output can be directly used in subsequent ellipse-based pupil trackers. We enhance the EV-Eye dataset by expanding annotated data and converting original mask labels to ellipse-based annotations to train the model. Besides, a novel trigonometric loss is adopted to address angle discontinuities and a fast causal event volume event representation method is put forward. On the enhanced EV-Eye test set, FACET achieves an average pupil center error of 0.20 pixels and an inference time of 0.53 ms, reducing pixel error and inference time by 1.6 \times and 1.8 \times compared to the prior art, EV-Eye, with 4.4 \times and 11.7 \times less parameters and arithmetic operations. The code is available at this https URL.

[CV-84] Mixing Data-driven and Geometric Models for Satellite Docking Port State Estimation using an RGB or Event Camera ICRA2025

链接: https://arxiv.org/abs/2409.15581
作者: Cedric Le Gentil,Jack Naylor,Nuwan Munasinghe,Jasprabhjit Mehami,Benny Dai,Mikhail Asavkin,Donald G. Dansereau,Teresa Vidal-Calleja
关键词-EN: In-orbit automated servicing, In-orbit automated, orbital debris, promising path, path towards lowering
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IEEE ICRA 2025

点击查看摘要

Abstract:In-orbit automated servicing is a promising path towards lowering the cost of satellite operations and reducing the amount of orbital debris. For this purpose, we present a pipeline for automated satellite docking port detection and state estimation using monocular vision data from standard RGB sensing or an event camera. Rather than taking snapshots of the environment, an event camera has independent pixels that asynchronously respond to light changes, offering advantages such as high dynamic range, low power consumption and latency, etc. This work focuses on satellite-agnostic operations (only a geometric knowledge of the actual port is required) using the recently released Lockheed Martin Mission Augmentation Port (LM-MAP) as the target. By leveraging shallow data-driven techniques to preprocess the incoming data to highlight the LM-MAP’s reflective navigational aids and then using basic geometric models for state estimation, we present a lightweight and data-efficient pipeline that can be used independently with either RGB or event cameras. We demonstrate the soundness of the pipeline and perform a quantitative comparison of the two modalities based on data collected with a photometrically accurate test bench that includes a robotic arm to simulate the target satellite’s uncontrolled motion.

[CV-85] Clinical-grade Multi-Organ Pathology Report Generation for Multi-scale Whole Slide Images via a Semantically Guided Medical Text Foundation Model

链接: https://arxiv.org/abs/2409.15574
作者: Jing Wei Tan,SeungKyu Kim,Eunsu Kim,Sung Hak Lee,Sangjeong Ahn,Won-Ki Jeong
关键词-EN: pathology report generation, natural language comprehension, image recognition tasks, report generation, pathology report
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision language models (VLM) have achieved success in both natural language comprehension and image recognition tasks. However, their use in pathology report generation for whole slide images (WSIs) is still limited due to the huge size of multi-scale WSIs and the high cost of WSI annotation. Moreover, in most of the existing research on pathology report generation, sufficient validation regarding clinical efficacy has not been conducted. Herein, we propose a novel Patient-level Multi-organ Pathology Report Generation (PMPRG) model, which utilizes the multi-scale WSI features from our proposed multi-scale regional vision transformer (MR-ViT) model and their real pathology reports to guide VLM training for accurate pathology report generation. The model then automatically generates a report based on the provided key features attended regional features. We assessed our model using a WSI dataset consisting of multiple organs, including the colon and kidney. Our model achieved a METEOR score of 0.68, demonstrating the effectiveness of our approach. This model allows pathologists to efficiently generate pathology reports for patients, regardless of the number of WSIs involved.

[CV-86] Critic Loss for Image Classification

链接: https://arxiv.org/abs/2409.15565
作者: Brendan Hogan Rappazzo,Aaron Ferber,Carla Gomes
关键词-EN: Modern neural network, achieve remarkable performance, frequently exhibit overconfidence, Modern neural, network classifiers achieve
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages

点击查看摘要

Abstract:Modern neural network classifiers achieve remarkable performance across a variety of tasks; however, they frequently exhibit overconfidence in their predictions due to the cross-entropy loss. Inspired by this problem, we propose the \textbfCri\textbftic Loss for Image \textbfClassification (CrtCl, pronounced Critical). CrtCl formulates image classification training in a generator-critic framework, with a base classifier acting as a generator, and a correctness critic imposing a loss on the classifier. The base classifier, acting as the generator, given images, generates the probability distribution over classes and intermediate embeddings. The critic model, given the image, intermediate embeddings, and output predictions of the base model, predicts the probability that the base model has produced the correct classification, which then can be back propagated as a self supervision signal. Notably, the critic does not use the label as input, meaning that the critic can train the base model on both labeled and unlabeled data in semi-supervised learning settings. CrtCl represents a learned loss method for accuracy, alleviating the negative side effects of using cross-entropy loss. Additionally, CrtCl provides a powerful way to select data to be labeled in an active learning setting, by estimating the classification ability of the base model on unlabeled data. We study the effectiveness of CrtCl in low-labeled data regimes, and in the context of active learning. In classification, we find that CrtCl, compared to recent baselines, increases classifier generalization and calibration with various amounts of labeled data. In active learning, we show our method outperforms baselines in accuracy and calibration. We observe consistent results across three image classification datasets.

[CV-87] QUB-PHEO: A Visual-Based Dyadic Multi-View Dataset for Intention Inference in Collaborative Assembly

链接: https://arxiv.org/abs/2409.15560
作者: Samuel Adebayo,Seán McLoone,Joost C. Dessing
关键词-EN: advancing human-robot interaction, introduces a visual-based, potential of advancing, advancing human-robot, intention inference
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:QUB-PHEO introduces a visual-based, dyadic dataset with the potential of advancing human-robot interaction (HRI) research in assembly operations and intention inference. This dataset captures rich multimodal interactions between two participants, one acting as a ‘robot surrogate,’ across a variety of assembly tasks that are further broken down into 36 distinct subtasks. With rich visual annotations, such as facial landmarks, gaze, hand movements, object localization, and more for 70 participants, QUB-PHEO offers two versions: full video data for 50 participants and visual cues for all 70. Designed to improve machine learning models for HRI, QUB-PHEO enables deeper analysis of subtle interaction cues and intentions, promising contributions to the field. The dataset will be available at this https URL subject to an End-User License Agreement (EULA).

[CV-88] Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection ECCV2024

链接: https://arxiv.org/abs/2409.15557
作者: Alireza Ganjdanesh,Yan Kang,Yuchen Liu,Richard Zhang,Zhe Lin,Heng Huang
关键词-EN: generate high-quality samples, Diffusion probabilistic models, high-quality samples, generate high-quality, Diffusion probabilistic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to the 18th European Conference on Computer Vision, ECCV 2024

点击查看摘要

Abstract:Diffusion probabilistic models can generate high-quality samples. Yet, their sampling process requires numerous denoising steps, making it slow and computationally intensive. We propose to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts. First, we study the similarities between pairs of denoising timesteps, observing a natural clustering, even across different datasets. This suggests that rather than having a single model for all time steps, separate models can serve as ``experts’’ for their respective time intervals. As such, we separately fine-tune the pretrained model on each interval, with elastic dimensions in depth and width, to obtain experts specialized in their corresponding denoising interval. To optimize the resource usage between experts, we introduce our Expert Routing Agent, which learns to select a set of proper network configurations. By doing so, our method can allocate the computing budget between the experts in an end-to-end manner without requiring manual heuristics. Finally, with a selected configuration, we fine-tune our pruned experts to obtain our mixture of efficient experts. We demonstrate the effectiveness of our method, DiffPruning, across several datasets, LSUN-Church, LSUN-Beds, FFHQ, and ImageNet, on the Latent Diffusion Model architecture.

[CV-89] SOFI: Multi-Scale Deformable Transformer for Camera Calibration with Enhanced Line Queries

链接: https://arxiv.org/abs/2409.15553
作者: Sebastian Janampa,Marios Pattichis
关键词-EN: zenith vanishing point, Camera calibration consists, estimating camera parameters, camera parameters, zenith vanishing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Camera calibration consists of estimating camera parameters such as the zenith vanishing point and horizon line. Estimating the camera parameters allows other tasks like 3D rendering, artificial reality effects, and object insertion in an image. Transformer-based models have provided promising results; however, they lack cross-scale interaction. In this work, we introduce \textitmulti-Scale defOrmable transFormer for camera calibratIon with enhanced line queries, SOFI. SOFI improves the line queries used in CTRL-C and MSCC by using both line content and line geometric features. Moreover, SOFI’s line queries allow transformer models to adopt the multi-scale deformable attention mechanism to promote cross-scale interaction between the feature maps produced by the backbone. SOFI outperforms existing methods on the \textit Google Street View, \textit Horizon Line in the Wild, and \textit Holicity datasets while keeping a competitive inference speed.

[CV-90] VaLID: Verification as Late Integration of Detections for LiDAR-Camera Fusion

链接: https://arxiv.org/abs/2409.15529
作者: Vanshika Vats,Marzia Binta Nizam,James Davis
关键词-EN: Vehicle object detection, Vehicle object, Vehicle, LiDAR, camera
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vehicle object detection is possible using both LiDAR and camera data. Methods using LiDAR generally outperform those using cameras only. The highest accuracy methods utilize both of these modalities through data fusion. In our study, we propose a model-independent late fusion method, VaLID, which validates whether each predicted bounding box is acceptable or not. Our method verifies the higher-performing, yet overly optimistic LiDAR model detections using camera detections that are obtained from either specially trained, general, or open-vocabulary models. VaLID uses a simple multi-layer perceptron trained with a high recall bias to reduce the false predictions made by the LiDAR detector, while still preserving the true ones. Evaluating with multiple combinations of LiDAR and camera detectors on the KITTI dataset, we reduce false positives by an average of 63.9%, thus outperforming the individual detectors on 2D average precision (2DAP). Our approach is model-agnostic and demonstrates state-of-the-art competitive performance even when using generic camera detectors that were not trained specifically for this dataset.

[CV-91] MATCH POLICY: A Simple Pipeline from Point Cloud Registration to Manipulation Policies ATC

链接: https://arxiv.org/abs/2409.15517
作者: Haojie Huang,Haotian Liu,Dian Wang,Robin Walters,Robert Platt
关键词-EN: rearrange objects relative, rearrange objects, MATCH POLICY, objects relative, propose MATCH POLICY
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: project url: this https URL

点击查看摘要

Abstract:Many manipulation tasks require the robot to rearrange objects relative to one another. Such tasks can be described as a sequence of relative poses between parts of a set of rigid bodies. In this work, we propose MATCH POLICY, a simple but novel pipeline for solving high-precision pick and place tasks. Instead of predicting actions directly, our method registers the pick and place targets to the stored demonstrations. This transfers action inference into a point cloud registration task and enables us to realize nontrivial manipulation policies without any training. MATCH POLICY is designed to solve high-precision tasks with a key-frame setting. By leveraging the geometric interaction and the symmetries of the task, it achieves extremely high sample efficiency and generalizability to unseen configurations. We demonstrate its state-of-the-art performance across various tasks on RLBench benchmark compared with several strong baselines and test it on a real robot with six tasks.

[CV-92] SpaGBOL: Spatial-Graph-Based Orientated Localisation

链接: https://arxiv.org/abs/2409.15514
作者: Tavis Shore,Oscar Mendez,Simon Hadfield
关键词-EN: urban regions, regions is challenging, challenging in part, part due, lack of geo-spatial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cross-View Geo-Localisation within urban regions is challenging in part due to the lack of geo-spatial structuring within current datasets and techniques. We propose utilising graph representations to model sequences of local observations and the connectivity of the target location. Modelling as a graph enables generating previously unseen sequences by sampling with new parameter configurations. To leverage this newly available information, we propose a GNN-based architecture, producing spatially strong embeddings and improving discriminability over isolated image embeddings. We outline SpaGBOL, introducing three novel contributions. 1) The first graph-structured dataset for Cross-View Geo-Localisation, containing multiple streetview images per node to improve generalisation. 2) Introducing GNNs to the problem, we develop the first system that exploits the correlation between node proximity and feature similarity. 3) Leveraging the unique properties of the graph representation - we demonstrate a novel retrieval filtering approach based on neighbourhood bearings. SpaGBOL achieves state-of-the-art accuracies on the unseen test graph - with relative Top-1 retrieval improvements on previous techniques of 11%, and 50% when filtering with Bearing Vector Matching on the SpaGBOL dataset.

[CV-93] PixelBytes: Catching Unified Embedding for Multimodal Generation

链接: https://arxiv.org/abs/2409.15512
作者: Fabien Furfaro
关键词-EN: report introduces PixelBytes, multimodal representation learning, report introduces, Recurrent Neural Networks, representation learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This report introduces PixelBytes Embedding, a novel approach for unified multimodal representation learning. Our method captures diverse inputs in a single, cohesive representation, enabling emergent properties for multimodal sequence generation, particularly for text and pixelated images. Inspired by state-of-the-art sequence models such as Image Transformers, PixelCNN, and Mamba-Bytes, PixelBytes aims to address the challenges of integrating different data types. We explore various model architectures, including Recurrent Neural Networks (RNNs), State Space Models (SSMs), and Attention-based models, focusing on bidirectional processing and our innovative PxBy embedding technique. Our experiments, conducted on a specialized PixelBytes Pokémon dataset, demonstrate that bidirectional sequence models with PxBy embedding and convolutional layers can generate coherent multimodal sequences. This work contributes to the advancement of integrated AI models capable of understanding and generating multimodal data in a unified manner.

[CV-94] Analysis of Human Perception in Distinguishing Real and AI-Generated Faces: An Eye-Tracking Based Study

链接: https://arxiv.org/abs/2409.15498
作者: Jin Huang,Subhadra Gopalakrishnan,Trisha Mittal,Jake Zuena,Jaclyn Pytlarz
关键词-EN: Artificial Intelligence, Intelligence have led, Recent advancements, generating realistic human, led to remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in Artificial Intelligence have led to remarkable improvements in generating realistic human faces. While these advancements demonstrate significant progress in generative models, they also raise concerns about the potential misuse of these generated images. In this study, we investigate how humans perceive and distinguish between real and fake images. We designed a perceptual experiment using eye-tracking technology to analyze how individuals differentiate real faces from those generated by AI. Our analysis of StyleGAN-3 generated images reveals that participants can distinguish real from fake faces with an average accuracy of 76.80%. Additionally, we found that participants scrutinize images more closely when they suspect an image to be fake. We believe this study offers valuable insights into human perception of AI-generated media.

[CV-95] Autonomous Exploration and Semantic Updating of Large-Scale Indoor Environments with Mobile Robots

链接: https://arxiv.org/abs/2409.15493
作者: Sai Haneesh Allu,Itay Kadosh,Tyler Summers,Yu Xiang
关键词-EN: semantic map, map, autonomously explore, explore an unknown, semantic
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 7 figures. Project page is available at this https URL

点击查看摘要

Abstract:We introduce a new robotic system that enables a mobile robot to autonomously explore an unknown environment, build a semantic map of the environment, and subsequently update the semantic map to reflect environment changes, such as location changes of objects. Our system leverages a LiDAR scanner for 2D occupancy grid mapping and an RGB-D camera for object perception. We introduce a semantic map representation that combines a 2D occupancy grid map for geometry, with a topological map for object semantics. This map representation enables us to effectively update the semantics by deleting or adding nodes to the topological map. Our system has been tested on a Fetch robot. The robot can semantically map a 93m x 90m floor and update the semantic map once objects are moved in the environment.

[CV-96] VLMine: Long-Tail Data Mining with Vision Language Models

链接: https://arxiv.org/abs/2409.15486
作者: Mao Ye,Gregory P. Meyer,Zaiwei Zhang,Dennis Park,Siva Karthik Mustikovela,Yuning Chai,Eric M Wolff
关键词-EN: Ensuring robust performance, Ensuring robust, machine learning, autonomous driving, robust performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10% and 50%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.

[CV-97] Adapting Segment Anything Model for Unseen Object Instance Segmentation ICRA2025

链接: https://arxiv.org/abs/2409.15481
作者: Rui Cao,Chuanxin Song,Biqi Yang,Jiangliu Wang,Pheng-Ann Heng,Yun-Hui Liu
关键词-EN: autonomous robots operating, Object Instance Segmentation, Unseen Object Instance, Instance Segmentation, unstructured environments
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:Unseen Object Instance Segmentation (UOIS) is crucial for autonomous robots operating in unstructured environments. Previous approaches require full supervision on large-scale tabletop datasets for effective pretraining. In this paper, we propose UOIS-SAM, a data-efficient solution for the UOIS task that leverages SAM’s high accuracy and strong generalization capabilities. UOIS-SAM integrates two key components: (i) a Heatmap-based Prompt Generator (HPG) to generate class-agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM’s mask decoder, mitigating issues introduced by the SAM baseline, such as background confusion and over-segmentation, especially in scenarios involving occlusion and texture-rich objects. Extensive experimental results on OCID, OSD, and additional photometrically challenging datasets including PhoCAL and HouseCat6D, demonstrate that, even using only 10% of the training samples compared to previous methods, UOIS-SAM achieves state-of-the-art performance in unseen object segmentation, highlighting its effectiveness and robustness in various tabletop scenes.

[CV-98] MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

链接: https://arxiv.org/abs/2409.15477
作者: Mohammad Shahab Sepehri,Zalan Fabian,Maryam Soltanolkotabi,Mahdi Soltanolkotabi
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, providing automated solutions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 Pages, 5 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have tremendous potential to improve the accuracy, availability, and cost-effectiveness of healthcare by providing automated solutions or serving as aids to medical professionals. Despite promising first steps in developing medical MLLMs in the past few years, their capabilities and limitations are not well-understood. Recently, many benchmark datasets have been proposed that test the general medical knowledge of such models across a variety of medical areas. However, the systematic failure modes and vulnerabilities of such models are severely underexplored with most medical benchmarks failing to expose the shortcomings of existing models in this safety-critical domain. In this paper, we introduce MediConfusion, a challenging medical Visual Question Answering (VQA) benchmark dataset, that probes the failure modes of medical MLLMs from a vision perspective. We reveal that state-of-the-art models are easily confused by image pairs that are otherwise visually dissimilar and clearly distinct for medical experts. Strikingly, all available models (open-source or proprietary) achieve performance below random guessing on MediConfusion, raising serious concerns about the reliability of existing medical MLLMs for healthcare deployment. We also extract common patterns of model failure that may help the design of a new generation of more trustworthy and reliable MLLMs in healthcare.

[CV-99] Matern Kernels for Tunable Implicit Surface Reconstruction

链接: https://arxiv.org/abs/2409.15466
作者: Maximilian Weiherer,Bernhard Egger
关键词-EN: oriented point clouds, tunable implicit surface, implicit surface reconstruction, Matérn kernels, Matérn
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:We propose to use the family of Matérn kernels for tunable implicit surface reconstruction, building upon the recent success of kernel methods for 3D reconstruction of oriented point clouds. As we show, both, from a theoretical and practical perspective, Matérn kernels have some appealing properties which make them particularly well suited for surface reconstruction – outperforming state-of-the-art methods based on the arc-cosine kernel while being significantly easier to implement, faster to compute, and scaleable. Being stationary, we demonstrate that the Matérn kernels’ spectrum can be tuned in the same fashion as Fourier feature mappings help coordinate-based MLPs to overcome spectral bias. Moreover, we theoretically analyze Matérn kernel’s connection to SIREN networks as well as its relation to previously employed arc-cosine kernels. Finally, based on recently introduced Neural Kernel Fields, we present data-dependent Matérn kernels and conclude that especially the Laplace kernel (being part of the Matérn family) is extremely competitive, performing almost on par with state-of-the-art methods in the noise-free case while having a more than five times shorter training time.

[CV-100] ag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models

链接: https://arxiv.org/abs/2409.15451
作者: Mike Zhang,Kaixian Qu,Vaishakh Patil,Cesar Cadena,Marco Hutter
关键词-EN: Large Language Models, Large Language, common sense reasoning, Language Models, sense reasoning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Language Models (LLM) have emerged as a tool for robots to generate task plans using common sense reasoning. For the LLM to generate actionable plans, scene context must be provided, often through a map. Recent works have shifted from explicit maps with fixed semantic classes to implicit open vocabulary maps based on queryable embeddings capable of representing any semantic class. However, embeddings cannot directly report the scene context as they are implicit, requiring further processing for LLM integration. To address this, we propose an explicit text-based map that can represent thousands of semantic classes while easily integrating with LLMs due to their text-based nature by building upon large-scale image recognition models. We study how entities in our map can be localized and show through evaluations that our text-based map localizations perform comparably to those from open vocabulary maps while using two to four orders of magnitude less memory. Real-robot experiments demonstrate the grounding of an LLM with the text-based map to solve user tasks.

[CV-101] Revealing an Unattractivity Bias in Mental Reconstruction of Occluded Faces using Generative Image Models

链接: https://arxiv.org/abs/2409.15443
作者: Frederik Riedmann,Bernhard Egger,Tim Rohe
关键词-EN: Previous studies, Previous, occluded, partially occluded, mental reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper and a corresponding poster were presented at the Cognitive Computational Neuroscience conference in 2024

点击查看摘要

Abstract:Previous studies have shown that faces are rated as more attractive when they are partially occluded. The cause of this observation remains unclear. One explanation is a mental reconstruction of the occluded face parts which is biased towards a more attractive percept as shown in face-attractiveness rating tasks. We aimed to test for this hypothesis by using a delayed matching-to-sample task, which directly requires mental reconstruction. In two online experiments, we presented observers with unattractive, neutral or attractive synthetic reconstructions of the occluded face parts using a state-of-the-art diffusion-based image generator. Our experiments do not support the initial hypothesis and reveal an unattractiveness bias for occluded faces instead. This suggests that facial attractiveness rating tasks do not prompt reconstructions. Rather, the attractivity bias may arise from global image features, and faces may actually be reconstructed with unattractive properties when mental reconstruction is applied.

[CV-102] DS2TA: Denoising Spiking Transformer with Attenuated Spatiotemporal Attention

链接: https://arxiv.org/abs/2409.15375
作者: Boxun Xu,Hejia Geng,Yuxuan Yin,Peng Li
关键词-EN: current high-performance models, vision applications, current high-performance, high-performance models, models of choice
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2311.09376

点击查看摘要

Abstract:Vision Transformers (ViT) are current high-performance models of choice for various vision applications. Recent developments have given rise to biologically inspired spiking transformers that thrive in ultra-low power operations on neuromorphic hardware, however, without fully unlocking the potential of spiking neural networks. We introduce DS2TA, a Denoising Spiking transformer with attenuated SpatioTemporal Attention, designed specifically for vision applications. DS2TA introduces a new spiking attenuated spatiotemporal attention mechanism that considers input firing correlations occurring in both time and space, thereby fully harnessing the computational power of spiking neurons at the core of the transformer architecture. Importantly, DS2TA facilitates parameter-efficient spatiotemporal attention computation without introducing extra weights. DS2TA employs efficient hashmap-based nonlinear spiking attention denoisers to enhance the robustness and expressive power of spiking attention maps. DS2TA demonstrates state-of-the-art performances on several widely adopted static image and dynamic neuromorphic datasets. Operated over 4 time steps, DS2TA achieves 94.92% top-1 accuracy on CIFAR10 and 77.47% top-1 accuracy on CIFAR100, as well as 79.1% and 94.44% on CIFAR10-DVS and DVS-Gesture using 10 time steps.

[CV-103] Damage detection in an uncertain nonlinear beam based on stochastic Volterra series

链接: https://arxiv.org/abs/2409.15349
作者: Luis Gustavo Giacon Villani,Samuel da Silva,Americo Cunha Jr
关键词-EN: Structural Health Monitoring, called Structural Health, Health Monitoring, commonly called Structural, Structural Health
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Probability (math.PR); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The damage detection problem in mechanical systems, using vibration measurements, is commonly called Structural Health Monitoring (SHM). Many tools are able to detect damages by changes in the vibration pattern, mainly, when damages induce nonlinear behavior. However, a more difficult problem is to detect structural variation associated with damage, when the mechanical system has nonlinear behavior even in the reference condition. In these cases, more sophisticated methods are required to detect if the changes in the response are based on some structural variation or changes in the vibration regime, because both can generate nonlinearities. Among the many ways to solve this problem, the use of the Volterra series has several favorable points, because they are a generalization of the linear convolution, allowing the separation of linear and nonlinear contributions by input filtering through the Volterra kernels. On the other hand, the presence of uncertainties in mechanical systems, due to noise, geometric imperfections, manufacturing irregularities, environmental conditions, and others, can also change the responses, becoming more difficult the damage detection procedure. An approach based on a stochastic version of Volterra series is proposed to be used in the detection of a breathing crack in a beam vibrating in a nonlinear regime of motion, even in reference condition (without crack). The system uncertainties are simulated by the variation imposed in the linear stiffness and damping coefficient. The results show, that the nonlinear analysis done, considering the high order Volterra kernels, allows the approach to detect the crack with a small propagation and probability confidence, even in the presence of uncertainties.

[CV-104] Ultrafast vision perception by neuromorphic optical flow

链接: https://arxiv.org/abs/2409.15345
作者: Shengbo Wang,Shuo Gao,Tongming Pu,Liangbing Zhao,Arokia Nathan
关键词-EN: current methods primarily, methods primarily operate, capturing movement velocities, vertical dimensions, Optical flow
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 17 pages, 4 figures

点击查看摘要

Abstract:Optical flow is crucial for robotic visual perception, yet current methods primarily operate in a 2D format, capturing movement velocities only in horizontal and vertical dimensions. This limitation results in incomplete motion cues, such as missing regions of interest or detailed motion analysis of different regions, leading to delays in processing high-volume visual data in real-world settings. Here, we report a 3D neuromorphic optical flow method that leverages the time-domain processing capability of memristors to embed external motion features directly into hardware, thereby completing motion cues and dramatically accelerating the computation of movement velocities and subsequent task-specific algorithms. In our demonstration, this approach reduces visual data processing time by an average of 0.3 seconds while maintaining or improving the accuracy of motion prediction, object tracking, and object segmentation. Interframe visual processing is achieved for the first time in UAV scenarios. Furthermore, the neuromorphic optical flow algorithm’s flexibility allows seamless integration with existing algorithms, ensuring broad applicability. These advancements open unprecedented avenues for robotic perception, without the trade-off between accuracy and efficiency.

[CV-105] Video-Driven Graph Network-Based Simulators

链接: https://arxiv.org/abs/2409.15344
作者: Franciszek Szewczyk,Gilles Louppe,Matthia Sabatelli
关键词-EN: precise physics simulations, typically requiring extensive, requiring extensive computational, extensive computational resources, detailed physical input
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Lifelike visualizations in design, cinematography, and gaming rely on precise physics simulations, typically requiring extensive computational resources and detailed physical input. This paper presents a method that can infer a system’s physical properties from a short video, eliminating the need for explicit parameter input, provided it is close to the training condition. The learned representation is then used within a Graph Network-based Simulator to emulate the trajectories of physical systems. We demonstrate that the video-derived encodings effectively capture the physical properties of the system and showcase a linear dependence between some of the encodings and the system’s motion.

[CV-106] StyleReiser: Stylizing Video With Reinforced Structure Guide

链接: https://arxiv.org/abs/2409.15341
作者: Radim Spetlik,David Futschik,Daniel Sykora
关键词-EN: maintaining visual consistency, entire video sequence, example-based video stylization, introduce StyleReiser, maintaining visual
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We introduce StyleReiser, an example-based video stylization method that transfers style from a given keyframe to the entire video sequence while maintaining visual consistency even in distant frames where the scene structure may change significantly. Unlike previous keyframe-based methods, our approach considers consistency with the prescribed style and maintains fidelity to new structural elements appearing in the target video sequence. This combination can significantly improve the quality of the stylized sequence without the need to add more correction keyframes. We also demonstrate that our approach can notably enhance the output of text-driven video stylization methods by suppressing their structural instability and enabling the user to perform custom edits on the generated keyframes. Moreover, due to its capability to perform inference in real-time, our technique can also be applied in interactive scenarios, such as consistently stylized video calls, which are difficult to achieve using text-driven approaches.

[CV-107] Electrooptical Image Synthesis from SAR Imagery Using Generative Adversarial Networks

链接: https://arxiv.org/abs/2409.15331
作者: Grant Rosario,David Noever
关键词-EN: Synthetic Aperture Radar, Aperture Radar, Synthetic Aperture, utility of Synthetic, satellite image analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The utility of Synthetic Aperture Radar (SAR) imagery in remote sensing and satellite image analysis is well established, offering robustness under various weather and lighting conditions. However, SAR images, characterized by their unique structural and texture characteristics, often pose interpretability challenges for analysts accustomed to electrooptical (EO) imagery. This application compares state-of-the-art Generative Adversarial Networks (GANs) including Pix2Pix, CycleGan, S-CycleGan, and a novel dual?generator GAN utilizing partial convolutions and a novel dual-generator architecture utilizing transformers. These models are designed to progressively refine the realism in the translated optical images, thereby enhancing the visual interpretability of SAR data. We demonstrate the efficacy of our approach through qualitative and quantitative evaluations, comparing the synthesized EO images with actual EO images in terms of visual fidelity and feature preservation. The results show significant improvements in interpretability, making SAR data more accessible for analysts familiar with EO imagery. Furthermore, we explore the potential of this technology in various applications, including environmental monitoring, urban planning, and military reconnaissance, where rapid, accurate interpretation of SAR data is crucial. Our research contributes to the field of remote sensing by bridging the gap between SAR and EO imagery, offering a novel tool for enhanced data interpretation and broader application of SAR technology in various domains.

[CV-108] xture Discrimination via Hilbert Curve Path Based Information Quantifiers

链接: https://arxiv.org/abs/2409.15327
作者: Aurelio F. Bariviera,Roberta Hansen,Verónica E. Pastor
关键词-EN: range of applications, spatial arrangement, relevant due, wide range, Hilbert curve
类目: Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:The analysis of the spatial arrangement of colors and roughness/smoothness of figures is relevant due to its wide range of applications. This paper proposes a texture classification method that extracts data from images using the Hilbert curve. Three information theory quantifiers are then computed: permutation entropy, permutation complexity, and Fisher information measure. The proposal exhibits some important properties: (i) it allows to discriminate figures according to varying degrees of correlations (as measured by the Hurst exponent), (ii) it is invariant to rotation and symmetry transformations, (iii) it can be used either in black and white or color images. Validations have been made not only using synthetic images but also using the well-known Brodatz image database.

[CV-109] Deep Transfer Learning for Breast Cancer Classification

链接: https://arxiv.org/abs/2409.15313
作者: Prudence Djagba,J. K. Buwa Mbouobda
关键词-EN: major global health, global health issue, Breast cancer, women worldwide, major global
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Breast cancer is a major global health issue that affects millions of women worldwide. Classification of breast cancer as early and accurately as possible is crucial for effective treatment and enhanced patient outcomes. Deep transfer learning has emerged as a promising technique for improving breast cancer classification by utilizing pre-trained models and transferring knowledge across related tasks. In this study, we examine the use of a VGG, Vision Transformers (ViT) and Resnet to classify images for Invasive Ductal Carcinoma (IDC) cancer and make a comparative analysis of the algorithms. The result shows a great advantage of Resnet-34 with an accuracy of 90.40% in classifying cancer images. However, the pretrained VGG-16 demonstrates a higher F1-score because there is less parameters to update. We believe that the field of breast cancer diagnosis stands to benefit greatly from the use of deep transfer learning. Transfer learning may assist to increase the accuracy and accessibility of breast cancer screening by allowing deep learning models to be trained with little data.

[CV-110] Enhancing coastal water body segmentation with Landsat Irish Coastal Segmentation (LICS) dataset

链接: https://arxiv.org/abs/2409.15311
作者: Conor O’Sullivan,Ambrish Kashyap,Seamus Coveney,Xavier Monteys,Soumyabrata Dev
关键词-EN: deep learning methods, dynamic resource, human activities, Ireland coastline, critical and dynamic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Ireland’s coastline, a critical and dynamic resource, is facing challenges such as erosion, sedimentation, and human activities. Monitoring these changes is a complex task we approach using a combination of satellite imagery and deep learning methods. However, limited research exists in this area, particularly for Ireland. This paper presents the Landsat Irish Coastal Segmentation (LICS) dataset, which aims to facilitate the development of deep learning methods for coastal water body segmentation while addressing modelling challenges specific to Irish meteorology and coastal types. The dataset is used to evaluate various automated approaches for segmentation, with U-NET achieving the highest accuracy of 95.0% among deep learning methods. Nevertheless, the Normalised Difference Water Index (NDWI) benchmark outperformed U-NET with an average accuracy of 97.2%. The study suggests that deep learning approaches can be further improved with more accurate training data and by considering alternative measurements of erosion. The LICS dataset and code are freely available to support reproducible research and further advancements in coastal monitoring efforts.

[CV-111] Visual Prompting in Multimodal Large Language Models : A Survey

链接: https://arxiv.org/abs/2409.15310
作者: Junda Wu,Zhehao Zhang,Yu Xia,Xintong Li,Zhaoyang Xia,Aaron Chang,Tong Yu,Sungchul Kim,Ryan A. Rossi,Ruiyi Zhang,Subrata Mitra,Dimitris N. Metaxas,Lina Yao,Jingbo Shang,Julian McAuley
关键词-EN: Multimodal large language, equip pre-trained large-language, Multimodal large, large language models, pre-trained large-language models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Multimodal large language models (MLLMs) equip pre-trained large-language models (LLMs) with visual capabilities. While textual prompting in LLMs has been widely studied, visual prompting has emerged for more fine-grained and free-form visual instructions. This paper presents the first comprehensive survey on visual prompting methods in MLLMs, focusing on visual prompting, prompt generation, compositional reasoning, and prompt learning. We categorize existing visual prompts and discuss generative methods for automatic prompt annotations on the images. We also examine visual prompting methods that enable better alignment between visual encoders and backbone LLMs, concerning MLLM’s visual grounding, object referring, and compositional reasoning abilities. In addition, we provide a summary of model training and in-context learning methods to improve MLLM’s perception and understanding of visual prompts. This paper examines visual prompting methods developed in MLLMs and provides a vision of the future of these methods.

[CV-112] he NGT200 Dataset: Geometric Multi-View Isolated Sign Recognition

链接: https://arxiv.org/abs/2409.15284
作者: Oline Ranum,David R. Wessels,Gomer Otterspeer,Erik J. Bekkers,Floris Roelofsen,Jari I. Andersen
关键词-EN: Sign Language Processing, real-world applications, Language Processing, achieve practical, inclusive future
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Proceedings of the Geometry-grounded Representation Learning and Generative Modeling Workshop (GRaM) at the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 251, 2024

点击查看摘要

Abstract:Sign Language Processing (SLP) provides a foundation for a more inclusive future in language technology; however, the field faces several significant challenges that must be addressed to achieve practical, real-world applications. This work addresses multi-view isolated sign recognition (MV-ISR), and highlights the essential role of 3D awareness and geometry in SLP systems. We introduce the NGT200 dataset, a novel spatio-temporal multi-view benchmark, establishing MV-ISR as distinct from single-view ISR (SV-ISR). We demonstrate the benefits of synthetic data and propose conditioning sign representations on spatial symmetries inherent in sign language. Leveraging an SE(2) equivariant model improves MV-ISR performance by 8%-22% over the baseline.

[CV-113] Compressed Depth Map Super-Resolution and Restoration: AIM 2024 Challenge Results ECCV2024

链接: https://arxiv.org/abs/2409.16277
作者: Marcos V. Conde,Florin-Alexandru Vasluianu,Jinhui Xiong,Wei Ye,Rakesh Ranjan,Radu Timofte
关键词-EN: efficient depth information, augmented reality, virtual reality, increasing demand, demand for augmented
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 - Advances in Image Manipulation (AIM)

点击查看摘要

Abstract:The increasing demand for augmented reality (AR) and virtual reality (VR) applications highlights the need for efficient depth information processing. Depth maps, essential for rendering realistic scenes and supporting advanced functionalities, are typically large and challenging to stream efficiently due to their size. This challenge introduces a focus on developing innovative depth upsampling techniques to reconstruct high-quality depth maps from compressed data. These techniques are crucial for overcoming the limitations posed by depth compression, which often degrades quality, loses scene details and introduces artifacts. By enhancing depth upsampling methods, this challenge aims to improve the efficiency and quality of depth map reconstruction. Our goal is to advance the state-of-the-art in depth processing technologies, thereby enhancing the overall user experience in AR and VR applications.

[CV-114] Upper-body free-breathing Magnetic Resonance Fingerprinting applied to the quantification of water T1 and fat fraction

链接: https://arxiv.org/abs/2409.16200
作者: Constantin Slioussarenko,Pierre-Yves Baudin,Marc Lapert,Benjamin Marty
关键词-EN: Magnetic Resonance Fingerprinting, multiple MRI parameters, including fat fraction, Magnetic Resonance, Resonance Fingerprinting
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Over the past decade, Magnetic Resonance Fingerprinting (MRF) has emerged as an efficient paradigm for the rapid and simultaneous quantification of multiple MRI parameters, including fat fraction (FF), water T1 ( T1_H2O ), water T2 ( T2_H2O ), and fat T1 ( T1_fat ). These parameters serve as promising imaging biomarkers in various anatomical targets such as the heart, liver, and skeletal muscles. However, measuring these parameters in the upper body poses challenges due to physiological motion, particularly respiratory motion. In this work, we propose a novel approach, motion-corrected (MoCo) MRF T1-FF, which estimates the motion field using an optimized preliminary motion scan and uses it to correct the MRF acquisition data before dictionary search for reconstructing motion-corrected FF and T1_H2O parametric maps of the upper-body region. We validated this framework using an \textitin vivo dataset comprising ten healthy volunteers and a 10-year-old boy with Duchenne muscular dystrophy. At the ROI level, in regions minimally affected by motion, no significant bias was observed between the uncorrected and MoCo reconstructions for FF (mean difference of -0.7%) and T1_H2O (-4.9 ms) values. Moreover, MoCo MRF T1-FF significantly reduced the standard deviations of distributions assessed in these regions, indicating improved precision. Notably, in regions heavily affected by motion, such as respiratory muscles, liver, and kidneys, the MRF parametric maps exhibited a marked reduction in motion blurring and streaking artifacts after motion correction. Furthermore, the diaphragm was consistently discernible on parametric maps after motion correction. This approach lays the groundwork for the joint 3D quantification of FF and T1_H2O in regions that are rarely studied, such as the respiratory muscles, particularly the intercostal muscles and diaphragm.

[CV-115] Multi-Model Ensemble Approach for Accurate Bi-Atrial Segmentation in LGE-MRI of Atrial Fibrillation Patients

链接: https://arxiv.org/abs/2409.16083
作者: Lucas Beveridge,Le Zhang
关键词-EN: morbidity and mortality, prevalent form, form of cardiac, increased morbidity, Atrial fibrillation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Atrial fibrillation (AF) is the most prevalent form of cardiac arrhythmia and is associated with increased morbidity and mortality. The effectiveness of current clinical interventions for AF is often limited by an incomplete understanding of the atrial anatomical structures that sustain this arrhythmia. Late Gadolinium-Enhanced MRI (LGE-MRI) has emerged as a critical imaging modality for assessing atrial fibrosis and scarring, which are essential markers for predicting the success of ablation procedures in AF patients. The Multi-class Bi-Atrial Segmentation (MBAS) challenge at MICCAI 2024 aims to enhance the segmentation of both left and right atria and their walls using a comprehensive dataset of 200 multi-center 3D LGE-MRIs, labelled by experts. This work presents an ensemble approach that integrates multiple machine learning models, including Unet, ResNet, EfficientNet and VGG, to perform automatic bi-atrial segmentation from LGE-MRI data. The ensemble model was evaluated using the Dice Similarity Coefficient (DSC) and 95% Hausdorff distance (HD95) on the left right atrium wall, right atrium cavity, and left atrium cavity. On the internal testing dataset, the model achieved a DSC of 88.41%, 98.48%, 98.45% and an HD95 of 1.07, 0.95, 0.64 respectively. This demonstrates the effectiveness of the ensemble model in improving segmentation accuracy. The approach contributes to advancing the understanding of AF and supports the development of more targeted and effective ablation strategies.

[CV-116] Enhanced Unsupervised Image-to-Image Translation Using Contrastive Learning and Histogram of Oriented Gradients

链接: https://arxiv.org/abs/2409.16042
作者: Wanchen Zhao
关键词-EN: Contrastive Unpaired Translation, vital area, area of computer, computer vision, vision that focuses
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10pages,4 figures

点击查看摘要

Abstract:Image-to-Image Translation is a vital area of computer vision that focuses on transforming images from one visual domain to another while preserving their core content and structure. However, this field faces two major challenges: first, the data from the two domains are often unpaired, making it difficult to train generative adversarial networks effectively; second, existing methods tend to produce artifacts or hallucinations during image generation, leading to a decline in image quality. To address these issues, this paper proposes an enhanced unsupervised image-to-image translation method based on the Contrastive Unpaired Translation (CUT) model, incorporating Histogram of Oriented Gradients (HOG) features. This novel approach ensures the preservation of the semantic structure of images, even without semantic labels, by minimizing the loss between the HOG features of input and generated images. The method was tested on translating synthetic game environments from GTA5 dataset to realistic urban scenes in cityscapes dataset, demonstrating significant improvements in reducing hallucinations and enhancing image quality.

[CV-117] Deep chroma compression of tone-mapped images

链接: https://arxiv.org/abs/2409.16032
作者: Xenios Milidonis,Francesco Banterle,Alessandro Artusi
关键词-EN: high dynamic range, Acquisition of high, high-quality output, high dynamic, thriving due
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Acquisition of high dynamic range (HDR) images is thriving due to the increasing use of smart devices and the demand for high-quality output. Extensive research has focused on developing methods for reducing the luminance range in HDR images using conventional and deep learning-based tone mapping operators to enable accurate reproduction on conventional 8 and 10-bit digital displays. However, these methods often fail to account for pixels that may lie outside the target display’s gamut, resulting in visible chromatic distortions or color clipping artifacts. Previous studies suggested that a gamut management step ensures that all pixels remain within the target gamut. However, such approaches are computationally expensive and cannot be deployed on devices with limited computational resources. We propose a generative adversarial network for fast and reliable chroma compression of HDR tone-mapped images. We design a loss function that considers the hue property of generated images to improve color accuracy, and train the model on an extensive image dataset. Quantitative experiments demonstrate that the proposed model outperforms state-of-the-art image generation and enhancement networks in color accuracy, while a subjective study suggests that the generated images are on par or superior to those produced by conventional chroma compression methods in terms of visual quality. Additionally, the model achieves real-time performance, showing promising results for deployment on devices with limited computational resources.

[CV-118] VascX Models: Model Ensembles for Retinal Vascular Analysis from Color Fundus Images

链接: https://arxiv.org/abs/2409.16016
作者: Jose Vargas Quiros,Bart Liefers,Karin van Garderen,Jeroen Vermeulen,Eyened Reading Center,Sinergia Consortium,Caroline Klaver
关键词-EN: color fundus images, color fundus, population-based Rotterdam Study, introduce VascX models, analyzing retinal vasculature
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
*备注:

点击查看摘要

Abstract:We introduce VascX models, a comprehensive set of model ensembles for analyzing retinal vasculature from color fundus images (CFIs). Annotated CFIs were aggregated from public datasets for vessel, artery-vein, and disc segmentation; and fovea localization. Additional CFIs from the population-based Rotterdam Study were, with arteries and veins annotated by graders at pixel level. Our models achieved robust performance across devices from different vendors, varying levels of image quality levels, and diverse pathologies. Our models demonstrated superior segmentation performance compared to existing systems under a variety of conditions. Significant enhancements were observed in artery-vein and disc segmentation performance, particularly in segmentations of these structures on CFIs of intermediate quality, a common characteristic of large cohorts and clinical datasets. Our model outperformed human graders in segmenting vessels with greater precision. With VascX models we provide a robust, ready-to-use set of model ensembles and inference code aimed at simplifying the implementation and enhancing the quality of automated retinal vasculature analyses. The precise vessel parameters generated by the model can serve as starting points for the identification of disease patterns in and outside of the eye.

[CV-119] Investigating Gender Bias in Lymph-node Segmentation with Anatomical Priors

链接: https://arxiv.org/abs/2409.15888
作者: Ricardo Coimbra Brioso,Damiano Dei,Nicola Lambri,Pietro Mancosu,Marta Scorsetti,Daniele Loiacono
关键词-EN: Clinical Target Volume, Radiotherapy requires precise, maximize treatment efficacy, Radiotherapy requires, Target Volume
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Radiotherapy requires precise segmentation of organs at risk (OARs) and of the Clinical Target Volume (CTV) to maximize treatment efficacy and minimize toxicity. While deep learning (DL) has significantly advanced automatic contouring, complex targets like CTVs remain challenging. This study explores the use of simpler, well-segmented structures (e.g., OARs) as Anatomical Prior (AP) information to improve CTV segmentation. We investigate gender bias in segmentation models and the mitigation effect of the prior information. Findings indicate that incorporating prior knowledge with the discussed strategies enhances segmentation quality in female patients and reduces gender bias, particularly in the abdomen region. This research provides a comparative analysis of new encoding strategies and highlights the potential of using AP to achieve fairer segmentation outcomes.

[CV-120] Unsupervised dMRI Artifact Detection via Angular Resolution Enhancement and Cycle Consistency Learning

链接: https://arxiv.org/abs/2409.15883
作者: Sheng Chen,Zihao Tang,Xinyi Wang,Chenyu Wang,Weidong Cai
关键词-EN: Diffusion magnetic resonance, magnetic resonance imaging, Diffusion magnetic, textbf, resonance imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to AJCAI2024, dMRI, Unsupervised artifact detection, Angular resolution enhancement, Cycle consistency

点击查看摘要

Abstract:Diffusion magnetic resonance imaging (dMRI) is a crucial technique in neuroimaging studies, allowing for the non-invasive probing of the underlying structures of brain tissues. Clinical dMRI data is susceptible to various artifacts during acquisition, which can lead to unreliable subsequent analyses. Therefore, dMRI preprocessing is essential for improving image quality, and manual inspection is often required to ensure that the preprocessed data is sufficiently corrected. However, manual inspection requires expertise and is time-consuming, especially with large-scale dMRI datasets. Given these challenges, an automated dMRI artifact detection tool is necessary to increase the productivity and reliability of dMRI data analysis. To this end, we propose a novel unsupervised deep learning framework called \textbfU nsupervised \textbfd MRI \textbfA rtifact \textbfD etection via \textbfA ngular Resolution Enhancement and \textbfC ycle Consistency Learning (UdAD-AC). UdAD-AC leverages dMRI angular resolution enhancement and cycle consistency learning to capture the effective representation of artifact-free dMRI data during training, and it identifies data containing artifacts using designed confidence score during inference. To assess the capability of UdAD-AC, several commonly reported dMRI artifacts, including bias field, susceptibility distortion, and corrupted volume, were added to the testing data. Experimental results demonstrate that UdAD-AC achieves the best performance compared to competitive methods in unsupervised dMRI artifact detection.

[CV-121] A Novel Framework for the Automated Characterization of Gram-Stained Blood Culture Slides Using a Large-Scale Vision Transformer

链接: https://arxiv.org/abs/2409.15546
作者: Jack McMahon,Naofumi Tomita,Elizabeth S. Tatishev,Adrienne A. Workman,Cristina R Costales,Niaz Banaei,Isabella W. Martin,Saeed Hassanpour
关键词-EN: artificial intelligence-assisted characterization, Gram-stained whole-slide images, Gram stain, whole-slide images, artificial intelligence-assisted
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study introduces a new framework for the artificial intelligence-assisted characterization of Gram-stained whole-slide images (WSIs). As a test for the diagnosis of bloodstream infections, Gram stains provide critical early data to inform patient treatment. Rapid and reliable analysis of Gram stains has been shown to be positively associated with better clinical outcomes, underscoring the need for improved tools to automate Gram stain analysis. In this work, we developed a novel transformer-based model for Gram-stained WSI classification, which is more scalable to large datasets than previous convolutional neural network (CNN) -based methods as it does not require patch-level manual annotations. We also introduce a large Gram stain dataset from Dartmouth-Hitchcock Medical Center (Lebanon, New Hampshire, USA) to evaluate our model, exploring the classification of five major categories of Gram-stained WSIs: Gram-positive cocci in clusters, Gram-positive cocci in pairs/chains, Gram-positive rods, Gram-negative rods, and slides with no bacteria. Our model achieves a classification accuracy of 0.858 (95% CI: 0.805, 0.905) and an AUC of 0.952 (95% CI: 0.922, 0.976) using five-fold nested cross-validation on our 475-slide dataset, demonstrating the potential of large-scale transformer models for Gram stain classification. We further demonstrate the generalizability of our trained model, which achieves strong performance on external datasets without additional fine-tuning.

[CV-122] Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech

链接: https://arxiv.org/abs/2409.15525
作者: Hong Nguyen,Sean Foley,Kevin Huang,Xuan Shi,Tiantian Feng,Shrikanth Narayanan
关键词-EN: Understanding speech production, learning system designs, language learning system, Magnetic Resonance Imaging, Understanding speech
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 4 pages

点击查看摘要

Abstract:Understanding speech production both visually and kinematically can inform second language learning system designs, as well as the creation of speaking characters in video games and animations. In this work, we introduce a data-driven method to visually represent articulator motion in Magnetic Resonance Imaging (MRI) videos of the human vocal tract during speech based on arbitrary audio or speech input. We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data using a speech-to-video diffusion model. Our findings demonstrate that the visual generation significantly benefits from the pre-trained speech representations. We also observed that evaluating phonemes in isolation is challenging but becomes more straightforward when assessed within the context of spoken words. Limitations of the current results include the presence of unsmooth tongue motion and video distortion when the tongue contacts the palate.

[CV-123] Bayesian computation with generative diffusion models by Multilevel Monte Carlo

链接: https://arxiv.org/abs/2409.15511
作者: Abdul-Lateef Haji-Ali,Marcelo Pereyra,Luke Shaw,Konstantinos Zygalakis
关键词-EN: Generative diffusion models, Multilevel Monte Carlo, delivering remarkably accurate, Monte Carlo, remarkably accurate solutions
类目: Computation (stat.CO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 images

点击查看摘要

Abstract:Generative diffusion models have recently emerged as a powerful strategy to perform stochastic sampling in Bayesian inverse problems, delivering remarkably accurate solutions for a wide range of challenging applications. However, diffusion models often require a large number of neural function evaluations per sample in order to deliver accurate posterior samples. As a result, using diffusion models as stochastic samplers for Monte Carlo integration in Bayesian computation can be highly computationally expensive. This cost is especially high in large-scale inverse problems such as computational imaging, which rely on large neural networks that are expensive to evaluate. With Bayesian imaging problems in mind, this paper presents a Multilevel Monte Carlo strategy that significantly reduces the cost of Bayesian computation with diffusion models. This is achieved by exploiting cost-accuracy trade-offs inherent to diffusion models to carefully couple models of different levels of accuracy in a manner that significantly reduces the overall cost of the calculation, without reducing the final accuracy. The effectiveness of the proposed Multilevel Monte Carlo approach is demonstrated with three canonical computational imaging problems, where we observe a 4\times -to- 8\times reduction in computational cost compared to conventional Monte Carlo averaging.

[CV-124] Adenocarcinoma Segmentation Using Pre-trained Swin-UNet with Parallel Cross-Attention for Multi-Domain Imaging

链接: https://arxiv.org/abs/2409.15501
作者: Abdul Qayyum,Moona Mazher Imran Razzak,Steven A Niederer
关键词-EN: Computer aided pathological, aided pathological analysis, Computer aided, tumor diagnosis, aided pathological
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages 2 figures

点击查看摘要

Abstract:Computer aided pathological analysis has been the gold standard for tumor diagnosis, however domain shift is a significant problem in histopathology. It may be caused by variability in anatomical structures, tissue preparation, and imaging processes challenges the robustness of segmentation models. In this work, we present a framework consist of pre-trained encoder with a Swin-UNet architecture enhanced by a parallel cross-attention module to tackle the problem of adenocarcinoma segmentation across different organs and scanners, considering both morphological changes and scanner-induced domain variations. Experiment conducted on Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation challenge dataset showed that our framework achieved segmentation scores of 0.7469 for the cross-organ track and 0.7597 for the cross-scanner track on the final challenge test sets, and effectively navigates diverse imaging conditions and improves segmentation accuracy across varying domains.

[CV-125] BurstM: Deep Burst Multi-scale SR using Fourier Space with Optical Flow

链接: https://arxiv.org/abs/2409.15384
作者: EungGu Kang,Byeonghun Lee,Sunghoon Im,Kyong Hwan Jin
关键词-EN: MFSR leverages abundant, Multi frame super-resolution, single image super-resolution, leverages abundant information, achieves higher performance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Multi frame super-resolution(MFSR) achieves higher performance than single image super-resolution (SISR), because MFSR leverages abundant information from multiple frames. Recent MFSR approaches adapt the deformable convolution network (DCN) to align the frames. However, the existing MFSR suffers from misalignments between the reference and source frames due to the limitations of DCN, such as small receptive fields and the predefined number of kernels. From these problems, existing MFSR approaches struggle to represent high-frequency information. To this end, we propose Deep Burst Multi-scale SR using Fourier Space with Optical Flow (BurstM). The proposed method estimates the optical flow offset for accurate alignment and predicts the continuous Fourier coefficient of each frame for representing high-frequency textures. In addition, we have enhanced the network flexibility by supporting various super-resolution (SR) scale factors with the unimodel. We demonstrate that our method has the highest performance and flexibility than the existing MFSR methods. Our source code is available at this https URL

[CV-126] Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data

链接: https://arxiv.org/abs/2409.15374
作者: Suryansh Vidya,Kush Gupta,Amir Aly,Andy Wills,Emmanuel Ifeachor,Rohit Shankar
关键词-EN: Autism Spectrum Disorder, Spectrum Disorder, ASD, Autism Spectrum, Early diagnosis
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been shown to significantly improve the quality of life of autistic individuals. However, diagnostics methods for ASD rely on assessments based on clinical presentation that are prone to bias and can be challenging to arrive at an early diagnosis. There is a need for objective biomarkers of ASD which can help improve diagnostic accuracy. Deep learning (DL) has achieved outstanding performance in diagnosing diseases and conditions from medical imaging data. Extensive research has been conducted on creating models that classify ASD using resting-state functional Magnetic Resonance Imaging (fMRI) data. However, existing models lack interpretability. This research aims to improve the accuracy and interpretability of ASD diagnosis by creating a DL model that can not only accurately classify ASD but also provide explainable insights into its working. The dataset used is a preprocessed version of the Autism Brain Imaging Data Exchange (ABIDE) with 884 samples. Our findings show a model that can accurately classify ASD and highlight critical brain regions differing between ASD and typical controls, with potential implications for early diagnosis and understanding of the neural basis of ASD. These findings are validated by studies in the literature that use different datasets and modalities, confirming that the model actually learned characteristics of ASD and not just the dataset. This study advances the field of explainable AI in medical imaging by providing a robust and interpretable model, thereby contributing to a future with objective and reliable ASD diagnostics.

[CV-127] A Lightweight GAN-Based Image Fusion Algorithm for Visible and Infrared Images

链接: https://arxiv.org/abs/2409.15332
作者: Zhizhong Wu,Hao Gong,Jiajing Chen,Zhou Yuru,LiangHao Tan,Ge Shi
关键词-EN: merging visible light, Generative Adversarial Network, Block Attention Module, Depthwise Separable Convolution, algorithm specifically designed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a lightweight image fusion algorithm specifically designed for merging visible light and infrared images, with an emphasis on balancing performance and efficiency. The proposed method enhances the generator in a Generative Adversarial Network (GAN) by integrating the Convolutional Block Attention Module (CBAM) to improve feature focus and utilizing Depthwise Separable Convolution (DSConv) for more efficient computations. These innovations significantly reduce the model’s computational cost, including the number of parameters and inference latency, while maintaining or even enhancing the quality of the fused images. Comparative experiments using the M3FD dataset demonstrate that the proposed algorithm not only outperforms similar image fusion methods in terms of fusion quality but also offers a more resource-efficient solution suitable for deployment on embedded devices. The effectiveness of the lightweight design is validated through extensive ablation studies, confirming its potential for real-time applications in complex environments.

机器学习

[LG-0] Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking

链接: https://arxiv.org/abs/2409.16287
作者: Xi Wang,Tianxing Chen,Qiaojun Yu,Tianling Xu,Zanxin Chen,Yiting Fu,Cewu Lu,Yao Mu,Ping Luo
关键词-EN: carefully considered, interactive perception, articulated objects, Articulated, interactive
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Articulated object manipulation requires precise object interaction, where the object’s axis must be carefully considered. Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics. To address this limitation, we present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds. Our method leverages any interactive perception technique as a foundation for interactive perception, inducing slight object movement to generate point cloud frames of the evolving dynamic scene. These point clouds are then segmented using Segment Anything Model 2 (SAM2), after which the moving part of the object is masked for accurate motion online axis estimation, guiding subsequent robotic actions. Our approach significantly enhances the precision and efficiency of manipulation tasks involving articulated objects. Experiments in simulated environments demonstrate that our method outperforms baseline approaches, especially in tasks that demand precise axis-based control. Project Page: this https URL.

[LG-1] Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

链接: https://arxiv.org/abs/2409.16283
作者: Homanga Bharadhwaj,Debidatta Dwibedi,Abhinav Gupta,Shubham Tulsiani,Carl Doersch,Ted Xiao,Dhruv Shah,Fei Xia,Dorsa Sadigh,Sean Kirmani
关键词-EN: manipulation policies generalize, policies generalize, video, human video generation, video generation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Preprint. Under Review

点击查看摘要

Abstract:How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn’t require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at this https URL

[LG-2] Learning To Help: Training Models to Assist Legacy Devices

链接: https://arxiv.org/abs/2409.16253
作者: Yu Wu,Anand Sarwate
关键词-EN: Machine learning models, Machine learning, learning models implemented, long time, implemented in hardware
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Machine learning models implemented in hardware on physical devices may be deployed for a long time. The computational abilities of the device may be limited and become outdated with respect to newer improvements. Because of the size of ML models, offloading some computation (e.g. to an edge cloud) can help such legacy devices. We cast this problem in the framework of learning with abstention (LWA) in which the expert (edge) must be trained to assist the client (device). Prior work on LWA trains the client assuming the edge is either an oracle or a human expert. In this work, we formalize the reverse problem of training the expert for a fixed (legacy) client. As in LWA, the client uses a rejection rule to decide when to offload inference to the expert (at a cost). We find the Bayes-optimal rule, prove a generalization bound, and find a consistent surrogate loss function. Empirical results show that our framework outperforms confidence-based rejection rules.

[LG-3] Fields of The World: A Machine Learning Benchmark Dataset For Global Agricultural Field Boundary Segmentation

链接: https://arxiv.org/abs/2409.16252
作者: Hannah Kerner,Snehal Chaudhari,Aninda Ghosh,Caleb Robinson,Adeel Ahmad,Eddie Choi,Nathan Jacobs,Chris Holmes,Matthias Mohr,Rahul Dodhia,Juan M. Lavista Ferres,Jennifer Marcus
关键词-EN: Crop field boundaries, Crop field, collect manually, monitoring and assessments, expensive to collect
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Crop field boundaries are foundational datasets for agricultural monitoring and assessments but are expensive to collect manually. Machine learning (ML) methods for automatically extracting field boundaries from remotely sensed images could help realize the demand for these datasets at a global scale. However, current ML methods for field instance segmentation lack sufficient geographic coverage, accuracy, and generalization capabilities. Further, research on improving ML methods is restricted by the lack of labeled datasets representing the diversity of global agricultural fields. We present Fields of The World (FTW) – a novel ML benchmark dataset for agricultural field instance segmentation spanning 24 countries on four continents (Europe, Africa, Asia, and South America). FTW is an order of magnitude larger than previous datasets with 70,462 samples, each containing instance and semantic segmentation masks paired with multi-date, multi-spectral Sentinel-2 satellite images. We provide results from baseline models for the new FTW benchmark, show that models trained on FTW have better zero-shot and fine-tuning performance in held-out countries than models that aren’t pre-trained with diverse datasets, and show positive qualitative zero-shot results of FTW models in a real-world scenario – running on Sentinel-2 scenes over Ethiopia.

[LG-4] Predicting Deterioration in Mild Cognitive Impairment with Survival Transformers Extreme Gradient Boosting and Cox Proportional Hazard Modelling ICANN2024

链接: https://arxiv.org/abs/2409.16231
作者: Henry Musto,Daniel Stamate,Doina Logofatu,Daniel Stahl
关键词-EN: mild cognitive impairment, predicting cognitive deterioration, extreme gradient boosting, ADNI cohort, gradient boosting models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted to ICANN 2024

点击查看摘要

Abstract:The paper proposes a novel approach of survival transformers and extreme gradient boosting models in predicting cognitive deterioration in individuals with mild cognitive impairment (MCI) using metabolomics data in the ADNI cohort. By leveraging advanced machine learning and transformer-based techniques applied in survival analysis, the proposed approach highlights the potential of these techniques for more accurate early detection and intervention in Alzheimer’s dementia disease. This research also underscores the importance of non-invasive biomarkers and innovative modelling tools in enhancing the accuracy of dementia risk assessments, offering new avenues for clinical practice and patient care. A comprehensive Monte Carlo simulation procedure consisting of 100 repetitions of a nested cross-validation in which models were trained and evaluated, indicates that the survival machine learning models based on Transformer and XGBoost achieved the highest mean C-index performances, namely 0.85 and 0.8, respectively, and that they are superior to the conventional survival analysis Cox Proportional Hazards model which achieved a mean C-Index of 0.77. Moreover, based on the standard deviations of the C-Index performances obtained in the Monte Carlo simulation, we established that both survival machine learning models above are more stable than the conventional statistical model.

[LG-5] Fine-Tuning is Fine if Calibrated

链接: https://arxiv.org/abs/2409.16223
作者: Zheda Mai,Arpita Chowdhury,Ping Zhang,Cheng-Hao Tu,Hong-You Chen,Vardaan Pahuja,Tanya Berger-Wolf,Song Gao,Charles Stewart,Yu Su,Wei-Lun Chao
关键词-EN: losing valuable knowledge, fine-tuned model, model, classes, pre-trained model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: The first three authors contribute equally

点击查看摘要

Abstract:Fine-tuning is arguably the most straightforward way to tailor a pre-trained model (e.g., a foundation model) to downstream applications, but it also comes with the risk of losing valuable knowledge the model had learned in pre-training. For example, fine-tuning a pre-trained classifier capable of recognizing a large number of classes to master a subset of classes at hand is shown to drastically degrade the model’s accuracy in the other classes it had previously learned. As such, it is hard to further use the fine-tuned model when it encounters classes beyond the fine-tuning data. In this paper, we systematically dissect the issue, aiming to answer the fundamental question, ‘‘What has been damaged in the fine-tuned model?’’ To our surprise, we find that the fine-tuned model neither forgets the relationship among the other classes nor degrades the features to recognize these classes. Instead, the fine-tuned model often produces more discriminative features for these other classes, even if they were missing during fine-tuning! What really hurts the accuracy is the discrepant logit scales between the fine-tuning classes and the other classes, implying that a simple post-processing calibration would bring back the pre-trained model’s capability and at the same time unveil the feature improvement over all classes. We conduct an extensive empirical study to demonstrate the robustness of our findings and provide preliminary explanations underlying them, suggesting new directions for future theoretical analysis. Our code is available at this https URL.

[LG-6] Problem-oriented AutoML in Clustering

链接: https://arxiv.org/abs/2409.16218
作者: Matheus Camilo da Silva,Gabriel Marques Tavares,Eric Medvet,Sylvio Barbon Junior
关键词-EN: Clustering Validity Indexes, flexible approach, automating clustering tasks, Problem-oriented AutoML, approach to automating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Problem-oriented AutoML in Clustering (PoAC) framework introduces a novel, flexible approach to automating clustering tasks by addressing the shortcomings of traditional AutoML solutions. Conventional methods often rely on predefined internal Clustering Validity Indexes (CVIs) and static meta-features, limiting their adaptability and effectiveness across diverse clustering tasks. In contrast, PoAC establishes a dynamic connection between the clustering problem, CVIs, and meta-features, allowing users to customize these components based on the specific context and goals of their task. At its core, PoAC employs a surrogate model trained on a large meta-knowledge base of previous clustering datasets and solutions, enabling it to infer the quality of new clustering pipelines and synthesize optimal solutions for unseen datasets. Unlike many AutoML frameworks that are constrained by fixed evaluation metrics and algorithm sets, PoAC is algorithm-agnostic, adapting seamlessly to different clustering problems without requiring additional data or retraining. Experimental results demonstrate that PoAC not only outperforms state-of-the-art frameworks on a variety of datasets but also excels in specific tasks such as data visualization, and highlight its ability to dynamically adjust pipeline configurations based on dataset complexity.

[LG-7] Deep Learning for Precision Agriculture: Post-Spraying Evaluation and Deposition Estimation

链接: https://arxiv.org/abs/2409.16213
作者: Harry Rogers,Tahmina Zebin,Grzegorz Cielniak,Beatriz De La Iglesia,Ben Magri
关键词-EN: Precision spraying, eXplainable Artificial Intelligence, requires automation primarily, Precision spraying evaluation, spraying evaluation requires
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Precision spraying evaluation requires automation primarily in post-spraying imagery. In this paper we propose an eXplainable Artificial Intelligence (XAI) computer vision pipeline to evaluate a precision spraying system post-spraying without the need for traditional agricultural methods. The developed system can semantically segment potential targets such as lettuce, chickweed, and meadowgrass and correctly identify if targets have been sprayed. Furthermore, this pipeline evaluates using a domain-specific Weakly Supervised Deposition Estimation task, allowing for class-specific quantification of spray deposit weights in \muL. Estimation of coverage rates of spray deposition in a class-wise manner allows for further understanding of effectiveness of precision spraying systems. Our study evaluates different Class Activation Mapping techniques, namely AblationCAM and ScoreCAM, to determine which is more effective and interpretable for these tasks. In the pipeline, inference-only feature fusion is used to allow for further interpretability and to enable the automation of precision spraying evaluation post-spray. Our findings indicate that a Fully Convolutional Network with an EfficientNet-B0 backbone and inference-only feature fusion achieves an average absolute difference in deposition values of 156.8 \muL across three classes in our test set. The dataset curated in this paper is publicly available at this https URL

[LG-8] MaskBit: Embedding-free Image Generation via Bit Tokens

链接: https://arxiv.org/abs/2409.16211
作者: Mark Weber,Lijun Yu,Qihang Yu,Xueqing Deng,Xiaohui Shen,Daniel Cremers,Liang-Chieh Chen
关键词-EN: Masked transformer models, Masked transformer, class-conditional image generation, subsequent Transformer model, compelling alternative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.

[LG-9] Second Order Bounds for Contextual Bandits with Function Approximation

链接: https://arxiv.org/abs/2409.16197
作者: Aldo Pacchiano
关键词-EN: context-action pairs belongs, developed algorithms no-regret, algorithms no-regret algorithms, square root, function class
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 12 pages main, 33 pages total

点击查看摘要

Abstract:Many works have developed algorithms no-regret algorithms for contextual bandits with function approximation, where the mean rewards over context-action pairs belongs to a function class. Although there are many approaches to this problem, one that has gained in importance is the use of algorithms based on the optimism principle such as optimistic least squares. It can be shown the regret of this algorithm scales as square root of the product of the eluder dimension (a statistical measure of the complexity of the function class), the logarithm of the function class size and the time horizon. Unfortunately, even if the variance of the measurement noise of the rewards at each time is changing and is very small, the regret of the optimistic least squares algorithm scales with square root of the time horizon. In this work we are the first to develop algorithms that satisfy regret bounds of scaling not with the square root of the time horizon, but the square root of the sum of the measurement variances in the setting of contextual bandits with function approximation when the variances are unknown. These bounds generalize existing techniques for deriving second order bounds in contextual linear problems.

[LG-10] Merging LoRAs like Playing LEGO: Pushing the Modularity of LoRA to Extremes Through Rank-Wise Clustering

链接: https://arxiv.org/abs/2409.16167
作者: Ziyu Zhao,Tao Shen,Didi Zhu,Zexi Li,Jing Su,Xuwu Wang,Kun Kuang,Fei Wu
关键词-EN: fine-tuning large language, platforms like Huggingface, large language models, fine-tuning large, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has emerged as a popular technique for fine-tuning large language models (LLMs) to various domains due to its modular design and widespread availability on platforms like Huggingface. This modularity has sparked interest in combining multiple LoRAs to enhance LLM capabilities. However, existing methods for LoRA composition primarily focus on task-specific adaptations that require additional training, and current model merging techniques often fail to fully leverage LoRA’s modular nature, leading to parameter interference and performance degradation. In this paper, we investigate the feasibility of disassembling and reassembling multiple LoRAs at a finer granularity, analogous to assembling LEGO blocks. We introduce the concept of Minimal Semantic Units (MSUs), where the parameters corresponding to each rank in LoRA function as independent units. These MSUs demonstrate permutation invariance and concatenation-summation equivalence properties, enabling flexible combinations to create new LoRAs. Building on these insights, we propose the LoRA-LEGO framework. This framework conducts rank-wise parameter clustering by grouping MSUs from different LoRAs into k clusters. The centroid of each cluster serves as a representative MSU, enabling the assembly of a merged LoRA with an adjusted rank of k . Additionally, we apply a dual reweighting strategy to optimize the scale of the merged LoRA. Experiments across various benchmarks demonstrate that our method outperforms existing approaches in LoRA merging.

[LG-11] Seeing Faces in Things: A Model and Dataset for Pareidolia

链接: https://arxiv.org/abs/2409.16143
作者: Mark Hamilton,Simon Stent,Vasha DuTell,Anne Harrington,Jennifer Corbett,Ruth Rosenholtz,William T. Freeman
关键词-EN: human visual system, shapes and sizes, visual system, system is well-tuned, faces
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The human visual system is well-tuned to detect faces of all shapes and sizes. While this brings obvious survival advantages, such as a better chance of spotting unknown predators in the bush, it also leads to spurious face detections. Face pareidolia'' describes the perception of face-like structure among otherwise random stimuli: seeing faces in coffee stains or clouds in the sky. In this paper, we study face pareidolia from a computer vision perspective. We present an image dataset of Faces in Things’', consisting of five thousand web images with human-annotated pareidolic faces. Using this dataset, we examine the extent to which a state-of-the-art human face detector exhibits pareidolia, and find a significant behavioral gap between humans and machines. We find that the evolutionary need for humans to detect animal faces, as well as human faces, may explain some of this gap. Finally, we propose a simple statistical model of pareidolia in images. Through studies on human subjects and our pareidolic face detectors we confirm a key prediction of our model regarding what image conditions are most likely to induce pareidolia. Dataset and Website: this https URL

[LG-12] abEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models

链接: https://arxiv.org/abs/2409.16118
作者: Andrei Margeloiu,Xiangjian Jiang,Nikola Simidjievski,Mateja Jamnik
关键词-EN: Data, difficult in critical, critical fields, synthetic data, classification performance
类目: Machine Learning (cs.LG)
*备注: 48 pages, 15 figures, 30 tables

点击查看摘要

Abstract:Data collection is often difficult in critical fields such as medicine, physics, and chemistry. As a result, classification methods usually perform poorly with these small datasets, leading to weak predictive performance. Increasing the training set with additional synthetic data, similar to data augmentation in images, is commonly believed to improve downstream classification performance. However, current tabular generative methods that learn either the joint distribution p(\mathbfx, y) or the class-conditional distribution p(\mathbfx \mid y) often overfit on small datasets, resulting in poor-quality synthetic data, usually worsening classification performance compared to using real data alone. To solve these challenges, we introduce TabEBM, a novel class-conditional generative method using Energy-Based Models (EBMs). Unlike existing methods that use a shared model to approximate all class-conditional densities, our key innovation is to create distinct EBM generative models for each class, each modelling its class-specific data distribution individually. This approach creates robust energy landscapes, even in ambiguous class distributions. Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods. When used for data augmentation, our synthetic data consistently improves the classification performance across diverse datasets of various sizes, especially small ones.

[LG-13] Self-attention as an attractor network: transient memories without backpropagation

链接: https://arxiv.org/abs/2409.16112
作者: Francesco D’Amico,Matteo Negri
关键词-EN: modern neural networks, modern Hopfield network, neural networks, successful architectures, modern neural
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:

点击查看摘要

Abstract:Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret self-attention as an attractor network, potentially paving the way for new theoretical approaches inspired from physics to understand transformers.

[LG-14] he Digital Transformation in Health: How AI Can Improve the Performance of Health Systems ALT

链接: https://arxiv.org/abs/2409.16098
作者: África Periáñez,Ana Fernández del Río,Ivan Nazarov,Enric Jané,Moiz Hassan,Aditya Rastogi,Dexian Tang
关键词-EN: revolutionize health care, Artificial Intelligence, integrating Artificial Intelligence, health, health care delivery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: This article has been accepted for publication in Health Systems Reform, published by Taylor Francis

点击查看摘要

Abstract:Mobile health has the potential to revolutionize health care delivery and patient engagement. In this work, we discuss how integrating Artificial Intelligence into digital health applications-focused on supply chain, patient management, and capacity building, among other use cases-can improve the health system and public health performance. We present an Artificial Intelligence and Reinforcement Learning platform that allows the delivery of adaptive interventions whose impact can be optimized through experimentation and real-time monitoring. The system can integrate multiple data sources and digital health applications. The flexibility of this platform to connect to various mobile health applications and digital devices and send personalized recommendations based on past data and predictions can significantly improve the impact of digital tools on health system outcomes. The potential for resource-poor settings, where the impact of this approach on health outcomes could be more decisive, is discussed specifically. This framework is, however, similarly applicable to improving efficiency in health systems where scarcity is not an issue.

[LG-15] From Pixels to Words: Leveraging Explainability in Face Recognition through Interactive Natural Language Processing

链接: https://arxiv.org/abs/2409.16089
作者: Ivan DeAndres-Tame,Muhammad Faisal,Ruben Tolosana,Rouqaiah Al-Refai,Ruben Vera-Rodriguez,Philipp Terhörst
关键词-EN: achieving high accuracy, Explainable Artificial Intelligence, deep learning, achieving high, advanced significantly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Face Recognition (FR) has advanced significantly with the development of deep learning, achieving high accuracy in several applications. However, the lack of interpretability of these systems raises concerns about their accountability, fairness, and reliability. In the present study, we propose an interactive framework to enhance the explainability of FR models by combining model-agnostic Explainable Artificial Intelligence (XAI) and Natural Language Processing (NLP) techniques. The proposed framework is able to accurately answer various questions of the user through an interactive chatbot. In particular, the explanations generated by our proposed method are in the form of natural language text and visual representations, which for example can describe how different facial regions contribute to the similarity measure between two faces. This is achieved through the automatic analysis of the output’s saliency heatmaps of the face images and a BERT question-answering model, providing users with an interface that facilitates a comprehensive understanding of the FR decisions. The proposed approach is interactive, allowing the users to ask questions to get more precise information based on the user’s background knowledge. More importantly, in contrast to previous studies, our solution does not decrease the face recognition performance. We demonstrate the effectiveness of the method through different experiments, highlighting its potential to make FR systems more interpretable and user-friendly, especially in sensitive applications where decision-making transparency is crucial.

[LG-16] Assessing Simplification Levels in Neural Networks: The Impact of Hyperparameter Configurations on Complexity and Sensitivity

链接: https://arxiv.org/abs/2409.16086
作者: (Joy)Huixin Guan
关键词-EN: Lempel Ziv complexity, Lempel Ziv, experimental study focused, effects on Lempel, specifically investigating
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents an experimental study focused on understanding the simplification properties of neural networks under different hyperparameter configurations, specifically investigating the effects on Lempel Ziv complexity and sensitivity. By adjusting key hyperparameters such as activation functions, hidden layers, and learning rate, this study evaluates how these parameters impact the complexity of network outputs and their robustness to input perturbations. The experiments conducted using the MNIST dataset aim to provide insights into the relationships between hyperparameters, complexity, and sensitivity, contributing to a deeper theoretical understanding of these concepts in neural networks.

[LG-17] Learning with Confidence: Training Better Classifiers from Soft Labels

链接: https://arxiv.org/abs/2409.16071
作者: Sjoerd de Vries,Dirk Thierens
关键词-EN: supervised machine learning, SLL methods, definite assignments, labels, supervised machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In supervised machine learning, models are typically trained using data with hard labels, i.e., definite assignments of class membership. This traditional approach, however, does not take the inherent uncertainty in these labels into account. We investigate whether incorporating label uncertainty, represented as discrete probability distributions over the class labels – known as soft labels – improves the predictive performance of classification models. We first demonstrate the potential value of soft label learning (SLL) for estimating model parameters in a simulation experiment, particularly for limited sample sizes and imbalanced data. Subsequently, we compare the performance of various wrapper methods for learning from both hard and soft labels using identical base classifiers. On real-world-inspired synthetic data with clean labels, the SLL methods consistently outperform hard label methods. Since real-world data is often noisy and precise soft labels are challenging to obtain, we study the effect that noisy probability estimates have on model performance. Alongside conventional noise models, our study examines four types of miscalibration that are known to affect human annotators. The results show that SLL methods outperform the hard label methods in the majority of settings. Finally, we evaluate the methods on a real-world dataset with confidence scores, where the SLL methods are shown to match the traditional methods for predicting the (noisy) hard labels while providing more accurate confidence estimates.

[LG-18] Whole-body end-effector pose tracking

链接: https://arxiv.org/abs/2409.16048
作者: Tifanny Portela,Andrei Cramariuc,Mayank Mittal,Marco Hutter
关键词-EN: Combining manipulation, mobility of legged, Combining, recent Reinforcement Learning, robotic applications
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Combining manipulation with the mobility of legged robots is essential for a wide range of robotic applications. However, integrating an arm with a mobile base significantly increases the system’s complexity, making precise end-effector control challenging. Existing model-based approaches are often constrained by their modeling assumptions, leading to limited robustness. Meanwhile, recent Reinforcement Learning (RL) implementations restrict the arm’s workspace to be in front of the robot or track only the position to obtain decent tracking accuracy. In this work, we address these limitations by introducing a whole-body RL formulation for end-effector pose tracking in a large workspace on rough, unstructured terrains. Our proposed method involves a terrain-aware sampling strategy for the robot’s initial configuration and end-effector pose commands, as well as a game-based curriculum to extend the robot’s operating range. We validate our approach on the ANYmal quadrupedal robot with a six DoF robotic arm. Through our experiments, we show that the learned controller achieves precise command tracking over a large workspace and adapts across varying terrains such as stairs and slopes. On deployment, it achieves a pose-tracking error of 2.64 cm and 3.64 degrees, outperforming existing competitive baselines.

[LG-19] me-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

链接: https://arxiv.org/abs/2409.16040
作者: Xiaoming Shi,Shiyu Wang,Yuqi Nie,Dianqi Li,Zhou Ye,Qingsong Wen,Ming Jin
关键词-EN: Deep learning, past decades, time series, time series forecasting, Deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 29 pages, 10 figures, 13 tables

点击查看摘要

Abstract:Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.

[LG-20] Robust Neural IDA-PBC: passivity-based stabilization under approximations

链接: https://arxiv.org/abs/2409.16008
作者: Santiago Sanchez-Escalonilla,Samuele Zoboli,Bayu Jayawardhana
关键词-EN: Passivity Based Control, Damping Assignment, Passivity Based, Based Control, Interconnection and Damping
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Preprint

点击查看摘要

Abstract:In this paper, we restructure the Neural Interconnection and Damping Assignment - Passivity Based Control (Neural IDA-PBC) design methodology, and we formally analyze its closed-loop properties. Neural IDA-PBC redefines the IDA-PBC design approach as an optimization problem by building on the framework of Physics Informed Neural Networks (PINNs). However, the closed-loop stability and robustness properties under Neural IDA-PBC remain unexplored. To address the issue, we study the behavior of classical IDA-PBC under approximations. Our theoretical analysis allows deriving conditions for practical and asymptotic stability of the desired equilibrium point. Moreover, it extends the Neural IDA-PBC applicability to port-Hamiltonian systems where the matching conditions cannot be solved exactly. Our renewed optimization-based design introduces three significant aspects: i) it involves a novel optimization objective including stability and robustness constraints issued from our theoretical analysis; ii) it employs separate Neural Networks (NNs), which can be structured to reduce the search space to relevant functions; iii) it does not require knowledge about the port-Hamiltonian formulation of the system’s model. Our methodology is validated with simulations on three standard benchmarks: a double pendulum, a nonlinear mass-spring-damper and a cartpole. Notably, classical IDA-PBC designs cannot be analytically derived for the latter.

[LG-21] Improvements to SDXL in NovelAI Diffusion V3

链接: https://arxiv.org/abs/2409.15997
作者: Juan Ossa,Eren Doğan,Alex Birch,F. Johnson
关键词-EN: training NovelAI Diffusion, image generation model, art anime image, anime image generation, NovelAI Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:In this technical report, we document the changes we made to SDXL in the process of training NovelAI Diffusion V3, our state of the art anime image generation model.

[LG-22] Exploring the Impact of Outlier Variability on Anomaly Detection Evaluation Metrics

链接: https://arxiv.org/abs/2409.15986
作者: Minjae Ok,Simon Klüttermann,Emmanuel Müller
关键词-EN: Operating Characteristic Area, Receiver Operating Characteristic, ROC AUC, Anomaly detection, Precision-Recall Curve Area
类目: Machine Learning (cs.LG)
*备注: 8 Pages, 5 figures

点击查看摘要

Abstract:Anomaly detection is a dynamic field, in which the evaluation of models plays a critical role in understanding their effectiveness. The selection and interpretation of the evaluation metrics are pivotal, particularly in scenarios with varying amounts of anomalies. This study focuses on examining the behaviors of three widely used anomaly detection metrics under different conditions: F1 score, Receiver Operating Characteristic Area Under Curve (ROC AUC), and Precision-Recall Curve Area Under Curve (AUCPR). Our study critically analyzes the extent to which these metrics provide reliable and distinct insights into model performance, especially considering varying levels of outlier fractions and contamination thresholds in datasets. Through a comprehensive experimental setup involving widely recognized algorithms for anomaly detection, we present findings that challenge the conventional understanding of these metrics and reveal nuanced behaviors under varying conditions. We demonstrated that while the F1 score and AUCPR are sensitive to outlier fractions, the ROC AUC maintains consistency and is unaffected by such variability. Additionally, under conditions of a fixed outlier fraction in the test set, we observe an alignment between ROC AUC and AUCPR, indicating that the choice between these two metrics may be less critical in such scenarios. The results of our study contribute to a more refined understanding of metric selection and interpretation in anomaly detection, offering valuable insights for both researchers and practitioners in the field. Comments: 8 Pages, 5 figures Subjects: Machine Learning (cs.LG)