Arxiv今日论文 | 2024-11-14

本篇博文主要展示 2024-11-14 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决的问题是评估通过领域适应预训练（Domain-Adaptive Pretraining, DAPT）开发的医疗领域基础模型在医疗问答任务中的实际表现。解决方案的关键在于通过以下三个方面进行全面比较和分析：(i) 直接将医疗模型与其对应的基模型进行一对一的对比；(ii) 在零样本/少样本提示（zero-/few-shot prompting）中为每个模型单独优化提示；(iii) 考虑比较中的统计不确定性。论文发现，尽管在特定问答任务上经过微调后，医疗模型可能表现出性能提升，但这种提升并不适用于基于临床笔记的任务。研究结果表明，最先进的通用领域模型已经展现出强大的医疗知识和推理能力，并提出了加强未来研究结论的建议。

链接: https://arxiv.org/abs/2411.08870
作者: Daniel P. Jeong,Pranav Mani,Saurabh Garg,Zachary C. Lipton,Michael Oberst
关键词-EN: adapting general-purpose large, general-purpose large language, recent works seek, develop foundation models, foundation models specifically
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Extended version of EMNLP 2024 paper arXiv:2411.04118 . Includes additional results on clinical note QA tasks and supervised fine-tuning evaluations

点击查看摘要

Abstract:Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare ten public “medical” LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question-answering (QA). For instance, across all tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 22.7% of cases, reach a (statistical) tie in 36.8% of cases, and are significantly worse than their base models in the remaining 40.5% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately in zero-/few-shot prompting; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Meanwhile, we find that after fine-tuning on specific QA tasks, medical LLMs can show performance improvements, but the benefits do not carry over to tasks based on clinical notes. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.
摘要：近期有几项研究致力于开发专门用于医疗领域的基础模型，通过在公开的生物医学语料库上继续预训练，将通用大语言模型 (LLMs) 和视觉-语言模型 (VLMs) 进行领域适应性预训练 (DAPT)。这些研究通常声称，这种领域适应性预训练能够提升下游医疗任务的性能，例如回答医学执照考试问题。在本研究中，我们对比了十个公开的“医疗”LLMs 和两个 VLMs 与其对应的基线模型，得出了不同的结论：所有医疗 VLMs 和几乎所有医疗 LLMs 在零样本/少样本提示和监督微调机制下，对于医疗问答 (QA) 任务，均未能持续优于其基线模型。例如，在我们考虑的所有任务和模型对在3-shot设置下，医疗 LLMs 仅在22.7%的情况下优于其基线模型，在36.8%的情况下达到（统计上的）平局，而在剩余的40.5%的情况下显著劣于其基线模型。我们的结论基于以下几点：(i) 直接将每个医疗模型与对应的基线模型进行一对一比较；(ii) 在零样本/少样本提示中分别优化每个模型的提示；(iii) 考虑比较中的统计不确定性。尽管这些基本实践在文献中并未被一致采用，但我们的消融实验表明它们对结论有显著影响。同时，我们发现，在特定 QA 任务上进行微调后，医疗 LLMs 可以显示出性能提升，但这种提升并未延续到基于临床笔记的任务中。我们的研究结果表明，最先进的通用领域模型可能已经展现出强大的医疗知识和推理能力，并提出了加强未来研究结论的建议。

[NLP-1] CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

【速读】：该论文试图解决法语语言模型（如CamemBERT）在面对时间概念漂移（temporal concept drift）时性能下降的问题，特别是在处理新话题和术语时。解决方案的关键在于引入两个新的CamemBERT版本：CamemBERTav2和CamemBERTv2。CamemBERTav2基于DeBERTaV3架构，采用替换令牌检测（Replaced Token Detection, RTD）目标以增强上下文理解；而CamemBERTv2基于RoBERTa，使用掩码语言建模（Masked Language Modeling, MLM）目标。两者均在更大、更新的数据集上训练，具有更长的上下文长度和改进的令牌化器，以提升法语的令牌化性能。这些新模型在通用领域和特定领域（如医疗领域）的任务中表现出色，显著优于前代模型，成为现代自然语言处理（NLP）系统的宝贵工具。

链接: https://arxiv.org/abs/2411.08868
作者: Wissam Antoun,Francis Kulumba,Rian Touchent,Éric de la Clergerie,Benoît Sagot,Djamé Seddah
关键词-EN: natural language processing, million downloads, downloads per month, widely adopted, adopted across industries
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.
摘要：法语语言模型，如 CamemBERT，已在各行业广泛应用于自然语言处理 (NLP) 任务，其中 CamemBERT 每月下载量超过 400 万次。然而，这些模型面临时间概念漂移的挑战，即过时的训练数据导致性能下降，尤其是在遇到新话题和术语时。这一问题凸显了更新模型以反映当前语言趋势的必要性。本文介绍了 CamemBERT 基础模型的两个新版本——CamemBERTav2 和 CamemBERTv2，旨在解决这些挑战。CamemBERTav2 基于 DeBERTaV3 架构，采用替换 Token 检测 (RTD) 目标以提升上下文理解能力，而 CamemBERTv2 则基于 RoBERTa，使用掩码语言建模 (MLM) 目标。两者均在更大规模、更新的数据集上进行训练，具有更长的上下文长度和更新的 Tokenizer，从而提升了法语的分词性能。我们评估了这些模型在通用领域 NLP 任务和特定领域应用（如医疗领域任务）中的表现，展示了它们在多种应用场景中的多功能性和有效性。结果表明，这些更新模型显著优于其前代，成为现代 NLP 系统中的宝贵工具。所有新模型及其中间检查点均在 Huggingface 上公开发布。

[NLP-2] Can sparse autoencoders be used to decompose and interpret steering vectors?

【速读】：该论文试图解决的问题是如何准确解释和理解控制大型语言模型行为的转向向量（steering vectors）。论文指出，尽管稀疏自编码器（Sparse Autoencoders, SAEs）可能是一种潜在的解释方法，但直接应用SAE重建的向量往往缺乏原始转向向量的控制特性。解决方案的关键在于识别并解释了两个主要原因：(1) 转向向量超出了SAE设计时所针对的输入分布范围；(2) 转向向量在特征方向上可能存在有意义的负投影，而SAE并未设计来处理这种情况。这些发现揭示了直接使用SAE解释转向向量的局限性，并为未来的研究提供了改进方向。

链接: https://arxiv.org/abs/2411.08790
作者: Harry Mayne,Yushi Yang,Adam Mahdi
关键词-EN: large language models, Steering vectors, language models, promising approach, approach to control
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.
摘要：转向向量（Steering vectors）是一种有前景的方法，用于控制大语言模型的行为。然而，其背后的机制仍未被充分理解。尽管稀疏自编码器（Sparse Autoencoders, SAEs）可能提供了一种解释转向向量的潜在方法，但最近的发现表明，SAE重建的向量往往缺乏原始向量的转向特性。本文探讨了为何直接应用SAE于转向向量会产生误导性的分解，并识别出两个原因：（1）转向向量超出了SAE设计时所针对的输入分布范围；（2）转向向量在特征方向上可能存在有意义的负投影，而SAE并未设计来处理这种情况。这些局限性阻碍了SAE直接用于解释转向向量的可行性。

[NLP-3] Zero-shot Cross-lingual Transfer Learning with Multiple Source and Target Languages for Information Extraction: Language Selection and Adversarial Training

【速读】：该论文试图解决多语言信息抽取（IE）系统在实际应用中难以泛化到多种语言的问题。解决方案的关键在于深入分析跨语言多迁移性（Cross-Lingual Multi-Transferability），特别是在涵盖多种语言的最新IE语料库中。论文首先确定了单迁移性能与多种语言距离之间的相关性，并基于此开发了一种结合语言距离的度量方法，该方法在不同任务和模型规模下均表现出高度相关性和鲁棒性。随后，论文探讨了更广泛的零样本多语言迁移设置，并提出基于新定义的语言距离进行语言聚类，以指导数据（语言）选择问题中的最佳成本-性能权衡。最后，论文提出了一种关系迁移设置，通过基于上述语言距离的关系进行对抗训练，进一步整合多语言未标注数据。

链接: https://arxiv.org/abs/2411.08785
作者: Nghia Trung Ngo,Thien Huu Nguyen
关键词-EN: previous researches addressing, researches addressing multi-lingual, high-resource languages predominantly, majority of previous, previous researches
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The majority of previous researches addressing multi-lingual IE are limited to zero-shot cross-lingual single-transfer (one-to-one) setting, with high-resource languages predominantly as source training data. As a result, these works provide little understanding and benefit for the realistic goal of developing a multi-lingual IE system that can generalize to as many languages as possible. Our study aims to fill this gap by providing a detailed analysis on Cross-Lingual Multi-Transferability (many-to-many transfer learning), for the recent IE corpora that cover a diverse set of languages. Specifically, we first determine the correlation between single-transfer performance and a wide range of linguistic-based distances. From the obtained insights, a combined language distance metric can be developed that is not only highly correlated but also robust across different tasks and model scales. Next, we investigate the more general zero-shot multi-lingual transfer settings where multiple languages are involved in the training and evaluation processes. Language clustering based on the newly defined distance can provide directions for achieving the optimal cost-performance trade-off in data (languages) selection problem. Finally, a relational-transfer setting is proposed to further incorporate multi-lingual unlabeled data based on adversarial training using the relation induced from the above linguistic distance.
摘要：以往大多数研究多语言信息抽取（IE）的工作局限于零样本跨语言单一转移（一对一）设置，主要使用高资源语言作为源训练数据。因此，这些研究对于开发能够泛化到尽可能多语言的多语言IE系统的现实目标提供的理解和帮助有限。我们的研究旨在填补这一空白，通过对涵盖多种语言的最新IE语料库进行跨语言多转移性（多对多迁移学习）的详细分析。具体而言，我们首先确定了单一转移性能与广泛的语言学距离之间的相关性。基于这些发现，我们可以开发一种不仅高度相关而且在不同任务和模型规模上表现稳健的综合语言距离度量。接着，我们研究了更为普遍的零样本多语言转移设置，其中涉及多个语言的训练和评估过程。基于新定义的距离进行语言聚类，可以为在数据（语言）选择问题中实现最佳成本-性能权衡提供方向。最后，我们提出了一种关系转移设置，通过基于上述语言距离的关系诱导进行对抗训练，进一步整合多语言未标注数据。

[NLP-4] Multi-Perspective Stance Detection

【速读】：该论文试图解决在主观自然语言处理任务中，由于标注者背景和经验的多样性导致的标注不一致性问题。解决方案的关键在于采用多视角（multi-perspective）方法，即在模型训练中纳入多个标注者的意见，而不是传统方法中仅使用单一的“真实标签”。研究结果表明，多视角方法在立场检测任务中显著提升了分类模型的性能，表明设计更具包容性的视角感知AI模型不仅是实现负责任和伦理AI的重要步骤，还能在性能上超越传统方法。

链接: https://arxiv.org/abs/2411.08752
作者: Benedetta Muscato,Praveen Bushipaka,Gizem Gezici,Lucia Passaro,Fosca Giannotti
关键词-EN: Subjective NLP tasks, Subjective NLP, human annotations provided, life experiences, NLP tasks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Subjective NLP tasks usually rely on human annotations provided by multiple annotators, whose judgments may vary due to their diverse backgrounds and life experiences. Traditional methods often aggregate multiple annotations into a single ground truth, disregarding the diversity in perspectives that arises from annotator disagreement. In this preliminary study, we examine the effect of including multiple annotations on model accuracy in classification. Our methodology investigates the performance of perspective-aware classification models in stance detection task and further inspects if annotator disagreement affects the model confidence. The results show that multi-perspective approach yields better classification performance outperforming the baseline which uses the single label. This entails that designing more inclusive perspective-aware AI models is not only an essential first step in implementing responsible and ethical AI, but it can also achieve superior results than using the traditional approaches.
摘要：主观自然语言处理任务通常依赖于由多个标注者提供的人工标注，这些标注者的判断可能因各自不同的背景和生活经历而有所不同。传统方法通常将多个标注聚合为一个单一的“真实”标签，忽略了标注者之间意见分歧所带来的视角多样性。在本初步研究中，我们探讨了在分类任务中包含多个标注对模型准确性的影响。我们的方法研究了视角感知分类模型在立场检测任务中的表现，并进一步考察了标注者之间的分歧是否会影响模型的置信度。结果表明，多视角方法在分类性能上优于使用单一标签的基线模型。这表明，设计更具包容性的视角感知AI模型不仅是实现负责任和合乎伦理的AI的关键第一步，而且还能在结果上超越传统方法。

[NLP-5] Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers ATC ICML2024

【速读】：该论文试图解决的核心问题是大型语言模型（LLMs）是否能够在多语言环境中发展出一种与特定语言解耦的通用概念表示。解决方案的关键在于通过分析基于Transformer的LLMs在单词翻译任务中的潜在表示（latents），特别是通过从源翻译提示中提取latents并将其插入目标翻译提示的前向传递中，发现输出语言在较早的层级就被编码，而待翻译的概念则在更深的层级被编码。基于这一发现，论文通过激活补丁（activation patching）技术，展示了在不改变语言的情况下改变概念，以及在不改变概念的情况下改变语言的能力。此外，论文还证明了使用不同语言的latents均值进行补丁操作不仅不会损害模型性能，反而能提升翻译概念的表现。这些结果为所研究模型中存在语言无关的概念表示提供了证据。

链接: https://arxiv.org/abs/2411.08745
作者: Clément Dumas,Chris Wendler,Veniamin Veselovsky,Giovanni Monea,Robert West
关键词-EN: multilingual language modeling, universal concept representation, develop a universal, disentangled from specific, central question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures, previously published under the title “How Do Llamas Process Multilingual Text? A Latent Exploration through Activation Patching” at the ICML 2024 mechanistic interpretability workshop this https URL

点击查看摘要

Abstract:A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address this question by analyzing latent representations (latents) during a word translation task in transformer-based LLMs. We strategically extract latents from a source translation prompt and insert them into the forward pass on a target translation prompt. By doing so, we find that the output language is encoded in the latent at an earlier layer than the concept to be translated. Building on this insight, we conduct two key experiments. First, we demonstrate that we can change the concept without changing the language and vice versa through activation patching alone. Second, we show that patching with the mean over latents across different languages does not impair and instead improves the models’ performance in translating the concept. Our results provide evidence for the existence of language-agnostic concept representations within the investigated models.
摘要：多语言语言建模中的一个核心问题是，大语言模型 (LLMs) 是否发展出了一种与特定语言解耦的通用概念表示。本文通过分析基于 Transformer 的大语言模型在单词翻译任务中的潜在表示 (latents) 来探讨这一问题。我们策略性地从源翻译提示中提取 latents，并将其插入到目标翻译提示的前向传递中。通过这种方式，我们发现输出语言在比待翻译概念更早的层级中被编码在 latent 中。基于这一发现，我们进行了两项关键实验。首先，我们证明了仅通过激活补丁 (activation patching) 即可在不改变语言的情况下改变概念，反之亦然。其次，我们展示了使用不同语言间 latents 的平均值进行补丁操作不仅不会损害，反而会提升模型在翻译概念时的性能。我们的研究结果为所调查模型中存在语言无关的概念表示提供了证据。

[NLP-6] A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

【速读】：该论文试图解决的问题是评估和比较离散语音标记（discrete speech tokens）与连续语音特征（continuous speech features）在语音大语言模型（Speech LLMs）中的性能差异。解决方案的关键在于通过使用轻量级语言模型（Qwen1.5-0.5B）进行一系列语义相关任务的公平和彻底的对比，揭示了连续特征在需要细粒度语义理解的任务中通常优于离散标记。此外，论文还深入分析了离散标记性能不足的关键因素，如标记粒度有限和信息保留效率低，并基于这些分析探讨了提升离散标记性能的潜在方向。

链接: https://arxiv.org/abs/2411.08742
作者: Dingdong Wang,Mingyu Cui,Dongchao Yang,Xueyuan Chen,Helen Meng
关键词-EN: Large Language Models, Speech Large Language, Language Models, Large Language, Speech Large
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 tables, 4 figures

点击查看摘要

Abstract:With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored. In this paper, we present a fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM (Qwen1.5-0.5B). Our findings reveal that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding. Moreover, this study goes beyond surface-level comparison by identifying key factors behind the under-performance of discrete tokens, such as limited token granularity and inefficient information retention. To enhance the performance of discrete tokens, we explore potential aspects based on our analysis. We hope our results can offer new insights into the opportunities for advancing discrete speech tokens in Speech LLMs.
摘要：随着语音大语言模型 (Speech Large Language Models, Speech LLMs) 的兴起，离散语音 Token 因其能够与基于文本的 Token 无缝集成而引起了越来越多的关注。与大多数专注于连续语音特征的研究相比，尽管基于离散 Token 的大语言模型在某些任务上展示了有前景的结果，但这两种范式之间的性能差距却鲜有探讨。本文通过使用轻量级大语言模型 (Qwen1.5-0.5B)，对离散和连续特征在多种语义相关任务上进行了公平且全面的比较。我们的研究发现，连续特征通常优于离散 Token，特别是在需要细粒度语义理解的任务中。此外，本研究不仅停留在表面层次的比较，还识别了离散 Token 表现不佳的关键因素，如 Token 粒度有限和信息保留效率低下。为了提升离散 Token 的性能，我们基于分析探讨了潜在的改进方向。我们希望这些结果能为推动语音大语言模型中离散语音 Token 的发展提供新的见解。

[NLP-7] Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models EMNLP2024

【速读】：该论文试图解决大型语言模型（LLMs）在传统对齐过程中依赖高成本训练和人类偏好标注的问题。解决方案的关键在于引入了一种无需调整的自我对齐方法，称为“动态奖励与提示优化（Dynamic Rewarding with Prompt Optimization）”。该方法利用基于搜索的优化框架，使LLMs能够通过迭代自我改进来制定最佳对齐指令，无需额外的训练或人类干预。其核心机制是动态奖励机制，能够识别并纠正模型特定的对齐弱点，使LLMs能够高效适应多样化的对齐挑战。实验结果表明，该方法显著提升了对齐性能，甚至在基础模型上超越了经过SFT/RLHF调整的模型，并且自动优化的提示词优于人类专家设计的提示词，验证了该方法的有效性。

链接: https://arxiv.org/abs/2411.08733
作者: Somanshu Singla,Zhen Wang,Tianyang Liu,Abdullah Ashfaq,Zhiting Hu,Eric P. Xing
关键词-EN: Aligning Large Language, Large Language Models, Aligning Large, Large Language, Language Models
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Main

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) traditionally relies on costly training and human preference annotations. Self-alignment seeks to reduce these expenses by enabling models to align themselves. To further lower costs and achieve alignment without any expensive tuning or annotations, we introduce a new tuning-free approach for self-alignment, Dynamic Rewarding with Prompt Optimization (\ours). Our approach leverages a search-based optimization framework that allows LLMs to iteratively self-improve and craft the optimal alignment instructions, all without additional training or human intervention. The core of \ours is a dynamic rewarding mechanism, which identifies and rectifies model-specific alignment weaknesses, allowing LLMs to adapt efficiently to diverse alignment challenges. Empirical evaluations on eight recent LLMs, both open- and closed-sourced, demonstrate that \ours significantly enhances alignment performance, with base models outperforming their SFT/RLHF-tuned counterparts. Moreover, the prompts automatically optimized by \ours surpass those curated by human experts, further validating the effectiveness of our approach. Our findings highlight the great potential of current LLMs to achieve adaptive self-alignment through inference-time optimization, complementing tuning-based alignment methods.
摘要：传统上，对大语言模型 (LLM) 的校准依赖于昂贵的训练和人类偏好注释。自我校准旨在通过使模型能够自我校准来减少这些成本。为了进一步降低成本并实现无需任何昂贵调整或注释的校准，我们引入了一种新的无需调整的自我校准方法，即基于提示优化的动态奖励 (Dynamic Rewarding with Prompt Optimization, \ours)。我们的方法利用了一种基于搜索的优化框架，使 LLM 能够迭代自我改进并制定最佳的校准指令，而无需额外的训练或人工干预。\ours 的核心是一个动态奖励机制，该机制识别并纠正模型特定的校准弱点，使 LLM 能够高效适应多样化的校准挑战。在八个最新的 LLM（包括开源和闭源）上的实证评估表明，\ours 显著提升了校准性能，基础模型在校准表现上优于其 SFT/RLHF 调整的对应模型。此外，\ours 自动优化的提示超越了人类专家精心设计的提示，进一步验证了我们方法的有效性。我们的研究结果突显了当前 LLM 通过推理时优化实现自适应自我校准的巨大潜力，补充了基于调整的校准方法。

[NLP-8] Analyst Reports and Stock Performance: Evidence from the Chinese Market

【速读】：该论文试图解决通过自然语言处理 (NLP) 技术从分析师报告中提取和量化文本信息，以预测股票表现的问题。解决方案的关键在于使用定制化的 BERT 深度学习模型处理中文文本，将分析师报告的情感分类为正面、中性或负面，并发现这些情感分类对股票波动性、超额收益和交易量的预测能力。具体来说，强烈正面的报告会增加超额收益和日内波动性，而强烈负面的报告会增加波动性和交易量，但会减少未来的超额收益。研究结果表明，正面情感报告的影响大于负面情感报告。

链接: https://arxiv.org/abs/2411.08726
作者: Rui Liu,Jiayou Liang,Haolong Chen,Yujia Hu
关键词-EN: natural language processing, applies natural language, quantify textual information, predict stock performance, article applies natural
类目: Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:This article applies natural language processing (NLP) to extract and quantify textual information to predict stock performance. Using an extensive dataset of Chinese analyst reports and employing a customized BERT deep learning model for Chinese text, this study categorizes the sentiment of the reports as positive, neutral, or negative. The findings underscore the predictive capacity of this sentiment indicator for stock volatility, excess returns, and trading volume. Specifically, analyst reports with strong positive sentiment will increase excess return and intraday volatility, and vice versa, reports with strong negative sentiment also increase volatility and trading volume, but decrease future excess return. The magnitude of this effect is greater for positive sentiment reports than for negative sentiment reports. This article contributes to the empirical literature on sentiment analysis and the response of the stock market to news in the Chinese stock market.
摘要：本文运用自然语言处理 (NLP) 技术，从文本信息中提取并量化数据，以预测股票表现。研究采用了一个包含中国分析师报告的大规模数据集，并利用定制的 BERT 深度学习模型进行中文文本处理，将报告的情绪分类为正面、中性或负面。研究结果强调了这种情绪指标对股票波动性、超额收益和交易量的预测能力。具体而言，情绪强烈的正面报告会增加超额收益和日内波动性，反之，情绪强烈的负面报告也会增加波动性和交易量，但会减少未来的超额收益。正面情绪报告的这种效应幅度大于负面情绪报告。本文为情绪分析和股票市场对新闻反应的实证研究贡献了新的视角，特别是在中国股票市场。

[NLP-9] QCG-Rerank: Chunks Graph Rerank with Query Expansion in Retrieval-Augmented LLM s for Tourism Domain

【速读】：该论文试图解决在旅游领域中，由于查询通常简短且数据库内容多样，现有的检索增强生成 (Retrieval-Augmented Generation, RAG) 模型在检索后可能包含大量无关或矛盾信息的问题。解决方案的关键在于提出了QCG-Rerank模型，该模型通过以下步骤实现：首先进行初步检索以获取候选片段，然后通过提取关键信息来增强语义并扩展原始查询。接着，利用扩展后的查询和候选片段计算相似度分数作为初始转移概率，并构建片段图。随后，通过迭代计算转移概率直至收敛，最终选择得分最高的片段输入到大语言模型 (Large Language Models, LLMs) 中生成响应。实验结果表明，QCG-Rerank方法在多个数据集上均表现出有效性和优越性。

链接: https://arxiv.org/abs/2411.08724
作者: Qikai Wei,Mingzhi Yang,Chunlong Han,Jingfu Wei,Minghao Zhang,Feifei Shi,Huansheng Ning
关键词-EN: Large Language Models, Large Language, Retrieval-Augmented Generation, information retrieval techniques, hallucination in Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates the issue of hallucination in Large Language Models (LLMs) by integrating information retrieval techniques. However, in the tourism domain, since the query is usually brief and the content in the database is diverse, existing RAG may contain a significant amount of irrelevant or contradictory information contents after retrieval. To address this challenge, we propose the QCG-Rerank model. This model first performs an initial retrieval to obtain candidate chunks and then enhances semantics by extracting critical information to expand the original query. Next, we utilize the expanded query and candidate chunks to calculate similarity scores as the initial transition probability and construct the chunks graph. Subsequently, We iteratively compute the transition probabilities based on an initial estimate until convergence. The chunks with the highest score are selected and input into the LLMs to generate responses. We evaluate the model on Cultour, IIRC, StrategyQA, HotpotQA, SQuAD, and MuSiQue datasets. The experimental results demonstrate the effectiveness and superiority of the QCG-Rerank method.
摘要：检索增强生成 (Retrieval-Augmented Generation, RAG) 通过整合信息检索技术，缓解了大语言模型 (Large Language Models, LLMs) 中的幻觉问题。然而，在旅游领域，由于查询通常简短且数据库内容多样，现有的 RAG 在检索后可能包含大量不相关或矛盾的信息内容。为应对这一挑战，我们提出了 QCG-Rerank 模型。该模型首先进行初步检索以获取候选片段，然后通过提取关键信息来扩展原始查询，从而增强语义。接着，我们利用扩展后的查询和候选片段计算相似度分数，作为初始转移概率，并构建片段图。随后，我们基于初始估计迭代计算转移概率，直至收敛。最终，选择得分最高的片段输入到 LLMs 中生成响应。我们在 Cultour、IIRC、StrategyQA、HotpotQA、SQuAD 和 MuSiQue 数据集上评估了该模型。实验结果表明，QCG-Rerank 方法的有效性和优越性。

[NLP-10] Are Triggers Needed for Document-Level Event Extraction?

【速读】：该论文试图解决文档级事件抽取中触发词（trigger）的作用问题。解决方案的关键在于探讨不同质量的触发词（包括人工标注、LLM生成、关键词和随机生成）对事件抽取模型性能的影响。研究发现，基本自动生成的触发词可以作为人工标注触发词的可行替代方案，且详细的事件描述有助于在触发词质量下降时保持模型性能的稳健性。此外，即使使用随机生成的触发词，也能对基于提示的LLM方法的任务表现产生积极影响。

链接: https://arxiv.org/abs/2411.08708
作者: Shaden Shaar,Wayne Chen,Maitreyi Chatterjee,Barry Wang,Wenting Zhao,Claire Cardie
关键词-EN: event extraction, event, existing work, focused on sentence-level, sentence-level texts
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most existing work on event extraction has focused on sentence-level texts and presumes the identification of a trigger-span – a word or phrase in the input that evokes the occurrence of an event of interest. Event arguments are then extracted with respect to the trigger. Indeed, triggers are treated as integral to, and trigger detection as an essential component of, event extraction. In this paper, we provide the first investigation of the role of triggers for the more difficult and much less studied task of document-level event extraction. We analyze their usefulness in multiple end-to-end and pipelined neural event extraction models for three document-level event extraction datasets, measuring performance using triggers of varying quality (human-annotated, LLM-generated, keyword-based, and random). Our research shows that trigger effectiveness varies based on the extraction task’s characteristics and data quality, with basic, automatically-generated triggers serving as a viable alternative to human-annotated ones. Furthermore, providing detailed event descriptions to the extraction model helps maintain robust performance even when trigger quality degrades. Perhaps surprisingly, we also find that the mere existence of trigger input, even random ones, is important for prompt-based LLM approaches to the task.
摘要：现有的大多数事件抽取工作主要集中在句子级别的文本上，并假设能够识别出一个触发词（trigger-span）——即输入文本中引发感兴趣事件发生的词语或短语。随后，事件参数根据该触发词进行抽取。实际上，触发词被视为事件抽取不可或缺的部分，而触发词检测则是其核心组成部分。本文首次探讨了在更为复杂且研究较少的文档级别事件抽取任务中，触发词的作用。我们分析了触发词在多个端到端和管道式神经网络事件抽取模型中的有效性，这些模型针对三个文档级别的事件抽取数据集进行了测试，通过使用不同质量的触发词（人工标注、大语言模型生成、基于关键词以及随机生成）来衡量性能。我们的研究表明，触发词的有效性取决于抽取任务的特性和数据质量，而基本的自动生成触发词可以作为人工标注触发词的可行替代方案。此外，向抽取模型提供详细的事件描述有助于在触发词质量下降时保持稳健的性能。令人意外的是，我们发现即使是随机生成的触发词，其存在对于基于提示的大语言模型方法完成任务也至关重要。

[NLP-11] heoretical Analysis of Byte-Pair Encoding

【速读】：该论文试图解决字节对编码 (Byte-Pair Encoding, BPE) 在子词分词中的优化问题，即寻找一种能够实现最佳压缩效用的字节对编码。论文的关键解决方案在于证明了这一优化问题是APX-complete的，这意味着它不太可能存在多项式时间近似方案。尽管如此，论文还展示了BPE在压缩效用上对最优字节对编码的近似程度，给出了一个在0.333到0.625之间的最坏情况近似因子。这些结果旨在解释BPE的持续成功，并提供了对其压缩效用的首个严格保证，适用于所有输入。

链接: https://arxiv.org/abs/2411.08671
作者: László Kozma,Johannes Voderholzer
关键词-EN: grammar-based text compression, subword tokenization, widely used method, method for subword, origins in grammar-based
类目: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in grammar-based text compression. It is employed in a variety of language processing tasks such as machine translation or large language model (LLM) pretraining, to create a token dictionary of a prescribed size. Most evaluations of BPE to date are empirical, and the reasons for its good practical performance are not well understood. In this paper we focus on the optimization problem underlying BPE: finding a pair encoding that achieves optimal compression utility. We show that this problem is APX-complete, indicating that it is unlikely to admit a polynomial-time approximation scheme. This answers, in a stronger form, a question recently raised by Zouhar et al. On the positive side, we show that BPE approximates the compression utility of the optimal pair encoding to a worst-case factor between 0.333 and 0.625 . Our results aim to explain the ongoing success of BPE and are, to our knowledge, the first rigorous guarantees on its compression utility that hold for all inputs. Subjects: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL) Cite as: arXiv:2411.08671 [cs.DS] (or arXiv:2411.08671v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2411.08671 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：字节对编码 (Byte-Pair Encoding, BPE) 是一种广泛使用的子词 Token 化方法，起源于基于语法的文本压缩技术。它被应用于多种语言处理任务中，如机器翻译或大语言模型 (Large Language Model, LLM) 的预训练，以创建一个规定大小的 Token 字典。迄今为止，大多数对 BPE 的评估都是基于经验的，其良好的实际性能背后的原因尚未得到充分理解。本文聚焦于 BPE 背后的优化问题：寻找一种能够实现最佳压缩效用的配对编码。我们证明，这个问题是 APX-complete 的，表明它不太可能存在多项式时间的近似方案。这一结论以更强的形式回答了 Zouhar 等人最近提出的问题。在积极的一面，我们证明了 BPE 在最坏情况下能够将最优配对编码的压缩效用近似到 0.333 至 0.625 的因子范围内。我们的研究旨在解释 BPE 持续成功的原因，并且据我们所知，这是首次为所有输入提供其压缩效用的严格保证。

主题：数据结构与算法 (cs.DS)；计算与语言 (cs.CL)
引用方式：arXiv:2411.08671 [cs.DS] (或 arXiv:2411.08671v1 [cs.DS] 用于此版本)
https://doi.org/10.48550/arXiv.2411.08671
通过 DataCite 发布的 arXiv-issued DOI（待注册）

[NLP-12] Dynamic Subset Tuning: Expanding the Operational Range of Parameter-Efficient Training for Large Language Models NEURIPS2024

【速读】：该论文试图解决大规模语言模型在适应下游任务时的参数效率问题。解决方案的关键在于提出了一种新颖的参数高效训练（Parameter-Efficient Training, PET）方法，该方法通过优化模型参数的一个动态子集来实现任务适应。与传统的固定参数子集方法不同，该方法中被优化的参数子集在训练过程中不断演化，从而能够在更少的参数数量下实现良好的性能。此外，该方法能够灵活地调整子集大小，覆盖任意比例的模型总参数，超越了现有的PET方法如提示调优（prompt tuning）和低秩适应（LoRA）的适用范围，并在多种自然语言处理任务（如机器翻译、问答、GSM8K、SuperGLUE）中表现出优越的性能。

链接: https://arxiv.org/abs/2411.08610
作者: Felix Stahlberg,Jared Lichtarge,Shankar Kumar
关键词-EN: large language models, large language, parameter-efficient training, optimizing a small, existing model parameters
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NeurIPS 2024 Workshop on Adaptive Foundation Models

点击查看摘要

Abstract:We propose a novel parameter-efficient training (PET) method for large language models that adapts models to downstream tasks by optimizing a small subset of the existing model parameters. Unlike prior methods, this subset is not fixed in location but rather which parameters are modified evolves over the course of training. This dynamic parameter selection can yield good performance with many fewer parameters than extant methods. Our method enables a seamless scaling of the subset size across an arbitrary proportion of the total model size, while popular PET approaches like prompt tuning and LoRA cover only a small part of this spectrum. We match or outperform prompt tuning and LoRA in most cases on a variety of NLP tasks (MT, QA, GSM8K, SuperGLUE) for a given parameter budget across different model families and sizes.
摘要：我们提出了一种新颖的参数高效训练 (Parameter-Efficient Training, PET) 方法，用于大语言模型，通过优化现有模型参数的一小部分子集来适应下游任务。与先前的方法不同，这一子集的位置并非固定，而是随着训练过程的进行，被修改的参数会动态变化。这种动态参数选择能够在比现有方法更少的参数下实现良好的性能。我们的方法能够无缝扩展子集大小，覆盖任意比例的总模型大小，而流行的 PET 方法如提示调优 (prompt tuning) 和低秩适应 (LoRA) 仅覆盖了这一范围的一小部分。在不同的模型家族和大小下，对于给定的参数预算，我们在多种 NLP 任务（机器翻译 (MT)、问答 (QA)、GSM8K、SuperGLUE）中大多数情况下与提示调优和 LoRA 持平或表现更优。

[NLP-13] XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL

【速读】：该论文试图解决在大语言模型（LLM）处理自然语言到SQL任务中的性能挑战。解决方案的关键在于引入XiYan-SQL框架，该框架采用多生成器集成策略（multi-generator ensemble strategy）来提升候选SQL查询的生成质量。核心创新包括：1) 提出M-Schema半结构化模式表示方法（semi-structured schema representation method），以增强对数据库结构的理解；2) 结合上下文学习（in-context learning, ICL）与监督微调（supervised fine-tuning），通过训练策略生成高质量且多样化的候选SQL查询；3) 实施基于命名实体识别（named entity recognition）的示例选择方法，防止实体过度强调；4) 通过优化器纠正候选SQL查询中的逻辑或语法错误；5) 微调选择模型以区分候选SQL查询的细微差别。这些创新共同提升了SQL查询的生成质量和多样性，并在多个数据集上实现了最先进的执行准确率。

链接: https://arxiv.org/abs/2411.08599
作者: Yingqi Gao,Yifu Liu,Xiaoxia Li,Xiaorong Shi,Yin Zhu,Yiming Wang,Shiqi Li,Wei Li,Yuntao Hong,Zhiling Luo,Jinyang Gao,Liyu Mou,Yu Li
关键词-EN: multi-generator ensemble strategy, large language model, language model performance, improve candidate generation, candidate SQL queries
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To tackle the challenges of large language model performance in natural language to SQL tasks, we introduce XiYan-SQL, an innovative framework that employs a multi-generator ensemble strategy to improve candidate generation. We introduce M-Schema, a semi-structured schema representation method designed to enhance the understanding of database structures. To enhance the quality and diversity of generated candidate SQL queries, XiYan-SQL integrates the significant potential of in-context learning (ICL) with the precise control of supervised fine-tuning. On one hand, we propose a series of training strategies to fine-tune models to generate high-quality candidates with diverse preferences. On the other hand, we implement the ICL approach with an example selection method based on named entity recognition to prevent overemphasis on entities. The refiner optimizes each candidate by correcting logical or syntactical errors. To address the challenge of identifying the best candidate, we fine-tune a selection model to distinguish nuances of candidate SQL queries. The experimental results on multiple dialect datasets demonstrate the robustness of XiYan-SQL in addressing challenges across different scenarios. Overall, our proposed XiYan-SQL achieves the state-of-the-art execution accuracy of 89.65% on the Spider test set, 69.86% on SQL-Eval, 41.20% on NL2GQL, and a competitive score of 72.23% on the Bird development benchmark. The proposed framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods.
摘要：为了应对大语言模型在自然语言到SQL任务中的性能挑战，我们引入了XiYan-SQL，这是一个创新的框架，采用多生成器集成策略来提升候选生成。我们提出了M-Schema，一种半结构化的模式表示方法，旨在增强对数据库结构的理解。为了提高生成候选SQL查询的质量和多样性，XiYan-SQL结合了上下文学习（ICL）的巨大潜力与监督微调的精确控制。一方面，我们提出了一系列训练策略，以微调模型生成具有多样偏好的高质量候选。另一方面，我们通过基于命名实体识别的示例选择方法实施ICL，以防止过度强调实体。精炼器通过纠正逻辑或句法错误来优化每个候选。为了解决识别最佳候选的挑战，我们微调了一个选择模型，以区分候选SQL查询的细微差别。在多个方言数据集上的实验结果表明，XiYan-SQL在不同场景下应对挑战的鲁棒性。总体而言，我们提出的XiYan-SQL在Spider测试集上达到了89.65%的执行准确率，在SQL-Eval上达到了69.86%，在NL2GQL上达到了41.20%，在Bird开发基准上达到了72.23%的竞争性分数。该框架不仅提高了SQL查询的质量和多样性，还优于以往的方法。

[NLP-14] CorrSynth – A Correlated Sampling Method for Diverse Dataset Generation from LLM s EMNLP2024

【速读】：该论文试图解决大型语言模型（LLMs）在生成数据时缺乏多样性、对提示的遵循度不高以及可能存在的偏见问题。解决方案的关键是提出了一种名为CorrSynth的方法，通过解码时基于相关采样的引导策略，生成更具多样性和忠实于输入提示的数据。该方法克服了其他基于引导技术（如基于分类器的引导）的复杂性缺点，并通过广泛的实验验证了其有效性，特别是在提高数据多样性和学生模型性能方面。

链接: https://arxiv.org/abs/2411.08553
作者: Suhas S Kowshik,Abhishek Divekar,Vijit Malik
关键词-EN: Large language models, demonstrated remarkable performance, Large language, few-shot prompting, demonstrated remarkable
类目: Computation and Language (cs.CL)
备注: Published as a main conference paper at EMNLP 2024; First two authors contributed equally

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance in diverse tasks using zero-shot and few-shot prompting. Even though their capabilities of data synthesis have been studied well in recent years, the generated data suffers from a lack of diversity, less adherence to the prompt, and potential biases that creep into the data from the generator model. In this work, we tackle the challenge of generating datasets with high diversity, upon which a student model is trained for downstream tasks. Taking the route of decoding-time guidance-based approaches, we propose CorrSynth, which generates data that is more diverse and faithful to the input prompt using a correlated sampling strategy. Further, our method overcomes the complexity drawbacks of some other guidance-based techniques like classifier-based guidance. With extensive experiments, we show the effectiveness of our approach and substantiate our claims. In particular, we perform intrinsic evaluation to show the improvements in diversity. Our experiments show that CorrSynth improves both student metrics and intrinsic metrics upon competitive baselines across four datasets, showing the innate advantage of our method.
摘要：大语言模型（LLMs）在多种任务中展示了通过零样本和少样本提示的显著性能。尽管近年来对其数据合成能力进行了深入研究，但生成的数据存在多样性不足、对提示的遵循度不高以及生成模型中潜在偏差的问题。在本研究中，我们解决了生成具有高多样性数据集的挑战，这些数据集用于训练学生模型以执行下游任务。我们采用了解码时基于指导的方法，提出了CorrSynth，该方法通过相关采样策略生成更具多样性且忠实于输入提示的数据。此外，我们的方法克服了其他基于指导技术（如基于分类器的指导）的复杂性缺点。通过广泛的实验，我们展示了我们方法的有效性并验证了我们的主张。特别是，我们进行了内在评估以展示多样性的改进。我们的实验表明，CorrSynth在四个数据集上均提升了学生模型指标和内在指标，显示出我们方法的固有优势。

[NLP-15] Neural Topic Modeling with Large Language Models in the Loop

【速读】：该论文试图解决现有大型语言模型（Large Language Models, LLMs）在主题建模（topic modeling）中存在的主题覆盖不全、主题对齐不准确以及效率低下的问题。解决方案的关键在于提出了一种名为LLM-ITL的新型LLM-in-the-loop框架，该框架将LLMs与现有的神经主题模型（Neural Topic Models, NTMs）相结合。在LLM-ITL中，全局主题和文档表示通过NTM学习，而LLM则通过基于置信度加权的最佳传输（Optimal Transport, OT）对齐目标来优化主题。这一过程不仅增强了学习主题的可解释性和连贯性，同时保持了NTMs的效率。实验结果表明，LLM-ITL能够显著提升NTMs的主题可解释性，同时保持文档表示的质量。

链接: https://arxiv.org/abs/2411.08534
作者: Xiaohao Yang,He Zhao,Weijie Xu,Yuanyuan Qi,Jueqing Lu,Dinh Phung,Lan Du
关键词-EN: natural language processing, Large Language Models, latent thematic structures, Neural Topic Models, language processing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Topic modeling is a fundamental task in natural language processing, allowing the discovery of latent thematic structures in text corpora. While Large Language Models (LLMs) have demonstrated promising capabilities in topic discovery, their direct application to topic modeling suffers from issues such as incomplete topic coverage, misalignment of topics, and inefficiency. To address these limitations, we propose LLM-ITL, a novel LLM-in-the-loop framework that integrates LLMs with many existing Neural Topic Models (NTMs). In LLM-ITL, global topics and document representations are learned through the NTM, while an LLM refines the topics via a confidence-weighted Optimal Transport (OT)-based alignment objective. This process enhances the interpretability and coherence of the learned topics, while maintaining the efficiency of NTMs. Extensive experiments demonstrate that LLM-ITL can help NTMs significantly improve their topic interpretability while maintaining the quality of document representation.
摘要：主题建模是自然语言处理中的一个基础任务，旨在从文本语料库中发现潜在的主题结构。尽管大语言模型（Large Language Models, LLMs）在主题发现方面展示了令人鼓舞的能力，但它们直接应用于主题建模时存在主题覆盖不全、主题错位和效率低下的问题。为了解决这些限制，我们提出了LLM-ITL，这是一种新颖的LLM-in-the-loop框架，将LLMs与现有的多种神经主题模型（Neural Topic Models, NTMs）相结合。在LLM-ITL中，通过NTM学习全局主题和文档表示，而LLM则通过基于置信度加权的最佳传输（Optimal Transport, OT）对齐目标来细化主题。这一过程增强了所学主题的可解释性和连贯性，同时保持了NTM的效率。广泛的实验表明，LLM-ITL能够显著提升NTM的主题可解释性，同时保持文档表示的质量。

[NLP-16] ree-of-Table: Unleashing the Power of LLM s for Enhanced Large-Scale Table Understanding

【速读】：该论文试图解决大型语言模型（LLMs）在处理大规模复杂表格数据时面临的挑战，特别是表格尺寸和复杂关系的问题。解决方案的关键在于引入了一种名为“Tree-of-Table”的新方法，通过表格凝聚（Table Condensation）和分解（Decomposition）将相关数据重新组织成可管理的格式，并构建一个层次化的表格树（Table-Tree）来促进树结构推理。随后，通过细致的表格树执行（Table-Tree Execution）过程，系统地解开树结构推理链以得出解决方案。实验结果表明，该方法在多个数据集上显著提升了性能，展示了在大型表格推理中的高效性和泛化能力。

链接: https://arxiv.org/abs/2411.08516
作者: Deyi Ji,Lanyun Zhu,Siqi Gao,Peng Xu,Hongtao Lu,Jieping Ye,Feng Zhao
关键词-EN: domains necessitate advanced, necessitate advanced methods, amounts of information, domains necessitate, necessitate advanced
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ubiquity and value of tables as semi-structured data across various domains necessitate advanced methods for understanding their complexity and vast amounts of information. Despite the impressive capabilities of large language models (LLMs) in advancing the natural language understanding frontier, their application to large-scale tabular data presents significant challenges, specifically regarding table size and complex intricate relationships. Existing works have shown promise with small-scale tables but often flounder when tasked with the complex reasoning required by larger, interconnected tables found in real-world scenarios. To address this gap, we introduce “Tree-of-Table”, a novel approach designed to enhance LLMs’ reasoning capabilities over large and complex tables. Our method employs Table Condensation and Decomposition to distill and reorganize relevant data into a manageable format, followed by the construction of a hierarchical Table-Tree that facilitates tree-structured reasoning. Through a meticulous Table-Tree Execution process, we systematically unravel the tree-structured reasoning chain to derive the solutions. Experiments across diverse datasets, including WikiTQ, TableFact, FeTaQA, and BIRD, demonstrate that Tree-of-Table sets a new benchmark with superior performance, showcasing remarkable efficiency and generalization capabilities in large-scale table reasoning.
摘要：表格作为跨多个领域的半结构化数据的普遍性和价值，要求开发先进的方法来理解其复杂性和海量信息。尽管大语言模型 (LLM) 在推进自然语言理解前沿方面展现出令人瞩目的能力，但将其应用于大规模表格数据时仍面临显著挑战，特别是关于表格大小和复杂关系的问题。现有研究在小规模表格上显示出潜力，但在处理现实场景中更大、更互联的表格所需的复杂推理时往往力不从心。为填补这一空白，我们提出了“表格树 (Tree-of-Table)”，这是一种旨在增强大语言模型对大型复杂表格推理能力的新方法。我们的方法采用表格浓缩与分解技术，将相关数据提炼并重组为可管理格式，随后构建一个层次化的表格树，以促进树结构推理。通过细致的表格树执行过程，我们系统地解开树结构推理链条，从而得出解决方案。在包括 WikiTQ、TableFact、FeTaQA 和 BIRD 在内的多样化数据集上的实验表明，表格树在大型表格推理中树立了新的性能标杆，展现出卓越的效率和泛化能力。

[NLP-17] An Information Theoretic Approach to Operationalize Right to Data Protection

【速读】：该论文试图解决大规模数据抓取用于微调语言模型（LMs）时引发的显著法律和伦理问题，特别是关于遵守数据保护法律如《通用数据保护条例》（GDPR）的问题。解决方案的关键在于引入RegText框架，该框架通过向自然语言数据集中注入不可察觉的虚假相关性，使得数据变得不可学习，同时不影响数据的语义内容。这种方法有效地限制了如GPT-4o和Llama等新型模型在其生成的数据上的学习能力，导致测试准确率下降，从而为保护公共数据提供了生成不可学习文本的新途径。

链接: https://arxiv.org/abs/2411.08506
作者: Abhinav Java,Simra Shahid,Chirag Agarwal
关键词-EN: Data Protection Regulation, General Data Protection, data protection laws, raises significant legal, Protection Regulation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: First two authors contributed equally to this work

点击查看摘要

Abstract:The widespread practice of indiscriminate data scraping to fine-tune language models (LMs) raises significant legal and ethical concerns, particularly regarding compliance with data protection laws such as the General Data Protection Regulation (GDPR). This practice often results in the unauthorized use of personal information, prompting growing debate within the academic and regulatory communities. Recent works have introduced the concept of generating unlearnable datasets (by adding imperceptible noise to the clean data), such that the underlying model achieves lower loss during training but fails to generalize to the unseen test setting. Though somewhat effective, these approaches are predominantly designed for images and are limited by several practical constraints like requiring knowledge of the target model. To this end, we introduce RegText, a framework that injects imperceptible spurious correlations into natural language datasets, effectively rendering them unlearnable without affecting semantic content. We demonstrate RegText’s utility through rigorous empirical analysis of small and large LMs. Notably, RegText can restrict newer models like GPT-4o and Llama from learning on our generated data, resulting in a drop in their test accuracy compared to their zero-shot performance and paving the way for generating unlearnable text to protect public data.
摘要：广泛采用的无差别数据抓取以微调语言模型 (LMs) 的做法引发了重大的法律和伦理问题，特别是在遵守数据保护法律如《通用数据保护条例》(GDPR) 方面。这种做法往往导致个人信息的未经授权使用，引发了学术界和监管界的日益激烈的讨论。近期研究引入了生成不可学习数据集的概念（通过向干净数据添加不可察觉的噪声），使得底层模型在训练期间损失降低，但在未见测试环境中泛化能力下降。尽管这些方法在一定程度上有效，但它们主要针对图像设计，并受限于若干实际限制，如需要了解目标模型。为此，我们提出了 RegText，这是一个框架，通过向自然语言数据集注入不可察觉的虚假关联，使其在不影响语义内容的情况下变得不可学习。我们通过严格的实证分析展示了 RegText 在小规模和大规模 LMs 中的效用。值得注意的是，RegText 能够限制 GPT-4o 和 Llama 等较新模型在我们生成的数据上的学习，导致它们的测试准确率相较于零样本性能有所下降，为生成不可学习的文本以保护公共数据铺平了道路。

[NLP-18] owards Objective and Unbiased Decision Assessments with LLM -Enhanced Hierarchical Attention Networks

【速读】：该论文试图解决在高风险决策过程中人类专家的认知偏差问题，特别是在大学招生等实际场景中的决策有效性。解决方案的关键在于提出了一个名为 BGM-HAN 的层次注意力网络，该网络通过字节对编码 (byte-pair encoding)、多头注意力机制 (multi-head attention) 和门控残差连接 (gated residual connection) 增强，并结合 Shortlist-Analyse-Recommend (SAR) 代理工作流程，模拟实际决策过程。实验结果表明，该模型和工作流程在提升决策质量和减少偏差方面显著优于人类判断和其他模型，且在真实世界数据中得到了验证。

链接: https://arxiv.org/abs/2411.08504
作者: Junhua Liu,Kwan Hui Lim,Roy Ka-Wei Lee
关键词-EN: objective and unbiased, cognitive bias, Abstract, high-stake decision making, decision making process
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How objective and unbiased are we while making decisions? This work investigates cognitive bias identification in high-stake decision making process by human experts, questioning its effectiveness in real-world settings, such as candidates assessments for university admission. We begin with a statistical analysis assessing correlations among different decision points among in the current process, which discovers discrepancies that imply cognitive bias and inconsistency in decisions. This motivates our exploration of bias-aware AI-augmented workflow that surpass human judgment. We propose BGM-HAN, a hierarchical attention network enhanced by byte-pair encoding, multi-head attention and gated residual connection. Using it as backbone model, we further propose a Shortlist-Analyse-Recommend (SAR) agentic workflow, which simulate real-world decision-making. In our experiments, both the proposed model and the agentic workflow significantly improves on both human judgment and alternative models, validated with real-world data.
摘要：在做出决策时，我们有多客观和公正？本研究探讨了人类专家在高风险决策过程中的认知偏差识别问题，质疑其在实际场景中的有效性，例如大学录取候选人的评估。我们首先进行统计分析，评估当前过程中不同决策点之间的相关性，发现这些差异暗示了决策中的认知偏差和不一致性。这促使我们探索超越人类判断的偏差感知AI增强工作流程。我们提出了BGM-HAN，这是一种由字节对编码、多头注意力和门控残差连接增强的分层注意力网络。以此为基础模型，我们进一步提出了一个模拟现实世界决策的短名单-分析-推荐（SAR）智能体工作流程。在我们的实验中，所提出的模型和智能体工作流程在人类判断和替代模型上均显著提升，并通过真实世界数据得到了验证。

[NLP-19] owards Evaluating Large Language Models for Graph Query Generation

【速读】：该论文试图解决生成式 AI (Generative AI) 在图数据库和知识图谱 (Knowledge Graphs, KGs) 中生成 Cypher 查询的挑战。解决方案的关键在于使用开放访问的大型语言模型 (Large Language Models, LLMs) 进行对比研究，并通过设计少样本学习提示 (few-shot learning prompt) 和基于检索增强生成 (Retrieval Augmented Generation, RAG) 以及思维链 (Chain-of-Thoughts, CoT) 推理的方法来评估多个 LLM 代理（如 OpenAI ChatGPT 4o, Claude Sonnet 3.5, Google Gemini Pro 1.5, 和本地部署的 Llama 3.1 8B）的查询生成准确性。研究结果表明，Claude Sonnet 3.5 在生成 Cypher 查询方面表现优于其他模型，并指出了未来研究的方向以解决现有局限性并推进 LLM 驱动的图数据库查询生成技术。

链接: https://arxiv.org/abs/2411.08449
作者: Siraj Munir,Alessandro Aldini
关键词-EN: Generative Artificial Intelligence, Large Language Models, Artificial Intelligence, Generative Artificial, solutions emerging rapidly
类目: Emerging Technologies (cs.ET); Computation and Language (cs.CL)
备注: Paper accepted and will be presented at CSCI2024 in December 2024, Later will be published at Springer LNCS

点击查看摘要

Abstract:Large Language Models (LLMs) are revolutionizing the landscape of Generative Artificial Intelligence (GenAI), with innovative LLM-backed solutions emerging rapidly. However, when applied to database technologies, specifically query generation for graph databases and Knowledge Graphs (KGs), LLMs still face significant challenges. While research on LLM-driven query generation for Structured Query Language (SQL) exists, similar systems for graph databases remain underdeveloped. This paper presents a comparative study addressing the challenge of generating Cypher queries a powerful language for interacting with graph databases using open-access LLMs. We rigorously evaluate several LLM agents (OpenAI ChatGPT 4o, Claude Sonnet 3.5, Google Gemini Pro 1.5, and a locally deployed Llama 3.1 8B) using a designed few-shot learning prompt and Retrieval Augmented Generation (RAG) backed by Chain-of-Thoughts (CoT) reasoning. Our empirical analysis of query generation accuracy reveals that Claude Sonnet 3.5 outperforms its counterparts in this specific domain. Further, we highlight promising future research directions to address the identified limitations and advance LLM-driven query generation for graph databases.
摘要：大语言模型 (LLM) 正在革新生成式人工智能 (Generative AI) 的领域，涌现出许多创新的 LLM 支持的解决方案。然而，当应用于数据库技术，特别是图数据库和知识图谱 (Knowledge Graphs, KGs) 的查询生成时，LLM 仍然面临重大挑战。尽管已有关于 LLM 驱动的结构化查询语言 (SQL) 查询生成的研究，但针对图数据库的类似系统仍处于发展初期。本文针对使用开放访问的 LLM 生成 Cypher 查询这一挑战进行了比较研究，Cypher 是一种用于与图数据库交互的强大语言。我们严格评估了几种 LLM 智能体（包括 OpenAI ChatGPT 4o、Claude Sonnet 3.5、Google Gemini Pro 1.5 和本地部署的 Llama 3.1 8B），使用了设计的少样本学习提示和基于链式思维 (Chain-of-Thoughts, CoT) 推理的检索增强生成 (Retrieval Augmented Generation, RAG)。我们的实证分析表明，在生成 Cypher 查询的准确性方面，Claude Sonnet 3.5 在这一特定领域优于其他智能体。此外，我们指出了未来有前景的研究方向，以解决已识别的局限性并推动 LLM 驱动的图数据库查询生成技术的发展。

[NLP-20] One STEP at a time: Language Agents are Stepwise Planners

【速读】：该论文试图解决语言代理在需要规划的任务中表现不足的问题。解决方案的关键在于引入了一个名为STEP的新框架，该框架通过四个相互连接的组件来增强语言代理的规划能力：1) 规划器（Planner）负责将任务分解为子任务并提供相关见解；2) 执行器（Executor）生成行动候选；3) 评估器（Evaluator）确保行动符合从先前经验中学到的规则；4) 记忆（Memory）存储经验以指导未来的决策。在ScienceWorld基准测试中，STEP显著优于现有最先进模型，展示了其在动态环境中提升任务解决能力的潜力。

链接: https://arxiv.org/abs/2411.08432
作者: Minh Nguyen,Ehsan Shareghi
关键词-EN: shown promising adaptability, perform complex tasks, Language agents, shown promising, promising adaptability
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language agents have shown promising adaptability in dynamic environments to perform complex tasks. However, despite the versatile knowledge embedded in large language models, these agents still fall short when it comes to tasks that require planning. We introduce STEP, a novel framework designed to efficiently learn from previous experiences to enhance the planning capabilities of language agents in future steps. Concretely, STEP functions through four interconnected components. First, the Planner takes on the task, breaks it down into subtasks and provides relevant insights. Then the Executor generates action candidates, while the Evaluator ensures the actions align with learned rules from previous experiences. Lastly, Memory stores experiences to inform future decisions. In the ScienceWorld benchmark, our results show that STEP consistently outperforms state-of-the-art models, achieving an overall score of 67.4 and successfully completing 12 out of 18 tasks. These findings highlight STEP’s potential as a framework for enhancing planning capabilities in language agents, paving the way for more sophisticated task-solving in dynamic environments.
摘要：语言智能体在动态环境中展现出了执行复杂任务的适应性潜力。然而，尽管大语言模型中嵌入了丰富的知识，这些智能体在需要规划的任务上仍显不足。我们提出了 STEP，一种新颖的框架，旨在通过从前经验中高效学习来增强语言智能体在未来步骤中的规划能力。具体而言，STEP 通过四个相互关联的组件运作。首先，规划器承担任务，将其分解为子任务并提供相关见解。接着，执行器生成行动候选方案，而评估器则确保这些行动符合从前经验中学习到的规则。最后，记忆模块存储经验，以指导未来的决策。在 ScienceWorld 基准测试中，我们的结果显示 STEP 持续优于最先进的模型，总体得分为 67.4，并成功完成了 18 项任务中的 12 项。这些发现突显了 STEP 作为增强语言智能体规划能力框架的潜力，为在动态环境中更复杂的任务解决铺平了道路。

[NLP-21] CLaSP: Learning Concepts for Time-Series Signals from Natural Language Supervision

【速读】：该论文试图解决时间序列信号数据在自然语言描述下的搜索问题。传统方法在设计时间序列信号特征的常规类别、量化这些特征以及创建同义词词典方面存在挑战。论文提出的解决方案之关键是引入基于对比学习的神经网络模型，称为CLaSP。该模型通过使用包含时间序列信号及其自然语言描述的数据集（如TRUCE和SUSHI）进行训练，能够直接利用数据分析师常用的自然语言词汇进行搜索，无需预定义的同义词词典，并利用大规模语言模型（LLM）中嵌入的常识知识来增强搜索能力。实验结果表明，CLaSP能够有效地支持自然语言搜索时间序列信号数据，并准确识别信号数据的变化点。

链接: https://arxiv.org/abs/2411.08397
作者: Aoi Ito,Kota Dohi,Yohei Kawaguchi
关键词-EN: time series signal, time series, foundation model called, series signal, series signal data
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper proposes a foundation model called “CLaSP” that can search time series signals using natural language that describes the characteristics of the signals as queries. Previous efforts to represent time series signal data in natural language have had challenges in designing a conventional class of time series signal characteristics, formulating their quantification, and creating a dictionary of synonyms. To overcome these limitations, the proposed method introduces a neural network based on contrastive learning. This network is first trained using the datasets TRUCE and SUSHI, which consist of time series signals and their corresponding natural language descriptions. Previous studies have proposed vocabularies that data analysts use to describe signal characteristics, and SUSHI was designed to cover these terms. We believe that a neural network trained on these datasets will enable data analysts to search using natural language vocabulary. Furthermore, our method does not require a dictionary of predefined synonyms, and it leverages common sense knowledge embedded in a large-scale language model (LLM). Experimental results demonstrate that CLaSP enables natural language search of time series signal data and can accurately learn the points at which signal data changes.
摘要：本文提出了一种名为“CLaSP”的基础模型，该模型能够使用描述信号特征的自然语言作为查询来搜索时间序列信号。以往在将时间序列信号数据表示为自然语言方面的工作面临挑战，包括设计常规的时间序列信号特征类别、制定其量化方法以及创建同义词词典。为了克服这些限制，本文提出的方法引入了一种基于对比学习的神经网络。该网络首先使用TRUCE和SUSHI数据集进行训练，这些数据集包含了时间序列信号及其相应的自然语言描述。以往的研究提出了数据分析师用于描述信号特征的词汇，而SUSHI数据集旨在涵盖这些术语。我们相信，在这些数据集上训练的神经网络将使数据分析师能够使用自然语言词汇进行搜索。此外，我们的方法不需要预定义的同义词词典，而是利用了大语言模型（LLM）中嵌入的常识知识。实验结果表明，CLaSP能够实现时间序列信号数据的自然语言搜索，并且能够准确学习信号数据变化的点。

[NLP-22] Interpretable Syntactic Representations Enable Hierarchical Word Vectors

【速读】：该论文试图解决当前分布式表示（distributed representations）中存在的密集且难以解释的问题。解决方案的关键在于提出了一种将词向量转换为简化语法表示（reduced syntactic representations）的方法。这种方法生成的表示形式紧凑且可解释，便于词向量的可视化和比较，并且与人类判断一致。通过增量学习（incremental learning）方法，这些语法表示被用于创建层次化的词向量，类似于人类学习的层次结构。由于这些表示是从预训练向量中提取的，生成过程和学习方法在计算上高效。最重要的是，语法表示为向量提供了合理的解释，并且在基准测试中，后续的层次化向量表现优于原始向量。

链接: https://arxiv.org/abs/2411.08384
作者: Biraj Silwal
关键词-EN: dense and uninterpretable, hard to interpret, representations, syntactic representations, vectors
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The distributed representations currently used are dense and uninterpretable, leading to interpretations that themselves are relative, overcomplete, and hard to interpret. We propose a method that transforms these word vectors into reduced syntactic representations. The resulting representations are compact and interpretable allowing better visualization and comparison of the word vectors and we successively demonstrate that the drawn interpretations are in line with human judgment. The syntactic representations are then used to create hierarchical word vectors using an incremental learning approach similar to the hierarchical aspect of human learning. As these representations are drawn from pre-trained vectors, the generation process and learning approach are computationally efficient. Most importantly, we find out that syntactic representations provide a plausible interpretation of the vectors and subsequent hierarchical vectors outperform the original vectors in benchmark tests.
摘要：当前使用的分布式表示是密集且不可解释的，导致其解释本身具有相对性、过度完备性和难以解释性。我们提出了一种方法，将这些词向量转换为简化的句法表示。由此产生的表示形式紧凑且可解释，便于更好地可视化和比较词向量，并且我们成功地证明了所得到的解释与人类判断一致。随后，这些句法表示被用于通过类似于人类学习层次结构的增量学习方法，创建层次化的词向量。由于这些表示是从预训练向量中提取的，生成过程和学习方法在计算上非常高效。最重要的是，我们发现句法表示为向量提供了合理的解释，并且后续的层次化向量在基准测试中优于原始向量。

[NLP-23] Refining Translations with LLM s: A Constraint-Aware Iterative Prompting Approach

【速读】：该论文试图解决大语言模型（LLMs）在机器翻译（MT）中处理低资源或领域特定上下文中的罕见词汇时面临的挑战。解决方案的关键在于提出了一种多步骤的提示链（multi-step prompt chain），通过优先考虑对语义准确性至关重要的关键词来增强翻译的忠实度。具体方法包括：首先识别这些关键词并从双语词典中检索其翻译，然后利用检索增强生成（Retrieval-Augmented Generation, RAG）将这些翻译整合到LLM的上下文中；此外，通过迭代自检机制（iterative self-checking mechanism），LLM根据词汇和语义约束对其翻译进行细化，从而减少长提示可能导致的输出幻觉（output hallucinations）。实验结果表明，该方法在FLORES-200和WMT数据集上显著优于基线模型，特别是在低资源场景下，显著提升了翻译的忠实度和鲁棒性。

链接: https://arxiv.org/abs/2411.08348
作者: Shangfeng Chen,Xiayang Shi,Pu Li,Yinlin Li,Jingjing Liu
关键词-EN: Large language models, demonstrated remarkable proficiency, Large language, demonstrated remarkable, remarkable proficiency
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable proficiency in machine translation (MT), even without specific training on the languages in question. However, translating rare words in low-resource or domain-specific contexts remains challenging for LLMs. To address this issue, we propose a multi-step prompt chain that enhances translation faithfulness by prioritizing key terms crucial for semantic accuracy. Our method first identifies these keywords and retrieves their translations from a bilingual dictionary, integrating them into the LLM’s context using Retrieval-Augmented Generation (RAG). We further mitigate potential output hallucinations caused by long prompts through an iterative self-checking mechanism, where the LLM refines its translations based on lexical and semantic constraints. Experiments using Llama and Qwen as base models on the FLORES-200 and WMT datasets demonstrate significant improvements over baselines, highlighting the effectiveness of our approach in enhancing translation faithfulness and robustness, particularly in low-resource scenarios.
摘要：大语言模型（LLMs）在机器翻译（MT）方面展示了显著的熟练度，即使在没有针对特定语言进行专门训练的情况下也是如此。然而，在低资源或领域特定的情境中翻译罕见词汇仍然是LLMs面临的一个挑战。为了解决这一问题，我们提出了一种多步骤的提示链，通过优先考虑对语义准确性至关重要的关键词来增强翻译的忠实度。我们的方法首先识别这些关键词，并从双语词典中检索其翻译，利用检索增强生成（RAG）将其整合到LLM的上下文中。我们进一步通过迭代自检机制减轻了由长提示引起的潜在输出幻觉问题，其中LLM根据词汇和语义约束来优化其翻译。在FLORES-200和WMT数据集上使用Llama和Qwen作为基础模型进行的实验表明，与基线相比有显著改进，突显了我们的方法在增强翻译忠实度和鲁棒性方面的有效性，特别是在低资源场景中。

[NLP-24] A Chinese Multi-label Affective Computing Dataset Based on Social Media Network Users

链接: https://arxiv.org/abs/2411.08347
作者: Jingyi Zhou,Senlin Luo,Haofan Chen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

[NLP-25] Bangla Grammatical Error Detection Leveraging Transformer-based Token Classification

【速读】：该论文试图解决的是孟加拉语（Bangla）自动化语法检查器开发中的未充分研究问题。解决方案的关键在于将任务分解为标记分类问题，并利用最先进的基于变压器（transformer-based）的模型进行处理。最终，通过结合这些模型的输出并应用基于规则的后处理步骤，生成更可靠和全面的结果。该系统在包含超过25,000条文本的数据集上进行了评估，最佳模型达到了1.04的Levenshtein距离分数。

链接: https://arxiv.org/abs/2411.08344
作者: Shayekh Bin Islam,Ridwanul Hasan Tanvir,Sihat Afnan
关键词-EN: automated grammar checker, automated Bangla typing, Bangla grammatical error, spoken language, Bangla typing assistant
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Bangla is the seventh most spoken language by a total number of speakers in the world, and yet the development of an automated grammar checker in this language is an understudied problem. Bangla grammatical error detection is a task of detecting sub-strings of a Bangla text that contain grammatical, punctuation, or spelling errors, which is crucial for developing an automated Bangla typing assistant. Our approach involves breaking down the task as a token classification problem and utilizing state-of-the-art transformer-based models. Finally, we combine the output of these models and apply rule-based post-processing to generate a more reliable and comprehensive result. Our system is evaluated on a dataset consisting of over 25,000 texts from various sources. Our best model achieves a Levenshtein distance score of 1.04. Finally, we provide a detailed analysis of different components of our system.
摘要：孟加拉语是全球使用人数第七多的语言，然而，针对该语言的自动语法检查器开发却是一个研究不足的问题。孟加拉语语法错误检测任务旨在识别孟加拉语文本中包含语法、标点或拼写错误的子字符串，这对于开发自动孟加拉语输入助手至关重要。我们的方法将该任务分解为Token分类问题，并利用基于Transformer的先进模型进行处理。最终，我们将这些模型的输出结果结合，并应用基于规则的后处理步骤，以生成更为可靠和全面的结果。我们的系统在一个包含超过25,000篇来自不同来源的文本数据集上进行了评估。我们表现最佳的模型达到了1.04的Levenshtein距离分数。最后，我们对系统的各个组成部分进行了详细分析。

[NLP-26] Are LLM s Prescient? A Continuous Evaluation using Daily News as the Oracle

【速读】：该论文试图解决现有大型语言模型（LLMs）评估基准因新模型和训练数据的出现而迅速过时的问题，以及这些基准缺乏时间维度，无法评估模型性能随时间变化的情况。解决方案的关键在于提出了一种持续的评估方法，即使用未来事件预测来评估LLMs的时间泛化能力和预测能力。具体来说，论文提出了名为“Daily Oracle”的基准，该基准通过自动从每日新闻中生成问答对，挑战LLMs预测“未来”事件的结果。研究发现，随着预训练数据的过时，LLM的性能会随时间下降，尽管检索增强生成（RAG）有潜力提高预测准确性，但性能下降的模式仍然存在，这强调了持续模型更新的必要性。

链接: https://arxiv.org/abs/2411.08324
作者: Hui Dai,Ryan Teehan,Mengye Ren
关键词-EN: Large Language Models, Large Language, Language Models, existing evaluation benchmarks, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of static questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs’ temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict “future” event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates.
摘要：现有的许多大语言模型（LLM）评估基准由于新模型和训练数据的出现而迅速过时。这些基准在评估LLM性能随时间变化方面也存在不足，因为它们由静态问题组成，缺乏时间维度。为了解决这些局限性，我们提出使用未来事件预测作为连续评估方法，以评估LLM的时间泛化能力和预测能力。我们的基准测试——Daily Oracle，自动从每日新闻中生成问答（QA）对，挑战LLM预测“未来”事件结果。我们的研究发现，随着预训练数据变得过时，LLM的性能随时间推移而下降。尽管检索增强生成（RAG）有潜力提高预测准确性，但性能下降的模式依然存在，这突显了持续模型更新的必要性。

[NLP-27] R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

【速读】：该论文试图解决当前强化学习从人类反馈 (Reinforcement Learning from Human Feedback, RLHF) 方法中存在的奖励分配问题，即现有的方法对整个输出序列分配单一、稀疏且延迟的奖励，可能忽略了每个token对最终结果的个别贡献。解决方案的关键在于提出了一种新的奖励再分配方法，称为R3HF，该方法通过将奖励模型的预测任务视为回归问题，实现了更细粒度的token级奖励分配。具体来说，R3HF计算每个token对奖励模型输出的具体贡献，从而提高了模型对语言细微差别的理解，进而提升了模型的性能。该方法设计为与大多数现有技术无缝集成，且计算成本较低。

链接: https://arxiv.org/abs/2411.08302
作者: Jiahui Li,Tai-wei Chang,Fengda Zhang,Kun Kuang,Long Chen
关键词-EN: human feedback, aligning large language, pairwise human feedback, Reinforcement learning, paradigm for aligning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences. This involves the initial training of a reward model based on pairwise human feedback. The reward model is subsequently utilized in reinforcement learning to assess the scores of each generated sentence as a whole, further guiding the optimization of LLMs. However, current approaches have a significant shortcoming: \emphThey allocate a single, sparse, and delayed reward to an entire sequence of output. This may overlook some significant individual contributions of each token towards the desired outcome. To overcome this limitation, our paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation. Specifically, our method treats the reward prediction task of the reward model as a regression problem. As a result, the redistributed rewards are computed by evaluating the specific contribution of each token to the reward model’s output. This detailed approach improves the model’s understanding of language nuances, leading to more precise enhancements in its performance. Our method is crafted to integrate seamlessly with most current techniques while incurring minimal computational costs. Through comprehensive experiments across diverse datasets and tasks, we have verified the effectiveness and superiority of our approach.
摘要：基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 提供了一种将大语言模型 (Large Language Models, LLMs) 与人类偏好对齐的范式。这一过程首先基于成对的人类反馈训练一个奖励模型。随后，该奖励模型在强化学习中被用于评估每个生成的句子整体的得分，从而进一步指导大语言模型的优化。然而，当前的方法存在一个显著的缺陷：它们为整个输出序列分配一个单一、稀疏且延迟的奖励。这可能会忽略每个 Token 对期望结果的个别重要贡献。为了克服这一限制，我们的论文提出了一种名为 R3HF 的新型奖励再分配方法，该方法促进了更细粒度的 Token 级别奖励分配。具体而言，我们的方法将奖励模型的奖励预测任务视为一个回归问题。因此，再分配的奖励是通过评估每个 Token 对奖励模型输出的具体贡献来计算的。这种详细的方法提高了模型对语言细微差别的理解，从而在性能上实现了更精确的提升。我们的方法设计为能够无缝集成到大多数现有技术中，同时只产生最小的计算成本。通过在多种数据集和任务上的全面实验，我们验证了该方法的有效性和优越性。

[NLP-28] Knowledge Bases in Support of Large Language Models for Processing Web News

【速读】：该论文试图解决大语言模型（LLMs）在预训练过程中隐式记忆的事实知识难以被下游应用有效利用的问题，特别是缺乏常识推理能力的问题。解决方案的关键在于引入一个通用框架，通过LLMs辅助构建知识库，并针对处理网络新闻进行定制。该框架的核心组件包括：1) 基于规则的新闻信息提取器（NewsIE），用于从新闻条目中提取结构化信息，形成关系元组；2) BERTGraph，用于将NewsIE提取的关系元组与LLMs获取的隐式知识事实进行图卷积，从而实现新闻分类。通过这两个轻量级组件的结合，论文在不同的新闻相关数据集上进行了新闻类别分类的实验评估，取得了良好的实验结果。

链接: https://arxiv.org/abs/2411.08278
作者: Yihe Zhang,Nabin Pakka,Nian-feng Tzeng
关键词-EN: Large Language Models, Large Language, received considerable interest, Language Models, received considerable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have received considerable interest in wide applications lately. During pre-training via massive datasets, such a model implicitly memorizes the factual knowledge of trained datasets in its hidden parameters. However, knowledge held implicitly in parameters often makes its use by downstream applications ineffective due to the lack of common-sense reasoning. In this article, we introduce a general framework that permits to build knowledge bases with an aid of LLMs, tailored for processing Web news. The framework applies a rule-based News Information Extractor (NewsIE) to news items for extracting their relational tuples, referred to as knowledge bases, which are then graph-convoluted with the implicit knowledge facts of news items obtained by LLMs, for their classification. It involves two lightweight components: 1) NewsIE: for extracting the structural information of every news item, in the form of relational tuples; 2) BERTGraph: for graph convoluting the implicit knowledge facts with relational tuples extracted by NewsIE. We have evaluated our framework under different news-related datasets for news category classification, with promising experimental results.
摘要：大语言模型 (LLM) 近期在广泛应用中引起了极大的关注。在通过大规模数据集进行预训练的过程中，此类模型在其隐藏参数中隐式地记忆了训练数据集的事实知识。然而，由于缺乏常识推理能力，参数中隐含的知识往往使得下游应用的使用效果不佳。本文介绍了一种通用框架，该框架借助 LLM 构建知识库，专门用于处理网络新闻。该框架应用基于规则的新闻信息提取器 (NewsIE) 从新闻条目中提取关系元组，即知识库，然后通过 LLM 获取的新闻条目隐含知识事实与这些关系元组进行图卷积，以进行分类。该框架包含两个轻量级组件：1) NewsIE：用于提取每个新闻条目的结构信息，以关系元组的形式呈现；2) BERTGraph：用于将 NewsIE 提取的关系元组与隐含知识事实进行图卷积。我们在不同的新闻相关数据集上对新闻类别分类进行了评估，实验结果令人鼓舞。

[NLP-29] A Large-Scale Study of Relevance Assessments with Large Language Models : An Initial Look

【速读】：该论文试图解决在大规模信息检索评估中，如何利用大型语言模型（LLMs）生成相关性评估以替代传统全手动评估的问题。解决方案的关键在于通过UMBRELA工具部署三种不同的LLM辅助评估方法，并与传统的全手动评估方法进行对比，以评估这些方法在系统排名上的相关性、成本效益和质量。研究结果表明，UMBRELA生成的自动评估在nDCG@20、nDCG@100和Recall@100等指标上与全手动评估高度相关，且在捕捉系统级别有效性方面表现出色，验证了LLMs在学术TREC风格评估中的应用潜力。

链接: https://arxiv.org/abs/2411.08275
作者: Shivani Upadhyay,Ronak Pradeep,Nandan Thakur,Daniel Campos,Nick Craswell,Ian Soboroff,Hoa Trang Dang,Jimmy Lin
关键词-EN: natural language processing, large language models, advance information retrieval, presents exciting opportunities, assessments presents exciting
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The application of large language models to provide relevance assessments presents exciting opportunities to advance information retrieval, natural language processing, and beyond, but to date many unknowns remain. This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the “standard” fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool. This setup allows us to correlate system rankings induced by the different approaches to characterize tradeoffs between cost and quality. We find that in terms of nDCG@20, nDCG@100, and Recall@100, system rankings induced by automatically generated relevance assessments from UMBRELA correlate highly with those induced by fully manual assessments across a diverse set of 77 runs from 19 teams. Our results suggest that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits. Overall, human assessors appear to be stricter than UMBRELA in applying relevance criteria. Our work validates the use of LLMs in academic TREC-style evaluations and provides the foundation for future studies.
摘要：大语言模型在提供相关性评估方面的应用为信息检索、自然语言处理等领域带来了激动人心的进展机遇，但迄今为止仍有许多未知之处。本文报告了大规模评估（TREC 2024 RAG 赛道）的结果，其中四种不同的相关性评估方法被实地部署：NIST 几十年来实施的“标准”全手动流程，以及三种利用开源工具 UMBRELA 不同程度地借助大语言模型的替代方案。这种设置使我们能够关联由不同方法诱导的系统排名，以表征成本与质量之间的权衡。我们发现，在 nDCG@20、nDCG@100 和 Recall@100 方面，UMBRELA 自动生成的相关性评估诱导的系统排名与全手动评估诱导的排名在来自 19 支团队的 77 个多样化运行中高度相关。我们的结果表明，UMBRELA 自动生成的判断可以替代全手动判断，以准确捕捉运行级别的效果。令人惊讶的是，我们发现大语言模型的辅助并未显著增加与全手动评估的相关性，这表明与人在回路中的流程相关的成本并未带来明显的实质性收益。总体而言，人类评估者在应用相关性标准时似乎比 UMBRELA 更为严格。我们的工作验证了大语言模型在学术 TREC 风格评估中的应用，并为未来的研究奠定了基础。

[NLP-30] Deceiving Question-Answering Models: A Hybrid Word-Level Adversarial Approach

【速读】：该论文试图解决自然语言处理（NLP）中问答（QA）模型在面对对抗性攻击时的鲁棒性问题。解决方案的关键在于提出了一种名为QA-Attack的新型词级对抗策略，该策略通过利用自定义的注意力机制和删除排序策略，识别并针对上下文段落中的特定词汇进行攻击。具体方法是通过精心选择和替换同义词，生成具有欺骗性的输入，同时保持语法完整性，从而误导模型产生错误的回答。该方法在处理各种问题类型，尤其是长文本输入时表现出广泛的适用性，并在多个基准数据集上的实验中展示了其优于现有对抗技术的成功率、语义变化、BLEU分数、流畅性和语法错误率。

链接: https://arxiv.org/abs/2411.08248
作者: Jiyao Li,Mingze Ni,Yongshun Gong,Wei Liu
关键词-EN: Deep learning underpins, neural machine translation, natural language processing, advanced natural language, Deep learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning underpins most of the currently advanced natural language processing (NLP) tasks such as textual classification, neural machine translation (NMT), abstractive summarization and question-answering (QA). However, the robustness of the models, particularly QA models, against adversarial attacks is a critical concern that remains insufficiently explored. This paper introduces QA-Attack (Question Answering Attack), a novel word-level adversarial strategy that fools QA models. Our attention-based attack exploits the customized attention mechanism and deletion ranking strategy to identify and target specific words within contextual passages. It creates deceptive inputs by carefully choosing and substituting synonyms, preserving grammatical integrity while misleading the model to produce incorrect responses. Our approach demonstrates versatility across various question types, particularly when dealing with extensive long textual inputs. Extensive experiments on multiple benchmark datasets demonstrate that QA-Attack successfully deceives baseline QA models and surpasses existing adversarial techniques regarding success rate, semantics changes, BLEU score, fluency and grammar error rate.
摘要：深度学习支撑了当前大多数先进的自然语言处理 (NLP) 任务，如文本分类、神经机器翻译 (NMT)、抽象摘要和问答 (QA)。然而，模型，特别是 QA 模型，在对抗攻击下的鲁棒性是一个关键问题，目前尚未得到充分探索。本文介绍了 QA-Attack（问答攻击），一种新颖的词级对抗策略，旨在欺骗 QA 模型。我们的基于注意力的攻击利用了定制的注意力机制和删除排序策略，以识别和针对上下文段落中的特定词。通过精心选择和替换同义词，它创建了具有欺骗性的输入，同时保持了语法完整性，误导模型产生错误响应。我们的方法在各种问题类型中表现出广泛的适用性，特别是在处理大量长文本输入时。在多个基准数据集上的广泛实验表明，QA-Attack 成功欺骗了基线 QA 模型，并在成功率、语义变化、BLEU 分数、流畅性和语法错误率方面超越了现有的对抗技术。

[NLP-31] Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset

【速读】：该论文试图解决大型语言模型（LLMs）在应用学习人类反馈（LHF）以减少有害输出和提高帮助性时，反馈质量和效果不明确的问题。解决方案的关键在于对广泛使用的Helpful and Harmless (HH)数据集进行审计，包括对其内容的全面调查（手动和自动化评估）、展示数据集对模型安全性的影响，以及分析引用该数据集的100篇最具影响力的论文。研究发现，HH数据集中的概念化失败和质量问题可能导致不同人群之间的安全行为差异，强调了在LLMs中需要更细致、上下文敏感的安全缓解策略。

链接: https://arxiv.org/abs/2411.08243
作者: Khaoula Chehbouni,Jonathan Colaço-Carr,Yash More,Jackie CK Cheung,Golnoosh Farnadi
关键词-EN: large language models, language models, learning from human, effort to mitigate, large language
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Prepared for conference submission

点击查看摘要

Abstract:In an effort to mitigate the harms of large language models (LLMs), learning from human feedback (LHF) has been used to steer LLMs towards outputs that are intended to be both less harmful and more helpful. Despite the widespread adoption of LHF in practice, the quality of this feedback and its effectiveness as a safety mitigation technique remain unclear. This study addresses these issues by auditing the widely-used Helpful and Harmless (HH) dataset by Anthropic. Our work includes: (1) a thorough investigation of the dataset’s content through both manual and automated evaluation; (2) experiments demonstrating the dataset’s impact on models’ safety; and (3) an analysis of the 100 most influential papers citing this dataset. Through our audit, we showcase how conceptualization failures and quality issues identified in the HH dataset can create additional harms by leading to disparate safety behaviors across demographic groups. Our findings highlight the need for more nuanced, context-sensitive approaches to safety mitigation in LLMs.
摘要：为了减轻大语言模型 (LLM) 带来的危害，学习人类反馈 (LHF) 已被用于引导 LLM 生成既减少危害又更具帮助性的输出。尽管 LHF 在实践中得到了广泛应用，但这种反馈的质量及其作为安全缓解技术的有效性仍不明确。本研究通过审查 Anthropic 广泛使用的 Helpful and Harmless (HH) 数据集来解决这些问题。我们的工作包括：(1) 通过手动和自动化评估对数据集内容进行全面调查；(2) 实验展示数据集对模型安全性的影响；(3) 分析引用该数据集的 100 篇最具影响力的论文。通过我们的审查，我们展示了 HH 数据集中概念化失败和质量问题如何通过导致不同人口群体间安全行为的差异而产生额外危害。我们的研究结果强调了在 LLM 中需要更加细致、上下文敏感的安全缓解方法。

[NLP-32] Retrieval Reasoning Re-ranking: A Context-Enriched Framework for Knowledge Graph Completion

【速读】：该论文试图解决知识图谱补全 (Knowledge Graph Completion, KGC) 任务中现有嵌入式方法依赖单一三元组数据导致的易受虚假关系模式和长尾实体影响的问题，以及基于文本的方法在知识图谱三元组与自然语言之间存在的语义鸿沟问题。解决方案的关键在于提出了一个上下文增强的框架 KGR3，该框架由三个模块组成：检索模块 (Retrieval module) 负责从知识图谱中收集支持性三元组、从基础嵌入模型中获取候选答案，并检索相关实体的上下文；推理模块 (Reasoning module) 利用大型语言模型生成每个查询三元组的潜在答案；重排序模块 (Re-ranking module) 结合前两个模块的候选答案，并微调大型语言模型以提供最佳答案。通过这种方式，KGR3 能够有效提升各种 KGC 方法的性能，实验结果表明，KGR3 在 FB15k237 和 WN18RR 数据集上分别实现了 12.3% 和 5.6% 的 Hits@1 绝对提升。

链接: https://arxiv.org/abs/2411.08165
作者: Muzhi Li,Cehao Yang,Chengjin Xu,Xuhui Jiang,Yiyan Qi,Jian Guo,Ho-fung Leung,Irwin King
关键词-EN: Knowledge Graph Completion, Graph Completion, Knowledge Graph, task aims, aims to infer
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Knowledge Graph Completion~(KGC) task aims to infer the missing entity from an incomplete triple. Existing embedding-based methods rely solely on triples in the KG, which is vulnerable to specious relation patterns and long-tail entities. On the other hand, text-based methods struggle with the semantic gap between KG triples and natural language. Apart from triples, entity contexts (e.g., labels, descriptions, aliases) also play a significant role in augmenting KGs. To address these limitations, we propose KGR3, a context-enriched framework for KGC. KGR3 is composed of three modules. Firstly, the Retrieval module gathers supporting triples from the KG, collects plausible candidate answers from a base embedding model, and retrieves context for each related entity. Then, the Reasoning module employs a large language model to generate potential answers for each query triple. Finally, the Re-ranking module combines candidate answers from the two modules mentioned above, and fine-tunes an LLM to provide the best answer. Extensive experiments on widely used datasets demonstrate that KGR3 consistently improves various KGC methods. Specifically, the best variant of KGR3 achieves absolute Hits@1 improvements of 12.3% and 5.6% on the FB15k237 and WN18RR datasets.
摘要：知识图谱补全 (Knowledge Graph Completion, KGC) 任务旨在从不完整的三元组中推断出缺失的实体。现有的基于嵌入的方法仅依赖于知识图谱 (KG) 中的三元组，这使得它们容易受到虚假关系模式和长尾实体的影响。另一方面，基于文本的方法则在处理知识图谱三元组与自然语言之间的语义鸿沟时遇到困难。除了三元组，实体上下文（例如标签、描述、别名）在增强知识图谱方面也起着重要作用。为了解决这些局限性，我们提出了 KGR3，这是一个用于知识图谱补全的上下文增强框架。KGR3 由三个模块组成。首先，检索模块从知识图谱中收集支持性三元组，从基础嵌入模型中收集可能的候选答案，并为每个相关实体检索上下文。接着，推理模块采用大语言模型 (LLM) 为每个查询三元组生成潜在答案。最后，重排序模块结合上述两个模块的候选答案，并微调大语言模型以提供最佳答案。在广泛使用的数据集上的大量实验表明，KGR3 持续改进了各种知识图谱补全方法。具体而言，KGR3 的最佳变体在 FB15k237 和 WN18RR 数据集上分别实现了 12.3% 和 5.6% 的绝对 Hits@1 提升。

[NLP-33] Large Language Models Can Self-Improve in Long-context Reasoning

【速读】：该论文试图解决大型语言模型（LLMs）在长上下文推理中的局限性问题。现有方法通常依赖于人工专家或高级模型（如GPT-4）生成的合成数据进行微调，这限制了进一步的进展。论文提出的解决方案之关键是自我改进（self-improvement），即通过**\ours**方法，让LLMs在长上下文推理中自我提升。具体而言，该方法通过采样多个输出、使用最小贝叶斯风险（Minimum Bayes Risk）评分，并基于这些评分进行监督微调或偏好优化。实验结果表明，该方法在多个领先的LLMs上显著提升了性能，例如Llama-3.1-8B-Instruct的绝对提升达到4.2分，并且优于依赖专家或高级模型生成数据的传统方法。

链接: https://arxiv.org/abs/2411.08147
作者: Siheng Li,Cheng Yang,Zesen Cheng,Lemao Liu,Mo Yu,Yujiu Yang,Wai Lam
关键词-EN: Large language models, achieved substantial progress, processing long contexts, Large language, achieved substantial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of 4.2 points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.
摘要：大语言模型（LLMs）在处理长上下文方面取得了显著进展，但在长上下文推理方面仍面临挑战。现有方法通常通过使用合成数据对 LLMs 进行微调，这些数据依赖于人类专家或高级模型（如 GPT-4）的标注，从而限制了进一步的发展。为了解决这一问题，我们研究了 LLMs 在长上下文推理中自我改进的潜力，并提出了 \ours 方法，该方法专门为此目的设计。该方法简单直接：我们为每个问题采样多个输出，使用最小贝叶斯风险（Minimum Bayes Risk）对其进行评分，然后基于这些输出进行监督微调或偏好优化。在多个领先 LLMs 上的广泛实验表明，\ours 方法的有效性，Llama-3.1-8B-Instruct 的绝对提升达到了 4.2 分。此外，与依赖人类专家或高级模型生成的数据的前期方法相比，\ours 表现更为优异。我们预计这项工作将为长上下文场景中的自我改进技术开辟新的途径，这对于 LLMs 的不断进步至关重要。

[NLP-34] On the Role of Speech Data in Reducing Toxicity Detection Bias

【速读】：该论文试图解决文本毒性检测系统中存在的显著偏见问题，特别是在涉及提及人口统计群体的样本时，这些系统会产生不成比例的高误报率。解决方案的关键在于通过比较基于语音和基于文本的毒性分类器，研究语音数据在推理过程中如何减少对群体提及的偏见。研究发现，利用语音数据可以显著减少对模糊和引发争议样本的偏见，并且改进分类器比改进转录流程更能有效减少群体偏见。此外，论文还公开发布了相关注释，并提出了未来毒性数据集构建的建议。

链接: https://arxiv.org/abs/2411.08135
作者: Samuel J. Bell,Mariano Coria Meglioli,Megan Richards,Eduardo Sánchez,Christophe Ropers,Skyler Wang,Adina Williams,Levent Sagun,Marta R. Costa-jussà
关键词-EN: producing disproportionate rates, Text toxicity detection, exhibit significant biases, systems exhibit significant, Text toxicity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Text toxicity detection systems exhibit significant biases, producing disproportionate rates of false positives on samples mentioning demographic groups. But what about toxicity detection in speech? To investigate the extent to which text-based biases are mitigated by speech-based systems, we produce a set of high-quality group annotations for the multilingual MuTox dataset, and then leverage these annotations to systematically compare speech- and text-based toxicity classifiers. Our findings indicate that access to speech data during inference supports reduced bias against group mentions, particularly for ambiguous and disagreement-inducing samples. Our results also suggest that improving classifiers, rather than transcription pipelines, is more helpful for reducing group bias. We publicly release our annotations and provide recommendations for future toxicity dataset construction.
摘要：文本毒性检测系统表现出显著的偏见，对提及人口统计群体的样本产生不成比例的高误报率。但语音中的毒性检测情况如何呢？为了探究基于文本的偏见在基于语音的系统中得到缓解的程度，我们为多语言MuTox数据集生成了一组高质量的群体标注，并利用这些标注系统地比较了基于语音和基于文本的毒性分类器。我们的研究结果表明，在推理过程中访问语音数据有助于减少对群体提及的偏见，特别是在模糊和引发争议的样本中。我们的结果还表明，改进分类器而非转录管道，对于减少群体偏见更为有效。我们公开发布了我们的标注，并提供了未来毒性数据集构建的建议。

人工智能

[AI-0] 4D Gaussian Splatting in the Wild with Uncertainty-Aware Regularization NEURIPS2024

链接: https://arxiv.org/abs/2411.08879
作者: Mijeong Kim,Jongwoo Lim,Bohyung Han
关键词-EN: including augmented, virtual reality, augmented and virtual, Gaussian Splatting, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Novel view synthesis of dynamic scenes is becoming important in various applications, including augmented and virtual reality. We propose a novel 4D Gaussian Splatting (4DGS) algorithm for dynamic scenes from casually recorded monocular videos. To overcome the overfitting problem of existing work for these real-world videos, we introduce an uncertainty-aware regularization that identifies uncertain regions with few observations and selectively imposes additional priors based on diffusion models and depth smoothness on such regions. This approach improves both the performance of novel view synthesis and the quality of training image reconstruction. We also identify the initialization problem of 4DGS in fast-moving dynamic regions, where the Structure from Motion (SfM) algorithm fails to provide reliable 3D landmarks. To initialize Gaussian primitives in such regions, we present a dynamic region densification method using the estimated depth maps and scene flow. Our experiments show that the proposed method improves the performance of 4DGS reconstruction from a video captured by a handheld monocular camera and also exhibits promising results in few-shot static scene reconstruction.

[AI-1] A Short Note on Evaluating RepNet for Temporal Repetition Counting in Videos

链接: https://arxiv.org/abs/2411.08878
作者: Debidatta Dwibedi,Yusuf Aytar,Jonathan Tompson,Pierre Sermanet,Andrew Zisserman
关键词-EN: discuss some consistent, Abstract, consistent issues, URL, RepNet
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We discuss some consistent issues on how RepNet has been evaluated in various papers. As a way to mitigate these issues, we report RepNet performance results on different datasets, and release evaluation code and the RepNet checkpoint to obtain these results. Code URL: this https URL

[AI-2] Causal Explanations for Image Classifiers

链接: https://arxiv.org/abs/2411.08875
作者: Hana Chockler,David A. Kelly,Daniel Kroening,Youcheng Sun
关键词-EN: explaining the output, output of image, image classifiers, variety of techniques, techniques to extract
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing algorithms for explaining the output of image classifiers use different definitions of explanations and a variety of techniques to extract them. However, none of the existing tools use a principled approach based on formal definitions of causes and explanations for the explanation extraction. In this paper we present a novel black-box approach to computing explanations grounded in the theory of actual causality. We prove relevant theoretical results and present an algorithm for computing approximate explanations based on these definitions. We prove termination of our algorithm and discuss its complexity and the amount of approximation compared to the precise definition. We implemented the framework in a tool rex and we present experimental results and a comparison with state-of-the-art tools. We demonstrate that rex is the most efficient tool and produces the smallest explanations, in addition to outperforming other black-box tools on standard quality measures.

[AI-3] Offline Adaptation of Quadruped Locomotion using Diffusion Models

链接: https://arxiv.org/abs/2411.08832
作者: Reece O’Mahoney,Alexander L. Mitchell,Wanming Yu,Ingmar Posner,Ioannis Havoutis
关键词-EN: offline adapting, present a diffusion-based, simultaneously addresses, addresses the limitations, limitations of learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a diffusion-based approach to quadrupedal locomotion that simultaneously addresses the limitations of learning and interpolating between multiple skills and of (modes) offline adapting to new locomotion behaviours after training. This is the first framework to apply classifier-free guided diffusion to quadruped locomotion and demonstrate its efficacy by extracting goal-conditioned behaviour from an originally unlabelled dataset. We show that these capabilities are compatible with a multi-skill policy and can be applied with little modification and minimal compute overhead, i.e., running entirely on the robots onboard CPU. We verify the validity of our approach with hardware experiments on the ANYmal quadruped platform.

[AI-4] Process-aware Human Activity Recognition

链接: https://arxiv.org/abs/2411.08814
作者: Jiawei Zheng,Petros Papapanagiotou,Jacques D. Fleuriot,Jane Hillston
关键词-EN: Humans naturally follow, naturally follow distinct, follow distinct patterns, daily activities, daily routines
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans naturally follow distinct patterns when conducting their daily activities, which are driven by established practices and processes, such as production workflows, social norms and daily routines. Human activity recognition (HAR) algorithms usually use neural networks or machine learning techniques to analyse inherent relationships within the data. However, these approaches often overlook the contextual information in which the data are generated, potentially limiting their effectiveness. We propose a novel approach that incorporates process information from context to enhance the HAR performance. Specifically, we align probabilistic events generated by machine learning models with process models derived from contextual information. This alignment adaptively weighs these two sources of information to optimise HAR accuracy. Our experiments demonstrate that our approach achieves better accuracy and Macro F1-score compared to baseline models.

[AI-5] Rethinking CyberSecEval: An LLM -Aided Approach to Evaluation Critique NEURIPS2024

链接: https://arxiv.org/abs/2411.08813
作者: Suhas Hariharan,Zainab Ali Majid,Jaime Raldua Veuthey,Jacob Haimes
关键词-EN: cybersecurity evaluations space, CyberSecEval approach, cybersecurity evaluations, evaluations space, work carried
类目: Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024, 2 pages

点击查看摘要

Abstract:A key development in the cybersecurity evaluations space is the work carried out by Meta, through their CyberSecEval approach. While this work is undoubtedly a useful contribution to a nascent field, there are notable features that limit its utility. Key drawbacks focus on the insecure code detection part of Meta’s methodology. We explore these limitations, and use our exploration as a test case for LLM-assisted benchmark analysis.

[AI-6] Evaluating World Models with LLM for Decision Making

链接: https://arxiv.org/abs/2411.08794
作者: Chang Yang,Xinrun Wang,Junzhe Jiang,Qinggang Zhang,Xiao Huang
关键词-EN: Dreamer achieve remarkable, World model emerges, achieve remarkable successes, World model, Large Language Models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world models are either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. In this work, we propose a comprehensive evaluation of the world models with LLMs from the decision making perspective. Specifically, we leverage the 31 diverse environments from (Wang et al., 2023;2024) and curate the rule-based policy of each environment for the diverse evaluation. Then, we design three main tasks, i.e., policy verification, action proposal, and policy planning, where the world models can be used for decision making solely. Finally, we conduct the comprehensive evaluation of the advanced LLMs, i.e., GPT-4o and GPT-4o-mini, on the environments for the three main tasks under various settings. The key observations include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, ii) the performance of the world model with LLM will be decreased for long-term decision-making tasks, and iii) the combination of different functionalities of the world model will brings additional unstabilities of the performance.

[AI-7] Sharingan: Extract User Action Sequence from Desktop Recordings

链接: https://arxiv.org/abs/2411.08768
作者: Yanting Chen,Yi Ren,Xiaoting Qin,Jue Zhang,Kehong Yuan,Lu Han,Qingwei Lin,Dongmei Zhang,Saravan Rajmohan,Qi Zhang
关键词-EN: understanding user behaviors, desktop recordings, offer a rich, automating processes, rich source
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Video recordings of user activities, particularly desktop recordings, offer a rich source of data for understanding user behaviors and automating processes. However, despite advancements in Vision-Language Models (VLMs) and their increasing use in video analysis, extracting user actions from desktop recordings remains an underexplored area. This paper addresses this gap by proposing two novel VLM-based methods for user action extraction: the Direct Frame-Based Approach (DF), which inputs sampled frames directly into VLMs, and the Differential Frame-Based Approach (DiffF), which incorporates explicit frame differences detected via computer vision techniques. We evaluate these methods using a basic self-curated dataset and an advanced benchmark adapted from prior work. Our results show that the DF approach achieves an accuracy of 70% to 80% in identifying user actions, with the extracted action sequences being re-playable though Robotic Process Automation. We find that while VLMs show potential, incorporating explicit UI changes can degrade performance, making the DF approach more reliable. This work represents the first application of VLMs for extracting user action sequences from desktop recordings, contributing new methods, benchmarks, and insights for future research.

[AI-8] SANDWICH: Towards an Offline Differentiable Fully-Trainable Wireless Neural Ray-Tracing Surrogate ICASSP2025

链接: https://arxiv.org/abs/2411.08767
作者: Yifei Jin,Ali Maatouk,Sarunas Girdzijauskas,Shugong Xu,Leandros Tassiulas,Rex Ying
关键词-EN: wireless channel modeling, tool for three-dimensional, driven by advances, graphical rendering, key tool
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: Submitted in ICASSP 2025

点击查看摘要

Abstract:Wireless ray-tracing (RT) is emerging as a key tool for three-dimensional (3D) wireless channel modeling, driven by advances in graphical rendering. Current approaches struggle to accurately model beyond 5G (B5G) network signaling, which often operates at higher frequencies and is more susceptible to environmental conditions and changes. Existing online learning solutions require real-time environmental supervision during training, which is both costly and incompatible with GPU-based processing. In response, we propose a novel approach that redefines ray trajectory generation as a sequential decision-making problem, leveraging generative models to jointly learn the optical, physical, and signal properties within each designated environment. Our work introduces the Scene-Aware Neural Decision Wireless Channel Raytracing Hierarchy (SANDWICH), an innovative offline, fully differentiable approach that can be trained entirely on GPUs. SANDWICH offers superior performance compared to existing online learning methods, outperforms the baseline by 4e^-2 radian in RT accuracy, and only fades 0.5 dB away from toplined channel gain estimation.

[AI-9] Flow reconstruction in time-varying geometries using graph neural networks

链接: https://arxiv.org/abs/2411.08764
作者: Bogdan A. Danciu,Vito A. Pagone,Benjamin Böhm,Marius Schmidt,Christos E. Frouzakis
关键词-EN: Graph Attention Convolutional, Attention Convolutional Network, Graph Attention, presents a Graph, Attention Convolutional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:The paper presents a Graph Attention Convolutional Network (GACN) for flow reconstruction from very sparse data in time-varying geometries. The model incorporates a feature propagation algorithm as a preprocessing step to handle extremely sparse inputs, leveraging information from neighboring nodes to initialize missing features. In addition, a binary indicator is introduced as a validity mask to distinguish between the original and propagated data points, enabling more effective learning from sparse inputs. Trained on a unique data set of Direct Numerical Simulations (DNS) of a motored engine at a technically relevant operating condition, the GACN shows robust performance across different resolutions and domain sizes and can effectively handle unstructured data and variable input sizes. The model is tested on previously unseen DNS data as well as on an experimental data set from Particle Image Velocimetry (PIV) measurements that were not considered during training. A comparative analysis shows that the GACN consistently outperforms both a conventional Convolutional Neural Network (CNN) and cubic interpolation methods on the DNS and PIV test sets by achieving lower reconstruction errors and better capturing fine-scale turbulent structures. In particular, the GACN effectively reconstructs flow fields from domains up to 14 times larger than those observed during training, with the performance advantage increasing for larger domains.

[AI-10] Polymetis:Large Language Modeling for Multiple Material Domains

链接: https://arxiv.org/abs/2411.08728
作者: Chao Huang,Huichen Xiao,Chen Chen,Chunyan Chen,Yi Zhao,Shiyu Du,Yiming Zhang,He Sha,Ruixin Gu
关键词-EN: materials science research, materials science, materials, language model Polymetis, large language models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the application of large language models in various fields continues to expand, materials science also ushers in opportunities for AI-driven innovation. The traditional way of relying on manual search for materials science-related information is now using artificial intelligence technology as an auxiliary tool to improve the efficiency of materials science research. To accelerate researchers’ knowledge acquisition and intelligent decision-making support in materials science research, this paper proposes a large language model Polymetis model for a variety of materials fields, aiming to provide highly professional knowledge answers in the field of materials, covering energy materials, functional materials, alloy materials, physical chemistry, biology, and other material directions. The model uses a dataset of about 2 million material knowledge instructions, and in the process of building the dataset, we developed the Intelligent Extraction Large Model (IELM), which is specially used to extract and form structured knowledge from scientific texts, avoiding a large number of costs that need to be manually annotated, and improving efficiency. We inject this data into the GLM4-9B model for learning to enhance its inference capabilities in a variety of material domains. In addition, we have introduced enhanced prompt strategies to ensure that the answers to the model are more organized and comprehensive, providing efficient and comprehensive intelligent support for the diverse needs of materials science exploration, and promoting the development of material science.

[AI-11] Searching Latent Program Spaces

链接: https://arxiv.org/abs/2411.08706
作者: Clément Bonnet,Matthew V Macfarlane
关键词-EN: generate programs restricted, automatically generate programs, synthesis methods aim, input-output pairs, aim to automatically
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code available at this https URL

点击查看摘要

Abstract:Program synthesis methods aim to automatically generate programs restricted to a language that can explain a given specification of input-output pairs. While purely symbolic approaches suffer from a combinatorial search space, recent methods leverage neural networks to learn distributions over program structures to narrow this search space significantly, enabling more efficient search. However, for challenging problems, it remains difficult to train models to perform program synthesis in one shot, making test-time search essential. Most neural methods lack structured search mechanisms during inference, relying instead on stochastic sampling or gradient updates, which can be inefficient. In this work, we propose the Latent Program Network (LPN), a general algorithm for program induction that learns a distribution over latent programs in a continuous space, enabling efficient search and test-time adaptation. We explore how to train these networks to optimize for test-time computation and demonstrate the use of gradient-based search both during training and at test time. We evaluate LPN on ARC-AGI, a program synthesis benchmark that evaluates performance by generalizing programs to new inputs rather than explaining the underlying specification. We show that LPN can generalize beyond its training distribution and adapt to unseen tasks by utilizing test-time computation, outperforming algorithms without test-time adaptation mechanisms.

[AI-12] MVKTrans: Multi-View Knowledge Transfer for Robust Multiomics Classification

链接: https://arxiv.org/abs/2411.08703
作者: Shan Cong,Zhiling Sang,Hongwei Liu,Haoran Luo,Xin Wang,Hong Liang,Jie Hao,Xiaohui Yao
关键词-EN: including complex interactions, address unique challenges, multiomics prediction, including complex, clinical symptoms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The distinct characteristics of multiomics data, including complex interactions within and across biological layers and disease heterogeneity (e.g., heterogeneity in etiology and clinical symptoms), drive us to develop novel designs to address unique challenges in multiomics prediction. In this paper, we propose the multi-view knowledge transfer learning (MVKTrans) framework, which transfers intra- and inter-omics knowledge in an adaptive manner by reviewing data heterogeneity and suppressing bias transfer, thereby enhancing classification performance. Specifically, we design a graph contrastive module that is trained on unlabeled data to effectively learn and transfer the underlying intra-omics patterns to the supervised task. This unsupervised pretraining promotes learning general and unbiased representations for each modality, regardless of the downstream tasks. In light of the varying discriminative capacities of modalities across different diseases and/or samples, we introduce an adaptive and bi-directional cross-omics distillation module. This module automatically identifies richer modalities and facilitates dynamic knowledge transfer from more informative to less informative omics, thereby enabling a more robust and generalized integration. Extensive experiments on four real biomedical datasets demonstrate the superior performance and robustness of MVKTrans compared to the state-of-the-art. Code and data are available at this https URL.

[AI-13] RACE: Transformer-based Risk Assessment for Clinical Evaluation

链接: https://arxiv.org/abs/2411.08701
作者: Dionysis Christopoulos,Sotiris Spanos,Valsamis Ntouskos,Konstantinos Karantzalos
关键词-EN: present TRACE, enhanced feature interaction, clinical risk assessment, Clinical Evaluation, Transformer-based Risk Assessment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present TRACE (Transformer-based Risk Assessment for Clinical Evaluation), a novel method for clinical risk assessment based on clinical data, leveraging the self-attention mechanism for enhanced feature interaction and result interpretation. Our approach is able to handle different data modalities, including continuous, categorical and multiple-choice (checkbox) attributes. The proposed architecture features a shared representation of the clinical data obtained by integrating specialized embeddings of each data modality, enabling the detection of high-risk individuals using Transformer encoder layers. To assess the effectiveness of the proposed method, a strong baseline based on non-negative multi-layer perceptrons (MLPs) is introduced. The proposed method outperforms various baselines widely used in the domain of clinical risk assessment, while effectively handling missing values. In terms of explainability, our Transformer-based method offers easily interpretable results via attention weights, further enhancing the clinicians’ decision-making process.

[AI-14] Rethinking negative sampling in content-based news recommendation

链接: https://arxiv.org/abs/2411.08700
作者: Miguel Ângelo Rebelo,João Vinagre,Ivo Pereira,Álvaro Figueira
关键词-EN: rapid relevance decay, undergo rapid relevance, lifespan of articles, relevance decay, undergo rapid
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:News recommender systems are hindered by the brief lifespan of articles, as they undergo rapid relevance decay. Recent studies have demonstrated the potential of content-based neural techniques in tackling this problem. However, these models often involve complex neural architectures and often lack consideration for negative examples. In this study, we posit that the careful sampling of negative examples has a big impact on the model’s outcome. We devise a negative sampling technique that not only improves the accuracy of the model but also facilitates the decentralization of the recommendation system. The experimental results obtained using the MIND dataset demonstrate that the accuracy of the method under consideration can compete with that of State-of-the-Art models. The utilization of the sampling technique is essential in reducing model complexity and accelerating the training process, while maintaining a high level of accuracy. Finally, we discuss how decentralized models can help improve privacy and scalability.

[AI-15] Scholarly Wikidata: Population and Exploration of Conference Data in Wikidata using LLM s

链接: https://arxiv.org/abs/2411.08696
作者: Nandana Mihindukulasooriya,Sanju Tiwari,Daniil Dobriy,Finn Årup Nielsen,Tek Raj Chhetri,Axel Polleres
关键词-EN: respective Knowledge Graphs, Knowledge Graphs, create respective Knowledge, respective Knowledge, Semantic Web-related conferences
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 17 pages, accepted at EKAW-24

点击查看摘要

Abstract:Several initiatives have been undertaken to conceptually model the domain of scholarly data using ontologies and to create respective Knowledge Graphs. Yet, the full potential seems unleashed, as automated means for automatic population of said ontologies are lacking, and respective initiatives from the Semantic Web community are not necessarily connected: we propose to make scholarly data more sustainably accessible by leveraging Wikidata’s infrastructure and automating its population in a sustainable manner through LLMs by tapping into unstructured sources like conference Web sites and proceedings texts as well as already existing structured conference datasets. While an initial analysis shows that Semantic Web conferences are only minimally represented in Wikidata, we argue that our methodology can help to populate, evolve and maintain scholarly data as a community within Wikidata. Our main contributions include (a) an analysis of ontologies for representing scholarly data to identify gaps and relevant entities/properties in Wikidata, (b) semi-automated extraction – requiring (minimal) manual validation – of conference metadata (e.g., acceptance rates, organizer roles, programme committee members, best paper awards, keynotes, and sponsors) from websites and proceedings texts using LLMs. Finally, we discuss © extensions to visualization tools in the Wikidata context for data exploration of the generated scholarly data. Our study focuses on data from 105 Semantic Web-related conferences and extends/adds more than 6000 entities in Wikidata. It is important to note that the method can be more generally applicable beyond Semantic Web-related conferences for enhancing Wikidata’s utility as a comprehensive scholarly resource. Source Repository: this https URL DOI: this https URL License: Creative Commons CC0 (Data), MIT (Code) Comments: 17 pages, accepted at EKAW-24 Subjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2411.08696 [cs.DL] (or arXiv:2411.08696v1 [cs.DL] for this version) https://doi.org/10.48550/arXiv.2411.08696 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nandana Mihindukulasooriya [view email] [v1] Wed, 13 Nov 2024 15:34:52 UTC (1,605 KB)

[AI-16] Analogical Reasoning Within a Conceptual Hyperspace IJCAI2024

链接: https://arxiv.org/abs/2411.08684
作者: Howard Goldowsky,Vasanth Sarathy
关键词-EN: Conceptual Spaces Theory, Conceptual Spaces, complex-sampled hyperdimensional computing, Spaces Theory, neuro-symbolic computational power
类目: Artificial Intelligence (cs.AI)
*备注: Analogy-angle workshop full paper at IJCAI 2024

点击查看摘要

Abstract:We propose an approach to analogical inference that marries the neuro-symbolic computational power of complex-sampled hyperdimensional computing (HDC) with Conceptual Spaces Theory (CST), a promising theory of semantic meaning. CST sketches, at an abstract level, approaches to analogical inference that go beyond the standard predicate-based structure mapping theories. But it does not describe how such an approach can be operationalized. We propose a concrete HDC-based architecture that computes several types of analogy classified by CST. We present preliminary proof-of-concept experimental results within a toy domain and describe how it can perform category-based and property-based analogical reasoning.

[AI-17] A Survey on Vision Autoregressive Model

链接: https://arxiv.org/abs/2411.08666
作者: Kai Jiang,Jiaxing Huang
关键词-EN: natural language processing, demonstrated great performance, Autoregressive models, vision autoregressive models, adaptability and generalizability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autoregressive models have demonstrated great performance in natural language processing (NLP) with impressive scalability, adaptability and generalizability. Inspired by their notable success in NLP field, autoregressive models have been intensively investigated recently for computer vision, which perform next-token predictions by representing visual data as visual tokens and enables autoregressive modelling for a wide range of vision tasks, ranging from visual generation and visual understanding to the very recent multimodal generation that unifies visual generation and understanding with a single autoregressive model. This paper provides a systematic review of vision autoregressive models, including the development of a taxonomy of existing methods and highlighting their major contributions, strengths, and limitations, covering various vision tasks such as image generation, video generation, image editing, motion generation, medical image analysis, 3D generation, robotic manipulation, unified multimodal generation, etc. Besides, we investigate and analyze the latest advancements in autoregressive models, including thorough benchmarking and discussion of existing methods across various evaluation datasets. Finally, we outline key challenges and promising directions for future research, offering a roadmap to guide further advancements in vision autoregressive models.

[AI-18] Estimating unknown parameters in differential equations with a reinforcement learning based PSO method

链接: https://arxiv.org/abs/2411.08651
作者: Wenkui Sun,Xiaoya Fan,Lijuan Jia,Tinyi Chu,Shing-Tung Yau,Rongling Wu,Zhong Wang
关键词-EN: numerous scientific fields, Differential equations, complex dynamic systems, Differential, Differential equations offer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Differential equations offer a foundational yet powerful framework for modeling interactions within complex dynamic systems and are widely applied across numerous scientific fields. One common challenge in this area is estimating the unknown parameters of these dynamic relationships. However, traditional numerical optimization methods rely on the selection of initial parameter values, making them prone to local optima. Meanwhile, deep learning and Bayesian methods require training models on specific differential equations, resulting in poor versatility. This paper reformulates the parameter estimation problem of differential equations as an optimization problem by introducing the concept of particles from the particle swarm optimization algorithm. Building on reinforcement learning-based particle swarm optimization (RLLPSO), this paper proposes a novel method, DERLPSO, for estimating unknown parameters of differential equations. We compared its performance on three typical ordinary differential equations with the state-of-the-art methods, including the RLLPSO algorithm, traditional numerical methods, deep learning approaches, and Bayesian methods. The experimental results demonstrate that our DERLPSO consistently outperforms other methods in terms of performance, achieving an average Mean Square Error of 1.13e-05, which reduces the error by approximately 4 orders of magnitude compared to other methods. Apart from ordinary differential equations, our DERLPSO also show great promise for estimating unknown parameters of partial differential equations. The DERLPSO method proposed in this paper has high accuracy, is independent of initial parameter values, and possesses strong versatility and stability. This work provides new insights into unknown parameter estimation for differential equations.

[AI-19] A System Level Performance Evaluation for Superconducting Digital Systems

链接: https://arxiv.org/abs/2411.08645
作者: Joyjit Kundu,Debjyoti Bhattacharjee,Nathan Josephsen,Ankit Pokhrel,Udara De Silva,Wenzhe Guo,Steven Van Winckel,Steven Brebels,Manu Perumkunnil,Quentin Herr,Anna Herr
关键词-EN: Superconducting Digital, offers significant potential, generation large scale, scale compute workloads, technology offers significant
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 8 figures

点击查看摘要

Abstract:Superconducting Digital (SCD) technology offers significant potential for enhancing the performance of next generation large scale compute workloads. By leveraging advanced lithography and a 300 mm platform, SCD devices can reduce energy consumption and boost computational power. This paper presents a cross-layer modeling approach to evaluate the system-level performance benefits of SCD architectures for Large Language Model (LLM) training and inference. Our findings, based on experimental data and Pulse Conserving Logic (PCL) design principles, demonstrate substantial performance gain in both training and inference. We are, thus, able to convincingly show that the SCD technology can address memory and interconnect limitations of present day solutions for next-generation compute systems.

[AI-20] owards More Accurate Fake Detection on Images Generated from Advanced Generative and Neural Rendering Models

链接: https://arxiv.org/abs/2411.08642
作者: Chengdong Dong,Vijayakumar Bhagavatula,Zhenyu Zhou,Ajay Kumar
关键词-EN: Neural Radiance Fields, visual data generation, Radiance Fields, Gaussian splatting, Neural Radiance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 8 Figures

点击查看摘要

Abstract:The remarkable progress in neural-network-driven visual data generation, especially with neural rendering techniques like Neural Radiance Fields and 3D Gaussian splatting, offers a powerful alternative to GANs and diffusion models. These methods can produce high-fidelity images and lifelike avatars, highlighting the need for robust detection methods. In response, an unsupervised training technique is proposed that enables the model to extract comprehensive features from the Fourier spectrum magnitude, thereby overcoming the challenges of reconstructing the spectrum due to its centrosymmetric properties. By leveraging the spectral domain and dynamically combining it with spatial domain information, we create a robust multimodal detector that demonstrates superior generalization capabilities in identifying challenging synthetic images generated by the latest image synthesis techniques. To address the absence of a 3D neural rendering-based fake image database, we develop a comprehensive database that includes images generated by diverse neural rendering techniques, providing a robust foundation for evaluating and advancing detection methods.

[AI-21] DipMe: Haptic Recognition of Granular Media for Tangible Interactive Applications

链接: https://arxiv.org/abs/2411.08641
作者: Xinkai Wang,Shuo Zhang,Ziyi Zhao,Lifeng Zhu,Aiguo Song
关键词-EN: granular materials, granular, shown its power, power in naturally, naturally interacting
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:While tangible user interface has shown its power in naturally interacting with rigid or soft objects, users cannot conveniently use different types of granular materials as the interaction media. We introduce DipMe as a smart device to recognize the types of granular media in real time, which can be used to connect the granular materials in the physical world with various virtual content. Other than vision-based solutions, we propose a dip operation of our device and exploit the haptic signals to recognize different types of granular materials. With modern machine learning tools, we find the haptic signals from different granular media are distinguishable by DipMe. With the online granular object recognition, we build several tangible interactive applications, demonstrating the effects of DipMe in perceiving granular materials and its potential in developing a tangible user interface with granular objects as the new media.

[AI-22] Precision-Focused Reinforcement Learning Model for Robotic Object Pushing

链接: https://arxiv.org/abs/2411.08622
作者: Lara Bergmann,David Leins,Robert Haschke,Klaus Neumann
关键词-EN: Non-prehensile manipulation, everyday situations, desired target position, important skill, assist humans
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Non-prehensile manipulation, such as pushing objects to a desired target position, is an important skill for robots to assist humans in everyday situations. However, the task is challenging due to the large variety of objects with different and sometimes unknown physical properties, such as shape, size, mass, and friction. This can lead to the object overshooting its target position, requiring fast corrective movements of the robot around the object, especially in cases where objects need to be precisely pushed. In this paper, we improve the state-of-the-art by introducing a new memory-based vision-proprioception RL model to push objects more precisely to target positions using fewer corrective movements.

[AI-23] Lo-MARVE: A Low Cost Autonomous Underwater Vehicle for Marine Exploration

链接: https://arxiv.org/abs/2411.08605
作者: Karl Mason,Daniel Kelly
关键词-EN: Robotic Vehicle Explorer, Marine Autonomous Robotic, Autonomous Robotic Vehicle, Low-cost Marine Autonomous, autonomous underwater vehicle
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: This paper was presented at the 12th International Conference on Control, Mechatronics and Automation (ICCMA 2024), held in London, UK, from November 11-13, 2024

点击查看摘要

Abstract:This paper presents Low-cost Marine Autonomous Robotic Vehicle Explorer (Lo-MARVE), a novel autonomous underwater vehicle (AUV) designed to provide a low cost solution for underwater exploration and environmental monitoring in shallow water environments. Lo-MARVE offers a cost-effective alternative to existing AUVs, featuring a modular design, low-cost sensors, and wireless communication capabilities. The total cost of Lo-MARVE is approximately EUR 500. Lo-MARVE is developed using the Raspberry Pi 4B microprocessor, with control software written in Python. The proposed AUV was validated through field testing outside of a laboratory setting, in the freshwater environment of the River Corrib in Galway, Ireland. This demonstrates its ability to navigate autonomously, collect data, and communicate effectively outside of a controlled laboratory setting. The successful deployment of Lo-MARVE in a real-world environment validates its proof of concept.

[AI-24] DeepUQ: Assessing the Aleatoric Uncertainties from two Deep Learning Methods NEURIPS2024

链接: https://arxiv.org/abs/2411.08587
作者: Rebecca Nevin,Aleksandra Ćiprijanović,Brian D. Nord
关键词-EN: Deep Evidential Regression, Assessing the quality, deep learning methods, Deep Ensembles, uncertainty
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to the Machine Learning for Physical Sciences workshop at NeurIPS 2024; 11 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Assessing the quality of aleatoric uncertainty estimates from uncertainty quantification (UQ) deep learning methods is important in scientific contexts, where uncertainty is physically meaningful and important to characterize and interpret exactly. We systematically compare aleatoric uncertainty measured by two UQ techniques, Deep Ensembles (DE) and Deep Evidential Regression (DER). Our method focuses on both zero-dimensional (0D) and two-dimensional (2D) data, to explore how the UQ methods function for different data dimensionalities. We investigate uncertainty injected on the input and output variables and include a method to propagate uncertainty in the case of input uncertainty so that we can compare the predicted aleatoric uncertainty to the known values. We experiment with three levels of noise. The aleatoric uncertainty predicted across all models and experiments scales with the injected noise level. However, the predicted uncertainty is miscalibrated to \rmstd(\sigma_\rm al) with the true uncertainty for half of the DE experiments and almost all of the DER experiments. The predicted uncertainty is the least accurate for both UQ methods for the 2D input uncertainty experiment and the high-noise level. While these results do not apply to more complex data, they highlight that further research on post-facto calibration for these methods would be beneficial, particularly for high-noise and high-dimensional settings.

[AI-25] Optimizing Automatic Summarization of Long Clinical Records Using Dynamic Context Extension:Testing and Evaluation of the NBCE Method

链接: https://arxiv.org/abs/2411.08586
作者: Guoqing Zhang,Keita Fukuyama,Kazumasa Kishimoto,Tomohiro Kuroda
关键词-EN: Summarizing patient clinical, patient clinical notes, Summarizing patient, reducing documentation burdens, documentation burdens
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Summarizing patient clinical notes is vital for reducing documentation burdens. Current manual summarization makes medical staff struggle. We propose an automatic method using LLMs, but long inputs cause LLMs to lose context, reducing output quality especially in small size model. We used a 7B model, open-calm-7b, enhanced with Native Bayes Context Extend and a redesigned decoding mechanism to reference one sentence at a time, keeping inputs within context windows, 2048 tokens. Our improved model achieved near parity with Google’s over 175B Gemini on ROUGE-L metrics with 200 samples, indicating strong performance using less resources, enhancing automated EMR summarization feasibility.

[AI-26] An Empirical Examination of the Evaluative AI Framework

链接: https://arxiv.org/abs/2411.08583
作者: Jaroslaw Kornowicz
关键词-EN: study empirically examines, study empirically, empirically examines, aims to enhance, recommendation-based approach
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study empirically examines the “Evaluative AI” framework, which aims to enhance the decision-making process for AI users by transitioning from a recommendation-based approach to a hypothesis-driven one. Rather than offering direct recommendations, this framework presents users pro and con evidence for hypotheses to support more informed decisions. However, findings from the current behavioral experiment reveal no significant improvement in decision-making performance and limited user engagement with the evidence provided, resulting in cognitive processes similar to those observed in traditional AI systems. Despite these results, the framework still holds promise for further exploration in future research.

[AI-27] Leveraging LLM s for Predictive Insights in Food Policy and Behavioral Interventions

链接: https://arxiv.org/abs/2411.08563
作者: Micha Kaiser,Paul Lohmann,Peter Ochieng,Billy Shi,Cass R. Sunstein,Lucia A. Reisch
关键词-EN: greenhouse gas emissions, global greenhouse gas, mitigating climate change, production contribute significantly, crucial entry points
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Food consumption and production contribute significantly to global greenhouse gas emissions, making them crucial entry points for mitigating climate change and maintaining a liveable planet. Over the past two decades, food policy initiatives have explored interventions to reshape production and consumption patterns, focusing on reducing food waste and curbing ruminant meat consumption. While the evidence of “what works” improves, evaluating which policies are appropriate and effective in specific contexts remains difficult due to external validity challenges. This paper demonstrates that a fine-tuned large language model (LLM) can accurately predict the direction of outcomes in approximately 80% of empirical studies measuring dietary-based impacts (e.g. food choices, sales, waste) resulting from behavioral interventions and policies. Approximately 75 prompts were required to achieve optimal results, with performance showing signs of catastrophic loss beyond this point. Our findings indicate that greater input detail enhances predictive accuracy, although the model still faces challenges with unseen studies, underscoring the importance of a representative training sample. As LLMs continue to improve and diversify, they hold promise for advancing data-driven, evidence-based policymaking.

[AI-28] Neural Corrective Machine Unranking

链接: https://arxiv.org/abs/2411.08562
作者: Jingrui Hou,Axel Finke,Georgina Cosma
关键词-EN: systems requires removing, requires removing specific, removing specific data, specific data whilst, data whilst maintaining
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: submitted to Information Sciences

点击查看摘要

Abstract:Machine unlearning in neural information retrieval (IR) systems requires removing specific data whilst maintaining model performance. Applying existing machine unlearning methods to IR may compromise retrieval effectiveness or inadvertently expose unlearning actions due to the removal of particular items from the retrieved results presented to users. We formalise corrective unranking, which extends machine unlearning in (neural) IR context by integrating substitute documents to preserve ranking integrity, and propose a novel teacher-student framework, Corrective unRanking Distillation (CuRD), for this task. CuRD (1) facilitates forgetting by adjusting the (trained) neural IR model such that its output relevance scores of to-be-forgotten samples mimic those of low-ranking, non-retrievable samples; (2) enables correction by fine-tuning the relevance scores for the substitute samples to match those of corresponding to-be-forgotten samples closely; (3) seeks to preserve performance on samples that are not targeted for forgetting. We evaluate CuRD on four neural IR models (BERTcat, BERTdot, ColBERT, PARADE) using MS MARCO and TREC CAR datasets. Experiments with forget set sizes from 1 % and 20 % of the training dataset demonstrate that CuRD outperforms seven state-of-the-art baselines in terms of forgetting and correction while maintaining model retention and generalisation capabilities.

[AI-29] LogLLM : Log-based Anomaly Detection Using Large Language Models

链接: https://arxiv.org/abs/2411.08561
作者: Wei Guan,Jian Cao,Shiyou Qian,Jianqi Gao
关键词-EN: record important runtime, important runtime information, Software systems, record important, important runtime
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Software systems often record important runtime information in logs to help with troubleshooting. Log-based anomaly detection has become a key research area that aims to identify system issues through log data, ultimately enhancing the reliability of software systems. Traditional deep learning methods often struggle to capture the semantic information embedded in log data, which is typically organized in natural language. In this paper, we propose LogLLM, a log-based anomaly detection framework that leverages large language models (LLMs). LogLLM employs BERT for extracting semantic vectors from log messages, while utilizing Llama, a transformer decoder-based model, for classifying log sequences. Additionally, we introduce a projector to align the vector representation spaces of BERT and Llama, ensuring a cohesive understanding of log semantics. Unlike conventional methods that require log parsers to extract templates, LogLLM preprocesses log messages with regular expressions, streamlining the entire process. Our framework is trained through a novel three-stage procedure designed to enhance performance and adaptability. Experimental results across four public datasets demonstrate that LogLLM outperforms state-of-the-art methods. Even when handling unstable logs, it effectively captures the semantic meaning of log messages and detects anomalies accurately.

[AI-30] Leveraging Pre-Trained Neural Networks to Enhance Machine Learning with Variational Quantum Circuits

链接: https://arxiv.org/abs/2411.08552
作者: Jun Qi,Chao-Han Yang,Samuel Yen-Chi Chen,Pin-Yu Chen,Hector Zenil,Jesper Tegner
关键词-EN: offers tremendous potential, Variational Quantum Circuits, offers tremendous, Quantum Machine Learning, Machine Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注: In submission

点击查看摘要

Abstract:Quantum Machine Learning (QML) offers tremendous potential but is currently limited by the availability of qubits. We introduce an innovative approach that utilizes pre-trained neural networks to enhance Variational Quantum Circuits (VQC). This technique effectively separates approximation error from qubit count and removes the need for restrictive conditions, making QML more viable for real-world applications. Our method significantly improves parameter optimization for VQC while delivering notable gains in representation and generalization capabilities, as evidenced by rigorous theoretical analysis and extensive empirical testing on quantum dot classification tasks. Moreover, our results extend to applications such as human genome analysis, demonstrating the broad applicability of our approach. By addressing the constraints of current quantum hardware, our work paves the way for a new era of advanced QML applications, unlocking the full potential of quantum computing in fields such as machine learning, materials science, medicine, mimetics, and various interdisciplinary areas.

[AI-31] Deeper Insights into Learning Performance of Stochastic Configuration Networks

链接: https://arxiv.org/abs/2411.08544
作者: Xiufeng Yan,Dianhui Wang
关键词-EN: Stochastic Configuration Networks, randomized neural networks, Stochastic Configuration, integrate randomized algorithms, Configuration Networks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Stochastic Configuration Networks (SCNs) are a class of randomized neural networks that integrate randomized algorithms within an incremental learning framework. A defining feature of SCNs is the supervisory mechanism, which adaptively adjusts the distribution to generate effective random basis functions, thereby enabling error-free learning. In this paper, we present a comprehensive analysis of the impact of the supervisory mechanism on the learning performance of SCNs. Our findings reveal that the current SCN framework evaluates the effectiveness of each random basis function in reducing residual errors using a lower bound on its error reduction potential, which constrains SCNs’ overall learning efficiency. Specifically, SCNs may fail to consistently select the most effective random candidate as the new basis function during each training iteration. To overcome this problem, we propose a novel method for evaluating the hidden layer’s output matrix, supported by a new supervisory mechanism that accurately assesses the error reduction potential of random basis functions without requiring the computation of the Moore-Penrose inverse of the output matrix. This approach enhances the selection of basis functions, reducing computational complexity and improving the overall scalability and learning capabilities of SCNs. We introduce a Recursive Moore-Penrose Inverse-SCN (RMPI-SCN) training scheme based on the new supervisory mechanism and demonstrate its effectiveness through simulations over some benchmark datasets. Experiments show that RMPI-SCN outperforms the conventional SCN in terms of learning capability, underscoring its potential to advance the SCN framework for large-scale data modeling applications.

[AI-32] MLV2-Net: Rater-Based Majority-Label Voting for Consistent Meningeal Lymphatic Vessel Segmentation ML4H2024

链接: https://arxiv.org/abs/2411.08537
作者: Fabian Bongratz,Markus Karmann,Adrian Holz,Moritz Bonhoeffer,Viktor Neumaier,Sarah Deli,Benita Schmitz-Koep,Claus Zimmer,Christian Sorg,Melissa Thalhammer,Dennis M Hedderich,Christian Wachinger
关键词-EN: Meningeal lymphatic vessels, Meningeal lymphatic, lymphatic vessels, drainage of waste, waste products
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ML4H 2024

点击查看摘要

Abstract:Meningeal lymphatic vessels (MLVs) are responsible for the drainage of waste products from the human brain. An impairment in their functionality has been associated with aging as well as brain disorders like multiple sclerosis and Alzheimer’s disease. However, MLVs have only recently been described for the first time in magnetic resonance imaging (MRI), and their ramified structure renders manual segmentation particularly difficult. Further, as there is no consistent notion of their appearance, human-annotated MLV structures contain a high inter-rater variability that most automatic segmentation methods cannot take into account. In this work, we propose a new rater-aware training scheme for the popular nnU-Net model, and we explore rater-based ensembling strategies for accurate and consistent segmentation of MLVs. This enables us to boost nnU-Net’s performance while obtaining explicit predictions in different annotation styles and a rater-based uncertainty estimation. Our final model, MLV ^2 -Net, achieves a Dice similarity coefficient of 0.806 with respect to the human reference standard. The model further matches the human inter-rater reliability and replicates age-related associations with MLV volume.

[AI-33] ACROSS: A Deformation-Based Cross-Modal Representation for Robotic Tactile Perception ICRA2025

链接: https://arxiv.org/abs/2411.08533
作者: Wadhah Zai El Amri,Malte Kuhlmann,Nicolás Navarro-Guerrero
关键词-EN: perception is essential, Tactile, sensor, human interaction, Tactile perception
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Paper Submitted to ICRA2025. arXiv admin note: text overlap with arXiv:2410.14310

点击查看摘要

Abstract:Tactile perception is essential for human interaction with the environment and is becoming increasingly crucial in robotics. Tactile sensors like the BioTac mimic human fingertips and provide detailed interaction data. Despite its utility in applications like slip detection and object identification, this sensor is now deprecated, making many existing valuable datasets obsolete. However, recreating similar datasets with newer sensor technologies is both tedious and time-consuming. Therefore, it is crucial to adapt these existing datasets for use with new setups and modalities. In response, we introduce ACROSS, a novel framework for translating data between tactile sensors by exploiting sensor deformation information. We demonstrate the approach by translating BioTac signals into the DIGIT sensor. Our framework consists of first converting the input signals into 3D deformation meshes. We then transition from the 3D deformation mesh of one sensor to the mesh of another, and finally convert the generated 3D deformation mesh into the corresponding output space. We demonstrate our approach to the most challenging problem of going from a low-dimensional tactile representation to a high-dimensional one. In particular, we transfer the tactile signals of a BioTac sensor to DIGIT tactile images. Our approach enables the continued use of valuable datasets and the exchange of data between groups with different setups.

[AI-34] Gendered Words and Grant Rates: A Textual Analysis of Disparate Outcomes in the Patent System

链接: https://arxiv.org/abs/2411.08526
作者: Deborah Gerhardt,Miriam Marcowitz-Bitton,W. Michael Schuster,Avshalom Elmalech,Omri Suissa,Moshe Mash
关键词-EN: study examines gender, law by analyzing, patent, study examines, examines gender disparities
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study examines gender disparities in patent law by analyzing the textual content of patent applications. While prior research has primarily focused on the study of metadata (i.e., filing year or technological class), we employ machine learning and natural language processing techniques to derive latent information from patent texts. In particular, these methods are used to predict inventor gender based on textual characteristics. We find that gender can be identified with notable accuracy - even without knowing the inventor’s name. This ability to discern gender through text suggests that anonymized patent examination - often proposed as a solution to mitigate disparities in patent grant rate - may not fully address gender-specific outcomes in securing a patent. Our analysis additionally identifies gendered differences in textual choices within patent documents and the fields in which inventors choose to work. These findings highlight the complex interaction between textual choices, gender, and success in securing a patent. As discussed herein, this raises critical questions about the efficacy of current proposals aimed at achieving gender parity and efficiency in the patent system.

[AI-35] SAD-TIME: a Spatiotemporal-fused network for depression detection with Automated multi-scale Depth-wise and TIME-interval-related common feature extractor

链接: https://arxiv.org/abs/2411.08521
作者: Han-Guang Wang,Hui-Rang Hou,Li-Cheng Jin,Chen-Yang Xu,Zhong-Yi Zhang,Qing-Hao Meng
关键词-EN: severe mental disorder, Background and Objective, severe mental, cure and rehabilitation, rehabilitation of people
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21pages, 7 figures

点击查看摘要

Abstract:Background and Objective: Depression is a severe mental disorder, and accurate diagnosis is pivotal to the cure and rehabilitation of people with depression. However, the current questionnaire-based diagnostic methods could bring subjective biases and may be denied by subjects. In search of a more objective means of diagnosis, researchers have begun to experiment with deep learning-based methods for identifying depressive disorders in recent years. Methods: In this study, a novel Spatiotemporal-fused network with Automated multi-scale Depth-wise and TIME-interval-related common feature extractor (SAD-TIME) is proposed. SAD-TIME incorporates an automated nodes’ common features extractor (CFE), a spatial sector (SpS), a modified temporal sector (TeS), and a domain adversarial learner (DAL). The CFE includes a multi-scale depth-wise 1D-convolutional neural network and a time-interval embedding generator, where the unique information of each channel is preserved. The SpS fuses the functional connectivity with the distance-based connectivity containing spatial position of EEG electrodes. A multi-head-attention graph convolutional network is also applied in the SpS to fuse the features from different EEG channels. The TeS is based on long short-term memory and graph transformer networks, where the temporal information of different time-windows is fused. Moreover, the DAL is used after the SpS to obtain the domain-invariant feature. Results: Experimental results under tenfold cross-validation show that the proposed SAD-TIME method achieves 92.00% and 94.00% depression classification accuracies on two datasets, respectively, in cross-subject mode. Conclusion: SAD-TIME is a robust depression detection model, where the automatedly-generated features, the SpS and the TeS assist the classification performance with the fusion of the innate spatiotemporal information in the EEG signals.

[AI-36] Explainers Mental Representations of Explainees Needs in Everyday Explanations

链接: https://arxiv.org/abs/2411.08514
作者: Michael Erol Schaffer,Lutz Terfloth,Carsten Schulte,Heike M. Buhl
关键词-EN: mental representations, explainers’ mental representations, explainees’ developing knowledge, Architecture, Relevance
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In explanations, explainers have mental representations of explainees’ developing knowledge and shifting interests regarding the explanandum. These mental representations are dynamic in nature and develop over time, thereby enabling explainers to react to explainees’ needs by adapting and customizing the explanation. XAI should be able to react to explainees’ needs in a similar manner. Therefore, a component that incorporates aspects of explainers’ mental representations of explainees is required. In this study, we took first steps by investigating explainers’ mental representations in everyday explanations of technological artifacts. According to the dual nature theory, technological artifacts require explanations with two distinct perspectives, namely observable and measurable features addressing “Architecture” or interpretable aspects addressing “Relevance”. We conducted extended semi structured pre-, post- and video recall-interviews with explainers (N=9) in the context of an explanation. The transcribed interviews were analyzed utilizing qualitative content analysis. The explainers’ answers regarding the explainees’ knowledge and interests with regard to the technological artifact emphasized the vagueness of early assumptions of explainers toward strong beliefs in the course of explanations. The assumed knowledge of explainees in the beginning is centered around Architecture and develops toward knowledge with regard to both Architecture and Relevance. In contrast, explainers assumed higher interests in Relevance in the beginning to interests regarding both Architecture and Relevance in the further course of explanations. Further, explainers often finished the explanation despite their perception that explainees still had gaps in knowledge. These findings are transferred into practical implications relevant for user models for adaptive explainable systems.

[AI-37] Learning Model Agnostic Explanations via Constraint Programming

链接: https://arxiv.org/abs/2411.08478
作者: Frederic Koriche,Jean-Marie Lagniez,Stefan Mengel,Chi Tran
关键词-EN: Interpretable Machine Learning, Machine Learning faces, Interpretable Machine, Machine Learning, Learning faces
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Interpretable Machine Learning faces a recurring challenge of explaining the predictions made by opaque classifiers such as ensemble models, kernel methods, or neural networks in terms that are understandable to humans. When the model is viewed as a black box, the objective is to identify a small set of features that jointly determine the black box response with minimal error. However, finding such model-agnostic explanations is computationally demanding, as the problem is intractable even for binary classifiers. In this paper, the task is framed as a Constraint Optimization Problem, where the constraint solver seeks an explanation of minimum error and bounded size for an input data instance and a set of samples generated by the black box. From a theoretical perspective, this constraint programming approach offers PAC-style guarantees for the output explanation. We evaluate the approach empirically on various datasets and show that it statistically outperforms the state-of-the-art heuristic Anchors method.

[AI-38] Building Trustworthy AI: Transparent AI Systems via Large Language Models Ontologies and Logical Reasoning (TranspNet)

链接: https://arxiv.org/abs/2411.08469
作者: Fadi Al Machot,Martin Thomas Horsch,Habib Ullah
关键词-EN: Large Language Models, Growing concerns, healthcare and finance, high-stakes fields, fields like healthcare
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Growing concerns over the lack of transparency in AI, particularly in high-stakes fields like healthcare and finance, drive the need for explainable and trustworthy systems. While Large Language Models (LLMs) perform exceptionally well in generating accurate outputs, their “black box” nature poses significant challenges to transparency and trust. To address this, the paper proposes the TranspNet pipeline, which integrates symbolic AI with LLMs. By leveraging domain expert knowledge, retrieval-augmented generation (RAG), and formal reasoning frameworks like Answer Set Programming (ASP), TranspNet enhances LLM outputs with structured reasoning and verification. This approach ensures that AI systems deliver not only accurate but also explainable and trustworthy results, meeting regulatory demands for transparency and accountability. TranspNet provides a comprehensive solution for developing AI systems that are reliable and interpretable, making it suitable for real-world applications where trust is critical.

[AI-39] Crystal Structure Generation Based On Material Properties

链接: https://arxiv.org/abs/2411.08464
作者: Chao Huang,JiaHui Chen,HongRui Liang,ChunYan Chen,Chen Chen
关键词-EN: crystal structure, material properties, materials science, crystal, structure
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:The discovery of new materials is very important to the field of materials science. When researchers explore new materials, they often have expected performance requirements for their crystal structure. In recent years, data-driven methods have made great progress in the direction plane of crystal structure generation, but there is still a lack of methods that can effectively map material properties to crystal structure. In this paper, we propose a Crystal DiT model to generate the crystal structure from the expected material properties by embedding the material properties and combining the symmetry information predicted by the large language model. Experimental verification shows that our proposed method has good performance.

[AI-40] Symbolic-AI-Fusion Deep Learning (SAIF-DL): Encoding Knowledge into Training with Answer Set Programming Loss Penalties by a Novel Loss Function Approach

链接: https://arxiv.org/abs/2411.08463
作者: Fadi Al Machot,Martin Thomas Horsch,Habib Ullah
关键词-EN: answer set programming, model learning process, set programming, paper presents, presents a hybrid
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:This paper presents a hybrid methodology that enhances the training process of deep learning (DL) models by embedding domain expert knowledge using ontologies and answer set programming (ASP). By integrating these symbolic AI methods, we encode domain-specific constraints, rules, and logical reasoning directly into the model’s learning process, thereby improving both performance and trustworthiness. The proposed approach is flexible and applicable to both regression and classification tasks, demonstrating generalizability across various fields such as healthcare, autonomous systems, engineering, and battery manufacturing applications. Unlike other state-of-the-art methods, the strength of our approach lies in its scalability across different domains. The design allows for the automation of the loss function by simply updating the ASP rules, making the system highly scalable and user-friendly. This facilitates seamless adaptation to new domains without significant redesign, offering a practical solution for integrating expert knowledge into DL models in industrial settings such as battery manufacturing.

[AI-41] rap-MID: Trapdoor-based Defense against Model Inversion Attacks NEURIPS

链接: https://arxiv.org/abs/2411.08460
作者: Zhen-Ting Liu,Shang-Tse Chen
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, privacy of Deep, Networks by recovering
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by Neural Information Processing Systems (NeurIPS) 2024

点击查看摘要

Abstract:Model Inversion (MI) attacks pose a significant threat to the privacy of Deep Neural Networks by recovering training data distribution from well-trained models. While existing defenses often rely on regularization techniques to reduce information leakage, they remain vulnerable to recent attacks. In this paper, we propose the Trapdoor-based Model Inversion Defense (Trap-MID) to mislead MI attacks. A trapdoor is integrated into the model to predict a specific label when the input is injected with the corresponding trigger. Consequently, this trapdoor information serves as the “shortcut” for MI attacks, leading them to extract trapdoor triggers rather than private data. We provide theoretical insights into the impacts of trapdoor’s effectiveness and naturalness on deceiving MI attacks. In addition, empirical experiments demonstrate the state-of-the-art defense performance of Trap-MID against various MI attacks without the requirements for extra data or large computational overhead. Our source code is publicly available at this https URL.

[AI-42] Learning Dynamic Cognitive Map with Autonomous Navigation

链接: https://arxiv.org/abs/2411.08447
作者: Daria de Tinguy,Tim Verbelen,Bart Dhoedt
关键词-EN: biologically inspired principles, inspired principles, biologically inspired, animal navigation strategies, space rooted
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: under submission at Frontiers Computer Neuroscience

点击查看摘要

Abstract:Inspired by animal navigation strategies, we introduce a novel computational model to navigate and map a space rooted in biologically inspired principles. Animals exhibit extraordinary navigation prowess, harnessing memory, imagination, and strategic decision-making to traverse complex and aliased environments adeptly. Our model aims to replicate these capabilities by incorporating a dynamically expanding cognitive map over predicted poses within an Active Inference framework, enhancing our agent’s generative model plasticity to novelty and environmental changes. Through structure learning and active inference navigation, our model demonstrates efficient exploration and exploitation, dynamically expanding its model capacity in response to anticipated novel un-visited locations and updating the map given new evidence contradicting previous beliefs. Comparative analyses in mini-grid environments with the Clone-Structured Cognitive Graph model (CSCG), which shares similar objectives, highlight our model’s ability to rapidly learn environmental structures within a single episode, with minimal navigation overlap. Our model achieves this without prior knowledge of observation and world dimensions, underscoring its robustness and efficacy in navigating intricate environments.

[AI-43] owards Optimizing a Retrieval Augmented Generation using Large Language Model on Academic Data

链接: https://arxiv.org/abs/2411.08438
作者: Anum Afzal,Juraj Vladika,Gentrit Fazlija,Andrei Staradubets,Florian Matthes
关键词-EN: Retrieval Augmented Generation, integrating Retrieval Augmented, Augmented Generation, organizations integrating Retrieval, Retrieval Augmented
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Given the growing trend of many organizations integrating Retrieval Augmented Generation (RAG) into their operations, we assess RAG on domain-specific data and test state-of-the-art models across various optimization techniques. We incorporate four optimizations; Multi-Query, Child-Parent-Retriever, Ensemble Retriever, and In-Context-Learning, to enhance the functionality and performance in the academic domain. We focus on data retrieval, specifically targeting various study programs at a large technical university. We additionally introduce a novel evaluation approach, the RAG Confusion Matrix designed to assess the effectiveness of various configurations within the RAG framework. By exploring the integration of both open-source (e.g., Llama2, Mistral) and closed-source (GPT-3.5 and GPT-4) Large Language Models, we offer valuable insights into the application and optimization of RAG frameworks in domain-specific contexts. Our experiments show a significant performance increase when including multi-query in the retrieval phase.

[AI-44] 3D Multi-Object Tracking with Semi-Supervised GRU-Kalman Filter

链接: https://arxiv.org/abs/2411.08433
作者: Xiaoxiang Wang,Jiaxin Liu,Miaojie Feng,Zhaoxing Zhang,Xin Yang
关键词-EN: environmental perception, robotic sensing, fundamental component, component of environmental, essential for intelligent
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:3D Multi-Object Tracking (MOT), a fundamental component of environmental perception, is essential for intelligent systems like autonomous driving and robotic sensing. Although Tracking-by-Detection frameworks have demonstrated excellent performance in recent years, their application in real-world scenarios faces significant challenges. Object movement in complex environments is often highly nonlinear, while existing methods typically rely on linear approximations of motion. Furthermore, system noise is frequently modeled as a Gaussian distribution, which fails to capture the true complexity of the noise dynamics. These oversimplified modeling assumptions can lead to significant reductions in tracking precision. To address this, we propose a GRU-based MOT method, which introduces a learnable Kalman filter into the motion module. This approach is able to learn object motion characteristics through data-driven learning, thereby avoiding the need for manual model design and model error. At the same time, to avoid abnormal supervision caused by the wrong association between annotations and trajectories, we design a semi-supervised learning strategy to accelerate the convergence speed and improve the robustness of the model. Evaluation experiment on the nuScenes and Argoverse2 datasets demonstrates that our system exhibits superior performance and significant potential compared to traditional TBD methods.

[AI-45] A Heterogeneous Graph Neural Network Fusing Functional and Structural Connectivity for MCI Diagnosis

链接: https://arxiv.org/abs/2411.08424
作者: Feiyu Yin,Yu Lei,Siyuan Dai,Wenwen Zeng,Guoqing Wu,Liang Zhan,Jinhua Yu
关键词-EN: diffusion tensor imaging, resting-state functional imaging, tensor imaging, graph neural networks, Brain connectivity alternations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Brain connectivity alternations associated with brain disorders have been widely reported in resting-state functional imaging (rs-fMRI) and diffusion tensor imaging (DTI). While many dual-modal fusion methods based on graph neural networks (GNNs) have been proposed, they generally follow homogenous fusion ways ignoring rich heterogeneity of dual-modal information. To address this issue, we propose a novel method that integrates functional and structural connectivity based on heterogeneous graph neural networks (HGNNs) to better leverage the rich heterogeneity in dual-modal images. We firstly use blood oxygen level dependency and whiter matter structure information provided by rs-fMRI and DTI to establish homo-meta-path, capturing node relationships within the same modality. At the same time, we propose to establish hetero-meta-path based on structure-function coupling and brain community searching to capture relations among cross-modal nodes. Secondly, we further introduce a heterogeneous graph pooling strategy that automatically balances homo- and hetero-meta-path, effectively leveraging heterogeneous information and preventing feature confusion after pooling. Thirdly, based on the flexibility of heterogeneous graphs, we propose a heterogeneous graph data augmentation approach that can conveniently address the sample imbalance issue commonly seen in clinical diagnosis. We evaluate our method on ADNI-3 dataset for mild cognitive impairment (MCI) diagnosis. Experimental results indicate the proposed method is effective and superior to other algorithms, with a mean classification accuracy of 93.3%.

[AI-46] Enhanced Classroom Dialogue Sequences Analysis with a Hybrid AI Agent : Merging Expert Rule-Base with Large Language Models

链接: https://arxiv.org/abs/2411.08418
作者: Yun Long,Yu Zhang
关键词-EN: fostering student engagement, Classroom dialogue plays, deeper learning, plays a crucial, crucial role
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Classroom dialogue plays a crucial role in fostering student engagement and deeper learning. However, analysing dialogue sequences has traditionally relied on either theoretical frameworks or empirical descriptions of practice, with limited integration between the two. This study addresses this gap by developing a comprehensive rule base of dialogue sequences and an Artificial Intelligence (AI) agent that combines expert-informed rule-based systems with a large language model (LLM). The agent applies expert knowledge while adapting to the complexities of natural language, enabling accurate and flexible categorisation of classroom dialogue sequences. By synthesising findings from over 30 studies, we established a comprehensive framework for dialogue analysis. The agent was validated against human expert coding, achieving high levels of precision and reliability. The results demonstrate that the agent provides theory-grounded and adaptive functions, tremendously enhancing the efficiency and scalability of classroom dialogue analysis, offering significant potential in improving classroom teaching practices and supporting teacher professional development.

[AI-47] Material Property Prediction with Element Attribute Knowledge Graphs and Multimodal Representation Learning

链接: https://arxiv.org/abs/2411.08414
作者: Chao Huang,Chunyan Chen,Ling Shi,Chen Chen
关键词-EN: Machine learning, crucial tool, Machine, tool for predicting, crystalline materials
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning has become a crucial tool for predicting the properties of crystalline materials. However, existing methods primarily represent material information by constructing multi-edge graphs of crystal structures, often overlooking the chemical and physical properties of elements (such as atomic radius, electronegativity, melting point, and ionization energy), which have a significant impact on material performance. To address this limitation, we first constructed an element property knowledge graph and utilized an embedding model to encode the element attributes within the knowledge graph. Furthermore, we propose a multimodal fusion framework, ESNet, which integrates element property features with crystal structure features to generate joint multimodal representations. This provides a more comprehensive perspective for predicting the performance of crystalline materials, enabling the model to consider both microstructural composition and chemical characteristics of the materials. We conducted experiments on the Materials Project benchmark dataset, which showed leading performance in the bandgap prediction task and achieved results on a par with existing benchmarks in the formation energy prediction task.

[AI-48] DiVR: incorporating context from diverse VR scenes for human trajectory prediction

链接: https://arxiv.org/abs/2411.08409
作者: Franz Franco Gallo(BIOVISION),Hui-Yin Wu(BIOVISION),Lucile Sassatelli(UniCA, IUF)
关键词-EN: offering unique opportunities, collecting detailed data, Virtual environments provide, offering unique, provide a rich
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Virtual environments provide a rich and controlled setting for collecting detailed data on human behavior, offering unique opportunities for predicting human trajectories in dynamic scenes. However, most existing approaches have overlooked the potential of these environments, focusing instead on static contexts without considering userspecific factors. Employing the CREATTIVE3D dataset, our work models trajectories recorded in virtual reality (VR) scenes for diverse situations including road-crossing tasks with user interactions and simulated visual impairments. We propose Diverse Context VR Human Motion Prediction (DiVR), a cross-modal transformer based on the Perceiver architecture that integrates both static and dynamic scene context using a heterogeneous graph convolution network. We conduct extensive experiments comparing DiVR against existing architectures including MLP, LSTM, and transformers with gaze and point cloud context. Additionally, we also stress test our model’s generalizability across different users, tasks, and scenes. Results show that DiVR achieves higher accuracy and adaptability compared to other models and to static graphs. This work highlights the advantages of using VR datasets for context-aware human trajectory modeling, with potential applications in enhancing user experiences in the metaverse. Our source code is publicly available at this https URL.

[AI-49] BAMAX: Backtrack Assisted Multi-Agent Exploration using Reinforcement Learning

链接: https://arxiv.org/abs/2411.08400
作者: Geetansh Kalra,Amit Patel,Atul Chaudhari,Divye Singh
关键词-EN: Autonomous robots collaboratively, robots collaboratively exploring, Autonomous robots, collaboratively exploring, exploring an unknown
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous robots collaboratively exploring an unknown environment is still an open problem. The problem has its roots in coordination among non-stationary agents, each with only a partial view of information. The problem is compounded when the multiple robots must completely explore the environment. In this paper, we introduce Backtrack Assisted Multi-Agent Exploration using Reinforcement Learning (BAMAX), a method for collaborative exploration in multi-agent systems which attempts to explore an entire virtual environment. As in the name, BAMAX leverages backtrack assistance to enhance the performance of agents in exploration tasks. To evaluate BAMAX against traditional approaches, we present the results of experiments conducted across multiple hexagonal shaped grids sizes, ranging from 10x10 to 60x60. The results demonstrate that BAMAX outperforms other methods in terms of faster coverage and less backtracking across these environments.

[AI-50] RLInspect: An Interactive Visual Approach to Assess Reinforcement Learning Algorithm

链接: https://arxiv.org/abs/2411.08392
作者: Geetansh Kalra,Divye Singh,Justin Jose
关键词-EN: rapidly growing area, Reinforcement Learning, machine learning techniques, machine learning, range of domains
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) is a rapidly growing area of machine learning that finds its application in a broad range of domains, from finance and healthcare to robotics and gaming. Compared to other machine learning techniques, RL agents learn from their own experiences using trial and error, and improve their performance over time. However, assessing RL models can be challenging, which makes it difficult to interpret their behaviour. While reward is a widely used metric to evaluate RL models, it may not always provide an accurate measure of training performance. In some cases, the reward may seem increasing while the model’s performance is actually decreasing, leading to misleading conclusions about the effectiveness of the training. To overcome this limitation, we have developed RLInspect - an interactive visual analytic tool, that takes into account different components of the RL model - state, action, agent architecture and reward, and provides a more comprehensive view of the RL training. By using RLInspect, users can gain insights into the model’s behaviour, identify issues during training, and potentially correct them effectively, leading to a more robust and reliable RL system.

[AI-51] Physics Informed Distillation for Diffusion Models

链接: https://arxiv.org/abs/2411.08378
作者: Joshua Tian Jin Tee,Kang Zhang,Hee Suk Yoon,Dhananjaya Nagaraja Gowda,Chanwoo Kim,Chang D. Yoo
关键词-EN: Diffusion models, Probability Flow Ordinary, Flow Ordinary Differential, Physics Informed Neural, Informed Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have recently emerged as a potent tool in generative modeling. However, their inherent iterative nature often results in sluggish image generation due to the requirement for multiple model evaluations. Recent progress has unveiled the intrinsic link between diffusion models and Probability Flow Ordinary Differential Equations (ODEs), thus enabling us to conceptualize diffusion models as ODE systems. Simultaneously, Physics Informed Neural Networks (PINNs) have substantiated their effectiveness in solving intricate differential equations through implicit modeling of their solutions. Building upon these foundational insights, we introduce Physics Informed Distillation (PID), which employs a student model to represent the solution of the ODE system corresponding to the teacher diffusion model, akin to the principles employed in PINNs. Through experiments on CIFAR 10 and ImageNet 64x64, we observe that PID achieves performance comparable to recent distillation methods. Notably, it demonstrates predictable trends concerning method-specific hyperparameters and eliminates the need for synthetic dataset generation during the distillation process. Both of which contribute to its easy-to-use nature as a distillation approach for Diffusion Models. Our code and pre-trained checkpoint are publicly available at: this https URL.

[AI-52] Developing an Effective Training Dataset to Enhance the Performance of AI-based Speaker Separation Systems

链接: https://arxiv.org/abs/2411.08375
作者: Rawad Melhem,Assef Jafar,Oumayma Al Dakkak
关键词-EN: active research topic, promising results achieved, recent years, paper addresses, addresses the challenge
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: in Arabic language

点击查看摘要

Abstract:This paper addresses the challenge of speaker separation, which remains an active research topic despite the promising results achieved in recent years. These results, however, often degrade in real recording conditions due to the presence of noise, echo, and other interferences. This is because neural models are typically trained on synthetic datasets consisting of mixed audio signals and their corresponding ground truths, which are generated using computer software and do not fully represent the complexities of real-world recording scenarios. The lack of realistic training sets for speaker separation remains a major hurdle, as obtaining individual sounds from mixed audio signals is a nontrivial task. To address this issue, we propose a novel method for constructing a realistic training set that includes mixture signals and corresponding ground truths for each speaker. We evaluate this dataset on a deep learning model and compare it to a synthetic dataset. We got a 1.65 dB improvement in Scale Invariant Signal to Distortion Ratio (SI-SDR) for speaker separation accuracy in realistic mixing. Our findings highlight the potential of realistic training sets for enhancing the performance of speaker separation models in real-world scenarios.

[AI-53] A Fuzzy Reinforcement LSTM-based Long-term Prediction Model for Fault Conditions in Nuclear Power Plants

链接: https://arxiv.org/abs/2411.08370
作者: Siwei Li,Jiayan Fang,Yichun Wua,Wei Wang,Chengxin Li,Jiangwen Chen
关键词-EN: significantly mitigate operational, mitigate operational risks, timely maintenance scheduling, Early fault detection, operator decision-making
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early fault detection and timely maintenance scheduling can significantly mitigate operational risks in NPPs and enhance the reliability of operator decision-making. Therefore, it is necessary to develop an efficient Prognostics and Health Management (PHM) multi-step prediction model for predicting of system health status and prompt execution of maintenance operations. In this study, we propose a novel predictive model that integrates reinforcement learning with Long Short-Term Memory (LSTM) neural networks and the Expert Fuzzy Evaluation Method. The model is validated using parameter data for 20 different breach sizes in the Main Steam Line Break (MSLB) accident condition of the CPR1000 pressurized water reactor simulation model and it demonstrates a remarkable capability in accurately forecasting NPP parameter changes up to 128 steps ahead (with a time interval of 10 seconds per step, i.e., 1280 seconds), thereby satisfying the temporal advance requirement for fault prognostics in NPPs. Furthermore, this method provides an effective reference solution for PHM applications such as anomaly detection and remaining useful life prediction.

[AI-54] Surprisingly Popular Voting for Concentric Rank-Order Models

链接: https://arxiv.org/abs/2411.08367
作者: Hadi Hosseini,Debmalya Mandal,Amrit Puhan
关键词-EN: social information sites, ground truth, ground truth ranking, important problem, problem on social
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An important problem on social information sites is the recovery of ground truth from individual reports when the experts are in the minority. The wisdom of the crowd, i.e. the collective opinion of a group of individuals fails in such a scenario. However, the surprisingly popular (SP) algorithm~\citeprelec2017solution can recover the ground truth even when the experts are in the minority, by asking the individuals to report additional prediction reports–their beliefs about the reports of others. Several recent works have extended the surprisingly popular algorithm to an equivalent voting rule (SP-voting) to recover the ground truth ranking over a set of m alternatives. However, we are yet to fully understand when SP-voting can recover the ground truth ranking, and if so, how many samples (votes and predictions) it needs. We answer this question by proposing two rank-order models and analyzing the sample complexity of SP-voting under these models. In particular, we propose concentric mixtures of Mallows and Plackett-Luce models with G (\ge 2) groups. Our models generalize previously proposed concentric mixtures of Mallows models with 2 groups, and we highlight the importance of G 2 groups by identifying three distinct groups (expert, intermediate, and non-expert) from existing datasets. Next, we provide conditions on the parameters of the underlying models so that SP-voting can recover ground-truth rankings with high probability, and also derive sample complexities under the same. We complement the theoretical results by evaluating SP-voting on simulated and real datasets.

[AI-55] Generative AI for Data Augmentation in Wireless Networks: Analysis Applications and Case Study

链接: https://arxiv.org/abs/2411.08341
作者: Jinbo Wen,Jiawen Kang,Dusit Niyato,Yang Zhang,Jiacheng Wang,Biplab Sikdar,Ping Zhang
关键词-EN: Data augmentation, Data, data augmentation techniques, GenAI-driven data augmentation, mitigate data scarcity
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data augmentation is a powerful technique to mitigate data scarcity. However, owing to fundamental differences in wireless data structures, traditional data augmentation techniques may not be suitable for wireless data. Fortunately, Generative Artificial Intelligence (GenAI) can be an effective alternative to wireless data augmentation due to its excellent data generation capability. This article systemically explores the potential and effectiveness of GenAI-driven data augmentation in wireless networks. We first briefly review data augmentation techniques, discuss their limitations in wireless networks, and introduce generative data augmentation, including reviewing GenAI models and their applications in data augmentation. We then explore the application prospects of GenAI-driven data augmentation in wireless networks from the physical, network, and application layers, which provides a GenAI-driven data augmentation architecture for each application. Subsequently, we propose a general generative diffusion model-based data augmentation framework for Wi-Fi gesture recognition, which uses transformer-based diffusion models to generate high-quality channel state information data. Furthermore, we develop residual neural network models for Wi-Fi gesture recognition to evaluate the role of augmented data and conduct a case study based on a real dataset. Simulation results demonstrate the effectiveness of the proposed framework. Finally, we discuss research directions for generative data augmentation.

[AI-56] DEEGITS: Deep Learning based Framework for Measuring Heterogenous Traffic State in Challenging Traffic Scenarios

链接: https://arxiv.org/abs/2411.08335
作者: Muttahirul Islam,Nazmul Haque,Md. Hadiuzzaman
关键词-EN: Deep Learning Based, paper presents DEEGITS, convolutional neural network, Learning Based Heterogeneous, rapidly detect vehicles
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Submitted for presentation at the 103 rd Annual Meeting of Transportation Research Board and publication in Transportation Research Record: Journal of Transportation Research Board

点击查看摘要

Abstract:This paper presents DEEGITS (Deep Learning Based Heterogeneous Traffic State Measurement), a comprehensive framework that leverages state-of-the-art convolutional neural network (CNN) techniques to accurately and rapidly detect vehicles and pedestrians, as well as to measure traffic states in challenging scenarios (i.e., congestion, occlusion). In this study, we enhance the training dataset through data fusion, enabling simultaneous detection of vehicles and pedestrians. Image preprocessing and augmentation are subsequently performed to improve the quality and quantity of the dataset. Transfer learning is applied on the YOLOv8 pretrained model to increase the model’s capability to identify a diverse array of vehicles. Optimal hyperparameters are obtained using the Grid Search algorithm, with the Stochastic Gradient Descent (SGD) optimizer outperforming other optimizers under these settings. Extensive experimentation and evaluation demonstrate substantial accuracy within the detection framework, with the model achieving 0.794 mAP@0.5 on the validation set and 0.786 mAP@0.5 on the test set, surpassing previous benchmarks on similar datasets. The DeepSORT multi-object tracking algorithm is incorporated to track detected vehicles and pedestrians in this study. Finally, the framework is tested to measure heterogeneous traffic states in mixed traffic conditions. Two locations with differing traffic compositions and congestion levels are selected: one motorized-dominant location with moderate density and one non-motorized-dominant location with higher density. Errors are statistically insignificant for both cases, showing correlations from 0.99 to 0.88 and 0.91 to 0.97 for heterogeneous traffic flow and speed measurements, respectively.

[AI-57] Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval

链接: https://arxiv.org/abs/2411.08334
作者: Yeong-Joon Ju,Ho-Joong Kim,Seong-Whan Lee
关键词-EN: Existing multimodal retrieval, Existing multimodal, image comprehension, caption generators, leading to cumbersome
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Existing multimodal retrieval systems often rely on disjointed models for image comprehension, such as object detectors and caption generators, leading to cumbersome implementations and training processes. To overcome this limitation, we propose an end-to-end retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries via dynamic modality interaction. Ret-XKnow leverages a partial convolution mechanism to focus on visual information relevant to the given textual query, thereby enhancing multimodal query representations. To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval (ViD2R) dataset automatically constructed from visual dialogue datasets. Our dataset construction process ensures that the dialogues are transformed into suitable information retrieval tasks using a text retriever. We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios. Our code is publicly available: this https URL.

[AI-58] Responsible AI in Construction Safety: Systematic Evaluation of Large Language Models and Prompt Engineering

链接: https://arxiv.org/abs/2411.08320
作者: Farouq Sammour,Jia Xu,Xi Wang,Mo Hu,Zhenyu Zhang
关键词-EN: Large Language Models, hazardous sectors, Large Language, Certified Safety Professionals, safety
类目: Artificial Intelligence (cs.AI)
*备注: 29 pages, 5 figures

点击查看摘要

Abstract:Construction remains one of the most hazardous sectors. Recent advancements in AI, particularly Large Language Models (LLMs), offer promising opportunities for enhancing workplace safety. However, responsible integration of LLMs requires systematic evaluation, as deploying them without understanding their capabilities and limitations risks generating inaccurate information, fostering misplaced confidence, and compromising worker safety. This study evaluates the performance of two widely used LLMs, GPT-3.5 and GPT-4o, across three standardized exams administered by the Board of Certified Safety Professionals (BCSP). Using 385 questions spanning seven safety knowledge areas, the study analyzes the models’ accuracy, consistency, and reliability. Results show that both models consistently exceed the BCSP benchmark, with GPT-4o achieving an accuracy rate of 84.6% and GPT-3.5 reaching 73.8%. Both models demonstrate strengths in safety management systems and hazard identification and control, but exhibit weaknesses in science, mathematics, emergency response, and fire prevention. An error analysis identifies four primary limitations affecting LLM performance: lack of knowledge, reasoning flaws, memory issues, and calculation errors. Our study also highlights the impact of prompt engineering strategies, with variations in accuracy reaching 13.5% for GPT-3.5 and 7.9% for GPT-4o. However, no single prompt configuration proves universally effective. This research advances knowledge in three ways: by identifying areas where LLMs can support safety practices and where human oversight remains essential, by offering practical insights into improving LLM implementation through prompt engineering, and by providing evidence-based direction for future research and development. These contributions support the responsible integration of AI in construction safety management toward achieving zero injuries.

[AI-59] PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

链接: https://arxiv.org/abs/2411.08307
作者: Yungang Yi,Weihua Li,Matthew Kuo,Quan Bai
关键词-EN: progressed significantly, domain of audio, Segmentation and Scale, audio generation, Music
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Music generation has progressed significantly, especially in the domain of audio generation. However, generating symbolic music that is both long-structured and expressive remains a significant challenge. In this paper, we propose PerceiverS (Segmentation and Scale), a novel architecture designed to address this issue by leveraging both Effective Segmentation and Multi-Scale attention mechanisms. Our approach enhances symbolic music generation by simultaneously learning long-term structural dependencies and short-term expressive details. By combining cross-attention and self-attention in a Multi-Scale setting, PerceiverS captures long-range musical structure while preserving performance nuances. The proposed model, evaluated on datasets like Maestro, demonstrates improvements in generating coherent and diverse music with both structural consistency and expressive variation. The project demos and the generated music samples can be accessed through the link: this https URL.

[AI-60] DNN Task Assignment in UAV Networks: A Generative AI Enhanced Multi-Agent Reinforcement Learning Approach

链接: https://arxiv.org/abs/2411.08299
作者: Xin Tang,Qian Chen,Wenjie Weng,Binhan Liao,Jiacheng Wang,Xianbin Cao,Xiaohuan Li
关键词-EN: Unmanned Aerial Vehicles, Unmanned Aerial, Aerial Vehicles, Internet of Things, possess high mobility
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) possess high mobility and flexible deployment capabilities, prompting the development of UAVs for various application scenarios within the Internet of Things (IoT). The unique capabilities of UAVs give rise to increasingly critical and complex tasks in uncertain and potentially harsh environments. The substantial amount of data generated from these applications necessitates processing and analysis through deep neural networks (DNNs). However, UAVs encounter challenges due to their limited computing resources when managing DNN models. This paper presents a joint approach that combines multiple-agent reinforcement learning (MARL) and generative diffusion models (GDM) for assigning DNN tasks to a UAV swarm, aimed at reducing latency from task capture to result output. To address these challenges, we first consider the task size of the target area to be inspected and the shortest flying path as optimization constraints, employing a greedy algorithm to resolve the subproblem with a focus on minimizing the UAV’s flying path and the overall system cost. In the second stage, we introduce a novel DNN task assignment algorithm, termed GDM-MADDPG, which utilizes the reverse denoising process of GDM to replace the actor network in multi-agent deep deterministic policy gradient (MADDPG). This approach generates specific DNN task assignment actions based on agents’ observations in a dynamic environment. Simulation results indicate that our algorithm performs favorably compared to benchmarks in terms of path planning, Age of Information (AoI), energy consumption, and task load balancing.

[AI-61] owerDebias: A Novel Debiasing Method based on the Tower Property

链接: https://arxiv.org/abs/2411.08297
作者: Norman Matloff,Aditya Mittal
关键词-EN: machine learning tools, sophisticated machine learning, Decision-making processes, black-box machine learning, raising concerns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Applications (stat.AP); Machine Learning (stat.ML)
*备注: To be submitted to a journal soon

点击查看摘要

Abstract:Decision-making processes have increasingly come to rely on sophisticated machine learning tools, raising concerns about the fairness of their predictions with respect to any sensitive groups. The widespread use of commercial black-box machine learning models necessitates careful consideration of their legal and ethical implications on consumers. In situations where users have access to these “black-box” models, a key question emerges: how can we mitigate or eliminate the influence of sensitive attributes, such as race or gender? We propose towerDebias (tDB), a novel approach designed to reduce the influence of sensitive variables in predictions made by black-box models. Using the Tower Property from probability theory, tDB aims to improve prediction fairness during the post-processing stage in a manner amenable to the Fairness-Utility Tradeoff. This method is highly flexible, requiring no prior knowledge of the original model’s internal structure, and can be extended to a range of different applications. We provide a formal improvement theorem for tDB and demonstrate its effectiveness in both regression and classification tasks, underscoring its impact on the fairness-utility tradeoff.

[AI-62] RESOLVE: Relational Reasoning with Symbolic and Object-Level Features Using Vector Symbolic Processing

链接: https://arxiv.org/abs/2411.08290
作者: Mohamed Mejri,Chandramouli Amarnath,Abhijit Chatterjee
关键词-EN: Modern transformer-based encoder-decoder, Modern transformer-based, effectively extract relational, transformer-based encoder-decoder architectures, encoder-decoder architectures struggle
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern transformer-based encoder-decoder architectures struggle with reasoning tasks due to their inability to effectively extract relational information between input objects (data/tokens). Recent work introduced the Abstractor module, embedded between transformer layers, to address this gap. However, the Abstractor layer while excelling at capturing relational information (pure relational reasoning), faces challenges in tasks that require both object and relational-level reasoning (partial relational reasoning). To address this, we propose RESOLVE, a neuro-vector symbolic architecture that combines object-level features with relational representations in high-dimensional spaces, using fast and efficient operations such as bundling (summation) and binding (Hadamard product) allowing both object-level features and relational representations to coexist within the same structure without interfering with one another. RESOLVE is driven by a novel attention mechanism that operates in a bipolar high dimensional space, allowing fast attention score computation compared to the state-of-the-art. By leveraging this design, the model achieves both low compute latency and memory efficiency. RESOLVE also offers better generalizability while achieving higher accuracy in purely relational reasoning tasks such as sorting as well as partial relational reasoning tasks such as math problem-solving compared to state-of-the-art methods.

[AI-63] Hashing for Protein Structure Similarity Search

链接: https://arxiv.org/abs/2411.08286
作者: Jin Han,Wu-Jun Li
关键词-EN: structure similarity search, protein function prediction, PSSS, plays a crucial, molecular evolution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Protein structure similarity search (PSSS), which tries to search proteins with similar structures, plays a crucial role across diverse domains from drug design to protein function prediction and molecular evolution. Traditional alignment-based PSSS methods, which directly calculate alignment on the protein structures, are highly time-consuming with high memory cost. Recently, alignment-free methods, which represent protein structures as fixed-length real-valued vectors, are proposed for PSSS. Although these methods have lower time and memory cost than alignment-based methods, their time and memory cost is still too high for large-scale PSSS, and their accuracy is unsatisfactory. In this paper, we propose a novel method, called \underline\textp r \underline\texto tein \underline\texts tructure \underline\texth ashing (POSH), for PSSS. POSH learns a binary vector representation for each protein structure, which can dramatically reduce the time and memory cost for PSSS compared with real-valued vector representation based methods. Furthermore, in POSH we also propose expressive hand-crafted features and a structure encoder to well model both node and edge interactions in proteins. Experimental results on real datasets show that POSH can outperform other methods to achieve state-of-the-art accuracy. Furthermore, POSH achieves a memory saving of more than six times and speed improvement of more than four times, compared with other methods.

[AI-64] GPTree: Towards Explainable Decision-Making via LLM -powered Decision Trees

链接: https://arxiv.org/abs/2411.08257
作者: Sichao Xiong,Yigit Ihlamur,Fuat Alican,Aaron Ontoyin Yin
关键词-EN: Traditional decision tree, high-dimensional data, struggle with non-linear, limiting its applicability, Traditional decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Traditional decision tree algorithms are explainable but struggle with non-linear, high-dimensional data, limiting its applicability in complex decision-making. Neural networks excel at capturing complex patterns but sacrifice explainability in the process. In this work, we present GPTree, a novel framework combining explainability of decision trees with the advanced reasoning capabilities of LLMs. GPTree eliminates the need for feature engineering and prompt chaining, requiring only a task-specific prompt and leveraging a tree-based structure to dynamically split samples. We also introduce an expert-in-the-loop feedback mechanism to further enhance performance by enabling human intervention to refine and rebuild decision paths, emphasizing the harmony between human expertise and machine intelligence. Our decision tree achieved a 7.8% precision rate for identifying “unicorn” startups at the inception stage of a startup, surpassing gpt-4o with few-shot learning as well as the best human decision-makers (3.1% to 5.6%).

[AI-65] VALTEST: Automated Validation of Language Model Generated Test Cases

链接: https://arxiv.org/abs/2411.08254
作者: Hamed Taherkhani,Hadi Hemmati
关键词-EN: Large Language Models, Large Language, demonstrated significant potential, test cases, generating unit test
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant potential in automating software testing, specifically in generating unit test cases. However, the validation of LLM-generated test cases remains a challenge, particularly when the ground truth is unavailable. This paper introduces VALTEST, a novel framework designed to automatically validate test cases generated by LLMs by leveraging token probabilities. We evaluate VALTEST using nine test suites generated from three datasets (HumanEval, MBPP, and LeetCode) across three LLMs (GPT-4o, GPT-3.5-turbo, and LLama3.1 8b). By extracting statistical features from token probabilities, we train a machine learning model to predict test case validity. VALTEST increases the validity rate of test cases by 6.2% to 24%, depending on the dataset and LLM. Our results suggest that token probabilities are reliable indicators for distinguishing between valid and invalid test cases, which provides a robust solution for improving the correctness of LLM-generated test cases in software testing. In addition, we found that replacing the identified invalid test cases by VALTEST, using a Chain-of-Thought prompting results in a more effective test suite while keeping the high validity rates.

[AI-66] Retrieval Augmented Time Series Forecasting

链接: https://arxiv.org/abs/2411.08249
作者: Kutay Tire,Ege Onur Taga,Muhammed Emrullah Ildiz,Samet Oymak
关键词-EN: modern LLM systems, Retrieval-augmented generation, LLM systems, modern LLM, user queries
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a central component of modern LLM systems, particularly in scenarios where up-to-date information is crucial for accurately responding to user queries or when queries exceed the scope of the training data. The advent of time-series foundation models (TSFM), such as Chronos, and the need for effective zero-shot forecasting performance across various time-series domains motivates the question: Do benefits of RAG similarly carry over to time series forecasting? In this paper, we advocate that the dynamic and event-driven nature of time-series data makes RAG a crucial component of TSFMs and introduce a principled RAG framework for time-series forecasting, called Retrieval Augmented Forecasting (RAF). Within RAF, we develop efficient strategies for retrieving related time-series examples and incorporating them into forecast. Through experiments and mechanistic studies, we demonstrate that RAF indeed improves the forecasting accuracy across diverse time series domains and the improvement is more significant for larger TSFM sizes.

[AI-67] A Social Outcomes and Priorities centered (SOP) Framework for AI policy

链接: https://arxiv.org/abs/2411.08241
作者: Mohak Shah
关键词-EN: build robust guardrails, risk containment plans, ensuring equitable benefits, Rapid developments, betterment of society
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rapid developments in AI and its adoption across various domains have necessitated a need to build robust guardrails and risk containment plans while ensuring equitable benefits for the betterment of society. The current technology-centered approach has resulted in a fragmented, reactive, and ineffective policy apparatus. This paper highlights the immediate and urgent need to pivot to a society-centered approach to develop comprehensive, coherent, forward-looking AI policy. To this end, we present a Social Outcomes and Priorities centered (SOP) framework for AI policy along with proposals on implementation of its various components. While the SOP framework is presented from a US-centric view, the takeaways are general and applicable globally.

[AI-68] DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection

链接: https://arxiv.org/abs/2411.08227
作者: Shawn Li,Huixian Gong,Hao Dong,Tiankai Yang,Zhengzhong Tu,Yue Zhao
关键词-EN: machine learning models, OOD detection, multimodal OOD detection, OOD, training distribution
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is essential for ensuring the robustness of machine learning models by identifying samples that deviate from the training distribution. While traditional OOD detection has primarily focused on single-modality inputs, such as images, recent advances in multimodal models have demonstrated the potential of leveraging multiple modalities (e.g., video, optical flow, audio) to enhance detection performance. However, existing methods often overlook intra-class variability within in-distribution (ID) data, assuming that samples of the same class are perfectly cohesive and consistent. This assumption can lead to performance degradation, especially when prediction discrepancies are uniformly amplified across all samples. To address this issue, we propose Dynamic Prototype Updating (DPU), a novel plug-and-play framework for multimodal OOD detection that accounts for intra-class variations. Our method dynamically updates class center representations for each class by measuring the variance of similar samples within each batch, enabling adaptive adjustments. This approach allows us to amplify prediction discrepancies based on the updated class centers, thereby improving the model’s robustness and generalization across different modalities. Extensive experiments on two tasks, five datasets, and nine base OOD algorithms demonstrate that DPU significantly improves OOD detection performance, setting a new state-of-the-art in multimodal OOD detection, with improvements of up to 80 percent in Far-OOD detection. To facilitate accessibility and reproducibility, our code is publicly available on GitHub.

[AI-69] PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model

链接: https://arxiv.org/abs/2411.08212
作者: Yilun Liu,Yunpu Ma,Shuo Chen,Zifeng Ding,Bailan He,Zhen Han,Volker Tresp
关键词-EN: improved resource utilization, paradigm has emerged, resource utilization, powerful approach, approach for scaling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code available via this https URL

点击查看摘要

Abstract:The Mixture-of-Experts (MoE) paradigm has emerged as a powerful approach for scaling transformers with improved resource utilization. However, efficiently fine-tuning MoE models remains largely underexplored. Inspired by recent works on Parameter-Efficient Fine-Tuning (PEFT), we present a unified framework for integrating PEFT modules directly into the MoE mechanism. Aligning with the core principles and architecture of MoE, our framework encompasses a set of design dimensions including various functional and composition strategies. By combining design choices within our framework, we introduce Parameter-Efficient Routed Fine-Tuning (PERFT) as a flexible and scalable family of PEFT strategies tailored for MoE models. Extensive experiments on adapting OLMoE-1B-7B and Mixtral-8 \times 7B for commonsense and arithmetic reasoning tasks demonstrate the effectiveness, scalability, and intriguing dynamics of PERFT. Additionally, we provide empirical findings for each specific design choice to facilitate better application of MoE and PEFT.

[AI-70] An Explainable Machine Learning Approach for Age and Gender Estimation in Living Individuals Using Dental Biometrics

链接: https://arxiv.org/abs/2411.08195
作者: Mohsin Ali,Haider Raza,John Q Gan,Ariel Pokhojaev,Matanel Katz,Esra Kosan,Dian Agustin Wahjuningrum,Omnina Saleh,Rachel Sarig,Akhilanada Chaurasia
关键词-EN: Gradient Boosting Machine, Tooth Coronal Index, Pulp Cavity Height, Coronal Pulp Cavity, Light Gradient Boosting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Objectives: Age and gender estimation is crucial for various applications, including forensic investigations and anthropological studies. This research aims to develop a predictive system for age and gender estimation in living individuals, leveraging dental measurements such as Coronal Height (CH), Coronal Pulp Cavity Height (CPCH), and Tooth Coronal Index (TCI). Methods: Machine learning models were employed in our study, including Cat Boost Classifier (Catboost), Gradient Boosting Machine (GBM), Ada Boost Classifier (AdaBoost), Random Forest (RF), eXtreme Gradient Boosting (XGB), Light Gradient Boosting Machine (LGB), and Extra Trees Classifier (ETC), to analyze dental data from 862 living individuals (459 males and 403 females). Specifically, periapical radiographs from six teeth per individual were utilized, including premolars and molars from both maxillary and mandibular. A novel ensemble learning technique was developed, which uses multiple models each tailored to distinct dental metrics, to estimate age and gender accurately. Furthermore, an explainable AI model has been created utilizing SHAP, enabling dental experts to make judicious decisions based on comprehensible insight. Results: The RF and XGB models were particularly effective, yielding the highest F1 score for age and gender estimation. Notably, the XGB model showed a slightly better performance in age estimation, achieving an F1 score of 73.26%. A similar trend for the RF model was also observed in gender estimation, achieving a F1 score of 77.53%. Conclusions: This study marks a significant advancement in dental forensic methods, showcasing the potential of machine learning to automate age and gender estimation processes with improved accuracy.

[AI-71] ractoEmbed: Modular Multi-level Embedding framework for white matter tract segmentation ICPR

链接: https://arxiv.org/abs/2411.08187
作者: Anoushkrit Goel,Bipanjit Singh,Ankita Joshi,Ranjeet Ranjan Jha,Chirag Ahuja,Aditya Nigam,Arnav Bhavsar
关键词-EN: studying brain structural, brain structural connectivity, neurosurgical planning, White matter tract, matter tract segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at 27th International Conference on Pattern Recognition (ICPR), 2024 15 pages, 2 figures

点击查看摘要

Abstract:White matter tract segmentation is crucial for studying brain structural connectivity and neurosurgical planning. However, segmentation remains challenging due to issues like class imbalance between major and minor tracts, structural similarity, subject variability, symmetric streamlines between hemispheres etc. To address these challenges, we propose TractoEmbed, a modular multi-level embedding framework, that encodes localized representations through learning tasks in respective encoders. In this paper, TractoEmbed introduces a novel hierarchical streamline data representation that captures maximum spatial information at each level i.e. individual streamlines, clusters, and patches. Experiments show that TractoEmbed outperforms state-of-the-art methods in white matter tract segmentation across different datasets, and spanning various age groups. The modular framework directly allows the integration of additional embeddings in future works.

[AI-72] SCORE: Syntactic Code Representations for Static Script Malware Detection

链接: https://arxiv.org/abs/2411.08182
作者: Ecenaz Erdemir,Kyuhong Park,Michael J. Morais,Vianne R. Gao,Marion Marschalek,Yi Fan
关键词-EN: businesses increasingly adopt, adopt cloud technologies, increasingly adopt cloud, security challenges, businesses increasingly
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As businesses increasingly adopt cloud technologies, they also need to be aware of new security challenges, such as server-side script attacks, to ensure the integrity of their systems and data. These scripts can steal data, compromise credentials, and disrupt operations. Unlike executables with standardized formats (e.g., ELF, PE), scripts are plaintext files with diverse syntax, making them harder to detect using traditional methods. As a result, more sophisticated approaches are needed to protect cloud infrastructures from these evolving threats. In this paper, we propose novel feature extraction and deep learning (DL)-based approaches for static script malware detection, targeting server-side threats. We extract features from plain-text code using two techniques: syntactic code highlighting (SCH) and abstract syntax tree (AST) construction. SCH leverages complex regexes to parse syntactic elements of code, such as keywords, variable names, etc. ASTs generate a hierarchical representation of a program’s syntactic structure. We then propose a sequential and a graph-based model that exploits these feature representations to detect script malware. We evaluate our approach on more than 400K server-side scripts in Bash, Python and Perl. We use a balanced dataset of 90K scripts for training, validation, and testing, with the remaining from 400K reserved for further analysis. Experiments show that our method achieves a true positive rate (TPR) up to 81% higher than leading signature-based antivirus solutions, while maintaining a low false positive rate (FPR) of 0.17%. Moreover, our approach outperforms various neural network-based detectors, demonstrating its effectiveness in learning code maliciousness for accurate detection of script malware.

[AI-73] Challenges in Guardrailing Large Language Models for Science

链接: https://arxiv.org/abs/2411.08181
作者: Nishan Pantha,Muthukumaran Ramasubramanian,Iksha Gurung,Manil Maskey,Rahul Ramachandran
关键词-EN: offering significant benefits, natural language processing, large language models, processing and understanding, offering significant
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid development in large language models (LLMs) has transformed the landscape of natural language processing and understanding (NLP/NLU), offering significant benefits across various domains. However, when applied to scientific research, these powerful models exhibit critical failure modes related to scientific integrity and trustworthiness. Existing general-purpose LLM guardrails are insufficient to address these unique challenges in the scientific domain. We provide comprehensive guidelines for deploying LLM guardrails in the scientific domain. We identify specific challenges – including time sensitivity, knowledge contextualization, conflict resolution, and intellectual property concerns – and propose a guideline framework for the guardrails that can align with scientific needs. These guardrail dimensions include trustworthiness, ethics bias, safety, and legal aspects. We also outline in detail the implementation strategies that employ white-box, black-box, and gray-box methodologies that can be enforced within scientific contexts.

[AI-74] Comprehensive and Comparative Analysis between Transfer Learning and Custom Built VGG and CNN-SVM Models for Wildfire Detection

链接: https://arxiv.org/abs/2411.08171
作者: Aditya V. Jonnalagadda,Hashim A. Hashim,Andrew Harris
关键词-EN: Contemporary Artificial Intelligence, Contemporary Artificial, Artificial Intelligence, Convolutional Neural Network, Residual Neural Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: In Proc. of the 2024 IEEE International Conference On Intelligent Computing in Data Sciences

点击查看摘要

Abstract:Contemporary Artificial Intelligence (AI) and Machine Learning (ML) research places a significant emphasis on transfer learning, showcasing its transformative potential in enhancing model performance across diverse domains. This paper examines the efficiency and effectiveness of transfer learning in the context of wildfire detection. Three purpose-built models – Visual Geometry Group (VGG)-7, VGG-10, and Convolutional Neural Network (CNN)-Support Vector Machine(SVM) CNN-SVM – are rigorously compared with three pretrained models – VGG-16, VGG-19, and Residual Neural Network (ResNet) ResNet101. We trained and evaluated these models using a dataset that captures the complexities of wildfires, incorporating variables such as varying lighting conditions, time of day, and diverse terrains. The objective is to discern how transfer learning performs against models trained from scratch in addressing the intricacies of the wildfire detection problem. By assessing the performance metrics, including accuracy, precision, recall, and F1 score, a comprehensive understanding of the advantages and disadvantages of transfer learning in this specific domain is obtained. This study contributes valuable insights to the ongoing discourse, guiding future directions in AI and ML research. Keywords: Wildfire prediction, deep learning, machine learning fire, detection

[AI-75] Adaptive Meta-Learning for Robust Deepfake Detection: A Multi-Agent Framework to Data Drift and Model Generalization

链接: https://arxiv.org/abs/2411.08148
作者: Dinesh Srivasthav P,Badri Narayan Subudhi
关键词-EN: enabled significant possibilities, Pioneering advancements, content creation, false content, artificial intelligence
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pioneering advancements in artificial intelligence, especially in genAI, have enabled significant possibilities for content creation, but also led to widespread misinformation and false content. The growing sophistication and realism of deepfakes is raising concerns about privacy invasion, identity theft, and has societal, business impacts, including reputational damage and financial loss. Many deepfake detectors have been developed to tackle this problem. Nevertheless, as for every AI model, the deepfake detectors face the wrath of lack of considerable generalization to unseen scenarios and cross-domain deepfakes. Besides, adversarial robustness is another critical challenge, as detectors drastically underperform to the slightest imperceptible change. Most state-of-the-art detectors are trained on static datasets and lack the ability to adapt to emerging deepfake attack trends. These three crucial challenges though hold paramount importance for reliability in practise, particularly in the deepfake domain, are also the problems with any other AI application. This paper proposes an adversarial meta-learning algorithm using task-specific adaptive sample synthesis and consistency regularization, in a refinement phase. By focussing on the classifier’s strengths and weaknesses, it boosts both robustness and generalization of the model. Additionally, the paper introduces a hierarchical multi-agent retrieval-augmented generation workflow with a sample synthesis module to dynamically adapt the model to new data trends by generating custom deepfake samples. The paper further presents a framework integrating the meta-learning algorithm with the hierarchical multi-agent workflow, offering a holistic solution for enhancing generalization, robustness, and adaptability. Experimental results demonstrate the model’s consistent performance across various datasets, outperforming the models in comparison.

[AI-76] Online Collision Risk Estimation via Monocular Depth-Aware Object Detectors and Fuzzy Inference ICRA2025

链接: https://arxiv.org/abs/2411.08060
作者: Brian Hsuan-Cheng Liao,Yingjie Xu,Chih-Hong Cheng,Hasan Esen,Alois Knoll
关键词-EN: monocular camera images, object detector performance, autonomous vehicle, camera images, collision risk
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages (IEEE double column format), 5 figures, 3 tables, submitted to ICRA 2025

点击查看摘要

Abstract:This paper presents a monitoring framework that infers the level of autonomous vehicle (AV) collision risk based on its object detector’s performance using only monocular camera images. Essentially, the framework takes two sets of predictions produced by different algorithms and associates their inconsistencies with the collision risk via fuzzy inference. The first set of predictions is obtained through retrieving safety-critical 2.5D objects from a depth map, and the second set comes from the AV’s 3D object detector. We experimentally validate that, based on Intersection-over-Union (IoU) and a depth discrepancy measure, the inconsistencies between the two sets of predictions strongly correlate to the safety-related error of the 3D object detector against ground truths. This correlation allows us to construct a fuzzy inference system and map the inconsistency measures to an existing collision risk indicator. In particular, we apply various knowledge- and data-driven techniques and find using particle swarm optimization that learns general fuzzy rules gives the best mapping result. Lastly, we validate our monitor’s capability to produce relevant risk estimates with the large-scale nuScenes dataset and show it can safeguard an AV in closed-loop simulations.

[AI-77] GREI Data Repository AI Taxonomy

链接: https://arxiv.org/abs/2411.08054
作者: John Chodacki(California Digital Library),Mark Hanhel(figshare),Stefano Iacus(Dataverse),Ryan Scherle(Dryad),Eric Olson(Center for Open Science),Nici Pfeiffer(Center for Open Science),Kristi Holmes(Zenodo),Mohammad Hosseini(Zenodo)
关键词-EN: Repository Ecosystem Initiative, Generalist Repository Ecosystem, Ecosystem Initiative, data repository roles, Generalist Repository
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Generalist Repository Ecosystem Initiative (GREI), funded by the NIH, developed an AI taxonomy tailored to data repository roles to guide AI integration across repository management. It categorizes the roles into stages, including acquisition, validation, organization, enhancement, analysis, sharing, and user support, providing a structured framework for implementing AI in repository workflows.

[AI-78] GraphAide: Advanced Graph-Assisted Query and Reasoning System

链接: https://arxiv.org/abs/2411.08041
作者: Sumit Purohit,George Chin,Patrick S Mackey,Joseph A Cottam
关键词-EN: multiple siloed sources, Curating knowledge, multiple siloed, structured and unstructured, Large Language Models
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Curating knowledge from multiple siloed sources that contain both structured and unstructured data is a major challenge in many real-world applications. Pattern matching and querying represent fundamental tasks in modern data analytics that leverage this curated knowledge. The development of such applications necessitates overcoming several research challenges, including data extraction, named entity recognition, data modeling, and designing query interfaces. Moreover, the explainability of these functionalities is critical for their broader adoption. The emergence of Large Language Models (LLMs) has accelerated the development lifecycle of new capabilities. Nonetheless, there is an ongoing need for domain-specific tools tailored to user activities. The creation of digital assistants has gained considerable traction in recent years, with LLMs offering a promising avenue to develop such assistants utilizing domain-specific knowledge and assumptions. In this context, we introduce an advanced query and reasoning system, GraphAide, which constructs a knowledge graph (KG) from diverse sources and allows to query and reason over the resulting KG. GraphAide harnesses both the KG and LLMs to rapidly develop domain-specific digital assistants. It integrates design patterns from retrieval augmented generation (RAG) and the semantic web to create an agentic LLM application. GraphAide underscores the potential for streamlined and efficient development of specialized digital assistants, thereby enhancing their applicability across various domains. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.08041 [cs.DB] (or arXiv:2411.08041v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2411.08041 Focus to learn more arXiv-issued DOI via DataCite

[AI-79] he Universal PDDL Domain

链接: https://arxiv.org/abs/2411.08040
作者: Patrik Haslum,Augusto B. Corrêa
关键词-EN: related problem instances, problem instances, common to distinguish, generally understood, domain
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:In AI planning, it is common to distinguish between planning domains and problem instances, where a “domain” is generally understood as a set of related problem instances. This distinction is important, for example, in generalised planning, which aims to find a single, general plan or policy that solves all instances of a given domain. In PDDL, domains and problem instances are clearly separated: the domain defines the types, predicate symbols, and action schemata, while the problem instance specifies the concrete set of (typed) objects, the initial state, and the goal condition. In this paper, we show that it is quite easy to define a PDDL domain such that any propositional planning problem instance, from any domain, becomes an instance of this (lifted) “universal” domain. We construct different formulations of the universal domain, and discuss their implications for the complexity of lifted domain-dependent or generalised planning.

[AI-80] Interaction Testing in Variation Analysis

链接: https://arxiv.org/abs/2411.08861
作者: Drago Plecko
关键词-EN: explaining scientific phenomena, scientific phenomena, prime importance, importance for explaining, explaining scientific
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Relationships of cause and effect are of prime importance for explaining scientific phenomena. Often, rather than just understanding the effects of causes, researchers also wish to understand how a cause X affects an outcome Y mechanistically – i.e., what are the causal pathways that are activated between X and Y . For analyzing such questions, a range of methods has been developed over decades under the rubric of causal mediation analysis. Traditional mediation analysis focuses on decomposing the average treatment effect (ATE) into direct and indirect effects, and therefore focuses on the ATE as the central quantity. This corresponds to providing explanations for associations in the interventional regime, such as when the treatment X is randomized. Commonly, however, it is of interest to explain associations in the observational regime, and not just in the interventional regime. In this paper, we introduce \textvariation analysis, an extension of mediation analysis that focuses on the total variation (TV) measure between X and Y , written as \mathrmE[Y \mid X=x_1] - \mathrmE[Y \mid X=x_0] . The TV measure encompasses both causal and confounded effects, as opposed to the ATE which only encompasses causal (direct and mediated) variations. In this way, the TV measure is suitable for providing explanations in the natural regime and answering questions such as ``why is X associated with Y ?‘’. Our focus is on decomposing the TV measure, in a way that explicitly includes direct, indirect, and confounded variations. Furthermore, we also decompose the TV measure to include interaction terms between these different pathways. Subsequently, interaction testing is introduced, involving hypothesis tests to determine if interaction terms are significantly different from zero. If interactions are not significant, more parsimonious decompositions of the TV measure can be used.

[AI-81] Data-driven Surface Solar Irradiance Estimation using Neural Operators at Global Scale

链接: https://arxiv.org/abs/2411.08843
作者: Alberto Carpentieri,Jussi Leinonen,Jeff Adie,Boris Bonev,Doris Folini,Farah Hariri
关键词-EN: Accurate surface solar, Accurate surface, surface solar irradiance, essential for optimizing, Accurate
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate surface solar irradiance (SSI) forecasting is essential for optimizing renewable energy systems, particularly in the context of long-term energy planning on a global scale. This paper presents a pioneering approach to solar radiation forecasting that leverages recent advancements in numerical weather prediction (NWP) and data-driven machine learning weather models. These advances facilitate long, stable rollouts and enable large ensemble forecasts, enhancing the reliability of predictions. Our flexible model utilizes variables forecast by these NWP and AI weather models to estimate 6-hourly SSI at global scale. Developed using NVIDIA Modulus, our model represents the first adaptive global framework capable of providing long-term SSI forecasts. Furthermore, it can be fine-tuned using satellite data, which significantly enhances its performance in the fine-tuned regions, while maintaining accuracy elsewhere. The improved accuracy of these forecasts has substantial implications for the integration of solar energy into power grids, enabling more efficient energy management and contributing to the global transition to renewable energy sources.

[AI-82] AstroM3: A self-supervised multimodal model for astronomy

链接: https://arxiv.org/abs/2411.08842
作者: Mariia Rizhko,Joshua S. Bloom
关键词-EN: facilitate astronomical inquiry, model inputs tend, primary data source, time series, advanced approaches
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While machine-learned models are now routinely employed to facilitate astronomical inquiry, model inputs tend to be limited to a primary data source (namely images or time series) and, in the more advanced approaches, some metadata. Yet with the growing use of wide-field, multiplexed observational resources, individual sources of interest often have a broad range of observational modes available. Here we construct an astronomical multimodal dataset and propose AstroM ^3 , a self-supervised pre-training approach that enables a model to learn from multiple modalities simultaneously. Specifically, we extend the CLIP (Contrastive Language-Image Pretraining) model to a trimodal setting, allowing the integration of time-series photometry data, spectra, and astrophysical metadata. In a fine-tuning supervised setting, our results demonstrate that CLIP pre-training improves classification performance for time-series photometry, where accuracy increases from 84.6% to 91.5%. Furthermore, CLIP boosts classification accuracy by up to 12.6% when the availability of labeled data is limited, showing the effectiveness of leveraging larger corpora of unlabeled data. In addition to fine-tuned classification, we can use the trained model in other downstream tasks that are not explicitly contemplated during the construction of the self-supervised model. In particular we show the efficacy of using the learned embeddings for misclassifications identification, similarity search, and anomaly detection. One surprising highlight is the “rediscovery” of Mira subtypes and two Rotational variable subclasses using manifold learning and dimension reduction algorithm. To our knowledge this is the first construction of an n2 mode model in astronomy. Extensions to n3 modes is naturally anticipated with this approach.

[AI-83] Intelligent Algorithms For Signature Diagnostics Of Three-Phase Motors

链接: https://arxiv.org/abs/2411.08582
作者: Stepan Svirin,Artem Ryzhikov,Saraa Ali,Denis Derkach
关键词-EN: enhance diagnostic performance, machine learning, intelligent diagnosis, diagnosis of three-phase, significantly enhance diagnostic
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The application of machine learning (ML) algorithms in the intelligent diagnosis of three-phase engines has the potential to significantly enhance diagnostic performance and accuracy. Traditional methods largely rely on signature analysis, which, despite being a standard practice, can benefit from the integration of advanced ML techniques. In our study, we innovate by combining state of the art algorithms with a novel unsupervised anomaly generation methodology that takes into account physics model of the engine. This hybrid approach leverages the strengths of both supervised ML and unsupervised signature analysis, achieving superior diagnostic accuracy and reliability along with a wide industrial application. Our experimental results demonstrate that this method significantly outperforms existing ML and non-ML state-of-the-art approaches while retaining the practical advantages of an unsupervised methodology. The findings highlight the potential of our approach to significantly contribute to the field of engine diagnostics, offering a robust and efficient solution for real-world applications.

[AI-84] What Representational Similarity Measures Imply about Decodable Information

链接: https://arxiv.org/abs/2411.08197
作者: Sarah E. Harvey,David Lipshutz,Alex H. Williams
关键词-EN: Neural responses encode, responses encode information, variety of downstream, responses encode, Neural responses
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural responses encode information that is useful for a variety of downstream tasks. A common approach to understand these systems is to build regression models or ``decoders’’ that reconstruct features of the stimulus from neural responses. Popular neural network similarity measures like centered kernel alignment (CKA), canonical correlation analysis (CCA), and Procrustes shape distance, do not explicitly leverage this perspective and instead highlight geometric invariances to orthogonal or affine transformations when comparing representations. Here, we show that many of these measures can, in fact, be equivalently motivated from a decoding perspective. Specifically, measures like CKA and CCA quantify the average alignment between optimal linear readouts across a distribution of decoding tasks. We also show that the Procrustes shape distance upper bounds the distance between optimal linear readouts and that the converse holds for representations with low participation ratio. Overall, our work demonstrates a tight link between the geometry of neural representations and the ability to linearly decode information. This perspective suggests new ways of measuring similarity between neural systems and also provides novel, unifying interpretations of existing measures.

[AI-85] MatPilot: an LLM -enabled AI Materials Scientist under the Framework of Human-Machine Collaboration

链接: https://arxiv.org/abs/2411.08063
作者: Ziqi Ni,Yahao Li,Kaijia Hu,Kunyuan Han,Ming Xu,Xingyu Chen,Fengqi Liu,Yicong Ye,Shuxin Bai
关键词-EN: presents unprecedented opportunities, materials science research, artificial intelligence, presents unprecedented, rapid evolution
类目: Physics and Society (physics.soc-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid evolution of artificial intelligence, particularly large language models, presents unprecedented opportunities for materials science research. We proposed and developed an AI materials scientist named MatPilot, which has shown encouraging abilities in the discovery of new materials. The core strength of MatPilot is its natural language interactive human-machine collaboration, which augments the research capabilities of human scientist teams through a multi-agent system. MatPilot integrates unique cognitive abilities, extensive accumulated experience, and ongoing curiosity of human-beings with the AI agents’ capabilities of advanced abstraction, complex knowledge storage and high-dimensional information processing. It could generate scientific hypotheses and experimental schemes, and employ predictive models and optimization algorithms to drive an automated experimental platform for experiments. It turns out that our system demonstrates capabilities for efficient validation, continuous learning, and iterative optimization.

计算机视觉

[CV-0] Multimodal Instruction Tuning with Hybrid State Space Models

链接: https://arxiv.org/abs/2411.08840
作者: Jianing Zhou,Han Li,Shuai Zhang,Ning Xie,Ruijie Wang,Xiaohan Nie,Sheng Liu,Lingyun Wang
关键词-EN: Handling lengthy context, Handling lengthy, high frame rate, large language models, multimodal large language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Handling lengthy context is crucial for enhancing the recognition and understanding capabilities of multimodal large language models (MLLMs) in applications such as processing high-resolution images or high frame rate videos. The rise in image resolution and frame rate substantially increases computational demands due to the increased number of input tokens. This challenge is further exacerbated by the quadratic complexity with respect to sequence length of the self-attention mechanism. Most prior works either pre-train models with long contexts, overlooking the efficiency problem, or attempt to reduce the context length via downsampling (e.g., identify the key image patches or frames) to decrease the context length, which may result in information loss. To circumvent this issue while keeping the remarkable effectiveness of MLLMs, we propose a novel approach using a hybrid transformer-MAMBA model to efficiently handle long contexts in multimodal applications. Our multimodal model can effectively process long context input exceeding 100k tokens, outperforming existing models across various benchmarks. Remarkably, our model enhances inference efficiency for high-resolution images and high-frame-rate videos by about 4 times compared to current models, with efficiency gains increasing as image resolution or video frames rise. Furthermore, our model is the first to be trained on low-resolution images or low-frame-rate videos while being capable of inference on high-resolution images and high-frame-rate videos, offering flexibility for inference in diverse scenarios.

[CV-1] LUDO: Low-Latency Understanding of Highly Deformable Objects using Point Cloud Occupancy Functions

链接: https://arxiv.org/abs/2411.08777
作者: Pit Henrich,Franziska Mathis-Ullrich,Paul Maria Scheikl
关键词-EN: Accurately determining, require precise targeting, deformable objects, determining the shape, shape and location
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurately determining the shape and location of internal structures within deformable objects is crucial for medical tasks that require precise targeting, such as robotic biopsies. We introduce LUDO, a method for accurate low-latency understanding of deformable objects. LUDO reconstructs objects in their deformed state, including their internal structures, from a single-view point cloud observation in under 30 ms using occupancy networks. We demonstrate LUDO’s abilities for autonomous targeting of internal regions of interest (ROIs) in highly deformable objects. Additionally, LUDO provides uncertainty estimates and explainability for its predictions, both of which are important in safety-critical applications such as surgical interventions. We evaluate LUDO in real-world robotic experiments, achieving a success rate of 98.9% for puncturing various ROIs inside highly deformable objects. LUDO demonstrates the potential to interact with deformable objects without the need for deformable registration methods.

[CV-2] Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation

链接: https://arxiv.org/abs/2411.08756
作者: Yangyang Li,Xuanting Hao,Ronghua Shang,Licheng Jiao
关键词-EN: masked image modeling, self-supervised learning share, representative self-supervised learning, integrated representative self-supervised, self-supervised learning paradigms
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:In view of the fact that semi- and self-supervised learning share a fundamental principle, effectively modeling knowledge from unlabeled data, various semi-supervised semantic segmentation methods have integrated representative self-supervised learning paradigms for further regularization. However, the potential of the state-of-the-art generative self-supervised paradigm, masked image modeling, has been scarcely studied. This paradigm learns the knowledge through establishing connections between the masked and visible parts of masked image, during the pixel reconstruction process. By inheriting and extending this insight, we successfully leverage masked image modeling to boost semi-supervised semantic segmentation. Specifically, we introduce a novel class-wise masked image modeling that independently reconstructs different image regions according to their respective classes. In this way, the mask-induced connections are established within each class, mitigating the semantic confusion that arises from plainly reconstructing images in basic masked image modeling. To strengthen these intra-class connections, we further develop a feature aggregation strategy that minimizes the distances between features corresponding to the masked and visible parts within the same class. Additionally, in semantic space, we explore the application of masked image modeling to enhance regularization. Extensive experiments conducted on well-known benchmarks demonstrate that our approach achieves state-of-the-art performance. The code will be available at this https URL.

[CV-3] Weakly-Supervised Anomaly Detection in Surveillance Videos Based on Two-Stream I3D Convolution Network

链接: https://arxiv.org/abs/2411.08755
作者: Sareh Soltani Nejad,Anwar Haque
关键词-EN: enhanced public safety, ensure enhanced public, anomaly detection, Convolutional Networks, public safety
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:The widespread implementation of urban surveillance systems has necessitated more sophisticated techniques for anomaly detection to ensure enhanced public safety. This paper presents a significant advancement in the field of anomaly detection through the application of Two-Stream Inflated 3D (I3D) Convolutional Networks. These networks substantially outperform traditional 3D Convolutional Networks (C3D) by more effectively extracting spatial and temporal features from surveillance videos, thus improving the precision of anomaly detection. Our research advances the field by implementing a weakly supervised learning framework based on Multiple Instance Learning (MIL), which uniquely conceptualizes surveillance videos as collections of ‘bags’ that contain instances (video clips). Each instance is innovatively processed through a ranking mechanism that prioritizes clips based on their potential to display anomalies. This novel strategy not only enhances the accuracy and precision of anomaly detection but also significantly diminishes the dependency on extensive manual annotations. Moreover, through meticulous optimization of model settings, including the choice of optimizer, our approach not only establishes new benchmarks in the performance of anomaly detection systems but also offers a scalable and efficient solution for real-world surveillance applications. This paper contributes significantly to the field of computer vision by delivering a more adaptable, efficient, and context-aware anomaly detection system, which is poised to redefine practices in urban surveillance.

[CV-4] Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos

链接: https://arxiv.org/abs/2411.08753
作者: Sagnik Majumder,Tushar Nagarajan,Ziad Al-Halah,Reina Pradhan,Kristen Grauman
关键词-EN: multi-view video, instructional multi-view video, human observer, multi-view, Existing methods rely
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive ``best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video – no language or camera poses – and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation.

[CV-5] Retrieval Augmented Recipe Generation WACV

链接: https://arxiv.org/abs/2411.08715
作者: Guoshan Liu,Hailong Yin,Bin Zhu,Jingjing Chen,Chong-Wah Ngo,Yu-Gang Jiang
关键词-EN: garnered significant attention, recent years, potential applications, area has garnered, garnered significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCEPT on IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Given the potential applications of generating recipes from food images, this area has garnered significant attention from researchers in recent years. Existing works for recipe generation primarily utilize a two-stage training method, first generating ingredients and then obtaining instructions from both the image and ingredients. Large Multi-modal Models (LMMs), which have achieved notable success across a variety of vision and language tasks, shed light to generating both ingredients and instructions directly from images. Nevertheless, LMMs still face the common issue of hallucinations during recipe generation, leading to suboptimal performance. To tackle this, we propose a retrieval augmented large multimodal model for recipe generation. We first introduce Stochastic Diversified Retrieval Augmentation (SDRA) to retrieve recipes semantically related to the image from an existing datastore as a supplement, integrating them into the prompt to add diverse and rich context to the input image. Additionally, Self-Consistency Ensemble Voting mechanism is proposed to determine the most confident prediction recipes as the final output. It calculates the consistency among generated recipe candidates, which use different retrieval recipes as context for generation. Extensive experiments validate the effectiveness of our proposed method, which demonstrates state-of-the-art (SOTA) performance in recipe generation tasks on the Recipe1M dataset.

[CV-6] High-resolution optical and acoustic remote sensing datasets of the Puck Lagoon Southern Baltic

链接: https://arxiv.org/abs/2411.08712
作者: Łukasz Janowski,Dimitrios Skarlatos,Panagiotis Agrafiotis,Paweł Tysiąc,Andrzej Pydyn,Mateusz Popek,Anna M. Kotarba-Morley,Gottfried Mandlburger,Łukasz Gajewski,Mateusz Kołakowski,Alexandra Papadaki,Juliusz Gajewski
关键词-EN: southern Baltic Sea, shallow marine basin, Baltic Sea, hosts valuable benthic, coast of Poland
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The very shallow marine basin of Puck Lagoon in the southern Baltic Sea, on the Northern coast of Poland, hosts valuable benthic habitats and cultural heritage sites. These include, among others, protected Zostera marina meadows, one of the Baltic’s major medieval harbours, a ship graveyard, and likely other submerged features that are yet to be discovered. Prior to this project, no comprehensive high-resolution remote sensing data were available for this area. This article describes the first Digital Elevation Models (DEMs) derived from a combination of airborne bathymetric LiDAR, multibeam echosounder, airborne photogrammetry and satellite imagery. These datasets also include multibeam echosounder backscatter and LiDAR intensity, allowing determination of the character and properties of the seafloor. Combined, these datasets are a vital resource for assessing and understanding seafloor morphology, benthic habitats, cultural heritage, and submerged landscapes. Given the significance of Puck Lagoon’s hydrographical, ecological, geological, and archaeological environs, the high-resolution bathymetry, acquired by our project, can provide the foundation for sustainable management and informed decision-making for this area of interest.

[CV-7] OSMLoc: Single Image-Based Visual Localization in OpenStreetMap with Geometric and Semantic Guidances

链接: https://arxiv.org/abs/2411.08665
作者: Youqi Liao,Xieyuanli Chen,Shuhao Kang,Jianping Li,Zhen Dong,Hongchao Fan,Bisheng Yang
关键词-EN: vectorized map data, volunteered geographic information, nearby visual observations, matching nearby visual, online and versatile
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, technical report

点击查看摘要

Abstract:OpenStreetMap (OSM), an online and versatile source of volunteered geographic information (VGI), is widely used for human self-localization by matching nearby visual observations with vectorized map data. However, due to the divergence in modalities and views, image-to-OSM (I2O) matching and localization remain challenging for robots, preventing the full utilization of VGI data in the unmanned ground vehicles and logistic industry. Inspired by the fact that the human brain relies on geometric and semantic understanding of sensory information for spatial localization tasks, we propose the OSMLoc in this paper. OSMLoc is a brain-inspired single-image visual localization method with semantic and geometric guidance to improve accuracy, robustness, and generalization ability. First, we equip the OSMLoc with the visual foundational model to extract powerful image features. Second, a geometry-guided depth distribution adapter is proposed to bridge the monocular depth estimation and camera-to-BEV transform. Thirdly, the semantic embeddings from the OSM data are utilized as auxiliary guidance for image-to-OSM feature matching. To validate the proposed OSMLoc, we collect a worldwide cross-area and cross-condition (CC) benchmark for extensive evaluation. Experiments on the MGL dataset, CC validation benchmark, and KITTI dataset have demonstrated the superiority of our method. Code, pre-trained models, CC validation benchmark, and additional results are available on: this https URL

[CV-8] oward Human Understanding with Controllable Synthesis

链接: https://arxiv.org/abs/2411.08663
作者: Hanz Cuevas-Velasquez,Priyanka Patel,Haiwen Feng,Michael Black
关键词-EN: estimation requires diverse, ground truth, requires diverse training, accurate ground truth, ground
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Training methods to perform robust 3D human pose and shape (HPS) estimation requires diverse training images with accurate ground truth. While BEDLAM demonstrates the potential of traditional procedural graphics to generate such data, the training images are clearly synthetic. In contrast, generative image models produce highly realistic images but without ground truth. Putting these methods together seems straightforward: use a generative model with the body ground truth as controlling signal. However, we find that, the more realistic the generated images, the more they deviate from the ground truth, making them inappropriate for training and evaluation. Enhancements of realistic details, such as clothing and facial expressions, can lead to subtle yet significant deviations from the ground truth, potentially misleading training models. We empirically verify that this misalignment causes the accuracy of HPS networks to decline when trained with generated images. To address this, we design a controllable synthesis method that effectively balances image realism with precise ground truth. We use this to create the Generative BEDLAM (Gen-B) dataset, which improves the realism of the existing synthetic BEDLAM dataset while preserving ground truth accuracy. We perform extensive experiments, with various noise-conditioning strategies, to evaluate the tradeoff between visual realism and HPS accuracy. We show, for the first time, that generative image models can be controlled by traditional graphics methods to produce training data that increases the accuracy of HPS methods.

[CV-9] MikuDance: Animating Character Art with Mixed Motion Dynamics

链接: https://arxiv.org/abs/2411.08656
作者: Jiaxu Zhang,Xianfang Zeng,Xin Chen,Wei Zuo,Gang Yu,Zhigang Tu
关键词-EN: pipeline incorporating mixed, diffusion-based pipeline incorporating, incorporating mixed motion, Mixed Motion Modeling, animate stylized character
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose MikuDance, a diffusion-based pipeline incorporating mixed motion dynamics to animate stylized character art. MikuDance consists of two key techniques: Mixed Motion Modeling and Mixed-Control Diffusion, to address the challenges of high-dynamic motion and reference-guidance misalignment in character art animation. Specifically, a Scene Motion Tracking strategy is presented to explicitly model the dynamic camera in pixel-wise space, enabling unified character-scene motion modeling. Building on this, the Mixed-Control Diffusion implicitly aligns the scale and body shape of diverse characters with motion guidance, allowing flexible control of local character motion. Subsequently, a Motion-Adaptive Normalization module is incorporated to effectively inject global scene motion, paving the way for comprehensive character art animation. Through extensive experiments, we demonstrate the effectiveness and generalizability of MikuDance across various character art and motion guidance, consistently producing high-quality animations with remarkable motion dynamics.

[CV-10] Zero-shot capability of SAM-family models for bone segmentation in CT scans

链接: https://arxiv.org/abs/2411.08629
作者: Caroline Magg,Hoel Kervadec,Clara I. Sánchez
关键词-EN: similar models build, promptable foundation models, build a family, family of promptable, promptable foundation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Segment Anything Model (SAM) and similar models build a family of promptable foundation models (FMs) for image and video segmentation. The object of interest is identified using prompts, such as bounding boxes or points. With these FMs becoming part of medical image segmentation, extensive evaluation studies are required to assess their strengths and weaknesses in clinical setting. Since the performance is highly dependent on the chosen prompting strategy, it is important to investigate different prompting techniques to define optimal guidelines that ensure effective use in medical image segmentation. Currently, no dedicated evaluation studies exist specifically for bone segmentation in CT scans, leaving a gap in understanding the performance for this task. Thus, we use non-iterative, ``optimal’’ prompting strategies composed of bounding box, points and combinations to test the zero-shot capability of SAM-family models for bone CT segmentation on three different skeletal regions. Our results show that the best settings depend on the model type and size, dataset characteristics and objective to optimize. Overall, SAM and SAM2 prompted with a bounding box in combination with the center point for all the components of an object yield the best results across all tested settings. As the results depend on multiple factors, we provide a guideline for informed decision-making in 2D prompting with non-interactive, ‘‘optimal’’ prompts.

[CV-11] LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation ECCV2024

链接: https://arxiv.org/abs/2411.08606
作者: Pengwei Yin,Jingjing Wang,Guanzhong Zeng,Di Xie,Jiang Zhu
关键词-EN: gaze estimation, significantly hindered, factors unrelated, training dataset, gaze
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:The ability of gaze estimation models to generalize is often significantly hindered by various factors unrelated to gaze, especially when the training dataset is limited. Current strategies aim to address this challenge through different domain generalization techniques, yet they have had limited success due to the risk of overfitting when solely relying on value labels for regression. Recent progress in pre-trained vision-language models has motivated us to capitalize on the abundant semantic information available. We propose a novel approach in this paper, reframing the gaze estimation task as a vision-language alignment issue. Our proposed framework, named Language-Guided Gaze Estimation (LG-Gaze), learns continuous and geometry-sensitive features for gaze estimation benefit from the rich prior knowledges of vision-language models. Specifically, LG-Gaze aligns gaze features with continuous linguistic features through our proposed multimodal contrastive regression loss, which customizes adaptive weights for different negative samples. Furthermore, to better adapt to the labels for gaze estimation task, we propose a geometry-aware interpolation method to obtain more precise gaze embeddings. Through extensive experiments, we validate the efficacy of our framework in four different cross-domain evaluation tasks.

[CV-12] Generalized Pose Space Embeddings for Training In-the-Wild using Anaylis-by-Synthesis

链接: https://arxiv.org/abs/2411.08603
作者: Dominik Borer,Jakob Buhmann,Martin Guay
关键词-EN: Modern pose estimation, Modern pose, pose estimation models, manually-labelled datasets, real world
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Modern pose estimation models are trained on large, manually-labelled datasets which are costly and may not cover the full extent of human poses and appearances in the real world. With advances in neural rendering, analysis-by-synthesis and the ability to not only predict, but also render the pose, is becoming an appealing framework, which could alleviate the need for large scale manual labelling efforts. While recent work have shown the feasibility of this approach, the predictions admit many flips due to a simplistic intermediate skeleton representation, resulting in low precision and inhibiting the acquisition of any downstream knowledge such as three-dimensional positioning. We solve this problem with a more expressive intermediate skeleton representation capable of capturing the semantics of the pose (left and right), which significantly reduces flips. To successfully train this new representation, we extend the analysis-by-synthesis framework with a training protocol based on synthetic data. We show that our representation results in less flips and more accurate predictions. Our approach outperforms previous models trained with analysis-by-synthesis on standard benchmarks.

[CV-13] Slender Object Scene Segmentation in Remote Sensing Image Based on Learnable Morphological Skeleton with Segment Anything Model

链接: https://arxiv.org/abs/2411.08592
作者: Jun Xie,Wenxiao Li,Faqiang Wang,Liqiang Zhang,Zhengyang Hou,Jun Liu
关键词-EN: small structural details, sensing image processing, preserve small structural, Morphological methods play, morphological skeleton prior
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Morphological methods play a crucial role in remote sensing image processing, due to their ability to capture and preserve small structural details. However, most of the existing deep learning models for semantic segmentation are based on the encoder-decoder architecture including U-net and Segment Anything Model (SAM), where the downsampling process tends to discard fine details. In this paper, we propose a new approach that integrates learnable morphological skeleton prior into deep neural networks using the variational method. To address the difficulty in backpropagation in neural networks caused by the non-differentiability presented in classical morphological operations, we provide a smooth representation of the morphological skeleton and design a variational segmentation model integrating morphological skeleton prior by employing operator splitting and dual methods. Then, we integrate this model into the network architecture of SAM, which is achieved by adding a token to mask decoder and modifying the final sigmoid layer, ensuring the final segmentation results preserve the skeleton structure as much as possible. Experimental results on remote sensing datasets, including buildings and roads, demonstrate that our method outperforms the original SAM on slender object segmentation and exhibits better generalization capability.

[CV-14] NavAgent : Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation

链接: https://arxiv.org/abs/2411.08579
作者: Youzhi Liu,Fanglong Yao,Yuanchang Yue,Guangluan Xu,Xian Sun,Kun Fu
关键词-EN: natural language commands, widely discussed research, discussed research direction, enable embodied agents, complicated visual environments
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN), as a widely discussed research direction in embodied intelligence, aims to enable embodied agents to navigate in complicated visual environments through natural language commands. Most existing VLN methods focus on indoor ground robot scenarios. However, when applied to UAV VLN in outdoor urban scenes, it faces two significant challenges. First, urban scenes contain numerous objects, which makes it challenging to match fine-grained landmarks in images with complex textual descriptions of these landmarks. Second, overall environmental information encompasses multiple modal dimensions, and the diversity of representations significantly increases the complexity of the encoding process. To address these challenges, we propose NavAgent, the first urban UAV embodied navigation model driven by a large Vision-Language Model. NavAgent undertakes navigation tasks by synthesizing multi-scale environmental information, including topological maps (global), panoramas (medium), and fine-grained landmarks (local). Specifically, we utilize GLIP to build a visual recognizer for landmark capable of identifying and linguisticizing fine-grained landmarks. Subsequently, we develop dynamically growing scene topology map that integrate environmental information and employ Graph Convolutional Networks to encode global environmental data. In addition, to train the visual recognizer for landmark, we develop NavAgent-Landmark2K, the first fine-grained landmark dataset for real urban street scenes. In experiments conducted on the Touchdown and Map2seq datasets, NavAgent outperforms strong baseline models. The code and dataset will be released to the community to facilitate the exploration and development of outdoor VLN.

[CV-15] UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation

链接: https://arxiv.org/abs/2411.08569
作者: Chengyuan Zhang,Yilin Zhang,Lei Zhu,Deyin Liu,Lin Wu,Bo Li,Shichao Zhang,Mohammed Bennamoun,Farid Boussaid
关键词-EN: few-shot object detection, Transformer architecture, unified incremental few-shot, incremental few-shot object, instance segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:This paper introduces a novel framework for unified incremental few-shot object detection (iFSOD) and instance segmentation (iFSIS) using the Transformer architecture. Our goal is to create an optimal solution for situations where only a few examples of novel object classes are available, with no access to training data for base or old classes, while maintaining high performance across both base and novel classes. To achieve this, We extend Mask-DINO into a two-stage incremental learning framework. Stage 1 focuses on optimizing the model using the base dataset, while Stage 2 involves fine-tuning the model on novel classes. Besides, we incorporate a classifier selection strategy that assigns appropriate classifiers to the encoder and decoder according to their distinct functions. Empirical evidence indicates that this approach effectively mitigates the over-fitting on novel classes learning. Furthermore, we implement knowledge distillation to prevent catastrophic forgetting of base classes. Comprehensive evaluations on the COCO and LVIS datasets for both iFSIS and iFSOD tasks demonstrate that our method significantly outperforms state-of-the-art approaches.

[CV-16] Saliency Map-based Image Retrieval using Invariant Krawtchouk Moments

链接: https://arxiv.org/abs/2411.08567
作者: Ashkan Nejad,Mohammad Reza Faraji,Xiaojun Qi
关键词-EN: digital devices equipped, feature extraction techniques, Internet technology, numerous content-based image, development of Internet
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the widespread adoption of digital devices equipped with cameras and the rapid development of Internet technology, numerous content-based image retrieval systems and novel image feature extraction techniques have emerged in recent years. This paper introduces a saliency map-based image retrieval approach using invariant Krawtchouk moments (SM-IKM) to enhance retrieval speed and accuracy. The proposed method applies a global contrast-based salient region detection algorithm to create a saliency map that effectively isolates the foreground from the background. It then combines multiple orders of invariant Krawtchouk moments (IKM) with local binary patterns (LBPs) and color histograms to comprehensively represent the foreground and background. Additionally, it incorporates LBPs derived from the saliency map to improve discriminative power, facilitating more precise image differentiation. A bag-of-visual-words (BoVW) model is employed to generate a codebook for classification and discrimination. By using compact IKMs in the BoVW framework and integrating a range of region-based feature-including color histograms, LBPs, and saliency map-enhanced LBPs, our proposed SM-IKM achieves efficient and accurate image retrieval. xtensive experiments on publicly available datasets, such as Caltech 101 and Wang, demonstrate that SM-IKM outperforms recent state-of-the-art retrieval methods. The source code for SM-IKM is available at this http URL.

[CV-17] APDDv2: Aesthetics of Paintings and Drawings Dataset with Artist Labeled Scores and Comments

链接: https://arxiv.org/abs/2411.08545
作者: Xin Jin,Qianqian Qiao,Yi Lu,Huaye Wang,Heng Huang,Shan Gao,Jianfei Liu,Rui Li
关键词-EN: diverse image samples, training visual models, training visual, abstract understandings, visual features
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Datasets play a pivotal role in training visual models, facilitating the development of abstract understandings of visual features through diverse image samples and multidimensional attributes. However, in the realm of aesthetic evaluation of artistic images, datasets remain relatively scarce. Existing painting datasets are often characterized by limited scoring dimensions and insufficient annotations, thereby constraining the advancement and application of automatic aesthetic evaluation methods in the domain of painting. To bridge this gap, we introduce the Aesthetics Paintings and Drawings Dataset (APDD), the first comprehensive collection of paintings encompassing 24 distinct artistic categories and 10 aesthetic attributes. Building upon the initial release of APDDv1, our ongoing research has identified opportunities for enhancement in data scale and annotation precision. Consequently, APDDv2 boasts an expanded image corpus and improved annotation quality, featuring detailed language comments to better cater to the needs of both researchers and practitioners seeking high-quality painting datasets. Furthermore, we present an updated version of the Art Assessment Network for Specific Painting Styles, denoted as ArtCLIP. Experimental validation demonstrates the superior performance of this revised model in the realm of aesthetic evaluation, surpassing its predecessor in accuracy and efficacy. The dataset and model are available at this https URL.

[CV-18] Classification and Morphological Analysis of DLBCL Subtypes in HE-Stained Slides

链接: https://arxiv.org/abs/2411.08531
作者: Ravi Kant Gupta,Mohit Jindal,Garima Jain,Epari Sridhar,Subhash Yadav,Hasmukh Jain,Tanuja Shet,Uma Sakhdeo,Manju Sengar,Lingaraj Nayak,Bhausaheb Bagal,Umesh Apkare,Amit Sethi
关键词-EN: large B-cell lymphoma, diffuse large B-cell, B-cell lymphoma, large B-cell, germinal center
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We address the challenge of automated classification of diffuse large B-cell lymphoma (DLBCL) into its two primary subtypes: activated B-cell-like (ABC) and germinal center B-cell-like (GCB). Accurate classification between these subtypes is essential for determining the appropriate therapeutic strategy, given their distinct molecular profiles and treatment responses. Our proposed deep learning model demonstrates robust performance, achieving an average area under the curve (AUC) of (87.4 pm 5.7)% during cross-validation. It shows a high positive predictive value (PPV), highlighting its potential for clinical application, such as triaging for molecular testing. To gain biological insights, we performed an analysis of morphological features of ABC and GCB subtypes. We segmented cell nuclei using a pre-trained deep neural network and compared the statistics of geometric and color features for ABC and GCB. We found that the distributions of these features were not very different for the two subtypes, which suggests that the visual differences between them are more subtle. These results underscore the potential of our method to assist in more precise subtype classification and can contribute to improved treatment management and outcomes for patients of DLBCL.

[CV-19] Efficient Whole Slide Image Classification through Fisher Vector Representation

链接: https://arxiv.org/abs/2411.08530
作者: Ravi Kant Gupta,Dadi Dharani,Shambhavi Shanker,Amit Sethi
关键词-EN: enhance diagnostic precision, significantly enhance diagnostic, precision and efficiency, enhance diagnostic, diagnostic precision
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advancement of digital pathology, particularly through computational analysis of whole slide images (WSI), is poised to significantly enhance diagnostic precision and efficiency. However, the large size and complexity of WSIs make it difficult to analyze and classify them using computers. This study introduces a novel method for WSI classification by automating the identification and examination of the most informative patches, thus eliminating the need to process the entire slide. Our method involves two-stages: firstly, it extracts only a few patches from the WSIs based on their pathological significance; and secondly, it employs Fisher vectors (FVs) for representing features extracted from these patches, which is known for its robustness in capturing fine-grained details. This approach not only accentuates key pathological features within the WSI representation but also significantly reduces computational overhead, thus making the process more efficient and scalable. We have rigorously evaluated the proposed method across multiple datasets to benchmark its performance against comprehensive WSI analysis and contemporary weakly-supervised learning methodologies. The empirical results indicate that our focused analysis of select patches, combined with Fisher vector representation, not only aligns with, but at times surpasses, the classification accuracy of standard practices. Moreover, this strategy notably diminishes computational load and resource expenditure, thereby establishing an efficient and precise framework for WSI analysis in the realm of digital pathology.

[CV-20] BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis

链接: https://arxiv.org/abs/2411.08508
作者: David Svitov,Pietro Morerio,Lourdes Agapito,Alessio Del Bue
关键词-EN: present billboard Splatting, scene representation based, textured geometric primitives, present billboard, representation based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present billboard Splatting (BBSplat) - a novel approach for 3D scene representation based on textured geometric primitives. BBSplat represents the scene as a set of optimizable textured planar primitives with learnable RGB textures and alpha-maps to control their shape. BBSplat primitives can be used in any Gaussian Splatting pipeline as drop-in replacements for Gaussians. Our method’s qualitative and quantitative improvements over 3D and 2D Gaussians are most noticeable when fewer primitives are used, when BBSplat achieves over 1200 FPS. Our novel regularization term encourages textures to have a sparser structure, unlocking an efficient compression that leads to a reduction in storage space of the model. Our experiments show the efficiency of BBSplat on standard datasets of real indoor and outdoor scenes such as TanksTemples, DTU, and Mip-NeRF-360. We demonstrate improvements on PSNR, SSIM, and LPIPS metrics compared to the state-of-the-art, especially for the case when fewer primitives are used, which, on the other hand, leads to up to 2 times inference speed improvement for the same rendering quality.

[CV-21] Impact of Iris Pigmentation on Performance Bias in Visible Iris Verification Systems: A Comparative Study

链接: https://arxiv.org/abs/2411.08490
作者: Geetanjali Sharma,Abhishek Tandon,Gaurav Jaswal,Aditya Nigam,Raghavendra Ramachandra
关键词-EN: recognition technology plays, biometric identification systems, Equal Error Rate, True Match Rate, iris pigmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 5 figures, 5 Tables

点击查看摘要

Abstract:Iris recognition technology plays a critical role in biometric identification systems, but their performance can be affected by variations in iris pigmentation. In this work, we investigate the impact of iris pigmentation on the efficacy of biometric recognition systems, focusing on a comparative analysis of blue and dark irises. Data sets were collected using multiple devices, including P1, P2, and P3 smartphones [4], to assess the robustness of the systems in different capture environments [19]. Both traditional machine learning techniques and deep learning models were used, namely Open-Iris, ViT-b, and ResNet50, to evaluate performance metrics such as Equal Error Rate (EER) and True Match Rate (TMR). Our results indicate that iris recognition systems generally exhibit higher accuracy for blue irises compared to dark irises. Furthermore, we examined the generalization capabilities of these systems across different iris colors and devices, finding that while training on diverse datasets enhances recognition performance, the degree of improvement is contingent on the specific model and device used. Our analysis also identifies inherent biases in recognition performance related to iris color and cross-device variability. These findings underscore the need for more inclusive dataset collection and model refinement to reduce bias and promote equitable biometric recognition across varying iris pigmentation and device configurations.

[CV-22] Methodology for a Statistical Analysis of Influencing Factors on 3D Object Detection Performance

链接: https://arxiv.org/abs/2411.08482
作者: Anton Kuznietsov,Dirk Schweickard,Steven Peters
关键词-EN: autonomous driving, object detection, essential task, task to perceive, localizing and classifying
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In autonomous driving, object detection is an essential task to perceive the environment by localizing and classifying objects. Most object detection algorithms rely on deep learning for their superior performance. However, their black box nature makes it challenging to ensure safety. In this paper, we propose a first-of-its-kind methodology for statistical analysis of the influence of various factors related to the objects to detect or the environment on the detection performance of both LiDAR- and camera-based 3D object detectors. We perform a univariate analysis between each of the factors and the detection error in order to compare the strength of influence. To better identify potential sources of detection errors, we also analyze the performance in dependency of the influencing factors and examine the interdependencies between the different influencing factors. Recognizing the factors that influence detection performance helps identify robustness issues in the trained object detector and supports the safety approval of object detection systems.

[CV-23] A survey on Graph Deep Representation Learning for Facial Expression Recognition

链接: https://arxiv.org/abs/2411.08472
作者: Théo Gueuret,Akrem Sellami,Chaabane Djeraba
关键词-EN: facial expression recognition, comprehensive review delves, review delves deeply, graph representation learning, expression recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This comprehensive review delves deeply into the various methodologies applied to facial expression recognition (FER) through the lens of graph representation learning (GRL). Initially, we introduce the task of FER and the concepts of graph representation and GRL. Afterward, we discuss some of the most prevalent and valuable databases for this task. We explore promising approaches for graph representation in FER, including graph diffusion, spatio-temporal graphs, and multi-stream architectures. Finally, we identify future research opportunities and provide concluding remarks.

[CV-24] HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere NEURIPS2024

链接: https://arxiv.org/abs/2411.08470
作者: Hatef Otroshi Shahreza,Sébastien Marcel
关键词-EN: crawling Internet, Face recognition, individuals’ consents, raising ethical, privacy concerns
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in NeurIPS 2024 Safe Generative AI Workshop

点击查看摘要

Abstract:Face recognition datasets are often collected by crawling Internet and without individuals’ consents, raising ethical and privacy concerns. Generating synthetic datasets for training face recognition models has emerged as a promising alternative. However, the generation of synthetic datasets remains challenging as it entails adequate inter-class and intra-class variations. While advances in generative models have made it easier to increase intra-class variations in face datasets (such as pose, illumination, etc.), generating sufficient inter-class variation is still a difficult task. In this paper, we formulate the dataset generation as a packing problem on the embedding space (represented on a hypersphere) of a face recognition model and propose a new synthetic dataset generation approach, called HyperFace. We formalize our packing problem as an optimization problem and solve it with a gradient descent-based approach. Then, we use a conditional face generator model to synthesize face images from the optimized embeddings. We use our generated datasets to train face recognition models and evaluate the trained models on several benchmarking real datasets. Our experimental results show that models trained with HyperFace achieve state-of-the-art performance in training face recognition using synthetic datasets.

[CV-25] Can MLLM s Guide Weakly-Supervised Temporal Action Localization Tasks?

链接: https://arxiv.org/abs/2411.08466
作者: Quan Zhang,Yuxin Qi
关键词-EN: Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, Video Foundation Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent breakthroughs in Multimodal Large Language Models (MLLMs) have gained significant recognition within the deep learning community, where the fusion of the Video Foundation Models (VFMs) and Large Language Models(LLMs) has proven instrumental in constructing robust video understanding systems, effectively surmounting constraints associated with predefined visual tasks. These sophisticated MLLMs exhibit remarkable proficiency in comprehending videos, swiftly attaining unprecedented performance levels across diverse benchmarks. However, their operation demands substantial memory and computational resources, underscoring the continued importance of traditional models in video comprehension tasks. In this paper, we introduce a novel learning paradigm termed MLLM4WTAL. This paradigm harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors for conventional Weakly-supervised Temporal Action Localization (WTAL) methods. MLLM4WTAL facilitates the enhancement of WTAL by leveraging MLLM guidance. It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR). These modules work in tandem to effectively address prevalent issues like incomplete and over-complete outcomes common in WTAL methods. Rigorous experiments are conducted to validate the efficacy of our proposed approach in augmenting the performance of various heterogeneous WTAL models.

[CV-26] Biomass phenotyping of oilseed rape through UAV multi-view oblique imaging with 3DGS and SAM model

链接: https://arxiv.org/abs/2411.08453
作者: Yutao Shen(1 and 2),Hongyu Zhou(3),Xin Yang(1 and 2),Xuqi Lu(1 and 2),Ziyue Guo(1 and 2),Lixi Jiang(3),Yong He(1 and 2),Haiyan Cen(1 and 2) ((1) College of Biosystems Engineering and Food Science, Zhejiang University, Hangzhou, P.R. China (2) Key Laboratory of Spectroscopy Sensing, Ministry of Agriculture and Rural Affairs, Hangzhou, P.R. China (3) College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, P.R. China)
关键词-EN: optimizing crop productivity, breeding strategies, crucial for optimizing, productivity and breeding, estimation of oilseed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Biomass estimation of oilseed rape is crucial for optimizing crop productivity and breeding strategies. While UAV-based imaging has advanced high-throughput phenotyping, current methods often rely on orthophoto images, which struggle with overlapping leaves and incomplete structural information in complex field environments. This study integrates 3D Gaussian Splatting (3DGS) with the Segment Anything Model (SAM) for precise 3D reconstruction and biomass estimation of oilseed rape. UAV multi-view oblique images from 36 angles were used to perform 3D reconstruction, with the SAM module enhancing point cloud segmentation. The segmented point clouds were then converted into point cloud volumes, which were fitted to ground-measured biomass using linear regression. The results showed that 3DGS (7k and 30k iterations) provided high accuracy, with peak signal-to-noise ratios (PSNR) of 27.43 and 29.53 and training times of 7 and 49 minutes, respectively. This performance exceeded that of structure from motion (SfM) and mipmap Neural Radiance Fields (Mip-NeRF), demonstrating superior efficiency. The SAM module achieved high segmentation accuracy, with a mean intersection over union (mIoU) of 0.961 and an F1-score of 0.980. Additionally, a comparison of biomass extraction models found the point cloud volume model to be the most accurate, with an determination coefficient (R2) of 0.976, root mean square error (RMSE) of 2.92 g/plant, and mean absolute percentage error (MAPE) of 6.81%, outperforming both the plot crop volume and individual crop volume models. This study highlights the potential of combining 3DGS with multi-view UAV imaging for improved biomass phenotyping.

[CV-27] AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding

链接: https://arxiv.org/abs/2411.08451
作者: Hao Guo,Wei Fan,Baichun Wei,Jianfei Zhu,Jin Tian,Chunzhi Yi,Feng Jiang
关键词-EN: Embodied reference understanding, Embodied reference, predict referents based, attention-dynamic touch line, language descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object’s bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented line. Extensive experiments on the YouRefIt dataset demonstrate the efficacy of our gesture information understanding method in significantly improving task performance. Our model achieves 76.4% accuracy at the 0.25 IoU threshold and, notably, surpasses human performance at the 0.75 IoU threshold, marking a first in this domain. Comparative experiments with distance-unaware understanding methods from previous research further validate the superiority of the Attention-Dynamic Touch Line across diverse contexts.

[CV-28] Machine Unlearning on Pre-trained Models by Residual Feature Alignment Using LoRA

链接: https://arxiv.org/abs/2411.08443
作者: Laiqiao Qin,Tianqing Zhu,Linlin Wang,Wanlei Zhou
关键词-EN: model, emerged technology, technology that removes, removes a subset, unlearning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Machine unlearning is new emerged technology that removes a subset of the training data from a trained model without affecting the model performance on the remaining data. This topic is becoming increasingly important in protecting user privacy and eliminating harmful or outdated data. The key challenge lies in effectively and efficiently unlearning specific information without compromising the model’s utility on the retained data. For the pre-trained models, fine-tuning is an important way to achieve the unlearning target. Previous work typically fine-tuned the entire model’s parameters, which incurs significant computation costs. In addition, the fine-tuning process may cause shifts in the intermediate layer features, affecting the model’s overall utility. In this work, we propose a novel and efficient machine unlearning method on pre-trained models. We term the method as Residual Feature Alignment Unlearning. Specifically, we leverage LoRA (Low-Rank Adaptation) to decompose the model’s intermediate features into pre-trained features and residual features. By adjusting the residual features, we align the unlearned model with the pre-trained model at the intermediate feature level to achieve both unlearning and remaining targets. The method aims to learn the zero residuals on the retained set and shifted residuals on the unlearning set. Extensive experiments on numerous datasets validate the effectiveness of our approach.

[CV-29] he VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense

链接: https://arxiv.org/abs/2411.08410
作者: Yangyang Guo,Fangkai Jiao,Liqiang Nie,Mohan Kankanhalli
关键词-EN: Vision Large Language, Large Language Models, Vision Large, Large Language, vulnerability of Vision
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise. However, recent defense mechanisms against these attacks have reached near-saturation performance on benchmarks, often with minimal effort. This simultaneous high performance in both attack and defense presents a perplexing paradox. Resolving it is critical for advancing the development of trustworthy models. To address this research gap, we first investigate why VLLMs are prone to these attacks. We then make a key observation: existing defense mechanisms suffer from an \textbfover-prudence problem, resulting in unexpected abstention even in the presence of benign inputs. Additionally, we find that the two representative evaluation methods for jailbreak often exhibit chance agreement. This limitation makes it potentially misleading when evaluating attack strategies or defense mechanisms. Beyond these empirical observations, our another contribution in this work is to repurpose the guardrails of LLMs on the shelf, as an effective alternative detector prior to VLLM response. We believe these findings offer useful insights to rethink the foundational development of VLLM safety with respect to benchmark datasets, evaluation methods, and defense strategies.

[CV-30] V2X-R: Cooperative LiDAR-4D Radar Fusion for 3D Object Detection with Denoising Diffusion

链接: https://arxiv.org/abs/2411.08402
作者: Xun Huang,Jinlong Wang,Qiming Xia,Siheng Chen,Bisheng Yang,Cheng Wang,Chenglu Wen
关键词-EN: systems have significantly, significantly enhanced, Current, radar, camera data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current Vehicle-to-Everything (V2X) systems have significantly enhanced 3D object detection using LiDAR and camera data. However, these methods suffer from performance degradation in adverse weather conditions. The weatherrobust 4D radar provides Doppler and additional geometric information, raising the possibility of addressing this challenge. To this end, we present V2X-R, the first simulated V2X dataset incorporating LiDAR, camera, and 4D radar. V2X-R contains 12,079 scenarios with 37,727 frames of LiDAR and 4D radar point clouds, 150,908 images, and 170,859 annotated 3D vehicle bounding boxes. Subsequently, we propose a novel cooperative LiDAR-4D radar fusion pipeline for 3D object detection and implement it with various fusion strategies. To achieve weather-robust detection, we additionally propose a Multi-modal Denoising Diffusion (MDD) module in our fusion pipeline. MDD utilizes weather-robust 4D radar feature as a condition to prompt the diffusion model to denoise noisy LiDAR features. Experiments show that our LiDAR-4D radar fusion pipeline demonstrates superior performance in the V2X-R dataset. Over and above this, our MDD module further improved the performance of basic fusion model by up to 5.73%/6.70% in foggy/snowy conditions with barely disrupting normal performance. The dataset and code will be publicly available at: this https URL.

[CV-31] MambaXCTrack: Mamba-based Tracker with SSM Cross-correlation and Motion Prompt for Ultrasound Needle Tracking

链接: https://arxiv.org/abs/2411.08395
作者: Yuelin Zhang,Qingpeng Ding,Long Lei,Jiwei Shan,Wenxuan Xie,Tianyi Zhang,Wanquan Yan,Raymond Shing-Yan Tang,Shing Shin Cheng
关键词-EN: percutaneous interventions, guided needle insertion, widely employed, employed in percutaneous, Ultrasound
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Ultrasound (US)-guided needle insertion is widely employed in percutaneous interventions. However, providing feedback on the needle tip position via US image presents challenges due to noise, artifacts, and the thin imaging plane of US, which degrades needle features and leads to intermittent tip visibility. In this paper, a Mamba-based US needle tracker MambaXCTrack utilizing structured state space models cross-correlation (SSMX-Corr) and implicit motion prompt is proposed, which is the first application of Mamba in US needle tracking. The SSMX-Corr enhances cross-correlation by long-range modeling and global searching of distant semantic features between template and search maps, benefiting the tracking under noise and artifacts by implicitly learning potential distant semantic cues. By combining with cross-map interleaved scan (CIS), local pixel-wise interaction with positional inductive bias can also be introduced to SSMX-Corr. The implicit low-level motion descriptor is proposed as a non-visual prompt to enhance tracking robustness, addressing the intermittent tip visibility problem. Extensive experiments on a dataset with motorized needle insertion in both phantom and tissue samples demonstrate that the proposed tracker outperforms other state-of-the-art trackers while ablation studies further highlight the effectiveness of each proposed tracking module.

[CV-32] EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

链接: https://arxiv.org/abs/2411.08380
作者: Xiaofeng Wang,Kang Zhao,Feng Liu,Jiayu Wang,Guosheng Zhao,Xiaoyi Bao,Zheng Zhu,Yingya Zhang,Xingang Wang
关键词-EN: replicate real-world environments, egocentric video generation, Video generation, leveraging visual data, egocentric video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of egocentric viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and high-level textual descriptions. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleaning pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.

[CV-33] Multiscale Graph Construction Using Non-local Cluster Features

链接: https://arxiv.org/abs/2411.08371
作者: Reina Kaneko,Hayate Kojima,Kenta Yanagiya,Junya Hara,Hiroshi Higashi,Yuichi Tanaka
关键词-EN: graph, clusters, features, multiscale, multiscale graph construction
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper presents a multiscale graph construction method using both graph and signal features. Multiscale graph is a hierarchical representation of the graph, where a node at each level indicates a cluster in a finer resolution. To obtain the hierarchical clusters, existing methods often use graph clustering; however, they may ignore signal variations. As a result, these methods could fail to detect the clusters having similar features on nodes. In this paper, we consider graph and node-wise features simultaneously for multiscale clustering of a graph. With given clusters of the graph, the clusters are merged hierarchically in three steps: 1) Feature vectors in the clusters are extracted. 2) Similarities among cluster features are calculated using optimal transport. 3) A variable k -nearest neighbor graph (V k NNG) is constructed and graph spectral clustering is applied to the V k NNG to obtain clusters at a coarser scale. Additionally, the multiscale graph in this paper has \textitnon-local characteristics: Nodes with similar features are merged even if they are spatially separated. In experiments on multiscale image and point cloud segmentation, we demonstrate the effectiveness of the proposed method.

[CV-34] DyConfidMatch: Dynamic Thresholding and Re-sampling for 3D Semi-supervised Learning

链接: https://arxiv.org/abs/2411.08340
作者: Zhimin Chen,Bing Li
关键词-EN: leverages limited labeled, Semi-supervised learning, abundant unlabeled data, leverages limited, limited labeled
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by Pattern Recognition Journal

点击查看摘要

Abstract:Semi-supervised learning (SSL) leverages limited labeled and abundant unlabeled data but often faces challenges with data imbalance, especially in 3D contexts. This study investigates class-level confidence as an indicator of learning status in 3D SSL, proposing a novel method that utilizes dynamic thresholding to better use unlabeled data, particularly from underrepresented classes. A re-sampling strategy is also introduced to mitigate bias towards well-represented classes, ensuring equitable class representation. Through extensive experiments in 3D SSL, our method surpasses state-of-the-art counterparts in classification and detection tasks, highlighting its effectiveness in tackling data imbalance. This approach presents a significant advancement in SSL for 3D datasets, providing a robust solution for data imbalance issues.

[CV-35] SASE: A Searching Architecture for Squeeze and Excitation Operations

链接: https://arxiv.org/abs/2411.08333
作者: Hanming Wang,Yunlong Li,Zijun Wu,Huifen Wang,Yuan Zhang
关键词-EN: introducing low complexity, enhancing network representational, network representational abilities, attention, attention modules
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the past few years, channel-wise and spatial-wise attention blocks have been widely adopted as supplementary modules in deep neural networks, enhancing network representational abilities while introducing low complexity. Most attention modules follow a squeeze-and-excitation paradigm. However, to design such attention modules, requires a substantial amount of experiments and computational resources. Neural Architecture Search (NAS), meanwhile, is able to automate the design of neural networks and spares the numerous experiments required for an optimal architecture. This motivates us to design a search architecture that can automatically find near-optimal attention modules through NAS. We propose SASE, a Searching Architecture for Squeeze and Excitation operations, to form a plug-and-play attention block by searching within certain search space. The search space is separated into 4 different sets, each corresponds to the squeeze or excitation operation along the channel or spatial dimension. Additionally, the search sets include not only existing attention blocks but also other operations that have not been utilized in attention mechanisms before. To the best of our knowledge, SASE is the first attempt to subdivide the attention search space and search for architectures beyond currently known attention modules. The searched attention module is tested with extensive experiments across a range of visual tasks. Experimental results indicate that visual backbone networks (ResNet-50/101) using the SASE attention module achieved the best performance compared to those using the current state-of-the-art attention modules. Codes are included in the supplementary material, and they will be made public later.

[CV-36] Motion Control for Enhanced Complex Action Video Generation

链接: https://arxiv.org/abs/2411.08328
作者: Qiang Zhou,Shaofeng Zhang,Nianzu Yang,Ye Qian,Hao Li
关键词-EN: Existing, struggle with generating, sufficiently pronounced, MVideo, motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Existing text-to-video (T2V) models often struggle with generating videos with sufficiently pronounced or complex actions. A key limitation lies in the text prompt’s inability to precisely convey intricate motion details. To address this, we propose a novel framework, MVideo, designed to produce long-duration videos with precise, fluid actions. MVideo overcomes the limitations of text prompts by incorporating mask sequences as an additional motion condition input, providing a clearer, more accurate representation of intended actions. Leveraging foundational vision models such as GroundingDINO and SAM2, MVideo automatically generates mask sequences, enhancing both efficiency and robustness. Our results demonstrate that, after training, MVideo effectively aligns text prompts with motion conditions to produce videos that simultaneously meet both criteria. This dual control mechanism allows for more dynamic video generation by enabling alterations to either the text prompt or motion condition independently, or both in tandem. Furthermore, MVideo supports motion condition editing and composition, facilitating the generation of videos with more complex actions. MVideo thus advances T2V motion generation, setting a strong benchmark for improved action depiction in current video diffusion models. Our project page is available at this https URL.

[CV-37] Choix dun espace de representation image adapte `a la detection de reseaux routiers

链接: https://arxiv.org/abs/2411.08293
作者: Jerome Gilles
关键词-EN: algorithms allowing, components have emerged, allowing to decompose, structures and textures, textures components
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Functional Analysis (math.FA)
*备注: in French language

点击查看摘要

Abstract:These last years, algorithms allowing to decompose an image into its structures and textures components have emerged. In this paper, we present an application of this type of decomposition to the problem road network detection in aerial or satelite imagery. The algorithmic procedure involves the image decomposition (using a unique property), an alignment detection step based on the Gestalt theory, and a refinement step using statistical active contours.

[CV-38] Noisy image decomposition: a new structure texture and noise model based on local adaptivity

链接: https://arxiv.org/abs/2411.08292
作者: Jerome Gilles
关键词-EN: image decomposition algorithms, proposed to split, decomposition algorithms, image decomposition, image
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Functional Analysis (math.FA)
*备注: arXiv admin note: text overlap with arXiv:2411.05265

点击查看摘要

Abstract:These last few years, image decomposition algorithms have been proposed to split an image into two parts: the structures and the textures. These algorithms are not adapted to the case of noisy images because the textures are corrupted by noise. In this paper, we propose a new model which decomposes an image into three parts (structures, textures and noise) based on a local regularization scheme. We compare our results with the recent work of Aujol and Chambolle. We finish by giving another model which combines the advantages of the two previous ones.

[CV-39] Restoration algorithms and system performance evaluation for active imagers

链接: https://arxiv.org/abs/2411.08291
作者: Jerome Gilles
关键词-EN: active imaging system, paper deals, related to active, active imaging, imaging system
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper deals with two fields related to active imaging system. First, we begin to explore image processing algorithms to restore the artefacts like speckle, scintillation and image dancing caused by atmospheric turbulence. Next, we examine how to evaluate the performance of this kind of systems. To do this task, we propose a modified version of the german TRM3 metric which permits to get MTF-like measures. We use the database acquired during NATO-TG40 field trials to make our tests.

[CV-40] MBA-SLAM: Motion Blur Aware Dense Visual SLAM with Radiance Fields Representation

链接: https://arxiv.org/abs/2411.08279
作者: Peng Wang,Lingzhe Zhao,Yin Zhang,Shiyu Zhao,Peidong Liu
关键词-EN: high-quality video sequences, effectiveness in Simultaneous, Neural Radiance Fields, Simultaneous Localization, photo-realistic rendering
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Emerging 3D scene representations, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have demonstrated their effectiveness in Simultaneous Localization and Mapping (SLAM) for photo-realistic rendering, particularly when using high-quality video sequences as input. However, existing methods struggle with motion-blurred frames, which are common in real-world scenarios like low-light or long-exposure conditions. This often results in a significant reduction in both camera localization accuracy and map reconstruction quality. To address this challenge, we propose a dense visual SLAM pipeline (i.e. MBA-SLAM) to handle severe motion-blurred inputs. Our approach integrates an efficient motion blur-aware tracker with either neural radiance fields or Gaussian Splatting based mapper. By accurately modeling the physical image formation process of motion-blurred images, our method simultaneously learns 3D scene representation and estimates the cameras’ local trajectory during exposure time, enabling proactive compensation for motion blur caused by camera movement. In our experiments, we demonstrate that MBA-SLAM surpasses previous state-of-the-art methods in both camera localization and map reconstruction, showcasing superior performance across a range of datasets, including synthetic and real datasets featuring sharp images as well as those affected by motion blur, highlighting the versatility and robustness of our approach. Code is available at this https URL.

[CV-41] LBONet: Supervised Spectral Descriptors for Shape Analysis

链接: https://arxiv.org/abs/2411.08272
作者: Oguzhan Yigit,Richard C. Wilson
关键词-EN: non-rigid shape analysis, shape analysis due, countable eigensystem forming, fully characterizing geodesic, characterizing geodesic distances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 13 figure

点击查看摘要

Abstract:The Laplace-Beltrami operator has established itself in the field of non-rigid shape analysis due to its many useful properties such as being invariant under isometric transformation, having a countable eigensystem forming an orthonormal basis, and fully characterizing geodesic distances of the manifold. However, this invariancy only applies under isometric deformations, which leads to a performance breakdown in many real-world applications. In recent years emphasis has been placed upon extracting optimal features using deep learning methods, however spectral signatures play a crucial role and still add value. In this paper we take a step back, revisiting the LBO and proposing a supervised way to learn several operators on a manifold. Depending on the task, by applying these functions, we can train the LBO eigenbasis to be more task-specific. The optimization of the LBO leads to enormous improvements to established descriptors such as the heat kernel signature in various tasks such as retrieval, classification, segmentation, and correspondence, proving the adaption of the LBO eigenbasis to both global and highly local learning settings.

[CV-42] GTA: Global Tracklet Association for Multi-Object Tracking in Sports ACCV2024

链接: https://arxiv.org/abs/2411.08216
作者: Jiacheng Sun,Hsiang-Wei Huang,Cheng-Yen Yang,Zhongyu Jiang,Jenq-Neng Hwang
关键词-EN: deep learning techniques, experiencing significant advancements, computer vision, learning techniques, focal points
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACCV 2024 MLCSA Workshop

点击查看摘要

Abstract:Multi-object tracking in sports scenarios has become one of the focal points in computer vision, experiencing significant advancements through the integration of deep learning techniques. Despite these breakthroughs, challenges remain, such as accurately re-identifying players upon re-entry into the scene and minimizing ID switches. In this paper, we propose an appearance-based global tracklet association algorithm designed to enhance tracking performance by splitting tracklets containing multiple identities and connecting tracklets seemingly from the same identity. This method can serve as a plug-and-play refinement tool for any multi-object tracker to further boost their performance. The proposed method achieved a new state-of-the-art performance on the SportsMOT dataset with HOTA score of 81.04%. Similarly, on the SoccerNet dataset, our method enhanced multiple trackers’ performance, consistently increasing the HOTA score from 79.41% to 83.11%. These significant and consistent improvements across different trackers and datasets underscore our proposed method’s potential impact on the application of sports player tracking. We open-source our project codebase at this https URL.

[CV-43] Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing

链接: https://arxiv.org/abs/2411.08196
作者: Zitao Shuai,Chenwei Wu,Zhengxu Tang,Bowen Song,Liyue Shen
关键词-EN: recently achieved remarkable, achieved remarkable success, latent space, Diffusion Transformers, text-guided image generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2408.13335

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation. In image editing, DiTs project text and image inputs to a joint latent space, from which they decode and synthesize new images. However, it remains largely unexplored how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images. In this paper, we investigate the latent space of DiT models and uncover two key properties: First, DiT’s latent space is inherently semantically disentangled, where different semantic attributes can be controlled by specific editing directions. Second, consistent semantic editing requires utilizing the entire joint latent space, as neither encoded image nor text alone contains enough semantic information. We show that these editing directions can be obtained directly from text prompts, enabling precise semantic control without additional training or mask annotations. Based on these insights, we propose a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing. Specifically, we first encode both the given source image and the text prompt that describes the image, to obtain the joint latent embedding. Then, using our proposed Hessian Score Distillation Sampling (HSDS) method, we identify editing directions that control specific target attributes while preserving other image features. These directions are guided by text prompts and used to manipulate the latent embeddings. Moreover, we propose a new metric to quantify the disentanglement degree of the latent space of diffusion models. Extensive experiment results on our new curated benchmark dataset and analysis demonstrate DiT’s disentanglement properties and effectiveness of the EIM framework.

[CV-44] EAPCR: A Universal Feature Extractor for Scientific Data without Explicit Feature Relation Patterns

链接: https://arxiv.org/abs/2411.08164
作者: Zhuohang Yu,Ling An,Yansong Li,Yu Wu,Zeyu Dong,Zhangdi Liu,Le Gao,Zhenyu Zhang,Chichun Zhou
关键词-EN: including Decision Tree, Decision Tree, system anomaly detection, non-image medical diagnostics, catalysis efficiency prediction
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Conventional methods, including Decision Tree (DT)-based methods, have been effective in scientific tasks, such as non-image medical diagnostics, system anomaly detection, and inorganic catalysis efficiency prediction. However, most deep-learning techniques have struggled to surpass or even match this level of success as traditional machine-learning methods. The primary reason is that these applications involve multi-source, heterogeneous data where features lack explicit relationships. This contrasts with image data, where pixels exhibit spatial relationships; textual data, where words have sequential dependencies; and graph data, where nodes are connected through established associations. The absence of explicit Feature Relation Patterns (FRPs) presents a significant challenge for deep learning techniques in scientific applications that are not image, text, and graph-based. In this paper, we introduce EAPCR, a universal feature extractor designed for data without explicit FRPs. Tested across various scientific tasks, EAPCR consistently outperforms traditional methods and bridges the gap where deep learning models fall short. To further demonstrate its robustness, we synthesize a dataset without explicit FRPs. While Kolmogorov-Arnold Network (KAN) and feature extractors like Convolutional Neural Networks (CNNs), Graph Convolutional Networks (GCNs), and Transformers struggle, EAPCR excels, demonstrating its robustness and superior performance in scientific tasks without FRPs.

[CV-45] CameraHMR: Aligning People with Perspective

链接: https://arxiv.org/abs/2411.08128
作者: Priyanka Patel,Michael J. Black
关键词-EN: human pose, estimation from monocular, monocular images, model, SMPLify
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 3DV 2025

点击查看摘要

Abstract:We address the challenge of accurate 3D human pose and shape estimation from monocular images. The key to accuracy and robustness lies in high-quality training data. Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations, assuming a simplified camera with default intrinsics. We make two contributions that improve pGT accuracy. First, to estimate camera intrinsics, we develop a field-of-view prediction model (HumanFoV) trained on a dataset of images containing people. We use the estimated intrinsics to enhance the 4D-Humans dataset by incorporating a full perspective camera model during SMPLify fitting. Second, 2D joints provide limited constraints on 3D body shape, resulting in average-looking bodies. To address this, we use the BEDLAM dataset to train a dense surface keypoint detector. We apply this detector to the 4D-Humans dataset and modify SMPLify to fit the detected keypoints, resulting in significantly more realistic body shapes. Finally, we upgrade the HMR2.0 architecture to include the estimated camera parameters. We iterate model training and SMPLify fitting initialized with the previously trained model. This leads to more accurate pGT and a new model, CameraHMR, with state-of-the-art accuracy. Code and pGT are available for research purposes.

[CV-46] IPO: Text to Image with Text Presampling for Prompt Optimization

链接: https://arxiv.org/abs/2411.08127
作者: Shih-Ying Yeh,Sang-Hyun Park,Giyeong Oh,Min Song,Youngjae Yu
关键词-EN: innovative framework designed, Large Language Models, designed to enhance, innovative framework, framework designed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 13 figures

点击查看摘要

Abstract:TIPO (Text to Image with text pre-sampling for Prompt Optimization) is an innovative framework designed to enhance text-to-image (T2I) generation by language model (LM) for automatic prompt engineering. By refining and extending user-provided prompts, TIPO bridges the gap between simple inputs and the detailed prompts required for high-quality image generation. Unlike previous approaches that rely on Large Language Models (LLMs) or reinforcement learning (RL), TIPO adjusts user input prompts with the distribution of a trained prompt dataset, eliminating the need for complex runtime cost via lightweight model. This pre-sampling approach enables efficient and scalable prompt optimization, grounded in the model’s training distribution. Experimental results demonstrate TIPO’s effectiveness in improving aesthetic scores, reducing image corruption, and better aligning generated images with dataset distributions. These findings highlight the critical role of prompt engineering in T2I systems and open avenues for broader applications of automatic prompt refinement.

[CV-47] Deep Learning 2.0: Artificial Neurons That Matter – Reject Correlation Embrace Orthogonality CVPR2025

链接: https://arxiv.org/abs/2411.08085
作者: Taha Bouhsine
关键词-EN: Neural Matter Network, achieves non-linear pattern, non-linear pattern recognition, Neural Matter, activation functions
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); General Topology (math.GN)
*备注: Submitted to CVPR 2025

点击查看摘要

Abstract:We introduce a yat-product-powered neural network, the Neural Matter Network (NMN), a breakthrough in deep learning that achieves non-linear pattern recognition without activation functions. Our key innovation relies on the yat-product and yat-product, which naturally induces non-linearity by projecting inputs into a pseudo-metric space, eliminating the need for traditional activation functions while maintaining only a softmax layer for final class probability distribution. This approach simplifies network architecture and provides unprecedented transparency into the network’s decision-making process. Our comprehensive empirical evaluation across different datasets demonstrates that NMN consistently outperforms traditional MLPs. The results challenge the assumption that separate activation functions are necessary for effective deep-learning models. The implications of this work extend beyond immediate architectural benefits, by eliminating intermediate activation functions while preserving non-linear capabilities, yat-MLP establishes a new paradigm for neural network design that combines simplicity with effectiveness. Most importantly, our approach provides unprecedented insights into the traditionally opaque “black-box” nature of neural networks, offering a clearer understanding of how these models process and classify information.

[CV-48] UNSCT-HRNet: Modeling Anatomical Uncertainty for Landmark Detection in Total Hip Arthroplasty

链接: https://arxiv.org/abs/2411.08488
作者: Jiaxin Wan,Lin Liu,Haoran Wang,Liangwei Li,Wei Li,Shuheng Kou,Runtian Li,Jiayi Tang,Juanxiu Liu,Jing Zhang,Xiaohui Du,Ruqian Hao
关键词-EN: Total hip arthroplasty, irregular patient postures, occluded anatomical markers, anatomical markers pose, markers pose significant
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Total hip arthroplasty (THA) relies on accurate landmark detection from radiographic images, but unstructured data caused by irregular patient postures or occluded anatomical markers pose significant challenges for existing methods. To address this, we propose UNSCT-HRNet (Unstructured CT - High-Resolution Net), a deep learning-based framework that integrates a Spatial Relationship Fusion (SRF) module and an Uncertainty Estimation (UE) module. The SRF module, utilizing coordinate convolution and polarized attention, enhances the model’s ability to capture complex spatial relationships. Meanwhile, the UE module which based on entropy ensures predictions are anatomically relevant. For unstructured data, the proposed method can predict landmarks without relying on the fixed number of points, which shows higher accuracy and better robustness comparing with the existing methods. Our UNSCT-HRNet demonstrates over a 60% improvement across multiple metrics in unstructured data. The experimental results also reveal that our approach maintains good performance on the structured dataset. Overall, the proposed UNSCT-HRNet has the potential to be used as a new reliable, automated solution for THA surgical planning and postoperative monitoring.

[CV-49] Robust Divergence Learning for Missing-Modality Segmentation

链接: https://arxiv.org/abs/2411.08305
作者: Runze Cheng,Zhongao Sun,Ye Zhang,Chun Li
关键词-EN: Multimodal Magnetic Resonance, Magnetic Resonance Imaging, Multimodal Magnetic, Resonance Imaging, Magnetic Resonance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal Magnetic Resonance Imaging (MRI) provides essential complementary information for analyzing brain tumor subregions. While methods using four common MRI modalities for automatic segmentation have shown success, they often face challenges with missing modalities due to image quality issues, inconsistent protocols, allergic reactions, or cost factors. Thus, developing a segmentation paradigm that handles missing modalities is clinically valuable. A novel single-modality parallel processing network framework based on Hölder divergence and mutual information is introduced. Each modality is independently input into a shared network backbone for parallel processing, preserving unique information. Additionally, a dynamic sharing framework is introduced that adjusts network parameters based on modality availability. A Hölder divergence and mutual information-based loss functions are used for evaluating discrepancies between predictions and labels. Extensive testing on the BraTS 2018 and BraTS 2020 datasets demonstrates that our method outperforms existing techniques in handling missing modalities and validates each component’s effectiveness.

[CV-50] omoGRAF: A Robust and Generalizable Reconstruction Network for Single-View Computed Tomography

链接: https://arxiv.org/abs/2411.08158
作者: Di Xu,Yang Yang,Hengjie Liu,Qihui Lyu,Martina Descovich,Dan Ruan,Ke Sheng
关键词-EN: high spatial resolution, spatial resolution visualization, Computed tomography, structures for scientific, high spatial
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Computed tomography (CT) provides high spatial resolution visualization of 3D structures for scientific and clinical applications. Traditional analytical/iterative CT reconstruction algorithms require hundreds of angular data samplings, a condition that may not be met in practice due to physical and mechanical limitations. Sparse view CT reconstruction has been proposed using constrained optimization and machine learning methods with varying success, less so for ultra-sparse view CT reconstruction with one to two views. Neural radiance field (NeRF) is a powerful tool for reconstructing and rendering 3D natural scenes from sparse views, but its direct application to 3D medical image reconstruction has been minimally successful due to the differences between optical and X-ray photon transportation. Here, we develop a novel TomoGRAF framework incorporating the unique X-ray transportation physics to reconstruct high-quality 3D volumes using ultra-sparse projections without prior. TomoGRAF captures the CT imaging geometry, simulates the X-ray casting and tracing process, and penalizes the difference between simulated and ground truth CT sub-volume during training. We evaluated the performance of TomoGRAF on an unseen dataset of distinct imaging characteristics from the training data and demonstrated a vast leap in performance compared with state-of-the-art deep learning and NeRF methods. TomoGRAF provides the first generalizable solution for image-guided radiotherapy and interventional radiology applications, where only one or a few X-ray views are available, but 3D volumetric information is desired.

机器学习

[LG-0] Unsupervised Parameter-free Outlier Detection using HDBSCAN* Outlier Profiles

链接: https://arxiv.org/abs/2411.08867
作者: Kushankur Ghosh,Murilo Coelho Naldi,Jörg Sander,Euijin Choo
关键词-EN: introduce irrelevant information, GLOSH, GLOSH scores, statistics and models, machine learning
类目: Machine Learning (cs.LG)
*备注: Accepted at IEEE International Conference on Big Data, IEEE BigData 2024

点击查看摘要

Abstract:In machine learning and data mining, outliers are data points that significantly differ from the dataset and often introduce irrelevant information that can induce bias in its statistics and models. Therefore, unsupervised methods are crucial to detect outliers if there is limited or no information about them. Global-Local Outlier Scores based on Hierarchies (GLOSH) is an unsupervised outlier detection method within HDBSCAN*, a state-of-the-art hierarchical clustering method. GLOSH estimates outlier scores for each data point by comparing its density to the highest density of the region they reside in the HDBSCAN* hierarchy. GLOSH may be sensitive to HDBSCAN*‘s minpts parameter that influences density estimation. With limited knowledge about the data, choosing an appropriate minpts value beforehand is challenging as one or some minpts values may better represent the underlying cluster structure than others. Additionally, in the process of searching for potential outliers'', one has to define the number of outliers n a dataset has, which may be impractical and is often unknown. In this paper, we propose an unsupervised strategy to find the best’’ minpts value, leveraging the range of GLOSH scores across minpts values to identify the value for which GLOSH scores can best identify outliers from the rest of the dataset. Moreover, we propose an unsupervised strategy to estimate a threshold for classifying points into inliers and (potential) outliers without the need to pre-define any value. Our experiments show that our strategies can automatically find the minpts value and threshold that yield the best or near best outlier detection results using GLOSH.

[LG-1] LLM Stinger: Jailbreaking LLM LLM s using RL fine-tuned LLMs AAAI2025

链接: https://arxiv.org/abs/2411.08862
作者: Piyush Jha,Arnav Arora,Vijay Ganesh
关键词-EN: leverages Large Language, Large Language Models, Large Language, automatically generate adversarial, generate adversarial suffixes
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:We introduce LLMStinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Unlike traditional methods, which require complex prompt engineering or white-box access, LLMStinger uses a reinforcement learning (RL) loop to fine-tune an attacker LLM, generating new suffixes based on existing attacks for harmful questions from the HarmBench benchmark. Our method significantly outperforms existing red-teaming approaches (we compared against 15 of the latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for their extensive safety measures. Additionally, we achieved a 94.97% ASR on GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability of LLMStinger across open and closed-source models.

[LG-2] Learning Gaussian Multi-Index Models with Gradient Flow: Time Complexity and Directional Convergence AISTATS2025

链接: https://arxiv.org/abs/2411.08798
作者: Berfin Simsek,Amire Bendjeddou,Daniel Hsu
关键词-EN: standard Gaussian data, high-dimensional standard Gaussian, Gaussian data, standard Gaussian, index vectors
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Algebraic Geometry (math.AG)
*备注: 21 pages, 6 figures, under review by AISTATS 2025

点击查看摘要

Abstract:This work focuses on the gradient flow dynamics of a neural network model that uses correlation loss to approximate a multi-index function on high-dimensional standard Gaussian data. Specifically, the multi-index function we consider is a sum of neurons f^(x) !=! \sum_j=1^k ! \sigma^(v_j^T x) where v_1, \dots, v_k are unit vectors, and \sigma^* lacks the first and second Hermite polynomials in its Hermite expansion. It is known that, for the single-index case ( k!=!1 ), overcoming the search phase requires polynomial time complexity. We first generalize this result to multi-index functions characterized by vectors in arbitrary directions. After the search phase, it is not clear whether the network neurons converge to the index vectors, or get stuck at a sub-optimal solution. When the index vectors are orthogonal, we give a complete characterization of the fixed points and prove that neurons converge to the nearest index vectors. Therefore, using n ! \asymp ! k \log k neurons ensures finding the full set of index vectors with gradient flow with high probability over random initialization. When v_i^T v_j !=! \beta ! \geq ! 0 for all i \neq j , we prove the existence of a sharp threshold \beta_c !=! c/(c+k) at which the fixed point that computes the average of the index vectors transitions from a saddle point to a minimum. Numerical simulations show that using a correlation loss and a mild overparameterization suffices to learn all of the index vectors when they are nearly orthogonal, however, the correlation loss fails when the dot product between the index vectors exceeds a certain threshold.

[LG-3] Locally Private Sampling with Public Data

链接: https://arxiv.org/abs/2411.08791
作者: Behnoosh Zamanlooy,Mario Diaz,Shahab Asoodeh
关键词-EN: Local differential privacy, privacy-preserving machine learning, Local differential, differential privacy, untrusted aggregator
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Local differential privacy (LDP) is increasingly employed in privacy-preserving machine learning to protect user data before sharing it with an untrusted aggregator. Most LDP methods assume that users possess only a single data record, which is a significant limitation since users often gather extensive datasets (e.g., images, text, time-series data) and frequently have access to public datasets. To address this limitation, we propose a locally private sampling framework that leverages both the private and public datasets of each user. Specifically, we assume each user has two distributions: p and q that represent their private dataset and the public dataset, respectively. The objective is to design a mechanism that generates a private sample approximating p while simultaneously preserving q . We frame this objective as a minimax optimization problem using f -divergence as the utility measure. We fully characterize the minimax optimal mechanisms for general f -divergences provided that p and q are discrete distributions. Remarkably, we demonstrate that this optimal mechanism is universal across all f -divergences. Experiments validate the effectiveness of our minimax optimal sampler compared to the state-of-the-art locally private sampler.

[LG-4] Optimal Oblivious Subspace Embeddings with Near-optimal Sparsity

链接: https://arxiv.org/abs/2411.08773
作者: Shabarish Chenakkod,Michał Dereziński,Xiaoyu Dong
关键词-EN: oblivious subspace embedding, subspace embedding, oblivious subspace, epsilon, preserves the norms
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:An oblivious subspace embedding is a random m\times n matrix \Pi such that, for any d -dimensional subspace, with high probability \Pi preserves the norms of all vectors in that subspace within a 1\pm\epsilon factor. In this work, we give an oblivious subspace embedding with the optimal dimension m=\Theta(d/\epsilon^2) that has a near-optimal sparsity of \tilde O(1/\epsilon) non-zero entries per column of \Pi . This is the first result to nearly match the conjecture of Nelson and Nguyen [FOCS 2013] in terms of the best sparsity attainable by an optimal oblivious subspace embedding, improving on a prior bound of \tilde O(1/\epsilon^6) non-zeros per column [Chenakkod et al., STOC 2024]. We further extend our approach to the non-oblivious setting, proposing a new family of Leverage Score Sparsified embeddings with Independent Columns, which yield faster runtimes for matrix approximation and regression tasks. In our analysis, we develop a new method which uses a decoupling argument together with the cumulant method for bounding the edge universality error of isotropic random matrices. To achieve near-optimal sparsity, we combine this general-purpose approach with new traces inequalities that leverage the specific structure of our subspace embedding construction. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2411.08773 [cs.DS] (or arXiv:2411.08773v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2411.08773 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] Mapping Methane – The Impact of Dairy Farm Practices on Emissions Through Satellite Data and Machine Learning

链接: https://arxiv.org/abs/2411.08766
作者: Hanqing Bi,Suresh Neethirajan
关键词-EN: Eastern Canada, observations in Eastern, Variance Inflation Factor, Principal Component Analysis, study investigates
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:This study investigates the correlation between dairy farm characteristics and methane concentrations as derived from satellite observations in Eastern Canada. Utilizing data from 11 dairy farms collected between January 2020 and December 2022, we integrated Sentinel-5P satellite methane data with critical farm-level attributes, including herd genetics, feeding practices, and management strategies. Initial analyses revealed significant correlations with methane concentrations, leading to the application of Variance Inflation Factor (VIF) and Principal Component Analysis (PCA) to address multicollinearity and enhance model stability. Subsequently, machine learning models - specifically Random Forest and Neural Networks - were employed to evaluate feature importance and predict methane emissions. Our findings indicate a strong negative correlation between the Estimated Breeding Value (EBV) for protein percentage and methane concentrations, suggesting that genetic selection for higher milk protein content could be an effective strategy for emissions reduction. The integration of atmospheric transport models with satellite data further refined our emission estimates, significantly enhancing accuracy and spatial resolution. This research underscores the potential of advanced satellite monitoring, machine learning techniques, and atmospheric modeling in improving methane emission assessments within the dairy sector. It emphasizes the critical role of farm-specific characteristics in developing effective mitigation strategies. Future investigations should focus on expanding the dataset and incorporating inversion modeling for more precise emission quantification. Balancing ecological impacts with economic viability will be essential for fostering sustainable dairy farming practices.

[LG-6] Energy Dissipation Preserving Physics Informed Neural Network for Allen-Cahn Equations

链接: https://arxiv.org/abs/2411.08760
作者: Mustafa Kütük,Hamdullah Yücel
关键词-EN: physics-informed neural network, logarithmic energy functionals, random initial functions, Allen-Cahn equation, degenerate mobility
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:This paper investigates a numerical solution of Allen-Cahn equation with constant and degenerate mobility, with polynomial and logarithmic energy functionals, with deterministic and random initial functions, and with advective term in one, two, and three spatial dimensions, based on the physics-informed neural network (PINN). To improve the learning capacity of the PINN, we incorporate the energy dissipation property of the Allen-Cahn equation as a penalty term into the loss function of the network. To facilitate the learning process of random initials, we employ a continuous analogue of the initial random condition by utilizing the Fourier series expansion. Adaptive methods from traditional numerical analysis are also integrated to enhance the effectiveness of the proposed PINN. Numerical results indicate a consistent decrease in the discrete energy, while also revealing phenomena such as phase separation and metastability.

[LG-7] ScaleNet: Scale Invariance Learning in Directed Graphs

链接: https://arxiv.org/abs/2411.08758
作者: Qin Jiang,Chengjia Wang,Michael Lones,Wei Pang
关键词-EN: Graph Neural Networks, Neural Networks, advanced relational data, relational data analysis, Graph Neural
类目: Machine Learning (cs.LG)
*备注: Scale invariance in node classification is demonstrated and applied in graph transformation to develop ScaleNet, which achieves state-of-the-art performance on both homophilic and heterophilic directed graphs

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have advanced relational data analysis but lack invariance learning techniques common in image classification. In node classification with GNNs, it is actually the ego-graph of the center node that is classified. This research extends the scale invariance concept to node classification by drawing an analogy to image processing: just as scale invariance being used in image classification to capture multi-scale features, we propose the concept of scaled ego-graphs''. Scaled ego-graphs generalize traditional ego-graphs by replacing undirected single-edges with scaled-edges’‘, which are ordered sequences of multiple directed edges. We empirically assess the performance of the proposed scale invariance in graphs on seven benchmark datasets, across both homophilic and heterophilic structures. Our scale-invariance-based graph learning outperforms inception models derived from random walks by being simpler, faster, and more accurate. The scale invariance explains inception models’ success on homophilic graphs and limitations on heterophilic graphs. To ensure applicability of inception model to heterophilic graphs as well, we further present ScaleNet, an architecture that leverages multi-scaled features. ScaleNet achieves state-of-the-art results on five out of seven datasets (four homophilic and one heterophilic) and matches top performance on the remaining two, demonstrating its excellent applicability. This represents a significant advance in graph learning, offering a unified framework that enhances node classification across various graph types. Our code is available at this https URL.

[LG-8] Optimal Transport-Based Displacement Interpolation with Data Augmentation for Reduced Order Modeling of Nonlinear Dynamical Systems

链接: https://arxiv.org/abs/2411.08750
作者: Moaad Khamlich,Federico Pichi,Michele Girfoglio,Annalisa Quaini,Gianluigi Rozza
关键词-EN: leverages optimal transport, reduced-order Model, optimal transport, theory and displacement, leverages optimal
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel reduced-order Model (ROM) that leverages optimal transport (OT) theory and displacement interpolation to enhance the representation of nonlinear dynamics in complex systems. While traditional ROM techniques face challenges in this scenario, especially when data (i.e., observational snapshots) is limited, our method addresses these issues by introducing a data augmentation strategy based on OT principles. The proposed framework generates interpolated solutions tracing geodesic paths in the space of probability distributions, enriching the training dataset for the ROM. A key feature of our approach is its ability to provide a continuous representation of the solution’s dynamics by exploiting a virtual-to-real time mapping. This enables the reconstruction of solutions at finer temporal scales than those provided by the original data. To further improve prediction accuracy, we employ Gaussian Process Regression to learn the residual and correct the representation between the interpolated snapshots and the physical solution. We demonstrate the effectiveness of our methodology with atmospheric mesoscale benchmarks characterized by highly nonlinear, advection-dominated dynamics. Our results show improved accuracy and efficiency in predicting complex system behaviors, indicating the potential of this approach for a wide range of applications in computational physics and engineering.

[LG-9] Bayesian Comparisons Between Representations

链接: https://arxiv.org/abs/2411.08739
作者: Heiko H. Schütt
关键词-EN: learning and neuroscience, fundamental question, machine learning, representations, Bayesian statistics
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Which neural networks are similar is a fundamental question for both machine learning and neuroscience. Our novel method compares representations based on Bayesian statistics about linear readouts from the representations. Concretely, we suggest to use the total variation distance or Jensen-Shannon distance between prior predictive distributions to compare representations. The prior predictive distribution is a full description of the inductive bias and generalization of a model in Bayesian statistics, making it a great basis for comparisons. As Jensen-Shannon distance and total variation distance are metrics our dissimilarity measures are pseudo-metrics for representations. For a linear readout, our metrics just depend on the linear kernel matrix of the representations. Thus, our metrics connects linear read-out based comparisons to kernel based metrics like centered kernel alignment and representational similarity analysis. We apply our new metrics to deep neural networks trained on ImageNet-1k. Our new metrics can be computed efficiently including a stochastic gradient without dimensionality reductions of the representations. It broadly agrees with existing metrics, but is more stringent. It varies less across different random image samples, and it measures how well two representations could be distinguished based on a linear read out. Thus our metric nicely extends our toolkit for comparing representations.

[LG-10] Recommender systems and reinforcement learning for building control and occupant interaction: A text-mining driven review of scientific literature

链接: https://arxiv.org/abs/2411.08734
作者: Wenhao Zhang,Matias Quintana,Clayton Miller
关键词-EN: greatly affects health, key research focus, environment greatly affects, health and well-being, enhancing health
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The indoor environment greatly affects health and well-being; enhancing health and reducing energy use in these settings is a key research focus. With advancing Information and Communication Technology (ICT), recommendation systems and reinforcement learning have emerged as promising methods to induce behavioral changes that improve indoor environments and building energy efficiency. This study employs text-mining and Natural Language Processing (NLP) to examine these approaches in building control and occupant interaction. Analyzing approximately 27,000 articles from the ScienceDirect database, we found extensive use of recommendation systems and reinforcement learning for space optimization, location recommendations, and personalized control suggestions. Despite broad applications, their use in optimizing indoor environments and energy efficiency is limited. Traditional recommendation algorithms are commonly used, but optimizing indoor conditions and energy efficiency often requires advanced machine learning techniques like reinforcement and deep learning. This review highlights the potential for expanding recommender systems and reinforcement learning applications in buildings and indoor environments. Areas for innovation include predictive maintenance, building-related product recommendations, and optimizing environments for specific needs like sleep and productivity enhancements based on user feedback.

[LG-11] Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLM s

链接: https://arxiv.org/abs/2411.08719
作者: Kazuki Fujii,Taishi Nakamura,Rio Yokota
关键词-EN: Large Language Models, attracted significant attention, significant attention due, human-like language understanding, Large Language
类目: Machine Learning (cs.LG)
*备注: 2 pages,extended abstract

点击查看摘要

Abstract:Large Language Models (LLMs) have attracted significant attention due to their human-like language understanding and generation capabilities, as well as their applicability across various domains. These models, characterized by their massive scale and extensive training data, continue to push the boundaries of what is possible in natural language processing. The Llama 3 series, for instance, exemplifies this trend with its flagship model boasting 405 billion parameters trained on 15.6 trillion tokens. The immense computational demands associated with training such models have spurred ongoing research into optimizing the efficiency of the training process, particularly through the use of lower-precision formats. NVIDIA’s H100 GPU, which introduces support for FP8 in addition to the more conventional FP16 and BF16 formats, has emerged as a focal point in this optimization effort. Preliminary studies suggest that FP8 could offer substantial reductions in training time without sacrificing model performance when compared to BF16, making it a promising candidate for large-scale model training. However, the broader implications of adopting FP8, particularly in terms of training stability and downstream task performance, have yet to be fully understood. In this study, we delve into the practical trade-offs involved in adopting FP8 over BF16 for training LLMs. Comments: 2 pages,extended abstract Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.08719 [cs.LG] (or arXiv:2411.08719v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.08719 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kazuki Fujii [view email] [v1] Sun, 10 Nov 2024 15:19:42 UTC (1,722 KB)

[LG-12] FedSub: Introducing class-aware Subnetworks Fusion to Enhance Personalized Federated Learning in Ubiquitous Systems

链接: https://arxiv.org/abs/2411.08699
作者: Mattia Giovanni Campana,Franca Delmastro
关键词-EN: Personalized Federated Learning, Learning is essential, Personalized Federated, Federated Learning, evolving user behaviors
类目: Machine Learning (cs.LG)
*备注: Submitted to Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)

点击查看摘要

Abstract:Personalized Federated Learning is essential in AI-driven ubiquitous systems, supporting the distributed development of models able to adapt to diverse and evolving user behaviors while safeguarding privacy. Despite addressing heterogeneous user data distributions in collaborative model training, existing methods often face limitations balancing personalization and generalization, oversimplifying user similarities, or relying heavily on global models. In this paper, we propose FedSub, a novel federated approach designed to enhance personalization through the use of class-aware prototypes and model subnetworks. Prototypes serve as compact representations of user data, clustered on the server to identify similarities based on specific label patterns. Concurrently, subnetworks – model components necessary to process each class – are extracted locally and fused by the server according to these clusters, producing highly tailored model updates for each user. This fine-grained, class-specific aggregation of clients’ models allows FedSub to capture the unique characteristics of individual user data patterns. The effectiveness of FedSub is validated in three real-world scenarios characterized by high data heterogeneity, derived from human activity recognition and mobile health applications. Experimental evaluations demonstrate FedSub’s performance improvements with respect to the state-of-the-art and significant advancements in personalization for ubiquitous systems based on personal mobile and wearable devices.

[LG-13] Measuring similarity between embedding spaces using induced neighborhood graphs

链接: https://arxiv.org/abs/2411.08687
作者: Tiago F. Tavares,Fabio Ayres,Paris Smaragdis
关键词-EN: Deep Learning techniques, Deep Learning, capture semantic similarities, Learning techniques, techniques have excelled
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Learning techniques have excelled at generating embedding spaces that capture semantic similarities between items. Often these representations are paired, enabling experiments with analogies (pairs within the same domain) and cross-modality (pairs across domains). These experiments are based on specific assumptions about the geometry of embedding spaces, which allow finding paired items by extrapolating the positional relationships between embedding pairs in the training dataset, allowing for tasks such as finding new analogies, and multimodal zero-shot classification. In this work, we propose a metric to evaluate the similarity between paired item representations. Our proposal is built from the structural similarity between the nearest-neighbors induced graphs of each representation, and can be configured to compare spaces based on different distance metrics and on different neighborhood sizes. We demonstrate that our proposal can be used to identify similar structures at different scales, which is hard to achieve with kernel methods such as Centered Kernel Alignment (CKA). We further illustrate our method with two case studies: an analogy task using GloVe embeddings, and zero-shot classification in the CIFAR-100 dataset using CLIP embeddings. Our results show that accuracy in both analogy and zero-shot classification tasks correlates with the embedding similarity. These findings can help explain performance differences in these tasks, and may lead to improved design of paired-embedding models in the future.

[LG-14] UniMat: Unifying Materials Embeddings through Multi-modal Learning

链接: https://arxiv.org/abs/2411.08664
作者: Janghoon Ock,Joseph Montoya,Daniel Schweigert,Linda Hung,Santosh K. Suram,Weike Ye
关键词-EN: text-based synthesis conditions, Materials science datasets, microscopic images, Materials science, characterization spectra
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Materials science datasets are inherently heterogeneous and are available in different modalities such as characterization spectra, atomic structures, microscopic images, and text-based synthesis conditions. The advancements in multi-modal learning, particularly in vision and language models, have opened new avenues for integrating data in different forms. In this work, we evaluate common techniques in multi-modal learning (alignment and fusion) in unifying some of the most important modalities in materials science: atomic structure, X-ray diffraction patterns (XRD), and composition. We show that structure graph modality can be enhanced by aligning with XRD patterns. Additionally, we show that aligning and fusing more experimentally accessible data formats, such as XRD patterns and compositions, can create more robust joint embeddings than individual modalities across various tasks. This lays the groundwork for future studies aiming to exploit the full potential of multi-modal data in materials science, facilitating more informed decision-making in materials design and discovery.

[LG-15] Accelerating Quasi-Static Time Series Simulations with Foundation Models

链接: https://arxiv.org/abs/2411.08652
作者: Alban Puech,François Mirallès,Jonas Weiss,Vincent Mai,Alexandre Blondin Massé,Martin de Montigny,Thomas Brunschwiler,Hendrik F. Hamann
关键词-EN: Quasi-static time series, Quasi-static time, distributed energy resources, power flow solvers, power flow
类目: Machine Learning (cs.LG)
*备注: Equal contributors: A.P. and F.M.; Lead contact: A.P

点击查看摘要

Abstract:Quasi-static time series (QSTS) simulations have great potential for evaluating the grid’s ability to accommodate the large-scale integration of distributed energy resources. However, as grids expand and operate closer to their limits, iterative power flow solvers, central to QSTS simulations, become computationally prohibitive and face increasing convergence issues. Neural power flow solvers provide a promising alternative, speeding up power flow computations by 3 to 4 orders of magnitude, though they are costly to train. In this paper, we envision how recently introduced grid foundation models could improve the economic viability of neural power flow solvers. Conceptually, these models amortize training costs by serving as a foundation for a range of grid operation and planning tasks beyond power flow solving, with only minimal fine-tuning required. We call for collaboration between the AI and power grid communities to develop and open-source these models, enabling all operators, even those with limited resources, to benefit from AI without building solutions from scratch.

[LG-16] owards Secure Intelligent O-RAN Architecture: Vulnerabilities Threats and Promising Technical Solutions using LLM s

链接: https://arxiv.org/abs/2411.08640
作者: Mojdeh Karbalaee Motalleb,Chafika Benzaid,Tarik Taleb,Marcos Katz,Vahid Shah-Mansouri,JaeSeung Song
关键词-EN: open radio access, radio access network, wireless communication systems, enhanced flexibility, services more efficiently
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:The evolution of wireless communication systems will be fundamentally impacted by an open radio access network (O-RAN), a new concept defining an intelligent architecture with enhanced flexibility, openness, and the ability to slice services more efficiently. For all its promises, and like any technological advancement, O-RAN is not without risks that need to be carefully assessed and properly addressed to accelerate its wide adoption in future mobile networks. In this paper, we present an in-depth security analysis of the O-RAN architecture, discussing the potential threats that may arise in the different O-RAN architecture layers and their impact on the Confidentiality, Integrity, and Availability (CIA) triad. We also promote the potential of zero trust, Moving Target Defense (MTD), blockchain, and large language models(LLM) technologies in fortifying O-RAN’s security posture. Furthermore, we numerically demonstrate the effectiveness of MTD in empowering robust deep reinforcement learning methods for dynamic network slice admission control in the O-RAN architecture. Moreover, we examine the effect of explainable AI (XAI) based on LLMs in securing the system.

[LG-17] Gaussian Mixture Models Based Augmentation Enhances GNN Generalization

链接: https://arxiv.org/abs/2411.08638
作者: Yassine Abbahaddou,Fragkiskos D. Malliaros,Johannes F. Lutzeyer,Amine Mohamed Aboussalah,Michalis Vazirgiannis
关键词-EN: Graph Neural Networks, Neural Networks, shown great promise, struggle to generalize, Graph Neural
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have shown great promise in tasks like node and graph classification, but they often struggle to generalize, particularly to unseen or out-of-distribution (OOD) data. These challenges are exacerbated when training data is limited in size or diversity. To address these issues, we introduce a theoretical framework using Rademacher complexity to compute a regret bound on the generalization error and then characterize the effect of data augmentation. This framework informs the design of GMM-GDA, an efficient graph data augmentation (GDA) algorithm leveraging the capability of Gaussian Mixture Models (GMMs) to approximate any distribution. Our approach not only outperforms existing augmentation techniques in terms of generalization but also offers improved time complexity, making it highly suitable for real-world applications.

[LG-18] Robot See Robot Do: Imitation Reward for Noisy Financial Environments

链接: https://arxiv.org/abs/2411.08637
作者: Sven Goluža,Tomislav Kovačević,Stjepan Begušić,Zvonko Kostanjčar
关键词-EN: asset trading aligns, trading aligns naturally, financial asset trading, sequential nature, nature of decision-making
类目: Machine Learning (cs.LG); Robotics (cs.RO); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:The sequential nature of decision-making in financial asset trading aligns naturally with the reinforcement learning (RL) framework, making RL a common approach in this domain. However, the low signal-to-noise ratio in financial markets results in noisy estimates of environment components, including the reward function, which hinders effective policy learning by RL agents. Given the critical importance of reward function design in RL problems, this paper introduces a novel and more robust reward function by leveraging imitation learning, where a trend labeling algorithm acts as an expert. We integrate imitation (expert’s) feedback with reinforcement (agent’s) feedback in a model-free RL algorithm, effectively embedding the imitation learning problem within the RL paradigm to handle the stochasticity of reward signals. Empirical results demonstrate that this novel approach improves financial performance metrics compared to traditional benchmarks and RL agents trained solely using reinforcement feedback.

[LG-19] Hopfield-Fenchel-Young Networks: A Unified Framework for Associative Memory Retrieval

链接: https://arxiv.org/abs/2411.08590
作者: Saul Santos,Vlad Niculae,Daniel McNamee,André F. T. Martins
关键词-EN: garnered renewed interest, renewed interest due, Associative memory models, Hopfield networks, self-attention in transformers
类目: Machine Learning (cs.LG)
*备注: 49 pages, 14 figures. arXiv admin note: text overlap with arXiv:2402.13725

点击查看摘要

Abstract:Associative memory models, such as Hopfield networks and their modern variants, have garnered renewed interest due to advancements in memory capacity and connections with self-attention in transformers. In this work, we introduce a unified framework-Hopfield-Fenchel-Young networks-which generalizes these models to a broader family of energy functions. Our energies are formulated as the difference between two Fenchel-Young losses: one, parameterized by a generalized entropy, defines the Hopfield scoring mechanism, while the other applies a post-transformation to the Hopfield output. By utilizing Tsallis and norm entropies, we derive end-to-end differentiable update rules that enable sparse transformations, uncovering new connections between loss margins, sparsity, and exact retrieval of single memory patterns. We further extend this framework to structured Hopfield networks using the SparseMAP transformation, allowing the retrieval of pattern associations rather than a single pattern. Our framework unifies and extends traditional and modern Hopfield networks and provides an energy minimization perspective for widely used post-transformations like \ell_2 -normalization and layer normalization-all through suitable choices of Fenchel-Young losses and by using convex analysis as a building block. Finally, we validate our Hopfield-Fenchel-Young networks on diverse memory recall tasks, including free and sequential recall. Experiments on simulated data, image retrieval, multiple instance learning, and text rationalization demonstrate the effectiveness of our approach.

[LG-20] Grammarization-Based Grasping with Deep Multi-Autoencoder Latent Space Exploration by Reinforcement Learning Agent ICRA2025

链接: https://arxiv.org/abs/2411.08566
作者: Leonidas Askianakis
关键词-EN: material properties, environmental factors, robot in unstructured, deemed a critical, critical challenge
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted for review at IEEE ICRA 2025

点击查看摘要

Abstract:Grasping by a robot in unstructured environments is deemed a critical challenge because of the requirement for effective adaptation to a wide variation in object geometries, material properties, and other environmental factors. In this paper, we propose a novel framework for robotic grasping based on the idea of compressing high-dimensional target and gripper features in a common latent space using a set of autoencoders. Our approach simplifies grasping by using three autoencoders dedicated to the target, the gripper, and a third one that fuses their latent representations. This allows the RL agent to achieve higher learning rates at the initial stages of exploration of a new environment, as well as at non-zero shot grasp attempts. The agent explores the latent space of the third autoencoder for better quality grasp without explicit reconstruction of objects. By implementing the PoWER algorithm into the RL training process, updates on the agent’s policy will be made through the perturbation in the reward-weighted latent space. The successful exploration efficiently constrains both position and pose integrity for feasible executions of grasps. We evaluate our system on a diverse set of objects, demonstrating the high success rate in grasping with minimum computational overhead. We found that approach enhances the adaptation of the RL agent by more than 35 % in simulation experiments.

[LG-21] Learning Locally Adaptive Metrics that Enhance Structural Representation with textttLAMINAR NEURIPS2024

链接: https://arxiv.org/abs/2411.08557
作者: Christian Kleiber,William H. Oliver,Tobias Buck
关键词-EN: unsupervised machine learning, machine learning pipeline, learning pipeline designed, more-informative distance metric, unsupervised machine
类目: Machine Learning (cs.LG)
*备注: Accepted to the NeurIPS 2024 Machine Learning and the Physical Sciences workshop. 6 pages, 6 figures

点击查看摘要

Abstract:We present \textttLAMINAR , a novel unsupervised machine learning pipeline designed to enhance the representation of structure within data via producing a more-informative distance metric. Analysis methods in the physical sciences often rely on standard metrics to define geometric relationships in data, which may fail to capture the underlying structure of complex data sets. \textttLAMINAR addresses this by using a continuous-normalising-flow and inverse-transform-sampling to define a Riemannian manifold in the data space without the need for the user to specify a metric over the data a-priori. The result is a locally-adaptive-metric that produces structurally-informative density-based distances. We demonstrate the utility of \textttLAMINAR by comparing its output to the Euclidean metric for structured data sets.

[LG-22] Graph Neural Networks in Supply Chain Analytics and Optimization: Concepts Perspectives Dataset and Benchmarks

链接: https://arxiv.org/abs/2411.08550
作者: Azmine Toushik Wasi,MD Shafikul Islam,Adipto Raihan Akib,Mahathir Mohammad Bappy
关键词-EN: Graph Neural Networks, Neural Networks, management remains limited, recently gained traction, chain management remains
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
*备注: 27 Pages. Extended journal version of SupplyGraph ( arXiv:2401.15299 ). In Review

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have recently gained traction in transportation, bioinformatics, language and image processing, but research on their application to supply chain management remains limited. Supply chains are inherently graph-like, making them ideal for GNN methodologies, which can optimize and solve complex problems. The barriers include a lack of proper conceptual foundations, familiarity with graph applications in SCM, and real-world benchmark datasets for GNN-based supply chain research. To address this, we discuss and connect supply chains with graph structures for effective GNN application, providing detailed formulations, examples, mathematical definitions, and task guidelines. Additionally, we present a multi-perspective real-world benchmark dataset from a leading FMCG company in Bangladesh, focusing on supply chain planning. We discuss various supply chain tasks using GNNs and benchmark several state-of-the-art models on homogeneous and heterogeneous graphs across six supply chain analytics tasks. Our analysis shows that GNN-based models consistently outperform statistical Machine Learning and other Deep Learning models by around 10-30% in regression, 10-30% in classification and detection tasks, and 15-40% in anomaly detection tasks on designated metrics. With this work, we lay the groundwork for solving supply chain problems using GNNs, supported by conceptual discussions, methodological insights, and a comprehensive dataset.

[LG-23] Properties of fairness measures in the context of varying class imbalance and protected group ratios

链接: https://arxiv.org/abs/2411.08425
作者: Dariusz Brzezinski,Julia Stachowiak,Jerzy Stefanowski,Izabela Szczech,Robert Susmaga,Sofya Aksenyuk,Uladzimir Ivashka,Oleksandr Yasinskyi
关键词-EN: credit risk management, Society is increasingly, fairness measures, criminal justice, credit risk
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Society is increasingly relying on predictive models in fields like criminal justice, credit risk management, or hiring. To prevent such automated systems from discriminating against people belonging to certain groups, fairness measures have become a crucial component in socially relevant applications of machine learning. However, existing fairness measures have been designed to assess the bias between predictions for protected groups without considering the imbalance in the classes of the target variable. Current research on the potential effect of class imbalance on fairness focuses on practical applications rather than dataset-independent measure properties. In this paper, we study the general properties of fairness measures for changing class and protected group proportions. For this purpose, we analyze the probability mass functions of six of the most popular group fairness measures. We also measure how the probability of achieving perfect fairness changes for varying class imbalance ratios. Moreover, we relate the dataset-independent properties of fairness measures described in this paper to classifier fairness in real-life tasks. Our results show that measures such as Equal Opportunity and Positive Predictive Parity are more sensitive to changes in class imbalance than Accuracy Equality. These findings can help guide researchers and practitioners in choosing the most appropriate fairness measures for their classification problems.

[LG-24] Federated Graph Learning with Graphless Clients

链接: https://arxiv.org/abs/2411.08374
作者: Xingbo Fu,Song Wang,Yushun Dong,Binchi Zhang,Chen Chen,Jundong Li
关键词-EN: Graph Neural Networks, Neural Networks, Federated Graph Learning, machine learning models, training machine learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Federated Graph Learning (FGL) is tasked with training machine learning models, such as Graph Neural Networks (GNNs), for multiple clients, each with its own graph data. Existing methods usually assume that each client has both node features and graph structure of its graph data. In real-world scenarios, however, there exist federated systems where only a part of the clients have such data while other clients (i.e. graphless clients) may only have node features. This naturally leads to a novel problem in FGL: how to jointly train a model over distributed graph data with graphless clients? In this paper, we propose a novel framework FedGLS to tackle the problem in FGL with graphless clients. In FedGLS, we devise a local graph learner on each graphless client which learns the local graph structure with the structure knowledge transferred from other clients. To enable structure knowledge transfer, we design a GNN model and a feature encoder on each client. During local training, the feature encoder retains the local graph structure knowledge together with the GNN model via knowledge distillation, and the structure knowledge is transferred among clients in global update. Our extensive experiments demonstrate the superiority of the proposed FedGLS over five baselines.

[LG-25] Coverage Analysis for Digital Cousin Selection – Improving Multi-Environment Q-Learning

链接: https://arxiv.org/abs/2411.08360
作者: Talha Bozkus,Tara Javidi,Urbashi Mitra
关键词-EN: unknown system dynamics, MEMQ algorithms, MEMQ, Q-learning algorithms, Q-learning
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Q-learning is widely employed for optimizing various large-dimensional networks with unknown system dynamics. Recent advancements include multi-environment mixed Q-learning (MEMQ) algorithms, which utilize multiple independent Q-learning algorithms across multiple, structurally related but distinct environments and outperform several state-of-the-art Q-learning algorithms in terms of accuracy, complexity, and robustness. We herein conduct a comprehensive probabilistic coverage analysis to ensure optimal data coverage conditions for MEMQ algorithms. First, we derive upper and lower bounds on the expectation and variance of different coverage coefficients (CC) for MEMQ algorithms. Leveraging these bounds, we develop a simple way of comparing the utilities of multiple environments in MEMQ algorithms. This approach appears to be near optimal versus our previously proposed partial ordering approach. We also present a novel CC-based MEMQ algorithm to improve the accuracy and complexity of existing MEMQ algorithms. Numerical experiments are conducted using random network graphs with four different graph properties. Our algorithm can reduce the average policy error (APE) by 65% compared to partial ordering and is 95% faster than the exhaustive search. It also achieves 60% less APE than several state-of-the-art reinforcement learning and prior MEMQ algorithms. Additionally, we numerically verify the theoretical results and show their scalability with the action-space size.

[LG-26] Learning-Augmented Algorithms for Online Concave Packing and Convex Covering Problems

链接: https://arxiv.org/abs/2411.08332
作者: Elena Grigorescu,Young-San Lin,Maoyuan Song
关键词-EN: computer science community, provide additional information, augment classical algorithms, machine learning predictors, online
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 38 pages. In submission

点击查看摘要

Abstract:Learning-augmented algorithms have been extensively studied across the computer science community in the recent years, driven by advances in machine learning predictors, which can provide additional information to augment classical algorithms. Such predictions are especially powerful in the context of online problems, where decisions have to be made without knowledge of the future, and which traditionally exhibits impossibility results bounding the performance of any online algorithm. The study of learning-augmented algorithms thus aims to use external advice prudently, to overcome classical impossibility results when the advice is accurate, and still perform comparably to the state-of-the-art online algorithms even when the advice is inaccurate. In this paper, we present learning-augmented algorithmic frameworks for two fundamental optimizations settings, extending and generalizing prior works. For online packing with concave objectives, we present a simple but overarching strategy that switches between the advice and the state-of-the-art online algorithm. For online covering with convex objectives, we greatly extend primal-dual methods for online convex covering programs by Azar et al. (FOCS 2016) and previous learning-augmented framework for online covering linear programs from the literature, to many new applications. We show that our algorithms break impossibility results when the advice is accurate, while maintaining comparable performance with state-of-the-art classical online algorithms even when the advice is erroneous. Comments: 38 pages. In submission Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2411.08332 [cs.DS] (or arXiv:2411.08332v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2411.08332 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Neural Conjugate Flows: Physics-informed architectures with flow structure

链接: https://arxiv.org/abs/2411.08326
作者: Arthur Bizzi,Lucas Nissenbaum,João M. Pereira
关键词-EN: introduce Neural Conjugate, Neural Conjugate Flows, Neural Conjugate, Conjugate Flows, NCF
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Neural Conjugate Flows (NCF), a class of neural network architectures equipped with exact flow structure. By leveraging topological conjugation, we prove that these networks are not only naturally isomorphic to a continuous group, but are also universal approximators for flows of ordinary differential equation (ODEs). Furthermore, topological properties of these flows can be enforced by the architecture in an interpretable manner. We demonstrate in numerical experiments how this topological group structure leads to concrete computational gains over other physics informed neural networks in estimating and extrapolating latent dynamics of ODEs, while training up to five times faster than other flow-based architectures.

[LG-28] Conditional Variable Flow Matching: Transforming Conditional Densities with Amortized Conditional Optimal Transport

链接: https://arxiv.org/abs/2411.08314
作者: Adam P. Generale,Andreas E. Robertson,Surya R. Kalidindi
关键词-EN: Forecasting stochastic nonlinear, stochastic nonlinear dynamical, fundamental challenge repeatedly, challenge repeatedly encountered, Forecasting stochastic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Forecasting stochastic nonlinear dynamical systems under the influence of conditioning variables is a fundamental challenge repeatedly encountered across the biological and physical sciences. While flow-based models can impressively predict the temporal evolution of probability distributions representing possible outcomes of a specific process, existing frameworks cannot satisfactorily account for the impact of conditioning variables on these dynamics. Amongst several limitations, existing methods require training data with paired conditions and are developed for discrete conditioning variables. We propose Conditional Variable Flow Matching (CVFM), a framework for learning flows transforming conditional distributions with amortization across continuous conditioning variables - permitting predictions across the conditional density manifold. This is accomplished through several novel advances, in particular, simultaneous sample conditioned flows over the main and conditioning variables, alongside a conditional Wasserstein distance and kernel facilitating conditional optimal transport. Collectively, these advances allow for learning system dynamics provided measurement data whose states and conditioning variables are not in correspondence. We demonstrate CVFM on a suite of increasingly challenging problems, including discrete and continuous conditional mapping benchmarks, image-to-image domain transfer, and modeling the temporal evolution of materials internal structure during manufacturing processes. We observe that CVFM results in improved performance and convergence characteristics over alternative conditional variants.

[LG-29] SDDBench: A Benchmark for Synthesizable Drug Design

链接: https://arxiv.org/abs/2411.08306
作者: Songtao Liu,Zhengkai Tu,Hanjun Dai,Peng Liu
关键词-EN: wet lab experiments, current drug design, wet lab, lab experiments, experiments with current
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:A significant challenge in wet lab experiments with current drug design generative models is the trade-off between pharmacological properties and synthesizability. Molecules predicted to have highly desirable properties are often difficult to synthesize, while those that are easily synthesizable tend to exhibit less favorable properties. As a result, evaluating the synthesizability of molecules in general drug design scenarios remains a significant challenge in the field of drug discovery. The commonly used synthetic accessibility (SA) score aims to evaluate the ease of synthesizing generated molecules, but it falls short of guaranteeing that synthetic routes can actually be found. Inspired by recent advances in top-down synthetic route generation, we propose a new, data-driven metric to evaluate molecule synthesizability. Our approach directly assesses the feasibility of synthetic routes for a given molecule through our proposed round-trip score. This novel metric leverages the synergistic duality between retrosynthetic planners and reaction predictors, both of which are trained on extensive reaction datasets. To demonstrate the efficacy of our method, we conduct a comprehensive evaluation of round-trip scores alongside search success rate across a range of representative molecule generative models. Code is available at this https URL.

[LG-30] Least Squares Training of Quadratic Convolutional Neural Networks with Applications to System Theory

链接: https://arxiv.org/abs/2411.08267
作者: Zachary Yetman Van Egmond,Luis Rodrigues
关键词-EN: quadratic activation functions, convolutional neural network, loss function, activation functions, convolutional neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper provides a least squares formulation for the training of a 2-layer convolutional neural network using quadratic activation functions, a 2-norm loss function, and no regularization term. Using this method, an analytic expression for the globally optimal weights is obtained alongside a quadratic input-output equation for the network. These properties make the network a viable tool in system theory by enabling further analysis, such as the sensitivity of the output to perturbations in the input, which is crucial for safety-critical systems such as aircraft or autonomous this http URL least squares method is compared to previously proposed strategies for training quadratic networks and to a back-propagation-trained ReLU network. The proposed method is applied to a system identification problem and a GPS position estimation problem. The least squares network is shown to have a significantly reduced training time with minimal compromises on prediction accuracy alongside the advantages of having an analytic input-output equation. Although these results only apply to 2-layer networks, this paper motivates the exploration of deeper quadratic networks in the context of system theory.

[LG-31] NVCiM-PT: An NVCiM-assisted Prompt Tuning Framework for Edge LLM s DATE2025

链接: https://arxiv.org/abs/2411.08244
作者: Ruiyang Qin,Pengyu Ren,Zheyu Yan,Liu Liu,Dancheng Liu,Amir Nassereldine,Jinjun Xiong,Kai Ni,Sharon Hu,Yiyu Shi
关键词-EN: Large Language Models, Large Language, Language Models, edge LLMs, edge
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: Accepted by DATE 2025

点击查看摘要

Abstract:Large Language Models (LLMs) deployed on edge devices, known as edge LLMs, need to continuously fine-tune their model parameters from user-generated data under limited resource constraints. However, most existing learning methods are not applicable for edge LLMs because of their reliance on high resources and low learning capacity. Prompt tuning (PT) has recently emerged as an effective fine-tuning method for edge LLMs by only modifying a small portion of LLM parameters, but it suffers from user domain shifts, resulting in repetitive training and losing resource efficiency. Conventional techniques to address domain shift issues often involve complex neural networks and sophisticated training, which are incompatible for PT for edge LLMs. Therefore, an open research question is how to address domain shift issues for edge LLMs with limited resources. In this paper, we propose a prompt tuning framework for edge LLMs, exploiting the benefits offered by non-volatile computing-in-memory (NVCiM) architectures. We introduce a novel NVCiM-assisted PT framework, where we narrow down the core operations to matrix-matrix multiplication, which can then be accelerated by performing in-situ computation on NVCiM. To the best of our knowledge, this is the first work employing NVCiM to improve the edge LLM PT performance.

[LG-32] Imitation Learning from Observations: An Autoregressive Mixture of Experts Approach

链接: https://arxiv.org/abs/2411.08232
作者: Renzi Wang,Flavia Sofia Acerbo,Tong Duy Son,Panagiotis Patrinos
关键词-EN: paper presents, approach to imitation, autoregressive mixture, mixture of experts, deployed to fit
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach to imitation learning from observations, where an autoregressive mixture of experts model is deployed to fit the underlying policy. The parameters of the model are learned via a two-stage framework. By leveraging the existing dynamics knowledge, the first stage of the framework estimates the control input sequences and hence reduces the problem complexity. At the second stage, the policy is learned by solving a regularized maximum-likelihood estimation problem using the estimated control input sequences. We further extend the learning procedure by incorporating a Lyapunov stability constraint to ensure asymptotic stability of the identified model, for accurate multi-step predictions. The effectiveness of the proposed framework is validated using two autonomous driving datasets collected from human demonstrations, demonstrating its practical applicability in modelling complex nonlinear dynamics.

[LG-33] Joint Diffusion models in Continual Learning

链接: https://arxiv.org/abs/2411.08224
作者: Paweł Skierś,Kamil Deja
关键词-EN: introduce JDCL, joint diffusion models, generative rehearsal based, based on joint, joint diffusion
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we introduce JDCL - a new method for continual learning with generative rehearsal based on joint diffusion models. Neural networks suffer from catastrophic forgetting defined as abrupt loss in the model’s performance when retrained with additional data coming from a different distribution. Generative-replay-based continual learning methods try to mitigate this issue by retraining a model with a combination of new and rehearsal data sampled from a generative model. In this work, we propose to extend this idea by combining a continually trained classifier with a diffusion-based generative model into a single - jointly optimized neural network. We show that such shared parametrization, combined with the knowledge distillation technique allows for stable adaptation to new tasks without catastrophic forgetting. We evaluate our approach on several benchmarks, where it outperforms recent state-of-the-art generative replay techniques. Additionally, we extend our method to the semi-supervised continual learning setup, where it outperforms competing buffer-based replay techniques, and evaluate, in a self-supervised manner, the quality of trained representations.

[LG-34] Fault Localization in Deep Learning-based Software: A System-level Approach

链接: https://arxiv.org/abs/2411.08172
作者: Mohammad Mehdi Morovati,Amin Nikanjam,Foutse Khomh
关键词-EN: Deep Learning, past decade, daily lives, integral part, fault localization
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over the past decade, Deep Learning (DL) has become an integral part of our daily lives. This surge in DL usage has heightened the need for developing reliable DL software systems. Given that fault localization is a critical task in reliability assessment, researchers have proposed several fault localization techniques for DL-based software, primarily focusing on faults within the DL model. While the DL model is central to DL components, there are other elements that significantly impact the performance of DL components. As a result, fault localization methods that concentrate solely on the DL model overlook a large portion of the system. To address this, we introduce FL4Deep, a system-level fault localization approach considering the entire DL development pipeline to effectively localize faults across the DL-based systems. In an evaluation using 100 faulty DL scripts, FL4Deep outperformed four previous approaches in terms of accuracy for three out of six DL-related faults, including issues related to data (84%), mismatched libraries between training and deployment (100%), and loss function (69%). Additionally, FL4Deep demonstrated superior precision and recall in fault localization for five categories of faults including three mentioned fault types in terms of accuracy, plus insufficient training iteration and activation function.

[LG-35] Multi-Agent Stochastic Bandits Robust to Adversarial Corruptions

链接: https://arxiv.org/abs/2411.08167
作者: Fatemeh Ghaffari,Xuchuang Wang,Jinhang Zuo,Mohammad Hajiesmaili
关键词-EN: multi-agent multi-armed bandits, study the problem, multi-armed bandits, accesses a subset, agents
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of multi-agent multi-armed bandits with adversarial corruption in a heterogeneous setting, where each agent accesses a subset of arms. The adversary can corrupt the reward observations for all agents. Agents share these corrupted rewards with each other, and the objective is to maximize the cumulative total reward of all agents (and not be misled by the adversary). We propose a multi-agent cooperative learning algorithm that is robust to adversarial corruptions. For this newly devised algorithm, we demonstrate that an adversary with an unknown corruption budget C only incurs an additive O((L / L_\min) C) term to the standard regret of the model in non-corruption settings, where L is the total number of agents, and L_\min is the minimum number of agents with mutual access to an arm. As a side-product, our algorithm also improves the state-of-the-art regret bounds when reducing to both the single-agent and homogeneous multi-agent scenarios, tightening multiplicative K (the number of arms) and L (the number of agents) factors, respectively.

[LG-36] ackling Polysemanticity with Neuron Embeddings

链接: https://arxiv.org/abs/2411.08166
作者: Alex Foote
关键词-EN: distinct semantic behaviours, making downstream manual, neuron characteristic dataset, interpretation much easier, identifying the distinct
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron’s characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model’s internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model’s actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs).

[LG-37] Impactful Bit-Flip Search on Full-precision Models

链接: https://arxiv.org/abs/2411.08133
作者: Nadav Benedek,Matan Levy,Mahmood Sharif
关键词-EN: shown remarkable performance, model parameters, Neural networks, shown remarkable, remarkable performance
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Neural networks have shown remarkable performance in various tasks, yet they remain susceptible to subtle changes in their input or model parameters. One particularly impactful vulnerability arises through the Bit-Flip Attack (BFA), where flipping a small number of critical bits in a model’s parameters can severely degrade its performance. A common technique for inducing bit flips in DRAM is the Row-Hammer attack, which exploits frequent uncached memory accesses to alter data. Identifying susceptible bits can be achieved through exhaustive search or progressive layer-by-layer analysis, especially in quantized networks. In this work, we introduce Impactful Bit-Flip Search (IBS), a novel method for efficiently pinpointing and flipping critical bits in full-precision networks. Additionally, we propose a Weight-Stealth technique that strategically modifies the model’s parameters in a way that maintains the float values within the original distribution, thereby bypassing simple range checks often used in tamper detection.

[LG-38] Intelligent Green Efficiency for Intrusion Detection

链接: https://arxiv.org/abs/2411.08069
作者: Pedro Pereira,Paulo Mendes,João Vitorino,Eva Maia,Isabel Praça
关键词-EN: recording great progress, Artificial Intelligence, popularity recently, recording great, emerged in popularity
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 16 pages, 9 tables, FPS 2024 conference

点击查看摘要

Abstract:Artificial Intelligence (AI) has emerged in popularity recently, recording great progress in various industries. However, the environmental impact of AI is a growing concern, in terms of the energy consumption and carbon footprint of Machine Learning (ML) and Deep Learning (DL) models, making essential investigate Green AI, an attempt to reduce the climate impact of AI systems. This paper presents an assessment of different programming languages and Feature Selection (FS) methods to improve computation performance of AI focusing on Network Intrusion Detection (NID) and cyber-attack classification tasks. Experiments were conducted using five ML models - Random Forest, XGBoost, LightGBM, Multi-Layer Perceptron, and Long Short-Term Memory - implemented in four programming languages - Python, Java, R, and Rust - along with three FS methods - Information Gain, Recursive Feature Elimination, and Chi-Square. The obtained results demonstrated that FS plays an important role enhancing the computational efficiency of AI models without compromising detection accuracy, highlighting languages like Python and R, that benefit from a rich AI libraries environment. These conclusions can be useful to design efficient and sustainable AI systems that still provide a good generalization and a reliable detection.

[LG-39] Equitable Length of Stay Prediction for Patients with Learning Disabilities and Multiple Long-term Conditions Using Machine Learning

链接: https://arxiv.org/abs/2411.08048
作者: Emeka Abakasanga,Rania Kousovista,Georgina Cosma,Ashley Akbari,Francesco Zaccardi,Navjot Kaur,Danielle Fitt,Gyuchan Thomas Jun,Reza Kiani,Satheesh Gangadharan
关键词-EN: premature deaths compared, higher mortality rate, machine learning models, learning disabilities, general public
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 13 pages of article with 9 figures; Supplementary material follows after article with 27 pages and 27 figures

点击查看摘要

Abstract:People with learning disabilities have a higher mortality rate and premature deaths compared to the general public, as reported in published research in the UK and other countries. This study analyses hospitalisations of 9,618 patients identified with learning disabilities and long-term conditions for the population of Wales using electronic health record (EHR) data sources from the SAIL Databank. We describe the demographic characteristics, prevalence of long-term conditions, medication history, hospital visits, and lifestyle history for our study cohort, and apply machine learning models to predict the length of hospital stays for this cohort. The random forest (RF) model achieved an Area Under the Curve (AUC) of 0.759 (males) and 0.756 (females), a false negative rate of 0.224 (males) and 0.229 (females), and a balanced accuracy of 0.690 (males) and 0.689 (females). After examining model performance across ethnic groups, two bias mitigation algorithms (threshold optimization and the reductions algorithm using an exponentiated gradient) were applied to minimise performance discrepancies. The threshold optimizer algorithm outperformed the reductions algorithm, achieving lower ranges in false positive rate and balanced accuracy for the male cohort across the ethnic groups. This study demonstrates the potential of applying machine learning models with effective bias mitigation approaches on EHR data sources to enable equitable prediction of hospital stays by addressing data imbalances across groups.

[LG-40] Oblique Bayesian additive regression trees

链接: https://arxiv.org/abs/2411.08849
作者: Paul-Hieu V. Nguyen,Ryan Yee,Sameer K. Deshpande
关键词-EN: Bayesian Additive Regression, Additive Regression Trees, Bayesian Additive, Additive Regression, Regression Trees
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current implementations of Bayesian Additive Regression Trees (BART) are based on axis-aligned decision rules that recursively partition the feature space using a single feature at a time. Several authors have demonstrated that oblique trees, whose decision rules are based on linear combinations of features, can sometimes yield better predictions than axis-aligned trees and exhibit excellent theoretical properties. We develop an oblique version of BART that leverages a data-adaptive decision rule prior that recursively partitions the feature space along random hyperplanes. Using several synthetic and real-world benchmark datasets, we systematically compared our oblique BART implementation to axis-aligned BART and other tree ensemble methods, finding that oblique BART was competitive with – and sometimes much better than – those methods.

[LG-41] Model agnostic local variable importance for locally dependent relationships

链接: https://arxiv.org/abs/2411.08821
作者: Kelvyn K. Bladen,Adele Cutler,D. Richard Cutler,Kevin R. Moon
关键词-EN: learning model results, interpret machine learning, machine learning model, Global variable importance, Global variable
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Global variable importance measures are commonly used to interpret machine learning model results. Local variable importance techniques assess how variables contribute to individual observations rather than the entire dataset. Current methods typically fail to accurately reflect locally dependent relationships between variables and instead focus on marginal importance values. Additionally, they are not natively adapted for multi-class classification problems. We propose a new model-agnostic method for calculating local variable importance, CLIQUE, that captures locally dependent relationships, contains improvements over permutation-based methods, and can be directly applied to multi-class classification problems. Simulated and real-world examples show that CLIQUE emphasizes locally dependent information and properly reduces bias in regions where variables do not affect the response.

[LG-42] FinRobot: AI Agent for Equity Research and Valuation with Large Language Models

链接: https://arxiv.org/abs/2411.08804
作者: Tianyu Zhou,Pinqiao Wang,Yilin Wu,Hongyang Yang
关键词-EN: grow increasingly complex, effectively assist human, markets grow increasingly, increasingly complex, grow increasingly
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR)
*备注: The 1st Workshop on LLMs and Generative AI for Finance, ICAIF 2024

点击查看摘要

Abstract:As financial markets grow increasingly complex, there is a rising need for automated tools that can effectively assist human analysts in equity research, particularly within sell-side research. While Generative AI (GenAI) has attracted significant attention in this field, existing AI solutions often fall short due to their narrow focus on technical factors and limited capacity for discretionary judgment. These limitations hinder their ability to adapt to new data in real-time and accurately assess risks, which diminishes their practical value for investors. This paper presents FinRobot, the first AI agent framework specifically designed for equity research. FinRobot employs a multi-agent Chain of Thought (CoT) system, integrating both quantitative and qualitative analyses to emulate the comprehensive reasoning of a human analyst. The system is structured around three specialized agents: the Data-CoT Agent, which aggregates diverse data sources for robust financial integration; the Concept-CoT Agent, which mimics an analysts reasoning to generate actionable insights; and the Thesis-CoT Agent, which synthesizes these insights into a coherent investment thesis and report. FinRobot provides thorough company analysis supported by precise numerical data, industry-appropriate valuation metrics, and realistic risk assessments. Its dynamically updatable data pipeline ensures that research remains timely and relevant, adapting seamlessly to new financial information. Unlike existing automated research tools, such as CapitalCube and Wright Reports, FinRobot delivers insights comparable to those produced by major brokerage firms and fundamental research vendors. We open-source FinRobot at \urlhttps://github. com/AI4Finance-Foundation/FinRobot. Comments: The 1st Workshop on LLMs and Generative AI for Finance, ICAIF 2024 Subjects: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR) Cite as: arXiv:2411.08804 [q-fin.CP] (or arXiv:2411.08804v1 [q-fin.CP] for this version) https://doi.org/10.48550/arXiv.2411.08804 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-43] Deep Learning Accelerated Quantum Transport Simulations in Nanoelectronics: From Break Junctions to Field-Effect Transistors

链接: https://arxiv.org/abs/2411.08800
作者: Jijie Zou,Zhanghao Zhouyin,Dongying Lin,Linfeng Zhang,Shimin Hou,Qiangqiang Gu
关键词-EN: Quantum transport calculations, designing nanoelectronic devices, Quantum transport, non-equilibrium Green Function, efficient quantum transport
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Quantum transport calculations are essential for understanding and designing nanoelectronic devices, yet the trade-off between accuracy and computational efficiency has long limited their practical applications. We present a general framework that combines the deep learning tight-binding Hamiltonian (DeePTB) approach with the non-equilibrium Green’s Function (NEGF) method, enabling efficient quantum transport calculations while maintaining first-principles accuracy. We demonstrate the capabilities of the DeePTB-NEGF framework through two representative applications: comprehensive simulation of break junction systems, where conductance histograms show good agreement with experimental measurements in both metallic contact and single-molecule junction cases; and simulation of carbon nanotube field effect transistors through self-consistent NEGF-Poisson calculations, capturing essential physics including the electrostatic potential and transfer characteristic curves under finite bias conditions. This framework bridges the gap between first-principles accuracy and computational efficiency, providing a powerful tool for high-throughput quantum transport simulations across different scales in nanoelectronics.

[LG-44] Deep Generative Demand Learning for Newsvendor and Pricing

链接: https://arxiv.org/abs/2411.08631
作者: Shijin Gong,Huihang Liu,Xinyu Zhang
关键词-EN: feature-based newsvendor problem, structural assumptions, stochastic optimization problem, conditional stochastic optimization, contextual features
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 30 pages, 6 figures

点击查看摘要

Abstract:We consider data-driven inventory and pricing decisions in the feature-based newsvendor problem, where demand is influenced by both price and contextual features and is modeled without any structural assumptions. The unknown demand distribution results in a challenging conditional stochastic optimization problem, further complicated by decision-dependent uncertainty and the integration of features. Inspired by recent advances in deep generative learning, we propose a novel approach leveraging conditional deep generative models (cDGMs) to address these challenges. cDGMs learn the demand distribution and generate probabilistic demand forecasts conditioned on price and features. This generative approach enables accurate profit estimation and supports the design of algorithms for two key objectives: (1) optimizing inventory for arbitrary prices, and (2) jointly determining optimal pricing and inventory levels. We provide theoretical guarantees for our approach, including the consistency of profit estimation and convergence of our decisions to the optimal solution. Extensive simulations-ranging from simple to complex scenarios, including one involving textual features-and a real-world case study demonstrate the effectiveness of our approach. Our method opens a new paradigm in management science and operations research, is adaptable to extensions of the newsvendor and pricing problems, and holds potential for solving other conditional stochastic optimization problems.

[LG-45] Quantifying Qualitative Insights: Leveraging LLM s to Market Predict

链接: https://arxiv.org/abs/2411.08404
作者: Hoyoung Lee,Youngsoo Choi,Yuhee Kwon
关键词-EN: Large Language Models, Large Language, transform financial analytics, Recent advancements, advancements in Large
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have the potential to transform financial analytics by integrating numerical and textual data. However, challenges such as insufficient context when fusing multimodal information and the difficulty in measuring the utility of qualitative outputs, which LLMs generate as text, have limited their effectiveness in tasks such as financial forecasting. This study addresses these challenges by leveraging daily reports from securities firms to create high-quality contextual information. The reports are segmented into text-based key factors and combined with numerical data, such as price information, to form context sets. By dynamically updating few-shot examples based on the query time, the sets incorporate the latest information, forming a highly relevant set closely aligned with the query point. Additionally, a crafted prompt is designed to assign scores to the key factors, converting qualitative insights into quantitative results. The derived scores undergo a scaling process, transforming them into real-world values that are used for prediction. Our experiments demonstrate that LLMs outperform time-series models in market forecasting, though challenges such as imperfect reproducibility and limited explainability remain.

[LG-46] Communication Efficient Decentralization for Smoothed Online Convex Optimization

链接: https://arxiv.org/abs/2411.08355
作者: Neelkamal Bhuyan,Debankur Mukherjee,Adam Wierman
关键词-EN: Online Convex Optimization, Smoothed Online Convex, multi-agent Smoothed Online, Convex Optimization, Smoothed Online
类目: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 39 pages

点击查看摘要

Abstract:We study the multi-agent Smoothed Online Convex Optimization (SOCO) problem, where N agents interact through a communication graph. In each round, each agent i receives a strongly convex hitting cost function f^i_t in an online fashion and selects an action x^i_t \in \mathbbR^d . The objective is to minimize the global cumulative cost, which includes the sum of individual hitting costs f^i_t(x^i_t) , a temporal “switching cost” for changing decisions, and a spatial “dissimilarity cost” that penalizes deviations in decisions among neighboring agents. We propose the first decentralized algorithm for multi-agent SOCO and prove its asymptotic optimality. Our approach allows each agent to operate using only local information from its immediate neighbors in the graph. For finite-time performance, we establish that the optimality gap in competitive ratio decreases with the time horizon T and can be conveniently tuned based on the per-round computation available to each agent. Moreover, our results hold even when the communication graph changes arbitrarily and adaptively over time. Finally, we establish that the computational complexity per round depends only logarithmically on the number of agents and almost linearly on their degree within the graph, ensuring scalability for large-system implementations.

[LG-47] SynapsNet: Enhancing Neuronal Population Dynamics Modeling via Learning Functional Connectivity

链接: https://arxiv.org/abs/2411.08221
作者: Parsa Delavari,Ipek Oruc,Timothy H Murphy
关键词-EN: scientifically translatable insights, scientifically translatable, translatable insights, large-scale neuronal population, availability of large-scale
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The availability of large-scale neuronal population datasets necessitates new methods to model population dynamics and extract interpretable, scientifically translatable insights. Existing deep learning methods often overlook the biological mechanisms underlying population activity and thus exhibit suboptimal performance with neuronal data and provide little to no interpretable information about neurons and their interactions. In response, we introduce SynapsNet, a novel deep-learning framework that effectively models population dynamics and functional interactions between neurons. Within this biologically realistic framework, each neuron, characterized by a latent embedding, sends and receives currents through directed connections. A shared decoder uses the input current, previous neuronal activity, neuron embedding, and behavioral data to predict the population activity in the next time step. Unlike common sequential models that treat population activity as a multichannel time series, SynapsNet applies its decoder to each neuron (channel) individually, with the learnable functional connectivity serving as the sole pathway for information flow between neurons. Our experiments, conducted on mouse cortical activity from publicly available datasets and recorded using the two most common population recording modalities (Ca imaging and Neuropixels) across three distinct tasks, demonstrate that SynapsNet consistently outperforms existing models in forecasting population activity. Additionally, our experiments on both real and synthetic data showed that SynapsNet accurately learns functional connectivity that reveals predictive interactions between neurons.

[LG-48] Emergent field theories from neural networks

链接: https://arxiv.org/abs/2411.08138
作者: Vitaly Vanchurin
关键词-EN: network-based learning systems, Hamiltonian systems, relation between Hamiltonian, neural network-based learning, learning systems
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:We establish a duality relation between Hamiltonian systems and neural network-based learning systems. We show that the Hamilton-Jacobi equations for position and momentum variables correspond to the equations governing the activation dynamics of non-trainable variables and the learning dynamics of trainable variables. The duality is then applied to model various field theories using the activation and learning dynamics of neural networks. For Klein-Gordon fields, the corresponding weight tensor is symmetric, while for Dirac fields, the weight tensor must contain an anti-symmetric tensor factor. The dynamical components of the weight and bias tensors correspond, respectively, to the temporal and spatial components of the gauge field.

[LG-49] A Tale of Two Cities: Pessimism and Opportunism in Offline Dynamic Pricing

链接: https://arxiv.org/abs/2411.08126
作者: Zeyu Bian,Zhengling Qi,Cong Shi,Lan Wang
关键词-EN: data coverage assumption, paper studies offline, studies offline dynamic, data coverage, partial identification framework
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies offline dynamic pricing without data coverage assumption, thereby allowing for any price including the optimal one not being observed in the offline data. Previous approaches that rely on the various coverage assumptions such as that the optimal prices are observable, would lead to suboptimal decisions and consequently, reduced profits. We address this challenge by framing the problem to a partial identification framework. Specifically, we establish a partial identification bound for the demand parameter whose associated price is unobserved by leveraging the inherent monotonicity property in the pricing problem. We further incorporate pessimistic and opportunistic strategies within the proposed partial identification framework to derive the estimated policy. Theoretically, we establish rate-optimal finite-sample regret guarantees for both strategies. Empirically, we demonstrate the superior performance of the newly proposed methods via a synthetic environment. This research provides practitioners with valuable insights into offline pricing strategies in the challenging no-coverage setting, ultimately fostering sustainable growth and profitability of the company.

[LG-50] Explainable Deep Learning Framework for SERS Bio-quantification

链接: https://arxiv.org/abs/2411.08082
作者: Jihan K. Zaki,Jakub Tomasik,Jade A. McCune,Sabine Bahn,Pietro Liò,Oren A. Scherman
关键词-EN: Surface-enhanced Raman spectroscopy, Surface-enhanced Raman, discover biomarker-disease relationships, Raman spectroscopy, SERS
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT)
*备注:

点击查看摘要

Abstract:Surface-enhanced Raman spectroscopy (SERS) is a potential fast and inexpensive method of analyte quantification, which can be combined with deep learning to discover biomarker-disease relationships. This study aims to address present challenges of SERS through a novel SERS bio-quantification framework, including spectral processing, analyte quantification, and model explainability. To this end,serotonin quantification in urine media was assessed as a model task with 682 SERS spectra measured in a micromolar range using cucurbit[8]uril chemical spacers. A denoising autoencoder was utilized for spectral enhancement, and convolutional neural networks (CNN) and vision transformers were utilized for biomarker quantification. Lastly, a novel context representative interpretable model explanations (CRIME) method was developed to suit the current needs of SERS mixture analysis explainability. Serotonin quantification was most efficient in denoised spectra analysed using a convolutional neural network with a three-parameter logistic output layer (mean absolute error = 0.15 \muM, mean percentage error = 4.67%). Subsequently, the CRIME method revealed the CNN model to present six prediction contexts, of which three were associated with serotonin. The proposed framework could unlock a novel, untargeted hypothesis generating method of biomarker discovery considering the rapid and inexpensive nature of SERS measurements, and the potential to identify biomarkers from CRIME contexts.

[LG-51] LoRA-BERT: a Natural Language Processing Model for Robust and Accurate Prediction of long non-coding RNAs

链接: https://arxiv.org/abs/2411.08073
作者: Nicholas Jeon,Xiaoning Qian,Lamin SaidyKhan,Paul de Figueiredo,Byung-Jun Yoon
关键词-EN: Long non-coding RNAs, numerous biological processes, serve as crucial, non-coding RNAs, Long non-coding
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long non-coding RNAs (lncRNAs) serve as crucial regulators in numerous biological processes. Although they share sequence similarities with messenger RNAs (mRNAs), lncRNAs perform entirely different roles, providing new avenues for biological research. The emergence of next-generation sequencing technologies has greatly advanced the detection and identification of lncRNA transcripts and deep learning-based approaches have been introduced to classify long non-coding RNAs (lncRNAs). These advanced methods have significantly enhanced the efficiency of identifying lncRNAs. However, many of these methods are devoid of robustness and accuracy due to the extended length of the sequences involved. To tackle this issue, we have introduced a novel pre-trained bidirectional encoder representation called LoRA-BERT. LoRA-BERT is designed to capture the importance of nucleotide-level information during sequence classification, leading to more robust and satisfactory outcomes. In a comprehensive comparison with commonly used sequence prediction tools, we have demonstrated that LoRA-BERT outperforms them in terms of accuracy and efficiency. Our results indicate that, when utilizing the transformer model, LoRA-BERT achieves state-of-the-art performance in predicting both lncRNAs and mRNAs for human and mouse species. Through the utilization of LoRA-BERT, we acquire valuable insights into the traits of lncRNAs and mRNAs, offering the potential to aid in the comprehension and detection of diseases linked to lncRNAs in humans.

[LG-52] Modeling variable guide efficiency in pooled CRISPR screens with ContrastiveVI NEURIPS2024

链接: https://arxiv.org/abs/2411.08072
作者: Ethan Weinberger,Ryan Conrad,Tal Ashuach
关键词-EN: Genetic screens mediated, combined with high-content, biological discovery, high-content readouts, readouts have emerged
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)
*备注: 15 pages, 4 figures, will be included in the AIDrugX workshop in Neurips 2024

点击查看摘要

Abstract:Genetic screens mediated via CRISPR-Cas9 combined with high-content readouts have emerged as powerful tools for biological discovery. However, computational analyses of these screens come with additional challenges beyond those found with standard scRNA-seq analyses. For example, perturbation-induced variations of interest may be subtle and masked by other dominant source of variation shared with controls, and variable guide efficiency results in some cells not undergoing genetic perturbation despite expressing a guide RNA. While a number of methods have been developed to address the former problem by explicitly disentangling perturbation-induced variations from those shared with controls, less attention has been paid to the latter problem of noisy perturbation labels. To address this issue, here we propose ContrastiveVI+, a generative modeling framework that both disentangles perturbation-induced from non-perturbation-related variations while also inferring whether cells truly underwent genomic edits. Applied to three large-scale Perturb-seq datasets, we find that ContrastiveVI+ better recovers known perturbation-induced variations compared to previous methods while successfully identifying cells that escaped the functional consequences of guide RNA expression. An open-source implementation of our model is available at \urlthis https URL.

[LG-53] Mobility-based Traffic Forecasting in a Multimodal Transport System

链接: https://arxiv.org/abs/2411.08052
作者: Henock M. Mboko,Mouhamadou A.M.T. Balde,Babacar M. Ndiaye
关键词-EN: study the analysis, population mobility data, transportation network, indirectly impacts, population mobility
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 17 pages, 18 figures

点击查看摘要

Abstract:We study the analysis of all the movements of the population on the basis of their mobility from one node to another, to observe, measure, and predict the impact of traffic according to this mobility. The frequency of congestion on roads directly or indirectly impacts our economic or social welfare. Our work focuses on exploring some machine learning methods to predict (with a certain probability) traffic in a multimodal transportation network from population mobility data. We analyze the observation of the influence of people’s movements on the transportation network and make a likely prediction of congestion on the network based on this observation (historical basis).

[LG-54] Stochastic Reconstruction of Gappy Lagrangian Turbulent Signals by Conditional Diffusion Models

链接: https://arxiv.org/abs/2410.23971
作者: Tianyi Li,Luca Biferale,Fabio Bonaccorso,Michele Buzzicotti,Luca Centurioni
关键词-EN: reconstructing missing spatial, small objects passively, objects passively advected, Global Drifter Program, spatial scales
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:We present a stochastic method for reconstructing missing spatial and velocity data along the trajectories of small objects passively advected by turbulent flows with a wide range of temporal or spatial scales, such as small balloons in the atmosphere or drifters in the ocean. Our approach makes use of conditional generative diffusion models, a recently proposed data-driven machine learning technique. We solve the problem for two paradigmatic open problems, the case of 3D tracers in homogeneous and isotropic turbulence, and 2D trajectories from the NOAA-funded Global Drifter Program. We show that for both cases, our method is able to reconstruct velocity signals retaining non-trivial scale-by-scale properties that are highly non-Gaussian and intermittent. A key feature of our method is its flexibility in dealing with the location and shape of data gaps, as well as its ability to naturally exploit correlations between different components, leading to superior accuracy, with respect to Gaussian process regressions, for both pointwise reconstruction and statistical expressivity. Our method shows promising applications also to a wide range of other Lagrangian problems, including multi-particle dispersion in turbulence, dynamics of charged particles in astrophysics and plasma physics, and pedestrian dynamics.

信息检索

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-14

目录

概览 (2024-11-14)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载