本篇博文主要展示 2024-10-11 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-10-11)

今日共更新561篇论文,其中:

  • 自然语言处理92篇(Computation and Language (cs.CL))
  • 人工智能147篇(Artificial Intelligence (cs.AI))
  • 计算机视觉148篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习198篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

【速读】: 该论文试图解决大规模视觉-语言预训练模型(如CLIP)在特定领域应用时性能不足的问题,特别是在缺乏人工标注数据的情况下。解决方案的关键在于提出了一种无监督的微调方法LatteCLIP,该方法利用大型多模态模型(LMMs)生成图像的文本描述,并通过学习噪声文本中的有用信息和双重伪标签来稳定训练过程,从而在无需人工标注的情况下提升模型在特定领域的分类性能。

链接: https://arxiv.org/abs/2410.08211
作者: Anh-Quan Cao,Maximilian Jaritz,Matthieu Guillaumin,Raoul de Charette,Loris Bazzani
关键词-EN: Large-scale vision-language pre-trained, Large-scale vision-language, applied to diverse, diverse applications, fine-tuning VLP models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large-scale vision-language pre-trained (VLP) models (e.g., CLIP) are renowned for their versatility, as they can be applied to diverse applications in a zero-shot setup. However, when these models are used in specific domains, their performance often falls short due to domain gaps or the under-representation of these domains in the training data. While fine-tuning VLP models on custom datasets with human-annotated labels can address this issue, annotating even a small-scale dataset (e.g., 100k samples) can be an expensive endeavor, often requiring expert annotators if the task is complex. To address these challenges, we propose LatteCLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains, without relying on human annotations. Our method leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for both individual images and groups of images. These provide additional contextual information to guide the fine-tuning process in the custom domains. Since LMM-generated descriptions are prone to hallucination or missing details, we introduce a novel strategy to distill only the useful information and stabilize the training. Specifically, we learn rich per-class prototype representations from noisy generated texts and dual pseudo-labels. Our experiments on 10 domain-specific datasets show that LatteCLIP outperforms pre-trained zero-shot methods by an average improvement of +4.74 points in top-1 accuracy and other state-of-the-art unsupervised methods by +3.45 points.
摘要:大规模视觉-语言预训练 (VLP) 模型(例如 CLIP)以其多功能性著称,因为它们可以在零样本设置下应用于多种应用。然而,当这些模型用于特定领域时,由于领域差距或训练数据中这些领域的代表性不足,其性能往往不尽如人意。尽管在带有人工标注标签的自定义数据集上微调 VLP 模型可以解决这一问题,但即使是小规模数据集(例如 10 万样本)的标注也可能是一项昂贵的任务,尤其是在任务复杂时,通常需要专家标注者。为了应对这些挑战,我们提出了 LatteCLIP,这是一种在自定义领域中使用已知类别名称进行分类的无监督方法,无需依赖人工标注。我们的方法利用大模态模型 (LMM) 为单个图像和图像组生成富有表现力的文本描述。这些描述提供了额外的上下文信息,以指导在自定义领域中的微调过程。由于 LMM 生成的描述容易出现幻觉或遗漏细节,我们引入了一种新颖的策略,仅提取有用的信息并稳定训练。具体而言,我们从噪声生成的文本和双重伪标签中学习丰富的每类原型表示。我们在 10 个特定领域的数据集上的实验表明,LatteCLIP 在 top-1 准确率上平均优于预训练的零样本方法 4.74 个百分点,并优于其他最先进的无监督方法 3.45 个百分点。

[NLP-1] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

【速读】: 该论文试图解决单体多模态大语言模型(MLLM)在扩展视觉编码和语言解码能力时面临的灾难性遗忘问题,以及由此导致的性能下降。解决方案的关键在于采用增量调优(delta tuning)策略,通过将视觉参数嵌入预训练的语言模型(LLM)中,并在优化视觉参数时冻结LLM,从而逐步学习视觉知识。论文提出了Mono-InternVL模型,该模型通过多模态混合专家结构集成了一组视觉专家,并创新性地引入了内生视觉预训练(EViP)策略,以渐进式学习过程最大化视觉专家的能力,从而有效提升模型在多模态任务中的表现和部署效率。

链接: https://arxiv.org/abs/2410.08202
作者: Gen Luo,Xue Yang,Wenhan Dou,Zhaokai Wang,Jifeng Dai,Yu Qiao,Xizhou Zhu
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, monolithic Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has led to an influx of efforts to extend their capabilities to multimodal tasks. Among them, growing attention has been focused on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. Despite the structural simplicity and deployment-friendliness, training a monolithic MLLM with promising performance still remains challenging. In particular, the popular approaches adopt continuous pre-training to extend a pre-trained LLM to a monolithic MLLM, which suffers from catastrophic forgetting and leads to performance degeneration. In this paper, we aim to overcome this limitation from the perspective of delta tuning. Specifically, our core idea is to embed visual parameters into a pre-trained LLM, thereby incrementally learning visual knowledge from massive data via delta tuning, i.e., freezing the LLM when optimizing the visual parameters. Based on this principle, we present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure. Moreover, we propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data. To validate our approach, we conduct extensive experiments on 16 benchmarks. Experimental results not only validate the superior performance of Mono-InternVL compared to the state-of-the-art MLLM on 6 multimodal benchmarks, e.g., +113 points over InternVL-1.5 on OCRBench, but also confirm its better deployment efficiency, with first token latency reduced by up to 67%.
摘要:大语言模型 (LLM) 的快速发展促使人们不断努力将其能力扩展到多模态任务中。其中,越来越多的关注集中在将视觉编码和语言解码集成到单一 LLM 中的单体多模态大语言模型 (MLLM) 上。尽管结构简单且便于部署,但训练一个性能优异的单体 MLLM 仍然具有挑战性。特别是,流行的方法采用连续预训练将预训练的 LLM 扩展为单体 MLLM,这种方法存在灾难性遗忘问题,导致性能下降。本文旨在从增量调优的角度克服这一限制。具体来说,我们的核心思想是将视觉参数嵌入到预训练的 LLM 中,从而通过增量调优(即在优化视觉参数时冻结 LLM)从大量数据中逐步学习视觉知识。基于这一原则,我们提出了 Mono-InternVL,这是一种新颖的单体 MLLM,通过多模态混合专家结构无缝集成了一组视觉专家。此外,我们提出了一种创新的预训练策略,以最大化 Mono-InternVL 的视觉能力,即内生视觉预训练 (EViP)。特别地,EViP 被设计为视觉专家的渐进学习过程,旨在从噪声数据到高质量数据充分挖掘视觉知识。为了验证我们的方法,我们在 16 个基准上进行了广泛的实验。实验结果不仅验证了 Mono-InternVL 在 6 个多模态基准上相对于最先进的 MLLM 的优越性能,例如在 OCRBench 上比 InternVL-1.5 高出 113 分,而且还确认了其更好的部署效率,首次 Token 延迟降低了高达 67%。

[NLP-2] From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions

【速读】: 该论文试图解决大语言模型(LLMs)与外部工具之间由于现有工具文档的不充分和不准确而产生的理解鸿沟问题。解决方案的关键在于提出了一种名为DRAFT的新框架,通过动态分析LLMs与工具交互产生的反馈和轨迹,采用试错法进行工具文档的迭代优化。该框架包括三个学习阶段:经验收集、经验学习和文档重写,并通过多样性探索策略和工具自适应终止机制来提高效率和防止过拟合,从而显著提升工具文档的质量,增强LLMs对工具的理解和使用效果。

链接: https://arxiv.org/abs/2410.08197
作者: Changle Qu,Sunhao Dai,Xiaochi Wei,Hengyi Cai,Shuaiqiang Wang,Dawei Yin,Jun Xu,Ji-Rong Wen
关键词-EN: Large Language Models, enables Large Language, Language Models, Large Language, learning enables Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool learning enables Large Language Models (LLMs) to interact with external environments by invoking tools, serving as an effective strategy to mitigate the limitations inherent in their pre-training data. In this process, tool documentation plays a crucial role by providing usage instructions for LLMs, thereby facilitating effective tool utilization. This paper concentrates on the critical challenge of bridging the comprehension gap between LLMs and external tools due to the inadequacies and inaccuracies inherent in existing human-centric tool documentation. We propose a novel framework, DRAFT, aimed at Dynamically Refining tool documentation through the Analysis of Feedback and Trails emanating from LLMs’ interactions with external tools. This methodology pivots on an innovative trial-and-error approach, consisting of three distinct learning phases: experience gathering, learning from experience, and documentation rewriting, to iteratively enhance the tool documentation. This process is further optimized by implementing a diversity-promoting exploration strategy to ensure explorative diversity and a tool-adaptive termination mechanism to prevent overfitting while enhancing efficiency. Extensive experiments on multiple datasets demonstrate that DRAFT’s iterative, feedback-based refinement significantly ameliorates documentation quality, fostering a deeper comprehension and more effective utilization of tools by LLMs. Notably, our analysis reveals that the tool documentation refined via our approach demonstrates robust cross-model generalization capabilities.
摘要:工具学习使大语言模型 (LLMs) 能够通过调用工具与外部环境进行交互,作为一种有效的策略来缓解其预训练数据中固有的局限性。在此过程中,工具文档通过为 LLMs 提供使用说明,起到了至关重要的作用,从而促进了工具的有效利用。本文集中探讨了由于现有以人为中心的工具文档中固有的不足和不准确性,导致 LLMs 与外部工具之间理解差距的关键挑战。我们提出了一种新颖的框架,DRAFT,旨在通过分析 LLMs 与外部工具交互产生的反馈和轨迹,动态优化工具文档。该方法基于一种创新的试错法,包括三个不同的学习阶段:经验收集、从经验中学习以及文档重写,以迭代方式提升工具文档的质量。此外,通过实施多样性促进的探索策略来确保探索的多样性,并采用工具自适应的终止机制来防止过拟合,同时提高效率。在多个数据集上的广泛实验表明,DRAFT 基于反馈的迭代优化显著提升了文档质量,促进了 LLMs 对工具的更深入理解和更有效利用。值得注意的是,我们的分析显示,通过我们的方法优化的工具文档展现出强大的跨模型泛化能力。

[NLP-3] MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

【速读】: 该论文试图解决大型语言模型在数学推理能力上的不足,特别是通过传统的数学预训练方法(如使用工程、机器学习等领域的数学包代码)无法直接提升数学推理能力的问题。解决方案的关键在于引入了一种新的方法,通过构建高质量的数学预训练数据集(MathCode-Pile),并生成包含推理步骤的数学代码,从而精确捕捉数学推理过程。具体步骤包括从数学相关的网页数据、数学包代码、数学教科书和合成数据中提取LaTeX表达式及其条件和结果,生成相应的代码,并将这些代码与自然语言推理步骤配对,最终形成一个包含19.2B-token的高性能数学预训练语料库。这种方法显著提升了多个流行基础模型的数学能力,并生成了MathCoder2系列模型。

链接: https://arxiv.org/abs/2410.08196
作者: Zimu Lu,Aojun Zhou,Ke Wang,Houxing Ren,Weikang Shi,Junting Pan,Mingjie Zhan,Hongsheng Li
关键词-EN: Code, mathematical, precision and accuracy, reasoning, mathematical reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Code has been shown to be effective in enhancing the mathematical reasoning abilities of large language models due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline. The code is released at this https URL .
摘要:代码因其精确性和准确性,已被证明能有效提升大语言模型的数学推理能力。以往涉及持续数学预训练的工作通常包含使用数学相关包的代码,这些包主要设计用于工程、机器学习、信号处理或模块测试等领域,而非直接针对数学推理。本文中,我们提出了一种生成伴随推理步骤的数学代码的新方法,用于持续预训练。我们的方法首先通过整合数学相关网络数据、使用数学包的代码、数学教科书和合成数据,构建了一个高质量的数学持续预训练数据集。接着,我们从先前收集的数据集中提取 LaTeX 表达式、表达式所需条件及其结果,构建推理步骤。基于这些提取的信息,我们生成相应的代码,以准确捕捉数学推理过程。将生成的代码附加到每个推理步骤后,形成包含自然语言推理步骤及其对应代码的数据对。结合这些数据与原始数据集,我们构建了一个名为 MathCode-Pile 的 19.2B-token 高性能数学预训练语料库。使用该语料库训练多个流行基础模型显著提升了它们的数学能力,从而诞生了 MathCoder2 系列模型。我们的所有数据处理和训练代码均已开源,确保了整个数据收集和训练流程的完全透明和易于复现。代码发布于 https URL。

[NLP-4] GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

【速读】: 该论文试图解决大语言模型(LLMs)在测试阶段如何高效地与人类偏好对齐的问题。传统方法在训练阶段通过微调模型来对齐人类偏好,但这种方法成本高且难以适应多样化的用户需求。论文提出的解决方案是GenARM,其关键在于引入了一种新的自回归奖励模型(Autoregressive Reward Model),该模型能够在不重新训练LLMs的情况下,实时计算下一个token的奖励,从而指导模型的自回归生成过程。这种方法不仅显著提升了测试阶段对齐的效率和效果,还支持多目标对齐和弱到强的指导,无需重新训练模型即可适应不同的用户偏好。

链接: https://arxiv.org/abs/2410.08193
作者: Yuancheng Xu,Udari Madhushani Sehwag,Alec Koppel,Sicheng Zhu,Bang An,Furong Huang,Sumitra Ganesh
关键词-EN: Large Language Models, Large Language, exhibit impressive capabilities, Language Models, exhibit impressive
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model–a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining.
摘要:大语言模型 (LLMs) 展现出令人印象深刻的能力,但需要与人类偏好进行仔细的校准。传统的训练时方法通过使用人类偏好数据集对 LLMs 进行微调,但这种方法会带来显著的训练成本,并且需要反复训练以应对多样化的用户偏好。测试时校准方法通过使用奖励模型 (RMs) 来指导冻结的 LLMs,而无需重新训练,从而解决了这一问题。然而,现有的测试时方法依赖于轨迹级别的 RMs,这些模型设计用于评估完整的响应,因此不适合需要从部分响应中计算下一个 Token 奖励的自回归文本生成。为了解决这一问题,我们引入了 GenARM,这是一种测试时校准方法,利用自回归奖励模型——一种新颖的奖励参数化设计,旨在预测下一个 Token 的奖励,以实现高效且有效的自回归生成。理论上,我们证明了这种参数化可以在 KL 正则化的强化学习框架内,证明性地引导冻结的 LLMs 朝向传统 RMs 所能实现的任何分布。实验结果表明,GenARM 显著优于先前的测试时校准基线,并达到了训练时方法的性能。此外,GenARM 支持从弱到强的指导,能够在不增加大型模型训练成本的情况下,使用较小的 RMs 校准较大的 LLMs。此外,GenARM 支持多目标校准,允许在偏好维度之间进行实时权衡,并满足多样化的用户偏好,而无需重新训练。

[NLP-5] MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

【速读】: 该论文试图解决在多模态检索任务中,视觉信息相对于文本信息在某些场景下更为重要或更易获取的问题。解决方案的关键在于引入了一个名为MRAG-Bench的多模态检索增强生成基准,该基准系统地识别并分类了视觉增强知识优于文本知识的场景,并通过16,130张图像和1,353个人工标注的多选题,评估了10个开源和4个专有的视觉-语言大模型(LVLMs)。研究结果表明,所有LVLMs在视觉信息增强下表现显著提升,证实了MRAG-Bench的视觉中心性,并揭示了当前LVLMs在有效利用检索到的视觉知识方面的不足,从而强调了MRAG-Bench在推动社区改进LVLMs利用视觉知识能力方面的重要性。

链接: https://arxiv.org/abs/2410.08182
作者: Wenbo Hu,Jia-Chen Gu,Zi-Yi Dou,Mohsen Fayyaz,Pan Lu,Kai-Wei Chang,Nanyun Peng
关键词-EN: Existing multimodal retrieval, retrieval benchmarks primarily, benchmarks primarily focus, Existing multimodal, primarily focus
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in which we systematically identify and categorize scenarios where visually augmented knowledge is better than textual knowledge, for instance, more images from varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. With MRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-Bench is vision-centric. Additionally, we conduct extensive analysis with MRAG-Bench, which offers valuable insights into retrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faces challenges in effectively leveraging retrieved knowledge, achieving only a 5.82% improvement with ground-truth information, in contrast to a 33.16% improvement observed in human participants. These findings highlight the importance of MRAG-Bench in encouraging the community to enhance LVLMs’ ability to utilize retrieved visual knowledge more effectively.
摘要:现有的多模态检索基准主要关注模型是否能够检索并利用外部文本知识进行问答。然而,在某些场景中,检索视觉信息比文本数据更为有利或更易获取。本文中,我们引入了一个多模态检索增强生成基准,MRAG-Bench,在该基准中,我们系统地识别并分类了视觉增强知识优于文本知识的场景,例如从不同视角获取更多图像。MRAG-Bench 包含 16,130 张图像和 1,353 个由人工标注的多项选择题,涵盖 9 种不同场景。通过 MRAG-Bench,我们对 10 个开源和 4 个专有的视觉语言大模型 (LVLMs) 进行了评估。结果显示,所有 LVLMs 在通过图像增强后表现均有显著提升,相比之下文本知识的增强效果较弱,这证实了 MRAG-Bench 是以视觉为中心的。此外,我们利用 MRAG-Bench 进行了广泛的分析,提供了关于检索增强型 LVLMs 的宝贵见解。值得注意的是,表现最佳的模型 GPT-4o 在有效利用检索知识方面面临挑战,仅实现了 5.82% 的提升,而人类参与者则观察到了 33.16% 的提升。这些发现突显了 MRAG-Bench 在推动社区提升 LVLMs 有效利用检索视觉知识能力方面的重要性。

[NLP-6] Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在开放和封闭场景中的可信度问题,特别是如何在这些模型中构建具有统计保证的预测集。解决方案的关键在于提出了一个名为TRON的两步框架,该框架包括:(1) 一种新的保形分数(conformal score),用于采样最小尺寸的响应集;(2) 一种非保形分数(nonconformity score),基于自一致性理论识别高质量响应,并通过两个特定的风险水平控制错误率。此外,论文首次研究了开放环境中预测集的语义冗余问题,提出了一种基于平均集大小的评估指标,以提高MLLMs的适应性和稳定性。

链接: https://arxiv.org/abs/2410.08174
作者: Qingni Wang,Tiantian Geng,Zhiyuan Wang,Teng Wang,Bo Fu,Feng Zheng
关键词-EN: Multimodal Large Language, Multimodal Large, Large Language Models, significant trustworthiness issues, encounter significant trustworthiness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) exhibit promising advancements across various tasks, yet they still encounter significant trustworthiness issues. Prior studies apply Split Conformal Prediction (SCP) in language modeling to construct prediction sets with statistical guarantees. However, these methods typically rely on internal model logits or are restricted to multiple-choice settings, which hampers their generalizability and adaptability in dynamic, open-ended environments. In this paper, we introduce TRON, a two-step framework for risk control and assessment, applicable to any MLLM that supports sampling in both open-ended and closed-ended scenarios. TRON comprises two main components: (1) a novel conformal score to sample response sets of minimum size, and (2) a nonconformity score to identify high-quality responses based on self-consistency theory, controlling the error rates by two specific risk levels. Furthermore, we investigate semantic redundancy in prediction sets within open-ended contexts for the first time, leading to a promising evaluation metric for MLLMs based on average set size. Our comprehensive experiments across four Video Question-Answering (VideoQA) datasets utilizing eight MLLMs show that TRON achieves desired error rates bounded by two user-specified risk levels. Additionally, deduplicated prediction sets maintain adaptiveness while being more efficient and stable for risk assessment under different risk levels.
摘要:多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在各种任务中展现出有前景的进展,但它们仍然面临显著的可信度问题。先前的研究在语言建模中应用了分割置信预测 (Split Conformal Prediction, SCP),以构建具有统计保证的预测集。然而,这些方法通常依赖于内部模型逻辑或局限于多项选择设置,这限制了它们在动态、开放环境中的通用性和适应性。本文中,我们引入了 TRON,一个适用于任何支持在开放和封闭场景中采样的 MLLM 的风险控制和评估的两步框架。TRON 包含两个主要组件:(1) 一种新颖的置信分数,用于采样最小尺寸的响应集;(2) 一种非一致性分数,基于自一致性理论识别高质量响应,通过两个特定的风险水平控制错误率。此外,我们首次在开放环境中研究了预测集中的语义冗余,从而提出了一种基于平均集大小的 MLLM 评估指标。我们在四个视频问答 (Video Question-Answering, VideoQA) 数据集上使用八个 MLLM 进行了全面实验,结果表明 TRON 实现了由用户指定的两个风险水平界定的期望错误率。此外,去重后的预测集在保持适应性的同时,在不同风险水平下进行风险评估时更加高效和稳定。

[NLP-7] Agent S: An Open Agent ic Framework that Uses Computers Like a Human

【速读】: 该论文试图解决自动化计算机任务中的三大挑战:获取领域特定知识、在长任务周期内进行规划以及处理动态、非均匀的界面。解决方案的关键在于Agent S框架引入了经验增强的分层规划,通过多层次的外部知识搜索和内部经验检索来促进高效的任务规划和子任务执行。此外,Agent S采用Agent-Computer Interface (ACI) 结合多模态大语言模型 (MLLMs) 来提升GUI代理的推理和控制能力,从而在OSWorld基准测试中实现了新的最先进性能,成功率相对基线提升了83.6%。

链接: https://arxiv.org/abs/2410.08164
作者: Saaket Agashe,Jiuzhou Han,Shuyu Gan,Jiachen Yang,Ang Li,Xin Eric Wang
关键词-EN: Graphical User Interface, Graphical User, enables autonomous interaction, transforming human-computer interaction, open agentic framework
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 16 figures, 9 tables

点击查看摘要

Abstract:We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at this https URL.
摘要:我们提出了 Agent S,这是一个开放的智能体框架,旨在通过图形用户界面 (GUI) 实现与计算机的自主交互,从而通过自动化复杂的多步骤任务来转变人机交互。Agent S 旨在解决自动化计算机任务中的三个关键挑战:获取领域特定知识、在长时间任务范围内进行规划,以及处理动态、非均匀的界面。为此,Agent S 引入了经验增强的分层规划,该规划从外部知识搜索和内部经验检索中学习,并在多个层次上进行,从而促进高效的任务规划和子任务执行。此外,它采用了一种智能体-计算机接口 (ACI),以更好地激发基于多模态大语言模型 (MLLMs) 的 GUI 智能体的推理和控制能力。在 OSWorld 基准测试中,Agent S 的成功率比基线高出 9.37%(相对改善 83.6%),并达到了新的最先进水平。全面的分析突显了各个组件的有效性,并为未来的改进提供了见解。此外,Agent S 在新发布的 WindowsAgentArena 基准测试中展示了广泛的通用性,适用于不同的操作系统。代码可在以下链接获取:https URL。

[NLP-8] he Effect of Surprisal on Reading Times in Information Seeking and Repeated Reading CONLL

【速读】: 该论文试图解决在日常语言处理中常见的三种情境(信息搜索、重复处理及两者的结合)下,意外性(surprisal)对处理难度影响的预测问题。解决方案的关键在于使用眼动数据来验证标准意外性估计与特定情境下意外性估计的预测能力。研究发现,标准意外性估计在所有情境下都能有效预测处理时间,但在信息搜索和重复阅读情境中,特定情境下的意外性估计并未提高预测能力,甚至在重复阅读中几乎无预测能力。这表明当前语言模型与人类任务和记忆表征之间存在不匹配,并质疑这些模型在估计认知相关量方面的适用性。

链接: https://arxiv.org/abs/2410.08162
作者: Keren Gruteke Klein,Yoav Meiri,Omer Shubi,Yevgeni Berzak
关键词-EN: investigation in psycholinguistics, central topic, topic of investigation, processing, surprisal
类目: Computation and Language (cs.CL)
备注: Accepted to CoNLL

点击查看摘要

Abstract:The effect of surprisal on processing difficulty has been a central topic of investigation in psycholinguistics. Here, we use eyetracking data to examine three language processing regimes that are common in daily life but have not been addressed with respect to this question: information seeking, repeated processing, and the combination of the two. Using standard regime-agnostic surprisal estimates we find that the prediction of surprisal theory regarding the presence of a linear effect of surprisal on processing times, extends to these regimes. However, when using surprisal estimates from regime-specific contexts that match the contexts and tasks given to humans, we find that in information seeking, such estimates do not improve the predictive power of processing times compared to standard surprisals. Further, regime-specific contexts yield near zero surprisal estimates with no predictive power for processing times in repeated reading. These findings point to misalignments of task and memory representations between humans and current language models, and question the extent to which such models can be used for estimating cognitively relevant quantities. We further discuss theoretical challenges posed by these results.
摘要:在心理语言学领域,意外性对处理难度的影响一直是研究的核心课题。在此,我们利用眼动追踪数据来探讨日常生活中常见的三种语言处理模式,但这些模式在此问题上尚未得到充分研究:信息搜索、重复处理以及两者的结合。使用标准的与模式无关的意外性估计,我们发现,意外性理论关于意外性对处理时间存在线性影响的预测,在这些模式中同样适用。然而,当使用与人类所处情境和任务相匹配的特定模式上下文中的意外性估计时,我们发现,在信息搜索中,这些估计并未比标准意外性更能提高处理时间的预测能力。此外,在重复阅读中,特定模式上下文产生的意外性估计接近于零,且对处理时间没有预测能力。这些发现表明,人类与当前语言模型在任务和记忆表征之间存在错位,并质疑这些模型在估计与认知相关的量方面的适用性。我们进一步讨论了这些结果带来的理论挑战。

[NLP-9] Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

【速读】: 该论文试图解决在大语言模型中提高多步推理能力的问题,特别是通过改进过程奖励模型(PRMs)来优化探索和强化学习(RL)中的信用分配。解决方案的关键在于设计有效的过程奖励,这些奖励应衡量每一步对未来正确响应概率的改进,即步骤级别的优势。论文提出使用一个与基础策略不同的证明策略来评估这种进步,从而在测试时搜索和在线RL中实现更高的准确性和计算效率。通过训练过程优势验证器(PAVs)来预测这种进步,论文展示了相较于结果奖励模型(ORMs),PRMs在准确性和计算效率上的显著提升。

链接: https://arxiv.org/abs/2410.08146
作者: Amrith Setlur,Chirag Nagpal,Adam Fisch,Xinyang Geng,Jacob Eisenstein,Rishabh Agarwal,Alekh Agarwal,Jonathan Berant,Aviral Kumar
关键词-EN: large language models, promising approach, large language, reward models, language models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: “How should we design process rewards?”. Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test-time search against PAVs is 8% more accurate, and 1.5-5\times more compute-efficient. Online RL with dense rewards from PAVs enables one of the first results with 5-6\times gain in sample efficiency, and 6% gain in accuracy, over ORMs.
摘要:一种有前景的提升大语言模型推理能力的方法是使用过程奖励模型 (Process Reward Models, PRMs)。PRMs 在多步推理过程中的每一步都提供反馈,这可能比仅在最终步骤提供反馈的结果奖励模型 (Outcome Reward Models, ORMs) 更能改进信用分配。然而,收集密集的每步人类标签并不具有可扩展性,并且从自动标注的数据中训练 PRMs 迄今为止仅带来了有限的收益。为了通过在 PRM 上运行搜索或将其用作强化学习 (Reinforcement Learning, RL) 的密集奖励来改进基础策略,我们提出问题:“我们应如何设计过程奖励?”我们的关键见解是,为了有效,步骤的过程奖励应衡量进展:在采取步骤前后,未来产生正确响应的可能性变化,这与 RL 中的步骤级优势概念相对应。至关重要的是,这种进展应在不同于基础策略的证明策略下进行衡量。我们从理论上刻画了良好证明策略的集合,并展示了从这些证明策略优化过程奖励可以改善测试时搜索和在线 RL 中的探索。事实上,我们的刻画表明,较弱的证明策略可以显著提升较强的基础策略,我们在实证中也观察到了这一点。我们通过训练过程优势验证器 (Process Advantage Verifiers, PAVs) 来预测在这些证明策略下的进展,并展示与 ORMs 相比,测试时搜索对抗 PAVs 的准确性提高了 8%,计算效率提高了 1.5-5 倍。在线 RL 中使用 PAVs 提供的密集奖励实现了样本效率提升 5-6 倍,准确性提升 6% 的首个结果。

[NLP-10] Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs

【速读】: 该论文试图解决多模态大语言模型(MLLMs)中常识性视觉知识冲突的问题,即模型内部常识知识与视觉信息之间的矛盾。解决方案的关键在于引入了一种自动化流水线,结合人机协作的质量控制,构建了一个包含374张原始图像和1,122个高质量问答对的诊断基准,用于模拟和评估这些冲突。通过该基准,论文评估了九种代表性MLLMs的冲突解决能力,并发现模型过度依赖文本查询。为此,论文提出了一种新的提示策略“Focus-on-Vision”(FoV),显著增强了MLLMs在面对冲突时优先考虑视觉数据的能力,从而有效缓解了视觉知识冲突问题。

链接: https://arxiv.org/abs/2410.08145
作者: Xiaoyuan Liu,Wenxuan Wang,Youliang Yuan,Jen-tse Huang,Qiuzhi Liu,Pinjia He,Zhaopeng Tu
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, contradicts model internal
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper explores the problem of commonsense-level vision-knowledge conflict in Multimodal Large Language Models (MLLMs), where visual information contradicts model’s internal commonsense knowledge (see Figure 1). To study this issue, we introduce an automated pipeline, augmented with human-in-the-loop quality control, to establish a benchmark aimed at simulating and assessing the conflicts in MLLMs. Utilizing this pipeline, we have crafted a diagnostic benchmark comprising 374 original images and 1,122 high-quality question-answer (QA) pairs. This benchmark covers two types of conflict target and three question difficulty levels, providing a thorough assessment tool. Through this benchmark, we evaluate the conflict-resolution capabilities of nine representative MLLMs across various model families and find a noticeable over-reliance on textual queries. Drawing on these findings, we propose a novel prompting strategy, “Focus-on-Vision” (FoV), which markedly enhances MLLMs’ ability to favor visual data over conflicting textual knowledge. Our detailed analysis and the newly proposed strategy significantly advance the understanding and mitigating of vision-knowledge conflicts in MLLMs. The data and code are made publicly available.
摘要:本文探讨了多模态大语言模型 (MLLMs) 中常识性视觉知识冲突的问题,即视觉信息与模型内部常识知识相矛盾的情况(见图 1)。为了研究这一问题,我们引入了一个自动化流程,并结合人在回路中的质量控制,建立了一个旨在模拟和评估 MLLMs 中冲突的基准。利用这一流程,我们构建了一个包含 374 张原始图像和 1,122 个高质量问答 (QA) 对的诊断基准。该基准涵盖了两种类型的冲突目标和三个问题难度级别,提供了一个全面的评估工具。通过这一基准,我们评估了九个代表性 MLLMs 在不同模型家族中的冲突解决能力,并发现其显著依赖于文本查询。基于这些发现,我们提出了一种新的提示策略,即“聚焦视觉” (Focus-on-Vision, FoV),该策略显著增强了 MLLMs 在视觉数据与冲突文本知识之间优先选择视觉数据的能力。我们的详细分析和所提出的新策略显著推进了对 MLLMs 中视觉知识冲突的理解和缓解。数据和代码已公开发布。

[NLP-11] DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory

【速读】: 该论文试图解决大语言模型在机器翻译中处理整个文档时面临的翻译一致性和准确性问题。解决方案的关键是引入DelTA(Document-levEL Translation Agent),它通过多层次的记忆结构(包括专有名词记录、双语摘要、长期记忆和短期记忆)来存储和更新信息,确保翻译过程中的一致性和质量。DelTA采用逐句翻译策略,避免了句子遗漏,并在记忆效率上优于主流方法,同时提高了代词翻译的准确性,其摘要组件还展示了作为基于查询的摘要任务工具的潜力。

链接: https://arxiv.org/abs/2410.08143
作者: Yutong Wang,Jiali Zeng,Xuebo Liu,Derek F. Wong,Fandong Meng,Jie Zhou,Min Zhang
关键词-EN: Large language models, Large language, reasonable quality improvements, achieved reasonable quality, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved reasonable quality improvements in machine translation (MT). However, most current research on MT-LLMs still faces significant challenges in maintaining translation consistency and accuracy when processing entire documents. In this paper, we introduce DelTA, a Document-levEL Translation Agent designed to overcome these limitations. DelTA features a multi-level memory structure that stores information across various granularities and spans, including Proper Noun Records, Bilingual Summary, Long-Term Memory, and Short-Term Memory, which are continuously retrieved and updated by auxiliary LLM-based components. Experimental results indicate that DelTA significantly outperforms strong baselines in terms of translation consistency and quality across four open/closed-source LLMs and two representative document translation datasets, achieving an increase in consistency scores by up to 4.58 percentage points and in COMET scores by up to 3.16 points on average. DelTA employs a sentence-by-sentence translation strategy, ensuring no sentence omissions and offering a memory-efficient solution compared to the mainstream method. Furthermore, DelTA improves pronoun translation accuracy, and the summary component of the agent also shows promise as a tool for query-based summarization tasks. We release our code and data at this https URL.
摘要:大语言模型 (LLMs) 在机器翻译 (MT) 方面取得了合理的质量提升。然而,当前大多数关于 MT-LLM 的研究在处理整个文档时仍面临保持翻译一致性和准确性的重大挑战。本文中,我们介绍了 DelTA,一种文档级翻译智能体,旨在克服这些限制。DelTA 具有多层次的记忆结构,存储了包括专有名词记录、双语摘要、长期记忆和短期记忆在内的多种粒度和跨度的信息,这些信息由辅助的基于 LLM 的组件持续检索和更新。实验结果表明,DelTA 在翻译一致性和质量方面显著优于四个开源/闭源 LLM 和两个代表性文档翻译数据集的强基线,一致性得分平均提高了 4.58 个百分点,COMET 得分平均提高了 3.16 分。DelTA 采用逐句翻译策略,确保无句子遗漏,并提供了一种比主流方法更高效的内存解决方案。此外,DelTA 提高了代词翻译的准确性,智能体的摘要组件也显示出作为基于查询的摘要任务工具的潜力。我们在 https URL 上发布了代码和数据。

[NLP-12] Assessing Episodic Memory in LLMs with Sequence Order Recall Tasks

【速读】: 该论文试图解决当前大型语言模型(LLM)基准测试中对长期记忆评估的不足,特别是缺乏对情景记忆(episodic memory)的评估。情景记忆涉及将记忆与其发生的时间和地点等上下文关联起来,这在许多认知任务和日常功能中至关重要。论文提出的解决方案是引入序列顺序回忆任务(Sequence Order Recall Tasks, SORT),这是一种从认知心理学中改编的任务,要求模型回忆文本片段的正确顺序。SORT的关键在于提供了一个通用的评估框架,易于扩展且不需要额外的注释,并通过实验证明人类能够基于长期记忆回忆书籍中的序列顺序。论文还展示了在有相关文本上下文的情况下,模型能够高准确度地完成任务,但在仅基于训练时提供的书籍文本时,LLM的表现有所下降。通过这种方式,SORT有助于评估和促进记忆增强模型的开发。

链接: https://arxiv.org/abs/2410.08133
作者: Mathis Pink,Vy A. Vo,Qinyuan Wu,Jianing Mu,Javier S. Turek,Uri Hasson,Kenneth A. Norman,Sebastian Michelmann,Alexander Huth,Mariya Toneva
关键词-EN: primarily assessing semantic, Current LLM benchmarks, Current LLM, assessing semantic aspects, semantic relations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current LLM benchmarks focus on evaluating models’ memory of facts and semantic relations, primarily assessing semantic aspects of long-term memory. However, in humans, long-term memory also includes episodic memory, which links memories to their contexts, such as the time and place they occurred. The ability to contextualize memories is crucial for many cognitive tasks and everyday functions. This form of memory has not been evaluated in LLMs with existing benchmarks. To address the gap in evaluating memory in LLMs, we introduce Sequence Order Recall Tasks (SORT), which we adapt from tasks used to study episodic memory in cognitive psychology. SORT requires LLMs to recall the correct order of text segments, and provides a general framework that is both easily extendable and does not require any additional annotations. We present an initial evaluation dataset, Book-SORT, comprising 36k pairs of segments extracted from 9 books recently added to the public domain. Based on a human experiment with 155 participants, we show that humans can recall sequence order based on long-term memory of a book. We find that models can perform the task with high accuracy when relevant text is given in-context during the SORT evaluation. However, when presented with the book text only during training, LLMs’ performance on SORT falls short. By allowing to evaluate more aspects of memory, we believe that SORT will aid in the emerging development of memory-augmented models.
摘要:当前的大语言模型基准主要集中在评估模型对事实和语义关系的记忆,主要评估长期记忆的语义方面。然而,在人类中,长期记忆还包括情景记忆,即将记忆与其发生的背景(如时间和地点)联系起来。记忆的情景化能力对于许多认知任务和日常功能至关重要。现有的基准尚未对大语言模型中的这种记忆形式进行评估。为了填补大语言模型记忆评估的这一空白,我们引入了序列顺序回忆任务 (Sequence Order Recall Tasks, SORT),该任务改编自认知心理学中用于研究情景记忆的任务。SORT 要求大语言模型回忆文本片段的正确顺序,并提供了一个易于扩展且不需要任何额外标注的通用框架。我们提供了一个初步的评估数据集,Book-SORT,包含从最近进入公共领域的 9 本书中提取的 36k 对片段。基于 155 名参与者的实验,我们展示了人类可以基于对一本书的长期记忆来回忆序列顺序。我们发现,当在 SORT 评估期间提供相关文本时,模型可以高精度地完成任务。然而,当仅在训练期间提供书籍文本时,大语言模型在 SORT 上的表现不佳。通过允许评估记忆的更多方面,我们相信 SORT 将有助于记忆增强模型的快速发展。

[NLP-13] hink Beyond Size: Dynamic Prompting for More Effective Reasoning ICLR2025

【速读】: 该论文试图解决大型语言模型(LLMs)在推理能力上的局限性问题,特别是模型大小对推理效果的影响。解决方案的关键在于提出了一种名为“动态提示”(Dynamic Prompting)的新框架,通过根据任务复杂度和模型表现实时调整提示序列和步骤数量,从而提高模型的适应性和问题解决效率。这种动态调整方法显著减少了模型的幻觉和重复循环,使得较小的LLMs也能在推理任务中与更大的模型相媲美,挑战了模型大小作为推理能力主要决定因素的传统观念。

链接: https://arxiv.org/abs/2410.08130
作者: Kamesh R
关键词-EN: Large Language Models, Large Language, paper presents Dynamic, presents Dynamic Prompting, capabilities of Large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Submitted to ICLR 2025. This is a preprint version. Future revisions will include additional evaluations and refinements

点击查看摘要

Abstract:This paper presents Dynamic Prompting, a novel framework aimed at improving the reasoning capabilities of Large Language Models (LLMs). In contrast to conventional static prompting methods, Dynamic Prompting enables the adaptive modification of prompt sequences and step counts based on real-time task complexity and model performance. This dynamic adaptation facilitates more efficient problem-solving, particularly in smaller models, by reducing hallucinations and repetitive cycles. Our empirical evaluations demonstrate that Dynamic Prompting allows smaller LLMs to perform competitively with much larger models, thereby challenging the conventional emphasis on model size as the primary determinant of reasoning efficacy.
摘要:本文介绍了动态提示 (Dynamic Prompting),这是一种旨在提升大语言模型 (LLM) 推理能力的新框架。与传统的静态提示方法不同,动态提示能够根据实时任务复杂度和模型性能,自适应地调整提示序列和步骤数量。这种动态适应通过减少幻觉和重复循环,促进了更高效的问题解决,尤其是在较小的模型中。我们的实证评估表明,动态提示使得较小的 LLM 能够与更大的模型相媲美,从而挑战了传统上将模型大小作为推理效能主要决定因素的观点。

[NLP-14] Mars: Situated Inductive Reasoning in an Open-World Environment

【速读】: 该论文试图解决机器智能在特定环境中进行情境归纳推理(situated inductive reasoning)的问题。解决方案的关键在于设计了一个名为Mars的交互环境,通过引入反常识的游戏机制(如地形修改、生存设置和任务依赖性),使代理能够主动与环境互动,从中推导有用规则并在特定情境下进行决策。论文通过实验验证了现有基于强化学习和大型语言模型的方法在这一挑战性任务上的不足,并提出了从历史轨迹中进行归纳推理的策略,强调了归纳推理在Mars环境中的重要性,旨在推动情境归纳推理技术的发展,为下一代能够适应性和情境敏感地进行推理的AI系统奠定基础。

链接: https://arxiv.org/abs/2410.08126
作者: Xiaojuan Tang,Jiaqi Li,Yitao Liang,Song-chun Zhu,Muhan Zhang,Zilong Zheng
关键词-EN: Large Language Models, Large Language, Language Models, shown remarkable success, inductive reasoning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) trained on massive corpora have shown remarkable success in knowledge-intensive tasks. Yet, most of them rely on pre-stored knowledge. Inducing new general knowledge from a specific environment and performing reasoning with the acquired knowledge – \textitsituated inductive reasoning, is crucial and challenging for machine intelligence. In this paper, we design Mars, an interactive environment devised for situated inductive reasoning. It introduces counter-commonsense game mechanisms by modifying terrain, survival setting and task dependency while adhering to certain principles. In Mars, agents need to actively interact with their surroundings, derive useful rules and perform decision-making tasks in specific contexts. We conduct experiments on various RL-based and LLM-based methods, finding that they all struggle on this challenging situated inductive reasoning benchmark. Furthermore, we explore \textitInduction from Reflection, where we instruct agents to perform inductive reasoning from history trajectory. The superior performance underscores the importance of inductive reasoning in Mars. Through Mars, we aim to galvanize advancements in situated inductive reasoning and set the stage for developing the next generation of AI systems that can reason in an adaptive and context-sensitive way.
摘要:在大规模语料库上训练的大语言模型 (LLM) 在知识密集型任务中表现出色。然而,大多数模型依赖于预存储的知识。从特定环境中诱导新的一般知识并利用所获得的知识进行推理——即情境归纳推理,对于机器智能来说至关重要且具有挑战性。本文中,我们设计了 Mars,这是一个为情境归纳推理设计的交互环境。它通过修改地形、生存设置和任务依赖性,同时遵循某些原则,引入了反常识的游戏机制。在 Mars 中,智能体需要与其周围环境积极互动,推导出有用的规则并在特定情境下执行决策任务。我们对基于强化学习 (RL) 和基于大语言模型 (LLM) 的方法进行了实验,发现它们在这个具有挑战性的情境归纳推理基准上都表现不佳。此外,我们探索了“从反思中归纳”的方法,指导智能体从历史轨迹中进行归纳推理。优越的性能凸显了归纳推理在 Mars 中的重要性。通过 Mars,我们旨在推动情境归纳推理的进步,并为开发能够以适应性和情境敏感方式进行推理的下一代 AI 系统奠定基础。

[NLP-15] Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System

【速读】: 该论文试图解决基于大型语言模型(LLM)的多智能体系统(MAS)在协作问题解决中面临的三大挑战:低通信效率、差扩展性和缺乏有效的参数更新优化方法。解决方案的关键在于提出了一种名为Optima的新框架,通过LLM训练显著提升通信效率和任务效果。Optima采用迭代生成、排序、选择和训练的范式,并引入奖励函数平衡任务性能、令牌效率和通信可读性。此外,论文探索了多种强化学习算法,并结合蒙特卡洛树搜索技术生成DPO数据,以探索多样化的交互路径,从而在多智能体任务中实现显著的性能提升和效率优化。

链接: https://arxiv.org/abs/2410.08115
作者: Weize Chen,Jiarui Yuan,Chen Qian,Cheng Yang,Zhiyuan Liu,Maosong Sun
关键词-EN: Large Language Model, Large Language, Language Model, parameter-updating optimization methods, low communication efficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Large Language Model (LLM) based multi-agent systems (MAS) show remarkable potential in collaborative problem-solving, yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter-updating optimization methods. We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness in LLM-based MAS through LLM training. Optima employs an iterative generate, rank, select, and train paradigm with a reward function balancing task performance, token efficiency, and communication readability. We explore various RL algorithms, including Supervised Fine-Tuning, Direct Preference Optimization, and their hybrid approaches, providing insights into their effectiveness-efficiency trade-offs. We integrate Monte Carlo Tree Search-inspired techniques for DPO data generation, treating conversation turns as tree nodes to explore diverse interaction paths. Evaluated on common multi-agent tasks, including information-asymmetric question answering and complex reasoning, Optima shows consistent and substantial improvements over single-agent baselines and vanilla MAS based on Llama 3 8B, achieving up to 2.8x performance gain with less than 10% tokens on tasks requiring heavy information exchange. Moreover, Optima’s efficiency gains open new possibilities for leveraging inference-compute more effectively, leading to improved inference-time scaling laws. By addressing fundamental challenges in LLM-based MAS, Optima shows the potential towards scalable, efficient, and effective MAS (this https URL).
摘要:基于大语言模型 (LLM) 的多智能体系统 (MAS) 在协作问题解决方面展现出显著潜力,但仍面临关键挑战:通信效率低下、可扩展性差以及缺乏有效的参数更新优化方法。我们提出了 Optima,这是一种新颖的框架,通过 LLM 训练显著提升了基于 LLM 的 MAS 中的通信效率和任务效果。Optima 采用了一种迭代生成、排序、选择和训练的范式,并结合了一个平衡任务性能、Token 效率和通信可读性的奖励函数。我们探讨了多种强化学习算法,包括监督微调 (Supervised Fine-Tuning)、直接偏好优化 (Direct Preference Optimization) 及其混合方法,提供了关于其效果-效率权衡的见解。我们整合了受蒙特卡洛树搜索启发的技术用于 DPO 数据生成,将对话轮次视为树节点以探索多样化的交互路径。在常见的多智能体任务中,包括信息不对称问答和复杂推理,Optima 相较于单智能体基线和基于 Llama 3 8B 的普通 MAS 显示出一致且显著的改进,在需要大量信息交换的任务中,性能提升高达 2.8 倍,且 Token 使用量减少不到 10%。此外,Optima 的效率提升为更有效地利用推理计算开辟了新的可能性,从而改进了推理时间的扩展规律。通过解决基于 LLM 的 MAS 中的基本挑战,Optima 展示了向可扩展、高效和有效的 MAS 发展的潜力 (this https URL)。

[NLP-16] Robust AI-Generated Text Detection by Restricted Embeddings EMNLP2024

【速读】: 该论文试图解决AI生成文本检测器在面对未知生成模型或语义领域时的鲁棒性问题。解决方案的关键在于通过清除Transformer文本编码器嵌入空间中的有害线性子空间,从而训练出能够忽略领域特定虚假特征的鲁棒分类器。研究者探索了多种子空间分解和特征选择策略,显著提升了跨领域和跨生成模型的迁移性能,特别是在RoBERTa和BERT嵌入的情况下,分别提高了9%和14%的平均分布外(OOD)分类得分。

链接: https://arxiv.org/abs/2410.08113
作者: Kristian Kuznetsov,Eduard Tulchinskii,Laida Kushnareva,German Magai,Serguei Barannikov,Sergey Nikolenko,Irina Piontkovskaya
关键词-EN: texts makes detecting, Growing amount, AI-generated texts makes, content more difficult, amount and quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of EMNLP 2024

点击查看摘要

Abstract:Growing amount and quality of AI-generated texts makes detecting such content more difficult. In most real-world scenarios, the domain (style and topic) of generated data and the generator model are not known in advance. In this work, we focus on the robustness of classifier-based detectors of AI-generated text, namely their ability to transfer to unseen generators or semantic domains. We investigate the geometry of the embedding space of Transformer-based text encoders and show that clearing out harmful linear subspaces helps to train a robust classifier, ignoring domain-specific spurious features. We investigate several subspace decomposition and feature selection strategies and achieve significant improvements over state of the art methods in cross-domain and cross-generator transfer. Our best approaches for head-wise and coordinate-based subspace removal increase the mean out-of-distribution (OOD) classification score by up to 9% and 14% in particular setups for RoBERTa and BERT embeddings respectively. We release our code and data: this https URL
摘要:随着 AI 生成文本的数量和质量不断增长,检测此类内容变得更加困难。在大多数现实场景中,生成数据的领域(风格和主题)以及生成模型都是未知的。在本研究中,我们重点关注基于分类器的 AI 生成文本检测器的鲁棒性,即它们在未见过的生成器或语义领域中的迁移能力。我们研究了基于 Transformer 的文本编码器的嵌入空间的几何结构,并发现清除有害的线性子空间有助于训练一个鲁棒的分类器,忽略特定领域的虚假特征。我们探讨了几种子空间分解和特征选择策略,并在跨领域和跨生成器迁移方面显著超越了现有最先进的方法。我们最佳的头部和基于坐标的子空间移除方法分别在 RoBERTa 和 BERT 嵌入的特定设置中将平均分布外(OOD)分类分数提高了高达 9% 和 14%。我们发布了代码和数据:this https URL

[NLP-17] A Closer Look at Machine Unlearning for Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)中因记忆敏感或受版权保护内容而引发的隐私和法律问题。解决方案的关键在于引入机器遗忘(machine unlearning)技术,通过在不完全重新训练的情况下移除特定内容,同时保持模型整体性能。论文提出了三种新的评估指标来衡量遗忘后的模型输出:token多样性、句子语义和事实正确性。此外,论文将遗忘方法分为无目标和有目标两类,并针对各自的问题提出了改进措施:无目标遗忘采用最大化熵(ME)目标,而有目标遗忘则引入答案保留(AP)损失作为正则化手段。实验结果表明,这些方法在不同场景下均有效。

链接: https://arxiv.org/abs/2410.08109
作者: Xiaojian Yuan,Tianyu Pang,Chao Du,Kejiang Chen,Weiming Zhang,Min Lin
关键词-EN: Large language models, Large language, raising privacy, legal concerns, memorize sensitive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches. The code is available at this https URL.
摘要:大语言模型 (LLMs) 可能会记忆敏感或受版权保护的内容,引发隐私和法律问题。由于从头开始重新训练的成本高昂,研究人员尝试采用机器遗忘技术来移除 LLMs 中的特定内容,同时保持整体性能。本文讨论了 LLMs 机器遗忘的几个问题,并提供了我们对可能方法的见解。为了解决遗忘后模型输出评估不足的问题,我们引入了三个额外的指标来评估 Token 多样性、句子语义和事实正确性。随后,我们将遗忘方法分为无目标和有目标两类,并分别讨论了它们的问题。具体而言,无目标遗忘试图近似的行为是不可预测的,可能涉及幻觉,而现有的正则化对于有目标遗忘来说是不充分的。为了缓解这些问题,我们提出使用最大化熵 (ME) 作为无目标遗忘的目标,并将答案保留 (AP) 损失作为有目标遗忘的正则化。在三种场景(即虚构遗忘、持续遗忘和真实世界遗忘)中的实验结果表明了我们方法的有效性。代码可在以下链接获取:https URL。

[NLP-18] What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

【速读】: 该论文试图解决在代码生成任务中,如何通过提示技术(如链式思维)提升大型语言模型(LLMs)的输出效果,特别是在多轮自动重提示和计算需求方面的问题。解决方案的关键在于系统地分解推理、指令和执行反馈提示,并通过广泛的网格搜索在CodeContests和TACO基准上对多个LLM家族和尺寸(如Llama 3.0和3.1、8B、70B、405B和GPT-4o)进行实验。研究揭示了在不同模型和采样预算下,一致提升性能的策略,并通过微调使模型内化这些推理过程,从而在多轮代码生成任务中实现性能和可扩展性的提升。

链接: https://arxiv.org/abs/2410.08105
作者: Kunhao Zheng,Juliette Decugis,Jonas Gehring,Taco Cohen,Benjamin Negrevergne,Gabriel Synnaeve
关键词-EN: popular vehicle, vehicle for improving, improving the outputs, large language models, Prompting techniques
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompting techniques such as chain-of-thought have established themselves as a popular vehicle for improving the outputs of large language models (LLMs). For code generation, however, their exact mechanics and efficacy are under-explored. We thus investigate the effects of a wide range of prompting strategies with a focus on automatic re-prompting over multiple turns and computational requirements. After systematically decomposing reasoning, instruction, and execution feedback prompts, we conduct an extensive grid search on the competitive programming benchmarks CodeContests and TACO for multiple LLM families and sizes (Llama 3.0 and 3.1, 8B, 70B, 405B, and GPT-4o). Our study reveals strategies that consistently improve performance across all models with small and large sampling budgets. We then show how finetuning with such an optimal configuration allows models to internalize the induced reasoning process and obtain improvements in performance and scalability for multi-turn code generation.
摘要:链式思维等提示技术已成为提升大语言模型 (LLM) 输出效果的流行手段。然而,在代码生成领域,其具体机制和效能尚未得到充分探索。因此,我们研究了多种提示策略对自动多轮重新提示和计算需求的影响。通过系统地分解推理、指令和执行反馈提示,我们在 CodeContests 和 TACO 这两个竞争性编程基准上,对多个 LLM 系列和规模(Llama 3.0 和 3.1,8B,70B,405B,以及 GPT-4o)进行了广泛的网格搜索。我们的研究表明,某些策略能在所有模型上持续提升性能,无论是在小样本还是大样本预算下。随后,我们展示了如何通过微调来使模型内化这些诱导的推理过程,从而在多轮代码生成中获得性能和可扩展性的提升。

[NLP-19] Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

【速读】: 该论文试图解决在大语言模型(LLM)预训练过程中数据选择效率低下的问题,特别是不同数据选择方法之间的内在冲突。解决方案的关键在于提出了一种新颖的多代理协作数据选择机制,其中每种数据选择方法作为一个独立的代理,并通过一个代理控制台动态整合所有代理的信息,从而在整个LLM训练过程中实现最优的数据选择。实验结果表明,该方法显著提高了数据效率,加速了LLM训练的收敛,并在多个语言模型基准测试中实现了平均10.5%的性能提升。

链接: https://arxiv.org/abs/2410.08102
作者: Tianyi Bai,Ling Yang,Zhen Hao Wong,Jiahui Peng,Xinlin Zhuang,Chi Zhang,Lijun Wu,Qiu Jiantao,Wentao Zhang,Binhang Yuan,Conghui He
关键词-EN: Efficient data selection, Efficient data, data selection, data, Efficient
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Efficient data selection is crucial to accelerate the pretraining of large language models (LLMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LLM pretraining. To tackle this problem, we propose a novel multi-agent collaborative data selection mechanism. In this framework, each data selection method serves as an independent agent, and an agent console is designed to dynamically integrate the information from all agents throughout the LLM training process. We conduct extensive empirical studies to evaluate our multi-agent framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LLM training, and achieves an average performance gain of 10.5% across multiple language model benchmarks compared to the state-of-the-art methods.
摘要:高效的数据选择对于加速大语言模型 (LLM) 的预训练至关重要。尽管已有多种方法被提出以提高数据效率,但鲜有研究探讨这些方法之间固有的冲突,以实现为 LLM 预训练选择最佳数据。为解决这一问题,我们提出了一种新颖的多智能体协同数据选择机制。在此框架中,每种数据选择方法作为一个独立的智能体,并设计了一个智能体控制台,以在整个 LLM 训练过程中动态整合所有智能体的信息。我们进行了广泛的实证研究来评估我们的多智能体框架。实验结果表明,我们的方法显著提高了数据效率,加速了 LLM 训练的收敛,并在多个语言模型基准测试中相较于最先进的方法实现了平均 10.5% 的性能提升。

[NLP-20] Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study over Open-ended Question Answering

【速读】: 该论文试图解决在开放式、真实世界问答场景中,如何评估知识图谱(KGs)增强的大型语言模型(LLMs)的推理准确性和减少幻觉现象的问题。解决方案的关键在于引入了一个名为OKGQA的新基准,该基准设计用于评估在复杂应用场景下,KGs如何提升LLMs的性能,并特别关注了KGs在语义和结构上可能存在的错误对模型性能的影响,通过OKGQA-P实验设置来模拟这些错误情况。

链接: https://arxiv.org/abs/2410.08085
作者: Yuan Sui,Bryan Hooi
关键词-EN: Large Language Models, integrating Knowledge Graphs, Knowledge Graphs, Large Language, Recent works integrating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Recent works integrating Knowledge Graphs (KGs) have led to promising improvements in enhancing reasoning accuracy of Large Language Models (LLMs). However, current benchmarks mainly focus on closed tasks, leaving a gap in the assessment of more complex, real-world scenarios. This gap has also obscured the evaluation of KGs’ potential to mitigate the problem of hallucination in LLMs. To fill the gap, we introduce OKGQA, a new benchmark specifically designed to assess LLMs enhanced with KGs under open-ended, real-world question answering scenarios. OKGQA is designed to closely reflect the complexities of practical applications using questions from different types, and incorporates specific metrics to measure both the reduction in hallucinations and the enhancement in reasoning capabilities. To consider the scenario in which KGs may have varying levels of mistakes, we further propose another experiment setting OKGQA-P to assess model performance when the semantics and structure of KGs are deliberately perturbed and contaminated. OKGQA aims to (1) explore whether KGs can make LLMs more trustworthy in an open-ended setting, and (2) conduct a comparative analysis to shed light on methods and future directions for leveraging KGs to reduce LLMs’ hallucination. We believe that this study can facilitate a more complete performance comparison and encourage continuous improvement in integrating KGs with LLMs.
摘要:近期将知识图谱 (Knowledge Graphs, KGs) 与大语言模型 (Large Language Models, LLMs) 结合的研究工作显示出在提升推理准确性方面的显著成效。然而,当前的基准测试主要集中在封闭任务上,未能全面评估在更复杂、真实世界场景中的表现。这一差距也掩盖了知识图谱在缓解大语言模型幻觉问题上的潜力评估。为了填补这一空白,我们引入了 OKGQA,这是一个专门设计用于评估在开放式、真实世界问答场景中,通过知识图谱增强的大语言模型的新基准。OKGQA 旨在通过不同类型的问答题目,紧密反映实际应用的复杂性,并结合特定的度量标准来衡量幻觉减少和推理能力提升的效果。考虑到知识图谱可能存在不同程度的错误,我们进一步提出了另一种实验设置 OKGQA-P,以评估在知识图谱的语义和结构被故意扰动和污染时模型的表现。OKGQA 的目标是 (1) 探究知识图谱是否能在开放式设置中使大语言模型更加可信,以及 (2) 进行比较分析,揭示利用知识图谱减少大语言模型幻觉的方法和未来方向。我们相信,这项研究能够促进更全面的性能比较,并鼓励知识图谱与大语言模型结合的持续改进。

[NLP-21] Packing Analysis: Packing Is More Appropriate for Large Models or Datasets in Supervised Fine-tuning

【速读】: 该论文试图解决在监督微调(SFT)阶段中,使用packing技术是否能有效提升训练效率并保持模型性能的问题。解决方案的关键在于通过广泛的对比实验,分析packing与padding方法在不同模型规模和数据集大小下的表现,涵盖知识、推理和编码等多个基准测试,以及GPT模型的评估。研究结果提供了packing方法在不同训练场景中的优势与局限性,并为实际应用提供了实施建议。

链接: https://arxiv.org/abs/2410.08081
作者: Shuhe Wang,Guoyin Wang,Jiwei Li,Eduard Hovy,Chen Guo
关键词-EN: maximum input length, optimization technique designed, maximize hardware resource, model maximum input, hardware resource efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Packing, initially utilized in the pre-training phase, is an optimization technique designed to maximize hardware resource efficiency by combining different training sequences to fit the model’s maximum input length. Although it has demonstrated effectiveness during pre-training, there remains a lack of comprehensive analysis for the supervised fine-tuning (SFT) stage on the following points: (1) whether packing can effectively enhance training efficiency while maintaining performance, (2) the suitable size of the model and dataset for fine-tuning with the packing method, and (3) whether packing unrelated or related training samples might cause the model to either excessively disregard or over-rely on the context. In this paper, we perform extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B. This provides the first comprehensive analysis of the advantages and limitations of packing versus padding, as well as practical considerations for implementing packing in various training scenarios. Our analysis covers various benchmarks, including knowledge, reasoning, and coding, as well as GPT-based evaluations, time efficiency, and other fine-tuning parameters. We also open-source our code for fine-tuning and evaluation and provide checkpoints fine-tuned on datasets of different sizes, aiming to advance future research on packing methods. Code is available at: this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2410.08081 [cs.LG] (or arXiv:2410.08081v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.08081 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:打包技术最初在预训练阶段被采用,旨在通过将不同的训练序列组合以适应模型的最大输入长度,从而最大化硬件资源效率。尽管在预训练阶段已显示出其有效性,但对于监督微调 (SFT) 阶段,以下问题仍缺乏全面分析:(1) 打包是否能在保持性能的同时有效提升训练效率;(2) 适用于打包方法微调的模型和数据集的合适规模;(3) 打包无关或相关的训练样本是否可能导致模型过度忽略或过度依赖上下文。本文通过对比使用填充和打包的 SFT 方法,涵盖了从 69K 到 1.2M 的 SFT 数据集和从 8B 到 70B 的模型,首次全面分析了打包与填充的优缺点,以及在各种训练场景中实施打包的实际考虑。我们的分析涵盖了知识、推理和编码等多种基准,以及基于 GPT 的评估、时间效率和其他微调参数。我们还开源了微调和评估代码,并提供了在不同规模数据集上微调的检查点,旨在推动未来对打包方法的研究。代码可在以下链接获取:this https URL。

主题:机器学习 (cs.LG);人工智能 (cs.AI);计算与语言 (cs.CL)
引用方式:arXiv:2410.08081 [cs.LG] (或 arXiv:2410.08081v1 [cs.LG] 用于此版本)
https://doi.org/10.48550/arXiv.2410.08081
通过 DataCite 发布的 arXiv DOI (待注册)

[NLP-22] aching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在算术推理任务中的表现不足问题。解决方案的关键在于提出了一种名为“教学启发式综合框架”的新方法,该框架模拟了教师指导学生的教学过程,通过向LLMs传授必要的概念、相关定理以及类似问题的解决策略,从而增强其推理能力。此外,论文还引入了两个新的中文数据集MathMC和MathToF,并通过实验验证了该方法在多个基准测试中的有效性,显著提升了LLMs在算术推理任务中的准确率。

链接: https://arxiv.org/abs/2410.08068
作者: Wenting Tan,Dongxiao Chen,Jieting Xue,Zihao Wang,Taijie Chen
关键词-EN: Large Language Models, Large Language, Language Models, exhibit impressive performance, arithmetic reasoning tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit impressive performance across various domains but still struggle with arithmetic reasoning tasks. Recent work shows the effectiveness of prompt design methods in enhancing reasoning capabilities. However, these approaches overlook crucial requirements for prior knowledge of specific concepts, theorems, and tricks to tackle most arithmetic reasoning problems successfully. To address this issue, we propose a novel and effective Teaching-Inspired Integrated Framework, which emulates the instructional process of a teacher guiding students. This method equips LLMs with essential concepts, relevant theorems, and similar problems with analogous solution approaches, facilitating the enhancement of reasoning abilities. Additionally, we introduce two new Chinese datasets, MathMC and MathToF, both with detailed explanations and answers. Experiments are conducted on nine benchmarks which demonstrates that our approach improves the reasoning accuracy of LLMs. With GPT-4 and our framework, we achieve new state-of-the-art performance on four math benchmarks (AddSub, SVAMP, Math23K and AQuA) with accuracies of 98.2% (+3.3%), 93.9% (+0.2%), 94.3% (+7.2%) and 81.1% (+1.2%). Our data and code are available at this https URL.
摘要:大语言模型 (LLMs) 在各个领域展现出令人印象深刻的表现,但在算术推理任务上仍面临挑战。最近的研究表明,提示设计方法在提升推理能力方面具有显著效果。然而,这些方法忽视了成功解决大多数算术推理问题所需的关键先验知识,包括特定概念、定理和技巧。为解决这一问题,我们提出了一种新颖且有效的教学启发式集成框架,模拟了教师指导学生的教学过程。该方法为大语言模型配备了必要的概念、相关定理以及具有类似解决方案的类似问题,从而促进了推理能力的提升。此外,我们引入了两个新的中文数据集,MathMC 和 MathToF,两者均包含详细的解释和答案。我们在九个基准上进行了实验,结果表明我们的方法提高了大语言模型的推理准确性。结合 GPT-4 和我们的框架,我们在四个数学基准 (AddSub, SVAMP, Math23K 和 AQuA) 上达到了新的最先进性能,准确率分别为 98.2% (+3.3%), 93.9% (+0.2%), 94.3% (+7.2%) 和 81.1% (+1.2%)。我们的数据和代码可在以下链接获取:https URL。

[NLP-23] Closing the Loop: Learning to Generate Writing Feedback via Language Model Simulated Student Revisions EMNLP2024

【速读】: 该论文试图解决自动生成的反馈是否能有效提升学生写作质量的问题,并面临生成反馈的指令缺乏共识的挑战。解决方案的关键在于提出了一种名为PROF的方法,通过学习语言模型模拟的学生修订过程来迭代优化反馈生成器,直接最大化学生整体修订表现的有效性。实验结果表明,PROF不仅在提升学生写作质量方面优于多种基线方法,还展现了增强的教学价值,尽管这一方面并未被显式训练。

链接: https://arxiv.org/abs/2410.08058
作者: Inderjeet Nair,Jiaye Tan,Xiaotian Su,Anne Gere,Xu Wang,Lu Wang
关键词-EN: Providing feedback, widely recognized, recognized as crucial, crucial for refining, students’ writing skills
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Providing feedback is widely recognized as crucial for refining students’ writing skills. Recent advances in language models (LMs) have made it possible to automatically generate feedback that is actionable and well-aligned with human-specified attributes. However, it remains unclear whether the feedback generated by these models is truly effective in enhancing the quality of student revisions. Moreover, prompting LMs with a precise set of instructions to generate feedback is nontrivial due to the lack of consensus regarding the specific attributes that can lead to improved revising performance. To address these challenges, we propose PROF that PROduces Feedback via learning from LM simulated student revisions. PROF aims to iteratively optimize the feedback generator by directly maximizing the effectiveness of students’ overall revising performance as simulated by LMs. Focusing on an economic essay assignment, we empirically test the efficacy of PROF and observe that our approach not only surpasses a variety of baseline methods in effectiveness of improving students’ writing but also demonstrates enhanced pedagogical values, even though it was not explicitly trained for this aspect.
摘要:提供反馈被广泛认为是提升学生写作技能的关键。语言模型 (LMs) 的最新进展使得自动生成与人类指定属性高度一致且具有操作性的反馈成为可能。然而,这些模型生成的反馈是否真正有效提升学生修改质量仍不明确。此外,由于缺乏关于哪些具体属性能够提升修改效果的共识,通过精确指令引导 LMs 生成反馈并非易事。为应对这些挑战,我们提出了 PROF,即通过学习 LM 模拟的学生修改来生成反馈。PROF 旨在通过直接最大化 LM 模拟的学生整体修改效果,迭代优化反馈生成器。我们以经济学论文作业为例,实证检验了 PROF 的有效性,发现我们的方法不仅在提升学生写作效果方面超越了多种基线方法,还展现了增强的教学价值,尽管在这方面并未进行显式训练。

[NLP-24] A Target-Aware Analysis of Data Augmentation for Hate Speech Detection

【速读】: 该论文试图解决现有仇恨言论检测系统在处理少数群体(如残障人士或老年人)相关数据时表现不佳的问题。解决方案的关键在于利用生成语言模型(如自回归模型和序列到序列模型)来扩展现有数据集,特别是针对代表性不足的群体生成合成数据,从而减少目标不平衡问题。通过结合传统数据增强方法和生成模型,论文发现这种混合方法在某些仇恨类别(如种族、宗教和残障)的分类性能上比无增强基线提高了超过10%的F1分数,从而推动了更公平和更具包容性的仇恨言论检测系统的发展。

链接: https://arxiv.org/abs/2410.08053
作者: Camilla Casula,Sara Tonelli
关键词-EN: main threats posed, Hate speech, hate speech detection, Measuring Hate Speech, social networks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hate speech is one of the main threats posed by the widespread use of social networks, despite efforts to limit it. Although attention has been devoted to this issue, the lack of datasets and case studies centered around scarcely represented phenomena, such as ableism or ageism, can lead to hate speech detection systems that do not perform well on underrepresented identity groups. Given the unpreceded capabilities of LLMs in producing high-quality data, we investigate the possibility of augmenting existing data with generative language models, reducing target imbalance. We experiment with augmenting 1,000 posts from the Measuring Hate Speech corpus, an English dataset annotated with target identity information, adding around 30,000 synthetic examples using both simple data augmentation methods and different types of generative models, comparing autoregressive and sequence-to-sequence approaches. We find traditional DA methods to often be preferable to generative models, but the combination of the two tends to lead to the best results. Indeed, for some hate categories such as origin, religion, and disability, hate speech classification using augmented data for training improves by more than 10% F1 over the no augmentation baseline. This work contributes to the development of systems for hate speech detection that are not only better performing but also fairer and more inclusive towards targets that have been neglected so far.
摘要:仇恨言论是社交网络广泛使用带来的主要威胁之一,尽管各方努力限制其传播。尽管对此问题已引起关注,但由于缺乏针对较少代表性现象(如残障歧视或年龄歧视)的数据集和案例研究,仇恨言论检测系统在处理代表性不足的身份群体时表现不佳。鉴于大语言模型 (LLM) 在生成高质量数据方面的卓越能力,我们探讨了利用生成式语言模型增强现有数据以减少目标不平衡的可能性。我们通过实验,对来自 Measuring Hate Speech 语料库的 1,000 条帖子进行了数据增强,该语料库是一个带有目标身份信息标注的英文数据集,通过简单数据增强方法和不同类型的生成模型增加了约 30,000 个合成样本,比较了自回归和序列到序列方法的效果。我们发现,传统数据增强方法通常优于生成模型,但两者的结合往往能带来最佳结果。实际上,对于某些仇恨类别,如起源、宗教和残疾,使用增强数据进行训练的仇恨言论分类在 F1 分数上比无增强基线提高了超过 10%。这项工作有助于开发不仅性能更优,而且对迄今为止被忽视的目标群体更加公平和包容的仇恨言论检测系统。

[NLP-25] VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

【速读】: 该论文试图解决将强化学习中的Q-learning技术应用于大型语言模型(LLMs)中的验证器模型时所面临的关键挑战,包括处理话语级别的马尔可夫决策过程(MDPs)、管理大规模动作空间以及缓解高估偏差。解决方案的关键在于提出了VerifierQ方法,该方法通过引入改进的Bellman更新、结合隐式Q-learning(IQL)以高效管理动作空间,并采用保守Q-learning(CQL)公式来平衡Q值估计,从而实现了并行Q值计算和提高训练效率。这一集成强化学习原则的验证器模型方法,不仅提升了生成器技术的现有进展,还为LLMs在复杂认知任务中的鲁棒性和适应性提供了潜在的增强。

链接: https://arxiv.org/abs/2410.08048
作者: Jianing Qi,Hao Tang,Zhigang Zhu
关键词-EN: test time compute, Large Language Models, Large Language, Language Models, verifier models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in test time compute, particularly through the use of verifier models, have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). This generator-verifier approach closely resembles the actor-critic framework in reinforcement learning (RL). However, current verifier models in LLMs often rely on supervised fine-tuning without temporal difference learning such as Q-learning. This paper introduces VerifierQ, a novel approach that integrates Offline Q-learning into LLM verifier models. We address three key challenges in applying Q-learning to LLMs: (1) handling utterance-level Markov Decision Processes (MDPs), (2) managing large action spaces, and (3) mitigating overestimation bias. VerifierQ introduces a modified Bellman update for bounded Q-values, incorporates Implicit Q-learning (IQL) for efficient action space management, and integrates a novel Conservative Q-learning (CQL) formulation for balanced Q-value estimation. Our method enables parallel Q-value computation and improving training efficiency. While recent work has explored RL techniques like MCTS for generators, VerifierQ is among the first to investigate the verifier (critic) aspect in LLMs through Q-learning. This integration of RL principles into verifier models complements existing advancements in generator techniques, potentially enabling more robust and adaptive reasoning in LLMs. Experimental results on mathematical reasoning tasks demonstrate VerifierQ’s superior performance compared to traditional supervised fine-tuning approaches, with improvements in efficiency, accuracy and robustness. By enhancing the synergy between generation and evaluation capabilities, VerifierQ contributes to the ongoing evolution of AI systems in addressing complex cognitive tasks across various domains.
摘要:近年来,测试时间计算的进步,特别是通过使用验证模型,显著增强了大型语言模型 (LLM) 的推理能力。这种生成器-验证器方法与强化学习 (RL) 中的演员-评论家框架非常相似。然而,当前的 LLM 验证模型通常依赖于监督微调,而没有采用如 Q-学习这样的时间差分学习。本文介绍了 VerifierQ,这是一种将离线 Q-学习集成到 LLM 验证模型中的新颖方法。我们解决了将 Q-学习应用于 LLM 的三个关键挑战:(1) 处理话语级别的马尔可夫决策过程 (MDP),(2) 管理大规模动作空间,以及 (3) 缓解高估偏差。VerifierQ 引入了针对有界 Q-值的修正贝尔曼更新,结合了隐式 Q-学习 (IQL) 以实现高效的动作空间管理,并集成了新的保守 Q-学习 (CQL) 公式以实现平衡的 Q-值估计。我们的方法支持并行 Q-值计算并提高训练效率。尽管最近的工作已经探索了如蒙特卡洛树搜索 (MCTS) 这样的 RL 技术用于生成器,但 VerifierQ 是首批通过 Q-学习研究 LLM 中验证器 (评论家) 方面的方法之一。这种将 RL 原理集成到验证模型中的做法补充了现有生成器技术的进步,有望使 LLM 在推理方面更加稳健和适应性强。在数学推理任务上的实验结果表明,与传统的监督微调方法相比,VerifierQ 在效率、准确性和鲁棒性方面均表现出优越的性能。通过增强生成和评估能力之间的协同作用,VerifierQ 为 AI 系统在处理跨多个领域的复杂认知任务方面的持续进化做出了贡献。

[NLP-26] Divide and Translate: Compositional First-Order Logic Translation and Verification for Complex Logical Reasoning

【速读】: 该论文试图解决大型语言模型(LLM)在处理复杂逻辑推理任务时,由于无法完全捕捉自然语言中的复杂逻辑语义,导致翻译成一阶逻辑公式时出现困难的问题。解决方案的关键在于提出了一种组合式一阶逻辑翻译方法(Compositional First-Order Logic Translation),即首先将自然语言句子解析为新定义的逻辑依赖结构,然后依次翻译这些解析后的子句。为确保翻译结果的可靠性,论文还引入了两种验证算法,并通过SAT求解器严格比较生成的一阶逻辑公式的语义,选择最可能的公式。实验结果表明,该方法(CLOVER)在多个逻辑推理基准测试中优于以往的神经符号方法,达到了新的最先进水平。

链接: https://arxiv.org/abs/2410.08047
作者: Hyun Ryu,Gyeongman Kim,Hyemin S. Lee,Eunho Yang
关键词-EN: large language model, reasoning tasks require, prompting still falls, falls short, tasks require
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Complex logical reasoning tasks require a long sequence of reasoning, which a large language model (LLM) with chain-of-thought prompting still falls short. To alleviate this issue, neurosymbolic approaches incorporate a symbolic solver. Specifically, an LLM only translates a natural language problem into a satisfiability (SAT) problem that consists of first-order logic formulas, and a sound symbolic solver returns a mathematically correct solution. However, we discover that LLMs have difficulties to capture complex logical semantics hidden in the natural language during translation. To resolve this limitation, we propose a Compositional First-Order Logic Translation. An LLM first parses a natural language sentence into newly defined logical dependency structures that consist of an atomic subsentence and its dependents, then sequentially translate the parsed subsentences. Since multiple logical dependency structures and sequential translations are possible for a single sentence, we also introduce two Verification algorithms to ensure more reliable results. We utilize an SAT solver to rigorously compare semantics of generated first-order logic formulas and select the most probable one. We evaluate the proposed method, dubbed CLOVER, on seven logical reasoning benchmarks and show that it outperforms the previous neurosymbolic approaches and achieves new state-of-the-art results.
摘要:复杂的逻辑推理任务需要进行长序列的推理,即使使用带有思维链提示的大语言模型 (LLM) 也难以胜任。为了缓解这一问题,神经符号方法引入了符号求解器。具体而言,LLM 仅将自然语言问题转化为由一阶逻辑公式组成的可满足性 (SAT) 问题,而一个可靠的符号求解器则返回一个数学上正确的解决方案。然而,我们发现 LLM 在翻译过程中难以捕捉自然语言中隐藏的复杂逻辑语义。为了解决这一局限性,我们提出了一种组合式一阶逻辑翻译方法。LLM 首先将自然语言句子解析为新定义的逻辑依赖结构,这些结构由一个原子子句及其依赖项组成,然后依次翻译这些解析后的子句。由于单个句子可能存在多种逻辑依赖结构和顺序翻译方式,我们还引入了两种验证算法,以确保结果的可靠性。我们利用 SAT 求解器严格比较生成的一阶逻辑公式的语义,并选择最可能的公式。我们在七个逻辑推理基准上评估了所提出的方法,称为 CLOVER,结果表明它优于以往的神经符号方法,并达到了新的最先进水平。

[NLP-27] he Rise of AI-Generated Content in Wikipedia

【速读】: 该论文试图解决AI生成内容在信息源中的广泛存在对责任性、准确性和偏见放大的担忧,以及对基于大规模互联网数据训练语言模型长期可行性的质疑。解决方案的关键在于使用GPTZero和Binoculars两种AI检测工具,通过对比GPT-3.5发布前后创建的维基百科页面,确定AI生成内容的下限。研究发现,新创建的维基百科页面中AI生成内容显著增加,且这些内容通常质量较低,倾向于自推广或带有特定观点,尤其是在争议性话题上。

链接: https://arxiv.org/abs/2410.08044
作者: Creston Brooks,Samuel Eggert,Denis Peskoff
关键词-EN: popular information sources, information sources raises, sources raises significant, raises significant concerns, concerns about accountability
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of AI-generated content in popular information sources raises significant concerns about accountability, accuracy, and bias amplification. Beyond directly impacting consumers, the widespread presence of this content poses questions for the long-term viability of training language models on vast internet sweeps. We use GPTZero, a proprietary AI detector, and Binoculars, an open-source alternative, to establish lower bounds on the presence of AI-generated content in recently created Wikipedia pages. Both detectors reveal a marked increase in AI-generated content in recent pages compared to those from before the release of GPT-3.5. With thresholds calibrated to achieve a 1% false positive rate on pre-GPT-3.5 articles, detectors flag over 5% of newly created English Wikipedia articles as AI-generated, with lower percentages for German, French, and Italian articles. Flagged Wikipedia articles are typically of lower quality and are often self-promotional or partial towards a specific viewpoint on controversial topics.
摘要:AI生成内容在流行信息源中的兴起引发了关于责任、准确性和偏见放大的重大关切。除了直接影响消费者外,这种内容的广泛存在还对在大规模互联网数据上训练语言模型的长期可行性提出了质疑。我们使用GPTZero,一种专有的AI检测器,以及Binoculars,一种开源替代方案,来确定最近创建的维基百科页面中AI生成内容的下限。两种检测器均显示,与GPT-3.5发布前的页面相比,近期页面中AI生成内容显著增加。在预GPT-3.5文章上校准阈值以实现1%的误报率后,检测器标记了超过5%的新创建的英文维基百科文章为AI生成,德语、法语和意大利语文章的标记比例较低。被标记的维基百科文章通常质量较低,且往往具有自我宣传性或对争议话题的特定观点有所偏颇。

[NLP-28] Composite Learning Units: Generalized Learning Beyond Parameter Updates to Transform LLMs into Adaptive Reasoners

【速读】: 该论文试图解决传统机器学习模型在面对复杂任务时缺乏持续学习和适应能力的问题。解决方案的关键在于引入复合学习单元(Composite Learning Units, CLUs),通过构建动态知识库(包括通用知识空间和提示特定知识空间),使大型语言模型(LLMs)能够在不进行传统参数更新的情况下,通过持续的交互和反馈进行广义和连续的学习。CLUs通过目标驱动的迭代过程不断优化知识库,从而在复杂任务中动态适应、提取细微见解并自主积累经验,显著提升模型的推理能力。

链接: https://arxiv.org/abs/2410.08037
作者: Santosh Kumar Radha,Oktay Goktas
关键词-EN: Human learning thrives, Large Language Models, Composite Learning Units, static machine learning, Human learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Human learning thrives on the ability to learn from mistakes, adapt through feedback, and refine understanding-processes often missing in static machine learning models. In this work, we introduce Composite Learning Units (CLUs) designed to transform reasoners, such as Large Language Models (LLMs), into learners capable of generalized, continuous learning without conventional parameter updates while enhancing their reasoning abilities through continual interaction and feedback. CLUs are built on an architecture that allows a reasoning model to maintain and evolve a dynamic knowledge repository: a General Knowledge Space for broad, reusable insights and a Prompt-Specific Knowledge Space for task-specific learning. Through goal-driven interactions, CLUs iteratively refine these knowledge spaces, enabling the system to adapt dynamically to complex tasks, extract nuanced insights, and build upon past experiences autonomously. We demonstrate CLUs’ effectiveness through a cryptographic reasoning task, where they continuously evolve their understanding through feedback to uncover hidden transformation rules. While conventional models struggle to grasp underlying logic, CLUs excel by engaging in an iterative, goal-oriented process. Specialized components-handling knowledge retrieval, prompt generation, and feedback analysis-work together within a reinforcing feedback loop. This approach allows CLUs to retain the memory of past failures and successes, adapt autonomously, and apply sophisticated reasoning effectively, continually learning from mistakes while also building on breakthroughs.
摘要:人类学习之所以蓬勃发展,是因为其具备从错误中学习、通过反馈进行适应以及不断完善理解的能力,而这些过程在静态机器学习模型中往往缺失。在本研究中,我们引入了复合学习单元 (Composite Learning Units, CLUs),旨在将推理器(如大语言模型 (Large Language Models, LLMs))转变为能够进行广义、连续学习的学习者,而无需传统的参数更新,同时通过持续的交互和反馈增强其推理能力。CLUs 构建于一种架构之上,该架构允许推理模型维护并演化一个动态的知识库:一个用于广泛、可复用洞察的通用知识空间 (General Knowledge Space) 和一个用于任务特定学习的提示特定知识空间 (Prompt-Specific Knowledge Space)。通过目标驱动的交互,CLUs 迭代地优化这些知识空间,使系统能够动态适应复杂任务,提取细微的洞察,并自主地基于过去的经验进行构建。我们通过一个密码学推理任务展示了 CLUs 的有效性,在该任务中,CLUs 通过反馈持续演化其理解,以揭示隐藏的变换规则。尽管传统模型难以掌握底层逻辑,但 CLUs 通过参与迭代、目标导向的过程表现出色。专门处理知识检索、提示生成和反馈分析的组件在一个增强反馈回路中协同工作。这种方法使 CLUs 能够保留过去失败和成功的记忆,自主适应,并有效地应用复杂的推理,持续从错误中学习,同时也在突破中不断进步。

[NLP-29] Private Language Models via Truncated Laplacian Mechanism EMNLP2024

【速读】: 该论文试图解决在自然语言处理任务中深度学习模型易受隐私攻击的问题,特别是在高隐私保护需求下,现有的差分隐私(DP)方法在嵌入空间中的扰动效果不佳的问题。解决方案的关键在于提出了一种新的私有嵌入方法——高维截断拉普拉斯机制。该方法通过引入截断拉普拉斯机制的高维扩展,理论上证明了其方差低于现有方法,从而在高隐私保护需求下仍能保持较高的实用性。实验结果表明,即使在高度隐私保护的场景下,该方法的实用性损失也较小。

链接: https://arxiv.org/abs/2410.08027
作者: Tianhao Huang,Tao Yang,Ivan Habernal,Lijie Hu,Di Wang
关键词-EN: Deep learning models, Deep learning, models for NLP, truncated Laplacian mechanism, NLP tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by EMNLP 2024, Main Track

点击查看摘要

Abstract:Deep learning models for NLP tasks are prone to variants of privacy attacks. To prevent privacy leakage, researchers have investigated word-level perturbations, relying on the formal guarantees of differential privacy (DP) in the embedding space. However, many existing approaches either achieve unsatisfactory performance in the high privacy regime when using the Laplacian or Gaussian mechanism, or resort to weaker relaxations of DP that are inferior to the canonical DP in terms of privacy strength. This raises the question of whether a new method for private word embedding can be designed to overcome these limitations. In this paper, we propose a novel private embedding method called the high dimensional truncated Laplacian mechanism. Specifically, we introduce a non-trivial extension of the truncated Laplacian mechanism, which was previously only investigated in one-dimensional space cases. Theoretically, we show that our method has a lower variance compared to the previous private word embedding methods. To further validate its effectiveness, we conduct comprehensive experiments on private embedding and downstream tasks using three datasets. Remarkably, even in the high privacy regime, our approach only incurs a slight decrease in utility compared to the non-private scenario.
摘要:用于自然语言处理 (NLP) 任务的深度学习模型容易受到多种隐私攻击。为了防止隐私泄露,研究人员探索了基于嵌入空间中差分隐私 (Differential Privacy, DP) 形式保证的词级扰动方法。然而,许多现有方法要么在使用拉普拉斯或高斯机制的高隐私环境下表现不佳,要么依赖于比标准差分隐私更弱的松弛形式,从而在隐私强度方面不如标准差分隐私。这引发了一个问题:是否可以设计一种新的私有词嵌入方法来克服这些局限性。在本文中,我们提出了一种名为高维截断拉普拉斯机制的新型私有嵌入方法。具体而言,我们引入了一种非平凡的截断拉普拉斯机制扩展,该机制此前仅在一维空间中进行了研究。理论上,我们证明了我们的方法相比之前的私有词嵌入方法具有更低的方差。为了进一步验证其有效性,我们在三个数据集上进行了全面的实验,涵盖私有嵌入和下游任务。值得注意的是,即使在高度隐私的环境下,我们的方法相较于非私有场景仅导致轻微的效用下降。

[NLP-30] LLM Cascade with Multi-Objective Optimal Consideration

【速读】: 该论文试图解决大型语言模型(LLMs)在实际应用中部署成本高的问题,并提出了一种新的多目标优化LLM级联策略。解决方案的关键在于通过级联本地和服务器模型,不仅优化性能与成本的权衡,还考虑了隐私等额外目标,从而更好地满足实际应用中的复杂需求,同时保持原有的级联能力。实验结果表明,该方法在多个基准测试中表现出色,验证了其有效性和优越性。

链接: https://arxiv.org/abs/2410.08014
作者: Kai Zhang,Liqian Peng,Congchao Wang,Alec Go,Xiaozhong Liu
关键词-EN: Large Language Models, generating natural language, Large Language, demonstrated exceptional capabilities, natural language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional capabilities in understanding and generating natural language. However, their high deployment costs often pose a barrier to practical applications, especially. Cascading local and server models offers a promising solution to this challenge. While existing studies on LLM cascades have primarily focused on the performance-cost trade-off, real-world scenarios often involve more complex requirements. This paper introduces a novel LLM Cascade strategy with Multi-Objective Optimization, enabling LLM cascades to consider additional objectives (e.g., privacy) and better align with the specific demands of real-world applications while maintaining their original cascading abilities. Extensive experiments on three benchmarks validate the effectiveness and superiority of our approach.
摘要:大语言模型 (LLMs) 在理解和生成自然语言方面展示了卓越的能力。然而,其高昂的部署成本常常成为实际应用的障碍,尤其是在这方面。级联本地和服务器模型为解决这一挑战提供了有前景的解决方案。尽管现有关于 LLM 级联的研究主要集中在性能-成本权衡上,但现实场景往往涉及更为复杂的需求。本文介绍了一种新颖的 LLM 级联策略,采用多目标优化,使 LLM 级联能够考虑额外的目标(例如,隐私),并在保持其原有级联能力的同时,更好地满足现实应用的特定需求。在三个基准上的广泛实验验证了我们方法的有效性和优越性。

[NLP-31] Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets

【速读】: 该论文试图解决在线平台上仇恨言论检测系统中的人类标注偏见问题,特别是标注者与仇恨言论目标之间的社会人口学特征如何相互作用并影响标注结果。解决方案的关键在于利用包含丰富社会人口学信息的广泛数据集,分析标注者与目标特征之间的关联,量化并描述这些偏见的强度和普遍性,并将其与基于角色的语言模型(LLMs)的偏见进行比较。通过这种方式,论文揭示了人类标注偏见与LLMs偏见之间的显著差异,为设计更公正的AI驱动的仇恨言论检测系统提供了新的见解。

链接: https://arxiv.org/abs/2410.07991
作者: Tommaso Giorgi,Lorenzo Cima,Tiziano Fagni,Marco Avvenuti,Stefano Cresci
关键词-EN: online platforms exacerbated, hate speech detection, hate speech, speech detection systems, speech detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The rise of online platforms exacerbated the spread of hate speech, demanding scalable and effective detection. However, the accuracy of hate speech detection systems heavily relies on human-labeled data, which is inherently susceptible to biases. While previous work has examined the issue, the interplay between the characteristics of the annotator and those of the target of the hate are still unexplored. We fill this gap by leveraging an extensive dataset with rich socio-demographic information of both annotators and targets, uncovering how human biases manifest in relation to the target’s attributes. Our analysis surfaces the presence of widespread biases, which we quantitatively describe and characterize based on their intensity and prevalence, revealing marked differences. Furthermore, we compare human biases with those exhibited by persona-based LLMs. Our findings indicate that while persona-based LLMs do exhibit biases, these differ significantly from those of human annotators. Overall, our work offers new and nuanced results on human biases in hate speech annotations, as well as fresh insights into the design of AI-driven hate speech detection systems.
摘要:在线平台的兴起加剧了仇恨言论的传播,迫切需要可扩展且有效的检测手段。然而,仇恨言论检测系统的准确性在很大程度上依赖于人工标注的数据,而这些数据本身就容易受到偏见的影响。尽管先前的工作已经探讨了这一问题,但标注者特征与仇恨言论目标特征之间的相互作用仍未得到充分研究。我们通过利用一个包含丰富社会人口统计信息的大型数据集,填补了这一空白,揭示了人类偏见如何与目标属性相关联。我们的分析揭示了普遍存在的偏见,并根据其强度和普遍性对其进行了定量描述和特征化,发现了显著差异。此外,我们将人类偏见与基于角色的大语言模型 (LLM) 所展现的偏见进行了比较。研究结果表明,尽管基于角色的大语言模型确实表现出偏见,但这些偏见与人类标注者的偏见存在显著差异。总体而言,我们的工作为仇恨言论标注中的人类偏见提供了新的细致入微的结果,并为基于AI的仇恨言论检测系统的设计带来了新的见解。

[NLP-32] Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

【速读】: 该论文试图解决现有大型语言模型(LLMs)在数学推理能力评估中的不足,特别是现有基准(如GSM8K和MATH)已无法有效挑战这些模型的问题。解决方案的关键在于提出一个全新的、更具挑战性的基准,专门用于评估LLMs在奥林匹克数学竞赛级别的数学推理能力。该基准包含4428个经过严格人工标注的竞赛级数学问题,涵盖33个子领域和10个难度级别,能够全面评估模型在奥林匹克数学推理中的表现。实验结果表明,即使是目前最先进的模型(如OpenAI o1-mini和OpenAI o1-preview)在处理高难度奥林匹克级别问题时也面临显著挑战,准确率分别为60.54%和52.55%,凸显了奥林匹克级别数学推理的难度。

链接: https://arxiv.org/abs/2410.07985
作者: Bofei Gao,Feifan Song,Zhe Yang,Zefan Cai,Yibo Miao,Qingxiu Dong,Lei Li,Chenghao Ma,Liang Chen,Runxin Xu,Zhengyang Tang,Benyou Wang,Daoguang Zan,Shanghaoran Quan,Ge Zhang,Lei Sha,Yichang Zhang,Xuancheng Ren,Tianyu Liu,Baobao Chang
关键词-EN: Recent advancements, large language models, advancements in large, large language, mathematical reasoning capabilities
类目: Computation and Language (cs.CL)
备注: 26 Pages, 17 Figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs’ mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning. Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
摘要:近年来,大语言模型 (LLM) 在数学推理能力方面取得了显著进展。然而,现有的基准测试如 GSM8K 或 MATH 已经被高精度解决(例如,OpenAI o1 在 MATH 数据集上达到了 94.8% 的准确率),这表明它们已不足以真正挑战这些模型。为了填补这一空白,我们提出了一项全面且具有挑战性的基准测试,专门用于评估 LLM 在奥林匹克级别的数学推理能力。与现有的奥林匹克相关基准不同,我们的数据集专注于数学领域,并包含大量经过严格人工标注的 4428 道竞赛级别问题。这些问题被精心分类为超过 33 个子领域,并涵盖了超过 10 种不同的难度级别,从而能够全面评估模型在奥林匹克数学推理中的表现。此外,我们基于此基准进行了深入分析。实验结果表明,即使是目前最先进的模型,OpenAI o1-mini 和 OpenAI o1-preview,在处理高难度的奥林匹克级别问题时也面临困难,准确率分别为 60.54% 和 52.55%,这突显了奥林匹克级别数学推理中的重大挑战。

[NLP-33] COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act

【速读】: 该论文试图解决欧盟人工智能法案(AI Act)在技术层面的解释和实施问题,特别是如何将法案的广泛监管要求转化为可测量的技术要求,并应用于大型语言模型(LLMs)的评估。解决方案的关键在于提出了COMPL-AI框架,该框架包括对AI Act的首个技术解释,以及一个基于最先进LLM基准的开源基准测试套件。通过评估12个知名LLM,论文揭示了现有模型和基准在鲁棒性、安全性、多样性和公平性等方面的不足,强调了未来LLM开发和监管基准应更加注重这些方面,从而推动更全面、符合法规的模型评估和开发。

链接: https://arxiv.org/abs/2410.07959
作者: Philipp Guldimann,Alexander Spiridonov,Robin Staab,Nikola Jovanović,Mark Vero,Velko Vechev,Anna Gueorguieva,Mislav Balunović,Nikola Konstantinov,Pavol Bielik,Petar Tsankov,Martin Vechev
关键词-EN: Artificial Intelligence Act, assess models’ compliance, Artificial Intelligence, lacks clear technical, clear technical interpretation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The EU’s Artificial Intelligence Act (AI Act) is a significant step towards responsible AI development, but lacks clear technical interpretation, making it difficult to assess models’ compliance. This work presents COMPL-AI, a comprehensive framework consisting of (i) the first technical interpretation of the EU AI Act, translating its broad regulatory requirements into measurable technical requirements, with the focus on large language models (LLMs), and (ii) an open-source Act-centered benchmarking suite, based on thorough surveying and implementation of state-of-the-art LLM benchmarks. By evaluating 12 prominent LLMs in the context of COMPL-AI, we reveal shortcomings in existing models and benchmarks, particularly in areas like robustness, safety, diversity, and fairness. This work highlights the need for a shift in focus towards these aspects, encouraging balanced development of LLMs and more comprehensive regulation-aligned benchmarks. Simultaneously, COMPL-AI for the first time demonstrates the possibilities and difficulties of bringing the Act’s obligations to a more concrete, technical level. As such, our work can serve as a useful first step towards having actionable recommendations for model providers, and contributes to ongoing efforts of the EU to enable application of the Act, such as the drafting of the GPAI Code of Practice.
摘要:欧盟的人工智能法案 (AI Act) 是迈向负责任 AI 开发的重要一步,但缺乏明确的技术解释,使得评估模型合规性变得困难。本研究提出了 COMPL-AI,这是一个全面的框架,包括 (i) 欧盟 AI 法案的首次技术解释,将广泛的监管要求转化为可衡量的技术要求,重点在于大语言模型 (LLMs),以及 (ii) 一个基于对最先进 LLM 基准进行全面调查和实施的开源法案中心基准测试套件。通过在 COMPL-AI 的背景下评估 12 个著名 LLM,我们揭示了现有模型和基准在鲁棒性、安全性、多样性和公平性等方面的不足。本研究强调了在这些方面进行重点转变的必要性,鼓励 LLM 的平衡发展以及更全面的法规对齐基准。同时,COMPL-AI 首次展示了将法案义务转化为更具体、技术层面的可能性和困难。因此,我们的工作可以作为模型提供者可操作建议的有用第一步,并有助于欧盟推动法案应用的持续努力,例如起草 GPAI 行为准则。

[NLP-34] Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions

【速读】: 该论文试图解决疾病实体识别(DER)和疾病实体归一化(DEN)中由于罕见疾病在训练语料和知识图谱中提及较少,导致高质量训练样本不足的问题。解决方案的关键在于利用大型语言模型(LLM)如LLaMa-2 13B Chat生成合成训练数据,通过微调模型生成包含统一医学语言系统(UMLS)疾病语义组概念的标准化提及,从而显著提升DEN的性能,尤其是在分布外(OOD)数据上的表现。虽然对DER的提升有限,但合成数据的引入显著改善了DEN的准确性和鲁棒性。

链接: https://arxiv.org/abs/2410.07951
作者: Kuleen Sasse,Shinjitha Vadlakonda,Richard E. Kennedy,John D. Osborne
关键词-EN: Knowledge Graphs, clinical named entity, named entity recognition, Disease Entity Recognition, entity recognition
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Background: Machine learning methods for clinical named entity recognition and entity normalization systems can utilize both labeled corpora and Knowledge Graphs (KGs) for learning. However, infrequently occurring concepts may have few mentions in training corpora and lack detailed descriptions or synonyms, even in large KGs. For Disease Entity Recognition (DER) and Disease Entity Normalization (DEN), this can result in fewer high quality training examples relative to the number of known diseases. Large Language Model (LLM) generation of synthetic training examples could improve performance in these information extraction tasks. Methods: We fine-tuned a LLaMa-2 13B Chat LLM to generate a synthetic corpus containing normalized mentions of concepts from the Unified Medical Language System (UMLS) Disease Semantic Group. We measured overall and Out of Distribution (OOD) performance for DER and DEN, with and without synthetic data augmentation. We evaluated performance on 3 different disease corpora using 4 different data augmentation strategies, assessed using BioBERT for DER and SapBERT and KrissBERT for DEN. Results: Our synthetic data yielded a substantial improvement for DEN, in all 3 training corpora the top 1 accuracy of both SapBERT and KrissBERT improved by 3-9 points in overall performance and by 20-55 points in OOD data. A small improvement (1-2 points) was also seen for DER in overall performance, but only one dataset showed OOD improvement. Conclusion: LLM generation of normalized disease mentions can improve DEN relative to normalization approaches that do not utilize LLMs to augment data with synthetic mentions. Ablation studies indicate that performance gains for DEN were only partially attributable to improvements in OOD performance. The same approach has only a limited ability to improve DER. We make our software and dataset publicly available. Comments: 21 pages, 3 figures, 7 tables Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.7; J.3 Cite as: arXiv:2410.07951 [cs.CL] (or arXiv:2410.07951v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.07951 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: John Osborne [view email] [v1] Thu, 10 Oct 2024 14:18:34 UTC (1,574 KB)
摘要:背景:用于临床命名实体识别和实体归一化的机器学习方法可以利用标注语料库和知识图谱 (Knowledge Graphs, KGs) 进行学习。然而,不常出现的概念在训练语料库中提及次数较少,且在大规模知识图谱中缺乏详细描述或同义词。对于疾病实体识别 (Disease Entity Recognition, DER) 和疾病实体归一化 (Disease Entity Normalization, DEN),这可能导致高质量训练样本的数量相对于已知疾病数量较少。大语言模型 (Large Language Model, LLM) 生成合成训练样本可以提升这些信息抽取任务的性能。方法:我们微调了一个 LLaMa-2 13B Chat 大语言模型,以生成包含统一医学语言系统 (Unified Medical Language System, UMLS) 疾病语义组概念归一化提及的合成语料库。我们测量了 DER 和 DEN 的总体和分布外 (Out of Distribution, OOD) 性能,无论是否使用合成数据增强。我们在 3 个不同的疾病语料库上使用 4 种不同的数据增强策略评估了性能,DER 使用 BioBERT 评估,DEN 使用 SapBERT 和 KrissBERT 评估。结果:我们的合成数据显著提升了 DEN 的性能,在所有 3 个训练语料库中,SapBERT 和 KrissBERT 的总体性能的 top 1 准确率提高了 3-9 个百分点,OOD 数据提高了 20-55 个百分点。DER 的总体性能也有小幅提升 (1-2 个百分点),但只有一个数据集显示出 OOD 性能的提升。结论:大语言模型生成的归一化疾病提及可以相对于不使用大语言模型进行数据增强的归一化方法提升 DEN 的性能。消融研究表明,DEN 性能的提升仅部分归因于 OOD 性能的改进。同样的方法对 DER 的改进能力有限。我们将软件和数据集公开发布。

评论:21 页,3 幅图,7 张表
主题:计算与语言 (cs.CL); 机器学习 (cs.LG)
ACM 分类:I.2.7; J.3
引用为:arXiv:2410.07951 [cs.CL] (或 arXiv:2410.07951v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.07951
专注于了解更多
arXiv 发布的 DOI 通过 DataCite (待注册)
提交历史
从:John Osborne [查看电子邮件]
[v1] 2024 年 10 月 10 日 14:18:34 UTC (1,574 KB)

[NLP-35] InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions

【速读】: 该论文试图解决人工智能在生物分子研究中的应用与研究人员直觉之间的鸿沟,特别是如何通过自然语言将复杂的分子结构与人类意图对齐。解决方案的关键在于提出了InstructBioMol,这是一种新型的大型语言模型(LLM),旨在通过全面的“任意到任意”对齐方式,将自然语言、分子和蛋白质进行整合。该模型能够接受多模态生物分子输入,并允许研究人员用自然语言表达设计目标,从而生成满足特定生物需求的生物分子输出。实验结果表明,InstructBioMol能够理解并设计生物分子,显著提高了药物分子的结合亲和力和酶的设计性能。

链接: https://arxiv.org/abs/2410.07919
作者: Xiang Zhuang,Keyan Ding,Tianwen Lyu,Yinuo Jiang,Xiaotong Li,Zhuoyi Xiang,Zeyuan Wang,Ming Qin,Kehua Feng,Jike Wang,Qiang Zhang,Huajun Chen
关键词-EN: Understanding and designing, advancing drug discovery, natural language, synthetic biology, central to advancing
类目: Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Understanding and designing biomolecules, such as proteins and small molecules, is central to advancing drug discovery, synthetic biology, and enzyme engineering. Recent breakthroughs in Artificial Intelligence (AI) have revolutionized biomolecular research, achieving remarkable accuracy in biomolecular prediction and design. However, a critical gap remains between AI’s computational power and researchers’ intuition, using natural language to align molecular complexity with human intentions. Large Language Models (LLMs) have shown potential to interpret human intentions, yet their application to biomolecular research remains nascent due to challenges including specialized knowledge requirements, multimodal data integration, and semantic alignment between natural language and biomolecules. To address these limitations, we present InstructBioMol, a novel LLM designed to bridge natural language and biomolecules through a comprehensive any-to-any alignment of natural language, molecules, and proteins. This model can integrate multimodal biomolecules as input, and enable researchers to articulate design goals in natural language, providing biomolecular outputs that meet precise biological needs. Experimental results demonstrate InstructBioMol can understand and design biomolecules following human instructions. Notably, it can generate drug molecules with a 10% improvement in binding affinity and design enzymes that achieve an ESP Score of 70.4, making it the only method to surpass the enzyme-substrate interaction threshold of 60.0 recommended by the ESP developer. This highlights its potential to transform real-world biomolecular research.
摘要:理解和设计生物分子,如蛋白质和小分子,是推动药物发现、合成生物学和酶工程发展的核心。近年来,人工智能 (AI) 的突破性进展彻底改变了生物分子研究,在生物分子预测和设计方面取得了显著的准确性。然而,AI 的计算能力与研究人员通过自然语言将分子复杂性与人类意图对齐的直觉之间仍存在关键差距。大语言模型 (LLM) 显示出解读人类意图的潜力,但由于专业知识需求、多模态数据整合以及自然语言与生物分子之间的语义对齐等挑战,其在生物分子研究中的应用仍处于初级阶段。为了解决这些限制,我们提出了 InstructBioMol,这是一种新型 LLM,旨在通过自然语言、分子和蛋白质的全面任意对齐来弥合自然语言与生物分子之间的鸿沟。该模型能够整合多模态生物分子作为输入,并使研究人员能够用自然语言表达设计目标,提供满足精确生物需求的生物分子输出。实验结果表明,InstructBioMol 能够根据人类指令理解和设计生物分子。值得注意的是,它可以生成结合亲和力提高 10% 的药物分子,并设计出 ESP 评分达到 70.4 的酶,使其成为唯一一种超越 ESP 开发者推荐的 60.0 酶-底物相互作用阈值的方法。这突显了其在实际生物分子研究中转化的潜力。

[NLP-36] Unsupervised Data Validation Methods for Efficient Model Training

【速读】: 该论文试图解决低资源语言在机器学习系统中的应用问题,特别是自然语言处理(NLP)、文本到语音(TTS)、语音到文本(STT)和视觉语言模型(VLM)中由于缺乏大规模数据集而面临的挑战。解决方案的关键在于定义“高质量数据”,开发数据生成和增强方法,以及提高模型训练的可访问性。论文综述了当前的方法,包括数据增强、多语言迁移学习、合成数据生成和数据选择技术,并指出了这些方法的进步和局限性。通过优化数据利用、减少所需数据量并保持高质量模型性能,论文旨在使先进的机器学习模型更适用于低资源语言,从而提升其在各领域的实用性和影响力。

链接: https://arxiv.org/abs/2410.07880
作者: Yurii Paniv
关键词-EN: low-resource languages, potential solutions, solutions for improving, systems for low-resource, machine learning systems
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper investigates the challenges and potential solutions for improving machine learning systems for low-resource languages. State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT), and vision-language models (VLM) rely heavily on large datasets, which are often unavailable for low-resource languages. This research explores key areas such as defining “quality data,” developing methods for generating appropriate data and enhancing accessibility to model training. A comprehensive review of current methodologies, including data augmentation, multilingual transfer learning, synthetic data generation, and data selection techniques, highlights both advancements and limitations. Several open research questions are identified, providing a framework for future studies aimed at optimizing data utilization, reducing the required data quantity, and maintaining high-quality model performance. By addressing these challenges, the paper aims to make advanced machine learning models more accessible for low-resource languages, enhancing their utility and impact across various sectors.
摘要:本文探讨了为低资源语言改进机器学习系统所面临的挑战及潜在解决方案。自然语言处理 (NLP)、文本到语音 (TTS)、语音到文本 (STT) 以及视觉语言模型 (VLM) 中的最先进模型严重依赖于大型数据集,而这些数据集通常在低资源语言中不可用。本研究探索了关键领域,如定义“高质量数据”、开发生成合适数据的方法以及提高模型训练的可访问性。对当前方法的综合评述,包括数据增强、多语言迁移学习、合成数据生成和数据选择技术,突显了进展与局限。识别出若干开放的研究问题,为未来的研究提供了框架,旨在优化数据利用、减少所需数据量并保持高质量的模型性能。通过应对这些挑战,本文旨在使先进的机器学习模型更易于低资源语言使用,从而增强其在各领域的实用性和影响力。

[NLP-37] Benchmarking Agent ic Workflow Generation

【速读】: 该论文试图解决现有工作流评估框架在全面性能评估、场景覆盖、工作流结构复杂性和评估标准严格性方面的局限性。解决方案的关键在于引入WorFBench,这是一个统一的工作流生成基准,具有多方面的场景和复杂的工作流结构。同时,提出WorFEval,一种系统性的评估协议,利用子序列和子图匹配算法来准确量化LLM代理的工作流生成能力。通过这些创新,论文能够全面评估不同类型LLM的序列规划和图规划能力,并揭示它们之间的差距,例如GPT-4在序列和图规划能力上存在约15%的差距。此外,论文还展示了生成的工作流如何增强下游任务的性能,从而在推理过程中实现更高效的表现。

链接: https://arxiv.org/abs/2410.07869
作者: Shuofei Qiao,Runnan Fang,Zhisong Qiu,Xiaobin Wang,Ningyu Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Huajun Chen
关键词-EN: Large Language Models, Large Language, driven significant advancements, decomposing complex problems, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent’s workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset will be available at this https URL.
摘要:大语言模型 (LLMs) 凭借其处理广泛任务的卓越能力,在解决推理和规划任务方面取得了显著进展,其中将复杂问题分解为可执行的工作流程是这一过程中的关键步骤。现有的工作流程评估框架要么仅关注整体性能,要么存在场景覆盖受限、工作流程结构简单以及评估标准宽松等局限性。为此,我们引入了 WorFBench,这是一个统一的工作流程生成基准,涵盖多方面场景和复杂图工作流程结构。此外,我们提出了 WorFEval,一种系统性评估协议,利用子序列和子图匹配算法来准确量化大语言模型智能体的工作流程生成能力。通过在不同类型的大语言模型上进行全面评估,我们发现大语言模型智能体的序列规划能力和图规划能力之间存在显著差距,即使是 GPT-4 也表现出约 15% 的差距。我们还训练了两个开源模型,并评估了它们在保留任务上的泛化能力。此外,我们观察到生成的工作流程可以增强下游任务,使它们在推理过程中以更少的时间实现更优的性能。代码和数据集将在此 https URL 上提供。

[NLP-38] Enhancing Language Model Reasoning via Weighted Reasoning in Self-Consistency NEURIPS2024

【速读】: 该论文试图解决大型语言模型(LLMs)在推理任务中的不足,特别是它们在复杂问题上的表现。解决方案的关键在于改进现有的自一致性框架,通过在最终决策之前,不仅考虑多个推理路径的最终决策,还分析和整合这些路径的详细推理步骤。这种方法不仅提高了推理路径的可靠性,还显著增强了模型在复杂推理任务中的表现。

链接: https://arxiv.org/abs/2410.07839
作者: Tim Knappe,Ryan Li,Ayush Chauhan,Kaylee Chhua,Kevin Zhu,Sean O’Brien
关键词-EN: large language models, reasoning tasks, tasks, large language, rapidly improved
类目: Computation and Language (cs.CL)
备注: Accepted to MATH-AI at NeurIPS 2024

点击查看摘要

Abstract:While large language models (LLMs) have rapidly improved their performance on a broad number of tasks, they still often fall short on reasoning tasks. As LLMs become more integrated in diverse real-world tasks, advancing their reasoning capabilities is crucial to their effectiveness in nuanced, complex problems. Wang et al’s self-consistency framework reveals that sampling multiple rationales before taking a majority vote reliably improves model performance across various closed-answer reasoning tasks. Standard methods based on this framework aggregate the final decisions of these rationales but fail to utilize the detailed step-by-step reasoning paths applied by these paths. Our work enhances this approach by incorporating and analyzing both the reasoning paths of these rationales in addition to their final decisions before taking a majority vote. These methods not only improve the reliability of reasoning paths but also cause more robust performance on complex reasoning tasks.
摘要:尽管大语言模型 (Large Language Models, LLMs) 在众多任务中的表现迅速提升,但在推理任务上仍常常表现不足。随着 LLMs 在多样化的实际任务中得到更广泛的应用,提升其推理能力对于其在复杂、微妙问题中的有效性至关重要。Wang 等人的自洽性框架揭示了,在各种封闭答案推理任务中,通过采样多个理由并在多数投票前进行可靠的改进,可以显著提升模型性能。基于此框架的标准方法仅聚合了这些理由的最终决策,而未能利用这些路径所应用的详细逐步推理路径。我们的工作通过在多数投票前结合并分析这些理由的推理路径及其最终决策,进一步增强了这种方法。这些方法不仅提高了推理路径的可靠性,还使得在复杂推理任务中表现出更强的鲁棒性。

[NLP-39] NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models NEURIPS2024

【速读】: 该论文试图解决低资源语言(如巴厘语和米南卡保语)的机器翻译问题,其关键解决方案在于利用预训练的大型语言模型(LLM)LLaMA2-7B,并通过继续预训练、监督微调(SFT)、自学习和基于LLM的数据清洗等技术,有效减少数据噪声并提升翻译质量。具体来说,NusaMT-7B模型在FLORES-200基准测试中,针对低资源语言的翻译性能显著超越了当前最先进(SoTA)的神经机器翻译模型,但在高资源语言的翻译上表现稍逊。这一研究展示了通过精细调整LLM,可以显著提升低资源语言的翻译质量,有助于语言保护和跨文化交流。

链接: https://arxiv.org/abs/2410.07830
作者: William Tan,Kevin Zhu
关键词-EN: demonstrated exceptional promise, Large Language Models, Large Language, Balinese and Minangkabau, demonstrated exceptional
类目: Computation and Language (cs.CL)
备注: Accepted to SoLaR @ NeurIPS 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional promise in translation tasks for high-resource languages. However, their performance in low-resource languages is limited by the scarcity of both parallel and monolingual corpora, as well as the presence of noise. Consequently, such LLMs suffer with alignment and have lagged behind State-of-The-Art (SoTA) neural machine translation (NMT) models in these settings. This paper introduces NusaMT-7B, an LLM-based machine translation model for low-resource Indonesian languages, starting with Balinese and Minangkabau. Leveraging the pretrained LLaMA2-7B, our approach integrates continued pre-training on monolingual data, Supervised Fine-Tuning (SFT), self-learning, and an LLM-based data cleaner to reduce noise in parallel sentences. In the FLORES-200 multilingual translation benchmark, NusaMT-7B outperforms SoTA models in the spBLEU metric by up to +6.69 spBLEU in translations into Balinese and Minangkabau, but underperforms by up to -3.38 spBLEU in translations into higher-resource languages. Our results show that fine-tuned LLMs can enhance translation quality for low-resource languages, aiding in linguistic preservation and cross-cultural communication.
摘要:大语言模型 (Large Language Models, LLMs) 在高资源语言的翻译任务中展现了卓越的潜力。然而,在低资源语言中,由于平行语料和单语语料的稀缺性以及噪声的存在,这些 LLMs 的表现受到限制。因此,在这些情况下,LLMs 在对齐方面存在问题,并且落后于最先进的 (State-of-The-Art, SoTA) 神经机器翻译 (Neural Machine Translation, NMT) 模型。本文介绍了 NusaMT-7B,这是一个基于 LLM 的机器翻译模型,专门针对低资源的印度尼西亚语言,首先从巴厘语和米南卡保语开始。利用预训练的 LLaMA2-7B,我们的方法整合了在单语数据上的继续预训练、监督微调 (Supervised Fine-Tuning, SFT)、自学习和基于 LLM 的数据清洗器,以减少平行句子中的噪声。在 FLORES-200 多语言翻译基准测试中,NusaMT-7B 在翻译成巴厘语和米南卡保语时,在 spBLEU 指标上比 SoTA 模型高出最多 +6.69 spBLEU,但在翻译成高资源语言时,表现最多落后 -3.38 spBLEU。我们的结果表明,经过微调的 LLMs 可以提高低资源语言的翻译质量,有助于语言保护和跨文化交流。

[NLP-40] Why do objects have many names? A study on word informativeness in language use and lexical systems EMNLP2024

【速读】: 该论文试图解决语言使用与词汇系统结构之间的鸿沟问题,即如何在考虑上下文沟通的同时,优化词汇系统的信息传递效率。解决方案的关键在于提出了一种基于视觉空间的信息量度量方法,并分析了英语和普通话的颜色命名数据,得出最优词汇系统应允许多个词汇指向同一指称对象,从而在不同情境下实现信息传递的最大准确性和最小冗余。

链接: https://arxiv.org/abs/2410.07827
作者: Eleonora Gualdoni,Gemma Boleda
关键词-EN: Human lexicons, lexical systems, Human, lexical, systems
类目: Computation and Language (cs.CL)
备注: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)

点击查看摘要

Abstract:Human lexicons contain many different words that speakers can use to refer to the same object, e.g., “purple” or “magenta” for the same shade of color. On the one hand, studies on language use have explored how speakers adapt their referring expressions to successfully communicate in context, without focusing on properties of the lexical system. On the other hand, studies in language evolution have discussed how competing pressures for informativeness and simplicity shape lexical systems, without tackling in-context communication. We aim at bridging the gap between these traditions, and explore why a soft mapping between referents and words is a good solution for communication, by taking into account both in-context communication and the structure of the lexicon. We propose a simple measure of informativeness for words and lexical systems, grounded in a visual space, and analyze color naming data for English and Mandarin Chinese. We conclude that optimal lexical systems are those where multiple words can apply to the same referent, conveying different amounts of information. Such systems allow speakers to maximize communication accuracy and minimize the amount of information they convey when communicating about referents in contexts.
摘要:人类词汇库中包含许多不同的词语,这些词语可以用来指代同一个对象,例如,对于同一种颜色,可以使用“紫色”或“洋红色”。一方面,语言使用研究探讨了说话者如何在上下文中调整其指称表达以成功进行交流,而未关注词汇系统的属性。另一方面,语言进化研究讨论了信息性和简洁性之间的竞争压力如何塑造词汇系统,而未涉及上下文中的交流。我们的目标是在这些传统之间架起桥梁,并通过考虑上下文交流和词汇结构,探讨为什么对象与词语之间的软映射是交流的良好解决方案。我们提出了一种基于视觉空间的词语和词汇系统的信息性简单度量,并分析了英语和普通话的颜色命名数据。我们得出结论,最优的词汇系统是那些多个词语可以应用于同一对象,传达不同信息量的系统。这样的系统使说话者能够在上下文中最大化交流准确性,并在交流对象时最小化传达的信息量。

[NLP-41] Fine-Tuning Language Models for Ethical Ambiguity: A Comparative Study of Alignment with Human Responses NEURIPS2024

【速读】: 该论文试图解决语言模型在处理道德模糊情境时与人类判断不一致的问题。解决方案的关键在于通过微调模型,特别是采用文本到文本的格式,提升模型对文本分布的理解,从而增强其在复杂决策情境中的表现和与人类判断的匹配度。实验结果显示,微调后的模型在交叉熵和狄利克雷得分上均有显著提升,尤其是在道德模糊情境下的表现接近GPT-4,但仍需进一步研究以完善伦理推理技术和捕捉人类判断的细微差别。

链接: https://arxiv.org/abs/2410.07826
作者: Pranav Senthilkumar,Visshwa Balasubramanian,Prisha Jain,Aneesa Maity,Jonathan Lu,Kevin Zhu
关键词-EN: well-recognized in NLP, misinterpret human intentions, human intentions due, Language models, handling of ambiguity
类目: Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2024, SoLaR workshop

点击查看摘要

Abstract:Language models often misinterpret human intentions due to their handling of ambiguity, a limitation well-recognized in NLP research. While morally clear scenarios are more discernible to LLMs, greater difficulty is encountered in morally ambiguous contexts. In this investigation, we explored LLM calibration to show that human and LLM judgments are poorly aligned in such scenarios. We used two curated datasets from the Scruples project for evaluation: DILEMMAS, which involves pairs of distinct moral scenarios to assess the model’s ability to compare and contrast ethical situations, and ANECDOTES, which presents individual narratives to evaluate the model’s skill in drawing out details, interpreting, and analyzing distinct moral scenarios. Model answer probabilities were extracted for all possible choices and compared with human annotations to benchmark the alignment of three models: Llama-3.1-8b, Zephyr-7b-beta, and Mistral-7b. Significant improvements were observed after fine-tuning, with notable enhancements in both cross-entropy and Dirichlet scores, particularly in the latter. Notably, after fine-tuning, the performance of Mistral-7B-Instruct-v0.3 was on par with GPT-4o. However, the experimental models that were examined were all still outperformed by the BERT and RoBERTa models in terms of cross-entropy scores. Our fine-tuning approach, which improves the model’s understanding of text distributions in a text-to-text format, effectively enhances performance and alignment in complex decision-making contexts, underscoring the need for further research to refine ethical reasoning techniques and capture human judgment nuances.
摘要:语言模型由于处理歧义的方式,常常误解人类的意图,这一局限性在自然语言处理 (NLP) 研究中已被广泛认识。虽然道德清晰的情境对大语言模型 (LLM) 来说更容易辨别,但在道德模糊的背景下,难度显著增加。在本研究中,我们探讨了 LLM 的校准,以展示在这种情况下的模型判断与人类判断之间存在的不一致性。我们使用了 Scruples 项目中的两个精心策划的数据集进行评估:DILEMMAS,该数据集包含成对的独特道德情境,用于评估模型比较和对比伦理情境的能力;ANECDOTES,该数据集提供单独的叙述,用于评估模型提取细节、解释和分析不同道德情境的技能。我们提取了所有可能选择的模型回答概率,并与人类注释进行比较,以基准化三个模型的对齐情况:Llama-3.1-8b、Zephyr-7b-beta 和 Mistral-7b。在微调后,观察到显著的改进,特别是在交叉熵和 Dirichlet 分数方面,后者尤为显著。值得注意的是,经过微调后,Mistral-7B-Instruct-v0.3 的性能与 GPT-4o 相当。然而,所考察的实验模型在交叉熵分数方面仍不及 BERT 和 RoBERTa 模型。我们的微调方法,通过改进模型对文本到文本格式中文字分布的理解,有效地提升了在复杂决策情境中的性能和对齐度,强调了进一步研究以完善伦理推理技术和捕捉人类判断细微差别的必要性。

[NLP-42] Extracting and Transferring Abilities For Building Multi-lingual Ability-enhanced Large Language Models

【速读】: 该论文试图解决低资源语言在大型语言模型(LLMs)中缺乏多语言能力的问题。解决方案的关键在于提出了一种名为MAET的多语言能力提取与转移方法,通过分解和提取与语言无关的能力相关权重,并利用简单的加减操作在不同语言间进行能力转移,而无需额外的训练数据。该方法包括提取和转移两个阶段,首先定位与特定能力高度相关的关键神经元,提取可转移的能力特定权重,然后在转移阶段选择能力相关参数张量,并设计基于语言和能力特定权重的合并策略,以构建多语言能力增强的LLM。实验结果表明,MAET在数学和科学任务中表现优异,优于基于训练的基线方法。

链接: https://arxiv.org/abs/2410.07825
作者: Zhipeng Chen,Liang Song,Kun Zhou,Wayne Xin Zhao,Bingning Wang,Weipeng Chen,Ji-Rong Wen
关键词-EN: large language models, Multi-lingual ability transfer, Multi-lingual Ability Extraction, Multi-lingual ability, increasingly important
类目: Computation and Language (cs.CL)
备注: 18 Pages. Working in progress

点击查看摘要

Abstract:Multi-lingual ability transfer has become increasingly important for the broad application of large language models (LLMs). Existing work highly relies on training with the multi-lingual ability-related data, which may be not available for low-resource languages. To solve it, we propose a Multi-lingual Ability Extraction and Transfer approach, named as MAET. Our key idea is to decompose and extract language-agnostic ability-related weights from LLMs, and transfer them across different languages by simple addition and subtraction operations without training. Specially, our MAET consists of the extraction and transfer stages. In the extraction stage, we firstly locate key neurons that are highly related to specific abilities, and then employ them to extract the transferable ability-specific weights. In the transfer stage, we further select the ability-related parameter tensors, and design the merging strategy based on the linguistic and ability specific weights, to build the multi-lingual ability-enhanced LLM. To demonstrate the effectiveness of our proposed approach, we conduct extensive experiments on mathematical and scientific tasks in both high-resource lingual and low-resource lingual scenarios. Experiment results have shown that MAET can effectively and efficiently extract and transfer the advanced abilities, and outperform training-based baseline methods. Our code and data are available at \urlthis https URL.
摘要:多语言能力迁移对于大语言模型 (LLM) 的广泛应用变得越来越重要。现有工作高度依赖于使用与多语言能力相关的数据进行训练,这对于低资源语言可能不可行。为了解决这一问题,我们提出了一种名为 MAET (Multi-lingual Ability Extraction and Transfer) 的多语言能力提取与迁移方法。我们的核心思想是从 LLM 中分解并提取与语言无关的能力相关权重,并通过简单的加减操作在不同语言之间进行迁移,而无需进行训练。具体来说,我们的 MAET 包括提取和迁移两个阶段。在提取阶段,我们首先定位与特定能力高度相关的关键神经元,然后利用这些神经元提取可迁移的能力特定权重。在迁移阶段,我们进一步选择与能力相关的参数张量,并基于语言和能力特定权重设计合并策略,以构建多语言能力增强的 LLM。为了验证我们提出的方法的有效性,我们在高资源语言和低资源语言场景下对数学和科学任务进行了广泛的实验。实验结果表明,MAET 能够有效地提取和迁移高级能力,并优于基于训练的基线方法。我们的代码和数据可在 \urlthis https URL 获取。

[NLP-43] Mitigating Gender Bias in Code Large Language Models via Model Editing

【速读】: 该论文试图解决大型语言模型(LLM)在代码生成过程中存在的性别偏见问题。解决方案的关键在于提出了一个名为CodeGenBias的数据集和一个评估指标FB-Score,用于量化性别偏见。论文进一步开发了一种多粒度模型编辑方法MG-Editing,通过在模型参数的不同层次(如全参数、层、模块、行和神经元级别)进行定位和编辑,有效减少了性别偏见,同时保持了模型的代码生成能力。实验结果表明,MG-Editing在行和神经元级别的粒度上应用时效果最佳。

链接: https://arxiv.org/abs/2410.07820
作者: Zhanyue Qin,Haochuan Wang,Zecheng Wang,Deyuan Liu,Cunhang Fan,Zhao Lv,Zhiying Tu,Dianhui Chu,Dianbo Sui
关键词-EN: program synthesis automatically, high-quality programming code, gender bias, Factual Bias Score, large language model
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, with the maturation of large language model (LLM) technology and the emergence of high-quality programming code datasets, researchers have become increasingly confident in addressing the challenges of program synthesis automatically. However, since most of the training samples for LLMs are unscreened, it is inevitable that LLMs’ performance may not align with real-world scenarios, leading to the presence of social bias. To evaluate and quantify the gender bias in code LLMs, we propose a dataset named CodeGenBias (Gender Bias in the Code Generation) and an evaluation metric called FB-Score (Factual Bias Score) based on the actual gender distribution of correlative professions. With the help of CodeGenBias and FB-Score, we evaluate and analyze the gender bias in eight mainstream Code LLMs. Previous work has demonstrated that model editing methods that perform well in knowledge editing have the potential to mitigate social bias in LLMs. Therefore, we develop a model editing approach named MG-Editing (Multi-Granularity model Editing), which includes the locating and editing phases. Our model editing method MG-Editing can be applied at five different levels of model parameter granularity: full parameters level, layer level, module level, row level, and neuron level. Extensive experiments not only demonstrate that our MG-Editing can effectively mitigate the gender bias in code LLMs while maintaining their general code generation capabilities, but also showcase its excellent generalization. At the same time, the experimental results show that, considering both the gender bias of the model and its general code generation capability, MG-Editing is most effective when applied at the row and neuron levels of granularity.
摘要:近年来,随着大语言模型 (LLM) 技术的成熟以及高质量编程代码数据集的出现,研究人员越来越有信心自动解决程序合成的挑战。然而,由于大多数 LLM 的训练样本未经筛选,LLM 的表现可能与现实场景不符,从而导致社会偏见的出现。为了评估和量化代码 LLM 中的性别偏见,我们提出了一种名为 CodeGenBias (代码生成中的性别偏见) 的数据集和一个基于相关职业实际性别分布的评估指标 FB-Score (事实偏见分数)。借助 CodeGenBias 和 FB-Score,我们评估并分析了八种主流代码 LLM 中的性别偏见。先前的工作表明,在知识编辑中表现良好的模型编辑方法有可能减轻 LLM 中的社会偏见。因此,我们开发了一种名为 MG-Editing (多粒度模型编辑) 的模型编辑方法,该方法包括定位和编辑两个阶段。我们的模型编辑方法 MG-Editing 可以在五个不同层次的模型参数粒度上应用:全参数层、层级、模块级、行级和神经元级。大量的实验不仅证明了我们的 MG-Editing 在保持代码生成能力的同时,能够有效减轻代码 LLM 中的性别偏见,而且还展示了其出色的泛化能力。同时,实验结果表明,在考虑模型的性别偏见和其一般代码生成能力的情况下,MG-Editing 在行级和神经元级粒度上应用最为有效。

[NLP-44] Uncovering Overfitting in Large Language Model Editing

【速读】: 该论文试图解决知识编辑过程中出现的“编辑过拟合”问题,即在复杂任务(如多跳推理)中,编辑后的模型过度依赖编辑目标,导致新知识的泛化能力受限。解决方案的关键在于提出了一种名为“Learn to Inference (LTI)”的插拔式策略,通过引入多阶段推理约束模块,指导编辑后的模型在推理过程中模仿未编辑的大型语言模型(LLMs)的知识召回机制,从而有效缓解编辑过拟合现象。

链接: https://arxiv.org/abs/2410.07819
作者: Mengqi Zhang,Xiaotian Ye,Qiang Liu,Pengjie Ren,Shu Wu,Zhumin Chen
关键词-EN: Large Language Models, Large Language, Editing Overfit, Language Models, editing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge editing has been proposed as an effective method for updating and correcting the internal knowledge of Large Language Models (LLMs). However, existing editing methods often struggle with complex tasks, such as multi-hop reasoning. In this paper, we identify and investigate the phenomenon of Editing Overfit, where edited models assign disproportionately high probabilities to the edit target, hindering the generalization of new knowledge in complex scenarios. We attribute this issue to the current editing paradigm, which places excessive emphasis on the direct correspondence between the input prompt and the edit target for each edit sample. To further explore this issue, we introduce a new benchmark, EVOKE (EValuation of Editing Overfit in Knowledge Editing), along with fine-grained evaluation metrics. Through comprehensive experiments and analysis, we demonstrate that Editing Overfit is prevalent in current editing methods and that common overfitting mitigation strategies are of limited effectiveness in knowledge editing. To overcome this, inspired by LLMs’ knowledge recall mechanisms, we propose a new plug-and-play strategy called Learn to Inference (LTI), which introduce a Multi-stage Inference Constraint module to guide the edited models in recalling new knowledge similarly to how unedited LLMs leverage knowledge through in-context learning. Extensive experimental results across a wide range of tasks validate the effectiveness of LTI in mitigating Editing Overfit.
摘要:知识编辑已被提出作为一种有效的方法,用于更新和纠正大语言模型 (LLM) 的内部知识。然而,现有的编辑方法在处理复杂任务(如多跳推理)时往往表现不佳。本文中,我们识别并研究了编辑过拟合现象,即编辑后的模型对编辑目标分配了不成比例的高概率,从而阻碍了新知识在复杂场景中的泛化。我们将这一问题归因于当前的编辑范式,该范式过度强调每个编辑样本的输入提示与编辑目标之间的直接对应关系。为进一步探讨这一问题,我们引入了一个新的基准 EVOKE (EValuation of Editing Overfit in Knowledge Editing),并附带细粒度的评估指标。通过全面的实验和分析,我们证明编辑过拟合在当前的编辑方法中普遍存在,且常见的过拟合缓解策略在知识编辑中效果有限。为克服这一问题,受大语言模型知识召回机制的启发,我们提出了一种新的即插即用策略,称为学习推理 (LTI),该策略引入了一个多阶段推理约束模块,以指导编辑后的模型像未编辑的大语言模型通过上下文学习利用知识一样召回新知识。广泛的实验结果验证了 LTI 在缓解编辑过拟合方面的有效性。

[NLP-45] Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?

【速读】: 该论文试图解决多语言模型在不同语言上表现不均衡的问题,特别是由于某些语言的泛化能力有限,导致模型在某些语言上的性能不佳。解决方案的关键在于通过语言学信息指导的语言选择方法进行指令调优,以提升模型在多种语言和任务上的表现。研究提出了一种基于语言多样性的简单算法来选择语言,并通过实验验证了这种选择方法在各种基准测试和开放式问题上的有效性,结果表明,精心选择的语言通常比随机选择的语言带来更好的性能提升。

链接: https://arxiv.org/abs/2410.07809
作者: Gürkan Soykan,Gözde Gül Şahin
关键词-EN: limited generalization capabilities, languages, Instruction tuning, perform unevenly, due to limited
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 31 pages, 6 figures

点击查看摘要

Abstract:Multilingual language models often perform unevenly across different languages due to limited generalization capabilities for some languages. This issue is significant because of the growing interest in making universal language models that work well for all languages. Instruction tuning with multilingual instruction-response pairs has been used to improve model performance across various languages. However, this approach is challenged by high computational costs, a lack of quality tuning data for all languages, and the “curse of multilinguality” – the performance drop per language after adding many languages. Recent studies have found that working with datasets with few languages and a smaller number of instances can be beneficial. Yet, there exists no systematic investigation into how choosing different languages affects multilingual instruction tuning. Our study proposes a method to select languages for instruction tuning in a linguistically informed way, aiming to boost model performance across languages and tasks. We use a simple algorithm to choose diverse languages and test their effectiveness on various benchmarks and open-ended questions. Our results show that this careful selection generally leads to better outcomes than choosing languages at random. We suggest a new and simple way of enhancing multilingual models by selecting diverse languages based on linguistic features that could help develop better multilingual systems and guide dataset creation efforts. All resources, including the code for language selection and multilingual instruction tuning, are made available in our official repository at this https URL enabling reproducibility and further research in this area.
摘要:多语言语言模型在不同语言上的表现往往不均衡,这是由于某些语言的泛化能力有限。这一问题在当前对构建适用于所有语言的通用语言模型的兴趣日益增长的背景下显得尤为重要。通过使用多语言指令-响应对进行指令调优,已被用于提升模型在多种语言上的性能。然而,这种方法面临着高计算成本、缺乏适用于所有语言的高质量调优数据以及“多语言诅咒”——即在添加多种语言后每种语言的性能下降——的挑战。最近的研究发现,使用包含少量语言和较少实例的数据集可能是有益的。然而,目前尚无系统性的研究探讨选择不同语言对多语言指令调优的影响。我们的研究提出了一种基于语言学信息选择语言进行指令调优的方法,旨在提升模型在多种语言和任务上的表现。我们采用一种简单的算法来选择多样化的语言,并在各种基准测试和开放式问题中测试其有效性。我们的结果表明,这种精心选择通常比随机选择语言能带来更好的结果。我们提出了一种新的、简单的方法来增强多语言模型,即基于语言特征选择多样化的语言,这有助于开发更好的多语言系统并指导数据集的创建工作。所有资源,包括语言选择和多语言指令调优的代码,均在我们的官方仓库中公开,网址为 https URL,以支持可重复性和该领域的进一步研究。

[NLP-46] Rewriting Conversational Utterances with Instructed Large Language Models

【速读】: 该论文试图解决在对话式搜索中,如何通过重写用户问题来提高搜索效果的问题。解决方案的关键在于利用经过指令微调的大型语言模型(LLMs),通过零样本或少量样本提示技术,生成更具信息量的重写问题,从而显著提升检索性能。实验结果表明,这种方法在多个评价指标上(如MRR、Precision@1、NDCG@3和Recall@500)相比现有最先进技术有显著提升。

链接: https://arxiv.org/abs/2410.07797
作者: Elnara Galimzhanova,Cristina Ioana Muntean,Franco Maria Nardini,Raffaele Perego,Guido Rocchietti
关键词-EN: large language models, text summarization, NLP tasks, recent studies, studies have shown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Many recent studies have shown the ability of large language models (LLMs) to achieve state-of-the-art performance on many NLP tasks, such as question answering, text summarization, coding, and translation. In some cases, the results provided by LLMs are on par with those of human experts. These models’ most disruptive innovation is their ability to perform tasks via zero-shot or few-shot prompting. This capability has been successfully exploited to train instructed LLMs, where reinforcement learning with human feedback is used to guide the model to follow the user’s requests directly. In this paper, we investigate the ability of instructed LLMs to improve conversational search effectiveness by rewriting user questions in a conversational setting. We study which prompts provide the most informative rewritten utterances that lead to the best retrieval performance. Reproducible experiments are conducted on publicly-available TREC CAST datasets. The results show that rewriting conversational utterances with instructed LLMs achieves significant improvements of up to 25.2% in MRR, 31.7% in Precision@1, 27% in NDCG@3, and 11.5% in Recall@500 over state-of-the-art techniques.
摘要:近年来,许多研究表明,大语言模型 (LLMs) 在诸多自然语言处理 (NLP) 任务中,如问答、文本摘要、编码和翻译,均能达到最先进的性能。在某些情况下,LLMs 提供的结果与人类专家的水平相当。这些模型最具颠覆性的创新在于其通过零样本 (Zero-shot) 或少样本 (Few-shot) 提示即可执行任务的能力。这一能力已被成功应用于训练指令型 LLMs,其中通过结合人类反馈的强化学习来直接引导模型遵循用户请求。本文探讨了指令型 LLMs 在对话环境中通过重写用户问题来提升对话搜索效果的能力。我们研究了哪些提示能够提供最具信息量的重写话语,从而实现最佳的检索性能。我们在公开的 TREC CAST 数据集上进行了可重复的实验。结果显示,使用指令型 LLMs 重写对话话语,在 MRR 上实现了高达 25.2% 的提升,在 Precision@1 上提升了 31.7%,在 NDCG@3 上提升了 27%,在 Recall@500 上提升了 11.5%,均优于现有最先进的技术。

[NLP-47] Modeling User Preferences with Automatic Metrics: Creating a High-Quality Preference Dataset for Machine Translation EMNLP

【速读】: 该论文试图解决机器翻译(MT)中如何更准确地对齐人类偏好以提高翻译质量的问题。解决方案的关键在于结合人工评估和自动度量,通过收集专业语言学家的句子级质量评估,分析当前自动度量的偏好恢复能力,并据此构建了一个新的MT-Pref数据集。该数据集包含18k个实例,涵盖18种语言方向,用于训练TOWER模型,从而显著提升WMT23和FLORES基准测试中的翻译质量。

链接: https://arxiv.org/abs/2410.07779
作者: Sweta Agrawal,José G. C. de Souza,Ricardo Rei,António Farinhas,Gonçalo Faria,Patrick Fernandes,Nuno M Guerreiro,Andre Martins
关键词-EN: important step, step in developing, developing accurate, accurate and safe, Alignment
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP Main 2024

点击查看摘要

Abstract:Alignment with human preferences is an important step in developing accurate and safe large language models. This is no exception in machine translation (MT), where better handling of language nuances and context-specific variations leads to improved quality. However, preference data based on human feedback can be very expensive to obtain and curate at a large scale. Automatic metrics, on the other hand, can induce preferences, but they might not match human expectations perfectly. In this paper, we propose an approach that leverages the best of both worlds. We first collect sentence-level quality assessments from professional linguists on translations generated by multiple high-quality MT systems and evaluate the ability of current automatic metrics to recover these preferences. We then use this analysis to curate a new dataset, MT-Pref (metric induced translation preference) dataset, which comprises 18k instances covering 18 language directions, using texts sourced from multiple domains post-2022. We show that aligning TOWER models on MT-Pref significantly improves translation quality on WMT23 and FLORES benchmarks.
摘要:与人类偏好对齐是开发准确且安全的大语言模型的重要步骤。在机器翻译 (MT) 领域也不例外,更好地处理语言细微差别和上下文特定变化可以提高翻译质量。然而,基于人类反馈的偏好数据在大规模获取和整理时成本非常高。另一方面,自动评估指标虽然可以诱导偏好,但可能无法完全匹配人类的期望。在本文中,我们提出了一种结合两者优势的方法。我们首先从专业语言学家那里收集了多个高质量机器翻译系统生成的翻译的句子级质量评估,并评估了当前自动评估指标恢复这些偏好的能力。然后,我们利用这一分析来整理一个新的数据集,即 MT-Pref (metric induced translation preference) 数据集,该数据集包含 18k 个实例,涵盖 18 种语言方向,使用 2022 年后的多领域文本。我们展示了在 MT-Pref 上对齐 TOWER 模型显著提高了 WMT23 和 FLORES 基准上的翻译质量。

[NLP-48] Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models ICASSP2025

【速读】: 该论文试图解决大规模基于Conformer的语音识别模型在从头开始训练时的低秩权重训练问题。解决方案的关键在于:1) 发现仅对注意力模块应用低秩结构可以显著提升性能,即使秩减少12%;2) 初始化和层级秩分配在低秩训练中起着至关重要的作用,特别是使用SVD初始化和线性层级秩映射;3) 提出Low-Rank Speech Model from Scratch (LR-SMS)方法,该方法在保持与全秩训练相同性能的同时,显著减少了参数数量(至少2倍)和训练时间(ASR加速1.3倍,AVSR加速1.15倍)。

链接: https://arxiv.org/abs/2410.07771
作者: Adriana Fernandez-Lopez,Shiwei Liu,Lu Yin,Stavros Petridis,Maja Pantic
关键词-EN: Conformer-based speech recognition, large-scale Conformer-based speech, large-scale Conformer-based, speech recognition models, Conformer-based speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:This paper investigates the under-explored area of low-rank weight training for large-scale Conformer-based speech recognition models from scratch. Our study demonstrates the viability of this training paradigm for such models, yielding several notable findings. Firstly, we discover that applying a low-rank structure exclusively to the attention modules can unexpectedly enhance performance, even with a significant rank reduction of 12%. In contrast, feed-forward layers present greater challenges, as they begin to exhibit performance degradation with a moderate 50% rank reduction. Furthermore, we find that both initialization and layer-wise rank assignment play critical roles in successful low-rank training. Specifically, employing SVD initialization and linear layer-wise rank mapping significantly boosts the efficacy of low-rank weight training. Building on these insights, we introduce the Low-Rank Speech Model from Scratch (LR-SMS), an approach that achieves performance parity with full-rank training while delivering substantial reductions in parameters count (by at least 2x), and training time speedups (by 1.3x for ASR and 1.15x for AVSR).
摘要:本文探讨了大规模基于 Conformer 的语音识别模型从头开始进行低秩权重训练的未被充分研究领域。我们的研究展示了这种训练范式对这类模型的可行性,并取得了若干显著发现。首先,我们发现仅将低秩结构应用于注意力模块可以出乎意料地提升性能,即使在秩减少 12% 的情况下也是如此。相比之下,前馈层面临更大的挑战,因为它们在秩减少 50% 时开始表现出性能下降。此外,我们发现初始化和逐层秩分配在成功的低秩训练中起着关键作用。具体而言,采用 SVD 初始化和线性逐层秩映射显著提升了低秩权重训练的效率。基于这些见解,我们引入了从头开始的低秩语音模型 (LR-SMS),该方法在实现与全秩训练性能相当的同时,显著减少了参数数量 (至少减少 2 倍),并加快了训练时间 (ASR 为 1.3 倍,AVSR 为 1.15 倍)。

[NLP-49] Dialectical Behavior Therapy Approach to LLM Prompting

【速读】: 该论文试图解决复杂推理任务在大型语言模型(LLMs)中的表现问题。解决方案的关键在于提出了一种受辩证行为疗法(DBT)启发的提示策略,通过将DBT的基本对话塑造概念应用于提示构造,指导模型逐步分解任务并进行推理。实验结果表明,这种基于DBT技术的提示策略显著提升了较小模型在多个数据集上的表现,尤其是在8亿和14亿参数模型上分别取得了显著的准确率提升。

链接: https://arxiv.org/abs/2410.07768
作者: Oxana Vitman,Nika Amaglobeli,Paul Plachinda
关键词-EN: Large language models, Large language, language models demonstrated, Dialectical Behavioral Therapy, CoT prompting guides
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models demonstrated state-of-the-art results on various reasoning tasks when applying the chain-of-thought (CoT) prompting technique. CoT prompting guides the model into breaking tasks into a few intermediate steps and provides step-by-step demonstrations. However, solving complex reasoning tasks remains a challenge. In this paper, we propose a novel prompting strategy inspired by Dialectical Behavioral Therapy (DBT). DBT, a form of cognitive-behavioral therapy, aims to help individuals cope with stress by developing a system of reasoning. We applied DBT’s basic concepts of shaping dialog to construct prompts and conducted experiments on different datasets and LLMs with various numbers of parameters. Our results show that prompts crafted with DBT techniques significantly improve results on smaller models, achieving a 7% increase in accuracy on the StrategyQA, 4.8% on Aqua dataset using 8b parameters model, and a 16.2% increase on the StrategyQA, 5.3% on GSM8K dataset with 14b parameters model.
摘要:大语言模型在应用链式思维 (Chain-of-Thought, CoT) 提示技术时,在各种推理任务上展示了最先进的结果。CoT 提示技术引导模型将任务分解为几个中间步骤,并提供逐步演示。然而,解决复杂的推理任务仍然是一个挑战。在本文中,我们提出了一种受辩证行为疗法 (Dialectical Behavioral Therapy, DBT) 启发的新型提示策略。DBT 是一种认知行为疗法,旨在通过发展一套推理系统来帮助个体应对压力。我们将 DBT 的基本对话塑造概念应用于构建提示,并在不同数据集和具有不同参数数量的大语言模型上进行了实验。我们的结果表明,使用 DBT 技术构建的提示显著提高了较小模型的结果,在使用 8b 参数模型时,StrategyQA 的准确率提高了 7%,Aqua 数据集提高了 4.8%,而在使用 14b 参数模型时,StrategyQA 的准确率提高了 16.2%,GSM8K 数据集提高了 5.3%。

[NLP-50] GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps NEURIPS2024

【速读】: 该论文试图解决大型语言模型(LLMs)在规划能力方面的评估问题。解决方案的关键在于提出了GameTraversalBenchmark(GTB),这是一个包含多样化2D网格游戏地图的基准测试,用于评估LLMs在完成给定目标时的路径规划能力,要求模型以最少的步数和最少的生成错误完成任务。通过GTB_Score(GTBS)这一综合评分标准,论文评估了多个LLMs的表现,发现GPT-4-Turbo在GTB上取得了最高的44.97%的分数,同时初步测试了大型推理模型o1,其得分为67.84%,表明当前模型在该基准测试中仍面临挑战。

链接: https://arxiv.org/abs/2410.07765
作者: Muhammad Umair Nasir,Steven James,Julian Togelius
关键词-EN: recently demonstrated great, demonstrated great success, understanding natural language, recently demonstrated, demonstrated great
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language. While they have also shown potential beyond the domain of natural language, it remains an open question as to what extent and in which way these LLMs can plan. We investigate their planning capabilities by proposing GameTraversalBenchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps. An LLM succeeds if it can traverse through given objectives, with a minimum number of steps and a minimum number of generation errors. We evaluate a number of LLMs on GTB and found that GPT-4-Turbo achieved the highest score of 44.97% on GTB_Score (GTBS), a composite score that combines the three above criteria. Furthermore, we preliminarily test large reasoning models, namely o1, which scores 67.84% on GTBS, indicating that the benchmark remains challenging for current models. Code, data, and documentation are available at this https URL.
摘要:大语言模型 (LLMs) 近期在生成和理解自然语言方面展示了巨大的成功。尽管它们在自然语言领域之外也展现了潜力,但这些 LLMs 在多大程度上以及以何种方式进行规划仍是一个开放的问题。我们通过提出 GameTraversalBenchmark (GTB) 来研究它们的规划能力,GTB 是一个包含多样化二维网格游戏地图的基准测试。一个 LLM 如果能够在最少步骤和最少生成错误的情况下完成给定的目标,则视为成功。我们在 GTB 上评估了多个 LLMs,发现 GPT-4-Turbo 在 GTB_Score (GTBS) 上取得了最高分 44.97%,GTBS 是一个综合评分,结合了上述三个标准。此外,我们初步测试了大型推理模型,即 o1,它在 GTBS 上得分为 67.84%,表明该基准对当前模型仍然具有挑战性。代码、数据和文档可在以下链接获取:https URL。

[NLP-51] textitJump Your Steps: Optimizing Sampling Schedule of Discrete Diffusion Models

【速读】: 该论文试图解决离散扩散模型(DDMs)在并行采样过程中引入的复合解码误差(CDE)问题,该误差导致采样速度加快的同时样本质量下降。解决方案的关键是提出了一种名为“Jump Your Steps”(JYS)的新方法,通过优化离散采样时间步的分配来最小化CDE,而无需额外的计算成本。具体而言,论文推导了CDE的实际上限,并提出了一种高效的算法来搜索最优的采样调度,从而在图像、音乐和文本生成等多个领域显著提升采样质量。

链接: https://arxiv.org/abs/2410.07761
作者: Yong-Hyun Park,Chieh-Hsin Lai,Satoshi Hayakawa,Yuhta Takida,Yuki Mitsufuji
关键词-EN: discrete diffusion models, Diffusion models, Compounding Decoding Error, continuous domains, notable success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion models (DDMs) for discrete variables. Despite recent advances, DDMs face the challenge of slow sampling speeds. While parallel sampling methods like \tau -leaping accelerate this process, they introduce \textitCompounding Decoding Error (CDE), where discrepancies arise between the true distribution and the approximation from parallel token generation, leading to degraded sample quality. In this work, we present \textitJump Your Steps (JYS), a novel approach that optimizes the allocation of discrete sampling timesteps by minimizing CDE without extra computational cost. More precisely, we derive a practical upper bound on CDE and propose an efficient algorithm for searching for the optimal sampling schedule. Extensive experiments across image, music, and text generation show that JYS significantly improves sampling quality, establishing it as a versatile framework for enhancing DDM performance for fast sampling.
摘要:扩散模型在连续领域取得了显著的成功,进而推动了离散扩散模型 (DDMs) 在离散变量中的发展。尽管近期有所进展,DDMs 仍面临采样速度慢的挑战。虽然并行采样方法如 \tau -leaping 加速了这一过程,但它们引入了复合解码误差 (CDE),即在并行 Token 生成过程中,真实分布与近似分布之间出现差异,导致样本质量下降。在本研究中,我们提出了 \textitJump Your Steps (JYS),一种通过最小化 CDE 来优化离散采样时间步分配的新方法,且无需额外计算成本。更具体地说,我们推导了 CDE 的实际上限,并提出了一种高效的算法来搜索最优采样计划。在图像、音乐和文本生成方面的广泛实验表明,JYS 显著提升了采样质量,确立了其在快速采样中增强 DDM 性能的多功能框架地位。

[NLP-52] StepTool: A Step-grained Reinforcement Learning Framework for Tool Learning in LLMs

【速读】: 该论文试图解决大语言模型(LLMs)在工具学习中面临的两个主要挑战:一是模仿静态轨迹限制了其在新任务中的泛化能力,二是专家轨迹可能并非最优,存在更好的解决方案路径。解决方案的关键在于引入了一个名为StepTool的新型步进式强化学习框架,该框架包括步进式奖励塑造和步进式优化两个组件。步进式奖励塑造根据工具调用成功及其对任务的贡献在每次工具交互时分配奖励,而步进式优化则使用策略梯度方法以多步方式优化模型。实验结果表明,StepTool在多步工具任务中显著优于现有方法,为复杂任务环境提供了强有力的解决方案。

链接: https://arxiv.org/abs/2410.07745
作者: Yuanqing Yu,Zhefan Wang,Weizhi Ma,Zhicheng Guo,Jingtao Zhan,Shuai Wang,Chuhan Wu,Zhiqiang Guo,Min Zhang
关键词-EN: Large Language Models, Large Language, acquire real-time information, real-time information retrieval, Language Models
类目: Computation and Language (cs.CL)
备注: Ongoning Work

点击查看摘要

Abstract:Despite having powerful reasoning and inference capabilities, Large Language Models (LLMs) still need external tools to acquire real-time information retrieval or domain-specific expertise to solve complex tasks, which is referred to as tool learning. Existing tool learning methods primarily rely on tuning with expert trajectories, focusing on token-sequence learning from a linguistic perspective. However, there are several challenges: 1) imitating static trajectories limits their ability to generalize to new tasks. 2) even expert trajectories can be suboptimal, and better solution paths may exist. In this work, we introduce StepTool, a novel step-grained reinforcement learning framework to improve tool learning in LLMs. It consists of two components: Step-grained Reward Shaping, which assigns rewards at each tool interaction based on tool invocation success and its contribution to the task, and Step-grained Optimization, which uses policy gradient methods to optimize the model in a multi-step manner. Experimental results demonstrate that StepTool significantly outperforms existing methods in multi-step, tool-based tasks, providing a robust solution for complex task environments. Codes are available at this https URL.
摘要:尽管大语言模型 (LLM) 具备强大的推理和推断能力,但仍需借助外部工具来获取实时信息检索或特定领域的专业知识以解决复杂任务,这一过程被称为工具学习。现有的工具学习方法主要依赖于专家轨迹的调优,侧重于从语言学角度进行 Token 序列学习。然而,存在以下几个挑战:1) 模仿静态轨迹限制了其对新任务的泛化能力。2) 即使是专家轨迹也可能并非最优,更好的解决方案路径可能存在。在本研究中,我们提出了 StepTool,一种新颖的步进式强化学习框架,以改进大语言模型中的工具学习。该框架包含两个组成部分:步进式奖励塑造 (Step-grained Reward Shaping),根据工具调用成功及其对任务的贡献在每次工具交互时分配奖励;以及步进式优化 (Step-grained Optimization),采用策略梯度方法以多步方式优化模型。实验结果表明,StepTool 在多步、基于工具的任务中显著优于现有方法,为复杂任务环境提供了一个稳健的解决方案。代码可在此 https URL 获取。

[NLP-53] SLIM: Let LLM Learn More and Forget Less with Soft LoRA and Identity Mixture

【速读】: 该论文试图解决在训练大型语言模型(LLMs)时如何平衡训练预算、下游任务性能和模型通用能力的问题。解决方案的关键在于提出了一种基于Soft LoRA和Identity Mixture(SLIM)的新型混合专家(MoE)框架,该框架通过动态路由机制在LoRA适配器和跳跃连接之间进行切换,从而在减少训练成本的同时,有效缓解了灾难性遗忘问题,并保持了模型的通用能力。此外,论文还采用了滑动聚类和模型合并技术,进一步提升了模型在下游任务中的表现和泛化能力。

链接: https://arxiv.org/abs/2410.07739
作者: Jiayi Han,Liang Du,Hongwei Du,Xiangguo Zhou,Yiwen Wu,Weibo Zheng,Donghong Han
关键词-EN: downstream tasks, challenge to balance, general capabilities, training budget, downstream performance
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Although many efforts have been made, it is still a challenge to balance the training budget, downstream performance, and the general capabilities of the LLMs in many applications. Training the whole model for downstream tasks is expensive, and could easily result in catastrophic forgetting. By introducing parameter-efficient fine-tuning (PEFT), the training cost could be reduced, but it still suffers from forgetting, and limits the learning on the downstream tasks. To efficiently fine-tune the LLMs with less limitation to their downstream performance while mitigating the forgetting of general capabilities, we propose a novel mixture of expert (MoE) framework based on Soft LoRA and Identity Mixture (SLIM), that allows dynamic routing between LoRA adapters and skipping connection, enables the suppression of forgetting. We adopt weight-yielding with sliding clustering for better out-of-domain distinguish to enhance the routing. We also propose to convert the mixture of low-rank adapters to the model merging formulation and introduce fast dynamic merging of LoRA adapters to keep the general capabilities of the base model. Extensive experiments demonstrate that the proposed SLIM is comparable to the state-of-the-art PEFT approaches on the downstream tasks while achieving the leading performance in mitigating catastrophic forgetting.
摘要:尽管已经做出了许多努力,但在许多应用中平衡大语言模型 (LLM) 的训练预算、下游任务性能和通用能力仍然是一个挑战。为下游任务训练整个模型成本高昂,并且容易导致灾难性遗忘。通过引入参数高效微调 (PEFT),可以降低训练成本,但仍然存在遗忘问题,并限制了下游任务的学习。为了在减少对下游任务性能限制的同时,有效微调大语言模型并缓解遗忘通用能力的问题,我们提出了一种基于 Soft LoRA 和 Identity Mixture (SLIM) 的新型专家混合 (MoE) 框架,该框架允许 LoRA 适配器与跳跃连接之间的动态路由,从而实现遗忘的抑制。我们采用带滑动聚类的权重生成方法,以增强域外区分能力,从而提升路由效果。我们还提出将低秩适配器的混合转换为模型合并公式,并引入 LoRA 适配器的快速动态合并,以保持基础模型的通用能力。广泛的实验表明,所提出的 SLIM 在下游任务上与最先进的 PEFT 方法相当,同时在缓解灾难性遗忘方面表现领先。

[NLP-54] Agent Bank: Towards Generalized LLM Agents via Fine-Tuning on 50000 Interaction Trajectories EMNLP2024

【速读】: 该论文旨在通过在代理-环境交互轨迹数据上进行微调,提升开源大型语言模型(LLMs)的通用代理能力。解决方案的关键在于引入AgentBank,这是一个包含超过5万条多样化高质量交互轨迹的数据集,涵盖16个任务和五个不同的代理技能维度。通过创新的标注流程,论文成功减少了数据集的难度偏差,并在此基础上微调LLMs生成了一系列名为Samoyed的代理模型。实验结果表明,扩展交互轨迹数据集能有效提升代理的通用能力,并揭示了轨迹微调和代理技能泛化的关键观察。

链接: https://arxiv.org/abs/2410.07706
作者: Yifan Song,Weimin Xiong,Xiutian Zhao,Dawei Zhu,Wenhao Wu,Ke Wang,Cheng Li,Wei Peng,Sujian Li
关键词-EN: holds significant promise, open-source large language, Fine-tuning on agent-environment, large language models, data holds significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of EMNLP 2024

点击查看摘要

Abstract:Fine-tuning on agent-environment interaction trajectory data holds significant promise for surfacing generalized agent capabilities in open-source large language models (LLMs). In this work, we introduce AgentBank, by far the largest trajectory tuning data collection featuring more than 50k diverse high-quality interaction trajectories which comprises 16 tasks covering five distinct agent skill dimensions. Leveraging a novel annotation pipeline, we are able to scale the annotated trajectories and generate a trajectory dataset with minimized difficulty bias. Furthermore, we fine-tune LLMs on AgentBank to get a series of agent models, Samoyed. Our comparative experiments demonstrate the effectiveness of scaling the interaction trajectory data to acquire generalized agent capabilities. Additional studies also reveal some key observations regarding trajectory tuning and agent skill generalization.
摘要:在智能体-环境交互轨迹数据上进行微调,对于在开源大语言模型 (LLM) 中发掘通用智能体能力具有重要潜力。本文中,我们引入了 AgentBank,这是迄今为止最大的轨迹微调数据集,包含了超过 50,000 条多样化的优质交互轨迹,涵盖了 16 项任务,涉及五个不同的智能体技能维度。通过采用一种新颖的标注流程,我们能够扩展标注的轨迹数量,并生成一个难度偏差最小化的轨迹数据集。此外,我们在 AgentBank 上对 LLM 进行微调,得到了一系列智能体模型,称为 Samoyed。我们的对比实验证明了扩展交互轨迹数据以获取通用智能体能力的有效性。进一步的研究还揭示了关于轨迹微调和智能体技能泛化的若干关键观察。

[NLP-55] Multi-Facet Counterfactual Learning for Content Quality Evaluation

【速读】: 该论文试图解决传统内容质量评估方法仅依赖单一评分信号,难以区分文档在多个质量维度上的差异的问题。解决方案的关键在于提出了Multi-facet cOunterfactual LEarning (MOLE)框架,通过利用大型语言模型生成与原始文档在关键质量维度上存在差异的反事实内容,并结合对比学习和监督学习的联合训练策略,使评估器能够识别和区分不同的质量维度,从而提高内容质量评分的准确性和与人类判断的相关性。

链接: https://arxiv.org/abs/2410.07693
作者: Jiasheng Zheng,Hongyu Lin,Boxi Cao,Meng Liao,Yaojie Lu,Xianpei Han,Le Sun
关键词-EN: current massive amount, essential for filtering, current massive, massive amount, content quality
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating the quality of documents is essential for filtering valuable content from the current massive amount of information. Conventional approaches typically rely on a single score as a supervision signal for training content quality evaluators, which is inadequate to differentiate documents with quality variations across multiple facets. In this paper, we propose Multi-facet cOunterfactual LEarning (MOLE), a framework for efficiently constructing evaluators that perceive multiple facets of content quality evaluation. Given a specific scenario, we prompt large language models to generate counterfactual content that exhibits variations in critical quality facets compared to the original document. Furthermore, we leverage a joint training strategy based on contrastive learning and supervised learning to enable the evaluator to distinguish between different quality facets, resulting in more accurate predictions of content quality scores. Experimental results on 2 datasets across different scenarios demonstrate that our proposed MOLE framework effectively improves the correlation of document content quality evaluations with human judgments, which serve as a valuable toolkit for effective information acquisition.
摘要:评估文档质量对于从当前海量信息中筛选出有价值的内容至关重要。传统方法通常依赖单一评分作为训练内容质量评估器的监督信号,这不足以区分在多个方面质量有所变化的文档。本文提出了一种名为多方面反事实学习 (Multi-facet cOunterfactual LEarning, MOLE) 的框架,用于高效构建能够感知内容质量评估多个方面的评估器。在特定场景下,我们引导大语言模型生成与原始文档相比在关键质量方面有所变化的反事实内容。此外,我们利用基于对比学习和监督学习的联合训练策略,使评估器能够区分不同的质量方面,从而更准确地预测内容质量评分。在不同场景下的两个数据集上的实验结果表明,我们提出的 MOLE 框架有效地提高了文档内容质量评估与人类判断的相关性,为有效信息获取提供了一个有价值的工具包。

[NLP-56] Smart Audit System Empowered by LLM

【速读】: 该论文试图解决传统制造质量审计过程中存在的劳动密集、依赖人工经验和难以在全球复杂供应链中保持透明度、问责制及持续改进的问题。解决方案的关键在于引入基于大型语言模型(LLMs)的智能审计系统,该系统通过三个创新点提升审计效率和效果:一是动态风险评估模型,优化审计流程和资源分配;二是制造合规助手,增强数据处理和知识库的自进化能力;三是Re-act框架通用性分析代理,提供实时定制化分析,帮助工程师改进供应商表现。实测结果显示,该系统能将审计效率提升超过24%。

链接: https://arxiv.org/abs/2410.07677
作者: Xu Yao,Xiaoxu Wu,Xi Li,Huan Xu,Chenlei Li,Ping Huang,Si Li,Xiaoning Ma,Jiulong Shan
关键词-EN: mass production environments, ensuring high product, high product standards, production environments, pivotal for ensuring
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Manufacturing quality audits are pivotal for ensuring high product standards in mass production environments. Traditional auditing processes, however, are labor-intensive and reliant on human expertise, posing challenges in maintaining transparency, accountability, and continuous improvement across complex global supply chains. To address these challenges, we propose a smart audit system empowered by large language models (LLMs). Our approach introduces three innovations: a dynamic risk assessment model that streamlines audit procedures and optimizes resource allocation; a manufacturing compliance copilot that enhances data processing, retrieval, and evaluation for a self-evolving manufacturing knowledge base; and a Re-act framework commonality analysis agent that provides real-time, customized analysis to empower engineers with insights for supplier improvement. These enhancements elevate audit efficiency and effectiveness, with testing scenarios demonstrating an improvement of over 24%.
摘要:在批量生产环境中,制造质量审核对于确保高产品标准至关重要。然而,传统的审核流程依赖于人工操作和专业知识,这使得在复杂的全球供应链中保持透明度、问责制和持续改进面临挑战。为应对这些挑战,我们提出了一种由大语言模型 (LLM) 驱动的智能审核系统。我们的方法引入了三项创新:一个动态风险评估模型,用于简化审核流程并优化资源分配;一个制造合规助手,增强数据处理、检索和评估,以构建一个自我进化的制造知识库;以及一个 Re-act 框架共性分析智能体,提供实时、定制化的分析,帮助工程师洞察供应商改进的机会。这些改进提升了审核的效率和效果,测试场景显示效率提升了超过 24%。

[NLP-57] MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization

【速读】: 该论文试图解决在大型语言模型(LLMs)超越人类能力的场景下,如何通过弱监督实现强学生模型的有效对齐问题。解决方案的关键在于提出了多代理对比偏好优化(MACPO)框架,该框架通过迭代强化不熟悉的正面行为并惩罚熟悉的负面行为,促进弱教师和强学生之间的相互学习。具体策略包括相互正面行为增强策略和硬负面行为构建策略,前者鼓励双方学习对方的正面行为,后者通过微调负面行为数据促使双方生成熟悉的负面行为。实验结果表明,MACPO不仅提升了强学生的对齐性能,还随着弱教师数量的增加,通过更多迭代优化轮次实现了更好的弱到强对齐效果。

链接: https://arxiv.org/abs/2410.07672
作者: Yougang Lyu,Lingyong Yan,Zihan Wang,Dawei Yin,Pengjie Ren,Maarten de Rijke,Zhaochun Ren
关键词-EN: large language models, achieving near-human capabilities, weak teachers, strong students, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:As large language models (LLMs) are rapidly advancing and achieving near-human capabilities, aligning them with human values is becoming more urgent. In scenarios where LLMs outperform humans, we face a weak-to-strong alignment problem where we need to effectively align strong student LLMs through weak supervision generated by weak teachers. Existing alignment methods mainly focus on strong-to-weak alignment and self-alignment settings, and it is impractical to adapt them to the much harder weak-to-strong alignment setting. To fill this gap, we propose a multi-agent contrastive preference optimization (MACPO) framework. MACPO facilitates weak teachers and strong students to learn from each other by iteratively reinforcing unfamiliar positive behaviors while penalizing familiar negative ones. To get this, we devise a mutual positive behavior augmentation strategy to encourage weak teachers and strong students to learn from each other’s positive behavior and further provide higher quality positive behavior for the next iteration. Additionally, we propose a hard negative behavior construction strategy to induce weak teachers and strong students to generate familiar negative behavior by fine-tuning on negative behavioral data. Experimental results on the HH-RLHF and PKU-SafeRLHF datasets, evaluated using both automatic metrics and human judgments, demonstrate that MACPO simultaneously improves the alignment performance of strong students and weak teachers. Moreover, as the number of weak teachers increases, MACPO achieves better weak-to-strong alignment performance through more iteration optimization rounds.
摘要:随着大语言模型 (LLM) 的快速发展并接近人类能力,使其与人类价值观对齐变得愈发紧迫。在 LLM 超越人类的场景中,我们面临一个弱到强的对齐问题,即需要通过弱教师生成的弱监督来有效对齐强学生 LLM。现有的对齐方法主要集中在强到弱对齐和自对齐设置上,将其适应于更为困难的弱到强对齐设置是不切实际的。为了填补这一空白,我们提出了一种多智能体对比偏好优化 (MACPO) 框架。MACPO 通过迭代强化不熟悉的积极行为同时惩罚熟悉的消极行为,促进弱教师和强学生相互学习。为此,我们设计了一种相互积极行为增强策略,鼓励弱教师和强学生从彼此的积极行为中学习,并为下一轮迭代提供更高质量的积极行为。此外,我们提出了一种硬消极行为构建策略,通过在消极行为数据上微调,诱导弱教师和强学生生成熟悉的消极行为。在 HH-RLHF 和 PKU-SafeRLHF 数据集上的实验结果,通过自动指标和人类判断评估,表明 MACPO 同时提高了强学生和弱教师的对齐性能。此外,随着弱教师数量的增加,MACPO 通过更多迭代优化轮次实现了更好的弱到强对齐性能。

[NLP-58] StablePrompt: Automatic Prompt Tuning using Reinforcement Learning for Large Language Models EMNLP2024

【速读】: 该论文试图解决在大语言模型(LLM)应用中寻找合适提示(prompt)的问题,特别是在使用强化学习(RL)进行提示调优时面临的训练不稳定性和环境依赖性问题。解决方案的关键在于提出了StablePrompt方法,通过将提示调优形式化为在线RL问题,并引入自适应近端策略优化(APPO)算法,利用LLM锚模型自适应调整策略更新速率,从而在保持预训练LLM语言能力的同时,实现灵活的提示搜索和稳定的训练过程。

链接: https://arxiv.org/abs/2410.07652
作者: Minchan Kwon,Gaeun Kim,Jongsuk Kim,Haeil Lee,Junmo Kim
关键词-EN: Large Language Models, Large Language, usage of Large, Language Models, important issue
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 cam-ready

点击查看摘要

Abstract:Finding appropriate prompts for the specific task has become an important issue as the usage of Large Language Models (LLM) has expanded. Reinforcement Learning (RL) is widely used for prompt tuning, but its inherent instability and environmental dependency make it difficult to use in practice. In this paper, we propose StablePrompt, which strikes a balance between training stability and search space, mitigating the instability of RL and producing high-performance prompts. We formulate prompt tuning as an online RL problem between the agent and target LLM and introduce Adaptive Proximal Policy Optimization (APPO). APPO introduces an LLM anchor model to adaptively adjust the rate of policy updates. This allows for flexible prompt search while preserving the linguistic ability of the pre-trained LLM. StablePrompt outperforms previous methods on various tasks including text classification, question answering, and text generation. Our code can be found in github.
摘要:随着大语言模型 (LLM) 的应用范围扩大,寻找适用于特定任务的提示词已成为一个重要问题。强化学习 (RL) 广泛用于提示词调优,但其固有的不稳定性及对环境的依赖性使其在实际应用中难以使用。本文提出 StablePrompt,该方法在训练稳定性和搜索空间之间取得平衡,缓解了 RL 的不稳定性,并生成高性能的提示词。我们将提示词调优形式化为智能体与目标 LLM 之间的在线 RL 问题,并引入自适应近端策略优化 (APPO)。APPO 引入一个 LLM 锚模型,以自适应地调整策略更新速率。这使得提示词搜索具有灵活性,同时保留了预训练 LLM 的语言能力。StablePrompt 在包括文本分类、问答和文本生成在内的多种任务上优于先前的方法。我们的代码可在 github 上找到。

[NLP-59] Automatic Curriculum Expert Iteration for Reliable LLM Reasoning

【速读】: 该论文试图解决大型语言模型(LLM)在推理任务中存在的幻觉(生成看似合理但不准确的内容)和懒惰(过度拒绝或默认回答“我不知道”)问题。解决方案的关键是提出了一种名为自动课程专家迭代(Auto-CEI)的方法,通过增强LLM的推理能力并使其响应与模型能力相匹配,即在能力范围内自信回答,在任务超出能力时适当拒绝。Auto-CEI通过专家迭代探索LLM推理路径,纠正错误路径以减少累积错误并提高鲁棒性,同时鼓励在充分推理尝试后才做出“我不知道”的响应。该方法通过自动调整奖励机制,激励模型在承认无能之前进行更深入的推理,从而推动LLM推理能力的极限并使其行为与这些极限相一致。

链接: https://arxiv.org/abs/2410.07627
作者: Zirui Zhao,Hanze Dong,Amrita Saha,Caiming Xiong,Doyen Sahoo
关键词-EN: generating plausible, inaccurate content, excessive refusals, persist as major, plausible but inaccurate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 20 pages

点击查看摘要

Abstract:Hallucinations (i.e., generating plausible but inaccurate content) and laziness (i.e. excessive refusals or defaulting to “I don’t know”) persist as major challenges in LLM reasoning. Current efforts to reduce hallucinations primarily focus on factual errors in knowledge-grounded tasks, often neglecting hallucinations related to faulty reasoning. Meanwhile, some approaches render LLMs overly conservative, limiting their problem-solving capabilities. To mitigate hallucination and laziness in reasoning tasks, we propose Automatic Curriculum Expert Iteration (Auto-CEI) to enhance LLM reasoning and align responses to the model’s capabilities–assertively answering within its limits and declining when tasks exceed them. In our method, Expert Iteration explores the reasoning trajectories near the LLM policy, guiding incorrect paths back on track to reduce compounding errors and improve robustness; it also promotes appropriate “I don’t know” responses after sufficient reasoning attempts. The curriculum automatically adjusts rewards, incentivizing extended reasoning before acknowledging incapability, thereby pushing the limits of LLM reasoning and aligning its behaviour with these limits. We compare Auto-CEI with various SOTA baselines across logical reasoning, mathematics, and planning tasks, where Auto-CEI achieves superior alignment by effectively balancing assertiveness and conservativeness.
摘要:幻觉(即生成看似合理但不准确的内容)和懒惰(即过度拒绝或默认回答“我不知道”)仍然是大型语言模型 (LLM) 推理中的主要挑战。当前减少幻觉的努力主要集中在基于知识的任务中的事实错误上,往往忽视了与错误推理相关的幻觉。同时,一些方法使 LLM 过于保守,限制了其解决问题的能力。为了减轻推理任务中的幻觉和懒惰,我们提出了自动课程专家迭代 (Auto-CEI) 来增强 LLM 推理,并使响应与模型的能力相匹配——在其能力范围内自信地回答,并在任务超出其能力时拒绝回答。在我们的方法中,专家迭代探索了 LLM 策略附近的推理轨迹,引导错误的路径回到正轨,以减少累积错误并提高鲁棒性;它还促进了在充分推理尝试后适当的“我不知道”响应。课程自动调整奖励,激励在承认无能之前进行更深入的推理,从而推动 LLM 推理的极限,并使其行为与这些极限相一致。我们将 Auto-CEI 与逻辑推理、数学和规划任务中的各种最先进 (SOTA) 基线进行了比较,Auto-CEI 通过有效平衡自信和保守性实现了优越的一致性。

[NLP-60] urboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text

【速读】: 该论文试图解决当前检索增强生成(RAG)系统在处理大量检索文档块时导致的计算开销大和首次生成时间(TTFT)延迟的问题。解决方案的关键在于引入TurboRAG系统,通过预先离线计算并存储文档的键值(KV)缓存,从而在推理过程中直接检索已保存的KV缓存,避免了在线计算KV缓存的开销。此外,论文还对掩码矩阵和位置嵌入机制进行了深入研究,并对预训练语言模型进行了微调,以保持TurboRAG的模型准确性。该方法无需修改现有的大语言模型及其推理系统,适用于大多数现有应用。实验结果表明,TurboRAG在保持与标准RAG系统相当性能的同时,将TTFT减少了高达9.4倍(平均8.6倍)。

链接: https://arxiv.org/abs/2410.07590
作者: Songshuo Lu,Hua Wang,Yutian Rong,Zhi Chen,Yaohua Tang
关键词-EN: Current Retrieval-Augmented Generation, process numerous retrieved, current RAG system, numerous retrieved document, retrieved document chunks
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current Retrieval-Augmented Generation (RAG) systems concatenate and process numerous retrieved document chunks for prefill which requires a large volume of computation, therefore leading to significant latency in time-to-first-token (TTFT). To reduce the computation overhead as well as TTFT, we introduce TurboRAG, a novel RAG system that redesigns the inference paradigm of the current RAG system by first pre-computing and storing the key-value (KV) caches of documents offline, and then directly retrieving the saved KV cache for prefill. Hence, online computation of KV caches is eliminated during inference. In addition, we provide a number of insights into the mask matrix and positional embedding mechanisms, plus fine-tune a pretrained language model to maintain model accuracy of TurboRAG. Our approach is applicable to most existing large language models and their applications without any requirement in modification of models and inference systems. Experimental results across a suite of RAG benchmarks demonstrate that TurboRAG reduces TTFT by up to 9.4x compared to the conventional RAG systems (on an average of 8.6x), but reserving comparable performance to the standard RAG systems.
摘要: 当前的检索增强生成 (Retrieval-Augmented Generation, RAG) 系统在预填充阶段需要连接和处理大量检索到的文档块,这导致计算量巨大,从而在生成首个 Token 的时间 (Time-to-First-Token, TTFT) 上产生了显著的延迟。为了减少计算开销和 TTFT,我们提出了 TurboRAG,这是一种新颖的 RAG 系统,通过首先离线预计算并存储文档的关键-值 (Key-Value, KV) 缓存,然后在预填充阶段直接检索保存的 KV 缓存,从而重新设计了当前 RAG 系统的推理范式。因此,在推理过程中消除了在线计算 KV 缓存的需求。此外,我们对掩码矩阵和位置嵌入机制提供了多种见解,并微调了一个预训练的语言模型以保持 TurboRAG 的模型准确性。我们的方法适用于大多数现有的大语言模型及其应用,无需对模型和推理系统进行任何修改。在一系列 RAG 基准测试中的实验结果表明,与传统的 RAG 系统相比,TurboRAG 将 TTFT 减少了高达 9.4 倍(平均为 8.6 倍),同时保持了与标准 RAG 系统相当的性能。

[NLP-61] No Free Lunch: Retrieval-Augmented Generation Undermines Fairness in LLMs Even for Vigilant Users

【速读】: 该论文试图解决Retrieval-Augmented Generation (RAG)在大语言模型(LLMs)中可能导致的公平性问题。研究指出,尽管RAG在减少幻觉和增强领域特定生成能力方面表现出色且成本效益高,但其公平性成本不容忽视。论文提出了一个三层次的威胁模型,从用户对公平性的认知角度出发,探讨了不同程度的公平性审查对外部数据集的影响。实验结果表明,即使在完全审查和看似无偏的数据集上,RAG仍可能导致偏见输出,且无需微调或重新训练。关键在于,当前的公平性对齐方法在RAG环境中存在局限,因此急需开发新的策略来确保RAG-based LLMs的公平性。论文提出了潜在的缓解措施,并呼吁进一步研究以开发强有力的公平性保障机制。

链接: https://arxiv.org/abs/2410.07589
作者: Mengxuan Hu,Hongyi Wu,Zihan Guan,Ronghang Zhu,Dongliang Guo,Daiqing Qi,Sheng Li
关键词-EN: domain-specific generation capabilities, Retrieval-Augmented Generation, large language models, domain-specific generation, generation capabilities
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is widely adopted for its effectiveness and cost-efficiency in mitigating hallucinations and enhancing the domain-specific generation capabilities of large language models (LLMs). However, is this effectiveness and cost-efficiency truly a free lunch? In this study, we comprehensively investigate the fairness costs associated with RAG by proposing a practical three-level threat model from the perspective of user awareness of fairness. Specifically, varying levels of user fairness awareness result in different degrees of fairness censorship on the external dataset. We examine the fairness implications of RAG using uncensored, partially censored, and fully censored datasets. Our experiments demonstrate that fairness alignment can be easily undermined through RAG without the need for fine-tuning or retraining. Even with fully censored and supposedly unbiased external datasets, RAG can lead to biased outputs. Our findings underscore the limitations of current alignment methods in the context of RAG-based LLMs and highlight the urgent need for new strategies to ensure fairness. We propose potential mitigations and call for further research to develop robust fairness safeguards in RAG-based LLMs.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 因其有效性和成本效益而被广泛采用,用于缓解幻觉并增强大语言模型 (Large Language Models, LLMs) 的领域特定生成能力。然而,这种有效性和成本效益是否真的是免费的午餐?在本研究中,我们从用户对公平性的认知角度,提出了一个实用的三级威胁模型,全面探讨了 RAG 相关的公平性成本。具体而言,用户对公平性的不同认知水平导致了对外部数据集的不同程度的公平性审查。我们通过使用未审查、部分审查和完全审查的数据集,考察了 RAG 的公平性影响。实验结果表明,无需微调或重新训练,RAG 就能轻易破坏公平性对齐。即使在完全审查且理论上无偏见的外部数据集上,RAG 也可能导致有偏见的输出。我们的研究强调了当前对齐方法在基于 RAG 的 LLMs 中的局限性,并突显了迫切需要新的策略来确保公平性。我们提出了潜在的缓解措施,并呼吁进一步研究,以开发基于 RAG 的 LLMs 中的强大公平性保障措施。

[NLP-62] Detecting Training Data of Large Language Models via Expectation Maximization

【速读】: 该论文试图解决大语言模型(LLMs)中训练数据成员资格推断攻击(MIAs)的问题,特别是在面对大规模预训练数据和成员资格模糊性时的挑战。解决方案的关键在于提出了EM-MIA方法,该方法通过期望最大化算法迭代优化成员分数和前缀分数,利用两者之间的对偶性相互提升估计精度。EM-MIA不仅在WikiMIA数据集上达到了最先进的结果,还引入了OLMoMIA基准,用于控制训练和测试数据分布的重叠程度,从而全面评估MIAs方法,推动该领域的未来研究。

链接: https://arxiv.org/abs/2410.07582
作者: Gyuwan Kim,Yang Li,Evangelia Spiliopoulou,Jie Ma,Miguel Ballesteros,William Yang Wang
关键词-EN: large language models, remains undisclosed, impressive advancements, widespread deployment, deployment of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 14 pages

点击查看摘要

Abstract:The widespread deployment of large language models (LLMs) has led to impressive advancements, yet information about their training data, a critical factor in their performance, remains undisclosed. Membership inference attacks (MIAs) aim to determine whether a specific instance was part of a target model’s training data. MIAs can offer insights into LLM outputs and help detect and address concerns such as data contamination and compliance with privacy and copyright standards. However, applying MIAs to LLMs presents unique challenges due to the massive scale of pre-training data and the ambiguous nature of membership. Additionally, creating appropriate benchmarks to evaluate MIA methods is not straightforward, as training and test data distributions are often unknown. In this paper, we introduce EM-MIA, a novel MIA method for LLMs that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm, leveraging the duality that the estimates of these scores can be improved by each other. Membership scores and prefix scores assess how each instance is likely to be a member and discriminative as a prefix, respectively. Our method achieves state-of-the-art results on the WikiMIA dataset. To further evaluate EM-MIA, we present OLMoMIA, a benchmark built from OLMo resources, which allows us to control the difficulty of MIA tasks with varying degrees of overlap between training and test data distributions. We believe that EM-MIA serves as a robust MIA method for LLMs and that OLMoMIA provides a valuable resource for comprehensively evaluating MIA approaches, thereby driving future research in this critical area.
摘要:大语言模型 (LLM) 的广泛部署带来了显著的进步,然而关于其训练数据的关键信息仍然未公开。成员推断攻击 (Membership Inference Attacks, MIA) 旨在确定特定实例是否属于目标模型的训练数据。MIA 可以深入了解 LLM 的输出,并有助于检测和解决数据污染、隐私和版权标准合规等问题。然而,将 MIA 应用于 LLM 面临独特的挑战,这主要源于预训练数据的庞大规模和成员关系的模糊性。此外,创建适当的基准来评估 MIA 方法并非易事,因为训练和测试数据分布通常是未知的。本文中,我们提出了 EM-MIA,一种针对 LLM 的新型 MIA 方法,该方法通过期望最大化算法迭代优化成员分数和前缀分数,利用这两种分数估计之间的对偶性来相互提升。成员分数和前缀分数分别评估每个实例成为成员的可能性和作为前缀的区分性。我们的方法在 WikiMIA 数据集上达到了最先进的结果。为进一步评估 EM-MIA,我们推出了 OLMoMIA,这是一个基于 OLMo 资源构建的基准,它使我们能够通过控制训练和测试数据分布之间的重叠程度来调节 MIA 任务的难度。我们相信 EM-MIA 作为一种强大的 MIA 方法适用于 LLM,而 OLMoMIA 则为全面评估 MIA 方法提供了宝贵的资源,从而推动这一关键领域的未来研究。

[NLP-63] RealVul: Can We Detect Vulnerabilities in Web Applications with LLM?

【速读】: 该论文试图解决PHP语言中软件漏洞检测的问题,特别是现有研究缺乏针对PHP漏洞的专门模型以及样本提取和处理方面的挑战。解决方案的关键在于提出了RealVul框架,该框架通过漏洞候选检测方法和代码规范化技术,能够隔离潜在的漏洞触发点并简化代码,去除不必要的语义信息,从而使模型更好地理解和学习生成的漏洞样本。此外,通过改进数据合成方法来解决PHP漏洞样本不足的问题。实验结果表明,RealVul显著提升了现有模型在PHP漏洞检测中的效果和泛化能力。

链接: https://arxiv.org/abs/2410.07573
作者: Di Cao,Yong Liao,Xiuwei Shang
关键词-EN: large language models, software vulnerability detection, latest advancements, advancements in large, sparked interest
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The latest advancements in large language models (LLMs) have sparked interest in their potential for software vulnerability detection. However, there is currently a lack of research specifically focused on vulnerabilities in the PHP language, and challenges in extracting samples and processing persist, hindering the model’s ability to effectively capture the characteristics of specific vulnerabilities. In this paper, we present RealVul, the first LLM-based framework designed for PHP vulnerability detection, addressing these issues. By vulnerability candidate detection methods and employing techniques such as normalization, we can isolate potential vulnerability triggers while streamlining the code and eliminating unnecessary semantic information, enabling the model to better understand and learn from the generated vulnerability samples. We also address the issue of insufficient PHP vulnerability samples by improving data synthesis methods. To evaluate RealVul’s performance, we conduct an extensive analysis using five distinct code LLMs on vulnerability data from 180 PHP projects. The results demonstrate a significant improvement in both effectiveness and generalization compared to existing methods, effectively boosting the vulnerability detection capabilities of these models.
摘要:大语言模型 (LLM) 的最新进展引发了对其在软件漏洞检测方面潜力的关注。然而,目前缺乏专门针对 PHP 语言漏洞的研究,且在提取样本和处理过程中存在挑战,阻碍了模型有效捕捉特定漏洞特征的能力。本文介绍了 RealVul,这是首个基于 LLM 的 PHP 漏洞检测框架,旨在解决这些问题。通过漏洞候选检测方法和采用归一化等技术,我们能够隔离潜在的漏洞触发点,同时简化代码并消除不必要的语义信息,使模型能够更好地理解和学习生成的漏洞样本。我们还通过改进数据合成方法解决了 PHP 漏洞样本不足的问题。为了评估 RealVul 的性能,我们使用五个不同的代码 LLM 对来自 180 个 PHP 项目的漏洞数据进行了广泛分析。结果表明,与现有方法相比,RealVul 在有效性和泛化能力方面均有显著提升,有效增强了这些模型的漏洞检测能力。

[NLP-64] How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

【速读】: 该论文试图解决在将大型语言模型(LLMs)转化为大型视觉-语言模型(LVLMs)过程中,由于视觉-语言适应(VL adaptation)导致的模型安全性下降问题。解决方案的关键在于通过权重合并(weight merging)方法,有效减少安全性下降的同时保持模型的帮助性,从而在多模态任务中实现更可靠和安全的LVLMs。

链接: https://arxiv.org/abs/2410.07571
作者: Seongyun Lee,Geewook Kim,Jiyeon Kim,Hyunji Lee,Hoyeon Chang,Sue Hyun Park,Minjoon Seo
关键词-EN: transforms Large Language, Large Language Models, Large Vision-Language Models, Large Language, Large Vision-Language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language adaptation (VL adaptation) transforms Large Language Models (LLMs) into Large Vision-Language Models (LVLMs) for multimodal tasks, but this process often compromises the inherent safety capabilities embedded in the original LLMs. Despite potential harmfulness due to weakened safety measures, in-depth analysis on the effects of VL adaptation on safety remains under-explored. This study examines how VL adaptation influences safety and evaluates the impact of safety fine-tuning methods. Our analysis reveals that safety degradation occurs during VL adaptation, even when the training data is safe. While safety tuning techniques like supervised fine-tuning with safety datasets or reinforcement learning from human feedback mitigate some risks, they still lead to safety degradation and a reduction in helpfulness due to over-rejection issues. Further analysis of internal model weights suggests that VL adaptation may impact certain safety-related layers, potentially lowering overall safety levels. Additionally, our findings demonstrate that the objectives of VL adaptation and safety tuning are divergent, which often results in their simultaneous application being suboptimal. To address this, we suggest the weight merging approach as an optimal solution effectively reducing safety degradation while maintaining helpfulness. These insights help guide the development of more reliable and secure LVLMs for real-world applications.
摘要:视觉-语言适应 (Vision-Language adaptation, VL adaptation) 将大语言模型 (Large Language Models, LLMs) 转化为大视觉-语言模型 (Large Vision-Language Models, LVLMs),以应对多模态任务,但这一过程往往削弱了原始 LLMs 中嵌入的固有安全能力。尽管由于安全措施的减弱可能带来潜在的危害,但关于 VL 适应对安全影响的深入分析仍未得到充分探索。本研究探讨了 VL 适应如何影响安全性,并评估了安全微调方法的影响。我们的分析表明,即使在训练数据安全的情况下,VL 适应过程中也会出现安全降级。虽然使用安全数据集进行监督微调或基于人类反馈的强化学习等安全调优技术可以缓解部分风险,但由于过度拒绝问题,它们仍会导致安全降级和有用性的降低。进一步分析内部模型权重表明,VL 适应可能影响某些与安全相关的层,从而可能降低整体安全水平。此外,我们的研究结果表明,VL 适应和安全调优的目标是分歧的,这通常导致它们的联合应用效果不佳。为解决这一问题,我们建议采用权重合并方法作为最佳解决方案,有效减少安全降级的同时保持有用性。这些见解有助于指导开发更可靠和安全的 LVLMs,以应用于实际场景。

[NLP-65] When and Where Did it Happen? An Encoder-Decoder Model to Identify Scenario Context

【速读】: 该论文试图解决在文本中提取事件或实体的相关时间和地点信息的问题,以增强信息提取的上下文准确性,从而在构建知识图谱时提高自动化发现的有效性。解决方案的关键在于使用高质量的流行病学论文数据集进行训练,采用编码器-解码器架构,并通过数据增强技术提升模型的性能。研究结果表明,经过微调的小型编码器-解码器模型在预测特定实体或事件的场景信息方面优于现成的语言模型和语义角色标注解析器。

链接: https://arxiv.org/abs/2410.07567
作者: Enrique Noriega-Atala,Robert Vacareanu,Salena Torres Ashton,Adarsh Pyarelal,Clayton T. Morrison,Mihai Surdeanu
关键词-EN: neural architecture finetuned, scenario context generation, context generation, mentioned in text, introduce a neural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:We introduce a neural architecture finetuned for the task of scenario context generation: The relevant location and time of an event or entity mentioned in text. Contextualizing information extraction helps to scope the validity of automated finings when aggregating them as knowledge graphs. Our approach uses a high-quality curated dataset of time and location annotations in a corpus of epidemiology papers to train an encoder-decoder architecture. We also explored the use of data augmentation techniques during training. Our findings suggest that a relatively small fine-tuned encoder-decoder model performs better than out-of-the-box LLMs and semantic role labeling parsers to accurate predict the relevant scenario information of a particular entity or event.
摘要:我们介绍了一种针对场景上下文生成任务进行微调的神经网络架构:即在文本中提到的某个事件或实体的相关位置和时间。上下文化的信息提取有助于在将这些信息聚合为知识图谱时,确定自动化发现的适用范围。我们的方法使用了一个高质量的、经过精心筛选的数据集,该数据集包含流行病学论文语料库中的时间和位置标注,用于训练一个编码器-解码器架构。我们还探讨了在训练过程中使用数据增强技术的应用。我们的研究结果表明,一个相对较小且经过微调的编码器-解码器模型在准确预测特定实体或事件的相关场景信息方面,表现优于现成的大语言模型和语义角色标注解析器。

[NLP-66] PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

【速读】: 该论文旨在开发一个专门针对日语能力的大型语言模型PLaMo-100B。解决方案的关键在于从零开始训练模型,使用2万亿个token,并通过QK Normalization和Z-Loss等架构设计确保训练稳定性。此外,通过监督微调(Supervised Fine-Tuning)和直接偏好优化(Direct Preference Optimization)等后训练技术,进一步提升了模型在日语特定任务中的表现,使其在性能上与前沿模型如GPT-4相媲美。

链接: https://arxiv.org/abs/2410.07563
作者: Kenshin Abe,Kaizaburo Chubachi,Yasuhiro Fujita,Yuta Hirokawa,Kentaro Imajo,Toshiki Kataoka,Hiroyoshi Komatsu,Hiroaki Mikami,Tsuguo Mogami,Shogo Murai,Kosuke Nakago,Daisuke Nishino,Toru Ogawa,Daisuke Okanohara,Yoshihiko Ozaki,Shotaro Sano,Shuji Suzuki,Tianqi Xu,Toshihiko Yanase(Preferred Elements, Inc.)
关键词-EN: Japanese proficiency, designed for Japanese, large-scale language model, language model designed, Direct Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency. The model was trained from scratch using 2 trillion tokens, with architecture such as QK Normalization and Z-Loss to ensure training stability during the training process. Post-training techniques, including Supervised Fine-Tuning and Direct Preference Optimization, were applied to refine the model’s performance. Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks, achieving results that are competitive with frontier models like GPT-4.
摘要:我们介绍了 PLaMo-100B,这是一个专为日语能力设计的大规模语言模型。该模型从零开始训练,使用了 2 万亿个 Token,并采用了 QK Normalization 和 Z-Loss 等架构,以确保训练过程中的稳定性。训练后,通过监督微调 (Supervised Fine-Tuning) 和直接偏好优化 (Direct Preference Optimization) 等技术对模型性能进行了优化。基准评估表明,PLaMo-100B 表现出色,特别是在日语特定任务中,其结果与 GPT-4 等前沿模型相媲美。

[NLP-67] AI-Press: A Multi-Agent News Generating and Feedback Simulation System Powered by Large Language Models

【速读】: 该论文试图解决新闻生成过程中大语言模型(LLMs)在专业性和伦理判断上的局限性,以及新闻发布前难以预测公众反馈的问题。解决方案的关键在于引入AI-Press系统,该系统基于多智能体协作和检索增强生成技术,实现了自动化的新闻草稿撰写与润色。此外,论文还开发了一个反馈模拟系统,通过考虑人口分布生成公众反馈,从而在新闻发布前进行有效的反馈预测和内容优化。

链接: https://arxiv.org/abs/2410.07561
作者: Xiawei Liu,Shiyue Yang,Xinnong Zhang,Haoyu Kuang,Libo Sun,Yihang Yang,Siming Chen,Xuanjing Huang,Zhongyu Wei
关键词-EN: transformed journalism, social platforms, platforms has transformed, public feedback, Abstract
类目: Computation and Language (cs.CL)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:The rise of various social platforms has transformed journalism. The growing demand for news content has led to the increased use of large language models (LLMs) in news production due to their speed and cost-effectiveness. However, LLMs still encounter limitations in professionalism and ethical judgment in news generation. Additionally, predicting public feedback is usually difficult before news is released. To tackle these challenges, we introduce AI-Press, an automated news drafting and polishing system based on multi-agent collaboration and Retrieval-Augmented Generation. We develop a feedback simulation system that generates public feedback considering demographic distributions. Through extensive quantitative and qualitative evaluations, our system shows significant improvements in news-generating capabilities and verifies the effectiveness of public feedback simulation.
摘要:各种社交平台的兴起改变了新闻业。新闻内容需求的不断增长,使得大语言模型 (LLM) 在新闻生产中的应用日益增多,因其速度和成本效益。然而,LLM 在新闻生成中的专业性和伦理判断仍存在局限。此外,新闻发布前预测公众反馈通常较为困难。为应对这些挑战,我们推出了 AI-Press,这是一个基于多智能体协作和检索增强生成的自动化新闻起草与润色系统。我们开发了一个反馈模拟系统,该系统考虑人口分布生成公众反馈。通过广泛的定量和定性评估,我们的系统在新闻生成能力上显示出显著改进,并验证了公众反馈模拟的有效性。

[NLP-68] KRAG Framework for Enhancing LLMs in the Legal Domain KR

【速读】: 该论文试图解决大型语言模型(LLMs)在特定领域应用中缺乏关键知识实体和关系的问题。解决方案的关键在于引入知识表示增强生成(KRAG)框架,通过战略性地包含这些缺失的知识实体和关系,提升LLMs在法律等领域的推理、论证和解释能力。具体实现模型Soft PROLEG利用推理图来辅助LLMs进行结构化的法律推理,显著提高了模型处理复杂法律文本和术语的能力。

链接: https://arxiv.org/abs/2410.07551
作者: Nguyen Ha Thanh,Ken Satoh
关键词-EN: introduces Knowledge Representation, Representation Augmented Generation, Knowledge Representation Augmented, Large Language Models, capabilities of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

Abstract:This paper introduces Knowledge Representation Augmented Generation (KRAG), a novel framework designed to enhance the capabilities of Large Language Models (LLMs) within domain-specific applications. KRAG points to the strategic inclusion of critical knowledge entities and relationships that are typically absent in standard data sets and which LLMs do not inherently learn. In the context of legal applications, we present Soft PROLEG, an implementation model under KRAG, which uses inference graphs to aid LLMs in delivering structured legal reasoning, argumentation, and explanations tailored to user inquiries. The integration of KRAG, either as a standalone framework or in tandem with retrieval augmented generation (RAG), markedly improves the ability of language models to navigate and solve the intricate challenges posed by legal texts and terminologies. This paper details KRAG’s methodology, its implementation through Soft PROLEG, and potential broader applications, underscoring its significant role in advancing natural language understanding and processing in specialized knowledge domains.
摘要:本文介绍了知识表示增强生成 (Knowledge Representation Augmented Generation, KRAG),这是一种旨在增强大语言模型 (Large Language Models, LLMs) 在特定领域应用中能力的新框架。KRAG 强调了在标准数据集中通常缺失且 LLMs 本身无法学习的战略性关键知识实体和关系的纳入。在法律应用的背景下,我们提出了 Soft PROLEG,这是 KRAG 下的一个实现模型,它使用推理图来辅助 LLMs 提供结构化的法律推理、论证和针对用户查询的解释。无论是作为独立框架还是与检索增强生成 (Retrieval Augmented Generation, RAG) 结合使用,KRAG 的集成显著提升了语言模型在处理法律文本和术语所提出的复杂挑战方面的能力。本文详细阐述了 KRAG 的方法论、通过 Soft PROLEG 的实现以及其潜在的更广泛应用,强调了其在推进特定知识领域中自然语言理解和处理方面的重要作用。

[NLP-69] OneNet: A Fine-Tuning Free Framework for Few-Shot Entity Linking via Large Language Model Prompting EMNLP2024

【速读】: 该论文试图解决在少样本实体链接(few-shot entity linking)场景下,传统方法依赖大量数据的问题。解决方案的关键在于提出了一种名为OneNet的创新框架,该框架利用大型语言模型(LLMs)的少样本学习能力,无需微调即可实现高效的实体链接。OneNet的核心组件包括:1) 实体简化处理器,通过总结和过滤无关实体来简化输入;2) 双视角实体链接器,结合上下文线索和先验知识进行精确链接;3) 实体一致性判断器,采用独特的算法来减少链接推理中的幻觉现象。通过这些组件,OneNet在多个基准数据集上显著超越了现有的最先进方法。

链接: https://arxiv.org/abs/2410.07549
作者: Xukai Liu,Ye Liu,Kai Zhang,Kehang Wang,Qi Liu,Enhong Chen
关键词-EN: associating ambiguous textual, ambiguous textual mentions, Entity Linking, Large Language Models, few-shot entity linking
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024 Main

点击查看摘要

Abstract:Entity Linking (EL) is the process of associating ambiguous textual mentions to specific entities in a knowledge base. Traditional EL methods heavily rely on large datasets to enhance their performance, a dependency that becomes problematic in the context of few-shot entity linking, where only a limited number of examples are available for training. To address this challenge, we present OneNet, an innovative framework that utilizes the few-shot learning capabilities of Large Language Models (LLMs) without the need for fine-tuning. To the best of our knowledge, this marks a pioneering approach to applying LLMs to few-shot entity linking tasks. OneNet is structured around three key components prompted by LLMs: (1) an entity reduction processor that simplifies inputs by summarizing and filtering out irrelevant entities, (2) a dual-perspective entity linker that combines contextual cues and prior knowledge for precise entity linking, and (3) an entity consensus judger that employs a unique consistency algorithm to alleviate the hallucination in the entity linking reasoning. Comprehensive evaluations across seven benchmark datasets reveal that OneNet outperforms current state-of-the-art entity linking methods.
摘要:实体链接 (Entity Linking, EL) 是将文本中模糊的提及与知识库中特定实体关联的过程。传统的 EL 方法严重依赖于大型数据集以提升其性能,这种依赖在少样本实体链接的背景下变得问题重重,因为在这种情况下,训练样本数量有限。为了应对这一挑战,我们提出了 OneNet,这是一种创新的框架,利用大语言模型 (Large Language Models, LLMs) 的少样本学习能力,而无需进行微调。据我们所知,这是首次将 LLMs 应用于少样本实体链接任务的开创性方法。OneNet 围绕三个由 LLMs 驱动的关键组件构建:(1) 实体简化处理器,通过总结和过滤无关实体来简化输入;(2) 双视角实体链接器,结合上下文线索和先验知识进行精确的实体链接;(3) 实体一致性判断器,采用独特的连贯性算法来缓解实体链接推理中的幻觉问题。在七个基准数据集上的全面评估显示,OneNet 优于当前最先进的实体链接方法。

[NLP-70] MKGL: Mastery of a Three-Word Language NEURIPS2024

【速读】: 该论文试图解决大语言模型(LLMs)在知识图谱(KGs)应用中的问题,特别是如何减少幻觉现象并提高KG完成任务的准确性。解决方案的关键在于引入了一种专门的知识图谱语言(KG Language, KGL),该语言通过精确的实体名词、关系动词和实体名词的三元组结构来描述事实,并通过定制词典、示例句子和实时KG上下文检索与KGL词嵌入增强来帮助LLMs学习这种新语言。这种方法显著减少了错误率,并使LLMs能够生成准确的KG三元组,甚至在处理未见过的术语时也能表现出色。

链接: https://arxiv.org/abs/2410.07526
作者: Lingbing Guo,Zhongpu Bo,Zhuo Chen,Yichi Zhang,Jiaoyan Chen,Yarong Lan,Mengshu Sun,Zhiqiang Zhang,Yangyifei Luo,Qian Li,Qiang Zhang,Wen Zhang,Huajun Chen
关键词-EN: Large language models, significantly advanced performance, natural language processing, Large language, significantly advanced
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NeurIPS 2024 (spotlight)

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced performance across a spectrum of natural language processing (NLP) tasks. Yet, their application to knowledge graphs (KGs), which describe facts in the form of triplets and allow minimal hallucinations, remains an underexplored frontier. In this paper, we investigate the integration of LLMs with KGs by introducing a specialized KG Language (KGL), where a sentence precisely consists of an entity noun, a relation verb, and ends with another entity noun. Despite KGL’s unfamiliar vocabulary to the LLM, we facilitate its learning through a tailored dictionary and illustrative sentences, and enhance context understanding via real-time KG context retrieval and KGL token embedding augmentation. Our results reveal that LLMs can achieve fluency in KGL, drastically reducing errors compared to conventional KG embedding methods on KG completion. Furthermore, our enhanced LLM shows exceptional competence in generating accurate three-word sentences from an initial entity and interpreting new unseen terms out of KGs.
摘要:大语言模型 (LLMs) 在自然语言处理 (NLP) 任务中的表现取得了显著进步。然而,将其应用于知识图谱 (KGs),这种以三元组形式描述事实并允许极少幻觉的结构,仍然是一个未充分探索的领域。本文中,我们通过引入一种专门的知识图谱语言 (KGL),探讨了 LLMs 与 KGs 的整合。在 KGL 中,一个句子精确地由一个实体名词、一个关系动词组成,并以另一个实体名词结束。尽管 KGL 的词汇对 LLM 来说是陌生的,但我们通过定制词典和示例句子来促进其学习,并通过实时 KG 上下文检索和 KGL Token 嵌入增强来提升上下文理解。我们的研究结果表明,LLMs 能够在 KGL 中实现流畅表达,与传统的 KG 嵌入方法相比,错误率大幅降低。此外,我们增强的 LLM 在从初始实体生成准确的三词句子以及解释知识图谱中未见的新术语方面表现出卓越的能力。

[NLP-71] Upcycling Large Language Models into Mixture of Experts

【速读】: 该论文试图解决如何高效地将预训练的密集语言模型升级为稀疏的混合专家(MoE)模型的问题。解决方案的关键在于提出了一种新的“虚拟组”初始化方案和权重缩放方法,以实现细粒度的MoE架构。通过对比实验,论文发现升级后的模型在性能上优于继续训练的密集模型,并且softmax-then-topK专家路由方法优于topK-then-softmax方法,更高的MoE粒度有助于提升准确性。最终,通过将Nemotron-4 15B模型升级并在1T tokens上进行训练,其性能显著优于相同条件下继续训练的密集模型。

链接: https://arxiv.org/abs/2410.07524
作者: Ethan He,Abhinav Khattar,Ryan Prenger,Vijay Korthikanti,Zijie Yan,Tong Liu,Shiqing Fan,Ashwath Aithal,Mohammad Shoeybi,Bryan Catanzaro
关键词-EN: pre-trained dense language, Upcycling pre-trained dense, Upcycling, language models, Upcycling pre-trained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel “virtual group” initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models.
摘要:将预训练的密集语言模型升级为稀疏的专家混合模型 (MoE) 是一种有效的方法,可以增加已训练模型的容量。然而,大规模升级的最佳技术仍不明确。在本研究中,我们对数十亿参数规模语言模型的升级方法和超参数进行了广泛研究。我们提出了一种新颖的“虚拟组”初始化方案和权重缩放方法,以实现向细粒度 MoE 架构的升级。通过消融实验,我们发现升级优于继续密集模型训练。此外,我们展示了 softmax-then-topK 专家路由优于 topK-then-softmax 方法,并且更高粒度的 MoE 可以提高准确性。最后,我们将 Nemotron-4 15B 在 1T Token 上进行了升级,并将其与在相同 1T Token 上连续训练的同一模型版本进行了比较:连续训练的模型达到了 65.3% 的 MMLU,而升级后的模型达到了 67.6%。我们的研究结果为有效利用升级构建 MoE 语言模型提供了见解和最佳实践。

[NLP-72] DemoShapley: Valuation of Demonstrations for In-Context Learning

【速读】: 该论文试图解决大语言模型(LLMs)在上下文学习(ICL)中,由于演示选择和排序不当导致的小样本学习效果不佳的问题。解决方案的关键是引入DemoShapley方法,该方法受Data Shapley估值定理启发,通过评估单个演示实例的影响力,区分出对模型性能有积极贡献和可能阻碍性能的演示,从而优化演示选择,提升模型在准确性、公平性以及跨领域查询的泛化能力,并有助于识别演示集中的噪声数据。

链接: https://arxiv.org/abs/2410.07523
作者: Shan Xie,Man Luo,Chadly Daniel Stern,Mengnan Du,Lu Cheng
关键词-EN: needing task-specific fine-tuning, Large language models, Large language, leveraging in-context learning, task-specific fine-tuning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) leveraging in-context learning (ICL) have set new benchmarks in few-shot learning across various tasks without needing task-specific fine-tuning. However, extensive research has demonstrated that the effectiveness of ICL is significantly influenced by the selection and ordering of demonstrations. Considering the critical role of demonstration selection in ICL, we introduce DemoShapley which is inspired by the Data Shapley valuation theorem. This approach assesses the influence of individual demonstration instances, distinguishing between those that contribute positively and those that may hinder performance. Our findings reveal that DemoShapley not only enhances model performance in terms of accuracy and fairness but also generalizes queries from domains distinct from those of the in-context demonstrations, highlighting its versatility and effectiveness in optimizing ICL demonstration selection. Last but not least, DemoShapley demonstrates its ability to aid in identifying noisy data within the demonstration set.
摘要:利用上下文学习 (In-Context Learning, ICL) 的大语言模型 (Large Language Models, LLMs) 在无需任务特定微调的情况下,已在各种任务的少样本学习中设定了新的基准。然而,大量研究表明,ICL 的有效性在很大程度上受到演示选择和排序的影响。鉴于演示选择在 IL 中的关键作用,我们引入了 DemoShapley,其灵感来自于 Data Shapley 估值定理。该方法评估了单个演示实例的影响,区分了那些对性能有积极贡献的实例和可能阻碍性能的实例。我们的研究结果表明,DemoShapley 不仅在准确性和公平性方面提升了模型性能,还能推广来自与上下文演示不同领域的查询,突显了其在优化 ICL 演示选择中的多功能性和有效性。最后但同样重要的是,DemoShapley 展示了其在识别演示集中噪声数据方面的能力。

[NLP-73] News Reporter: A Multi-lingual LLM Framework for Broadcast T.V News ICASSP2025

【速读】: 该论文旨在解决大型语言模型(LLMs)在处理电视新闻相关问题时,由于训练数据缺乏验证性而导致答案不准确的问题。解决方案的关键在于收集并分享从美国多个新闻频道的新闻录音转录文本中提取的大量问答对,并使用这些问答对对现有的LLM模型进行微调。此外,论文还提出了一种基于检索增强生成(RAG)的方法,以增强答案的上下文相关性,并确保答案指向可验证的新闻录音,从而提高模型在处理新闻相关问题时的准确性和可信度。

链接: https://arxiv.org/abs/2410.07520
作者: Tarun Jain,Yufei Gao,Sridhar Vanga,Karan Singla
关键词-EN: conversational chatbots due, provide coherent answers, varied queries, essential tools, conversational chatbots
类目: Computation and Language (cs.CL)
备注: 5 pages, under review at ICASSP 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have fast become an essential tools to many conversational chatbots due to their ability to provide coherent answers for varied queries. Datasets used to train these LLMs are often a mix of generic and synthetic samples, thus lacking the verification needed to provide correct and verifiable answers for T.V. News. We collect and share a large collection of QA pairs extracted from transcripts of news recordings from various news-channels across the United States. Resultant QA pairs are then used to fine-tune an off-the-shelf LLM model. Our model surpasses base models of similar size on several open LLM benchmarks. We further integrate and propose a RAG method to improve contextualization of our answers and also point it to a verifiable news recording. Comments: 5 pages, under review at ICASSP 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.07520 [cs.CL] (or arXiv:2410.07520v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.07520 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:大语言模型 (LLMs) 因其能够为各种查询提供连贯的答案,迅速成为许多对话式聊天机器人的重要工具。用于训练这些 LLM 的数据集通常是通用样本和合成样本的混合,因此缺乏为电视新闻提供正确且可验证答案所需的验证。我们收集并分享了从美国各地新闻频道的新闻录音转录文本中提取的大量问答对。随后,这些问答对被用于微调一个现成的 LLM 模型。我们的模型在多个开放的 LLM 基准测试中超越了同规模的基线模型。我们进一步整合并提出了一种 RAG 方法,以提高答案的上下文相关性,并将其指向可验证的新闻录音。

评论:5 页,正在 ICASSP 2025 评审中
主题:计算与语言 (cs.CL)
引用方式:arXiv:2410.07520 [cs.CL]
(或 arXiv:2410.07520v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.07520
了解更多信息
arXiv 发布的 DOI 通过 DataCite(待注册)

[NLP-74] Evolutionary Contrastive Distillation for Language Model Alignment

【速读】: 该论文试图解决大型语言模型(LLMs)在执行复杂指令时表现不佳的问题。解决方案的关键在于提出了一种名为进化对比蒸馏(Evolutionary Contrastive Distillation, ECD)的新方法,通过生成高质量的合成偏好数据来增强语言模型对复杂指令的遵循能力。ECD方法通过逐步将简单指令进化为更复杂的指令,并利用对比学习算法(如DPO)来区分成功遵循指令的响应与虽高质量但存在细微错误的响应,从而提升模型在复杂指令下的表现。实验结果表明,该方法使7B模型在复杂指令遵循能力上超越了当前最先进的7B模型,并能与开源的70B模型相媲美。

链接: https://arxiv.org/abs/2410.07513
作者: Julian Katz-Samuels,Zheng Li,Hyokun Yun,Priyanka Nigam,Yi Xu,Vaclav Petricek,Bing Yin,Trishul Chilimbi
关键词-EN: Evolutionary Contrastive Distillation, real-world applications, complex instructions, large language models, execute complex instructions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability of large language models (LLMs) to execute complex instructions is essential for their real-world applications. However, several recent studies indicate that LLMs struggle with challenging instructions. In this paper, we propose Evolutionary Contrastive Distillation (ECD), a novel method for generating high-quality synthetic preference data designed to enhance the complex instruction-following capability of language models. ECD generates data that specifically illustrates the difference between a response that successfully follows a set of complex instructions and a response that is high-quality, but nevertheless makes some subtle mistakes. This is done by prompting LLMs to progressively evolve simple instructions to more complex instructions. When the complexity of an instruction is increased, the original successful response to the original instruction becomes a “hard negative” response for the new instruction, mostly meeting requirements of the new instruction, but barely missing one or two. By pairing a good response with such a hard negative response, and employing contrastive learning algorithms such as DPO, we improve language models’ ability to follow complex instructions. Empirically, we observe that our method yields a 7B model that exceeds the complex instruction-following performance of current SOTA 7B models and is competitive even with open-source 70B models.
摘要:大语言模型 (LLM) 执行复杂指令的能力对其现实应用至关重要。然而,最近的几项研究表明,LLM 在应对挑战性指令时表现不佳。本文提出了一种名为进化对比蒸馏 (Evolutionary Contrastive Distillation, ECD) 的新方法,旨在生成高质量的合成偏好数据,以增强语言模型遵循复杂指令的能力。ECD 生成的数据专门展示了成功遵循一组复杂指令的响应与高质量但存在细微错误的响应之间的差异。这一过程通过引导 LLM 逐步将简单指令进化为更复杂的指令来实现。当指令的复杂性增加时,原始指令的成功响应变为新指令的“硬负样本”响应,该响应大部分满足新指令的要求,但几乎遗漏了一两个要点。通过将良好响应与这种硬负样本响应配对,并采用对比学习算法如 DPO,我们提升了语言模型遵循复杂指令的能力。实证结果显示,我们的方法使得 7B 模型在遵循复杂指令的表现上超越了当前最先进的 7B 模型,甚至与开源的 70B 模型相媲美。

[NLP-75] hought2Text: Text Generation from EEG Signal using Large Language Models (LLMs)

【速读】: 该论文试图解决将大脑活动解码并以可理解的形式表达的问题。解决方案的关键在于使用指令微调的大型语言模型(LLMs),并通过三个阶段实现:首先训练EEG编码器进行视觉特征提取,其次微调LLMs以处理图像和文本数据,从而生成多模态描述,最后在EEG嵌入上进一步微调,使得在推理过程中能够直接从EEG生成文本。这种方法展示了在神经科学和自然语言处理领域中,实现便携、低成本的“思维到文本”技术的显著进展。

链接: https://arxiv.org/abs/2410.07507
作者: Abhijit Mishra,Shreya Shukla,Jose Torres,Jacek Gwizdka,Shounak Roychowdhury
关键词-EN: expressing brain activity, Decoding and expressing, Large Language Models, expressing brain, brain activity
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Decoding and expressing brain activity in a comprehensible form is a challenging frontier in AI. This paper presents Thought2Text, which uses instruction-tuned Large Language Models (LLMs) fine-tuned with EEG data to achieve this goal. The approach involves three stages: (1) training an EEG encoder for visual feature extraction, (2) fine-tuning LLMs on image and text data, enabling multimodal description generation, and (3) further fine-tuning on EEG embeddings to generate text directly from EEG during inference. Experiments on a public EEG dataset collected for six subjects with image stimuli demonstrate the efficacy of multimodal LLMs (LLaMa-v3, Mistral-v0.3, Qwen2.5), validated using traditional language generation evaluation metrics, GPT-4 based assessments, and evaluations by human expert. This approach marks a significant advancement towards portable, low-cost “thoughts-to-text” technology with potential applications in both neuroscience and natural language processing (NLP).
摘要:将大脑活动解码并以可理解的形式表达是人工智能领域的一个挑战性前沿。本文介绍了 Thought2Text,该系统利用经过 EEG 数据微调的指令调优大语言模型 (LLM) 来实现这一目标。该方法包括三个阶段:(1) 训练 EEG 编码器以进行视觉特征提取,(2) 在图像和文本数据上微调 LLM,使其能够生成多模态描述,(3) 进一步在 EEG 嵌入上进行微调,以便在推理过程中直接从 EEG 生成文本。在一项针对六名受试者使用图像刺激的公开 EEG 数据集上的实验表明,多模态 LLM (LLaMa-v3, Mistral-v0.3, Qwen2.5) 的有效性得到了验证,验证方法包括传统的语言生成评估指标、基于 GPT-4 的评估以及人类专家的评估。这一方法标志着在便携式、低成本的“思维到文本”技术方面取得了重大进展,具有在神经科学和自然语言处理 (NLP) 领域的应用潜力。

[NLP-76] Using LLMs to Discover Legal Factors

【速读】: 该论文试图解决法律分析和计算法律推理中因素(factors)的自动发现问题。解决方案的关键在于利用大型语言模型(LLMs)从原始法院意见中提取并定义有效的法律领域因素,通过半自动化的方法生成因素列表,从而在一定程度上预测案件结果,尽管其准确性尚未达到专家定义因素的水平。

链接: https://arxiv.org/abs/2410.07504
作者: Morgan Gray,Jaromir Savelka,Wesley Oliver,Kevin Ashley
关键词-EN: foundational component, analysis and computational, legal reasoning, legal analysis, computational models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Factors are a foundational component of legal analysis and computational models of legal reasoning. These factor-based representations enable lawyers, judges, and AI and Law researchers to reason about legal cases. In this paper, we introduce a methodology that leverages large language models (LLMs) to discover lists of factors that effectively represent a legal domain. Our method takes as input raw court opinions and produces a set of factors and associated definitions. We demonstrate that a semi-automated approach, incorporating minimal human involvement, produces factor representations that can predict case outcomes with moderate success, if not yet as well as expert-defined factors can.
摘要:因素是法律分析和法律推理计算模型的基础组成部分。这些基于因素的表示使律师、法官以及法律与人工智能研究人员能够对法律案件进行推理。本文介绍了一种利用大语言模型 (LLM) 来发现有效表示法律领域的因素列表的方法。我们的方法以原始法院意见为输入,生成一组因素及其相关定义。我们证明,采用半自动化方法,即使仅包含最少的人工干预,也能产生可以中等成功率预测案件结果的因素表示,尽管尚不及专家定义的因素那样准确。

[NLP-77] PublicHearingBR: A Brazilian Portuguese Dataset of Public Hearing Transcripts for Summarization of Long Documents

【速读】: 该论文试图解决巴西葡萄牙语长文档摘要生成的问题,关键解决方案是引入了PublicHearingBR数据集,该数据集包含巴西众议院公开听证会的转录文本、新闻文章和结构化摘要,涵盖参与者的信息及其陈述或观点。论文还提出了一种混合摘要系统作为基准,并讨论了在大语言模型摘要生成中评估指标的重要性,特别是如何应对生成的摘要中可能出现的幻觉问题。此外,数据集还提供了用于葡萄牙语自然语言推理任务的标注数据。

链接: https://arxiv.org/abs/2410.07495
作者: Leandro Carísio Fernandes,Guilherme Zeferino Rodrigues Dobins,Roberto Lotufo,Jayr Alencar Pereira
关键词-EN: paper introduces PublicHearingBR, Brazilian Portuguese dataset, Portuguese dataset designed, summarizing long documents, introduces PublicHearingBR
类目: Computation and Language (cs.CL)
备注: 26 pages

点击查看摘要

Abstract:This paper introduces PublicHearingBR, a Brazilian Portuguese dataset designed for summarizing long documents. The dataset consists of transcripts of public hearings held by the Brazilian Chamber of Deputies, paired with news articles and structured summaries containing the individuals participating in the hearing and their statements or opinions. The dataset supports the development and evaluation of long document summarization systems in Portuguese. Our contributions include the dataset, a hybrid summarization system to establish a baseline for future studies, and a discussion on evaluation metrics for summarization involving large language models, addressing the challenge of hallucination in the generated summaries. As a result of this discussion, the dataset also provides annotated data that can be used in Natural Language Inference tasks in Portuguese.
摘要:本文介绍了 PublicHearingBR,这是一个专为长文档摘要设计的巴西葡萄牙语数据集。该数据集由巴西众议院举行的公开听证会记录组成,配以新闻文章和结构化摘要,其中包含参与听证会的个人及其陈述或意见。该数据集支持葡萄牙语长文档摘要系统的发展和评估。我们的贡献包括数据集本身、一个混合摘要系统,用于为未来研究建立基线,以及对涉及大语言模型摘要的评估指标的讨论,解决了生成摘要中出现的幻觉问题。作为这一讨论的结果,数据集还提供了可用于葡萄牙语自然语言推理任务的标注数据。

[NLP-78] ransducer Consistency Regularization for Speech to Text Applications

【速读】: 该论文试图解决在基于转录器(transducer)的语音应用中应用一致性正则化(consistency regularization)的难题。由于转录器优化准则的巨大对齐空间,并非所有对齐都对模型优化同等重要,这使得直接应用一致性正则化变得不直接。解决方案的关键在于提出了一种名为转录器一致性正则化(Transducer Consistency Regularization, TCR)的方法,通过应用如频谱增强和dropout等数据增强技术生成不同的数据视图,并利用职业概率(occupational probabilities)对转录器输出分布进行加权,从而确保只有接近真实对齐的对齐才会对模型学习产生贡献。实验结果表明,该方法相较于其他一致性正则化实现更为优越,能够在Librispeech数据集上相对降低4.3%的词错误率(WER)。

链接: https://arxiv.org/abs/2410.07491
作者: Cindy Tseng,Yun Tang,Vijendra Raj Apsingekar
关键词-EN: generate consistent representation, distorted input features, improve model generalization, Consistency regularization, Transducer Consistency Regularization
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 8 pages, 4 figures. Accepted in IEEE Spoken Language Technology Workshop 2024

点击查看摘要

Abstract:Consistency regularization is a commonly used practice to encourage the model to generate consistent representation from distorted input features and improve model generalization. It shows significant improvement on various speech applications that are optimized with cross entropy criterion. However, it is not straightforward to apply consistency regularization for the transducer-based approaches, which are widely adopted for speech applications due to the competitive performance and streaming characteristic. The main challenge is from the vast alignment space of the transducer optimization criterion and not all the alignments within the space contribute to the model optimization equally. In this study, we present Transducer Consistency Regularization (TCR), a consistency regularization method for transducer models. We apply distortions such as spec augmentation and dropout to create different data views and minimize the distribution difference. We utilize occupational probabilities to give different weights on transducer output distributions, thus only alignments close to oracle alignments would contribute to the model learning. Our experiments show the proposed method is superior to other consistency regularization implementations and could effectively reduce word error rate (WER) by 4.3% relatively comparing with a strong baseline on the \textscLibrispeech dataset.
摘要:一致性正则化是一种常用的方法,旨在通过鼓励模型从扭曲的输入特征中生成一致的表示,从而提高模型的泛化能力。在以交叉熵准则优化的各种语音应用中,这种方法显示出显著的改进。然而,对于基于转录器的方法,直接应用一致性正则化并不容易。转录器方法因其竞争性能和流式特性而在语音应用中被广泛采用。主要挑战来自于转录器优化准则的巨大对齐空间,并非所有对齐方式在该空间中对模型优化的贡献均等。在本研究中,我们提出了转录器一致性正则化 (Transducer Consistency Regularization, TCR),这是一种针对转录器模型的一致性正则化方法。我们应用诸如频谱增强和 dropout 等扭曲方法来创建不同的数据视图,并最小化分布差异。我们利用职业概率为转录器输出分布赋予不同的权重,从而只有接近真实对齐的对齐方式才会对模型学习产生贡献。我们的实验表明,所提出的方法优于其他一致性正则化实现,并且在 \textscLibrispeech 数据集上,与强基线相比,能够相对降低 4.3% 的字错误率 (WER)。

[NLP-79] MoDEM: Mixture of Domain Expert Models

【速读】: 该论文试图解决大型语言模型(LLMs)在性能和效率上的问题,特别是如何在不增加模型规模的情况下提升特定领域的处理能力。解决方案的关键在于结合领域提示路由与领域专用模型,通过BERT-based路由器将输入提示定向到最合适的领域专家模型,这些专家模型针对健康、数学和科学等特定领域进行了优化。这种方法不仅显著提升了模型在各领域的表现,还提高了性能与成本的比率,预示着未来AI发展可能转向构建小型、高度专业化的模型生态系统,并结合复杂的路由系统,以实现更高效的资源利用和降低计算成本。

链接: https://arxiv.org/abs/2410.07490
作者: Toby Simonds,Kemal Kurniawan,Jey Han Lau
关键词-EN: combining domain prompt, large language models, models, combining domain, domain prompt routing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a novel approach to enhancing the performance and efficiency of large language models (LLMs) by combining domain prompt routing with domain-specialized models. We introduce a system that utilizes a BERT-based router to direct incoming prompts to the most appropriate domain expert model. These expert models are specifically tuned for domains such as health, mathematics and science. Our research demonstrates that this approach can significantly outperform general-purpose models of comparable size, leading to a superior performance-to-cost ratio across various benchmarks. The implications of this study suggest a potential paradigm shift in LLM development and deployment. Rather than focusing solely on creating increasingly large, general-purpose models, the future of AI may lie in developing ecosystems of smaller, highly specialized models coupled with sophisticated routing systems. This approach could lead to more efficient resource utilization, reduced computational costs, and superior overall performance.
摘要:我们提出了一种通过结合领域提示路由与领域专用模型来提升大语言模型 (LLM) 性能和效率的新方法。我们引入了一个系统,该系统利用基于 BERT 的路由器将传入的提示定向到最合适的领域专家模型。这些专家模型专门针对健康、数学和科学等领域进行了调优。我们的研究表明,这种方法在各种基准测试中显著优于同等规模的全能模型,从而在性能与成本比率上表现更优。本研究的含义表明,LLM 的开发和部署可能面临潜在的范式转变。未来,AI 的发展可能不再仅仅专注于创建越来越大的全能模型,而是转向开发由小型、高度专业化的模型与复杂路由系统相结合的生态系统。这种方法有望实现更高效的资源利用、降低计算成本以及提升整体性能。

[NLP-80] Localizing Factual Inconsistencies in Attributable Text Generation

【速读】: 该论文试图解决模型生成文本中事实不一致性的精确定位问题。解决方案的关键在于引入QASemConsistency方法,通过将生成文本分解为最小谓词-论元级别的命题,并以简单问答(QA)对的形式表达,然后评估每个QA对是否得到可信参考文本的支持。这种方法能够有效定位未被支持的信息,并通过人类注释和自动检测方法验证其有效性。

链接: https://arxiv.org/abs/2410.07473
作者: Arie Cattan,Paul Roit,Shiyue Zhang,David Wan,Roee Aharoni,Idan Szpektor,Mohit Bansal,Ido Dagan
关键词-EN: increasing interest, hallucinations in model-generated, varying levels, model-generated texts, detecting hallucinations
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There has been an increasing interest in detecting hallucinations in model-generated texts, both manually and automatically, at varying levels of granularity. However, most existing methods fail to precisely pinpoint the errors. In this work, we introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation, at a fine-grained level. Drawing inspiration from Neo-Davidsonian formal semantics, we propose decomposing the generated text into minimal predicate-argument level propositions, expressed as simple question-answer (QA) pairs, and assess whether each individual QA pair is supported by a trusted reference text. As each QA pair corresponds to a single semantic relation between a predicate and an argument, QASemConsistency effectively localizes the unsupported information. We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation, by collecting crowdsourced annotations of granular consistency errors, while achieving a substantial inter-annotator agreement ( \kappa 0.7) . Then, we implement several methods for automatically detecting localized factual inconsistencies, with both supervised entailment models and open-source LLMs.
摘要:在模型生成的文本中检测幻觉现象,无论是手动还是自动,在不同粒度级别上都引起了越来越多的关注。然而,大多数现有方法未能精确地定位错误。在本研究中,我们引入了 QASemConsistency,这是一种用于在可归因文本生成中定位事实不一致性的新形式化方法,具有细粒度的特性。受 Neo-Davidsonian 形式语义学的启发,我们提出将生成的文本分解为最小的谓词-论元级别命题,这些命题以简单的问题-答案 (QA) 对的形式表达,并评估每个 QA 对是否由可信的参考文本支持。由于每个 QA 对对应于谓词与论元之间的一个单一语义关系,QASemConsistency 能够有效地定位不支持的信息。我们首先通过收集众包注释的粒度一致性错误,展示了 QASemConsistency 方法在人类注释中的有效性,同时实现了显著的注释者间一致性 ( \kappa 0.7)。接着,我们实现了几种自动检测局部事实不一致性的方法,包括监督蕴涵模型和开源大语言模型。

[NLP-81] SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection

【速读】: 该论文试图解决在大语言模型(LLMs)微调过程中,由于使用对抗样本或低质量数据而导致模型预设的安全性和对齐能力受损的问题。解决方案的关键是提出了SEAL框架,该框架通过双层优化学习一个数据排序器,能够提升安全和高质数据的排名,降低不安全或低质数据的排名,从而在微调过程中增强模型的安全性。实验结果表明,使用SEAL框架训练的模型在多个基准测试中表现优异,相较于随机选择数据,分别在Llama-3-8b-Instruct和Merlinite-7b模型上提升了8.5%和9.7%的胜率。

链接: https://arxiv.org/abs/2410.07471
作者: Han Shen,Pin-Yu Chen,Payel Das,Tianyi Chen
关键词-EN: leveraging Large Language, Large Language Models, Large Language, boost downstream performance, leveraging Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning on task-specific data to boost downstream performance is a crucial step for leveraging Large Language Models (LLMs). However, previous studies have demonstrated that fine-tuning the models on several adversarial samples or even benign data can greatly comprise the model’s pre-equipped alignment and safety capabilities. In this work, we propose SEAL, a novel framework to enhance safety in LLM fine-tuning. SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection respectively on Llama-3-8b-Instruct and Merlinite-7b models. Our code is available on github this https URL.
摘要:在特定任务数据上进行微调以提升下游性能是利用大语言模型 (LLM) 的关键步骤。然而,先前的研究表明,在几个对抗样本甚至良性数据上微调模型会大大削弱模型预先配备的对齐和安全能力。在此工作中,我们提出了 SEAL,一种新颖的框架,用于增强 LLM 微调中的安全性。SEAL 基于双层优化学习一个数据排序器,以提升安全和高质量的微调数据,并降低不安全或低质量的数据。使用 SEAL 训练的模型在多个基线上展示了优越的质量,与随机选择相比,Llama-3-8b-Instruct 和 Merlinite-7b 模型分别提高了 8.5% 和 9.7% 的胜率。我们的代码可在 GitHub 上获取。

[NLP-82] Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning EMNLP2024

【速读】: 该论文试图解决大语言模型(LLM)剪枝过程中校准数据选择的优化问题。解决方案的关键在于评估和比较不同类型的校准数据(包括预训练数据集和下游任务数据集)对剪枝后模型性能的影响。研究发现,常用的C4数据集并非最优选择,而算术数据集在校准效果上表现出色,甚至优于预训练数据集。此外,论文还探讨了In-Context Learning(ICL)和Chain-of-Thought(CoT)对不同数据类别校准效果的影响,为更高效地部署LLM提供了重要指导。

链接: https://arxiv.org/abs/2410.07461
作者: Abhinav Bandari,Lu Yin,Cheng-Yu Hsieh,Ajay Kumar Jaiswal,Tianlong Chen,Li Shen,Ranjay Krishna,Shiwei Liu
关键词-EN: make LLMs cheaper, LLM pruning, Network pruning, calibration data, LLM
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Network pruning has emerged as a potential solution to make LLMs cheaper to deploy. However, existing LLM pruning approaches universally rely on the C4 dataset as the calibration data for calculating pruning scores, leaving its optimality unexplored. In this study, we evaluate the choice of calibration data on LLM pruning, across a wide range of datasets that are most commonly used in LLM training and evaluation, including four pertaining datasets as well as three categories of downstream tasks encompassing nine datasets. Each downstream dataset is prompted with In-Context Learning (ICL) and Chain-of-Thought (CoT), respectively. Besides the already intriguing observation that the choice of calibration data significantly impacts the performance of pruned LLMs, our results also uncover several subtle and often unexpected findings, summarized as follows: (1) C4 is not the optimal choice for LLM pruning, even among commonly used pre-training datasets; (2) arithmetic datasets, when used as calibration data, performs on par or even better than pre-training datasets; (3) pruning with downstream datasets does not necessarily help the corresponding downstream task, compared to pre-training data; (4) ICL is widely beneficial to all data categories, whereas CoT is only useful on certain tasks. Our findings shed light on the importance of carefully selecting calibration data for LLM pruning and pave the way for more efficient deployment of these powerful models in real-world applications. We release our code at: this https URL.
摘要:网络剪枝已成为降低大语言模型 (LLM) 部署成本的潜在解决方案。然而,现有的 LLM 剪枝方法普遍依赖 C4 数据集作为计算剪枝分数的校准数据,其最优性尚未得到探究。在本研究中,我们评估了校准数据的选择对 LLM 剪枝的影响,涵盖了在 LLM 训练和评估中最常用的广泛数据集,包括四个相关数据集以及涵盖九个数据集的三个下游任务类别。每个下游数据集分别通过上下文学习 (In-Context Learning, ICL) 和思维链 (Chain-of-Thought, CoT) 进行提示。除了已经引人注目的观察结果,即校准数据的选择显著影响剪枝后 LLM 的性能,我们的研究还揭示了几个细微且常常出乎意料的发现,总结如下:(1) C4 并非 LLM 剪枝的最优选择,即使在常用的预训练数据集中也是如此;(2) 算术数据集作为校准数据时,其表现与预训练数据集相当甚至更优;(3) 使用下游数据集进行剪枝并不一定有助于相应的下游任务,相比预训练数据;(4) ICL 对所有数据类别普遍有益,而 CoT 仅在某些任务中有用。我们的发现强调了在 LLM 剪枝中仔细选择校准数据的重要性,并为这些强大模型在实际应用中的更高效部署铺平了道路。我们已在以下链接发布了代码:this https URL。

[NLP-83] Advocating Character Error Rate for Multilingual ASR Evaluation

【速读】: 该论文试图解决自动语音识别(ASR)系统在多语言环境下评估指标的局限性问题,特别是传统的词错误率(WER)在处理形态复杂语言或无明确词边界语言时的不足。解决方案的关键在于提倡使用字符错误率(CER)作为多语言ASR评估的主要指标。CER能够避免WER面临的许多挑战,并在不同书写系统中表现出更高的稳定性。通过在马拉雅拉姆语、英语和阿拉伯语中进行的人类评估实验,论文证明了CER与人类判断的相关性高于WER,即使在英语中也如此。因此,论文建议在多语言ASR评估中优先考虑或至少补充使用CER,以适应不同语言的语法特性。

链接: https://arxiv.org/abs/2410.07400
作者: Thennal D K,Jesin James,Deepa P Gopinath,Muhammed Ashraf K
关键词-EN: Automatic speech recognition, Automatic speech, ASR, speech recognition, WER
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 8 pages

点击查看摘要

Abstract:Automatic speech recognition (ASR) systems have traditionally been evaluated using English datasets, with the word error rate (WER) serving as the predominant metric. WER’s simplicity and ease of interpretation have contributed to its widespread adoption, particularly for English. However, as ASR systems expand to multilingual contexts, WER fails in various ways, particularly with morphologically complex languages or those without clear word boundaries. Our work documents the limitations of WER as an evaluation metric and advocates for the character error rate (CER) as the primary metric in multilingual ASR evaluation. We show that CER avoids many of the challenges WER faces and exhibits greater consistency across writing systems. We support our proposition by conducting human evaluations of ASR transcriptions in three languages: Malayalam, English, and Arabic, which exhibit distinct morphological characteristics. We show that CER correlates more closely with human judgments than WER, even for English. To facilitate further research, we release our human evaluation dataset for future benchmarking of ASR metrics. Our findings suggest that CER should be prioritized, or at least supplemented, in multilingual ASR evaluations to account for the varying linguistic characteristics of different languages.
摘要:传统的自动语音识别 (ASR) 系统通常使用英语数据集进行评估,其中词错误率 (WER) 是最主要的评估指标。WER 的简单性和易于解释性促使其在英语评估中得到广泛应用。然而,随着 ASR 系统扩展到多语言环境,WER 在多种情况下表现不佳,尤其是在形态复杂或缺乏明确词边界的语言中。我们的研究详细阐述了 WER 作为评估指标的局限性,并主张在多语言 ASR 评估中以字符错误率 (CER) 为主要指标。我们证明,CER 避免了 WER 面临的许多挑战,并且在不同书写系统中表现出更高的稳定性。我们通过在三种语言(马拉雅拉姆语、英语和阿拉伯语)中进行 ASR 转录的人工评估来支持我们的主张,这三种语言具有不同的形态特征。我们发现,即使在英语中,CER 与人类判断的相关性也比 WER 更紧密。为了促进进一步的研究,我们发布了我们的人工评估数据集,供未来对 ASR 指标进行基准测试。我们的研究结果表明,在多语言 ASR 评估中,应优先考虑或至少补充 CER,以适应不同语言的多样语言特征。

[NLP-84] SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers

【速读】: 该论文试图解决Transformer模型在参数高效微调(PEFT)过程中内存密集的问题,特别是针对MLP块的微调。解决方案的关键在于提出了一种名为SparseGrad的新型选择性PEFT方法,通过将层梯度转换为稀疏结构,仅保留约1%的层元素具有显著性,从而大幅减少需要更新的参数数量。这种方法在BERT、RoBERTa和LLaMa-2等模型上进行了实验,结果显示在相同的内存需求下,SparseGrad优于现有的PEFT方法如LoRA和MeProp。

链接: https://arxiv.org/abs/2410.07383
作者: Viktoriia Chekalina,Anna Rudenko,Gleb Mezentsev,Alexander Mikhalev,Alexander Panchenko,Ivan Oseledets
关键词-EN: performance of Transformer, Transformer models, processed text, enhanced by increasing, MLP blocks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The performance of Transformer models has been enhanced by increasing the number of parameters and the length of the processed text. Consequently, fine-tuning the entire model becomes a memory-intensive process. High-performance methods for parameter-efficient fine-tuning (PEFT) typically work with Attention blocks and often overlook MLP blocks, which contain about half of the model parameters. We propose a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks. We transfer layer gradients to a space where only about 1% of the layer’s elements remain significant. By converting gradients into a sparse structure, we reduce the number of updated parameters. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task. In these experiments, with identical memory requirements, our method outperforms LoRA and MeProp, robust popular state-of-the-art PEFT approaches.
摘要:Transformer 模型的性能通过增加参数数量和处理文本的长度得到了提升。因此,对整个模型进行微调成为了一个内存密集型的过程。高效的参数微调 (Parameter-Efficient Fine-Tuning, PEFT) 方法通常只关注 Attention 模块,而忽略了包含约一半模型参数的 MLP 模块。我们提出了一种新的选择性 PEFT 方法,即 SparseGrad,该方法在 MLP 模块上表现出色。我们将层梯度转换到一个空间,其中只有约 1% 的层元素保持显著性。通过将梯度转换为稀疏结构,我们减少了更新的参数数量。我们将 SparseGrad 应用于 BERT 和 RoBERTa 的 NLU 任务以及 LLaMa-2 的问答任务微调中。在这些实验中,在相同的内存需求下,我们的方法优于 LoRA 和 MeProp 这两种流行且稳健的 PEFT 方法。

[NLP-85] Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

【速读】: 该论文试图解决现有评价指标在描述生成任务中无法全面捕捉描述质量或细粒度细节的问题。解决方案的关键在于提出了PAC-S++,这是一种可学习的评价指标,利用CLIP模型在经过筛选和清理的数据上进行预训练,并通过额外的视觉和文本正样本对进行正则化。PAC-S++不仅在自批判序列训练(SCST)阶段作为奖励机制用于微调描述生成模型,还通过广泛的实验证明了其在不同图像和视频数据集上的有效性,显著提升了生成描述的语义丰富性、减少重复和语法错误。

链接: https://arxiv.org/abs/2410.07336
作者: Sara Sarto,Nicholas Moratelli,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
关键词-EN: significant advancements, fail to capture, capture the full, fine-grained details, existing evaluation metrics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: this https URL.
摘要:尽管在字幕生成方面取得了显著进展,但现有的评估指标往往无法全面捕捉字幕的质量或细微细节。这主要是因为它们依赖于非特定的手工编写参考文本或嘈杂的预训练数据。然而,找到一种有效的评估指标不仅对字幕评估至关重要,对生成阶段也同样重要。事实上,评估指标在字幕模型的微调阶段可以发挥关键作用,最终提升生成字幕的质量。在本文中,我们提出了 PAC-S++,一种可学习的评估指标,它利用了在网络收集和清理数据上预训练的 CLIP 模型,并通过额外的生成视觉和文本正样本对进行正则化。利用这种更强大且经过筛选的预训练,我们还将 PAC-S++ 作为奖励应用于通常用于微调字幕模型的自批判序列训练 (SCST) 阶段。在不同图像和视频数据集上的广泛实验表明,PAC-S++ 相比任务中常用的评估指标,其有效性更高,包括对对象幻觉的敏感性。此外,我们展示了将 PAC-S++ 集成到字幕模型微调阶段,可以生成语义更丰富、重复更少且语法错误更少的字幕。在域外基准上的评估进一步证明了我们的微调方法在提升模型能力方面的有效性。源代码和训练模型已在以下网址公开:this https URL。

[NLP-86] DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models EMNLP2024

【速读】: 该论文试图解决在基于代理的数据科学任务中,大型语言模型(LLMs)在代码生成方面的性能评估问题。解决方案的关键在于引入DA-Code基准,该基准通过设计具有挑战性的任务、基于真实且多样化的数据集,并要求模型使用复杂的数据科学编程语言进行精细的数据处理,来评估LLMs在实际数据分析场景中的表现。此外,论文还开发了DA-Agent基线模型,并通过实验展示了当前最佳LLMs在该基准上的表现仅为30.5%的准确率,表明该领域仍有显著的改进空间。

链接: https://arxiv.org/abs/2410.07331
作者: Yiming Huang,Jianwen Luo,Yan Yu,Yitong Zhang,Fangyu Lei,Yifan Wei,Shizhu He,Lifu Huang,Xiao Liu,Jun Zhao,Kang Liu
关键词-EN: benchmark specifically designed, generation benchmark specifically, code generation tasks, code generation benchmark, agent-based data science
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024

点击查看摘要

Abstract:We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at [this https URL](this https URL).
摘要:我们介绍了 DA-Code,这是一个专门设计用于评估大语言模型 (LLM) 在基于智能体的数据科学任务中的代码生成基准。该基准包含三个核心要素:首先,DA-Code 中的任务本身具有挑战性,与传统的代码生成任务不同,这些任务要求在基础和规划方面具备高级编程技能。其次,DA-Code 中的示例均基于真实且多样化的数据,涵盖了广泛复杂的数據整理和分析任务。第三,为了解决这些任务,模型必须利用复杂的数据科学编程语言,进行精细的数据处理并得出答案。我们在一个可控且可执行的环境中设置了基准,该环境与现实世界的数据分析场景相符且具有可扩展性。标注者精心设计了评估套件,以确保评估的准确性和鲁棒性。我们开发了 DA-Agent 基线。实验表明,尽管基线表现优于其他现有框架,但使用当前最佳的大语言模型仅能达到 30.5% 的准确率,仍有很大的改进空间。我们在此 [https URL](this https URL) 发布了我们的基准。

[NLP-87] Locally Measuring Cross-lingual Lexical Alignment: A Domain and Word Level Perspective

【速读】: 该论文试图解决跨语言词汇表示空间对齐的问题,特别是如何更准确地评估和改进不同语言间词汇意义的对齐效果。解决方案的关键在于提出了一种新的方法论,通过合成验证和自然主义验证(特别是亲属关系领域的词汇空缺)来评估对齐效果,并引入了基于上下文嵌入的新型度量标准。研究涵盖了16种不同语言,表明使用更新的语言模型可以显著提升跨语言词汇对齐的准确性和细致度。

链接: https://arxiv.org/abs/2410.07239
作者: Taelin Karidi,Eitan Grossman,Omri Abend
关键词-EN: aligning lexical representation, aligning language spaces, lexical representation spaces, representation spaces, focused on aligning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:NLP research on aligning lexical representation spaces to one another has so far focused on aligning language spaces in their entirety. However, cognitive science has long focused on a local perspective, investigating whether translation equivalents truly share the same meaning or the extent that cultural and regional influences result in meaning variations. With recent technological advances and the increasing amounts of available data, the longstanding question of cross-lingual lexical alignment can now be approached in a more data-driven manner. However, developing metrics for the task requires some methodology for comparing metric efficacy. We address this gap and present a methodology for analyzing both synthetic validations and a novel naturalistic validation using lexical gaps in the kinship domain. We further propose new metrics, hitherto unexplored on this task, based on contextualized embeddings. Our analysis spans 16 diverse languages, demonstrating that there is substantial room for improvement with the use of newer language models. Our research paves the way for more accurate and nuanced cross-lingual lexical alignment methodologies and evaluation.
摘要:迄今为止,自然语言处理 (NLP) 研究在将词汇表示空间相互对齐方面主要集中在整体语言空间的对齐上。然而,认知科学长期以来关注局部视角,探讨翻译等价词是否真正共享相同意义,或文化与地区影响导致意义变异的程度。随着近期技术进步和可用数据量的增加,长期以来关于跨语言词汇对齐的问题现在可以以更数据驱动的方式进行探讨。然而,为该任务开发度量标准需要一些比较度量有效性的方法。我们填补了这一空白,并提出了一种方法,用于分析合成验证和一种新颖的自然主义验证,后者利用亲属关系领域的词汇空缺。我们进一步提出了基于上下文嵌入的新度量标准,这些度量标准在此任务上尚未被探索。我们的分析涵盖了16种不同语言,表明使用更新的语言模型有显著的改进空间。我们的研究为更准确和细致的跨语言词汇对齐方法和评估铺平了道路。

[NLP-88] he First VoicePrivacy Attacker Challenge Evaluation Plan

【速读】: 该论文旨在解决语音匿名化系统的攻击问题,关键在于开发针对语音匿名化的攻击系统。解决方案的核心是参与者需构建自动说话人验证系统作为攻击系统,并提交其在开发和评估数据上的得分。参与者可以使用任何公开可用的额外训练数据和模型,但必须在指定截止日期前声明。评估指标为等错误率(EER),最终结果将在ICASSP 2025的特别会议上展示,前五名参与者将被邀请提交并展示其挑战系统。

链接: https://arxiv.org/abs/2410.07428
作者: Natalia Tomashenko,Xiaoxiao Miao,Emmanuel Vincent,Junichi Yamagishi
关键词-EN: VoicePrivacy Attacker Challenge, anonymization systems submitted, developing attacker systems, Grand Challenge, VoicePrivacy initiative
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The First VoicePrivacy Attacker Challenge is a new kind of challenge organized as part of the VoicePrivacy initiative and supported by ICASSP 2025 as the SP Grand Challenge It focuses on developing attacker systems against voice anonymization, which will be evaluated against a set of anonymization systems submitted to the VoicePrivacy 2024 Challenge. Training, development, and evaluation datasets are provided along with a baseline attacker system. Participants shall develop their attacker systems in the form of automatic speaker verification systems and submit their scores on the development and evaluation data to the organizers. To do so, they can use any additional training data and models, provided that they are openly available and declared before the specified deadline. The metric for evaluation is equal error rate (EER). Results will be presented at the ICASSP 2025 special session to which 5 selected top-ranked participants will be invited to submit and present their challenge systems.
摘要:第一届 VoicePrivacy 攻击者挑战赛是 VoicePrivacy 计划的一部分,并得到 ICASSP 2025 的支持,作为 SP 重大挑战。该挑战赛专注于开发针对语音匿名化的攻击系统,这些系统将在 VoicePrivacy 2024 挑战赛中提交的一组匿名化系统上进行评估。挑战赛提供了训练、开发和评估数据集,以及一个基线攻击系统。参与者需开发自动说话人验证系统形式的攻击系统,并将他们在开发和评估数据上的得分提交给组织者。为此,他们可以使用任何额外的训练数据和模型,前提是这些数据和模型是公开可用的,并在指定截止日期前声明。评估指标为等错误率 (EER)。结果将在 ICASSP 2025 的特别会议上展示,届时将邀请 5 名排名最高的参与者提交并展示他们的挑战系统。

[NLP-89] Learn from Real: Reality Defenders Submission to ASVspoof5 Challenge

【速读】: 该论文旨在解决音频深度伪造检测问题,特别是通过参与ASVspoof挑战来评估检测模型的通用性和鲁棒性。解决方案的关键在于提出了一种新颖的预训练策略,即SLIM系统,该系统通过自监督对比学习从各种真实语音中学习风格与语言学依赖嵌入。这些嵌入有助于区分伪造语音和真实语音,通过关注风格和语言学方面的关系,从而在保持低计算成本的同时显著提高模型的通用性。

链接: https://arxiv.org/abs/2410.07379
作者: Yi Zhu,Chirag Goel,Surya Koppisetti,Trang Tran,Ankur Kumar,Gaurav Bharaj
关键词-EN: Audio deepfake detection, Audio deepfake, crucial to combat, combat the malicious, deepfake detection
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted into ASVspoof5 workshop

点击查看摘要

Abstract:Audio deepfake detection is crucial to combat the malicious use of AI-synthesized speech. Among many efforts undertaken by the community, the ASVspoof challenge has become one of the benchmarks to evaluate the generalizability and robustness of detection models. In this paper, we present Reality Defender’s submission to the ASVspoof5 challenge, highlighting a novel pretraining strategy which significantly improves generalizability while maintaining low computational cost during training. Our system SLIM learns the style-linguistics dependency embeddings from various types of bonafide speech using self-supervised contrastive learning. The learned embeddings help to discriminate spoof from bonafide speech by focusing on the relationship between the style and linguistics aspects. We evaluated our system on ASVspoof5, ASV2019, and In-the-wild. Our submission achieved minDCF of 0.1499 and EER of 5.5% on ASVspoof5 Track 1, and EER of 7.4% and 10.8% on ASV2019 and In-the-wild respectively.
摘要:音频深度伪造检测对于应对 AI 合成语音的恶意使用至关重要。在社区的众多努力中,ASVspoof 挑战已成为评估检测模型泛化性和鲁棒性的基准之一。本文介绍了 Reality Defender 在 ASVspoof5 挑战中的提交方案,重点介绍了一种新颖的预训练策略,该策略显著提高了泛化性,同时在训练过程中保持了低计算成本。我们的系统 SLIM 通过自监督对比学习从多种类型的真实语音中学习风格-语言依赖嵌入。这些学习到的嵌入通过关注风格和语言方面的关系,有助于区分伪造语音和真实语音。我们在 ASVspoof5、ASV2019 和 In-the-wild 数据集上评估了我们的系统。我们的提交在 ASVspoof5 Track 1 上取得了 minDCF 为 0.1499 和 EER 为 5.5% 的成绩,在 ASV2019 和 In-the-wild 上分别取得了 EER 为 7.4% 和 10.8% 的成绩。

[NLP-90] Swin-BERT: A Feature Fusion System designed for Speech-based Alzheimers Dementia Detection

【速读】: 该论文试图解决阿尔茨海默病(AD)自动检测系统中语音信息包含与认知状态无关的额外信息(如年龄和性别)的问题。解决方案的关键在于提出了一种名为Swin-BERT的语音检测系统,该系统通过以下方式解决上述问题:首先,在声学部分,采用移位窗口多头注意力机制来设计声学系统,并将年龄和性别作为额外输入以解耦其对声学特征提取的影响;其次,在语言学部分,通过去除与节奏相关的信息并使用字符级转录作为额外输入来补偿,以减少节奏信息对检测结果的干扰;最后,将声学特征与语言学系统结合,形成综合的Swin-BERT系统。实验结果表明,该系统在ADReSS和ADReSSo数据集上分别达到了85.58%和87.32%的F-score,优于以往的研究。

链接: https://arxiv.org/abs/2410.07277
作者: Yilin Pan,Yanpei Shi,Yijia Zhang,Mingyu Lu
关键词-EN: automatic Alzheimer dementia, automatic Alzheimer, Alzheimer dementia, early stages, system
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Speech is usually used for constructing an automatic Alzheimer’s dementia (AD) detection system, as the acoustic and linguistic abilities show a decline in people living with AD at the early stages. However, speech includes not only AD-related local and global information but also other information unrelated to cognitive status, such as age and gender. In this paper, we propose a speech-based system named Swin-BERT for automatic dementia detection. For the acoustic part, the shifted windows multi-head attention that proposed to extract local and global information from images, is used for designing our acoustic-based system. To decouple the effect of age and gender on acoustic feature extraction, they are used as an extra input of the designed acoustic system. For the linguistic part, the rhythm-related information, which varies significantly between people living with and without AD, is removed while transcribing the audio recordings into transcripts. To compensate for the removed rhythm-related information, the character-level transcripts are proposed to be used as the extra input of a word-level BERT-style system. Finally, the Swin-BERT combines the acoustic features learned from our proposed acoustic-based system with our linguistic-based system. The experiments are based on the two datasets provided by the international dementia detection challenges: the ADReSS and ADReSSo. The results show that both the proposed acoustic and linguistic systems can be better or comparable with previous research on the two datasets. Superior results are achieved by the proposed Swin-BERT system on the ADReSS and ADReSSo datasets, which are 85.58% F-score and 87.32% F-score respectively.
摘要:语音通常用于构建自动阿尔茨海默病(AD)检测系统,因为患有 AD 的人在早期阶段其声音的声学和语言能力会下降。然而,语音不仅包含与 AD 相关的局部和全局信息,还包含与认知状态无关的其他信息,如年龄和性别。本文提出了一种基于语音的系统,名为 Swin-BERT,用于自动痴呆检测。对于声学部分,我们采用了用于从图像中提取局部和全局信息的移位窗口多头注意力机制来设计我们的声学系统。为了解耦年龄和性别对声学特征提取的影响,我们将它们作为设计的声学系统的额外输入。对于语言部分,在将音频记录转录为文本时,去除了患有 AD 和未患有 AD 的人之间显著不同的节奏相关信息。为了补偿去除的节奏相关信息,提出了使用字符级转录作为词级 BERT 风格系统的额外输入。最后,Swin-BERT 结合了我们提出的声学系统和语言系统所学习的声学特征。实验基于国际痴呆检测挑战提供的两个数据集:ADReSS 和 ADReSSo。结果表明,我们提出的声学和语言系统在两个数据集上的表现均优于或与之前的研究相当。在 ADReSS 和 ADReSSo 数据集上,Swin-BERT 系统分别取得了 85.58% 和 87.32% 的 F 分数,表现优异。

[NLP-91] Distilling Analysis from Generative Models for Investment Decisions

【速读】: 该论文试图解决金融市场中专业人士决策行为的建模问题,特别是如何准确预测专业人士的交易决策。解决方案的关键在于提出了Chain-of-Decision方法,该方法通过引入一个“意见生成器”来增强对新闻事件的主观分析,从而提升模型在模拟专业人士决策过程中的表现。

链接: https://arxiv.org/abs/2410.07225
作者: Chung-Chi Chen,Hiroya Takamura,Ichiro Kobayashi,Yusuke Miyao
关键词-EN: decisions, Professionals’, stock analysts’ decisions, professionals’ decision-making processes, decision-making processes
类目: atistical Finance (q-fin.ST); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Professionals’ decisions are the focus of every field. For example, politicians’ decisions will influence the future of the country, and stock analysts’ decisions will impact the market. Recognizing the influential role of professionals’ perspectives, inclinations, and actions in shaping decision-making processes and future trends across multiple fields, we propose three tasks for modeling these decisions in the financial market. To facilitate this, we introduce a novel dataset, A3, designed to simulate professionals’ decision-making processes. While we find current models present challenges in forecasting professionals’ behaviors, particularly in making trading decisions, the proposed Chain-of-Decision approach demonstrates promising improvements. It integrates an opinion-generator-in-the-loop to provide subjective analysis based on each news item, further enhancing the proposed tasks’ performance.
摘要:专业人士的决策是各个领域的焦点。例如,政治家的决策将影响国家的未来,而股票分析师的决策将影响市场。认识到专业人士的观点、倾向和行动在塑造决策过程和多个领域的未来趋势中的影响力,我们提出了三个任务来模拟金融市场中的这些决策。为此,我们引入了一个新的数据集 A3,旨在模拟专业人士的决策过程。尽管我们发现当前模型在预测专业人士的行为,特别是在做出交易决策方面存在挑战,但所提出的 Chain-of-Decision 方法展示了有希望的改进。它集成了一个意见生成器,基于每条新闻提供主观分析,从而进一步提升了所提出任务的性能。

人工智能

[AI-0] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

链接: https://arxiv.org/abs/2410.08211
作者: Anh-Quan Cao,Maximilian Jaritz,Matthieu Guillaumin,Raoul de Charette,Loris Bazzani
关键词-EN: Large-scale vision-language pre-trained, Large-scale vision-language, applied to diverse, diverse applications, fine-tuning VLP models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large-scale vision-language pre-trained (VLP) models (e.g., CLIP) are renowned for their versatility, as they can be applied to diverse applications in a zero-shot setup. However, when these models are used in specific domains, their performance often falls short due to domain gaps or the under-representation of these domains in the training data. While fine-tuning VLP models on custom datasets with human-annotated labels can address this issue, annotating even a small-scale dataset (e.g., 100k samples) can be an expensive endeavor, often requiring expert annotators if the task is complex. To address these challenges, we propose LatteCLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains, without relying on human annotations. Our method leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for both individual images and groups of images. These provide additional contextual information to guide the fine-tuning process in the custom domains. Since LMM-generated descriptions are prone to hallucination or missing details, we introduce a novel strategy to distill only the useful information and stabilize the training. Specifically, we learn rich per-class prototype representations from noisy generated texts and dual pseudo-labels. Our experiments on 10 domain-specific datasets show that LatteCLIP outperforms pre-trained zero-shot methods by an average improvement of +4.74 points in top-1 accuracy and other state-of-the-art unsupervised methods by +3.45 points.

[AI-1] PointOBB-v2: Towards Simpler Faster and Stronger Single Point Supervised Oriented Object Detection

链接: https://arxiv.org/abs/2410.08210
作者: Botao Ren,Xue Yang,Yi Yu,Junwei Luo,Zhidong Deng
关键词-EN: made initial progress, Single point supervised, gained attention, attention and made, made initial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Single point supervised oriented object detection has gained attention and made initial progress within the community. Diverse from those approaches relying on one-shot samples or powerful pretrained models (e.g. SAM), PointOBB has shown promise due to its prior-free feature. In this paper, we propose PointOBB-v2, a simpler, faster, and stronger method to generate pseudo rotated boxes from points without relying on any other prior. Specifically, we first generate a Class Probability Map (CPM) by training the network with non-uniform positive and negative sampling. We show that the CPM is able to learn the approximate object regions and their contours. Then, Principal Component Analysis (PCA) is applied to accurately estimate the orientation and the boundary of objects. By further incorporating a separation mechanism, we resolve the confusion caused by the overlapping on the CPM, enabling its operation in high-density scenarios. Extensive comparisons demonstrate that our method achieves a training speed 15.58x faster and an accuracy improvement of 11.60%/25.15%/21.19% on the DOTA-v1.0/v1.5/v2.0 datasets compared to the previous state-of-the-art, PointOBB. This significantly advances the cutting edge of single point supervised oriented detection in the modular track.

[AI-2] Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

链接: https://arxiv.org/abs/2410.08209
作者: Shengcao Cao,Liang-Yan Gui,Yu-Xiong Wang
关键词-EN: Current large multimodal, relate language components, Current large, large multimodal models, face challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an “attend-and-segment” method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: this https URL.

[AI-3] SPA: 3D Spatial-Awareness Enables Effective Embodied Representation

链接: https://arxiv.org/abs/2410.08208
作者: Haoyi Zhu,Honghui Yang,Yating Wang,Jiange Yang,Limin Wang,Tong He
关键词-EN: vanilla Vision Transformer, embodied representation learning, framework that emphasizes, emphasizes the importance, Vision Transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In this paper, we introduce SPA, a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. Our approach leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understanding. We present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios. The results are compelling: SPA consistently outperforms more than 10 state-of-the-art representation methods, including those specifically designed for embodied AI, vision-centric tasks, and multi-modal applications, while using less training data. Furthermore, we conduct a series of real-world experiments to confirm its effectiveness in practical scenarios. These results highlight the critical role of 3D spatial awareness for embodied representation learning. Our strongest model takes more than 6000 GPU hours to train and we are committed to open-sourcing all code and model weights to foster future research in embodied representation learning. Project Page: this https URL.

[AI-4] From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions

链接: https://arxiv.org/abs/2410.08197
作者: Changle Qu,Sunhao Dai,Xiaochi Wei,Hengyi Cai,Shuaiqiang Wang,Dawei Yin,Jun Xu,Ji-Rong Wen
关键词-EN: Large Language Models, enables Large Language, Language Models, Large Language, learning enables Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tool learning enables Large Language Models (LLMs) to interact with external environments by invoking tools, serving as an effective strategy to mitigate the limitations inherent in their pre-training data. In this process, tool documentation plays a crucial role by providing usage instructions for LLMs, thereby facilitating effective tool utilization. This paper concentrates on the critical challenge of bridging the comprehension gap between LLMs and external tools due to the inadequacies and inaccuracies inherent in existing human-centric tool documentation. We propose a novel framework, DRAFT, aimed at Dynamically Refining tool documentation through the Analysis of Feedback and Trails emanating from LLMs’ interactions with external tools. This methodology pivots on an innovative trial-and-error approach, consisting of three distinct learning phases: experience gathering, learning from experience, and documentation rewriting, to iteratively enhance the tool documentation. This process is further optimized by implementing a diversity-promoting exploration strategy to ensure explorative diversity and a tool-adaptive termination mechanism to prevent overfitting while enhancing efficiency. Extensive experiments on multiple datasets demonstrate that DRAFT’s iterative, feedback-based refinement significantly ameliorates documentation quality, fostering a deeper comprehension and more effective utilization of tools by LLMs. Notably, our analysis reveals that the tool documentation refined via our approach demonstrates robust cross-model generalization capabilities.

[AI-5] MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

链接: https://arxiv.org/abs/2410.08196
作者: Zimu Lu,Aojun Zhou,Ke Wang,Houxing Ren,Weikang Shi,Junting Pan,Mingjie Zhan,Hongsheng Li
关键词-EN: Code, mathematical, precision and accuracy, reasoning, mathematical reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:Code has been shown to be effective in enhancing the mathematical reasoning abilities of large language models due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline. The code is released at this https URL .

[AI-6] DifFRelight: Diffusion-Based Facial Performance Relighting SIGGRAPH WWW

链接: https://arxiv.org/abs/2410.08188
作者: Mingming He,Pascal Clausen,Ahmet Levent Taşel,Li Ma,Oliver Pilarski,Wenqi Xian,Laszlo Rikker,Xueming Yu,Ryan Burgert,Ning Yu,Paul Debevec
关键词-EN: relighting using diffusion-based, free-viewpoint facial performance, facial performance relighting, Stable Diffusion model, lighting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: 18 pages, SIGGRAPH Asia 2024 Conference Papers (SA Conference Papers '24), December 3–6, 2024, Tokyo, Japan. Project page: this https URL

点击查看摘要

Abstract:We present a novel framework for free-viewpoint facial performance relighting using diffusion-based image-to-image translation. Leveraging a subject-specific dataset containing diverse facial expressions captured under various lighting conditions, including flat-lit and one-light-at-a-time (OLAT) scenarios, we train a diffusion model for precise lighting control, enabling high-fidelity relit facial images from flat-lit inputs. Our framework includes spatially-aligned conditioning of flat-lit captures and random noise, along with integrated lighting information for global control, utilizing prior knowledge from the pre-trained Stable Diffusion model. This model is then applied to dynamic facial performances captured in a consistent flat-lit environment and reconstructed for novel-view synthesis using a scalable dynamic 3D Gaussian Splatting method to maintain quality and consistency in the relit results. In addition, we introduce unified lighting control by integrating a novel area lighting representation with directional lighting, allowing for joint adjustments in light size and direction. We also enable high dynamic range imaging (HDRI) composition using multiple directional lights to produce dynamic sequences under complex lighting conditions. Our evaluations demonstrate the models efficiency in achieving precise lighting control and generalizing across various facial expressions while preserving detailed features such as skintexture andhair. The model accurately reproduces complex lighting effects like eye reflections, subsurface scattering, self-shadowing, and translucency, advancing photorealism within our framework.

[AI-7] MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

链接: https://arxiv.org/abs/2410.08182
作者: Wenbo Hu,Jia-Chen Gu,Zi-Yi Dou,Mohsen Fayyaz,Pan Lu,Kai-Wei Chang,Nanyun Peng
关键词-EN: Existing multimodal retrieval, retrieval benchmarks primarily, benchmarks primarily focus, Existing multimodal, primarily focus
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: this https URL

点击查看摘要

Abstract:Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in which we systematically identify and categorize scenarios where visually augmented knowledge is better than textual knowledge, for instance, more images from varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. With MRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-Bench is vision-centric. Additionally, we conduct extensive analysis with MRAG-Bench, which offers valuable insights into retrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faces challenges in effectively leveraging retrieved knowledge, achieving only a 5.82% improvement with ground-truth information, in contrast to a 33.16% improvement observed in human participants. These findings highlight the importance of MRAG-Bench in encouraging the community to enhance LVLMs’ ability to utilize retrieved visual knowledge more effectively.

[AI-8] Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

链接: https://arxiv.org/abs/2410.08174
作者: Qingni Wang,Tiantian Geng,Zhiyuan Wang,Teng Wang,Bo Fu,Feng Zheng
关键词-EN: Multimodal Large Language, Multimodal Large, Large Language Models, significant trustworthiness issues, encounter significant trustworthiness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) exhibit promising advancements across various tasks, yet they still encounter significant trustworthiness issues. Prior studies apply Split Conformal Prediction (SCP) in language modeling to construct prediction sets with statistical guarantees. However, these methods typically rely on internal model logits or are restricted to multiple-choice settings, which hampers their generalizability and adaptability in dynamic, open-ended environments. In this paper, we introduce TRON, a two-step framework for risk control and assessment, applicable to any MLLM that supports sampling in both open-ended and closed-ended scenarios. TRON comprises two main components: (1) a novel conformal score to sample response sets of minimum size, and (2) a nonconformity score to identify high-quality responses based on self-consistency theory, controlling the error rates by two specific risk levels. Furthermore, we investigate semantic redundancy in prediction sets within open-ended contexts for the first time, leading to a promising evaluation metric for MLLMs based on average set size. Our comprehensive experiments across four Video Question-Answering (VideoQA) datasets utilizing eight MLLMs show that TRON achieves desired error rates bounded by two user-specified risk levels. Additionally, deduplicated prediction sets maintain adaptiveness while being more efficient and stable for risk assessment under different risk levels.

[AI-9] On the Evaluation of Generative Robotic Simulations ALT

链接: https://arxiv.org/abs/2410.08172
作者: Feng Chen,Botian Xu,Pu Hua,Peiqi Duan,Yanchao Yang,Yi Ma,Huazhe Xu
关键词-EN: acquiring extensive real-world, scalable simulated robotic, extensive real-world data, simulated robotic tasks, highlighting the importance
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Due to the difficulty of acquiring extensive real-world data, robot simulation has become crucial for parallel training and sim-to-real transfer, highlighting the importance of scalable simulated robotic tasks. Foundation models have demonstrated impressive capacities in autonomously generating feasible robotic tasks. However, this new paradigm underscores the challenge of adequately evaluating these autonomously generated tasks. To address this, we propose a comprehensive evaluation framework tailored to generative simulations. Our framework segments evaluation into three core aspects: quality, diversity, and generalization. For single-task quality, we evaluate the realism of the generated task and the completeness of the generated trajectories using large language models and vision-language models. In terms of diversity, we measure both task and data diversity through text similarity of task descriptions and world model loss trained on collected task trajectories. For task-level generalization, we assess the zero-shot generalization ability on unseen tasks of a policy trained with multiple generated tasks. Experiments conducted on three representative task generation pipelines demonstrate that the results from our framework are highly consistent with human evaluations, confirming the feasibility and validity of our approach. The findings reveal that while metrics of quality and diversity can be achieved through certain methods, no single approach excels across all metrics, suggesting a need for greater focus on balancing these different metrics. Additionally, our analysis further highlights the common challenge of low generalization capability faced by current works. Our anonymous website: this https URL.

[AI-10] Agent S: An Open Agent ic Framework that Uses Computers Like a Human

链接: https://arxiv.org/abs/2410.08164
作者: Saaket Agashe,Jiuzhou Han,Shuyu Gan,Jiachen Yang,Ang Li,Xin Eric Wang
关键词-EN: Graphical User Interface, Graphical User, enables autonomous interaction, transforming human-computer interaction, open agentic framework
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 16 figures, 9 tables

点击查看摘要

Abstract:We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at this https URL.

[AI-11] DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory

链接: https://arxiv.org/abs/2410.08143
作者: Yutong Wang,Jiali Zeng,Xuebo Liu,Derek F. Wong,Fandong Meng,Jie Zhou,Min Zhang
关键词-EN: Large language models, Large language, reasonable quality improvements, achieved reasonable quality, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved reasonable quality improvements in machine translation (MT). However, most current research on MT-LLMs still faces significant challenges in maintaining translation consistency and accuracy when processing entire documents. In this paper, we introduce DelTA, a Document-levEL Translation Agent designed to overcome these limitations. DelTA features a multi-level memory structure that stores information across various granularities and spans, including Proper Noun Records, Bilingual Summary, Long-Term Memory, and Short-Term Memory, which are continuously retrieved and updated by auxiliary LLM-based components. Experimental results indicate that DelTA significantly outperforms strong baselines in terms of translation consistency and quality across four open/closed-source LLMs and two representative document translation datasets, achieving an increase in consistency scores by up to 4.58 percentage points and in COMET scores by up to 3.16 points on average. DelTA employs a sentence-by-sentence translation strategy, ensuring no sentence omissions and offering a memory-efficient solution compared to the mainstream method. Furthermore, DelTA improves pronoun translation accuracy, and the summary component of the agent also shows promise as a tool for query-based summarization tasks. We release our code and data at this https URL.

[AI-12] Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction

链接: https://arxiv.org/abs/2410.08134
作者: Jarrid Rector-Brooks,Mohsin Hasan,Zhangzhi Peng,Zachary Quinn,Chenghao Liu,Sarthak Mittal,Nouha Dziri,Michael Bronstein,Yoshua Bengio,Pranam Chatterjee,Alexander Tong,Avishek Joey Bose
关键词-EN: data underlies important, underlies important applications, important applications spanning, discrete data underlies, spanning text-based agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative modeling of discrete data underlies important applications spanning text-based agents like ChatGPT to the design of the very building blocks of life in protein sequences. However, application domains need to exert control over the generated data by steering the generative process - typically via RLHF - to satisfy a specified property, reward, or affinity metric. In this paper, we study the problem of steering Masked Diffusion Models (MDMs), a recent class of discrete diffusion models that offer a compelling alternative to traditional autoregressive models. We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference by learning to sample from a target Bayesian posterior. Our DDPP framework leads to a family of three novel objectives that are all simulation-free, and thus scalable while applying to general non-differentiable reward functions. Empirically, we instantiate DDPP by steering MDMs to perform class-conditional pixel-level image modeling, RLHF-based alignment of MDMs using text-based rewards, and finetuning protein language models to generate more diverse secondary structures and shorter proteins. We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.

[AI-13] Assessing Episodic Memory in LLMs with Sequence Order Recall Tasks

链接: https://arxiv.org/abs/2410.08133
作者: Mathis Pink,Vy A. Vo,Qinyuan Wu,Jianing Mu,Javier S. Turek,Uri Hasson,Kenneth A. Norman,Sebastian Michelmann,Alexander Huth,Mariya Toneva
关键词-EN: primarily assessing semantic, Current LLM benchmarks, Current LLM, assessing semantic aspects, semantic relations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current LLM benchmarks focus on evaluating models’ memory of facts and semantic relations, primarily assessing semantic aspects of long-term memory. However, in humans, long-term memory also includes episodic memory, which links memories to their contexts, such as the time and place they occurred. The ability to contextualize memories is crucial for many cognitive tasks and everyday functions. This form of memory has not been evaluated in LLMs with existing benchmarks. To address the gap in evaluating memory in LLMs, we introduce Sequence Order Recall Tasks (SORT), which we adapt from tasks used to study episodic memory in cognitive psychology. SORT requires LLMs to recall the correct order of text segments, and provides a general framework that is both easily extendable and does not require any additional annotations. We present an initial evaluation dataset, Book-SORT, comprising 36k pairs of segments extracted from 9 books recently added to the public domain. Based on a human experiment with 155 participants, we show that humans can recall sequence order based on long-term memory of a book. We find that models can perform the task with high accuracy when relevant text is given in-context during the SORT evaluation. However, when presented with the book text only during training, LLMs’ performance on SORT falls short. By allowing to evaluate more aspects of memory, we believe that SORT will aid in the emerging development of memory-augmented models.

[AI-14] Mars: Situated Inductive Reasoning in an Open-World Environment

链接: https://arxiv.org/abs/2410.08126
作者: Xiaojuan Tang,Jiaqi Li,Yitao Liang,Song-chun Zhu,Muhan Zhang,Zilong Zheng
关键词-EN: Large Language Models, Large Language, Language Models, shown remarkable success, inductive reasoning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) trained on massive corpora have shown remarkable success in knowledge-intensive tasks. Yet, most of them rely on pre-stored knowledge. Inducing new general knowledge from a specific environment and performing reasoning with the acquired knowledge – \textitsituated inductive reasoning, is crucial and challenging for machine intelligence. In this paper, we design Mars, an interactive environment devised for situated inductive reasoning. It introduces counter-commonsense game mechanisms by modifying terrain, survival setting and task dependency while adhering to certain principles. In Mars, agents need to actively interact with their surroundings, derive useful rules and perform decision-making tasks in specific contexts. We conduct experiments on various RL-based and LLM-based methods, finding that they all struggle on this challenging situated inductive reasoning benchmark. Furthermore, we explore \textitInduction from Reflection, where we instruct agents to perform inductive reasoning from history trajectory. The superior performance underscores the importance of inductive reasoning in Mars. Through Mars, we aim to galvanize advancements in situated inductive reasoning and set the stage for developing the next generation of AI systems that can reason in an adaptive and context-sensitive way.

[AI-15] Heterogeneous Graph Auto-Encoder for CreditCard Fraud Detection

链接: https://arxiv.org/abs/2410.08121
作者: Moirangthem Tiken Singh,Rabinder Kumar Prasad,Gurumayum Robert Michael,N K Kaphungkui,N.Hemarjit Singh
关键词-EN: credit card usage, digital revolution, notable increase, Graph Neural Networks, fraud
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The digital revolution has significantly impacted financial transactions, leading to a notable increase in credit card usage. However, this convenience comes with a trade-off: a substantial rise in fraudulent activities. Traditional machine learning methods for fraud detection often struggle to capture the inherent interconnectedness within financial data. This paper proposes a novel approach for credit card fraud detection that leverages Graph Neural Networks (GNNs) with attention mechanisms applied to heterogeneous graph representations of financial data. Unlike homogeneous graphs, heterogeneous graphs capture intricate relationships between various entities in the financial ecosystem, such as cardholders, merchants, and transactions, providing a richer and more comprehensive data representation for fraud analysis. To address the inherent class imbalance in fraud data, where genuine transactions significantly outnumber fraudulent ones, the proposed approach integrates an autoencoder. This autoencoder, trained on genuine transactions, learns a latent representation and flags deviations during reconstruction as potential fraud. This research investigates two key questions: (1) How effectively can a GNN with an attention mechanism detect and prevent credit card fraud when applied to a heterogeneous graph? (2) How does the efficacy of the autoencoder with attention approach compare to traditional methods? The results are promising, demonstrating that the proposed model outperforms benchmark algorithms such as Graph Sage and FI-GRL, achieving a superior AUC-PR of 0.89 and an F1-score of 0.81. This research significantly advances fraud detection systems and the overall security of financial transactions by leveraging GNNs with attention mechanisms and addressing class imbalance through an autoencoder.

[AI-16] Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System

链接: https://arxiv.org/abs/2410.08115
作者: Weize Chen,Jiarui Yuan,Chen Qian,Cheng Yang,Zhiyuan Liu,Maosong Sun
关键词-EN: Large Language Model, Large Language, Language Model, parameter-updating optimization methods, low communication efficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Large Language Model (LLM) based multi-agent systems (MAS) show remarkable potential in collaborative problem-solving, yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter-updating optimization methods. We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness in LLM-based MAS through LLM training. Optima employs an iterative generate, rank, select, and train paradigm with a reward function balancing task performance, token efficiency, and communication readability. We explore various RL algorithms, including Supervised Fine-Tuning, Direct Preference Optimization, and their hybrid approaches, providing insights into their effectiveness-efficiency trade-offs. We integrate Monte Carlo Tree Search-inspired techniques for DPO data generation, treating conversation turns as tree nodes to explore diverse interaction paths. Evaluated on common multi-agent tasks, including information-asymmetric question answering and complex reasoning, Optima shows consistent and substantial improvements over single-agent baselines and vanilla MAS based on Llama 3 8B, achieving up to 2.8x performance gain with less than 10% tokens on tasks requiring heavy information exchange. Moreover, Optima’s efficiency gains open new possibilities for leveraging inference-compute more effectively, leading to improved inference-time scaling laws. By addressing fundamental challenges in LLM-based MAS, Optima shows the potential towards scalable, efficient, and effective MAS (this https URL).

[AI-17] Robust AI-Generated Text Detection by Restricted Embeddings EMNLP2024

链接: https://arxiv.org/abs/2410.08113
作者: Kristian Kuznetsov,Eduard Tulchinskii,Laida Kushnareva,German Magai,Serguei Barannikov,Sergey Nikolenko,Irina Piontkovskaya
关键词-EN: texts makes detecting, Growing amount, AI-generated texts makes, content more difficult, amount and quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to Findings of EMNLP 2024

点击查看摘要

Abstract:Growing amount and quality of AI-generated texts makes detecting such content more difficult. In most real-world scenarios, the domain (style and topic) of generated data and the generator model are not known in advance. In this work, we focus on the robustness of classifier-based detectors of AI-generated text, namely their ability to transfer to unseen generators or semantic domains. We investigate the geometry of the embedding space of Transformer-based text encoders and show that clearing out harmful linear subspaces helps to train a robust classifier, ignoring domain-specific spurious features. We investigate several subspace decomposition and feature selection strategies and achieve significant improvements over state of the art methods in cross-domain and cross-generator transfer. Our best approaches for head-wise and coordinate-based subspace removal increase the mean out-of-distribution (OOD) classification score by up to 9% and 14% in particular setups for RoBERTa and BERT embeddings respectively. We release our code and data: this https URL

[AI-18] Active Fourier Auditor for Estimating Distributional Properties of ML Models

链接: https://arxiv.org/abs/2410.08111
作者: Ayoub Ajarra,Bishwamittra Ghosh,Debabrota Basu
关键词-EN: Machine Learning, deployment of Machine, real-world applications, central concern, pervasive deployment
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:With the pervasive deployment of Machine Learning (ML) models in real-world applications, verifying and auditing properties of ML models have become a central concern. In this work, we focus on three properties: robustness, individual fairness, and group fairness. We discuss two approaches for auditing ML model properties: estimation with and without reconstruction of the target model under audit. Though the first approach is studied in the literature, the second approach remains unexplored. For this purpose, we develop a new framework that quantifies different properties in terms of the Fourier coefficients of the ML model under audit but does not parametrically reconstruct it. We propose the Active Fourier Auditor (AFA), which queries sample points according to the Fourier coefficients of the ML model, and further estimates the properties. We derive high probability error bounds on AFA’s estimates, along with the worst-case lower bounds on the sample complexity to audit them. Numerically we demonstrate on multiple datasets and models that AFA is more accurate and sample-efficient to estimate the properties of interest than the baselines.

[AI-19] A Closer Look at Machine Unlearning for Large Language Models

链接: https://arxiv.org/abs/2410.08109
作者: Xiaojian Yuan,Tianyu Pang,Chao Du,Kejiang Chen,Weiming Zhang,Min Lin
关键词-EN: Large language models, Large language, raising privacy, legal concerns, memorize sensitive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches. The code is available at this https URL.

[AI-20] A Generative AI Technique for Synthesizing a Digital Twin for U.S. Residential Solar Adoption and Generation

链接: https://arxiv.org/abs/2410.08098
作者: Aparna Kishore,Swapna Thorve,Madhav Marathe
关键词-EN: reducing carbon emissions, Residential rooftop solar, rooftop solar adoption, Residential rooftop, carbon emissions
类目: Artificial Intelligence (cs.AI)
*备注: 41 pages including references and supplementary

点击查看摘要

Abstract:Residential rooftop solar adoption is considered crucial for reducing carbon emissions. The lack of photovoltaic (PV) data at a finer resolution (e.g., household, hourly levels) poses a significant roadblock to informed decision-making. We discuss a novel methodology to generate a highly granular, residential-scale realistic dataset for rooftop solar adoption across the contiguous United States. The data-driven methodology consists of: (i) integrated machine learning models to identify PV adopters, (ii) methods to augment the data using explainable AI techniques to glean insights about key features and their interactions, and (iii) methods to generate household-level hourly solar energy output using an analytical model. The resulting synthetic datasets are validated using real-world data and can serve as a digital twin for modeling downstream tasks. Finally, a policy-based case study utilizing the digital twin for Virginia demonstrated increased rooftop solar adoption with the 30% Federal Solar Investment Tax Credit, especially in Low-to-Moderate-Income communities.

[AI-21] SAKA: An Intelligent Platform for Semi-automated Knowledge Graph Construction and Application

链接: https://arxiv.org/abs/2410.08094
作者: Hanrong Zhang,Xinyue Wang,Jiabao Pan,Hongwei Wang
关键词-EN: companies offer applications, technology is extensively, offer applications based, extensively utilized, companies offer
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge graph (KG) technology is extensively utilized in many areas, and many companies offer applications based on KG. Nonetheless, the majority of KG platforms necessitate expertise and tremendous time and effort of users to construct KG records manually, which poses great difficulties for ordinary people to use. Additionally, audio data is abundant and holds valuable information, but it is challenging to transform it into a KG. What’s more, the platforms usually do not leverage the full potential of the KGs constructed by users. In this paper, we propose an intelligent and user-friendly platform for Semi-automated KG Construction and Application (SAKA) to address the problems aforementioned. Primarily, users can semi-automatically construct KGs from structured data of numerous areas by interacting with the platform, based on which multi-versions of KG can be stored, viewed, managed, and updated. Moreover, we propose an Audio-based KG Information Extraction (AGIE) method to establish KGs from audio data. Lastly, the platform creates a semantic parsing-based knowledge base question answering (KBQA) system based on the user-created KGs. We prove the feasibility of the semi-automatic KG construction method on the SAKA platform.

[AI-22] Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study over Open-ended Question Answering

链接: https://arxiv.org/abs/2410.08085
作者: Yuan Sui,Bryan Hooi
关键词-EN: Large Language Models, integrating Knowledge Graphs, Knowledge Graphs, Large Language, Recent works integrating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Work in progress

点击查看摘要

Abstract:Recent works integrating Knowledge Graphs (KGs) have led to promising improvements in enhancing reasoning accuracy of Large Language Models (LLMs). However, current benchmarks mainly focus on closed tasks, leaving a gap in the assessment of more complex, real-world scenarios. This gap has also obscured the evaluation of KGs’ potential to mitigate the problem of hallucination in LLMs. To fill the gap, we introduce OKGQA, a new benchmark specifically designed to assess LLMs enhanced with KGs under open-ended, real-world question answering scenarios. OKGQA is designed to closely reflect the complexities of practical applications using questions from different types, and incorporates specific metrics to measure both the reduction in hallucinations and the enhancement in reasoning capabilities. To consider the scenario in which KGs may have varying levels of mistakes, we further propose another experiment setting OKGQA-P to assess model performance when the semantics and structure of KGs are deliberately perturbed and contaminated. OKGQA aims to (1) explore whether KGs can make LLMs more trustworthy in an open-ended setting, and (2) conduct a comparative analysis to shed light on methods and future directions for leveraging KGs to reduce LLMs’ hallucination. We believe that this study can facilitate a more complete performance comparison and encourage continuous improvement in integrating KGs with LLMs.

[AI-23] Packing Analysis: Packing Is More Appropriate for Large Models or Datasets in Supervised Fine-tuning

链接: https://arxiv.org/abs/2410.08081
作者: Shuhe Wang,Guoyin Wang,Jiwei Li,Eduard Hovy,Chen Guo
关键词-EN: maximum input length, optimization technique designed, maximize hardware resource, model maximum input, hardware resource efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Packing, initially utilized in the pre-training phase, is an optimization technique designed to maximize hardware resource efficiency by combining different training sequences to fit the model’s maximum input length. Although it has demonstrated effectiveness during pre-training, there remains a lack of comprehensive analysis for the supervised fine-tuning (SFT) stage on the following points: (1) whether packing can effectively enhance training efficiency while maintaining performance, (2) the suitable size of the model and dataset for fine-tuning with the packing method, and (3) whether packing unrelated or related training samples might cause the model to either excessively disregard or over-rely on the context. In this paper, we perform extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B. This provides the first comprehensive analysis of the advantages and limitations of packing versus padding, as well as practical considerations for implementing packing in various training scenarios. Our analysis covers various benchmarks, including knowledge, reasoning, and coding, as well as GPT-based evaluations, time efficiency, and other fine-tuning parameters. We also open-source our code for fine-tuning and evaluation and provide checkpoints fine-tuned on datasets of different sizes, aiming to advance future research on packing methods. Code is available at: this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2410.08081 [cs.LG] (or arXiv:2410.08081v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.08081 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-24] Unlearning-based Neural Interpretations

链接: https://arxiv.org/abs/2410.08069
作者: Ching Lam Choi,Alexandre Duplessis,Serge Belongie
关键词-EN: computing feature importance, Gradient-based interpretations, require an anchor, comparison to avoid, avoid saturation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gradient-based interpretations often require an anchor point of comparison to avoid saturation in computing feature importance. We show that current baselines defined using static functions–constant mapping, averaging or blurring–inject harmful colour, texture or frequency assumptions that deviate from model behaviour. This leads to accumulation of irregular gradients, resulting in attribution maps that are biased, fragile and manipulable. Departing from the static approach, we propose UNI to compute an (un)learnable, debiased and adaptive baseline by perturbing the input towards an unlearning direction of steepest ascent. Our method discovers reliable baselines and succeeds in erasing salient features, which in turn locally smooths the high-curvature decision boundaries. Our analyses point to unlearning as a promising avenue for generating faithful, efficient and robust interpretations.

[AI-25] aching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

链接: https://arxiv.org/abs/2410.08068
作者: Wenting Tan,Dongxiao Chen,Jieting Xue,Zihao Wang,Taijie Chen
关键词-EN: Large Language Models, Large Language, Language Models, exhibit impressive performance, arithmetic reasoning tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit impressive performance across various domains but still struggle with arithmetic reasoning tasks. Recent work shows the effectiveness of prompt design methods in enhancing reasoning capabilities. However, these approaches overlook crucial requirements for prior knowledge of specific concepts, theorems, and tricks to tackle most arithmetic reasoning problems successfully. To address this issue, we propose a novel and effective Teaching-Inspired Integrated Framework, which emulates the instructional process of a teacher guiding students. This method equips LLMs with essential concepts, relevant theorems, and similar problems with analogous solution approaches, facilitating the enhancement of reasoning abilities. Additionally, we introduce two new Chinese datasets, MathMC and MathToF, both with detailed explanations and answers. Experiments are conducted on nine benchmarks which demonstrates that our approach improves the reasoning accuracy of LLMs. With GPT-4 and our framework, we achieve new state-of-the-art performance on four math benchmarks (AddSub, SVAMP, Math23K and AQuA) with accuracies of 98.2% (+3.3%), 93.9% (+0.2%), 94.3% (+7.2%) and 81.1% (+1.2%). Our data and code are available at this https URL.

[AI-26] Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

链接: https://arxiv.org/abs/2410.08067
作者: Shenao Zhang,Zhihan Liu,Boyi Liu,Yufeng Zhang,Yingxiang Yang,Yongfei Liu,Liyu Chen,Tao Sun,Zhaoran Wang
关键词-EN: Large Language Models, Large Language, Language Models, instructions and intentions, existing direct alignment
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses. Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the high-quality rejected responses. The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to responses with the highest rewards, which are sparse in data. To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset. This dataset is easily integrated with existing direct alignment algorithms and is applicable to any preference dataset. The experimental results across instruction-following benchmarks including AlpacaEval, MT-Bench, and Arena-Hard-Auto demonstrate that our approach consistently boosts the performance of DPO by a considerable margin across diverse models. Additionally, our method improves the average accuracy on various academic benchmarks. When applying our method to on-policy data, the resulting DPO model achieves SOTA results on AlpacaEval. Through ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere dataset expansion. Our code is available at this https URL.

[AI-27] Closing the Loop: Learning to Generate Writing Feedback via Language Model Simulated Student Revisions EMNLP2024

链接: https://arxiv.org/abs/2410.08058
作者: Inderjeet Nair,Jiaye Tan,Xiaotian Su,Anne Gere,Xu Wang,Lu Wang
关键词-EN: Providing feedback, widely recognized, recognized as crucial, crucial for refining, students’ writing skills
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Providing feedback is widely recognized as crucial for refining students’ writing skills. Recent advances in language models (LMs) have made it possible to automatically generate feedback that is actionable and well-aligned with human-specified attributes. However, it remains unclear whether the feedback generated by these models is truly effective in enhancing the quality of student revisions. Moreover, prompting LMs with a precise set of instructions to generate feedback is nontrivial due to the lack of consensus regarding the specific attributes that can lead to improved revising performance. To address these challenges, we propose PROF that PROduces Feedback via learning from LM simulated student revisions. PROF aims to iteratively optimize the feedback generator by directly maximizing the effectiveness of students’ overall revising performance as simulated by LMs. Focusing on an economic essay assignment, we empirically test the efficacy of PROF and observe that our approach not only surpasses a variety of baseline methods in effectiveness of improving students’ writing but also demonstrates enhanced pedagogical values, even though it was not explicitly trained for this aspect.

[AI-28] Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

链接: https://arxiv.org/abs/2410.08049
作者: Yiyuan Zhang,Xiaohan Ding,Xiangyu Yue
关键词-EN: Convolutional Neural Networks, modern Convolutional Neural, designing modern Convolutional, Neural Networks, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This is the journal version of arXiv:2203.06717 and arXiv:2311.15599

点击查看摘要

Abstract:This paper proposes the paradigm of large convolutional kernels in designing modern Convolutional Neural Networks (ConvNets). We establish that employing a few large kernels, instead of stacking multiple smaller ones, can be a superior design strategy. Our work introduces a set of architecture design guidelines for large-kernel ConvNets that optimize their efficiency and performance. We propose the UniRepLKNet architecture, which offers systematical architecture design principles specifically crafted for large-kernel ConvNets, emphasizing their unique ability to capture extensive spatial information without deep layer stacking. This results in a model that not only surpasses its predecessors with an ImageNet accuracy of 88.0%, an ADE20K mIoU of 55.6%, and a COCO box AP of 56.4% but also demonstrates impressive scalability and performance on various modalities such as time-series forecasting, audio, point cloud, and video recognition. These results indicate the universal modeling abilities of large-kernel ConvNets with faster inference speed compared with vision transformers. Our findings reveal that large-kernel ConvNets possess larger effective receptive fields and a higher shape bias, moving away from the texture bias typical of smaller-kernel CNNs. All codes and models are publicly available at this https URL promoting further research and development in the community.

[AI-29] On the Convergence of (Stochastic) Gradient Descent for Kolmogorov–Arnold Networks

链接: https://arxiv.org/abs/2410.08041
作者: Yihang Gao,Vincent Y. F. Tan
关键词-EN: Arnold Networks, neural network architecture, gained significant attention, proposed neural network, deep learning community
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Kolmogorov–Arnold Networks (KANs), a recently proposed neural network architecture, have gained significant attention in the deep learning community, due to their potential as a viable alternative to multi-layer perceptrons (MLPs) and their broad applicability to various scientific tasks. Empirical investigations demonstrate that KANs optimized via stochastic gradient descent (SGD) are capable of achieving near-zero training loss in various machine learning (e.g., regression, classification, and time series forecasting, etc.) and scientific tasks (e.g., solving partial differential equations). In this paper, we provide a theoretical explanation for the empirical success by conducting a rigorous convergence analysis of gradient descent (GD) and SGD for two-layer KANs in solving both regression and physics-informed tasks. For regression problems, we establish using the neural tangent kernel perspective that GD achieves global linear convergence of the objective function when the hidden dimension of KANs is sufficiently large. We further extend these results to SGD, demonstrating a similar global convergence in expectation. Additionally, we analyze the global convergence of GD and SGD for physics-informed KANs, which unveils additional challenges due to the more complex loss structure. This is the first work establishing the global convergence guarantees for GD and SGD applied to optimize KANs and physics-informed KANs.

[AI-30] Composite Learning Units: Generalized Learning Beyond Parameter Updates to Transform LLMs into Adaptive Reasoners

链接: https://arxiv.org/abs/2410.08037
作者: Santosh Kumar Radha,Oktay Goktas
关键词-EN: Human learning thrives, Large Language Models, Composite Learning Units, static machine learning, Human learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Human learning thrives on the ability to learn from mistakes, adapt through feedback, and refine understanding-processes often missing in static machine learning models. In this work, we introduce Composite Learning Units (CLUs) designed to transform reasoners, such as Large Language Models (LLMs), into learners capable of generalized, continuous learning without conventional parameter updates while enhancing their reasoning abilities through continual interaction and feedback. CLUs are built on an architecture that allows a reasoning model to maintain and evolve a dynamic knowledge repository: a General Knowledge Space for broad, reusable insights and a Prompt-Specific Knowledge Space for task-specific learning. Through goal-driven interactions, CLUs iteratively refine these knowledge spaces, enabling the system to adapt dynamically to complex tasks, extract nuanced insights, and build upon past experiences autonomously. We demonstrate CLUs’ effectiveness through a cryptographic reasoning task, where they continuously evolve their understanding through feedback to uncover hidden transformation rules. While conventional models struggle to grasp underlying logic, CLUs excel by engaging in an iterative, goal-oriented process. Specialized components-handling knowledge retrieval, prompt generation, and feedback analysis-work together within a reinforcing feedback loop. This approach allows CLUs to retain the memory of past failures and successes, adapt autonomously, and apply sophisticated reasoning effectively, continually learning from mistakes while also building on breakthroughs.

[AI-31] IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities

链接: https://arxiv.org/abs/2410.08035
作者: Xin Zhang,Xiang Lyu,Zhihao Du,Qian Chen,Dong Zhang,Hangrui Hu,Chaohong Tan,Tianyu Zhao,Yuxuan Wang,Bin Zhang,Heng Lu,Yaqian Zhou,Xipeng Qiu
关键词-EN: maintain content quality, brings computational overhead, Current methods, voice interaction capabilities, capabilities rely heavily
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current methods of building LLMs with voice interaction capabilities rely heavily on explicit text autoregressive generation before or during speech response generation to maintain content quality, which unfortunately brings computational overhead and increases latency in multi-turn interactions. To address this, we introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities. IntrinsicVoice aims to facilitate the transfer of textual capabilities of pre-trained LLMs to the speech modality by mitigating the modality gap between text and speech. Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences while generating high-quality audio, significantly reducing the length difference between speech and text, speeding up inference, and alleviating long-text modeling issues. Additionally, we construct a multi-turn speech-to-speech dialogue dataset named \method-500k which includes nearly 500k turns of speech-to-speech dialogues, and a cross-modality training strategy to enhance the semantic alignment between speech and text. Experimental results demonstrate that IntrinsicVoice can generate high-quality speech response with latency lower than 100ms in multi-turn dialogue scenarios. Demos are available at this https URL.

[AI-32] Strategic Classification With Externalities

链接: https://arxiv.org/abs/2410.08032
作者: Yiling Chen,Safwan Hossain,Evi Micha,Ariel Procaccia
关键词-EN: strategic classification problem, pure Nash Equilibrium, possibly manipulated, classification problem, principal reveals
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We propose a new variant of the strategic classification problem: a principal reveals a classifier, and n agents report their (possibly manipulated) features to be classified. Motivated by real-world applications, our model crucially allows the manipulation of one agent to affect another; that is, it explicitly captures inter-agent externalities. The principal-agent interactions are formally modeled as a Stackelberg game, with the resulting agent manipulation dynamics captured as a simultaneous game. We show that under certain assumptions, the pure Nash Equilibrium of this agent manipulation game is unique and can be efficiently computed. Leveraging this result, PAC learning guarantees are established for the learner: informally, we show that it is possible to learn classifiers that minimize loss on the distribution, even when a random number of agents are manipulating their way to a pure Nash Equilibrium. We also comment on the optimization of such classifiers through gradient-based approaches. This work sets the theoretical foundations for a more realistic analysis of classifiers that are robust against multiple strategic actors interacting in a common environment.

[AI-33] Private Language Models via Truncated Laplacian Mechanism EMNLP2024

链接: https://arxiv.org/abs/2410.08027
作者: Tianhao Huang,Tao Yang,Ivan Habernal,Lijie Hu,Di Wang
关键词-EN: Deep learning models, Deep learning, models for NLP, truncated Laplacian mechanism, NLP tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by EMNLP 2024, Main Track

点击查看摘要

Abstract:Deep learning models for NLP tasks are prone to variants of privacy attacks. To prevent privacy leakage, researchers have investigated word-level perturbations, relying on the formal guarantees of differential privacy (DP) in the embedding space. However, many existing approaches either achieve unsatisfactory performance in the high privacy regime when using the Laplacian or Gaussian mechanism, or resort to weaker relaxations of DP that are inferior to the canonical DP in terms of privacy strength. This raises the question of whether a new method for private word embedding can be designed to overcome these limitations. In this paper, we propose a novel private embedding method called the high dimensional truncated Laplacian mechanism. Specifically, we introduce a non-trivial extension of the truncated Laplacian mechanism, which was previously only investigated in one-dimensional space cases. Theoretically, we show that our method has a lower variance compared to the previous private word embedding methods. To further validate its effectiveness, we conduct comprehensive experiments on private embedding and downstream tasks using three datasets. Remarkably, even in the high privacy regime, our approach only incurs a slight decrease in utility compared to the non-private scenario.

[AI-34] he Computational Complexity of Circuit Discovery for Inner Interpretability

链接: https://arxiv.org/abs/2410.08025
作者: Federico Adolfi,Martina G. Vilas,Todd Wareham
关键词-EN: brain science, machine learning, proposed applications, applications of neural, neural networks
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Many proposed applications of neural networks in machine learning, cognitive/brain science, and society hinge on the feasibility of inner interpretability via circuit discovery. This calls for empirical and theoretical explorations of viable algorithmic options. Despite advances in the design and testing of heuristics, there are concerns about their scalability and faithfulness at a time when we lack understanding of the complexity properties of the problems they are deployed to solve. To address this, we study circuit discovery with classical and parameterized computational complexity theory: (1) we describe a conceptual scaffolding to reason about circuit finding queries in terms of affordances for description, explanation, prediction and control; (2) we formalize a comprehensive set of queries that capture mechanistic explanation, and propose a formal framework for their analysis; (3) we use it to settle the complexity of many query variants and relaxations of practical interest on multi-layer perceptrons (part of, e.g., transformers). Our findings reveal a challenging complexity landscape. Many queries are intractable (NP-hard, \Sigma^p_2 -hard), remain fixed-parameter intractable (W[1]-hard) when constraining model/circuit features (e.g., depth), and are inapproximable under additive, multiplicative, and probabilistic approximation schemes. To navigate this landscape, we prove there exist transformations to tackle some of these hard problems (NP- vs. \Sigma^p_2 -complete) with better-understood heuristics, and prove the tractability (PTIME) or fixed-parameter tractability (FPT) of more modest queries which retain useful affordances. This framework allows us to understand the scope and limits of interpretability queries, explore viable options, and compare their resource demands among existing and future architectures.

[AI-35] Pretraining Graph Transformers with Atom-in-a-Molecule Quantum Properties for Improved ADMET Modeling

链接: https://arxiv.org/abs/2410.08024
作者: Alessio Fallani,Ramil Nugmanov,Jose Arjona-Medina,Jörg Kurt Wegner,Alexandre Tkatchenko,Kostiantyn Chernichenko
关键词-EN: Graph Transformer architectures, pretraining Graph Transformer, Transformer architectures, atom-level quantum-mechanical features, Data Commons ADMET
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We evaluate the impact of pretraining Graph Transformer architectures on atom-level quantum-mechanical features for the modeling of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of drug-like compounds. We compare this pretraining strategy with two others: one based on molecular quantum properties (specifically the HOMO-LUMO gap) and one using a self-supervised atom masking technique. After fine-tuning on Therapeutic Data Commons ADMET datasets, we evaluate the performance improvement in the different models observing that models pretrained with atomic quantum mechanical properties produce in general better results. We then analyse the latent representations and observe that the supervised strategies preserve the pretraining information after finetuning and that different pretrainings produce different trends in latent expressivity across layers. Furthermore, we find that models pretrained on atomic quantum mechanical properties capture more low-frequency laplacian eigenmodes of the input graph via the attention weights and produce better representations of atomic environments within the molecule. Application of the analysis to a much larger non-public dataset for microsomal clearance illustrates generalizability of the studied indicators. In this case the performances of the models are in accordance with the representation analysis and highlight, especially for the case of masking pretraining and atom-level quantum property pretraining, how model types with similar performance on public benchmarks can have different performances on large scale pharmaceutical data.

[AI-36] GrabDAE: An Innovative Framework for Unsupervised Domain Adaptation Utilizing Grab-Mask and Denoise Auto-Encoder

链接: https://arxiv.org/abs/2410.08023
作者: Junzhou Chen,Xuan Wen,Ronghui Zhang,Bingtao Ren,Di Wu,Zhigang Xu,Danwei Wang
关键词-EN: Unsupervised Domain Adaptation, target domain, Unsupervised Domain, Existing Unsupervised Domain, labeled source domain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) aims to adapt a model trained on a labeled source domain to an unlabeled target domain by addressing the domain shift. Existing Unsupervised Domain Adaptation (UDA) methods often fall short in fully leveraging contextual information from the target domain, leading to suboptimal decision boundary separation during source and target domain alignment. To address this, we introduce GrabDAE, an innovative UDA framework designed to tackle domain shift in visual classification tasks. GrabDAE incorporates two key innovations: the Grab-Mask module, which blurs background information in target domain images, enabling the model to focus on essential, domain-relevant features through contrastive learning; and the Denoising Auto-Encoder (DAE), which enhances feature alignment by reconstructing features and filtering noise, ensuring a more robust adaptation to the target domain. These components empower GrabDAE to effectively handle unlabeled target domain data, significantly improving both classification accuracy and robustness. Extensive experiments on benchmark datasets, including VisDA-2017, Office-Home, and Office31, demonstrate that GrabDAE consistently surpasses state-of-the-art UDA methods, setting new performance benchmarks. By tackling UDA’s critical challenges with its novel feature masking and denoising approach, GrabDAE offers both significant theoretical and practical advancements in domain adaptation.

[AI-37] Probabilistic Satisfaction of Temporal Logic Constraints in Reinforcement Learning via Adaptive Policy-Switching

链接: https://arxiv.org/abs/2410.08022
作者: Xiaoshan Lin,Sadık Bera Yüksel,Yasin Yazıcıoğlu,Derya Aksaray
关键词-EN: Constrained Reinforcement Learning, Constrained Reinforcement, traditional reinforcement learning, Reinforcement Learning, traditional reinforcement
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Constrained Reinforcement Learning (CRL) is a subset of machine learning that introduces constraints into the traditional reinforcement learning (RL) framework. Unlike conventional RL which aims solely to maximize cumulative rewards, CRL incorporates additional constraints that represent specific mission requirements or limitations that the agent must comply with during the learning process. In this paper, we address a type of CRL problem where an agent aims to learn the optimal policy to maximize reward while ensuring a desired level of temporal logic constraint satisfaction throughout the learning process. We propose a novel framework that relies on switching between pure learning (reward maximization) and constraint satisfaction. This framework estimates the probability of constraint satisfaction based on earlier trials and properly adjusts the probability of switching between learning and constraint satisfaction policies. We theoretically validate the correctness of the proposed algorithm and demonstrate its performance and scalability through comprehensive simulations.

[AI-38] Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs

链接: https://arxiv.org/abs/2410.08020
作者: Jonas Hübotter,Sascha Bongni,Ido Hakimi,Andreas Krause
关键词-EN: Nearest Neighbor retrieval, Recent efforts, Nearest Neighbor, Neighbor retrieval, automatic data selection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent efforts in fine-tuning language models often rely on automatic data selection, commonly using Nearest Neighbors retrieval from large datasets. However, we theoretically show that this approach tends to select redundant data, limiting its effectiveness or even hurting performance. To address this, we introduce SIFT, a data selection algorithm designed to reduce uncertainty about the model’s response given a prompt, which unifies ideas from retrieval and active learning. Whereas Nearest Neighbor retrieval typically fails in the presence of information duplication, SIFT accounts for information duplication and optimizes the overall information gain of the selected examples. We focus our evaluations on fine-tuning at test-time for prompt-specific language modeling on the Pile dataset, and show that SIFT consistently outperforms Nearest Neighbor retrieval, with minimal computational overhead. Moreover, we show that our uncertainty estimates can predict the performance gain of test-time fine-tuning, and use this to develop an adaptive algorithm that invests test-time compute proportional to realized performance gains. We provide the \textttactiveft (Active Fine-Tuning) library which can be used as a drop-in replacement for Nearest Neighbor retrieval.

[AI-39] owards Synergistic Generalized and Efficient Dual-System for Robotic Manipulation

链接: https://arxiv.org/abs/2410.08001
作者: Qingwen Bu,Hongyang Li,Li Chen,Jisong Cai,Jia Zeng,Heming Cui,Maoqing Yao,Yu Qiao
关键词-EN: versatile robotic systems, facilitate broad adaptability, large cross-embodiment data, cross-embodiment data corpus, increasing demand
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:The increasing demand for versatile robotic systems to operate in diverse and dynamic environments has emphasized the importance of a generalist policy, which leverages a large cross-embodiment data corpus to facilitate broad adaptability and high-level reasoning. However, the generalist would struggle with inefficient inference and cost-expensive training. The specialist policy, instead, is curated for specific domain data and excels at task-level precision with efficiency. Yet, it lacks the generalization capacity for a wide range of applications. Inspired by these observations, we introduce RoboDual, a synergistic dual-system that supplements the merits of both generalist and specialist policy. A diffusion transformer-based specialist is devised for multi-step action rollouts, exquisitely conditioned on the high-level task understanding and discretized action output of a vision-language-action (VLA) based generalist. Compared to OpenVLA, RoboDual achieves 26.7% improvement in real-world setting and 12% gain on CALVIN by introducing a specialist policy with merely 20M trainable parameters. It maintains strong performance with 5% of demonstration data only, and enables a 3.8 times higher control frequency in real-world deployment. Code would be made publicly available. Our project page is hosted at: this https URL

[AI-40] Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets

链接: https://arxiv.org/abs/2410.07991
作者: Tommaso Giorgi,Lorenzo Cima,Tiziano Fagni,Marco Avvenuti,Stefano Cresci
关键词-EN: online platforms exacerbated, hate speech detection, hate speech, speech detection systems, speech detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The rise of online platforms exacerbated the spread of hate speech, demanding scalable and effective detection. However, the accuracy of hate speech detection systems heavily relies on human-labeled data, which is inherently susceptible to biases. While previous work has examined the issue, the interplay between the characteristics of the annotator and those of the target of the hate are still unexplored. We fill this gap by leveraging an extensive dataset with rich socio-demographic information of both annotators and targets, uncovering how human biases manifest in relation to the target’s attributes. Our analysis surfaces the presence of widespread biases, which we quantitatively describe and characterize based on their intensity and prevalence, revealing marked differences. Furthermore, we compare human biases with those exhibited by persona-based LLMs. Our findings indicate that while persona-based LLMs do exhibit biases, these differ significantly from those of human annotators. Overall, our work offers new and nuanced results on human biases in hate speech annotations, as well as fresh insights into the design of AI-driven hate speech detection systems.

[AI-41] MolMix: A Simple Yet Effective Baseline for Multimodal Molecular Representation Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.07981
作者: Andrei Manolache,Dragos Tantaru,Mathias Niepert
关键词-EN: molecular representation learning, multimodal molecular representation, simple transformer-based baseline, SMILES strings, molecular representation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Machine Learning for Structural Biology Workshop, NeurIPS 2024

点击查看摘要

Abstract:In this work, we propose a simple transformer-based baseline for multimodal molecular representation learning, integrating three distinct modalities: SMILES strings, 2D graph representations, and 3D conformers of molecules. A key aspect of our approach is the aggregation of 3D conformers, allowing the model to account for the fact that molecules can adopt multiple conformations-an important factor for accurate molecular representation. The tokens for each modality are extracted using modality-specific encoders: a transformer for SMILES strings, a message-passing neural network for 2D graphs, and an equivariant neural network for 3D conformers. The flexibility and modularity of this framework enable easy adaptation and replacement of these encoders, making the model highly versatile for different molecular tasks. The extracted tokens are then combined into a unified multimodal sequence, which is processed by a downstream transformer for prediction tasks. To efficiently scale our model for large multimodal datasets, we utilize Flash Attention 2 and bfloat16 precision. Despite its simplicity, our approach achieves state-of-the-art results across multiple datasets, demonstrating its effectiveness as a strong baseline for multimodal molecular representation learning.

[AI-42] D-Waves Nonlinear-Program Hybrid Solver: Description and Performance Analysis

链接: https://arxiv.org/abs/2410.07980
作者: Eneko Osaba,Pablo Miranda-Rodriguez
关键词-EN: advanced quantum-classical algorithms, quantum computing, development of advanced, advanced quantum-classical, quantum-classical algorithms
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注: 10 pages, 8 figures and 7 tables

点击查看摘要

Abstract:The development of advanced quantum-classical algorithms is among the most prominent strategies in quantum computing. Numerous hybrid solvers have been introduced recently. Many of these methods are created ad hoc to address specific use cases. However, several well-established schemes are frequently utilized to address optimization problems. In this context, D-Wave launched the Hybrid Solver Service in 2020, offering a portfolio of methods designed to accelerate time-to-solution for users aiming to optimize performance and operational processes. Recently, a new technique has been added to this portfolio: the Nonlinear-Program Hybrid Solver. This paper describes this solver and evaluates its performance through a benchmark of 45 instances across three combinatorial optimization problems: the Traveling Salesman Problem, the Knapsack Problem, and the Maximum Cut Problem. To facilitate the use of this relatively unexplored solver, we provide details of the implementation used to solve these three optimization problems.

[AI-43] Doobs Lagrangian: A Sample-Efficient Variational Approach to Transition Path Sampling NEURIPS2024

链接: https://arxiv.org/abs/2410.07974
作者: Yuanqi Du,Michael Plainer,Rob Brekelmans,Chenru Duan,Frank Noé,Carla P. Gomes,Alan Apsuru-Guzik,Kirill Neklyudov
关键词-EN: poses significant computational, significant computational challenges, computational challenges due, fundamental problem arising, exponentially large space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Chemical Physics (physics.chem-ph)
*备注: Accepted as Spotlight at Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Rare event sampling in dynamical systems is a fundamental problem arising in the natural sciences, which poses significant computational challenges due to an exponentially large space of trajectories. For settings where the dynamical system of interest follows a Brownian motion with known drift, the question of conditioning the process to reach a given endpoint or desired rare event is definitively answered by Doob’s h-transform. However, the naive estimation of this transform is infeasible, as it requires simulating sufficiently many forward trajectories to estimate rare event probabilities. In this work, we propose a variational formulation of Doob’s h -transform as an optimization problem over trajectories between a given initial point and the desired ending point. To solve this optimization, we propose a simulation-free training objective with a model parameterization that imposes the desired boundary conditions by design. Our approach significantly reduces the search space over trajectories and avoids expensive trajectory simulation and inefficient importance sampling estimators which are required in existing methods. We demonstrate the ability of our method to find feasible transition paths on real-world molecular simulation and protein folding tasks.

[AI-44] Neural Reasoning Networks: Efficient Interpretable Neural Networks With Automatic Textual Explanations

链接: https://arxiv.org/abs/2410.07966
作者: Stephen Carrow,Kyle Harper Erwin,Olga Vilenskaia,Parikshit Ram,Tim Klinger,Naweed Aghmad Khan,Ndivhuwo Makondo,Alexander Gray
关键词-EN: Neural Reasoning Networks, Recent advances, ensure fairness, legal compliance, advances in machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in machine learning have led to a surge in adoption of neural networks for various tasks, but lack of interpretability remains an issue for many others in which an understanding of the features influencing the prediction is necessary to ensure fairness, safety, and legal compliance. In this paper we consider one class of such tasks, tabular dataset classification, and propose a novel neuro-symbolic architecture, Neural Reasoning Networks (NRN), that is scalable and generates logically sound textual explanations for its predictions. NRNs are connected layers of logical neurons which implement a form of real valued logic. A training algorithm (R-NRN) learns the weights of the network as usual using gradient descent optimization with backprop, but also learns the network structure itself using a bandit-based optimization. Both are implemented in an extension to PyTorch (this https URL) that takes full advantage of GPU scaling and batched training. Evaluation on a diverse set of 22 open-source datasets for tabular classification demonstrates performance (measured by ROC AUC) which improves over multi-layer perceptron (MLP) and is statistically similar to other state-of-the-art approaches such as Random Forest, XGBoost and Gradient Boosted Trees, while offering 43% faster training and a more than 2 orders of magnitude reduction in the number of parameters required, on average. Furthermore, R-NRN explanations are shorter than the compared approaches while producing more accurate feature importance scores.

[AI-45] owards Assurance of LLM Adversarial Robustness using Ontology-Driven Argumentation

链接: https://arxiv.org/abs/2410.07962
作者: Tomas Bueno Momcilovic,Beat Buesser,Giulio Zizzo,Mark Purcell,Tomas Bueno Momcilovic
关键词-EN: large language models, challenges remain, ensuring their security, impressive adaptability, adaptability of large
类目: Artificial Intelligence (cs.AI)
*备注: To be published in xAI 2024, late-breaking track

点击查看摘要

Abstract:Despite the impressive adaptability of large language models (LLMs), challenges remain in ensuring their security, transparency, and interpretability. Given their susceptibility to adversarial attacks, LLMs need to be defended with an evolving combination of adversarial training and guardrails. However, managing the implicit and heterogeneous knowledge for continuously assuring robustness is difficult. We introduce a novel approach for assurance of the adversarial robustness of LLMs based on formal argumentation. Using ontologies for formalization, we structure state-of-the-art attacks and defenses, facilitating the creation of a human-readable assurance case, and a machine-readable representation. We demonstrate its application with examples in English language and code translation tasks, and provide implications for theory and practice, by targeting engineers, data scientists, users, and auditors.

[AI-46] COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act

链接: https://arxiv.org/abs/2410.07959
作者: Philipp Guldimann,Alexander Spiridonov,Robin Staab,Nikola Jovanović,Mark Vero,Velko Vechev,Anna Gueorguieva,Mislav Balunović,Nikola Konstantinov,Pavol Bielik,Petar Tsankov,Martin Vechev
关键词-EN: Artificial Intelligence Act, assess models’ compliance, Artificial Intelligence, lacks clear technical, clear technical interpretation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The EU’s Artificial Intelligence Act (AI Act) is a significant step towards responsible AI development, but lacks clear technical interpretation, making it difficult to assess models’ compliance. This work presents COMPL-AI, a comprehensive framework consisting of (i) the first technical interpretation of the EU AI Act, translating its broad regulatory requirements into measurable technical requirements, with the focus on large language models (LLMs), and (ii) an open-source Act-centered benchmarking suite, based on thorough surveying and implementation of state-of-the-art LLM benchmarks. By evaluating 12 prominent LLMs in the context of COMPL-AI, we reveal shortcomings in existing models and benchmarks, particularly in areas like robustness, safety, diversity, and fairness. This work highlights the need for a shift in focus towards these aspects, encouraging balanced development of LLMs and more comprehensive regulation-aligned benchmarks. Simultaneously, COMPL-AI for the first time demonstrates the possibilities and difficulties of bringing the Act’s obligations to a more concrete, technical level. As such, our work can serve as a useful first step towards having actionable recommendations for model providers, and contributes to ongoing efforts of the EU to enable application of the Act, such as the drafting of the GPAI Code of Practice.

[AI-47] he Function-Representation Unification Framework

链接: https://arxiv.org/abs/2410.07928
作者: Alfredo Ibias,Hector Antona,Guillem Ramirez-Miranda,Enric Guinovart,Eduard Alarcon
关键词-EN: Cognitive Architectures, artificial cognition, research into developing, developing an artificial, model of computation
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cognitive Architectures are the forefront of our research into developing an artificial cognition. However, they approach the problem from a separated memory and program model of computation. This model of computation poses a fundamental problem: the knowledge retrieval heuristic. In this paper we propose to solve this problem by using a new model of computation, one where the memory and the program are united: the Function-Representation. We propose a whole framework about how to implement and use these Function-Representations, and we explore their potential through mathematical definitions and proofs. We also talk about different ways to organise multiple Function-Representations, and explore the kind of functions that these Function-Representations can implement. Finally, we also explore the limitations of our proposal.

[AI-48] Deep Learning for Generalised Planning with Background Knowledge

链接: https://arxiv.org/abs/2410.07923
作者: Dillon Z. Chen,Rostislav Horčík,Gustav Šír
关键词-EN: recently drawn attention, declarative problem solving, Automated planning, form of declarative, recently drawn
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automated planning is a form of declarative problem solving which has recently drawn attention from the machine learning (ML) community. ML has been applied to planning either as a way to test `reasoning capabilities’ of architectures, or more pragmatically in an attempt to scale up solvers with learned domain knowledge. In practice, planning problems are easy to solve but hard to optimise. However, ML approaches still struggle to solve many problems that are often easy for both humans and classical planners. In this paper, we thus propose a new ML approach that allows users to specify background knowledge (BK) through Datalog rules to guide both the learning and planning processes in an integrated fashion. By incorporating BK, our approach bypasses the need to relearn how to solve problems from scratch and instead focuses the learning on plan quality optimisation. Experiments with BK demonstrate that our method successfully scales and learns to plan efficiently with high quality solutions from small training data generated in under 5 seconds.

[AI-49] Meta-Learning Integration in Hierarchical Reinforcement Learning for Advanced Task Complexity

链接: https://arxiv.org/abs/2410.07921
作者: Arash Khajooeinejad,Masoumeh Chapariniya
关键词-EN: Hierarchical Reinforcement Learning, Hierarchical Reinforcement, effectively tackles complex, Reinforcement Learning, effectively tackles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hierarchical Reinforcement Learning (HRL) effectively tackles complex tasks by decomposing them into structured policies. However, HRL agents often face challenges with efficient exploration and rapid adaptation. To address this, we integrate meta-learning into HRL to enhance the agent’s ability to learn and adapt hierarchical policies swiftly. Our approach employs meta-learning for rapid task adaptation based on prior experience, while intrinsic motivation mechanisms encourage efficient exploration by rewarding novel state visits. Specifically, our agent uses a high-level policy to select among multiple low-level policies operating within custom grid environments. We utilize gradient-based meta-learning with differentiable inner-loop updates, enabling optimization across a curriculum of increasingly difficult tasks. Experimental results demonstrate that our meta-learned hierarchical agent significantly outperforms traditional HRL agents without meta-learning and intrinsic motivation. The agent exhibits accelerated learning, higher cumulative rewards, and improved success rates in complex grid environments. These findings suggest that integrating meta-learning with HRL, alongside curriculum learning and intrinsic motivation, substantially enhances the agent’s capability to handle complex tasks.

[AI-50] Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines

链接: https://arxiv.org/abs/2410.07896
作者: Junyu Lai,Jiahe Xu,Yao Yang,Yunpeng Huang,Chun Cao,Jingwei Xu
关键词-EN: Large Language Models, natural language processing, demonstrated remarkable capabilities, Large Language, natural language
类目: Artificial Intelligence (cs.AI)
*备注: 30 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing and reasoning tasks. However, their performance in the foundational domain of arithmetic remains unsatisfactory. When dealing with arithmetic tasks, LLMs often memorize specific examples rather than learning the underlying computational logic, limiting their ability to generalize to new problems. In this paper, we propose a Composable Arithmetic Execution Framework (CAEF) that enables LLMs to learn to execute step-by-step computations by emulating Turing Machines, thereby gaining a genuine understanding of computational logic. Moreover, the proposed framework is highly scalable, allowing composing learned operators to significantly reduce the difficulty of learning complex operators. In our evaluation, CAEF achieves nearly 100% accuracy across seven common mathematical operations on the LLaMA 3.1-8B model, effectively supporting computations involving operands with up to 100 digits, a level where GPT-4o falls short noticeably in some settings.

[AI-51] Benchmarking Agent ic Workflow Generation

链接: https://arxiv.org/abs/2410.07869
作者: Shuofei Qiao,Runnan Fang,Zhisong Qiu,Xiaobin Wang,Ningyu Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Huajun Chen
关键词-EN: Large Language Models, Large Language, driven significant advancements, decomposing complex problems, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent’s workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset will be available at this https URL.

[AI-52] he Sets of Power

链接: https://arxiv.org/abs/2410.07867
作者: Joao Marques-Silva(1),Carlos Mencía(2),Raúl Mencía(2) ((1) ICREA, University of Lleida, Spain, (2) University of Oviedo, Spain)
关键词-EN: measures of importance, voting power, subject of extensive, Measures, Measures of voting
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Measures of voting power have been the subject of extensive research since the mid 1940s. More recently, similar measures of relative importance have been studied in other domains that include inconsistent knowledge bases, intensity of attacks in argumentation, different problems in the analysis of database management, and explainability. This paper demonstrates that all these examples are instantiations of computing measures of importance for a rather more general problem domain. The paper then shows that the best-known measures of importance can be computed for any reference set whenever one is given a monotonically increasing predicate that partitions the subsets of that reference set. As a consequence, the paper also proves that measures of importance can be devised in several domains, for some of which such measures have not yet been studied nor proposed. Furthermore, the paper highlights several research directions related with computing measures of importance.

[AI-53] System-2 Reasoning via Generality and Adaptation NEURIPS2024

链接: https://arxiv.org/abs/2410.07866
作者: Sejin Kim,Sundong Kim
关键词-EN: Artificial General Intelligence, achieving Artificial General, General Intelligence, Artificial General, current models struggle
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024 Workshop on System 2 Reasoning At Scale

点击查看摘要

Abstract:While significant progress has been made in task-specific applications, current models struggle with deep reasoning, generality, and adaptation – key components of System-2 reasoning that are crucial for achieving Artificial General Intelligence (AGI). Despite the promise of approaches such as program synthesis, language models, and transformers, these methods often fail to generalize beyond their training data and to adapt to novel tasks, limiting their ability to perform human-like reasoning. This paper explores the limitations of existing approaches in achieving advanced System-2 reasoning and highlights the importance of generality and adaptation for AGI. Moreover, we propose four key research directions to address these gaps: (1) learning human intentions from action sequences, (2) combining symbolic and neural models, (3) meta-learning for unfamiliar environments, and (4) reinforcement learning to reason multi-step. Through these directions, we aim to advance the ability to generalize and adapt, bringing computational models closer to the reasoning capabilities required for AGI.

[AI-54] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

链接: https://arxiv.org/abs/2410.07864
作者: Songming Liu,Lingxuan Wu,Bangguo Li,Hengkai Tan,Huayu Chen,Zhengyi Wang,Ke Xu,Hang Su,Jun Zhu
关键词-EN: extremely challenging due, developing foundation models, multi-modal action distributions, Robotics Diffusion Transformer, diffusion foundation model
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, conference

点击查看摘要

Abstract:Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1~5 demonstrations, and effectively handles complex, dexterous tasks. We refer to this https URL for the code and videos.

[AI-55] Learning to Balance Altruism and Self-interest Based on Empathy in Mixed-Motive Games

链接: https://arxiv.org/abs/2410.07863
作者: Fanqi Kong,Yizhe Huang,Song-Chun Zhu,Siyuan Qi,Xue Feng
关键词-EN: Real-world multi-agent scenarios, involve mixed motives, Real-world multi-agent, demanding altruistic agents, mixed motives
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-world multi-agent scenarios often involve mixed motives, demanding altruistic agents capable of self-protection against potential exploitation. However, existing approaches often struggle to achieve both objectives. In this paper, based on that empathic responses are modulated by inferred social relationships between agents, we propose LASE Learning to balance Altruism and Self-interest based on Empathy), a distributed multi-agent reinforcement learning algorithm that fosters altruistic cooperation through gifting while avoiding exploitation by other agents in mixed-motive games. LASE allocates a portion of its rewards to co-players as gifts, with this allocation adapting dynamically based on the social relationship – a metric evaluating the friendliness of co-players estimated by counterfactual reasoning. In particular, social relationship measures each co-player by comparing the estimated Q -function of current joint action to a counterfactual baseline which marginalizes the co-player’s action, with its action distribution inferred by a perspective-taking module. Comprehensive experiments are performed in spatially and temporally extended mixed-motive games, demonstrating LASE’s ability to promote group collaboration without compromising fairness and its capacity to adapt policies to various types of interactive co-players.

[AI-56] From Logits to Hierarchies: Hierarchical Clustering made Simple

链接: https://arxiv.org/abs/2410.07858
作者: Emanuele Palumbo,Moritz Vandenhirtz,Alain Ryser,Imant Daunhawer,Julia E. Vogt
关键词-EN: supervised machine learning, making the modeling, machine learning, intrinsically hierarchical, critical objective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The structure of many real-world datasets is intrinsically hierarchical, making the modeling of such hierarchies a critical objective in both unsupervised and supervised machine learning. Recently, novel approaches for hierarchical clustering with deep architectures have been proposed. In this work, we take a critical perspective on this line of research and demonstrate that many approaches exhibit major limitations when applied to realistic datasets, partly due to their high computational complexity. In particular, we show that a lightweight procedure implemented on top of pre-trained non-hierarchical clustering models outperforms models designed specifically for hierarchical clustering. Our proposed approach is computationally efficient and applicable to any pre-trained clustering model that outputs logits, without requiring any fine-tuning. To highlight the generality of our findings, we illustrate how our method can also be applied in a supervised setup, recovering meaningful hierarchies from a pre-trained ImageNet classifier.

[AI-57] SNN-PAR: Energy Efficient Pedestrian Attribute Recognition via Spiking Neural Networks

链接: https://arxiv.org/abs/2410.07857
作者: Haiyang Wang,Qian Zhu,Mowen She,Yabo Li,Haoyu Song,Minghe Xu,Xiao Wang
关键词-EN: Pedestrian Attribute Recognition, Artificial neural network, Attribute Recognition, neural network, neural network based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Artificial neural network based Pedestrian Attribute Recognition (PAR) has been widely studied in recent years, despite many progresses, however, the energy consumption is still high. To address this issue, in this paper, we propose a Spiking Neural Network (SNN) based framework for energy-efficient attribute recognition. Specifically, we first adopt a spiking tokenizer module to transform the given pedestrian image into spiking feature representations. Then, the output will be fed into the spiking Transformer backbone networks for energy-efficient feature extraction. We feed the enhanced spiking features into a set of feed-forward networks for pedestrian attribute recognition. In addition to the widely used binary cross-entropy loss function, we also exploit knowledge distillation from the artificial neural network to the spiking Transformer network for more accurate attribute recognition. Extensive experiments on three widely used PAR benchmark datasets fully validated the effectiveness of our proposed SNN-PAR framework. The source code of this paper is released on \urlthis https URL.

[AI-58] MinorityPrompt: Text to Minority Image Generation via Prompt Optimization

链接: https://arxiv.org/abs/2410.07838
作者: Soobin Um,Jong Chul Ye
关键词-EN: latent diffusion models, diffusion models, latent diffusion, minority samples, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, 8 figures

点击查看摘要

Abstract:We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models. Minority instances, in the context of T2I generation, can be defined as ones living on low-density regions of text-conditional data distributions. They are valuable for various applications of modern T2I generators, such as data augmentation and creative AI. Unfortunately, existing pretrained T2I diffusion models primarily focus on high-density regions, largely due to the influence of guided samplers (like CFG) that are essential for producing high-quality generations. To address this, we present a novel framework to counter the high-density-focus of T2I diffusion models. Specifically, we first develop an online prompt optimization framework that can encourage the emergence of desired properties during inference while preserving semantic contents of user-provided prompts. We subsequently tailor this generic prompt optimizer into a specialized solver that promotes the generation of minority features by incorporating a carefully-crafted likelihood objective. Our comprehensive experiments, conducted across various types of T2I models, demonstrate that our approach significantly enhances the capability to produce high-quality minority instances compared to existing samplers.

[AI-59] Masked Generative Priors Improve World Models Sequence Modelling Capabilities

链接: https://arxiv.org/abs/2410.07836
作者: Cristian Meo,Mircea Lica,Zarif Ikram,Akihiro Nakano,Vedant Shah,Aniket Rajiv Didolkar,Dianbo Liu,Anirudh Goyal,Justin Dauwels
关键词-EN: Deep Reinforcement Learning, Transformer-based World Models, creating artificial agents, world models, Deep Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (RL) has become the leading approach for creating artificial agents in complex environments. Model-based approaches, which are RL methods with world models that predict environment dynamics, are among the most promising directions for improving data efficiency, forming a critical step toward bridging the gap between research and real-world deployment. In particular, world models enhance sample efficiency by learning in imagination, which involves training a generative sequence model of the environment in a self-supervised manner. Recently, Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling and generating token sequences. Building on the Efficient Stochastic Transformer-based World Models (STORM) architecture, we replace the traditional MLP prior with a Masked Generative Prior (e.g., MaskGIT Prior) and introduce GIT-STORM. We evaluate our model on two downstream tasks: reinforcement learning and video prediction. GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark. Moreover, we apply Transformer-based World Models to continuous action environments for the first time, addressing a significant gap in prior research. To achieve this, we employ a state mixer function that integrates latent state representations with actions, enabling our model to handle continuous control tasks. We validate this approach through qualitative and quantitative analyses on the DeepMind Control Suite, showcasing the effectiveness of Transformer-based World Models in this new domain. Our results highlight the versatility and efficacy of the MaskGIT dynamics prior, paving the way for more accurate world models and effective RL policies.

[AI-60] LaB-CL: Localized and Balanced Contrastive Learning for improving parking slot detection

链接: https://arxiv.org/abs/2410.07832
作者: U Jin Jeong,Sumin Roh,Il Yong Chun
关键词-EN: Parking slot detection, Parking slot, slot detection, autonomous parking systems, slot
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Parking slot detection is an essential technology in autonomous parking systems. In general, the classification problem of parking slot detection consists of two tasks, a task determining whether localized candidates are junctions of parking slots or not, and the other that identifies a shape of detected junctions. Both classification tasks can easily face biased learning toward the majority class, degrading classification performances. Yet, the data imbalance issue has been overlooked in parking slot detection. We propose the first supervised contrastive learning framework for parking slot detection, Localized and Balanced Contrastive Learning for improving parking slot detection (LaB-CL). The proposed LaB-CL framework uses two main approaches. First, we propose to include class prototypes to consider representations from all classes in every mini batch, from the local perspective. Second, we propose a new hard negative sampling scheme that selects local representations with high prediction error. Experiments with the benchmark dataset demonstrate that the proposed LaB-CL framework can outperform existing parking slot detection methods.

[AI-61] Mitigating Gender Bias in Code Large Language Models via Model Editing

链接: https://arxiv.org/abs/2410.07820
作者: Zhanyue Qin,Haochuan Wang,Zecheng Wang,Deyuan Liu,Cunhang Fan,Zhao Lv,Zhiying Tu,Dianhui Chu,Dianbo Sui
关键词-EN: program synthesis automatically, high-quality programming code, gender bias, Factual Bias Score, large language model
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In recent years, with the maturation of large language model (LLM) technology and the emergence of high-quality programming code datasets, researchers have become increasingly confident in addressing the challenges of program synthesis automatically. However, since most of the training samples for LLMs are unscreened, it is inevitable that LLMs’ performance may not align with real-world scenarios, leading to the presence of social bias. To evaluate and quantify the gender bias in code LLMs, we propose a dataset named CodeGenBias (Gender Bias in the Code Generation) and an evaluation metric called FB-Score (Factual Bias Score) based on the actual gender distribution of correlative professions. With the help of CodeGenBias and FB-Score, we evaluate and analyze the gender bias in eight mainstream Code LLMs. Previous work has demonstrated that model editing methods that perform well in knowledge editing have the potential to mitigate social bias in LLMs. Therefore, we develop a model editing approach named MG-Editing (Multi-Granularity model Editing), which includes the locating and editing phases. Our model editing method MG-Editing can be applied at five different levels of model parameter granularity: full parameters level, layer level, module level, row level, and neuron level. Extensive experiments not only demonstrate that our MG-Editing can effectively mitigate the gender bias in code LLMs while maintaining their general code generation capabilities, but also showcase its excellent generalization. At the same time, the experimental results show that, considering both the gender bias of the model and its general code generation capability, MG-Editing is most effective when applied at the row and neuron levels of granularity.

[AI-62] mporal-Difference Variational Continual Learning

链接: https://arxiv.org/abs/2410.07812
作者: Luckeciano C. Melo,Alessandro Abate,Yarin Gal
关键词-EN: Machine Learning models, capability of Machine, Machine Learning, crucial capability, real-world applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A crucial capability of Machine Learning models in real-world applications is the ability to continuously learn new tasks. This adaptability allows them to respond to potentially inevitable shifts in the data-generating distribution over time. However, in Continual Learning (CL) settings, models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines the reliability of deployed systems. Variational Continual Learning methods tackle this challenge by employing a learning objective that recursively updates the posterior distribution and enforces it to stay close to the latest posterior estimate. Nonetheless, we argue that these methods may be ineffective due to compounding approximation errors over successive recursions. To mitigate this, we propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations, preventing individual errors from dominating future posterior updates and compounding over time. We reveal insightful connections between these objectives and Temporal-Difference methods, a popular learning mechanism in Reinforcement Learning and Neuroscience. We evaluate the proposed objectives on challenging versions of popular CL benchmarks, demonstrating that they outperform standard Variational CL methods and non-variational baselines, effectively alleviating Catastrophic Forgetting.

[AI-63] Rewriting Conversational Utterances with Instructed Large Language Models

链接: https://arxiv.org/abs/2410.07797
作者: Elnara Galimzhanova,Cristina Ioana Muntean,Franco Maria Nardini,Raffaele Perego,Guido Rocchietti
关键词-EN: large language models, text summarization, NLP tasks, recent studies, studies have shown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Many recent studies have shown the ability of large language models (LLMs) to achieve state-of-the-art performance on many NLP tasks, such as question answering, text summarization, coding, and translation. In some cases, the results provided by LLMs are on par with those of human experts. These models’ most disruptive innovation is their ability to perform tasks via zero-shot or few-shot prompting. This capability has been successfully exploited to train instructed LLMs, where reinforcement learning with human feedback is used to guide the model to follow the user’s requests directly. In this paper, we investigate the ability of instructed LLMs to improve conversational search effectiveness by rewriting user questions in a conversational setting. We study which prompts provide the most informative rewritten utterances that lead to the best retrieval performance. Reproducible experiments are conducted on publicly-available TREC CAST datasets. The results show that rewriting conversational utterances with instructed LLMs achieves significant improvements of up to 25.2% in MRR, 31.7% in Precision@1, 27% in NDCG@3, and 11.5% in Recall@500 over state-of-the-art techniques.

[AI-64] Do Current Language Models Support Code Intelligence for R Programming Language?

链接: https://arxiv.org/abs/2410.07793
作者: ZiXiao Zhao,Fatemeh H. Fard
关键词-EN: developing Pre-trained Language, Pre-trained Language Models, Software Engineering, Recent advancements, developing Pre-trained
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in developing Pre-trained Language Models for Code (Code-PLMs) have urged many areas of Software Engineering (SE) and brought breakthrough results for many SE tasks. Though these models have achieved the state-of-the-art performance for SE tasks for many popular programming languages, such as Java and Python, the Scientific Software and its related languages like R programming language have rarely benefited or even been evaluated with the Code-PLMs. Research has shown that R has many differences with other programming languages and requires specific techniques. In this study, we provide the first insights for code intelligence for R. For this purpose, we collect and open source an R dataset, and evaluate Code-PLMs for the two tasks of code summarization and method name prediction using several settings and strategies, including the differences in two R styles, Tidy-verse and Base R. Our results demonstrate that the studied models have experienced varying degrees of performance degradation when processing R programming language code, which is supported by human evaluation. Additionally, not all models show performance improvement in R-specific tasks even after multi-language fine-tuning. The dual syntax paradigms in R significantly impact the models’ performance, particularly in code summarization tasks. Furthermore, the project-specific context inherent in R codebases significantly impacts the performance when attempting cross-project training.

[AI-65] Mastering Contact-rich Tasks by Combining Soft and Rigid Robotics with Imitation Learning

链接: https://arxiv.org/abs/2410.07787
作者: Mariano Ramírez Montero,Ebrahim Shahabi,Giovanni Franzese,Jens Kober,Barbara Mazzolai,Cosimo Della Santina
关键词-EN: control remains challenging, precise control remains, establishing safe, remains challenging, potential to revolutionize
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Soft robots have the potential to revolutionize the use of robotic systems with their capability of establishing safe, robust, and adaptable interactions with their environment, but their precise control remains challenging. In contrast, traditional rigid robots offer high accuracy and repeatability but lack the flexibility of soft robots. We argue that combining these characteristics in a hybrid robotic platform can significantly enhance overall capabilities. This work presents a novel hybrid robotic platform that integrates a rigid manipulator with a fully developed soft arm. This system is equipped with the intelligence necessary to perform flexible and generalizable tasks through imitation learning autonomously. The physical softness and machine learning enable our platform to achieve highly generalizable skills, while the rigid components ensure precision and repeatability.

[AI-66] Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models ICASSP2025

链接: https://arxiv.org/abs/2410.07771
作者: Adriana Fernandez-Lopez,Shiwei Liu,Lu Yin,Stavros Petridis,Maja Pantic
关键词-EN: Conformer-based speech recognition, large-scale Conformer-based speech, large-scale Conformer-based, speech recognition models, Conformer-based speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:This paper investigates the under-explored area of low-rank weight training for large-scale Conformer-based speech recognition models from scratch. Our study demonstrates the viability of this training paradigm for such models, yielding several notable findings. Firstly, we discover that applying a low-rank structure exclusively to the attention modules can unexpectedly enhance performance, even with a significant rank reduction of 12%. In contrast, feed-forward layers present greater challenges, as they begin to exhibit performance degradation with a moderate 50% rank reduction. Furthermore, we find that both initialization and layer-wise rank assignment play critical roles in successful low-rank training. Specifically, employing SVD initialization and linear layer-wise rank mapping significantly boosts the efficacy of low-rank weight training. Building on these insights, we introduce the Low-Rank Speech Model from Scratch (LR-SMS), an approach that achieves performance parity with full-rank training while delivering substantial reductions in parameters count (by at least 2x), and training time speedups (by 1.3x for ASR and 1.15x for AVSR).

[AI-67] GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps NEURIPS2024

链接: https://arxiv.org/abs/2410.07765
作者: Muhammad Umair Nasir,Steven James,Julian Togelius
关键词-EN: recently demonstrated great, demonstrated great success, understanding natural language, recently demonstrated, demonstrated great
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language. While they have also shown potential beyond the domain of natural language, it remains an open question as to what extent and in which way these LLMs can plan. We investigate their planning capabilities by proposing GameTraversalBenchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps. An LLM succeeds if it can traverse through given objectives, with a minimum number of steps and a minimum number of generation errors. We evaluate a number of LLMs on GTB and found that GPT-4-Turbo achieved the highest score of 44.97% on GTB_Score (GTBS), a composite score that combines the three above criteria. Furthermore, we preliminarily test large reasoning models, namely o1, which scores 67.84% on GTBS, indicating that the benchmark remains challenging for current models. Code, data, and documentation are available at this https URL.

[AI-68] HARIVO: Harnessing Text-to-Image Models for Video Generation ECCV2024

链接: https://arxiv.org/abs/2410.07763
作者: Mingi Kwon,Seoung Wug Oh,Yang Zhou,Difan Liu,Joon-Young Lee,Haoran Cai,Baqiao Liu,Feng Liu,Youngjung Uh
关键词-EN: create diffusion-based video, create diffusion-based, diffusion-based video models, diffusion-based video, video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV2024

点击查看摘要

Abstract:We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: this https URL

[AI-69] textitJump Your Steps: Optimizing Sampling Schedule of Discrete Diffusion Models

链接: https://arxiv.org/abs/2410.07761
作者: Yong-Hyun Park,Chieh-Hsin Lai,Satoshi Hayakawa,Yuhta Takida,Yuki Mitsufuji
关键词-EN: discrete diffusion models, Diffusion models, Compounding Decoding Error, continuous domains, notable success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion models (DDMs) for discrete variables. Despite recent advances, DDMs face the challenge of slow sampling speeds. While parallel sampling methods like \tau -leaping accelerate this process, they introduce \textitCompounding Decoding Error (CDE), where discrepancies arise between the true distribution and the approximation from parallel token generation, leading to degraded sample quality. In this work, we present \textitJump Your Steps (JYS), a novel approach that optimizes the allocation of discrete sampling timesteps by minimizing CDE without extra computational cost. More precisely, we derive a practical upper bound on CDE and propose an efficient algorithm for searching for the optimal sampling schedule. Extensive experiments across image, music, and text generation show that JYS significantly improves sampling quality, establishing it as a versatile framework for enhancing DDM performance for fast sampling.

[AI-70] Learning Low-Level Causal Relations using a Simulated Robotic Arm ICANN

链接: https://arxiv.org/abs/2410.07751
作者: Miroslav Cibula,Matthias Kerzel,Igor Farkaš
关键词-EN: complex actions, humans to predict, plan the execution, actions, causal effects
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 3 tables. Appeared in 2024 International Conference on Artificial Neural Networks (ICANN) proceedings. Published version copyrighted by Springer. This work was funded by the Horizon Europe Twinning project TERAIS, G.A. number 101079338 and in part by the Slovak Grant Agency for Science (VEGA), project 1/0373/23

点击查看摘要

Abstract:Causal learning allows humans to predict the effect of their actions on the known environment and use this knowledge to plan the execution of more complex actions. Such knowledge also captures the behaviour of the environment and can be used for its analysis and the reasoning behind the behaviour. This type of knowledge is also crucial in the design of intelligent robotic systems with common sense. In this paper, we study causal relations by learning the forward and inverse models based on data generated by a simulated robotic arm involved in two sensorimotor tasks. As a next step, we investigate feature attribution methods for the analysis of the forward model, which reveals the low-level causal effects corresponding to individual features of the state vector related to both the arm joints and the environment features. This type of analysis provides solid ground for dimensionality reduction of the state representations, as well as for the aggregation of knowledge towards the explainability of causal effects at higher levels.

[AI-71] Enhancing Federated Domain Adaptation with Multi-Domain Prototype-Based Federated Fine-Tuning

链接: https://arxiv.org/abs/2410.07738
作者: Jingyuan Zhang,Yiyang Duan,Shuaicheng Niu,Yang Cao,Wei Yang Bryan Lim
关键词-EN: Federated Domain Adaptation, unique data domains, Federated Domain, transmitting private data, shared category space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Domain Adaptation (FDA) is a Federated Learning (FL) scenario where models are trained across multiple clients with unique data domains but a shared category space, without transmitting private data. The primary challenge in FDA is data heterogeneity, which causes significant divergences in gradient updates when using conventional averaging-based aggregation methods, reducing the efficacy of the global model. This further undermines both in-domain and out-of-domain performance (within the same federated system but outside the local client). To address this, we propose a novel framework called \textbfMulti-domain \textbfPrototype-based \textbfFederated Fine-\textbfTuning (MPFT). MPFT fine-tunes a pre-trained model using multi-domain prototypes, i.e., pretrained representations enriched with domain-specific information from category-specific local data. This enables supervised learning on the server to derive a globally optimized adapter that is subsequently distributed to local clients, without the intrusion of data privacy. Empirical results show that MPFT significantly improves both in-domain and out-of-domain accuracy over conventional methods, enhancing knowledge preservation and adaptation in FDA. Notably, MPFT achieves convergence within a single communication round, greatly reducing computation and communication costs. To ensure privacy, MPFT applies differential privacy to protect the prototypes. Additionally, we develop a prototype-based feature space hijacking attack to evaluate robustness, confirming that raw data samples remain unrecoverable even after extensive training epochs. The complete implementation of MPFL is available at \urlthis https URL.

[AI-72] On the Generalization Properties of Deep Learning for Aircraft Fuel Flow Estimation Models

链接: https://arxiv.org/abs/2410.07717
作者: Gabriel Jarry,Ramon Dalmau,Philippe Very,Junzi Sun
关键词-EN: Accurately estimating aircraft, current aviation practices, Accurately estimating, designing next-generation aircraft, aircraft
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurately estimating aircraft fuel flow is essential for evaluating new procedures, designing next-generation aircraft, and monitoring the environmental impact of current aviation practices. This paper investigates the generalization capabilities of deep learning models in predicting fuel consumption, focusing particularly on their performance for aircraft types absent from the training data. We propose a novel methodology that integrates neural network architectures with domain generalization techniques to enhance robustness and reliability across a wide range of aircraft. A comprehensive dataset containing 101 different aircraft types, separated into training and generalization sets, with each aircraft type set containing 1,000 flights. We employed the base of aircraft data (BADA) model for fuel flow estimates, introduced a pseudo-distance metric to assess aircraft type similarity, and explored various sampling strategies to optimize model performance in data-sparse regions. Our results reveal that for previously unseen aircraft types, the introduction of noise into aircraft and engine parameters improved model generalization. The model is able to generalize with acceptable mean absolute percentage error between 2% and 10% for aircraft close to existing aircraft, while performance is below 1% error for known aircraft in the training set. This study highlights the potential of combining domain-specific insights with advanced machine learning techniques to develop scalable, accurate, and generalizable fuel flow estimation models.

[AI-73] Learning Tree Pattern Transformations

链接: https://arxiv.org/abs/2410.07708
作者: Daniel Neider,Leif Sabellek,Johannes Schmidt,Fabian Vehlken,Thomas Zeume
关键词-EN: XML or JSON, understanding tree-structured data, JSON data, structurally differs, computer science
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Explaining why and how a tree t structurally differs from another tree t^* is a question that is encountered throughout computer science, including in understanding tree-structured data such as XML or JSON data. In this article, we explore how to learn explanations for structural differences between pairs of trees from sample data: suppose we are given a set (t_1, t_1^),\dots, (t_n, t_n^)\ of pairs of labelled, ordered trees; is there a small set of rules that explains the structural differences between all pairs (t_i, t_i^*) ? This raises two research questions: (i) what is a good notion of “rule” in this context?; and (ii) how can sets of rules explaining a data set be learnt algorithmically? We explore these questions from the perspective of database theory by (1) introducing a pattern-based specification language for tree transformations; (2) exploring the computational complexity of variants of the above algorithmic problem, e.g. showing NP-hardness for very restricted variants; and (3) discussing how to solve the problem for data from CS education research using SAT solvers. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Databases (cs.DB) Cite as: arXiv:2410.07708 [cs.LG] (or arXiv:2410.07708v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.07708 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-74] Agent Bank: Towards Generalized LLM Agents via Fine-Tuning on 50000 Interaction Trajectories EMNLP2024

链接: https://arxiv.org/abs/2410.07706
作者: Yifan Song,Weimin Xiong,Xiutian Zhao,Dawei Zhu,Wenhao Wu,Ke Wang,Cheng Li,Wei Peng,Sujian Li
关键词-EN: holds significant promise, open-source large language, Fine-tuning on agent-environment, large language models, data holds significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Findings of EMNLP 2024

点击查看摘要

Abstract:Fine-tuning on agent-environment interaction trajectory data holds significant promise for surfacing generalized agent capabilities in open-source large language models (LLMs). In this work, we introduce AgentBank, by far the largest trajectory tuning data collection featuring more than 50k diverse high-quality interaction trajectories which comprises 16 tasks covering five distinct agent skill dimensions. Leveraging a novel annotation pipeline, we are able to scale the annotated trajectories and generate a trajectory dataset with minimized difficulty bias. Furthermore, we fine-tune LLMs on AgentBank to get a series of agent models, Samoyed. Our comparative experiments demonstrate the effectiveness of scaling the interaction trajectory data to acquire generalized agent capabilities. Additional studies also reveal some key observations regarding trajectory tuning and agent skill generalization.

[AI-75] Adversarial Robustness Overestimation and Instability in TRADES

链接: https://arxiv.org/abs/2410.07675
作者: Jonathan Weiping Li,Ren-Wei Liang,Cheng-Han Yeh,Cheng-Chang Tsai,Kuanchun Yu,Chun-Shien Lu,Shang-Tse Chen
关键词-EN: paper examines, probabilistic robustness overestimation, PGD validation accuracy, overestimation, TRADES
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper examines the phenomenon of probabilistic robustness overestimation in TRADES, a prominent adversarial training method. Our study reveals that TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task. This discrepancy highlights a significant overestimation of robustness for these instances, potentially linked to gradient masking. We further analyze the parameters contributing to unstable models that lead to overestimation. Our findings indicate that smaller batch sizes, lower beta values (which control the weight of the robust loss term in TRADES), larger learning rates, and higher class complexity (e.g., CIFAR-100 versus CIFAR-10) are associated with an increased likelihood of robustness overestimation. By examining metrics such as the First-Order Stationary Condition (FOSC), inner-maximization, and gradient information, we identify the underlying cause of this phenomenon as gradient masking and provide insights into it. Furthermore, our experiments show that certain unstable training instances may return to a state without robust overestimation, inspiring our attempts at a solution. In addition to adjusting parameter settings to reduce instability or retraining when overestimation occurs, we recommend incorporating Gaussian noise in inputs when the FOSC score exceed the threshold. This method aims to mitigate robustness overestimation of TRADES and other similar methods at its source, ensuring more reliable representation of adversarial robustness during evaluation.

[AI-76] Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference

链接: https://arxiv.org/abs/2410.07673
作者: Jianxing Yu,Shiqi Wang,Han Yin,Zhenlong Sun,Ruobing Xie,Bo Zhang,Yanghui Rao
关键词-EN: detecting clickbait posts, paper focuses, focuses on detecting, detecting clickbait, Web
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper focuses on detecting clickbait posts on the Web. These posts often use eye-catching disinformation in mixed modalities to mislead users to click for profit. That affects the user experience and thus would be blocked by content provider. To escape detection, malicious creators use tricks to add some irrelevant non-bait content into bait posts, dressing them up as legal to fool the detector. This content often has biased relations with non-bait labels, yet traditional detectors tend to make predictions based on simple co-occurrence rather than grasping inherent factors that lead to malicious behavior. This spurious bias would easily cause misjudgments. To address this problem, we propose a new debiased method based on causal inference. We first employ a set of features in multiple modalities to characterize the posts. Considering these features are often mixed up with unknown biases, we then disentangle three kinds of latent factors from them, including the invariant factor that indicates intrinsic bait intention; the causal factor which reflects deceptive patterns in a certain scenario, and non-causal noise. By eliminating the noise that causes bias, we can use invariant and causal factors to build a robust model with good generalization ability. Experiments on three popular datasets show the effectiveness of our approach.

[AI-77] MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization

链接: https://arxiv.org/abs/2410.07672
作者: Yougang Lyu,Lingyong Yan,Zihan Wang,Dawei Yin,Pengjie Ren,Maarten de Rijke,Zhaochun Ren
关键词-EN: large language models, achieving near-human capabilities, weak teachers, strong students, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:As large language models (LLMs) are rapidly advancing and achieving near-human capabilities, aligning them with human values is becoming more urgent. In scenarios where LLMs outperform humans, we face a weak-to-strong alignment problem where we need to effectively align strong student LLMs through weak supervision generated by weak teachers. Existing alignment methods mainly focus on strong-to-weak alignment and self-alignment settings, and it is impractical to adapt them to the much harder weak-to-strong alignment setting. To fill this gap, we propose a multi-agent contrastive preference optimization (MACPO) framework. MACPO facilitates weak teachers and strong students to learn from each other by iteratively reinforcing unfamiliar positive behaviors while penalizing familiar negative ones. To get this, we devise a mutual positive behavior augmentation strategy to encourage weak teachers and strong students to learn from each other’s positive behavior and further provide higher quality positive behavior for the next iteration. Additionally, we propose a hard negative behavior construction strategy to induce weak teachers and strong students to generate familiar negative behavior by fine-tuning on negative behavioral data. Experimental results on the HH-RLHF and PKU-SafeRLHF datasets, evaluated using both automatic metrics and human judgments, demonstrate that MACPO simultaneously improves the alignment performance of strong students and weak teachers. Moreover, as the number of weak teachers increases, MACPO achieves better weak-to-strong alignment performance through more iteration optimization rounds.

[AI-78] DISCO: A Hierarchical Disentangled Cognitive Diagnosis Framework for Interpretable Job Recommendation ICDM2024

链接: https://arxiv.org/abs/2410.07671
作者: Xiaoshan Yu,Chuan Qin,Qi Zhang,Chen Zhu,Haiping Ma,Xingyi Zhang,Hengshu Zhu
关键词-EN: created unprecedented opportunities, accurately pinpointing positions, online recruitment platforms, job seekers, skills and preferences
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by ICDM 2024. 10 pages

点击查看摘要

Abstract:The rapid development of online recruitment platforms has created unprecedented opportunities for job seekers while concurrently posing the significant challenge of quickly and accurately pinpointing positions that align with their skills and preferences. Job recommendation systems have significantly alleviated the extensive search burden for job seekers by optimizing user engagement metrics, such as clicks and applications, thus achieving notable success. In recent years, a substantial amount of research has been devoted to developing effective job recommendation models, primarily focusing on text-matching based and behavior modeling based methods. While these approaches have realized impressive outcomes, it is imperative to note that research on the explainability of recruitment recommendations remains profoundly unexplored. To this end, in this paper, we propose DISCO, a hierarchical Disentanglement based Cognitive diagnosis framework, aimed at flexibly accommodating the underlying representation learning model for effective and interpretable job recommendations. Specifically, we first design a hierarchical representation disentangling module to explicitly mine the hierarchical skill-related factors implied in hidden representations of job seekers and jobs. Subsequently, we propose level-aware association modeling to enhance information communication and robust representation learning both inter- and intra-level, which consists of the interlevel knowledge influence module and the level-wise contrastive learning. Finally, we devise an interaction diagnosis module incorporating a neural diagnosis function for effectively modeling the multi-level recruitment interaction process between job seekers and jobs, which introduces the cognitive measurement theory.

[AI-79] Almost Minimax Optimal Best Arm Identification in Piecewise Stationary Linear Bandits NEURIPS2024

链接: https://arxiv.org/abs/2410.07638
作者: Yunlong Hou,Vincent Y. F. Tan,Zixin Zhong
关键词-EN: varepsilon, BAI, piecewise stationary linear, stationary linear bandit, environment randomly samples
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 69 pages. Accepted to NeurIPS 2024

点击查看摘要

Abstract:We propose a \em novel piecewise stationary linear bandit (PSLB) model, where the environment randomly samples a context from an unknown probability distribution at each changepoint, and the quality of an arm is measured by its return averaged over all contexts. The contexts and their distribution, as well as the changepoints are unknown to the agent. We design \em Piecewise-Stationary \varepsilon -Best Arm Identification ^+ (PS \varepsilon BAI ^+ ), an algorithm that is guaranteed to identify an \varepsilon -optimal arm with probability \ge 1-\delta and with a minimal number of samples. PS \varepsilon BAI ^+ consists of two subroutines, PS \varepsilon BAI and \sc Naïve \varepsilon -BAI (N \varepsilon BAI), which are executed in parallel. PS \varepsilon BAI actively detects changepoints and aligns contexts to facilitate the arm identification process. When PS \varepsilon BAI and N \varepsilon BAI are utilized judiciously in parallel, PS \varepsilon BAI ^+ is shown to have a finite expected sample complexity. By proving a lower bound, we show the expected sample complexity of PS \varepsilon BAI ^+ is optimal up to a logarithmic factor. We compare PS \varepsilon BAI ^+ to baseline algorithms using numerical experiments which demonstrate its efficiency. Both our analytical and numerical results corroborate that the efficacy of PS \varepsilon BAI ^+ is due to the delicate change detection and context alignment procedures embedded in PS \varepsilon BAI.

[AI-80] Automatic Curriculum Expert Iteration for Reliable LLM Reasoning

链接: https://arxiv.org/abs/2410.07627
作者: Zirui Zhao,Hanze Dong,Amrita Saha,Caiming Xiong,Doyen Sahoo
关键词-EN: generating plausible, inaccurate content, excessive refusals, persist as major, plausible but inaccurate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: 20 pages

点击查看摘要

Abstract:Hallucinations (i.e., generating plausible but inaccurate content) and laziness (i.e. excessive refusals or defaulting to “I don’t know”) persist as major challenges in LLM reasoning. Current efforts to reduce hallucinations primarily focus on factual errors in knowledge-grounded tasks, often neglecting hallucinations related to faulty reasoning. Meanwhile, some approaches render LLMs overly conservative, limiting their problem-solving capabilities. To mitigate hallucination and laziness in reasoning tasks, we propose Automatic Curriculum Expert Iteration (Auto-CEI) to enhance LLM reasoning and align responses to the model’s capabilities–assertively answering within its limits and declining when tasks exceed them. In our method, Expert Iteration explores the reasoning trajectories near the LLM policy, guiding incorrect paths back on track to reduce compounding errors and improve robustness; it also promotes appropriate “I don’t know” responses after sufficient reasoning attempts. The curriculum automatically adjusts rewards, incentivizing extended reasoning before acknowledging incapability, thereby pushing the limits of LLM reasoning and aligning its behaviour with these limits. We compare Auto-CEI with various SOTA baselines across logical reasoning, mathematics, and planning tasks, where Auto-CEI achieves superior alignment by effectively balancing assertiveness and conservativeness.

[AI-81] Moyun: A Diffusion-Based Model for Style-Specific Chinese Calligraphy Generation

链接: https://arxiv.org/abs/2410.07618
作者: Kaiyuan Liu,Jiahao Mei,Hengyu Zhang,Yihuai Zhang,Xingjiao Wu,Daoguo Dong,Liang He
关键词-EN: Chinese calligraphy generation, achieved style transfer, style remains challenging, character style remains, Chinese calligraphy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although Chinese calligraphy generation has achieved style transfer, generating calligraphy by specifying the calligrapher, font, and character style remains challenging. To address this, we propose a new Chinese calligraphy generation model ‘Moyun’ , which replaces the Unet in the Diffusion model with Vision Mamba and introduces the TripleLabel control mechanism to achieve controllable calligraphy generation. The model was tested on our large-scale dataset ‘Mobao’ of over 1.9 million images, and the results demonstrate that ‘Moyun’ can effectively control the generation process and produce calligraphy in the specified style. Even for calligraphy the calligrapher has not written, ‘Moyun’ can generate calligraphy that matches the style of the calligrapher.

[AI-82] A Survey for Deep Reinforcement Learning Based Network Intrusion Detection

链接: https://arxiv.org/abs/2410.07612
作者: Wanrong Yang,Alberto Acuto,Yihang Zhou,Dominik Wojtczak
关键词-EN: intrusion detection, network intrusion detection, intrusion detection systems, DRL, detection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:Cyber-attacks are becoming increasingly sophisticated and frequent, highlighting the importance of network intrusion detection systems. This paper explores the potential and challenges of using deep reinforcement learning (DRL) in network intrusion detection. It begins by introducing key DRL concepts and frameworks, such as deep Q-networks and actor-critic algorithms, and reviews recent research utilizing DRL for intrusion detection. The study evaluates challenges related to model training efficiency, detection of minority and unknown class attacks, feature selection, and handling unbalanced datasets. The performance of DRL models is comprehensively analyzed, showing that while DRL holds promise, many recent technologies remain underexplored. Some DRL models achieve state-of-the-art results on public datasets, occasionally outperforming traditional deep learning methods. The paper concludes with recommendations for enhancing DRL deployment and testing in real-world network scenarios, with a focus on Internet of Things intrusion detection. It discusses recent DRL architectures and suggests future policy functions for DRL-based intrusion detection. Finally, the paper proposes integrating DRL with generative methods to further improve performance, addressing current gaps and supporting more robust and adaptive network intrusion detection systems.

[AI-83] CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features

链接: https://arxiv.org/abs/2410.07610
作者: Po-han Li,Sandeep P. Chinchali,Ufuk Topcu
关键词-EN: cross-modal retrieval, CSA, excel in tasks, Multimodal, CLIP excel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring 300,000\times fewer multimodal data pairs and 6\times fewer unimodal data for ImageNet classification and misinformative news captions detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.

[AI-84] A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks NEURIPS2024

链接: https://arxiv.org/abs/2410.07593
作者: Hoin Jung,Taeuk Jang,Xiaoqian Wang
关键词-EN: enabled complex multimodal, Recent advancements, image data simultaneously, complex multimodal tasks, data simultaneously
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024, the Thirty-Eighth Annual Conference on Neural Information Processing Systems

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have enabled complex multimodal tasks by processing text and image data simultaneously, significantly enhancing the field of artificial intelligence. However, these models often exhibit biases that can skew outputs towards societal stereotypes, thus necessitating debiasing strategies. Existing debiasing methods focus narrowly on specific modalities or tasks, and require extensive retraining. To address these limitations, this paper introduces Selective Feature Imputation for Debiasing (SFID), a novel methodology that integrates feature pruning and low confidence imputation (LCI) to effectively reduce biases in VLMs. SFID is versatile, maintaining the semantic integrity of outputs and costly effective by eliminating the need for retraining. Our experimental results demonstrate SFID’s effectiveness across various VLMs tasks including zero-shot classification, text-to-image retrieval, image captioning, and text-to-image generation, by significantly reducing gender biases without compromising performance. This approach not only enhances the fairness of VLMs applications but also preserves their efficiency and utility across diverse scenarios.

[AI-85] Diversified and Adaptive Negative Sampling on Knowledge Graphs

链接: https://arxiv.org/abs/2410.07592
作者: Ran Liu,Zhongzhou Liu,Xiaoli Li,Hao Wu,Yuan Fang
关键词-EN: negative triplets, knowledge graph embedding, negative, knowledge graphs, Negative Sampling DANS
类目: Artificial Intelligence (cs.AI)
*备注: 30 pages, 7 figures, Journal

点击查看摘要

Abstract:In knowledge graph embedding, aside from positive triplets (ie: facts in the knowledge graph), the negative triplets used for training also have a direct influence on the model performance. In reality, since knowledge graphs are sparse and incomplete, negative triplets often lack explicit labels, and thus they are often obtained from various sampling strategies (eg: randomly replacing an entity in a positive triplet). An ideal sampled negative triplet should be informative enough to help the model train better. However, existing methods often ignore diversity and adaptiveness in their sampling process, which harms the informativeness of negative triplets. As such, we propose a generative adversarial approach called Diversified and Adaptive Negative Sampling DANS on knowledge graphs. DANS is equipped with a two-way generator that generates more diverse negative triplets through two pathways, and an adaptive mechanism that produces more fine-grained examples by localizing the global generator for different entities and relations. On the one hand, the two-way generator increase the overall informativeness with more diverse negative examples; on the other hand, the adaptive mechanism increases the individual sample-wise informativeness with more fine-grained sampling. Finally, we evaluate the performance of DANS on three benchmark knowledge graphs to demonstrate its effectiveness through quantitative and qualitative experiments.

[AI-86] Detecting Training Data of Large Language Models via Expectation Maximization

链接: https://arxiv.org/abs/2410.07582
作者: Gyuwan Kim,Yang Li,Evangelia Spiliopoulou,Jie Ma,Miguel Ballesteros,William Yang Wang
关键词-EN: large language models, remains undisclosed, impressive advancements, widespread deployment, deployment of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:The widespread deployment of large language models (LLMs) has led to impressive advancements, yet information about their training data, a critical factor in their performance, remains undisclosed. Membership inference attacks (MIAs) aim to determine whether a specific instance was part of a target model’s training data. MIAs can offer insights into LLM outputs and help detect and address concerns such as data contamination and compliance with privacy and copyright standards. However, applying MIAs to LLMs presents unique challenges due to the massive scale of pre-training data and the ambiguous nature of membership. Additionally, creating appropriate benchmarks to evaluate MIA methods is not straightforward, as training and test data distributions are often unknown. In this paper, we introduce EM-MIA, a novel MIA method for LLMs that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm, leveraging the duality that the estimates of these scores can be improved by each other. Membership scores and prefix scores assess how each instance is likely to be a member and discriminative as a prefix, respectively. Our method achieves state-of-the-art results on the WikiMIA dataset. To further evaluate EM-MIA, we present OLMoMIA, a benchmark built from OLMo resources, which allows us to control the difficulty of MIA tasks with varying degrees of overlap between training and test data distributions. We believe that EM-MIA serves as a robust MIA method for LLMs and that OLMoMIA provides a valuable resource for comprehensively evaluating MIA approaches, thereby driving future research in this critical area.

[AI-87] When and Where Did it Happen? An Encoder-Decoder Model to Identify Scenario Context

链接: https://arxiv.org/abs/2410.07567
作者: Enrique Noriega-Atala,Robert Vacareanu,Salena Torres Ashton,Adarsh Pyarelal,Clayton T. Morrison,Mihai Surdeanu
关键词-EN: neural architecture finetuned, scenario context generation, context generation, mentioned in text, introduce a neural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:We introduce a neural architecture finetuned for the task of scenario context generation: The relevant location and time of an event or entity mentioned in text. Contextualizing information extraction helps to scope the validity of automated finings when aggregating them as knowledge graphs. Our approach uses a high-quality curated dataset of time and location annotations in a corpus of epidemiology papers to train an encoder-decoder architecture. We also explored the use of data augmentation techniques during training. Our findings suggest that a relatively small fine-tuned encoder-decoder model performs better than out-of-the-box LLMs and semantic role labeling parsers to accurate predict the relevant scenario information of a particular entity or event.

[AI-88] PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

链接: https://arxiv.org/abs/2410.07563
作者: Kenshin Abe,Kaizaburo Chubachi,Yasuhiro Fujita,Yuta Hirokawa,Kentaro Imajo,Toshiki Kataoka,Hiroyoshi Komatsu,Hiroaki Mikami,Tsuguo Mogami,Shogo Murai,Kosuke Nakago,Daisuke Nishino,Toru Ogawa,Daisuke Okanohara,Yoshihiko Ozaki,Shotaro Sano,Shuji Suzuki,Tianqi Xu,Toshihiko Yanase(Preferred Elements, Inc.)
关键词-EN: Japanese proficiency, designed for Japanese, large-scale language model, language model designed, Direct Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency. The model was trained from scratch using 2 trillion tokens, with architecture such as QK Normalization and Z-Loss to ensure training stability during the training process. Post-training techniques, including Supervised Fine-Tuning and Direct Preference Optimization, were applied to refine the model’s performance. Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks, achieving results that are competitive with frontier models like GPT-4.

[AI-89] COMMA: A Communicative Multimodal Multi-Agent Benchmark

链接: https://arxiv.org/abs/2410.07553
作者: Timothy Ossowski,Jixuan Chen,Danyal Maqbool,Zefan Cai,Tyler Bradshaw,Junjie Hu
关键词-EN: multi-modal agents built, large foundation models, rapid advances, advances of multi-modal, built on large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advances of multi-modal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of scenarios, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. By testing both agent-agent and agent-human collaborations using open-source and closed-source models, our findings reveal surprising weaknesses in state-of-the-art models, including proprietary models like GPT-4o. These models struggle to outperform even a simple random agent baseline in agent-agent collaboration and only surpass the random baseline when a human is involved.

[AI-90] KRAG Framework for Enhancing LLMs in the Legal Domain KR

链接: https://arxiv.org/abs/2410.07551
作者: Nguyen Ha Thanh,Ken Satoh
关键词-EN: introduces Knowledge Representation, Representation Augmented Generation, Knowledge Representation Augmented, Large Language Models, capabilities of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

Abstract:This paper introduces Knowledge Representation Augmented Generation (KRAG), a novel framework designed to enhance the capabilities of Large Language Models (LLMs) within domain-specific applications. KRAG points to the strategic inclusion of critical knowledge entities and relationships that are typically absent in standard data sets and which LLMs do not inherently learn. In the context of legal applications, we present Soft PROLEG, an implementation model under KRAG, which uses inference graphs to aid LLMs in delivering structured legal reasoning, argumentation, and explanations tailored to user inquiries. The integration of KRAG, either as a standalone framework or in tandem with retrieval augmented generation (RAG), markedly improves the ability of language models to navigate and solve the intricate challenges posed by legal texts and terminologies. This paper details KRAG’s methodology, its implementation through Soft PROLEG, and potential broader applications, underscoring its significant role in advancing natural language understanding and processing in specialized knowledge domains.

[AI-91] OneNet: A Fine-Tuning Free Framework for Few-Shot Entity Linking via Large Language Model Prompting EMNLP2024

链接: https://arxiv.org/abs/2410.07549
作者: Xukai Liu,Ye Liu,Kai Zhang,Kehang Wang,Qi Liu,Enhong Chen
关键词-EN: associating ambiguous textual, ambiguous textual mentions, Entity Linking, Large Language Models, few-shot entity linking
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024 Main

点击查看摘要

Abstract:Entity Linking (EL) is the process of associating ambiguous textual mentions to specific entities in a knowledge base. Traditional EL methods heavily rely on large datasets to enhance their performance, a dependency that becomes problematic in the context of few-shot entity linking, where only a limited number of examples are available for training. To address this challenge, we present OneNet, an innovative framework that utilizes the few-shot learning capabilities of Large Language Models (LLMs) without the need for fine-tuning. To the best of our knowledge, this marks a pioneering approach to applying LLMs to few-shot entity linking tasks. OneNet is structured around three key components prompted by LLMs: (1) an entity reduction processor that simplifies inputs by summarizing and filtering out irrelevant entities, (2) a dual-perspective entity linker that combines contextual cues and prior knowledge for precise entity linking, and (3) an entity consensus judger that employs a unique consistency algorithm to alleviate the hallucination in the entity linking reasoning. Comprehensive evaluations across seven benchmark datasets reveal that OneNet outperforms current state-of-the-art entity linking methods.

[AI-92] Comprehensive Online Training and Deployment for Spiking Neural Networks

链接: https://arxiv.org/abs/2410.07547
作者: Zecheng Hao,Yifan Huang,Zijie Xu,Zhaofei Yu,Tiejun Huang
关键词-EN: Spiking Neural Networks, Neural Networks, Artificial Intelligence, development of Artificial, Spiking Neural
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are considered to have enormous potential in the future development of Artificial Intelligence (AI) due to their brain-inspired and energy-efficient properties. In the current supervised learning domain of SNNs, compared to vanilla Spatial-Temporal Back-propagation (STBP) training, online training can effectively overcome the risk of GPU memory explosion and has received widespread academic attention. However, the current proposed online training methods cannot tackle the inseparability problem of temporal dependent gradients and merely aim to optimize the training memory, resulting in no performance advantages compared to the STBP training models in the inference phase. To address the aforementioned challenges, we propose Efficient Multi-Precision Firing (EM-PF) model, which is a family of advanced spiking models based on floating-point spikes and binary synaptic weights. We point out that EM-PF model can effectively separate temporal gradients and achieve full-stage optimization towards computation speed and memory footprint. Experimental results have demonstrated that EM-PF model can be flexibly combined with various techniques including random back-propagation, parallel computation and channel attention mechanism, to achieve state-of-the-art performance with extremely low computational overhead in the field of online learning.

[AI-93] Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM

链接: https://arxiv.org/abs/2410.07531
作者: Haiyue Ma,Jian Liu,Ronny Krashinsky
关键词-EN: Random Number Generation, RNG, training time, dramatically impact, turn increases
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dropout, a network operator, when enabled is likely to dramatically impact the performance of Flash-Attention, which in turn increases the end-to-end training time of Large-Language-Models (LLMs). The main contributor to such performance degradation is the Random Number Generation (RNG) phase that is traditionally fused into the Flash-Attention kernel. As RNG and Attention have the same hardware bottlenecks, RNG latency can hardly be hidden within the Attention kernel. We propose overlapping RNG with previous GEMM layers in the network to hide RNG runtime and improve end-to-end performance. RNG and GEMM have distinct resource requirements and hardware bottlenecks, so they can run in parallel without compromising each other’s performance. Our fine-grained performance model, cross-validated by silicon results, shows 1.14x speedup on one transformer block (including multi-head attention and feed-forward layers) for Llama2, and up to 1.23x speedup when varying workload sizes, on GH100 GPUs with FP8 precision. Further, we extend our theoretical model to different RNG implementations and hardware architectures, and discuss the widely applicable benefits for overlapping RNG with GEMM layers. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.07531 [cs.AR] (or arXiv:2410.07531v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2410.07531 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-94] Audio Explanation Synthesis with Generative Foundation Models

链接: https://arxiv.org/abs/2410.07530
作者: Alican Akman,Qiyang Sun,Björn W. Schuller
关键词-EN: intricate decision-making processes, audio foundation models, increasing success, tasks has led, improved interpretability
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The increasing success of audio foundation models across various tasks has led to a growing need for improved interpretability to understand their intricate decision-making processes better. Existing methods primarily focus on explaining these models by attributing importance to elements within the input space based on their influence on the final decision. In this paper, we introduce a novel audio explanation method that capitalises on the generative capacity of audio foundation models. Our method leverages the intrinsic representational power of the embedding space within these models by integrating established feature attribution techniques to identify significant features in this space. The method then generates listenable audio explanations by prioritising the most important features. Through rigorous benchmarking against standard datasets, including keyword spotting and speech emotion recognition, our model demonstrates its efficacy in producing audio explanations.

[AI-95] MKGL: Mastery of a Three-Word Language NEURIPS2024

链接: https://arxiv.org/abs/2410.07526
作者: Lingbing Guo,Zhongpu Bo,Zhuo Chen,Yichi Zhang,Jiaoyan Chen,Yarong Lan,Mengshu Sun,Zhiqiang Zhang,Yangyifei Luo,Qian Li,Qiang Zhang,Wen Zhang,Huajun Chen
关键词-EN: Large language models, significantly advanced performance, natural language processing, Large language, significantly advanced
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 (spotlight)

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced performance across a spectrum of natural language processing (NLP) tasks. Yet, their application to knowledge graphs (KGs), which describe facts in the form of triplets and allow minimal hallucinations, remains an underexplored frontier. In this paper, we investigate the integration of LLMs with KGs by introducing a specialized KG Language (KGL), where a sentence precisely consists of an entity noun, a relation verb, and ends with another entity noun. Despite KGL’s unfamiliar vocabulary to the LLM, we facilitate its learning through a tailored dictionary and illustrative sentences, and enhance context understanding via real-time KG context retrieval and KGL token embedding augmentation. Our results reveal that LLMs can achieve fluency in KGL, drastically reducing errors compared to conventional KG embedding methods on KG completion. Furthermore, our enhanced LLM shows exceptional competence in generating accurate three-word sentences from an initial entity and interpreting new unseen terms out of KGs.

[AI-96] Offline Inverse Constrained Reinforcement Learning for Safe-Critical Decision Making in Healthcare

链接: https://arxiv.org/abs/2410.07525
作者: Nan Fang,Guiliang Liu,Wei Gong
关键词-EN: Constrained Reinforcement Learning, Reinforcement Learning, agents overlooking common-sense, Inverse Constrained Reinforcement, Constrained Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) applied in healthcare can lead to unsafe medical decisions and treatment, such as excessive dosages or abrupt changes, often due to agents overlooking common-sense constraints. Consequently, Constrained Reinforcement Learning (CRL) is a natural choice for safe decisions. However, specifying the exact cost function is inherently difficult in healthcare. Recent Inverse Constrained Reinforcement Learning (ICRL) is a promising approach that infers constraints from expert demonstrations. ICRL algorithms model Markovian decisions in an interactive environment. These settings do not align with the practical requirement of a decision-making system in healthcare, where decisions rely on historical treatment recorded in an offline dataset. To tackle these issues, we propose the Constraint Transformer (CT). Specifically, 1) we utilize a causal attention mechanism to incorporate historical decisions and observations into the constraint modeling, while employing a Non-Markovian layer for weighted constraints to capture critical states. 2) A generative world model is used to perform exploratory data augmentation, enabling offline RL methods to simulate unsafe decision sequences. In multiple medical scenarios, empirical results demonstrate that CT can capture unsafe states and achieve strategies that approximate lower mortality rates, reducing the occurrence probability of unsafe behaviors.

[AI-97] Upcycling Large Language Models into Mixture of Experts

链接: https://arxiv.org/abs/2410.07524
作者: Ethan He,Abhinav Khattar,Ryan Prenger,Vijay Korthikanti,Zijie Yan,Tong Liu,Shiqing Fan,Ashwath Aithal,Mohammad Shoeybi,Bryan Catanzaro
关键词-EN: pre-trained dense language, Upcycling pre-trained dense, Upcycling, language models, Upcycling pre-trained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel “virtual group” initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models.

[AI-98] DemoShapley: Valuation of Demonstrations for In-Context Learning

链接: https://arxiv.org/abs/2410.07523
作者: Shan Xie,Man Luo,Chadly Daniel Stern,Mengnan Du,Lu Cheng
关键词-EN: needing task-specific fine-tuning, Large language models, Large language, leveraging in-context learning, task-specific fine-tuning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) leveraging in-context learning (ICL) have set new benchmarks in few-shot learning across various tasks without needing task-specific fine-tuning. However, extensive research has demonstrated that the effectiveness of ICL is significantly influenced by the selection and ordering of demonstrations. Considering the critical role of demonstration selection in ICL, we introduce DemoShapley which is inspired by the Data Shapley valuation theorem. This approach assesses the influence of individual demonstration instances, distinguishing between those that contribute positively and those that may hinder performance. Our findings reveal that DemoShapley not only enhances model performance in terms of accuracy and fairness but also generalizes queries from domains distinct from those of the in-context demonstrations, highlighting its versatility and effectiveness in optimizing ICL demonstration selection. Last but not least, DemoShapley demonstrates its ability to aid in identifying noisy data within the demonstration set.

[AI-99] Evolutionary Contrastive Distillation for Language Model Alignment

链接: https://arxiv.org/abs/2410.07513
作者: Julian Katz-Samuels,Zheng Li,Hyokun Yun,Priyanka Nigam,Yi Xu,Vaclav Petricek,Bing Yin,Trishul Chilimbi
关键词-EN: Evolutionary Contrastive Distillation, real-world applications, complex instructions, large language models, execute complex instructions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The ability of large language models (LLMs) to execute complex instructions is essential for their real-world applications. However, several recent studies indicate that LLMs struggle with challenging instructions. In this paper, we propose Evolutionary Contrastive Distillation (ECD), a novel method for generating high-quality synthetic preference data designed to enhance the complex instruction-following capability of language models. ECD generates data that specifically illustrates the difference between a response that successfully follows a set of complex instructions and a response that is high-quality, but nevertheless makes some subtle mistakes. This is done by prompting LLMs to progressively evolve simple instructions to more complex instructions. When the complexity of an instruction is increased, the original successful response to the original instruction becomes a “hard negative” response for the new instruction, mostly meeting requirements of the new instruction, but barely missing one or two. By pairing a good response with such a hard negative response, and employing contrastive learning algorithms such as DPO, we improve language models’ ability to follow complex instructions. Empirically, we observe that our method yields a 7B model that exceeds the complex instruction-following performance of current SOTA 7B models and is competitive even with open-source 70B models.

[AI-100] CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression

链接: https://arxiv.org/abs/2410.07505
作者: Wenyuan Liu,Xindian Ma,Peng Zhang,Yan Wang
关键词-EN: compressing Large Language, Large Language Models, compressing Large, quantization kernel, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Post-Training Quantization (PTQ) is an effective technique for compressing Large Language Models (LLMs). While many studies focus on quantizing both weights and activations, it is still a challenge to maintain the accuracy of LLM after activating quantization. To investigate the primary cause, we extend the concept of kernel from linear algebra to quantization functions to define a new term, “quantization kernel”, which refers to the set of elements in activations that are quantized to zero. Through quantitative analysis of the quantization kernel, we find that these elements are crucial for maintaining the accuracy of quantized LLMs. With the decrease of quantization kernel, the precision of quantized LLMs increases. If the quantization kernel proportion is kept below 19% for OPT models and below 1% for LLaMA models, the precision loss from quantizing activations to INT8 becomes negligible. Motivated by the goal of developing a quantization method with small quantization kernel, we propose CrossQuant: a simple yet effective method for quantizing activations. CrossQuant cross-quantizes elements using row and column-wise absolute maximum vectors, achieving a quantization kernel of approximately 16% for OPT models and less than 0.1% for LLaMA models. Experimental results on LLMs (LLaMA, OPT) ranging from 6.7B to 70B parameters demonstrate that CrossQuant improves or maintains perplexity and accuracy in language modeling, zero-shot, and few-shot tasks.

[AI-101] Using LLMs to Discover Legal Factors

链接: https://arxiv.org/abs/2410.07504
作者: Morgan Gray,Jaromir Savelka,Wesley Oliver,Kevin Ashley
关键词-EN: foundational component, analysis and computational, legal reasoning, legal analysis, computational models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Factors are a foundational component of legal analysis and computational models of legal reasoning. These factor-based representations enable lawyers, judges, and AI and Law researchers to reason about legal cases. In this paper, we introduce a methodology that leverages large language models (LLMs) to discover lists of factors that effectively represent a legal domain. Our method takes as input raw court opinions and produces a set of factors and associated definitions. We demonstrate that a semi-automated approach, incorporating minimal human involvement, produces factor representations that can predict case outcomes with moderate success, if not yet as well as expert-defined factors can.

[AI-102] Dense Optimizer : An Information Entropy-Guided Structural Search Method for Dense-like Neural Network Design

链接: https://arxiv.org/abs/2410.07499
作者: Liu Tianyuan,Hou Libin,Wang Linyuan,Song Xiyu,Yan Bin
关键词-EN: Dense Convolutional Network, Dense Optimizer, Dense Convolutional, efficient structure, Convolutional Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages,3 figures

点击查看摘要

Abstract:Dense Convolutional Network has been continuously refined to adopt a highly efficient and compact architecture, owing to its lightweight and efficient structure. However, the current Dense-like architectures are mainly designed manually, it becomes increasingly difficult to adjust the channels and reuse level based on past experience. As such, we propose an architecture search method called Dense Optimizer that can search high-performance dense-like network automatically. In Dense Optimizer, we view the dense network as a hierarchical information system, maximize the network’s information entropy while constraining the distribution of the entropy across each stage via a power law, thereby constructing an optimization problem. We also propose a branch-and-bound optimization algorithm, tightly integrates power-law principle with search space scaling to solve the optimization problem efficiently. The superiority of Dense Optimizer has been validated on different computer vision benchmark datasets. Specifically, Dense Optimizer completes high-quality search but only costs 4 hours with one CPU. Our searched model DenseNet-OPT achieved a top 1 accuracy of 84.3% on CIFAR-100, which is 5.97% higher than the original one.

[AI-103] WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

链接: https://arxiv.org/abs/2410.07484
作者: Siyu Zhou,Tianyi Zhou,Yijun Yang,Guodong Long,Deheng Ye,Jing Jiang,Chengqi Zhang
关键词-EN: large language models, LLM, directly serve, powerful world models, large language
类目: Artificial Intelligence (cs.AI)
*备注: 35 pages, including references and appendix

点击查看摘要

Abstract:Can large language models (LLMs) directly serve as powerful world models for model-based agents? While the gaps between the prior knowledge of LLMs and the specified environment’s dynamics do exist, our study reveals that the gaps can be bridged by aligning an LLM with its deployed environment and such “world alignment” can be efficiently achieved by rule learning on LLMs. Given the rich prior knowledge of LLMs, only a few additional rules suffice to align LLM predictions with the specified environment dynamics. To this end, we propose a neurosymbolic approach to learn these rules gradient-free through LLMs, by inducing, updating, and pruning rules based on comparisons of agent-explored trajectories and world model predictions. The resulting world model is composed of the LLM and the learned rules. Our embodied LLM agent “WALL-E” is built upon model-predictive control (MPC). By optimizing look-ahead actions based on the precise world model, MPC significantly improves exploration and learning efficiency. Compared to existing LLM agents, WALL-E’s reasoning only requires a few principal rules rather than verbose buffered trajectories being included in the LLM input. On open-world challenges in Minecraft and ALFWorld, WALL-E achieves higher success rates than existing methods, with lower costs on replanning time and the number of tokens used for reasoning. In Minecraft, WALL-E exceeds baselines by 15-30% in success rate while costing 8-20 fewer replanning rounds and only 60-80% of tokens. In ALFWorld, its success rate surges to a new record high of 95% only after 6 iterations.

[AI-104] Exploring the design space of deep-learning-based weather forecasting systems

链接: https://arxiv.org/abs/2410.07472
作者: Shoaib Ahmed Siddiqui,Jean Kossaifi,Boris Bonev,Christopher Choy,Jan Kautz,David Krueger,Kamyar Azizzadenesheli
关键词-EN: progress in developing, tremendous progress, models, architectures, choices including architecture
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite tremendous progress in developing deep-learning-based weather forecasting systems, their design space, including the impact of different design choices, is yet to be well understood. This paper aims to fill this knowledge gap by systematically analyzing these choices including architecture, problem formulation, pretraining scheme, use of image-based pretrained models, loss functions, noise injection, multi-step inputs, additional static masks, multi-step finetuning (including larger stride models), as well as training on a larger dataset. We study fixed-grid architectures such as UNet, fully convolutional architectures, and transformer-based models, along with grid-invariant architectures, including graph-based and operator-based models. Our results show that fixed-grid architectures outperform grid-invariant architectures, indicating a need for further architectural developments in grid-invariant models such as neural operators. We therefore propose a hybrid system that combines the strong performance of fixed-grid models with the flexibility of grid-invariant architectures. We further show that multi-step fine-tuning is essential for most deep-learning models to work well in practice, which has been a common practice in the past. Pretraining objectives degrade performance in comparison to supervised training, while image-based pretrained models provide useful inductive biases in some cases in comparison to training the model from scratch. Interestingly, we see a strong positive effect of using a larger dataset when training a smaller model as compared to training on a smaller dataset for longer. Larger models, on the other hand, primarily benefit from just an increase in the computational budget. We believe that these results will aid in the design of better weather forecasting systems in the future.

[AI-105] SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection

链接: https://arxiv.org/abs/2410.07471
作者: Han Shen,Pin-Yu Chen,Payel Das,Tianyi Chen
关键词-EN: leveraging Large Language, Large Language Models, Large Language, boost downstream performance, leveraging Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Fine-tuning on task-specific data to boost downstream performance is a crucial step for leveraging Large Language Models (LLMs). However, previous studies have demonstrated that fine-tuning the models on several adversarial samples or even benign data can greatly comprise the model’s pre-equipped alignment and safety capabilities. In this work, we propose SEAL, a novel framework to enhance safety in LLM fine-tuning. SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection respectively on Llama-3-8b-Instruct and Merlinite-7b models. Our code is available on github this https URL.

[AI-106] nyLidarNet: 2D LiDAR-based End-to-End Deep Learning Model for F1TENTH Autonomous Racing

链接: https://arxiv.org/abs/2410.07447
作者: Mohammed Misbah Zarrar,Qitao Weng,Bakhbyergyen Yerjan,Ahmet Soyyigit,Heechul Yun
关键词-EN: raw sensory data, Prior research, sensory data, research has demonstrated, demonstrated the effectiveness
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prior research has demonstrated the effectiveness of end-to-end deep learning for robotic navigation, where the control signals are directly derived from raw sensory data. However, the majority of existing end-to-end navigation solutions are predominantly camera-based. In this paper, we introduce TinyLidarNet, a lightweight 2D LiDAR-based end-to-end deep learning model for autonomous racing. An F1TENTH vehicle using TinyLidarNet won 3rd place in the 12th F1TENTH Autonomous Grand Prix competition, demonstrating its competitive performance. We systematically analyze its performance on untrained tracks and computing requirements for real-time processing. We find that TinyLidarNet’s 1D Convolutional Neural Network (CNN) based architecture significantly outperforms widely used Multi-Layer Perceptron (MLP) based architecture. In addition, we show that it can be processed in real-time on low-end micro-controller units (MCUs).

[AI-107] Zero-Shot Generalization of Vision-Based RL Without Data Augmentation

链接: https://arxiv.org/abs/2410.07441
作者: Sumeet Batra,Gaurav S. Sukhatme
关键词-EN: Generalizing vision-based reinforcement, vision-based reinforcement learning, Generalizing vision-based, reinforcement learning, open challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Generalizing vision-based reinforcement learning (RL) agents to novel environments remains a difficult and open challenge. Current trends are to collect large-scale datasets or use data augmentation techniques to prevent overfitting and improve downstream generalization. However, the computational and data collection costs increase exponentially with the number of task variations and can destabilize the already difficult task of training RL agents. In this work, we take inspiration from recent advances in computational neuroscience and propose a model, Associative Latent DisentAnglement (ALDA), that builds on standard off-policy RL towards zero-shot generalization. Specifically, we revisit the role of latent disentanglement in RL and show how combining it with a model of associative memory achieves zero-shot generalization on difficult task variations without relying on data augmentation. Finally, we formally show that data augmentation techniques are a form of weak disentanglement and discuss the implications of this insight.

[AI-108] Can Transformers Reason Logically? A Study in SAT Solving

链接: https://arxiv.org/abs/2410.07432
作者: Leyan Pan,Vijay Ganesh,Jacob Abernethy,Chris Esposo,Wenke Lee
关键词-EN: Boolean satisfiability, logical reasoning capabilities, study the logical, capabilities of LLMs, solve SAT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 29 pages, 4 Figures

点击查看摘要

Abstract:We theoretically and empirically study the logical reasoning capabilities of LLMs in the context of the Boolean satisfiability (SAT) problem. First, we construct a decoder-only Transformer that can solve SAT using backtracking and deduction via Chain-of-Thought (CoT). We prove its correctness by showing trace equivalence to the well-known DPLL SAT-solving algorithm. Second, to support the implementation of this abstract construction, we design a compiler \textttPARAT that takes as input a procedural specification and outputs a transformer model implementing this specification. Third, rather than \textitprogramming a transformer to reason, we evaluate empirically whether it can be \textittrained to do so by learning directly from algorithmic traces (“reasoning paths”) of the DPLL algorithm.

[AI-109] CAFEEN: A Cooperative Approach for Energy Efficient NoCs with Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2410.07426
作者: Kamil Khan,Sudeep Pasricha
关键词-EN: efficient power management, emerging high-performance, efficient power, power management, management is crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:In emerging high-performance Network-on-Chip (NoC) architectures, efficient power management is crucial to minimize energy consumption. We propose a novel framework called CAFEEN that employs both heuristic-based fine-grained and machine learning-based coarse-grained power-gating for energy-efficient NoCs. CAFEEN uses a fine-grained method to activate only essential NoC buffers during lower network loads. It switches to a coarse-grained method at peak loads to minimize compounding wake-up overhead using multi-agent reinforcement learning. Results show that CAFEEN adaptively balances power-efficiency with performance, reducing total energy by 2.60x for single application workloads and 4.37x for multi-application workloads, compared to state-of-the-art NoC power-gating frameworks.

[AI-110] Exploring Efficient Foundational Multi-modal Models for Video Summarization

链接: https://arxiv.org/abs/2410.07405
作者: Karan Samel,Apoorva Beedu,Nitish Sontakke,Irfan Essa
关键词-EN: generate text outputs, models, model, language model, Foundational models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Foundational models are able to generate text outputs given prompt instructions and text, audio, or image inputs. Recently these models have been combined to perform tasks on video, such as video summarization. Such video foundation models perform pre-training by aligning outputs from each modality-specific model into the same embedding space. Then the embeddings from each model are used within a language model, which is fine-tuned on a desired instruction set. Aligning each modality during pre-training is computationally expensive and prevents rapid testing of different base modality models. During fine-tuning, evaluation is carried out within in-domain videos where it is hard to understand the generalizability and data efficiency of these methods. To alleviate these issues we propose a plug-and-play video language model. It directly uses the texts generated from each input modality into the language model, avoiding pre-training alignment overhead. Instead of fine-tuning we leverage few-shot instruction adaptation strategies. We compare the performance versus the computational costs for our plug-and-play style method and baseline tuning methods. Finally, we explore the generalizability of each method during domain shift and present insights on what data is useful when training data is limited. Through this analysis, we present practical insights on how to leverage multi-modal foundational models for effective results given realistic compute and data limitations.

[AI-111] Fostering Intrinsic Motivation in Reinforcement Learning with Pretrained Foundation Models

链接: https://arxiv.org/abs/2410.07404
作者: Alain Andres,Javier Del Ser
关键词-EN: sparse or non-existent, remains a significant, significant challenge, challenge in reinforcement, environments where extrinsic
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Exploration remains a significant challenge in reinforcement learning, especially in environments where extrinsic rewards are sparse or non-existent. The recent rise of foundation models, such as CLIP, offers an opportunity to leverage pretrained, semantically rich embeddings that encapsulate broad and reusable knowledge. In this work we explore the potential of these foundation models not just to drive exploration, but also to analyze the critical role of the episodic novelty term in enhancing exploration effectiveness of the agent. We also investigate whether providing the intrinsic module with complete state information – rather than just partial observations – can improve exploration, despite the difficulties in handling small variations within large state spaces. Our experiments in the MiniGrid domain reveal that intrinsic modules can effectively utilize full state information, significantly increasing sample efficiency while learning an optimal policy. Moreover, we show that the embeddings provided by foundation models are sometimes even better than those constructed by the agent during training, further accelerating the learning process, especially when coupled with the episodic novelty term to enhance exploration.

[AI-112] LLM Embeddings Improve Test-time Adaptation to Tabular Y|X-Shifts

链接: https://arxiv.org/abs/2410.07395
作者: Yibo Zeng,Jiashuo Liu,Henry Lam,Hongseok Namkoong
关键词-EN: label and covariates, missing variables, common due, due to missing, tabular datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:For tabular datasets, the change in the relationship between the label and covariates ( Y|X -shifts) is common due to missing variables (a.k.a. confounders). Since it is impossible to generalize to a completely new and unknown domain, we study models that are easy to adapt to the target domain even with few labeled examples. We focus on building more informative representations of tabular data that can mitigate Y|X -shifts, and propose to leverage the prior world knowledge in LLMs by serializing (write down) the tabular data to encode it. We find LLM embeddings alone provide inconsistent improvements in robustness, but models trained on them can be well adapted/finetuned to the target domain even using 32 labeled observations. Our finding is based on a comprehensive and systematic study consisting of 7650 source-target pairs and benchmark against 261,000 model configurations trained by 22 algorithms. Our observation holds when ablating the size of accessible target data and different adaptation strategies. The code is available at this https URL.

[AI-113] he Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks

链接: https://arxiv.org/abs/2410.07391
作者: Isaac R. Galatzer-Levy,David Munday,Jed McGiffin,Xin Liu,Danny Karmon,Ilia Labzovsky,Rivka Moroshko,Amir Zait,Daniel McDuff
关键词-EN: general intelligence foundation, Adult Intelligence Scale, Wechsler Adult Intelligence, Working Memory Index, intelligence foundation models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:There is increasing interest in tracking the capabilities of general intelligence foundation models. This study benchmarks leading large language models and vision language models against human performance on the Wechsler Adult Intelligence Scale (WAIS-IV), a comprehensive, population-normed assessment of underlying human cognition and intellectual abilities, with a focus on the domains of VerbalComprehension (VCI), Working Memory (WMI), and Perceptual Reasoning (PRI). Most models demonstrated exceptional capabilities in the storage, retrieval, and manipulation of tokens such as arbitrary sequences of letters and numbers, with performance on the Working Memory Index (WMI) greater or equal to the 99.5th percentile when compared to human population normative ability. Performance on the Verbal Comprehension Index (VCI) which measures retrieval of acquired information, and linguistic understanding about the meaning of words and their relationships to each other, also demonstrated consistent performance at or above the 98th percentile. Despite these broad strengths, we observed consistently poor performance on the Perceptual Reasoning Index (PRI; range 0.1-10th percentile) from multimodal models indicating profound inability to interpret and reason on visual information. Smaller and older model versions consistently performed worse, indicating that training data, parameter count and advances in tuning are resulting in significant advances in cognitive ability.

[AI-114] SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers

链接: https://arxiv.org/abs/2410.07383
作者: Viktoriia Chekalina,Anna Rudenko,Gleb Mezentsev,Alexander Mikhalev,Alexander Panchenko,Ivan Oseledets
关键词-EN: performance of Transformer, Transformer models, processed text, enhanced by increasing, MLP blocks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The performance of Transformer models has been enhanced by increasing the number of parameters and the length of the processed text. Consequently, fine-tuning the entire model becomes a memory-intensive process. High-performance methods for parameter-efficient fine-tuning (PEFT) typically work with Attention blocks and often overlook MLP blocks, which contain about half of the model parameters. We propose a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks. We transfer layer gradients to a space where only about 1% of the layer’s elements remain significant. By converting gradients into a sparse structure, we reduce the number of updated parameters. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task. In these experiments, with identical memory requirements, our method outperforms LoRA and MeProp, robust popular state-of-the-art PEFT approaches.

[AI-115] Improving the portability of predicting students performance models by using ontologies

链接: https://arxiv.org/abs/2410.07358
作者: Javier Lopez Zambrano,Juan A. Lara,Cristobal Romero
关键词-EN: Educational Data Mining, Educational Data, Data Mining, Learning Analytics, main current challenges
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One of the main current challenges in Educational Data Mining and Learning Analytics is the portability or transferability of predictive models obtained for a particular course so that they can be applied to other different courses. To handle this challenge, one of the foremost problems is the models excessive dependence on the low-level attributes used to train them, which reduces the models portability. To solve this issue, the use of high level attributes with more semantic meaning, such as ontologies, may be very useful. Along this line, we propose the utilization of an ontology that uses a taxonomy of actions that summarises students interactions with the Moodle learning management system. We compare the results of this proposed approach against our previous results when we used low-level raw attributes obtained directly from Moodle logs. The results indicate that the use of the proposed ontology improves the portability of the models in terms of predictive accuracy. The main contribution of this paper is to show that the ontological models obtained in one source course can be applied to other different target courses with similar usage levels without losing prediction accuracy.

[AI-116] MoE: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

链接: https://arxiv.org/abs/2410.07348
作者: Peng Jin,Bo Zhu,Li Yuan,Shuicheng Yan
关键词-EN: MoE, aim to simultaneously, simultaneously enhance, enhance the effectiveness, effectiveness and efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, Code: this https URL

点击查看摘要

Abstract:In this work, we aim to simultaneously enhance the effectiveness and efficiency of Mixture-of-Experts (MoE) methods. To achieve this, we propose MoE++, a general and heterogeneous MoE framework that integrates both Feed-Forward Network~(FFN) and zero-computation experts. Specifically, we introduce three types of zero-computation experts: the zero expert, copy expert, and constant expert, which correspond to discard, skip, and replace operations, respectively. This design offers three key advantages: (i) Low Computing Overhead: Unlike the uniform mixing mechanism for all tokens within vanilla MoE, MoE++ allows each token to engage with a dynamic number of FFNs, be adjusted by constant vectors, or even skip the MoE layer entirely. (ii) High Performance: By enabling simple tokens to utilize fewer FFN experts, MoE++ allows more experts to focus on challenging tokens, thereby unlocking greater performance potential than vanilla MoE. (iii) Deployment Friendly: Given that zero-computation experts have negligible parameters, we can deploy all zero-computation experts on each GPU, eliminating the significant communication overhead and expert load imbalance associated with FFN experts distributed across different GPUs. Moreover, we leverage gating residuals, enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts. Extensive experimental results demonstrate that MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size, which lays a solid foundation for developing advanced and efficient MoE-related models.

[AI-117] Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

链接: https://arxiv.org/abs/2410.07336
作者: Sara Sarto,Nicholas Moratelli,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
关键词-EN: significant advancements, fail to capture, capture the full, fine-grained details, existing evaluation metrics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: this https URL.

[AI-118] DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.07331
作者: Yiming Huang,Jianwen Luo,Yan Yu,Yitong Zhang,Fangyu Lei,Yifan Wei,Shizhu He,Lifu Huang,Xiao Liu,Jun Zhao,Kang Liu
关键词-EN: benchmark specifically designed, generation benchmark specifically, code generation tasks, code generation benchmark, agent-based data science
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024

点击查看摘要

Abstract:We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at [this https URL](this https URL).

[AI-119] A Blockchain and Artificial Intelligence based System for Halal Food Traceability

链接: https://arxiv.org/abs/2410.07305
作者: Abdulla Alourani,Shahnawaz Khan
关键词-EN: halal food products, halal food, halal food consumers, food products, halal
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:The demand of the halal food products is increasing rapidly around the world. The consumption of halal food product is just not among the Muslims but also among non-Muslims, due to the purity of the halal food products. However, there are several challenges that are faced by the halal food consumers. The challenges raise a doubt among the halal food consumers about the authenticity of the product being halal. Therefore, a solution that can address these issues and can establish trust between consumers and producers. Blockchain technology can provide a distributed ledger of an immutable record of the information. Artificial intelligence supports developing a solution for pattern identification. The proposed research utilizes blockchain an artificial intelligence-based system for developing a system that ensure the authenticity of the halal food products by providing the traceability related to all the operations and processes of the supply chain and sourcing the raw material. The proposed system has been tested with a local supermarket. The results and tests of the developed solution seemed effective and the testers expressed interest in real-world implementation of the proposed system.

[AI-120] he Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making

链接: https://arxiv.org/abs/2410.07304
作者: Basile Garcia,Crystal Qian,Stefano Palminteri
关键词-EN: large language models, language models, integrated into society, increasingly integrated, large language
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) become increasingly integrated into society, their alignment with human morals is crucial. To better understand this alignment, we created a large corpus of human- and LLM-generated responses to various moral scenarios. We found a misalignment between human and LLM moral assessments; although both LLMs and humans tended to reject morally complex utilitarian dilemmas, LLMs were more sensitive to personal framing. We then conducted a quantitative user study involving 230 participants (N=230), who evaluated these responses by determining whether they were AI-generated and assessed their agreement with the responses. Human evaluators preferred LLMs’ assessments in moral scenarios, though a systematic anti-AI bias was observed: participants were less likely to agree with judgments they believed to be machine-generated. Statistical and NLP-based analyses revealed subtle linguistic differences in responses, influencing detection and agreement. Overall, our findings highlight the complexities of human-AI perception in morally charged decision-making.

[AI-121] Examining the Prevalence and Dynamics of AI-Generated Media in Art Subreddits

链接: https://arxiv.org/abs/2410.07302
作者: Hana Matatov,Marianne Aubin Le Quéré,Ofra Amir,Mor Naaman
关键词-EN: Broadly accessible generative, compelling visual art, create compelling visual, Broadly accessible, visual art
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Broadly accessible generative AI models like Dall-E have made it possible for anyone to create compelling visual art. In online communities, the introduction of AI-generated content (AIGC) may impact community dynamics by shifting the kinds of content being posted or the responses to content suspected of being generated by AI. We take steps towards examining the potential impact of AIGC on art-related communities on Reddit. We distinguish between communities that disallow AI content and those without a direct policy. We look at image-based posts made to these communities that are transparently created by AI, or comments in these communities that suspect authors of using generative AI. We find that AI posts (and accusations) have played a very small part in these communities through the end of 2023, accounting for fewer than 0.2% of the image-based posts. Even as the absolute number of author-labelled AI posts dwindles over time, accusations of AI use remain more persistent. We show that AI content is more readily used by newcomers and may help increase participation if it aligns with community rules. However, the tone of comments suspecting AI use by others have become more negative over time, especially in communities that do not have explicit rules about AI. Overall, the results show the changing norms and interactions around AIGC in online communities designated for creativity.

[AI-122] owards Generalisable Time Series Understanding Across Domains

链接: https://arxiv.org/abs/2410.07299
作者: Özgün Turgut,Philip Müller,Martin J. Menten,Daniel Rueckert
关键词-EN: datasets unlocks foundational, natural language processing, large datasets unlocks, time series, unlocks foundational model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In natural language processing and computer vision, self-supervised pre-training on large datasets unlocks foundational model capabilities across domains and tasks. However, this potential has not yet been realised in time series analysis, where existing methods disregard the heterogeneous nature of time series characteristics. Time series are prevalent in many domains, including medicine, engineering, natural sciences, and finance, but their characteristics vary significantly in terms of variate count, inter-variate relationships, temporal dynamics, and sampling frequency. This inherent heterogeneity across domains prevents effective pre-training on large time series corpora. To address this issue, we introduce OTiS, an open model for general time series analysis, that has been specifically designed to handle multi-domain heterogeneity. We propose a novel pre-training paradigm including a tokeniser with learnable domain-specific signatures, a dual masking strategy to capture temporal causality, and a normalised cross-correlation loss to model long-range dependencies. Our model is pre-trained on a large corpus of 640,187 samples and 11 billion time points spanning 8 distinct domains, enabling it to analyse time series from any (unseen) domain. In comprehensive experiments across 15 diverse applications - including classification, regression, and forecasting - OTiS showcases its ability to accurately capture domain-specific data characteristics and demonstrates its competitiveness against state-of-the-art baselines. Our code and pre-trained weights are publicly available at this https URL.

[AI-123] Enhancing Performance of Point Cloud Completion Networks with Consistency Loss

链接: https://arxiv.org/abs/2410.07298
作者: Kevin Tirta Wijaya,Christofel Rio Goenawan,Seung-Hyun Kong
关键词-EN: Point cloud completion, proposed consistency loss, Point cloud, consistency loss, point completion network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: First version of Paper “Enhancing Performance of Point Cloud Completion Networks with Consistency Loss” by Kevin Tirta Wijaya and Christofel Rio Goenawan. In process submission to Neurocomputing Journal 2024

点击查看摘要

Abstract:Point cloud completion networks are conventionally trained to minimize the disparities between the completed point cloud and the ground-truth counterpart. However, an incomplete object-level point cloud can have multiple valid completion solutions when it is examined in isolation. This one-to-many mapping issue can cause contradictory supervision signals to the network because the loss function may produce different values for identical input-output pairs of the network. In many cases, this issue could adversely affect the network optimization process. In this work, we propose to enhance the conventional learning objective using a novel completion consistency loss to mitigate the one-to-many mapping problem. Specifically, the proposed consistency loss ensure that a point cloud completion network generates a coherent completion solution for incomplete objects originating from the same source point cloud. Experimental results across multiple well-established datasets and benchmarks demonstrated the proposed completion consistency loss have excellent capability to enhance the completion performance of various existing networks without any modification to the design of the networks. The proposed consistency loss enhances the performance of the point completion network without affecting the inference speed, thereby increasing the accuracy of point cloud completion. Notably, a state-of-the-art point completion network trained with the proposed consistency loss can achieve state-of-the-art accuracy on the challenging new MVP dataset. The code and result of experiment various point completion models using proposed consistency loss will be available at: this https URL .

[AI-124] Principal Orthogonal Latent Components Analysis (POLCA Net)

链接: https://arxiv.org/abs/2410.07289
作者: Jose Antonio Martin H.,Freddy Perozo,Manuel Lopez
关键词-EN: Components Analysis Network, raw data, POLCA Net, pivotal area, field of machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Representation learning is a pivotal area in the field of machine learning, focusing on the development of methods to automatically discover the representations or features needed for a given task from raw data. Unlike traditional feature engineering, which requires manual crafting of features, representation learning aims to learn features that are more useful and relevant for tasks such as classification, prediction, and clustering. We introduce Principal Orthogonal Latent Components Analysis Network (POLCA Net), an approach to mimic and extend PCA and LDA capabilities to non-linear domains. POLCA Net combines an autoencoder framework with a set of specialized loss functions to achieve effective dimensionality reduction, orthogonality, variance-based feature sorting, high-fidelity reconstructions, and additionally, when used with classification labels, a latent representation well suited for linear classifiers and low dimensional visualization of class distribution as well.

[AI-125] Benchmarking Data Heterogeneity Evaluation Approaches for Personalized Federated Learning NEURIPS’24

链接: https://arxiv.org/abs/2410.07286
作者: Zhilong Li,Xiaohu Wu,Xiaoli Tang,Tiantian He,Yew-Soon Ong,Mengmeng Chen,Qiqi Liu,Qicheng Lao,Xiaoxiao Li,Han Yu
关键词-EN: clients’ local datasets, growing research interest, local datasets, interest in measuring, measuring the statistical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to FL@FM-NeurIPS’24

点击查看摘要

Abstract:There is growing research interest in measuring the statistical heterogeneity of clients’ local datasets. Such measurements are used to estimate the suitability for collaborative training of personalized federated learning (PFL) models. Currently, these research endeavors are taking place in silos and there is a lack of a unified benchmark to provide a fair and convenient comparison among various approaches in common settings. We aim to bridge this important gap in this paper. The proposed benchmarking framework currently includes six representative approaches. Extensive experiments have been conducted to compare these approaches under five standard non-IID FL settings, providing much needed insights into which approaches are advantageous under which settings. The proposed framework offers useful guidance on the suitability of various data divergence measures in FL systems. It is beneficial for keeping related research activities on the right track in terms of: (1) designing PFL schemes, (2) selecting appropriate data heterogeneity evaluation approaches for specific FL application scenarios, and (3) addressing fairness issues in collaborative model training. The code is available at this https URL.

[AI-126] Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

链接: https://arxiv.org/abs/2410.07283
作者: Donghyun Lee,Mo Tiwari
关键词-EN: Large Language Models, Language Models, Large Language, grow increasingly powerful, grow increasingly
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) grow increasingly powerful, multi-agent systems are becoming more prevalent in modern AI applications. Most safety research, however, has focused on vulnerabilities in single-agent LLMs. These include prompt injection attacks, where malicious prompts embedded in external content trick the LLM into executing unintended or harmful actions, compromising the victim’s application. In this paper, we reveal a more dangerous vector: LLM-to-LLM prompt injection within multi-agent systems. We introduce Prompt Infection, a novel attack where malicious prompts self-replicate across interconnected agents, behaving much like a computer virus. This attack poses severe threats, including data theft, scams, misinformation, and system-wide disruption, all while propagating silently through the system. Our extensive experiments demonstrate that multi-agent systems are highly susceptible, even when agents do not publicly share all communications. To address this, we propose LLM Tagging, a defense mechanism that, when combined with existing safeguards, significantly mitigates infection spread. This work underscores the urgent need for advanced security measures as multi-agent LLM systems become more widely adopted.

[AI-127] Retrieval Replace Reduction: An effective visual token reduction method via semantic match

链接: https://arxiv.org/abs/2410.07278
作者: Yingen Liu,Fan Wu,Ruihui Li,Zhuo Tang,Kenli Li
关键词-EN: large language models, Multimodal large language, demonstrated strong performance, language models, training from scratch
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 2 figures,3 tables

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated strong performance across various tasks without requiring training from scratch. However, they face significant computational and memory constraints, particularly when processing multimodal inputs that exceed context length, limiting their scalability. In this paper, we introduce a new approach, \textbfTRSM (\textbfToken \textbfReduction via \textbfSemantic \textbfMatch), which effectively reduces the number of visual tokens without compromising MLLM performance. Inspired by how humans process multimodal tasks, TRSM leverages semantic information from one modality to match relevant semantics in another, reducing the number of visual this http URL, to retain task relevant visual tokens, we use the text prompt as a query vector to retrieve the most similar vectors from the visual prompt and merge them with the text tokens. Based on experimental results, when applied to LLaVA-1.5\citeliu2023, our approach compresses the visual tokens by 20%, achieving comparable performance across diverse visual question-answering and reasoning tasks.

[AI-128] Mitigation of gender bias in automatic facial non-verbal behaviors generation

链接: https://arxiv.org/abs/2410.07274
作者: Alice Delbosc(TALEP, LIS, AMU),Magalie Ochs(LIS, AMU, R2I),Nicolas Sabouret(CPU, LISN),Brian Ravenet(CPU, LISN),Stephane Ayache(AMU, LIS, QARMA)
关键词-EN: interactive agents focuses, social interactive agents, social interactive, believability and synchronization, Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Research on non-verbal behavior generation for social interactive agents focuses mainly on the believability and synchronization of non-verbal cues with speech. However, existing models, predominantly based on deep learning architectures, often perpetuate biases inherent in the training data. This raises ethical concerns, depending on the intended application of these agents. This paper addresses these issues by first examining the influence of gender on facial non-verbal behaviors. We concentrate on gaze, head movements, and facial expressions. We introduce a classifier capable of discerning the gender of a speaker from their non-verbal cues. This classifier achieves high accuracy on both real behavior data, extracted using state-of-the-art tools, and synthetic data, generated from a model developed in previous this http URL upon this work, we present a new model, FairGenderGen, which integrates a gender discriminator and a gradient reversal layer into our previous behavior generation model. This new model generates facial non-verbal behaviors from speech features, mitigating gender sensitivity in the generated behaviors. Our experiments demonstrate that the classifier, developed in the initial phase, is no longer effective in distinguishing the gender of the speaker from the generated non-verbal behaviors.

[AI-129] Multi-Task Program Error Repair and Explanatory Diagnosis

链接: https://arxiv.org/abs/2410.07271
作者: Zhenyu Xu,Victor S. Sheng
关键词-EN: unexpected output, performance issues, Multi-task Program Error, program error diagnosis, Program Error Repair
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Program errors can occur in any type of programming, and can manifest in a variety of ways, such as unexpected output, crashes, or performance issues. And program error diagnosis can often be too abstract or technical for developers to understand, especially for beginners. The goal of this paper is to present a novel machine-learning approach for Multi-task Program Error Repair and Explanatory Diagnosis (mPRED). A pre-trained language model is used to encode the source code, and a downstream model is specifically designed to identify and repair errors. Programs and test cases will be augmented and optimized from several perspectives. Additionally, our approach incorporates a “chain of thoughts” method, which enables the models to produce intermediate reasoning explanations before providing the final correction. To aid in visualizing and analyzing the program structure, we use a graph neural network for program structure visualization. Overall, our approach offers a promising approach for repairing program errors across different programming languages and providing helpful explanations to programmers.

[AI-130] Learning Content-Aware Multi-Modal Joint Input Pruning via Birds-Eye-View Representation

链接: https://arxiv.org/abs/2410.07268
作者: Yuxin Li,Yiheng Li,Xulei Yang,Mengying Yu,Zihang Huang,Xiaojun Wu,Chai Kiat Yeo
关键词-EN: substantial academic attention, recently garnered substantial, garnered substantial academic, autonomous driving, representation has recently
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the landscape of autonomous driving, Bird’s-Eye-View (BEV) representation has recently garnered substantial academic attention, serving as a transformative framework for the fusion of multi-modal sensor inputs. This BEV paradigm effectively shifts the sensor fusion challenge from a rule-based methodology to a data-centric approach, thereby facilitating more nuanced feature extraction from an array of heterogeneous sensors. Notwithstanding its evident merits, the computational overhead associated with BEV-based techniques often mandates high-capacity hardware infrastructures, thus posing challenges for practical, real-world implementations. To mitigate this limitation, we introduce a novel content-aware multi-modal joint input pruning technique. Our method leverages BEV as a shared anchor to algorithmically identify and eliminate non-essential sensor regions prior to their introduction into the perception model’s backbone. We validatethe efficacy of our approach through extensive experiments on the NuScenes dataset, demonstrating substantial computational efficiency without sacrificing perception accuracy. To the best of our knowledge, this work represents the first attempt to alleviate the computational burden from the input pruning point.

[AI-131] A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models

链接: https://arxiv.org/abs/2410.07265
作者: Cong Guo,Feng Cheng,Zhixu Du,James Kiessling,Jonathan Ku,Shiyu Li,Ziru Li,Mingyuan Ma,Tergel Molom-Ochir,Benjamin Morris,Haoxuan Shan,Jingwei Sun,Yitu Wang,Chiyue Wei,Xueying Wu,Yuhao Wu,Hao Frank Yang,Jingyang Zhang,Junyao Zhang,Qilin Zheng,Guanglei Zhou, Hai (Helen)Li,Yiran Chen
关键词-EN: demonstrating remarkable capabilities, natural language processing, large language models, artificial intelligence, demonstrating remarkable
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted by IEEE Circuits and Systems Magazine

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has significantly transformed the field of artificial intelligence, demonstrating remarkable capabilities in natural language processing and moving towards multi-modal functionality. These models are increasingly integrated into diverse applications, impacting both research and industry. However, their development and deployment present substantial challenges, including the need for extensive computational resources, high energy consumption, and complex software optimizations. Unlike traditional deep learning systems, LLMs require unique optimization strategies for training and inference, focusing on system-level efficiency. This paper surveys hardware and software co-design approaches specifically tailored to address the unique characteristics and constraints of large language models. This survey analyzes the challenges and impacts of LLMs on hardware and algorithm research, exploring algorithm optimization, hardware design, and system-level innovations. It aims to provide a comprehensive understanding of the trade-offs and considerations in LLM-centric computing systems, guiding future advancements in AI. Finally, we summarize the existing efforts in this space and outline future directions toward realizing production-grade co-design methodologies for the next generation of large language models and AI systems.

[AI-132] AAAI Workshop on AI Planning for Cyber-Physical Systems – CAIPI24 AAAI

链接: https://arxiv.org/abs/2410.07245
作者: Oliver Niggemann,Gautam Biswas,Alexander Diedrich,Jonas Ehrhardt,René Heesch,Niklas Widulle
关键词-EN: Annual AAAI Conference, Cyber-Physical Systems, Annual AAAI, Intelligence in Vancouver, AAAI Conference
类目: Artificial Intelligence (cs.AI)
*备注: This is the Proceedings of the AAAI Workshop on AI Planning for Cyber-Physical Systems - CAIPI24, which was held in Vancouver, CA, February 26, 2024

点击查看摘要

Abstract:The workshop ‘AI-based Planning for Cyber-Physical Systems’, which took place on February 26, 2024, as part of the 38th Annual AAAI Conference on Artificial Intelligence in Vancouver, Canada, brought together researchers to discuss recent advances in AI planning methods for Cyber-Physical Systems (CPS). CPS pose a major challenge due to their complexity and data-intensive nature, which often exceeds the capabilities of traditional planning algorithms. The workshop highlighted new approaches such as neuro-symbolic architectures, large language models (LLMs), deep reinforcement learning and advances in symbolic planning. These techniques are promising when it comes to managing the complexity of CPS and have potential for real-world applications.

[AI-133] chnical Report: Competition Solution For Modelscope-Sora

链接: https://arxiv.org/abs/2410.07194
作者: Shengfu Chen,Hailong Liu,Wenzhao Wei
关键词-EN: presents the approach, approach adopted, focuses on fine-tuning, video generation models, Modelscope-Sora challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This report presents the approach adopted in the Modelscope-Sora challenge, which focuses on fine-tuning data for video generation models. The challenge evaluates participants’ ability to analyze, clean, and generate high-quality datasets for video-based text-to-video tasks under specific computational constraints. The provided methodology involves data processing techniques such as video description generation, filtering, and acceleration. This report outlines the procedures and tools utilized to enhance the quality of training data, ensuring improved performance in text-to-video generation models.

[AI-134] Does Spatial Cognition Emerge in Frontier Models?

链接: https://arxiv.org/abs/2410.06468
作者: Santhosh Kumar Ramakrishnan,Erik Wijmans,Philipp Kraehenbuehl,Vladlen Koltun
关键词-EN: present SPACE, Abstract, models, benchmark, spatial
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.

[AI-135] Optimal Transportation by Orthogonal Coupling Dynamics

链接: https://arxiv.org/abs/2410.08060
作者: Mohsen Sadr,Peyman Mohajerin Esfehani,Hossein Gorji
关键词-EN: learning tasks rest, algorithms and learning, learning tasks, tasks rest, rest on solution
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many numerical algorithms and learning tasks rest on solution of the Monge-Kantorovich problem and corresponding Wasserstein distances. While the natural approach is to treat the problem as an infinite-dimensional linear programming, such a methodology severely limits the computational performance due to the polynomial scaling with respect to the sample size along with intensive memory requirements. We propose a novel alternative framework to address the Monge-Kantorovich problem based on a projection type gradient descent scheme. The micro-dynamics is built on the notion of the conditional expectation, where the connection with the opinion dynamics is explored and leveraged to build compact numerical schemes. We demonstrate that the devised dynamics recovers random maps with favourable computational performance. Along with the theoretical insight, the provided dynamics paves the way for innovative approaches to construct numerical schemes for computing optimal transport maps as well as Wasserstein distances.

[AI-136] ONCOPILOT: A Promptable CT Foundation Model For Solid Tumor Evaluation

链接: https://arxiv.org/abs/2410.07908
作者: Léo Machado,Hélène Philippe,Élodie Ferreres,Julien Khlaut,Julie Dupuis,Korentin Le Floch,Denis Habip Gatenyo,Pascal Roux,Jules Grégory,Maxime Ronot,Corentin Dancette,Daniel Tordjman,Pierre Manceron,Paul Hérent
关键词-EN: diverse shapes, proteiform phenomenon, displaying complex, locations and displaying, tumors emerging
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Carcinogenesis is a proteiform phenomenon, with tumors emerging in various locations and displaying complex, diverse shapes. At the crucial intersection of research and clinical practice, it demands precise and flexible assessment. However, current biomarkers, such as RECIST 1.1’s long and short axis measurements, fall short of capturing this complexity, offering an approximate estimate of tumor burden and a simplistic representation of a more intricate process. Additionally, existing supervised AI models face challenges in addressing the variability in tumor presentations, limiting their clinical utility. These limitations arise from the scarcity of annotations and the models’ focus on narrowly defined tasks. To address these challenges, we developed ONCOPILOT, an interactive radiological foundation model trained on approximately 7,500 CT scans covering the whole body, from both normal anatomy and a wide range of oncological cases. ONCOPILOT performs 3D tumor segmentation using visual prompts like point-click and bounding boxes, outperforming state-of-the-art models (e.g., nnUnet) and achieving radiologist-level accuracy in RECIST 1.1 measurements. The key advantage of this foundation model is its ability to surpass state-of-the-art performance while keeping the radiologist in the loop, a capability that previous models could not achieve. When radiologists interactively refine the segmentations, accuracy improves further. ONCOPILOT also accelerates measurement processes and reduces inter-reader variability, facilitating volumetric analysis and unlocking new biomarkers for deeper insights. This AI assistant is expected to enhance the precision of RECIST 1.1 measurements, unlock the potential of volumetric biomarkers, and improve patient stratification and clinical care, while seamlessly integrating into the radiological workflow. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.07908 [eess.IV] (or arXiv:2410.07908v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2410.07908 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-137] Generalization Ability Analysis of Through-the-Wall Radar Human Activity Recognition

链接: https://arxiv.org/abs/2410.07543
作者: Weicheng Gao,Xiaodong Qu,Xiaopeng Yang
关键词-EN: indoor human motion, analyze indoor human, human activity recognition, TWR HAR, human motion
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figures, 0 table, in Proc. IEEE International Conference on Signal, Information and Data Processing (ICSIDP), 2024

点击查看摘要

Abstract:Through-the-Wall radar (TWR) human activity recognition (HAR) is a technology that uses low-frequency ultra-wideband (UWB) signal to detect and analyze indoor human motion. However, the high dependence of existing end-to-end recognition models on the distribution of TWR training data makes it difficult to achieve good generalization across different indoor testers. In this regard, the generalization ability of TWR HAR is analyzed in this paper. In detail, an end-to-end linear neural network method for TWR HAR and its generalization error bound are first discussed. Second, a micro-Doppler corner representation method and the change of the generalization error before and after dimension reduction are presented. The appropriateness of the theoretical generalization errors is proved through numerical simulations and experiments. The results demonstrate that feature dimension reduction is effective in allowing recognition models to generalize across different indoor testers.

[AI-138] Generalizable Indoor Human Activity Recognition Method Based on Micro-Doppler Corner Point Cloud and Dynamic Graph Learning

链接: https://arxiv.org/abs/2410.07542
作者: Xiaopeng Yang,Weicheng Gao,Xiaodong Qu,Haoyu Meng
关键词-EN: intelligent decision-making algorithms, fusing micro-Doppler signature, human activity recognition, decision-making algorithms, micro-Doppler signature extraction
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 15 pages, 12 figures, 6 tables, in IEEE Transactions on Aerospace and Electronics Systems, 2024

点击查看摘要

Abstract:Through-the-wall radar (TWR) human activity recognition can be achieved by fusing micro-Doppler signature extraction and intelligent decision-making algorithms. However, limited by the insufficient priori of tester in practical indoor scenarios, the trained models on one tester are commonly difficult to inference well on other testers, which causes poor generalization ability. To solve this problem, this paper proposes a generalizable indoor human activity recognition method based on micro-Doppler corner point cloud and dynamic graph learning. In the proposed method, DoG-\muD-CornerDet is used for micro-Doppler corner extraction on two types of radar profiles. Then, a micro-Doppler corner filtering method based on polynomial fitting smoothing is proposed to maximize the feature distance under the constraints of the kinematic model. The extracted corners from the two types of radar profiles are concatenated together into three-dimensional point cloud. Finally, the paper proposes a dynamic graph neural network (DGNN)-based recognition method for data-to-activity label mapping. Visualization, comparison and ablation experiments are carried out to verify the effectiveness of the proposed method. The results prove that the proposed method has strong generalization ability on radar data collected from different testers.

[AI-139] Efficient Generation of Molecular Clusters with Dual-Scale Equivariant Flow Matching

链接: https://arxiv.org/abs/2410.07539
作者: Akshay Subramanian,Shuhui Qu,Cheol Woo Park,Sulin Liu,Janghwan Lee,Rafael Gómez-Bombarelli
关键词-EN: Amorphous molecular solids, molecular solids offer, inorganic semiconductors, solution processability, Amorphous molecular
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Amorphous molecular solids offer a promising alternative to inorganic semiconductors, owing to their mechanical flexibility and solution processability. The packing structure of these materials plays a crucial role in determining their electronic and transport properties, which are key to enhancing the efficiency of devices like organic solar cells (OSCs). However, obtaining these optoelectronic properties computationally requires molecular dynamics (MD) simulations to generate a conformational ensemble, a process that can be computationally expensive due to the large system sizes involved. Recent advances have focused on using generative models, particularly flow-based models as Boltzmann generators, to improve the efficiency of MD sampling. In this work, we developed a dual-scale flow matching method that separates training and inference into coarse-grained and all-atom stages and enhances both the accuracy and efficiency of standard flow matching samplers. We demonstrate the effectiveness of this method on a dataset of Y6 molecular clusters obtained through MD simulations, and we benchmark its efficiency and accuracy against single-scale flow matching methods.

[AI-140] Learn from Real: Reality Defenders Submission to ASVspoof5 Challenge

链接: https://arxiv.org/abs/2410.07379
作者: Yi Zhu,Chirag Goel,Surya Koppisetti,Trang Tran,Ankur Kumar,Gaurav Bharaj
关键词-EN: Audio deepfake detection, Audio deepfake, crucial to combat, combat the malicious, deepfake detection
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted into ASVspoof5 workshop

点击查看摘要

Abstract:Audio deepfake detection is crucial to combat the malicious use of AI-synthesized speech. Among many efforts undertaken by the community, the ASVspoof challenge has become one of the benchmarks to evaluate the generalizability and robustness of detection models. In this paper, we present Reality Defender’s submission to the ASVspoof5 challenge, highlighting a novel pretraining strategy which significantly improves generalizability while maintaining low computational cost during training. Our system SLIM learns the style-linguistics dependency embeddings from various types of bonafide speech using self-supervised contrastive learning. The learned embeddings help to discriminate spoof from bonafide speech by focusing on the relationship between the style and linguistics aspects. We evaluated our system on ASVspoof5, ASV2019, and In-the-wild. Our submission achieved minDCF of 0.1499 and EER of 5.5% on ASVspoof5 Track 1, and EER of 7.4% and 10.8% on ASV2019 and In-the-wild respectively.

[AI-141] Unlocking Real-Time Fluorescence Lifetime Imaging: Multi-Pixel Parallelism for FPGA-Accelerated Processing

链接: https://arxiv.org/abs/2410.07364
作者: Ismail Erbas,Aporva Amarnath,Vikas Pandey,Karthik Swaminathan,Naigang Wang,Xavier Intes
关键词-EN: Fluorescence lifetime imaging, Fluorescence lifetime, protein interactions, lifetime imaging, fluorescent molecules
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Fluorescence lifetime imaging (FLI) is a widely used technique in the biomedical field for measuring the decay times of fluorescent molecules, providing insights into metabolic states, protein interactions, and ligand-receptor bindings. However, its broader application in fast biological processes, such as dynamic activity monitoring, and clinical use, such as in guided surgery, is limited by long data acquisition times and computationally demanding data processing. While deep learning has reduced post-processing times, time-resolved data acquisition remains a bottleneck for real-time applications. To address this, we propose a method to achieve real-time FLI using an FPGA-based hardware accelerator. Specifically, we implemented a GRU-based sequence-to-sequence (Seq2Seq) model on an FPGA board compatible with time-resolved cameras. The GRU model balances accurate processing with the resource constraints of FPGAs, which have limited DSP units and BRAM. The limited memory and computational resources on the FPGA require efficient scheduling of operations and memory allocation to deploy deep learning models for low-latency applications. We address these challenges by using STOMP, a queue-based discrete-event simulator that automates and optimizes task scheduling and memory management on hardware. By integrating a GRU-based Seq2Seq model and its compressed version, called Seq2SeqLite, generated through knowledge distillation, we were able to process multiple pixels in parallel, reducing latency compared to sequential processing. We explore various levels of parallelism to achieve an optimal balance between performance and resource utilization. Our results indicate that the proposed techniques achieved a 17.7x and 52.0x speedup over manual scheduling for the Seq2Seq model and the Seq2SeqLite model, respectively.

[AI-142] Crafting desirable climate trajectories with RL explored socio-environmental simulations

链接: https://arxiv.org/abs/2410.07287
作者: James Rudd-Jones,Fiona Thendean,María Pérez-Ortiz
关键词-EN: enact impactful change, necessitating effective climate, effective climate policies, existential threat, necessitating effective
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
*备注: 23 pages, 13 Figures

点击查看摘要

Abstract:Climate change poses an existential threat, necessitating effective climate policies to enact impactful change. Decisions in this domain are incredibly complex, involving conflicting entities and evidence. In the last decades, policymakers increasingly use simulations and computational methods to guide some of their decisions. Integrated Assessment Models (IAMs) are one of such methods, which combine social, economic, and environmental simulations to forecast potential policy effects. For example, the UN uses outputs of IAMs for their recent Intergovernmental Panel on Climate Change (IPCC) reports. Traditionally these have been solved using recursive equation solvers, but have several shortcomings, e.g. struggling at decision making under uncertainty. Recent preliminary work using Reinforcement Learning (RL) to replace the traditional solvers shows promising results in decision making in uncertain and noisy scenarios. We extend on this work by introducing multiple interacting RL agents as a preliminary analysis on modelling the complex interplay of socio-interactions between various stakeholders or nations that drives much of the current climate crisis. Our findings show that cooperative agents in this framework can consistently chart pathways towards more desirable futures in terms of reduced carbon emissions and improved economy. However, upon introducing competition between agents, for instance by using opposing reward functions, desirable climate futures are rarely reached. Modelling competition is key to increased realism in these simulations, as such we employ policy interpretation by visualising what states lead to more uncertain behaviour, to understand algorithm failure. Finally, we highlight the current limitations and avenues for further work to ensure future technology uptake for policy derivation.

[AI-143] Swin-BERT: A Feature Fusion System designed for Speech-based Alzheimers Dementia Detection

链接: https://arxiv.org/abs/2410.07277
作者: Yilin Pan,Yanpei Shi,Yijia Zhang,Mingyu Lu
关键词-EN: automatic Alzheimer dementia, automatic Alzheimer, Alzheimer dementia, early stages, system
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Speech is usually used for constructing an automatic Alzheimer’s dementia (AD) detection system, as the acoustic and linguistic abilities show a decline in people living with AD at the early stages. However, speech includes not only AD-related local and global information but also other information unrelated to cognitive status, such as age and gender. In this paper, we propose a speech-based system named Swin-BERT for automatic dementia detection. For the acoustic part, the shifted windows multi-head attention that proposed to extract local and global information from images, is used for designing our acoustic-based system. To decouple the effect of age and gender on acoustic feature extraction, they are used as an extra input of the designed acoustic system. For the linguistic part, the rhythm-related information, which varies significantly between people living with and without AD, is removed while transcribing the audio recordings into transcripts. To compensate for the removed rhythm-related information, the character-level transcripts are proposed to be used as the extra input of a word-level BERT-style system. Finally, the Swin-BERT combines the acoustic features learned from our proposed acoustic-based system with our linguistic-based system. The experiments are based on the two datasets provided by the international dementia detection challenges: the ADReSS and ADReSSo. The results show that both the proposed acoustic and linguistic systems can be better or comparable with previous research on the two datasets. Superior results are achieved by the proposed Swin-BERT system on the ADReSS and ADReSSo datasets, which are 85.58% F-score and 87.32% F-score respectively.

[AI-144] Deep Learning for Surgical Instrument Recognition and Segmentation in Robotic-Assisted Surgeries: A Systematic Review

链接: https://arxiv.org/abs/2410.07269
作者: Fatimaelzahraa Ali Ahmed,Mahmoud Yousef,Mariam Ali Ahmed,Hasan Omar Ali,Anns Mahboob,Hazrat Ali,Zubair Shah,Omar Aboumarzouk,Abdulla Al Ansari,Shidin Balakrishnan
关键词-EN: Applying deep learning, minimally invasive surgeries, robot-assisted minimally invasive, Applying deep, surgical
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 57 pages, 9 figures, Accepted for publication in Artificial Intelligence Reviews journal this https URL

点击查看摘要

Abstract:Applying deep learning (DL) for annotating surgical instruments in robot-assisted minimally invasive surgeries (MIS) represents a significant advancement in surgical technology. This systematic review examines 48 studies that and advanced DL methods and architectures. These sophisticated DL models have shown notable improvements in the precision and efficiency of detecting and segmenting surgical tools. The enhanced capabilities of these models support various clinical applications, including real-time intraoperative guidance, comprehensive postoperative evaluations, and objective assessments of surgical skills. By accurately identifying and segmenting surgical instruments in video data, DL models provide detailed feedback to surgeons, thereby improving surgical outcomes and reducing complication risks. Furthermore, the application of DL in surgical education is transformative. The review underscores the significant impact of DL on improving the accuracy of skill assessments and the overall quality of surgical training programs. However, implementing DL in surgical tool detection and segmentation faces challenges, such as the need for large, accurately annotated datasets to train these models effectively. The manual annotation process is labor-intensive and time-consuming, posing a significant bottleneck. Future research should focus on automating the detection and segmentation process and enhancing the robustness of DL models against environmental variations. Expanding the application of DL models across various surgical specialties will be essential to fully realize this technology’s potential. Integrating DL with other emerging technologies, such as augmented reality (AR), also offers promising opportunities to further enhance the precision and efficacy of surgical procedures.

[AI-145] Reconstruction of Particle Flow Energy Distribution Using Deep Learning Algorithms

链接: https://arxiv.org/abs/2410.07250
作者: Han Zhang(1),Shengxiang Lin(2),Xingyi Zhang(3),Yu Wang(4),Yangguang Zhang(5) ((1) College of Artificial Intelligence and Automation, Hohai University, (2) Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, (3) School of Mechanical Engineering, Shanghai Jiao Tong University, (4) School of Control and Computer Engineering, North China Electric Power University, (5) School of Automation and Electrical Engineering, University of Science and Technology Beijing)
关键词-EN: complex detector signals, Large Hadron Collider, high-energy particle physics, particle physics, extracting information
类目: Instrumentation and Detectors (physics.ins-det); Artificial Intelligence (cs.AI)
*备注: 11 pages, 1 tables, 9 figures Code available at this https URL

点击查看摘要

Abstract:In high-energy particle physics, extracting information from complex detector signals is crucial for energy reconstruction. Recent advancements involve using deep learning to process calorimeter images from various sub-detectors in experiments like the Large Hadron Collider (LHC) for energy map reconstruction. This paper compares classical algorithms-MLP, CNN, U-Net, and RNN-with variants that include self-attention and 3D convolution modules to evaluate their effectiveness in reconstructing the initial energy distribution. Additionally, a test dataset of jet events is utilized to analyze and compare models’ performance in handling anomalous high-energy events. The analysis highlights the effectiveness of deep learning techniques for energy image reconstruction and explores their potential in this area.

[AI-146] Evaluating Financial Relational Graphs: Interpretation Before Prediction

链接: https://arxiv.org/abs/2410.07216
作者: Yingjie Niu,Lanxin Lu,Rian Dolphin,Valerio Poti,Ruihai Dong
关键词-EN: Accurate and robust, robust stock trend, stock trend forecasting, relationship graphs, stock relationship graphs
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 ACM International Conference on AI in Finance

点击查看摘要

Abstract:Accurate and robust stock trend forecasting has been a crucial and challenging task, as stock price changes are influenced by multiple factors. Graph neural network-based methods have recently achieved remarkable success in this domain by constructing stock relationship graphs that reflect internal factors and relationships between stocks. However, most of these methods rely on predefined factors to construct static stock relationship graphs due to the lack of suitable datasets, failing to capture the dynamic changes in stock relationships. Moreover, the evaluation of relationship graphs in these methods is often tied to the performance of neural network models on downstream tasks, leading to confusion and imprecision. To address these issues, we introduce the SPNews dataset, collected based on S\P 500 Index stocks, to facilitate the construction of dynamic relationship graphs. Furthermore, we propose a novel set of financial relationship graph evaluation methods that are independent of downstream tasks. By using the relationship graph to explain historical financial phenomena, we assess its validity before constructing a graph neural network, ensuring the graph’s effectiveness in capturing relevant financial relationships. Experimental results demonstrate that our evaluation methods can effectively differentiate between various financial relationship graphs, yielding more interpretable results compared to traditional approaches. We make our source code publicly available on GitHub to promote reproducibility and further research in this area.

计算机视觉

[CV-0] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

链接: https://arxiv.org/abs/2410.08211
作者: Anh-Quan Cao,Maximilian Jaritz,Matthieu Guillaumin,Raoul de Charette,Loris Bazzani
关键词-EN: Large-scale vision-language pre-trained, Large-scale vision-language, applied to diverse, diverse applications, fine-tuning VLP models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large-scale vision-language pre-trained (VLP) models (e.g., CLIP) are renowned for their versatility, as they can be applied to diverse applications in a zero-shot setup. However, when these models are used in specific domains, their performance often falls short due to domain gaps or the under-representation of these domains in the training data. While fine-tuning VLP models on custom datasets with human-annotated labels can address this issue, annotating even a small-scale dataset (e.g., 100k samples) can be an expensive endeavor, often requiring expert annotators if the task is complex. To address these challenges, we propose LatteCLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains, without relying on human annotations. Our method leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for both individual images and groups of images. These provide additional contextual information to guide the fine-tuning process in the custom domains. Since LMM-generated descriptions are prone to hallucination or missing details, we introduce a novel strategy to distill only the useful information and stabilize the training. Specifically, we learn rich per-class prototype representations from noisy generated texts and dual pseudo-labels. Our experiments on 10 domain-specific datasets show that LatteCLIP outperforms pre-trained zero-shot methods by an average improvement of +4.74 points in top-1 accuracy and other state-of-the-art unsupervised methods by +3.45 points.

[CV-1] PointOBB-v2: Towards Simpler Faster and Stronger Single Point Supervised Oriented Object Detection

链接: https://arxiv.org/abs/2410.08210
作者: Botao Ren,Xue Yang,Yi Yu,Junwei Luo,Zhidong Deng
关键词-EN: made initial progress, Single point supervised, gained attention, attention and made, made initial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Single point supervised oriented object detection has gained attention and made initial progress within the community. Diverse from those approaches relying on one-shot samples or powerful pretrained models (e.g. SAM), PointOBB has shown promise due to its prior-free feature. In this paper, we propose PointOBB-v2, a simpler, faster, and stronger method to generate pseudo rotated boxes from points without relying on any other prior. Specifically, we first generate a Class Probability Map (CPM) by training the network with non-uniform positive and negative sampling. We show that the CPM is able to learn the approximate object regions and their contours. Then, Principal Component Analysis (PCA) is applied to accurately estimate the orientation and the boundary of objects. By further incorporating a separation mechanism, we resolve the confusion caused by the overlapping on the CPM, enabling its operation in high-density scenarios. Extensive comparisons demonstrate that our method achieves a training speed 15.58x faster and an accuracy improvement of 11.60%/25.15%/21.19% on the DOTA-v1.0/v1.5/v2.0 datasets compared to the previous state-of-the-art, PointOBB. This significantly advances the cutting edge of single point supervised oriented detection in the modular track.

[CV-2] Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

链接: https://arxiv.org/abs/2410.08209
作者: Shengcao Cao,Liang-Yan Gui,Yu-Xiong Wang
关键词-EN: Current large multimodal, relate language components, Current large, large multimodal models, face challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an “attend-and-segment” method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: this https URL.

[CV-3] SPA: 3D Spatial-Awareness Enables Effective Embodied Representation

链接: https://arxiv.org/abs/2410.08208
作者: Haoyi Zhu,Honghui Yang,Yating Wang,Jiange Yang,Limin Wang,Tong He
关键词-EN: vanilla Vision Transformer, embodied representation learning, framework that emphasizes, emphasizes the importance, Vision Transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In this paper, we introduce SPA, a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. Our approach leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understanding. We present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios. The results are compelling: SPA consistently outperforms more than 10 state-of-the-art representation methods, including those specifically designed for embodied AI, vision-centric tasks, and multi-modal applications, while using less training data. Furthermore, we conduct a series of real-world experiments to confirm its effectiveness in practical scenarios. These results highlight the critical role of 3D spatial awareness for embodied representation learning. Our strongest model takes more than 6000 GPU hours to train and we are committed to open-sourcing all code and model weights to foster future research in embodied representation learning. Project Page: this https URL.

[CV-4] DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models

链接: https://arxiv.org/abs/2410.08207
作者: Xiaoxiao He,Ligong Han,Quan Dao,Song Wen,Minhao Bai,Di Liu,Han Zhang,Martin Renqiang Min,Felix Juefei-Xu,Chaowei Tan,Bo Liu,Kang Li,Hongdong Li,Junzhou Huang,Faez Ahmed,Akash Srivastava,Dimitris Metaxas
关键词-EN: masked language modeling, Discrete diffusion models, achieved success, success in tasks, language modeling
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces. For project webpage, see this https URL.

[CV-5] Interactive4D: Interactive 4D LiDAR Segmentation

链接: https://arxiv.org/abs/2410.08206
作者: Ilya Fradlin,Idil Esen Zulfikar,Kadir Yilmaz,Theodora Kontogianni,Bastian Leibe
关键词-EN: important role, role in facilitating, LiDAR, future LiDAR datasets, Interactive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:Interactive segmentation has an important role in facilitating the annotation process of future LiDAR datasets. Existing approaches sequentially segment individual objects at each LiDAR scan, repeating the process throughout the entire sequence, which is redundant and ineffective. In this work, we propose interactive 4D segmentation, a new paradigm that allows segmenting multiple objects on multiple LiDAR scans simultaneously, and Interactive4D, the first interactive 4D segmentation model that segments multiple objects on superimposed consecutive LiDAR scans in a single iteration by utilizing the sequential nature of LiDAR data. While performing interactive segmentation, our model leverages the entire space-time volume, leading to more efficient segmentation. Operating on the 4D volume, it directly provides consistent instance IDs over time and also simplifies tracking annotations. Moreover, we show that click simulations are crucial for successful model training on LiDAR point clouds. To this end, we design a click simulation strategy that is better suited for the characteristics of LiDAR data. To demonstrate its accuracy and effectiveness, we evaluate Interactive4D on multiple LiDAR datasets, where Interactive4D achieves a new state-of-the-art by a large margin. Upon acceptance, we will publicly release the code and models at this https URL.

[CV-6] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

链接: https://arxiv.org/abs/2410.08202
作者: Gen Luo,Xue Yang,Wenhan Dou,Zhaokai Wang,Jifeng Dai,Yu Qiao,Xizhou Zhu
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, monolithic Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has led to an influx of efforts to extend their capabilities to multimodal tasks. Among them, growing attention has been focused on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. Despite the structural simplicity and deployment-friendliness, training a monolithic MLLM with promising performance still remains challenging. In particular, the popular approaches adopt continuous pre-training to extend a pre-trained LLM to a monolithic MLLM, which suffers from catastrophic forgetting and leads to performance degeneration. In this paper, we aim to overcome this limitation from the perspective of delta tuning. Specifically, our core idea is to embed visual parameters into a pre-trained LLM, thereby incrementally learning visual knowledge from massive data via delta tuning, i.e., freezing the LLM when optimizing the visual parameters. Based on this principle, we present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure. Moreover, we propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data. To validate our approach, we conduct extensive experiments on 16 benchmarks. Experimental results not only validate the superior performance of Mono-InternVL compared to the state-of-the-art MLLM on 6 multimodal benchmarks, e.g., +113 points over InternVL-1.5 on OCRBench, but also confirm its better deployment efficiency, with first token latency reduced by up to 67%.

[CV-7] MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

链接: https://arxiv.org/abs/2410.08196
作者: Zimu Lu,Aojun Zhou,Ke Wang,Houxing Ren,Weikang Shi,Junting Pan,Mingjie Zhan,Hongsheng Li
关键词-EN: Code, mathematical, precision and accuracy, reasoning, mathematical reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:Code has been shown to be effective in enhancing the mathematical reasoning abilities of large language models due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline. The code is released at this https URL .

[CV-8] HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation ECCV2024

链接: https://arxiv.org/abs/2410.08192
作者: Shanyan Guan,Yanhao Ge,Ying Tai,Jian Yang,Wei Li,Mingyu You
关键词-EN: shown remarkable creative, generating personalized instances, personalized instances based, Recent advancements, remarkable creative capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024, the project page: this https URL

点击查看摘要

Abstract:Recent advancements in text-to-image diffusion models have shown remarkable creative capabilities with textual prompts, but generating personalized instances based on specific subjects, known as subject-driven generation, remains challenging. To tackle this issue, we present a new hybrid framework called HybridBooth, which merges the benefits of optimization-based and direct-regression methods. HybridBooth operates in two stages: the Word Embedding Probe, which generates a robust initial word embedding using a fine-tuned encoder, and the Word Embedding Refinement, which further adapts the encoder to specific subject images by optimizing key parameters. This approach allows for effective and fast inversion of visual concepts into textual embedding, even from a single image, while maintaining the model’s generalization capabilities.

[CV-9] Poison-splat: Computation Cost Attack on 3D Gaussian Splatting

链接: https://arxiv.org/abs/2410.08190
作者: Jiahao Lu,Yifan Zhang,Qiuhong Shen,Xinchao Wang,Shuicheng Yan
关键词-EN: Gaussian splatting, vision tasks, performance and efficiency, representation and brought, groundbreaking performance
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Our code is available at this https URL

点击查看摘要

Abstract:3D Gaussian splatting (3DGS), known for its groundbreaking performance and efficiency, has become a dominant 3D representation and brought progress to many 3D vision tasks. However, in this work, we reveal a significant security vulnerability that has been largely overlooked in 3DGS: the computation cost of training 3DGS could be maliciously tampered by poisoning the input data. By developing an attack named Poison-splat, we reveal a novel attack surface where the adversary can poison the input images to drastically increase the computation memory and time needed for 3DGS training, pushing the algorithm towards its worst computation complexity. In extreme cases, the attack can even consume all allocable memory, leading to a Denial-of-Service (DoS) that disrupts servers, resulting in practical damages to real-world 3DGS service vendors. Such a computation cost attack is achieved by addressing a bi-level optimization problem through three tailored strategies: attack objective approximation, proxy model rendering, and optional constrained optimization. These strategies not only ensure the effectiveness of our attack but also make it difficult to defend with simple defensive measures. We hope the revelation of this novel attack surface can spark attention to this crucial yet overlooked vulnerability of 3DGS systems.

[CV-10] SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation NEURIPS2024

链接: https://arxiv.org/abs/2410.08189
作者: Hang Yin,Xiuwei Xu,Zhenyu Wu,Jie Zhou,Jiwen Lu
关键词-EN: zero-shot object navigation, object navigation, object navigation methods, scene graph, object navigation framework
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted to NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:In this paper, we propose a new framework for zero-shot object navigation. Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning. To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges. Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark.

[CV-11] DifFRelight: Diffusion-Based Facial Performance Relighting SIGGRAPH WWW

链接: https://arxiv.org/abs/2410.08188
作者: Mingming He,Pascal Clausen,Ahmet Levent Taşel,Li Ma,Oliver Pilarski,Wenqi Xian,Laszlo Rikker,Xueming Yu,Ryan Burgert,Ning Yu,Paul Debevec
关键词-EN: relighting using diffusion-based, free-viewpoint facial performance, facial performance relighting, Stable Diffusion model, lighting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: 18 pages, SIGGRAPH Asia 2024 Conference Papers (SA Conference Papers '24), December 3–6, 2024, Tokyo, Japan. Project page: this https URL

点击查看摘要

Abstract:We present a novel framework for free-viewpoint facial performance relighting using diffusion-based image-to-image translation. Leveraging a subject-specific dataset containing diverse facial expressions captured under various lighting conditions, including flat-lit and one-light-at-a-time (OLAT) scenarios, we train a diffusion model for precise lighting control, enabling high-fidelity relit facial images from flat-lit inputs. Our framework includes spatially-aligned conditioning of flat-lit captures and random noise, along with integrated lighting information for global control, utilizing prior knowledge from the pre-trained Stable Diffusion model. This model is then applied to dynamic facial performances captured in a consistent flat-lit environment and reconstructed for novel-view synthesis using a scalable dynamic 3D Gaussian Splatting method to maintain quality and consistency in the relit results. In addition, we introduce unified lighting control by integrating a novel area lighting representation with directional lighting, allowing for joint adjustments in light size and direction. We also enable high dynamic range imaging (HDRI) composition using multiple directional lights to produce dynamic sequences under complex lighting conditions. Our evaluations demonstrate the models efficiency in achieving precise lighting control and generalizing across various facial expressions while preserving detailed features such as skintexture andhair. The model accurately reproduces complex lighting effects like eye reflections, subsurface scattering, self-shadowing, and translucency, advancing photorealism within our framework.

[CV-12] Scaling Laws For Diffusion Transformers

链接: https://arxiv.org/abs/2410.08184
作者: Zhengyang Liang,Hao He,Ceyuan Yang,Bo Dai
关键词-EN: Diffusion transformers, achieved appealing synthesis, content recreation, image and video, achieved appealing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion transformers (DiT) have already achieved appealing synthesis and scaling properties in content recreation, e.g., image and video generation. However, scaling laws of DiT are less explored, which usually offer precise predictions regarding optimal model size and data requirements given a specific compute budget. Therefore, experiments across a broad range of compute budgets, from 1e17 to 6e18 FLOPs are conducted to confirm the existence of scaling laws in DiT for the first time. Concretely, the loss of pretraining DiT also follows a power-law relationship with the involved compute. Based on the scaling law, we can not only determine the optimal model size and required data but also accurately predict the text-to-image generation loss given a model with 1B parameters and a compute budget of 1e21 FLOPs. Additionally, we also demonstrate that the trend of pre-training loss matches the generation performances (e.g., FID), even across various datasets, which complements the mapping from compute to synthesis quality and thus provides a predictable benchmark that assesses model performance and data quality at a reduced cost.

[CV-13] MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

链接: https://arxiv.org/abs/2410.08182
作者: Wenbo Hu,Jia-Chen Gu,Zi-Yi Dou,Mohsen Fayyaz,Pan Lu,Kai-Wei Chang,Nanyun Peng
关键词-EN: Existing multimodal retrieval, retrieval benchmarks primarily, benchmarks primarily focus, Existing multimodal, primarily focus
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: this https URL

点击查看摘要

Abstract:Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in which we systematically identify and categorize scenarios where visually augmented knowledge is better than textual knowledge, for instance, more images from varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. With MRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-Bench is vision-centric. Additionally, we conduct extensive analysis with MRAG-Bench, which offers valuable insights into retrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faces challenges in effectively leveraging retrieved knowledge, achieving only a 5.82% improvement with ground-truth information, in contrast to a 33.16% improvement observed in human participants. These findings highlight the importance of MRAG-Bench in encouraging the community to enhance LVLMs’ ability to utilize retrieved visual knowledge more effectively.

[CV-14] RGM: Reconstructing High-fidelity 3D Car Assets with Relightable 3D-GS Generative Model from a Single Image

链接: https://arxiv.org/abs/2410.08181
作者: Xiaoxue Chen,Jv Zheng,Hao Huang,Haoran Xu,Weihao Gu,Kangliang Chen,He xiang,Huan-ang Gao,Hao Zhao,Guyue Zhou,Yaqin Zhang
关键词-EN: including video games, autonomous driving, including video, video games, virtual reality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The generation of high-quality 3D car assets is essential for various applications, including video games, autonomous driving, and virtual reality. Current 3D generation methods utilizing NeRF or 3D-GS as representations for 3D objects, generate a Lambertian object under fixed lighting and lack separated modelings for material and global illumination. As a result, the generated assets are unsuitable for relighting under varying lighting conditions, limiting their applicability in downstream tasks. To address this challenge, we propose a novel relightable 3D object generative framework that automates the creation of 3D car assets, enabling the swift and accurate reconstruction of a vehicle’s geometry, texture, and material properties from a single input image. Our approach begins with introducing a large-scale synthetic car dataset comprising over 1,000 high-precision 3D vehicle models. We represent 3D objects using global illumination and relightable 3D Gaussian primitives integrating with BRDF parameters. Building on this representation, we introduce a feed-forward model that takes images as input and outputs both relightable 3D Gaussians and global illumination parameters. Experimental results demonstrate that our method produces photorealistic 3D car assets that can be seamlessly integrated into road scenes with different illuminations, which offers substantial practical benefits for industrial applications.

[CV-15] ANet: Triplet Attention Network for All-In-One Adverse Weather Image Restoration ACCV2024

链接: https://arxiv.org/abs/2410.08177
作者: Hsing-Hua Wang,Fu-Jen Tsai,Yen-Yu Lin,Chia-Wen Lin
关键词-EN: unwanted degraded artifacts, remove unwanted degraded, weather conditions, Adverse weather image, weather
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages (ACCV 2024)

点击查看摘要

Abstract:Adverse weather image restoration aims to remove unwanted degraded artifacts, such as haze, rain, and snow, caused by adverse weather conditions. Existing methods achieve remarkable results for addressing single-weather conditions. However, they face challenges when encountering unpredictable weather conditions, which often happen in real-world scenarios. Although different weather conditions exhibit different degradation patterns, they share common characteristics that are highly related and complementary, such as occlusions caused by degradation patterns, color distortion, and contrast attenuation due to the scattering of atmospheric particles. Therefore, we focus on leveraging common knowledge across multiple weather conditions to restore images in a unified manner. In this paper, we propose a Triplet Attention Network (TANet) to efficiently and effectively address all-in-one adverse weather image restoration. TANet consists of Triplet Attention Block (TAB) that incorporates three types of attention mechanisms: Local Pixel-wise Attention (LPA) and Global Strip-wise Attention (GSA) to address occlusions caused by non-uniform degradation patterns, and Global Distribution Attention (GDA) to address color distortion and contrast attenuation caused by atmospheric phenomena. By leveraging common knowledge shared across different weather conditions, TANet successfully addresses multiple weather conditions in a unified manner. Experimental results show that TANet efficiently and effectively achieves state-of-the-art performance in all-in-one adverse weather image restoration. The source code is available at this https URL.

[CV-16] On the Evaluation of Generative Robotic Simulations ALT

链接: https://arxiv.org/abs/2410.08172
作者: Feng Chen,Botian Xu,Pu Hua,Peiqi Duan,Yanchao Yang,Yi Ma,Huazhe Xu
关键词-EN: acquiring extensive real-world, scalable simulated robotic, extensive real-world data, simulated robotic tasks, highlighting the importance
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Due to the difficulty of acquiring extensive real-world data, robot simulation has become crucial for parallel training and sim-to-real transfer, highlighting the importance of scalable simulated robotic tasks. Foundation models have demonstrated impressive capacities in autonomously generating feasible robotic tasks. However, this new paradigm underscores the challenge of adequately evaluating these autonomously generated tasks. To address this, we propose a comprehensive evaluation framework tailored to generative simulations. Our framework segments evaluation into three core aspects: quality, diversity, and generalization. For single-task quality, we evaluate the realism of the generated task and the completeness of the generated trajectories using large language models and vision-language models. In terms of diversity, we measure both task and data diversity through text similarity of task descriptions and world model loss trained on collected task trajectories. For task-level generalization, we assess the zero-shot generalization ability on unseen tasks of a policy trained with multiple generated tasks. Experiments conducted on three representative task generation pipelines demonstrate that the results from our framework are highly consistent with human evaluations, confirming the feasibility and validity of our approach. The findings reveal that while metrics of quality and diversity can be achieved through certain methods, no single approach excels across all metrics, suggesting a need for greater focus on balancing these different metrics. Additionally, our analysis further highlights the common challenge of low generalization capability faced by current works. Our anonymous website: this https URL.

[CV-17] ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

链接: https://arxiv.org/abs/2410.08168
作者: Zitian Zhang,Frédéric Fortier-Chouinard,Mathieu Garon,Anand Bhattad,Jean-François Lalonde
关键词-EN: require paired composite-scene, Stable Diffusion model, paired composite-scene images, effective zero-shot, Stable Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present ZeroComp, an effective zero-shot 3D object compositing approach that does not require paired composite-scene images during training. Our method leverages ControlNet to condition from intrinsic images and combines it with a Stable Diffusion model to utilize its scene priors, together operating as an effective rendering engine. During training, ZeroComp uses intrinsic images based on geometry, albedo, and masked shading, all without the need for paired images of scenes with and without composite objects. Once trained, it seamlessly integrates virtual 3D objects into scenes, adjusting shading to create realistic composites. We developed a high-quality evaluation dataset and demonstrate that ZeroComp outperforms methods using explicit lighting estimations and generative techniques in quantitative and human perception benchmarks. Additionally, ZeroComp extends to real and outdoor image compositing, even when trained solely on synthetic indoor data, showcasing its effectiveness in image compositing.

[CV-18] Visual Scratchpads: Enabling Global Reasoning in Vision

链接: https://arxiv.org/abs/2410.08165
作者: Aryo Lotfi,Enrico Fini,Samy Bengio,Moin Nabi,Emmanuel Abbe
关键词-EN: achieved remarkable success, features provide critical, local features provide, provide critical information, Modern vision models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Modern vision models have achieved remarkable success in benchmarks where local features provide critical information about the target. There is now a growing interest in solving tasks that require more global reasoning, where local features offer no significant information. These tasks are reminiscent of the connectivity tasks discussed by Minsky and Papert in 1969, which exposed the limitations of the perceptron model and contributed to the first AI winter. In this paper, we revisit such tasks by introducing four global visual benchmarks involving path findings and mazes. We show that: (1) although today’s large vision models largely surpass the expressivity limitations of the early models, they still struggle with the learning efficiency; we put forward the “globality degree” notion to understand this limitation; (2) we then demonstrate that the picture changes and global reasoning becomes feasible with the introduction of “visual scratchpads”; similarly to the text scratchpads and chain-of-thoughts used in language models, visual scratchpads help break down global tasks into simpler ones; (3) we finally show that some scratchpads are better than others, in particular, “inductive scratchpads” that take steps relying on less information afford better out-of-distribution generalization and succeed for smaller model sizes.

[CV-19] Agent S: An Open Agent ic Framework that Uses Computers Like a Human

链接: https://arxiv.org/abs/2410.08164
作者: Saaket Agashe,Jiuzhou Han,Shuyu Gan,Jiachen Yang,Ang Li,Xin Eric Wang
关键词-EN: Graphical User Interface, Graphical User, enables autonomous interaction, transforming human-computer interaction, open agentic framework
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 16 figures, 9 tables

点击查看摘要

Abstract:We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at this https URL.

[CV-20] DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

链接: https://arxiv.org/abs/2410.08159
作者: Jiatao Gu,Yuyang Wang,Yizhe Zhang,Qihang Zhang,Dinghuai Zhang,Navdeep Jaitly,Josh Susskind,Shuangfei Zhai
关键词-EN: DART, image, Markovian, visual generation, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages

点击查看摘要

Abstract:Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process that gradually adds noise to the input. We argue that the Markovian property limits the models ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework. DART iteratively denoises image patches spatially and spectrally using an AR model with the same architecture as standard language models. DART does not rely on image quantization, enabling more effective image modeling while maintaining flexibility. Furthermore, DART seamlessly trains with both text and image data in a unified model. Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. Through this unified framework, DART sets a new benchmark for scalable, high-quality image synthesis.

[CV-21] RayEmb: Arbitrary Landmark Detection in X-Ray Images Using Ray Embedding Subspace ACCV2024

链接: https://arxiv.org/abs/2410.08152
作者: Pragyan Shrestha,Chun Xie,Yuichi Yoshii,Itaru Kitahara
关键词-EN: X-ray images, orthopedic surgeries, X-ray, Intra-operative, pre-operatively acquired
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted as an oral presentation at ACCV 2024

点击查看摘要

Abstract:Intra-operative 2D-3D registration of X-ray images with pre-operatively acquired CT scans is a crucial procedure in orthopedic surgeries. Anatomical landmarks pre-annotated in the CT volume can be detected in X-ray images to establish 2D-3D correspondences, which are then utilized for registration. However, registration often fails in certain view angles due to poor landmark visibility. We propose a novel method to address this issue by detecting arbitrary landmark points in X-ray images. Our approach represents 3D points as distinct subspaces, formed by feature vectors (referred to as ray embeddings) corresponding to intersecting rays. Establishing 2D-3D correspondences then becomes a task of finding ray embeddings that are close to a given subspace, essentially performing an intersection test. Unlike conventional methods for landmark estimation, our approach eliminates the need for manually annotating fixed landmarks. We trained our model using the synthetic images generated from CTPelvic1K CLINIC dataset, which contains 103 CT volumes, and evaluated it on the DeepFluoro dataset, comprising real X-ray images. Experimental results demonstrate the superiority of our method over conventional methods. The code is available at this https URL.

[CV-22] Progressive Autoregressive Video Diffusion Models

链接: https://arxiv.org/abs/2410.08151
作者: Desai Xie,Zhan Xu,Yicong Hong,Hao Tan,Difan Liu,Feng Liu,Arie Kaufman,Yang Zhou
关键词-EN: Current frontier video, Current frontier, demonstrated remarkable results, generating high-quality videos, demonstrated remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 5 figures. Our video results and code are available at this https URL

点击查看摘要

Abstract:Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. In this work, we show that existing models can be naturally extended to autoregressive video diffusion models without changing the architectures. Our key idea is to assign the latent frames with progressively increasing noise levels rather than a single noise level, which allows for fine-grained condition among the latents and large overlaps between the attention windows. Such progressive video denoising allows our models to autoregressively generate video frames without quality degradation or abrupt scene changes. We present state-of-the-art results on long video generation at 1 minute (1440 frames at 24 FPS). Videos from this paper are available at this https URL.

[CV-23] Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs

链接: https://arxiv.org/abs/2410.08145
作者: Xiaoyuan Liu,Wenxuan Wang,Youliang Yuan,Jen-tse Huang,Qiuzhi Liu,Pinjia He,Zhaopeng Tu
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, contradicts model internal
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper explores the problem of commonsense-level vision-knowledge conflict in Multimodal Large Language Models (MLLMs), where visual information contradicts model’s internal commonsense knowledge (see Figure 1). To study this issue, we introduce an automated pipeline, augmented with human-in-the-loop quality control, to establish a benchmark aimed at simulating and assessing the conflicts in MLLMs. Utilizing this pipeline, we have crafted a diagnostic benchmark comprising 374 original images and 1,122 high-quality question-answer (QA) pairs. This benchmark covers two types of conflict target and three question difficulty levels, providing a thorough assessment tool. Through this benchmark, we evaluate the conflict-resolution capabilities of nine representative MLLMs across various model families and find a noticeable over-reliance on textual queries. Drawing on these findings, we propose a novel prompting strategy, “Focus-on-Vision” (FoV), which markedly enhances MLLMs’ ability to favor visual data over conflicting textual knowledge. Our detailed analysis and the newly proposed strategy significantly advance the understanding and mitigating of vision-knowledge conflicts in MLLMs. The data and code are made publicly available.

[CV-24] Efficient Perspective-Correct 3D Gaussian Splatting Using Hybrid Transparency

链接: https://arxiv.org/abs/2410.08129
作者: Florian Hahlbohm,Fabian Friederichs,Tim Weyrich,Linus Franke,Moritz Kappel,Susana Castillo,Marc Stamminger,Martin Eisemann,Marcus Magnor
关键词-EN: versatile rendering primitive, proven a versatile, Gaussian Splats, Splats, rendering primitive
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splats (3DGS) have proven a versatile rendering primitive, both for inverse rendering as well as real-time exploration of scenes. In these applications, coherence across camera frames and multiple views is crucial, be it for robust convergence of a scene reconstruction or for artifact-free fly-throughs. Recent work started mitigating artifacts that break multi-view coherence, including popping artifacts due to inconsistent transparency sorting and perspective-correct outlines of (2D) splats. At the same time, real-time requirements forced such implementations to accept compromises in how transparency of large assemblies of 3D Gaussians is resolved, in turn breaking coherence in other ways. In our work, we aim at achieving maximum coherence, by rendering fully perspective-correct 3D Gaussians while using a high-quality approximation of accurate blending, hybrid transparency, on a per-pixel level, in order to retain real-time frame rates. Our fast and perspectively accurate approach for evaluation of 3D Gaussians does not require matrix inversions, thereby ensuring numerical stability and eliminating the need for special handling of degenerate splats, and the hybrid transparency formulation for blending maintains similar quality as fully resolved per-pixel transparencies at a fraction of the rendering costs. We further show that each of these two components can be independently integrated into Gaussian splatting systems. In combination, they achieve up to 2 \times higher frame rates, 2 \times faster optimization, and equal or better image quality with fewer rendering artifacts compared to traditional 3DGS on common benchmarks.

[CV-25] Q-VLM: Post-training Quantization for Large Vision-Language Models

链接: https://arxiv.org/abs/2410.08119
作者: Changyuan Wang,Ziwei Wang,Xiuwei Xu,Yansong Tang,Jie Zhou,Jiwen Lu
关键词-EN: post-training quantization framework, efficient multi-modal inference, cross-layer dependency, optimal quantization strategy, large vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. Conventional quantization methods sequentially search the layer-wise rounding functions by minimizing activation discretization errors, which fails to acquire optimal quantization strategy without considering cross-layer dependency. On the contrary, we mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy searching with low search cost. Specifically, we observe the strong correlation between the activation entropy and the cross-layer dependency concerning output discretization errors. Therefore, we employ the entropy as the proxy to partition blocks optimally, which aims to achieve satisfying trade-offs between discretization errors and the search cost. Moreover, we optimize the visual encoder to disentangle the cross-layer dependency for fine-grained decomposition of search space, so that the search cost is further reduced without harming the quantization accuracy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks. Code is available at this https URL.

[CV-26] Medical Image Quality Assessment based on Probability of Necessity and Sufficiency

链接: https://arxiv.org/abs/2410.08118
作者: Boyu Chen,Ameenat L. Solebo,Weiye Bao,Paul Taylor
关键词-EN: medical image analysis, reliable medical image, image quality assessment, image analysis, reliable medical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image quality assessment (MIQA) is essential for reliable medical image analysis. While deep learning has shown promise in this field, current models could be misled by spurious correlations learned from data and struggle with out-of-distribution (OOD) scenarios. To that end, we propose an MIQA framework based on a concept from causal inference: Probability of Necessity and Sufficiency (PNS). PNS measures how likely a set of features is to be both necessary (always present for an outcome) and sufficient (capable of guaranteeing an outcome) for a particular result. Our approach leverages this concept by learning hidden features from medical images with high PNS values for quality prediction. This encourages models to capture more essential predictive information, enhancing their robustness to OOD scenarios. We evaluate our framework on an Anterior Segment Optical Coherence Tomography (AS-OCT) dataset for the MIQA task and experimental results demonstrate the effectiveness of our framework.

[CV-27] Parameter-Efficient Fine-Tuning in Spectral Domain for Point Cloud Learning

链接: https://arxiv.org/abs/2410.08114
作者: Dingkang Liang,Tianrui Feng,Xin Zhou,Yumeng Zhang,Zhikang Zou,Xiang Bai
关键词-EN: leveraging pre-training techniques, hot research topic, point cloud, enhance point cloud, Point cloud Graph
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The code will be made available at this https URL

点击查看摘要

Abstract:Recently, leveraging pre-training techniques to enhance point cloud models has become a hot research topic. However, existing approaches typically require full fine-tuning of pre-trained models to achieve satisfied performance on downstream tasks, accompanying storage-intensive and computationally demanding. To address this issue, we propose a novel Parameter-Efficient Fine-Tuning (PEFT) method for point cloud, called PointGST (Point cloud Graph Spectral Tuning). PointGST freezes the pre-trained model and introduces a lightweight, trainable Point Cloud Spectral Adapter (PCSA) to fine-tune parameters in the spectral domain. The core idea is built on two observations: 1) The inner tokens from frozen models might present confusion in the spatial domain; 2) Task-specific intrinsic information is important for transferring the general knowledge to the downstream task. Specifically, PointGST transfers the point tokens from the spatial domain to the spectral domain, effectively de-correlating confusion among tokens via using orthogonal components for separating. Moreover, the generated spectral basis involves intrinsic information about the downstream point clouds, enabling more targeted tuning. As a result, PointGST facilitates the efficient transfer of general knowledge to downstream tasks while significantly reducing training costs. Extensive experiments on challenging point cloud datasets across various tasks demonstrate that PointGST not only outperforms its fully fine-tuning counterpart but also significantly reduces trainable parameters, making it a promising solution for efficient point cloud learning. It improves upon a solid baseline by +2.28%, 1.16%, and 2.78%, resulting in 99.48%, 97.76%, and 96.18% on the ScanObjNN OBJ BG, OBJ OBLY, and PB T50 RS datasets, respectively. This advancement establishes a new state-of-the-art, using only 0.67% of the trainable parameters.

[CV-28] IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera

链接: https://arxiv.org/abs/2410.08107
作者: Jian Huang,Chengrui Dong,Peidong Liu
关键词-EN: achieved remarkable progress, Implicit neural representation, Gaussian Splatting, RGB and RGB-D, Implicit neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code Page: this https URL

点击查看摘要

Abstract:Implicit neural representation and explicit 3D Gaussian Splatting (3D-GS) for novel view synthesis have achieved remarkable progress with frame-based camera (e.g. RGB and RGB-D cameras) recently. Compared to frame-based camera, a novel type of bio-inspired visual sensor, i.e. event camera, has demonstrated advantages in high temporal resolution, high dynamic range, low power consumption and low latency. Due to its unique asynchronous and irregular data capturing process, limited work has been proposed to apply neural representation or 3D Gaussian splatting for an event camera. In this work, we present IncEventGS, an incremental 3D Gaussian Splatting reconstruction algorithm with a single event camera. To recover the 3D scene representation incrementally, we exploit the tracking and mapping paradigm of conventional SLAM pipelines for IncEventGS. Given the incoming event stream, the tracker firstly estimates an initial camera motion based on prior reconstructed 3D-GS scene representation. The mapper then jointly refines both the 3D scene representation and camera motion based on the previously estimated motion trajectory from the tracker. The experimental results demonstrate that IncEventGS delivers superior performance compared to prior NeRF-based methods and other related baselines, even we do not have the ground-truth camera poses. Furthermore, our method can also deliver better performance compared to state-of-the-art event visual odometry methods in terms of camera motion estimation. Code is publicly available at: this https URL.

[CV-29] CrackSegDiff: Diffusion Probability Model-based Multi-modal Crack Segmentation

链接: https://arxiv.org/abs/2410.08100
作者: Xiaoyan Jiang,Licheng Jiang,Anjie Wang,Kaiying Zhu,Yongbin Gao
关键词-EN: road condition assessments, road inspection robots, improved maintenance strategies, road inspection, road condition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Integrating grayscale and depth data in road inspection robots could enhance the accuracy, reliability, and comprehensiveness of road condition assessments, leading to improved maintenance strategies and safer infrastructure. However, these data sources are often compromised by significant background noise from the pavement. Recent advancements in Diffusion Probabilistic Models (DPM) have demonstrated remarkable success in image segmentation tasks, showcasing potent denoising capabilities, as evidenced in studies like SegDiff \citeamit2021segdiff. Despite these advancements, current DPM-based segmentors do not fully capitalize on the potential of original image data. In this paper, we propose a novel DPM-based approach for crack segmentation, named CrackSegDiff, which uniquely fuses grayscale and range/depth images. This method enhances the reverse diffusion process by intensifying the interaction between local feature extraction via DPM and global feature extraction. Unlike traditional methods that utilize Transformers for global features, our approach employs Vm-unet \citeruan2024vm to efficiently capture long-range information of the original data. The integration of features is further refined through two innovative modules: the Channel Fusion Module (CFM) and the Shallow Feature Compensation Module (SFCM). Our experimental evaluation on the three-class crack image segmentation tasks within the FIND dataset demonstrates that CrackSegDiff outperforms state-of-the-art methods, particularly excelling in the detection of shallow cracks. Code is available at this https URL.

[CV-30] UW-SDF: Exploiting Hybrid Geometric Priors for Neural SDF Reconstruction from Underwater Multi-view Monocular Images IROS2024

链接: https://arxiv.org/abs/2410.08092
作者: Zeyu Chen,Jingyi Tang,Gu Wang,Shengquan Li,Xinghui Li,Xiangyang Ji,Xiu Li
关键词-EN: exploration and mapping, unique characteristics, poses a challenging, challenging problem, problem in tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 9 figures, presented at IROS 2024

点击查看摘要

Abstract:Due to the unique characteristics of underwater environments, accurate 3D reconstruction of underwater objects poses a challenging problem in tasks such as underwater exploration and mapping. Traditional methods that rely on multiple sensor data for 3D reconstruction are time-consuming and face challenges in data acquisition in underwater scenarios. We propose UW-SDF, a framework for reconstructing target objects from multi-view underwater images based on neural SDF. We introduce hybrid geometric priors to optimize the reconstruction process, markedly enhancing the quality and efficiency of neural SDF reconstruction. Additionally, to address the challenge of segmentation consistency in multi-view images, we propose a novel few-shot multi-view target segmentation strategy using the general-purpose segmentation model (SAM), enabling rapid automatic segmentation of unseen objects. Through extensive qualitative and quantitative experiments on diverse datasets, we demonstrate that our proposed method outperforms the traditional underwater 3D reconstruction method and other neural rendering approaches in the field of underwater 3D reconstruction.

[CV-31] Distribution Guidance Network for Weakly Supervised Point Cloud Semantic Segmentation

链接: https://arxiv.org/abs/2410.08091
作者: Zhiyi Pan,Wei Gao,Shan Liu,Ge Li
关键词-EN: dense annotations inherent, point cloud semantic, cloud semantic segmentation, semantic segmentation suffers, inadequate supervision signals
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite alleviating the dependence on dense annotations inherent to fully supervised methods, weakly supervised point cloud semantic segmentation suffers from inadequate supervision signals. In response to this challenge, we introduce a novel perspective that imparts auxiliary constraints by regulating the feature space under weak supervision. Our initial investigation identifies which distributions accurately characterize the feature space, subsequently leveraging this priori to guide the alignment of the weakly supervised embeddings. Specifically, we analyze the superiority of the mixture of von Mises-Fisher distributions (moVMF) among several common distribution candidates. Accordingly, we develop a Distribution Guidance Network (DGNet), which comprises a weakly supervised learning branch and a distribution alignment branch. Leveraging reliable clustering initialization derived from the weakly supervised learning branch, the distribution alignment branch alternately updates the parameters of the moVMF and the network, ensuring alignment with the moVMF-defined latent space. Extensive experiments validate the rationality and effectiveness of our distribution choice and network design. Consequently, DGNet achieves state-of-the-art performance under multiple datasets and various weakly supervised settings.

[CV-32] oMiE: Towards Modular Growth in Enhanced SMPL Skeleton for 3D Human with Animatable Garments

链接: https://arxiv.org/abs/2410.08082
作者: Yifan Zhan,Qingtian Zhu,Muyao Niu,Mingze Ma,Jiancheng Zhao,Zhihang Zhong,Xiao Sun,Yu Qiao,Yinqiang Zheng
关键词-EN: complex garments, highlight a critical, overlooked factor, human tasks, garments
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we highlight a critical yet often overlooked factor in most 3D human tasks, namely modeling humans with complex garments. It is known that the parameterized formulation of SMPL is able to fit human skin; while complex garments, e.g., hand-held objects and loose-fitting garments, are difficult to get modeled within the unified framework, since their movements are usually decoupled with the human body. To enhance the capability of SMPL skeleton in response to this situation, we propose a modular growth strategy that enables the joint tree of the skeleton to expand adaptively. Specifically, our method, called ToMiE, consists of parent joints localization and external joints optimization. For parent joints localization, we employ a gradient-based approach guided by both LBS blending weights and motion kernels. Once the external joints are obtained, we proceed to optimize their transformations in SE(3) across different frames, enabling rendering and explicit animation. ToMiE manages to outperform other methods across various cases with garments, not only in rendering quality but also by offering free animation of grown joints, thereby enhancing the expressive ability of SMPL skeleton for a broader range of applications.

[CV-33] Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models

链接: https://arxiv.org/abs/2410.08074
作者: Vinith M. Suriyakumar,Rohan Alur,Ayush Sekhari,Manish Raghavan,Ashia C. Wilson
关键词-EN: web-scale datasets, rely on massive, diffusion models rely, diffusion models, diffusion
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 13 figures

点击查看摘要

Abstract:Text-to-image diffusion models rely on massive, web-scale datasets. Training them from scratch is computationally expensive, and as a result, developers often prefer to make incremental updates to existing models. These updates often compose fine-tuning steps (to learn new concepts or improve model performance) with “unlearning” steps (to “forget” existing concepts, such as copyrighted works or explicit content). In this work, we demonstrate a critical and previously unknown vulnerability that arises in this paradigm: even under benign, non-adversarial conditions, fine-tuning a text-to-image diffusion model on seemingly unrelated images can cause it to “relearn” concepts that were previously “unlearned.” We comprehensively investigate the causes and scope of this phenomenon, which we term concept resurgence, by performing a series of experiments which compose “mass concept erasure” (the current state of the art for unlearning in text-to-image diffusion models (Lu et al., 2024)) with subsequent fine-tuning of Stable Diffusion v1.4. Our findings underscore the fragility of composing incremental model updates, and raise serious new concerns about current approaches to ensuring the safety and alignment of text-to-image diffusion models.

[CV-34] Unlearning-based Neural Interpretations

链接: https://arxiv.org/abs/2410.08069
作者: Ching Lam Choi,Alexandre Duplessis,Serge Belongie
关键词-EN: computing feature importance, Gradient-based interpretations, require an anchor, comparison to avoid, avoid saturation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gradient-based interpretations often require an anchor point of comparison to avoid saturation in computing feature importance. We show that current baselines defined using static functions–constant mapping, averaging or blurring–inject harmful colour, texture or frequency assumptions that deviate from model behaviour. This leads to accumulation of irregular gradients, resulting in attribution maps that are biased, fragile and manipulable. Departing from the static approach, we propose UNI to compute an (un)learnable, debiased and adaptive baseline by perturbing the input towards an unlearning direction of steepest ascent. Our method discovers reliable baselines and succeeds in erasing salient features, which in turn locally smooths the high-curvature decision boundaries. Our analyses point to unlearning as a promising avenue for generating faithful, efficient and robust interpretations.

[CV-35] Reversible Decoupling Network for Single Image Reflection Removal

链接: https://arxiv.org/abs/2410.08063
作者: Hao Zhao,Mingjia Li,Qiming Hu,Xiaojie Guo
关键词-EN: shown promising advances, single-image reflection removal, approaches to single-image, promising advances, single-image reflection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent deep-learning-based approaches to single-image reflection removal have shown promising advances, primarily for two reasons: 1) the utilization of recognition-pretrained features as inputs, and 2) the design of dual-stream interaction networks. However, according to the Information Bottleneck principle, high-level semantic clues tend to be compressed or discarded during layer-by-layer propagation. Additionally, interactions in dual-stream networks follow a fixed pattern across different layers, limiting overall performance. To address these limitations, we propose a novel architecture called Reversible Decoupling Network (RDNet), which employs a reversible encoder to secure valuable information while flexibly decoupling transmission- and reflection-relevant features during the forward pass. Furthermore, we customize a transmission-rate-aware prompt generator to dynamically calibrate features, further boosting performance. Extensive experiments demonstrate the superiority of RDNet over existing SOTA methods on five widely-adopted benchmark datasets. Our code will be made publicly available.

[CV-36] A framework for compressing unstructured scientific data via serialization

链接: https://arxiv.org/abs/2410.08059
作者: Viktor Reshniak,Qian Gong,Rick Archibald,Scott Klasky,Norbert Podhorszki
关键词-EN: compressing unstructured scientific, unstructured scientific data, present a general, compressing unstructured, unstructured scientific
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 9 figures

点击查看摘要

Abstract:We present a general framework for compressing unstructured scientific data with known local connectivity. A common application is simulation data defined on arbitrary finite element meshes. The framework employs a greedy topology preserving reordering of original nodes which allows for seamless integration into existing data processing pipelines. This reordering process depends solely on mesh connectivity and can be performed offline for optimal efficiency. However, the algorithm’s greedy nature also supports on-the-fly implementation. The proposed method is compatible with any compression algorithm that leverages spatial correlations within the data. The effectiveness of this approach is demonstrated on a large-scale real dataset using several compression methods, including MGARD, SZ, and ZFP.

[CV-37] Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

链接: https://arxiv.org/abs/2410.08049
作者: Yiyuan Zhang,Xiaohan Ding,Xiangyu Yue
关键词-EN: Convolutional Neural Networks, modern Convolutional Neural, designing modern Convolutional, Neural Networks, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This is the journal version of arXiv:2203.06717 and arXiv:2311.15599

点击查看摘要

Abstract:This paper proposes the paradigm of large convolutional kernels in designing modern Convolutional Neural Networks (ConvNets). We establish that employing a few large kernels, instead of stacking multiple smaller ones, can be a superior design strategy. Our work introduces a set of architecture design guidelines for large-kernel ConvNets that optimize their efficiency and performance. We propose the UniRepLKNet architecture, which offers systematical architecture design principles specifically crafted for large-kernel ConvNets, emphasizing their unique ability to capture extensive spatial information without deep layer stacking. This results in a model that not only surpasses its predecessors with an ImageNet accuracy of 88.0%, an ADE20K mIoU of 55.6%, and a COCO box AP of 56.4% but also demonstrates impressive scalability and performance on various modalities such as time-series forecasting, audio, point cloud, and video recognition. These results indicate the universal modeling abilities of large-kernel ConvNets with faster inference speed compared with vision transformers. Our findings reveal that large-kernel ConvNets possess larger effective receptive fields and a higher shape bias, moving away from the texture bias typical of smaller-kernel CNNs. All codes and models are publicly available at this https URL promoting further research and development in the community.

[CV-38] GrabDAE: An Innovative Framework for Unsupervised Domain Adaptation Utilizing Grab-Mask and Denoise Auto-Encoder

链接: https://arxiv.org/abs/2410.08023
作者: Junzhou Chen,Xuan Wen,Ronghui Zhang,Bingtao Ren,Di Wu,Zhigang Xu,Danwei Wang
关键词-EN: Unsupervised Domain Adaptation, target domain, Unsupervised Domain, Existing Unsupervised Domain, labeled source domain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) aims to adapt a model trained on a labeled source domain to an unlabeled target domain by addressing the domain shift. Existing Unsupervised Domain Adaptation (UDA) methods often fall short in fully leveraging contextual information from the target domain, leading to suboptimal decision boundary separation during source and target domain alignment. To address this, we introduce GrabDAE, an innovative UDA framework designed to tackle domain shift in visual classification tasks. GrabDAE incorporates two key innovations: the Grab-Mask module, which blurs background information in target domain images, enabling the model to focus on essential, domain-relevant features through contrastive learning; and the Denoising Auto-Encoder (DAE), which enhances feature alignment by reconstructing features and filtering noise, ensuring a more robust adaptation to the target domain. These components empower GrabDAE to effectively handle unlabeled target domain data, significantly improving both classification accuracy and robustness. Extensive experiments on benchmark datasets, including VisDA-2017, Office-Home, and Office31, demonstrate that GrabDAE consistently surpasses state-of-the-art UDA methods, setting new performance benchmarks. By tackling UDA’s critical challenges with its novel feature masking and denoising approach, GrabDAE offers both significant theoretical and practical advancements in domain adaptation.

[CV-39] OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling NEURIPS2024

链接: https://arxiv.org/abs/2410.08021
作者: Linhui Xiao,Xiaoshan Yang,Fang Peng,Yaowei Wang,Changsheng Xu
关键词-EN: bulky Transformer-based fusion, early-stage interaction technologies, works heavily rely, bulky Transformer-based, Transformer-based fusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024. The project page: this https URL

点击查看摘要

Abstract:Constrained by the separate encoding of vision and language, existing grounding and referring segmentation works heavily rely on bulky Transformer-based fusion en-/decoders and a variety of early-stage interaction technologies. Simultaneously, the current mask visual language modeling (MVLM) fails to capture the nuanced referential relationship between image-text in referring tasks. In this paper, we propose OneRef, a minimalist referring framework built on the modality-shared one-tower transformer that unifies the visual and linguistic feature spaces. To modeling the referential relationship, we introduce a novel MVLM paradigm called Mask Referring Modeling (MRefM), which encompasses both referring-aware mask image modeling and referring-aware mask language modeling. Both modules not only reconstruct modality-related content but also cross-modal referring content. Within MRefM, we propose a referring-aware dynamic image masking strategy that is aware of the referred region rather than relying on fixed ratios or generic random masking schemes. By leveraging the unified visual language feature space and incorporating MRefM’s ability to model the referential relations, our approach enables direct regression of the referring results without resorting to various complex techniques. Our method consistently surpasses existing approaches and achieves SoTA performance on both grounding and segmentation tasks, providing valuable insights for future research. Our code and models are available at this https URL.

[CV-40] Fast Feedforward 3D Gaussian Splatting Compression

链接: https://arxiv.org/abs/2410.08017
作者: Yihang Chen,Qianyi Wu,Mengyao Li,Weiyao Lin,Mehrtash Harandi,Jianfei Cai
关键词-EN: storage requirements pose, requirements pose challenges, Gaussian Splatting, advancing real-time, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:With 3D Gaussian Splatting (3DGS) advancing real-time and high-fidelity rendering for novel view synthesis, storage requirements pose challenges for their widespread adoption. Although various compression techniques have been proposed, previous art suffers from a common limitation: for any existing 3DGS, per-scene optimization is needed to achieve compression, making the compression sluggish and slow. To address this issue, we introduce Fast Compression of 3D Gaussian Splatting (FCGS), an optimization-free model that can compress 3DGS representations rapidly in a single feed-forward pass, which significantly reduces compression time from minutes to seconds. To enhance compression efficiency, we propose a multi-path entropy module that assigns Gaussian attributes to different entropy constraint paths for balance between size and fidelity. We also carefully design both inter- and intra-Gaussian context models to remove redundancies among the unstructured Gaussian blobs. Overall, FCGS achieves a compression ratio of over 20X while maintaining fidelity, surpassing most per-scene SOTA optimization-based methods. Our code is available at: this https URL.

[CV-41] RegionGrasp: A Novel Task for Contact Region Controllable Hand Grasp Generation ECCV2024 ECCV

链接: https://arxiv.org/abs/2410.07995
作者: Yilin Wang,Chuan Guo,Li Cheng,Hai Jiang
关键词-EN: natural hand grasps, Hand Grasp Generation, Controllable Hand Grasp, Region Controllable Hand, machine automatically generate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for ECCV Workshop: HANDS@ECCV2024

点击查看摘要

Abstract:Can machine automatically generate multiple distinct and natural hand grasps, given specific contact region of an object in 3D? This motivates us to consider a novel task of \textitRegion Controllable Hand Grasp Generation (RegionGrasp), as follows: given as input a 3D object, together with its specific surface area selected as the intended contact region, to generate a diverse set of plausible hand grasps of the object, where the thumb finger tip touches the object surface on the contact region. To address this task, RegionGrasp-CVAE is proposed, which consists of two main parts. First, to enable contact region-awareness, we propose ConditionNet as the condition encoder that includes in it a transformer-backboned object encoder, O-Enc; a pretraining strategy is adopted by O-Enc, where the point patches of object surface are randomly masked off and subsequently restored, to further capture surface geometric information of the object. Second, to realize interaction awareness, HOINet is introduced to encode hand-object interaction features by entangling high-level hand features with embedded object features through geometric-aware multi-head cross attention. Empirical evaluations demonstrate the effectiveness of our approach qualitatively and quantitatively where it is shown to compare favorably with respect to the state of the art methods.

[CV-42] LADIMO: Face Morph Generation through Biometric Template Inversion with Latent Diffusion

链接: https://arxiv.org/abs/2410.07988
作者: Marcel Grimmer,Christoph Busch
关键词-EN: severe security threat, face recognition systems, Face morphing, Face, face morphing approach
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Face morphing attacks pose a severe security threat to face recognition systems, enabling the morphed face image to be verified against multiple identities. To detect such manipulated images, the development of new face morphing methods becomes essential to increase the diversity of training datasets used for face morph detection. In this study, we present a representation-level face morphing approach, namely LADIMO, that performs morphing on two face recognition embeddings. Specifically, we train a Latent Diffusion Model to invert a biometric template - thus reconstructing the face image from an FRS latent representation. Our subsequent vulnerability analysis demonstrates the high morph attack potential in comparison to MIPGAN-II, an established GAN-based face morphing approach. Finally, we exploit the stochastic LADMIO model design in combination with our identity conditioning mechanism to create unlimited morphing attacks from a single face morph image pair. We show that each face morph variant has an individual attack success rate, enabling us to maximize the morph attack potential by applying a simple re-sampling strategy. Code and pre-trained models available here: this https URL

[CV-43] A transition towards virtual representations of visual scenes

链接: https://arxiv.org/abs/2410.07987
作者: Américo Pereira,Pedro Carvalho,Luís Côrte-Real
关键词-EN: extract meaningful information, Visual scene understanding, Visual scene, computer vision, vision that aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual scene understanding is a fundamental task in computer vision that aims to extract meaningful information from visual data. It traditionally involves disjoint and specialized algorithms for different tasks that are tailored for specific application scenarios. This can be cumbersome when designing complex systems that include processing of visual and semantic data extracted from visual scenes, which is even more noticeable nowadays with the influx of applications for virtual or augmented reality. When designing a system that employs automatic visual scene understanding to enable a precise and semantically coherent description of the underlying scene, which can be used to fuel a visualization component with 3D virtual synthesis, the lack of flexibility and unified frameworks become more prominent. To alleviate this issue and its inherent problems, we propose an architecture that addresses the challenges of visual scene understanding and description towards a 3D virtual synthesis that enables an adaptable, unified and coherent solution. Furthermore, we expose how our proposition can be of use into multiple application areas. Additionally, we also present a proof of concept system that employs our architecture to further prove its usability in practice.

[CV-44] Generalizable and Animatable Gaussian Head Avatar NEURIPS2024

链接: https://arxiv.org/abs/2410.07971
作者: Xuangeng Chu,Tatsuya Harada
关键词-EN: one-shot animatable head, Animatable Gaussian head, animatable head avatar, animatable head, propose Generalizable
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: NeurIPS 2024, code is available at this https URL , more demos are available at this https URL

点击查看摘要

Abstract:In this paper, we propose Generalizable and Animatable Gaussian head Avatar (GAGAvatar) for one-shot animatable head avatar reconstruction. Existing methods rely on neural radiance fields, leading to heavy rendering consumption and low reenactment speeds. To address these limitations, we generate the parameters of 3D Gaussians from a single image in a single forward pass. The key innovation of our work is the proposed dual-lifting method, which produces high-fidelity 3D Gaussians that capture identity and facial details. Additionally, we leverage global image features and the 3D morphable model to construct 3D Gaussians for controlling expressions. After training, our model can reconstruct unseen identities without specific optimizations and perform reenactment rendering at real-time speeds. Experiments show that our method exhibits superior performance compared to previous methods in terms of reconstruction quality and expression accuracy. We believe our method can establish new benchmarks for future research and advance applications of digital avatars. Code and demos are available this https URL.

[CV-45] Iterative Optimization Annotation Pipeline and ALSS-YOLO-Seg for Efficient Banana Plantation Segmentation in UAV Imagery

链接: https://arxiv.org/abs/2410.07955
作者: Ang He,Ximei Wu,Xing Xu,Jing Chen,Xiaobin Guo,Sheng Xu
关键词-EN: Unmanned Aerial Vehicle, Aerial Vehicle, Unmanned Aerial, plant health assessment, captured images plays
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Precise segmentation of Unmanned Aerial Vehicle (UAV)-captured images plays a vital role in tasks such as crop yield estimation and plant health assessment in banana plantations. By identifying and classifying planted areas, crop area can be calculated, which is indispensable for accurate yield predictions. However, segmenting banana plantation scenes requires a substantial amount of annotated data, and manual labeling of these images is both time-consuming and labor-intensive, limiting the development of large-scale datasets. Furthermore, challenges such as changing target sizes, complex ground backgrounds, limited computational resources, and correct identification of crop categories make segmentation even more difficult. To address these issues, we proposed a comprehensive solution. Firstly, we designed an iterative optimization annotation pipeline leveraging SAM2’s zero-shot capabilities to generate high-quality segmentation annotations, thereby reducing the cost and time associated with data annotation significantly. Secondly, we developed ALSS-YOLO-Seg, an efficient lightweight segmentation model optimized for UAV imagery. The model’s backbone includes an Adaptive Lightweight Channel Splitting and Shuffling (ALSS) module to improve information exchange between channels and optimize feature extraction, aiding accurate crop identification. Additionally, a Multi-Scale Channel Attention (MSCA) module combines multi-scale feature extraction with channel attention to tackle challenges of varying target sizes and complex ground backgrounds.

[CV-46] Multimodal Perception System for Real Open Environment

链接: https://arxiv.org/abs/2410.07926
作者: Yuyang Sha
关键词-EN: real open environment, multimodal perception system, open environment, paper presents, multimodal perception
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a novel multimodal perception system for a real open environment. The proposed system includes an embedded computation platform, cameras, ultrasonic sensors, GPS, and IMU devices. Unlike the traditional frameworks, our system integrates multiple sensors with advanced computer vision algorithms to help users walk outside reliably. The system can efficiently complete various tasks, including navigating to specific locations, passing through obstacle regions, and crossing intersections. Specifically, we also use ultrasonic sensors and depth cameras to enhance obstacle avoidance performance. The path planning module is designed to find the locally optimal route based on various feedback and the user’s current state. To evaluate the performance of the proposed system, we design several experiments under different scenarios. The results show that the system can help users walk efficiently and independently in complex situations.

[CV-47] Understanding Human Activity with Uncertainty Measure for Novelty in Graph Convolutional Networks

链接: https://arxiv.org/abs/2410.07917
作者: Hao Xing,Darius Burschka
关键词-EN: developing intelligent robots, Understanding human activity, Graph Convolutional Network, Fusion Graph Convolutional, Temporal Fusion Graph
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 10 figures, The International Journal of Robotics Research

点击查看摘要

Abstract:Understanding human activity is a crucial aspect of developing intelligent robots, particularly in the domain of human-robot collaboration. Nevertheless, existing systems encounter challenges such as over-segmentation, attributed to errors in the up-sampling process of the decoder. In response, we introduce a promising solution: the Temporal Fusion Graph Convolutional Network. This innovative approach aims to rectify the inadequate boundary estimation of individual actions within an activity stream and mitigate the issue of over-segmentation in the temporal dimension. Moreover, systems leveraging human activity recognition frameworks for decision-making necessitate more than just the identification of actions. They require a confidence value indicative of the certainty regarding the correspondence between observations and training examples. This is crucial to prevent overly confident responses to unforeseen scenarios that were not part of the training data and may have resulted in mismatches due to weak similarity measures within the system. To address this, we propose the incorporation of a Spectral Normalized Residual connection aimed at enhancing efficient estimation of novelty in observations. This innovative approach ensures the preservation of input distance within the feature space by imposing constraints on the maximum gradients of weight updates. By limiting these gradients, we promote a more robust handling of novel situations, thereby mitigating the risks associated with overconfidence. Our methodology involves the use of a Gaussian process to quantify the distance in feature space. Comments: 15 pages, 10 figures, The International Journal of Robotics Research Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.07917 [cs.RO] (or arXiv:2410.07917v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2410.07917 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-48] A Lightweight Target-Driven Network of Stereo Matching for Inland Waterways

链接: https://arxiv.org/abs/2410.07915
作者: Jing Su,Yiqing Zhou,Yu Zhang,Chao Wang,Yi Wei
关键词-EN: Unmanned Surface Vehicles, Surface Vehicles, Unmanned Surface, navigation of Unmanned, target-driven stereo matching
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Stereo matching for inland waterways is one of the key technologies for the autonomous navigation of Unmanned Surface Vehicles (USVs), which involves dividing the stereo images into reference images and target images for pixel-level matching. However, due to the challenges of the inland waterway environment, such as blurred textures, large spatial scales, and computational resource constraints of the USVs platform, the participation of geometric features from the target image is required for efficient target-driven matching. Based on this target-driven concept, we propose a lightweight target-driven stereo matching neural network, named LTNet. Specifically, a lightweight and efficient 4D cost volume, named the Geometry Target Volume (GTV), is designed to fully utilize the geometric information of target features by employing the shifted target features as the filtered feature volume. Subsequently, to address the substantial texture interference and object occlusions present in the waterway environment, a Left-Right Consistency Refinement (LRR) module is proposed. The \textLRR utilizes the pixel-level differences in left and right disparities to introduce soft constraints, thereby enhancing the accuracy of predictions during the intermediate stages of the network. Moreover, knowledge distillation is utilized to enhance the generalization capability of lightweight models on the USVInland dataset. Furthermore, a new large-scale benchmark, named Spring, is utilized to validate the applicability of LTNet across various scenarios. In experiments on the aforementioned two datasets, LTNet achieves competitive results, with only 3.7M parameters. The code is available at this https URL .

[CV-49] Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network IROS2022

链接: https://arxiv.org/abs/2410.07912
作者: Hao Xing,Darius Burschka
关键词-EN: Graph Convolutional Network, Human activities recognition, temporal pyramid pooling, Pyramid Graph Convolutional, intelligent robot
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 7 pages, 6 figures, IROS 2022 conference

点击查看摘要

Abstract:Human activities recognition is an important task for an intelligent robot, especially in the field of human-robot collaboration, it requires not only the label of sub-activities but also the temporal structure of the activity. In order to automatically recognize both the label and the temporal structure in sequence of human-object interaction, we propose a novel Pyramid Graph Convolutional Network (PGCN), which employs a pyramidal encoder-decoder architecture consisting of an attention based graph convolution network and a temporal pyramid pooling module for downsampling and upsampling interaction sequence on the temporal axis, respectively. The system represents the 2D or 3D spatial relation of human and objects from the detection results in video data as a graph. To learn the human-object relations, a new attention graph convolutional network is trained to extract condensed information from the graph representation. To segment action into sub-actions, a novel temporal pyramid pooling module is proposed, which upsamples compressed features back to the original time scale and classifies actions per frame. We explore various attention layers, namely spatial attention, temporal attention and channel attention, and combine different upsampling decoders to test the performance on action recognition and segmentation. We evaluate our model on two challenging datasets in the field of human-object interaction recognition, i.e. Bimanual Actions and IKEA Assembly datasets. We demonstrate that our classifier significantly improves both framewise action recognition and segmentation, e.g., F1 micro and F1@50 scores on Bimanual Actions dataset are improved by 4.3% and 8.5% respectively. Comments: 7 pages, 6 figures, IROS 2022 conference Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2410.07912 [cs.CV] (or arXiv:2410.07912v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.07912 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-50] Semi-Supervised Video Desnowing Network via Temporal Decoupling Experts and Distribution-Driven Contrastive Regularization

链接: https://arxiv.org/abs/2410.07901
作者: Hongtao Wu,Yijun Yang,Angelica I Aviles-Rivero,Jingjing Ren,Sixiang Chen,Haoyu Chen,Lei Zhu
关键词-EN: computer vision tasks, degradations present formidable, present formidable challenges, Snow degradations present, outdoor scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Snow degradations present formidable challenges to the advancement of computer vision tasks by the undesirable corruption in outdoor scenarios. While current deep learning-based desnowing approaches achieve success on synthetic benchmark datasets, they struggle to restore out-of-distribution real-world snowy videos due to the deficiency of paired real-world training data. To address this bottleneck, we devise a new paradigm for video desnowing in a semi-supervised spirit to involve unlabeled real data for the generalizable snow removal. Specifically, we construct a real-world dataset with 85 snowy videos, and then present a Semi-supervised Video Desnowing Network (SemiVDN) equipped by a novel Distribution-driven Contrastive Regularization. The elaborated contrastive regularization mitigates the distribution gap between the synthetic and real data, and consequently maintains the desired snow-invariant background details. Furthermore, based on the atmospheric scattering model, we introduce a Prior-guided Temporal Decoupling Experts module to decompose the physical components that make up a snowy video in a frame-correlated manner. We evaluate our SemiVDN on benchmark datasets and the collected real snowy data. The experimental results demonstrate the superiority of our approach against state-of-the-art image- and video-level desnowing methods.

[CV-51] Deepfake detection in videos with multiple faces using geometric-fakeness features

链接: https://arxiv.org/abs/2410.07888
作者: Kirill Vyshegorodtsev,Dmitry Kudiyarov,Alexander Balashov,Alexander Kuzmin
关键词-EN: recent years deepfake, video conferencing solutions, facial manipulation techniques, years deepfake detection, deepfake
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Due to the development of facial manipulation techniques in recent years deepfake detection in video stream became an important problem for face biometrics, brand monitoring or online video conferencing solutions. In case of a biometric authentication, if you replace a real datastream with a deepfake, you can bypass a liveness detection system. Using a deepfake in a video conference, you can penetrate into a private meeting. Deepfakes of victims or public figures can also be used by fraudsters for blackmailing, extorsion and financial fraud. Therefore, the task of detecting deepfakes is relevant to ensuring privacy and security. In existing approaches to a deepfake detection their performance deteriorates when multiple faces are present in a video simultaneously or when there are other objects erroneously classified as faces. In our research we propose to use geometric-fakeness features (GFF) that characterize a dynamic degree of a face presence in a video and its per-frame deepfake scores. To analyze temporal inconsistencies in GFFs between the frames we train a complex deep learning model that outputs a final deepfake prediction. We employ our approach to analyze videos with multiple faces that are simultaneously present in a video. Such videos often occur in practice e.g., in an online video conference. In this case, real faces appearing in a frame together with a deepfake face will significantly affect a deepfake detection and our approach allows to counter this problem. Through extensive experiments we demonstrate that our approach outperforms current state-of-the-art methods on popular benchmark datasets such as FaceForensics++, DFDC, Celeb-DF and WildDeepFake. The proposed approach remains accurate when trained to detect multiple different deepfake generation techniques.

[CV-52] Generated Bias: Auditing Internal Bias Dynamics of Text-To-Image Generative Models

链接: https://arxiv.org/abs/2410.07884
作者: Abhishek Mandal,Susan Leavy,Suzanne Little
关键词-EN: text prompts, capable of generating, generating images, images from text, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Text-To-Image (TTI) Diffusion Models such as DALL-E and Stable Diffusion are capable of generating images from text prompts. However, they have been shown to perpetuate gender stereotypes. These models process data internally in multiple stages and employ several constituent models, often trained separately. In this paper, we propose two novel metrics to measure bias internally in these multistage multimodal models. Diffusion Bias was developed to detect and measures bias introduced by the diffusion stage of the models. Bias Amplification measures amplification of bias during the text-to-image conversion process. Our experiments reveal that TTI models amplify gender bias, the diffusion process itself contributes to bias and that Stable Diffusion v2 is more prone to gender bias than DALL-E 2.

[CV-53] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

链接: https://arxiv.org/abs/2410.07864
作者: Songming Liu,Lingxuan Wu,Bangguo Li,Hengkai Tan,Huayu Chen,Zhengyi Wang,Ke Xu,Hang Su,Jun Zhu
关键词-EN: extremely challenging due, developing foundation models, multi-modal action distributions, Robotics Diffusion Transformer, diffusion foundation model
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, conference

点击查看摘要

Abstract:Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1~5 demonstrations, and effectively handles complex, dexterous tasks. We refer to this https URL for the code and videos.

[CV-54] BA-Net: Bridge Attention in Deep Neural Networks

链接: https://arxiv.org/abs/2410.07860
作者: Ronghui Zhang,Runzong Zou,Yue Zhao,Zirui Zhang,Junzhou Chen,Yue Cao,Chuan Hu,Houbing Song
关键词-EN: highly influential, influential in numerous, Attention, numerous computer vision, Attention mechanisms
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Attention mechanisms, particularly channel attention, have become highly influential in numerous computer vision tasks. Despite their effectiveness, many existing methods primarily focus on optimizing performance through complex attention modules applied at individual convolutional layers, often overlooking the synergistic interactions that can occur across multiple layers. In response to this gap, we introduce bridge attention, a novel approach designed to facilitate more effective integration and information flow between different convolutional layers. Our work extends the original bridge attention model (BAv1) by introducing an adaptive selection operator, which reduces information redundancy and optimizes the overall information exchange. This enhancement results in the development of BAv2, which achieves substantial performance improvements in the ImageNet classification task, obtaining Top-1 accuracies of 80.49% and 81.75% when using ResNet50 and ResNet101 as backbone networks, respectively. These results surpass the retrained baselines by 1.61% and 0.77%, respectively. Furthermore, BAv2 outperforms other existing channel attention techniques, such as the classical SENet101, exceeding its retrained performance by 0.52% Additionally, integrating BAv2 into advanced convolutional networks and vision transformers has led to significant gains in performance across a wide range of computer vision tasks, underscoring its broad applicability.

[CV-55] From Logits to Hierarchies: Hierarchical Clustering made Simple

链接: https://arxiv.org/abs/2410.07858
作者: Emanuele Palumbo,Moritz Vandenhirtz,Alain Ryser,Imant Daunhawer,Julia E. Vogt
关键词-EN: supervised machine learning, making the modeling, machine learning, intrinsically hierarchical, critical objective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The structure of many real-world datasets is intrinsically hierarchical, making the modeling of such hierarchies a critical objective in both unsupervised and supervised machine learning. Recently, novel approaches for hierarchical clustering with deep architectures have been proposed. In this work, we take a critical perspective on this line of research and demonstrate that many approaches exhibit major limitations when applied to realistic datasets, partly due to their high computational complexity. In particular, we show that a lightweight procedure implemented on top of pre-trained non-hierarchical clustering models outperforms models designed specifically for hierarchical clustering. Our proposed approach is computationally efficient and applicable to any pre-trained clustering model that outputs logits, without requiring any fine-tuning. To highlight the generality of our findings, we illustrate how our method can also be applied in a supervised setup, recovering meaningful hierarchies from a pre-trained ImageNet classifier.

[CV-56] SNN-PAR: Energy Efficient Pedestrian Attribute Recognition via Spiking Neural Networks

链接: https://arxiv.org/abs/2410.07857
作者: Haiyang Wang,Qian Zhu,Mowen She,Yabo Li,Haoyu Song,Minghe Xu,Xiao Wang
关键词-EN: Pedestrian Attribute Recognition, Artificial neural network, Attribute Recognition, neural network, neural network based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Artificial neural network based Pedestrian Attribute Recognition (PAR) has been widely studied in recent years, despite many progresses, however, the energy consumption is still high. To address this issue, in this paper, we propose a Spiking Neural Network (SNN) based framework for energy-efficient attribute recognition. Specifically, we first adopt a spiking tokenizer module to transform the given pedestrian image into spiking feature representations. Then, the output will be fed into the spiking Transformer backbone networks for energy-efficient feature extraction. We feed the enhanced spiking features into a set of feed-forward networks for pedestrian attribute recognition. In addition to the widely used binary cross-entropy loss function, we also exploit knowledge distillation from the artificial neural network to the spiking Transformer network for more accurate attribute recognition. Extensive experiments on three widely used PAR benchmark datasets fully validated the effectiveness of our proposed SNN-PAR framework. The source code of this paper is released on \urlthis https URL.

[CV-57] HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter

链接: https://arxiv.org/abs/2410.07854
作者: Yumiao Zhao,Bo Jiang,Xiao Wang,Qin Xu,Jin Tang
关键词-EN: Adapter-based tuning methods, shown significant potential, pre-trained Vision-Language Models, Adapter-based tuning, Heterogeneous Graph
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Adapter-based tuning methods have shown significant potential in transferring knowledge from pre-trained Vision-Language Models to the downstream tasks. However, after reviewing existing adapters, we find they generally fail to fully explore the interactions between different modalities in constructing task-specific knowledge. Also, existing works usually only focus on similarity matching between positive text prompts, making it challenging to distinguish the classes with high similar visual contents. To address these issues, in this paper, we propose a novel Heterogeneous Graph Adapter to achieve tuning VLMs for the downstream tasks. To be specific, we first construct a unified heterogeneous graph mode, which contains i) visual nodes, positive text nodes and negative text nodes, and ii) several types of edge connections to comprehensively model the intra-modality, inter-modality and inter-class structure knowledge together. Next, we employ a specific Heterogeneous Graph Neural Network to excavate multi-modality structure knowledge for adapting both visual and textual features for the downstream tasks. Finally, after HeGraphAdapter, we construct both text-based and visual-based classifiers simultaneously to comprehensively enhance the performance of the CLIP model. Experimental results on 11 benchmark datasets demonstrate the effectiveness and benefits of the proposed HeGraphAdapter.

[CV-58] MinorityPrompt: Text to Minority Image Generation via Prompt Optimization

链接: https://arxiv.org/abs/2410.07838
作者: Soobin Um,Jong Chul Ye
关键词-EN: latent diffusion models, diffusion models, latent diffusion, minority samples, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, 8 figures

点击查看摘要

Abstract:We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models. Minority instances, in the context of T2I generation, can be defined as ones living on low-density regions of text-conditional data distributions. They are valuable for various applications of modern T2I generators, such as data augmentation and creative AI. Unfortunately, existing pretrained T2I diffusion models primarily focus on high-density regions, largely due to the influence of guided samplers (like CFG) that are essential for producing high-quality generations. To address this, we present a novel framework to counter the high-density-focus of T2I diffusion models. Specifically, we first develop an online prompt optimization framework that can encourage the emergence of desired properties during inference while preserving semantic contents of user-provided prompts. We subsequently tailor this generic prompt optimizer into a specialized solver that promotes the generation of minority features by incorporating a carefully-crafted likelihood objective. Our comprehensive experiments, conducted across various types of T2I models, demonstrate that our approach significantly enhances the capability to produce high-quality minority instances compared to existing samplers.

[CV-59] Multi-Scale Deformable Transformers for Student Learning Behavior Detection in Smart Classroom

链接: https://arxiv.org/abs/2410.07834
作者: Zhifeng Wang,Minghui Wang,Chunyan Zeng,Longlong Li
关键词-EN: Artificial Intelligence, modern educational system, task traditionally dependent, integration of Artificial, rapidly evolving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 Pages

点击查看摘要

Abstract:The integration of Artificial Intelligence into the modern educational system is rapidly evolving, particularly in monitoring student behavior in classrooms, a task traditionally dependent on manual observation. This conventional method is notably inefficient, prompting a shift toward more advanced solutions like computer vision. However, existing target detection models face significant challenges such as occlusion, blurring, and scale disparity, which are exacerbated by the dynamic and complex nature of classroom settings. Furthermore, these models must adeptly handle multiple target detection. To overcome these obstacles, we introduce the Student Learning Behavior Detection with Multi-Scale Deformable Transformers (SCB-DETR), an innovative approach that utilizes large convolutional kernels for upstream feature extraction, and multi-scale feature fusion. This technique significantly improves the detection capabilities for multi-scale and occluded targets, offering a robust solution for analyzing student behavior. SCB-DETR establishes an end-to-end framework that simplifies the detection process and consistently outperforms other deep learning methods. Employing our custom Student Classroom Behavior (SCBehavior) Dataset, SCB-DETR achieves a mean Average Precision (mAP) of 0.626, which is a 1.5% improvement over the baseline model’s mAP and a 6% increase in AP50. These results demonstrate SCB-DETR’s superior performance in handling the uneven distribution of student behaviors and ensuring precise detection in dynamic classroom environments.

[CV-60] LaB-CL: Localized and Balanced Contrastive Learning for improving parking slot detection

链接: https://arxiv.org/abs/2410.07832
作者: U Jin Jeong,Sumin Roh,Il Yong Chun
关键词-EN: Parking slot detection, Parking slot, slot detection, autonomous parking systems, slot
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Parking slot detection is an essential technology in autonomous parking systems. In general, the classification problem of parking slot detection consists of two tasks, a task determining whether localized candidates are junctions of parking slots or not, and the other that identifies a shape of detected junctions. Both classification tasks can easily face biased learning toward the majority class, degrading classification performances. Yet, the data imbalance issue has been overlooked in parking slot detection. We propose the first supervised contrastive learning framework for parking slot detection, Localized and Balanced Contrastive Learning for improving parking slot detection (LaB-CL). The proposed LaB-CL framework uses two main approaches. First, we propose to include class prototypes to consider representations from all classes in every mini batch, from the local perspective. Second, we propose a new hard negative sampling scheme that selects local representations with high prediction error. Experiments with the benchmark dataset demonstrate that the proposed LaB-CL framework can outperform existing parking slot detection methods.

[CV-61] Exploring Foundation Models in Remote Sensing Image Change Detection: A Comprehensive Survey

链接: https://arxiv.org/abs/2410.07824
作者: Zihan Yu,Tianxiao Li,Yuxin Zhu,Rongze Pan
关键词-EN: URL recent years, http URL recent, widely applied technique, http URL, URL recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages

点击查看摘要

Abstract:Change detection, as an important and widely applied technique in the field of remote sensing, aims to analyze changes in surface areas over time and has broad applications in areas such as environmental monitoring, urban development, and land use this http URL recent years, deep learning, especially the development of foundation models, has provided more powerful solutions for feature extraction and data fusion, effectively addressing these complexities. This paper systematically reviews the latest advancements in the field of change detection, with a focus on the application of foundation models in remote sensing tasks.

[CV-62] Simple ReFlow: Improved Techniques for Fast Flow Models

链接: https://arxiv.org/abs/2410.07815
作者: Beomsu Kim,Yu-Guan Hsieh,Michal Klein,Marco Cuturi,Jong Chul Ye,Bahjat Kawar,James Thornton
关键词-EN: remarkable generative performance, Diffusion and flow-matching, flow-matching models achieve, models achieve remarkable, achieve remarkable generative
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion and flow-matching models achieve remarkable generative performance but at the cost of many sampling steps, this slows inference and limits applicability to time-critical tasks. The ReFlow procedure can accelerate sampling by straightening generation trajectories. However, ReFlow is an iterative procedure, typically requiring training on simulated data, and results in reduced sample quality. To mitigate sample deterioration, we examine the design space of ReFlow and highlight potential pitfalls in prior heuristic practices. We then propose seven improvements for training dynamics, learning and inference, which are verified with thorough ablation studies on CIFAR10 32 \times 32 , AFHQv2 64 \times 64 , and FFHQ 64 \times 64 . Combining all our techniques, we achieve state-of-the-art FID scores (without / with guidance, resp.) for fast generation via neural ODEs: 2.23 / 1.98 on CIFAR10, 2.30 / 1.91 on AFHQv2, 2.84 / 2.67 on FFHQ, and 3.49 / 1.74 on ImageNet-64, all with merely 9 neural function evaluations.

[CV-63] Robotic framework for autonomous manipulation of laboratory equipment with different degrees of transparency via 6D pose estimation

链接: https://arxiv.org/abs/2410.07801
作者: Maria Makarova,Daria Trinitatova,Dzmitry Tsetserukou
关键词-EN: changing external conditions, special operator skills, require special operator, systems operate autonomously, modern robotic systems
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE); Systems and Control (eess.SY)
*备注: Accepted to the 2024 IEEE International Conference on Robotics and Biomimetics (IEEE ROBIO 2024), 8 pages, 11 figures

点击查看摘要

Abstract:Many modern robotic systems operate autonomously, however they often lack the ability to accurately analyze the environment and adapt to changing external conditions, while teleoperation systems often require special operator skills. In the field of laboratory automation, the number of automated processes is growing, however such systems are usually developed to perform specific tasks. In addition, many of the objects used in this field are transparent, making it difficult to analyze them using visual channels. The contributions of this work include the development of a robotic framework with autonomous mode for manipulating liquid-filled objects with different degrees of transparency in complex pose combinations. The conducted experiments demonstrated the robustness of the designed visual perception system to accurately estimate object poses for autonomous manipulation, and confirmed the performance of the algorithms in dexterous operations such as liquid dispensing. The proposed robotic framework can be applied for laboratory automation, since it allows solving the problem of performing non-trivial manipulation tasks with the analysis of object poses of varying degrees of transparency and liquid levels, requiring high accuracy and repeatability.

[CV-64] Optimal-State Dynamics Estimation for Physics-based Human Motion Capture from Videos NEURIPS2024

链接: https://arxiv.org/abs/2410.07795
作者: Cuong Le,Viktor Johansson,Manon Kok,Bastian Wandt
关键词-EN: made significant progress, recent years, capture from monocular, monocular videos, videos has made
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 7 figure, accepted to NeurIPS 2024

点击查看摘要

Abstract:Human motion capture from monocular videos has made significant progress in recent years. However, modern approaches often produce temporal artifacts, e.g. in form of jittery motion and struggle to achieve smooth and physically plausible motions. Explicitly integrating physics, in form of internal forces and exterior torques, helps alleviating these artifacts. Current state-of-the-art approaches make use of an automatic PD controller to predict torques and reaction forces in order to re-simulate the input kinematics, i.e. the joint angles of a predefined skeleton. However, due to imperfect physical models, these methods often require simplifying assumptions and extensive preprocessing of the input kinematics to achieve good performance. To this end, we propose a novel method to selectively incorporate the physics models with the kinematics observations in an online setting, inspired by a neural Kalman-filtering approach. We develop a control loop as a meta-PD controller to predict internal joint torques and external reaction forces, followed by a physics-based motion simulation. A recurrent neural network is introduced to realize a Kalman filter that attentively balances the kinematics input and simulated motion, resulting in an optimal-state dynamics prediction. We show that this filtering step is crucial to provide an online supervision that helps balancing the shortcoming of the respective input motions, thus being important for not only capturing accurate global motion trajectories but also producing physically plausible human poses. The proposed approach excels in the physics-based human pose estimation task and demonstrates the physical plausibility of the predictive dynamics, compared to state of the art. The code is available on this https URL

[CV-65] Enhancing Hyperspectral Image Prediction with Contrastive Learning in Low-Label Regime

链接: https://arxiv.org/abs/2410.07790
作者: Salma Haidar,José Oramas
关键词-EN: Self-supervised contrastive learning, Self-supervised contrastive, limited labelled data, addressing the challenge, challenge of limited
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-supervised contrastive learning is an effective approach for addressing the challenge of limited labelled data. This study builds upon the previously established two-stage patch-level, multi-label classification method for hyperspectral remote sensing imagery. We evaluate the method’s performance for both the single-label and multi-label classification tasks, particularly under scenarios of limited training data. The methodology unfolds in two stages. Initially, we focus on training an encoder and a projection network using a contrastive learning approach. This step is crucial for enhancing the ability of the encoder to discern patterns within the unlabelled data. Next, we employ the pre-trained encoder to guide the training of two distinct predictors: one for multi-label and another for single-label classification. Empirical results on four public datasets show that the predictors trained with our method perform better than those trained under fully supervised techniques. Notably, the performance is maintained even when the amount of training data is reduced by 50% . This advantage is consistent across both tasks. The method’s effectiveness comes from its streamlined architecture. This design allows for retraining the encoder along with the predictor. As a result, the encoder becomes more adaptable to the features identified by the classifier, improving the overall classification performance. Qualitative analysis reveals the contrastive-learning-based encoder’s capability to provide representations that allow separation among classes and identify location-based features despite not being explicitly trained for that. This observation indicates the method’s potential in uncovering implicit spatial information within the data.

[CV-66] CLIP Multi-modal Hashing for Multimedia Retrieval

链接: https://arxiv.org/abs/2410.07783
作者: Jian Zhu,Mingkai Sheng,Zhangmin Huang,Jingfei Chang,Jinling Jiang,Jian Long,Cheng Luo,Lei Liu
关键词-EN: Multi-modal hashing methods, Multi-modal hashing, CLIP Multi-modal Hashing, binary hash code, hashing methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 31st International Conference on MultiMedia Modeling (MMM2025)

点击查看摘要

Abstract:Multi-modal hashing methods are widely used in multimedia retrieval, which can fuse multi-source data to generate binary hash code. However, the individual backbone networks have limited feature expression capabilities and are not jointly pre-trained on large-scale unsupervised multi-modal data, resulting in low retrieval accuracy. To address this issue, we propose a novel CLIP Multi-modal Hashing (CLIPMH) method. Our method employs the CLIP framework to extract both text and vision features and then fuses them to generate hash code. Due to enhancement on each modal feature, our method has great improvement in the retrieval performance of multi-modal hashing methods. Compared with state-of-the-art unsupervised and supervised multi-modal hashing methods, experiments reveal that the proposed CLIPMH can significantly improve performance (a maximum increase of 8.38% in mAP).

[CV-67] Neural Semantic Map-Learning for Autonomous Vehicles IROS2024

链接: https://arxiv.org/abs/2410.07780
作者: Markus Herb,Nassir Navab,Federico Tombari
关键词-EN: demand detailed maps, vehicles demand detailed, Autonomous vehicles demand, reliably through traffic, safe operation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:Autonomous vehicles demand detailed maps to maneuver reliably through traffic, which need to be kept up-to-date to ensure a safe operation. A promising way to adapt the maps to the ever-changing road-network is to use crowd-sourced data from a fleet of vehicles. In this work, we present a mapping system that fuses local submaps gathered from a fleet of vehicles at a central instance to produce a coherent map of the road environment including drivable area, lane markings, poles, obstacles and more as a 3D mesh. Each vehicle contributes locally reconstructed submaps as lightweight meshes, making our method applicable to a wide range of reconstruction methods and sensor modalities. Our method jointly aligns and merges the noisy and incomplete local submaps using a scene-specific Neural Signed Distance Field, which is supervised using the submap meshes to predict a fused environment representation. We leverage memory-efficient sparse feature-grids to scale to large areas and introduce a confidence score to model uncertainty in scene reconstruction. Our approach is evaluated on two datasets with different local mapping methods, showing improved pose alignment and reconstruction over existing methods. Additionally, we demonstrate the benefit of multi-session mapping and examine the required amount of data to enable high-fidelity map learning for autonomous vehicles.

[CV-68] Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models ICASSP2025

链接: https://arxiv.org/abs/2410.07771
作者: Adriana Fernandez-Lopez,Shiwei Liu,Lu Yin,Stavros Petridis,Maja Pantic
关键词-EN: Conformer-based speech recognition, large-scale Conformer-based speech, large-scale Conformer-based, speech recognition models, Conformer-based speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:This paper investigates the under-explored area of low-rank weight training for large-scale Conformer-based speech recognition models from scratch. Our study demonstrates the viability of this training paradigm for such models, yielding several notable findings. Firstly, we discover that applying a low-rank structure exclusively to the attention modules can unexpectedly enhance performance, even with a significant rank reduction of 12%. In contrast, feed-forward layers present greater challenges, as they begin to exhibit performance degradation with a moderate 50% rank reduction. Furthermore, we find that both initialization and layer-wise rank assignment play critical roles in successful low-rank training. Specifically, employing SVD initialization and linear layer-wise rank mapping significantly boosts the efficacy of low-rank weight training. Building on these insights, we introduce the Low-Rank Speech Model from Scratch (LR-SMS), an approach that achieves performance parity with full-rank training while delivering substantial reductions in parameters count (by at least 2x), and training time speedups (by 1.3x for ASR and 1.15x for AVSR).

[CV-69] HARIVO: Harnessing Text-to-Image Models for Video Generation ECCV2024

链接: https://arxiv.org/abs/2410.07763
作者: Mingi Kwon,Seoung Wug Oh,Yang Zhou,Difan Liu,Joon-Young Lee,Haoran Cai,Baqiao Liu,Feng Liu,Youngjung Uh
关键词-EN: create diffusion-based video, create diffusion-based, diffusion-based video models, diffusion-based video, video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV2024

点击查看摘要

Abstract:We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: this https URL

[CV-70] textitJump Your Steps: Optimizing Sampling Schedule of Discrete Diffusion Models

链接: https://arxiv.org/abs/2410.07761
作者: Yong-Hyun Park,Chieh-Hsin Lai,Satoshi Hayakawa,Yuhta Takida,Yuki Mitsufuji
关键词-EN: discrete diffusion models, Diffusion models, Compounding Decoding Error, continuous domains, notable success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion models (DDMs) for discrete variables. Despite recent advances, DDMs face the challenge of slow sampling speeds. While parallel sampling methods like \tau -leaping accelerate this process, they introduce \textitCompounding Decoding Error (CDE), where discrepancies arise between the true distribution and the approximation from parallel token generation, leading to degraded sample quality. In this work, we present \textitJump Your Steps (JYS), a novel approach that optimizes the allocation of discrete sampling timesteps by minimizing CDE without extra computational cost. More precisely, we derive a practical upper bound on CDE and propose an efficient algorithm for searching for the optimal sampling schedule. Extensive experiments across image, music, and text generation show that JYS significantly improves sampling quality, establishing it as a versatile framework for enhancing DDM performance for fast sampling.

[CV-71] HeightFormer: A Semantic Alignment Monocular 3D Object Detection Method from Roadside Perspective

链接: https://arxiv.org/abs/2410.07758
作者: Pei Liu(1),Zihao Zhang(2),Haipeng Liu(3),Nanfang Zheng(4),Meixin Zhu(1),Ziyuan Pu(4) ((1) Intelligent Transportation Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou), (2) School of Cyber Science and Engineering, Southeast University, (3) Li Auto Inc, (4) School of Transportation, Southeast University)
关键词-EN: received extensive attention, applying roadside sensors, object detection technology, traffic object detection, critical technology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The on-board 3D object detection technology has received extensive attention as a critical technology for autonomous driving, while few studies have focused on applying roadside sensors in 3D traffic object detection. Existing studies achieve the projection of 2D image features to 3D features through height estimation based on the frustum. However, they did not consider the height alignment and the extraction efficiency of bird’s-eye-view features. We propose a novel 3D object detection framework integrating Spatial Former and Voxel Pooling Former to enhance 2D-to-3D projection based on height estimation. Extensive experiments were conducted using the Rope3D and DAIR-V2X-I dataset, and the results demonstrated the outperformance of the proposed algorithm in the detection of both vehicles and cyclists. These results indicate that the algorithm is robust and generalized under various detection scenarios. Improving the accuracy of 3D object detection on the roadside is conducive to building a safe and trustworthy intelligent transportation system of vehicle-road coordination and promoting the large-scale application of autonomous driving. The code and pre-trained models will be released on this https URL.

[CV-72] MMHead: Towards Fine-grained Multi-modal 3D Facial Animation

链接: https://arxiv.org/abs/2410.07757
作者: Sijing Wu,Yunhao Li,Yichao Yan,Huiyu Duan,Ziwei Liu,Guangtao Zhai
关键词-EN: attracted considerable attention, facial animation, considerable attention due, facial, animation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACMMM 2024. Project page: this https URL

点击查看摘要

Abstract:3D facial animation has attracted considerable attention due to its extensive applications in the multimedia field. Audio-driven 3D facial animation has been widely explored with promising results. However, multi-modal 3D facial animation, especially text-guided 3D facial animation is rarely explored due to the lack of multi-modal 3D facial animation dataset. To fill this gap, we first construct a large-scale multi-modal 3D facial animation dataset, MMHead, which consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations. Each text annotation contains abstract action and emotion descriptions, fine-grained facial and head movements (i.e., expression and head pose) descriptions, and three possible scenarios that may cause such emotion. Concretely, we integrate five public 2D portrait video datasets, and propose an automatic pipeline to 1) reconstruct 3D facial motion sequences from monocular videos; and 2) obtain hierarchical text annotations with the help of AU detection and ChatGPT. Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation. Moreover, a simple but efficient VQ-VAE-based method named MM2Face is proposed to unify the multi-modal information and generate diverse and plausible 3D facial motions, which achieves competitive results on both benchmarks. Extensive experiments and comprehensive analysis demonstrate the significant potential of our dataset and benchmarks in promoting the development of multi-modal 3D facial animation.

[CV-73] Synthesizing Multi-Class Surgical Datasets with Anatomy-Aware Diffusion Models

链接: https://arxiv.org/abs/2410.07753
作者: Danush Kumar Venkatesh,Dominik Rivoir,Micha Pfeiffer,Fiona Kolbinger,Stefanie Speidel
关键词-EN: providing intraoperative assistance, automatically recognizing anatomical, computer-assisted surgery, automatically recognizing, intraoperative assistance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In computer-assisted surgery, automatically recognizing anatomical organs is crucial for understanding the surgical scene and providing intraoperative assistance. While machine learning models can identify such structures, their deployment is hindered by the need for labeled, diverse surgical datasets with anatomical annotations. Labeling multiple classes (i.e., organs) in a surgical scene is time-intensive, requiring medical experts. Although synthetically generated images can enhance segmentation performance, maintaining both organ structure and texture during generation is challenging. We introduce a multi-stage approach using diffusion models to generate multi-class surgical datasets with annotations. Our framework improves anatomy awareness by training organ specific models with an inpainting objective guided by binary segmentation masks. The organs are generated with an inference pipeline using pre-trained ControlNet to maintain the organ structure. The synthetic multi-class datasets are constructed through an image composition step, ensuring structural and textural consistency. This versatile approach allows the generation of multi-class datasets from real binary datasets and simulated surgical masks. We thoroughly evaluate the generated datasets on image quality and downstream segmentation, achieving a 15% improvement in segmentation scores when combined with real images. Our codebase this https URL

[CV-74] VBench: Redesigning Video-Language Evaluation

链接: https://arxiv.org/abs/2410.07752
作者: Daniel Cores,Michael Dorkenwald,Manuel Mucientes,Cees G. M. Snoek,Yuki M. Asano
关键词-EN: Large language models, Large language, demonstrated impressive performance, demonstrated impressive, integrated with vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large language models have demonstrated impressive performance when integrated with vision models even enabling video understanding. However, evaluating these video models presents its own unique challenges, for which several benchmarks have been proposed. In this paper, we show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning. We identified three main issues in existing datasets: (i) static information from single frames is often sufficient to solve the tasks (ii) the text of the questions and candidate answers is overly informative, allowing models to answer correctly without relying on any visual input (iii) world knowledge alone can answer many of the questions, making the benchmarks a test of knowledge replication rather than visual reasoning. In addition, we found that open-ended question-answering benchmarks for video understanding suffer from similar issues while the automatic evaluation process with LLMs is unreliable, making it an unsuitable alternative. As a solution, we propose TVBench, a novel open-source video multiple-choice question-answering benchmark, and demonstrate through extensive evaluations that it requires a high level of temporal understanding. Surprisingly, we find that most recent state-of-the-art video-language models perform similarly to random performance on TVBench, with only Gemini-Pro and Tarsier clearly surpassing this baseline.

[CV-75] MGMapNet: Multi-Granularity Representation Learning for End-to-End Vectorized HD Map Construction

链接: https://arxiv.org/abs/2410.07733
作者: Jing Yang,Minyue Jiang,Sen Yang,Xiao Tan,Yingying Li,Errui Ding,Hanli Wang,Jingdong Wang
关键词-EN: typically requires capturing, Vectorized High-Definition, construction of Vectorized, map typically requires, typically requires
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The construction of Vectorized High-Definition (HD) map typically requires capturing both category and geometry information of map elements. Current state-of-the-art methods often adopt solely either point-level or instance-level representation, overlooking the strong intrinsic relationships between points and instances. In this work, we propose a simple yet efficient framework named MGMapNet (Multi-Granularity Map Network) to model map element with a multi-granularity representation, integrating both coarse-grained instance-level and fine-grained point-level queries. Specifically, these two granularities of queries are generated from the multi-scale bird’s eye view (BEV) features using a proposed Multi-Granularity Aggregator. In this module, instance-level query aggregates features over the entire scope covered by an instance, and the point-level query aggregates features locally. Furthermore, a Point Instance Interaction module is designed to encourage information exchange between instance-level and point-level queries. Experimental results demonstrate that the proposed MGMapNet achieves state-of-the-art performance, surpassing MapTRv2 by 5.3 mAP on nuScenes and 4.4 mAP on Argoverse2 respectively.

[CV-76] Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

链接: https://arxiv.org/abs/2410.07718
作者: Jiahao Cui,Hui Li,Yao Yao,Hao Zhu,Hanlin Shang,Kaihui Cheng,Hang Zhou,Siyu Zhu,Jingdong Wang
关键词-EN: diffusion-based generative models, Recent advances, latent diffusion-based generative, achieved impressive results, diffusion-based generative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in latent diffusion-based generative models for portrait image animation, such as Hallo, have achieved impressive results in short-duration video synthesis. In this paper, we present updates to Hallo, introducing several design enhancements to extend its capabilities. First, we extend the method to produce long-duration videos. To address substantial challenges such as appearance drift and temporal artifacts, we investigate augmentation strategies within the image space of conditional motion frames. Specifically, we introduce a patch-drop technique augmented with Gaussian noise to enhance visual consistency and temporal coherence over long duration. Second, we achieve 4K resolution portrait video generation. To accomplish this, we implement vector quantization of latent codes and apply temporal alignment techniques to maintain coherence across the temporal dimension. By integrating a high-quality decoder, we realize visual synthesis at 4K resolution. Third, we incorporate adjustable semantic textual labels for portrait expressions as conditional inputs. This extends beyond traditional audio cues to improve controllability and increase the diversity of the generated content. To the best of our knowledge, Hallo2, proposed in this paper, is the first method to achieve 4K resolution and generate hour-long, audio-driven portrait image animations enhanced with textual prompts. We have conducted extensive experiments to evaluate our method on publicly available datasets, including HDTF, CelebV, and our introduced “Wild” dataset. The experimental results demonstrate that our approach achieves state-of-the-art performance in long-duration portrait video animation, successfully generating rich and controllable content at 4K resolution for duration extending up to tens of minutes. Project page this https URL

[CV-77] MotionGS: Exploring Explicit Motion Guidance for Deformable 3D Gaussian Splatting NEURIPS2024

链接: https://arxiv.org/abs/2410.07707
作者: Ruijie Zhu,Yanzhe Liang,Hanzhi Chang,Jiacheng Deng,Jiahao Lu,Wenfei Yang,Tianzhu Zhang,Yongdong Zhang
关键词-EN: Gaussian Splatting, Dynamic scene reconstruction, long-term challenge, Gaussian splatting framework, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024. 21 pages, 14 figures,7 tables

点击查看摘要

Abstract:Dynamic scene reconstruction is a long-term challenge in the field of 3D vision. Recently, the emergence of 3D Gaussian Splatting has provided new insights into this problem. Although subsequent efforts rapidly extend static 3D Gaussian to dynamic scenes, they often lack explicit constraints on object motion, leading to optimization difficulties and performance degradation. To address the above issues, we propose a novel deformable 3D Gaussian splatting framework called MotionGS, which explores explicit motion priors to guide the deformation of 3D Gaussians. Specifically, we first introduce an optical flow decoupling module that decouples optical flow into camera flow and motion flow, corresponding to camera movement and object motion respectively. Then the motion flow can effectively constrain the deformation of 3D Gaussians, thus simulating the motion of dynamic objects. Additionally, a camera pose refinement module is proposed to alternately optimize 3D Gaussians and camera poses, mitigating the impact of inaccurate camera poses. Extensive experiments in the monocular dynamic scenes validate that MotionGS surpasses state-of-the-art methods and exhibits significant superiority in both qualitative and quantitative results. Project page: this https URL

[CV-78] st-Time Intensity Consistency Adaptation for Shadow Detection ICONIP2024

链接: https://arxiv.org/abs/2410.07695
作者: Leyi Zhu,Weihuang Liu,Xinyi Chen,Zimeng Li,Xuhang Chen,Zhen Wang,Chi-Man Pun
关键词-EN: accurate scene understanding, object geometry, scene context, computer vision, variations in illumination
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 5 figures, published to ICONIP 2024

点击查看摘要

Abstract:Shadow detection is crucial for accurate scene understanding in computer vision, yet it is challenged by the diverse appearances of shadows caused by variations in illumination, object geometry, and scene context. Deep learning models often struggle to generalize to real-world images due to the limited size and diversity of training datasets. To address this, we introduce TICA, a novel framework that leverages light-intensity information during test-time adaptation to enhance shadow detection accuracy. TICA exploits the inherent inconsistencies in light intensity across shadow regions to guide the model toward a more consistent prediction. A basic encoder-decoder model is initially trained on a labeled dataset for shadow detection. Then, during the testing phase, the network is adjusted for each test sample by enforcing consistent intensity predictions between two augmented input image versions. This consistency training specifically targets both foreground and background intersection regions to identify shadow regions within images accurately for robust adaptation. Extensive evaluations on the ISTD and SBU shadow detection datasets reveal that TICA significantly demonstrates that TICA outperforms existing state-of-the-art methods, achieving superior results in balanced error rate (BER).

[CV-79] Growing Efficient Accurate and Robust Neural Networks on the Edge

链接: https://arxiv.org/abs/2410.07691
作者: Vignesh Sundaresha,Naresh Shanbhag
关键词-EN: occurring common corruptions, deep learning systems, computational complexity coupled, naturally occurring common, resource-constrained Edge devices
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:The ubiquitous deployment of deep learning systems on resource-constrained Edge devices is hindered by their high computational complexity coupled with their fragility to out-of-distribution (OOD) data, especially to naturally occurring common corruptions. Current solutions rely on the Cloud to train and compress models before deploying to the Edge. This incurs high energy and latency costs in transmitting locally acquired field data to the Cloud while also raising privacy concerns. We propose GEARnn (Growing Efficient, Accurate, and Robust neural networks) to grow and train robust networks in-situ, i.e., completely on the Edge device. Starting with a low-complexity initial backbone network, GEARnn employs One-Shot Growth (OSG) to grow a network satisfying the memory constraints of the Edge device using clean data, and robustifies the network using Efficient Robust Augmentation (ERA) to obtain the final network. We demonstrate results on a NVIDIA Jetson Xavier NX, and analyze the trade-offs between accuracy, robustness, model size, energy consumption, and training time. Our results demonstrate the construction of efficient, accurate, and robust networks entirely on an Edge device.

[CV-80] When the Small-Loss Trick is Not Enough: Multi-Label Image Classification with Noisy Labels Applied to CCTV Sewer Inspections

链接: https://arxiv.org/abs/2410.07689
作者: Keryan Chelouche,Marie Lachaize(VERI),Marine Bernard(VERI),Louise Olgiati,Remi Cuingnet
关键词-EN: efficient Closed-Circuit Television, Closed-Circuit Television, label noise, sewerage networks, heavily relies
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The maintenance of sewerage networks, with their millions of kilometers of pipe, heavily relies on efficient Closed-Circuit Television (CCTV) inspections. Many promising approaches based on multi-label image classification have leveraged databases of historical inspection reports to automate these inspections. However, the significant presence of label noise in these databases, although known, has not been addressed. While extensive research has explored the issue of label noise in singlelabel classification (SLC), little attention has been paid to label noise in multi-label classification (MLC). To address this, we first adapted three sample selection SLC methods (Co-teaching, CoSELFIE, and DISC) that have proven robust to label noise. Our findings revealed that sample selection based solely on the small-loss trick can handle complex label noise, but it is sub-optimal. Adapting hybrid sample selection methods to noisy MLC appeared to be a more promising approach. In light of this, we developed a novel method named MHSS (Multi-label Hybrid Sample Selection) based on CoSELFIE. Through an in-depth comparative study, we demonstrated the superior performance of our approach in dealing with both synthetic complex noise and real noise, thus contributing to the ongoing efforts towards effective automation of CCTV sewer pipe inspections.

[CV-81] PokeFlex: A Real-World Dataset of Deformable Objects for Robotics

链接: https://arxiv.org/abs/2410.07688
作者: Jan Obrist,Miguel Zamora,Hehui Zheng,Ronan Hinchet,Firat Ozdemir,Juan Zarate,Robert K. Katzschmann,Stelian Coros
关键词-EN: shown great potential, solving challenging manipulation, challenging manipulation tasks, Data-driven methods, shown great
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data-driven methods have shown great potential in solving challenging manipulation tasks, however, their application in the domain of deformable objects has been constrained, in part, by the lack of data. To address this, we propose PokeFlex, a dataset featuring real-world paired and annotated multimodal data that includes 3D textured meshes, point clouds, RGB images, and depth maps. Such data can be leveraged for several downstream tasks such as online 3D mesh reconstruction, and it can potentially enable underexplored applications such as the real-world deployment of traditional control methods based on mesh simulations. To deal with the challenges posed by real-world 3D mesh reconstruction, we leverage a professional volumetric capture system that allows complete 360° reconstruction. PokeFlex consists of 18 deformable objects with varying stiffness and shapes. Deformations are generated by dropping objects onto a flat surface or by poking the objects with a robot arm. Interaction forces and torques are also reported for the latter case. Using different data modalities, we demonstrated a use case for the PokeFlex dataset in online 3D mesh reconstruction. We refer the reader to our website ( this https URL ) for demos and examples of our dataset.

[CV-82] Relational Diffusion Distillation for Efficient Image Generation

链接: https://arxiv.org/abs/2410.07679
作者: Weilun Feng,Chuanguang Yang,Zhulin An,Libo Huang,Boyu Diao,Fei Wang,Yongjun Xu
关键词-EN: scarce computing resources, high inference delay, inference delay hinders, achieved remarkable performance, image generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although the diffusion model has achieved remarkable performance in the field of image generation, its high inference delay hinders its wide application in edge devices with scarce computing resources. Therefore, many training-free sampling methods have been proposed to reduce the number of sampling steps required for diffusion models. However, they perform poorly under a very small number of sampling steps. Thanks to the emergence of knowledge distillation technology, the existing training scheme methods have achieved excellent results at very low step numbers. However, the current methods mainly focus on designing novel diffusion model sampling methods with knowledge distillation. How to transfer better diffusion knowledge from teacher models is a more valuable problem but rarely studied. Therefore, we propose Relational Diffusion Distillation (RDD), a novel distillation method tailored specifically for distilling diffusion models. Unlike existing methods that simply align teacher and student models at pixel level or feature distributions, our method introduces cross-sample relationship interaction during the distillation process and alleviates the memory constraints induced by multiple sample interactions. Our RDD significantly enhances the effectiveness of the progressive distillation framework within the diffusion model. Extensive experiments on several datasets (e.g., CIFAR-10 and ImageNet) demonstrate that our proposed RDD leads to 1.47 FID decrease under 1 sampling step compared to state-of-the-art diffusion distillation methods and achieving 256x speed-up compared to DDIM strategy. Code is available at this https URL.

[CV-83] Delta-ICM: Entropy Modeling with Delta Function for Learned Image Compression

链接: https://arxiv.org/abs/2410.07669
作者: Takahiro Shindo,Taiju Watanabe,Yui Tatsumi,Hiroshi Watanabe
关键词-EN: Image Coding, ICM, Image, computer vision progresses, Coding for Machines
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Image Coding for Machines (ICM) is becoming more important as research in computer vision progresses. ICM is a vital research field that pursues the use of images for image recognition models, facilitating efficient image transmission and storage. The demand for recognition models is growing rapidly among the general public, and their performance continues to improve. To meet these needs, exchanging image data between consumer devices and cloud AI using ICM technology could be one possible solution. In ICM, various image compression methods have adopted Learned Image Compression (LIC). LIC includes an entropy model for estimating the bitrate of latent features, and the design of this model significantly affects its performance. Typically, LIC methods assume that the distribution of latent features follows a normal distribution. This assumption is effective for compressing images intended for human vision. However, employing an entropy model based on normal distribution is inefficient in ICM due to the limitation of image parts that require precise decoding. To address this, we propose Delta-ICM, which uses a probability distribution based on a delta function. Assuming the delta distribution as a distribution of latent features reduces the entropy of image portions unnecessary for machines. We compress the remaining portions using an entropy model based on normal distribution, similar to existing methods. Delta-ICM selects between the entropy model based on the delta distribution and the one based on the normal distribution for each latent feature. Our method outperforms existing ICM methods in image compression performance aimed at machines.

[CV-84] MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

链接: https://arxiv.org/abs/2410.07659
作者: Onkar Susladkar,Jishu Sen Gupta,Chirag Sehgal,Sparsh Mittal,Rekha Singhal
关键词-EN: presents significant challenges, Vector-Quantization Variational Autoencoder, combines Variational Autoencoders, spatio-temporal complexity, data presents significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under submission at a conference

点击查看摘要

Abstract:The spatio-temporal complexity of video data presents significant challenges in tasks such as compression, generation, and inpainting. We present four key contributions to address the challenges of spatiotemporal video processing. First, we introduce the 3D Mobile Inverted Vector-Quantization Variational Autoencoder (3D-MBQ-VAE), which combines Variational Autoencoders (VAEs) with masked token modeling to enhance spatiotemporal video compression. The model achieves superior temporal consistency and state-of-the-art (SOTA) reconstruction quality by employing a novel training strategy with full frame masking. Second, we present MotionAura, a text-to-video generation framework that utilizes vector-quantized diffusion models to discretize the latent space and capture complex motion dynamics, producing temporally coherent videos aligned with text prompts. Third, we propose a spectral transformer-based denoising network that processes video data in the frequency domain using the Fourier Transform. This method effectively captures global context and long-range dependencies for high-quality video generation and denoising. Lastly, we introduce a downstream task of Sketch Guided Video Inpainting. This task leverages Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. Our models achieve SOTA performance on a range of benchmarks. Our work offers robust frameworks for spatiotemporal modeling and user-driven video content manipulation. We will release the code, datasets, and models in open-source.

[CV-85] SeMv-3D: Towards Semantic and Mutil-view Consistency simultaneously for General Text-to-3D Generation with Triplane Priors

链接: https://arxiv.org/abs/2410.07658
作者: Xiao Cai,Pengpeng Zeng,Lianli Gao,Junchen Zhu,Jiaxin Zhang,Sitong Su,Heng Tao Shen,Jingkuan Song
关键词-EN: Recent advancements, advancements in generic, remarkable by fine-tuning, multi-view consistency, models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in generic 3D content generation from text prompts have been remarkable by fine-tuning text-to-image diffusion (T2I) models or employing these T2I models as priors to learn a general text-to-3D model. While fine-tuning-based methods ensure great alignment between text and generated views, i.e., semantic consistency, their ability to achieve multi-view consistency is hampered by the absence of 3D constraints, even in limited view. In contrast, prior-based methods focus on regressing 3D shapes with any view that maintains uniformity and coherence across views, i.e., multi-view consistency, but such approaches inevitably compromise visual-textual alignment, leading to a loss of semantic details in the generated objects. To achieve semantic and multi-view consistency simultaneously, we propose SeMv-3D, a novel framework for general text-to-3d generation. Specifically, we propose a Triplane Prior Learner (TPL) that learns triplane priors with 3D spatial features to maintain consistency among different views at the 3D level, e.g., geometry and texture. Moreover, we design a Semantic-aligned View Synthesizer (SVS) that preserves the alignment between 3D spatial features and textual semantics in latent space. In SVS, we devise a simple yet effective batch sampling and rendering strategy that can generate arbitrary views in a single feed-forward inference. Extensive experiments present our SeMv-3D’s superiority over state-of-the-art performances with semantic and multi-view consistency in any view. Our code and more visual results are available at this https URL.

[CV-86] FLIER: Few-shot Language Image Models Embedded with Latent Representations

链接: https://arxiv.org/abs/2410.07648
作者: Zhinuo Zhou,Peng Zhou,Xiaoyong Pan
关键词-EN: Contrastive Language-Image Pre-training, low-data regimes scenes, Language-Image Pre-training, Contrastive Language-Image, shown impressive abilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages,3 figures

点击查看摘要

Abstract:As the boosting development of large vision-language models like Contrastive Language-Image Pre-training (CLIP), many CLIP-like methods have shown impressive abilities on visual recognition, especially in low-data regimes scenes. However, we have noticed that most of these methods are limited to introducing new modifications on text and image encoder. Recently, latent diffusion models (LDMs) have shown good ability on image generation. The potent capabilities of LDMs direct our focus towards the latent representations sampled by UNet. Inspired by the conjecture in CoOp that learned prompts encode meanings beyond the existing vocabulary, we assume that, for deep models, the latent representations are concise and accurate understanding of images, in which high-frequency, imperceptible details are abstracted away. In this paper, we propose a Few-shot Language Image model Embedded with latent Representations (FLIER) for image recognition by introducing a latent encoder jointly trained with CLIP’s image encoder, it incorporates pre-trained vision-language knowledge of CLIP and the latent representations from Stable Diffusion. We first generate images and corresponding latent representations via Stable Diffusion with the textual inputs from GPT-3. With latent representations as “models-understandable pixels”, we introduce a flexible convolutional neural network with two convolutional layers to be the latent encoder, which is simpler than most encoders in vision-language models. The latent encoder is jointly trained with CLIP’s image encoder, transferring pre-trained knowledge to downstream tasks better. Experiments and extensive ablation studies on various visual classification tasks demonstrate that FLIER performs state-of-the-art on 11 datasets for most few-shot classification.

[CV-87] Shift and matching queries for video semantic segmentation

链接: https://arxiv.org/abs/2410.07635
作者: Tsubasa Mizuno,Toru Tamaki
关键词-EN: preserve temporal consistency, applying image segmentation, popular task, temporal consistency, image segmentation models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video segmentation is a popular task, but applying image segmentation models frame-by-frame to videos does not preserve temporal consistency. In this paper, we propose a method to extend a query-based image segmentation model to video using feature shift and query matching. The method uses a query-based architecture, where decoded queries represent segmentation masks. These queries should be matched before performing the feature shift to ensure that the shifted queries represent the same mask across different frames. Experimental results on CityScapes-VPS and VSPW show significant improvements from the baselines, highlighting the method’s effectiveness in enhancing segmentation quality while efficiently reusing pre-trained weights.

[CV-88] DPL: Cross-quality DeepFake Detection via Dual Progressive Learning ACCV2024

链接: https://arxiv.org/abs/2410.07633
作者: Dongliang Zhang,Yunfei Li,Jiaran Zhou,Yuezun Li
关键词-EN: Real-world DeepFake videos, Real-world DeepFake, cross-quality DeepFake detection, DeepFake detection, compression operations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCV 2024

点击查看摘要

Abstract:Real-world DeepFake videos often undergo various compression operations, resulting in a range of video qualities. These varying qualities diversify the pattern of forgery traces, significantly increasing the difficulty of DeepFake detection. To address this challenge, we introduce a new Dual Progressive Learning (DPL) framework for cross-quality DeepFake detection. We liken this task to progressively drilling for underground water, where low-quality videos require more effort than high-quality ones. To achieve this, we develop two sequential-based branches to “drill waters” with different efforts. The first branch progressively excavates the forgery traces according to the levels of video quality, i.e., time steps, determined by a dedicated CLIP-based indicator. In this branch, a Feature Selection Module is designed to adaptively assign appropriate features to the corresponding time steps. Considering that different techniques may introduce varying forgery traces within the same video quality, we design a second branch targeting forgery identifiability as complementary. This branch operates similarly and shares the feature selection module with the first branch. Our design takes advantage of the sequential model where computational units share weights across different time steps and can memorize previous progress, elegantly achieving progressive learning while maintaining reasonable memory costs. Extensive experiments demonstrate the superiority of our method for cross-quality DeepFake detection.

[CV-89] MorCode: Face Morphing Attack Generation using Generative Codebooks

链接: https://arxiv.org/abs/2410.07625
作者: Aravinda Reddy PN,Raghavendra Ramachandra,Sushma Venkatesh,Krothapalli Sreenivasa Rao,Pabitra Mitra,Rakesh Krishna
关键词-EN: Generative Adversarial Networks, multiple facial images, Face recognition systems, morphing generation, face morphing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Face recognition systems (FRS) can be compromised by face morphing attacks, which blend textural and geometric information from multiple facial images. The rapid evolution of generative AI, especially Generative Adversarial Networks (GAN) or Diffusion models, where encoded images are interpolated to generate high-quality face morphing images. In this work, we present a novel method for the automatic face morphing generation method \textitMorCode, which leverages a contemporary encoder-decoder architecture conditioned on codebook learning to generate high-quality morphing images. Extensive experiments were performed on the newly constructed morphing dataset using five state-of-the-art morphing generation techniques using both digital and print-scan data. The attack potential of the proposed morphing generation technique, \textitMorCode, was benchmarked using three different face recognition systems. The obtained results indicate the highest attack potential of the proposed \textitMorCode when compared with five state-of-the-art morphing generation methods on both digital and print scan data.

[CV-90] Moyun: A Diffusion-Based Model for Style-Specific Chinese Calligraphy Generation

链接: https://arxiv.org/abs/2410.07618
作者: Kaiyuan Liu,Jiahao Mei,Hengyu Zhang,Yihuai Zhang,Xingjiao Wu,Daoguo Dong,Liang He
关键词-EN: Chinese calligraphy generation, achieved style transfer, style remains challenging, character style remains, Chinese calligraphy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although Chinese calligraphy generation has achieved style transfer, generating calligraphy by specifying the calligrapher, font, and character style remains challenging. To address this, we propose a new Chinese calligraphy generation model ‘Moyun’ , which replaces the Unet in the Diffusion model with Vision Mamba and introduces the TripleLabel control mechanism to achieve controllable calligraphy generation. The model was tested on our large-scale dataset ‘Mobao’ of over 1.9 million images, and the results demonstrate that ‘Moyun’ can effectively control the generation process and produce calligraphy in the specified style. Even for calligraphy the calligrapher has not written, ‘Moyun’ can generate calligraphy that matches the style of the calligrapher.

[CV-91] Prototype-based Optimal Transport for Out-of-Distribution Detection

链接: https://arxiv.org/abs/2410.07617
作者: Ao Ke,Wenlong Chen,Chuanwen Feng,Yukun Cao,Xike Xie,S.Kevin Zhou,Lei Feng
关键词-EN: deep neural networks, OOD, OOD inputs, OOD data, real-world deployment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detecting Out-of-Distribution (OOD) inputs is crucial for improving the reliability of deep neural networks in the real-world deployment. In this paper, inspired by the inherent distribution shift between ID and OOD data, we propose a novel method that leverages optimal transport to measure the distribution discrepancy between test inputs and ID prototypes. The resulting transport costs are used to quantify the individual contribution of each test input to the overall discrepancy, serving as a desirable measure for OOD detection. To address the issue that solely relying on the transport costs to ID prototypes is inadequate for identifying OOD inputs closer to ID data, we generate virtual outliers to approximate the OOD region via linear extrapolation. By combining the transport costs to ID prototypes with the costs to virtual outliers, the detection of OOD data near ID data is emphasized, thereby enhancing the distinction between ID and OOD inputs. Experiments demonstrate the superiority of our method over state-of-the-art methods.

[CV-92] Explainability of Deep Neural Networks for Brain Tumor Detection

链接: https://arxiv.org/abs/2410.07613
作者: S.Park,J.Kim
关键词-EN: supporting healthcare professionals, Convolutional Neural Networks, Medical image classification, image classification, classification is crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 13 figures

点击查看摘要

Abstract:Medical image classification is crucial for supporting healthcare professionals in decision-making and training. While Convolutional Neural Networks (CNNs) have traditionally dominated this field, Transformer-based models are gaining attention. In this study, we apply explainable AI (XAI) techniques to assess the performance of various models on real-world medical data and identify areas for improvement. We compare CNN models such as VGG-16, ResNet-50, and EfficientNetV2L with a Transformer model: ViT-Base-16. Our results show that data augmentation has little impact, but hyperparameter tuning and advanced modeling improve performance. CNNs, particularly VGG-16 and ResNet-50, outperform ViT-Base-16 and EfficientNetV2L, likely due to underfitting from limited data. XAI methods like LIME and SHAP further reveal that better-performing models visualize tumors more effectively. These findings suggest that CNNs with shallower architectures are more effective for small datasets and can support medical decision-making.

[CV-93] CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features

链接: https://arxiv.org/abs/2410.07610
作者: Po-han Li,Sandeep P. Chinchali,Ufuk Topcu
关键词-EN: cross-modal retrieval, CSA, excel in tasks, Multimodal, CLIP excel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring 300,000\times fewer multimodal data pairs and 6\times fewer unimodal data for ImageNet classification and misinformative news captions detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.

[CV-94] A Variational Bayesian Inference Theory of Elasticity and Its Mixed Probabilistic Finite Element Method for Inverse Deformation Solutions in Any Dimension

链接: https://arxiv.org/abs/2410.07605
作者: Chao Wang,Shaofan Li
关键词-EN: variational Bayesian inference, Bayesian inference theory, Bayesian inference, Bayesian inference Finite, Bayesian inference network
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In this work, we have developed a variational Bayesian inference theory of elasticity, which is accomplished by using a mixed Variational Bayesian inference Finite Element Method (VBI-FEM) that can be used to solve the inverse deformation problems of continua. In the proposed variational Bayesian inference theory of continuum mechanics, the elastic strain energy is used as a prior in a Bayesian inference network, which can intelligently recover the detailed continuum deformation mappings with only given the information on the deformed and undeformed continuum body shapes without knowing the interior deformation and the precise actual boundary conditions, both traction as well as displacement boundary conditions, and the actual material constitutive relation. Moreover, we have implemented the related finite element formulation in a computational probabilistic mechanics framework. To numerically solve mixed variational problem, we developed an operator splitting or staggered algorithm that consists of the finite element (FE) step and the Bayesian learning (BL) step as an analogue of the well-known the Expectation-Maximization (EM) algorithm. By solving the mixed probabilistic Galerkin variational problem, we demonstrated that the proposed method is able to inversely predict continuum deformation mappings with strong discontinuity or fracture without knowing the external load conditions. The proposed method provides a robust machine intelligent solution for the long-sought-after inverse problem solution, which has been a major challenge in structure failure forensic pattern analysis in past several decades. The proposed method may become a promising artificial intelligence-based inverse method for solving general partial differential equations.

[CV-95] RNA: Video Editing with ROI-based Neural Atlas ACCV2024

链接: https://arxiv.org/abs/2410.07600
作者: Jaekyeong Lee,Geonung Kim,Sunghyun Cho
关键词-EN: Social Network Service, video-based Social Network, Network Service, Social Network, video-based Social
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCV2024

点击查看摘要

Abstract:With the recent growth of video-based Social Network Service (SNS) platforms, the demand for video editing among common users has increased. However, video editing can be challenging due to the temporally-varying factors such as camera movement and moving objects. While modern atlas-based video editing methods have addressed these issues, they often fail to edit videos including complex motion or multiple moving objects, and demand excessive computational cost, even for very simple edits. In this paper, we propose a novel region-of-interest (ROI)-based video editing framework: ROI-based Neural Atlas (RNA). Unlike prior work, RNA allows users to specify editing regions, simplifying the editing process by removing the need for foreground separation and atlas modeling for foreground objects. However, this simplification presents a unique challenge: acquiring a mask that effectively handles occlusions in the edited area caused by moving objects, without relying on an additional segmentation model. To tackle this, we propose a novel mask refinement approach designed for this specific challenge. Moreover, we introduce a soft neural atlas model for video reconstruction to ensure high-quality editing results. Extensive experiments show that RNA offers a more practical and efficient editing solution, applicable to a wider range of videos with superior quality compared to prior methods.

[CV-96] Causal Image Modeling for Efficient Visual Understanding

链接: https://arxiv.org/abs/2410.07599
作者: Feng Wang,Timing Yang,Yaodong Yu,Sucheng Ren,Guoyizhe Wei,Angtian Wang,Wei Shao,Yuyin Zhou,Alan Yuille,Cihang Xie
关键词-EN: learn visual representations, employ uni-directional language, uni-directional language models, Adventurer series models, causal image modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we present a comprehensive analysis of causal image modeling and introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies demonstrate the significant efficiency and effectiveness of this causal image modeling paradigm. For example, our base-sized Adventurer model attains a competitive test accuracy of 84.0% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 5.3 times more efficient than vision transformers to achieve the same result.

[CV-97] Fine-detailed Neural Indoor Scene Reconstruction using multi-level importance sampling and multi-view consistency

链接: https://arxiv.org/abs/2410.07597
作者: Xinghui Li,Yuchen Ji,Xiansong Lai,Wanting Zhang
关键词-EN: impressive performance, indoor scenarios, simplicity and impressive, Recently, popular due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 3 figures, International Conference on Image Processing

点击查看摘要

Abstract:Recently, neural implicit 3D reconstruction in indoor scenarios has become popular due to its simplicity and impressive performance. Previous works could produce complete results leveraging monocular priors of normal or depth. However, they may suffer from over-smoothed reconstructions and long-time optimization due to unbiased sampling and inaccurate monocular priors. In this paper, we propose a novel neural implicit surface reconstruction method, named FD-NeuS, to learn fine-detailed 3D models using multi-level importance sampling strategy and multi-view consistency methodology. Specifically, we leverage segmentation priors to guide region-based ray sampling, and use piecewise exponential functions as weights to pilot 3D points sampling along the rays, ensuring more attention on important regions. In addition, we introduce multi-view feature consistency and multi-view normal consistency as supervision and uncertainty respectively, which further improve the reconstruction of details. Extensive quantitative and qualitative results show that FD-NeuS outperforms existing methods in various scenes.

[CV-98] A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks NEURIPS2024

链接: https://arxiv.org/abs/2410.07593
作者: Hoin Jung,Taeuk Jang,Xiaoqian Wang
关键词-EN: enabled complex multimodal, Recent advancements, image data simultaneously, complex multimodal tasks, data simultaneously
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024, the Thirty-Eighth Annual Conference on Neural Information Processing Systems

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have enabled complex multimodal tasks by processing text and image data simultaneously, significantly enhancing the field of artificial intelligence. However, these models often exhibit biases that can skew outputs towards societal stereotypes, thus necessitating debiasing strategies. Existing debiasing methods focus narrowly on specific modalities or tasks, and require extensive retraining. To address these limitations, this paper introduces Selective Feature Imputation for Debiasing (SFID), a novel methodology that integrates feature pruning and low confidence imputation (LCI) to effectively reduce biases in VLMs. SFID is versatile, maintaining the semantic integrity of outputs and costly effective by eliminating the need for retraining. Our experimental results demonstrate SFID’s effectiveness across various VLMs tasks including zero-shot classification, text-to-image retrieval, image captioning, and text-to-image generation, by significantly reducing gender biases without compromising performance. This approach not only enhances the fairness of VLMs applications but also preserves their efficiency and utility across diverse scenarios.

[CV-99] urboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text

链接: https://arxiv.org/abs/2410.07590
作者: Songshuo Lu,Hua Wang,Yutian Rong,Zhi Chen,Yaohua Tang
关键词-EN: Current Retrieval-Augmented Generation, process numerous retrieved, current RAG system, numerous retrieved document, retrieved document chunks
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Current Retrieval-Augmented Generation (RAG) systems concatenate and process numerous retrieved document chunks for prefill which requires a large volume of computation, therefore leading to significant latency in time-to-first-token (TTFT). To reduce the computation overhead as well as TTFT, we introduce TurboRAG, a novel RAG system that redesigns the inference paradigm of the current RAG system by first pre-computing and storing the key-value (KV) caches of documents offline, and then directly retrieving the saved KV cache for prefill. Hence, online computation of KV caches is eliminated during inference. In addition, we provide a number of insights into the mask matrix and positional embedding mechanisms, plus fine-tune a pretrained language model to maintain model accuracy of TurboRAG. Our approach is applicable to most existing large language models and their applications without any requirement in modification of models and inference systems. Experimental results across a suite of RAG benchmarks demonstrate that TurboRAG reduces TTFT by up to 9.4x compared to the conventional RAG systems (on an average of 8.6x), but reserving comparable performance to the standard RAG systems.

[CV-100] ddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching ECCV2024

链接: https://arxiv.org/abs/2410.07579
作者: Ruonan Yu,Songhua Liu,Jingwen Ye,Xinchao Wang
关键词-EN: enabling models trained, real data, condensation refers, refers to compressing, generalize effectively
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:Dataset distillation or condensation refers to compressing a large-scale dataset into a much smaller one, enabling models trained on this synthetic dataset to generalize effectively on real data. Tackling this challenge, as defined, relies on a bi-level optimization algorithm: a novel model is trained in each iteration within a nested loop, with gradients propagated through an unrolled computation graph. However, this approach incurs high memory and time complexity, posing difficulties in scaling up to large datasets such as ImageNet. Addressing these concerns, this paper introduces Teddy, a Taylor-approximated dataset distillation framework designed to handle large-scale dataset and enhance efficiency. On the one hand, backed up by theoretical analysis, we propose a memory-efficient approximation derived from Taylor expansion, which transforms the original form dependent on multi-step gradients to a first-order one. On the other hand, rather than repeatedly training a novel model in each iteration, we unveil that employing a pre-cached pool of weak models, which can be generated from a single base model, enhances both time efficiency and performance concurrently, particularly when dealing with large-scale datasets. Extensive experiments demonstrate that the proposed Teddy attains state-of-the-art efficiency and performance on the Tiny-ImageNet and original-sized ImageNet-1K dataset, notably surpassing prior methods by up to 12.8%, while reducing 46.6% runtime. Our code will be available at this https URL.

[CV-101] 3D Vision-Language Gaussian Splatting

链接: https://arxiv.org/abs/2410.07577
作者: Qucheng Peng,Benjamin Planche,Zhongpai Gao,Meng Zheng,Anwesa Choudhuri,Terrence Chen,Chen Chen,Ziyan Wu
关键词-EN: Recent advancements, autonomous driving, augmented reality, scene understanding, applications in robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: main paper + supplementary material

点击查看摘要

Abstract:Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches have naively embedded semantic representations into 3D reconstruction methods without striking a balance between visual and language modalities, which leads to unsatisfying semantic rasterization of translucent or reflective objects, as well as over-fitting on color modality. To alleviate these limitations, we propose a solution that adequately handles the distinct visual and semantic modalities, i.e., a 3D vision-language Gaussian splatting model for scene understanding, to put emphasis on the representation learning of language modality. We propose a novel cross-modal rasterizer, using modality fusion along with a smoothed semantic indicator for enhancing semantic rasterization. We also employ a camera-view blending technique to improve semantic consistency between existing and synthesized views, thereby effectively mitigating over-fitting. Extensive experiments demonstrate that our method achieves state-of-the-art performance in open-vocabulary semantic segmentation, surpassing existing methods by a significant margin.

[CV-102] How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

链接: https://arxiv.org/abs/2410.07571
作者: Seongyun Lee,Geewook Kim,Jiyeon Kim,Hyunji Lee,Hoyeon Chang,Sue Hyun Park,Minjoon Seo
关键词-EN: transforms Large Language, Large Language Models, Large Vision-Language Models, Large Language, Large Vision-Language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-Language adaptation (VL adaptation) transforms Large Language Models (LLMs) into Large Vision-Language Models (LVLMs) for multimodal tasks, but this process often compromises the inherent safety capabilities embedded in the original LLMs. Despite potential harmfulness due to weakened safety measures, in-depth analysis on the effects of VL adaptation on safety remains under-explored. This study examines how VL adaptation influences safety and evaluates the impact of safety fine-tuning methods. Our analysis reveals that safety degradation occurs during VL adaptation, even when the training data is safe. While safety tuning techniques like supervised fine-tuning with safety datasets or reinforcement learning from human feedback mitigate some risks, they still lead to safety degradation and a reduction in helpfulness due to over-rejection issues. Further analysis of internal model weights suggests that VL adaptation may impact certain safety-related layers, potentially lowering overall safety levels. Additionally, our findings demonstrate that the objectives of VL adaptation and safety tuning are divergent, which often results in their simultaneous application being suboptimal. To address this, we suggest the weight merging approach as an optimal solution effectively reducing safety degradation while maintaining helpfulness. These insights help guide the development of more reliable and secure LVLMs for real-world applications.

[CV-103] CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection

链接: https://arxiv.org/abs/2410.07540
作者: Guankun Wang,Han Xiao,Huxin Gao,Renrui Zhang,Long Bai,Xiaoxiao Yang,Zhen Li,Hongsheng Li,Hongliang Ren
关键词-EN: minimizing recurrence rates, ESD, enables rapid resection, minimizing recurrence, long-term overall survival
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:submucosal dissection (ESD) enables rapid resection of large lesions, minimizing recurrence rates and improving long-term overall survival. Despite these advantages, ESD is technically challenging and carries high risks of complications, necessitating skilled surgeons and precise instruments. Recent advancements in Large Visual-Language Models (LVLMs) offer promising decision support and predictive planning capabilities for robotic systems, which can augment the accuracy of ESD and reduce procedural risks. However, existing datasets for multi-level fine-grained ESD surgical motion understanding are scarce and lack detailed annotations. In this paper, we design a hierarchical decomposition of ESD motion granularity and introduce a multi-level surgical motion dataset (CoPESD) for training LVLMs as the robotic \textbfCo-\textbfPilot of \textbfEndoscopic \textbfSubmucosal \textbfDissection. CoPESD includes 17,679 images with 32,699 bounding boxes and 88,395 multi-level motions, from over 35 hours of ESD videos for both robot-assisted and conventional surgeries. CoPESD enables granular analysis of ESD motions, focusing on the complex task of submucosal dissection. Extensive experiments on the LVLMs demonstrate the effectiveness of CoPESD in training LVLMs to predict following surgical robotic motions. As the first multimodal ESD motion dataset, CoPESD supports advanced research in ESD instruction-following and surgical automation. The dataset is available at \hrefthis https URLthis https URL.

[CV-104] I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow

链接: https://arxiv.org/abs/2410.07536
作者: Ruoyi Du,Dongyang Liu,Le Zhuo,Qin Qi,Hongsheng Li,Zhanyu Ma,Peng Gao
关键词-EN: Rectified Flow Transformers, offer superior training, Rectified Flow, Flow Transformers, offer superior
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Rectified Flow Transformers (RFTs) offer superior training and inference efficiency, making them likely the most viable direction for scaling up diffusion models. However, progress in generation resolution has been relatively slow due to data quality and training costs. Tuning-free resolution extrapolation presents an alternative, but current methods often reduce generative stability, limiting practical application. In this paper, we review existing resolution extrapolation methods and introduce the I-Max framework to maximize the resolution potential of Text-to-Image RFTs. I-Max features: (i) a novel Projected Flow strategy for stable extrapolation and (ii) an advanced inference toolkit for generalizing model knowledge to higher resolutions. Experiments with Lumina-Next-2K and Flux.1-dev demonstrate I-Max’s ability to enhance stability in resolution extrapolation and show that it can bring image detail emergence and artifact correction, confirming the practical value of tuning-free resolution extrapolation.

[CV-105] CountMamba: Exploring Multi-directional Selective State-Space Models for Plant Counting

链接: https://arxiv.org/abs/2410.07528
作者: Hulingxiao He,Yaqi Zhang,Jinglin Xu,Yuxin Peng
关键词-EN: pollination yield estimation, including seed breeding, plant counting tasks, stage of agriculture, seed breeding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by PRCV 2024

点击查看摘要

Abstract:Plant counting is essential in every stage of agriculture, including seed breeding, germination, cultivation, fertilization, pollination yield estimation, and harvesting. Inspired by the fact that humans count objects in high-resolution images by sequential scanning, we explore the potential of handling plant counting tasks via state space models (SSMs) for generating counting results. In this paper, we propose a new counting approach named CountMamba that constructs multiple counting experts to scan from various directions simultaneously. Specifically, we design a Multi-directional State-Space Group to process the image patch sequences in multiple orders and aim to simulate different counting experts. We also design Global-Local Adaptive Fusion to adaptively aggregate global features extracted from multiple directions and local features extracted from the CNN branch in a sample-wise manner. Extensive experiments demonstrate that the proposed CountMamba performs competitively on various plant counting tasks, including maize tassels, wheat ears, and sorghum head counting.

[CV-106] O1O: Grouping of Known Classes to Identify Unknown Objects as Odd-One-Out ACCV2024

链接: https://arxiv.org/abs/2410.07514
作者: Mısra Yavuz,Fatma Güney
关键词-EN: detection methods trained, methods trained, fixed set, objects, classes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ACCV 2024 (Oral)

点击查看摘要

Abstract:Object detection methods trained on a fixed set of known classes struggle to detect objects of unknown classes in the open-world setting. Current fixes involve adding approximate supervision with pseudo-labels corresponding to candidate locations of objects, typically obtained in a class-agnostic manner. While previous approaches mainly rely on the appearance of objects, we find that geometric cues improve unknown recall. Although additional supervision from pseudo-labels helps to detect unknown objects, it also introduces confusion for known classes. We observed a notable decline in the model’s performance for detecting known objects in the presence of noisy pseudo-labels. Drawing inspiration from studies on human cognition, we propose to group known classes into superclasses. By identifying similarities between classes within a superclass, we can identify unknown classes through an odd-one-out scoring mechanism. Our experiments on open-world detection benchmarks demonstrate significant improvements in unknown recall, consistently across all tasks. Crucially, we achieve this without compromising known performance, thanks to better partitioning of the feature space with superclasses.

[CV-107] Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

链接: https://arxiv.org/abs/2410.07500
作者: Zhizheng Liu,Joe Lin,Wayne Wu,Bolei Zhou
关键词-EN: Understanding and modeling, pedestrian movements, modeling pedestrian movements, pedestrian, real world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Understanding and modeling pedestrian movements in the real world is crucial for applications like motion forecasting and scene simulation. Many factors influence pedestrian movements, such as scene context, individual characteristics, and goals, which are often ignored by the existing human generation methods. Web videos contain natural pedestrian behavior and rich motion context, but annotating them with pre-trained predictors leads to noisy labels. In this work, we propose learning diverse pedestrian movements from web videos. We first curate a large-scale dataset called CityWalkers that captures diverse real-world pedestrian movements in urban scenes. Then, based on CityWalkers, we propose a generative model called PedGen for diverse pedestrian movement generation. PedGen introduces automatic label filtering to remove the low-quality labels and a mask embedding to train with partial labels. It also contains a novel context encoder that lifts the 2D scene context to 3D and can incorporate various context factors in generating realistic pedestrian movements in urban scenes. Experiments show that PedGen outperforms existing baseline methods for pedestrian movement generation by learning from noisy labels and incorporating the context factors. In addition, PedGen achieves zero-shot generalization in both real-world and simulated environments. The code, model, and data will be made publicly available at this https URL .

[CV-108] Dense Optimizer : An Information Entropy-Guided Structural Search Method for Dense-like Neural Network Design

链接: https://arxiv.org/abs/2410.07499
作者: Liu Tianyuan,Hou Libin,Wang Linyuan,Song Xiyu,Yan Bin
关键词-EN: Dense Convolutional Network, Dense Optimizer, Dense Convolutional, efficient structure, Convolutional Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages,3 figures

点击查看摘要

Abstract:Dense Convolutional Network has been continuously refined to adopt a highly efficient and compact architecture, owing to its lightweight and efficient structure. However, the current Dense-like architectures are mainly designed manually, it becomes increasingly difficult to adjust the channels and reuse level based on past experience. As such, we propose an architecture search method called Dense Optimizer that can search high-performance dense-like network automatically. In Dense Optimizer, we view the dense network as a hierarchical information system, maximize the network’s information entropy while constraining the distribution of the entropy across each stage via a power law, thereby constructing an optimization problem. We also propose a branch-and-bound optimization algorithm, tightly integrates power-law principle with search space scaling to solve the optimization problem efficiently. The superiority of Dense Optimizer has been validated on different computer vision benchmark datasets. Specifically, Dense Optimizer completes high-quality search but only costs 4 hours with one CPU. Our searched model DenseNet-OPT achieved a top 1 accuracy of 84.3% on CIFAR-100, which is 5.97% higher than the original one.

[CV-109] Progressive Multi-Modal Fusion for Robust 3D Object Detection

链接: https://arxiv.org/abs/2410.07475
作者: Rohit Mohan,Daniele Cattaneo,Florian Drews,Abhinav Valada
关键词-EN: Bird Eye View, Multi-sensor fusion, crucial for accurate, autonomous driving, cameras and LiDAR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-sensor fusion is crucial for accurate 3D object detection in autonomous driving, with cameras and LiDAR being the most commonly used sensors. However, existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird’s Eye View (BEV) or Perspective View (PV), thus sacrificing complementary information such as height or geometric proportions. To address this limitation, we propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels. Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection. Additionally, we introduce a self-supervised mask modeling pre-training strategy to improve multi-modal representation learning and data efficiency through three novel objectives. Extensive experiments on nuScenes and Argoverse2 datasets conclusively demonstrate the efficacy of ProFusion3D. Moreover, ProFusion3D is robust to sensor failure, demonstrating strong performance when only one modality is available.

[CV-110] Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation ACCV2024

链接: https://arxiv.org/abs/2410.07463
作者: Susan Liang,Chao Huang,Yapeng Tian,Anurag Kumar,Chenliang Xu
关键词-EN: task called language-guided, called language-guided joint, editing, called language-guided, audio-visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCV 2024

点击查看摘要

Abstract:In this paper, we introduce a novel task called language-guided joint audio-visual editing. Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance. For instance, we can alter the background environment of a sounding object while keeping its appearance unchanged, or we can add new sounds contextualized to the visual content. To address this task, we propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas. Firstly, we propose a one-shot adaptation approach to tailor generative diffusion models for audio-visual content editing. With as few as one audio-visual sample, we jointly transfer the audio and vision diffusion models to the target domain. After fine-tuning, our model enables consistent generation of this audio-visual sample. Secondly, we introduce a cross-modal semantic enhancement approach. We observe that when using language as content editing guidance, the vision branch may overlook editing requirements. This phenomenon, termed catastrophic neglect, hampers audio-visual alignment during content editing. We therefore enhance semantic consistency between language and vision to mitigate this issue. Extensive experiments validate the effectiveness of our method in language-based audio-visual editing and highlight its superiority over several baseline approaches. We recommend that readers visit our project page for more details: this https URL.

[CV-111] Generalizing Segmentation Foundation Model Under Sim-to-real Domain-shift for Guidewire Segmentation in X-ray Fluoroscopy

链接: https://arxiv.org/abs/2410.07460
作者: Yuxuan Wen,Evgenia Roussinova,Olivier Brina,Paolo Machi,Mohamed Bouri
关键词-EN: enhance procedural accuracy, complex vascular pathways, endovascular interventions holds, significantly enhance procedural, providing critical feedback
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Guidewire segmentation during endovascular interventions holds the potential to significantly enhance procedural accuracy, improving visualization and providing critical feedback that can support both physicians and robotic systems in navigating complex vascular pathways. Unlike supervised segmentation networks, which need many expensive expert-annotated labels, sim-to-real domain adaptation approaches utilize synthetic data from simulations, offering a cost-effective solution. The success of models like Segment-Anything (SAM) has driven advancements in image segmentation foundation models with strong zero/few-shot generalization through prompt engineering. However, they struggle with medical images like X-ray fluoroscopy and the domain-shifts of the data. Given the challenges of acquiring annotation and the accessibility of labeled simulation data, we propose a sim-to-real domain adaption framework with a coarse-to-fine strategy to adapt SAM to X-ray fluoroscopy guidewire segmentation without any annotation on the target domain. We first generate the pseudo-labels by utilizing a simple source image style transfer technique that preserves the guidewire structure. Then, we develop a weakly supervised self-training architecture to fine-tune an end-to-end student SAM with the coarse labels by imposing consistency regularization and supervision from the teacher SAM network. We validate the effectiveness of the proposed method on a publicly available Cardiac dataset and an in-house Neurovascular dataset, where our method surpasses both pre-trained SAM and many state-of-the-art domain adaptation techniques by a large margin. Our code will be made public on GitHub soon.

[CV-112] nyLidarNet: 2D LiDAR-based End-to-End Deep Learning Model for F1TENTH Autonomous Racing

链接: https://arxiv.org/abs/2410.07447
作者: Mohammed Misbah Zarrar,Qitao Weng,Bakhbyergyen Yerjan,Ahmet Soyyigit,Heechul Yun
关键词-EN: raw sensory data, Prior research, sensory data, research has demonstrated, demonstrated the effectiveness
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prior research has demonstrated the effectiveness of end-to-end deep learning for robotic navigation, where the control signals are directly derived from raw sensory data. However, the majority of existing end-to-end navigation solutions are predominantly camera-based. In this paper, we introduce TinyLidarNet, a lightweight 2D LiDAR-based end-to-end deep learning model for autonomous racing. An F1TENTH vehicle using TinyLidarNet won 3rd place in the 12th F1TENTH Autonomous Grand Prix competition, demonstrating its competitive performance. We systematically analyze its performance on untrained tracks and computing requirements for real-time processing. We find that TinyLidarNet’s 1D Convolutional Neural Network (CNN) based architecture significantly outperforms widely used Multi-Layer Perceptron (MLP) based architecture. In addition, we show that it can be processed in real-time on low-end micro-controller units (MCUs).

[CV-113] Self-Supervised Learning for Real-World Object Detection: a Survey

链接: https://arxiv.org/abs/2410.07442
作者: Alina Ciocarlan,Sidonie Lefebvre,Sylvie Le Hégarat-Mascle,Arnaud Woiselle
关键词-EN: Masked Image Modeling, Self-Supervised Learning, SSL, object detection, small object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-Supervised Learning (SSL) has emerged as a promising approach in computer vision, enabling networks to learn meaningful representations from large unlabeled datasets. SSL methods fall into two main categories: instance discrimination and Masked Image Modeling (MIM). While instance discrimination is fundamental to SSL, it was originally designed for classification and may be less effective for object detection, particularly for small objects. In this survey, we focus on SSL methods specifically tailored for real-world object detection, with an emphasis on detecting small objects in complex environments. Unlike previous surveys, we offer a detailed comparison of SSL strategies, including object-level instance discrimination and MIM methods, and assess their effectiveness for small object detection using both CNN and ViT-based architectures. Specifically, our benchmark is performed on the widely-used COCO dataset, as well as on a specialized real-world dataset focused on vehicle detection in infrared remote sensing imagery. We also assess the impact of pre-training on custom domain-specific datasets, highlighting how certain SSL strategies are better suited for handling uncurated data. Our findings highlight that instance discrimination methods perform well with CNN-based encoders, while MIM methods are better suited for ViT-based architectures and custom dataset pre-training. This survey provides a practical guide for selecting optimal SSL strategies, taking into account factors such as backbone architecture, object size, and custom pre-training requirements. Ultimately, we show that choosing an appropriate SSL pre-training strategy, along with a suitable encoder, significantly enhances performance in real-world object detection, particularly for small object detection in frugal settings. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.07442 [cs.CV] (or arXiv:2410.07442v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.07442 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-114] Zero-Shot Generalization of Vision-Based RL Without Data Augmentation

链接: https://arxiv.org/abs/2410.07441
作者: Sumeet Batra,Gaurav S. Sukhatme
关键词-EN: Generalizing vision-based reinforcement, vision-based reinforcement learning, Generalizing vision-based, reinforcement learning, open challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Generalizing vision-based reinforcement learning (RL) agents to novel environments remains a difficult and open challenge. Current trends are to collect large-scale datasets or use data augmentation techniques to prevent overfitting and improve downstream generalization. However, the computational and data collection costs increase exponentially with the number of task variations and can destabilize the already difficult task of training RL agents. In this work, we take inspiration from recent advances in computational neuroscience and propose a model, Associative Latent DisentAnglement (ALDA), that builds on standard off-policy RL towards zero-shot generalization. Specifically, we revisit the role of latent disentanglement in RL and show how combining it with a model of associative memory achieves zero-shot generalization on difficult task variations without relying on data augmentation. Finally, we formally show that data augmentation techniques are a form of weak disentanglement and discuss the implications of this insight.

[CV-115] Robust infrared small target detection using self-supervised and a contrario paradigms

链接: https://arxiv.org/abs/2410.07437
作者: Alina Ciocarlan,Sylvie Le Hégarat-Mascle,Sidonie Lefebvre,Arnaud Woiselle
关键词-EN: Detecting small targets, defense applications due, Detecting small, Infrared Small Target, infrared images poses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detecting small targets in infrared images poses significant challenges in defense applications due to the presence of complex backgrounds and the small size of the targets. Traditional object detection methods often struggle to balance high detection rates with low false alarm rates, especially when dealing with small objects. In this paper, we introduce a novel approach that combines a contrario paradigm with Self-Supervised Learning (SSL) to improve Infrared Small Target Detection (IRSTD). On the one hand, the integration of an a contrario criterion into a YOLO detection head enhances feature map responses for small and unexpected objects while effectively controlling false alarms. On the other hand, we explore SSL techniques to overcome the challenges of limited annotated data, common in IRSTD tasks. Specifically, we benchmark several representative SSL strategies for their effectiveness in improving small object detection performance. Our findings show that instance discrimination methods outperform masked image modeling strategies when applied to YOLO-based small object detection. Moreover, the combination of the a contrario and SSL paradigms leads to significant performance improvements, narrowing the gap with state-of-the-art segmentation methods and even outperforming them in frugal settings. This two-pronged approach offers a robust solution for improving IRSTD performance, particularly under challenging conditions.

[CV-116] Surgical Depth Anything: Depth Estimation for Surgical Scenes using Foundation Models

链接: https://arxiv.org/abs/2410.07434
作者: Ange Lou,Yamin Li,Yike Zhang,Jack Noble
关键词-EN: Monocular depth estimation, Monocular depth, reconstruction algorithms, crucial for tracking, tracking and reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Monocular depth estimation is crucial for tracking and reconstruction algorithms, particularly in the context of surgical videos. However, the inherent challenges in directly obtaining ground truth depth maps during surgery render supervised learning approaches impractical. While many self-supervised methods based on Structure from Motion (SfM) have shown promising results, they rely heavily on high-quality camera motion and require optimization on a per-patient basis. These limitations can be mitigated by leveraging the current state-of-the-art foundational model for depth estimation, Depth Anything. However, when directly applied to surgical scenes, Depth Anything struggles with issues such as blurring, bleeding, and reflections, resulting in suboptimal performance. This paper presents a fine-tuning of the Depth Anything model specifically for the surgical domain, aiming to deliver more accurate pixel-wise depth maps tailored to the unique requirements and challenges of surgical environments. Our fine-tuning approach significantly improves the model’s performance in surgical scenes, reducing errors related to blurring and reflections, and achieving a more reliable and precise depth estimation.

[CV-117] Segmenting objects with Bayesian fusion of active contour models and convnet priors

链接: https://arxiv.org/abs/2410.07421
作者: Przemyslaw Polewski,Jacquelyn Shelton,Wei Yao,Marco Heurich
关键词-EN: great practical significance, core computer vision, computer vision task, Convolutional Neural Network, Deep Shape Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Instance segmentation is a core computer vision task with great practical significance. Recent advances, driven by large-scale benchmark datasets, have yielded good general-purpose Convolutional Neural Network (CNN)-based methods. Natural Resource Monitoring (NRM) utilizes remote sensing imagery with generally known scale and containing multiple overlapping instances of the same class, wherein the object contours are jagged and highly irregular. This is in stark contrast with the regular man-made objects found in classic benchmark datasets. We address this problem and propose a novel instance segmentation method geared towards NRM imagery. We formulate the problem as Bayesian maximum a posteriori inference which, in learning the individual object contours, incorporates shape, location, and position priors from state-of-the-art CNN architectures, driving a simultaneous level-set evolution of multiple object contours. We employ loose coupling between the CNNs that supply the priors and the active contour process, allowing a drop-in replacement of new network architectures. Moreover, we introduce a novel prior for contour shape, namely, a class of Deep Shape Models based on architectures from Generative Adversarial Networks (GANs). These Deep Shape Models are in essence a non-linear generalization of the classic Eigenshape formulation. In experiments, we tackle the challenging, real-world problem of segmenting individual dead tree crowns and delineating precise contours. We compare our method to two leading general-purpose instance segmentation methods - Mask R-CNN and K-net - on color infrared aerial imagery. Results show our approach to significantly outperform both methods in terms of reconstruction quality of tree crown contours. Furthermore, use of the GAN-based deep shape model prior yields significant improvement of all results over the vanilla Eigenshape prior.

[CV-118] NeRF-Accelerated Ecological Monitoring in Mixed-Evergreen Redwood Forest

链接: https://arxiv.org/abs/2410.07418
作者: Adam Korycki,Cory Yeaton,Gregory S. Gilbert,Colleen Josephson,Steve McGuire
关键词-EN: critical observational data, observational data needed, critical observational, observational data, data needed
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Forest mapping provides critical observational data needed to understand the dynamics of forest environments. Notably, tree diameter at breast height (DBH) is a metric used to estimate forest biomass and carbon dioxide (CO _2 ) sequestration. Manual methods of forest mapping are labor intensive and time consuming, a bottleneck for large-scale mapping efforts. Automated mapping relies on acquiring dense forest reconstructions, typically in the form of point clouds. Terrestrial laser scanning (TLS) and mobile laser scanning (MLS) generate point clouds using expensive LiDAR sensing, and have been used successfully to estimate tree diameter. Neural radiance fields (NeRFs) are an emergent technology enabling photorealistic, vision-based reconstruction by training a neural network on a sparse set of input views. In this paper, we present a comparison of MLS and NeRF forest reconstructions for the purpose of trunk diameter estimation in a mixed-evergreen Redwood forest. In addition, we propose an improved DBH-estimation method using convex-hull modeling. Using this approach, we achieved 1.68 cm RMSE, which consistently outperformed standard cylinder modeling approaches. Our code contributions and forest datasets are freely available at this https URL.

[CV-119] 3D2M Dataset: A 3-Dimension diverse Mesh Dataset

链接: https://arxiv.org/abs/2410.07415
作者: Sankarshan Dasgupta
关键词-EN: attracting significant attention, area of research, attracting significant, industry alike, prominent area
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 6 pages, 1 figures, 2 tables

点击查看摘要

Abstract:Three-dimensional (3D) reconstruction has emerged as a prominent area of research, attracting significant attention from academia and industry alike. Among the various applications of 3D reconstruction, facial reconstruction poses some of the most formidable challenges. Additionally, each individuals facial structure is unique, requiring algorithms to be robust enough to handle this variability while maintaining fidelity to the original features. This article presents a comprehensive dataset of 3D meshes featuring a diverse range of facial structures and corresponding facial landmarks. The dataset comprises 188 3D facial meshes, including 73 from female candidates and 114 from male candidates. It encompasses a broad representation of ethnic backgrounds, with contributions from 45 different ethnicities, ensuring a rich diversity in facial characteristics. Each facial mesh is accompanied by key points that accurately annotate the relevant features, facilitating precise analysis and manipulation. This dataset is particularly valuable for applications such as facial re targeting, the study of facial structure components, and real-time person representation in video streams. By providing a robust resource for researchers and developers, it aims to advance the field of 3D facial reconstruction and related technologies.

[CV-120] Aligning Motion-Blurred Images Using Contrastive Learning on Overcomplete Pixels

链接: https://arxiv.org/abs/2410.07410
作者: Leonid Pogorelyuk,Stefan T. Radev
关键词-EN: learning overcomplete pixel-level, motion blur, invariant to motion, overcomplete pixel-level features, contrastive objective
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:We propose a new contrastive objective for learning overcomplete pixel-level features that are invariant to motion blur. Other invariances (e.g., pose, illumination, or weather) can be learned by applying the corresponding transformations on unlabeled images during self-supervised training. We showcase that a simple U-Net trained with our objective can produce local features useful for aligning the frames of an unseen video captured with a moving camera under realistic and challenging conditions. Using a carefully designed toy example, we also show that the overcomplete pixels can encode the identity of objects in an image and the pixel coordinates relative to these objects.

[CV-121] Exploring Efficient Foundational Multi-modal Models for Video Summarization

链接: https://arxiv.org/abs/2410.07405
作者: Karan Samel,Apoorva Beedu,Nitish Sontakke,Irfan Essa
关键词-EN: generate text outputs, models, model, language model, Foundational models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Foundational models are able to generate text outputs given prompt instructions and text, audio, or image inputs. Recently these models have been combined to perform tasks on video, such as video summarization. Such video foundation models perform pre-training by aligning outputs from each modality-specific model into the same embedding space. Then the embeddings from each model are used within a language model, which is fine-tuned on a desired instruction set. Aligning each modality during pre-training is computationally expensive and prevents rapid testing of different base modality models. During fine-tuning, evaluation is carried out within in-domain videos where it is hard to understand the generalizability and data efficiency of these methods. To alleviate these issues we propose a plug-and-play video language model. It directly uses the texts generated from each input modality into the language model, avoiding pre-training alignment overhead. Instead of fine-tuning we leverage few-shot instruction adaptation strategies. We compare the performance versus the computational costs for our plug-and-play style method and baseline tuning methods. Finally, we explore the generalizability of each method during domain shift and present insights on what data is useful when training data is limited. Through this analysis, we present practical insights on how to leverage multi-modal foundational models for effective results given realistic compute and data limitations.

[CV-122] Enhancing Soccer Camera Calibration Through Keypoint Exploitation

链接: https://arxiv.org/abs/2410.07401
作者: Nikolay S. Falaleev,Ruilong Chen
关键词-EN: enabling precise scene, precise scene geometry, scene geometry interpretation, supporting sports analytics, sports analytics tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7th ACM International Workshop on Multimedia Content Analysis in Sports

点击查看摘要

Abstract:Accurate camera calibration is essential for transforming 2D images from camera sensors into 3D world coordinates, enabling precise scene geometry interpretation and supporting sports analytics tasks such as player tracking, offside detection, and performance analysis. However, obtaining a sufficient number of high-quality point pairs remains a significant challenge for both traditional and deep learning-based calibration methods. This paper introduces a multi-stage pipeline that addresses this challenge by leveraging the structural features of the football pitch. Our approach significantly increases the number of usable points for calibration by exploiting line-line and line-conic intersections, points on the conics, and other geometric features. To mitigate the impact of imperfect annotations, we employ data fitting techniques. Our pipeline utilizes deep learning for keypoint and line detection and incorporates geometric constraints based on real-world pitch dimensions. A voter algorithm iteratively selects the most reliable keypoints, further enhancing calibration accuracy. We evaluated our approach on the largest football broadcast camera calibration dataset available, and secured the top position in the SoccerNet Camera Calibration Challenge 2023 [arXiv:2309.06006], which demonstrates the effectiveness of our method in real-world scenarios. The project code is available at this https URL .

[CV-123] Structured Spatial Reasoning with Open Vocabulary Object Detectors

链接: https://arxiv.org/abs/2410.07394
作者: Negar Nejatishahidin,Madhukar Reddy Vongala,Jana Kosecka
关键词-EN: Language Models, Active Vision Dataset, spatial reasoning tasks, object rearrangement, object search
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reasoning about spatial relationships between objects is essential for many real-world robotic tasks, such as fetch-and-delivery, object rearrangement, and object search. The ability to detect and disambiguate different objects and identify their location is key to successful completion of these tasks. Several recent works have used powerful Vision and Language Models (VLMs) to unlock this capability in robotic agents. In this paper we introduce a structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors to enhance spatial reasoning for robotic perception. The approach is evaluated and compared against zero-shot performance of the state-of-the-art Vision and Language Models (VLMs) on spatial reasoning tasks. To enable this comparison, we annotate spatial clauses in real-world RGB-D Active Vision Dataset [1] and conduct experiments on this and the synthetic Semantic Abstraction [2] dataset. Results demonstrate the effectiveness of the proposed method, showing superior performance of grounding spatial relations over state of the art open-source VLMs by more than 20%.

[CV-124] En masse scanning and automated surfacing of small objects using Micro-CT

链接: https://arxiv.org/abs/2410.07385
作者: Riley C. W. O’Neill,Katrina Yezzi Woodley,Jeff Calder,Peter J. Olver
关键词-EN: computationally intensive analyses, Modern archaeological methods, high resolution scanning, Modern archaeological, large datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 36 pages, 12 figures, 2 tables. Source code available at this https URL

点击查看摘要

Abstract:Modern archaeological methods increasingly utilize 3D virtual representations of objects, computationally intensive analyses, high resolution scanning, large datasets, and machine learning. With higher resolution scans, challenges surrounding computational power, memory, and file storage quickly arise. Processing and analyzing high resolution scans often requires memory-intensive workflows, which are infeasible for most computers and increasingly necessitate the use of super-computers or innovative methods for processing on standard computers. Here we introduce a novel protocol for en-masse micro-CT scanning of small objects with a \em mostly-automated processing workflow that functions in memory-limited settings. We scanned 1,112 animal bone fragments using just 10 micro-CT scans, which were post-processed into individual PLY files. Notably, our methods can be applied to any object (with discernible density from the packaging material) making this method applicable to a variety of inquiries and fields including paleontology, geology, electrical engineering, and materials science. Further, our methods may immediately be adopted by scanning institutes to pool customer orders together and offer more affordable scanning. The work presented herein is part of a larger program facilitated by the international and multi-disciplinary research consortium known as Anthropological and Mathematical Analysis of Archaeological and Zooarchaeological Evidence (AMAAZE). AMAAZE unites experts in anthropology, mathematics, and computer science to develop new methods for mass-scale virtual archaeological research. Overall, our new scanning method and processing workflows lay the groundwork and set the standard for future mass-scale, high resolution scanning studies.

[CV-125] Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

链接: https://arxiv.org/abs/2410.07336
作者: Sara Sarto,Nicholas Moratelli,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
关键词-EN: significant advancements, fail to capture, capture the full, fine-grained details, existing evaluation metrics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: this https URL.

[CV-126] Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow

链接: https://arxiv.org/abs/2410.07303
作者: Fu-Yun Wang,Ling Yang,Zhaoyang Huang,Mengdi Wang,Hongsheng Li
关键词-EN: solving generative ODEs, computationally intensive nature, improved visual generation, slow generation speed, generation speed due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have greatly improved visual generation but are hindered by slow generation speed due to the computationally intensive nature of solving generative ODEs. Rectified flow, a widely recognized solution, improves generation speed by straightening the ODE path. Its key components include: 1) using the diffusion form of flow-matching, 2) employing \boldsymbol v -prediction, and 3) performing rectification (a.k.a. reflow). In this paper, we argue that the success of rectification primarily lies in using a pretrained diffusion model to obtain matched pairs of noise and samples, followed by retraining with these matched noise-sample pairs. Based on this, components 1) and 2) are unnecessary. Furthermore, we highlight that straightness is not an essential training target for rectification; rather, it is a specific case of flow-matching models. The more critical training target is to achieve a first-order approximate ODE path, which is inherently curved for models like DDPM and Sub-VP. Building on this insight, we propose Rectified Diffusion, which generalizes the design space and application scope of rectification to encompass the broader category of diffusion models, rather than being restricted to flow-matching models. We validate our method on Stable Diffusion v1-5 and Stable Diffusion XL. Our method not only greatly simplifies the training procedure of rectified flow-based previous works (e.g., InstaFlow) but also achieves superior performance with even lower training cost. Our code is available at this https URL.

[CV-127] owards Generalisable Time Series Understanding Across Domains

链接: https://arxiv.org/abs/2410.07299
作者: Özgün Turgut,Philip Müller,Martin J. Menten,Daniel Rueckert
关键词-EN: datasets unlocks foundational, natural language processing, large datasets unlocks, time series, unlocks foundational model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In natural language processing and computer vision, self-supervised pre-training on large datasets unlocks foundational model capabilities across domains and tasks. However, this potential has not yet been realised in time series analysis, where existing methods disregard the heterogeneous nature of time series characteristics. Time series are prevalent in many domains, including medicine, engineering, natural sciences, and finance, but their characteristics vary significantly in terms of variate count, inter-variate relationships, temporal dynamics, and sampling frequency. This inherent heterogeneity across domains prevents effective pre-training on large time series corpora. To address this issue, we introduce OTiS, an open model for general time series analysis, that has been specifically designed to handle multi-domain heterogeneity. We propose a novel pre-training paradigm including a tokeniser with learnable domain-specific signatures, a dual masking strategy to capture temporal causality, and a normalised cross-correlation loss to model long-range dependencies. Our model is pre-trained on a large corpus of 640,187 samples and 11 billion time points spanning 8 distinct domains, enabling it to analyse time series from any (unseen) domain. In comprehensive experiments across 15 diverse applications - including classification, regression, and forecasting - OTiS showcases its ability to accurately capture domain-specific data characteristics and demonstrates its competitiveness against state-of-the-art baselines. Our code and pre-trained weights are publicly available at this https URL.

[CV-128] Enhancing Performance of Point Cloud Completion Networks with Consistency Loss

链接: https://arxiv.org/abs/2410.07298
作者: Kevin Tirta Wijaya,Christofel Rio Goenawan,Seung-Hyun Kong
关键词-EN: Point cloud completion, proposed consistency loss, Point cloud, consistency loss, point completion network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: First version of Paper “Enhancing Performance of Point Cloud Completion Networks with Consistency Loss” by Kevin Tirta Wijaya and Christofel Rio Goenawan. In process submission to Neurocomputing Journal 2024

点击查看摘要

Abstract:Point cloud completion networks are conventionally trained to minimize the disparities between the completed point cloud and the ground-truth counterpart. However, an incomplete object-level point cloud can have multiple valid completion solutions when it is examined in isolation. This one-to-many mapping issue can cause contradictory supervision signals to the network because the loss function may produce different values for identical input-output pairs of the network. In many cases, this issue could adversely affect the network optimization process. In this work, we propose to enhance the conventional learning objective using a novel completion consistency loss to mitigate the one-to-many mapping problem. Specifically, the proposed consistency loss ensure that a point cloud completion network generates a coherent completion solution for incomplete objects originating from the same source point cloud. Experimental results across multiple well-established datasets and benchmarks demonstrated the proposed completion consistency loss have excellent capability to enhance the completion performance of various existing networks without any modification to the design of the networks. The proposed consistency loss enhances the performance of the point completion network without affecting the inference speed, thereby increasing the accuracy of point cloud completion. Notably, a state-of-the-art point completion network trained with the proposed consistency loss can achieve state-of-the-art accuracy on the challenging new MVP dataset. The code and result of experiment various point completion models using proposed consistency loss will be available at: this https URL .

[CV-129] ReinDiffuse: Crafting Physically Plausible Motions with Reinforced Diffusion Model WACV2025

链接: https://arxiv.org/abs/2410.07296
作者: Gaoge Han,Mingjiang Liang,Jinglei Tang,Yongkang Cheng,Wei Liu,Shaoli Huang
关键词-EN: Generating human motion, Generating human, challenging task, motion diffusion model, diffusion model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by WACV 2025 in Round 1

点击查看摘要

Abstract:Generating human motion from textual descriptions is a challenging task. Existing methods either struggle with physical credibility or are limited by the complexities of physics simulations. In this paper, we present \emphReinDiffuse that combines reinforcement learning with motion diffusion model to generate physically credible human motions that align with textual descriptions. Our method adapts Motion Diffusion Model to output a parameterized distribution of actions, making them compatible with reinforcement learning paradigms. We employ reinforcement learning with the objective of maximizing physically plausible rewards to optimize motion generation for physical fidelity. Our approach outperforms existing state-of-the-art models on two major datasets, HumanML3D and KIT-ML, achieving significant improvements in physical plausibility and motion quality. Project: \urlthis https URL

[CV-130] Retrieval Replace Reduction: An effective visual token reduction method via semantic match

链接: https://arxiv.org/abs/2410.07278
作者: Yingen Liu,Fan Wu,Ruihui Li,Zhuo Tang,Kenli Li
关键词-EN: large language models, Multimodal large language, demonstrated strong performance, language models, training from scratch
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 2 figures,3 tables

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated strong performance across various tasks without requiring training from scratch. However, they face significant computational and memory constraints, particularly when processing multimodal inputs that exceed context length, limiting their scalability. In this paper, we introduce a new approach, \textbfTRSM (\textbfToken \textbfReduction via \textbfSemantic \textbfMatch), which effectively reduces the number of visual tokens without compromising MLLM performance. Inspired by how humans process multimodal tasks, TRSM leverages semantic information from one modality to match relevant semantics in another, reducing the number of visual this http URL, to retain task relevant visual tokens, we use the text prompt as a query vector to retrieve the most similar vectors from the visual prompt and merge them with the text tokens. Based on experimental results, when applied to LLaVA-1.5\citeliu2023, our approach compresses the visual tokens by 20%, achieving comparable performance across diverse visual question-answering and reasoning tasks.

[CV-131] Mitigation of gender bias in automatic facial non-verbal behaviors generation

链接: https://arxiv.org/abs/2410.07274
作者: Alice Delbosc(TALEP, LIS, AMU),Magalie Ochs(LIS, AMU, R2I),Nicolas Sabouret(CPU, LISN),Brian Ravenet(CPU, LISN),Stephane Ayache(AMU, LIS, QARMA)
关键词-EN: interactive agents focuses, social interactive agents, social interactive, believability and synchronization, Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Research on non-verbal behavior generation for social interactive agents focuses mainly on the believability and synchronization of non-verbal cues with speech. However, existing models, predominantly based on deep learning architectures, often perpetuate biases inherent in the training data. This raises ethical concerns, depending on the intended application of these agents. This paper addresses these issues by first examining the influence of gender on facial non-verbal behaviors. We concentrate on gaze, head movements, and facial expressions. We introduce a classifier capable of discerning the gender of a speaker from their non-verbal cues. This classifier achieves high accuracy on both real behavior data, extracted using state-of-the-art tools, and synthetic data, generated from a model developed in previous this http URL upon this work, we present a new model, FairGenderGen, which integrates a gender discriminator and a gradient reversal layer into our previous behavior generation model. This new model generates facial non-verbal behaviors from speech features, mitigating gender sensitivity in the generated behaviors. Our experiments demonstrate that the classifier, developed in the initial phase, is no longer effective in distinguishing the gender of the speaker from the generated non-verbal behaviors.

[CV-132] BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models NEURIPS

链接: https://arxiv.org/abs/2410.07273
作者: Fangyikang Wang,Hubery Yin,Yuejiang Dong,Huminhao Zhu,Chao Zhang,Hanbin Zhao,Hui Qian,Chen Li
关键词-EN: exact inversion samplers, exact inversion, diffusion model sampling, heuristic exact inversion, inversion samplers
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted paper by NeurIPS

点击查看摘要

Abstract:The inversion of diffusion model sampling, which aims to find the corresponding initial noise of a sample, plays a critical role in various tasks. Recently, several heuristic exact inversion samplers have been proposed to address the inexact inversion issue in a training-free manner. However, the theoretical properties of these heuristic samplers remain unknown and they often exhibit mediocre sampling quality. In this paper, we introduce a generic formulation, \emphBidirectional Explicit Linear Multi-step (BELM) samplers, of the exact inversion samplers, which includes all previously proposed heuristic exact inversion samplers as special cases. The BELM formulation is derived from the variable-stepsize-variable-formula linear multi-step method via integrating a bidirectional explicit constraint. We highlight this bidirectional explicit constraint is the key of mathematically exact inversion. We systematically investigate the Local Truncation Error (LTE) within the BELM framework and show that the existing heuristic designs of exact inversion samplers yield sub-optimal LTE. Consequently, we propose the Optimal BELM (O-BELM) sampler through the LTE minimization approach. We conduct additional analysis to substantiate the theoretical stability and global convergence property of the proposed optimal sampler. Comprehensive experiments demonstrate our O-BELM sampler establishes the exact inversion property while achieving high-quality sampling. Additional experiments in image editing and image interpolation highlight the extensive potential of applying O-BELM in varying applications.

[CV-133] Learning Content-Aware Multi-Modal Joint Input Pruning via Birds-Eye-View Representation

链接: https://arxiv.org/abs/2410.07268
作者: Yuxin Li,Yiheng Li,Xulei Yang,Mengying Yu,Zihang Huang,Xiaojun Wu,Chai Kiat Yeo
关键词-EN: substantial academic attention, recently garnered substantial, garnered substantial academic, autonomous driving, representation has recently
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the landscape of autonomous driving, Bird’s-Eye-View (BEV) representation has recently garnered substantial academic attention, serving as a transformative framework for the fusion of multi-modal sensor inputs. This BEV paradigm effectively shifts the sensor fusion challenge from a rule-based methodology to a data-centric approach, thereby facilitating more nuanced feature extraction from an array of heterogeneous sensors. Notwithstanding its evident merits, the computational overhead associated with BEV-based techniques often mandates high-capacity hardware infrastructures, thus posing challenges for practical, real-world implementations. To mitigate this limitation, we introduce a novel content-aware multi-modal joint input pruning technique. Our method leverages BEV as a shared anchor to algorithmically identify and eliminate non-essential sensor regions prior to their introduction into the perception model’s backbone. We validatethe efficacy of our approach through extensive experiments on the NuScenes dataset, demonstrating substantial computational efficiency without sacrificing perception accuracy. To the best of our knowledge, this work represents the first attempt to alleviate the computational burden from the input pruning point.

[CV-134] Spiking GS: Towards High-Accuracy and Low-Cost Surface Reconstruction via Spiking Neuron-based Gaussian Splatting

链接: https://arxiv.org/abs/2410.07266
作者: Weixing Zhang,Zongrui Li,De Ma,Huajin Tang,Xudong Jiang,Qian Zheng,Gang Pan
关键词-EN: Gaussian Splatting, scenes in minutes, Gaussian Splatting pipeline, capable of reconstructing, Gaussians
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting is capable of reconstructing 3D scenes in minutes. Despite recent advances in improving surface reconstruction accuracy, the reconstructed results still exhibit bias and suffer from inefficiency in storage and training. This paper provides a different observation on the cause of the inefficiency and the reconstruction bias, which is attributed to the integration of the low-opacity parts (LOPs) of the generated Gaussians. We show that LOPs consist of Gaussians with overall low-opacity (LOGs) and the low-opacity tails (LOTs) of Gaussians. We propose Spiking GS to reduce such two types of LOPs by integrating spiking neurons into the Gaussian Splatting pipeline. Specifically, we introduce global and local full-precision integrate-and-fire spiking neurons to the opacity and representation function of flattened 3D Gaussians, respectively. Furthermore, we enhance the density control strategy with spiking neurons’ thresholds and an new criterion on the scale of Gaussians. Our method can represent more accurate reconstructed surfaces at a lower cost. The code is available at \urlthis https URL.

[CV-135] Neural Contrast: Leveraging Generative Editing for Graphic Design Recommendations PRICAI2024

链接: https://arxiv.org/abs/2410.07211
作者: Marian Lupascu,Ionut Mironica,Mihai-Sorin Stupariu
关键词-EN: Creating visually appealing, visually appealing composites, appealing composites requires, composites requires optimizing, Creating visually
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, Paper sent and accepted as a poster at PRICAI 2024

点击查看摘要

Abstract:Creating visually appealing composites requires optimizing both text and background for compatibility. Previous methods have focused on simple design strategies, such as changing text color or adding background shapes for contrast. These approaches are often destructive, altering text color or partially obstructing the background image. Another method involves placing design elements in non-salient and contrasting regions, but this isn’t always effective, especially with patterned backgrounds. To address these challenges, we propose a generative approach using a diffusion model. This method ensures the altered regions beneath design assets exhibit low saliency while enhancing contrast, thereby improving the visibility of the design asset.

[CV-136] SpaRG: Sparsely Reconstructed Graphs for Generalizable fMRI Analysis

链接: https://arxiv.org/abs/2410.07201
作者: Camila González,Yanis Miraoui,Yiran Fan,Ehsan Adeli,Kilian M. Pohl
关键词-EN: Magnetic Resonance Imaging, functional Magnetic Resonance, resting-state functional Magnetic, Resonance Imaging, Magnetic Resonance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning can help uncover patterns in resting-state functional Magnetic Resonance Imaging (rs-fMRI) associated with psychiatric disorders and personal traits. Yet the problem of interpreting deep learning findings is rarely more evident than in fMRI analyses, as the data is sensitive to scanning effects and inherently difficult to visualize. We propose a simple approach to mitigate these challenges grounded on sparsification and self-supervision. Instead of extracting post-hoc feature attributions to uncover functional connections that are important to the target task, we identify a small subset of highly informative connections during training and occlude the rest. To this end, we jointly train a (1) sparse input mask, (2) variational autoencoder (VAE), and (3) downstream classifier in an end-to-end fashion. While we need a portion of labeled samples to train the classifier, we optimize the sparse mask and VAE with unlabeled data from additional acquisition sites, retaining only the input features that generalize well. We evaluate our method - Sparsely Reconstructed Graphs (SpaRG) - on the public ABIDE dataset for the task of sex classification, training with labeled cases from 18 sites and adapting the model to two additional out-of-distribution sites with a portion of unlabeled samples. For a relatively coarse parcellation (64 regions), SpaRG utilizes only 1% of the original connections while improving the classification accuracy across domains. Our code can be found at this http URL.

[CV-137] chnical Report: Competition Solution For Modelscope-Sora

链接: https://arxiv.org/abs/2410.07194
作者: Shengfu Chen,Hailong Liu,Wenzhao Wei
关键词-EN: presents the approach, approach adopted, focuses on fine-tuning, video generation models, Modelscope-Sora challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This report presents the approach adopted in the Modelscope-Sora challenge, which focuses on fine-tuning data for video generation models. The challenge evaluates participants’ ability to analyze, clean, and generate high-quality datasets for video-based text-to-video tasks under specific computational constraints. The provided methodology involves data processing techniques such as video description generation, filtering, and acceleration. This report outlines the procedures and tools utilized to enhance the quality of training data, ensuring improved performance in text-to-video generation models.

[CV-138] Margin-bounded Confidence Scores for Out-of-Distribution Detection

链接: https://arxiv.org/abs/2410.07185
作者: Lakpa D. Tamang,Mohamed Reda Bouadjenek,Richard Dazeley,Sunil Aryal
关键词-EN: Machine Learning applications, critical Machine Learning, accurately classifying in-distribution, critical Machine, medical image diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, IEEE Conference in Data Mining 2024

点击查看摘要

Abstract:In many critical Machine Learning applications, such as autonomous driving and medical image diagnosis, the detection of out-of-distribution (OOD) samples is as crucial as accurately classifying in-distribution (ID) inputs. Recently Outlier Exposure (OE) based methods have shown promising results in detecting OOD inputs via model fine-tuning with auxiliary outlier data. However, most of the previous OE-based approaches emphasize more on synthesizing extra outlier samples or introducing regularization to diversify OOD sample space, which is rather unquantifiable in practice. In this work, we propose a novel and straightforward method called Margin bounded Confidence Scores (MaCS) to address the nontrivial OOD detection problem by enlarging the disparity between ID and OOD scores, which in turn makes the decision boundary more compact facilitating effective segregation with a simple threshold. Specifically, we augment the learning objective of an OE regularized classifier with a supplementary constraint, which penalizes high confidence scores for OOD inputs compared to that of ID and significantly enhances the OOD detection performance while maintaining the ID classification accuracy. Extensive experiments on various benchmark datasets for image classification tasks demonstrate the effectiveness of the proposed method by significantly outperforming state-of-the-art (S.O.T.A) methods on various benchmarking metrics. The code is publicly available at this https URL

[CV-139] Does Spatial Cognition Emerge in Frontier Models?

链接: https://arxiv.org/abs/2410.06468
作者: Santhosh Kumar Ramakrishnan,Erik Wijmans,Philipp Kraehenbuehl,Vladlen Koltun
关键词-EN: present SPACE, Abstract, models, benchmark, spatial
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.

[CV-140] ICPR 2024 Competition on Multiple Sclerosis Lesion Segmentation – Methods and Results

链接: https://arxiv.org/abs/2410.07924
作者: Alessia Rondinella,Francesco Guarnera,Elena Crispino,Giulia Russo,Clara Di Lorenzo,Davide Maimone,Francesco Pappalardo,Sebastiano Battiato
关键词-EN: multiple sclerosis lesions, Multiple Sclerosis, segmenting multiple sclerosis, Sclerosis Lesion Segmentation, sclerosis lesions
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This report summarizes the outcomes of the ICPR 2024 Competition on Multiple Sclerosis Lesion Segmentation (MSLesSeg). The competition aimed to develop methods capable of automatically segmenting multiple sclerosis lesions in MRI scans. Participants were provided with a novel annotated dataset comprising a heterogeneous cohort of MS patients, featuring both baseline and follow-up MRI scans acquired at different hospitals. MSLesSeg focuses on developing algorithms that can independently segment multiple sclerosis lesions of an unexamined cohort of patients. This segmentation approach aims to overcome current benchmarks by eliminating user interaction and ensuring robust lesion detection at different timepoints, encouraging innovation and promoting methodological advances.

[CV-141] ONCOPILOT: A Promptable CT Foundation Model For Solid Tumor Evaluation

链接: https://arxiv.org/abs/2410.07908
作者: Léo Machado,Hélène Philippe,Élodie Ferreres,Julien Khlaut,Julie Dupuis,Korentin Le Floch,Denis Habip Gatenyo,Pascal Roux,Jules Grégory,Maxime Ronot,Corentin Dancette,Daniel Tordjman,Pierre Manceron,Paul Hérent
关键词-EN: diverse shapes, proteiform phenomenon, displaying complex, locations and displaying, tumors emerging
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Carcinogenesis is a proteiform phenomenon, with tumors emerging in various locations and displaying complex, diverse shapes. At the crucial intersection of research and clinical practice, it demands precise and flexible assessment. However, current biomarkers, such as RECIST 1.1’s long and short axis measurements, fall short of capturing this complexity, offering an approximate estimate of tumor burden and a simplistic representation of a more intricate process. Additionally, existing supervised AI models face challenges in addressing the variability in tumor presentations, limiting their clinical utility. These limitations arise from the scarcity of annotations and the models’ focus on narrowly defined tasks. To address these challenges, we developed ONCOPILOT, an interactive radiological foundation model trained on approximately 7,500 CT scans covering the whole body, from both normal anatomy and a wide range of oncological cases. ONCOPILOT performs 3D tumor segmentation using visual prompts like point-click and bounding boxes, outperforming state-of-the-art models (e.g., nnUnet) and achieving radiologist-level accuracy in RECIST 1.1 measurements. The key advantage of this foundation model is its ability to surpass state-of-the-art performance while keeping the radiologist in the loop, a capability that previous models could not achieve. When radiologists interactively refine the segmentations, accuracy improves further. ONCOPILOT also accelerates measurement processes and reduces inter-reader variability, facilitating volumetric analysis and unlocking new biomarkers for deeper insights. This AI assistant is expected to enhance the precision of RECIST 1.1 measurements, unlock the potential of volumetric biomarkers, and improve patient stratification and clinical care, while seamlessly integrating into the radiological workflow. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.07908 [eess.IV] (or arXiv:2410.07908v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2410.07908 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-142] FDDM: Frequency-Decomposed Diffusion Model for Rectum Cancer Dose Prediction in Radiotherapy

链接: https://arxiv.org/abs/2410.07876
作者: Xin Liao,Zhenghao Feng,Jianghong Xiao,Xingchen Peng,Yan Wang
关键词-EN: Accurate dose distribution, dose distribution prediction, Accurate dose, dose map, coarse dose map
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate dose distribution prediction is crucial in the radiotherapy planning. Although previous methods based on convolutional neural network have shown promising performance, they have the problem of over-smoothing, leading to prediction without important high-frequency details. Recently, diffusion model has achieved great success in computer vision, which excels in generating images with more high-frequency details, yet suffers from time-consuming and extensive computational resource consumption. To alleviate these problems, we propose Frequency-Decomposed Diffusion Model (FDDM) that refines the high-frequency subbands of the dose map. To be specific, we design a Coarse Dose Prediction Module (CDPM) to first predict a coarse dose map and then utilize discrete wavelet transform to decompose the coarse dose map into a low-frequency subband and three high?frequency subbands. There is a notable difference between the coarse predicted results and ground truth in high?frequency subbands. Therefore, we design a diffusion-based module called High-Frequency Refinement Module (HFRM) that performs diffusion operation in the high?frequency components of the dose map instead of the original dose map. Extensive experiments on an in-house dataset verify the effectiveness of our approach.

[CV-143] Breaking the curse of dimensionality in structured density estimation NEURIPS2024

链接: https://arxiv.org/abs/2410.07685
作者: Robert A. Vandermeulen,Wai Ming Tai,Bryon Aragam
关键词-EN: Markov conditions implied, structured multivariate density, curse of dimensionality, estimating a structured, structured multivariate
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Work accepted to NeurIPS 2024

点击查看摘要

Abstract:We consider the problem of estimating a structured multivariate density, subject to Markov conditions implied by an undirected graph. In the worst case, without Markovian assumptions, this problem suffers from the curse of dimensionality. Our main result shows how the curse of dimensionality can be avoided or greatly alleviated under the Markov property, and applies to arbitrary graphs. While existing results along these lines focus on sparsity or manifold assumptions, we introduce a new graphical quantity called “graph resilience” and show how it controls the sample complexity. Surprisingly, although one might expect the sample complexity of this problem to scale with local graph parameters such as the degree, this turns out not to be the case. Through explicit examples, we compute uniform deviation bounds and illustrate how the curse of dimensionality in density estimation can thus be circumvented. Notable examples where the rate improves substantially include sequential, hierarchical, and spatial data.

[CV-144] DDSR: Single-Step Diffusion with Two Discriminators for Super Resolution

链接: https://arxiv.org/abs/2410.07663
作者: Sohwi Kim,Tae-Kyun Kim
关键词-EN: increasingly being specialized, Super-resolution, diffusion-based super-resolution, Abstract, diffusion-based super-resolution method
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Super-resolution methods are increasingly being specialized for both real-world and face-specific tasks. However, many existing approaches rely on simplistic degradation models, which limits their ability to handle complex and unknown degradation patterns effectively. While diffusion-based super-resolution techniques have recently shown impressive results, they are still constrained by the need for numerous inference steps. To address this, we propose TDDSR, an efficient single-step diffusion-based super-resolution method. Our method, distilled from a pre-trained teacher model and based on a diffusion network, performs super-resolution in a single step. It integrates a learnable downsampler to capture diverse degradation patterns and employs two discriminators, one for high-resolution and one for low-resolution images, to enhance the overall performance. Experimental results demonstrate its effectiveness across real-world and face-specific SR tasks, achieving performance comparable to, or even surpassing, another single-step method, previous state-of-the-art models, and the teacher model.

[CV-145] Calibration of 3D Single-pixel Imaging Systems with a Calibration Field

链接: https://arxiv.org/abs/2410.07545
作者: Xinyue Ma,Chenxing Wang
关键词-EN: ffexibly applied, SPI, promising imaging technique, Abstract, SPI systems
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D single-pixel imaging (SPI) is a promising imaging technique that can be ffexibly applied to various wavebands. The main challenge in 3D SPI is that the calibration usually requires a large number of standard points as references, which are tricky to capture using single-pixel detectors. Conventional solutions involve sophisticated device deployment and cumbersome operations, resulting in hundreds of images needed for calibration. In our work, we construct a Calibration Field (CaliF) to efffciently generate the standard points from one single image. A high accuracy of the CaliF is guaranteed by the technique of deep learning and digital twin. We perform experiments with our new method to verify its validity and accuracy. We believe our work holds great potential in 3D SPI systems or even general imaging systems.

[CV-146] Modeling Alzheimers Disease: From Memory Loss to Plaque Tangles Formation

链接: https://arxiv.org/abs/2410.07503
作者: Sai Nag Anurag Nangunoori,Akshara Karthic Mahadevan
关键词-EN: employ the Hopfield, Hopfield model, biochemical processes characteristic, Alzheimer disease, Alzheimer
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:We employ the Hopfield model as a simplified framework to explore both the memory deficits and the biochemical processes characteristic of Alzheimer’s disease. By simulating neuronal death and synaptic degradation through increasing the number of stored patterns and introducing noise into the synaptic weights, we demonstrate hallmark symptoms of dementia, including memory loss, confusion, and delayed retrieval times. As the network’s capacity is exceeded, retrieval errors increase, mirroring the cognitive confusion observed in Alzheimer’s patients. Additionally, we simulate the impact of synaptic degradation by varying the sparsity of the weight matrix, showing impaired memory recall and reduced retrieval success as noise levels increase. Furthermore, we extend our model to connect memory loss with biochemical processes linked to Alzheimer’s. By simulating the role of reduced insulin sensitivity over time, we show how it can trigger increased calcium influx into mitochondria, leading to misfolded proteins and the formation of amyloid plaques. These findings, modeled over time, suggest that both neuronal degradation and metabolic factors contribute to the progressive decline seen in Alzheimer’s disease. Our work offers a computational framework for understanding the dual impact of synaptic and metabolic dysfunction in neurodegenerative diseases.

[CV-147] Deep Learning for Surgical Instrument Recognition and Segmentation in Robotic-Assisted Surgeries: A Systematic Review

链接: https://arxiv.org/abs/2410.07269
作者: Fatimaelzahraa Ali Ahmed,Mahmoud Yousef,Mariam Ali Ahmed,Hasan Omar Ali,Anns Mahboob,Hazrat Ali,Zubair Shah,Omar Aboumarzouk,Abdulla Al Ansari,Shidin Balakrishnan
关键词-EN: Applying deep learning, minimally invasive surgeries, robot-assisted minimally invasive, Applying deep, surgical
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 57 pages, 9 figures, Accepted for publication in Artificial Intelligence Reviews journal this https URL

点击查看摘要

Abstract:Applying deep learning (DL) for annotating surgical instruments in robot-assisted minimally invasive surgeries (MIS) represents a significant advancement in surgical technology. This systematic review examines 48 studies that and advanced DL methods and architectures. These sophisticated DL models have shown notable improvements in the precision and efficiency of detecting and segmenting surgical tools. The enhanced capabilities of these models support various clinical applications, including real-time intraoperative guidance, comprehensive postoperative evaluations, and objective assessments of surgical skills. By accurately identifying and segmenting surgical instruments in video data, DL models provide detailed feedback to surgeons, thereby improving surgical outcomes and reducing complication risks. Furthermore, the application of DL in surgical education is transformative. The review underscores the significant impact of DL on improving the accuracy of skill assessments and the overall quality of surgical training programs. However, implementing DL in surgical tool detection and segmentation faces challenges, such as the need for large, accurately annotated datasets to train these models effectively. The manual annotation process is labor-intensive and time-consuming, posing a significant bottleneck. Future research should focus on automating the detection and segmentation process and enhancing the robustness of DL models against environmental variations. Expanding the application of DL models across various surgical specialties will be essential to fully realize this technology’s potential. Integrating DL with other emerging technologies, such as augmented reality (AR), also offers promising opportunities to further enhance the precision and efficacy of surgical procedures.

机器学习

[LG-0] Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

链接: https://arxiv.org/abs/2410.08209
作者: Shengcao Cao,Liang-Yan Gui,Yu-Xiong Wang
关键词-EN: Current large multimodal, relate language components, Current large, large multimodal models, face challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an “attend-and-segment” method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: this https URL.

[LG-1] SPA: 3D Spatial-Awareness Enables Effective Embodied Representation

链接: https://arxiv.org/abs/2410.08208
作者: Haoyi Zhu,Honghui Yang,Yating Wang,Jiange Yang,Limin Wang,Tong He
关键词-EN: vanilla Vision Transformer, embodied representation learning, framework that emphasizes, emphasizes the importance, Vision Transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In this paper, we introduce SPA, a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. Our approach leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understanding. We present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios. The results are compelling: SPA consistently outperforms more than 10 state-of-the-art representation methods, including those specifically designed for embodied AI, vision-centric tasks, and multi-modal applications, while using less training data. Furthermore, we conduct a series of real-world experiments to confirm its effectiveness in practical scenarios. These results highlight the critical role of 3D spatial awareness for embodied representation learning. Our strongest model takes more than 6000 GPU hours to train and we are committed to open-sourcing all code and model weights to foster future research in embodied representation learning. Project Page: this https URL.

[LG-2] DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models

链接: https://arxiv.org/abs/2410.08207
作者: Xiaoxiao He,Ligong Han,Quan Dao,Song Wen,Minhao Bai,Di Liu,Han Zhang,Martin Renqiang Min,Felix Juefei-Xu,Chaowei Tan,Bo Liu,Kang Li,Hongdong Li,Junzhou Huang,Faez Ahmed,Akash Srivastava,Dimitris Metaxas
关键词-EN: masked language modeling, Discrete diffusion models, achieved success, success in tasks, language modeling
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces. For project webpage, see this https URL.

[LG-3] Efficient Dictionary Learning with Switch Sparse Autoencoders

链接: https://arxiv.org/abs/2410.08201
作者: Anish Mudide,Joshua Engels,Eric J. Michaud,Max Tegmark,Christian Schroeder de Witt
关键词-EN: decomposing neural network, Switch Sparse Autoencoders, Switch SAEs, neural network activations, Sparse autoencoders
类目: Machine Learning (cs.LG)
*备注: Code available at this https URL

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are a recent technique for decomposing neural network activations into human-interpretable features. However, in order for SAEs to identify all features represented in frontier models, it will be necessary to scale them up to very high width, posing a computational challenge. In this work, we introduce Switch Sparse Autoencoders, a novel SAE architecture aimed at reducing the compute cost of training SAEs. Inspired by sparse mixture of experts models, Switch SAEs route activation vectors between smaller “expert” SAEs, enabling SAEs to efficiently scale to many more features. We present experiments comparing Switch SAEs with other SAE architectures, and find that Switch SAEs deliver a substantial Pareto improvement in the reconstruction vs. sparsity frontier for a given fixed training compute budget. We also study the geometry of features across experts, analyze features duplicated across experts, and verify that Switch SAE features are as interpretable as features found by other SAE architectures.

[LG-4] Adam Exploits ell_infty-geometry of Loss Landscape via Coordinate-wise Adaptivity

链接: https://arxiv.org/abs/2410.08198
作者: Shuo Xie,Mohamad Amin Mohamadi,Zhiyuan Li
关键词-EN: Adam outperforms SGD, training language models, Adam, ell, training language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically – previous convergence analysis for Adam and SGD mainly focuses on the number of steps T and is already minimax-optimal in non-convex cases, which are both \widetildeO(T^-1/4) . In this work, we argue that the exploitation of nice \ell_\infty -geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under \ell_\infty -geometry rather than the more common \ell_2 -geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable \ell_\infty -geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.

[LG-5] Poison-splat: Computation Cost Attack on 3D Gaussian Splatting

链接: https://arxiv.org/abs/2410.08190
作者: Jiahao Lu,Yifan Zhang,Qiuhong Shen,Xinchao Wang,Shuicheng Yan
关键词-EN: Gaussian splatting, vision tasks, performance and efficiency, representation and brought, groundbreaking performance
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Our code is available at this https URL

点击查看摘要

Abstract:3D Gaussian splatting (3DGS), known for its groundbreaking performance and efficiency, has become a dominant 3D representation and brought progress to many 3D vision tasks. However, in this work, we reveal a significant security vulnerability that has been largely overlooked in 3DGS: the computation cost of training 3DGS could be maliciously tampered by poisoning the input data. By developing an attack named Poison-splat, we reveal a novel attack surface where the adversary can poison the input images to drastically increase the computation memory and time needed for 3DGS training, pushing the algorithm towards its worst computation complexity. In extreme cases, the attack can even consume all allocable memory, leading to a Denial-of-Service (DoS) that disrupts servers, resulting in practical damages to real-world 3DGS service vendors. Such a computation cost attack is achieved by addressing a bi-level optimization problem through three tailored strategies: attack objective approximation, proxy model rendering, and optional constrained optimization. These strategies not only ensure the effectiveness of our attack but also make it difficult to defend with simple defensive measures. We hope the revelation of this novel attack surface can spark attention to this crucial yet overlooked vulnerability of 3DGS systems.

[LG-6] Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models

链接: https://arxiv.org/abs/2410.08174
作者: Qingni Wang,Tiantian Geng,Zhiyuan Wang,Teng Wang,Bo Fu,Feng Zheng
关键词-EN: Multimodal Large Language, Multimodal Large, Large Language Models, significant trustworthiness issues, encounter significant trustworthiness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) exhibit promising advancements across various tasks, yet they still encounter significant trustworthiness issues. Prior studies apply Split Conformal Prediction (SCP) in language modeling to construct prediction sets with statistical guarantees. However, these methods typically rely on internal model logits or are restricted to multiple-choice settings, which hampers their generalizability and adaptability in dynamic, open-ended environments. In this paper, we introduce TRON, a two-step framework for risk control and assessment, applicable to any MLLM that supports sampling in both open-ended and closed-ended scenarios. TRON comprises two main components: (1) a novel conformal score to sample response sets of minimum size, and (2) a nonconformity score to identify high-quality responses based on self-consistency theory, controlling the error rates by two specific risk levels. Furthermore, we investigate semantic redundancy in prediction sets within open-ended contexts for the first time, leading to a promising evaluation metric for MLLMs based on average set size. Our comprehensive experiments across four Video Question-Answering (VideoQA) datasets utilizing eight MLLMs show that TRON achieves desired error rates bounded by two user-specified risk levels. Additionally, deduplicated prediction sets maintain adaptiveness while being more efficient and stable for risk assessment under different risk levels.

[LG-7] On the Evaluation of Generative Robotic Simulations ALT

链接: https://arxiv.org/abs/2410.08172
作者: Feng Chen,Botian Xu,Pu Hua,Peiqi Duan,Yanchao Yang,Yi Ma,Huazhe Xu
关键词-EN: acquiring extensive real-world, scalable simulated robotic, extensive real-world data, simulated robotic tasks, highlighting the importance
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Due to the difficulty of acquiring extensive real-world data, robot simulation has become crucial for parallel training and sim-to-real transfer, highlighting the importance of scalable simulated robotic tasks. Foundation models have demonstrated impressive capacities in autonomously generating feasible robotic tasks. However, this new paradigm underscores the challenge of adequately evaluating these autonomously generated tasks. To address this, we propose a comprehensive evaluation framework tailored to generative simulations. Our framework segments evaluation into three core aspects: quality, diversity, and generalization. For single-task quality, we evaluate the realism of the generated task and the completeness of the generated trajectories using large language models and vision-language models. In terms of diversity, we measure both task and data diversity through text similarity of task descriptions and world model loss trained on collected task trajectories. For task-level generalization, we assess the zero-shot generalization ability on unseen tasks of a policy trained with multiple generated tasks. Experiments conducted on three representative task generation pipelines demonstrate that the results from our framework are highly consistent with human evaluations, confirming the feasibility and validity of our approach. The findings reveal that while metrics of quality and diversity can be achieved through certain methods, no single approach excels across all metrics, suggesting a need for greater focus on balancing these different metrics. Additionally, our analysis further highlights the common challenge of low generalization capability faced by current works. Our anonymous website: this https URL.

[LG-8] Visual Scratchpads: Enabling Global Reasoning in Vision

链接: https://arxiv.org/abs/2410.08165
作者: Aryo Lotfi,Enrico Fini,Samy Bengio,Moin Nabi,Emmanuel Abbe
关键词-EN: achieved remarkable success, features provide critical, local features provide, provide critical information, Modern vision models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Modern vision models have achieved remarkable success in benchmarks where local features provide critical information about the target. There is now a growing interest in solving tasks that require more global reasoning, where local features offer no significant information. These tasks are reminiscent of the connectivity tasks discussed by Minsky and Papert in 1969, which exposed the limitations of the perceptron model and contributed to the first AI winter. In this paper, we revisit such tasks by introducing four global visual benchmarks involving path findings and mazes. We show that: (1) although today’s large vision models largely surpass the expressivity limitations of the early models, they still struggle with the learning efficiency; we put forward the “globality degree” notion to understand this limitation; (2) we then demonstrate that the picture changes and global reasoning becomes feasible with the introduction of “visual scratchpads”; similarly to the text scratchpads and chain-of-thoughts used in language models, visual scratchpads help break down global tasks into simpler ones; (3) we finally show that some scratchpads are better than others, in particular, “inductive scratchpads” that take steps relying on less information afford better out-of-distribution generalization and succeed for smaller model sizes.

[LG-9] DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

链接: https://arxiv.org/abs/2410.08159
作者: Jiatao Gu,Yuyang Wang,Yizhe Zhang,Qihang Zhang,Dinghuai Zhang,Navdeep Jaitly,Josh Susskind,Shuangfei Zhai
关键词-EN: DART, image, Markovian, visual generation, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages

点击查看摘要

Abstract:Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process that gradually adds noise to the input. We argue that the Markovian property limits the models ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework. DART iteratively denoises image patches spatially and spectrally using an AR model with the same architecture as standard language models. DART does not rely on image quantization, enabling more effective image modeling while maintaining flexibility. Furthermore, DART seamlessly trains with both text and image data in a unified model. Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. Through this unified framework, DART sets a new benchmark for scalable, high-quality image synthesis.

[LG-10] Progressive Autoregressive Video Diffusion Models

链接: https://arxiv.org/abs/2410.08151
作者: Desai Xie,Zhan Xu,Yicong Hong,Hao Tan,Difan Liu,Feng Liu,Arie Kaufman,Yang Zhou
关键词-EN: Current frontier video, Current frontier, demonstrated remarkable results, generating high-quality videos, demonstrated remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 15 pages, 5 figures. Our video results and code are available at this https URL

点击查看摘要

Abstract:Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. In this work, we show that existing models can be naturally extended to autoregressive video diffusion models without changing the architectures. Our key idea is to assign the latent frames with progressively increasing noise levels rather than a single noise level, which allows for fine-grained condition among the latents and large overlaps between the attention windows. Such progressive video denoising allows our models to autoregressively generate video frames without quality degradation or abrupt scene changes. We present state-of-the-art results on long video generation at 1 minute (1440 frames at 24 FPS). Videos from this paper are available at this https URL.

[LG-11] Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

链接: https://arxiv.org/abs/2410.08146
作者: Amrith Setlur,Chirag Nagpal,Adam Fisch,Xinyang Geng,Jacob Eisenstein,Rishabh Agarwal,Alekh Agarwal,Jonathan Berant,Aviral Kumar
关键词-EN: large language models, promising approach, large language, reward models, language models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: “How should we design process rewards?”. Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test-time search against PAVs is 8% more accurate, and 1.5-5\times more compute-efficient. Online RL with dense rewards from PAVs enables one of the first results with 5-6\times gain in sample efficiency, and 6% gain in accuracy, over ORMs.

[LG-12] Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction

链接: https://arxiv.org/abs/2410.08134
作者: Jarrid Rector-Brooks,Mohsin Hasan,Zhangzhi Peng,Zachary Quinn,Chenghao Liu,Sarthak Mittal,Nouha Dziri,Michael Bronstein,Yoshua Bengio,Pranam Chatterjee,Alexander Tong,Avishek Joey Bose
关键词-EN: data underlies important, underlies important applications, important applications spanning, discrete data underlies, spanning text-based agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative modeling of discrete data underlies important applications spanning text-based agents like ChatGPT to the design of the very building blocks of life in protein sequences. However, application domains need to exert control over the generated data by steering the generative process - typically via RLHF - to satisfy a specified property, reward, or affinity metric. In this paper, we study the problem of steering Masked Diffusion Models (MDMs), a recent class of discrete diffusion models that offer a compelling alternative to traditional autoregressive models. We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference by learning to sample from a target Bayesian posterior. Our DDPP framework leads to a family of three novel objectives that are all simulation-free, and thus scalable while applying to general non-differentiable reward functions. Empirically, we instantiate DDPP by steering MDMs to perform class-conditional pixel-level image modeling, RLHF-based alignment of MDMs using text-based rewards, and finetuning protein language models to generate more diverse secondary structures and shorter proteins. We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.

[LG-13] Assessing Episodic Memory in LLMs with Sequence Order Recall Tasks

链接: https://arxiv.org/abs/2410.08133
作者: Mathis Pink,Vy A. Vo,Qinyuan Wu,Jianing Mu,Javier S. Turek,Uri Hasson,Kenneth A. Norman,Sebastian Michelmann,Alexander Huth,Mariya Toneva
关键词-EN: primarily assessing semantic, Current LLM benchmarks, Current LLM, assessing semantic aspects, semantic relations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current LLM benchmarks focus on evaluating models’ memory of facts and semantic relations, primarily assessing semantic aspects of long-term memory. However, in humans, long-term memory also includes episodic memory, which links memories to their contexts, such as the time and place they occurred. The ability to contextualize memories is crucial for many cognitive tasks and everyday functions. This form of memory has not been evaluated in LLMs with existing benchmarks. To address the gap in evaluating memory in LLMs, we introduce Sequence Order Recall Tasks (SORT), which we adapt from tasks used to study episodic memory in cognitive psychology. SORT requires LLMs to recall the correct order of text segments, and provides a general framework that is both easily extendable and does not require any additional annotations. We present an initial evaluation dataset, Book-SORT, comprising 36k pairs of segments extracted from 9 books recently added to the public domain. Based on a human experiment with 155 participants, we show that humans can recall sequence order based on long-term memory of a book. We find that models can perform the task with high accuracy when relevant text is given in-context during the SORT evaluation. However, when presented with the book text only during training, LLMs’ performance on SORT falls short. By allowing to evaluate more aspects of memory, we believe that SORT will aid in the emerging development of memory-augmented models.

[LG-14] hink Beyond Size: Dynamic Prompting for More Effective Reasoning ICLR2025

链接: https://arxiv.org/abs/2410.08130
作者: Kamesh R
关键词-EN: Large Language Models, Large Language, paper presents Dynamic, presents Dynamic Prompting, capabilities of Large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: Submitted to ICLR 2025. This is a preprint version. Future revisions will include additional evaluations and refinements

点击查看摘要

Abstract:This paper presents Dynamic Prompting, a novel framework aimed at improving the reasoning capabilities of Large Language Models (LLMs). In contrast to conventional static prompting methods, Dynamic Prompting enables the adaptive modification of prompt sequences and step counts based on real-time task complexity and model performance. This dynamic adaptation facilitates more efficient problem-solving, particularly in smaller models, by reducing hallucinations and repetitive cycles. Our empirical evaluations demonstrate that Dynamic Prompting allows smaller LLMs to perform competitively with much larger models, thereby challenging the conventional emphasis on model size as the primary determinant of reasoning efficacy.

[LG-15] Mars: Situated Inductive Reasoning in an Open-World Environment

链接: https://arxiv.org/abs/2410.08126
作者: Xiaojuan Tang,Jiaqi Li,Yitao Liang,Song-chun Zhu,Muhan Zhang,Zilong Zheng
关键词-EN: Large Language Models, Large Language, Language Models, shown remarkable success, inductive reasoning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) trained on massive corpora have shown remarkable success in knowledge-intensive tasks. Yet, most of them rely on pre-stored knowledge. Inducing new general knowledge from a specific environment and performing reasoning with the acquired knowledge – \textitsituated inductive reasoning, is crucial and challenging for machine intelligence. In this paper, we design Mars, an interactive environment devised for situated inductive reasoning. It introduces counter-commonsense game mechanisms by modifying terrain, survival setting and task dependency while adhering to certain principles. In Mars, agents need to actively interact with their surroundings, derive useful rules and perform decision-making tasks in specific contexts. We conduct experiments on various RL-based and LLM-based methods, finding that they all struggle on this challenging situated inductive reasoning benchmark. Furthermore, we explore \textitInduction from Reflection, where we instruct agents to perform inductive reasoning from history trajectory. The superior performance underscores the importance of inductive reasoning in Mars. Through Mars, we aim to galvanize advancements in situated inductive reasoning and set the stage for developing the next generation of AI systems that can reason in an adaptive and context-sensitive way.

[LG-16] Generalizing Stochastic Smoothing for Differentiation and Gradient Estimation

链接: https://arxiv.org/abs/2410.08125
作者: Felix Petersen,Christian Borgelt,Aashwin Mishra,Stefano Ermon
关键词-EN: gradient estimation, differentiable, full support, estimation, gradient
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We deal with the problem of gradient estimation for stochastic differentiable relaxations of algorithms, operators, simulators, and other non-differentiable functions. Stochastic smoothing conventionally perturbs the input of a non-differentiable function with a differentiable density distribution with full support, smoothing it and enabling gradient estimation. Our theory starts at first principles to derive stochastic smoothing with reduced assumptions, without requiring a differentiable density nor full support, and we present a general framework for relaxation and gradient estimation of non-differentiable black-box functions f:\mathbbR^n\to\mathbbR^m . We develop variance reduction for gradient estimation from 3 orthogonal perspectives. Empirically, we benchmark 6 distributions and up to 24 variance reduction strategies for differentiable sorting and ranking, differentiable shortest-paths on graphs, differentiable rendering for pose estimation, as well as differentiable cryo-ET simulations.

[LG-17] Heterogeneous Graph Auto-Encoder for CreditCard Fraud Detection

链接: https://arxiv.org/abs/2410.08121
作者: Moirangthem Tiken Singh,Rabinder Kumar Prasad,Gurumayum Robert Michael,N K Kaphungkui,N.Hemarjit Singh
关键词-EN: credit card usage, digital revolution, notable increase, Graph Neural Networks, fraud
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The digital revolution has significantly impacted financial transactions, leading to a notable increase in credit card usage. However, this convenience comes with a trade-off: a substantial rise in fraudulent activities. Traditional machine learning methods for fraud detection often struggle to capture the inherent interconnectedness within financial data. This paper proposes a novel approach for credit card fraud detection that leverages Graph Neural Networks (GNNs) with attention mechanisms applied to heterogeneous graph representations of financial data. Unlike homogeneous graphs, heterogeneous graphs capture intricate relationships between various entities in the financial ecosystem, such as cardholders, merchants, and transactions, providing a richer and more comprehensive data representation for fraud analysis. To address the inherent class imbalance in fraud data, where genuine transactions significantly outnumber fraudulent ones, the proposed approach integrates an autoencoder. This autoencoder, trained on genuine transactions, learns a latent representation and flags deviations during reconstruction as potential fraud. This research investigates two key questions: (1) How effectively can a GNN with an attention mechanism detect and prevent credit card fraud when applied to a heterogeneous graph? (2) How does the efficacy of the autoencoder with attention approach compare to traditional methods? The results are promising, demonstrating that the proposed model outperforms benchmark algorithms such as Graph Sage and FI-GRL, achieving a superior AUC-PR of 0.89 and an F1-score of 0.81. This research significantly advances fraud detection systems and the overall security of financial transactions by leveraging GNNs with attention mechanisms and addressing class imbalance through an autoencoder.

[LG-18] On Barycenter Computation: Semi-Unbalanced Optimal Transport-based Method on Gaussians

链接: https://arxiv.org/abs/2410.08117
作者: Ngoc-Hai Nguyen,Dung Le,Hoang-Phi Nguyen,Tung Pham,Nhat Ho
关键词-EN: Semi-Unbalanced Optimal Transport, Geodesic Gradient Descent, Exact Geodesic Gradient, Hybrid Gradient Descent, centered Gaussian probability
类目: Machine Learning (cs.LG)
*备注: Ngoc-Hai Nguyen and Dung Le contributed equally to this work. 44 pages, 5 figures

点击查看摘要

Abstract:We explore a robust version of the barycenter problem among n centered Gaussian probability measures, termed Semi-Unbalanced Optimal Transport (SUOT)-based Barycenter, wherein the barycenter remains fixed while the others are relaxed using Kullback-Leibler divergence. We develop optimization algorithms on Bures-Wasserstein manifold, named the Exact Geodesic Gradient Descent and Hybrid Gradient Descent algorithms. While the Exact Geodesic Gradient Descent method is based on computing the exact closed form of the first-order derivative of the objective function of the barycenter along a geodesic on the Bures manifold, the Hybrid Gradient Descent method utilizes optimizer components when solving the SUOT problem to replace outlier measures before applying the Riemannian Gradient Descent. We establish the theoretical convergence guarantees for both methods and demonstrate that the Exact Geodesic Gradient Descent algorithm attains a dimension-free convergence rate. Finally, we conduct experiments to compare the normal Wasserstein Barycenter with ours and perform an ablation study.

[LG-19] Active Fourier Auditor for Estimating Distributional Properties of ML Models

链接: https://arxiv.org/abs/2410.08111
作者: Ayoub Ajarra,Bishwamittra Ghosh,Debabrota Basu
关键词-EN: Machine Learning, deployment of Machine, real-world applications, central concern, pervasive deployment
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:With the pervasive deployment of Machine Learning (ML) models in real-world applications, verifying and auditing properties of ML models have become a central concern. In this work, we focus on three properties: robustness, individual fairness, and group fairness. We discuss two approaches for auditing ML model properties: estimation with and without reconstruction of the target model under audit. Though the first approach is studied in the literature, the second approach remains unexplored. For this purpose, we develop a new framework that quantifies different properties in terms of the Fourier coefficients of the ML model under audit but does not parametrically reconstruct it. We propose the Active Fourier Auditor (AFA), which queries sample points according to the Fourier coefficients of the ML model, and further estimates the properties. We derive high probability error bounds on AFA’s estimates, along with the worst-case lower bounds on the sample complexity to audit them. Numerically we demonstrate on multiple datasets and models that AFA is more accurate and sample-efficient to estimate the properties of interest than the baselines.

[LG-20] A Closer Look at Machine Unlearning for Large Language Models

链接: https://arxiv.org/abs/2410.08109
作者: Xiaojian Yuan,Tianyu Pang,Chao Du,Kejiang Chen,Weiming Zhang,Min Lin
关键词-EN: Large language models, Large language, raising privacy, legal concerns, memorize sensitive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches. The code is available at this https URL.

[LG-21] Noethers razor: Learning Conserved Quantities

链接: https://arxiv.org/abs/2410.08087
作者: Tycho F. A. van der Ouderaa,Mark van der Wilk,Pim de Haan
关键词-EN: machine learning models, conserved quantities, learning dynamical systems, machine learning, Noether theorem
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Symmetries have proven useful in machine learning models, improving generalisation and overall performance. At the same time, recent advancements in learning dynamical systems rely on modelling the underlying Hamiltonian to guarantee the conservation of energy. These approaches can be connected via a seminal result in mathematical physics: Noether’s theorem, which states that symmetries in a dynamical system correspond to conserved quantities. This work uses Noether’s theorem to parameterise symmetries as learnable conserved quantities. We then allow conserved quantities and associated symmetries to be learned directly from train data through approximate Bayesian model selection, jointly with the regular training procedure. As training objective, we derive a variational lower bound to the marginal likelihood. The objective automatically embodies an Occam’s Razor effect that avoids collapse of conservation laws to the trivial constant, without the need to manually add and tune additional regularisers. We demonstrate a proof-of-principle on n -harmonic oscillators and n -body systems. We find that our method correctly identifies the correct conserved quantities and U( n ) and SE( n ) symmetry groups, improving overall performance and predictive accuracy on test data.

[LG-22] Packing Analysis: Packing Is More Appropriate for Large Models or Datasets in Supervised Fine-tuning

链接: https://arxiv.org/abs/2410.08081
作者: Shuhe Wang,Guoyin Wang,Jiwei Li,Eduard Hovy,Chen Guo
关键词-EN: maximum input length, optimization technique designed, maximize hardware resource, model maximum input, hardware resource efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Packing, initially utilized in the pre-training phase, is an optimization technique designed to maximize hardware resource efficiency by combining different training sequences to fit the model’s maximum input length. Although it has demonstrated effectiveness during pre-training, there remains a lack of comprehensive analysis for the supervised fine-tuning (SFT) stage on the following points: (1) whether packing can effectively enhance training efficiency while maintaining performance, (2) the suitable size of the model and dataset for fine-tuning with the packing method, and (3) whether packing unrelated or related training samples might cause the model to either excessively disregard or over-rely on the context. In this paper, we perform extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B. This provides the first comprehensive analysis of the advantages and limitations of packing versus padding, as well as practical considerations for implementing packing in various training scenarios. Our analysis covers various benchmarks, including knowledge, reasoning, and coding, as well as GPT-based evaluations, time efficiency, and other fine-tuning parameters. We also open-source our code for fine-tuning and evaluation and provide checkpoints fine-tuned on datasets of different sizes, aiming to advance future research on packing methods. Code is available at: this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2410.08081 [cs.LG] (or arXiv:2410.08081v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.08081 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models

链接: https://arxiv.org/abs/2410.08074
作者: Vinith M. Suriyakumar,Rohan Alur,Ayush Sekhari,Manish Raghavan,Ashia C. Wilson
关键词-EN: web-scale datasets, rely on massive, diffusion models rely, diffusion models, diffusion
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 13 figures

点击查看摘要

Abstract:Text-to-image diffusion models rely on massive, web-scale datasets. Training them from scratch is computationally expensive, and as a result, developers often prefer to make incremental updates to existing models. These updates often compose fine-tuning steps (to learn new concepts or improve model performance) with “unlearning” steps (to “forget” existing concepts, such as copyrighted works or explicit content). In this work, we demonstrate a critical and previously unknown vulnerability that arises in this paradigm: even under benign, non-adversarial conditions, fine-tuning a text-to-image diffusion model on seemingly unrelated images can cause it to “relearn” concepts that were previously “unlearned.” We comprehensively investigate the causes and scope of this phenomenon, which we term concept resurgence, by performing a series of experiments which compose “mass concept erasure” (the current state of the art for unlearning in text-to-image diffusion models (Lu et al., 2024)) with subsequent fine-tuning of Stable Diffusion v1.4. Our findings underscore the fragility of composing incremental model updates, and raise serious new concerns about current approaches to ensuring the safety and alignment of text-to-image diffusion models.

[LG-24] Gaussian Process Thompson Sampling via Rootfinding NEURIPS2024

链接: https://arxiv.org/abs/2410.08071
作者: Taiwo A. Adebiyi,Bach Do,Ruda Zhang
关键词-EN: effective stochastic policy, Thompson sampling, effective stochastic, stochastic policy, Bayesian decision making
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Paper accepted at the NeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty for an oral presentation

点击查看摘要

Abstract:Thompson sampling (TS) is a simple, effective stochastic policy in Bayesian decision making. It samples the posterior belief about the reward profile and optimizes the sample to obtain a candidate decision. In continuous optimization, the posterior of the objective function is often a Gaussian process (GP), whose sample paths have numerous local optima, making their global optimization challenging. In this work, we introduce an efficient global optimization strategy for GP-TS that carefully selects starting points for gradient-based multi-start optimizers. It identifies all local optima of the prior sample via univariate global rootfinding, and optimizes the posterior sample using a differentiable, decoupled representation. We demonstrate remarkable improvement in the global optimization of GP posterior samples, especially in high dimensions. This leads to dramatic improvements in the overall performance of Bayesian optimization using GP-TS acquisition functions, surprisingly outperforming alternatives like GP-UCB and EI.

[LG-25] Unlearning-based Neural Interpretations

链接: https://arxiv.org/abs/2410.08069
作者: Ching Lam Choi,Alexandre Duplessis,Serge Belongie
关键词-EN: computing feature importance, Gradient-based interpretations, require an anchor, comparison to avoid, avoid saturation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gradient-based interpretations often require an anchor point of comparison to avoid saturation in computing feature importance. We show that current baselines defined using static functions–constant mapping, averaging or blurring–inject harmful colour, texture or frequency assumptions that deviate from model behaviour. This leads to accumulation of irregular gradients, resulting in attribution maps that are biased, fragile and manipulable. Departing from the static approach, we propose UNI to compute an (un)learnable, debiased and adaptive baseline by perturbing the input towards an unlearning direction of steepest ascent. Our method discovers reliable baselines and succeeds in erasing salient features, which in turn locally smooths the high-curvature decision boundaries. Our analyses point to unlearning as a promising avenue for generating faithful, efficient and robust interpretations.

[LG-26] Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

链接: https://arxiv.org/abs/2410.08067
作者: Shenao Zhang,Zhihan Liu,Boyi Liu,Yufeng Zhang,Yingxiang Yang,Yongfei Liu,Liyu Chen,Tao Sun,Zhaoran Wang
关键词-EN: Large Language Models, Large Language, Language Models, instructions and intentions, existing direct alignment
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses. Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the high-quality rejected responses. The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to responses with the highest rewards, which are sparse in data. To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset. This dataset is easily integrated with existing direct alignment algorithms and is applicable to any preference dataset. The experimental results across instruction-following benchmarks including AlpacaEval, MT-Bench, and Arena-Hard-Auto demonstrate that our approach consistently boosts the performance of DPO by a considerable margin across diverse models. Additionally, our method improves the average accuracy on various academic benchmarks. When applying our method to on-policy data, the resulting DPO model achieves SOTA results on AlpacaEval. Through ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere dataset expansion. Our code is available at this https URL.

[LG-27] Closing the Loop: Learning to Generate Writing Feedback via Language Model Simulated Student Revisions EMNLP2024

链接: https://arxiv.org/abs/2410.08058
作者: Inderjeet Nair,Jiaye Tan,Xiaotian Su,Anne Gere,Xu Wang,Lu Wang
关键词-EN: Providing feedback, widely recognized, recognized as crucial, crucial for refining, students’ writing skills
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Providing feedback is widely recognized as crucial for refining students’ writing skills. Recent advances in language models (LMs) have made it possible to automatically generate feedback that is actionable and well-aligned with human-specified attributes. However, it remains unclear whether the feedback generated by these models is truly effective in enhancing the quality of student revisions. Moreover, prompting LMs with a precise set of instructions to generate feedback is nontrivial due to the lack of consensus regarding the specific attributes that can lead to improved revising performance. To address these challenges, we propose PROF that PROduces Feedback via learning from LM simulated student revisions. PROF aims to iteratively optimize the feedback generator by directly maximizing the effectiveness of students’ overall revising performance as simulated by LMs. Focusing on an economic essay assignment, we empirically test the efficacy of PROF and observe that our approach not only surpasses a variety of baseline methods in effectiveness of improving students’ writing but also demonstrates enhanced pedagogical values, even though it was not explicitly trained for this aspect.

[LG-28] Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

链接: https://arxiv.org/abs/2410.08049
作者: Yiyuan Zhang,Xiaohan Ding,Xiangyu Yue
关键词-EN: Convolutional Neural Networks, modern Convolutional Neural, designing modern Convolutional, Neural Networks, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This is the journal version of arXiv:2203.06717 and arXiv:2311.15599

点击查看摘要

Abstract:This paper proposes the paradigm of large convolutional kernels in designing modern Convolutional Neural Networks (ConvNets). We establish that employing a few large kernels, instead of stacking multiple smaller ones, can be a superior design strategy. Our work introduces a set of architecture design guidelines for large-kernel ConvNets that optimize their efficiency and performance. We propose the UniRepLKNet architecture, which offers systematical architecture design principles specifically crafted for large-kernel ConvNets, emphasizing their unique ability to capture extensive spatial information without deep layer stacking. This results in a model that not only surpasses its predecessors with an ImageNet accuracy of 88.0%, an ADE20K mIoU of 55.6%, and a COCO box AP of 56.4% but also demonstrates impressive scalability and performance on various modalities such as time-series forecasting, audio, point cloud, and video recognition. These results indicate the universal modeling abilities of large-kernel ConvNets with faster inference speed compared with vision transformers. Our findings reveal that large-kernel ConvNets possess larger effective receptive fields and a higher shape bias, moving away from the texture bias typical of smaller-kernel CNNs. All codes and models are publicly available at this https URL promoting further research and development in the community.

[LG-29] VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

链接: https://arxiv.org/abs/2410.08048
作者: Jianing Qi,Hao Tang,Zhigang Zhu
关键词-EN: test time compute, Large Language Models, Large Language, Language Models, verifier models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent advancements in test time compute, particularly through the use of verifier models, have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). This generator-verifier approach closely resembles the actor-critic framework in reinforcement learning (RL). However, current verifier models in LLMs often rely on supervised fine-tuning without temporal difference learning such as Q-learning. This paper introduces VerifierQ, a novel approach that integrates Offline Q-learning into LLM verifier models. We address three key challenges in applying Q-learning to LLMs: (1) handling utterance-level Markov Decision Processes (MDPs), (2) managing large action spaces, and (3) mitigating overestimation bias. VerifierQ introduces a modified Bellman update for bounded Q-values, incorporates Implicit Q-learning (IQL) for efficient action space management, and integrates a novel Conservative Q-learning (CQL) formulation for balanced Q-value estimation. Our method enables parallel Q-value computation and improving training efficiency. While recent work has explored RL techniques like MCTS for generators, VerifierQ is among the first to investigate the verifier (critic) aspect in LLMs through Q-learning. This integration of RL principles into verifier models complements existing advancements in generator techniques, potentially enabling more robust and adaptive reasoning in LLMs. Experimental results on mathematical reasoning tasks demonstrate VerifierQ’s superior performance compared to traditional supervised fine-tuning approaches, with improvements in efficiency, accuracy and robustness. By enhancing the synergy between generation and evaluation capabilities, VerifierQ contributes to the ongoing evolution of AI systems in addressing complex cognitive tasks across various domains.

[LG-30] On the Convergence of (Stochastic) Gradient Descent for Kolmogorov–Arnold Networks

链接: https://arxiv.org/abs/2410.08041
作者: Yihang Gao,Vincent Y. F. Tan
关键词-EN: Arnold Networks, neural network architecture, gained significant attention, proposed neural network, deep learning community
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Kolmogorov–Arnold Networks (KANs), a recently proposed neural network architecture, have gained significant attention in the deep learning community, due to their potential as a viable alternative to multi-layer perceptrons (MLPs) and their broad applicability to various scientific tasks. Empirical investigations demonstrate that KANs optimized via stochastic gradient descent (SGD) are capable of achieving near-zero training loss in various machine learning (e.g., regression, classification, and time series forecasting, etc.) and scientific tasks (e.g., solving partial differential equations). In this paper, we provide a theoretical explanation for the empirical success by conducting a rigorous convergence analysis of gradient descent (GD) and SGD for two-layer KANs in solving both regression and physics-informed tasks. For regression problems, we establish using the neural tangent kernel perspective that GD achieves global linear convergence of the objective function when the hidden dimension of KANs is sufficiently large. We further extend these results to SGD, demonstrating a similar global convergence in expectation. Additionally, we analyze the global convergence of GD and SGD for physics-informed KANs, which unveils additional challenges due to the more complex loss structure. This is the first work establishing the global convergence guarantees for GD and SGD applied to optimize KANs and physics-informed KANs.

[LG-31] Composite Learning Units: Generalized Learning Beyond Parameter Updates to Transform LLMs into Adaptive Reasoners

链接: https://arxiv.org/abs/2410.08037
作者: Santosh Kumar Radha,Oktay Goktas
关键词-EN: Human learning thrives, Large Language Models, Composite Learning Units, static machine learning, Human learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Human learning thrives on the ability to learn from mistakes, adapt through feedback, and refine understanding-processes often missing in static machine learning models. In this work, we introduce Composite Learning Units (CLUs) designed to transform reasoners, such as Large Language Models (LLMs), into learners capable of generalized, continuous learning without conventional parameter updates while enhancing their reasoning abilities through continual interaction and feedback. CLUs are built on an architecture that allows a reasoning model to maintain and evolve a dynamic knowledge repository: a General Knowledge Space for broad, reusable insights and a Prompt-Specific Knowledge Space for task-specific learning. Through goal-driven interactions, CLUs iteratively refine these knowledge spaces, enabling the system to adapt dynamically to complex tasks, extract nuanced insights, and build upon past experiences autonomously. We demonstrate CLUs’ effectiveness through a cryptographic reasoning task, where they continuously evolve their understanding through feedback to uncover hidden transformation rules. While conventional models struggle to grasp underlying logic, CLUs excel by engaging in an iterative, goal-oriented process. Specialized components-handling knowledge retrieval, prompt generation, and feedback analysis-work together within a reinforcing feedback loop. This approach allows CLUs to retain the memory of past failures and successes, adapt autonomously, and apply sophisticated reasoning effectively, continually learning from mistakes while also building on breakthroughs.

[LG-32] Strategic Classification With Externalities

链接: https://arxiv.org/abs/2410.08032
作者: Yiling Chen,Safwan Hossain,Evi Micha,Ariel Procaccia
关键词-EN: strategic classification problem, pure Nash Equilibrium, possibly manipulated, classification problem, principal reveals
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We propose a new variant of the strategic classification problem: a principal reveals a classifier, and n agents report their (possibly manipulated) features to be classified. Motivated by real-world applications, our model crucially allows the manipulation of one agent to affect another; that is, it explicitly captures inter-agent externalities. The principal-agent interactions are formally modeled as a Stackelberg game, with the resulting agent manipulation dynamics captured as a simultaneous game. We show that under certain assumptions, the pure Nash Equilibrium of this agent manipulation game is unique and can be efficiently computed. Leveraging this result, PAC learning guarantees are established for the learner: informally, we show that it is possible to learn classifiers that minimize loss on the distribution, even when a random number of agents are manipulating their way to a pure Nash Equilibrium. We also comment on the optimization of such classifiers through gradient-based approaches. This work sets the theoretical foundations for a more realistic analysis of classifiers that are robust against multiple strategic actors interacting in a common environment.

[LG-33] Private Language Models via Truncated Laplacian Mechanism EMNLP2024

链接: https://arxiv.org/abs/2410.08027
作者: Tianhao Huang,Tao Yang,Ivan Habernal,Lijie Hu,Di Wang
关键词-EN: Deep learning models, Deep learning, models for NLP, truncated Laplacian mechanism, NLP tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by EMNLP 2024, Main Track

点击查看摘要

Abstract:Deep learning models for NLP tasks are prone to variants of privacy attacks. To prevent privacy leakage, researchers have investigated word-level perturbations, relying on the formal guarantees of differential privacy (DP) in the embedding space. However, many existing approaches either achieve unsatisfactory performance in the high privacy regime when using the Laplacian or Gaussian mechanism, or resort to weaker relaxations of DP that are inferior to the canonical DP in terms of privacy strength. This raises the question of whether a new method for private word embedding can be designed to overcome these limitations. In this paper, we propose a novel private embedding method called the high dimensional truncated Laplacian mechanism. Specifically, we introduce a non-trivial extension of the truncated Laplacian mechanism, which was previously only investigated in one-dimensional space cases. Theoretically, we show that our method has a lower variance compared to the previous private word embedding methods. To further validate its effectiveness, we conduct comprehensive experiments on private embedding and downstream tasks using three datasets. Remarkably, even in the high privacy regime, our approach only incurs a slight decrease in utility compared to the non-private scenario.

[LG-34] Generalization Bounds and Model Complexity for Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2410.08026
作者: Xianyang Zhang,Huijuan Zhou
关键词-EN: network structure recently, structure recently proposed, proposed by Liu, Kolmogorov-Arnold Network, network structure
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Network (KAN) is a network structure recently proposed by Liu et al. (2024) that offers improved interpretability and a more parsimonious design in many science-oriented tasks compared to multi-layer perceptrons. This work provides a rigorous theoretical analysis of KAN by establishing generalization bounds for KAN equipped with activation functions that are either represented by linear combinations of basis functions or lying in a low-rank Reproducing Kernel Hilbert Space (RKHS). In the first case, the generalization bound accommodates various choices of basis functions in forming the activation functions in each layer of KAN and is adapted to different operator norms at each layer. For a particular choice of operator norms, the bound scales with the l_1 norm of the coefficient matrices and the Lipschitz constants for the activation functions, and it has no dependence on combinatorial parameters (e.g., number of nodes) outside of logarithmic factors. Moreover, our result does not require the boundedness assumption on the loss function and, hence, is applicable to a general class of regression-type loss functions. In the low-rank case, the generalization bound scales polynomially with the underlying ranks as well as the Lipschitz constants of the activation functions in each layer. These bounds are empirically investigated for KANs trained with stochastic gradient descent on simulated and real data sets. The numerical results demonstrate the practical relevance of these bounds.

[LG-35] Pretraining Graph Transformers with Atom-in-a-Molecule Quantum Properties for Improved ADMET Modeling

链接: https://arxiv.org/abs/2410.08024
作者: Alessio Fallani,Ramil Nugmanov,Jose Arjona-Medina,Jörg Kurt Wegner,Alexandre Tkatchenko,Kostiantyn Chernichenko
关键词-EN: Graph Transformer architectures, pretraining Graph Transformer, Transformer architectures, atom-level quantum-mechanical features, Data Commons ADMET
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We evaluate the impact of pretraining Graph Transformer architectures on atom-level quantum-mechanical features for the modeling of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of drug-like compounds. We compare this pretraining strategy with two others: one based on molecular quantum properties (specifically the HOMO-LUMO gap) and one using a self-supervised atom masking technique. After fine-tuning on Therapeutic Data Commons ADMET datasets, we evaluate the performance improvement in the different models observing that models pretrained with atomic quantum mechanical properties produce in general better results. We then analyse the latent representations and observe that the supervised strategies preserve the pretraining information after finetuning and that different pretrainings produce different trends in latent expressivity across layers. Furthermore, we find that models pretrained on atomic quantum mechanical properties capture more low-frequency laplacian eigenmodes of the input graph via the attention weights and produce better representations of atomic environments within the molecule. Application of the analysis to a much larger non-public dataset for microsomal clearance illustrates generalizability of the studied indicators. In this case the performances of the models are in accordance with the representation analysis and highlight, especially for the case of masking pretraining and atom-level quantum property pretraining, how model types with similar performance on public benchmarks can have different performances on large scale pharmaceutical data.

[LG-36] Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs

链接: https://arxiv.org/abs/2410.08020
作者: Jonas Hübotter,Sascha Bongni,Ido Hakimi,Andreas Krause
关键词-EN: Nearest Neighbor retrieval, Recent efforts, Nearest Neighbor, Neighbor retrieval, automatic data selection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent efforts in fine-tuning language models often rely on automatic data selection, commonly using Nearest Neighbors retrieval from large datasets. However, we theoretically show that this approach tends to select redundant data, limiting its effectiveness or even hurting performance. To address this, we introduce SIFT, a data selection algorithm designed to reduce uncertainty about the model’s response given a prompt, which unifies ideas from retrieval and active learning. Whereas Nearest Neighbor retrieval typically fails in the presence of information duplication, SIFT accounts for information duplication and optimizes the overall information gain of the selected examples. We focus our evaluations on fine-tuning at test-time for prompt-specific language modeling on the Pile dataset, and show that SIFT consistently outperforms Nearest Neighbor retrieval, with minimal computational overhead. Moreover, we show that our uncertainty estimates can predict the performance gain of test-time fine-tuning, and use this to develop an adaptive algorithm that invests test-time compute proportional to realized performance gains. We provide the \textttactiveft (Active Fine-Tuning) library which can be used as a drop-in replacement for Nearest Neighbor retrieval.

[LG-37] Non-transferable Pruning ECCV2024

链接: https://arxiv.org/abs/2410.08015
作者: Ruyi Ding,Lili Su,Aidong Adam Ding,Yunsi Fei
关键词-EN: Deep Neural Networks, Pretrained Deep Neural, Neural Networks, Deep Neural, valuable intellectual property
类目: Machine Learning (cs.LG)
*备注: Accepted in ECCV 2024

点击查看摘要

Abstract:Pretrained Deep Neural Networks (DNNs), developed from extensive datasets to integrate multifaceted knowledge, are increasingly recognized as valuable intellectual property (IP). To safeguard these models against IP infringement, strategies for ownership verification and usage authorization have emerged. Unlike most existing IP protection strategies that concentrate on restricting direct access to the model, our study addresses an extended DNN IP issue: applicability authorization, aiming to prevent the misuse of learned knowledge, particularly in unauthorized transfer learning scenarios. We propose Non-Transferable Pruning (NTP), a novel IP protection method that leverages model pruning to control a pretrained DNN’s transferability to unauthorized data domains. Selective pruning can deliberately diminish a model’s suitability on unauthorized domains, even with full fine-tuning. Specifically, our framework employs the alternating direction method of multipliers (ADMM) for optimizing both the model sparsity and an innovative non-transferable learning loss, augmented with Fisher space discriminative regularization, to constrain the model’s generalizability to the target dataset. We also propose a novel effective metric to measure the model non-transferability: Area Under the Sample-wise Learning Curve (SLC-AUC). This metric facilitates consideration of full fine-tuning across various sample sizes. Experimental results demonstrate that NTP significantly surpasses the state-of-the-art non-transferable learning methods, with an average SLC-AUC at -0.54 across diverse pairs of source and target domains, indicating that models trained with NTP do not suit for transfer learning to unauthorized target domains. The efficacy of NTP is validated in both supervised and self-supervised learning contexts, confirming its applicability in real-world scenarios.

[LG-38] me Can Invalidate Algorithmic Recourse

链接: https://arxiv.org/abs/2410.08007
作者: Giovanni De Toni,Stefano Teso,Bruno Lepri,Andrea Passerini
关键词-EN: machine learning predictors, overturn unfavourable decisions, unfavourable decisions made, aims to provide, learning predictors
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Algorithmic Recourse (AR) aims to provide users with actionable steps to overturn unfavourable decisions made by machine learning predictors. However, these actions often take time to implement (e.g., getting a degree can take years), and their effects may vary as the world evolves. Thus, it is natural to ask for recourse that remains valid in a dynamic environment. In this paper, we study the robustness of algorithmic recourse over time by casting the problem through the lens of causality. We demonstrate theoretically and empirically that (even robust) causal AR methods can fail over time except in the - unlikely - case that the world is stationary. Even more critically, unless the world is fully deterministic, counterfactual AR cannot be solved optimally. To account for this, we propose a simple yet effective algorithm for temporal AR that explicitly accounts for time. Our simulations on synthetic and realistic datasets show how considering time produces more resilient solutions to potential trends in the data distribution.

[LG-39] More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing

链接: https://arxiv.org/abs/2410.08003
作者: Sagi Shaier,Francisco Pereira,Katharina von der Wense,Lawrence E Hunter,Matt Jones
关键词-EN: biological neural systems, energy usage, evolution of biological, systems has led, enables efficiency
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The evolution of biological neural systems has led to both modularity and sparse coding, which enables efficiency in energy usage, and robustness across the diversity of tasks in the lifespan. In contrast, standard neural networks rely on dense, non-specialized architectures, where all model parameters are simultaneously updated to learn multiple tasks, leading to representation interference. Current sparse neural network approaches aim to alleviate this issue, but are often hindered by limitations such as 1) trainable gating functions that cause representation collapse; 2) non-overlapping experts that result in redundant computation and slow learning; and 3) reliance on explicit input or task IDs that impose significant constraints on flexibility and scalability. In this paper we propose Conditionally Overlapping Mixture of ExperTs (COMET), a general deep learning method that addresses these challenges by inducing a modular, sparse architecture with an exponential number of overlapping experts. COMET replaces the trainable gating function used in Sparse Mixture of Experts with a fixed, biologically inspired random projection applied to individual input representations. This design causes the degree of expert overlap to depend on input similarity, so that similar inputs tend to share more parameters. This facilitates positive knowledge transfer, resulting in faster learning and improved generalization. We demonstrate the effectiveness of COMET on a range of tasks, including image classification, language modeling, and regression, using several popular deep learning architectures.

[LG-40] AHA: Human-Assisted Out-of-Distribution Generalization and Detection NEURIPS2024

链接: https://arxiv.org/abs/2410.08000
作者: Haoyue Bai,Jifan Zhang,Robert Nowak
关键词-EN: Modern machine learning, Modern machine, encounter distribution shifts, OOD, OOD generalization
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Modern machine learning models deployed often encounter distribution shifts in real-world applications, manifesting as covariate or semantic out-of-distribution (OOD) shifts. These shifts give rise to challenges in OOD generalization and OOD detection. This paper introduces a novel, integrated approach AHA (Adaptive Human-Assisted OOD learning) to simultaneously address both OOD generalization and detection through a human-assisted framework by labeling data in the wild. Our approach strategically labels examples within a novel maximum disambiguation region, where the number of semantic and covariate OOD data roughly equalizes. By labeling within this region, we can maximally disambiguate the two types of OOD data, thereby maximizing the utility of the fixed labeling budget. Our algorithm first utilizes a noisy binary search algorithm that identifies the maximal disambiguation region with high probability. The algorithm then continues with annotating inside the identified labeling region, reaping the full benefit of human feedback. Extensive experiments validate the efficacy of our framework. We observed that with only a few hundred human annotations, our method significantly outperforms existing state-of-the-art methods that do not involve human assistance, in both OOD generalization and OOD detection. Code is publicly available at \urlthis https URL.

[LG-41] Neuroplastic Expansion in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2410.07994
作者: Jiashun Liu,Johan Obando-Ceron,Aaron Courville,Ling Pan
关键词-EN: significantly impedes learning, biological brains, significantly impedes, non-stationary nature, solidification of neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The loss of plasticity in learning agents, analogous to the solidification of neural pathways in biological brains, significantly impedes learning and adaptation in reinforcement learning due to its non-stationary nature. To address this fundamental challenge, we propose a novel approach, Neuroplastic Expansion (NE), inspired by cortical expansion in cognitive science. NE maintains learnability and adaptability throughout the entire training process by dynamically growing the network from a smaller initial size to its full dimension. Our method is designed with three key components: (1) elastic neuron generation based on potential gradients, (2) dormant neuron pruning to optimize network expressivity, and (3) neuron consolidation via experience review to strike a balance in the plasticity-stability dilemma. Extensive experiments demonstrate that NE effectively mitigates plasticity loss and outperforms state-of-the-art methods across various tasks in MuJoCo and DeepMind Control Suite environments. NE enables more adaptive learning in complex, dynamic environments, which represents a crucial step towards transitioning deep reinforcement learning from static, one-time training paradigms to more flexible, continually adapting models.

[LG-42] Machine Learning-based feasibility estimation of digital blocks in BCD technology

链接: https://arxiv.org/abs/2410.07989
作者: Gabriele Faraone,Francesco Daghero,Eugenio Serianni,Dario Licastro,Nicola Di Carolo,Michelangelo Grosso,Giovanna Antonella Franchino,Daniele Jahier Pagliari
关键词-EN: Mixed Signal, Integrated Circuit, process predominantly carried, time-consuming process predominantly, process predominantly
类目: Machine Learning (cs.LG)
*备注: Author’s version

点击查看摘要

Abstract:Analog-on-Top Mixed Signal (AMS) Integrated Circuit (IC) design is a time-consuming process predominantly carried out by hand. Within this flow, usually, some area is reserved by the top-level integrator for the placement of digital blocks. Specific features of the area, such as size and shape, have a relevant impact on the possibility of implementing the digital logic with the required functionality. We present a Machine Learning (ML)-based evaluation methodology for predicting the feasibility of digital implementation using a set of high-level features. This approach aims to avoid time-consuming Place-and-Route trials, enabling rapid feedback between Digital and Analog Back-End designers during top-level placement.

[LG-43] MolMix: A Simple Yet Effective Baseline for Multimodal Molecular Representation Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.07981
作者: Andrei Manolache,Dragos Tantaru,Mathias Niepert
关键词-EN: molecular representation learning, multimodal molecular representation, simple transformer-based baseline, SMILES strings, molecular representation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Machine Learning for Structural Biology Workshop, NeurIPS 2024

点击查看摘要

Abstract:In this work, we propose a simple transformer-based baseline for multimodal molecular representation learning, integrating three distinct modalities: SMILES strings, 2D graph representations, and 3D conformers of molecules. A key aspect of our approach is the aggregation of 3D conformers, allowing the model to account for the fact that molecules can adopt multiple conformations-an important factor for accurate molecular representation. The tokens for each modality are extracted using modality-specific encoders: a transformer for SMILES strings, a message-passing neural network for 2D graphs, and an equivariant neural network for 3D conformers. The flexibility and modularity of this framework enable easy adaptation and replacement of these encoders, making the model highly versatile for different molecular tasks. The extracted tokens are then combined into a unified multimodal sequence, which is processed by a downstream transformer for prediction tasks. To efficiently scale our model for large multimodal datasets, we utilize Flash Attention 2 and bfloat16 precision. Despite its simplicity, our approach achieves state-of-the-art results across multiple datasets, demonstrating its effectiveness as a strong baseline for multimodal molecular representation learning.

[LG-44] Doobs Lagrangian: A Sample-Efficient Variational Approach to Transition Path Sampling NEURIPS2024

链接: https://arxiv.org/abs/2410.07974
作者: Yuanqi Du,Michael Plainer,Rob Brekelmans,Chenru Duan,Frank Noé,Carla P. Gomes,Alan Apsuru-Guzik,Kirill Neklyudov
关键词-EN: poses significant computational, significant computational challenges, computational challenges due, fundamental problem arising, exponentially large space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Chemical Physics (physics.chem-ph)
*备注: Accepted as Spotlight at Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Rare event sampling in dynamical systems is a fundamental problem arising in the natural sciences, which poses significant computational challenges due to an exponentially large space of trajectories. For settings where the dynamical system of interest follows a Brownian motion with known drift, the question of conditioning the process to reach a given endpoint or desired rare event is definitively answered by Doob’s h-transform. However, the naive estimation of this transform is infeasible, as it requires simulating sufficiently many forward trajectories to estimate rare event probabilities. In this work, we propose a variational formulation of Doob’s h -transform as an optimization problem over trajectories between a given initial point and the desired ending point. To solve this optimization, we propose a simulation-free training objective with a model parameterization that imposes the desired boundary conditions by design. Our approach significantly reduces the search space over trajectories and avoids expensive trajectory simulation and inefficient importance sampling estimators which are required in existing methods. We demonstrate the ability of our method to find feasible transition paths on real-world molecular simulation and protein folding tasks.

[LG-45] Learning Equivariant Non-Local Electron Density Functionals

链接: https://arxiv.org/abs/2410.07972
作者: Nicholas Gao,Eike Eberhard,Stephan Günnemann
关键词-EN: functional theory hinges, Graph Exchange Correlation, theory hinges, Equivariant Graph Exchange, density functional theory
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:The accuracy of density functional theory hinges on the approximation of non-local contributions to the exchange-correlation (XC) functional. To date, machine-learned and human-designed approximations suffer from insufficient accuracy, limited scalability, or dependence on costly reference data. To address these issues, we introduce Equivariant Graph Exchange Correlation (EG-XC), a novel non-local XC functional based on equivariant graph neural networks. EG-XC combines semi-local functionals with a non-local feature density parametrized by an equivariant nuclei-centered point cloud representation of the electron density to capture long-range interactions. By differentiating through a self-consistent field solver, we train EG-XC requiring only energy targets. In our empirical evaluation, we find EG-XC to accurately reconstruct `gold-standard’ CCSD(T) energies on MD17. On out-of-distribution conformations of 3BPA, EG-XC reduces the relative MAE by 35% to 50%. Remarkably, EG-XC excels in data efficiency and molecular size extrapolation on QM9, matching force fields trained on 5 times more and larger molecules. On identical training sets, EG-XC yields on average 51% lower MAEs.

[LG-46] Neural Reasoning Networks: Efficient Interpretable Neural Networks With Automatic Textual Explanations

链接: https://arxiv.org/abs/2410.07966
作者: Stephen Carrow,Kyle Harper Erwin,Olga Vilenskaia,Parikshit Ram,Tim Klinger,Naweed Aghmad Khan,Ndivhuwo Makondo,Alexander Gray
关键词-EN: Neural Reasoning Networks, Recent advances, ensure fairness, legal compliance, advances in machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in machine learning have led to a surge in adoption of neural networks for various tasks, but lack of interpretability remains an issue for many others in which an understanding of the features influencing the prediction is necessary to ensure fairness, safety, and legal compliance. In this paper we consider one class of such tasks, tabular dataset classification, and propose a novel neuro-symbolic architecture, Neural Reasoning Networks (NRN), that is scalable and generates logically sound textual explanations for its predictions. NRNs are connected layers of logical neurons which implement a form of real valued logic. A training algorithm (R-NRN) learns the weights of the network as usual using gradient descent optimization with backprop, but also learns the network structure itself using a bandit-based optimization. Both are implemented in an extension to PyTorch (this https URL) that takes full advantage of GPU scaling and batched training. Evaluation on a diverse set of 22 open-source datasets for tabular classification demonstrates performance (measured by ROC AUC) which improves over multi-layer perceptron (MLP) and is statistically similar to other state-of-the-art approaches such as Random Forest, XGBoost and Gradient Boosted Trees, while offering 43% faster training and a more than 2 orders of magnitude reduction in the number of parameters required, on average. Furthermore, R-NRN explanations are shorter than the compared approaches while producing more accurate feature importance scores.

[LG-47] COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act

链接: https://arxiv.org/abs/2410.07959
作者: Philipp Guldimann,Alexander Spiridonov,Robin Staab,Nikola Jovanović,Mark Vero,Velko Vechev,Anna Gueorguieva,Mislav Balunović,Nikola Konstantinov,Pavol Bielik,Petar Tsankov,Martin Vechev
关键词-EN: Artificial Intelligence Act, assess models’ compliance, Artificial Intelligence, lacks clear technical, clear technical interpretation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The EU’s Artificial Intelligence Act (AI Act) is a significant step towards responsible AI development, but lacks clear technical interpretation, making it difficult to assess models’ compliance. This work presents COMPL-AI, a comprehensive framework consisting of (i) the first technical interpretation of the EU AI Act, translating its broad regulatory requirements into measurable technical requirements, with the focus on large language models (LLMs), and (ii) an open-source Act-centered benchmarking suite, based on thorough surveying and implementation of state-of-the-art LLM benchmarks. By evaluating 12 prominent LLMs in the context of COMPL-AI, we reveal shortcomings in existing models and benchmarks, particularly in areas like robustness, safety, diversity, and fairness. This work highlights the need for a shift in focus towards these aspects, encouraging balanced development of LLMs and more comprehensive regulation-aligned benchmarks. Simultaneously, COMPL-AI for the first time demonstrates the possibilities and difficulties of bringing the Act’s obligations to a more concrete, technical level. As such, our work can serve as a useful first step towards having actionable recommendations for model providers, and contributes to ongoing efforts of the EU to enable application of the Act, such as the drafting of the GPAI Code of Practice.

[LG-48] Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions

链接: https://arxiv.org/abs/2410.07951
作者: Kuleen Sasse,Shinjitha Vadlakonda,Richard E. Kennedy,John D. Osborne
关键词-EN: Knowledge Graphs, clinical named entity, named entity recognition, Disease Entity Recognition, entity recognition
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 21 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Background: Machine learning methods for clinical named entity recognition and entity normalization systems can utilize both labeled corpora and Knowledge Graphs (KGs) for learning. However, infrequently occurring concepts may have few mentions in training corpora and lack detailed descriptions or synonyms, even in large KGs. For Disease Entity Recognition (DER) and Disease Entity Normalization (DEN), this can result in fewer high quality training examples relative to the number of known diseases. Large Language Model (LLM) generation of synthetic training examples could improve performance in these information extraction tasks. Methods: We fine-tuned a LLaMa-2 13B Chat LLM to generate a synthetic corpus containing normalized mentions of concepts from the Unified Medical Language System (UMLS) Disease Semantic Group. We measured overall and Out of Distribution (OOD) performance for DER and DEN, with and without synthetic data augmentation. We evaluated performance on 3 different disease corpora using 4 different data augmentation strategies, assessed using BioBERT for DER and SapBERT and KrissBERT for DEN. Results: Our synthetic data yielded a substantial improvement for DEN, in all 3 training corpora the top 1 accuracy of both SapBERT and KrissBERT improved by 3-9 points in overall performance and by 20-55 points in OOD data. A small improvement (1-2 points) was also seen for DER in overall performance, but only one dataset showed OOD improvement. Conclusion: LLM generation of normalized disease mentions can improve DEN relative to normalization approaches that do not utilize LLMs to augment data with synthetic mentions. Ablation studies indicate that performance gains for DEN were only partially attributable to improvements in OOD performance. The same approach has only a limited ability to improve DER. We make our software and dataset publicly available. Comments: 21 pages, 3 figures, 7 tables Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.7; J.3 Cite as: arXiv:2410.07951 [cs.CL] (or arXiv:2410.07951v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.07951 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: John Osborne [view email] [v1] Thu, 10 Oct 2024 14:18:34 UTC (1,574 KB)

[LG-49] Offline Hierarchical Reinforcement Learning via Inverse Optimization

链接: https://arxiv.org/abs/2410.07933
作者: Carolin Schmidt,Daniele Gammelli,James Harrison,Marco Pavone,Filipe Rodrigues
关键词-EN: requiring long-horizon planning, enable strong performance, high-dimensional action spaces, policies enable strong, long-horizon planning
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a significant challenge. Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms. In this work, we propose OHIO: a framework for offline reinforcement learning (RL) of hierarchical policies. Our framework leverages knowledge of the policy structure to solve the inverse problem, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical policy. This approach constructs a dataset suitable for off-the-shelf offline training. We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed.

[LG-50] Efficient Reinforcement Learning with Large Language Model Priors

链接: https://arxiv.org/abs/2410.07927
作者: Xue Yan,Yan Song,Xidong Feng,Mengyue Yang,Haifeng Zhang,Haitham Bou Ammar,Jun Wang
关键词-EN: made notable advances, sequential decision-making, specific cases, heuristic search, search have made
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In sequential decision-making (SDM) tasks, methods like reinforcement learning (RL) and heuristic search have made notable advances in specific cases. However, they often require extensive exploration and face challenges in generalizing across diverse environments due to their limited grasp of the underlying decision dynamics. In contrast, large language models (LLMs) have recently emerged as powerful general-purpose tools, due to their capacity to maintain vast amounts of domain-specific knowledge. To harness this rich prior knowledge for efficiently solving complex SDM tasks, we propose treating LLMs as prior action distributions and integrating them into RL frameworks through Bayesian inference methods, making use of variational inference and direct posterior sampling. The proposed approaches facilitate the seamless incorporation of fixed LLM priors into both policy-based and value-based RL frameworks. Our experiments show that incorporating LLM-based action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios.

[LG-51] Meta-Learning Integration in Hierarchical Reinforcement Learning for Advanced Task Complexity

链接: https://arxiv.org/abs/2410.07921
作者: Arash Khajooeinejad,Masoumeh Chapariniya
关键词-EN: Hierarchical Reinforcement Learning, Hierarchical Reinforcement, effectively tackles complex, Reinforcement Learning, effectively tackles
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hierarchical Reinforcement Learning (HRL) effectively tackles complex tasks by decomposing them into structured policies. However, HRL agents often face challenges with efficient exploration and rapid adaptation. To address this, we integrate meta-learning into HRL to enhance the agent’s ability to learn and adapt hierarchical policies swiftly. Our approach employs meta-learning for rapid task adaptation based on prior experience, while intrinsic motivation mechanisms encourage efficient exploration by rewarding novel state visits. Specifically, our agent uses a high-level policy to select among multiple low-level policies operating within custom grid environments. We utilize gradient-based meta-learning with differentiable inner-loop updates, enabling optimization across a curriculum of increasingly difficult tasks. Experimental results demonstrate that our meta-learned hierarchical agent significantly outperforms traditional HRL agents without meta-learning and intrinsic motivation. The agent exhibits accelerated learning, higher cumulative rewards, and improved success rates in complex grid environments. These findings suggest that integrating meta-learning with HRL, alongside curriculum learning and intrinsic motivation, substantially enhances the agent’s capability to handle complex tasks.

[LG-52] Robustness Auditing for Linear Regression: To Singularity and Beyond

链接: https://arxiv.org/abs/2410.07916
作者: Ittai Rubinstein,Samuel B. Hopkins
关键词-EN: highly influential econometrics, influential econometrics studies, recently been discovered, highly influential, overturned by removing
类目: Machine Learning (cs.LG)
*备注: 65 pages, 2 figures

点击查看摘要

Abstract:It has recently been discovered that the conclusions of many highly influential econometrics studies can be overturned by removing a very small fraction of their samples (often less than 0.5% ). These conclusions are typically based on the results of one or more Ordinary Least Squares (OLS) regressions, raising the question: given a dataset, can we certify the robustness of an OLS fit on this dataset to the removal of a given number of samples? Brute-force techniques quickly break down even on small datasets. Existing approaches which go beyond brute force either can only find candidate small subsets to remove (but cannot certify their non-existence) [BGM20, KZC21], are computationally intractable beyond low dimensional settings [MR22], or require very strong assumptions on the data distribution and too many samples to give reasonable bounds in practice [BP21, FH23]. We present an efficient algorithm for certifying the robustness of linear regressions to removals of samples. We implement our algorithm and run it on several landmark econometrics datasets with hundreds of dimensions and tens of thousands of samples, giving the first non-trivial certificates of robustness to sample removal for datasets of dimension 4 or greater. We prove that under distributional assumptions on a dataset, the bounds produced by our algorithm are tight up to a 1 + o(1) multiplicative factor. Comments: 65 pages, 2 figures Subjects: Machine Learning (cs.LG) MSC classes: 62F35, 68W99, 62J05 ACMclasses: G.3; I.2.6; F.2.2 Cite as: arXiv:2410.07916 [cs.LG] (or arXiv:2410.07916v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.07916 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] Stress Detection Using PPG Signal and Combined Deep CNN-MLP Network

链接: https://arxiv.org/abs/2410.07911
作者: Yasin Hasanpoor,Koorosh Motaman,Bahram Tarvirdizadeh,Khalil Alipour,Mohammad Ghamari
关键词-EN: people lives, fact in people, Stress, PPG signals, PPG
类目: Machine Learning (cs.LG)
*备注: 5 figures , 2 tables

点击查看摘要

Abstract:Stress has become a fact in people’s lives. It has a significant effect on the function of body systems and many key systems of the body including respiratory, cardiovascular, and even reproductive systems are impacted by stress. It can be very helpful to detect stress episodes in early steps of its appearance to avoid damages it can cause to body systems. Using physiological signals can be useful for stress detection as they reflect very important information about the human body. PPG signal due to its advantages is one of the mostly used signal in this field. In this research work, we take advantage of PPG signals to detect stress events. The PPG signals used in this work are collected from one of the newest publicly available datasets named as UBFC-Phys and a model is developed by using CNN-MLP deep learning algorithm. The results obtained from the proposed model indicate that stress can be detected with an accuracy of approximately 82 percent.

[LG-54] CL3: A Collaborative Learning Framework for the Medical Data Ensuring Data Privacy in the Hyperconnected Environment

链接: https://arxiv.org/abs/2410.07900
作者: Mohamamd Zavid Parvez,Rafiqul Islam,Md Zahidul Islam
关键词-EN: transmitting sensitive patient, intercept sensitive information, hyperconnected environment, transmitting sensitive, intercept sensitive
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In a hyperconnected environment, medical institutions are particularly concerned with data privacy when sharing and transmitting sensitive patient information due to the risk of data breaches, where malicious actors could intercept sensitive information. A collaborative learning framework, including transfer, federated, and incremental learning, can generate efficient, secure, and scalable models while requiring less computation, maintaining patient data privacy, and ensuring an up-to-date model. This study aims to address the detection of COVID-19 using chest X-ray images through a proposed collaborative learning framework called CL3. Initially, transfer learning is employed, leveraging knowledge from a pre-trained model as the starting global model. Local models from different medical institutes are then integrated, and a new global model is constructed to adapt to any data drift observed in the local models. Additionally, incremental learning is considered, allowing continuous adaptation to new medical data without forgetting previously learned information. Experimental results demonstrate that the CL3 framework achieved a global accuracy of 89.99% when using Xception with a batch size of 16 after being trained for six federated communication rounds.

[LG-55] A Comprehensive Survey on Joint Resource Allocation Strategies in Federated Edge Learning

链接: https://arxiv.org/abs/2410.07881
作者: Jingbo Zhang,Qiong Wu,Pingyi Fan,Qiang Fan
关键词-EN: emerging distributed Machine, distributed Machine Learning, enables model training, Federated Edge Learning, distributed Machine
类目: Machine Learning (cs.LG)
*备注: This paper has been submitted to CMC-Computers Materials Continua

点击查看摘要

Abstract:Federated Edge Learning (FEL), an emerging distributed Machine Learning (ML) paradigm, enables model training in a distributed environment while ensuring user privacy by using physical separation for each user data. However, with the development of complex application scenarios such as the Internet of Things (IoT) and Smart Earth, the conventional resource allocation schemes can no longer effectively support these growing computational and communication demands. Therefore, joint resource optimization may be the key solution to the scaling problem. This paper simultaneously addresses the multifaceted challenges of computation and communication, with the growing multiple resource demands. We systematically review the joint allocation strategies for different resources (computation, data, communication, and network topology) in FEL, and summarize the advantages in improving system efficiency, reducing latency, enhancing resource utilization and enhancing robustness. In addition, we present the potential ability of joint optimization to enhance privacy preservation by reducing communication requirements, indirectly. This work not only provides theoretical support for resource management in federated learning (FL) systems, but also provides ideas for potential optimal deployment in multiple real-world scenarios. By thoroughly discussing the current challenges and future research directions, it also provides some important insights into multi-resource optimization in complex application environments.

[LG-56] Unsupervised Data Validation Methods for Efficient Model Training

链接: https://arxiv.org/abs/2410.07880
作者: Yurii Paniv
关键词-EN: low-resource languages, potential solutions, solutions for improving, systems for low-resource, machine learning systems
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the challenges and potential solutions for improving machine learning systems for low-resource languages. State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT), and vision-language models (VLM) rely heavily on large datasets, which are often unavailable for low-resource languages. This research explores key areas such as defining “quality data,” developing methods for generating appropriate data and enhancing accessibility to model training. A comprehensive review of current methodologies, including data augmentation, multilingual transfer learning, synthetic data generation, and data selection techniques, highlights both advancements and limitations. Several open research questions are identified, providing a framework for future studies aimed at optimizing data utilization, reducing the required data quantity, and maintaining high-quality model performance. By addressing these challenges, the paper aims to make advanced machine learning models more accessible for low-resource languages, enhancing their utility and impact across various sectors.

[LG-57] Benchmarking Agent ic Workflow Generation

链接: https://arxiv.org/abs/2410.07869
作者: Shuofei Qiao,Runnan Fang,Zhisong Qiu,Xiaobin Wang,Ningyu Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Huajun Chen
关键词-EN: Large Language Models, Large Language, driven significant advancements, decomposing complex problems, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent’s workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset will be available at this https URL.

[LG-58] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

链接: https://arxiv.org/abs/2410.07864
作者: Songming Liu,Lingxuan Wu,Bangguo Li,Hengkai Tan,Huayu Chen,Zhengyi Wang,Ke Xu,Hang Su,Jun Zhu
关键词-EN: extremely challenging due, developing foundation models, multi-modal action distributions, Robotics Diffusion Transformer, diffusion foundation model
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, conference

点击查看摘要

Abstract:Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1~5 demonstrations, and effectively handles complex, dexterous tasks. We refer to this https URL for the code and videos.

[LG-59] From Logits to Hierarchies: Hierarchical Clustering made Simple

链接: https://arxiv.org/abs/2410.07858
作者: Emanuele Palumbo,Moritz Vandenhirtz,Alain Ryser,Imant Daunhawer,Julia E. Vogt
关键词-EN: supervised machine learning, making the modeling, machine learning, intrinsically hierarchical, critical objective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The structure of many real-world datasets is intrinsically hierarchical, making the modeling of such hierarchies a critical objective in both unsupervised and supervised machine learning. Recently, novel approaches for hierarchical clustering with deep architectures have been proposed. In this work, we take a critical perspective on this line of research and demonstrate that many approaches exhibit major limitations when applied to realistic datasets, partly due to their high computational complexity. In particular, we show that a lightweight procedure implemented on top of pre-trained non-hierarchical clustering models outperforms models designed specifically for hierarchical clustering. Our proposed approach is computationally efficient and applicable to any pre-trained clustering model that outputs logits, without requiring any fine-tuning. To highlight the generality of our findings, we illustrate how our method can also be applied in a supervised setup, recovering meaningful hierarchies from a pre-trained ImageNet classifier.

[LG-60] Scalable Representation Learning for Multimodal Tabular Transactions

链接: https://arxiv.org/abs/2410.07851
作者: Natraj Raman,Sumitra Ganesh,Manuela Veloso
关键词-EN: understand unstructured text, Large language models, primarily designed, designed to understand, understand unstructured
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are primarily designed to understand unstructured text. When directly applied to structured formats such as tabular data, they may struggle to discern inherent relationships and overlook critical patterns. While tabular representation learning methods can address some of these limitations, existing efforts still face challenges with sparse high-cardinality fields, precise numerical reasoning, and column-heavy tables. Furthermore, leveraging these learned representations for downstream tasks through a language based interface is not apparent. In this paper, we present an innovative and scalable solution to these challenges. Concretely, our approach introduces a multi-tier partitioning mechanism that utilizes power-law dynamics to handle large vocabularies, an adaptive quantization mechanism to impose priors on numerical continuity, and a distinct treatment of core-columns and meta-information columns. To facilitate instruction tuning on LLMs, we propose a parameter efficient decoder that interleaves transaction and text modalities using a series of adapter layers, thereby exploiting rich cross-task knowledge. We validate the efficacy of our solution on a large-scale dataset of synthetic payments transactions.

[LG-61] Protect Before Generate: Error Correcting Codes within Discrete Deep Generative Models

链接: https://arxiv.org/abs/2410.07840
作者: María Martínez-García,Grace Villacrés,David Mitchell,Pablo M. Olmos
关键词-EN: learning low-dimensional discrete, deep probabilistic models, Error Correcting Codes, leveraging Error Correcting, latent representations remains
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite significant advancements in deep probabilistic models, learning low-dimensional discrete latent representations remains a challenging task. In this paper, we introduce a novel method that enhances variational inference in discrete latent variable models by leveraging Error Correcting Codes (ECCs) to introduce redundancy in the latent representations. This redundancy is then exploited by the variational posterior to yield more accurate estimates, thereby narrowing the variational gap. Inspired by ECCs commonly used in digital communications and data storage, we demonstrate proof-of-concept using a Discrete Variational Autoencoder (DVAE) with binary latent variables and block repetition codes. We further extend this idea to a hierarchical structure based on polar codes, where certain latent bits are more robustly protected. Our method improves generation quality, data reconstruction, and uncertainty calibration compared to the uncoded DVAE, even when trained with tighter bounds such as the Importance Weighted Autoencoder (IWAE) objective. In particular, we demonstrate superior performance on MNIST, FMNIST, CIFAR10, and Tiny ImageNet datasets. The general approach of integrating ECCs into variational inference is compatible with existing techniques to boost variational inference, such as importance sampling or Hamiltonian Monte Carlo. We also outline the key properties ECCs must have to effectively enhance discrete variational inference.

[LG-62] MinorityPrompt: Text to Minority Image Generation via Prompt Optimization

链接: https://arxiv.org/abs/2410.07838
作者: Soobin Um,Jong Chul Ye
关键词-EN: latent diffusion models, diffusion models, latent diffusion, minority samples, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, 8 figures

点击查看摘要

Abstract:We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models. Minority instances, in the context of T2I generation, can be defined as ones living on low-density regions of text-conditional data distributions. They are valuable for various applications of modern T2I generators, such as data augmentation and creative AI. Unfortunately, existing pretrained T2I diffusion models primarily focus on high-density regions, largely due to the influence of guided samplers (like CFG) that are essential for producing high-quality generations. To address this, we present a novel framework to counter the high-density-focus of T2I diffusion models. Specifically, we first develop an online prompt optimization framework that can encourage the emergence of desired properties during inference while preserving semantic contents of user-provided prompts. We subsequently tailor this generic prompt optimizer into a specialized solver that promotes the generation of minority features by incorporating a carefully-crafted likelihood objective. Our comprehensive experiments, conducted across various types of T2I models, demonstrate that our approach significantly enhances the capability to produce high-quality minority instances compared to existing samplers.

[LG-63] Masked Generative Priors Improve World Models Sequence Modelling Capabilities

链接: https://arxiv.org/abs/2410.07836
作者: Cristian Meo,Mircea Lica,Zarif Ikram,Akihiro Nakano,Vedant Shah,Aniket Rajiv Didolkar,Dianbo Liu,Anirudh Goyal,Justin Dauwels
关键词-EN: Deep Reinforcement Learning, Transformer-based World Models, creating artificial agents, world models, Deep Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (RL) has become the leading approach for creating artificial agents in complex environments. Model-based approaches, which are RL methods with world models that predict environment dynamics, are among the most promising directions for improving data efficiency, forming a critical step toward bridging the gap between research and real-world deployment. In particular, world models enhance sample efficiency by learning in imagination, which involves training a generative sequence model of the environment in a self-supervised manner. Recently, Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling and generating token sequences. Building on the Efficient Stochastic Transformer-based World Models (STORM) architecture, we replace the traditional MLP prior with a Masked Generative Prior (e.g., MaskGIT Prior) and introduce GIT-STORM. We evaluate our model on two downstream tasks: reinforcement learning and video prediction. GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark. Moreover, we apply Transformer-based World Models to continuous action environments for the first time, addressing a significant gap in prior research. To achieve this, we employ a state mixer function that integrates latent state representations with actions, enabling our model to handle continuous control tasks. We validate this approach through qualitative and quantitative analyses on the DeepMind Control Suite, showcasing the effectiveness of Transformer-based World Models in this new domain. Our results highlight the versatility and efficacy of the MaskGIT dynamics prior, paving the way for more accurate world models and effective RL policies.

[LG-64] A note on the VC dimension of 1-dimensional GNNs

链接: https://arxiv.org/abs/2410.07829
作者: Noah Daniëls,Floris Geerts
关键词-EN: Graph Neural Networks, Neural Networks, analyzing graph-structured data, complex relational information, capture complex relational
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become an essential tool for analyzing graph-structured data, leveraging their ability to capture complex relational information. While the expressivity of GNNs, particularly their equivalence to the Weisfeiler-Leman (1-WL) isomorphism test, has been well-documented, understanding their generalization capabilities remains critical. This paper focuses on the generalization of GNNs by investigating their Vapnik-Chervonenkis (VC) dimension. We extend previous results to demonstrate that 1-dimensional GNNs with a single parameter have an infinite VC dimension for unbounded graphs. Furthermore, we show that this also holds for GNNs using analytic non-polynomial activation functions, including the 1-dimensional GNNs that were recently shown to be as expressive as the 1-WL test. These results suggest inherent limitations in the generalization ability of even the most simple GNNs, when viewed from the VC dimension perspective.

[LG-65] Simple ReFlow: Improved Techniques for Fast Flow Models

链接: https://arxiv.org/abs/2410.07815
作者: Beomsu Kim,Yu-Guan Hsieh,Michal Klein,Marco Cuturi,Jong Chul Ye,Bahjat Kawar,James Thornton
关键词-EN: remarkable generative performance, Diffusion and flow-matching, flow-matching models achieve, models achieve remarkable, achieve remarkable generative
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion and flow-matching models achieve remarkable generative performance but at the cost of many sampling steps, this slows inference and limits applicability to time-critical tasks. The ReFlow procedure can accelerate sampling by straightening generation trajectories. However, ReFlow is an iterative procedure, typically requiring training on simulated data, and results in reduced sample quality. To mitigate sample deterioration, we examine the design space of ReFlow and highlight potential pitfalls in prior heuristic practices. We then propose seven improvements for training dynamics, learning and inference, which are verified with thorough ablation studies on CIFAR10 32 \times 32 , AFHQv2 64 \times 64 , and FFHQ 64 \times 64 . Combining all our techniques, we achieve state-of-the-art FID scores (without / with guidance, resp.) for fast generation via neural ODEs: 2.23 / 1.98 on CIFAR10, 2.30 / 1.91 on AFHQv2, 2.84 / 2.67 on FFHQ, and 3.49 / 1.74 on ImageNet-64, all with merely 9 neural function evaluations.

[LG-66] mporal-Difference Variational Continual Learning

链接: https://arxiv.org/abs/2410.07812
作者: Luckeciano C. Melo,Alessandro Abate,Yarin Gal
关键词-EN: Machine Learning models, capability of Machine, Machine Learning, crucial capability, real-world applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A crucial capability of Machine Learning models in real-world applications is the ability to continuously learn new tasks. This adaptability allows them to respond to potentially inevitable shifts in the data-generating distribution over time. However, in Continual Learning (CL) settings, models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines the reliability of deployed systems. Variational Continual Learning methods tackle this challenge by employing a learning objective that recursively updates the posterior distribution and enforces it to stay close to the latest posterior estimate. Nonetheless, we argue that these methods may be ineffective due to compounding approximation errors over successive recursions. To mitigate this, we propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations, preventing individual errors from dominating future posterior updates and compounding over time. We reveal insightful connections between these objectives and Temporal-Difference methods, a popular learning mechanism in Reinforcement Learning and Neuroscience. We evaluate the proposed objectives on challenging versions of popular CL benchmarks, demonstrating that they outperform standard Variational CL methods and non-variational baselines, effectively alleviating Catastrophic Forgetting.

[LG-67] Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?

链接: https://arxiv.org/abs/2410.07809
作者: Gürkan Soykan,Gözde Gül Şahin
关键词-EN: limited generalization capabilities, languages, Instruction tuning, perform unevenly, due to limited
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 31 pages, 6 figures

点击查看摘要

Abstract:Multilingual language models often perform unevenly across different languages due to limited generalization capabilities for some languages. This issue is significant because of the growing interest in making universal language models that work well for all languages. Instruction tuning with multilingual instruction-response pairs has been used to improve model performance across various languages. However, this approach is challenged by high computational costs, a lack of quality tuning data for all languages, and the “curse of multilinguality” – the performance drop per language after adding many languages. Recent studies have found that working with datasets with few languages and a smaller number of instances can be beneficial. Yet, there exists no systematic investigation into how choosing different languages affects multilingual instruction tuning. Our study proposes a method to select languages for instruction tuning in a linguistically informed way, aiming to boost model performance across languages and tasks. We use a simple algorithm to choose diverse languages and test their effectiveness on various benchmarks and open-ended questions. Our results show that this careful selection generally leads to better outcomes than choosing languages at random. We suggest a new and simple way of enhancing multilingual models by selecting diverse languages based on linguistic features that could help develop better multilingual systems and guide dataset creation efforts. All resources, including the code for language selection and multilingual instruction tuning, are made available in our official repository at this https URL enabling reproducibility and further research in this area.

[LG-68] Deep and Probabilistic Solar Irradiance Forecast at the Arctic Circle

链接: https://arxiv.org/abs/2410.07806
作者: Niklas Erdmann,Lars Ø. Bentsen,Roy Stenbro,Heine N. Riise,Narada Warakagoda,Paal Engelstad
关键词-EN: changing weather conditions, Johnson, weather conditions, dynamic and unreliable, unreliable due
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures. To be published in the 2024 IEEE Conference Photovoltaic Specialists (PVSC) proceedings

点击查看摘要

Abstract:Solar irradiance forecasts can be dynamic and unreliable due to changing weather conditions. Near the Arctic circle, this also translates into a distinct set of further challenges. This work is forecasting solar irradiance with Norwegian data using variations of Long-Short-Term Memory units (LSTMs). In order to gain more trustworthiness of results, the probabilistic approaches Quantile Regression (QR) and Maximum Likelihood (MLE) are optimized on top of the LSTMs, providing measures of uncertainty for the results. MLE is further extended by using a Johnson’s SU distribution, a Johnson’s SB distribution, and a Weibull distribution in addition to a normal Gaussian to model parameters. Contrary to a Gaussian, Weibull, Johnson’s SU and Johnson’s SB can return skewed distributions, enabling it to fit the non-normal solar irradiance distribution more optimally. The LSTMs are compared against each other, a simple Multi-layer Perceptron (MLP), and a smart-persistence estimator. The proposed LSTMs are found to be more accurate than smart persistence and the MLP for a multi-horizon, day-ahead (36 hours) forecast. The deterministic LSTM showed better root mean squared error (RMSE), but worse mean absolute error (MAE) than a MLE with Johnson’s SB distribution. Probabilistic uncertainty estimation is shown to fit relatively well across the distribution of observed irradiance. While QR shows better uncertainty estimation calibration, MLE with Johnson’s SB, Johnson’s SU, or Gaussian show better performance in the other metrics employed. Optimizing and comparing the models against each other reveals a seemingly inherent trade-off between point-prediction and uncertainty estimation calibration.

[LG-69] MGMD-GAN: Generalization Improvement of Generative Adversarial Networks with Multiple Generator Multiple Discriminator Framework Against Membership Inference Attacks

链接: https://arxiv.org/abs/2410.07803
作者: Nirob Arefin
关键词-EN: Generative Adversarial Networks, Adversarial Networks, Generative Adversarial, Membership Inference Attacks, Inference Attacks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative Adversarial Networks (GAN) are among the widely used Generative models in various applications. However, the original GAN architecture may memorize the distribution of the training data and, therefore, poses a threat to Membership Inference Attacks. In this work, we propose a new GAN framework that consists of Multiple Generators and Multiple Discriminators (MGMD-GAN). Disjoint partitions of the training data are used to train this model and it learns the mixture distribution of all the training data partitions. In this way, our proposed model reduces the generalization gap which makes our MGMD-GAN less vulnerable to Membership Inference Attacks. We provide an experimental analysis of our model and also a comparison with other GAN frameworks.

[LG-70] Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers

链接: https://arxiv.org/abs/2410.07799
作者: Alireza Naderi,Thiziri Nait Saada,Jared Tanner
关键词-EN: neural network architecture, textit, core component, rank collapse, Attention
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. However, \softmaxx-based attention puts transformers’ trainability at risk. Even \textitat initialisation, the propagation of signals and gradients through the random network can be pathological, resulting in known issues such as (i) vanishing/exploding gradients and (ii) \textitrank collapse, i.e. when all tokens converge to a single representation \textitwith depth. This paper examines signal propagation in \textitattention-only transformers from a random matrix perspective, illuminating the origin of such issues, as well as unveiling a new phenomenon – (iii) rank collapse \textitin width. Modelling \softmaxx-based attention at initialisation with Random Markov matrices, our theoretical analysis reveals that a \textitspectral gap between the two largest singular values of the attention matrix causes (iii), which, in turn, exacerbates (i) and (ii). Building on this insight, we propose a novel, yet simple, practical solution to resolve rank collapse in width by removing the spectral gap. Moreover, we validate our findings and discuss the training benefits of the proposed fix through experiments that also motivate a revision of some of the default parameter scaling. Our attention model accurately describes the standard key-query attention in a single-layer transformer, making this work a significant first step towards a better understanding of the initialisation dynamics in the multi-layer case.

[LG-71] owards Quantifying The Privacy Of Redacted Text ECIR’23

链接: https://arxiv.org/abs/2410.07772
作者: Vaibhav Gusain,Douglas Leith
关键词-EN: redacted text, approach for evaluating, paper we propose, redacted, text
类目: Machine Learning (cs.LG)
*备注: Accepted in ECIR’23

点击查看摘要

Abstract:In this paper we propose use of a k-anonymity-like approach for evaluating the privacy of redacted text. Given a piece of redacted text we use a state of the art transformer-based deep learning network to reconstruct the original text. This generates multiple full texts that are consistent with the redacted text, i.e. which are grammatical, have the same non-redacted words etc, and represents each of these using an embedding vector that captures sentence similarity. In this way we can estimate the number, diversity and quality of full text consistent with the redacted text and so evaluate privacy.

[LG-72] Dialectical Behavior Therapy Approach to LLM Prompting

链接: https://arxiv.org/abs/2410.07768
作者: Oxana Vitman,Nika Amaglobeli,Paul Plachinda
关键词-EN: Large language models, Large language, language models demonstrated, Dialectical Behavioral Therapy, CoT prompting guides
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models demonstrated state-of-the-art results on various reasoning tasks when applying the chain-of-thought (CoT) prompting technique. CoT prompting guides the model into breaking tasks into a few intermediate steps and provides step-by-step demonstrations. However, solving complex reasoning tasks remains a challenge. In this paper, we propose a novel prompting strategy inspired by Dialectical Behavioral Therapy (DBT). DBT, a form of cognitive-behavioral therapy, aims to help individuals cope with stress by developing a system of reasoning. We applied DBT’s basic concepts of shaping dialog to construct prompts and conducted experiments on different datasets and LLMs with various numbers of parameters. Our results show that prompts crafted with DBT techniques significantly improve results on smaller models, achieving a 7% increase in accuracy on the StrategyQA, 4.8% on Aqua dataset using 8b parameters model, and a 16.2% increase on the StrategyQA, 5.3% on GSM8K dataset with 14b parameters model.

[LG-73] Explaining Hypergraph Neural Networks: From Local Explanations to Global Concepts

链接: https://arxiv.org/abs/2410.07764
作者: Shiye Su,Iulia Duta,Lucie Charlotte Magister,Pietro Liò
关键词-EN: message passing paradigm, describing relational data, Hypergraph neural networks, higher-order interactions, class of powerful
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hypergraph neural networks are a class of powerful models that leverage the message passing paradigm to learn over hypergraphs, a generalization of graphs well-suited to describing relational data with higher-order interactions. However, such models are not naturally interpretable, and their explainability has received very limited attention. We introduce SHypX, the first model-agnostic post-hoc explainer for hypergraph neural networks that provides both local and global explanations. At the instance-level, it performs input attribution by discretely sampling explanation subhypergraphs optimized to be faithful and concise. At the model-level, it produces global explanation subhypergraphs using unsupervised concept extraction. Extensive experiments across four real-world and four novel, synthetic hypergraph datasets demonstrate that our method finds high-quality explanations which can target a user-specified balance between faithfulness and concision, improving over baselines by 25 percent points in fidelity on average.

[LG-74] QoS-Nets: Adaptive Approximate Neural Network Inference

链接: https://arxiv.org/abs/2410.07762
作者: Elias Trommer,Bernd Waschneck,Akash Kumar
关键词-EN: neural network applications, neural network, neural network layer, network layer computations, approximate multiplier instances
类目: Machine Learning (cs.LG)
*备注: unpublished, currently under peer review

点击查看摘要

Abstract:In order to vary the arithmetic resource consumption of neural network applications at runtime, this work proposes the flexible reuse of approximate multipliers for neural network layer computations. We introduce a search algorithm that chooses an appropriate subset of approximate multipliers of a user-defined size from a larger search space and enables retraining to maximize task performance. Unlike previous work, our approach can output more than a single, static assignment of approximate multiplier instances to layers. These different operating points allow a system to gradually adapt its Quality of Service (QoS) to changing environmental conditions by increasing or decreasing its accuracy and resource consumption. QoS-Nets achieves this by reassigning the selected approximate multiplier instances to layers at runtime. To combine multiple operating points with the use of retraining, we propose a fine-tuning scheme that shares the majority of parameters between operating points, with only a small amount of additional parameters required per operating point. In our evaluation on MobileNetV2, QoS-Nets is used to select four approximate multiplier instances for three different operating points. These operating points result in power savings for multiplications between 15.3% and 42.8% at a Top-5 accuracy loss between 0.3 and 2.33 percentage points. Through our fine-tuning scheme, all three operating points only increase the model’s parameter count by only 2.75%.

[LG-75] textitJump Your Steps: Optimizing Sampling Schedule of Discrete Diffusion Models

链接: https://arxiv.org/abs/2410.07761
作者: Yong-Hyun Park,Chieh-Hsin Lai,Satoshi Hayakawa,Yuhta Takida,Yuki Mitsufuji
关键词-EN: discrete diffusion models, Diffusion models, Compounding Decoding Error, continuous domains, notable success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion models (DDMs) for discrete variables. Despite recent advances, DDMs face the challenge of slow sampling speeds. While parallel sampling methods like \tau -leaping accelerate this process, they introduce \textitCompounding Decoding Error (CDE), where discrepancies arise between the true distribution and the approximation from parallel token generation, leading to degraded sample quality. In this work, we present \textitJump Your Steps (JYS), a novel approach that optimizes the allocation of discrete sampling timesteps by minimizing CDE without extra computational cost. More precisely, we derive a practical upper bound on CDE and propose an efficient algorithm for searching for the optimal sampling schedule. Extensive experiments across image, music, and text generation show that JYS significantly improves sampling quality, establishing it as a versatile framework for enhancing DDM performance for fast sampling.

[LG-76] Synthesizing Multi-Class Surgical Datasets with Anatomy-Aware Diffusion Models

链接: https://arxiv.org/abs/2410.07753
作者: Danush Kumar Venkatesh,Dominik Rivoir,Micha Pfeiffer,Fiona Kolbinger,Stefanie Speidel
关键词-EN: providing intraoperative assistance, automatically recognizing anatomical, computer-assisted surgery, automatically recognizing, intraoperative assistance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In computer-assisted surgery, automatically recognizing anatomical organs is crucial for understanding the surgical scene and providing intraoperative assistance. While machine learning models can identify such structures, their deployment is hindered by the need for labeled, diverse surgical datasets with anatomical annotations. Labeling multiple classes (i.e., organs) in a surgical scene is time-intensive, requiring medical experts. Although synthetically generated images can enhance segmentation performance, maintaining both organ structure and texture during generation is challenging. We introduce a multi-stage approach using diffusion models to generate multi-class surgical datasets with annotations. Our framework improves anatomy awareness by training organ specific models with an inpainting objective guided by binary segmentation masks. The organs are generated with an inference pipeline using pre-trained ControlNet to maintain the organ structure. The synthetic multi-class datasets are constructed through an image composition step, ensuring structural and textural consistency. This versatile approach allows the generation of multi-class datasets from real binary datasets and simulated surgical masks. We thoroughly evaluate the generated datasets on image quality and downstream segmentation, achieving a 15% improvement in segmentation scores when combined with real images. Our codebase this https URL

[LG-77] Learning Low-Level Causal Relations using a Simulated Robotic Arm ICANN

链接: https://arxiv.org/abs/2410.07751
作者: Miroslav Cibula,Matthias Kerzel,Igor Farkaš
关键词-EN: complex actions, humans to predict, plan the execution, actions, causal effects
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 3 tables. Appeared in 2024 International Conference on Artificial Neural Networks (ICANN) proceedings. Published version copyrighted by Springer. This work was funded by the Horizon Europe Twinning project TERAIS, G.A. number 101079338 and in part by the Slovak Grant Agency for Science (VEGA), project 1/0373/23

点击查看摘要

Abstract:Causal learning allows humans to predict the effect of their actions on the known environment and use this knowledge to plan the execution of more complex actions. Such knowledge also captures the behaviour of the environment and can be used for its analysis and the reasoning behind the behaviour. This type of knowledge is also crucial in the design of intelligent robotic systems with common sense. In this paper, we study causal relations by learning the forward and inverse models based on data generated by a simulated robotic arm involved in two sensorimotor tasks. As a next step, we investigate feature attribution methods for the analysis of the forward model, which reveals the low-level causal effects corresponding to individual features of the state vector related to both the arm joints and the environment features. This type of analysis provides solid ground for dimensionality reduction of the state representations, as well as for the aggregation of knowledge towards the explainability of causal effects at higher levels.

[LG-78] Benign Overfitting in Single-Head Attention

链接: https://arxiv.org/abs/2410.07746
作者: Roey Magen,Shuning Shang,Zhiwei Xu,Spencer Frei,Wei Hu,Gal Vardi
关键词-EN: near-optimal test performance, trained neural network, neural network perfectly, network perfectly fits, perfectly fits noisy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The phenomenon of benign overfitting, where a trained neural network perfectly fits noisy training data but still achieves near-optimal test performance, has been extensively studied in recent years for linear models and fully-connected/convolutional networks. In this work, we study benign overfitting in a single-head softmax attention model, which is the fundamental building block of Transformers. We prove that under appropriate conditions, the model exhibits benign overfitting in a classification setting already after two steps of gradient descent. Moreover, we show conditions where a minimum-norm/maximum-margin interpolator exhibits benign overfitting. We study how the overfitting behavior depends on the signal-to-noise ratio (SNR) of the data distribution, namely, the ratio between norms of signal and noise tokens, and prove that a sufficiently large SNR is both necessary and sufficient for benign overfitting.

[LG-79] SLIM: Let LLM Learn More and Forget Less with Soft LoRA and Identity Mixture

链接: https://arxiv.org/abs/2410.07739
作者: Jiayi Han,Liang Du,Hongwei Du,Xiangguo Zhou,Yiwen Wu,Weibo Zheng,Donghong Han
关键词-EN: downstream tasks, challenge to balance, general capabilities, training budget, downstream performance
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 11 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Although many efforts have been made, it is still a challenge to balance the training budget, downstream performance, and the general capabilities of the LLMs in many applications. Training the whole model for downstream tasks is expensive, and could easily result in catastrophic forgetting. By introducing parameter-efficient fine-tuning (PEFT), the training cost could be reduced, but it still suffers from forgetting, and limits the learning on the downstream tasks. To efficiently fine-tune the LLMs with less limitation to their downstream performance while mitigating the forgetting of general capabilities, we propose a novel mixture of expert (MoE) framework based on Soft LoRA and Identity Mixture (SLIM), that allows dynamic routing between LoRA adapters and skipping connection, enables the suppression of forgetting. We adopt weight-yielding with sliding clustering for better out-of-domain distinguish to enhance the routing. We also propose to convert the mixture of low-rank adapters to the model merging formulation and introduce fast dynamic merging of LoRA adapters to keep the general capabilities of the base model. Extensive experiments demonstrate that the proposed SLIM is comparable to the state-of-the-art PEFT approaches on the downstream tasks while achieving the leading performance in mitigating catastrophic forgetting.

[LG-80] Enhancing Federated Domain Adaptation with Multi-Domain Prototype-Based Federated Fine-Tuning

链接: https://arxiv.org/abs/2410.07738
作者: Jingyuan Zhang,Yiyang Duan,Shuaicheng Niu,Yang Cao,Wei Yang Bryan Lim
关键词-EN: Federated Domain Adaptation, unique data domains, Federated Domain, transmitting private data, shared category space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Domain Adaptation (FDA) is a Federated Learning (FL) scenario where models are trained across multiple clients with unique data domains but a shared category space, without transmitting private data. The primary challenge in FDA is data heterogeneity, which causes significant divergences in gradient updates when using conventional averaging-based aggregation methods, reducing the efficacy of the global model. This further undermines both in-domain and out-of-domain performance (within the same federated system but outside the local client). To address this, we propose a novel framework called \textbfMulti-domain \textbfPrototype-based \textbfFederated Fine-\textbfTuning (MPFT). MPFT fine-tunes a pre-trained model using multi-domain prototypes, i.e., pretrained representations enriched with domain-specific information from category-specific local data. This enables supervised learning on the server to derive a globally optimized adapter that is subsequently distributed to local clients, without the intrusion of data privacy. Empirical results show that MPFT significantly improves both in-domain and out-of-domain accuracy over conventional methods, enhancing knowledge preservation and adaptation in FDA. Notably, MPFT achieves convergence within a single communication round, greatly reducing computation and communication costs. To ensure privacy, MPFT applies differential privacy to protect the prototypes. Additionally, we develop a prototype-based feature space hijacking attack to evaluate robustness, confirming that raw data samples remain unrecoverable even after extensive training epochs. The complete implementation of MPFL is available at \urlthis https URL.

[LG-81] Plug-and-Play Performance Estimation for LLM Services without Relying on Labeled Data

链接: https://arxiv.org/abs/2410.07737
作者: Can Wang,Dianbo Sui,Hongliang Sun,Hao Ding,Bolin Zhang,Zhiying Tu
关键词-EN: Large Language Model, Large Language, Language Model, exhibit impressive capability, LLM services
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM) services exhibit impressive capability on unlearned tasks leveraging only a few examples by in-context learning (ICL). However, the success of ICL varies depending on the task and context, leading to heterogeneous service quality. Directly estimating the performance of LLM services at each invocation can be laborious, especially requiring abundant labeled data or internal information within the LLM. This paper introduces a novel method to estimate the performance of LLM services across different tasks and contexts, which can be “plug-and-play” utilizing only a few unlabeled samples like ICL. Our findings suggest that the negative log-likelihood and perplexity derived from LLM service invocation can function as effective and significant features. Based on these features, we utilize four distinct meta-models to estimate the performance of LLM services. Our proposed method is compared against unlabeled estimation baselines across multiple LLM services and tasks. And it is experimentally applied to two scenarios, demonstrating its effectiveness in the selection and further optimization of LLM services.

[LG-82] On the Detection of Aircraft Single Engine Taxi using Deep Learning Models

链接: https://arxiv.org/abs/2410.07727
作者: Gabriel Jarry,Philippe Very,Ramon Dalmau,Daniel Delahaye,Arthur Houdant
关键词-EN: faces increasing pressure, Single Engine Taxiing, aviation industry, industry is vital, vital for global
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The aviation industry is vital for global transportation but faces increasing pressure to reduce its environmental footprint, particularly CO2 emissions from ground operations such as taxiing. Single Engine Taxiing (SET) has emerged as a promising technique to enhance fuel efficiency and sustainability. However, evaluating SET’s benefits is hindered by the limited availability of SET-specific data, typically accessible only to aircraft operators. In this paper, we present a novel deep learning approach to detect SET operations using ground trajectory data. Our method involves using proprietary Quick Access Recorder (QAR) data of A320 flights to label ground movements as SET or conventional taxiing during taxi-in operations, while using only trajectory features equivalent to those available in open-source surveillance systems such as Automatic Dependent Surveillance-Broadcast (ADS-B) or ground radar. This demonstrates that SET can be inferred from ground movement patterns, paving the way for future work with non-proprietary data sources. Our results highlight the potential of deep learning to improve SET detection and support more comprehensive environmental impact assessments.

[LG-83] owards Trustworthy Web Attack Detection: An Uncertainty-Aware Ensemble Deep Kernel Learning Model

链接: https://arxiv.org/abs/2410.07725
作者: Yonghang Zhou,Hongyi Zhu,Yidong Chai,Yuanchun Jiang,Yezheng Liu
关键词-EN: bring huge costs, web application-based businesses, Deep Kernel Learning, model uncertainty, Web attacks
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Web attacks are one of the major and most persistent forms of cyber threats, which bring huge costs and losses to web application-based businesses. Various detection methods, such as signature-based, machine learning-based, and deep learning-based, have been proposed to identify web attacks. However, these methods either (1) heavily rely on accurate and complete rule design and feature engineering, which may not adapt to fast-evolving attacks, or (2) fail to estimate model uncertainty, which is essential to the trustworthiness of the prediction made by the model. In this study, we proposed an Uncertainty-aware Ensemble Deep Kernel Learning (UEDKL) model to detect web attacks from HTTP request payload data with the model uncertainty captured from the perspective of both data distribution and model parameters. The proposed UEDKL utilizes a deep kernel learning model to distinguish normal HTTP requests from different types of web attacks with model uncertainty estimated from data distribution perspective. Multiple deep kernel learning models were trained as base learners to capture the model uncertainty from model parameters perspective. An attention-based ensemble learning approach was designed to effectively integrate base learners’ predictions and model uncertainty. We also proposed a new metric named High Uncertainty Ratio-F Score Curve to evaluate model uncertainty estimation. Experiments on BDCI and SRBH datasets demonstrated that the proposed UEDKL framework yields significant improvement in both web attack detection performance and uncertainty estimation quality compared to benchmark models.

[LG-84] Understanding Adversarially Robust Generalization via Weight-Curvature Index

链接: https://arxiv.org/abs/2410.07719
作者: Yuelin Xu,Xiao Zhang
关键词-EN: remain largely unknown, adversarially robust generalization, robust generalization, remain largely, largely unknown
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite extensive research on adversarial examples, the underlying mechanisms of adversarially robust generalization, a critical yet challenging task for deep learning, remain largely unknown. In this work, we propose a novel perspective to decipher adversarially robust generalization through the lens of the Weight-Curvature Index (WCI). The proposed WCI quantifies the vulnerability of models to adversarial perturbations using the Frobenius norm of weight matrices and the trace of Hessian matrices. We prove generalization bounds based on PAC-Bayesian theory and second-order loss function approximations to elucidate the interplay between robust generalization gap, model parameters, and loss landscape curvature. Our theory and experiments show that WCI effectively captures the robust generalization performance of adversarially trained models. By offering a nuanced understanding of adversarial robustness based on the scale of model parameters and the curvature of the loss landscape, our work provides crucial insights for designing more resilient deep learning models, enhancing their reliability and security.

[LG-85] On the Generalization Properties of Deep Learning for Aircraft Fuel Flow Estimation Models

链接: https://arxiv.org/abs/2410.07717
作者: Gabriel Jarry,Ramon Dalmau,Philippe Very,Junzi Sun
关键词-EN: Accurately estimating aircraft, current aviation practices, Accurately estimating, designing next-generation aircraft, aircraft
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurately estimating aircraft fuel flow is essential for evaluating new procedures, designing next-generation aircraft, and monitoring the environmental impact of current aviation practices. This paper investigates the generalization capabilities of deep learning models in predicting fuel consumption, focusing particularly on their performance for aircraft types absent from the training data. We propose a novel methodology that integrates neural network architectures with domain generalization techniques to enhance robustness and reliability across a wide range of aircraft. A comprehensive dataset containing 101 different aircraft types, separated into training and generalization sets, with each aircraft type set containing 1,000 flights. We employed the base of aircraft data (BADA) model for fuel flow estimates, introduced a pseudo-distance metric to assess aircraft type similarity, and explored various sampling strategies to optimize model performance in data-sparse regions. Our results reveal that for previously unseen aircraft types, the introduction of noise into aircraft and engine parameters improved model generalization. The model is able to generalize with acceptable mean absolute percentage error between 2% and 10% for aircraft close to existing aircraft, while performance is below 1% error for known aircraft in the training set. This study highlights the potential of combining domain-specific insights with advanced machine learning techniques to develop scalable, accurate, and generalizable fuel flow estimation models.

[LG-86] Rethinking the Principle of Gradient Smooth Methods in Model Explanation

链接: https://arxiv.org/abs/2410.07711
作者: Linjiang Zhou,Chao Ma,Zepeng Wang,Xiaochuan Shi
关键词-EN: gradient-based model explanation, model explanation method, model explanation, Gaussian noise, noise
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient Smoothing is an efficient approach to reducing noise in gradient-based model explanation method. SmoothGrad adds Gaussian noise to mitigate much of these noise. However, the crucial hyper-parameter in this method, the variance \sigma of Gaussian noise, is set manually or with heuristic approach. However, it results in the smoothed gradients still containing a certain amount of noise. In this paper, we aim to interpret SmoothGrad as a corollary of convolution, thereby re-understanding the gradient noise and the role of \sigma from the perspective of confidence level. Furthermore, we propose an adaptive gradient smoothing method, AdaptGrad, based on these insights. Through comprehensive experiments, both qualitative and quantitative results demonstrate that AdaptGrad could effectively reduce almost all the noise in vanilla gradients compared with baselines methods. AdaptGrad is simple and universal, making it applicable for enhancing gradient-based interpretability methods for better visualization.

[LG-87] Learning Tree Pattern Transformations

链接: https://arxiv.org/abs/2410.07708
作者: Daniel Neider,Leif Sabellek,Johannes Schmidt,Fabian Vehlken,Thomas Zeume
关键词-EN: XML or JSON, understanding tree-structured data, JSON data, structurally differs, computer science
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Explaining why and how a tree t structurally differs from another tree t^* is a question that is encountered throughout computer science, including in understanding tree-structured data such as XML or JSON data. In this article, we explore how to learn explanations for structural differences between pairs of trees from sample data: suppose we are given a set (t_1, t_1^),\dots, (t_n, t_n^)\ of pairs of labelled, ordered trees; is there a small set of rules that explains the structural differences between all pairs (t_i, t_i^*) ? This raises two research questions: (i) what is a good notion of “rule” in this context?; and (ii) how can sets of rules explaining a data set be learnt algorithmically? We explore these questions from the perspective of database theory by (1) introducing a pattern-based specification language for tree transformations; (2) exploring the computational complexity of variants of the above algorithmic problem, e.g. showing NP-hardness for very restricted variants; and (3) discussing how to solve the problem for data from CS education research using SAT solvers. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Databases (cs.DB) Cite as: arXiv:2410.07708 [cs.LG] (or arXiv:2410.07708v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.07708 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-88] MotionGS: Exploring Explicit Motion Guidance for Deformable 3D Gaussian Splatting NEURIPS2024

链接: https://arxiv.org/abs/2410.07707
作者: Ruijie Zhu,Yanzhe Liang,Hanzhi Chang,Jiacheng Deng,Jiahao Lu,Wenfei Yang,Tianzhu Zhang,Yongdong Zhang
关键词-EN: Gaussian Splatting, Dynamic scene reconstruction, long-term challenge, Gaussian splatting framework, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024. 21 pages, 14 figures,7 tables

点击查看摘要

Abstract:Dynamic scene reconstruction is a long-term challenge in the field of 3D vision. Recently, the emergence of 3D Gaussian Splatting has provided new insights into this problem. Although subsequent efforts rapidly extend static 3D Gaussian to dynamic scenes, they often lack explicit constraints on object motion, leading to optimization difficulties and performance degradation. To address the above issues, we propose a novel deformable 3D Gaussian splatting framework called MotionGS, which explores explicit motion priors to guide the deformation of 3D Gaussians. Specifically, we first introduce an optical flow decoupling module that decouples optical flow into camera flow and motion flow, corresponding to camera movement and object motion respectively. Then the motion flow can effectively constrain the deformation of 3D Gaussians, thus simulating the motion of dynamic objects. Additionally, a camera pose refinement module is proposed to alternately optimize 3D Gaussians and camera poses, mitigating the impact of inaccurate camera poses. Extensive experiments in the monocular dynamic scenes validate that MotionGS surpasses state-of-the-art methods and exhibits significant superiority in both qualitative and quantitative results. Project page: this https URL

[LG-89] A Generalization Result for Convergence in Learning-to-Optimize

链接: https://arxiv.org/abs/2410.07704
作者: Michael Sucker,Peter Ochs
关键词-EN: conventional convergence guarantees, applied easily, geometric arguments, transferring geometric arguments, Convergence
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Convergence in learning-to-optimize is hardly studied, because conventional convergence guarantees in optimization are based on geometric arguments, which cannot be applied easily to learned algorithms. Thus, we develop a probabilistic framework that resembles deterministic optimization and allows for transferring geometric arguments into learning-to-optimize. Our main theorem is a generalization result for parametric classes of potentially non-smooth, non-convex loss functions and establishes the convergence of learned optimization algorithms to stationary points with high probability. This can be seen as a statistical counterpart to the use of geometric safeguards to ensure convergence. To the best of our knowledge, we are the first to prove convergence of optimization algorithms in such a probabilistic framework.

[LG-90] Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

链接: https://arxiv.org/abs/2410.07698
作者: Yiming Chen,Yuan Zhang,Liyuan Cao,Kun Yuan,Zaiwen Wen
关键词-EN: adapting large language, significantly reduces memory, Parameter-efficient fine-tuning, significantly reduces, large language models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) significantly reduces memory costs when adapting large language models (LLMs) for downstream applications. However, traditional first-order (FO) fine-tuning algorithms incur substantial memory overhead due to the need to store activation values for back-propagation during gradient computation, particularly in long-context fine-tuning tasks. Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values, thus eliminating the need for activation storage. Nevertheless, existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance. This paper proposes a low-rank ZO gradient estimator and introduces a novel low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs. We provide convergence guarantees for LOZO by framing it as a subspace optimization method. Additionally, its low-rank nature enables LOZO to integrate with momentum techniques while incurring negligible extra memory costs. Extensive experiments across various model sizes and downstream tasks demonstrate that LOZO and its momentum-based variant outperform existing ZO methods and closely approach the performance of FO algorithms.

[LG-91] Growing Efficient Accurate and Robust Neural Networks on the Edge

链接: https://arxiv.org/abs/2410.07691
作者: Vignesh Sundaresha,Naresh Shanbhag
关键词-EN: occurring common corruptions, deep learning systems, computational complexity coupled, naturally occurring common, resource-constrained Edge devices
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:The ubiquitous deployment of deep learning systems on resource-constrained Edge devices is hindered by their high computational complexity coupled with their fragility to out-of-distribution (OOD) data, especially to naturally occurring common corruptions. Current solutions rely on the Cloud to train and compress models before deploying to the Edge. This incurs high energy and latency costs in transmitting locally acquired field data to the Cloud while also raising privacy concerns. We propose GEARnn (Growing Efficient, Accurate, and Robust neural networks) to grow and train robust networks in-situ, i.e., completely on the Edge device. Starting with a low-complexity initial backbone network, GEARnn employs One-Shot Growth (OSG) to grow a network satisfying the memory constraints of the Edge device using clean data, and robustifies the network using Efficient Robust Augmentation (ERA) to obtain the final network. We demonstrate results on a NVIDIA Jetson Xavier NX, and analyze the trade-offs between accuracy, robustness, model size, energy consumption, and training time. Our results demonstrate the construction of efficient, accurate, and robust networks entirely on an Edge device.

[LG-92] When the Small-Loss Trick is Not Enough: Multi-Label Image Classification with Noisy Labels Applied to CCTV Sewer Inspections

链接: https://arxiv.org/abs/2410.07689
作者: Keryan Chelouche,Marie Lachaize(VERI),Marine Bernard(VERI),Louise Olgiati,Remi Cuingnet
关键词-EN: efficient Closed-Circuit Television, Closed-Circuit Television, label noise, sewerage networks, heavily relies
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The maintenance of sewerage networks, with their millions of kilometers of pipe, heavily relies on efficient Closed-Circuit Television (CCTV) inspections. Many promising approaches based on multi-label image classification have leveraged databases of historical inspection reports to automate these inspections. However, the significant presence of label noise in these databases, although known, has not been addressed. While extensive research has explored the issue of label noise in singlelabel classification (SLC), little attention has been paid to label noise in multi-label classification (MLC). To address this, we first adapted three sample selection SLC methods (Co-teaching, CoSELFIE, and DISC) that have proven robust to label noise. Our findings revealed that sample selection based solely on the small-loss trick can handle complex label noise, but it is sub-optimal. Adapting hybrid sample selection methods to noisy MLC appeared to be a more promising approach. In light of this, we developed a novel method named MHSS (Multi-label Hybrid Sample Selection) based on CoSELFIE. Through an in-depth comparative study, we demonstrated the superior performance of our approach in dealing with both synthetic complex noise and real noise, thus contributing to the ongoing efforts towards effective automation of CCTV sewer pipe inspections.

[LG-93] Learning to Compress: Local Rank and Information Compression in Deep Neural Networks NEURIPS2024

链接: https://arxiv.org/abs/2410.07687
作者: Niket Patel,Ravid Shwartz-Ziv
关键词-EN: implicitly learning low-dimensional, neural networks tend, Deep neural networks, tend to exhibit, exhibit a bias
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Accepted to Compression Workshop @ NeurIPS 2024

点击查看摘要

Abstract:Deep neural networks tend to exhibit a bias toward low-rank solutions during training, implicitly learning low-dimensional feature representations. This paper investigates how deep multilayer perceptrons (MLPs) encode these feature manifolds and connects this behavior to the Information Bottleneck (IB) theory. We introduce the concept of local rank as a measure of feature manifold dimensionality and demonstrate, both theoretically and empirically, that this rank decreases during the final phase of training. We argue that networks that reduce the rank of their learned representations also compress mutual information between inputs and intermediate layers. This work bridges the gap between feature manifold rank and information compression, offering new insights into the interplay between information bottlenecks and representation learning.

[LG-94] FedEP: Tailoring Attention to Heterogeneous Data Distribution with Entropy Pooling for Decentralized Federated Learning

链接: https://arxiv.org/abs/2410.07678
作者: Chao Feng,Hongjie Guan,Alberto Huertas Celdrán,Jan von der Assen,Gérôme Bovet,Burkhard Stiller
关键词-EN: Identically Distributed, Federated Learning, non-Independent and Identically, Federated Entropy Pooling, data distribution
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) performance is highly influenced by data distribution across clients, and non-Independent and Identically Distributed (non-IID) leads to a slower convergence of the global model and a decrease in model effectiveness. The existing algorithms for solving the non-IID problem are focused on the traditional centralized FL (CFL), where a central server is used for model aggregation. However, in decentralized FL (DFL), nodes lack the overall vision of the federation. To address the non-IID problem in DFL, this paper proposes a novel DFL aggregation algorithm, Federated Entropy Pooling (FedEP). FedEP mitigates the client drift problem by incorporating the statistical characteristics of local distributions instead of any actual data. Prior to training, each client conducts a local distribution fitting using a Gaussian Mixture Model (GMM) and shares the resulting statistical characteristics with its neighbors. After receiving the statistical characteristics shared by its neighbors, each node tries to fit the global data distribution. In the aggregation phase, each node calculates the Kullback-Leibler (KL) divergences of the local data distribution over the fitted global data distribution, giving the weights to generate the aggregated model. Extensive experiments have demonstrated that FedEP can achieve faster convergence and show higher test performance than state-of-the-art approaches.

[LG-95] Adversarial Robustness Overestimation and Instability in TRADES

链接: https://arxiv.org/abs/2410.07675
作者: Jonathan Weiping Li,Ren-Wei Liang,Cheng-Han Yeh,Cheng-Chang Tsai,Kuanchun Yu,Chun-Shien Lu,Shang-Tse Chen
关键词-EN: paper examines, probabilistic robustness overestimation, PGD validation accuracy, overestimation, TRADES
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper examines the phenomenon of probabilistic robustness overestimation in TRADES, a prominent adversarial training method. Our study reveals that TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task. This discrepancy highlights a significant overestimation of robustness for these instances, potentially linked to gradient masking. We further analyze the parameters contributing to unstable models that lead to overestimation. Our findings indicate that smaller batch sizes, lower beta values (which control the weight of the robust loss term in TRADES), larger learning rates, and higher class complexity (e.g., CIFAR-100 versus CIFAR-10) are associated with an increased likelihood of robustness overestimation. By examining metrics such as the First-Order Stationary Condition (FOSC), inner-maximization, and gradient information, we identify the underlying cause of this phenomenon as gradient masking and provide insights into it. Furthermore, our experiments show that certain unstable training instances may return to a state without robust overestimation, inspiring our attempts at a solution. In addition to adjusting parameter settings to reduce instability or retraining when overestimation occurs, we recommend incorporating Gaussian noise in inputs when the FOSC score exceed the threshold. This method aims to mitigate robustness overestimation of TRADES and other similar methods at its source, ensuring more reliable representation of adversarial robustness during evaluation.

[LG-96] Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference

链接: https://arxiv.org/abs/2410.07673
作者: Jianxing Yu,Shiqi Wang,Han Yin,Zhenlong Sun,Ruobing Xie,Bo Zhang,Yanghui Rao
关键词-EN: detecting clickbait posts, paper focuses, focuses on detecting, detecting clickbait, Web
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper focuses on detecting clickbait posts on the Web. These posts often use eye-catching disinformation in mixed modalities to mislead users to click for profit. That affects the user experience and thus would be blocked by content provider. To escape detection, malicious creators use tricks to add some irrelevant non-bait content into bait posts, dressing them up as legal to fool the detector. This content often has biased relations with non-bait labels, yet traditional detectors tend to make predictions based on simple co-occurrence rather than grasping inherent factors that lead to malicious behavior. This spurious bias would easily cause misjudgments. To address this problem, we propose a new debiased method based on causal inference. We first employ a set of features in multiple modalities to characterize the posts. Considering these features are often mixed up with unknown biases, we then disentangle three kinds of latent factors from them, including the invariant factor that indicates intrinsic bait intention; the causal factor which reflects deceptive patterns in a certain scenario, and non-causal noise. By eliminating the noise that causes bias, we can use invariant and causal factors to build a robust model with good generalization ability. Experiments on three popular datasets show the effectiveness of our approach.

[LG-97] Scalable and Resource-Efficient Second-Order Federated Learning via Over-the-Air Aggregation

链接: https://arxiv.org/abs/2410.07662
作者: Abdulmomen Ghalkha,Chaouki Ben Issaid,Mehdi Bennis
关键词-EN: offer faster convergence, Second-order federated learning, leveraging curvature information, algorithms offer faster, federated learning
类目: Machine Learning (cs.LG)
*备注: 5 pages, 1 figure, 4 subfigures, letter

点击查看摘要

Abstract:Second-order federated learning (FL) algorithms offer faster convergence than their first-order counterparts by leveraging curvature information. However, they are hindered by high computational and storage costs, particularly for large-scale models. Furthermore, the communication overhead associated with large models and digital transmission exacerbates these challenges, causing communication bottlenecks. In this work, we propose a scalable second-order FL algorithm using a sparse Hessian estimate and leveraging over-the-air aggregation, making it feasible for larger models. Our simulation results demonstrate more than 67% of communication resources and energy savings compared to other first and second-order baselines.

[LG-98] Mechanistic Permutability: Match Features Across Layers

链接: https://arxiv.org/abs/2410.07656
作者: Nikita Balagansky,Ian Maksimov,Daniil Gavrilov
关键词-EN: deep neural networks, Sparse Autoencoders, fundamental challenge, due to polysemanticity, layers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.

[LG-99] Almost Minimax Optimal Best Arm Identification in Piecewise Stationary Linear Bandits NEURIPS2024

链接: https://arxiv.org/abs/2410.07638
作者: Yunlong Hou,Vincent Y. F. Tan,Zixin Zhong
关键词-EN: varepsilon, BAI, piecewise stationary linear, stationary linear bandit, environment randomly samples
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 69 pages. Accepted to NeurIPS 2024

点击查看摘要

Abstract:We propose a \em novel piecewise stationary linear bandit (PSLB) model, where the environment randomly samples a context from an unknown probability distribution at each changepoint, and the quality of an arm is measured by its return averaged over all contexts. The contexts and their distribution, as well as the changepoints are unknown to the agent. We design \em Piecewise-Stationary \varepsilon -Best Arm Identification ^+ (PS \varepsilon BAI ^+ ), an algorithm that is guaranteed to identify an \varepsilon -optimal arm with probability \ge 1-\delta and with a minimal number of samples. PS \varepsilon BAI ^+ consists of two subroutines, PS \varepsilon BAI and \sc Naïve \varepsilon -BAI (N \varepsilon BAI), which are executed in parallel. PS \varepsilon BAI actively detects changepoints and aligns contexts to facilitate the arm identification process. When PS \varepsilon BAI and N \varepsilon BAI are utilized judiciously in parallel, PS \varepsilon BAI ^+ is shown to have a finite expected sample complexity. By proving a lower bound, we show the expected sample complexity of PS \varepsilon BAI ^+ is optimal up to a logarithmic factor. We compare PS \varepsilon BAI ^+ to baseline algorithms using numerical experiments which demonstrate its efficiency. Both our analytical and numerical results corroborate that the efficacy of PS \varepsilon BAI ^+ is due to the delicate change detection and context alignment procedures embedded in PS \varepsilon BAI.

[LG-100] Provable Privacy Attacks on Trained Shallow Neural Networks

链接: https://arxiv.org/abs/2410.07632
作者: Guy Smorodinsky,Gal Vardi,Itay Safran
关键词-EN: ReLU neural networks, shown on trained, provable privacy attacks, ReLU neural, neural networks
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We study what provable privacy attacks can be shown on trained, 2-layer ReLU neural networks. We explore two types of attacks; data reconstruction attacks, and membership inference attacks. We prove that theoretical results on the implicit bias of 2-layer neural networks can be used to provably reconstruct a set of which at least a constant fraction are training points in a univariate setting, and can also be used to identify with high probability whether a given point was used in the training set in a high dimensional setting. To the best of our knowledge, our work is the first to show provable vulnerabilities in this setting.

[LG-101] Automatic Curriculum Expert Iteration for Reliable LLM Reasoning

链接: https://arxiv.org/abs/2410.07627
作者: Zirui Zhao,Hanze Dong,Amrita Saha,Caiming Xiong,Doyen Sahoo
关键词-EN: generating plausible, inaccurate content, excessive refusals, persist as major, plausible but inaccurate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: 20 pages

点击查看摘要

Abstract:Hallucinations (i.e., generating plausible but inaccurate content) and laziness (i.e. excessive refusals or defaulting to “I don’t know”) persist as major challenges in LLM reasoning. Current efforts to reduce hallucinations primarily focus on factual errors in knowledge-grounded tasks, often neglecting hallucinations related to faulty reasoning. Meanwhile, some approaches render LLMs overly conservative, limiting their problem-solving capabilities. To mitigate hallucination and laziness in reasoning tasks, we propose Automatic Curriculum Expert Iteration (Auto-CEI) to enhance LLM reasoning and align responses to the model’s capabilities–assertively answering within its limits and declining when tasks exceed them. In our method, Expert Iteration explores the reasoning trajectories near the LLM policy, guiding incorrect paths back on track to reduce compounding errors and improve robustness; it also promotes appropriate “I don’t know” responses after sufficient reasoning attempts. The curriculum automatically adjusts rewards, incentivizing extended reasoning before acknowledging incapability, thereby pushing the limits of LLM reasoning and aligning its behaviour with these limits. We compare Auto-CEI with various SOTA baselines across logical reasoning, mathematics, and planning tasks, where Auto-CEI achieves superior alignment by effectively balancing assertiveness and conservativeness.

[LG-102] he Plug-in Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis

链接: https://arxiv.org/abs/2410.07616
作者: Matthew Zurek,Yudong Chen
关键词-EN: Markov decision processes, average-reward Markov decision, plug-in approach, Markov decision, average-reward Markov
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the sample complexity of the plug-in approach for learning \varepsilon -optimal policies in average-reward Markov decision processes (MDPs) with a generative model. The plug-in approach constructs a model estimate then computes an average-reward optimal policy in the estimated model. Despite representing arguably the simplest algorithm for this problem, the plug-in approach has never been theoretically analyzed. Unlike the more well-studied discounted MDP reduction method, the plug-in approach requires no prior problem information or parameter tuning. Our results fill this gap and address the limitations of prior approaches, as we show that the plug-in approach is optimal in several well-studied settings without using prior knowledge. Specifically it achieves the optimal diameter- and mixing-based sample complexities of \widetildeO\left(SA \fracD\varepsilon^2\right) and \widetildeO\left(SA \frac\tau_\mathrmunif\varepsilon^2\right) , respectively, without knowledge of the diameter D or uniform mixing time \tau_\mathrmunif . We also obtain span-based bounds for the plug-in approach, and complement them with algorithm-specific lower bounds suggesting that they are unimprovable. Our results require novel techniques for analyzing long-horizon problems which may be broadly useful and which also improve results for the discounted plug-in approach, removing effective-horizon-related sample size restrictions and obtaining the first optimal complexity bounds for the full range of sample sizes without reward perturbation.

[LG-103] Parallel Digital Twin-driven Deep Reinforcement Learning for User Association and Load Balancing in Dynamic Wireless Networks

链接: https://arxiv.org/abs/2410.07611
作者: Zhenyu Tao,Wei Xu,Xiaohu You
关键词-EN: densely deployed heterogeneous, deployed heterogeneous cellular, heterogeneous cellular network, DRL, user
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: arXiv admin note: text overlap with arXiv:2407.19765

点击查看摘要

Abstract:Optimization of user association in a densely deployed heterogeneous cellular network is usually challenging and even more complicated due to the dynamic nature of user mobility and fluctuation in user counts. While deep reinforcement learning (DRL) emerges as a promising solution, its application in practice is hindered by high trial-and-error costs in real world and unsatisfactory physical network performance during training. In addition, existing DRL-based user association methods are usually only applicable to scenarios with a fixed number of users due to convergence and compatibility challenges. In this paper, we propose a parallel digital twin (DT)-driven DRL method for user association and load balancing in networks with both dynamic user counts, distribution, and mobility patterns. Our method employs a distributed DRL strategy to handle varying user numbers and exploits a refined neural network structure for faster convergence. To address these DRL training-related challenges, we devise a high-fidelity DT construction technique, featuring a zero-shot generative user mobility model, named Map2Traj, based on a diffusion model. Map2Traj estimates user trajectory patterns and spatial distributions solely from street maps. Armed with this DT environment, DRL agents are enabled to be trained without the need for interactions with the physical network. To enhance the generalization ability of DRL models for dynamic scenarios, a parallel DT framework is further established to alleviate strong correlation and non-stationarity in single-environment training and improve the training efficiency. Numerical results show that the proposed parallel DT-driven DRL method achieves closely comparable performance to real environment training, and even outperforms those trained in a single real-world environment with nearly 20% gain in terms of cell-edge user performance.

[LG-104] CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features

链接: https://arxiv.org/abs/2410.07610
作者: Po-han Li,Sandeep P. Chinchali,Ufuk Topcu
关键词-EN: cross-modal retrieval, CSA, excel in tasks, Multimodal, CLIP excel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring 300,000\times fewer multimodal data pairs and 6\times fewer unimodal data for ImageNet classification and misinformative news captions detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.

[LG-105] A Variational Bayesian Inference Theory of Elasticity and Its Mixed Probabilistic Finite Element Method for Inverse Deformation Solutions in Any Dimension

链接: https://arxiv.org/abs/2410.07605
作者: Chao Wang,Shaofan Li
关键词-EN: variational Bayesian inference, Bayesian inference theory, Bayesian inference, Bayesian inference Finite, Bayesian inference network
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In this work, we have developed a variational Bayesian inference theory of elasticity, which is accomplished by using a mixed Variational Bayesian inference Finite Element Method (VBI-FEM) that can be used to solve the inverse deformation problems of continua. In the proposed variational Bayesian inference theory of continuum mechanics, the elastic strain energy is used as a prior in a Bayesian inference network, which can intelligently recover the detailed continuum deformation mappings with only given the information on the deformed and undeformed continuum body shapes without knowing the interior deformation and the precise actual boundary conditions, both traction as well as displacement boundary conditions, and the actual material constitutive relation. Moreover, we have implemented the related finite element formulation in a computational probabilistic mechanics framework. To numerically solve mixed variational problem, we developed an operator splitting or staggered algorithm that consists of the finite element (FE) step and the Bayesian learning (BL) step as an analogue of the well-known the Expectation-Maximization (EM) algorithm. By solving the mixed probabilistic Galerkin variational problem, we demonstrated that the proposed method is able to inversely predict continuum deformation mappings with strong discontinuity or fracture without knowing the external load conditions. The proposed method provides a robust machine intelligent solution for the long-sought-after inverse problem solution, which has been a major challenge in structure failure forensic pattern analysis in past several decades. The proposed method may become a promising artificial intelligence-based inverse method for solving general partial differential equations.

[LG-106] Imitation Learning with Limited Actions via Diffusion Planners and Deep Koopman Controllers

链接: https://arxiv.org/abs/2410.07584
作者: Jianxin Bi,Kelvin Lim,Kaiqi Chen,Yifei Huang,Harold Soh
关键词-EN: demonstrated significant potential, Recent advances, diffusion-based robot policies, imitating multi-modal behaviors, advances in diffusion-based
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in diffusion-based robot policies have demonstrated significant potential in imitating multi-modal behaviors. However, these approaches typically require large quantities of demonstration data paired with corresponding robot action labels, creating a substantial data collection burden. In this work, we propose a plan-then-control framework aimed at improving the action-data efficiency of inverse dynamics controllers by leveraging observational demonstration data. Specifically, we adopt a Deep Koopman Operator framework to model the dynamical system and utilize observation-only trajectories to learn a latent action representation. This latent representation can then be effectively mapped to real high-dimensional continuous actions using a linear action decoder, requiring minimal action-labeled data. Through experiments on simulated robot manipulation tasks and a real robot experiment with multi-modal expert demonstrations, we demonstrate that our approach significantly enhances action-data efficiency and achieves high task success rates with limited action data.

[LG-107] Detecting Training Data of Large Language Models via Expectation Maximization

链接: https://arxiv.org/abs/2410.07582
作者: Gyuwan Kim,Yang Li,Evangelia Spiliopoulou,Jie Ma,Miguel Ballesteros,William Yang Wang
关键词-EN: large language models, remains undisclosed, impressive advancements, widespread deployment, deployment of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:The widespread deployment of large language models (LLMs) has led to impressive advancements, yet information about their training data, a critical factor in their performance, remains undisclosed. Membership inference attacks (MIAs) aim to determine whether a specific instance was part of a target model’s training data. MIAs can offer insights into LLM outputs and help detect and address concerns such as data contamination and compliance with privacy and copyright standards. However, applying MIAs to LLMs presents unique challenges due to the massive scale of pre-training data and the ambiguous nature of membership. Additionally, creating appropriate benchmarks to evaluate MIA methods is not straightforward, as training and test data distributions are often unknown. In this paper, we introduce EM-MIA, a novel MIA method for LLMs that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm, leveraging the duality that the estimates of these scores can be improved by each other. Membership scores and prefix scores assess how each instance is likely to be a member and discriminative as a prefix, respectively. Our method achieves state-of-the-art results on the WikiMIA dataset. To further evaluate EM-MIA, we present OLMoMIA, a benchmark built from OLMo resources, which allows us to control the difficulty of MIA tasks with varying degrees of overlap between training and test data distributions. We believe that EM-MIA serves as a robust MIA method for LLMs and that OLMoMIA provides a valuable resource for comprehensively evaluating MIA approaches, thereby driving future research in this critical area.

[LG-108] Boosting Deep Ensembles with Learning Rate Tuning

链接: https://arxiv.org/abs/2410.07564
作者: Hongpeng Jin,Yanzhao Wu
关键词-EN: Deep Neural Network, Neural Network, deep, high impact, deep learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Learning Rate (LR) has a high impact on deep learning training performance. A common practice is to train a Deep Neural Network (DNN) multiple times with different LR policies to find the optimal LR policy, which has been widely recognized as a daunting and costly task. Moreover, multiple times of DNN training has not been effectively utilized. In practice, often only the optimal LR is adopted, which misses the opportunities to further enhance the overall accuracy of the deep learning system and results in a huge waste of both computing resources and training time. This paper presents a novel framework, LREnsemble, to effectively leverage effective learning rate tuning to boost deep ensemble performance. We make three original contributions. First, we show that the LR tuning with different LR policies can produce highly diverse DNNs, which can be supplied as base models for deep ensembles. Second, we leverage different ensemble selection algorithms to identify high-quality deep ensembles from the large pool of base models with significant accuracy improvements over the best single base model. Third, we propose LREnsemble, a framework that utilizes the synergy of LR tuning and deep ensemble techniques to enhance deep learning performance. The experiments on multiple benchmark datasets have demonstrated the effectiveness of LREnsemble, generating up to 2.34% accuracy improvements over well-optimized baselines.

[LG-109] PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

链接: https://arxiv.org/abs/2410.07563
作者: Kenshin Abe,Kaizaburo Chubachi,Yasuhiro Fujita,Yuta Hirokawa,Kentaro Imajo,Toshiki Kataoka,Hiroyoshi Komatsu,Hiroaki Mikami,Tsuguo Mogami,Shogo Murai,Kosuke Nakago,Daisuke Nishino,Toru Ogawa,Daisuke Okanohara,Yoshihiko Ozaki,Shotaro Sano,Shuji Suzuki,Tianqi Xu,Toshihiko Yanase(Preferred Elements, Inc.)
关键词-EN: Japanese proficiency, designed for Japanese, large-scale language model, language model designed, Direct Preference Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency. The model was trained from scratch using 2 trillion tokens, with architecture such as QK Normalization and Z-Loss to ensure training stability during the training process. Post-training techniques, including Supervised Fine-Tuning and Direct Preference Optimization, were applied to refine the model’s performance. Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks, achieving results that are competitive with frontier models like GPT-4.

[LG-110] Conditional Lagrangian Wasserstein Flow for Time Series Imputation

链接: https://arxiv.org/abs/2410.07550
作者: Weizhu Qian,Dalin Zhang,Yan Zhao
关键词-EN: Time series imputation, numerous real-world applications, Time series, Conditional Lagrangian Wasserstein, Lagrangian Wasserstein Flow
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:Time series imputation is important for numerous real-world applications. To overcome the limitations of diffusion model-based imputation methods, e.g., slow convergence in inference, we propose a novel method for time series imputation in this work, called Conditional Lagrangian Wasserstein Flow. The proposed method leverages the (conditional) optimal transport theory to learn the probability flow in a simulation-free manner, in which the initial noise, missing data, and observations are treated as the source distribution, target distribution, and conditional information, respectively. According to the principle of least action in Lagrangian mechanics, we learn the velocity by minimizing the corresponding kinetic energy. Moreover, to incorporate more prior information into the model, we parameterize the derivative of a task-specific potential function via a variational autoencoder, and combine it with the base estimator to formulate a Rao-Blackwellized sampler. The propose model allows us to take less intermediate steps to produce high-quality samples for inference compared to existing diffusion methods. Finally, the experimental results on the real-word datasets show that the proposed method achieves competitive performance on time series imputation compared to the state-of-the-art methods.

[LG-111] Rank Aggregation in Crowdsourcing for Listwise Annotations

链接: https://arxiv.org/abs/2410.07538
作者: Wenshui Luo,Haoyu Liu,Yongliang Ding,Tao Zhou,Sheng wan,Runze Wu,Minmin Lin,Cong Zhang,Changjie Fan,Chen Gong
关键词-EN: gained significant attention, recently gained significant, Listwise rank Aggregation, Rank aggregation, significant attention
类目: Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

Abstract:Rank aggregation through crowdsourcing has recently gained significant attention, particularly in the context of listwise ranking annotations. However, existing methods primarily focus on a single problem and partial ranks, while the aggregation of listwise full ranks across numerous problems remains largely unexplored. This scenario finds relevance in various applications, such as model quality assessment and reinforcement learning with human feedback. In light of practical needs, we propose LAC, a Listwise rank Aggregation method in Crowdsourcing, where the global position information is carefully measured and included. In our design, an especially proposed annotation quality indicator is employed to measure the discrepancy between the annotated rank and the true rank. We also take the difficulty of the ranking problem itself into consideration, as it directly impacts the performance of annotators and consequently influences the final results. To our knowledge, LAC is the first work to directly deal with the full rank aggregation problem in listwise crowdsourcing, and simultaneously infer the difficulty of problems, the ability of annotators, and the ground-truth ranks in an unsupervised way. To evaluate our method, we collect a real-world business-oriented dataset for paragraph ranking. Experimental results on both synthetic and real-world benchmark datasets demonstrate the effectiveness of our proposed LAC method.

[LG-112] Corruption-Robust Linear Bandits: Minimax Optimality and Gap-Dependent Misspecification NEURIPS2024

链接: https://arxiv.org/abs/2410.07533
作者: Haolin Liu,Artin Tajdini,Andrew Wagenmaker,Chen-Yu Wei
关键词-EN: learner effectively learn, facing corrupted rewards, linear bandits, effectively learn, learn when facing
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:In linear bandits, how can a learner effectively learn when facing corrupted rewards? While significant work has explored this question, a holistic understanding across different adversarial models and corruption measures is lacking, as is a full characterization of the minimax regret bounds. In this work, we compare two types of corruptions commonly considered: strong corruption, where the corruption level depends on the action chosen by the learner, and weak corruption, where the corruption level does not depend on the action chosen by the learner. We provide a unified framework to analyze these corruptions. For stochastic linear bandits, we fully characterize the gap between the minimax regret under strong and weak corruptions. We also initiate the study of corrupted adversarial linear bandits, obtaining upper and lower bounds with matching dependencies on the corruption level. Next, we reveal a connection between corruption-robust learning and learning with gap-dependent mis-specification, a setting first studied by Liu et al. (2023a), where the misspecification level of an action or policy is proportional to its suboptimality. We present a general reduction that enables any corruption-robust algorithm to handle gap-dependent misspecification. This allows us to recover the results of Liu et al. (2023a) in a black-box manner and significantly generalize them to settings like linear MDPs, yielding the first results for gap-dependent misspecification in reinforcement learning. However, this general reduction does not attain the optimal rate for gap-dependent misspecification. Motivated by this, we develop a specialized algorithm that achieves optimal bounds for gap-dependent misspecification in linear bandits, thus answering an open question posed by Liu et al. (2023a).

[LG-113] Enhanced physics-informed neural networks (PINNs) for high-order power grid dynamics NEURIPS2024

链接: https://arxiv.org/abs/2410.07527
作者: Vineet Jagadeesan Nair
关键词-EN: physics-informed neural networks, ordinary differential equations, develop improved physics-informed, improved physics-informed neural, high-dimensional power system
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted to the Tackling Climate Change with Machine Learning workshop at NeurIPS 2024

点击查看摘要

Abstract:We develop improved physics-informed neural networks (PINNs) for high-order and high-dimensional power system models described by nonlinear ordinary differential equations. We propose some novel enhancements to improve PINN training and accuracy and also implement several other recently proposed ideas from the literature. We successfully apply these to study the transient dynamics of synchronous generators. We also make progress towards applying PINNs to advanced inverter models. Such enhanced PINNs can allow us to accelerate high-fidelity simulations needed to ensure a stable and reliable renewables-rich future grid.

[LG-114] Offline Inverse Constrained Reinforcement Learning for Safe-Critical Decision Making in Healthcare

链接: https://arxiv.org/abs/2410.07525
作者: Nan Fang,Guiliang Liu,Wei Gong
关键词-EN: Constrained Reinforcement Learning, Reinforcement Learning, agents overlooking common-sense, Inverse Constrained Reinforcement, Constrained Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) applied in healthcare can lead to unsafe medical decisions and treatment, such as excessive dosages or abrupt changes, often due to agents overlooking common-sense constraints. Consequently, Constrained Reinforcement Learning (CRL) is a natural choice for safe decisions. However, specifying the exact cost function is inherently difficult in healthcare. Recent Inverse Constrained Reinforcement Learning (ICRL) is a promising approach that infers constraints from expert demonstrations. ICRL algorithms model Markovian decisions in an interactive environment. These settings do not align with the practical requirement of a decision-making system in healthcare, where decisions rely on historical treatment recorded in an offline dataset. To tackle these issues, we propose the Constraint Transformer (CT). Specifically, 1) we utilize a causal attention mechanism to incorporate historical decisions and observations into the constraint modeling, while employing a Non-Markovian layer for weighted constraints to capture critical states. 2) A generative world model is used to perform exploratory data augmentation, enabling offline RL methods to simulate unsafe decision sequences. In multiple medical scenarios, empirical results demonstrate that CT can capture unsafe states and achieve strategies that approximate lower mortality rates, reducing the occurrence probability of unsafe behaviors.

[LG-115] Upcycling Large Language Models into Mixture of Experts

链接: https://arxiv.org/abs/2410.07524
作者: Ethan He,Abhinav Khattar,Ryan Prenger,Vijay Korthikanti,Zijie Yan,Tong Liu,Shiqing Fan,Ashwath Aithal,Mohammad Shoeybi,Bryan Catanzaro
关键词-EN: pre-trained dense language, Upcycling pre-trained dense, Upcycling, language models, Upcycling pre-trained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel “virtual group” initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models.

[LG-116] DemoShapley: Valuation of Demonstrations for In-Context Learning

链接: https://arxiv.org/abs/2410.07523
作者: Shan Xie,Man Luo,Chadly Daniel Stern,Mengnan Du,Lu Cheng
关键词-EN: needing task-specific fine-tuning, Large language models, Large language, leveraging in-context learning, task-specific fine-tuning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) leveraging in-context learning (ICL) have set new benchmarks in few-shot learning across various tasks without needing task-specific fine-tuning. However, extensive research has demonstrated that the effectiveness of ICL is significantly influenced by the selection and ordering of demonstrations. Considering the critical role of demonstration selection in ICL, we introduce DemoShapley which is inspired by the Data Shapley valuation theorem. This approach assesses the influence of individual demonstration instances, distinguishing between those that contribute positively and those that may hinder performance. Our findings reveal that DemoShapley not only enhances model performance in terms of accuracy and fairness but also generalizes queries from domains distinct from those of the in-context demonstrations, highlighting its versatility and effectiveness in optimizing ICL demonstration selection. Last but not least, DemoShapley demonstrates its ability to aid in identifying noisy data within the demonstration set.

[LG-117] MEMS Gyroscope Multi-Feature Calibration Using Machine Learning Technique

链接: https://arxiv.org/abs/2410.07519
作者: Yaoyao Long,Zhenming Liu,Cong Hao,Farrokh Ayazi
关键词-EN: accurate angular velocity, angular velocity measurements, MEMS gyroscopes offer, measurements in navigation, control systems
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Gyroscopes are crucial for accurate angular velocity measurements in navigation, stabilization, and control systems. MEMS gyroscopes offer advantages like compact size and low cost but suffer from errors and inaccuracies that are complex and time varying. This study leverages machine learning (ML) and uses multiple signals of the MEMS resonator gyroscope to improve its calibration. XGBoost, known for its high predictive accuracy and ability to handle complex, non-linear relationships, and MLP, recognized for its capability to model intricate patterns through multiple layers and hidden dimensions, are employed to enhance the calibration process. Our findings show that both XGBoost and MLP models significantly reduce noise and enhance accuracy and stability, outperforming the traditional calibration techniques. Despite higher computational costs, DL models are ideal for high-stakes applications, while ML models are efficient for consumer electronics and environmental monitoring. Both ML and DL models demonstrate the potential of advanced calibration techniques in enhancing MEMS gyroscope performance and calibration efficiency.

[LG-118] Evolutionary Contrastive Distillation for Language Model Alignment

链接: https://arxiv.org/abs/2410.07513
作者: Julian Katz-Samuels,Zheng Li,Hyokun Yun,Priyanka Nigam,Yi Xu,Vaclav Petricek,Bing Yin,Trishul Chilimbi
关键词-EN: Evolutionary Contrastive Distillation, real-world applications, complex instructions, large language models, execute complex instructions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The ability of large language models (LLMs) to execute complex instructions is essential for their real-world applications. However, several recent studies indicate that LLMs struggle with challenging instructions. In this paper, we propose Evolutionary Contrastive Distillation (ECD), a novel method for generating high-quality synthetic preference data designed to enhance the complex instruction-following capability of language models. ECD generates data that specifically illustrates the difference between a response that successfully follows a set of complex instructions and a response that is high-quality, but nevertheless makes some subtle mistakes. This is done by prompting LLMs to progressively evolve simple instructions to more complex instructions. When the complexity of an instruction is increased, the original successful response to the original instruction becomes a “hard negative” response for the new instruction, mostly meeting requirements of the new instruction, but barely missing one or two. By pairing a good response with such a hard negative response, and employing contrastive learning algorithms such as DPO, we improve language models’ ability to follow complex instructions. Empirically, we observe that our method yields a 7B model that exceeds the complex instruction-following performance of current SOTA 7B models and is competitive even with open-source 70B models.

[LG-119] CSGDN: Contrastive Signed Graph Diffusion Network for Predicting Crop Gene-Trait Associations

链接: https://arxiv.org/abs/2410.07511
作者: Yiru Pan,Xingyu Ji,Jiaqi You,Lu Li,Zhenping Liu,Xianlong Zhang,Zeyu Zhang,Maojun Wang
关键词-EN: complex physiological functions, perform complex physiological, negative association preidiction, Signed Graph Diffusion, perform complex
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Positive and negative association preidiction between gene and trait help studies for crops to perform complex physiological functions. The transcription and regulation activity of specific genes will be adjusted accordingly in different cell types, developmental stages, and physiological states to meet the needs of organisms. Determing gene-trait associations can resolve the mechanism of trait formation and benefit the improvement of crop yield and quality. There are the following two problems in obtaining the positive/negative associations between gene and trait: 1) High-throughput DNA/RNA sequencing and trait data collection are expensive and time-consuming due to the need to process large sample sizes; 2) experiments introduce both random and systematic errors, and, at the same time, calculations or predictions using software or models may produce noise. To address these two issues, we propose a Contrastive Signed Graph Diffusion Network, CSGDN, to learn robust node representations with fewer training samples to achieve higher link prediction accuracy. CSGDN employs a signed graph diffusion method to uncover the underlying regulatory associations between genes and traits. Then, stochastic perterbation strategies are used to create two views for both original and diffusive graphs. At last, a multi-view contrastive learning paradigm loss is designed to unify the node presentations learned from the two views to resist interference and reduce noise. We conduct experiments to validate the performance of CSGDN on three crop datasets: Gossypium hirsutum, Brassica napus, and Triticum turgidum. The results demonstrate that the proposed model outperforms state-of-the-art methods by up to 9.28% AUC for link sign prediction in G. hirsutum dataset.

[LG-120] MOLA: Enhancing Industrial Process Monitoring Using Multi-Block Orthogonal Long Short-Term Memory Autoencoder

链接: https://arxiv.org/abs/2410.07508
作者: Fangyuan Ma,Cheng Ji,Jingde Wang,Wei Sun,Xun Tang,Zheyu Jiang
关键词-EN: Orthogonal Long short-term, Multi-block Orthogonal Long, memory Autoencoder paradigm, dynamic orthogonal features, Long short-term memory
类目: Machine Learning (cs.LG)
*备注: 21 pages, 9 figures, 9 tables. Submitted to Processes

点击查看摘要

Abstract:In this work, we introduce MOLA: a Multi-block Orthogonal Long short-term memory Autoencoder paradigm, to conduct accurate, reliable fault detection of industrial processes. To achieve this, MOLA effectively extracts dynamic orthogonal features by introducing an orthogonality-based loss function to constrain the latent space output. This helps eliminate the redundancy in the features identified, thereby improving the overall monitoring performance. On top of this, a multi-block monitoring structure is proposed, which categorizes the process variables into multiple blocks by leveraging expert process knowledge about their associations with the overall process. Each block is associated with its specific Orthogonal Long short-term memory Autoencoder model, whose extracted dynamic orthogonal features are monitored by distance-based Hotelling’s T^2 statistics and quantile-based cumulative sum (CUSUM) designed for multivariate data streams that are nonparametric, heterogeneous in nature. Compared to having a single model accounting for all process variables, such a multi-block structure improves the overall process monitoring performance significantly, especially for large-scale industrial processes. Finally, we propose an adaptive weight-based Bayesian fusion (W-BF) framework to aggregate all block-wise monitoring statistics into a global statistic that we monitor for faults, with the goal of improving fault detection speed by assigning weights to blocks based on the sequential order where alarms are raised. We demonstrate the efficiency and effectiveness of our MOLA framework by applying it to the Tennessee Eastman Process and comparing the performance with various benchmark methods.

[LG-121] CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression

链接: https://arxiv.org/abs/2410.07505
作者: Wenyuan Liu,Xindian Ma,Peng Zhang,Yan Wang
关键词-EN: compressing Large Language, Large Language Models, compressing Large, quantization kernel, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Post-Training Quantization (PTQ) is an effective technique for compressing Large Language Models (LLMs). While many studies focus on quantizing both weights and activations, it is still a challenge to maintain the accuracy of LLM after activating quantization. To investigate the primary cause, we extend the concept of kernel from linear algebra to quantization functions to define a new term, “quantization kernel”, which refers to the set of elements in activations that are quantized to zero. Through quantitative analysis of the quantization kernel, we find that these elements are crucial for maintaining the accuracy of quantized LLMs. With the decrease of quantization kernel, the precision of quantized LLMs increases. If the quantization kernel proportion is kept below 19% for OPT models and below 1% for LLaMA models, the precision loss from quantizing activations to INT8 becomes negligible. Motivated by the goal of developing a quantization method with small quantization kernel, we propose CrossQuant: a simple yet effective method for quantizing activations. CrossQuant cross-quantizes elements using row and column-wise absolute maximum vectors, achieving a quantization kernel of approximately 16% for OPT models and less than 0.1% for LLaMA models. Experimental results on LLMs (LLaMA, OPT) ranging from 6.7B to 70B parameters demonstrate that CrossQuant improves or maintains perplexity and accuracy in language modeling, zero-shot, and few-shot tasks.

[LG-122] Adaptive Batch Size for Privately Finding Second-Order Stationary Points

链接: https://arxiv.org/abs/2410.07502
作者: Daogao Liu,Kunal Talwar
关键词-EN: first-order stationary point, second-order stationary point, stationary point, differential privacy constraints, first-order stationary
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:There is a gap between finding a first-order stationary point (FOSP) and a second-order stationary point (SOSP) under differential privacy constraints, and it remains unclear whether privately finding an SOSP is more challenging than finding an FOSP. Specifically, Ganesh et al. (2023) demonstrated that an \alpha -SOSP can be found with \alpha=O(\frac1n^1/3+(\frac\sqrtdn\epsilon)^3/7) , where n is the dataset size, d is the dimension, and \epsilon is the differential privacy parameter. Building on the SpiderBoost algorithm framework, we propose a new approach that uses adaptive batch sizes and incorporates the binary tree mechanism. Our method improves the results for privately finding an SOSP, achieving \alpha=O(\frac1n^1/3+(\frac\sqrtdn\epsilon)^1/2) . This improved bound matches the state-of-the-art for finding an FOSP, suggesting that privately finding an SOSP may be achievable at no additional cost.

[LG-123] Inferring biological processes with intrinsic noise from cross-sectional data

链接: https://arxiv.org/abs/2410.07501
作者: Suryanarayana Maddu,Victor Chardès,Michael. J. Shelley
关键词-EN: Inferring dynamical models, computational biology, dynamical models, significant challenge, challenge in computational
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Inferring dynamical models from data continues to be a significant challenge in computational biology, especially given the stochastic nature of many biological processes. We explore a common scenario in omics, where statistically independent cross-sectional samples are available at a few time points, and the goal is to infer the underlying diffusion process that generated the data. Existing inference approaches often simplify or ignore noise intrinsic to the system, compromising accuracy for the sake of optimization ease. We circumvent this compromise by inferring the phase-space probability flow that shares the same time-dependent marginal distributions as the underlying stochastic process. Our approach, probability flow inference (PFI), disentangles force from intrinsic stochasticity while retaining the algorithmic ease of ODE inference. Analytically, we prove that for Ornstein-Uhlenbeck processes the regularized PFI formalism yields a unique solution in the limit of well-sampled distributions. In practical applications, we show that PFI enables accurate parameter and force estimation in high-dimensional stochastic reaction networks, and that it allows inference of cell differentiation dynamics with molecular noise, outperforming state-of-the-art approaches.

[LG-124] Dense Optimizer : An Information Entropy-Guided Structural Search Method for Dense-like Neural Network Design

链接: https://arxiv.org/abs/2410.07499
作者: Liu Tianyuan,Hou Libin,Wang Linyuan,Song Xiyu,Yan Bin
关键词-EN: Dense Convolutional Network, Dense Optimizer, Dense Convolutional, efficient structure, Convolutional Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages,3 figures

点击查看摘要

Abstract:Dense Convolutional Network has been continuously refined to adopt a highly efficient and compact architecture, owing to its lightweight and efficient structure. However, the current Dense-like architectures are mainly designed manually, it becomes increasingly difficult to adjust the channels and reuse level based on past experience. As such, we propose an architecture search method called Dense Optimizer that can search high-performance dense-like network automatically. In Dense Optimizer, we view the dense network as a hierarchical information system, maximize the network’s information entropy while constraining the distribution of the entropy across each stage via a power law, thereby constructing an optimization problem. We also propose a branch-and-bound optimization algorithm, tightly integrates power-law principle with search space scaling to solve the optimization problem efficiently. The superiority of Dense Optimizer has been validated on different computer vision benchmark datasets. Specifically, Dense Optimizer completes high-quality search but only costs 4 hours with one CPU. Our searched model DenseNet-OPT achieved a top 1 accuracy of 84.3% on CIFAR-100, which is 5.97% higher than the original one.

[LG-125] Gem: Gaussian Mixture Model Embeddings for Numerical Feature Distributions

链接: https://arxiv.org/abs/2410.07485
作者: Hafiz Tayyab Rauf,Alex Bogatu,Norman W. Paton,Andre Freitas
关键词-EN: including entity resolution, Gaussian mixture model, including entity, entity resolution, Gaussian mixture
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Embeddings are now used to underpin a wide variety of data management tasks, including entity resolution, dataset search and semantic type detection. Such applications often involve datasets with numerical columns, but there has been more emphasis placed on the semantics of categorical data in embeddings than on the distinctive features of numerical data. In this paper, we propose a method called Gem (Gaussian mixture model embeddings) that creates embeddings that build on numerical value distributions from columns. The proposed method specializes a Gaussian Mixture Model (GMM) to identify and cluster columns with similar value distributions. We introduce a signature mechanism that generates a probability matrix for each column, indicating its likelihood of belonging to specific Gaussian components, which can be used for different applications, such as to determine semantic types. Finally, we generate embeddings for three numerical data properties: distributional, statistical, and contextual. Our core method focuses solely on numerical columns without using table names or neighboring columns for context. However, the method can be combined with other types of evidence, and we later integrate attribute names with the Gaussian embeddings to evaluate the method’s contribution to improving overall performance. We compare Gem with several baseline methods for numeric only and numeric + context tasks, showing that Gem consistently outperforms the baselines on four benchmark datasets.

[LG-126] Unifying and Verifying Mechanistic Interpretations: A Case Study with Group Operations

链接: https://arxiv.org/abs/2410.07476
作者: Wilson Wu,Louis Jaburi,Jacob Drori,Jason Gross
关键词-EN: neural networks trained, networks trained, neural networks, finite groups, recent line
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages, 4 figures

点击查看摘要

Abstract:A recent line of work in mechanistic interpretability has focused on reverse-engineering the computation performed by neural networks trained on the binary operation of finite groups. We investigate the internals of one-hidden-layer neural networks trained on this task, revealing previously unidentified structure and producing a more complete description of such models that unifies the explanations of previous works. Notably, these models approximate equivariance in each input argument. We verify that our explanation applies to a large fraction of networks trained on this task by translating it into a compact proof of model performance, a quantitative evaluation of model understanding. In particular, our explanation yields a guarantee of model accuracy that runs in 30% the time of brute force and gives a =95% accuracy bound for 45% of the models we trained. We were unable to obtain nontrivial non-vacuous accuracy bounds using only explanations from previous works.

[LG-127] Exploring the design space of deep-learning-based weather forecasting systems

链接: https://arxiv.org/abs/2410.07472
作者: Shoaib Ahmed Siddiqui,Jean Kossaifi,Boris Bonev,Christopher Choy,Jan Kautz,David Krueger,Kamyar Azizzadenesheli
关键词-EN: progress in developing, tremendous progress, models, architectures, choices including architecture
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite tremendous progress in developing deep-learning-based weather forecasting systems, their design space, including the impact of different design choices, is yet to be well understood. This paper aims to fill this knowledge gap by systematically analyzing these choices including architecture, problem formulation, pretraining scheme, use of image-based pretrained models, loss functions, noise injection, multi-step inputs, additional static masks, multi-step finetuning (including larger stride models), as well as training on a larger dataset. We study fixed-grid architectures such as UNet, fully convolutional architectures, and transformer-based models, along with grid-invariant architectures, including graph-based and operator-based models. Our results show that fixed-grid architectures outperform grid-invariant architectures, indicating a need for further architectural developments in grid-invariant models such as neural operators. We therefore propose a hybrid system that combines the strong performance of fixed-grid models with the flexibility of grid-invariant architectures. We further show that multi-step fine-tuning is essential for most deep-learning models to work well in practice, which has been a common practice in the past. Pretraining objectives degrade performance in comparison to supervised training, while image-based pretrained models provide useful inductive biases in some cases in comparison to training the model from scratch. Interestingly, we see a strong positive effect of using a larger dataset when training a smaller model as compared to training on a smaller dataset for longer. Larger models, on the other hand, primarily benefit from just an increase in the computational budget. We believe that these results will aid in the design of better weather forecasting systems in the future.

[LG-128] SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection

链接: https://arxiv.org/abs/2410.07471
作者: Han Shen,Pin-Yu Chen,Payel Das,Tianyi Chen
关键词-EN: leveraging Large Language, Large Language Models, Large Language, boost downstream performance, leveraging Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Fine-tuning on task-specific data to boost downstream performance is a crucial step for leveraging Large Language Models (LLMs). However, previous studies have demonstrated that fine-tuning the models on several adversarial samples or even benign data can greatly comprise the model’s pre-equipped alignment and safety capabilities. In this work, we propose SEAL, a novel framework to enhance safety in LLM fine-tuning. SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection respectively on Llama-3-8b-Instruct and Merlinite-7b models. Our code is available on github this https URL.

[LG-129] Systematic Feature Design for Cycle Life Prediction of Lithium-Ion Batteries During Formation

链接: https://arxiv.org/abs/2410.07458
作者: Jinwook Rhyu,Joachim Schaeffer,Michael L. Li,Xiao Cui,William C. Chueh,Martin Z. Bazant,Richard D. Braatz
关键词-EN: long testing time, lithium-ion battery manufacturing, solid electrolyte interphase, electrolyte interphase formation, cycle life prediction
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: Main: 27 pages, 6 figures. SI: 13 pages, 9 figures

点击查看摘要

Abstract:Optimization of the formation step in lithium-ion battery manufacturing is challenging due to limited physical understanding of solid electrolyte interphase formation and the long testing time (~100 days) for cells to reach the end of life. We propose a systematic feature design framework that requires minimal domain knowledge for accurate cycle life prediction during formation. Two simple Q(V) features designed from our framework, extracted from formation data without any additional diagnostic cycles, achieved a median of 9.20% error for cycle life prediction, outperforming thousands of autoML models using pre-defined features. We attribute the strong performance of our designed features to their physical origins - the voltage ranges identified by our framework capture the effects of formation temperature and microscopic particle resistance heterogeneity. By designing highly interpretable features, our approach can accelerate formation research, leveraging the interplay between data-driven feature design and mechanistic understanding.

[LG-130] SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders

链接: https://arxiv.org/abs/2410.07456
作者: Constantin Venhoff,Anisoara Calinescu,Philip Torr,Christian Schroeder de Witt
关键词-EN: ground truth features, ground truth, SAEs, key challenge, truth features
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key challenge in interpretability is to decompose model activations into meaningful features. Sparse autoencoders (SAEs) have emerged as a promising tool for this task. However, a central problem in evaluating the quality of SAEs is the absence of ground truth features to serve as an evaluation gold standard. Current evaluation methods for SAEs are therefore confronted with a significant trade-off: SAEs can either leverage toy models or other proxies with predefined ground truth features; or they use extensive prior knowledge of realistic task circuits. The former limits the generalizability of the evaluation results, while the latter limits the range of models and tasks that can be used for evaluations. We introduce SAGE: Scalable Autoencoder Ground-truth Evaluation, a ground truth evaluation framework for SAEs that scales to large state-of-the-art SAEs and models. We demonstrate that our method can automatically identify task-specific activations and compute ground truth features at these points. Compared to previous methods we reduce the training overhead by introducing a novel reconstruction method that allows to apply residual stream SAEs to sublayer activations. This eliminates the need for SAEs trained on every task-specific activation location. Then we validate the scalability of our framework, by evaluating SAEs on novel tasks on Pythia70M, GPT-2 Small, and Gemma-2-2. Our framework therefore paves the way for generalizable, large-scale evaluations of SAEs in interpretability research.

[LG-131] Collective variables of neural networks: empirical time evolution and scaling laws

链接: https://arxiv.org/abs/2410.07451
作者: Samuel Tovey,Sven Krippendorf,Michael Spannowsky,Konstantin Nikolaou,Christian Holm
关键词-EN: neural networks, neural, understanding learning dynamics, neural network architectures, neural network representations
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:This work presents a novel means for understanding learning dynamics and scaling relations in neural networks. We show that certain measures on the spectrum of the empirical neural tangent kernel, specifically entropy and trace, yield insight into the representations learned by a neural network and how these can be improved through architecture scaling. These results are demonstrated first on test cases before being shown on more complex networks, including transformers, auto-encoders, graph neural networks, and reinforcement learning studies. In testing on a wide range of architectures, we highlight the universal nature of training dynamics and further discuss how it can be used to understand the mechanisms behind learning in neural networks. We identify two such dominant mechanisms present throughout machine learning training. The first, information compression, is seen through a reduction in the entropy of the NTK spectrum during training, and occurs predominantly in small neural networks. The second, coined structure formation, is seen through an increasing entropy and thus, the creation of structure in the neural network representations beyond the prior established by the network at initialization. Due to the ubiquity of the latter in deep neural network architectures and its flexibility in the creation of feature-rich representations, we argue that this form of evolution of the network’s entropy be considered the onset of a deep learning regime.

[LG-132] nyLidarNet: 2D LiDAR-based End-to-End Deep Learning Model for F1TENTH Autonomous Racing

链接: https://arxiv.org/abs/2410.07447
作者: Mohammed Misbah Zarrar,Qitao Weng,Bakhbyergyen Yerjan,Ahmet Soyyigit,Heechul Yun
关键词-EN: raw sensory data, Prior research, sensory data, research has demonstrated, demonstrated the effectiveness
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prior research has demonstrated the effectiveness of end-to-end deep learning for robotic navigation, where the control signals are directly derived from raw sensory data. However, the majority of existing end-to-end navigation solutions are predominantly camera-based. In this paper, we introduce TinyLidarNet, a lightweight 2D LiDAR-based end-to-end deep learning model for autonomous racing. An F1TENTH vehicle using TinyLidarNet won 3rd place in the 12th F1TENTH Autonomous Grand Prix competition, demonstrating its competitive performance. We systematically analyze its performance on untrained tracks and computing requirements for real-time processing. We find that TinyLidarNet’s 1D Convolutional Neural Network (CNN) based architecture significantly outperforms widely used Multi-Layer Perceptron (MLP) based architecture. In addition, we show that it can be processed in real-time on low-end micro-controller units (MCUs).

[LG-133] KACQ-DCNN: Uncertainty-Aware Interpretable Kolmogorov-Arnold Classical-Quantum Dual-Channel Neural Network for Heart Disease Detection

链接: https://arxiv.org/abs/2410.07446
作者: Md Abrar Jahin,Md. Akmol Masud,M. F. Mridha,Zeyar Aung,Nilanjan Dey
关键词-EN: global health challenge, million annual deaths, improved diagnostic tools, major global health, Heart failure remains
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heart failure remains a major global health challenge, contributing significantly to the 17.8 million annual deaths from cardiovascular disease, highlighting the need for improved diagnostic tools. Current heart disease prediction models based on classical machine learning face limitations, including poor handling of high-dimensional, imbalanced data, limited performance on small datasets, and a lack of uncertainty quantification, while also being difficult for healthcare professionals to interpret. To address these issues, we introduce KACQ-DCNN, a novel classical-quantum hybrid dual-channel neural network that replaces traditional multilayer perceptrons and convolutional layers with Kolmogorov-Arnold Networks (KANs). This approach enhances function approximation with learnable univariate activation functions, reducing model complexity and improving generalization. The KACQ-DCNN 4-qubit 1-layered model significantly outperforms 37 benchmark models across multiple metrics, achieving an accuracy of 92.03%, a macro-average precision, recall, and F1 score of 92.00%, and an ROC-AUC score of 94.77%. Ablation studies demonstrate the synergistic benefits of combining classical and quantum components with KAN. Additionally, explainability techniques like LIME and SHAP provide feature-level insights, improving model transparency, while uncertainty quantification via conformal prediction ensures robust probability estimates. These results suggest that KACQ-DCNN offers a promising path toward more accurate, interpretable, and reliable heart disease predictions, paving the way for advancements in cardiovascular healthcare.

[LG-134] Zero-Shot Generalization of Vision-Based RL Without Data Augmentation

链接: https://arxiv.org/abs/2410.07441
作者: Sumeet Batra,Gaurav S. Sukhatme
关键词-EN: Generalizing vision-based reinforcement, vision-based reinforcement learning, Generalizing vision-based, reinforcement learning, open challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Generalizing vision-based reinforcement learning (RL) agents to novel environments remains a difficult and open challenge. Current trends are to collect large-scale datasets or use data augmentation techniques to prevent overfitting and improve downstream generalization. However, the computational and data collection costs increase exponentially with the number of task variations and can destabilize the already difficult task of training RL agents. In this work, we take inspiration from recent advances in computational neuroscience and propose a model, Associative Latent DisentAnglement (ALDA), that builds on standard off-policy RL towards zero-shot generalization. Specifically, we revisit the role of latent disentanglement in RL and show how combining it with a model of associative memory achieves zero-shot generalization on difficult task variations without relying on data augmentation. Finally, we formally show that data augmentation techniques are a form of weak disentanglement and discuss the implications of this insight.

[LG-135] oward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap

链接: https://arxiv.org/abs/2410.07436
作者: Georgia Channing,Juil Sock,Ronald Clark,Philip Torr,Christian Schroeder de Witt
关键词-EN: election security, generated audio deepfakes, audio deepfakes poses, audio deepfake detectors, rapid proliferation
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The rapid proliferation of AI-manipulated or generated audio deepfakes poses serious challenges to media integrity and election security. Current AI-driven detection solutions lack explainability and underperform in real-world settings. In this paper, we introduce novel explainability methods for state-of-the-art transformer-based audio deepfake detectors and open-source a novel benchmark for real-world generalizability. By narrowing the explainability gap between transformer-based audio deepfake detectors and traditional methods, our results not only build trust with human experts, but also pave the way for unlocking the potential of citizen intelligence to overcome the scalability issue in audio deepfake detection.

[LG-136] Can Transformers Reason Logically? A Study in SAT Solving

链接: https://arxiv.org/abs/2410.07432
作者: Leyan Pan,Vijay Ganesh,Jacob Abernethy,Chris Esposo,Wenke Lee
关键词-EN: Boolean satisfiability, logical reasoning capabilities, study the logical, capabilities of LLMs, solve SAT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 29 pages, 4 Figures

点击查看摘要

Abstract:We theoretically and empirically study the logical reasoning capabilities of LLMs in the context of the Boolean satisfiability (SAT) problem. First, we construct a decoder-only Transformer that can solve SAT using backtracking and deduction via Chain-of-Thought (CoT). We prove its correctness by showing trace equivalence to the well-known DPLL SAT-solving algorithm. Second, to support the implementation of this abstract construction, we design a compiler \textttPARAT that takes as input a procedural specification and outputs a transformer model implementing this specification. Third, rather than \textitprogramming a transformer to reason, we evaluate empirically whether it can be \textittrained to do so by learning directly from algorithmic traces (“reasoning paths”) of the DPLL algorithm.

[LG-137] EventFlow: Forecasting Continuous-Time Event Data with Flow Matching

链接: https://arxiv.org/abs/2410.07430
作者: Gavin Kerrigan,Kai Nelson,Padhraic Smyth
关键词-EN: Continuous-time event sequences, irregular intervals, scientific domains, Continuous-time event, occur at irregular
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Continuous-time event sequences, in which events occur at irregular intervals, are ubiquitous across a wide range of industrial and scientific domains. The contemporary modeling paradigm is to treat such data as realizations of a temporal point process, and in machine learning it is common to model temporal point processes in an autoregressive fashion using a neural network. While autoregressive models are successful in predicting the time of a single subsequent event, their performance can be unsatisfactory in forecasting longer horizons due to cascading errors. We propose EventFlow, a non-autoregressive generative model for temporal point processes. Our model builds on the flow matching framework in order to directly learn joint distributions over event times, side-stepping the autoregressive process. EventFlow is likelihood-free, easy to implement and sample from, and either matches or surpasses the performance of state-of-the-art models in both unconditional and conditional generation tasks on a set of standard benchmarks

[LG-138] A Generalization Bound for a Family of Implicit Networks

链接: https://arxiv.org/abs/2410.07427
作者: Samy Wu Fung,Benjamin Berkels
关键词-EN: Implicit networks, fixed point operators, neural networks, implicit networks defined, fixed point
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Implicit networks are a class of neural networks whose outputs are defined by the fixed point of a parameterized operator. They have enjoyed success in many applications including natural language processing, image processing, and numerous other applications. While they have found abundant empirical success, theoretical work on its generalization is still under-explored. In this work, we consider a large family of implicit networks defined parameterized contractive fixed point operators. We show a generalization bound for this class based on a covering number argument for the Rademacher complexity of these architectures.

[LG-139] CAFEEN: A Cooperative Approach for Energy Efficient NoCs with Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2410.07426
作者: Kamil Khan,Sudeep Pasricha
关键词-EN: efficient power management, emerging high-performance, efficient power, power management, management is crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:In emerging high-performance Network-on-Chip (NoC) architectures, efficient power management is crucial to minimize energy consumption. We propose a novel framework called CAFEEN that employs both heuristic-based fine-grained and machine learning-based coarse-grained power-gating for energy-efficient NoCs. CAFEEN uses a fine-grained method to activate only essential NoC buffers during lower network loads. It switches to a coarse-grained method at peak loads to minimize compounding wake-up overhead using multi-agent reinforcement learning. Results show that CAFEEN adaptively balances power-efficiency with performance, reducing total energy by 2.60x for single application workloads and 4.37x for multi-application workloads, compared to state-of-the-art NoC power-gating frameworks.

[LG-140] Learning responsibility allocations for multi-agent interactions: A differentiable optimization approach with control barrier functions

链接: https://arxiv.org/abs/2410.07409
作者: Isaac Remy,David Fridovich-Keil,Karen Leung
关键词-EN: package delivery, contextual cues, efficient multi-agent interaction, driving to package, dynamics are influenced
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:From autonomous driving to package delivery, ensuring safe yet efficient multi-agent interaction is challenging as the interaction dynamics are influenced by hard-to-model factors such as social norms and contextual cues. Understanding these influences can aid in the design and evaluation of socially-aware autonomous agents whose behaviors are aligned with human values. In this work, we seek to codify factors governing safe multi-agent interactions via the lens of responsibility, i.e., an agent’s willingness to deviate from their desired control to accommodate safe interaction with others. Specifically, we propose a data-driven modeling approach based on control barrier functions and differentiable optimization that efficiently learns agents’ responsibility allocation from data. We demonstrate on synthetic and real-world datasets that we can obtain an interpretable and quantitative understanding of how much agents adjust their behavior to ensure the safety of others given their current environment.

[LG-141] Fostering Intrinsic Motivation in Reinforcement Learning with Pretrained Foundation Models

链接: https://arxiv.org/abs/2410.07404
作者: Alain Andres,Javier Del Ser
关键词-EN: sparse or non-existent, remains a significant, significant challenge, challenge in reinforcement, environments where extrinsic
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Exploration remains a significant challenge in reinforcement learning, especially in environments where extrinsic rewards are sparse or non-existent. The recent rise of foundation models, such as CLIP, offers an opportunity to leverage pretrained, semantically rich embeddings that encapsulate broad and reusable knowledge. In this work we explore the potential of these foundation models not just to drive exploration, but also to analyze the critical role of the episodic novelty term in enhancing exploration effectiveness of the agent. We also investigate whether providing the intrinsic module with complete state information – rather than just partial observations – can improve exploration, despite the difficulties in handling small variations within large state spaces. Our experiments in the MiniGrid domain reveal that intrinsic modules can effectively utilize full state information, significantly increasing sample efficiency while learning an optimal policy. Moreover, we show that the embeddings provided by foundation models are sometimes even better than those constructed by the agent during training, further accelerating the learning process, especially when coupled with the episodic novelty term to enhance exploration.

[LG-142] Aligning AI-driven discovery with human intuition

链接: https://arxiv.org/abs/2410.07397
作者: Kevin Zhang,Hod Lipson
关键词-EN: challenge is emerging, making these models, models more compatible, data-driven modeling, existing human knowledge
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As data-driven modeling of physical dynamical systems becomes more prevalent, a new challenge is emerging: making these models more compatible and aligned with existing human knowledge. AI-driven scientific modeling processes typically begin with identifying hidden state variables, then deriving governing equations, followed by predicting and analyzing future behaviors. The critical initial step of identification of an appropriate set of state variables remains challenging for two reasons. First, finding a compact set of meaningfully predictive variables is mathematically difficult and under-defined. A second reason is that variables found often lack physical significance, and are therefore difficult for human scientists to interpret. We propose a new general principle for distilling representations that are naturally more aligned with human intuition, without relying on prior physical knowledge. We demonstrate our approach on a number of experimental and simulated system where the variables generated by the AI closely resemble those chosen independently by human scientists. We suggest that this principle can help make human-AI collaboration more fruitful, as well as shed light on how humans make scientific modeling choices.

[LG-143] LLM Embeddings Improve Test-time Adaptation to Tabular Y|X-Shifts

链接: https://arxiv.org/abs/2410.07395
作者: Yibo Zeng,Jiashuo Liu,Henry Lam,Hongseok Namkoong
关键词-EN: label and covariates, missing variables, common due, due to missing, tabular datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:For tabular datasets, the change in the relationship between the label and covariates ( Y|X -shifts) is common due to missing variables (a.k.a. confounders). Since it is impossible to generalize to a completely new and unknown domain, we study models that are easy to adapt to the target domain even with few labeled examples. We focus on building more informative representations of tabular data that can mitigate Y|X -shifts, and propose to leverage the prior world knowledge in LLMs by serializing (write down) the tabular data to encode it. We find LLM embeddings alone provide inconsistent improvements in robustness, but models trained on them can be well adapted/finetuned to the target domain even using 32 labeled observations. Our finding is based on a comprehensive and systematic study consisting of 7650 source-target pairs and benchmark against 261,000 model configurations trained by 22 algorithms. Our observation holds when ablating the size of accessible target data and different adaptation strategies. The code is available at this https URL.

[LG-144] Learning-Based Shielding for Safe Autonomy under Unknown Dynamics

链接: https://arxiv.org/abs/2410.07359
作者: Robert Reed,Morteza Lahijanian
关键词-EN: Markov Decision Processes, neural network controller, deep reinforcement learning, neural network, Deep Kernel Learning
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Shielding is a common method used to guarantee the safety of a system under a black-box controller, such as a neural network controller from deep reinforcement learning (DRL), with simpler, verified controllers. Existing shielding methods rely on formal verification through Markov Decision Processes (MDPs), assuming either known or finite-state models, which limits their applicability to DRL settings with unknown, continuous-state systems. This paper addresses these limitations by proposing a data-driven shielding methodology that guarantees safety for unknown systems under black-box controllers. The approach leverages Deep Kernel Learning to model the systems’ one-step evolution with uncertainty quantification and constructs a finite-state abstraction as an Interval MDP (IMDP). By focusing on safety properties expressed in safe linear temporal logic (safe LTL), we develop an algorithm that computes the maximally permissive set of safe policies on the IMDP, ensuring avoidance of unsafe states. The algorithms soundness and computational complexity are demonstrated through theoretical proofs and experiments on nonlinear systems, including a high-dimensional autonomous spacecraft scenario.

[LG-145] Generating Origin-Destination Matrices in Neural Spatial Interaction Models

链接: https://arxiv.org/abs/2410.07352
作者: Ioannis Zachos,Mark Girolami,Theodoros Damoulas
关键词-EN: Agent-based models, areas in transportation, proliferating as decision-making, decision-making tools, tools across policy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Agent-based models (ABMs) are proliferating as decision-making tools across policy areas in transportation, economics, and epidemiology. In these models, a central object of interest is the discrete origin-destination matrix which captures spatial interactions and agent trip counts between locations. Existing approaches resort to continuous approximations of this matrix and subsequent ad-hoc discretisations in order to perform ABM simulation and calibration. This impedes conditioning on partially observed summary statistics, fails to explore the multimodal matrix distribution over a discrete combinatorial support, and incurs discretisation errors. To address these challenges, we introduce a computationally efficient framework that scales linearly with the number of origin-destination pairs, operates directly on the discrete combinatorial space, and learns the agents’ trip intensity through a neural differential equation that embeds spatial interactions. Our approach outperforms the prior art in terms of reconstruction error and ground truth matrix coverage, at a fraction of the computational cost. We demonstrate these benefits in large-scale spatial mobility ABMs in Cambridge, UK and Washington, DC, USA.

[LG-146] MoE: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

链接: https://arxiv.org/abs/2410.07348
作者: Peng Jin,Bo Zhu,Li Yuan,Shuicheng Yan
关键词-EN: MoE, aim to simultaneously, simultaneously enhance, enhance the effectiveness, effectiveness and efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, Code: this https URL

点击查看摘要

Abstract:In this work, we aim to simultaneously enhance the effectiveness and efficiency of Mixture-of-Experts (MoE) methods. To achieve this, we propose MoE++, a general and heterogeneous MoE framework that integrates both Feed-Forward Network~(FFN) and zero-computation experts. Specifically, we introduce three types of zero-computation experts: the zero expert, copy expert, and constant expert, which correspond to discard, skip, and replace operations, respectively. This design offers three key advantages: (i) Low Computing Overhead: Unlike the uniform mixing mechanism for all tokens within vanilla MoE, MoE++ allows each token to engage with a dynamic number of FFNs, be adjusted by constant vectors, or even skip the MoE layer entirely. (ii) High Performance: By enabling simple tokens to utilize fewer FFN experts, MoE++ allows more experts to focus on challenging tokens, thereby unlocking greater performance potential than vanilla MoE. (iii) Deployment Friendly: Given that zero-computation experts have negligible parameters, we can deploy all zero-computation experts on each GPU, eliminating the significant communication overhead and expert load imbalance associated with FFN experts distributed across different GPUs. Moreover, we leverage gating residuals, enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts. Extensive experimental results demonstrate that MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size, which lays a solid foundation for developing advanced and efficient MoE-related models.

[LG-147] owards Generalisable Time Series Understanding Across Domains

链接: https://arxiv.org/abs/2410.07299
作者: Özgün Turgut,Philip Müller,Martin J. Menten,Daniel Rueckert
关键词-EN: datasets unlocks foundational, natural language processing, large datasets unlocks, time series, unlocks foundational model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In natural language processing and computer vision, self-supervised pre-training on large datasets unlocks foundational model capabilities across domains and tasks. However, this potential has not yet been realised in time series analysis, where existing methods disregard the heterogeneous nature of time series characteristics. Time series are prevalent in many domains, including medicine, engineering, natural sciences, and finance, but their characteristics vary significantly in terms of variate count, inter-variate relationships, temporal dynamics, and sampling frequency. This inherent heterogeneity across domains prevents effective pre-training on large time series corpora. To address this issue, we introduce OTiS, an open model for general time series analysis, that has been specifically designed to handle multi-domain heterogeneity. We propose a novel pre-training paradigm including a tokeniser with learnable domain-specific signatures, a dual masking strategy to capture temporal causality, and a normalised cross-correlation loss to model long-range dependencies. Our model is pre-trained on a large corpus of 640,187 samples and 11 billion time points spanning 8 distinct domains, enabling it to analyse time series from any (unseen) domain. In comprehensive experiments across 15 diverse applications - including classification, regression, and forecasting - OTiS showcases its ability to accurately capture domain-specific data characteristics and demonstrates its competitiveness against state-of-the-art baselines. Our code and pre-trained weights are publicly available at this https URL.

[LG-148] IterGen: Iterative Structured LLM Generation

链接: https://arxiv.org/abs/2410.07295
作者: Shubham Ugare,Rohan Gumaste,Tarun Suresh,Gagandeep Singh,Sasa Misailovic
关键词-EN: Large Language Models, Large Language, Language Models, Large, Models
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used for tasks such as natural language and code generation. Still, their outputs often suffer from issues like privacy violations, and semantically inaccurate code generation. Current libraries for LLM generation rely on left-to-right decoding without systematic support for backtracking, limiting the ability to correct or refine outputs mid-generation. To address this issue, we introduce IterGen, an intuitive framework for iterative, grammar-guided LLM generation that enables users to move both forward and backward within the generated output based on grammar symbols. By leveraging a symbol-to-position mapping, IterGen ensures efficient and structured generation while allowing for corrections during the process. We demonstrate IterGen’s effectiveness in two important applications: reducing privacy leakage in LLM outputs and improving the accuracy of LLM-generated SQL queries. Our code is available at this https URL Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL) Cite as: arXiv:2410.07295 [cs.SE] (or arXiv:2410.07295v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2410.07295 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-149] Principal Orthogonal Latent Components Analysis (POLCA Net)

链接: https://arxiv.org/abs/2410.07289
作者: Jose Antonio Martin H.,Freddy Perozo,Manuel Lopez
关键词-EN: Components Analysis Network, raw data, POLCA Net, pivotal area, field of machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Representation learning is a pivotal area in the field of machine learning, focusing on the development of methods to automatically discover the representations or features needed for a given task from raw data. Unlike traditional feature engineering, which requires manual crafting of features, representation learning aims to learn features that are more useful and relevant for tasks such as classification, prediction, and clustering. We introduce Principal Orthogonal Latent Components Analysis Network (POLCA Net), an approach to mimic and extend PCA and LDA capabilities to non-linear domains. POLCA Net combines an autoencoder framework with a set of specialized loss functions to achieve effective dimensionality reduction, orthogonality, variance-based feature sorting, high-fidelity reconstructions, and additionally, when used with classification labels, a latent representation well suited for linear classifiers and low dimensional visualization of class distribution as well.

[LG-150] Benchmarking Data Heterogeneity Evaluation Approaches for Personalized Federated Learning NEURIPS’24

链接: https://arxiv.org/abs/2410.07286
作者: Zhilong Li,Xiaohu Wu,Xiaoli Tang,Tiantian He,Yew-Soon Ong,Mengmeng Chen,Qiqi Liu,Qicheng Lao,Xiaoxiao Li,Han Yu
关键词-EN: clients’ local datasets, growing research interest, local datasets, interest in measuring, measuring the statistical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to FL@FM-NeurIPS’24

点击查看摘要

Abstract:There is growing research interest in measuring the statistical heterogeneity of clients’ local datasets. Such measurements are used to estimate the suitability for collaborative training of personalized federated learning (PFL) models. Currently, these research endeavors are taking place in silos and there is a lack of a unified benchmark to provide a fair and convenient comparison among various approaches in common settings. We aim to bridge this important gap in this paper. The proposed benchmarking framework currently includes six representative approaches. Extensive experiments have been conducted to compare these approaches under five standard non-IID FL settings, providing much needed insights into which approaches are advantageous under which settings. The proposed framework offers useful guidance on the suitability of various data divergence measures in FL systems. It is beneficial for keeping related research activities on the right track in terms of: (1) designing PFL schemes, (2) selecting appropriate data heterogeneity evaluation approaches for specific FL application scenarios, and (3) addressing fairness issues in collaborative model training. The code is available at this https URL.

[LG-151] A Utility-Mining-Driven Active Learning Approach for Analyzing Clickstream Sequences

链接: https://arxiv.org/abs/2410.07282
作者: Danny Y. C. Wang,Lars Arne Jordanger,Jerry Chun-Wei Lin
关键词-EN: evolving e-commerce industry, Sequential Pattern Mining, rapidly evolving e-commerce, selecting high-quality data, High-Utility Sequential Pattern
类目: Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, preprint version

点击查看摘要

Abstract:In rapidly evolving e-commerce industry, the capability of selecting high-quality data for model training is essential. This study introduces the High-Utility Sequential Pattern Mining using SHAP values (HUSPM-SHAP) model, a utility mining-based active learning strategy to tackle this challenge. We found that the parameter settings for positive and negative SHAP values impact the model’s mining outcomes, introducing a key consideration into the active learning framework. Through extensive experiments aimed at predicting behaviors that do lead to purchases or not, the designed HUSPM-SHAP model demonstrates its superiority across diverse scenarios. The model’s ability to mitigate labeling needs while maintaining high predictive performance is highlighted. Our findings demonstrate the model’s capability to refine e-commerce data processing, steering towards more streamlined, cost-effective prediction modeling.

[LG-152] Mitigation of gender bias in automatic facial non-verbal behaviors generation

链接: https://arxiv.org/abs/2410.07274
作者: Alice Delbosc(TALEP, LIS, AMU),Magalie Ochs(LIS, AMU, R2I),Nicolas Sabouret(CPU, LISN),Brian Ravenet(CPU, LISN),Stephane Ayache(AMU, LIS, QARMA)
关键词-EN: interactive agents focuses, social interactive agents, social interactive, believability and synchronization, Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Research on non-verbal behavior generation for social interactive agents focuses mainly on the believability and synchronization of non-verbal cues with speech. However, existing models, predominantly based on deep learning architectures, often perpetuate biases inherent in the training data. This raises ethical concerns, depending on the intended application of these agents. This paper addresses these issues by first examining the influence of gender on facial non-verbal behaviors. We concentrate on gaze, head movements, and facial expressions. We introduce a classifier capable of discerning the gender of a speaker from their non-verbal cues. This classifier achieves high accuracy on both real behavior data, extracted using state-of-the-art tools, and synthetic data, generated from a model developed in previous this http URL upon this work, we present a new model, FairGenderGen, which integrates a gender discriminator and a gradient reversal layer into our previous behavior generation model. This new model generates facial non-verbal behaviors from speech features, mitigating gender sensitivity in the generated behaviors. Our experiments demonstrate that the classifier, developed in the initial phase, is no longer effective in distinguishing the gender of the speaker from the generated non-verbal behaviors.

[LG-153] BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models NEURIPS

链接: https://arxiv.org/abs/2410.07273
作者: Fangyikang Wang,Hubery Yin,Yuejiang Dong,Huminhao Zhu,Chao Zhang,Hanbin Zhao,Hui Qian,Chen Li
关键词-EN: exact inversion samplers, exact inversion, diffusion model sampling, heuristic exact inversion, inversion samplers
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted paper by NeurIPS

点击查看摘要

Abstract:The inversion of diffusion model sampling, which aims to find the corresponding initial noise of a sample, plays a critical role in various tasks. Recently, several heuristic exact inversion samplers have been proposed to address the inexact inversion issue in a training-free manner. However, the theoretical properties of these heuristic samplers remain unknown and they often exhibit mediocre sampling quality. In this paper, we introduce a generic formulation, \emphBidirectional Explicit Linear Multi-step (BELM) samplers, of the exact inversion samplers, which includes all previously proposed heuristic exact inversion samplers as special cases. The BELM formulation is derived from the variable-stepsize-variable-formula linear multi-step method via integrating a bidirectional explicit constraint. We highlight this bidirectional explicit constraint is the key of mathematically exact inversion. We systematically investigate the Local Truncation Error (LTE) within the BELM framework and show that the existing heuristic designs of exact inversion samplers yield sub-optimal LTE. Consequently, we propose the Optimal BELM (O-BELM) sampler through the LTE minimization approach. We conduct additional analysis to substantiate the theoretical stability and global convergence property of the proposed optimal sampler. Comprehensive experiments demonstrate our O-BELM sampler establishes the exact inversion property while achieving high-quality sampling. Additional experiments in image editing and image interpolation highlight the extensive potential of applying O-BELM in varying applications.

[LG-154] Boosting the Performance of Decentralized Federated Learning via Catalyst Acceleration

链接: https://arxiv.org/abs/2410.07272
作者: Qinglun Li,Miao Zhang,Yingqi Liu,Quanjun Yin,Li Shen,Xiaochun Cao
关键词-EN: Decentralized Federated Learning, Federated Learning, Decentralized Federated, centralized architectures due, reduced communication overhead
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2410.06482

点击查看摘要

Abstract:Decentralized Federated Learning has emerged as an alternative to centralized architectures due to its faster training, privacy preservation, and reduced communication overhead. In decentralized communication, the server aggregation phase in Centralized Federated Learning shifts to the client side, which means that clients connect with each other in a peer-to-peer manner. However, compared to the centralized mode, data heterogeneity in Decentralized Federated Learning will cause larger variances between aggregated models, which leads to slow convergence in training and poor generalization performance in tests. To address these issues, we introduce Catalyst Acceleration and propose an acceleration Decentralized Federated Learning algorithm called DFedCata. It consists of two main components: the Moreau envelope function, which primarily addresses parameter inconsistencies among clients caused by data heterogeneity, and Nesterov’s extrapolation step, which accelerates the aggregation phase. Theoretically, we prove the optimization error bound and generalization error bound of the algorithm, providing a further understanding of the nature of the algorithm and the theoretical perspectives on the hyperparameter choice. Empirically, we demonstrate the advantages of the proposed algorithm in both convergence speed and generalization performance on CIFAR10/100 with various non-iid data distributions. Furthermore, we also experimentally verify the theoretical properties of DFedCata.

[LG-155] A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models

链接: https://arxiv.org/abs/2410.07265
作者: Cong Guo,Feng Cheng,Zhixu Du,James Kiessling,Jonathan Ku,Shiyu Li,Ziru Li,Mingyuan Ma,Tergel Molom-Ochir,Benjamin Morris,Haoxuan Shan,Jingwei Sun,Yitu Wang,Chiyue Wei,Xueying Wu,Yuhao Wu,Hao Frank Yang,Jingyang Zhang,Junyao Zhang,Qilin Zheng,Guanglei Zhou, Hai (Helen)Li,Yiran Chen
关键词-EN: demonstrating remarkable capabilities, natural language processing, large language models, artificial intelligence, demonstrating remarkable
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted by IEEE Circuits and Systems Magazine

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has significantly transformed the field of artificial intelligence, demonstrating remarkable capabilities in natural language processing and moving towards multi-modal functionality. These models are increasingly integrated into diverse applications, impacting both research and industry. However, their development and deployment present substantial challenges, including the need for extensive computational resources, high energy consumption, and complex software optimizations. Unlike traditional deep learning systems, LLMs require unique optimization strategies for training and inference, focusing on system-level efficiency. This paper surveys hardware and software co-design approaches specifically tailored to address the unique characteristics and constraints of large language models. This survey analyzes the challenges and impacts of LLMs on hardware and algorithm research, exploring algorithm optimization, hardware design, and system-level innovations. It aims to provide a comprehensive understanding of the trade-offs and considerations in LLM-centric computing systems, guiding future advancements in AI. Finally, we summarize the existing efforts in this space and outline future directions toward realizing production-grade co-design methodologies for the next generation of large language models and AI systems.

[LG-156] Memory-augmented Transformers can implement Linear First-Order Optimization Methods

链接: https://arxiv.org/abs/2410.07263
作者: Sanchayan Dutta(UC Davis),Suvrit Sra(TU Munich)
关键词-EN: linearly combine past, combine past gradients, conjugate gradient descent, gradient descent, preconditioned gradient descent
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We show that memory-augmented Transformers (Memformers) can implement linear first-order optimization methods such as conjugate gradient descent, momentum methods, and more generally, methods that linearly combine past gradients. Building on prior work that demonstrates how Transformers can simulate preconditioned gradient descent, we provide theoretical and empirical evidence that Memformers can learn more advanced optimization algorithms. Specifically, we analyze how memory registers in Memformers store suitable intermediate attention values allowing them to implement algorithms such as conjugate gradient. Our results show that Memformers can efficiently learn these methods by training on random linear regression tasks, even learning methods that outperform conjugate gradient. This work extends our knowledge about the algorithmic capabilities of Transformers, showing how they can learn complex optimization methods.

[LG-157] Similarity Learning with neural networks

链接: https://arxiv.org/abs/2410.07214
作者: Gabriel Sanfins,Fabio Ramos,Danilo Naiff
关键词-EN: automatically identify similarity, neural network algorithm, network algorithm designed, identify similarity relations, similarity relations
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Fluid Dynamics (physics.flu-dyn)
*备注: 24 pages, 13 figures

点击查看摘要

Abstract:In this work, we introduce a neural network algorithm designed to automatically identify similarity relations from data. By uncovering these similarity relations, our network approximates the underlying physical laws that relate dimensionless quantities to their dimensionless variables and coefficients. Additionally, we develop a linear algebra framework, accompanied by code, to derive the symmetry groups associated with these similarity relations. While our approach is general, we illustrate its application through examples in fluid mechanics, including laminar Newtonian and non-Newtonian flows in smooth pipes, as well as turbulent flows in both smooth and rough pipes. Such examples are chosen to highlight the framework’s capability to handle both simple and intricate cases, and further validates its effectiveness in discovering underlying physical laws from data.

[LG-158] Neural Contrast: Leveraging Generative Editing for Graphic Design Recommendations PRICAI2024

链接: https://arxiv.org/abs/2410.07211
作者: Marian Lupascu,Ionut Mironica,Mihai-Sorin Stupariu
关键词-EN: Creating visually appealing, visually appealing composites, appealing composites requires, composites requires optimizing, Creating visually
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, Paper sent and accepted as a poster at PRICAI 2024

点击查看摘要

Abstract:Creating visually appealing composites requires optimizing both text and background for compatibility. Previous methods have focused on simple design strategies, such as changing text color or adding background shapes for contrast. These approaches are often destructive, altering text color or partially obstructing the background image. Another method involves placing design elements in non-salient and contrasting regions, but this isn’t always effective, especially with patterned backgrounds. To address these challenges, we propose a generative approach using a diffusion model. This method ensures the altered regions beneath design assets exhibit low saliency while enhancing contrast, thereby improving the visibility of the design asset.

[LG-159] An Analysis of Minimum Error Entropy Loss Functions in Wireless Communications

链接: https://arxiv.org/abs/2410.07208
作者: Rumeshika Pallewela,Eslam Eldeeb,Hirley Alves
关键词-EN: minimum error entropy, advanced information-theoretic loss, MEE criterion, paper introduces, introduces the minimum
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces the minimum error entropy (MEE) criterion as an advanced information-theoretic loss function tailored for deep learning applications in wireless communications. The MEE criterion leverages higher-order statistical properties, offering robustness in noisy scenarios like Rayleigh fading and impulsive interference. In addition, we propose a less computationally complex version of the MEE function to enhance practical usability in wireless communications. The method is evaluated through simulations on two critical applications: over-the-air regression and indoor localization. Results indicate that the MEE criterion outperforms conventional loss functions, such as mean squared error (MSE) and mean absolute error (MAE), achieving significant performance improvements in terms of accuracy, over 20 % gain over traditional methods, and convergence speed across various channel conditions. This work establishes MEE as a promising alternative for wireless communication tasks in deep learning models, enabling better resilience and adaptability.

[LG-160] SpaRG: Sparsely Reconstructed Graphs for Generalizable fMRI Analysis

链接: https://arxiv.org/abs/2410.07201
作者: Camila González,Yanis Miraoui,Yiran Fan,Ehsan Adeli,Kilian M. Pohl
关键词-EN: Magnetic Resonance Imaging, functional Magnetic Resonance, resting-state functional Magnetic, Resonance Imaging, Magnetic Resonance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning can help uncover patterns in resting-state functional Magnetic Resonance Imaging (rs-fMRI) associated with psychiatric disorders and personal traits. Yet the problem of interpreting deep learning findings is rarely more evident than in fMRI analyses, as the data is sensitive to scanning effects and inherently difficult to visualize. We propose a simple approach to mitigate these challenges grounded on sparsification and self-supervision. Instead of extracting post-hoc feature attributions to uncover functional connections that are important to the target task, we identify a small subset of highly informative connections during training and occlude the rest. To this end, we jointly train a (1) sparse input mask, (2) variational autoencoder (VAE), and (3) downstream classifier in an end-to-end fashion. While we need a portion of labeled samples to train the classifier, we optimize the sparse mask and VAE with unlabeled data from additional acquisition sites, retaining only the input features that generalize well. We evaluate our method - Sparsely Reconstructed Graphs (SpaRG) - on the public ABIDE dataset for the task of sex classification, training with labeled cases from 18 sites and adapting the model to two additional out-of-distribution sites with a portion of unlabeled samples. For a relatively coarse parcellation (64 regions), SpaRG utilizes only 1% of the original connections while improving the classification accuracy across domains. Our code can be found at this http URL.

[LG-161] PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

链接: https://arxiv.org/abs/2410.07192
作者: Daiyaan Arfeen,Zhen Zhang,Xinwei Fu,Gregory R. Ganger,Yida Wang
关键词-EN: Deep Neural Networks, Training Deep Neural, Neural Networks, Deep Neural, generally involves pipeline-parallel
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training Deep Neural Networks (DNNs) with billions of parameters generally involves pipeline-parallel (PP) execution. Unfortunately, PP model training can use GPUs inefficiently, especially at large scale, due to idle GPU time caused by pipeline bubbles, which are often 15-30% and can exceed 60% of the training job’s GPU allocation. To improve the GPU utilization of PP model training, this paper describes PipeFill, which fills pipeline bubbles with execution of other pending jobs. By leveraging bubble GPU time, PipeFill reduces the GPU utilization sacrifice associated with scaling-up of large-model training. To context-switch between fill jobs and the main training job with minimal overhead to the main job, and maximize fill job efficiency, PipeFill carefully fits fill job work to measured bubble durations and GPU memory availability, introduces explicit pipeline-bubble instructions, and orchestrates placement and execution of fill jobs in pipeline bubbles. Experiments show that PipeFill can increase overall utilization by up to 63% for GPUs used in large-scale LLM training, with 2% slowdown of the training job, and 5-15% even for low-scale LLM training. For large-scale LLM training on 8K GPUs, the 63% increase translates to up to 2.6K additional GPUs worth of work completed.

[LG-162] Curb Your Attention: Causal Attention Gating for Robust Trajectory Prediction in Autonomous Driving

链接: https://arxiv.org/abs/2410.07191
作者: Ehsan Ahmadi,Ray Mercurius,Soheil Alizadeh,Kasra Rezaee,Amir Rasouli
关键词-EN: Causal Discovery Network, ego-agent behavior, Causal Attention Gating, Trajectory prediction, agents whose actions
类目: Robotics (cs.RO); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 6 pages with 3 figures

点击查看摘要

Abstract:Trajectory prediction models in autonomous driving are vulnerable to perturbations from non-causal agents whose actions should not affect the ego-agent’s behavior. Such perturbations can lead to incorrect predictions of other agents’ trajectories, potentially compromising the safety and efficiency of the ego-vehicle’s decision-making process. Motivated by this challenge, we propose \textitCausal tRajecTory predICtion \textbf(CRiTIC) , a novel model that utilizes a \textitCausal Discovery Network to identify inter-agent causal relations over a window of past time steps. To incorporate discovered causal relationships, we propose a novel \textitCausal Attention Gating mechanism to selectively filter information in the proposed Transformer-based architecture. We conduct extensive experiments on two autonomous driving benchmark datasets to evaluate the robustness of our model against non-causal perturbations and its generalization capacity. Our results indicate that the robustness of predictions can be improved by up to \textbf54% without a significant detriment to prediction accuracy. Lastly, we demonstrate the superior domain generalizability of the proposed model, which achieves up to \textbf29% improvement in cross-domain performance. These results underscore the potential of our model to enhance both robustness and generalization capacity for trajectory prediction in diverse autonomous driving domains. Further details can be found on our project page: this https URL.

[LG-163] he trade-off between data minimization and fairness in collaborative filtering

链接: https://arxiv.org/abs/2410.07182
作者: Nasim Sonboli,Sipei Li,Mehdi Elahi,Asia Biega
关键词-EN: General Data Protection, Data Protection Regulations, Protection Regulations, safeguard individuals’ personal, individuals’ personal information
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:General Data Protection Regulations (GDPR) aim to safeguard individuals’ personal information from harm. While full compliance is mandatory in the European Union and the California Privacy Rights Act (CPRA), it is not in other places. GDPR requires simultaneous compliance with all the principles such as fairness, accuracy, and data minimization. However, it overlooks the potential contradictions within its principles. This matter gets even more complex when compliance is required from decision-making systems. Therefore, it is essential to investigate the feasibility of simultaneously achieving the goals of GDPR and machine learning, and the potential tradeoffs that might be forced upon us. This paper studies the relationship between the principles of data minimization and fairness in recommender systems. We operationalize data minimization via active learning (AL) because, unlike many other methods, it can preserve a high accuracy while allowing for strategic data collection, hence minimizing the amount of data collection. We have implemented several active learning strategies (personalized and non-personalized) and conducted a comparative analysis focusing on accuracy and fairness on two publicly available datasets. The results demonstrate that different AL strategies may have different impacts on the accuracy of recommender systems with nearly all strategies negatively impacting fairness. There has been no to very limited work on the trade-off between data minimization and fairness, the pros and cons of active learning methods as tools for implementing data minimization, and the potential impacts of AL on fairness. By exploring these critical aspects, we offer valuable insights for developing recommender systems that are GDPR compliant.

[LG-164] Does Spatial Cognition Emerge in Frontier Models?

链接: https://arxiv.org/abs/2410.06468
作者: Santhosh Kumar Ramakrishnan,Erik Wijmans,Philipp Kraehenbuehl,Vladlen Koltun
关键词-EN: present SPACE, Abstract, models, benchmark, spatial
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.

[LG-165] Interpreting Deep Neural Network-Based Receiver Under Varying Signal-To-Noise Ratios

链接: https://arxiv.org/abs/2409.16768
作者: Marko Tuononen,Dani Korpi,Ville Hautamäki
关键词-EN: convolutional neural network-based, focusing on convolutional, network-based receiver model, interpreting neural networks, neural network-based receiver
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 7+1 pages, 8 figures

点击查看摘要

Abstract:We propose a novel method for interpreting neural networks, focusing on convolutional neural network-based receiver model. The method identifies which unit or units of the model contain most (or least) information about the channel parameter(s) of the interest, providing insights at both global and local levels – with global explanations aggregating local ones. Experiments on link-level simulations demonstrate the method’s effectiveness in identifying units that contribute most (and least) to signal-to-noise ratio processing. Although we focus on a radio receiver model, the method generalizes to other neural network architectures and applications, offering robust estimation even in high-dimensional settings.

[LG-166] Features are fate: a theory of transfer learning in high-dimensional regression

链接: https://arxiv.org/abs/2410.08194
作者: Javan Tahir,Surya Ganguli,Grant M. Rotskoff
关键词-EN: data-limited downstream tasks, large-scale pre-trained neural, transfer learning, methods to adapt, emergence of large-scale
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages, 7 figures

点击查看摘要

Abstract:With the emergence of large-scale pre-trained neural networks, methods to adapt such “foundation” models to data-limited downstream tasks have become a necessity. Fine-tuning, preference optimization, and transfer learning have all been successfully employed for these purposes when the target task closely resembles the source task, but a precise theoretical understanding of “task similarity” is still lacking. While conventional wisdom suggests that simple measures of similarity between source and target distributions, such as \phi -divergences or integral probability metrics, can directly predict the success of transfer, we prove the surprising fact that, in general, this is not the case. We adopt, instead, a feature-centric viewpoint on transfer learning and establish a number of theoretical results that demonstrate that when the target task is well represented by the feature space of the pre-trained model, transfer learning outperforms training from scratch. We study deep linear networks as a minimal model of transfer learning in which we can analytically characterize the transferability phase diagram as a function of the target dataset size and the feature space overlap. For this model, we establish rigorously that when the feature space overlap between the source and target tasks is sufficiently strong, both linear transfer and fine-tuning improve performance, especially in the low data limit. These results build on an emerging understanding of feature learning dynamics in deep linear networks, and we demonstrate numerically that the rigorous results we derive for the linear case also apply to nonlinear networks.

[LG-167] Deconstructing equivariant representations in molecular systems NEURIPS2024

链接: https://arxiv.org/abs/2410.08131
作者: Kin Long Kelvin Lee,Mikhail Galkin,Santiago Miret
关键词-EN: shown significant progress, Recent equivariant models, chemical property prediction, molecules and materials, Recent equivariant
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: Accepted in the Findings track at the AI4Mat workshop, NeurIPS 2024 Vancouver, BC

点击查看摘要

Abstract:Recent equivariant models have shown significant progress in not just chemical property prediction, but as surrogates for dynamical simulations of molecules and materials. Many of the top performing models in this category are built within the framework of tensor products, which preserves equivariance by restricting interactions and transformations to those that are allowed by symmetry selection rules. Despite being a core part of the modeling process, there has not yet been much attention into understanding what information persists in these equivariant representations, and their general behavior outside of benchmark metrics. In this work, we report on a set of experiments using a simple equivariant graph convolution model on the QM9 dataset, focusing on correlating quantitative performance with the resulting molecular graph embeddings. Our key finding is that, for a scalar prediction task, many of the irreducible representations are simply ignored during training – specifically those pertaining to vector ( l=1 ) and tensor quantities ( l=2 ) – an issue that does not necessarily make itself evident in the test metric. We empirically show that removing some unused orders of spherical harmonics improves model performance, correlating with improved latent space structure. We provide a number of recommendations for future experiments to try and improve efficiency and utilization of equivariant features based on these observations.

[LG-168] Variational Inequality Methods for Multi-Agent Reinforcement Learning: Performance and Stability Gains

链接: https://arxiv.org/abs/2410.07976
作者: Baraah A. M. Sidahmed,Tatjana Chavdarova
关键词-EN: Multi-agent reinforcement learning, presents unique challenges, reinforcement learning, unique challenges, solving Variational Inequalities
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) presents unique challenges as agents learn strategies through experiences. Gradient-based methods are often sensitive to hyperparameter selection and initial random seed variations. Concurrently, significant advances have been made in solving Variational Inequalities (VIs) which include equilibrium-finding problems particularly in addressing the non-converging rotational dynamics that impede convergence of traditional gradient based optimization methods. This paper explores the potential of leveraging VI-based techniques to improve MARL training. Specifically, we study the performance of VI method namely, Nested-Lookahead VI (nLA-VI) and Extragradient (EG) in enhancing the multi-agent deep deterministic policy gradient (MADDPG) algorithm. We present a VI reformulation of the actor-critic algorithm for both single- and multi-agent settings. We introduce three algorithms that use nLA-VI, EG, and a combination of both, named LA-MADDPG, EG-MADDPG, and LA-EG-MADDPG, respectively. Our empirical results demonstrate that these VI-based approaches yield significant performance improvements in benchmark environments, such as the zero-sum games: rock-paper-scissors and matching pennies, where equilibrium strategies can be quantitatively assessed, and the Multi-Agent Particle Environment: Predator prey benchmark, where VI-based methods also yield balanced participation of agents from the same team.

[LG-169] QCircuitNet: A Large-Scale Hierarchical Dataset for Quantum Algorithm Design

链接: https://arxiv.org/abs/2410.07961
作者: Rui Yang,Yuntian Gu,Ziruo Wang,Yitao Liang,Tongyang Li
关键词-EN: emerging field recognized, quantum algorithms, Quantum, implementing quantum algorithms, classical computing
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 35 pages, 7 figures, 4 tables, GitHub repository: this https URL

点击查看摘要

Abstract:Quantum computing is an emerging field recognized for the significant speedup it offers over classical computing through quantum algorithms. However, designing and implementing quantum algorithms pose challenges due to the complex nature of quantum mechanics and the necessity for precise control over quantum states. Despite the significant advancements in AI, there has been a lack of datasets specifically tailored for this purpose. In this work, we introduce QCircuitNet, the first benchmark and test dataset designed to evaluate AI’s capability in designing and implementing quantum algorithms in the form of quantum circuit codes. Unlike using AI for writing traditional codes, this task is fundamentally different and significantly more complicated due to highly flexible design space and intricate manipulation of qubits. Our key contributions include: 1. A general framework which formulates the key features of quantum algorithm design task for Large Language Models. 2. Implementation for a wide range of quantum algorithms from basic primitives to advanced applications, with easy extension to more quantum algorithms. 3. Automatic validation and verification functions, allowing for iterative evaluation and interactive reasoning without human inspection. 4. Promising potential as a training dataset through primitive fine-tuning results. We observed several interesting experimental phenomena: fine-tuning does not always outperform few-shot learning, and LLMs tend to exhibit consistent error patterns. QCircuitNet provides a comprehensive benchmark for AI-driven quantum algorithm design, offering advantages in model evaluation and improvement, while also revealing some limitations of LLMs in this domain.

[LG-170] Decision-Aware Predictive Model Selection for Workforce Allocation

链接: https://arxiv.org/abs/2410.07932
作者: Eric G. Stratman,Justin J. Boutilier,Laura A. Albert
关键词-EN: make subjective decisions, subjective decisions, information is scarce, organizations depend, depend on human
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many organizations depend on human decision-makers to make subjective decisions, especially in settings where information is scarce. Although workers are often viewed as interchangeable, the specific individual assigned to a task can significantly impact outcomes due to their unique decision-making processes and risk tolerance. In this paper, we introduce a novel framework that utilizes machine learning to predict worker behavior and employs integer optimization to strategically assign workers to tasks. Unlike traditional methods that treat machine learning predictions as static inputs for optimization, in our approach, the optimal predictive model used to represent a worker’s behavior is determined by how that worker is allocated within the optimization process. We present a decision-aware optimization framework that integrates predictive model selection with worker allocation. Collaborating with an auto-insurance provider and using real-world data, we evaluate the effectiveness of our proposed method by applying three different techniques to predict worker behavior. Our findings show the proposed decision-aware framework outperforms traditional methods and offers context-sensitive and data-responsive strategies for workforce management.

[LG-171] Cost-aware Simulation-based Inference

链接: https://arxiv.org/abs/2410.07930
作者: Ayush Bharti,Daolang Huang,Samuel Kaski,François-Xavier Briol
关键词-EN: Simulation-based inference, SBI methods, preferred framework, framework for estimating, cost-aware SBI methods
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Simulation-based inference (SBI) is the preferred framework for estimating parameters of intractable models in science and engineering. A significant challenge in this context is the large computational cost of simulating data from complex models, and the fact that this cost often depends on parameter values. We therefore propose \textitcost-aware SBI methods which can significantly reduce the cost of existing sampling-based SBI methods, such as neural SBI and approximate Bayesian computation. This is achieved through a combination of rejection and self-normalised importance sampling, which significantly reduces the number of expensive simulations needed. Our approach is studied extensively on models from epidemiology to telecommunications engineering, where we obtain significant reductions in the overall cost of inference.

[LG-172] Identifying latent disease factors differently expressed in patient subgroups using group factor analysis

链接: https://arxiv.org/abs/2410.07890
作者: Fabio S. Ferreira,John Ashburner,Arabella Bouzigues,Chatrin Suksasilp,Lucy L. Russell,Phoebe H. Foster,Eve Ferry-Bolder,John C. van Swieten,Lize C. Jiskoot,Harro Seelaar,Raquel Sanchez-Valle,Robert Laforce,Caroline Graff,Daniela Galimberti,Rik Vandenberghe,Alexandre de Mendonca,Pietro Tiraboschi,Isabel Santana,Alexander Gerhard,Johannes Levin,Sandro Sorbi,Markus Otto,Florence Pasquier,Simon Ducharme,Chris R. Butler,Isabelle Le Ber,Elizabeth Finger,Maria C. Tartaglia,Mario Masellis,James B. Rowe,Matthis Synofzik,Fermin Moreno,Barbara Borroni,Samuel Kaski,Jonathan D. Rohrer,Janaina Mourao-Miranda
关键词-EN: hinder disease understanding, latent disease factors, latent factors, Group Factor Analysis, sparse GFA
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 38 pages, 14 figures

点击查看摘要

Abstract:In this study, we propose a novel approach to uncover subgroup-specific and subgroup-common latent factors addressing the challenges posed by the heterogeneity of neurological and mental disorders, which hinder disease understanding, treatment development, and outcome prediction. The proposed approach, sparse Group Factor Analysis (GFA) with regularised horseshoe priors, was implemented with probabilistic programming and can uncover associations (or latent factors) among multiple data modalities differentially expressed in sample subgroups. Synthetic data experiments showed the robustness of our sparse GFA by correctly inferring latent factors and model parameters. When applied to the Genetic Frontotemporal Dementia Initiative (GENFI) dataset, which comprises patients with frontotemporal dementia (FTD) with genetically defined subgroups, the sparse GFA identified latent disease factors differentially expressed across the subgroups, distinguishing between “subgroup-specific” latent factors within homogeneous groups and “subgroup common” latent factors shared across subgroups. The latent disease factors captured associations between brain structure and non-imaging variables (i.e., questionnaires assessing behaviour and disease severity) across the different genetic subgroups, offering insights into disease profiles. Importantly, two latent factors were more pronounced in the two more homogeneous FTD patient subgroups (progranulin (GRN) and microtubule-associated protein tau (MAPT) mutation), showcasing the method’s ability to reveal subgroup-specific characteristics. These findings underscore the potential of sparse GFA for integrating multiple data modalities and identifying interpretable latent disease factors that can improve the characterization and stratification of patients with neurological and mental health disorders.

[LG-173] Orthogonal Nonnegative Matrix Factorization with the Kullback-Leibler divergence

链接: https://arxiv.org/abs/2410.07786
作者: Jean Pacifique Nkurunziza,Fulgence Nahayo,Nicolas Gillis
关键词-EN: Orthogonal nonnegative matrix, nonnegative matrix factorization, Orthogonal nonnegative, matrix factorization, approach for clustering
类目: Machine Learning (stat.ML); Information Retrieval (cs.IR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 10 pages

点击查看摘要

Abstract:Orthogonal nonnegative matrix factorization (ONMF) has become a standard approach for clustering. As far as we know, most works on ONMF rely on the Frobenius norm to assess the quality of the approximation. This paper presents a new model and algorithm for ONMF that minimizes the Kullback-Leibler (KL) divergence. As opposed to the Frobenius norm which assumes Gaussian noise, the KL divergence is the maximum likelihood estimator for Poisson-distributed data, which can model better vectors of word counts in document data sets and photo counting processes in imaging. We have developed an algorithm based on alternating optimization, KL-ONMF, and show that it performs favorably with the Frobenius-norm based ONMF for document classification and hyperspectral image unmixing.

[LG-174] On the grid-sampling limit SDE

链接: https://arxiv.org/abs/2410.07778
作者: Christian Bender,Nguyen Tran Thuan
关键词-EN: continuous-time reinforcement learning, recent work, reinforcement learning, introduced the grid-sampling, proxy for modeling
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: This note provides supplementary materials to arXiv:2409.17200 in a self-contained way

点击查看摘要

Abstract:In our recent work [3] we introduced the grid-sampling SDE as a proxy for modeling exploration in continuous-time reinforcement learning. In this note, we provide further motivation for the use of this SDE and discuss its wellposedness in the presence of jumps.

[LG-175] Meta-Learning from Learning Curves for Budget-Limited Algorithm Selection

链接: https://arxiv.org/abs/2410.07696
作者: Manh Hung Nguyen,Lisheng Sun-Hosoya(LISN),Isabelle Guyon
关键词-EN: learning curves, computationally wasteful, learning, machine learning algorithms, large set
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Training a large set of machine learning algorithms to convergence in order to select the best-performing algorithm for a dataset is computationally wasteful. Moreover, in a budget-limited scenario, it is crucial to carefully select an algorithm candidate and allocate a budget for training it, ensuring that the limited budget is optimally distributed to favor the most promising candidates. Casting this problem as a Markov Decision Process, we propose a novel framework in which an agent must select in the process of learning the most promising algorithm without waiting until it is fully trained. At each time step, given an observation of partial learning curves of algorithms, the agent must decide whether to allocate resources to further train the most promising algorithm (exploitation), to wake up another algorithm previously put to sleep, or to start training a new algorithm (exploration). In addition, our framework allows the agent to meta-learn from learning curves on past datasets along with dataset meta-features and algorithm hyperparameters. By incorporating meta-learning, we aim to avoid myopic decisions based solely on premature learning curves on the dataset at hand. We introduce two benchmarks of learning curves that served in international competitions at WCCI’22 and AutoML-conf’22, of which we analyze the results. Our findings show that both meta-learning and the progression of learning curves enhance the algorithm selection process, as evidenced by methods of winning teams and our DDQN baseline, compared to heuristic baselines or a random search. Interestingly, our cost-effective baseline, which selects the best-performing algorithm w.r.t. a small budget, can perform decently when learning curves do not intersect frequently.

[LG-176] Breaking the curse of dimensionality in structured density estimation NEURIPS2024

链接: https://arxiv.org/abs/2410.07685
作者: Robert A. Vandermeulen,Wai Ming Tai,Bryon Aragam
关键词-EN: Markov conditions implied, structured multivariate density, curse of dimensionality, estimating a structured, structured multivariate
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Work accepted to NeurIPS 2024

点击查看摘要

Abstract:We consider the problem of estimating a structured multivariate density, subject to Markov conditions implied by an undirected graph. In the worst case, without Markovian assumptions, this problem suffers from the curse of dimensionality. Our main result shows how the curse of dimensionality can be avoided or greatly alleviated under the Markov property, and applies to arbitrary graphs. While existing results along these lines focus on sparsity or manifold assumptions, we introduce a new graphical quantity called “graph resilience” and show how it controls the sample complexity. Surprisingly, although one might expect the sample complexity of this problem to scale with local graph parameters such as the degree, this turns out not to be the case. Through explicit examples, we compute uniform deviation bounds and illustrate how the curse of dimensionality in density estimation can thus be circumvented. Notable examples where the rate improves substantially include sequential, hierarchical, and spatial data.

[LG-177] heoretical limits of descending ell_0 sparse-regression ML algorithms

链接: https://arxiv.org/abs/2410.07651
作者: Mihailo Stojnic
关键词-EN: sparse regression problems, solving classical compressed, classical compressed sensing, norm based optimization, based optimization algorithms
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the theoretical limits of the \ell_0 (quasi) norm based optimization algorithms when employed for solving classical compressed sensing or sparse regression problems. Considering standard contexts with deterministic signals and statistical systems, we utilize \emphFully lifted random duality theory (Fl RDT) and develop a generic analytical program for studying performance of the \emphmaximum-likelihood (ML) decoding. The key ML performance parameter, the residual \emphroot mean square error ( \textbfRMSE ), is uncovered to exhibit the so-called \emphphase-transition (PT) phenomenon. The associated aPT curve, which separates the regions of systems dimensions where \emphan \ell_0 based algorithm succeeds or fails in achieving small (comparable to the noise) ML optimal \textbfRMSE is precisely determined as well. In parallel, we uncover the existence of another dPT curve which does the same separation but for practically feasible \emphdescending \ell_0 ( d\ell_0 ) algorithms. Concrete implementation and practical relevance of the Fl RDT typically rely on the ability to conduct a sizeable set of the underlying numerical evaluations which reveal that for the ML decoding the Fl RDT converges astonishingly fast with corrections in the estimated quantities not exceeding \sim 0.1% already on the third level of lifting. Analytical results are supplemented by a sizeable set of numerical experiments where we implement a simple variant of d\ell_0 and demonstrate that its practical performance very accurately matches the theoretical predictions. Completely surprisingly, a remarkably precise agreement between the simulations and the theory is observed for fairly small dimensions of the order of 100.

[LG-178] Rethinking Adversarial Inverse Reinforcement Learning: From the Angles of Policy Imitation and Transferable Reward Recovery

链接: https://arxiv.org/abs/2410.07643
作者: Yangchun Zhang,Wang Zhou,Yirui Zhou
关键词-EN: transferable task descriptions, adversarial inverse reinforcement, inverse reinforcement learning, adversarial inverse, inverse reinforcement
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2403.14593

点击查看摘要

Abstract:In scenarios of inverse reinforcement learning (IRL) with a single expert, adversarial inverse reinforcement learning (AIRL) serves as a foundational approach to providing comprehensive and transferable task descriptions by restricting the reward class, e.g., to state-only rewards. However, AIRL faces practical challenges, primarily stemming from the difficulty of verifying the unobservable transition matrix - often encountered in practice - under the specific conditions necessary for effective transfer. This paper reexamines AIRL in light of the unobservable transition matrix or limited informative priors. By applying random matrix theory (RMT), we demonstrate that AIRL can disentangle rewards for effective transfer with high probability, irrespective of specific conditions. This perspective reframes inadequate transfer in certain contexts. Specifically, it is attributed to the selection problem of the reinforcement learning algorithm employed by AIRL, which is characterized by training variance. Based on this insight, we propose a hybrid framework that integrates on-policy proximal policy optimization (PPO) in the source environment with off-policy soft actor-critic (SAC) in the target environment, leading to significant improvements in reward transfer effectiveness.

[LG-179] Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

链接: https://arxiv.org/abs/2410.07574
作者: Zhong Zheng,Haochen Zhang,Lingzhou Xue
关键词-EN: Markov Decision Processes, tabular Markov Decision, episodic tabular Markov, Decision Processes, Markov Decision
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the gap-dependent bounds of two important algorithms for on-policy Q-learning for finite-horizon episodic tabular Markov Decision Processes (MDPs): UCB-Advantage (Zhang et al. 2020) and Q-EarlySettled-Advantage (Li et al. 2021). UCB-Advantage and Q-EarlySettled-Advantage improve upon the results based on Hoeffding-type bonuses and achieve the almost optimal \sqrtT -type regret bound in the worst-case scenario, where T is the total number of steps. However, the benign structures of the MDPs such as a strictly positive suboptimality gap can significantly improve the regret. While gap-dependent regret bounds have been obtained for Q-learning with Hoeffding-type bonuses, it remains an open question to establish gap-dependent regret bounds for Q-learning using variance estimators in their bonuses and reference-advantage decomposition for variance reduction. We develop a novel error decomposition framework to prove gap-dependent regret bounds of UCB-Advantage and Q-EarlySettled-Advantage that are logarithmic in T and improve upon existing ones for Q-learning algorithms. Moreover, we establish the gap-dependent bound for the policy switching cost of UCB-Advantage and improve that under the worst-case MDPs. To our knowledge, this paper presents the first gap-dependent regret analysis for Q-learning using variance estimators and reference-advantage decomposition and also provides the first gap-dependent analysis on policy switching cost for Q-learning.

[LG-180] Hybrid Summary Statistics NEURIPS2024

链接: https://arxiv.org/abs/2410.07548
作者: T. Lucas Makinen,Ce Sui,Benjamin D. Wandelt,Natalia Porqueres,Alan Heavens
关键词-EN: capture high-information posteriors, robust simulation-based inference, high-information posteriors, sparsely sampled, capture high-information
类目: Machine Learning (stat.ML); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Information Theory (cs.IT); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 7 pages, 4 figures. Accepted to ML4PS2024 at NeurIPS 2024

点击查看摘要

Abstract:We present a way to capture high-information posteriors from training sets that are sparsely sampled over the parameter space for robust simulation-based inference. In physical inference problems, we can often apply domain knowledge to define traditional summary statistics to capture some of the information in a dataset. We show that augmenting these statistics with neural network outputs to maximise the mutual information improves information extraction compared to neural summaries alone or their concatenation to existing summaries and makes inference robust in settings with low training data. We introduce 1) two loss formalisms to achieve this and 2) apply the technique to two different cosmological datasets to extract non-Gaussian parameter information.

[LG-181] Representation-Enhanced Neural Knowledge Integration with Application to Large-Scale Medical Ontology Learning

链接: https://arxiv.org/abs/2410.07454
作者: Suqi Liu,Tianxi Cai,Xiaoou Li
关键词-EN: ensures consistent interpretation, biomedical data discovery, graph enhances reproducibility, knowledge graph, large-scale knowledge graph
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:A large-scale knowledge graph enhances reproducibility in biomedical data discovery by providing a standardized, integrated framework that ensures consistent interpretation across diverse datasets. It improves generalizability by connecting data from various sources, enabling broader applicability of findings across different populations and conditions. Generating reliable knowledge graph, leveraging multi-source information from existing literature, however, is challenging especially with a large number of node sizes and heterogeneous relations. In this paper, we propose a general theoretically guaranteed statistical framework, called RENKI, to enable simultaneous learning of multiple relation types. RENKI generalizes various network models widely used in statistics and computer science. The proposed framework incorporates representation learning output into initial entity embedding of a neural network that approximates the score function for the knowledge graph and continuously trains the model to fit observed facts. We prove nonasymptotic bounds for in-sample and out-of-sample weighted MSEs in relation to the pseudo-dimension of the knowledge graph function class. Additionally, we provide pseudo-dimensions for score functions based on multilayer neural networks with ReLU activation function, in the scenarios when the embedding parameters either fixed or trainable. Finally, we complement our theoretical results with numerical studies and apply the method to learn a comprehensive medical knowledge graph combining a pretrained language model representation with knowledge graph links observed in several medical ontologies. The experiments justify our theoretical findings and demonstrate the effect of weighting in the presence of heterogeneous relations and the benefit of incorporating representation learning in nonparametric models.

[LG-182] Siamese networks for Poincare embeddings and the reconstruction of evolutionary trees

链接: https://arxiv.org/abs/2410.07387
作者: Ciro Carvallo,Hernán Bocaccio,Gabriel B. Mindlin,Pablo Groisman
关键词-EN: bird song spectrograms, reconstructing evolutionary trees, reconstructing evolutionary, specific application, application to bird
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:We present a method for reconstructing evolutionary trees from high-dimensional data, with a specific application to bird song spectrograms. We address the challenge of inferring phylogenetic relationships from phenotypic traits, like vocalizations, without predefined acoustic properties. Our approach combines two main components: Poincaré embeddings for dimensionality reduction and distance computation, and the neighbor joining algorithm for tree reconstruction. Unlike previous work, we employ Siamese networks to learn embeddings from only leaf node samples of the latent tree. We demonstrate our method’s effectiveness on both synthetic data and spectrograms from six species of finches.

[LG-183] Learning to learn ecosystems from limited data – a meta-learning approach

链接: https://arxiv.org/abs/2410.07368
作者: Zheng-Meng Zhai,Bryan Glaz,Mulugeta Haile,Ying-Cheng Lai
关键词-EN: developing data-driven approaches, fundamental challenge, challenge in developing, developing data-driven, data-driven approaches
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 16 pages, 13 figures

点击查看摘要

Abstract:A fundamental challenge in developing data-driven approaches to ecological systems for tasks such as state estimation and prediction is the paucity of the observational or measurement data. For example, modern machine-learning techniques such as deep learning or reservoir computing typically require a large quantity of data. Leveraging synthetic data from paradigmatic nonlinear but non-ecological dynamical systems, we develop a meta-learning framework with time-delayed feedforward neural networks to predict the long-term behaviors of ecological systems as characterized by their attractors. We show that the framework is capable of accurately reconstructing the ``dynamical climate’’ of the ecological system with limited data. Three benchmark population models in ecology, namely the Hastings-Powell model, a three-species food chain, and the Lotka-Volterra system, are used to demonstrate the performance of the meta-learning based prediction framework. In all cases, enhanced accuracy and robustness are achieved using five to seven times less training data as compared with the corresponding machine-learning method trained solely from the ecosystem data. A number of issues affecting the prediction performance are addressed.

[LG-184] Unlocking Real-Time Fluorescence Lifetime Imaging: Multi-Pixel Parallelism for FPGA-Accelerated Processing

链接: https://arxiv.org/abs/2410.07364
作者: Ismail Erbas,Aporva Amarnath,Vikas Pandey,Karthik Swaminathan,Naigang Wang,Xavier Intes
关键词-EN: Fluorescence lifetime imaging, Fluorescence lifetime, protein interactions, lifetime imaging, fluorescent molecules
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Fluorescence lifetime imaging (FLI) is a widely used technique in the biomedical field for measuring the decay times of fluorescent molecules, providing insights into metabolic states, protein interactions, and ligand-receptor bindings. However, its broader application in fast biological processes, such as dynamic activity monitoring, and clinical use, such as in guided surgery, is limited by long data acquisition times and computationally demanding data processing. While deep learning has reduced post-processing times, time-resolved data acquisition remains a bottleneck for real-time applications. To address this, we propose a method to achieve real-time FLI using an FPGA-based hardware accelerator. Specifically, we implemented a GRU-based sequence-to-sequence (Seq2Seq) model on an FPGA board compatible with time-resolved cameras. The GRU model balances accurate processing with the resource constraints of FPGAs, which have limited DSP units and BRAM. The limited memory and computational resources on the FPGA require efficient scheduling of operations and memory allocation to deploy deep learning models for low-latency applications. We address these challenges by using STOMP, a queue-based discrete-event simulator that automates and optimizes task scheduling and memory management on hardware. By integrating a GRU-based Seq2Seq model and its compressed version, called Seq2SeqLite, generated through knowledge distillation, we were able to process multiple pixels in parallel, reducing latency compared to sequential processing. We explore various levels of parallelism to achieve an optimal balance between performance and resource utilization. Our results indicate that the proposed techniques achieved a 17.7x and 52.0x speedup over manual scheduling for the Seq2Seq model and the Seq2SeqLite model, respectively.

[LG-185] Efficient representation learning of scintillation signal characteristics with spectrum-inspired temporal neural networks

链接: https://arxiv.org/abs/2410.07267
作者: Pengcheng Ai,Xiangming Sun,Zhi Deng,Xinchi Ran
关键词-EN: nuclear medicine imaging, energy physics experiments, high energy physics, Nuclear radiation detectors, Nuclear radiation
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 24 pages, 13 figures

点击查看摘要

Abstract:Nuclear radiation detectors based on scintillators are widely used in particle and high energy physics experiments, nuclear medicine imaging, industrial and environmental detection, etc. Precisely extracting scintillation signal characteristics at the event level is important for these applications, not only in respect of understanding the scintillator itself, but also kinds and physical property of incident particles. Recent researches demonstrate data-driven neural networks are superior to traditional statistical methods, especially when the analytical form of signals is hard to obtain, or noise is significant. However, most densely connected or convolution-based networks fail to fully exploit the spectral and temporal structure of scintillation signals, leaving large space for performance improvement. In this paper, we propose a network architecture specially tailored for scintillation signal characterization based on previous works on time series analysis. By directly applying Fast Fourier Transform on original signals without data embedding, including the zero-frequency component, adjusting convolution scheme for low-frequency components, and unbiasedly re-weighting features from different frequencies, the proposed network architecture can serve as a lightweight and enhanced representation learning backbone. We prove our idea on simulation data generated with the setting of the LUX dark matter detector, and on experimental electrical signals with fast electronics to emulate scintillation variations. The proposed model achieves significantly better results than the reference model in literature and densely connected models without representation learning.

[LG-186] Precision Cancer Classification and Biomarker Identification from mRNA Gene Expression via Dimensionality Reduction and Explainable AI

链接: https://arxiv.org/abs/2410.07260
作者: Farzana Tabassum,Sabrina Islam,Siana Rizwan,Masrur Sobhan,Tasnim Ahmed,Sabbir Ahmed,Tareque Mohmud Chowdhury
关键词-EN: enabling precise diagnoses, unique molecular signatures, Gene expression, enabling precise, critical method
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 37 pages, 2 figures, 8 tables, Submitted to Journal of Computational Science

点击查看摘要

Abstract:Gene expression analysis is a critical method for cancer classification, enabling precise diagnoses through the identification of unique molecular signatures associated with various tumors. Identifying cancer-specific genes from gene expression values enables a more tailored and personalized treatment approach. However, the high dimensionality of mRNA gene expression data poses challenges for analysis and data extraction. This research presents a comprehensive pipeline designed to accurately identify 33 distinct cancer types and their corresponding gene sets. It incorporates a combination of normalization and feature selection techniques to reduce dataset dimensionality effectively while ensuring high performance. Notably, our pipeline successfully identifies a substantial number of cancer-specific genes using a reduced feature set of just 500, in contrast to using the full dataset comprising 19,238 features. By employing an ensemble approach that combines three top-performing classifiers, a classification accuracy of 96.61% was achieved. Furthermore, we leverage Explainable AI to elucidate the biological significance of the identified cancer-specific genes, employing Differential Gene Expression (DGE) analysis.

[LG-187] A Dynamic Approach to Stock Price Prediction: Comparing RNN and Mixture of Experts Models Across Different Volatility Profiles

链接: https://arxiv.org/abs/2410.07234
作者: Diego Vallarino
关键词-EN: Recurrent Neural Network, Mixture of Experts, Recurrent Neural, Neural Network, stock price prediction
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:This study evaluates the effectiveness of a Mixture of Experts (MoE) model for stock price prediction by comparing it to a Recurrent Neural Network (RNN) and a linear regression model. The MoE framework combines an RNN for volatile stocks and a linear model for stable stocks, dynamically adjusting the weight of each model through a gating network. Results indicate that the MoE approach significantly improves predictive accuracy across different volatility profiles. The RNN effectively captures non-linear patterns for volatile companies but tends to overfit stable data, whereas the linear model performs well for predictable trends. The MoE model’s adaptability allows it to outperform each individual model, reducing errors such as Mean Squared Error (MSE) and Mean Absolute Error (MAE). Future work should focus on enhancing the gating mechanism and validating the model with real-world datasets to optimize its practical applicability.

[LG-188] RFBoost: Understanding and Boosting Deep WiFi Sensing via Physical Data Augmentation

链接: https://arxiv.org/abs/2410.07230
作者: Weiying Hou,Chenshu Wu
关键词-EN: learning shows promising, DWS, shows promising performance, Deep learning shows, shows promising
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 8, 2, Article 58 (June 2024), 26 pages

点击查看摘要

Abstract:Deep learning shows promising performance in wireless sensing. However, deep wireless sensing (DWS) heavily relies on large datasets. Unfortunately, building comprehensive datasets for DWS is difficult and costly, because wireless data depends on environmental factors and cannot be labeled offline. Despite recent advances in few-shot/cross-domain learning, DWS is still facing data scarcity issues. In this paper, we investigate a distinct perspective of radio data augmentation (RDA) for WiFi sensing and present a data-space solution. Our key insight is that wireless signals inherently exhibit data diversity, contributing more information to be extracted for DWS. We present RFBoost, a simple and effective RDA framework encompassing novel physical data augmentation techniques. We implement RFBoost as a plug-and-play module integrated with existing deep models and evaluate it on multiple datasets. Experimental results demonstrate that RFBoost achieves remarkable average accuracy improvements of 5.4% on existing models without additional data collection or model modifications, and the best-boosted performance outperforms 11 state-of-the-art baseline models without RDA. RFBoost pioneers the study of RDA, an important yet currently underexplored building block for DWS, which we expect to become a standard DWS component of WiFi sensing and beyond. RFBoost is released at this https URL.

[LG-189] Distilling Analysis from Generative Models for Investment Decisions

链接: https://arxiv.org/abs/2410.07225
作者: Chung-Chi Chen,Hiroya Takamura,Ichiro Kobayashi,Yusuke Miyao
关键词-EN: decisions, Professionals’, stock analysts’ decisions, professionals’ decision-making processes, decision-making processes
类目: atistical Finance (q-fin.ST); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Professionals’ decisions are the focus of every field. For example, politicians’ decisions will influence the future of the country, and stock analysts’ decisions will impact the market. Recognizing the influential role of professionals’ perspectives, inclinations, and actions in shaping decision-making processes and future trends across multiple fields, we propose three tasks for modeling these decisions in the financial market. To facilitate this, we introduce a novel dataset, A3, designed to simulate professionals’ decision-making processes. While we find current models present challenges in forecasting professionals’ behaviors, particularly in making trading decisions, the proposed Chain-of-Decision approach demonstrates promising improvements. It integrates an opinion-generator-in-the-loop to provide subjective analysis based on each news item, further enhancing the proposed tasks’ performance.

[LG-190] Computing Systemic Risk Measures with Graph Neural Networks

链接: https://arxiv.org/abs/2410.07222
作者: Lukas Gonon,Thilo Meyer-Brandis,Niklas Weber
关键词-EN: modelled bilateral liabilities, explicitly modelled bilateral, systemic risk measures, paper investigates systemic, investigates systemic risk
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Mathematical Finance (q-fin.MF)
*备注: 45 pages

点击查看摘要

Abstract:This paper investigates systemic risk measures for stochastic financial networks of explicitly modelled bilateral liabilities. We extend the notion of systemic risk measures from Biagini, Fouque, Fritelli and Meyer-Brandis (2019) to graph structured data. In particular, we focus on an aggregation function that is derived from a market clearing algorithm proposed by Eisenberg and Noe (2001). In this setting, we show the existence of an optimal random allocation that distributes the overall minimal bailout capital and secures the network. We study numerical methods for the approximation of systemic risk and optimal random allocations. We propose to use permutation equivariant architectures of neural networks like graph neural networks (GNNs) and a class that we name (extended) permutation equivariant neural networks ((X)PENNs). We compare their performance to several benchmark allocations. The main feature of GNNs and (X)PENNs is that they are permutation equivariant with respect to the underlying graph data. In numerical experiments we find evidence that these permutation equivariant methods are superior to other approaches.

[LG-191] Stock Price Prediction and Traditional Models: An Approach to Achieve Short- Medium- and Long-Term Goals

链接: https://arxiv.org/abs/2410.07220
作者: Opeyemi Sheu Alamu,Md Kamrul Siam
关键词-EN: Nigerian stock exchange, Autoregressive Moving Average, Integrated Moving Average, Autoregressive Integrated Moving, Gated Recurrent Units
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 20 pages

点击查看摘要

Abstract:A comparative analysis of deep learning models and traditional statistical methods for stock price prediction uses data from the Nigerian stock exchange. Historical data, including daily prices and trading volumes, are employed to implement models such as Long Short Term Memory (LSTM) networks, Gated Recurrent Units (GRUs), Autoregressive Integrated Moving Average (ARIMA), and Autoregressive Moving Average (ARMA). These models are assessed over three-time horizons: short-term (1 year), medium-term (2.5 years), and long-term (5 years), with performance measured by Mean Squared Error (MSE) and Mean Absolute Error (MAE). The stability of the time series is tested using the Augmented Dickey-Fuller (ADF) test. Results reveal that deep learning models, particularly LSTM, outperform traditional methods by capturing complex, nonlinear patterns in the data, resulting in more accurate predictions. However, these models require greater computational resources and offer less interpretability than traditional approaches. The findings highlight the potential of deep learning for improving financial forecasting and investment strategies. Future research could incorporate external factors such as social media sentiment and economic indicators, refine model architectures, and explore real-time applications to enhance prediction accuracy and scalability.

[LG-192] Evaluating Financial Relational Graphs: Interpretation Before Prediction

链接: https://arxiv.org/abs/2410.07216
作者: Yingjie Niu,Lanxin Lu,Rian Dolphin,Valerio Poti,Ruihai Dong
关键词-EN: Accurate and robust, robust stock trend, stock trend forecasting, relationship graphs, stock relationship graphs
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by 2024 ACM International Conference on AI in Finance

点击查看摘要

Abstract:Accurate and robust stock trend forecasting has been a crucial and challenging task, as stock price changes are influenced by multiple factors. Graph neural network-based methods have recently achieved remarkable success in this domain by constructing stock relationship graphs that reflect internal factors and relationships between stocks. However, most of these methods rely on predefined factors to construct static stock relationship graphs due to the lack of suitable datasets, failing to capture the dynamic changes in stock relationships. Moreover, the evaluation of relationship graphs in these methods is often tied to the performance of neural network models on downstream tasks, leading to confusion and imprecision. To address these issues, we introduce the SPNews dataset, collected based on S\P 500 Index stocks, to facilitate the construction of dynamic relationship graphs. Furthermore, we propose a novel set of financial relationship graph evaluation methods that are independent of downstream tasks. By using the relationship graph to explain historical financial phenomena, we assess its validity before constructing a graph neural network, ensuring the graph’s effectiveness in capturing relevant financial relationships. Experimental results demonstrate that our evaluation methods can effectively differentiate between various financial relationship graphs, yielding more interpretable results compared to traditional approaches. We make our source code publicly available on GitHub to promote reproducibility and further research in this area.

[LG-193] Analysis and Optimization of Seismic Monitoring Networks with Bayesian Optimal Experiment Design

链接: https://arxiv.org/abs/2410.07215
作者: Jake Callahan,Kevin Monogue,Ruben Villarreal,Tommie Catanach
关键词-EN: Bayesian OED, Toggle, networks increasingly aim, Monitoring networks increasingly, diverse sensors covering
类目: Applications (stat.AP); Machine Learning (cs.LG); Geophysics (physics.geo-ph); Machine Learning (stat.ML)
*备注: 38 pages, 19 figures. Submitted to Geophysical Journal International

点击查看摘要

Abstract:Monitoring networks increasingly aim to assimilate data from a large number of diverse sensors covering many sensing modalities. Bayesian optimal experimental design (OED) seeks to identify data, sensor configurations, or experiments which can optimally reduce uncertainty and hence increase the performance of a monitoring network. Information theory guides OED by formulating the choice of experiment or sensor placement as an optimization problem that maximizes the expected information gain (EIG) about quantities of interest given prior knowledge and models of expected observation data. Therefore, within the context of seismo-acoustic monitoring, we can use Bayesian OED to configure sensor networks by choosing sensor locations, types, and fidelity in order to improve our ability to identify and locate seismic sources. In this work, we develop the framework necessary to use Bayesian OED to optimize a sensor network’s ability to locate seismic events from arrival time data of detected seismic phases at the regional-scale. Bayesian OED requires four elements: 1) A likelihood function that describes the distribution of detection and travel time data from the sensor network, 2) A Bayesian solver that uses a prior and likelihood to identify the posterior distribution of seismic events given the data, 3) An algorithm to compute EIG about seismic events over a dataset of hypothetical prior events, 4) An optimizer that finds a sensor network which maximizes EIG. Once we have developed this framework, we explore many relevant questions to monitoring such as: how to trade off sensor fidelity and earth model uncertainty; how sensor types, number, and locations influence uncertainty; and how prior models and constraints influence sensor placement. Comments: 38 pages, 19 figures. Submitted to Geophysical Journal International Subjects: Applications (stat.AP); Machine Learning (cs.LG); Geophysics (physics.geo-ph); Machine Learning (stat.ML) Cite as: arXiv:2410.07215 [stat.AP] (or arXiv:2410.07215v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2410.07215 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jake Callahan [view email] [v1] Fri, 27 Sep 2024 04:45:27 UTC (9,787 KB) Full-text links: Access Paper: View a PDF of the paper titled Analysis and Optimization of Seismic Monitoring Networks with Bayesian Optimal Experiment Design, by Jake Callahan and Kevin Monogue and Ruben Villarreal and Tommie CatanachView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: stat.AP prev | next new | recent | 2024-10 Change to browse by: cs cs.LG physics physics.geo-ph stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-194] owards Explainable Graph Neural Networks for Neurological Evaluation on EEG Signals

链接: https://arxiv.org/abs/2410.07199
作者: Andrea Protani,Lorenzo Giusti,Chiara Iacovelli,Albert Sund Aillet,Diogo Reis Santos,Giuseppe Reale,Aurelia Zauli,Marco Moci,Marta Garbuglia,Pierpaolo Brutti,Pietro Caliandro,Luigi Serio
关键词-EN: accurately estimating stroke, accurately estimating, effectively manage patient, crucial for healthcare, healthcare professionals
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 7 pages, 7 figures

点击查看摘要

Abstract:After an acute stroke, accurately estimating stroke severity is crucial for healthcare professionals to effectively manage patient’s treatment. Graph theory methods have shown that brain connectivity undergoes frequency-dependent reorganization post-stroke, adapting to new conditions. Traditional methods often rely on handcrafted features that may not capture the complexities of clinical phenomena. In this study, we propose a novel approach using Graph Neural Networks (GNNs) to predict stroke severity, as measured by the NIH Stroke Scale (NIHSS). We analyzed electroencephalography (EEG) recordings from 71 patients at the time of hospitalization. For each patient, we generated five graphs weighted by Lagged Linear Coherence (LLC) between signals from distinct Brodmann Areas, covering \delta (2-4 Hz), \theta (4-8 Hz), \alpha_1 (8-10.5 Hz), \alpha_2 (10.5-13 Hz), and \beta_1 (13-20 Hz) frequency bands. To emphasize key neurological connections and maintain sparsity, we applied a sparsification process based on structural and functional brain network properties. We then trained a graph attention model to predict the NIHSS. By examining its attention coefficients, our model reveals insights into brain reconfiguration, providing clinicians with a valuable tool for diagnosis, personalized treatment, and early intervention in neurorehabilitation.

[LG-195] EEGUnity: Open-Source Tool in Facilitating Unified EEG Datasets Towards Large-Scale EEG Model

链接: https://arxiv.org/abs/2410.07196
作者: Chengxuan Qin,Rui Yang,Wenlong You,Zhige Chen,Longsheng Zhu,Mengjie Huang,Zidong Wang
关键词-EN: manage diverse EEG, dispersed EEG dataset, EEG dataset publications, diverse EEG datasets, large-scale EEG model
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing number of dispersed EEG dataset publications and the advancement of large-scale Electroencephalogram (EEG) models have increased the demand for practical tools to manage diverse EEG datasets. However, the inherent complexity of EEG data, characterized by variability in content data, metadata, and data formats, poses challenges for integrating multiple datasets and conducting large-scale EEG model research. To tackle the challenges, this paper introduces EEGUnity, an open-source tool that incorporates modules of ‘EEG Parser’, ‘Correction’, ‘Batch Processing’, and ‘Large Language Model Boost’. Leveraging the functionality of such modules, EEGUnity facilitates the efficient management of multiple EEG datasets, such as intelligent data structure inference, data cleaning, and data unification. In addition, the capabilities of EEGUnity ensure high data quality and consistency, providing a reliable foundation for large-scale EEG data research. EEGUnity is evaluated across 25 EEG datasets from different sources, offering several typical batch processing workflows. The results demonstrate the high performance and flexibility of EEGUnity in parsing and data processing. The project code is publicly available at this http URL.

[LG-196] Designing Pre-training Datasets from Unlabeled Data for EEG Classification with Transformers

链接: https://arxiv.org/abs/2410.07190
作者: Tim Bary,Benoit Macq
关键词-EN: neural networks require, Transformer neural networks, train effectively, neural networks, networks require
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, 5 tables, 22nd IEEE Mediterranean Electrotechnical Conference (MELECON 2024)

点击查看摘要

Abstract:Transformer neural networks require a large amount of labeled data to train effectively. Such data is often scarce in electroencephalography, as annotations made by medical experts are costly. This is why self-supervised training, using unlabeled data, has to be performed beforehand. In this paper, we present a way to design several labeled datasets from unlabeled electroencephalogram (EEG) data. These can then be used to pre-train transformers to learn representations of EEG signals. We tested this method on an epileptic seizure forecasting task on the Temple University Seizure Detection Corpus using a Multi-channel Vision Transformer. Our results suggest that 1) Models pre-trained using our approach demonstrate significantly faster training times, reducing fine-tuning duration by more than 50% for the specific task, and 2) Pre-trained models exhibit improved accuracy, with an increase from 90.93% to 92.16%, as well as a higher AUC, rising from 0.9648 to 0.9702 when compared to non-pre-trained models.

[LG-197] Dual Stream Graph Transformer Fusion Networks for Enhanced Brain Decoding

链接: https://arxiv.org/abs/2410.07189
作者: Lucas Goene,Siamak Mehrkanoon
关键词-EN: classifying task-based Magnetoencephalography, Stream Graph-Transformer Fusion, Dual Stream Graph-Transformer, architecture designed specifically, Graph-Transformer Fusion
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 6 pages

点击查看摘要

Abstract:This paper presents the novel Dual Stream Graph-Transformer Fusion (DS-GTF) architecture designed specifically for classifying task-based Magnetoencephalography (MEG) data. In the spatial stream, inputs are initially represented as graphs, which are then passed through graph attention networks (GAT) to extract spatial patterns. Two methods, TopK and Thresholded Adjacency are introduced for initializing the adjacency matrix used in the GAT. In the temporal stream, the Transformer Encoder receives concatenated windowed input MEG data and learns new temporal representations. The learned temporal and spatial representations from both streams are fused before reaching the output layer. Experimental results demonstrate an enhancement in classification performance and a reduction in standard deviation across multiple test subjects compared to other examined models.

信息检索

[IR-0] Rewriting Conversational Utterances with Instructed Large Language Models

链接: https://arxiv.org/abs/2410.07797
作者: Elnara Galimzhanova,Cristina Ioana Muntean,Franco Maria Nardini,Raffaele Perego,Guido Rocchietti
关键词-EN: large language models, text summarization, NLP tasks, recent studies, studies have shown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Many recent studies have shown the ability of large language models (LLMs) to achieve state-of-the-art performance on many NLP tasks, such as question answering, text summarization, coding, and translation. In some cases, the results provided by LLMs are on par with those of human experts. These models’ most disruptive innovation is their ability to perform tasks via zero-shot or few-shot prompting. This capability has been successfully exploited to train instructed LLMs, where reinforcement learning with human feedback is used to guide the model to follow the user’s requests directly. In this paper, we investigate the ability of instructed LLMs to improve conversational search effectiveness by rewriting user questions in a conversational setting. We study which prompts provide the most informative rewritten utterances that lead to the best retrieval performance. Reproducible experiments are conducted on publicly-available TREC CAST datasets. The results show that rewriting conversational utterances with instructed LLMs achieves significant improvements of up to 25.2% in MRR, 31.7% in Precision@1, 27% in NDCG@3, and 11.5% in Recall@500 over state-of-the-art techniques.

[IR-1] DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities

链接: https://arxiv.org/abs/2410.07722
作者: Thong Nguyen,Shubham Chatterjee,Sean MacAvaney,Ian Mackie,Jeff Dalton,Andrew Yates
关键词-EN: Learned Sparse Retrieval, Learned Sparse, pre-trained transformers, nonsensical fragments, Sparse Retrieval
类目: Information Retrieval (cs.IR)
*备注: this https URL

点击查看摘要

Abstract:Learned Sparse Retrieval (LSR) models use vocabularies from pre-trained transformers, which often split entities into nonsensical fragments. Splitting entities can reduce retrieval accuracy and limits the model’s ability to incorporate up-to-date world knowledge not included in the training data. In this work, we enhance the LSR vocabulary with Wikipedia concepts and entities, enabling the model to resolve ambiguities more effectively and stay current with evolving knowledge. Central to our approach is a Dynamic Vocabulary (DyVo) head, which leverages existing entity embeddings and an entity retrieval component that identifies entities relevant to a query or document. We use the DyVo head to generate entity weights, which are then merged with word piece weights to create joint representations for efficient indexing and retrieval using an inverted index. In experiments across three entity-rich document ranking datasets, the resulting DyVo model substantially outperforms state-of-the-art baselines.

[IR-2] DISCO: A Hierarchical Disentangled Cognitive Diagnosis Framework for Interpretable Job Recommendation ICDM2024

链接: https://arxiv.org/abs/2410.07671
作者: Xiaoshan Yu,Chuan Qin,Qi Zhang,Chen Zhu,Haiping Ma,Xingyi Zhang,Hengshu Zhu
关键词-EN: created unprecedented opportunities, accurately pinpointing positions, online recruitment platforms, job seekers, skills and preferences
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by ICDM 2024. 10 pages

点击查看摘要

Abstract:The rapid development of online recruitment platforms has created unprecedented opportunities for job seekers while concurrently posing the significant challenge of quickly and accurately pinpointing positions that align with their skills and preferences. Job recommendation systems have significantly alleviated the extensive search burden for job seekers by optimizing user engagement metrics, such as clicks and applications, thus achieving notable success. In recent years, a substantial amount of research has been devoted to developing effective job recommendation models, primarily focusing on text-matching based and behavior modeling based methods. While these approaches have realized impressive outcomes, it is imperative to note that research on the explainability of recruitment recommendations remains profoundly unexplored. To this end, in this paper, we propose DISCO, a hierarchical Disentanglement based Cognitive diagnosis framework, aimed at flexibly accommodating the underlying representation learning model for effective and interpretable job recommendations. Specifically, we first design a hierarchical representation disentangling module to explicitly mine the hierarchical skill-related factors implied in hidden representations of job seekers and jobs. Subsequently, we propose level-aware association modeling to enhance information communication and robust representation learning both inter- and intra-level, which consists of the interlevel knowledge influence module and the level-wise contrastive learning. Finally, we devise an interaction diagnosis module incorporating a neural diagnosis function for effectively modeling the multi-level recruitment interaction process between job seekers and jobs, which introduces the cognitive measurement theory.

[IR-3] Firzen: Firing Strict Cold-Start Items with Frozen Heterogeneous and Homogeneous Graphs for Recommendation ICDE2024

链接: https://arxiv.org/abs/2410.07654
作者: Hulingxiao He,Xiangteng He,Yuxin Peng,Zifei Shan,Xin Su
关键词-EN: utilizing unique identities, represent distinct users, recommender systems literature, strict cold-start item, models utilizing unique
类目: Information Retrieval (cs.IR)
*备注: Accepted by ICDE 2024. The code is available at this https URL

点击查看摘要

Abstract:Recommendation models utilizing unique identities (IDs) to represent distinct users and items have dominated the recommender systems literature for over a decade. Since multi-modal content of items (e.g., texts and images) and knowledge graphs (KGs) may reflect the interaction-related users’ preferences and items’ characteristics, they have been utilized as useful side information to further improve the recommendation quality. However, the success of such methods often limits to either warm-start or strict cold-start item recommendation in which some items neither appear in the training data nor have any interactions in the test stage: (1) Some fail to learn the embedding of a strict cold-start item since side information is only utilized to enhance the warm-start ID representations; (2) The others deteriorate the performance of warm-start recommendation since unrelated multi-modal content or entities in KGs may blur the final representations. In this paper, we propose a unified framework incorporating multi-modal content of items and KGs to effectively solve both strict cold-start and warm-start recommendation termed Firzen, which extracts the user-item collaborative information over frozen heterogeneous graph (collaborative knowledge graph), and exploits the item-item semantic structures and user-user behavioral association over frozen homogeneous graphs (item-item relation graph and user-user co-occurrence graph). Furthermore, we build four unified strict cold-start evaluation benchmarks based on publicly available Amazon datasets and a real-world industrial dataset from Weixin Channels via rearranging the interaction data and constructing KGs. Extensive empirical results demonstrate that our model yields significant improvements for strict cold-start recommendation and outperforms or matches the state-of-the-art performance in the warm-start scenario.

[IR-4] CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features

链接: https://arxiv.org/abs/2410.07610
作者: Po-han Li,Sandeep P. Chinchali,Ufuk Topcu
关键词-EN: cross-modal retrieval, CSA, excel in tasks, Multimodal, CLIP excel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring 300,000\times fewer multimodal data pairs and 6\times fewer unimodal data for ImageNet classification and misinformative news captions detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.

[IR-5] No Free Lunch: Retrieval-Augmented Generation Undermines Fairness in LLMs Even for Vigilant Users

链接: https://arxiv.org/abs/2410.07589
作者: Mengxuan Hu,Hongyi Wu,Zihan Guan,Ronghang Zhu,Dongliang Guo,Daiqing Qi,Sheng Li
关键词-EN: domain-specific generation capabilities, Retrieval-Augmented Generation, large language models, domain-specific generation, generation capabilities
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is widely adopted for its effectiveness and cost-efficiency in mitigating hallucinations and enhancing the domain-specific generation capabilities of large language models (LLMs). However, is this effectiveness and cost-efficiency truly a free lunch? In this study, we comprehensively investigate the fairness costs associated with RAG by proposing a practical three-level threat model from the perspective of user awareness of fairness. Specifically, varying levels of user fairness awareness result in different degrees of fairness censorship on the external dataset. We examine the fairness implications of RAG using uncensored, partially censored, and fully censored datasets. Our experiments demonstrate that fairness alignment can be easily undermined through RAG without the need for fine-tuning or retraining. Even with fully censored and supposedly unbiased external datasets, RAG can lead to biased outputs. Our findings underscore the limitations of current alignment methods in the context of RAG-based LLMs and highlight the urgent need for new strategies to ensure fairness. We propose potential mitigations and call for further research to develop robust fairness safeguards in RAG-based LLMs.

[IR-6] he trade-off between data minimization and fairness in collaborative filtering

链接: https://arxiv.org/abs/2410.07182
作者: Nasim Sonboli,Sipei Li,Mehdi Elahi,Asia Biega
关键词-EN: General Data Protection, Data Protection Regulations, Protection Regulations, safeguard individuals’ personal, individuals’ personal information
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:General Data Protection Regulations (GDPR) aim to safeguard individuals’ personal information from harm. While full compliance is mandatory in the European Union and the California Privacy Rights Act (CPRA), it is not in other places. GDPR requires simultaneous compliance with all the principles such as fairness, accuracy, and data minimization. However, it overlooks the potential contradictions within its principles. This matter gets even more complex when compliance is required from decision-making systems. Therefore, it is essential to investigate the feasibility of simultaneously achieving the goals of GDPR and machine learning, and the potential tradeoffs that might be forced upon us. This paper studies the relationship between the principles of data minimization and fairness in recommender systems. We operationalize data minimization via active learning (AL) because, unlike many other methods, it can preserve a high accuracy while allowing for strategic data collection, hence minimizing the amount of data collection. We have implemented several active learning strategies (personalized and non-personalized) and conducted a comparative analysis focusing on accuracy and fairness on two publicly available datasets. The results demonstrate that different AL strategies may have different impacts on the accuracy of recommender systems with nearly all strategies negatively impacting fairness. There has been no to very limited work on the trade-off between data minimization and fairness, the pros and cons of active learning methods as tools for implementing data minimization, and the potential impacts of AL on fairness. By exploring these critical aspects, we offer valuable insights for developing recommender systems that are GDPR compliant.

[IR-7] Orthogonal Nonnegative Matrix Factorization with the Kullback-Leibler divergence

链接: https://arxiv.org/abs/2410.07786
作者: Jean Pacifique Nkurunziza,Fulgence Nahayo,Nicolas Gillis
关键词-EN: Orthogonal nonnegative matrix, nonnegative matrix factorization, Orthogonal nonnegative, matrix factorization, approach for clustering
类目: Machine Learning (stat.ML); Information Retrieval (cs.IR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 10 pages

点击查看摘要

Abstract:Orthogonal nonnegative matrix factorization (ONMF) has become a standard approach for clustering. As far as we know, most works on ONMF rely on the Frobenius norm to assess the quality of the approximation. This paper presents a new model and algorithm for ONMF that minimizes the Kullback-Leibler (KL) divergence. As opposed to the Frobenius norm which assumes Gaussian noise, the KL divergence is the maximum likelihood estimator for Poisson-distributed data, which can model better vectors of word counts in document data sets and photo counting processes in imaging. We have developed an algorithm based on alternating optimization, KL-ONMF, and show that it performs favorably with the Frobenius-norm based ONMF for document classification and hyperspectral image unmixing.

附件下载

点击下载今日全部论文列表