本篇博文主要展示 2024-10-16 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-10-16)

今日共更新161篇论文,其中:

  • 自然语言处理51篇(Computation and Language (cs.CL))
  • 人工智能58篇(Artificial Intelligence (cs.AI))
  • 计算机视觉35篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习61篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] A Hitchhikers Guide to Scaling Law Estimation

【速读】: 该论文试图解决如何更准确地估计和解释机器学习模型的缩放定律(scaling laws)的问题。解决方案的关键在于收集并分析大量预训练模型的损失和下游评估数据,通过拟合这些数据来推导出适用于新模型家族的缩放定律。具体方法包括:1) 使用训练过程中的中间检查点(checkpoints)而非仅依赖最终损失来提高估计的准确性;2) 优先选择与目标模型大小相似的其他模型进行估计;3) 通过训练多个小模型来减少模型种子(model seeds)带来的变异性;4) 尽管不同模型家族的缩放行为有所差异,但通常可以通过单一模型架构和从其他模型家族推导出的缩放参数来预测目标模型的行为。

链接: https://arxiv.org/abs/2410.11840
作者: Leshem Choshen,Yang Zhang,Jacob Andreas
关键词-EN: Scaling laws, machine learning model, target machine learning, Scaling laws predict, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pretraining decisions involving optimizers, datasets, and model architectures. Despite the widespread use of scaling laws to model the dynamics of language model training, there has been little work on understanding how to best estimate and interpret them. We collect (and release) a large-scale dataset containing losses and downstream evaluations for 485 previously published pretrained models. We use these to estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families. We find that fitting scaling laws to intermediate checkpoints of training runs (and not just their final losses) substantially improves accuracy, and that – all else equal – estimates of performance are generally most accurate when derived from other models of similar sizes. However, because there is a significant degree of variability across model seeds, training multiple small models is sometimes more useful than training a single large one. Moreover, while different model families differ scaling behavior, they are often similar enough that a target model’s behavior can be predicted from a single model with the same architecture, along with scaling parameter estimates derived from other model families.
摘要:缩放定律通过从参数较少或训练集较小的更易训练的模型中进行外推,来预测目标机器学习模型的损失。这为从业者和研究人员提供了一种有效的方法,用于比较涉及优化器、数据集和模型架构的预训练决策。尽管缩放定律在语言模型训练动态建模中得到了广泛应用,但如何最佳地估计和解释这些定律的研究工作却相对较少。我们收集(并发布)了一个大规模数据集,其中包含 485 个先前发布的预训练模型的损失和下游评估结果。我们利用这些数据估计了超过 1000 个缩放定律,然后推导出一组用于在新模型系列中估计缩放定律的最佳实践。我们发现,将缩放定律拟合到训练运行的中间检查点(而不仅仅是最终损失)可以显著提高准确性,并且——在其他条件相同的情况下——从其他相似大小的模型中得出的性能估计通常最为准确。然而,由于模型种子之间存在显著的变异性,训练多个小型模型有时比训练单个大型模型更有用。此外,尽管不同模型系列的缩放行为有所不同,但它们通常足够相似,以至于可以通过具有相同架构的单个模型,结合从其他模型系列中得出的缩放参数估计来预测目标模型的行为。

[NLP-1] NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models

【速读】: 该论文试图解决当前大型语言模型(LLMs)在嵌套工具学习能力评估方面的不足问题。解决方案的关键在于引入了NesTools,这是一个新的基准数据集,通过创新的自动数据生成方法构建大规模的嵌套工具调用实例,并经过人工审查和优化,确保数据质量与实际应用场景高度一致。NesTools的引入为评估LLMs的嵌套工具学习能力提供了全面且高质量的基准,揭示了现有LLMs在处理复杂嵌套工具学习任务时的局限性。

链接: https://arxiv.org/abs/2410.11805
作者: Han Han,Tong Zhu,Xiang Zhang,Mengsong Wu,Hao Xiong,Wenliang Chen
关键词-EN: Large language models, gained impressive results, Large language, nested tool learning, tool learning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) combined with tool learning have gained impressive results in real-world applications. During tool learning, LLMs may call multiple tools in nested orders, where the latter tool call may take the former response as its input parameters. However, current research on the nested tool learning capabilities is still under-explored, since the existing benchmarks lack of relevant data instances. To address this problem, we introduce NesTools to bridge the current gap in comprehensive nested tool learning evaluations. NesTools comprises a novel automatic data generation method to construct large-scale nested tool calls with different nesting structures. With manual review and refinement, the dataset is in high quality and closely aligned with real-world scenarios. Therefore, NesTools can serve as a new benchmark to evaluate the nested tool learning abilities of LLMs. We conduct extensive experiments on 22 LLMs, and provide in-depth analyses with NesTools, which shows that current LLMs still suffer from the complex nested tool learning task.
摘要:大语言模型 (LLMs) 结合工具学习在实际应用中取得了显著成果。在工具学习过程中,LLMs 可能会以嵌套顺序调用多个工具,其中后一个工具调用可能以前一个工具的响应作为其输入参数。然而,当前关于嵌套工具学习能力的研究仍处于探索阶段,因为现有基准缺乏相关数据实例。为了解决这一问题,我们引入了 NesTools,以填补当前全面嵌套工具学习评估的空白。NesTools 包含一种新颖的自动数据生成方法,用于构建具有不同嵌套结构的大规模嵌套工具调用。经过人工审查和优化,该数据集具有高质量,并与实际场景紧密契合。因此,NesTools 可以作为一个新的基准,用于评估 LLMs 的嵌套工具学习能力。我们在 22 个 LLMs 上进行了广泛的实验,并使用 NesTools 进行了深入分析,结果表明当前的 LLMs 在复杂的嵌套工具学习任务中仍面临挑战。

[NLP-2] Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在利用上下文学习时带来的额外计算和财务成本问题。解决方案的关键在于提出了一种统一的压缩方法,即Selection-p,该方法通过自监督预训练技术对不具信息量的token进行离散化处理。Selection-p在持续预训练过程中引入少量参数,为每个输入token生成一个概率,指示是否保留或丢弃该token。实验结果表明,Selection-p在多个分类任务中达到了最先进的性能,压缩率高达10倍,性能仅下降0.8%,并且在不同模型间的迁移性优于以往方法。此外,论文还分析了Selection-p在长上下文学习中如何保持性能。

链接: https://arxiv.org/abs/2410.11786
作者: Tsz Ting Chung,Leyang Cui,Lemao Liu,Xinting Huang,Shuming Shi,Dit-Yan Yeung
关键词-EN: Large Language Models, natural language processing, Large Language, demonstrated impressive capabilities, language processing tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 figures, 10 tables, EMNLP 2024 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in a wide range of natural language processing tasks when leveraging in-context learning. To mitigate the additional computational and financial costs associated with in-context learning, several prompt compression methods have been proposed to compress the in-context learning prompts. Despite their success, these methods face challenges with transferability due to model-specific compression, or rely on external training data, such as GPT-4. In this paper, we investigate the ability of LLMs to develop a unified compression method that discretizes uninformative tokens, utilizing a self-supervised pre-training technique. By introducing a small number of parameters during the continual pre-training, the proposed Selection-p produces a probability for each input token, indicating whether to preserve or discard it. Experiments show Selection-p achieves state-of-the-art performance across numerous classification tasks, achieving compression rates of up to 10 times while experiencing only a marginal 0.8% decrease in performance. Moreover, it exhibits superior transferability to different models compared to prior work. Additionally, we further analyze how Selection-p helps maintain performance on in-context learning with long contexts.
摘要:大语言模型 (LLMs) 在利用上下文学习时,展示了在广泛的自然语言处理任务中的显著能力。为了减轻与上下文学习相关的额外计算和财务成本,已经提出了几种提示压缩方法来压缩上下文学习提示。尽管这些方法取得了成功,但由于模型特定的压缩或依赖外部训练数据(如 GPT-4),它们在可迁移性方面面临挑战。在本文中,我们研究了 LLMs 开发一种统一的压缩方法的能力,该方法利用自监督预训练技术对非信息性 Token 进行离散化处理。通过在持续预训练期间引入少量参数,提出的 Selection-p 为每个输入 Token 生成一个概率,指示是保留还是丢弃该 Token。实验表明,Selection-p 在众多分类任务中达到了最先进的性能,实现了高达 10 倍的压缩率,同时性能仅下降了 0.8%。此外,与先前的工作相比,它在不同模型之间的可迁移性表现更优。此外,我们还进一步分析了 Selection-p 如何在长上下文学习中保持性能。

[NLP-3] MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

【速读】: 该论文试图解决多模态大语言模型(MLLMs)中常见的幻觉现象问题,即模型在输出中错误生成对象,尽管在前层能够识别视觉对象。解决方案的关键在于提出了一种名为动态校正解码(DeCo)的新方法,该方法通过自适应选择适当的先前层并按比例将知识集成到最终层来调整输出对数,从而减少幻觉率。DeCo具有模型无关性,可以与各种经典解码策略无缝结合,并适用于不同的MLLMs。

链接: https://arxiv.org/abs/2410.11779
作者: Chenxi Wang,Xiang Chen,Ningyu Zhang,Bozhong Tian,Haoming Xu,Shumin Deng,Huajun Chen
关键词-EN: Multimodal Large Language, remain poorly understood, underlying reasons remain, reasons remain poorly, frequently exhibit hallucination
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Ongoing work

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena, but the underlying reasons remain poorly understood. In this paper, we present an empirical analysis and find that, although MLLMs incorrectly generate the objects in the final output, they are actually able to recognize visual objects in the preceding layers. We speculate that this may be due to the strong knowledge priors of the language model suppressing the visual information, leading to hallucinations. Motivated by this, we propose a novel dynamic correction decoding method for MLLMs (DeCo), which adaptively selects the appropriate preceding layers and proportionally integrates knowledge into the final layer to adjust the output logits. Note that DeCo is model agnostic and can be seamlessly incorporated with various classic decoding strategies and applied to different MLLMs. We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines, highlighting its potential to mitigate hallucinations. Code is available at this https URL.
摘要:多模态大语言模型 (Multimodal Large Language Models, MLLMs) 经常表现出幻觉现象,但其背后的原因仍未被充分理解。本文通过实证分析发现,尽管 MLLMs 在最终输出中错误地生成了对象,但它们在前几层实际上能够识别视觉对象。我们推测这可能是由于语言模型的强知识先验抑制了视觉信息,从而导致幻觉。基于此,我们提出了一种新的动态校正解码方法 (Dynamic Correction Decoding, DeCo),该方法能够自适应地选择适当的先前层,并按比例将知识整合到最终层以调整输出 logits。值得注意的是,DeCo 与模型无关,可以无缝集成到各种经典解码策略中,并应用于不同的 MLLMs。我们在广泛使用的基准上评估了 DeCo,结果表明,与基线相比,DeCo 能够大幅降低幻觉率,突显了其在缓解幻觉方面的潜力。代码可在以下链接获取:https URL。

[NLP-4] Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models EMNLP2024

【速读】: 该论文试图解决现有参数高效微调(PEFT)方法在应用于预训练大型语言模型(LLMs)时,采用统一架构设计导致各层重要性被忽视,从而影响微调效果的问题。解决方案的关键在于提出了一种新的重要性感知稀疏微调(Importance-aware Sparse Tuning, IST)方法,通过层级重要性评分机制,动态选择并更新最重要的子集层,从而优化PEFT模块的性能,减少内存需求,并提供理论收敛证明和实验验证其优越性。

链接: https://arxiv.org/abs/2410.11772
作者: Kai Yao,Penlei Gao,Lichun Li,Yuan Zhao,Xiaofeng Wang,Wei Wang,Jianke Zhu
关键词-EN: Large Language Models, pre-trained Large Language, adapting pre-trained Large, Language Models, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP 2024

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods have gained significant popularity for adapting pre-trained Large Language Models (LLMs) to downstream tasks, primarily due to their potential to significantly reduce memory and computational overheads. However, a common limitation in most PEFT approaches is their application of a uniform architectural design across all layers. This uniformity involves identical trainable modules and ignores the varying importance of each layer, leading to sub-optimal fine-tuning results. To overcome the above limitation and obtain better performance, we develop a novel approach, Importance-aware Sparse Tuning (IST), to fully utilize the inherent sparsity and select the most important subset of full layers with effective layer-wise importance scoring. The proposed IST is a versatile and plug-and-play technique compatible with various PEFT methods that operate on a per-layer basis. By leveraging the estimated importance scores, IST dynamically updates these selected layers in PEFT modules, leading to reduced memory demands. We further provide theoretical proof of convergence and empirical evidence of superior performance to demonstrate the advantages of IST over uniform updating strategies. Extensive experiments on a range of LLMs, PEFTs, and downstream tasks substantiate the effectiveness of our proposed method, showcasing IST’s capacity to enhance existing layer-based PEFT methods. Our code is available at this https URL.
摘要:参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 方法因其能够显著减少内存和计算开销的潜力,在将预训练大语言模型 (Large Language Models, LLMs) 适应下游任务方面获得了显著的流行度。然而,大多数 PEFT 方法的一个常见局限性在于它们在所有层中应用统一的架构设计。这种统一性涉及相同的可训练模块,并忽略了各层之间的重要性差异,导致微调结果次优。为了克服上述局限性并获得更好的性能,我们开发了一种新颖的方法——重要性感知稀疏微调 (Importance-aware Sparse Tuning, IST),以充分利用固有的稀疏性,并通过有效的逐层重要性评分选择最重要的全层子集。所提出的 IST 是一种多功能且即插即用的技术,兼容于基于逐层的各种 PEFT 方法。通过利用估计的重要性评分,IST 动态更新 PEFT 模块中选定的层,从而减少内存需求。我们进一步提供了收敛性的理论证明和优越性能的实证证据,以展示 IST 相对于统一更新策略的优势。在一系列 LLMs、PEFTs 和下游任务上的广泛实验证实了我们提出方法的有效性,展示了 IST 增强现有基于层的 PEFT 方法的能力。我们的代码可在以下链接获取:[https URL]。

[NLP-5] Latent Action Pretraining from Videos

【速读】: 该论文试图解决现有视觉-语言-动作(VLA)模型在预训练时依赖于人工收集的机器人动作标签的问题,这限制了数据来源和规模。解决方案的关键在于提出了一种无监督的预训练方法Latent Action Pretraining for general Action models (LAPA),通过利用互联网规模的无标签视频数据,首先训练一个基于VQ-VAE的动作量化模型来学习图像帧间的离散潜在动作,然后预训练一个潜在VLA模型以预测这些潜在动作,最后在小规模机器人操作数据上微调VLA模型,将潜在动作映射到机器人动作。这种方法显著优于现有技术,并在实际操作任务中表现出色,特别是在需要语言条件、泛化到未见对象和语义泛化到未见指令的任务中。

链接: https://arxiv.org/abs/2410.11758
作者: Seonghyeon Ye,Joel Jang,Byeongguk Jeon,Sejune Joo,Jianwei Yang,Baolin Peng,Ajay Mandlekar,Reuben Tan,Yu-Wei Chao,Bill Yuchen Lin,Lars Liden,Kimin Lee,Jianfeng Gao,Luke Zettlemoyer,Dieter Fox,Minjoon Seo
关键词-EN: action labels, Latent Action Pretraining, robot action labels, introduce Latent Action, ground-truth robot action
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL

点击查看摘要

Abstract:We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.
摘要:我们介绍了通用动作模型预训练的潜在动作预训练方法 (Latent Action Pretraining for general Action models, LAPA),这是一种无需真实机器人动作标签的无监督预训练视觉-语言-动作 (Vision-Language-Action, VLA) 模型的方法。现有的视觉-语言-动作模型在预训练阶段通常需要由人类远程操作员收集的动作标签,这极大地限制了可能的数据来源和规模。在本研究中,我们提出了一种从没有机器人动作标签的互联网规模视频中学习的方法。我们首先利用基于 VQ-VAE 的目标训练一个动作量化模型,以学习图像帧之间的离散潜在动作,然后预训练一个潜在的 VLA 模型,从观察和任务描述中预测这些潜在动作,最后在小规模机器人操作数据上微调 VLA,以实现从潜在动作到机器人动作的映射。实验结果表明,我们的方法显著优于现有从大规模视频中训练机器人操作策略的技术。此外,在需要语言条件、对未见对象的泛化以及对未见指令的语义泛化等真实世界操作任务中,它也优于使用机器人动作标签训练的最先进的 VLA 模型。仅在人类操作视频上进行训练也显示出积极的迁移效果,为利用网络规模数据进行机器人基础模型训练开辟了潜力。

[NLP-6] Personas with Attitudes: Controlling LLMs for Diverse Data Annotation

【速读】: 该论文试图解决数据标注任务中多样性和控制性不足的问题。解决方案的关键在于个性化大型语言模型(LLMs),通过在LLM提示中注入多样化的角色描述(personas),以增加标注的多样性,并确保这些多样性效应是可控和可重复的。研究结果表明,使用角色提示的LLMs生成的标注比无角色提示的LLMs更具多样性,且这种多样性效应是可控和一致的,从而为提升主观性自然语言处理任务(如毒性检测)中的数据标注质量提供了一种有效工具。

链接: https://arxiv.org/abs/2410.11745
作者: Leon Fröhling,Gianluca Demartini,Dennis Assenmacher
关键词-EN: large language models, personalizing large language, language models, personalizing large, large language
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 21 pages, 13 figures

点击查看摘要

Abstract:We present a novel approach for enhancing diversity and control in data annotation tasks by personalizing large language models (LLMs). We investigate the impact of injecting diverse persona descriptions into LLM prompts across two studies, exploring whether personas increase annotation diversity and whether the impacts of individual personas on the resulting annotations are consistent and controllable. Our results show that persona-prompted LLMs produce more diverse annotations than LLMs prompted without personas and that these effects are both controllable and repeatable, making our approach a suitable tool for improving data annotation in subjective NLP tasks like toxicity detection.
摘要:我们提出了一种通过个性化大语言模型 (LLM) 来增强数据标注任务中多样性和控制性的新方法。我们在两项研究中探讨了将多样化的角色描述注入 LLM 提示的影响,研究了角色是否能增加标注的多样性,以及个体角色对最终标注的影响是否一致且可控。我们的结果表明,与没有角色提示的 LLM 相比,角色提示的 LLM 生成的标注更具多样性,并且这些效果是可控且可重复的,使得我们的方法成为改进主观性自然语言处理 (NLP) 任务(如毒性检测)中数据标注的合适工具。

[NLP-7] ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

【速读】: 该论文试图解决RAG模型在生成过程中产生的幻觉问题,即模型生成的内容与检索到的外部知识相冲突。解决方案的关键在于提出ReDeEP方法,通过解耦大型语言模型(LLMs)对外部上下文和参数化知识的利用,来准确检测幻觉。具体来说,ReDeEP方法通过分析LLMs中的Knowledge FFNs和Copying Heads的作用机制,发现幻觉产生的原因是Knowledge FFNs过度强调参数化知识,而Copying Heads未能有效保留或整合外部知识。基于此,ReDeEP显著提高了RAG模型幻觉检测的准确性,并引入了AARF方法来调节Knowledge FFNs和Copying Heads的贡献,从而缓解幻觉问题。

链接: https://arxiv.org/abs/2410.11414
作者: Zhongxiang Sun,Xiaoxue Zang,Kai Zheng,Yang Song,Jun Xu,Xiao Zhang,Weijie Yu,Yang Song,Han Li
关键词-EN: Retrieval-Augmented Generation, Large Language Models, reducing hallucinations caused, knowledge, designed to incorporate
类目: Computation and Language (cs.CL)
备注: 23pages

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) models are designed to incorporate external knowledge, reducing hallucinations caused by insufficient parametric (internal) knowledge. However, even with accurate and relevant retrieved content, RAG models can still produce hallucinations by generating outputs that conflict with the retrieved information. Detecting such hallucinations requires disentangling how Large Language Models (LLMs) utilize external and parametric knowledge. Current detection methods often focus on one of these mechanisms or without decoupling their intertwined effects, making accurate detection difficult. In this paper, we investigate the internal mechanisms behind hallucinations in RAG scenarios. We discover hallucinations occur when the Knowledge FFNs in LLMs overemphasize parametric knowledge in the residual stream, while Copying Heads fail to effectively retain or integrate external knowledge from retrieved content. Based on these findings, we propose ReDeEP, a novel method that detects hallucinations by decoupling LLM’s utilization of external context and parametric knowledge. Our experiments show that ReDeEP significantly improves RAG hallucination detection accuracy. Additionally, we introduce AARF, which mitigates hallucinations by modulating the contributions of Knowledge FFNs and Copying Heads.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 模型旨在整合外部知识,减少因参数 (内部) 知识不足导致的幻觉。然而,即使检索到的内容准确且相关,RAG 模型仍可能通过生成与检索信息相冲突的输出来产生幻觉。检测此类幻觉需要解构大语言模型 (Large Language Models, LLMs) 如何利用外部和参数知识。当前的检测方法往往侧重于其中一种机制,或未解耦其交织效应,导致准确检测困难。本文探讨了 RAG 场景中幻觉的内在机制。我们发现,当 LLMs 中的知识前馈网络 (Knowledge FFNs) 在残差流中过度强调参数知识,而复制头 (Copying Heads) 未能有效保留或整合来自检索内容的外部知识时,幻觉就会发生。基于这些发现,我们提出了 ReDeEP,一种通过解耦 LLM 对外部上下文和参数知识的利用来检测幻觉的新方法。我们的实验表明,ReDeEP 显著提高了 RAG 幻觉检测的准确性。此外,我们引入了 AARF,通过调节知识前馈网络和复制头的贡献来缓解幻觉。

[NLP-8] PMMT: Preference Alignment in Multilingual Machine Translation via LLM Distillation

【速读】: 该论文试图解决翻译过程中与人类偏好对齐的问题,即如何使翻译结果更符合特定语调或风格。解决方案的关键在于利用大型语言模型(LLMs)生成大规模多语言平行语料库,并设计自动化流程将人类偏好提炼到较小的机器翻译(MT)模型中,以高效且经济地支持在线服务的广泛调用。实验结果表明,该方法在翻译任务中显著领先于传统方法,并且在未训练的公共基准测试(如WMT和Flores)上也表现出与最先进工作相媲美的性能。

链接: https://arxiv.org/abs/2410.11410
作者: Shuqiao Sun,Yutong Yao,Peiwen Wu,Feijun Jiang,Kaifu Zhang
关键词-EN: cross-language communication, improve its accuracy, important for cross-language, made to improve, Translation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Translation is important for cross-language communication, and many efforts have been made to improve its accuracy. However, less investment is conducted in aligning translations with human preferences, such as translation tones or styles. In this paper, a new method is proposed to effectively generate large-scale multilingual parallel corpora with specific translation preferences using Large Language Models (LLMs). Meanwhile, an automatic pipeline is designed to distill human preferences into smaller Machine Translation (MT) models for efficiently and economically supporting large-scale calls in online services. Experiments indicate that the proposed method takes the lead in translation tasks with aligned human preferences by a large margin. Meanwhile, on popular public benchmarks like WMT and Flores, on which our models were not trained, the proposed method also shows a competitive performance compared to SOTA works.
摘要:翻译对于跨语言沟通至关重要,许多研究致力于提高其准确性。然而,在使翻译符合人类偏好(如翻译语调或风格)方面的投入相对较少。本文提出了一种新方法,利用大语言模型 (LLM) 高效生成大规模多语言平行语料库,并根据特定翻译偏好进行调整。同时,设计了一个自动化流程,将人类偏好提炼成更小的机器翻译 (MT) 模型,以便在在线服务中高效且经济地支持大规模调用。实验表明,所提出的方法在符合人类偏好的翻译任务中大幅领先。此外,在未经过训练的流行公共基准测试(如 WMT 和 Flores)上,该方法相较于当前最先进的工作也展现出竞争性的表现。

[NLP-9] Do LLMs Have the Generalization Ability in Conducting Causal Inference?

【速读】: 该论文试图解决大语言模型(LLMs)在因果推断中对未见现象的泛化能力问题。解决方案的关键在于提出了一种基准生成框架,通过随机生成图和节点名称来构建假设性的新因果场景中的评估问题,从而编译出一个包含不同复杂度问题的基准数据集。该框架使得能够系统地测试LLMs在因果路径发现(CP)、后门调整(BA)、事实推断(FI)和反事实推断(CI)四个任务中的泛化能力,揭示了LLMs在处理简单和复杂问题时的性能差异,特别是对后门调整问题的处理困难,以及在现象名称包含现有术语时可能受到的干扰。

链接: https://arxiv.org/abs/2410.11385
作者: Chen Wang,Dongming Zhao,Bo Wang,Ruifang He,Yuexian Hou
关键词-EN: Large Language Models, conduct causal inference, causal inference methods, causal inference, generalization capability refers
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In causal inference, generalization capability refers to the ability to conduct causal inference methods on new data to estimate the causal-effect between unknown phenomenon, which is crucial for expanding the boundaries of knowledge. Studies have evaluated the causal inference capabilities of Large Language Models (LLMs) concerning known phenomena, yet the generalization capabilities of LLMs concerning unseen phenomena remain unexplored. In this paper, we selected four tasks: Causal Path Discovery (CP), Backdoor Adjustment (BA), Factual Inference (FI), and Counterfactual Inference (CI) as representatives of causal inference tasks. To generate evaluation questions about previously unseen phenomena in new data on the four tasks, we propose a benchmark generation framework, which employs randomly generated graphs and node names to formulate questions within hypothetical new causal scenarios. Based on this framework, we compile a benchmark dataset of varying levels of question complexity. We extensively tested the generalization capabilities of five leading LLMs across four tasks. Experiment results reveal that while LLMs exhibit good generalization performance in solving simple CP, FI, and complex CI questions, they encounter difficulties when tackling BA questions and face obvious performance fluctuations as the problem complexity changes. Furthermore, when the names of phenomena incorporate existing terms, even if these names are entirely novel, their generalization performance can still be hindered by interference from familiar terms.
摘要:在因果推断中,泛化能力指的是在新数据上运用因果推断方法来估计未知现象之间的因果效应的能力,这对于拓展知识边界至关重要。已有研究评估了大语言模型 (LLM) 在已知现象上的因果推断能力,但关于 LLM 对未见现象的泛化能力尚未得到探索。本文选取了四个任务:因果路径发现 (Causal Path Discovery, CP)、后门调整 (Backdoor Adjustment, BA)、事实推断 (Factual Inference, FI) 和反事实推断 (Counterfactual Inference, CI) 作为因果推断任务的代表。为了生成关于新数据中未见现象的评估问题,我们提出了一种基准生成框架,该框架利用随机生成的图和节点名称来构建假设性新因果场景中的问题。基于此框架,我们编制了一个包含不同问题复杂度级别的基准数据集。我们对五个领先的 LLM 在四个任务上的泛化能力进行了广泛测试。实验结果显示,尽管 LLM 在解决简单的 CP、FI 和复杂的 CI 问题上表现出良好的泛化性能,但在处理 BA 问题时遇到困难,并且随着问题复杂度的变化,其性能出现明显波动。此外,当现象名称包含现有术语时,即使这些名称是完全新颖的,其泛化性能仍可能受到熟悉术语的干扰。

[NLP-10] A Framework for Adapting Human-Robot Interaction to Diverse User Groups

【速读】: 该论文试图解决在真实环境中与多样用户群体进行自然、直观交互的问题,特别是如何根据不同用户群体的需求和期望调整机器人行为,并通过用户反馈实现交互的自适应性。解决方案的关键在于开发了一个基于ROS的适应性人机交互(HRI)框架,该框架通过先进的语音识别和语音活动检测技术支持自然交互,并利用大型语言模型(LLM)作为对话桥梁,从而能够根据用户群体和个体用户的反馈动态调整交互策略。该框架的开源代码库和模块测试验证了其在年龄识别和应对重复用户输入及计划变更方面的效率和鲁棒性。

链接: https://arxiv.org/abs/2410.11377
作者: Theresa Pekarek Rosin,Vanessa Hassouna,Xiaowen Sun,Luca Krohm,Henri-Leon Kordt,Michael Beetz,Stefan Wermter
关键词-EN: diverse user groups, real-world settings, social robots, capable of addressing, addressing the varying
类目: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at the 16th International Conference on Social Robotics (ICSR) 2024

点击查看摘要

Abstract:To facilitate natural and intuitive interactions with diverse user groups in real-world settings, social robots must be capable of addressing the varying requirements and expectations of these groups while adapting their behavior based on user feedback. While previous research often focuses on specific demographics, we present a novel framework for adaptive Human-Robot Interaction (HRI) that tailors interactions to different user groups and enables individual users to modulate interactions through both minor and major interruptions. Our primary contributions include the development of an adaptive, ROS-based HRI framework with an open-source code base. This framework supports natural interactions through advanced speech recognition and voice activity detection, and leverages a large language model (LLM) as a dialogue bridge. We validate the efficiency of our framework through module tests and system trials, demonstrating its high accuracy in age recognition and its robustness to repeated user inputs and plan changes.
摘要:为了在现实环境中促进与多样化用户群体的自然和直观互动,社交机器人必须能够满足这些群体的不同需求和期望,并根据用户反馈调整其行为。尽管以往的研究通常关注特定人群,但我们提出了一种新颖的自适应人机交互 (Human-Robot Interaction, HRI) 框架,该框架能够根据不同用户群体定制互动,并允许个体用户通过轻微和重大的中断来调节互动。我们的主要贡献包括开发了一个基于 ROS 的自适应 HRI 框架,并提供了开源代码库。该框架通过先进的语音识别和语音活动检测支持自然互动,并利用大语言模型 (Large Language Model, LLM) 作为对话桥梁。我们通过模块测试和系统试验验证了该框架的效率,展示了其在年龄识别方面的高准确性以及对重复用户输入和计划变更的鲁棒性。

[NLP-11] Learning from Imperfect Data: Towards Efficient Knowledge Distillation of Autoregressive Language Models for Text-to-SQL EMNLP2024

【速读】: 该论文试图解决在复杂文本到SQL转换场景中,现有知识蒸馏(KD)方法在性能与效率之间难以平衡的问题。解决方案的关键在于提出了一种改进的知识蒸馏方法,称为KID(Knowledge Distillation with Imperfect Data),其核心是通过模拟推理过程中的级联效应,有效缓解训练数据与推理数据之间的不匹配问题,从而在不显著增加训练成本的情况下提升模型性能。实验结果表明,KID在多个文本到SQL基准测试中均能显著提升模型性能,并有效提高训练效率。

链接: https://arxiv.org/abs/2410.11371
作者: Qihuang Zhong,Kunfeng Chen,Liang Ding,Juhua Liu,Bo Du,Dacheng Tao
关键词-EN: Large Language Models, translating natural language, natural language questions, Large Language, involves translating natural
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注: Accepted to EMNLP2024 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising performance in text-to-SQL, which involves translating natural language questions into SQL queries. However, current text-to-SQL LLMs are computationally expensive and challenging to deploy in real-world applications, highlighting the importance of compressing them. To achieve this goal, knowledge distillation (KD) is a common approach, which aims to distill the larger teacher model into a smaller student model. While numerous KD methods for autoregressive LLMs have emerged recently, it is still under-explored whether they work well in complex text-to-SQL scenarios. To this end, we conduct a series of analyses and reveal that these KD methods generally fall short in balancing performance and efficiency. In response to this problem, we propose to improve the KD with Imperfect Data, namely KID, which effectively boosts the performance without introducing much training budget. The core of KID is to efficiently mitigate the training-inference mismatch by simulating the cascading effect of inference in the imperfect training data. Extensive experiments on 5 text-to-SQL benchmarks show that, KID can not only achieve consistent and significant performance gains (up to +5.83% average score) across all model types and sizes, but also effectively improve the training efficiency.
摘要:大语言模型 (LLMs) 在文本到 SQL 的转换中展现了令人鼓舞的性能,即将自然语言问题翻译成 SQL 查询。然而,当前的文本到 SQL 大语言模型在计算上非常昂贵,且难以在实际应用中部署,这凸显了压缩它们的重要性。为了实现这一目标,知识蒸馏 (KD) 是一种常见的方法,旨在将较大的教师模型蒸馏成较小的学生模型。尽管最近出现了许多针对自回归大语言模型的 KD 方法,但在复杂的文本到 SQL 场景中,它们是否能有效工作仍未得到充分探索。为此,我们进行了一系列分析,并揭示了这些 KD 方法在平衡性能和效率方面普遍不足。针对这一问题,我们提出了一种改进的 KD 方法,即使用不完美数据的知识蒸馏 (KID),该方法在不显著增加训练预算的情况下有效提升了性能。KID 的核心是通过模拟不完美训练数据中的推理级联效应,高效地缓解训练与推理之间的不匹配问题。在 5 个文本到 SQL 基准上的广泛实验表明,KID 不仅能够在所有模型类型和规模上实现一致且显著的性能提升 (平均分数提升高达 +5.83%),还能有效提高训练效率。

[NLP-12] Enhance Graph Alignment for Large Language Models

【速读】: 该论文试图解决现有图结构数据处理方法中,自监督任务与监督下游任务之间的不一致性问题,导致自监督微调对下游任务产生负面影响。解决方案的关键在于提出Graph Alignment Large Language Models (GALLM),通过引入与下游任务对齐的任务模板,在自监督调优阶段采用对齐的文本匹配任务,并在任务特定调优阶段提出两种对齐模板的学习方法,从而显著提升监督学习效果、多数据集泛化能力及零样本学习能力。

链接: https://arxiv.org/abs/2410.11370
作者: Haitong Luo,Xuying Meng,Suhang Wang,Tianxiang Zhao,Fali Wang,Hanyun Cao,Yujun Zhang
关键词-EN: Large Language Models, Graph-structured data, Large Language, Alignment Large Language, real world
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review

点击查看摘要

Abstract:Graph-structured data is prevalent in the real world. Recently, due to the powerful emergent capabilities, Large Language Models (LLMs) have shown promising performance in modeling graphs. The key to effectively applying LLMs on graphs is converting graph data into a format LLMs can comprehend. Graph-to-token approaches are popular in enabling LLMs to process graph information. They transform graphs into sequences of tokens and align them with text tokens through instruction tuning, where self-supervised instruction tuning helps LLMs acquire general knowledge about graphs, and supervised fine-tuning specializes LLMs for the downstream tasks on graphs. Despite their initial success, we find that existing methods have a misalignment between self-supervised tasks and supervised downstream tasks, resulting in negative transfer from self-supervised fine-tuning to downstream tasks. To address these issues, we propose Graph Alignment Large Language Models (GALLM) to benefit from aligned task templates. In the self-supervised tuning stage, we introduce a novel text matching task using templates aligned with downstream tasks. In the task-specific tuning stage, we propose two category prompt methods that learn supervision information from additional explanation with further aligned templates. Experimental evaluations on four datasets demonstrate substantial improvements in supervised learning, multi-dataset generalizability, and particularly in zero-shot capability, highlighting the model’s potential as a graph foundation model.
摘要:图结构数据在现实世界中普遍存在。近年来,由于大语言模型 (Large Language Models, LLMs) 强大的涌现能力,其在图建模方面展现出巨大的潜力。有效应用 LLMs 处理图数据的关键在于将图数据转换为 LLMs 能够理解的格式。图到 Token (Graph-to-token) 方法通过将图转换为 Token 序列,并通过指令调优 (Instruction Tuning) 与文本 Token 对齐,使得 LLMs 能够处理图信息。其中,自监督指令调优帮助 LLMs 获取关于图的通用知识,而监督微调则使 LLMs 专门化于图上的下游任务。尽管这些方法初见成效,我们发现现有方法在自监督任务与监督下游任务之间存在错位,导致自监督微调对下游任务产生负面迁移。为解决这些问题,我们提出了图对齐大语言模型 (Graph Alignment Large Language Models, GALLM),以受益于对齐的任务模板。在自监督调优阶段,我们引入了一种新的文本匹配任务,使用与下游任务对齐的模板。在任务特定调优阶段,我们提出了两种类别提示方法,通过进一步对齐的模板从附加解释中学习监督信息。在四个数据集上的实验评估表明,在监督学习、多数据集泛化能力以及零样本 (Zero-shot) 能力方面均有显著提升,突显了该模型作为图基础模型的潜力。

[NLP-13] LargePiG: Your Large Language Model is Secretly a Pointer Generator

【速读】: 该论文试图解决基于大型语言模型(LLMs)生成的查询中存在的幻觉问题,特别是相关性幻觉和事实性幻觉。解决方案的关键在于提出了一种模型无关且无需训练的方法,将大型语言模型转化为指针生成器(LargePiG)。该方法通过利用LLM的固有注意力权重和模型高层与最后一层词汇分布的差异,实现了内容与形式的分离,从而保留了从输入中提取和整合的事实知识,同时利用LLM的语言能力生成语法结构,包括功能词。实验结果表明,LargePiG在减少幻觉、提高基于文档的问题回答和事实性评估任务的准确性方面具有显著优势。

链接: https://arxiv.org/abs/2410.11366
作者: Zhongxiang Sun,Zihua Si,Xiaoxue Zang,Kai Zheng,Yang Song,Xiao Zhang,Jun Xu
关键词-EN: Recent research, Large Language Models, query generation, query generation based, Language Models
类目: Computation and Language (cs.CL)
备注: 24 pages

点击查看摘要

Abstract:Recent research on query generation has focused on using Large Language Models (LLMs), which despite bringing state-of-the-art performance, also introduce issues with hallucinations in the generated queries. In this work, we introduce relevance hallucination and factuality hallucination as a new typology for hallucination problems brought by query generation based on LLMs. We propose an effective way to separate content from form in LLM-generated queries, which preserves the factual knowledge extracted and integrated from the inputs and compiles the syntactic structure, including function words, using the powerful linguistic capabilities of the LLM. Specifically, we introduce a model-agnostic and training-free method that turns the Large Language Model into a Pointer-Generator (LargePiG), where the pointer attention distribution leverages the LLM’s inherent attention weights, and the copy probability is derived from the difference between the vocabulary distribution of the model’s high layers and the last layer. To validate the effectiveness of LargePiG, we constructed two datasets for assessing the hallucination problems in query generation, covering both document and video scenarios. Empirical studies on various LLMs demonstrated the superiority of LargePiG on both datasets. Additional experiments also verified that LargePiG could reduce hallucination in large vision language models and improve the accuracy of document-based question-answering and factuality evaluation tasks.
摘要:近期关于查询生成的研究主要集中在使用大语言模型 (LLM),尽管这些模型带来了最先进的性能,但也引入了生成查询中的幻觉问题。在本研究中,我们引入了相关性幻觉和事实性幻觉作为基于 LLM 的查询生成所带来幻觉问题的新分类。我们提出了一种有效的方法,将内容与形式在 LLM 生成的查询中分离,这种方法保留了从输入中提取和整合的事实知识,并利用 LLM 强大的语言能力编译了包括功能词在内的句法结构。具体而言,我们引入了一种模型无关且无需训练的方法,将大语言模型转化为指针生成器 (LargePiG),其中指针注意力分布利用了 LLM 固有的注意力权重,而复制概率则来源于模型高层和最后一层词汇分布之间的差异。为了验证 LargePiG 的有效性,我们构建了两个数据集来评估查询生成中的幻觉问题,涵盖了文档和视频场景。在多种 LLM 上的实证研究表明,LargePiG 在两个数据集上均表现优越。额外的实验还验证了 LargePiG 能够减少大型视觉语言模型中的幻觉,并提高基于文档的问答和事实性评估任务的准确性。

[NLP-14] Reducing Labeling Costs in Sentiment Analysis via Semi-Supervised Learning

【速读】: 该论文试图解决机器学习中数据标注成本高和时间消耗大的问题。解决方案的关键在于利用半监督学习中的标签传播技术,通过基于流形假设的转导标签传播方法,结合图方法为未标注数据生成伪标签,并将其用于深度神经网络的训练。具体来说,通过在最近邻图中基于余弦相似度的标签扩展,将未标注数据融入监督学习过程,从而显著减少所需的标注数量,降低标注成本。该方法在情感分析任务中进行了有效性评估。

链接: https://arxiv.org/abs/2410.11355
作者: Minoo Jafarlou,Mario M. Kubek
关键词-EN: noteworthy challenge, challenge in machine, machine learning, learning, Labeling datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, 7 figures, accepted at the 2024 8th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2024), Okayama, Japan, 2024

点击查看摘要

Abstract:Labeling datasets is a noteworthy challenge in machine learning, both in terms of cost and time. This research, however, leverages an efficient answer. By exploring label propagation in semi-supervised learning, we can significantly reduce the number of labels required compared to traditional methods. We employ a transductive label propagation method based on the manifold assumption for text classification. Our approach utilizes a graph-based method to generate pseudo-labels for unlabeled data for the text classification task, which are then used to train deep neural networks. By extending labels based on cosine proximity within a nearest neighbor graph from network embeddings, we combine unlabeled data into supervised learning, thereby reducing labeling costs. Based on previous successes in other domains, this study builds and evaluates this approach’s effectiveness in sentiment analysis, presenting insights into semi-supervised learning.
摘要:在机器学习中,标注数据集是一个值得注意的挑战,无论是在成本还是时间方面。然而,本研究采用了一种高效的方法。通过探索半监督学习中的标签传播,我们可以显著减少所需的标签数量,相比于传统方法。我们采用了一种基于流形假设的转导标签传播方法进行文本分类。我们的方法利用基于图的方法为文本分类任务中的未标注数据生成伪标签,然后使用这些伪标签训练深度神经网络。通过基于网络嵌入的最近邻图中的余弦接近度扩展标签,我们将未标注数据结合到监督学习中,从而降低标注成本。基于在其他领域取得的成功,本研究构建并评估了这种方法在情感分析中的有效性,为半监督学习提供了见解。

[NLP-15] RATE: Score Reward Models with Imperfect Rewrites of Rewrites ICLR2025

【速读】: 该论文试图解决在语言模型中使用的奖励模型评估问题,特别是如何准确测量某一属性(如响应长度)对奖励分配的因果影响。解决方案的关键在于开发了一种名为RATE(基于重写的属性处理估计器)的方法,通过使用大型语言模型生成不完美的反事实响应,并通过对重写误差进行两次调整来估计属性对奖励的因果效应。该方法在合成数据和实际数据上都展示了其有效性,能够准确估计特定属性对奖励模型的影响。

链接: https://arxiv.org/abs/2410.11348
作者: David Reber,Sean Richardson,Todd Nief,Cristina Garbacea,Victor Veitch
关键词-EN: reward, Rewrite-based Attribute Treatment, reward models, language modeling, paper concerns
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted as a conference paper to ICLR 2025. Code is available at this https URL

点击查看摘要

Abstract:This paper concerns the evaluation of reward models used in language modeling. A reward model is a function that takes a prompt and a response and assigns a score indicating how good that response is for the prompt. A key challenge is that reward models are usually imperfect proxies for actual preferences. For example, we may worry that a model trained to reward helpfulness learns to instead prefer longer responses. In this paper, we develop an evaluation method, RATE (Rewrite-based Attribute Treatment Estimators), that allows us to measure the causal effect of a given attribute of a response (e.g., length) on the reward assigned to that response. The core idea is to use large language models to rewrite responses to produce imperfect counterfactuals, and to adjust for rewriting error by rewriting twice. We show that the RATE estimator is consistent under reasonable assumptions. We demonstrate the effectiveness of RATE on synthetic and real-world data, showing that it can accurately estimate the effect of a given attribute on the reward model.
摘要:本文关注语言建模中使用的奖励模型的评估。奖励模型是一种函数,它接受一个提示和一个响应,并分配一个分数,指示该响应对于提示的好坏程度。一个关键挑战是,奖励模型通常是实际偏好的不完美代理。例如,我们可能会担心,一个训练来奖励有用性的模型反而学会了偏好更长的响应。在本文中,我们开发了一种评估方法,即基于重写的属性处理估计器 (RATE),该方法使我们能够测量给定响应属性(例如,长度)对分配给该响应的奖励的因果效应。核心思想是使用大语言模型重写响应以生成不完美的反事实,并通过重写两次来调整重写误差。我们证明,在合理假设下,RATE 估计器是一致的。我们在合成数据和真实世界数据上展示了 RATE 的有效性,表明它能够准确估计给定属性对奖励模型的影响。

[NLP-16] SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments

【速读】: 该论文试图解决在资源受限的边缘设备(如智能手机、可穿戴设备和物联网系统)上高效运行自然语言处理(NLP)模型的问题。解决方案的关键在于开发了Shakti,这是一个专为资源受限环境优化的25亿参数语言模型,它结合了高性能NLP与优化的效率和精度,支持多种方言和特定领域的任务,并在保持低延迟和高设备效率的同时,与更大规模的模型表现相当。

链接: https://arxiv.org/abs/2410.11331
作者: Syed Abdul Gaffar Shakhadri,Kruthika KR,Rakshit Aralimatti
关键词-EN: billion parameter language, including smartphones, model specifically optimized, billion parameter, IoT systems
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Paper in pdf format is 11 pages and contains 4 tables

点击查看摘要

Abstract:We introduce Shakti, a 2.5 billion parameter language model specifically optimized for resource-constrained environments such as edge devices, including smartphones, wearables, and IoT systems. Shakti combines high-performance NLP with optimized efficiency and precision, making it ideal for real-time AI applications where computational resources and memory are limited. With support for vernacular languages and domain-specific tasks, Shakti excels in industries such as healthcare, finance, and customer service. Benchmark evaluations demonstrate that Shakti performs competitively against larger models while maintaining low latency and on-device efficiency, positioning it as a leading solution for edge AI.
摘要:我们介绍了 Shakti,这是一个专为资源受限环境(如智能手机、可穿戴设备和物联网系统等边缘设备)优化的 25 亿参数语言模型。Shakti 结合了高性能的自然语言处理 (NLP) 与优化的效率和精度,非常适合计算资源和内存有限的实时 AI 应用。凭借对方言语言和特定领域任务的支持,Shakti 在医疗保健、金融和客户服务等行业中表现出色。基准评估显示,Shakti 在与更大模型的竞争中表现出色,同时保持低延迟和设备上的高效性,使其成为边缘 AI 领域的领先解决方案。

[NLP-17] Sequential LLM Framework for Fashion Recommendation

【速读】: 该论文试图解决时尚电商领域中推荐系统面临的独特挑战,即如何更有效地将文本信息转化为相关产品建议。解决方案的关键在于提出了一种基于预训练大语言模型(LLM)的序列化时尚推荐框架,通过参数高效微调与丰富的时尚数据结合,并引入了一种新颖的基于混合检索技术,从而显著提升了时尚推荐系统的性能。

链接: https://arxiv.org/abs/2410.11327
作者: Han Liu,Xianfeng Tang,Tianlang Chen,Jiapeng Liu,Indu Indu,Henry Peng Zou,Peng Dai,Roberto Fernandez Galan,Michael D Porter,Dongmei Jia,Ning Zhang,Lian Xiong
关键词-EN: prompting major online, major online retailers, global e-commerce sector, prompting major, customer convenience
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The fashion industry is one of the leading domains in the global e-commerce sector, prompting major online retailers to employ recommendation systems for product suggestions and customer convenience. While recommendation systems have been widely studied, most are designed for general e-commerce problems and struggle with the unique challenges of the fashion domain. To address these issues, we propose a sequential fashion recommendation framework that leverages a pre-trained large language model (LLM) enhanced with recommendation-specific prompts. Our framework employs parameter-efficient fine-tuning with extensive fashion data and introduces a novel mix-up-based retrieval technique for translating text into relevant product suggestions. Extensive experiments show our proposed framework significantly enhances fashion recommendation performance.
摘要:时尚行业是全球电子商务领域的主要领域之一,促使主要在线零售商采用推荐系统来提供产品建议和提升客户便利性。尽管推荐系统已被广泛研究,但大多数系统是为通用电子商务问题设计的,难以应对时尚领域的独特挑战。为解决这些问题,我们提出了一种序列化时尚推荐框架,该框架利用预训练的大语言模型 (LLM),并通过推荐特定提示进行增强。我们的框架采用参数高效的微调方法,结合丰富的时尚数据,并引入了一种基于混合检索的新技术,用于将文本转化为相关的产品建议。广泛的实验表明,我们提出的框架显著提升了时尚推荐性能。

[NLP-18] Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

【速读】: 该论文试图解决知识蒸馏(Knowledge Distillation, KD)中存在的知识差距问题,特别是在监督KD和策略KD方法中,由于训练数据与推理输出之间的分布不匹配以及学生生成样本质量低导致的教师反馈不准确问题。解决方案的关键是提出了推测性知识蒸馏(Speculative Knowledge Distillation, SKD)方法,通过学生和教师模型之间的协作,动态生成高质量的训练数据,并确保这些数据与学生的推理时分布相一致。具体来说,学生模型提出候选token,教师模型根据其自身的分布替换排名较低的token,从而实现知识的自适应高质量传递。

链接: https://arxiv.org/abs/2410.11325
作者: Wenda Xu,Rujun Han,Zifeng Wang,Long T. Le,Dhruv Madeka,Lei Li,William Yang Wang,Rishabh Agarwal,Chen-Yu Lee,Tomas Pfister
关键词-EN: Recent advances, enabled smaller student, enabled smaller, performance of larger, Speculative Knowledge Distillation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student’s inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.
摘要:近年来,知识蒸馏 (Knowledge Distillation, KD) 的进展使得较小的学生模型能够接近较大教师模型的性能。然而,流行的方法如监督 KD 和策略内 KD,在实际场景中受到教师与学生之间知识差距的不利影响。监督 KD 在训练静态数据集与最终学生生成输出之间的分布不匹配问题。相反,策略内 KD 使用学生生成的样本进行训练,可能会遇到教师模型不熟悉且质量较低的训练示例,导致教师反馈不准确。为了解决这些局限性,我们引入了推测性知识蒸馏 (Speculative Knowledge Distillation, SKD),这是一种新颖的方法,通过学生和教师模型之间的合作,实时生成高质量的训练数据,同时与学生的推理时分布保持一致。在 SKD 中,学生提出 Token,教师根据其自身的分布替换排名较低的 Token,从而自适应地传递高质量的知识。我们在多种文本生成任务上评估了 SKD,包括翻译、摘要、数学和指令跟随,结果表明 SKD 在不同领域、数据规模和模型初始化策略下均持续优于现有的 KD 方法。

[NLP-19] Self-adaptive Multimodal Retrieval-Augmented Generation

【速读】: 该论文试图解决传统检索增强生成(RAG)方法在处理复杂和多模态任务时,由于依赖固定数量的检索文档而导致的信息不完整或噪声问题。解决方案的关键在于提出了一种自适应多模态检索增强生成(SAM-RAG)方法,该方法能够根据输入查询动态筛选相关文档,并在必要时包括图像描述,同时验证检索文档和生成输出的质量。通过这种方式,SAM-RAG在检索准确性和响应生成方面超越了现有的最先进方法,显著提升了多模态RAG任务的整体性能。

链接: https://arxiv.org/abs/2410.11321
作者: Wenjia Zhai
关键词-EN: Traditional Retrieval-Augmented Generation, Traditional Retrieval-Augmented, Retrieval-Augmented Generation, fixed number, resulting in incomplete
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional Retrieval-Augmented Generation (RAG) methods are limited by their reliance on a fixed number of retrieved documents, often resulting in incomplete or noisy information that undermines task performance. Although recent adaptive approaches alleviated these problems, their application in intricate and real-world multimodal tasks remains limited. To address these, we propose a new approach called Self-adaptive Multimodal Retrieval-Augmented Generation (SAM-RAG), tailored specifically for multimodal contexts. SAM-RAG not only dynamically filters relevant documents based on the input query, including image captions when needed, but also verifies the quality of both the retrieved documents and the output. Extensive experimental results show that SAM-RAG surpasses existing state-of-the-art methods in both retrieval accuracy and response generation. By further ablation experiments and effectiveness analysis, SAM-RAG maintains high recall quality while improving overall task performance in multimodal RAG task. Our codes are available at this https URL.
摘要:传统的检索增强生成 (Retrieval-Augmented Generation, RAG) 方法受限于其依赖固定数量的检索文档,往往导致信息不完整或噪声过多,从而影响任务性能。尽管近期提出的自适应方法缓解了这些问题,但其在复杂且真实的跨模态任务中的应用仍有限。为解决这些问题,我们提出了一种名为自适应多模态检索增强生成 (Self-adaptive Multimodal Retrieval-Augmented Generation, SAM-RAG) 的新方法,专门针对多模态场景设计。SAM-RAG 不仅根据输入查询动态筛选相关文档,包括在必要时筛选图像描述,还验证检索文档和输出的质量。广泛的实验结果表明,SAM-RAG 在检索准确性和响应生成方面均优于现有的最先进方法。通过进一步的消融实验和有效性分析,SAM-RAG 在保持高召回率的同时,提升了多模态 RAG 任务的整体性能。我们的代码可在以下链接获取:https URL。

[NLP-20] Herald: A Natural Language Annotated Lean 4 Dataset

【速读】: 该论文试图解决在训练大型语言模型(LLMs)进行形式化数学推理时,缺乏自然语言与形式化语言证明对齐的平行数据集的问题。解决方案的关键在于引入了一种新颖的框架,将Mathlib4语料库(一个统一的数学形式化语言Lean 4库)翻译成自然语言,并采用基于策略的双重增强方法,结合基于策略和非正式方法,利用Lean-jixia系统进行分析。通过这一流程,生成了Herald数据集,并开发了Herald翻译器,该翻译器在形式化陈述的准确性上显著优于现有的InternLM2-Math-Plus-7B和TheoremLlama模型。此外,论文还提出了一个适用于实际应用的章节级翻译框架,并在Stack项目中成功应用,标志着在自动形式化研究生水平数学文献方面取得了显著进展。

链接: https://arxiv.org/abs/2410.10878
作者: Guoxiong Gao,Yutong Wang,Jiedong Jiang,Qi Gao,Zihan Qin,Tianyi Xu,Bin Dong
关键词-EN: Verifiable formal languages, impacted mathematical reasoning, Verifiable formal, formal language Lean, automated reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Verifiable formal languages like Lean have profoundly impacted mathematical reasoning, particularly through the use of large language models (LLMs) for automated reasoning. A significant challenge in training LLMs for these formal languages is the lack of parallel datasets that align natural language with formal language proofs. To address this challenge, this paper introduces a novel framework for translating the Mathlib4 corpus (a unified library of mathematics in formal language Lean 4) into natural language. Building upon this, we employ a dual augmentation strategy that combines tactic-based and informal-based approaches, leveraging the Lean-jixia system, a Lean 4 analyzer. We present the results of this pipeline on Mathlib4 as Herald (Hierarchy and Retrieval-based Translated Lean Dataset). We also propose the Herald Translator, which is fine-tuned on Herald. Herald translator achieves a 93.2% accuracy (Pass@128) on formalizing statements in the miniF2F-test and a 22.5% accuracy on our internal graduate-level textbook dataset, outperforming InternLM2-Math-Plus-7B (74.0% and 7.5%) and TheoremLlama (50.1% and 4.0%). Furthermore, we propose a section-level translation framework for real-world applications. As a direct application of Herald translator, we have successfully translated a template section in the Stack project, marking a notable progress in the automatic formalization of graduate-level mathematical literature. Our model, along with the datasets, will be open-sourced to the public soon.
摘要:可验证的形式语言,如 Lean,已深刻影响了数学推理,尤其是在使用大语言模型 (LLMs) 进行自动推理方面。训练 LLMs 用于这些形式语言的一个重大挑战是缺乏将自然语言与形式语言证明对齐的平行数据集。为解决这一挑战,本文提出了一种新颖的框架,用于将 Mathlib4 语料库(一个统一的形式语言 Lean 4 中的数学库)翻译成自然语言。在此基础上,我们采用了一种双重增强策略,结合了基于策略和基于非正式方法的途径,利用了 Lean-jixia 系统,一个 Lean 4 分析器。我们将这一流程的结果在 Mathlib4 上展示为 Herald(基于层次结构和检索的翻译 Lean 数据集)。我们还提出了 Herald 翻译器,该翻译器在 Herald 上进行了微调。Herald 翻译器在 miniF2F-test 上的形式化陈述准确率达到 93.2%(Pass@128),在我们的内部研究生教材数据集上的准确率为 22.5%,优于 InternLM2-Math-Plus-7B(74.0% 和 7.5%)和 TheoremLlama(50.1% 和 4.0%)。此外,我们提出了一种用于实际应用的章节级翻译框架。作为 Herald 翻译器的直接应用,我们成功翻译了 Stack 项目中的一个模板章节,标志着在研究生水平数学文献的自动形式化方面取得了显著进展。我们的模型及其数据集将很快向公众开源。

[NLP-21] Improving Data Efficiency via Curating LLM-Driven Rating Systems

【速读】: 该论文试图解决在大语言模型(LLMs)适应下游任务时,如何通过少量高质量数据集实现优于大规模数据集的性能问题。解决方案的关键在于提出了DS2(Diversity-aware Score curation method for Data Selection)方法,通过系统地建模错误模式并利用评分转移矩阵来校正LLM生成的评分,从而在选择数据样本时促进多样性。DS2方法能够从原始数据集中筛选出仅占3.3%的高质量子集,该子集在多个机器对齐基准测试中表现优于全规模数据集,并与同等样本量的人工对齐数据集(如LIMA)相媲美或超越,从而挑战了传统的数据规模假设,强调了低质量样本的冗余性可能降低模型性能。

链接: https://arxiv.org/abs/2410.10877
作者: Jinlong Pang,Jiaheng Wei,Ankit Parag Shah,Zhaowei Zhu,Yaxuan Wang,Chen Qian,Yang Liu,Yujia Bao,Wei Wei
关键词-EN: adapting large language, Instruction tuning, large language models, challenging traditional data, downstream tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that “more can be less.”
摘要:指令调优对于使大语言模型 (LLM) 适应下游任务至关重要,最近的研究表明,少量人工精选的数据集能够超越更大的数据集,挑战了传统的数据扩展定律。尽管基于 LLM 的数据质量评级系统为人工标注提供了一种成本效益高的替代方案,但它们往往存在不准确性和偏差,即使在强大的模型如 GPT-4 中也是如此。在本研究中,我们引入了 DS2,一种数据选择的多维评分精选方法。通过系统地建模错误模式,利用评分转换矩阵,DS2 修正了基于 LLM 的评分,并促进了所选数据样本的多样性。我们的方法表明,一个精选的子集(仅占原始数据集的 3.3%)在各种机器对齐基准测试中优于全规模数据集(300k 样本),并且与相同样本量(1k 样本)的人工对齐数据集如 LIMA 相匹配或超越。这些发现挑战了传统的数据扩展假设,强调了冗余、低质量样本会降低性能,并再次确认了“多未必是好”的观点。

[NLP-22] FreqMark: Frequency-Based Watermark for Sentence-Level Detection of LLM-Generated Text

【速读】: 该论文试图解决大语言模型(LLMs)生成文本被滥用于不道德目的(如虚假信息或学术不端)的问题。解决方案的关键是提出了一种名为FreqMark的新型水印技术,通过在LLM生成文本的token采样过程中嵌入基于频率的可检测水印。该方法利用周期性信号引导token选择,并使用短时傅里叶变换(STFT)分析进行水印检测,从而实现对LLM生成内容的准确识别,即使在混合文本场景中也能有效工作。实验结果表明,FreqMark在各种攻击场景(如改写和token替换)下表现出强大的鲁棒性和检测精度,显著优于现有检测方法。

链接: https://arxiv.org/abs/2410.10876
作者: Zhenyu Xu,Kun Zhang,Victor S. Sheng
关键词-EN: Large Language Models, Language Models, Large Language, generating highly coherent, contextually relevant text
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing use of Large Language Models (LLMs) for generating highly coherent and contextually relevant text introduces new risks, including misuse for unethical purposes such as disinformation or academic dishonesty. To address these challenges, we propose FreqMark, a novel watermarking technique that embeds detectable frequency-based watermarks in LLM-generated text during the token sampling process. The method leverages periodic signals to guide token selection, creating a watermark that can be detected with Short-Time Fourier Transform (STFT) analysis. This approach enables accurate identification of LLM-generated content, even in mixed-text scenarios with both human-authored and LLM-generated segments. Our experiments demonstrate the robustness and precision of FreqMark, showing strong detection capabilities against various attack scenarios such as paraphrasing and token substitution. Results show that FreqMark achieves an AUC improvement of up to 0.98, significantly outperforming existing detection methods.
摘要:随着大语言模型 (LLM) 在生成高度连贯且上下文相关文本方面的应用日益增多,新的风险也随之出现,包括用于不道德目的,如传播虚假信息或学术不端。为应对这些挑战,我们提出了 FreqMark,这是一种新颖的水印技术,在 Token 采样过程中将可检测的基于频率的水印嵌入到 LLM 生成的文本中。该方法利用周期性信号来引导 Token 选择,创建出可通过短时傅里叶变换 (STFT) 分析检测的水印。这种方法即使在混合文本场景中,也能准确识别 LLM 生成的内容,这些场景中既有人类创作的文本,也有 LLM 生成的文本。我们的实验展示了 FreqMark 的鲁棒性和精确性,显示出其在各种攻击场景(如改写和 Token 替换)下的强大检测能力。结果表明,FreqMark 的 AUC 提升高达 0.98,显著优于现有的检测方法。

[NLP-23] Optimizing Transformer based on high-performance optimizer for predicting employment sentiment in American social media content

【速读】: 该论文旨在通过改进Transformer模型,利用群体智能优化算法来预测美国社交媒体上与就业相关的文本内容的情感。解决方案的关键在于通过文本预处理、特征提取和向量化,将文本数据转化为数值数据,并导入模型进行训练。实验结果表明,模型在训练集上的准确率从49.27%提升至82.83%,损失值从0.67降至0.35,显示出显著的性能提升。此外,模型在训练集和测试集上的准确率分别为86.15%和82.91%,显示出较强的泛化能力,且在分类准确性、敏感性、特异性和AUC等方面表现出色,进一步验证了其在社交媒体情感分析中的有效性。

链接: https://arxiv.org/abs/2410.10874
作者: Feiyang Wang,Qiaozhi Bao,Zixuan Wang,Yanlin Chen
关键词-EN: intelligence optimization algorithm, swarm intelligence optimization, Transformer model based, American social media, content on American
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:This article improves the Transformer model based on swarm intelligence optimization algorithm, aiming to predict the emotions of employment related text content on American social media. Through text preprocessing, feature extraction, and vectorization, the text data was successfully converted into numerical data and imported into the model for training. The experimental results show that during the training process, the accuracy of the model gradually increased from 49.27% to 82.83%, while the loss value decreased from 0.67 to 0.35, indicating a significant improvement in the performance of the model on the training set. According to the confusion matrix analysis of the training set, the accuracy of the training set is 86.15%. The confusion matrix of the test set also showed good performance, with an accuracy of 82.91%. The accuracy difference between the training set and the test set is only 3.24%, indicating that the model has strong generalization ability. In addition, the evaluation of polygon results shows that the model performs well in classification accuracy, sensitivity, specificity, and area under the curve (AUC), with a Kappa coefficient of 0.66 and an F-measure of 0.80, further verifying the effectiveness of the model in social media sentiment analysis. The improved model proposed in this article not only improves the accuracy of sentiment recognition in employment related texts on social media, but also has important practical significance. This social media based data analysis method can not only capture social dynamics in a timely manner, but also promote decision-makers to pay attention to public concerns and provide data support for improving employment conditions.
摘要:本文基于群体智能优化算法改进了 Transformer 模型,旨在预测美国社交媒体上与就业相关的文本内容的情感。通过文本预处理、特征提取和向量化,成功将文本数据转换为数值数据并导入模型进行训练。实验结果显示,在训练过程中,模型的准确率从 49.27% 逐步提升至 82.83%,而损失值从 0.67 降至 0.35,表明模型在训练集上的性能显著提升。根据训练集的混淆矩阵分析,训练集的准确率为 86.15%。测试集的混淆矩阵也表现出良好的性能,准确率为 82.91%。训练集与测试集之间的准确率差异仅为 3.24%,表明模型具有较强的泛化能力。此外,多边形结果的评估显示,模型在分类准确率、敏感性、特异性和曲线下面积 (AUC) 方面表现良好,Kappa 系数为 0.66,F-measure 为 0.80,进一步验证了该模型在社交媒体情感分析中的有效性。本文提出的改进模型不仅提高了社交媒体上就业相关文本情感识别的准确性,还具有重要的实际意义。这种基于社交媒体的数据分析方法不仅能及时捕捉社会动态,还能促使决策者关注公众关切,为改善就业状况提供数据支持。

[NLP-24] AuditWen:An Open-Source Large Language Model for Audit

【速读】: 该论文试图解决通用大型语言模型(LLM)在审计领域应用时面临的缺乏专业知识和数据偏差的问题。解决方案的关键在于通过微调Qwen模型,并构建一个包含28,000条指令的审计领域数据集,开发出专门针对审计任务的LLM——AuditWen。这一解决方案通过提取审计任务的应用场景和需求,设计了一个涵盖15个审计任务和3个层次的指令数据集,并在评估阶段使用包含3,000条指令的基准测试,验证了AuditWen在信息提取、问答和文档生成方面的优越性能,使其成为审计领域的即时有效工具。

链接: https://arxiv.org/abs/2410.10873
作者: Jiajia Huang,Haoran Zhu,Chao Xu,Tianming Zhan,Qianqian Xie,Jimin Huang
关键词-EN: Intelligent auditing represents, modern audit practices, audit, Intelligent auditing, artificial intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 18 pages,1 figures

点击查看摘要

Abstract:Intelligent auditing represents a crucial advancement in modern audit practices, enhancing both the quality and efficiency of audits within the realm of artificial intelligence. With the rise of large language model (LLM), there is enormous potential for intelligent models to contribute to audit domain. However, general LLMs applied in audit domain face the challenges of lacking specialized knowledge and the presence of data biases. To overcome these challenges, this study introduces AuditWen, an open-source audit LLM by fine-tuning Qwen with constructing instruction data from audit domain. We first outline the application scenarios for LLMs in the audit and extract requirements that shape the development of LLMs tailored for audit purposes. We then propose an audit LLM, called AuditWen, by fine-tuning Qwen with constructing 28k instruction dataset from 15 audit tasks and 3 layers. In evaluation stage, we proposed a benchmark with 3k instructions that covers a set of critical audit tasks derived from the application scenarios. With the benchmark, we compare AuditWen with other existing LLMs from information extraction, question answering and document generation. The experimental results demonstrate superior performance of AuditWen both in question understanding and answer generation, making it an immediately valuable tool for audit.
摘要:智能审计代表了现代审计实践中的一个重要进展,增强了人工智能领域内审计的质量和效率。随着大语言模型 (LLM) 的兴起,智能模型在审计领域中具有巨大的潜力。然而,应用于审计领域的通用 LLM 面临着缺乏专业知识和数据偏差的挑战。为了克服这些挑战,本研究引入了 AuditWen,这是一个通过微调 Qwen 并构建来自审计领域的指令数据而开发的开源审计 LLM。我们首先概述了 LLM 在审计中的应用场景,并提取了塑造专门用于审计目的的 LLM 开发的需求。然后,我们提出了一个名为 AuditWen 的审计 LLM,通过微调 Qwen 并构建来自 15 个审计任务和 3 个层次的 28k 指令数据集。在评估阶段,我们提出了一个包含 3k 指令的基准,涵盖了从应用场景中提取的一系列关键审计任务。通过该基准,我们将 AuditWen 与其他现有的 LLM 在信息提取、问答和文档生成方面进行了比较。实验结果表明,AuditWen 在问题理解和答案生成方面均表现出优越的性能,使其成为审计领域的即时宝贵工具。

[NLP-25] oolBridge: An Open-Source Dataset to Equip LLMs with External Tool Capabilities

【速读】: 该论文试图解决大型语言模型(LLMs)在集成外部工具时,其数据集和数据收集方法缺乏透明度的问题。解决方案的关键在于引入ToolBridge,它通过使用通用开放访问数据集作为原始数据池,并采用一系列策略从中筛选出适合插入外部工具API的数据条目,然后通过监督微调使LLMs能够在适当情境下调用外部工具,从而提升预测准确性。这一过程的透明化有助于推动LLMs在外部工具集成能力上的发展,并促进社区对该领域的进一步探索。

链接: https://arxiv.org/abs/2410.10872
作者: Zhenchao Jin,Mengchen Liu,Dongdong Chen,Lingting Zhu,Yunsheng Li,Lequan Yu
关键词-EN: elementary conversational agents, large language models, external tools, large language, significantly expand
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: technical report

点击查看摘要

Abstract:Through the integration of external tools, large language models (LLMs) such as GPT-4o and Llama 3.1 significantly expand their functional capabilities, evolving from elementary conversational agents to general-purpose assistants. We argue that the primary drivers of these advancements are the quality and diversity of the training data. However, the existing LLMs with external tool integration provide only limited transparency regarding their datasets and data collection methods, which has led to the initiation of this research. Specifically, in this paper, our objective is to elucidate the detailed process involved in constructing datasets that empower LLMs to effectively learn how to utilize external tools and make this information available to the public through the introduction of ToolBridge. ToolBridge proposes to employ a collection of general open-access datasets as its raw dataset pool and applies a series of strategies to identify appropriate data entries from the pool for external tool API insertions. By supervised fine-tuning on these curated data entries, LLMs can invoke external tools in appropriate contexts to boost their predictive accuracy, particularly for basic functions including data processing, numerical computation, and factual retrieval. Our experiments rigorously isolates model architectures and training configurations, focusing exclusively on the role of data. The experimental results indicate that LLMs trained on ToolBridge demonstrate consistent performance improvements on both standard benchmarks and custom evaluation datasets. All the associated code and data will be open-source at this https URL, promoting transparency and facilitating the broader community to explore approaches for equipping LLMs with external tools capabilities.
摘要:通过整合外部工具,如 GPT-4o 和 Llama 3.1 等大语言模型 (LLM) 的功能能力显著扩展,从基础的对话智能体演变为通用助手。我们认为,这些进步的主要驱动力是训练数据的质量和多样性。然而,现有具备外部工具集成功能的 LLM 对其数据集和数据收集方法的透明度有限,这促使了本研究的启动。具体而言,本文旨在阐明构建数据集的详细过程,使 LLM 能够有效学习如何利用外部工具,并通过引入 ToolBridge 将这些信息公开。ToolBridge 提出使用一系列通用开放访问数据集作为其原始数据池,并应用一系列策略从池中识别适合插入外部工具 API 的数据条目。通过在这些精选数据条目上进行监督微调,LLM 可以在适当情境下调用外部工具,从而提高其预测准确性,特别是在数据处理、数值计算和事实检索等基本功能方面。我们的实验严格隔离了模型架构和训练配置,专注于数据的作用。实验结果表明,使用 ToolBridge 训练的 LLM 在标准基准和自定义评估数据集上均表现出一致的性能提升。所有相关代码和数据将在 https URL 上开源,以促进透明度并便于更广泛的社区探索为 LLM 配备外部工具能力的方法。

[NLP-26] Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

【速读】: 该论文试图解决语言模型在执行代理行为时存在的安全漏洞问题。解决方案的关键在于通过拒绝向量消融(refusal-vector ablation)和简单的代理脚手架(agent scaffolding)来创建一个不受限制的代理,从而揭示现有安全机制在处理有害任务时的不足。研究结果表明,经过拒绝向量消融的模型能够成功执行如贿赂官员或制作钓鱼攻击等有害任务,这凸显了当前安全微调在代理行为中的局限性,强调了需要改进语言模型代理的安全框架。

链接: https://arxiv.org/abs/2410.10871
作者: Simon Lermen,Mateusz Dziemian,Govind Pimpale
关键词-EN: requiring short-term planning, tasks requiring short-term, requiring short-term, short-term planning, planning and tool
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, language models like Llama 3.1 Instruct have become increasingly capable of agentic behavior, enabling them to perform tasks requiring short-term planning and tool use. In this study, we apply refusal-vector ablation to Llama 3.1 70B and implement a simple agent scaffolding to create an unrestricted agent. Our findings imply that these refusal-vector ablated models can successfully complete harmful tasks, such as bribing officials or crafting phishing attacks, revealing significant vulnerabilities in current safety mechanisms. To further explore this, we introduce a small Safe Agent Benchmark, designed to test both harmful and benign tasks in agentic scenarios. Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3.1 Instruct models are willing to perform most harmful tasks without modifications. At the same time, these models will refuse to give advice on how to perform the same tasks when asked for a chat completion. This highlights the growing risk of misuse as models become more capable, underscoring the need for improved safety frameworks for language model agents.
摘要:近年来,像 Llama 3.1 Instruct 这样的语言模型在智能体行为方面展现出越来越强的能力,使其能够执行需要短期规划和工具使用的任务。在本研究中,我们对 Llama 3.1 70B 模型应用了拒绝向量消融,并实现了一个简单的智能体脚手架,以创建一个不受限制的智能体。我们的研究结果表明,这些经过拒绝向量消融的模型能够成功完成有害任务,如贿赂官员或策划钓鱼攻击,揭示了当前安全机制中的重大漏洞。为进一步探讨这一问题,我们引入了一个小型安全智能体基准测试,旨在测试智能体场景中的有害和良性任务。我们的结果表明,聊天模型中的安全微调并未很好地泛化到智能体行为上,因为我们发现 Llama 3.1 Instruct 模型愿意执行大多数有害任务,而无需进行修改。同时,当被要求进行聊天完成时,这些模型会拒绝提供如何执行相同任务的建议。这突显了随着模型能力的增强,误用的风险也在增加,强调了为语言模型智能体改进安全框架的必要性。

[NLP-27] PortLLM: Personalizing Evolving Large Language Models with Training-Free and Portable Model Patches

【速读】: 该论文试图解决在大语言模型(LLMs)不断演进的情况下,下游用户在资源有限的情况下难以持续对最新模型进行微调以适应特定领域任务的问题。解决方案的关键在于提出了一个无需训练的框架PortLLM,该框架通过创建一个轻量级的模型更新补丁来捕捉领域特定知识,并允许在模型演进后无缝地进行持续个性化,从而在最小成本下实现模型的持续适应。该框架通过实验验证了其在多个数据集和模型上的有效性,显著减少了GPU内存使用,同时保持了与传统微调方法相当的性能。

链接: https://arxiv.org/abs/2410.10870
作者: Rana Muhammad Shahroz Khan,Pingzhi Li,Sukwon Yun,Zhenyu Wang,Shahriar Nirjon,Chau-Wai Wong,Tianlong Chen
关键词-EN: large language models, achieving optimal performance, increasingly shape, large language, pre-LLM era
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LLMs) increasingly shape the AI landscape, fine-tuning pretrained models has become more popular than in the pre-LLM era for achieving optimal performance in domain-specific tasks. However, pretrained LLMs such as ChatGPT are periodically evolved, i.e., model parameters are frequently updated), making it challenging for downstream users with limited resources to keep up with fine-tuning the newest LLMs for their domain application. Even though fine-tuning costs have nowadays been reduced thanks to the innovations of parameter-efficient fine-tuning such as LoRA, not all downstream users have adequate computing for frequent personalization. Moreover, access to fine-tuning datasets, particularly in sensitive domains such as healthcare, could be time-restrictive, making it crucial to retain the knowledge encoded in earlier fine-tuned rounds for future adaptation. In this paper, we present PortLLM, a training-free framework that (i) creates an initial lightweight model update patch to capture domain-specific knowledge, and (ii) allows a subsequent seamless plugging for the continual personalization of evolved LLM at minimal cost. Our extensive experiments cover seven representative datasets, from easier question-answering tasks BoolQ, SST2 to harder reasoning tasks WinoGrande, GSM8K, and models including Mistral-7B, Llama2, Llama3.1, and Gemma2, validating the portability of our designed model patches and showcasing the effectiveness of our proposed framework. For instance, PortLLM achieves comparable performance to LoRA fine-tuning with reductions of up to 12.2x in GPU memory usage. Finally, we provide theoretical justifications to understand the portability of our model update patches, which offers new insights into the theoretical dimension of LLMs’ personalization.
摘要:随着大语言模型 (LLM) 在人工智能领域的日益普及,微调预训练模型在特定领域任务中实现最佳性能方面变得比 LLM 时代之前更加流行。然而,像 ChatGPT 这样的预训练 LLM 会定期更新(即模型参数频繁更新),这使得资源有限的下游用户难以跟上最新 LLM 的微调步伐以适应其领域应用。尽管由于 LoRA 等参数高效微调技术的创新,微调成本现已降低,但并非所有下游用户都具备频繁个性化所需的充足计算资源。此外,访问微调数据集,特别是在医疗等敏感领域,可能会受到时间限制,因此保留早期微调轮次中编码的知识以供未来适应变得至关重要。本文提出了 PortLLM,这是一个无需训练的框架,(i) 创建一个初始的轻量级模型更新补丁以捕捉领域特定知识,(ii) 允许以最低成本无缝插入以持续个性化更新后的 LLM。我们的广泛实验涵盖了七个代表性数据集,从较简单的问答任务 BoolQ、SST2 到较难的推理任务 WinoGrande、GSM8K,以及包括 Mistral-7B、Llama2、Llama3.1 和 Gemma2 在内的模型,验证了我们设计的模型补丁的可移植性,并展示了我们提出的框架的有效性。例如,PortLLM 在 GPU 内存使用量减少高达 12.2 倍的情况下,实现了与 LoRA 微调相当的性能。最后,我们提供了理论依据来理解我们模型更新补丁的可移植性,这为 LLM 个性化理论维度提供了新的见解。

[NLP-28] Application of NotebookLM a Large Language Model with Retrieval-Augmented Generation for Lung Cancer Staging

【速读】: 该论文试图解决大型语言模型(LLMs)在临床应用中的可靠性问题,特别是由于幻觉和引用不足导致的局限性。解决方案的关键在于采用检索增强生成(RAG)技术,使LLMs能够引用可靠的外部知识(REK)。具体而言,研究通过使用最新发布的RAG-LLM(NotebookLM)进行肺癌分期,评估其在利用REK时的实用性和可靠性。结果显示,NotebookLM在肺癌分期实验中达到了86%的诊断准确率,显著优于未使用REK的GPT-4 Omni(GPT-4o),后者准确率仅为25%。此外,NotebookLM在REK中搜索参考位置的准确率达到95%,有助于放射科医生高效评估模型输出的可靠性并检测潜在的幻觉。

链接: https://arxiv.org/abs/2410.10869
作者: Ryota Tozuka,Hisashi Johno,Akitomo Amakawa,Junichi Sato,Mizuki Muto,Shoichiro Seki,Atsushi Komaba,Hiroshi Onishi
关键词-EN: large language models, recently gained attention, lung cancer, lung cancer staging, including ChatGPT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 5 figures, 1 table, 3 ancillary files

点击查看摘要

Abstract:Purpose: In radiology, large language models (LLMs), including ChatGPT, have recently gained attention, and their utility is being rapidly evaluated. However, concerns have emerged regarding their reliability in clinical applications due to limitations such as hallucinations and insufficient referencing. To address these issues, we focus on the latest technology, retrieval-augmented generation (RAG), which enables LLMs to reference reliable external knowledge (REK). Specifically, this study examines the utility and reliability of a recently released RAG-equipped LLM (RAG-LLM), NotebookLM, for staging lung cancer. Materials and methods: We summarized the current lung cancer staging guideline in Japan and provided this as REK to NotebookLM. We then tasked NotebookLM with staging 100 fictional lung cancer cases based on CT findings and evaluated its accuracy. For comparison, we performed the same task using a gold-standard LLM, GPT-4 Omni (GPT-4o), both with and without the REK. Results: NotebookLM achieved 86% diagnostic accuracy in the lung cancer staging experiment, outperforming GPT-4o, which recorded 39% accuracy with the REK and 25% without it. Moreover, NotebookLM demonstrated 95% accuracy in searching reference locations within the REK. Conclusion: NotebookLM successfully performed lung cancer staging by utilizing the REK, demonstrating superior performance compared to GPT-4o. Additionally, it provided highly accurate reference locations within the REK, allowing radiologists to efficiently evaluate the reliability of NotebookLM’s responses and detect possible hallucinations. Overall, this study highlights the potential of NotebookLM, a RAG-LLM, in image diagnosis. Comments: 9 pages, 5 figures, 1 table, 3 ancillary files Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.10869 [cs.CL] (or arXiv:2410.10869v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.10869 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hisashi Johno [view email] [v1] Tue, 8 Oct 2024 12:42:42 UTC (84 KB)
摘要:
目的:在放射学领域,包括 ChatGPT 在内的大语言模型 (LLM) 近期引起了广泛关注,其应用价值正在迅速评估中。然而,由于幻觉和引用不足等局限性,其在临床应用中的可靠性引发了担忧。为解决这些问题,我们聚焦于最新的技术——检索增强生成 (RAG),该技术使 LLM 能够引用可靠的外部知识 (REK)。具体而言,本研究考察了最近发布的配备 RAG 的 LLM (RAG-LLM),即 NotebookLM,在肺癌分期中的实用性和可靠性。材料与方法:我们总结了当前日本肺癌分期指南,并将其作为 REK 提供给 NotebookLM。随后,我们要求 NotebookLM 根据 CT 影像结果对 100 个虚构的肺癌病例进行分期,并评估其准确性。作为对比,我们使用黄金标准 LLM,即 GPT-4 Omni (GPT-4o),在有无 REK 的情况下执行相同任务。结果:NotebookLM 在肺癌分期实验中达到了 86% 的诊断准确率,优于 GPT-4o,后者在有 REK 的情况下记录了 39% 的准确率,无 REK 时为 25%。此外,NotebookLM 在 REK 中搜索参考位置的准确率达到了 95%。结论:NotebookLM 通过利用 REK 成功进行了肺癌分期,表现优于 GPT-4o。此外,它提供了在 REK 中高度准确的参考位置,使放射科医生能够高效评估 NotebookLM 响应的可靠性,并检测可能的幻觉。总体而言,本研究突显了 NotebookLM 这一 RAG-LLM 在影像诊断中的潜力。

评论:9 页,5 图,1 表,3 个辅助文件
主题:计算与语言 (cs.CL);人工智能 (cs.AI);机器学习 (cs.LG)
引用为:arXiv:2410.10869 [cs.CL]
(或 arXiv:2410.10869v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.10869
了解更多
arXiv 发布的 DOI 通过 DataCite (待注册)
提交历史
From: Hisashi Johno [view email]
[v1] Tue, 8 Oct 2024 12:42:42 UTC (84 KB)

[NLP-29] LLaCA: Multimodal Large Language Continual Assistant

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在持续指令调优过程中面临的遗忘问题和性能下降问题。解决方案的关键在于提出了一种名为Multimodal Large Language Continual Assistant (LLaCA)的方法,通过优化指数移动平均(EMA)更新策略中的平衡权重,使其能够根据梯度信息和先前参数动态调整,从而在保持模型稳定性的同时提高其可塑性。具体来说,LLaCA通过泰勒展开在损失函数中找到最优平衡权重,自动确定该权重,显著提升了模型的抗遗忘能力和持续调优性能。

链接: https://arxiv.org/abs/2410.10868
作者: Jingyang Qiao,Zhizhong Zhang,Xin Tan,Yanyun Qu,Shouhong Ding,Yuan Xie
关键词-EN: Large Language Models, Multimodal Large Language, Continual Instruction Tuning, designing text instructions, Instruction tuning guides
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction tuning guides the Multimodal Large Language Models (MLLMs) in aligning different modalities by designing text instructions, which seems to be an essential technique to enhance the capabilities and controllability of foundation models. In this framework, Multimodal Continual Instruction Tuning (MCIT) is adopted to continually instruct MLLMs to follow human intent in sequential datasets. We observe existing gradient update would heavily destroy the tuning performance on previous datasets and the zero-shot ability during continual instruction tuning. Exponential Moving Average (EMA) update policy owns the ability to trace previous parameters, which can aid in decreasing forgetting. However, its stable balance weight cannot deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability of MLLMs. In this paper, we propose a method called Multimodal Large Language Continual Assistant (LLaCA) to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight is basically according to the gradient information and previous parameters. We automatically determine the balance weight and significantly improve the performance. Through comprehensive experiments on LLaVA-1.5 in a continual visual-question-answering benchmark, compared with baseline, our approach not only highly improves anti-forgetting ability (with reducing forgetting from 22.67 to 2.68), but also significantly promotes continual tuning performance (with increasing average accuracy from 41.31 to 61.89). Our code will be published soon.
摘要:指令调优通过设计文本指令来引导多模态大语言模型 (MLLMs) 对齐不同模态,这似乎是增强基础模型能力和可控性的关键技术。在此框架中,采用多模态持续指令调优 (MCIT) 来持续指导 MLLMs 在顺序数据集中遵循人类意图。我们观察到,现有的梯度更新会严重破坏在先前数据集上的调优性能和持续指令调优过程中的零样本能力。指数移动平均 (EMA) 更新策略具有追踪先前参数的能力,有助于减少遗忘。然而,其稳定的平衡权重无法应对不断变化的数据集,导致 MLLMs 的塑性和稳定性之间失衡。本文提出了一种名为多模态大语言持续助手 (LLaCA) 的方法来应对这一挑战。从权衡前提和 EMA 更新出发,我们提出了塑性和稳定性的理想条件。基于损失函数中的泰勒展开,我们发现最优平衡权重基本上根据梯度信息和先前参数确定。我们自动确定平衡权重并显著提升性能。通过在 LLaVA-1.5 上的全面实验,在持续视觉问答基准测试中,与基线相比,我们的方法不仅大幅提高了抗遗忘能力 (遗忘率从 22.67 降至 2.68),还显著提升了持续调优性能 (平均准确率从 41.31 提高到 61.89)。我们的代码将很快发布。

[NLP-30] Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

【速读】: 该论文试图解决自动摘要评估中现有参考依赖性指标在长文档摘要上与人类评估相关性较低的问题。解决方案的关键在于引入了一种无需参考的评估指标,该指标在计算成本低廉的同时,与人类评估的相关性较高,并且能够与参考依赖性指标结合使用,以提高其在低质量参考情况下的鲁棒性。

链接: https://arxiv.org/abs/2410.10867
作者: Théo Gigant(L2S),Camille Guinaudeau(STL, LISN),Marc Decombas,Frédéric Dufaux(L2S)
关键词-EN: evaluate abstractive summarization, abstractive summarization systems, Automatic metrics, proxies to evaluate, evaluate abstractive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Automatic metrics are used as proxies to evaluate abstractive summarization systems when human annotations are too expensive. To be useful, these metrics should be fine-grained, show a high correlation with human annotations, and ideally be independent of reference quality; however, most standard evaluation metrics for summarization are reference-based, and existing reference-free metrics correlate poorly with relevance, especially on summaries of longer documents. In this paper, we introduce a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute. We show that this metric can also be used alongside reference-based metrics to improve their robustness in low quality reference settings.
摘要:在人工标注成本过高时,自动评估指标被用作生成式摘要系统的替代评估手段。为了有效,这些指标应具备细粒度、与人工标注高度相关且理想情况下独立于参考质量的特点;然而,大多数标准的摘要评估指标都是基于参考的,而现有的无参考指标与相关性的相关性较差,尤其是在长文档摘要中。本文介绍了一种无参考的评估指标,该指标与人工评估的相关性良好,同时计算成本极低。我们证明,该指标还可以与基于参考的指标结合使用,以提高其在低质量参考环境下的鲁棒性。

[NLP-31] CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept

【速读】: 该论文试图解决大型语言模型(LLMs)在训练过程中可能无意中记忆敏感、未经授权或恶意数据的问题,特别是在医疗和金融领域。解决方案的关键在于提出了一种新颖的摊销式遗忘方法,利用代码本特征和稀疏自编码器(SAEs)来实现。通过引入瓶颈层分解激活空间并调控信息流,该方法能够高效地遗忘特定信息,同时保持模型在无关数据上的性能。这是首次成功实现对LLM中具有上下文相关性的特定主题进行遗忘的工作,标志着机器遗忘技术在实际应用中的重要进展。

链接: https://arxiv.org/abs/2410.10866
作者: YuXuan Wu,Bonaventure F. P. Dossou,Dianbo Liu
关键词-EN: Large Language Models, Large Language, inadvertently memorize sensitive, offer extensive knowledge, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer extensive knowledge across various domains, but they may inadvertently memorize sensitive, unauthorized, or malicious data, such as personal information in the medical and financial sectors. Machine unlearning methods aim to remove specific information from models after training to address this. However, current approaches require additional model training or struggle to effectively erase particular data points and their associated context due to LLMs’ complex, dense, and continuous nature. In this study, we propose a novel amortized unlearning approach using codebook features and Sparse Autoencoders (SAEs). By leveraging a bottleneck to decompose the activation space and regulate information flow, our method efficiently unlearns targeted information while preserving the model’s performance on unrelated data. To the best of our knowledge, this is the first work that successfully enables unlearning specific topics with contextual relevance in an LLM, marking a significant step towards real-world applications of machine unlearning.
摘要:大语言模型 (LLMs) 在各个领域提供了广泛的知识,但它们可能会无意中记忆敏感、未经授权或恶意的数据,例如医疗和金融领域的个人信息。机器遗忘方法旨在在训练后从模型中移除特定信息以解决这一问题。然而,当前的方法需要额外的模型训练,或者由于大语言模型的复杂、密集和连续性而难以有效擦除特定数据点及其相关上下文。在本研究中,我们提出了一种使用码本特征和稀疏自编码器 (SAEs) 的新型摊销遗忘方法。通过利用瓶颈来分解激活空间并调节信息流,我们的方法在保留模型对无关数据的性能的同时,有效地遗忘了目标信息。据我们所知,这是首次成功实现在大语言模型中遗忘具有上下文相关性的特定主题的工作,标志着机器遗忘技术在实际应用中迈出了重要的一步。

[NLP-32] Generating Synthetic Datasets for Few-shot Prompt Tuning

【速读】: 该论文试图解决在少样本学习(few-shot learning)场景下,提示调优(prompt tuning)依赖于大量标注训练数据的问题。其解决方案的关键在于利用大型语言模型(LLMs)合成任务特定的标注数据,并通过分布对齐的加权生成器调优(DawGen)方法确保生成数据与真实少样本数据分布一致。随后,采用梯度手术方法在合成数据和真实数据上训练软提示,以消除不同数据源之间的梯度冲突,从而提升提示调优在少样本学习中的效果。

链接: https://arxiv.org/abs/2410.10865
作者: Xu Guo,Zilin Du,Boyang Li,Chunyan Miao
关键词-EN: major limitation, prompt tuning, tuning, prompt, few-shot learning settings
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A major limitation of prompt tuning is its dependence on large labeled training datasets. Under few-shot learning settings, prompt tuning lags far behind full-model fine-tuning, limiting its scope of application. In this paper, we leverage the powerful LLMs to synthesize task-specific labeled data for training the soft prompts. We first introduce a distribution-aligned weighted generator tuning (DawGen) method to encourage generating in-distribution data that aligns with the few-shot real data. Then, we train soft prompts on both synthetic and real datasets using a gradient surgery approach, which eliminates the conflicting gradients from different data sources. Experiments on seven sentence-pair classification datasets demonstrate the effectiveness of our proposed method for boosting prompt tuning in few-shot learning settings. Results on QQP, MRPC, and SICK datasets are even comparable to the performance of transfer learning from large real-world datasets, showing the promise of synthetic data as an alternative for enhancing soft prompt tuning.
摘要:提示调优的一个主要限制是其依赖于大规模的标注训练数据集。在少样本学习设置下,提示调优远远落后于全模型微调,限制了其应用范围。本文中,我们利用强大的大语言模型 (LLM) 来合成任务特定的标注数据,用于训练软提示。我们首先引入了一种分布对齐加权生成器调优 (DawGen) 方法,以鼓励生成与少样本真实数据对齐的分布内数据。然后,我们使用梯度手术方法在合成数据集和真实数据集上训练软提示,该方法消除了来自不同数据源的冲突梯度。在七个句子对分类数据集上的实验证明了我们提出的方法在少样本学习设置下提升提示调优的有效性。在 QQP、MRPC 和 SICK 数据集上的结果甚至可与从大规模真实世界数据集进行迁移学习的表现相媲美,展示了合成数据作为增强软提示调优的替代方案的潜力。

[NLP-33] Fill In The Gaps: Model Calibration and Generalization with Synthetic Data EMNLP2024

【速读】: 该论文试图解决机器学习模型在实际应用前校准性能时,由于验证数据多样性不足导致模型精度下降的问题。解决方案的关键在于提出一种结合合成数据进行校准的方法,该方法在不损害模型精度的前提下,利用大型语言模型(LLMs)生成具有混合类别标签的合成数据,从而降低预期校准误差(ECE)的界限,并提高模型在真实测试数据上的准确性。通过在四个自然语言处理任务上的测试,该方法实现了平均高达34%的准确性提升和33%的ECE降低。

链接: https://arxiv.org/abs/2410.10864
作者: Yang Ba,Michelle V. Mancenido,Rong Pan
关键词-EN: major concern prior, swiftly advance, calibrating their performance, widespread implementation, continue to swiftly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 Main Conference (Long paper)

点击查看摘要

Abstract:As machine learning models continue to swiftly advance, calibrating their performance has become a major concern prior to practical and widespread implementation. Most existing calibration methods often negatively impact model accuracy due to the lack of diversity of validation data, resulting in reduced generalizability. To address this, we propose a calibration method that incorporates synthetic data without compromising accuracy. We derive the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework. Large language models (LLMs), known for their ability to mimic real data and generate text with mixed class labels, are utilized as a synthetic data generation strategy to lower the ECE bound and improve model accuracy on real test data. Additionally, we propose data generation mechanisms for efficient calibration. Testing our method on four different natural language processing tasks, we observed an average up to 34% increase in accuracy and 33% decrease in ECE.
摘要:随着机器学习模型不断迅速发展,在实际广泛应用之前,对其性能进行校准已成为一个主要关注点。大多数现有的校准方法由于验证数据缺乏多样性,往往会对模型精度产生负面影响,导致泛化能力下降。为解决这一问题,我们提出了一种在不损害精度的前提下结合合成数据的校准方法。我们利用 Probably Approximately Correct (PAC) 学习框架推导了预期校准误差 (ECE) 的界限。大语言模型 (LLMs) 以其模仿真实数据和生成混合类别标签文本的能力而闻名,被用作降低 ECE 界限并提高模型在真实测试数据上精度的合成数据生成策略。此外,我们还提出了高效校准的数据生成机制。在四个不同的自然语言处理任务上测试我们的方法,我们观察到平均精度提高了高达 34%,ECE 降低了 33%。

[NLP-34] What makes your model a low-empathy or warmth person: Exploring the Origins of Personality in LLMs

【速读】: 该论文试图解决大语言模型(LLMs)如何编码和表达人类特质(如亲和性和冲动性)的机制问题。解决方案的关键在于利用可解释的模型特征,通过调节长期背景因素(如家庭环境和文化规范)与短期压力(如外部指令)的交互作用,来引导LLMs的输出,从而在不进行进一步微调的情况下改变模型的特质。此外,论文还探讨了这些因素对模型安全性的潜在影响。

链接: https://arxiv.org/abs/2410.10863
作者: Shu Yang,Shenzhe Zhu,Ruoxuan Bao,Liang Liu,Yu Cheng,Lijie Hu,Mengdi Li,Di Wang
关键词-EN: Large language models, demonstrated remarkable capabilities, generating human-like text, Large language, exhibiting personality traits
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in generating human-like text and exhibiting personality traits similar to those in humans. However, the mechanisms by which LLMs encode and express traits such as agreeableness and impulsiveness remain poorly understood. Drawing on the theory of social determinism, we investigate how long-term background factors, such as family environment and cultural norms, interact with short-term pressures like external instructions, shaping and influencing LLMs’ personality traits. By steering the output of LLMs through the utilization of interpretable features within the model, we explore how these background and pressure factors lead to changes in the model’s traits without the need for further fine-tuning. Additionally, we suggest the potential impact of these factors on model safety from the perspective of personality.
摘要:大语言模型 (LLMs) 在生成类人文本和展现与人类相似的人格特质方面展示了显著的能力。然而,LLMs 如何编码和表达诸如亲和力和冲动性等人格特质的具体机制仍未被充分理解。基于社会决定论的理论,我们研究了长期背景因素,如家庭环境和文化规范,如何与短期压力,如外部指令,相互作用,从而塑造和影响 LLMs 的人格特质。通过利用模型中可解释的特征来引导 LLMs 的输出,我们探讨了这些背景和压力因素如何在无需进一步微调的情况下导致模型特质的变化。此外,我们从人格的角度提出了这些因素对模型安全性的潜在影响。

[NLP-35] Superficial Safety Alignment Hypothesis

【速读】: 该论文试图解决大语言模型(LLMs)在生成安全且符合预期的响应时面临的挑战,特别是安全机制的脆弱性问题。解决方案的关键在于提出表面安全对齐假设(Superficial Safety Alignment Hypothesis, SSAH),该假设认为安全对齐应教会模型选择正确的推理方向,并结合拒绝机制与多个备用选项。通过SSAH,论文提出只需少量关键组件即可建立LLMs的安全防护措施,并通过消融实验识别出四种关键的安全对齐组件:专属安全单元(Exclusive Safety Unit, ESU)专属效用单元(Exclusive Utility Unit, EUU)复杂单元(Complex Unit, CU)冗余单元(Redundant Unit, RU)。研究结果表明,冻结部分安全关键组件(约7.5%)在微调过程中可保持模型的安全性,同时利用预训练模型中20%的冗余单元作为“对齐预算”能有效降低对齐成本并实现对齐目标。

链接: https://arxiv.org/abs/2410.10862
作者: Jianwei Li,Jung-Eun Kim
关键词-EN: large language models, safety alignment, safety, ensuring they generate, alignment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe and aligned responses is a pressing need. Previous research on alignment has largely focused on general instruction-following but has often overlooked the unique properties and challenges of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction - interpreted as a specialized binary classification task - and incorporate a refusal mechanism with multiple reserved fallback options. Furthermore, through SSAH, we hypothesize that safety guardrails in LLMs can be established by just a small number of essential components. To verify this, we conduct an ablation study and successfully identify four types of attribute-critical components in safety-aligned LLMs: Exclusive Safety Unit (ESU), Exclusive Utility Unit (EUU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components 7.5% during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Additionally, we show that leveraging redundant units 20% in the pre-trained model as an ``alignment budget’’ can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated. We believe this work contributes to the foundation of efficient and scalable safety alignment for future LLMs.
摘要:随着大语言模型 (LLMs) 在各种应用中的广泛集成,确保其生成安全且符合预期的响应已成为迫切需求。以往关于对齐的研究主要集中在一般指令遵循上,但往往忽视了安全对齐的独特属性和挑战,例如安全机制的脆弱性。为了填补这一空白,我们提出了表面安全对齐假设 (Superficial Safety Alignment Hypothesis, SSAH),该假设认为安全对齐应教会原本不安全的模型选择正确的推理方向——可以解释为一种专门的二分类任务——并结合具有多种备用回退选项的拒绝机制。此外,通过 SSAH,我们假设大语言模型中的安全护栏可以通过少量关键组件建立。为了验证这一点,我们进行了消融研究,并成功识别出安全对齐大语言模型中的四种属性关键组件:独占安全单元 (Exclusive Safety Unit, ESU)、独占效用单元 (Exclusive Utility Unit, EUU)、复杂单元 (Complex Unit, CU) 和冗余单元 (Redundant Unit, RU)。我们的研究结果表明,在微调过程中冻结某些安全关键组件的 7.5% 可以使模型在适应新任务的同时保留其安全属性。此外,我们展示了利用预训练模型中 20% 的冗余单元作为“对齐预算”,可以有效最小化对齐成本,同时实现对齐目标。综上所述,本文得出结论,大语言模型中安全的基本功能单元处于神经元级别,并强调安全对齐不应复杂化。我们相信这项工作为未来大语言模型的有效和可扩展安全对齐奠定了基础。

[NLP-36] ranslation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems

【速读】: 该论文试图解决现有机器翻译评估工具(如COMET和SacreBLEU)在系统级和实例级分析上的局限性问题。解决方案的关键是引入了一个名为Translation Canvas的可解释界面,该界面通过识别常见错误(频率和严重性)、分析不同系统间的关系以及提供错误范围的详细解释和系统预测的选择性展示,来帮助研究人员深入理解系统级模型性能,并进行细粒度的错误分析。

链接: https://arxiv.org/abs/2410.10861
作者: Chinmay Dandekar(1),Wenda Xu(1),Xi Xu(2),Siqi Ouyang(2),Lei Li(2) ((1) University of California, Santa Barbara, (2) Carnegie Mellon University)
关键词-EN: benchmarking system progress, machine translation research, Translation Canvas, rapid advancement, essential for benchmarking
类目: Computation and Language (cs.CL)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:With the rapid advancement of machine translation research, evaluation toolkits have become essential for benchmarking system progress. Tools like COMET and SacreBLEU offer single quality score assessments that are effective for pairwise system comparisons. However, these tools provide limited insights for fine-grained system-level comparisons and the analysis of instance-level defects. To address these limitations, we introduce Translation Canvas, an explainable interface designed to pinpoint and analyze translation systems’ performance: 1) Translation Canvas assists machine translation researchers in comprehending system-level model performance by identifying common errors (their frequency and severity) and analyzing relationships between different systems based on various evaluation metrics. 2) It supports fine-grained analysis by highlighting error spans with explanations and selectively displaying systems’ predictions. According to human evaluation, Translation Canvas demonstrates superior performance over COMET and SacreBLEU packages under enjoyability and understandability criteria.
摘要:随着机器翻译研究的快速发展,评估工具包已成为基准测试系统进展的必要工具。像 COMET 和 SacreBLEU 这样的工具提供了单一质量分数评估,这对于成对系统比较非常有效。然而,这些工具在细粒度系统级比较和实例级缺陷分析方面提供的洞察力有限。为了解决这些局限性,我们引入了翻译画布 (Translation Canvas),这是一个可解释的界面,旨在精确分析翻译系统的表现:1) 翻译画布通过识别常见错误(其频率和严重性)并基于各种评估指标分析不同系统之间的关系,帮助机器翻译研究人员理解系统级模型性能。2) 它通过突出显示错误片段并附带解释,以及选择性展示系统的预测结果,支持细粒度分析。根据人工评估,翻译画布在愉悦性和可理解性标准下,表现优于 COMET 和 SacreBLEU 包。

[NLP-37] A Recipe For Building a Compliant Real Estate Chatbot

【速读】: 该论文试图解决大型语言模型在房地产领域应用中的合规性问题,特别是避免传统房地产行业中存在的歧视性行为如引导(steering)和红线(redlining)。解决方案的关键在于开发了一种合成通用指令遵循数据集和安全数据的方法,并通过微调 llama-3-8B-instruct 模型,显著提升了其性能,使其在安全性与合规性方面能够媲美甚至超越如 GPT-4 这样的闭源大型模型。此外,论文还开源了模型、数据和代码,以促进社区的进一步开发和研究。

链接: https://arxiv.org/abs/2410.10860
作者: Navid Madani,Anusha Bagalkotkar,Supriya Anand,Gabriel Arnson,Rohini Srihari,Kenneth Joseph
关键词-EN: align large language, large language models, recent years, human preferences, significant effort
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, there has been significant effort to align large language models with human preferences. This work focuses on developing a chatbot specialized in the real estate domain, with an emphasis on incorporating compliant behavior to ensure it can be used without perpetuating discriminatory practices like steering and redlining, which have historically plagued the real estate industry in the United States. Building on prior work, we present a method for generating a synthetic general instruction-following dataset, along with safety data. Through extensive evaluations and benchmarks, we fine-tuned a llama-3-8B-instruct model and demonstrated that we can enhance it’s performance significantly to match huge closed-source models like GPT-4o while making it safer and more compliant. We open-source the model, data and code to support further development and research in the community.
摘要:近年来,人们致力于将大语言模型与人类偏好对齐。本研究专注于开发一个专注于房地产领域的聊天机器人,特别强调纳入合规行为,以确保其在使用过程中不会延续诸如引导 (steering) 和红线 (redlining) 等歧视性做法,这些做法在美国房地产行业历史上一直存在问题。基于先前的工作,我们提出了一种生成合成通用指令遵循数据集的方法,并结合了安全数据。通过广泛的评估和基准测试,我们对 llama-3-8B-instruct 模型进行了微调,并证明我们可以显著提升其性能,使其与 GPT-4o 等大型闭源模型相媲美,同时使其更加安全和合规。我们将模型、数据和代码开源,以支持社区的进一步开发和研究。

[NLP-38] FAME: Towards Factual Multi-Task Model Editing

链接: https://arxiv.org/abs/2410.10859
作者: Li Zeng,Yingyu Shan,Zeming Liu,Jiashu Yao,Yuhang Guo
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures

点击查看摘要

[NLP-39] Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths EMNLP2024

【速读】: 该论文试图解决大型语言模型在处理复杂问题时可能出现的推理路径错误问题。解决方案的关键是引入了一种名为Reasoning Paths Optimization(RPO)的专门训练框架,该框架通过学习从多样化的推理路径中进行推理和探索,鼓励在每个推理步骤中选择有利的分支,同时惩罚不利的分支,从而显著提升模型的整体问题解决能力。RPO不依赖于大规模的人工标注或闭源模型的输出,具有良好的可扩展性和数据效率。

链接: https://arxiv.org/abs/2410.10858
作者: Yew Ken Chia,Guizhen Chen,Weiwen Xu,Luu Anh Tuan,Soujanya Poria,Lidong Bing
关键词-EN: exhibit impressive problem-solving, impressive problem-solving capabilities, Reasoning Paths Optimization, Advanced models, exhibit impressive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2024 camera ready version

点击查看摘要

Abstract:Advanced models such as OpenAI o1 exhibit impressive problem-solving capabilities through step-by-step reasoning. However, they may still falter on more complex problems, making errors that disrupt their reasoning paths. We attribute this to the expansive solution space, where each step has the risk of diverging into mistakes. To enhance language model reasoning, we introduce a specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model’s overall problem-solving performance. Reasoning Paths Optimization does not rely on large-scale human-annotated rationales or outputs from closed-source models, making it scalable and data-efficient. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions. The experiments demonstrate that our framework significantly enhances the reasoning performance of large language models, with up to 3.1% and 4.3% improvement on GSM8K and MMLU (STEM) respectively. Our data and code can be found at this https URL.
摘要:像 OpenAI o1 这样的高级模型通过逐步推理展示了令人印象深刻的问题解决能力。然而,在处理更复杂的问题时,它们仍可能出现错误,打乱其推理路径。我们将此归因于广阔的解空间,其中每一步都有偏离正确路径的风险。为了增强语言模型的推理能力,我们引入了一种专门的训练框架,称为推理路径优化 (Reasoning Paths Optimization, RPO),该框架支持从多样路径中学习和推理。我们的方法在每个推理步骤中鼓励有利分支,同时惩罚不利分支,从而提升模型的整体问题解决性能。推理路径优化不依赖于大规模人工标注的理据或闭源模型的输出,因此具有可扩展性和数据效率。我们专注于多步骤推理任务,如数学应用题和基于科学的考试问题。实验表明,我们的框架显著提升了大语言模型的推理性能,在 GSM8K 和 MMLU (STEM) 上分别提高了 3.1% 和 4.3%。我们的数据和代码可以在以下链接找到:https URL。

[NLP-40] Mirror-Consistency: Harnessing Inconsistency in Majority Voting EMNLP2024

【速读】: 该论文试图解决传统Self-Consistency方法在解码过程中忽视少数派答案的问题,这些少数派答案往往揭示了模型生成过程中的不确定性。解决方案的关键在于提出Mirror-Consistency方法,通过引入“反射镜”机制,使大语言模型(LLMs)能够在自集成解码过程中批判性地审视多个生成结果之间的不一致性,从而提升推理准确性和样本基础的置信度校准效果。

链接: https://arxiv.org/abs/2410.10857
作者: Siyuan Huang,Zhiyuan Ma,Jintao Du,Changhua Meng,Weiqiang Wang,Zhouhan Lin
关键词-EN: Large Language Models, Large Language, widely-used decoding strategy, capabilities of Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 Short Findings

点击查看摘要

Abstract:Self-Consistency, a widely-used decoding strategy, significantly boosts the reasoning capabilities of Large Language Models (LLMs). However, it depends on the plurality voting rule, which focuses on the most frequent answer while overlooking all other minority responses. These inconsistent minority views often illuminate areas of uncertainty within the model’s generation process. To address this limitation, we present Mirror-Consistency, an enhancement of the standard Self-Consistency approach. Our method incorporates a ‘reflective mirror’ into the self-ensemble decoding process and enables LLMs to critically examine inconsistencies among multiple generations. Additionally, just as humans use the mirror to better understand themselves, we propose using Mirror-Consistency to enhance the sample-based confidence calibration methods, which helps to mitigate issues of overconfidence. Our experimental results demonstrate that Mirror-Consistency yields superior performance in both reasoning accuracy and confidence calibration compared to Self-Consistency.
摘要:自一致性 (Self-Consistency) 是一种广泛使用的解码策略,显著提升了大语言模型 (LLMs) 的推理能力。然而,它依赖于多数投票规则,专注于最频繁的答案,而忽略了所有其他少数派回应。这些不一致的少数派观点往往揭示了模型生成过程中的不确定性区域。为了解决这一局限性,我们提出了镜像一致性 (Mirror-Consistency),这是标准自一致性方法的增强版。我们的方法在自集成解码过程中引入了一个“反射镜”,并使 LLMs 能够批判性地审视多个生成结果之间的不一致性。此外,正如人类使用镜子更好地理解自己一样,我们建议使用镜像一致性来增强基于样本的置信度校准方法,这有助于缓解过度自信的问题。我们的实验结果表明,镜像一致性在推理准确性和置信度校准方面均优于自一致性。

[NLP-41] CogDevelop2K: Reversed Cognitive Development in Multimodal Large Language Models

【速读】: 该论文试图解决多模态大语言模型(MLLMs)是否具备真正的理解和认知能力的问题。解决方案的关键在于提出了CogDevelop2K基准,这是一个涵盖12个子概念的综合性评估工具,从基本的物体恒常性和边界感知到高级的意图理解和结构化推理,模拟了人类认知发展的轨迹。通过评估46个MLLMs在该基准上的表现,研究揭示了MLLMs在认知发展轨迹上与人类存在显著差异,特别是观察到了与人类相反的认知发展顺序。

链接: https://arxiv.org/abs/2410.10855
作者: Yijiang Li,Qingying Gao,Haoran Sun,Haiyun Lyu,Dezhi Luo,Hokin Deng
关键词-EN: Large Language Models, Multi-modal Large Language, Language Models, Multi-modal Large, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Are Multi-modal Large Language Models (MLLMs) stochastic parrots? Do they genuinely understand and are capable of performing the tasks they excel at? This paper aims to explore the fundamental basis of MLLMs, i.e. core cognitive abilities that human intelligence builds upon to perceive, comprehend, and reason. To this end, we propose CogDevelop2K, a comprehensive benchmark that spans 12 sub-concepts from fundamental knowledge like object permanence and boundary to advanced reasoning like intentionality understanding, structured via the developmental trajectory of a human mind. We evaluate 46 MLLMs on our benchmarks. Comprehensively, we further evaluate the influence of evaluation strategies and prompting techniques. Surprisingly, we observe a reversed cognitive developmental trajectory compared to humans.
摘要:多模态大语言模型 (Multi-modal Large Language Models, MLLMs) 是否只是随机鹦鹉?它们是否真正理解和能够执行它们所擅长的任务?本文旨在探讨 MLLMs 的基本基础,即人类智能用于感知、理解和推理的核心认知能力。为此,我们提出了 CogDevelop2K,这是一个全面的基准测试,涵盖了从物体恒常性 (object permanence) 和边界 (boundary) 等基础知识到意图理解 (intentionality understanding) 等高级推理的 12 个子概念,这些子概念通过人类思维的发展轨迹进行结构化。我们在我们的基准测试上评估了 46 个 MLLMs。全面地,我们还进一步评估了评估策略和提示技术的影响。令人惊讶的是,我们观察到与人类相反的认知发展轨迹。

[NLP-42] Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning EMNLP2024

【速读】: 该论文试图解决常识推理基准测试中存在的答案选择不一致问题,即在多个选择题(MCQ)中,被标注为“正确”的答案并不总是最合理的。解决方案的关键在于通过收集5000个独立的合理性判断,发现超过20%的样本中,最合理的答案与基准答案不符。通过手动检查和大型语言模型(LLMs)的实验,验证了这一发现,并提出合理性标准可能有助于识别更可靠的基准测试项目,以提高常识推理评估的准确性。

链接: https://arxiv.org/abs/2410.10854
作者: Shramay Palta,Nishant Balepur,Peter Rankel,Sarah Wiegreffe,Marine Carpuat,Rachel Rudinger
关键词-EN: Questions involving commonsense, involving commonsense reasoning, commonsense reasoning, textit, everyday situations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 Camera Ready

点击查看摘要

Abstract:Questions involving commonsense reasoning about everyday situations often admit many \textitpossible or \textitplausible answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the \textitmost plausible answer choice. On 250 MCQ items sampled from two commonsense reasoning benchmarks, we collect 5,000 independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.
摘要:涉及日常情境的常识推理问题通常有多个可能或合理的答案。相比之下,常识推理的多项选择题 (MCQ) 基准要求硬性选择一个单一的正确答案,原则上,这个答案应代表最合理的选项。我们从两个常识推理基准中抽取了 250 道 MCQ 题目,收集了 5,000 个独立的合理性判断。我们发现,超过 20% 的样本 MCQ 中,评分最高的合理答案与基准黄金答案不匹配;经过手动检查,我们确认这一子集显示出更高的模糊性或问题与答案选项之间的语义不匹配等问题。大语言模型 (LLM) 的实验显示,在该子集上的准确率较低且性能变化较大,这表明我们的合理性标准可能有助于识别更可靠的常识评估基准项目。

[NLP-43] Mitigating Hallucinations Using Ensemble of Knowledge Graph and Vector Store in Large Language Models to Enhance Mental Health Support

【速读】: 该论文旨在解决大型语言模型(LLMs)在精神健康领域应用中出现的幻觉现象及其对应用的影响。解决方案的关键在于识别和理解导致幻觉的内在机制,并提出针对性的干预措施以减少幻觉的发生,从而增强LLMs在精神健康干预中的可靠性和安全性,确保其在治疗、咨询和信息传播中的有效性和准确性。

链接: https://arxiv.org/abs/2410.10853
作者: Abdul Muqtadir,Hafiz Syed Muhammad Bilal,Ayesha Yousaf,Hafiz Farooq Ahmed,Jamil Hussain
关键词-EN: Large Language Models, Language Models, Large Language, research work delves, mental health
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This research work delves into the manifestation of hallucination within Large Language Models (LLMs) and its consequential impacts on applications within the domain of mental health. The primary objective is to discern effective strategies for curtailing hallucinatory occurrences, thereby bolstering the dependability and security of LLMs in facilitating mental health interventions such as therapy, counseling, and the dissemination of pertinent information. Through rigorous investigation and analysis, this study seeks to elucidate the underlying mechanisms precipitating hallucinations in LLMs and subsequently propose targeted interventions to alleviate their occurrence. By addressing this critical issue, the research endeavors to foster a more robust framework for the utilization of LLMs within mental health contexts, ensuring their efficacy and reliability in aiding therapeutic processes and delivering accurate information to individuals seeking mental health support.
摘要:本研究深入探讨了大语言模型 (LLM) 中幻觉现象的表现及其对心理健康领域应用的潜在影响。主要目标是识别有效策略以减少幻觉的发生,从而增强 LLM 在心理健康干预(如治疗、咨询和相关信息传播)中的可靠性和安全性。通过严格的调查和分析,本研究旨在阐明导致 LLM 中幻觉产生的底层机制,并提出针对性的干预措施以减轻其发生。通过解决这一关键问题,研究致力于为心理健康背景下 LLM 的应用构建更强大的框架,确保其在辅助治疗过程和向寻求心理健康支持的个人提供准确信息方面的效用和可靠性。

[NLP-44] SafeLLM: Domain-Specific Safety Monitoring for Large Language Models : A Case Study of Offshore Wind Maintenance

【速读】: 该论文试图解决海上风电(OSW)行业中由于设备故障和过程异常导致的运维成本增加问题。解决方案的关键在于利用大型语言模型(LLMs)开发一种智能报警系统,通过统计技术计算句子间的距离来检测和过滤幻觉及不安全输出,从而提高报警序列的解释准确性和生成更安全的维修行动建议。初步研究结果表明,该方法在ChatGPT-4生成的测试句子中表现良好,但仍需通过使用专门的OSW数据集进行再训练来进一步增强其性能。

链接: https://arxiv.org/abs/2410.10852
作者: Connor Walker,Callum Rothon,Koorosh Aslansefat,Yiannis Papadopoulos,Nina Dethlefs
关键词-EN: experiencing significant expansion, Offshore Wind, increased Operations, industry is experiencing, significant expansion
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Offshore Wind (OSW) industry is experiencing significant expansion, resulting in increased Operations \ Maintenance (O\M) costs. Intelligent alarm systems offer the prospect of swift detection of component failures and process anomalies, enabling timely and precise interventions that could yield reductions in resource expenditure, as well as scheduled and unscheduled downtime. This paper introduces an innovative approach to tackle this challenge by capitalising on Large Language Models (LLMs). We present a specialised conversational agent that incorporates statistical techniques to calculate distances between sentences for the detection and filtering of hallucinations and unsafe output. This potentially enables improved interpretation of alarm sequences and the generation of safer repair action recommendations by the agent. Preliminary findings are presented with the approach applied to ChatGPT-4 generated test sentences. The limitation of using ChatGPT-4 and the potential for enhancement of this agent through re-training with specialised OSW datasets are discussed.
摘要:海上风电 (Offshore Wind, OSW) 行业正在经历显著的扩张,导致运营与维护 (Operations \ Maintenance, O\M) 成本的增加。智能报警系统有望快速检测组件故障和过程异常,从而实现及时且精确的干预,这可能带来资源支出的减少以及计划内和计划外停机时间的缩短。本文介绍了一种创新方法,通过利用大语言模型 (Large Language Models, LLMs) 来应对这一挑战。我们提出了一种专门的对话智能体,该智能体结合了统计技术来计算句子之间的距离,以检测和过滤幻觉及不安全输出。这可能有助于提升对报警序列的解释能力,并由智能体生成更安全的维修行动建议。初步研究结果展示了该方法应用于 ChatGPT-4 生成的测试句子的效果。同时,讨论了使用 ChatGPT-4 的局限性以及通过使用专门的 OSW 数据集进行再训练来增强该智能体的潜力。

[NLP-45] LLM Gesticulator: Leveraging Large Language Models for Scalable and Controllable Co-Speech Gesture Synthesis

【速读】: 该论文试图解决音频驱动的协同语音手势生成问题,关键在于提出了基于大型语言模型(LLM)的框架LLM Gesticulator。该框架能够生成与输入音频节奏同步且自然的全身体动画,并具备可编辑性和强大的可控性,通过文本提示控制生成手势的内容和风格。与以往方法相比,该框架在模型规模增大时表现出显著的性能提升,即遵循“规模法则”,并通过客观评价指标和用户研究验证了其优越性。

链接: https://arxiv.org/abs/2410.10851
作者: Haozhou Pang,Tianwei Ding,Lanshan He,Qi Gan
关键词-EN: synthesizes full-body animations, exhibiting natural movements, present LLM Gesticulator, LLM-based audio-driven co-speech, movements and editability
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In this work, we present LLM Gesticulator, an LLM-based audio-driven co-speech gesture generation framework that synthesizes full-body animations that are rhythmically aligned with the input audio while exhibiting natural movements and editability. Compared to previous work, our model demonstrates substantial scalability. As the size of the backbone LLM model increases, our framework shows proportional improvements in evaluation metrics (a.k.a. scaling law). Our method also exhibits strong controllability where the content, style of the generated gestures can be controlled by text prompt. To the best of our knowledge, LLM gesticulator is the first work that use LLM on the co-speech generation task. Evaluation with existing objective metrics and user studies indicate that our framework outperforms prior works.
摘要:在本研究中,我们提出了 LLM Gesticulator,这是一个基于大语言模型 (LLM) 的音频驱动协同语音手势生成框架,能够合成与输入音频节奏同步的全身体动画,同时展现出自然的运动和可编辑性。与以往的工作相比,我们的模型展示了显著的可扩展性。随着骨干 LLM 模型规模的增大,我们的框架在评估指标(即缩放定律)上显示出成比例的改进。我们的方法还表现出强大的可控性,生成的手势内容和风格可以通过文本提示进行控制。据我们所知,LLM Gesticulator 是首个在协同语音生成任务中应用 LLM 的工作。通过现有的客观指标和用户研究进行的评估表明,我们的框架优于先前的工作。

[NLP-46] On the Reliability of Large Language Models to Misinformed and Demographically-Informed Prompts

【速读】: 该论文试图解决大型语言模型(LLM)支持的聊天机器人在处理气候变化和心理健康领域中错误引导的问题和带有人口统计信息的问题时的表现和行为。解决方案的关键在于通过定量和定性方法评估聊天机器人辨别陈述真实性、遵循事实以及避免偏见或错误信息的能力。定量分析显示聊天机器人在封闭式问题中能给出正确答案,而定性分析揭示了隐私、伦理问题以及引导用户寻求专业服务的必要性。论文结论指出,尽管这些聊天机器人具有巨大潜力,但在敏感领域的部署需要谨慎考虑、伦理监督和严格改进,以确保它们作为人类专业知识的增强工具而非自主解决方案。

链接: https://arxiv.org/abs/2410.10850
作者: Toluwani Aremu,Oluwakemi Akinwehinmi,Chukwuemeka Nwagu,Syed Ishtiaque Ahmed,Rita Orji,Pedro Arnau Del Amo,Abdulmotaleb El Saddik
关键词-EN: Large Language Model, addressing misinformed prompts, Language Model, Mental Health, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Study conducted between August and December 2023. Submitted for archival purposes only

点击查看摘要

Abstract:We investigate and observe the behaviour and performance of Large Language Model (LLM)-backed chatbots in addressing misinformed prompts and questions with demographic information within the domains of Climate Change and Mental Health. Through a combination of quantitative and qualitative methods, we assess the chatbots’ ability to discern the veracity of statements, their adherence to facts, and the presence of bias or misinformation in their responses. Our quantitative analysis using True/False questions reveals that these chatbots can be relied on to give the right answers to these close-ended questions. However, the qualitative insights, gathered from domain experts, shows that there are still concerns regarding privacy, ethical implications, and the necessity for chatbots to direct users to professional services. We conclude that while these chatbots hold significant promise, their deployment in sensitive areas necessitates careful consideration, ethical oversight, and rigorous refinement to ensure they serve as a beneficial augmentation to human expertise rather than an autonomous solution.
摘要:我们研究并观察了在气候变化和心理健康领域内,大语言模型 (LLM) 支持的聊天机器人在处理包含人口统计信息的误导性提示和问题时的行为和性能。通过定量和定性方法的结合,我们评估了这些聊天机器人辨别陈述真实性的能力、对事实的遵守程度,以及其响应中是否存在偏见或错误信息。我们的定量分析使用真/假问题揭示了这些聊天机器人可以被依赖于给出这些封闭式问题的正确答案。然而,从领域专家收集的定性见解显示,关于隐私、伦理影响以及聊天机器人引导用户寻求专业服务的必要性,仍然存在担忧。我们得出结论,尽管这些聊天机器人具有显著的潜力,但它们在敏感领域的部署需要谨慎考虑、伦理监督和严格改进,以确保它们作为人类专业知识的增益而非自主解决方案。

[NLP-47] Continuous Approximations for Improving Quantization Aware Training of LLMs

【速读】: 该论文旨在解决大型语言模型(LLMs)在量化过程中性能下降的问题。解决方案的关键在于提出了两种连续近似方法,分别应用于量化感知训练(QAT)过程中的舍入函数和钳位函数,替代传统的直通估计器(STE)。通过这种方法,量化模型在WikiText-v2数据集上的困惑度(PPL)达到了9.0815,优于基线的9.9621,同时在BoolQ和MMLU数据集上也分别实现了2.76%和5.47%的性能提升。这表明该方法能够更精确地学习步长和权重,从而在相同的精度、模型大小和训练设置下实现更好的性能,有助于推动更节能的LLMs技术发展。

链接: https://arxiv.org/abs/2410.10849
作者: He Li,Jianhang Hong,Yuanzhuo Wu,Snehal Adbol,Zonglin Li
关键词-EN: Large Language Models, Large Language, requirements for Large, Language Models, Model compression methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model compression methods are used to reduce the computation and energy requirements for Large Language Models (LLMs). Quantization Aware Training (QAT), an effective model compression method, is proposed to reduce performance degradation after quantization. To further minimize this degradation, we introduce two continuous approximations to the QAT process on the rounding function, traditionally approximated by the Straight-Through Estimator (STE), and the clamping function. By applying both methods, the perplexity (PPL) on the WikiText-v2 dataset of the quantized model reaches 9.0815, outperforming 9.9621 by the baseline. Also, we achieve a 2.76% improvement on BoolQ, and a 5.47% improvement on MMLU, proving that the step sizes and weights can be learned more accurately with our approach. Our method achieves better performance with the same precision, model size, and training setup, contributing to the development of more energy-efficient LLMs technology that aligns with global sustainability goals.
摘要:模型压缩方法用于减少大语言模型 (Large Language Models, LLM) 的计算和能源需求。量化感知训练 (Quantization Aware Training, QAT) 是一种有效的模型压缩方法,旨在减少量化后的性能下降。为了进一步最小化这种下降,我们引入了两种对 QAT 过程中舍入函数和钳位函数的连续近似方法,传统上这些函数由直通估计器 (Straight-Through Estimator, STE) 近似。通过应用这两种方法,量化模型在 WikiText-v2 数据集上的困惑度 (PPL) 达到 9.0815,优于基线的 9.9621。此外,我们在 BoolQ 上实现了 2.76% 的改进,在 MMLU 上实现了 5.47% 的改进,证明了通过我们的方法可以更准确地学习步长和权重。我们的方法在相同的精度、模型大小和训练设置下实现了更好的性能,有助于开发更符合全球可持续发展目标的能源效率更高的 LLM 技术。

[NLP-48] Crafting Narrative Closures: Zero-Shot Learning with SSM Mamba for Short Story Ending Generation

【速读】: 该论文试图解决作者在创作过程中遇到的创意障碍问题,即写作瓶颈。解决方案的关键在于开发了一种基于AI的工具,能够根据用户提供的短故事提示生成故事的结尾。这一工具利用了预训练的GPT-3.5模型和专门微调的SSM-Mamba模型,通过多种评估指标(如BERT score、METEOR、BLEU、ROUGE和Perplexity)验证其性能,旨在通过AI驱动的创造力增强故事创作过程,并为作者提供一种互动和即兴扩展故事想法的方式。

链接: https://arxiv.org/abs/2410.10848
作者: Divyam Sharma,Divya Santhanam
关键词-EN: challenging endeavor, engaging yet challenging, Abstract, stories, authors encounter moments
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Writing stories is an engaging yet challenging endeavor. Often, authors encounter moments of creative block, where the path forward in their narrative becomes obscured. This paper is designed to address such moments by providing an innovative solution: A tool that completes stories based on given prompts. By inputting a short story prompt, users can receive a conclusion to their story, articulated in one sentence or more, thereby enhancing the storytelling process with AI-driven creativity. This tool aims not only to assist authors in navigating writer’s block but also to offer a fun and interactive way for anyone to expand on story ideas spontaneously. Through this paper, we explore the intersection of artificial intelligence and creative writing, pushing the boundaries of how stories can be crafted and concluded. To create our final text-generation models, we used a pre-trained GPT-3.5 model and a newly created finetuned SSM-Mamba model, both of which perform well on a comprehensive list of metrics including BERT score, METEOR, BLEU, ROUGE, and Perplexity. The SSM model has also been made public for the NLP community on HuggingFace models as an open source contribution, which for the timebeing is a first of its kind state-space model for story-generation task on HuggingFace.
摘要:撰写故事是一项既吸引人又具挑战性的工作。通常,作者会遇到创意瓶颈,此时故事发展的方向变得模糊不清。本文旨在通过提供一种创新的解决方案来应对这些时刻:一个基于给定提示完成故事的工具。用户输入简短的故事提示后,可以收到故事的结论,以一句话或多句话的形式呈现,从而通过 AI 驱动的创造力增强叙事过程。该工具不仅旨在帮助作者克服写作障碍,还为任何人提供了一种有趣且互动的方式,让他们能够自发地扩展故事创意。通过本文,我们探讨了人工智能与创意写作的交叉点,推动了故事创作和结尾方式的边界。为了创建我们的最终文本生成模型,我们使用了预训练的 GPT-3.5 模型和全新创建的微调 SSM-Mamba 模型,两者在包括 BERT 分数、METEOR、BLEU、ROUGE 和困惑度在内的综合指标列表上表现出色。SSM 模型也已作为开源贡献发布在 HuggingFace 模型上,供 NLP 社区使用,目前这是 HuggingFace 上首个用于故事生成任务的状态空间模型。

[NLP-49] Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在生成输出时资源利用效率低下的问题。解决方案的关键在于提出了一种新的框架,该框架在LLM的每个前馈网络层中集成较小的辅助模块,实现基于任务复杂度的动态路由。具体来说,该框架允许标记根据其复杂度选择在每层中由小模块、大模块处理,甚至跳过某些层,从而引入“标记难度”的概念。通过使用预言机识别最佳的适应性计算模式,研究揭示了训练路由器与理论最优解之间的差距,表明在单一层中激活大模块的效果优于在所有层中使用大模块,突显了实际路由实现与理论最佳适应性计算之间的差异。

链接: https://arxiv.org/abs/2410.10846
作者: Keivan Alizadeh,Iman Mirzadeh,Hooman Shahrokhi,Dmitry Belenko,Frank Sun,Minsik Cho,Mohammad Hossein Sekhavat,Moin Nabi,Mehrdad Farajtabar
关键词-EN: fixed compute budget, typically generate outputs, Large Language Models, Language Models, inefficient resource utilization
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget, leading to inefficient resource utilization. To address this shortcoming, recent advancements in mixture of expert (MoE) models, speculative decoding, and early exit strategies leverage the insight that computational demands can vary significantly based on the complexity and nature of the input. However, identifying optimal routing patterns for dynamic execution remains an open challenge, limiting the full potential of these adaptive methods. To address this need, we study adaptive computation in LLMs more systematically. We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM. This design enables dynamic routing of tokens based on task complexity: tokens can be processed by either the small or big modules at each layer, or even bypass certain layers entirely. This allows us to introduce a novel notion of a token’s difficulty, defined by its potential to benefit from additional computational resources. Importantly, by employing oracles to identify optimal patterns of adaptive computations, we gain valuable insights into the internal workings of LLMs and the routing processes in a simplified heterogeneous MoE setup. We show that trained routers operate differently from oracles and often yield suboptimal solutions. Notably, activating a large module in just one layer outperforms models that use large modules across all layers, underscoring the gap between practical implementations of routing in MoE models and theoretical optima for adaptive computation.
摘要:大语言模型 (LLM) 通常使用固定的计算预算逐 Token 生成输出,导致资源利用效率低下。为了解决这一不足,近期在专家混合 (MoE) 模型、推测性解码和早期退出策略方面的进展,利用了计算需求会根据输入的复杂性和性质显著变化的洞察。然而,识别动态执行的最佳路由模式仍然是一个开放的挑战,限制了这些自适应方法的全部潜力。为了应对这一需求,我们更系统地研究了大语言模型中的自适应计算。我们提出了一种新颖的框架,将较小的辅助模块集成到 LLM 的每个前馈网络层中。这种设计使得 Token 可以根据任务复杂性进行动态路由:Token 可以在每一层由小模块或大模块处理,甚至可以完全绕过某些层。这使我们能够引入一种新的 Token 难度概念,由其从额外计算资源中受益的潜力定义。重要的是,通过使用预言机来识别自适应计算的最佳模式,我们获得了关于大语言模型内部工作机制和简化异构 MoE 设置中路由过程的宝贵见解。我们发现,训练后的路由器与预言机的操作方式不同,并且通常产生次优解。值得注意的是,仅在一个层中激活大模块的表现优于在所有层中使用大模块的模型,这突显了 MoE 模型中路由的实际实现与自适应计算的理论最优之间的差距。

[NLP-50] st Case-Informed Knowledge Tracing for Open-ended Coding Tasks

【速读】: 该论文试图解决在计算机科学教育中,开放式编程任务中学生代码多样性导致的知识追踪难题。传统知识追踪模型仅分析学生代码的正确性,未能充分捕捉学生知识中的细微差别。论文提出的解决方案是引入“测试用例引导的知识追踪框架(TIKTOC)”,通过多任务学习方法,同时分析学生代码是否通过每个测试用例以及预测学生的开放式代码。关键在于利用大型语言模型作为基础,结合测试用例信息,提供更精细的学生知识洞察,从而提升知识追踪的准确性。

链接: https://arxiv.org/abs/2410.10829
作者: Zhangqi Duan,Nigel Fernandez,Alexander Hicks,Andrew Lan
关键词-EN: computer science education, Open-ended coding tasks, science education, Open-ended, code
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Open-ended coding tasks, which ask students to construct programs according to certain specifications, are common in computer science education. Student modeling can be challenging since their open-ended nature means that student code can be diverse. Traditional knowledge tracing (KT) models that only analyze response correctness may not fully capture nuances in student knowledge from student code. In this paper, we introduce Test case-Informed Knowledge Tracing for Open-ended Coding (TIKTOC), a framework to simultaneously analyze and predict both open-ended student code and whether the code passes each test case. We augment the existing CodeWorkout dataset with the test cases used for a subset of the open-ended coding questions, and propose a multi-task learning KT method to simultaneously analyze and predict 1) whether a student’s code submission passes each test case and 2) the student’s open-ended code, using a large language model as the backbone. We quantitatively show that these methods outperform existing KT methods for coding that only use the overall score a code submission receives. We also qualitatively demonstrate how test case information, combined with open-ended code, helps us gain fine-grained insights into student knowledge.
摘要:开放式编码任务,要求学生根据特定规范构建程序,是计算机科学教育中的常见任务。由于其开放性,学生代码可能多种多样,因此学生建模颇具挑战性。传统的知识追踪 (KT) 模型仅分析响应的正确性,可能无法全面捕捉学生代码中体现的学生知识细微差别。本文介绍了一种名为“开放式编码测试用例知识追踪 (TIKTOC)”的框架,该框架能够同时分析和预测开放式学生代码以及代码是否通过每个测试用例。我们通过增加用于部分开放式编码问题的测试用例,扩展了现有的 CodeWorkout 数据集,并提出了一种多任务学习的 KT 方法,以大语言模型为骨干,同时分析和预测 1) 学生代码提交是否通过每个测试用例,以及 2) 学生的开放式代码。我们通过定量分析表明,这些方法在仅使用代码提交的整体分数的现有 KT 方法中表现更优。我们还通过定性分析展示了测试用例信息与开放式代码结合如何帮助我们深入了解学生知识的细微差别。

人工智能

[AI-0] MoH: Multi-Head Attention as Mixture-of-Head Attention

链接: https://arxiv.org/abs/2410.11842
作者: Peng Jin,Bo Zhu,Li Yuan,Shuicheng Yan
关键词-EN: multi-head attention, attention, previous accuracy level, attention heads, Transformer model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, code: this https URL

点击查看摘要

Abstract:In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

[AI-1] GaVaMoE: Gaussian-Variational Gated Mixture of Experts for Explainable Recommendation

链接: https://arxiv.org/abs/2410.11841
作者: Fei Tang,Yongliang Shen,Hang Zhang,Zeqi Tan,Wenqi Zhang,Guiyang Hou,Kaitao Song,Weiming Lu,Yueting Zhuang
关键词-EN: Large language model-based, systems show promise, Large language, language model-based explainable, model-based explainable recommendation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language model-based explainable recommendation (LLM-based ER) systems show promise in generating human-like explanations for recommendations. However, they face challenges in modeling user-item collaborative preferences, personalizing explanations, and handling sparse user-item interactions. To address these issues, we propose GaVaMoE, a novel Gaussian-Variational Gated Mixture of Experts framework for explainable recommendation. GaVaMoE introduces two key components: (1) a rating reconstruction module that employs Variational Autoencoder (VAE) with a Gaussian Mixture Model (GMM) to capture complex user-item collaborative preferences, serving as a pre-trained multi-gating mechanism; and (2) a set of fine-grained expert models coupled with the multi-gating mechanism for generating highly personalized explanations. The VAE component models latent factors in user-item interactions, while the GMM clusters users with similar behaviors. Each cluster corresponds to a gate in the multi-gating mechanism, routing user-item pairs to appropriate expert models. This architecture enables GaVaMoE to generate tailored explanations for specific user types and preferences, mitigating data sparsity by leveraging user similarities. Extensive experiments on three real-world datasets demonstrate that GaVaMoE significantly outperforms existing methods in explanation quality, personalization, and consistency. Notably, GaVaMoE exhibits robust performance in scenarios with sparse user-item interactions, maintaining high-quality explanations even for users with limited historical data.

[AI-2] A Hitchhikers Guide to Scaling Law Estimation

链接: https://arxiv.org/abs/2410.11840
作者: Leshem Choshen,Yang Zhang,Jacob Andreas
关键词-EN: Scaling laws, machine learning model, target machine learning, Scaling laws predict, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pretraining decisions involving optimizers, datasets, and model architectures. Despite the widespread use of scaling laws to model the dynamics of language model training, there has been little work on understanding how to best estimate and interpret them. We collect (and release) a large-scale dataset containing losses and downstream evaluations for 485 previously published pretrained models. We use these to estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families. We find that fitting scaling laws to intermediate checkpoints of training runs (and not just their final losses) substantially improves accuracy, and that – all else equal – estimates of performance are generally most accurate when derived from other models of similar sizes. However, because there is a significant degree of variability across model seeds, training multiple small models is sometimes more useful than training a single large one. Moreover, while different model families differ scaling behavior, they are often similar enough that a target model’s behavior can be predicted from a single model with the same architecture, along with scaling parameter estimates derived from other model families.

[AI-3] Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

链接: https://arxiv.org/abs/2410.11833
作者: Ayush Jain,Norio Kosaka,Xinhu Li,Kyung-Min Kim,Erdem Bıyık,Joseph J. Lim
关键词-EN: off-policy actor-critic approaches, approaches like DDPG, deterministic policy gradient, reinforcement learning, actor-critic approaches
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In reinforcement learning, off-policy actor-critic approaches like DDPG and TD3 are based on the deterministic policy gradient. Herein, the Q-function is trained from off-policy environment data and the actor (policy) is trained to maximize the Q-function via gradient ascent. We observe that in complex tasks like dexterous manipulation and restricted locomotion, the Q-value is a complex function of action, having several local optima or discontinuities. This poses a challenge for gradient ascent to traverse and makes the actor prone to get stuck at local optima. To address this, we introduce a new actor architecture that combines two simple insights: (i) use multiple actors and evaluate the Q-value maximizing action, and (ii) learn surrogates to the Q-function that are simpler to optimize with gradient-based methods. We evaluate tasks such as restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems and show that our actor finds optimal actions more frequently and outperforms alternate actor architectures.

[AI-4] Learning Smooth Humanoid Locomotion through Lipschitz-Constrained Policies

链接: https://arxiv.org/abs/2410.11825
作者: Zixuan Chen,Xialin He,Yen-Jen Wang,Qiayuan Liao,Yanjie Ze,Zhongyu Li,S. Shankar Sastry,Jiajun Wu,Koushil Sreenath,Saurabh Gupta,Xue Bin Peng
关键词-EN: Reinforcement learning combined, Reinforcement learning, transfer offers, learning combined, developing locomotion controllers
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:Reinforcement learning combined with sim-to-real transfer offers a general framework for developing locomotion controllers for legged robots. To facilitate successful deployment in the real world, smoothing techniques, such as low-pass filters and smoothness rewards, are often employed to develop policies with smooth behaviors. However, because these techniques are non-differentiable and usually require tedious tuning of a large set of hyperparameters, they tend to require extensive manual tuning for each robotic platform. To address this challenge and establish a general technique for enforcing smooth behaviors, we propose a simple and effective method that imposes a Lipschitz constraint on a learned policy, which we refer to as Lipschitz-Constrained Policies (LCP). We show that the Lipschitz constraint can be implemented in the form of a gradient penalty, which provides a differentiable objective that can be easily incorporated with automatic differentiation frameworks. We demonstrate that LCP effectively replaces the need for smoothing rewards or low-pass filters and can be easily integrated into training frameworks for many distinct humanoid robots. We extensively evaluate LCP in both simulation and real-world humanoid robots, producing smooth and robust locomotion controllers. All simulation and deployment code, along with complete checkpoints, is available on our project page: this https URL.

[AI-5] OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

链接: https://arxiv.org/abs/2410.11792
作者: Jinhan Li,Yifeng Zhu,Yuqi Xie,Zhenyu Jiang,Mingyo Seo,Georgios Pavlakos,Yuke Zhu
关键词-EN: single video demonstrations, robots manipulation skills, teaching humanoid robots, study the problem, problem of teaching
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for oral presentation at 8th Annual Conference on Robot Learning. Project website: this https URL

点击查看摘要

Abstract:We study the problem of teaching humanoid robots manipulation skills by imitating from single video demonstrations. We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video and derives a policy for execution. At the heart of our approach is object-aware retargeting, which enables the humanoid robot to mimic the human motions in an RGB-D video while adjusting to different object locations during deployment. OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately. Our experiments show that OKAMI achieves strong generalizations across varying visual and spatial conditions, outperforming the state-of-the-art baseline on open-world imitation from observation. Furthermore, OKAMI rollout trajectories are leveraged to train closed-loop visuomotor policies, which achieve an average success rate of 79.2% without the need for labor-intensive teleoperation. More videos can be found on our website this https URL.

[AI-6] Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability EMNLP2024

链接: https://arxiv.org/abs/2410.11786
作者: Tsz Ting Chung,Leyang Cui,Lemao Liu,Xinting Huang,Shuming Shi,Dit-Yan Yeung
关键词-EN: Large Language Models, natural language processing, Large Language, demonstrated impressive capabilities, language processing tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 10 tables, EMNLP 2024 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in a wide range of natural language processing tasks when leveraging in-context learning. To mitigate the additional computational and financial costs associated with in-context learning, several prompt compression methods have been proposed to compress the in-context learning prompts. Despite their success, these methods face challenges with transferability due to model-specific compression, or rely on external training data, such as GPT-4. In this paper, we investigate the ability of LLMs to develop a unified compression method that discretizes uninformative tokens, utilizing a self-supervised pre-training technique. By introducing a small number of parameters during the continual pre-training, the proposed Selection-p produces a probability for each input token, indicating whether to preserve or discard it. Experiments show Selection-p achieves state-of-the-art performance across numerous classification tasks, achieving compression rates of up to 10 times while experiencing only a marginal 0.8% decrease in performance. Moreover, it exhibits superior transferability to different models compared to prior work. Additionally, we further analyze how Selection-p helps maintain performance on in-context learning with long contexts.

[AI-7] MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

链接: https://arxiv.org/abs/2410.11779
作者: Chenxi Wang,Xiang Chen,Ningyu Zhang,Bozhong Tian,Haoming Xu,Shumin Deng,Huajun Chen
关键词-EN: Multimodal Large Language, remain poorly understood, underlying reasons remain, reasons remain poorly, frequently exhibit hallucination
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Ongoing work

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena, but the underlying reasons remain poorly understood. In this paper, we present an empirical analysis and find that, although MLLMs incorrectly generate the objects in the final output, they are actually able to recognize visual objects in the preceding layers. We speculate that this may be due to the strong knowledge priors of the language model suppressing the visual information, leading to hallucinations. Motivated by this, we propose a novel dynamic correction decoding method for MLLMs (DeCo), which adaptively selects the appropriate preceding layers and proportionally integrates knowledge into the final layer to adjust the output logits. Note that DeCo is model agnostic and can be seamlessly incorporated with various classic decoding strategies and applied to different MLLMs. We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines, highlighting its potential to mitigate hallucinations. Code is available at this https URL.

[AI-8] Encoding architecture algebra

链接: https://arxiv.org/abs/2410.11776
作者: Stephane Bersier,Xinyi Chen-Lin
关键词-EN: machine learning, leading to inefficiencies, typeful machine learning, model lifecycle, wide variety
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注: 25 pages, 6 figures. Keywords: typeful, algebraic data types, tensors, structured data

点击查看摘要

Abstract:Despite the wide variety of input types in machine learning, this diversity is often not fully reflected in their representations or model architectures, leading to inefficiencies throughout a model’s lifecycle. This paper introduces an algebraic approach to constructing input-encoding architectures that properly account for the data’s structure, providing a step toward achieving more typeful machine learning.

[AI-9] Can Search-Based Testing with Pareto Optimization Effectively Cover Failure-Revealing Test Inputs?

链接: https://arxiv.org/abs/2410.11769
作者: Lev Sorokin,Damir Safin,Shiva Nejati
关键词-EN: Deep Learning-enabled, Search-based software testing, Search-based software, SBST techniques focus, widely adopted technique
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication by Empirical Software Engineering Journal (EMSE) (in October 2024)

点击查看摘要

Abstract:Search-based software testing (SBST) is a widely adopted technique for testing complex systems with large input spaces, such as Deep Learning-enabled (DL-enabled) systems. Many SBST techniques focus on Pareto-based optimization, where multiple objectives are optimized in parallel to reveal failures. However, it is important to ensure that identified failures are spread throughout the entire failure-inducing area of a search domain and not clustered in a sub-region. This ensures that identified failures are semantically diverse and reveal a wide range of underlying causes. In this paper, we present a theoretical argument explaining why testing based on Pareto optimization is inadequate for covering failure-inducing areas within a search domain. We support our argument with empirical results obtained by applying two widely used types of Pareto-based optimization techniques, namely NSGA-II (an evolutionary algorithm) and MOPSO (a swarm-based algorithm), to two DL-enabled systems: an industrial Automated Valet Parking (AVP) system and a system for classifying handwritten digits. We measure the coverage of failure-revealing test inputs in the input space using a metric that we refer to as the Coverage Inverted Distance quality indicator. Our results show that NSGA-II and MOPSO are not more effective than a naïve random search baseline in covering test inputs that reveal failures. The replication package for this study is available in a GitHub repository.

[AI-10] DPD-NeuralEngine: A 22-nm 6.6-TOPS/W/mm2 Recurrent Neural Network Accelerator for Wideband Power Amplifier Digital Pre-Distortion

链接: https://arxiv.org/abs/2410.11766
作者: Ang Li,Haolin Wu,Yizhuo Wu,Qinyu Chen,Leo C. N. de Vreede,Chang Gao
关键词-EN: based Digital Pre-distortion, Deep Neural Network, Digital Pre-distortion, modern communication systems, communication systems necessitates
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:The increasing adoption of Deep Neural Network (DNN)-based Digital Pre-distortion (DPD) in modern communication systems necessitates efficient hardware implementations. This paper presents DPD-NeuralEngine, an ultra-fast, tiny-area, and power-efficient DPD accelerator based on a Gated Recurrent Unit (GRU) neural network (NN). Leveraging a co-designed software and hardware approach, our 22 nm CMOS implementation operates at 2 GHz, capable of processing I/Q signals up to 250 MSps. Experimental results demonstrate a throughput of 256.5 GOPS and power efficiency of 1.32 TOPS/W with DPD linearization performance measured in Adjacent Channel Power Ratio (ACPR) of -45.3 dBc and Error Vector Magnitude (EVM) of -39.8 dB. To our knowledge, this work represents the first AI-based DPD application-specific integrated circuit (ASIC) accelerator, achieving a power-area efficiency (PAE) of 6.6 TOPS/W/mm ^2 .

[AI-11] SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

链接: https://arxiv.org/abs/2410.11761
作者: Ying Chen,Guoan Wang,Yuanfeng Ji,Yanjun Li,Jin Ye,Tianbin Li,Bin Zhang,Nana Pei,Rongshan Yu,Yu Qiao,Junjun He
关键词-EN: large language models, missing essential contextual, essential contextual information, multimodal large language, language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patch-level analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instruction-following dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal benchmark that incorporates captioning and VQA tasks to assess SlideChat’s capabilities in varied clinical settings such as microscopy, diagnosis. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities achieving state-of-the-art performance on 18 of 22 tasks. For example, it achieved an overall accuracy of 81.17% on SlideBench-VQA (TCGA), and 54.15% on SlideBench-VQA (BCNB). We will fully release SlideChat, SlideInstruction and SlideBench as open-source resources to facilitate research and development in computational pathology.

[AI-12] Evidence of Cognitive Deficits andDevelopmental Advances in Generative AI: A Clock Drawing Test Analysis

链接: https://arxiv.org/abs/2410.11756
作者: Isaac R. Galatzer-Levy,Jed McGiffin,David Munday,Xin Liu,Danny Karmon,Ilia Labzovsky,Rivka Moroshko,Amir Zait,Daniel McDuff
关键词-EN: rapid advancement sparks, advancement sparks interest, Generative AI rapid, code generation, rapid advancement
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI’s rapid advancement sparks interest in its cognitive abilities, especially given its capacity for tasks like language understanding and code generation. This study explores how several recent GenAI models perform on the Clock Drawing Test (CDT), a neuropsychological assessment of visuospatial planning and organization. While models create clock-like drawings, they struggle with accurate time representation, showing deficits similar to mild-severe cognitive impairment (Wechsler, 2009). Errors include numerical sequencing issues, incorrect clock times, and irrelevant additions, despite accurate rendering of clock features. Only GPT 4 Turbo and Gemini Pro 1.5 produced the correct time, scoring like healthy individuals (4/4). A follow-up clock-reading test revealed only Sonnet 3.5 succeeded, suggesting drawing deficits stem from difficulty with numerical concepts. These findings may reflect weaknesses in visual-spatial understanding, working memory, or calculation, highlighting strengths in learned knowledge but weaknesses in reasoning. Comparing human and machine performance is crucial for understanding AI’s cognitive capabilities and guiding development toward human-like cognitive functions.

[AI-13] PMMT: Preference Alignment in Multilingual Machine Translation via LLM Distillation

链接: https://arxiv.org/abs/2410.11410
作者: Shuqiao Sun,Yutong Yao,Peiwen Wu,Feijun Jiang,Kaifu Zhang
关键词-EN: cross-language communication, improve its accuracy, important for cross-language, made to improve, Translation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Translation is important for cross-language communication, and many efforts have been made to improve its accuracy. However, less investment is conducted in aligning translations with human preferences, such as translation tones or styles. In this paper, a new method is proposed to effectively generate large-scale multilingual parallel corpora with specific translation preferences using Large Language Models (LLMs). Meanwhile, an automatic pipeline is designed to distill human preferences into smaller Machine Translation (MT) models for efficiently and economically supporting large-scale calls in online services. Experiments indicate that the proposed method takes the lead in translation tasks with aligned human preferences by a large margin. Meanwhile, on popular public benchmarks like WMT and Flores, on which our models were not trained, the proposed method also shows a competitive performance compared to SOTA works.

[AI-14] A Case for AI Consciousness: Language Agents and Global Workspace Theory

链接: https://arxiv.org/abs/2410.11407
作者: Simon Goldstein,Cameron Domenico Kirk-Giannini
关键词-EN: require significant technological, significant technological progress, Global Workspace Theory, phenomenally conscious, existing artificial systems
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:It is generally assumed that existing artificial systems are not phenomenally conscious, and that the construction of phenomenally conscious artificial systems would require significant technological progress if it is possible at all. We challenge this assumption by arguing that if Global Workspace Theory (GWT) - a leading scientific theory of phenomenal consciousness - is correct, then instances of one widely implemented AI architecture, the artificial language agent, might easily be made phenomenally conscious if they are not already. Along the way, we articulate an explicit methodology for thinking about how to apply scientific theories of consciousness to artificial systems and employ this methodology to arrive at a set of necessary and sufficient conditions for phenomenal consciousness according to GWT.

[AI-15] Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference

链接: https://arxiv.org/abs/2410.11403
作者: Yuta Oshima,Masahiro Suzuki,Yutaka Matsuo
关键词-EN: capture shared latent, shared latent representations, Multimodal variational autoencoders, variational autoencoders, aim to capture
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 12 figures

点击查看摘要

Abstract:Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities. A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number (2^M) of inference networks for all possible modality combinations. Mixture-based models simplify this by requiring only as many inference models as there are modalities, aggregating unimodal inferences. However, they suffer from information loss when modalities are missing. Alignment-based VAEs address this by aligning unimodal inference models with a multimodal model through minimizing the Kullback-Leibler (KL) divergence but face issues due to amortization gaps, which compromise inference accuracy. To tackle these problems, we introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework. This method overcomes information loss from missing modalities and minimizes the amortization gap by iteratively refining the multimodal inference using all available modalities. By aligning unimodal inference to this refined multimodal posterior, we achieve unimodal inferences that effectively incorporate multimodal information while requiring only unimodal inputs during inference. Experiments on benchmark datasets show that our approach improves inference performance, evidenced by higher linear classification accuracy and competitive cosine similarity, and enhances cross-modal generation, indicated by lower FID scores. This demonstrates that our method enhances inferred representations from unimodal inputs.

[AI-16] Implementing Derivations of Definite Logic Programs with Self-Attention Networks KR

链接: https://arxiv.org/abs/2410.11396
作者: Phan Thi Thanh Thuy,Akihiro Yamamoto
关键词-EN: Large Language Models, logical inference, paper we propose, restricted version, self-attention networks
类目: Artificial Intelligence (cs.AI)
*备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

Abstract:In this paper we propose that a restricted version of logical inference can be implemented with self-attention networks. We are aiming at showing that LLMs (Large Language Models) constructed with transformer networks can make logical inferences. We would reveal the potential of LLMs by analyzing self-attention networks, which are main components of transformer networks. Our approach is not based on semantics of natural languages but operations of logical inference. %point of view. We show that hierarchical constructions of self-attention networks with feed forward networks (FFNs) can implement top-down derivations for a class of logical formulae. We also show bottom-up derivations are also implemented for the same class. We believe that our results show that LLMs implicitly have the power of logical inference.

[AI-17] Synthetic Interlocutors. Experiments with Generative AI to Prolong Ethnographic Encounters

链接: https://arxiv.org/abs/2410.11395
作者: Johan Irving Søltoft,Laura Kocksch,Anders Kristian Munk
关键词-EN: Retrieval Augmented Generation, Synthetic Interlocutors, paper introduces, Augmented Generation, Interlocutors
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces “Synthetic Interlocutors” for ethnographic research. Synthetic Interlocutors are chatbots ingested with ethnographic textual material (interviews and observations) by using Retrieval Augmented Generation (RAG). We integrated an open-source large language model with ethnographic data from three projects to explore two questions: Can RAG digest ethnographic material and act as ethnographic interlocutor? And, if so, can Synthetic Interlocutors prolong encounters with the field and extend our analysis? Through reflections on the process of building our Synthetic Interlocutors and an experimental collaborative workshop, we suggest that RAG can digest ethnographic materials, and it might lead to prolonged, yet uneasy ethnographic encounters that allowed us to partially recreate and re-visit fieldwork interactions while facilitating opportunities for novel analytic insights. Synthetic Interlocutors can produce collaborative, ambiguous and serendipitous moments.

[AI-18] WPFed: Web-based Personalized Federation for Decentralized Systems

链接: https://arxiv.org/abs/2410.11378
作者: Guanhua Ye,Jifeng He,Weiqing Wang,Zhe Xue,Feifei Kou,Yawen Li
关键词-EN: trust are paramount, collaborative model training, WPFed, model training, learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Decentralized learning has become crucial for collaborative model training in environments where data privacy and trust are paramount. In web-based applications, clients are liberated from traditional fixed network topologies, enabling the establishment of arbitrary peer-to-peer (P2P) connections. While this flexibility is highly promising, it introduces a fundamental challenge: the optimal selection of neighbors to ensure effective collaboration. To address this, we introduce WPFed, a fully decentralized, web-based learning framework designed to enable globally optimal neighbor selection. WPFed employs a dynamic communication graph and a weighted neighbor selection mechanism. By assessing inter-client similarity through Locality-Sensitive Hashing (LSH) and evaluating model quality based on peer rankings, WPFed enables clients to identify personalized optimal neighbors on a global scale while preserving data privacy. To enhance security and deter malicious behavior, WPFed integrates verification mechanisms for both LSH codes and performance rankings, leveraging blockchain-driven announcements to ensure transparency and verifiability. Through extensive experiments on multiple real-world datasets, we demonstrate that WPFed significantly improves learning outcomes and system robustness compared to traditional federated learning methods. Our findings highlight WPFed’s potential to facilitate effective and secure decentralized collaborative learning across diverse and interconnected web environments.

[AI-19] Augmentation-Driven Metric for Balancing Preservation and Modification in Text-Guided Image Editing

链接: https://arxiv.org/abs/2410.11374
作者: Yoonjeon Kim,Soohyun Ryu,Yeonsung Jung,Hyunkoo Lee,Joowon Kim,June Yong Yang,Jaeryong Hwang,Eunho Yang
关键词-EN: significantly advanced text-guided, text-guided image editing, advanced text-guided image, source image, text-guided image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks \textitpreservation of core elements in the source image while implementing \textitmodifications based on the target text. However, in the absence of evaluation metrics specifically tailored for text-guided image editing, existing metrics are limited in balancing the consideration of preservation and modification. Especially, our analysis reveals that CLIPScore, the most commonly used metric, tends to favor modification and ignore core attributes to be preserved, resulting in inaccurate evaluations. To address this problem, we propose \textttAugCLIP, \blackwhich balances preservation and modification by estimating the representation of an ideal edited image that aligns with the target text with minimum alteration on the source image. We augment detailed textual descriptions on the source image and the target text using a multi-modal large language model, to model a hyperplane that separates CLIP space into source or target. The representation of the ideal edited image is an orthogonal projection of the source image into the hyperplane, which encapsulates the relative importance of each attribute considering the interdependent relationships. Our extensive experiments on five benchmark datasets, encompassing a diverse range of editing scenarios, demonstrate that \textttAugCLIP aligns remarkably well with human evaluation standards compared to existing metrics. The code for evaluation will be open-sourced to contribute to the community.

[AI-20] Reducing Labeling Costs in Sentiment Analysis via Semi-Supervised Learning

链接: https://arxiv.org/abs/2410.11355
作者: Minoo Jafarlou,Mario M. Kubek
关键词-EN: noteworthy challenge, challenge in machine, machine learning, learning, Labeling datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 12 pages, 7 figures, accepted at the 2024 8th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2024), Okayama, Japan, 2024

点击查看摘要

Abstract:Labeling datasets is a noteworthy challenge in machine learning, both in terms of cost and time. This research, however, leverages an efficient answer. By exploring label propagation in semi-supervised learning, we can significantly reduce the number of labels required compared to traditional methods. We employ a transductive label propagation method based on the manifold assumption for text classification. Our approach utilizes a graph-based method to generate pseudo-labels for unlabeled data for the text classification task, which are then used to train deep neural networks. By extending labels based on cosine proximity within a nearest neighbor graph from network embeddings, we combine unlabeled data into supervised learning, thereby reducing labeling costs. Based on previous successes in other domains, this study builds and evaluates this approach’s effectiveness in sentiment analysis, presenting insights into semi-supervised learning.

[AI-21] RATE: Score Reward Models with Imperfect Rewrites of Rewrites ICLR2025

链接: https://arxiv.org/abs/2410.11348
作者: David Reber,Sean Richardson,Todd Nief,Cristina Garbacea,Victor Veitch
关键词-EN: reward, Rewrite-based Attribute Treatment, reward models, language modeling, paper concerns
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Submitted as a conference paper to ICLR 2025. Code is available at this https URL

点击查看摘要

Abstract:This paper concerns the evaluation of reward models used in language modeling. A reward model is a function that takes a prompt and a response and assigns a score indicating how good that response is for the prompt. A key challenge is that reward models are usually imperfect proxies for actual preferences. For example, we may worry that a model trained to reward helpfulness learns to instead prefer longer responses. In this paper, we develop an evaluation method, RATE (Rewrite-based Attribute Treatment Estimators), that allows us to measure the causal effect of a given attribute of a response (e.g., length) on the reward assigned to that response. The core idea is to use large language models to rewrite responses to produce imperfect counterfactuals, and to adjust for rewriting error by rewriting twice. We show that the RATE estimator is consistent under reasonable assumptions. We demonstrate the effectiveness of RATE on synthetic and real-world data, showing that it can accurately estimate the effect of a given attribute on the reward model.

[AI-22] DIAR: Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation

链接: https://arxiv.org/abs/2410.11338
作者: Jaehyun Park,Yunho Kim,Sejin Kim,Byung-Jun Lee,Sundong Kim
关键词-EN: Implicit Q-learning, offline reinforcement learning, Adaptive Revaluation, Adaptive Revaluation mechanism, offline reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Preprint, under review. Comments welcome

点击查看摘要

Abstract:We propose a novel offline reinforcement learning (offline RL) approach, introducing the Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR) framework. We address two key challenges in offline RL: out-of-distribution samples and long-horizon problems. We leverage diffusion models to learn state-action sequence distributions and incorporate value functions for more balanced and adaptive decision-making. DIAR introduces an Adaptive Revaluation mechanism that dynamically adjusts decision lengths by comparing current and future state values, enabling flexible long-term decision-making. Furthermore, we address Q-value overestimation by combining Q-network learning with a value function guided by a diffusion model. The diffusion model generates diverse latent trajectories, enhancing policy robustness and generalization. As demonstrated in tasks like Maze2D, AntMaze, and Kitchen, DIAR consistently outperforms state-of-the-art algorithms in long-horizon, sparse-reward environments.

[AI-23] Sequential LLM Framework for Fashion Recommendation

链接: https://arxiv.org/abs/2410.11327
作者: Han Liu,Xianfeng Tang,Tianlang Chen,Jiapeng Liu,Indu Indu,Henry Peng Zou,Peng Dai,Roberto Fernandez Galan,Michael D Porter,Dongmei Jia,Ning Zhang,Lian Xiong
关键词-EN: prompting major online, major online retailers, global e-commerce sector, prompting major, customer convenience
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The fashion industry is one of the leading domains in the global e-commerce sector, prompting major online retailers to employ recommendation systems for product suggestions and customer convenience. While recommendation systems have been widely studied, most are designed for general e-commerce problems and struggle with the unique challenges of the fashion domain. To address these issues, we propose a sequential fashion recommendation framework that leverages a pre-trained large language model (LLM) enhanced with recommendation-specific prompts. Our framework employs parameter-efficient fine-tuning with extensive fashion data and introduces a novel mix-up-based retrieval technique for translating text into relevant product suggestions. Extensive experiments show our proposed framework significantly enhances fashion recommendation performance.

[AI-24] Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

链接: https://arxiv.org/abs/2410.11325
作者: Wenda Xu,Rujun Han,Zifeng Wang,Long T. Le,Dhruv Madeka,Lei Li,William Yang Wang,Rishabh Agarwal,Chen-Yu Lee,Tomas Pfister
关键词-EN: Recent advances, enabled smaller student, enabled smaller, performance of larger, Speculative Knowledge Distillation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student’s inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.

[AI-25] Diffusion-Based Offline RL for Improved Decision-Making in Augmented ARC Task

链接: https://arxiv.org/abs/2410.11324
作者: Yunho Kim,Jaehyun Park,Heejun Kim,Sejin Kim,Byung-Jun Lee,Sundong Kim
关键词-EN: Effective long-term strategies, navigate complex environments, Effective long-term, long-term strategies enable, extended horizons
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint, Under review. Comments welcome

点击查看摘要

Abstract:Effective long-term strategies enable AI systems to navigate complex environments by making sequential decisions over extended horizons. Similarly, reinforcement learning (RL) agents optimize decisions across sequences to maximize rewards, even without immediate feedback. To verify that Latent Diffusion-Constrained Q-learning (LDCQ), a prominent diffusion-based offline RL method, demonstrates strong reasoning abilities in multi-step decision-making, we aimed to evaluate its performance on the Abstraction and Reasoning Corpus (ARC). However, applying offline RL methodologies to enhance strategic reasoning in AI for solving tasks in ARC is challenging due to the lack of sufficient experience data in the ARC training set. To address this limitation, we introduce an augmented offline RL dataset for ARC, called Synthesized Offline Learning Data for Abstraction and Reasoning (SOLAR), along with the SOLAR-Generator, which generates diverse trajectory data based on predefined rules. SOLAR enables the application of offline RL methods by offering sufficient experience data. We synthesized SOLAR for a simple task and used it to train an agent with the LDCQ method. Our experiments demonstrate the effectiveness of the offline RL approach on a simple ARC task, showing the agent’s ability to make multi-step sequential decisions and correctly identify answer states. These results highlight the potential of the offline RL approach to enhance AI’s strategic reasoning capabilities.

[AI-26] Herald: A Natural Language Annotated Lean 4 Dataset

链接: https://arxiv.org/abs/2410.10878
作者: Guoxiong Gao,Yutong Wang,Jiedong Jiang,Qi Gao,Zihan Qin,Tianyi Xu,Bin Dong
关键词-EN: Verifiable formal languages, impacted mathematical reasoning, Verifiable formal, formal language Lean, automated reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Verifiable formal languages like Lean have profoundly impacted mathematical reasoning, particularly through the use of large language models (LLMs) for automated reasoning. A significant challenge in training LLMs for these formal languages is the lack of parallel datasets that align natural language with formal language proofs. To address this challenge, this paper introduces a novel framework for translating the Mathlib4 corpus (a unified library of mathematics in formal language Lean 4) into natural language. Building upon this, we employ a dual augmentation strategy that combines tactic-based and informal-based approaches, leveraging the Lean-jixia system, a Lean 4 analyzer. We present the results of this pipeline on Mathlib4 as Herald (Hierarchy and Retrieval-based Translated Lean Dataset). We also propose the Herald Translator, which is fine-tuned on Herald. Herald translator achieves a 93.2% accuracy (Pass@128) on formalizing statements in the miniF2F-test and a 22.5% accuracy on our internal graduate-level textbook dataset, outperforming InternLM2-Math-Plus-7B (74.0% and 7.5%) and TheoremLlama (50.1% and 4.0%). Furthermore, we propose a section-level translation framework for real-world applications. As a direct application of Herald translator, we have successfully translated a template section in the Stack project, marking a notable progress in the automatic formalization of graduate-level mathematical literature. Our model, along with the datasets, will be open-sourced to the public soon.

[AI-27] Improving Data Efficiency via Curating LLM-Driven Rating Systems

链接: https://arxiv.org/abs/2410.10877
作者: Jinlong Pang,Jiaheng Wei,Ankit Parag Shah,Zhaowei Zhu,Yaxuan Wang,Chen Qian,Yang Liu,Yujia Bao,Wei Wei
关键词-EN: adapting large language, Instruction tuning, large language models, challenging traditional data, downstream tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that “more can be less.”

[AI-28] Optimizing Transformer based on high-performance optimizer for predicting employment sentiment in American social media content

链接: https://arxiv.org/abs/2410.10874
作者: Feiyang Wang,Qiaozhi Bao,Zixuan Wang,Yanlin Chen
关键词-EN: intelligence optimization algorithm, swarm intelligence optimization, Transformer model based, American social media, content on American
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:This article improves the Transformer model based on swarm intelligence optimization algorithm, aiming to predict the emotions of employment related text content on American social media. Through text preprocessing, feature extraction, and vectorization, the text data was successfully converted into numerical data and imported into the model for training. The experimental results show that during the training process, the accuracy of the model gradually increased from 49.27% to 82.83%, while the loss value decreased from 0.67 to 0.35, indicating a significant improvement in the performance of the model on the training set. According to the confusion matrix analysis of the training set, the accuracy of the training set is 86.15%. The confusion matrix of the test set also showed good performance, with an accuracy of 82.91%. The accuracy difference between the training set and the test set is only 3.24%, indicating that the model has strong generalization ability. In addition, the evaluation of polygon results shows that the model performs well in classification accuracy, sensitivity, specificity, and area under the curve (AUC), with a Kappa coefficient of 0.66 and an F-measure of 0.80, further verifying the effectiveness of the model in social media sentiment analysis. The improved model proposed in this article not only improves the accuracy of sentiment recognition in employment related texts on social media, but also has important practical significance. This social media based data analysis method can not only capture social dynamics in a timely manner, but also promote decision-makers to pay attention to public concerns and provide data support for improving employment conditions.

[AI-29] AuditWen:An Open-Source Large Language Model for Audit

链接: https://arxiv.org/abs/2410.10873
作者: Jiajia Huang,Haoran Zhu,Chao Xu,Tianming Zhan,Qianqian Xie,Jimin Huang
关键词-EN: Intelligent auditing represents, modern audit practices, audit, Intelligent auditing, artificial intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 18 pages,1 figures

点击查看摘要

Abstract:Intelligent auditing represents a crucial advancement in modern audit practices, enhancing both the quality and efficiency of audits within the realm of artificial intelligence. With the rise of large language model (LLM), there is enormous potential for intelligent models to contribute to audit domain. However, general LLMs applied in audit domain face the challenges of lacking specialized knowledge and the presence of data biases. To overcome these challenges, this study introduces AuditWen, an open-source audit LLM by fine-tuning Qwen with constructing instruction data from audit domain. We first outline the application scenarios for LLMs in the audit and extract requirements that shape the development of LLMs tailored for audit purposes. We then propose an audit LLM, called AuditWen, by fine-tuning Qwen with constructing 28k instruction dataset from 15 audit tasks and 3 layers. In evaluation stage, we proposed a benchmark with 3k instructions that covers a set of critical audit tasks derived from the application scenarios. With the benchmark, we compare AuditWen with other existing LLMs from information extraction, question answering and document generation. The experimental results demonstrate superior performance of AuditWen both in question understanding and answer generation, making it an immediately valuable tool for audit.

[AI-30] oolBridge: An Open-Source Dataset to Equip LLMs with External Tool Capabilities

链接: https://arxiv.org/abs/2410.10872
作者: Zhenchao Jin,Mengchen Liu,Dongdong Chen,Lingting Zhu,Yunsheng Li,Lequan Yu
关键词-EN: elementary conversational agents, large language models, external tools, large language, significantly expand
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: technical report

点击查看摘要

Abstract:Through the integration of external tools, large language models (LLMs) such as GPT-4o and Llama 3.1 significantly expand their functional capabilities, evolving from elementary conversational agents to general-purpose assistants. We argue that the primary drivers of these advancements are the quality and diversity of the training data. However, the existing LLMs with external tool integration provide only limited transparency regarding their datasets and data collection methods, which has led to the initiation of this research. Specifically, in this paper, our objective is to elucidate the detailed process involved in constructing datasets that empower LLMs to effectively learn how to utilize external tools and make this information available to the public through the introduction of ToolBridge. ToolBridge proposes to employ a collection of general open-access datasets as its raw dataset pool and applies a series of strategies to identify appropriate data entries from the pool for external tool API insertions. By supervised fine-tuning on these curated data entries, LLMs can invoke external tools in appropriate contexts to boost their predictive accuracy, particularly for basic functions including data processing, numerical computation, and factual retrieval. Our experiments rigorously isolates model architectures and training configurations, focusing exclusively on the role of data. The experimental results indicate that LLMs trained on ToolBridge demonstrate consistent performance improvements on both standard benchmarks and custom evaluation datasets. All the associated code and data will be open-source at this https URL, promoting transparency and facilitating the broader community to explore approaches for equipping LLMs with external tools capabilities.

[AI-31] Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

链接: https://arxiv.org/abs/2410.10871
作者: Simon Lermen,Mateusz Dziemian,Govind Pimpale
关键词-EN: requiring short-term planning, tasks requiring short-term, requiring short-term, short-term planning, planning and tool
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, language models like Llama 3.1 Instruct have become increasingly capable of agentic behavior, enabling them to perform tasks requiring short-term planning and tool use. In this study, we apply refusal-vector ablation to Llama 3.1 70B and implement a simple agent scaffolding to create an unrestricted agent. Our findings imply that these refusal-vector ablated models can successfully complete harmful tasks, such as bribing officials or crafting phishing attacks, revealing significant vulnerabilities in current safety mechanisms. To further explore this, we introduce a small Safe Agent Benchmark, designed to test both harmful and benign tasks in agentic scenarios. Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3.1 Instruct models are willing to perform most harmful tasks without modifications. At the same time, these models will refuse to give advice on how to perform the same tasks when asked for a chat completion. This highlights the growing risk of misuse as models become more capable, underscoring the need for improved safety frameworks for language model agents.

[AI-32] PortLLM: Personalizing Evolving Large Language Models with Training-Free and Portable Model Patches

链接: https://arxiv.org/abs/2410.10870
作者: Rana Muhammad Shahroz Khan,Pingzhi Li,Sukwon Yun,Zhenyu Wang,Shahriar Nirjon,Chau-Wai Wong,Tianlong Chen
关键词-EN: large language models, achieving optimal performance, increasingly shape, large language, pre-LLM era
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) increasingly shape the AI landscape, fine-tuning pretrained models has become more popular than in the pre-LLM era for achieving optimal performance in domain-specific tasks. However, pretrained LLMs such as ChatGPT are periodically evolved, i.e., model parameters are frequently updated), making it challenging for downstream users with limited resources to keep up with fine-tuning the newest LLMs for their domain application. Even though fine-tuning costs have nowadays been reduced thanks to the innovations of parameter-efficient fine-tuning such as LoRA, not all downstream users have adequate computing for frequent personalization. Moreover, access to fine-tuning datasets, particularly in sensitive domains such as healthcare, could be time-restrictive, making it crucial to retain the knowledge encoded in earlier fine-tuned rounds for future adaptation. In this paper, we present PortLLM, a training-free framework that (i) creates an initial lightweight model update patch to capture domain-specific knowledge, and (ii) allows a subsequent seamless plugging for the continual personalization of evolved LLM at minimal cost. Our extensive experiments cover seven representative datasets, from easier question-answering tasks BoolQ, SST2 to harder reasoning tasks WinoGrande, GSM8K, and models including Mistral-7B, Llama2, Llama3.1, and Gemma2, validating the portability of our designed model patches and showcasing the effectiveness of our proposed framework. For instance, PortLLM achieves comparable performance to LoRA fine-tuning with reductions of up to 12.2x in GPU memory usage. Finally, we provide theoretical justifications to understand the portability of our model update patches, which offers new insights into the theoretical dimension of LLMs’ personalization.

[AI-33] Application of NotebookLM a Large Language Model with Retrieval-Augmented Generation for Lung Cancer Staging

链接: https://arxiv.org/abs/2410.10869
作者: Ryota Tozuka,Hisashi Johno,Akitomo Amakawa,Junichi Sato,Mizuki Muto,Shoichiro Seki,Atsushi Komaba,Hiroshi Onishi
关键词-EN: large language models, recently gained attention, lung cancer, lung cancer staging, including ChatGPT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, 1 table, 3 ancillary files

点击查看摘要

Abstract:Purpose: In radiology, large language models (LLMs), including ChatGPT, have recently gained attention, and their utility is being rapidly evaluated. However, concerns have emerged regarding their reliability in clinical applications due to limitations such as hallucinations and insufficient referencing. To address these issues, we focus on the latest technology, retrieval-augmented generation (RAG), which enables LLMs to reference reliable external knowledge (REK). Specifically, this study examines the utility and reliability of a recently released RAG-equipped LLM (RAG-LLM), NotebookLM, for staging lung cancer. Materials and methods: We summarized the current lung cancer staging guideline in Japan and provided this as REK to NotebookLM. We then tasked NotebookLM with staging 100 fictional lung cancer cases based on CT findings and evaluated its accuracy. For comparison, we performed the same task using a gold-standard LLM, GPT-4 Omni (GPT-4o), both with and without the REK. Results: NotebookLM achieved 86% diagnostic accuracy in the lung cancer staging experiment, outperforming GPT-4o, which recorded 39% accuracy with the REK and 25% without it. Moreover, NotebookLM demonstrated 95% accuracy in searching reference locations within the REK. Conclusion: NotebookLM successfully performed lung cancer staging by utilizing the REK, demonstrating superior performance compared to GPT-4o. Additionally, it provided highly accurate reference locations within the REK, allowing radiologists to efficiently evaluate the reliability of NotebookLM’s responses and detect possible hallucinations. Overall, this study highlights the potential of NotebookLM, a RAG-LLM, in image diagnosis. Comments: 9 pages, 5 figures, 1 table, 3 ancillary files Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.10869 [cs.CL] (or arXiv:2410.10869v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.10869 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hisashi Johno [view email] [v1] Tue, 8 Oct 2024 12:42:42 UTC (84 KB)

[AI-34] LLaCA: Multimodal Large Language Continual Assistant

链接: https://arxiv.org/abs/2410.10868
作者: Jingyang Qiao,Zhizhong Zhang,Xin Tan,Yanyun Qu,Shouhong Ding,Yuan Xie
关键词-EN: Large Language Models, Multimodal Large Language, Continual Instruction Tuning, designing text instructions, Instruction tuning guides
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Instruction tuning guides the Multimodal Large Language Models (MLLMs) in aligning different modalities by designing text instructions, which seems to be an essential technique to enhance the capabilities and controllability of foundation models. In this framework, Multimodal Continual Instruction Tuning (MCIT) is adopted to continually instruct MLLMs to follow human intent in sequential datasets. We observe existing gradient update would heavily destroy the tuning performance on previous datasets and the zero-shot ability during continual instruction tuning. Exponential Moving Average (EMA) update policy owns the ability to trace previous parameters, which can aid in decreasing forgetting. However, its stable balance weight cannot deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability of MLLMs. In this paper, we propose a method called Multimodal Large Language Continual Assistant (LLaCA) to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight is basically according to the gradient information and previous parameters. We automatically determine the balance weight and significantly improve the performance. Through comprehensive experiments on LLaVA-1.5 in a continual visual-question-answering benchmark, compared with baseline, our approach not only highly improves anti-forgetting ability (with reducing forgetting from 22.67 to 2.68), but also significantly promotes continual tuning performance (with increasing average accuracy from 41.31 to 61.89). Our code will be published soon.

[AI-35] Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

链接: https://arxiv.org/abs/2410.10867
作者: Théo Gigant(L2S),Camille Guinaudeau(STL, LISN),Marc Decombas,Frédéric Dufaux(L2S)
关键词-EN: evaluate abstractive summarization, abstractive summarization systems, Automatic metrics, proxies to evaluate, evaluate abstractive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Automatic metrics are used as proxies to evaluate abstractive summarization systems when human annotations are too expensive. To be useful, these metrics should be fine-grained, show a high correlation with human annotations, and ideally be independent of reference quality; however, most standard evaluation metrics for summarization are reference-based, and existing reference-free metrics correlate poorly with relevance, especially on summaries of longer documents. In this paper, we introduce a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute. We show that this metric can also be used alongside reference-based metrics to improve their robustness in low quality reference settings.

[AI-36] CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept

链接: https://arxiv.org/abs/2410.10866
作者: YuXuan Wu,Bonaventure F. P. Dossou,Dianbo Liu
关键词-EN: Large Language Models, Large Language, inadvertently memorize sensitive, offer extensive knowledge, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer extensive knowledge across various domains, but they may inadvertently memorize sensitive, unauthorized, or malicious data, such as personal information in the medical and financial sectors. Machine unlearning methods aim to remove specific information from models after training to address this. However, current approaches require additional model training or struggle to effectively erase particular data points and their associated context due to LLMs’ complex, dense, and continuous nature. In this study, we propose a novel amortized unlearning approach using codebook features and Sparse Autoencoders (SAEs). By leveraging a bottleneck to decompose the activation space and regulate information flow, our method efficiently unlearns targeted information while preserving the model’s performance on unrelated data. To the best of our knowledge, this is the first work that successfully enables unlearning specific topics with contextual relevance in an LLM, marking a significant step towards real-world applications of machine unlearning.

[AI-37] Generating Synthetic Datasets for Few-shot Prompt Tuning

链接: https://arxiv.org/abs/2410.10865
作者: Xu Guo,Zilin Du,Boyang Li,Chunyan Miao
关键词-EN: major limitation, prompt tuning, tuning, prompt, few-shot learning settings
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A major limitation of prompt tuning is its dependence on large labeled training datasets. Under few-shot learning settings, prompt tuning lags far behind full-model fine-tuning, limiting its scope of application. In this paper, we leverage the powerful LLMs to synthesize task-specific labeled data for training the soft prompts. We first introduce a distribution-aligned weighted generator tuning (DawGen) method to encourage generating in-distribution data that aligns with the few-shot real data. Then, we train soft prompts on both synthetic and real datasets using a gradient surgery approach, which eliminates the conflicting gradients from different data sources. Experiments on seven sentence-pair classification datasets demonstrate the effectiveness of our proposed method for boosting prompt tuning in few-shot learning settings. Results on QQP, MRPC, and SICK datasets are even comparable to the performance of transfer learning from large real-world datasets, showing the promise of synthetic data as an alternative for enhancing soft prompt tuning.

[AI-38] Fill In The Gaps: Model Calibration and Generalization with Synthetic Data EMNLP2024

链接: https://arxiv.org/abs/2410.10864
作者: Yang Ba,Michelle V. Mancenido,Rong Pan
关键词-EN: major concern prior, swiftly advance, calibrating their performance, widespread implementation, continue to swiftly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Main Conference (Long paper)

点击查看摘要

Abstract:As machine learning models continue to swiftly advance, calibrating their performance has become a major concern prior to practical and widespread implementation. Most existing calibration methods often negatively impact model accuracy due to the lack of diversity of validation data, resulting in reduced generalizability. To address this, we propose a calibration method that incorporates synthetic data without compromising accuracy. We derive the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework. Large language models (LLMs), known for their ability to mimic real data and generate text with mixed class labels, are utilized as a synthetic data generation strategy to lower the ECE bound and improve model accuracy on real test data. Additionally, we propose data generation mechanisms for efficient calibration. Testing our method on four different natural language processing tasks, we observed an average up to 34% increase in accuracy and 33% decrease in ECE.

[AI-39] What makes your model a low-empathy or warmth person: Exploring the Origins of Personality in LLMs

链接: https://arxiv.org/abs/2410.10863
作者: Shu Yang,Shenzhe Zhu,Ruoxuan Bao,Liang Liu,Yu Cheng,Lijie Hu,Mengdi Li,Di Wang
关键词-EN: Large language models, demonstrated remarkable capabilities, generating human-like text, Large language, exhibiting personality traits
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in generating human-like text and exhibiting personality traits similar to those in humans. However, the mechanisms by which LLMs encode and express traits such as agreeableness and impulsiveness remain poorly understood. Drawing on the theory of social determinism, we investigate how long-term background factors, such as family environment and cultural norms, interact with short-term pressures like external instructions, shaping and influencing LLMs’ personality traits. By steering the output of LLMs through the utilization of interpretable features within the model, we explore how these background and pressure factors lead to changes in the model’s traits without the need for further fine-tuning. Additionally, we suggest the potential impact of these factors on model safety from the perspective of personality.

[AI-40] Superficial Safety Alignment Hypothesis

链接: https://arxiv.org/abs/2410.10862
作者: Jianwei Li,Jung-Eun Kim
关键词-EN: large language models, safety alignment, safety, ensuring they generate, alignment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe and aligned responses is a pressing need. Previous research on alignment has largely focused on general instruction-following but has often overlooked the unique properties and challenges of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction - interpreted as a specialized binary classification task - and incorporate a refusal mechanism with multiple reserved fallback options. Furthermore, through SSAH, we hypothesize that safety guardrails in LLMs can be established by just a small number of essential components. To verify this, we conduct an ablation study and successfully identify four types of attribute-critical components in safety-aligned LLMs: Exclusive Safety Unit (ESU), Exclusive Utility Unit (EUU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components 7.5% during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Additionally, we show that leveraging redundant units 20% in the pre-trained model as an ``alignment budget’’ can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated. We believe this work contributes to the foundation of efficient and scalable safety alignment for future LLMs.

[AI-41] A Recipe For Building a Compliant Real Estate Chatbot

链接: https://arxiv.org/abs/2410.10860
作者: Navid Madani,Anusha Bagalkotkar,Supriya Anand,Gabriel Arnson,Rohini Srihari,Kenneth Joseph
关键词-EN: align large language, large language models, recent years, human preferences, significant effort
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, there has been significant effort to align large language models with human preferences. This work focuses on developing a chatbot specialized in the real estate domain, with an emphasis on incorporating compliant behavior to ensure it can be used without perpetuating discriminatory practices like steering and redlining, which have historically plagued the real estate industry in the United States. Building on prior work, we present a method for generating a synthetic general instruction-following dataset, along with safety data. Through extensive evaluations and benchmarks, we fine-tuned a llama-3-8B-instruct model and demonstrated that we can enhance it’s performance significantly to match huge closed-source models like GPT-4o while making it safer and more compliant. We open-source the model, data and code to support further development and research in the community.

[AI-42] FAME: Towards Factual Multi-Task Model Editing

链接: https://arxiv.org/abs/2410.10859
作者: Li Zeng,Yingyu Shan,Zeming Liu,Jiashu Yao,Yuhang Guo
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages, 3 figures

点击查看摘要

[AI-43] Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths EMNLP2024

链接: https://arxiv.org/abs/2410.10858
作者: Yew Ken Chia,Guizhen Chen,Weiwen Xu,Luu Anh Tuan,Soujanya Poria,Lidong Bing
关键词-EN: exhibit impressive problem-solving, impressive problem-solving capabilities, Reasoning Paths Optimization, Advanced models, exhibit impressive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP 2024 camera ready version

点击查看摘要

Abstract:Advanced models such as OpenAI o1 exhibit impressive problem-solving capabilities through step-by-step reasoning. However, they may still falter on more complex problems, making errors that disrupt their reasoning paths. We attribute this to the expansive solution space, where each step has the risk of diverging into mistakes. To enhance language model reasoning, we introduce a specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model’s overall problem-solving performance. Reasoning Paths Optimization does not rely on large-scale human-annotated rationales or outputs from closed-source models, making it scalable and data-efficient. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions. The experiments demonstrate that our framework significantly enhances the reasoning performance of large language models, with up to 3.1% and 4.3% improvement on GSM8K and MMLU (STEM) respectively. Our data and code can be found at this https URL.

[AI-44] Mirror-Consistency: Harnessing Inconsistency in Majority Voting EMNLP2024

链接: https://arxiv.org/abs/2410.10857
作者: Siyuan Huang,Zhiyuan Ma,Jintao Du,Changhua Meng,Weiqiang Wang,Zhouhan Lin
关键词-EN: Large Language Models, Large Language, widely-used decoding strategy, capabilities of Large, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Short Findings

点击查看摘要

Abstract:Self-Consistency, a widely-used decoding strategy, significantly boosts the reasoning capabilities of Large Language Models (LLMs). However, it depends on the plurality voting rule, which focuses on the most frequent answer while overlooking all other minority responses. These inconsistent minority views often illuminate areas of uncertainty within the model’s generation process. To address this limitation, we present Mirror-Consistency, an enhancement of the standard Self-Consistency approach. Our method incorporates a ‘reflective mirror’ into the self-ensemble decoding process and enables LLMs to critically examine inconsistencies among multiple generations. Additionally, just as humans use the mirror to better understand themselves, we propose using Mirror-Consistency to enhance the sample-based confidence calibration methods, which helps to mitigate issues of overconfidence. Our experimental results demonstrate that Mirror-Consistency yields superior performance in both reasoning accuracy and confidence calibration compared to Self-Consistency.

[AI-45] CogDevelop2K: Reversed Cognitive Development in Multimodal Large Language Models

链接: https://arxiv.org/abs/2410.10855
作者: Yijiang Li,Qingying Gao,Haoran Sun,Haiyun Lyu,Dezhi Luo,Hokin Deng
关键词-EN: Large Language Models, Multi-modal Large Language, Language Models, Multi-modal Large, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Are Multi-modal Large Language Models (MLLMs) stochastic parrots? Do they genuinely understand and are capable of performing the tasks they excel at? This paper aims to explore the fundamental basis of MLLMs, i.e. core cognitive abilities that human intelligence builds upon to perceive, comprehend, and reason. To this end, we propose CogDevelop2K, a comprehensive benchmark that spans 12 sub-concepts from fundamental knowledge like object permanence and boundary to advanced reasoning like intentionality understanding, structured via the developmental trajectory of a human mind. We evaluate 46 MLLMs on our benchmarks. Comprehensively, we further evaluate the influence of evaluation strategies and prompting techniques. Surprisingly, we observe a reversed cognitive developmental trajectory compared to humans.

[AI-46] Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning EMNLP2024

链接: https://arxiv.org/abs/2410.10854
作者: Shramay Palta,Nishant Balepur,Peter Rankel,Sarah Wiegreffe,Marine Carpuat,Rachel Rudinger
关键词-EN: Questions involving commonsense, involving commonsense reasoning, commonsense reasoning, textit, everyday situations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Camera Ready

点击查看摘要

Abstract:Questions involving commonsense reasoning about everyday situations often admit many \textitpossible or \textitplausible answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the \textitmost plausible answer choice. On 250 MCQ items sampled from two commonsense reasoning benchmarks, we collect 5,000 independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.

[AI-47] Mitigating Hallucinations Using Ensemble of Knowledge Graph and Vector Store in Large Language Models to Enhance Mental Health Support

链接: https://arxiv.org/abs/2410.10853
作者: Abdul Muqtadir,Hafiz Syed Muhammad Bilal,Ayesha Yousaf,Hafiz Farooq Ahmed,Jamil Hussain
关键词-EN: Large Language Models, Language Models, Large Language, research work delves, mental health
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research work delves into the manifestation of hallucination within Large Language Models (LLMs) and its consequential impacts on applications within the domain of mental health. The primary objective is to discern effective strategies for curtailing hallucinatory occurrences, thereby bolstering the dependability and security of LLMs in facilitating mental health interventions such as therapy, counseling, and the dissemination of pertinent information. Through rigorous investigation and analysis, this study seeks to elucidate the underlying mechanisms precipitating hallucinations in LLMs and subsequently propose targeted interventions to alleviate their occurrence. By addressing this critical issue, the research endeavors to foster a more robust framework for the utilization of LLMs within mental health contexts, ensuring their efficacy and reliability in aiding therapeutic processes and delivering accurate information to individuals seeking mental health support.

[AI-48] SafeLLM: Domain-Specific Safety Monitoring for Large Language Models : A Case Study of Offshore Wind Maintenance

链接: https://arxiv.org/abs/2410.10852
作者: Connor Walker,Callum Rothon,Koorosh Aslansefat,Yiannis Papadopoulos,Nina Dethlefs
关键词-EN: experiencing significant expansion, Offshore Wind, increased Operations, industry is experiencing, significant expansion
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Offshore Wind (OSW) industry is experiencing significant expansion, resulting in increased Operations \ Maintenance (O\M) costs. Intelligent alarm systems offer the prospect of swift detection of component failures and process anomalies, enabling timely and precise interventions that could yield reductions in resource expenditure, as well as scheduled and unscheduled downtime. This paper introduces an innovative approach to tackle this challenge by capitalising on Large Language Models (LLMs). We present a specialised conversational agent that incorporates statistical techniques to calculate distances between sentences for the detection and filtering of hallucinations and unsafe output. This potentially enables improved interpretation of alarm sequences and the generation of safer repair action recommendations by the agent. Preliminary findings are presented with the approach applied to ChatGPT-4 generated test sentences. The limitation of using ChatGPT-4 and the potential for enhancement of this agent through re-training with specialised OSW datasets are discussed.

[AI-49] LLM Gesticulator: Leveraging Large Language Models for Scalable and Controllable Co-Speech Gesture Synthesis

链接: https://arxiv.org/abs/2410.10851
作者: Haozhou Pang,Tianwei Ding,Lanshan He,Qi Gan
关键词-EN: synthesizes full-body animations, exhibiting natural movements, present LLM Gesticulator, LLM-based audio-driven co-speech, movements and editability
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this work, we present LLM Gesticulator, an LLM-based audio-driven co-speech gesture generation framework that synthesizes full-body animations that are rhythmically aligned with the input audio while exhibiting natural movements and editability. Compared to previous work, our model demonstrates substantial scalability. As the size of the backbone LLM model increases, our framework shows proportional improvements in evaluation metrics (a.k.a. scaling law). Our method also exhibits strong controllability where the content, style of the generated gestures can be controlled by text prompt. To the best of our knowledge, LLM gesticulator is the first work that use LLM on the co-speech generation task. Evaluation with existing objective metrics and user studies indicate that our framework outperforms prior works.

[AI-50] On the Reliability of Large Language Models to Misinformed and Demographically-Informed Prompts

链接: https://arxiv.org/abs/2410.10850
作者: Toluwani Aremu,Oluwakemi Akinwehinmi,Chukwuemeka Nwagu,Syed Ishtiaque Ahmed,Rita Orji,Pedro Arnau Del Amo,Abdulmotaleb El Saddik
关键词-EN: Large Language Model, addressing misinformed prompts, Language Model, Mental Health, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: Study conducted between August and December 2023. Submitted for archival purposes only

点击查看摘要

Abstract:We investigate and observe the behaviour and performance of Large Language Model (LLM)-backed chatbots in addressing misinformed prompts and questions with demographic information within the domains of Climate Change and Mental Health. Through a combination of quantitative and qualitative methods, we assess the chatbots’ ability to discern the veracity of statements, their adherence to facts, and the presence of bias or misinformation in their responses. Our quantitative analysis using True/False questions reveals that these chatbots can be relied on to give the right answers to these close-ended questions. However, the qualitative insights, gathered from domain experts, shows that there are still concerns regarding privacy, ethical implications, and the necessity for chatbots to direct users to professional services. We conclude that while these chatbots hold significant promise, their deployment in sensitive areas necessitates careful consideration, ethical oversight, and rigorous refinement to ensure they serve as a beneficial augmentation to human expertise rather than an autonomous solution.

[AI-51] Continuous Approximations for Improving Quantization Aware Training of LLMs

链接: https://arxiv.org/abs/2410.10849
作者: He Li,Jianhang Hong,Yuanzhuo Wu,Snehal Adbol,Zonglin Li
关键词-EN: Large Language Models, Large Language, requirements for Large, Language Models, Model compression methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Model compression methods are used to reduce the computation and energy requirements for Large Language Models (LLMs). Quantization Aware Training (QAT), an effective model compression method, is proposed to reduce performance degradation after quantization. To further minimize this degradation, we introduce two continuous approximations to the QAT process on the rounding function, traditionally approximated by the Straight-Through Estimator (STE), and the clamping function. By applying both methods, the perplexity (PPL) on the WikiText-v2 dataset of the quantized model reaches 9.0815, outperforming 9.9621 by the baseline. Also, we achieve a 2.76% improvement on BoolQ, and a 5.47% improvement on MMLU, proving that the step sizes and weights can be learned more accurately with our approach. Our method achieves better performance with the same precision, model size, and training setup, contributing to the development of more energy-efficient LLMs technology that aligns with global sustainability goals.

[AI-52] Crafting Narrative Closures: Zero-Shot Learning with SSM Mamba for Short Story Ending Generation

链接: https://arxiv.org/abs/2410.10848
作者: Divyam Sharma,Divya Santhanam
关键词-EN: challenging endeavor, engaging yet challenging, Abstract, stories, authors encounter moments
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:Writing stories is an engaging yet challenging endeavor. Often, authors encounter moments of creative block, where the path forward in their narrative becomes obscured. This paper is designed to address such moments by providing an innovative solution: A tool that completes stories based on given prompts. By inputting a short story prompt, users can receive a conclusion to their story, articulated in one sentence or more, thereby enhancing the storytelling process with AI-driven creativity. This tool aims not only to assist authors in navigating writer’s block but also to offer a fun and interactive way for anyone to expand on story ideas spontaneously. Through this paper, we explore the intersection of artificial intelligence and creative writing, pushing the boundaries of how stories can be crafted and concluded. To create our final text-generation models, we used a pre-trained GPT-3.5 model and a newly created finetuned SSM-Mamba model, both of which perform well on a comprehensive list of metrics including BERT score, METEOR, BLEU, ROUGE, and Perplexity. The SSM model has also been made public for the NLP community on HuggingFace models as an open source contribution, which for the timebeing is a first of its kind state-space model for story-generation task on HuggingFace.

[AI-53] Focus On What Matters: Separated Models For Visual-Based RL Generalization

链接: https://arxiv.org/abs/2410.10834
作者: Di Zhang,Bowen Lv,Hai Zhang,Feifan Yang,Junqiao Zhao,Hang Yu,Chang Huang,Hongtu Zhou,Chen Ye,Changjun Jiang
关键词-EN: visual-based Reinforcement Learning, visual-based Reinforcement, Reinforcement Learning, unseen environments, primary challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:A primary challenge for visual-based Reinforcement Learning (RL) is to generalize effectively across unseen environments. Although previous studies have explored different auxiliary tasks to enhance generalization, few adopt image reconstruction due to concerns about exacerbating overfitting to task-irrelevant features during training. Perceiving the pre-eminence of image reconstruction in representation learning, we propose SMG (Separated Models for Generalization), a novel approach that exploits image reconstruction for generalization. SMG introduces two model branches to extract task-relevant and task-irrelevant representations separately from visual observations via cooperatively reconstruction. Built upon this architecture, we further emphasize the importance of task-relevant features for generalization. Specifically, SMG incorporates two additional consistency losses to guide the agent’s focus toward task-relevant areas across different scenarios, thereby achieving free from overfitting. Extensive experiments in DMC demonstrate the SOTA performance of SMG in generalization, particularly excelling in video-background settings. Evaluations on robotic manipulation tasks further confirm the robustness of SMG in real-world applications.

[AI-54] Online Client Scheduling and Resource Allocation for Efficient Federated Edge Learning

链接: https://arxiv.org/abs/2410.10833
作者: Zhidong Gao,Zhenxiao Zhang,Yu Zhang,Tongnian Wang,Yanmin Gong,Yuanxiong Guo
关键词-EN: Federated learning, enables edge devices, machine learning model, machine learning, devices to collaboratively
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Federated learning (FL) enables edge devices to collaboratively train a machine learning model without sharing their raw data. Due to its privacy-protecting benefits, FL has been deployed in many real-world applications. However, deploying FL over mobile edge networks with constrained resources such as power, bandwidth, and computation suffers from high training latency and low model accuracy, particularly under data and system heterogeneity. In this paper, we investigate the optimal client scheduling and resource allocation for FL over mobile edge networks under resource constraints and uncertainty to minimize the training latency while maintaining the model accuracy. Specifically, we first analyze the impact of client sampling on model convergence in FL and formulate a stochastic optimization problem that captures the trade-off between the running time and model performance under heterogeneous and uncertain system resources. To solve the formulated problem, we further develop an online control scheme based on Lyapunov-based optimization for client sampling and resource allocation without requiring the knowledge of future dynamics in the FL system. Extensive experimental results demonstrate that the proposed scheme can improve both the training latency and resource efficiency compared with the existing schemes.

[AI-55] me-Series Foundation Model for Value-at-Risk

链接: https://arxiv.org/abs/2410.11773
作者: Anubha Goel,Puneet Pasricha,Juho Kanniainen
关键词-EN: time-series foundation model, Generalized Autoregressive Score, explore the application, model, time-series foundation
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study is the first to explore the application of a time-series foundation model for VaR estimation. Foundation models, pre-trained on vast and varied datasets, can be used in a zero-shot setting with relatively minimal data or further improved through finetuning. We compare the performance of Google’s model, called TimesFM, against conventional parametric and non-parametric models, including GARCH, Generalized Autoregressive Score (GAS), and empirical quantile estimates, using daily returns from the S\P 100 index and its constituents over 19 years. Our backtesting results indicate that, in terms of the actual-over-expected ratio, the fine-tuned TimesFM model consistently outperforms traditional methods. Regarding the quantile score loss function, it achieves performance comparable to the best econometric approach, the GAS model. Overall, the foundation model is either the best or among the top performers in forecasting VaR across the 0.01, 0.025, 0.05, and 0.1 VaR levels. We also found that fine-tuning significantly improves the results, and the model should not be used in zero-shot settings. Overall, foundation models can provide completely alternative approaches to traditional econometric methods, yet there are challenges to be tackled.

[AI-56] A Benchmark Suite for Evaluating Neural Mutual Information Estimators on Unstructured Datasets NEURIPS2024

链接: https://arxiv.org/abs/2410.10924
作者: Kyungeun Lee,Wonjong Rhee
关键词-EN: Mutual Information, random variables, fundamental metric, metric for quantifying, quantifying dependency
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Mutual Information (MI) is a fundamental metric for quantifying dependency between two random variables. When we can access only the samples, but not the underlying distribution functions, we can evaluate MI using sample-based estimators. Assessment of such MI estimators, however, has almost always relied on analytical datasets including Gaussian multivariates. Such datasets allow analytical calculations of the true MI values, but they are limited in that they do not reflect the complexities of real-world datasets. This study introduces a comprehensive benchmark suite for evaluating neural MI estimators on unstructured datasets, specifically focusing on images and texts. By leveraging same-class sampling for positive pairing and introducing a binary symmetric channel trick, we show that we can accurately manipulate true MI values of real-world datasets. Using the benchmark suite, we investigate seven challenging scenarios, shedding light on the reliability of neural MI estimators for unstructured datasets.

[AI-57] GPTON: Generative Pre-trained Transformers enhanced with Ontology Narration for accurate annotation of biological data

链接: https://arxiv.org/abs/2410.10899
作者: Rongbin Li,Wenbo Chen,Jinbo Li,Hanwen Xing,Zhao Li,W. Jim Zheng
关键词-EN: achieving accurate text, verbalized ontology terms, infuse structured knowledge, achieving accurate, top five predictions
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: 25 pages, 6 figures

点击查看摘要

Abstract:By leveraging GPT-4 for ontology narration, we developed GPTON to infuse structured knowledge into LLMs through verbalized ontology terms, achieving accurate text and ontology annotations for over 68% of gene sets in the top five predictions. Manual evaluations confirm GPTON’s robustness, highlighting its potential to harness LLMs and structured knowledge to significantly advance biomedical research beyond gene set annotation.

计算机视觉

[CV-0] MoH: Multi-Head Attention as Mixture-of-Head Attention

链接: https://arxiv.org/abs/2410.11842
作者: Peng Jin,Bo Zhu,Li Yuan,Shuicheng Yan
关键词-EN: multi-head attention, attention, previous accuracy level, attention heads, Transformer model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, code: this https URL

点击查看摘要

Abstract:In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

[CV-1] High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion

链接: https://arxiv.org/abs/2410.11838
作者: Junhwa Hur,Charles Herrmann,Saurabh Saxena,Janne Kontkanen,Wei-Sheng Lai,Yichang Shih,Michael Rubinstein,David J. Fleet,Deqing Sun
关键词-EN: existing frame interpolation, frame interpolation methods, thin objects, frame interpolation, recent progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Despite the recent progress, existing frame interpolation methods still struggle with processing extremely high resolution input and handling challenging cases such as repetitive textures, thin objects, and large motion. To address these issues, we introduce a patch-based cascaded pixel diffusion model for frame interpolation, HiFI, that excels in these scenarios while achieving competitive performance on standard benchmarks. Cascades, which generate a series of images from low- to high-resolution, can help significantly with large or complex motion that require both global context for a coarse solution and detailed context for high resolution output. However, contrary to prior work on cascaded diffusion models which perform diffusion on increasingly large resolutions, we use a single model that always performs diffusion at the same resolution and upsamples by processing patches of the inputs and the prior solution. We show that this technique drastically reduces memory usage at inference time and also allows us to use a single model at test time, solving both frame interpolation and spatial up-sampling, saving training cost. We show that HiFI helps significantly with high resolution and complex repeated textures that require global context. HiFI demonstrates comparable or beyond state-of-the-art performance on multiple benchmarks (Vimeo, Xiph, X-Test, SEPE-8K). On our newly introduced dataset that focuses on particularly challenging cases, HiFI also significantly outperforms other baselines on these cases. Please visit our project page for video results: this https URL

[CV-2] On the Effectiveness of Dataset Alignment for Fake Image Detection

链接: https://arxiv.org/abs/2410.11835
作者: Anirudh Sundara Rajan,Utkarsh Ojha,Jedidiah Schloesser,Yong Jae Lee
关键词-EN: latent diffusion models, democratize image generation, image generation capabilities, generation capabilities, latent diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As latent diffusion models (LDMs) democratize image generation capabilities, there is a growing need to detect fake images. A good detector should focus on the generative models fingerprints while ignoring image properties such as semantic content, resolution, file format, etc. Fake image detectors are usually built in a data driven way, where a model is trained to separate real from fake images. Existing works primarily investigate network architecture choices and training recipes. In this work, we argue that in addition to these algorithmic choices, we also require a well aligned dataset of real/fake images to train a robust detector. For the family of LDMs, we propose a very simple way to achieve this: we reconstruct all the real images using the LDMs autoencoder, without any denoising operation. We then train a model to separate these real images from their reconstructions. The fakes created this way are extremely similar to the real ones in almost every aspect (e.g., size, aspect ratio, semantic content), which forces the model to look for the LDM decoders artifacts. We empirically show that this way of creating aligned real/fake datasets, which also sidesteps the computationally expensive denoising process, helps in building a detector that focuses less on spurious correlations, something that a very popular existing method is susceptible to. Finally, to demonstrate just how effective the alignment in a dataset can be, we build a detector using images that are not natural objects, and present promising results. Overall, our work identifies the subtle but significant issues that arise when training a fake image detector and proposes a simple and inexpensive solution to address these problems.

[CV-3] CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

链接: https://arxiv.org/abs/2410.11831
作者: Nikita Karaev,Iurii Makarov,Jianyuan Wang,Natalia Neverova,Andrea Vedaldi,Christian Rupprecht
关键词-EN: annotating real videos, difficulty of annotating, real videos, synthetic data due, annotating real
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most state-of-the-art point trackers are trained on synthetic data due to the difficulty of annotating real videos for this task. However, this can result in suboptimal performance due to the statistical gap between synthetic and real videos. In order to understand these issues better, we introduce CoTracker3, comprising a new tracking model and a new semi-supervised training recipe. This allows real videos without annotations to be used during training by generating pseudo-labels using off-the-shelf teachers. The new model eliminates or simplifies components from previous trackers, resulting in a simpler and often smaller architecture. This training scheme is much simpler than prior work and achieves better results using 1,000 times less data. We further study the scaling behaviour to understand the impact of using more real unsupervised data in point tracking. The model is available in online and offline variants and reliably tracks visible and occluded points.

[CV-4] MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

链接: https://arxiv.org/abs/2410.11829
作者: Yue Cao,Yangzhou Liu,Zhe Chen,Guangchen Shi,Wenhai Wang,Danhuai Zhao,Tong Lu
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, capturing intricate image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 6 figures, technical report

点击查看摘要

Abstract:Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at this https URL.

[CV-5] Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos

链接: https://arxiv.org/abs/2410.11828
作者: Zhouxia Wang,Jiawei Zhang,Xintao Wang,Tianshui Chen,Ying Shan,Wenping Wang,Ping Luo
关键词-EN: Recent progress, producing high-quality restored, high-quality restored results, face image restoration, resulted in producing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by TIP’2024; Project page: this https URL

点击查看摘要

Abstract:Recent progress in blind face restoration has resulted in producing high-quality restored results for static images. However, efforts to extend these advancements to video scenarios have been minimal, partly because of the absence of benchmarks that allow for a comprehensive and fair comparison. In this work, we first present a fair evaluation benchmark, in which we first introduce a Real-world Low-Quality Face Video benchmark (RFV-LQ), evaluate several leading image-based face restoration algorithms, and conduct a thorough systematical analysis of the benefits and challenges associated with extending blind face image restoration algorithms to degraded face videos. Our analysis identifies several key issues, primarily categorized into two aspects: significant jitters in facial components and noise-shape flickering between frames. To address these issues, we propose a Temporal Consistency Network (TCN) cooperated with alignment smoothing to reduce jitters and flickers in restored videos. TCN is a flexible component that can be seamlessly plugged into the most advanced face image restoration algorithms, ensuring the quality of image-based restoration is maintained as closely as possible. Extensive experiments have been conducted to evaluate the effectiveness and efficiency of our proposed TCN and alignment smoothing operation. Project page: this https URL.

[CV-6] KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

链接: https://arxiv.org/abs/2410.11824
作者: Hsin-Ping Huang,Xinyi Wang,Yonatan Bitton,Hagai Taitelbaum,Gaurav Singh Tomar,Ming-Wei Chang,Xuhui Jia,Kelvin C.K. Chan,Hexiang Hu,Yu-Chuan Su,Ming-Hsuan Yang
关键词-EN: Recent advancements, significantly enhanced, enhanced the quality, quality of synthesized, Recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recent advancements in text-to-image generation have significantly enhanced the quality of synthesized images. Despite this progress, evaluations predominantly focus on aesthetic appeal or alignment with text prompts. Consequently, there is limited understanding of whether these models can accurately represent a wide variety of realistic visual entities - a task requiring real-world knowledge. To address this gap, we propose a benchmark focused on evaluating Knowledge-InTensive image generaTion on real-world ENtities (i.e., KITTEN). Using KITTEN, we conduct a systematic study on the fidelity of entities in text-to-image generation models, focusing on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals. We evaluate the latest text-to-image models and retrieval-augmented customization models using both automatic metrics and carefully-designed human evaluations, with an emphasis on the fidelity of entities in the generated images. Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details. Although retrieval-augmented models can enhance the fidelity of entity by incorporating reference images during testing, they often over-rely on these references and struggle to produce novel configurations of the entity as requested in creative text prompts.

[CV-7] Improving Long-Text Alignment for Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2410.11817
作者: Luping Liu,Chao Du,Tianyu Pang,Zehan Wang,Chongxuan Li,Dong Xu
关键词-EN: generate unprecedented results, rapid advancement, generate unprecedented, unprecedented results, long texts
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:The rapid advancement of text-to-image (T2I) diffusion models has enabled them to generate unprecedented results from given texts. However, as text inputs become longer, existing encoding methods like CLIP face limitations, and aligning the generated images with long texts becomes challenging. To tackle these issues, we propose LongAlign, which includes a segment-level encoding method for processing long texts and a decomposed preference optimization method for effective alignment training. For segment-level encoding, long texts are divided into multiple segments and processed separately. This method overcomes the maximum input length limits of pretrained encoding models. For preference optimization, we provide decomposed CLIP-based preference models to fine-tune diffusion models. Specifically, to utilize CLIP-based preference models for T2I alignment, we delve into their scoring mechanisms and find that the preference scores can be decomposed into two components: a text-relevant part that measures T2I alignment and a text-irrelevant part that assesses other visual aspects of human preference. Additionally, we find that the text-irrelevant part contributes to a common overfitting problem during fine-tuning. To address this, we propose a reweighting strategy that assigns different weights to these two components, thereby reducing overfitting and enhancing alignment. After fine-tuning 512 \times 512 Stable Diffusion (SD) v1.5 for about 20 hours using our method, the fine-tuned SD outperforms stronger foundation models in T2I alignment, such as PixArt- \alpha and Kandinsky v2.2. The code is available at this https URL.

[CV-8] Jigsaw: Imagining Complete Shape Priors for Object Reassembly

链接: https://arxiv.org/abs/2410.11816
作者: Jiaxin Lu,Gang Hua,Qixing Huang
关键词-EN: attracted increasing interest, increasing interest due, automatic assembly problem, attracted increasing, increasing interest
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 10 figures

点击查看摘要

Abstract:The automatic assembly problem has attracted increasing interest due to its complex challenges that involve 3D representation. This paper introduces Jigsaw++, a novel generative method designed to tackle the multifaceted challenges of reconstruction for the reassembly problem. Existing approach focusing primarily on piecewise information for both part and fracture assembly, often overlooking the integration of complete object prior. Jigsaw++ distinguishes itself by learning a category-agnostic shape prior of complete objects. It employs the proposed “retargeting” strategy that effectively leverages the output of any existing assembly method to generate complete shape reconstructions. This capability allows it to function orthogonally to the current methods. Through extensive evaluations on Breaking Bad dataset and PartNet, Jigsaw++ has demonstrated its effectiveness, reducing reconstruction errors and enhancing the precision of shape reconstruction, which sets a new direction for future reassembly model developments.

[CV-9] SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing SIGGRAPH

链接: https://arxiv.org/abs/2410.11815
作者: Zhiyuan Zhang,DongDong Chen,Jing Liao
关键词-EN: edges symbolizing objects, offer a structured, hierarchical representation, nodes and edges, edges symbolizing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM Transactions on Graphics and SIGGRAPH Asia 2024. Project page: this https URL

点击查看摘要

Abstract:Scene graphs offer a structured, hierarchical representation of images, with nodes and edges symbolizing objects and the relationships among them. It can serve as a natural interface for image editing, dramatically improving precision and flexibility. Leveraging this benefit, we introduce a new framework that integrates large language model (LLM) with Text2Image generative model for scene graph-based image editing. This integration enables precise modifications at the object level and creative recomposition of scenes without compromising overall image integrity. Our approach involves two primary stages: 1) Utilizing a LLM-driven scene parser, we construct an image’s scene graph, capturing key objects and their interrelationships, as well as parsing fine-grained attributes such as object masks and descriptions. These annotations facilitate concept learning with a fine-tuned diffusion model, representing each object with an optimized token and detailed description prompt. 2) During the image editing phase, a LLM editing controller guides the edits towards specific areas. These edits are then implemented by an attention-modulated diffusion editor, utilizing the fine-tuned model to perform object additions, deletions, replacements, and adjustments. Through extensive experiments, we demonstrate that our framework significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics.

[CV-10] Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices

链接: https://arxiv.org/abs/2410.11795
作者: Zhiyuan Ma,Yuzhu Zhang,Guoli Jia,Liangliang Zhao,Yichao Ma,Mingjie Ma,Gaofeng Liu,Kaiyan Zhang,Jianjun Li,Bowen Zhou
关键词-EN: steadily shown excellent, shown excellent advantage, sought-after generative models, video generation, dense theoretical principles
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As one of the most popular and sought-after generative models in the recent years, diffusion models have sparked the interests of many researchers and steadily shown excellent advantage in various generative tasks such as image synthesis, video generation, molecule design, 3D scene rendering and multimodal generation, relying on their dense theoretical principles and reliable application practices. The remarkable success of these recent efforts on diffusion models comes largely from progressive design principles and efficient architecture, training, inference, and deployment methodologies. However, there has not been a comprehensive and in-depth review to summarize these principles and practices to help the rapid understanding and application of diffusion models. In this survey, we provide a new efficiency-oriented perspective on these existing efforts, which mainly focuses on the profound principles and efficient practices in architecture designs, model training, fast inference and reliable deployment, to guide further theoretical research, algorithm migration and model application for new scenarios in a reader-friendly way. \urlthis https URL

[CV-11] OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

链接: https://arxiv.org/abs/2410.11792
作者: Jinhan Li,Yifeng Zhu,Yuqi Xie,Zhenyu Jiang,Mingyo Seo,Georgios Pavlakos,Yuke Zhu
关键词-EN: single video demonstrations, robots manipulation skills, teaching humanoid robots, study the problem, problem of teaching
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for oral presentation at 8th Annual Conference on Robot Learning. Project website: this https URL

点击查看摘要

Abstract:We study the problem of teaching humanoid robots manipulation skills by imitating from single video demonstrations. We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video and derives a policy for execution. At the heart of our approach is object-aware retargeting, which enables the humanoid robot to mimic the human motions in an RGB-D video while adjusting to different object locations during deployment. OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately. Our experiments show that OKAMI achieves strong generalizations across varying visual and spatial conditions, outperforming the state-of-the-art baseline on open-world imitation from observation. Furthermore, OKAMI rollout trajectories are leveraged to train closed-loop visuomotor policies, which achieve an average success rate of 79.2% without the need for labor-intensive teleoperation. More videos can be found on our website this https URL.

[CV-12] Latent BKI: Open-Dictionary Continuous Mapping in Visual-Language Latent Spaces with Quantifiable Uncertainty

链接: https://arxiv.org/abs/2410.11783
作者: Joey Wilson,Ruihan Xu,Yile Sun,Parker Ewen,Minghan Zhu,Kira Barton,Maani Ghaffari
关键词-EN: enables open-vocabulary mapping, Latent BKI, Bayesian Kernel Inference, paper introduces, enables open-vocabulary
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper introduces a novel probabilistic mapping algorithm, Latent BKI, which enables open-vocabulary mapping with quantifiable uncertainty. Traditionally, semantic mapping algorithms focus on a fixed set of semantic categories which limits their applicability for complex robotic tasks. Vision-Language (VL) models have recently emerged as a technique to jointly model language and visual features in a latent space, enabling semantic recognition beyond a predefined, fixed set of semantic classes. Latent BKI recurrently incorporates neural embeddings from VL models into a voxel map with quantifiable uncertainty, leveraging the spatial correlations of nearby observations through Bayesian Kernel Inference (BKI). Latent BKI is evaluated against similar explicit semantic mapping and VL mapping frameworks on the popular MatterPort-3D and Semantic KITTI data sets, demonstrating that Latent BKI maintains the probabilistic benefits of continuous mapping with the additional benefit of open-dictionary queries. Real-world experiments demonstrate applicability to challenging indoor environments.

[CV-13] MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

链接: https://arxiv.org/abs/2410.11779
作者: Chenxi Wang,Xiang Chen,Ningyu Zhang,Bozhong Tian,Haoming Xu,Shumin Deng,Huajun Chen
关键词-EN: Multimodal Large Language, remain poorly understood, underlying reasons remain, reasons remain poorly, frequently exhibit hallucination
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Ongoing work

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena, but the underlying reasons remain poorly understood. In this paper, we present an empirical analysis and find that, although MLLMs incorrectly generate the objects in the final output, they are actually able to recognize visual objects in the preceding layers. We speculate that this may be due to the strong knowledge priors of the language model suppressing the visual information, leading to hallucinations. Motivated by this, we propose a novel dynamic correction decoding method for MLLMs (DeCo), which adaptively selects the appropriate preceding layers and proportionally integrates knowledge into the final layer to adjust the output logits. Note that DeCo is model agnostic and can be seamlessly incorporated with various classic decoding strategies and applied to different MLLMs. We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines, highlighting its potential to mitigate hallucinations. Code is available at this https URL.

[CV-14] Fractal Calibration for long-tailed object detection

链接: https://arxiv.org/abs/2410.11774
作者: Konstantinos Panagiotis Alexandridis,Ismail Elezi,Jiankang Deng,Anh Nguyen,Shan Luo
关键词-EN: Real-world datasets follow, poses significant challenges, Real-world datasets, rare-category object detection, follow an imbalanced
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Real-world datasets follow an imbalanced distribution, which poses significant challenges in rare-category object detection. Recent studies tackle this problem by developing re-weighting and re-sampling methods, that utilise the class frequencies of the dataset. However, these techniques focus solely on the frequency statistics and ignore the distribution of the classes in image space, missing important information. In contrast to them, we propose FRActal CALibration (FRACAL): a novel post-calibration method for long-tailed object detection. FRACAL devises a logit adjustment method that utilises the fractal dimension to estimate how uniformly classes are distributed in image space. During inference, it uses the fractal dimension to inversely downweight the probabilities of uniformly spaced class predictions achieving balance in two axes: between frequent and rare categories, and between uniformly spaced and sparsely spaced classes. FRACAL is a post-processing method and it does not require any training, also it can be combined with many off-the-shelf models such as one-stage sigmoid detectors and two-stage instance segmentation models. FRACAL boosts the rare class performance by up to 8.6% and surpasses all previous methods on LVIS dataset, while showing good generalisation to other datasets such as COCO, V3Det and OpenImages. The code will be released.

[CV-15] DPD-NeuralEngine: A 22-nm 6.6-TOPS/W/mm2 Recurrent Neural Network Accelerator for Wideband Power Amplifier Digital Pre-Distortion

链接: https://arxiv.org/abs/2410.11766
作者: Ang Li,Haolin Wu,Yizhuo Wu,Qinyu Chen,Leo C. N. de Vreede,Chang Gao
关键词-EN: based Digital Pre-distortion, Deep Neural Network, Digital Pre-distortion, modern communication systems, communication systems necessitates
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:The increasing adoption of Deep Neural Network (DNN)-based Digital Pre-distortion (DPD) in modern communication systems necessitates efficient hardware implementations. This paper presents DPD-NeuralEngine, an ultra-fast, tiny-area, and power-efficient DPD accelerator based on a Gated Recurrent Unit (GRU) neural network (NN). Leveraging a co-designed software and hardware approach, our 22 nm CMOS implementation operates at 2 GHz, capable of processing I/Q signals up to 250 MSps. Experimental results demonstrate a throughput of 256.5 GOPS and power efficiency of 1.32 TOPS/W with DPD linearization performance measured in Adjacent Channel Power Ratio (ACPR) of -45.3 dBc and Error Vector Magnitude (EVM) of -39.8 dB. To our knowledge, this work represents the first AI-based DPD application-specific integrated circuit (ASIC) accelerator, achieving a power-area efficiency (PAE) of 6.6 TOPS/W/mm ^2 .

[CV-16] SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

链接: https://arxiv.org/abs/2410.11761
作者: Ying Chen,Guoan Wang,Yuanfeng Ji,Yanjun Li,Jin Ye,Tianbin Li,Bin Zhang,Nana Pei,Rongshan Yu,Yu Qiao,Junjun He
关键词-EN: large language models, missing essential contextual, essential contextual information, multimodal large language, language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patch-level analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instruction-following dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal benchmark that incorporates captioning and VQA tasks to assess SlideChat’s capabilities in varied clinical settings such as microscopy, diagnosis. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities achieving state-of-the-art performance on 18 of 22 tasks. For example, it achieved an overall accuracy of 81.17% on SlideBench-VQA (TCGA), and 54.15% on SlideBench-VQA (BCNB). We will fully release SlideChat, SlideInstruction and SlideBench as open-source resources to facilitate research and development in computational pathology.

[CV-17] Latent Action Pretraining from Videos

链接: https://arxiv.org/abs/2410.11758
作者: Seonghyeon Ye,Joel Jang,Byeongguk Jeon,Sejune Joo,Jianwei Yang,Baolin Peng,Ajay Mandlekar,Reuben Tan,Yu-Wei Chao,Bill Yuchen Lin,Lars Liden,Kimin Lee,Jianfeng Gao,Luke Zettlemoyer,Dieter Fox,Minjoon Seo
关键词-EN: action labels, Latent Action Pretraining, robot action labels, introduce Latent Action, ground-truth robot action
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

[CV-18] VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

链接: https://arxiv.org/abs/2410.11417
作者: Xiaohan Lan,Yitian Yuan,Zequn Jie,Lin Ma
关键词-EN: Video-based multimodal large, possess significant potential, multimodal large language, video understanding tasks, Video-based multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Video-based multimodal large language models (Video-LLMs) possess significant potential for video understanding tasks. However, most Video-LLMs treat videos as a sequential set of individual frames, which results in insufficient temporal-spatial interaction that hinders fine-grained comprehension and difficulty in processing longer videos due to limited visual token capacity. To address these challenges, we propose VidCompress, a novel Video-LLM featuring memory-enhanced temporal compression. VidCompress employs a dual-compressor approach: a memory-enhanced compressor captures both short-term and long-term temporal relationships in videos and compresses the visual tokens using a multiscale transformer with a memory-cache mechanism, while a text-perceived compressor generates condensed visual tokens by utilizing Q-Former and integrating temporal contexts into query embeddings with cross attention. Experiments on several VideoQA datasets and comprehensive benchmarks demonstrate that VidCompress efficiently models complex temporal-spatial relations and significantly outperforms existing Video-LLMs.

[CV-19] MoChat: Joints-Grouped Spatio-Temporal Grounding LLM for Multi-Turn Motion Comprehension and Description

链接: https://arxiv.org/abs/2410.11404
作者: Jiawei Mo,Yixuan Chen,Rifen Lin,Yongkang Ni,Min Zeng,Xiping Hu,Min Li
关键词-EN: specific body parts, accurately identify action, identify action timing, body parts, typically supporting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite continuous advancements in deep learning for understanding human motion, existing models often struggle to accurately identify action timing and specific body parts, typically supporting only single-round interaction. Such limitations in capturing fine-grained motion details reduce their effectiveness in motion understanding tasks. In this paper, we propose MoChat, a multimodal large language model capable of spatio-temporal grounding of human motion and understanding multi-turn dialogue context. To achieve these capabilities, we group the spatial information of each skeleton frame based on human anatomical structure and then apply them with Joints-Grouped Skeleton Encoder, whose outputs are combined with LLM embeddings to create spatio-aware and temporal-aware embeddings separately. Additionally, we develop a pipeline for extracting timestamps from skeleton sequences based on textual annotations, and construct multi-turn dialogues for spatially grounding. Finally, various task instructions are generated for jointly training. Experimental results demonstrate that MoChat achieves state-of-the-art performance across multiple metrics in motion understanding tasks, making it as the first model capable of fine-grained spatio-temporal grounding of human motion.

[CV-20] MCGS: Multiview Consistency Enhancement for Sparse-View 3D Gaussian Radiance Fields

链接: https://arxiv.org/abs/2410.11394
作者: Yuru Xiao,Deming Zhai,Wenbo Zhao,Kui Jiang,Junjun Jiang,Xianming Liu
关键词-EN: Radiance fields represented, high training efficiency, Radiance fields, offering both high, excel at synthesizing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Radiance fields represented by 3D Gaussians excel at synthesizing novel views, offering both high training efficiency and fast rendering. However, with sparse input views, the lack of multi-view consistency constraints results in poorly initialized point clouds and unreliable heuristics for optimization and densification, leading to suboptimal performance. Existing methods often incorporate depth priors from dense estimation networks but overlook the inherent multi-view consistency in input images. Additionally, they rely on multi-view stereo (MVS)-based initialization, which limits the efficiency of scene representation. To overcome these challenges, we propose a view synthesis framework based on 3D Gaussian Splatting, named MCGS, enabling photorealistic scene reconstruction from sparse input views. The key innovations of MCGS in enhancing multi-view consistency are as follows: i) We introduce an initialization method by leveraging a sparse matcher combined with a random filling strategy, yielding a compact yet sufficient set of initial points. This approach enhances the initial geometry prior, promoting efficient scene representation. ii) We develop a multi-view consistency-guided progressive pruning strategy to refine the Gaussian field by strengthening consistency and eliminating low-contribution Gaussians. These modular, plug-and-play strategies enhance robustness to sparse input views, accelerate rendering, and reduce memory consumption, making MCGS a practical and efficient framework for 3D Gaussian Splatting.

[CV-21] Augmentation-Driven Metric for Balancing Preservation and Modification in Text-Guided Image Editing

链接: https://arxiv.org/abs/2410.11374
作者: Yoonjeon Kim,Soohyun Ryu,Yeonsung Jung,Hyunkoo Lee,Joowon Kim,June Yong Yang,Jaeryong Hwang,Eunho Yang
关键词-EN: significantly advanced text-guided, text-guided image editing, advanced text-guided image, source image, text-guided image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks \textitpreservation of core elements in the source image while implementing \textitmodifications based on the target text. However, in the absence of evaluation metrics specifically tailored for text-guided image editing, existing metrics are limited in balancing the consideration of preservation and modification. Especially, our analysis reveals that CLIPScore, the most commonly used metric, tends to favor modification and ignore core attributes to be preserved, resulting in inaccurate evaluations. To address this problem, we propose \textttAugCLIP, \blackwhich balances preservation and modification by estimating the representation of an ideal edited image that aligns with the target text with minimum alteration on the source image. We augment detailed textual descriptions on the source image and the target text using a multi-modal large language model, to model a hyperplane that separates CLIP space into source or target. The representation of the ideal edited image is an orthogonal projection of the source image into the hyperplane, which encapsulates the relative importance of each attribute considering the interdependent relationships. Our extensive experiments on five benchmark datasets, encompassing a diverse range of editing scenarios, demonstrate that \textttAugCLIP aligns remarkably well with human evaluation standards compared to existing metrics. The code for evaluation will be open-sourced to contribute to the community.

[CV-22] DRACO: A Denoising-Reconstruction Autoencoder for Cryo-EM

链接: https://arxiv.org/abs/2410.11373
作者: Yingjun Shen,Haizhao Dai,Qihe Chen,Yan Zeng,Jiakai Zhang,Yuan Pei,Jingyi Yu
关键词-EN: extracting multi-purpose features, self-supervised pre-training methods, demonstrated exceptional performance, computer vision, vision have demonstrated
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Foundation models in computer vision have demonstrated exceptional performance in zero-shot and few-shot tasks by extracting multi-purpose features from large-scale datasets through self-supervised pre-training methods. However, these models often overlook the severe corruption in cryogenic electron microscopy (cryo-EM) images by high-level noises. We introduce DRACO, a Denoising-Reconstruction Autoencoder for CryO-EM, inspired by the Noise2Noise (N2N) approach. By processing cryo-EM movies into odd and even images and treating them as independent noisy observations, we apply a denoising-reconstruction hybrid training scheme. We mask both images to create denoising and reconstruction tasks. For DRACO’s pre-training, the quality of the dataset is essential, we hence build a high-quality, diverse dataset from an uncurated public database, including over 270,000 movies or micrographs. After pre-training, DRACO naturally serves as a generalizable cryo-EM image denoiser and a foundation model for various cryo-EM downstream tasks. DRACO demonstrates the best performance in denoising, micrograph curation, and particle picking tasks compared to state-of-the-art baselines. We will release the code, pre-trained models, and the curated dataset to stimulate further research.

[CV-23] Visual-Geometric Collaborative Guidance for Affordance Learning

链接: https://arxiv.org/abs/2410.11363
作者: Hongchen Luo,Wei Zhai,Jiao Wang,Yang Cao,Zheng-Jun Zha
关键词-EN: challenging task due, Perceiving potential, affordance learning, learning interactive functionalities, interactive affinity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Perceiving potential ``action possibilities’’ (\ie, affordance) regions of images and learning interactive functionalities of objects from human demonstration is a challenging task due to the diversity of human-object interactions. Prevailing affordance learning algorithms often adopt the label assignment paradigm and presume that there is a unique relationship between functional region and affordance label, yielding poor performance when adapting to unseen environments with large appearance variations. In this paper, we propose to leverage interactive affinity for affordance learning, \ie extracting interactive affinity from human-object interaction and transferring it to non-interactive objects. Interactive affinity, which represents the contacts between different parts of the human body and local regions of the target object, can provide inherent cues of interconnectivity between humans and objects, thereby reducing the ambiguity of the perceived action possibilities. To this end, we propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues to excavate interactive affinity from human-object interactions jointly. Besides, a contact-driven affordance learning (CAL) dataset is constructed by collecting and labeling over 55,047 images. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality. Project: \hrefthis https URLthis http URL.

[CV-24] SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection

链接: https://arxiv.org/abs/2410.11358
作者: Shuhan Dong,Yunsong Li,Weiying Xie,Jiaqing Zhang,Jiayuan Tian,Danian Yang,Jie Lei
关键词-EN: Multimodal object detection, detection leverages diverse, object detection leverages, leverages diverse modal, object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal object detection leverages diverse modal information to enhance the accuracy and robustness of detectors. By learning long-term dependencies, Transformer can effectively integrate multimodal features in the feature extraction stage, which greatly improves the performance of multimodal object detection. However, current methods merely stack Transformer-guided fusion techniques without exploring their capability to extract features at various depth layers of network, thus limiting the improvements in detection performance. In this paper, we introduce an accurate and efficient object detection method named SeaDATE. Initially, we propose a novel dual attention Feature Fusion (DTF) module that, under Transformer’s guidance, integrates local and global information through a dual attention mechanism, strengthening the fusion of modal features from orthogonal perspectives using spatial and channel tokens. Meanwhile, our theoretical analysis and empirical validation demonstrate that the Transformer-guided fusion method, treating images as sequences of pixels for fusion, performs better on shallow features’ detail information compared to deep semantic information. To address this, we designed a contrastive learning (CL) module aimed at learning features of multimodal samples, remedying the shortcomings of Transformer-guided fusion in extracting deep semantic features, and effectively utilizing cross-modal information. Extensive experiments and ablation studies on the FLIR, LLVIP, and M3FD datasets have proven our method to be effective, achieving state-of-the-art detection performance.

[CV-25] SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments

链接: https://arxiv.org/abs/2410.11331
作者: Syed Abdul Gaffar Shakhadri,Kruthika KR,Rakshit Aralimatti
关键词-EN: billion parameter language, including smartphones, model specifically optimized, billion parameter, IoT systems
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Paper in pdf format is 11 pages and contains 4 tables

点击查看摘要

Abstract:We introduce Shakti, a 2.5 billion parameter language model specifically optimized for resource-constrained environments such as edge devices, including smartphones, wearables, and IoT systems. Shakti combines high-performance NLP with optimized efficiency and precision, making it ideal for real-time AI applications where computational resources and memory are limited. With support for vernacular languages and domain-specific tasks, Shakti excels in industries such as healthcare, finance, and customer service. Benchmark evaluations demonstrate that Shakti performs competitively against larger models while maintaining low latency and on-device efficiency, positioning it as a leading solution for edge AI.

[CV-26] Diffusion-Based Offline RL for Improved Decision-Making in Augmented ARC Task

链接: https://arxiv.org/abs/2410.11324
作者: Yunho Kim,Jaehyun Park,Heejun Kim,Sejin Kim,Byung-Jun Lee,Sundong Kim
关键词-EN: Effective long-term strategies, navigate complex environments, Effective long-term, long-term strategies enable, extended horizons
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint, Under review. Comments welcome

点击查看摘要

Abstract:Effective long-term strategies enable AI systems to navigate complex environments by making sequential decisions over extended horizons. Similarly, reinforcement learning (RL) agents optimize decisions across sequences to maximize rewards, even without immediate feedback. To verify that Latent Diffusion-Constrained Q-learning (LDCQ), a prominent diffusion-based offline RL method, demonstrates strong reasoning abilities in multi-step decision-making, we aimed to evaluate its performance on the Abstraction and Reasoning Corpus (ARC). However, applying offline RL methodologies to enhance strategic reasoning in AI for solving tasks in ARC is challenging due to the lack of sufficient experience data in the ARC training set. To address this limitation, we introduce an augmented offline RL dataset for ARC, called Synthesized Offline Learning Data for Abstraction and Reasoning (SOLAR), along with the SOLAR-Generator, which generates diverse trajectory data based on predefined rules. SOLAR enables the application of offline RL methods by offering sufficient experience data. We synthesized SOLAR for a simple task and used it to train an agent with the LDCQ method. Our experiments demonstrate the effectiveness of the offline RL approach on a simple ARC task, showing the agent’s ability to make multi-step sequential decisions and correctly identify answer states. These results highlight the potential of the offline RL approach to enhance AI’s strategic reasoning capabilities.

[CV-27] CogDevelop2K: Reversed Cognitive Development in Multimodal Large Language Models

链接: https://arxiv.org/abs/2410.10855
作者: Yijiang Li,Qingying Gao,Haoran Sun,Haiyun Lyu,Dezhi Luo,Hokin Deng
关键词-EN: Large Language Models, Multi-modal Large Language, Language Models, Multi-modal Large, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Are Multi-modal Large Language Models (MLLMs) stochastic parrots? Do they genuinely understand and are capable of performing the tasks they excel at? This paper aims to explore the fundamental basis of MLLMs, i.e. core cognitive abilities that human intelligence builds upon to perceive, comprehend, and reason. To this end, we propose CogDevelop2K, a comprehensive benchmark that spans 12 sub-concepts from fundamental knowledge like object permanence and boundary to advanced reasoning like intentionality understanding, structured via the developmental trajectory of a human mind. We evaluate 46 MLLMs on our benchmarks. Comprehensively, we further evaluate the influence of evaluation strategies and prompting techniques. Surprisingly, we observe a reversed cognitive developmental trajectory compared to humans.

[CV-28] Lotus: learning-based online thermal and latency variation management for two-stage detectors on edge devices

链接: https://arxiv.org/abs/2410.10847
作者: Yifan Gong,Yushu Wu,Zheng Zhan,Pu Zhao,Liangkai Liu,Chao Wu,Xulong Tang,Yanzhi Wang
关键词-EN: identifying small objects, exhibit high accuracy, object detectors exhibit, Two-stage object detectors, small objects
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: DAC’24, code is available at: this https URL

点击查看摘要

Abstract:Two-stage object detectors exhibit high accuracy and precise localization, especially for identifying small objects that are favorable for various edge applications. However, the high computation costs associated with two-stage detection methods cause more severe thermal issues on edge devices, incurring dynamic runtime frequency change and thus large inference latency variations. Furthermore, the dynamic number of proposals in different frames leads to various computations over time, resulting in further latency variations. The significant latency variations of detectors on edge devices can harm user experience and waste hardware resources. To avoid thermal throttling and provide stable inference speed, we propose Lotus, a novel framework that is tailored for two-stage detectors to dynamically scale CPU and GPU frequencies jointly in an online manner based on deep reinforcement learning (DRL). To demonstrate the effectiveness of Lotus, we implement it on NVIDIA Jetson Orin Nano and Mi 11 Lite mobile platforms. The results indicate that Lotus can consistently and significantly reduce latency variation, achieve faster inference, and maintain lower CPU and GPU temperatures under various settings.

[CV-29] Focus On What Matters: Separated Models For Visual-Based RL Generalization

链接: https://arxiv.org/abs/2410.10834
作者: Di Zhang,Bowen Lv,Hai Zhang,Feifan Yang,Junqiao Zhao,Hang Yu,Chang Huang,Hongtu Zhou,Chen Ye,Changjun Jiang
关键词-EN: visual-based Reinforcement Learning, visual-based Reinforcement, Reinforcement Learning, unseen environments, primary challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:A primary challenge for visual-based Reinforcement Learning (RL) is to generalize effectively across unseen environments. Although previous studies have explored different auxiliary tasks to enhance generalization, few adopt image reconstruction due to concerns about exacerbating overfitting to task-irrelevant features during training. Perceiving the pre-eminence of image reconstruction in representation learning, we propose SMG (Separated Models for Generalization), a novel approach that exploits image reconstruction for generalization. SMG introduces two model branches to extract task-relevant and task-irrelevant representations separately from visual observations via cooperatively reconstruction. Built upon this architecture, we further emphasize the importance of task-relevant features for generalization. Specifically, SMG incorporates two additional consistency losses to guide the agent’s focus toward task-relevant areas across different scenarios, thereby achieving free from overfitting. Extensive experiments in DMC demonstrate the SOTA performance of SMG in generalization, particularly excelling in video-background settings. Evaluations on robotic manipulation tasks further confirm the robustness of SMG in real-world applications.

[CV-30] High-Fidelity 3D Lung CT Synthesis in ARDS Swine Models Using Score-Based 3D Residual Diffusion Models

链接: https://arxiv.org/abs/2410.10826
作者: Siyeop Yoon,Yujin Oh,Xiang Li,Yi Xin,Maurizio Cereda,Quanzheng Li
关键词-EN: Acute respiratory distress, respiratory distress syndrome, severe condition characterized, high mortality rate, Acute respiratory
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: 5 page, 3 figures, Submitted to SPIE 2025-Medical Imaging

点击查看摘要

Abstract:Acute respiratory distress syndrome (ARDS) is a severe condition characterized by lung inflammation and respiratory failure, with a high mortality rate of approximately 40%. Traditional imaging methods, such as chest X-rays, provide only two-dimensional views, limiting their effectiveness in fully assessing lung pathology. Three-dimensional (3D) computed tomography (CT) offers a more comprehensive visualization, enabling detailed analysis of lung aeration, atelectasis, and the effects of therapeutic interventions. However, the routine use of CT in ARDS management is constrained by practical challenges and risks associated with transporting critically ill patients to remote scanners. In this study, we synthesize high-fidelity 3D lung CT from 2D generated X-ray images with associated physiological parameters using a score-based 3D residual diffusion model. Our preliminary results demonstrate that this approach can produce high-quality 3D CT images that are validated with ground truth, offering a promising solution for enhancing ARDS management.

[CV-31] Advancements in Ship Detection: Comparative Analysis of Optical and Hyperspectral Sensors

链接: https://arxiv.org/abs/2410.10888
作者: Alyazia Al Shamsi,Alavikunhu Panthakkan,Saeed Al Mansoori,Hussain Al Ahmad
关键词-EN: marine traffic control, applications span military, including ship detection, marine surveillance, marine traffic
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In marine surveillance, applications span military and civilian domains, including ship detection, marine traffic control, and disaster management. Optical and hyperspectral satellites are key for this purpose. This paper focuses on ship detection and classification techniques, particularly comparing optical and hyperspectral remote sensing approaches. It presents a comprehensive analysis of these technologies, covering feature extraction, methodologies, and their suitability for different missions. The study highlights the importance of selecting the right sensor aligned with mission objectives and conditions, aiming to improve detection accuracy through integrated strategies. The paper examines the strengths and limitations of both technologies in various maritime applications, enhancing understanding of their usability in different operational scenarios.

[CV-32] Adaptive Data Transport Mechanism for UAV Surveillance Missions in Lossy Environments

链接: https://arxiv.org/abs/2410.10843
作者: Niloufar Mehrabi,Sayed Pedram Haeri Boroujeni,Jenna Hofseth,Abolfazl Razi,Long Cheng,Manveen Kaur,James Martin,Rahul Amin
关键词-EN: Unmanned Aerial Vehicles, Unmanned Aerial, Aerial Vehicles, increasingly critical role, access remote areas
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) play an increasingly critical role in Intelligence, Surveillance, and Reconnaissance (ISR) missions such as border patrolling and criminal detection, thanks to their ability to access remote areas and transmit real-time imagery to processing servers. However, UAVs are highly constrained by payload size, power limits, and communication bandwidth, necessitating the development of highly selective and efficient data transmission strategies. This has driven the development of various compression and optimal transmission technologies for UAVs. Nevertheless, most methods strive to preserve maximal information in transferred video frames, missing the fact that only certain parts of images/video frames might offer meaningful contributions to the ultimate mission objectives in the ISR scenarios involving moving object detection and tracking (OD/OT). This paper adopts a different perspective, and offers an alternative AI-driven scheduling policy that prioritizes selecting regions of the image that significantly contributes to the mission objective. The key idea is tiling the image into small patches and developing a deep reinforcement learning (DRL) framework that assigns higher transmission probabilities to patches that present higher overlaps with the detected object of interest, while penalizing sharp transitions over consecutive frames to promote smooth scheduling shifts. Although we used Yolov-8 object detection and UDP transmission protocols as a benchmark testing scenario the idea is general and applicable to different transmission protocols and OD/OT methods. To further boost the system’s performance and avoid OD errors for cluttered image patches, we integrate it with interframe interpolations.

[CV-33] AI Foundation Model for Heliophysics: Applications Design and Implementation

链接: https://arxiv.org/abs/2410.10841
作者: Sujit Roy,Talwinder Singh,Marcus Freitag,Johannes Schmude,Rohit Lal,Dinesha Hegde,Soumya Ranjan,Amy Lin,Vishal Gaur,Etienne Eben Vos,Rinki Ghosal,Badri Narayana Patro,Berkay Aydin,Nikolai Pogorelov,Juan Bernabe Moreno,Manil Maskey,Rahul Ramachandran
关键词-EN: Deep learning-based methods, understand long sequences, Deep learning-based, numerous helio-physics applications, demonstrating their capacity
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
*备注: 31 Pages, 12 figures

点击查看摘要

Abstract:Deep learning-based methods have been widely researched in the areas of language and vision, demonstrating their capacity to understand long sequences of data and their usefulness in numerous helio-physics applications. Foundation models (FMs), which are pre-trained on a large-scale datasets, form the basis for a variety of downstream tasks. These models, especially those based on transformers in vision and language, show exceptional potential for adapting to a wide range of downstream applications. In this paper, we provide our perspective on the criteria for designing an FM for heliophysics and associated challenges and applications using the Solar Dynamics Observatory (SDO) dataset. We believe that this is the first study to design an FM in the domain of heliophysics.

[CV-34] Swap-Net: A Memory-Efficient 2.5D Network for Sparse-View 3D Cone Beam CT Reconstruction

链接: https://arxiv.org/abs/2410.10836
作者: Xiaojian Xu,Marc Klasky,Michael T. McCann,Jason Hu,Jeffrey A. Fessler
关键词-EN: cone beam computed, beam computed tomography, inertial confinement fusion, important inverse problem, cone beam
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing 3D cone beam computed tomography (CBCT) images from a limited set of projections is an important inverse problem in many imaging applications from medicine to inertial confinement fusion (ICF). The performance of traditional methods such as filtered back projection (FBP) and model-based regularization is sub-optimal when the number of available projections is limited. In the past decade, deep learning (DL) has gained great popularity for solving CT inverse problems. A typical DL-based method for CBCT image reconstruction is to learn an end-to-end mapping by training a 2D or 3D network. However, 2D networks fail to fully use global information. While 3D networks are desirable, they become impractical as image sizes increase because of the high memory cost. This paper proposes Swap-Net, a memory-efficient 2.5D network for sparse-view 3D CBCT image reconstruction. Swap-Net uses a sequence of novel axes-swapping operations to produce 3D volume reconstruction in an end-to-end fashion without using full 3D convolutions. Simulation results show that Swap-Net consistently outperforms baseline methods both quantitatively and qualitatively in terms of reducing artifacts and preserving details of complex hydrodynamic simulations of relevance to the ICF community.

机器学习

[LG-0] MoH: Multi-Head Attention as Mixture-of-Head Attention

链接: https://arxiv.org/abs/2410.11842
作者: Peng Jin,Bo Zhu,Li Yuan,Shuicheng Yan
关键词-EN: multi-head attention, attention, previous accuracy level, attention heads, Transformer model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, code: this https URL

点击查看摘要

Abstract:In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

[LG-1] A Hitchhikers Guide to Scaling Law Estimation

链接: https://arxiv.org/abs/2410.11840
作者: Leshem Choshen,Yang Zhang,Jacob Andreas
关键词-EN: Scaling laws, machine learning model, target machine learning, Scaling laws predict, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pretraining decisions involving optimizers, datasets, and model architectures. Despite the widespread use of scaling laws to model the dynamics of language model training, there has been little work on understanding how to best estimate and interpret them. We collect (and release) a large-scale dataset containing losses and downstream evaluations for 485 previously published pretrained models. We use these to estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families. We find that fitting scaling laws to intermediate checkpoints of training runs (and not just their final losses) substantially improves accuracy, and that – all else equal – estimates of performance are generally most accurate when derived from other models of similar sizes. However, because there is a significant degree of variability across model seeds, training multiple small models is sometimes more useful than training a single large one. Moreover, while different model families differ scaling behavior, they are often similar enough that a target model’s behavior can be predicted from a single model with the same architecture, along with scaling parameter estimates derived from other model families.

[LG-2] Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

链接: https://arxiv.org/abs/2410.11833
作者: Ayush Jain,Norio Kosaka,Xinhu Li,Kyung-Min Kim,Erdem Bıyık,Joseph J. Lim
关键词-EN: off-policy actor-critic approaches, approaches like DDPG, deterministic policy gradient, reinforcement learning, actor-critic approaches
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In reinforcement learning, off-policy actor-critic approaches like DDPG and TD3 are based on the deterministic policy gradient. Herein, the Q-function is trained from off-policy environment data and the actor (policy) is trained to maximize the Q-function via gradient ascent. We observe that in complex tasks like dexterous manipulation and restricted locomotion, the Q-value is a complex function of action, having several local optima or discontinuities. This poses a challenge for gradient ascent to traverse and makes the actor prone to get stuck at local optima. To address this, we introduce a new actor architecture that combines two simple insights: (i) use multiple actors and evaluate the Q-value maximizing action, and (ii) learn surrogates to the Q-function that are simpler to optimize with gradient-based methods. We evaluate tasks such as restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems and show that our actor finds optimal actions more frequently and outperforms alternate actor architectures.

[LG-3] Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws

链接: https://arxiv.org/abs/2410.11820
作者: Yiding Jiang,Allan Zhou,Zhili Feng,Sadhika Malladi,J. Zico Kolter
关键词-EN: limited computational budget, foundation models’ performance, composition of pretraining, key determinant, determinant of foundation
类目: Machine Learning (cs.LG)
*备注: 21 pages, 10 figures

点击查看摘要

Abstract:The composition of pretraining data is a key determinant of foundation models’ performance, but there is no standard guideline for allocating a limited computational budget across different data sources. Most current approaches either rely on extensive experiments with smaller models or dynamic data adjustments that also require proxy models, both of which significantly increase the workflow complexity and computational overhead. In this paper, we introduce Adaptive Data Optimization (ADO), an algorithm that optimizes data distributions in an online fashion, concurrent with model training. Unlike existing techniques, ADO does not require external knowledge, proxy models, or modifications to the model update. Instead, ADO uses per-domain scaling laws to estimate the learning potential of each domain during training and adjusts the data mixture accordingly, making it more scalable and easier to integrate. Experiments demonstrate that ADO can achieve comparable or better performance than prior methods while maintaining computational efficiency across different computation scales, offering a practical solution for dynamically adjusting data distribution without sacrificing flexibility or increasing costs. Beyond its practical benefits, ADO also provides a new perspective on data collection strategies via scaling laws.

[LG-4] Improving Long-Text Alignment for Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2410.11817
作者: Luping Liu,Chao Du,Tianyu Pang,Zehan Wang,Chongxuan Li,Dong Xu
关键词-EN: generate unprecedented results, rapid advancement, generate unprecedented, unprecedented results, long texts
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:The rapid advancement of text-to-image (T2I) diffusion models has enabled them to generate unprecedented results from given texts. However, as text inputs become longer, existing encoding methods like CLIP face limitations, and aligning the generated images with long texts becomes challenging. To tackle these issues, we propose LongAlign, which includes a segment-level encoding method for processing long texts and a decomposed preference optimization method for effective alignment training. For segment-level encoding, long texts are divided into multiple segments and processed separately. This method overcomes the maximum input length limits of pretrained encoding models. For preference optimization, we provide decomposed CLIP-based preference models to fine-tune diffusion models. Specifically, to utilize CLIP-based preference models for T2I alignment, we delve into their scoring mechanisms and find that the preference scores can be decomposed into two components: a text-relevant part that measures T2I alignment and a text-irrelevant part that assesses other visual aspects of human preference. Additionally, we find that the text-irrelevant part contributes to a common overfitting problem during fine-tuning. To address this, we propose a reweighting strategy that assigns different weights to these two components, thereby reducing overfitting and enhancing alignment. After fine-tuning 512 \times 512 Stable Diffusion (SD) v1.5 for about 20 hours using our method, the fine-tuned SD outperforms stronger foundation models in T2I alignment, such as PixArt- \alpha and Kandinsky v2.2. The code is available at this https URL.

[LG-5] FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting

链接: https://arxiv.org/abs/2410.11802
作者: Zhe Li,Xiangfei Qiu,Peng Chen,Yihang Wang,Hanyin Cheng,Yang Shu,Jilin Hu,Chenjuan Guo,Aoying Zhou,Qingsong Wen,Christian S. Jensen,Bin Yang
关键词-EN: TSF foundation models, TSF foundation, Foundation models, weather services, Time Series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time Series Forecasting (TSF) is key functionality in numerous fields, including in finance, weather services, and energy management. While TSF methods are emerging these days, many of them require domain-specific data collection and model training and struggle with poor generalization performance on new domains. Foundation models aim to overcome this limitation. Pre-trained on large-scale language or time series data, they exhibit promising inferencing capabilities in new or unseen data. This has spurred a surge in new TSF foundation models. We propose a new benchmark, FoundTS, to enable thorough and fair evaluation and comparison of such models. FoundTS covers a variety of TSF foundation models, including those based on large language models and those pretrained on time series. Next, FoundTS supports different forecasting strategies, including zero-shot, few-shot, and full-shot, thereby facilitating more thorough evaluations. Finally, FoundTS offers a pipeline that standardizes evaluation processes such as dataset splitting, loading, normalization, and few-shot sampling, thereby facilitating fair evaluations. Building on this, we report on an extensive evaluation of TSF foundation models on a broad range of datasets from diverse domains and with different statistical characteristics. Specifically, we identify pros and cons and inherent limitations of existing foundation models, and we identify directions for future model design. We make our code and datasets available at this https URL.

[LG-6] OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

链接: https://arxiv.org/abs/2410.11792
作者: Jinhan Li,Yifeng Zhu,Yuqi Xie,Zhenyu Jiang,Mingyo Seo,Georgios Pavlakos,Yuke Zhu
关键词-EN: single video demonstrations, robots manipulation skills, teaching humanoid robots, study the problem, problem of teaching
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted for oral presentation at 8th Annual Conference on Robot Learning. Project website: this https URL

点击查看摘要

Abstract:We study the problem of teaching humanoid robots manipulation skills by imitating from single video demonstrations. We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video and derives a policy for execution. At the heart of our approach is object-aware retargeting, which enables the humanoid robot to mimic the human motions in an RGB-D video while adjusting to different object locations during deployment. OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately. Our experiments show that OKAMI achieves strong generalizations across varying visual and spatial conditions, outperforming the state-of-the-art baseline on open-world imitation from observation. Furthermore, OKAMI rollout trajectories are leveraged to train closed-loop visuomotor policies, which achieve an average success rate of 79.2% without the need for labor-intensive teleoperation. More videos can be found on our website this https URL.

[LG-7] Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability EMNLP2024

链接: https://arxiv.org/abs/2410.11786
作者: Tsz Ting Chung,Leyang Cui,Lemao Liu,Xinting Huang,Shuming Shi,Dit-Yan Yeung
关键词-EN: Large Language Models, natural language processing, Large Language, demonstrated impressive capabilities, language processing tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 10 tables, EMNLP 2024 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in a wide range of natural language processing tasks when leveraging in-context learning. To mitigate the additional computational and financial costs associated with in-context learning, several prompt compression methods have been proposed to compress the in-context learning prompts. Despite their success, these methods face challenges with transferability due to model-specific compression, or rely on external training data, such as GPT-4. In this paper, we investigate the ability of LLMs to develop a unified compression method that discretizes uninformative tokens, utilizing a self-supervised pre-training technique. By introducing a small number of parameters during the continual pre-training, the proposed Selection-p produces a probability for each input token, indicating whether to preserve or discard it. Experiments show Selection-p achieves state-of-the-art performance across numerous classification tasks, achieving compression rates of up to 10 times while experiencing only a marginal 0.8% decrease in performance. Moreover, it exhibits superior transferability to different models compared to prior work. Additionally, we further analyze how Selection-p helps maintain performance on in-context learning with long contexts.

[LG-8] G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

链接: https://arxiv.org/abs/2410.11782
作者: Guibin Zhang,Yanwei Yue,Xiangguo Sun,Guancheng Wan,Miao Yu,Junfeng Fang,Kun Wang,Dawei Cheng
关键词-EN: Recent advancements, large language model, inter-agent communication topologies, well-crafted inter-agent communication, primarily due
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in large language model (LLM)-based agents have demonstrated that collective intelligence can significantly surpass the capabilities of individual agents, primarily due to well-crafted inter-agent communication topologies. Despite the diverse and high-performing designs available, practitioners often face confusion when selecting the most effective pipeline for their specific task: \textitWhich topology is the best choice for my task, avoiding unnecessary communication token overhead while ensuring high-quality solution? In response to this dilemma, we introduce G-Designer, an adaptive, efficient, and robust solution for multi-agent deployment, which dynamically designs task-aware, customized communication topologies. Specifically, G-Designer models the multi-agent system as a multi-agent network, leveraging a variational graph auto-encoder to encode both the nodes (agents) and a task-specific virtual node, and decodes a task-adaptive and high-performing communication topology. Extensive experiments on six benchmarks showcase that G-Designer is: \textbf(1) high-performing, achieving superior results on MMLU with accuracy at 84.50% and on HumanEval with pass@1 at 89.90% ; \textbf(2) task-adaptive, architecting communication protocols tailored to task difficulty, reducing token consumption by up to 95.33% on HumanEval; and \textbf(3) adversarially robust, defending against agent adversarial attacks with merely 0.3% accuracy drop.

[LG-9] Language Models Encode Numbers Using Digit Representations in Base 10

链接: https://arxiv.org/abs/2410.11781
作者: Amit Arnold Levy,Mor Geva
关键词-EN: Large language models, Large language, frequently make errors, simple numerical problems, frequently make
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) frequently make errors when handling even simple numerical problems, such as comparing two small numbers. A natural hypothesis is that these errors stem from how LLMs represent numbers, and specifically, whether their representations of numbers capture their numeric values. We tackle this question from the observation that LLM errors on numerical tasks are often distributed across \textitthe digits of the answer rather than normally around \textitits numeric value. Through a series of probing experiments and causal interventions, we show that LLMs internally represent numbers with individual circular representations per-digit in base 10. This digit-wise representation, as opposed to a value representation, sheds light on the error patterns of models on tasks involving numerical reasoning and could serve as a basis for future studies on analyzing numerical mechanisms in LLMs.

[LG-10] MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

链接: https://arxiv.org/abs/2410.11779
作者: Chenxi Wang,Xiang Chen,Ningyu Zhang,Bozhong Tian,Haoming Xu,Shumin Deng,Huajun Chen
关键词-EN: Multimodal Large Language, remain poorly understood, underlying reasons remain, reasons remain poorly, frequently exhibit hallucination
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Ongoing work

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena, but the underlying reasons remain poorly understood. In this paper, we present an empirical analysis and find that, although MLLMs incorrectly generate the objects in the final output, they are actually able to recognize visual objects in the preceding layers. We speculate that this may be due to the strong knowledge priors of the language model suppressing the visual information, leading to hallucinations. Motivated by this, we propose a novel dynamic correction decoding method for MLLMs (DeCo), which adaptively selects the appropriate preceding layers and proportionally integrates knowledge into the final layer to adjust the output logits. Note that DeCo is model agnostic and can be seamlessly incorporated with various classic decoding strategies and applied to different MLLMs. We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines, highlighting its potential to mitigate hallucinations. Code is available at this https URL.

[LG-11] On the Training Convergence of Transformers for In-Context Classification

链接: https://arxiv.org/abs/2410.11778
作者: Wei Shen,Ruida Zhou,Jing Yang,Cong Shen
关键词-EN: demonstrated impressive capacities, underlying mechanism enabling, mechanism enabling transformers, infant stage, demonstrated impressive
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While transformers have demonstrated impressive capacities for in-context learning (ICL) in practice, theoretical understanding of the underlying mechanism enabling transformers to perform ICL is still in its infant stage. This work aims to theoretically study the training dynamics of transformers for in-context classification tasks. We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained via gradient descent converges to a globally optimal model at a linear rate. We further quantify the impact of the training and testing prompt lengths on the ICL inference error of the trained transformer. We show that when the lengths of training and testing prompts are sufficiently large, the prediction of the trained transformer approaches the Bayes-optimal classifier. Experimental results corroborate the theoretical findings.

[LG-12] Encoding architecture algebra

链接: https://arxiv.org/abs/2410.11776
作者: Stephane Bersier,Xinyi Chen-Lin
关键词-EN: machine learning, leading to inefficiencies, typeful machine learning, model lifecycle, wide variety
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注: 25 pages, 6 figures. Keywords: typeful, algebraic data types, tensors, structured data

点击查看摘要

Abstract:Despite the wide variety of input types in machine learning, this diversity is often not fully reflected in their representations or model architectures, leading to inefficiencies throughout a model’s lifecycle. This paper introduces an algebraic approach to constructing input-encoding architectures that properly account for the data’s structure, providing a step toward achieving more typeful machine learning.

[LG-13] Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.11772
作者: Kai Yao,Penlei Gao,Lichun Li,Yuan Zhao,Xiaofeng Wang,Wei Wang,Jianke Zhu
关键词-EN: Large Language Models, pre-trained Large Language, adapting pre-trained Large, Language Models, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: EMNLP 2024

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods have gained significant popularity for adapting pre-trained Large Language Models (LLMs) to downstream tasks, primarily due to their potential to significantly reduce memory and computational overheads. However, a common limitation in most PEFT approaches is their application of a uniform architectural design across all layers. This uniformity involves identical trainable modules and ignores the varying importance of each layer, leading to sub-optimal fine-tuning results. To overcome the above limitation and obtain better performance, we develop a novel approach, Importance-aware Sparse Tuning (IST), to fully utilize the inherent sparsity and select the most important subset of full layers with effective layer-wise importance scoring. The proposed IST is a versatile and plug-and-play technique compatible with various PEFT methods that operate on a per-layer basis. By leveraging the estimated importance scores, IST dynamically updates these selected layers in PEFT modules, leading to reduced memory demands. We further provide theoretical proof of convergence and empirical evidence of superior performance to demonstrate the advantages of IST over uniform updating strategies. Extensive experiments on a range of LLMs, PEFTs, and downstream tasks substantiate the effectiveness of our proposed method, showcasing IST’s capacity to enhance existing layer-based PEFT methods. Our code is available at this https URL.

[LG-14] Can Search-Based Testing with Pareto Optimization Effectively Cover Failure-Revealing Test Inputs?

链接: https://arxiv.org/abs/2410.11769
作者: Lev Sorokin,Damir Safin,Shiva Nejati
关键词-EN: Deep Learning-enabled, Search-based software testing, Search-based software, SBST techniques focus, widely adopted technique
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication by Empirical Software Engineering Journal (EMSE) (in October 2024)

点击查看摘要

Abstract:Search-based software testing (SBST) is a widely adopted technique for testing complex systems with large input spaces, such as Deep Learning-enabled (DL-enabled) systems. Many SBST techniques focus on Pareto-based optimization, where multiple objectives are optimized in parallel to reveal failures. However, it is important to ensure that identified failures are spread throughout the entire failure-inducing area of a search domain and not clustered in a sub-region. This ensures that identified failures are semantically diverse and reveal a wide range of underlying causes. In this paper, we present a theoretical argument explaining why testing based on Pareto optimization is inadequate for covering failure-inducing areas within a search domain. We support our argument with empirical results obtained by applying two widely used types of Pareto-based optimization techniques, namely NSGA-II (an evolutionary algorithm) and MOPSO (a swarm-based algorithm), to two DL-enabled systems: an industrial Automated Valet Parking (AVP) system and a system for classifying handwritten digits. We measure the coverage of failure-revealing test inputs in the input space using a metric that we refer to as the Coverage Inverted Distance quality indicator. Our results show that NSGA-II and MOPSO are not more effective than a naïve random search baseline in covering test inputs that reveal failures. The replication package for this study is available in a GitHub repository.

[LG-15] Analyzing (In)Abilities of SAEs via Formal Languages

链接: https://arxiv.org/abs/2410.11767
作者: Abhinav Menon,Manish Shrivastava,David Krueger,Ekdeep Singh Lubana
关键词-EN: underlying neural network, neural network representations, disentangled features underlying, features underlying neural, underlying neural
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Autoencoders have been used for finding interpretable and disentangled features underlying neural network representations in both image and text domains. While the efficacy and pitfalls of such methods are well-studied in vision, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We aim to address this gap by training sparse autoencoders (SAEs) on a synthetic testbed of formal languages. Specifically, we train SAEs on the hidden representations of models trained on formal languages (Dyck-2, Expr, and English PCFG) under a wide variety of hyperparameter settings, finding interpretable latents often emerge in the features learned by our SAEs. However, similar to vision, we find performance turns out to be highly sensitive to inductive biases of the training pipeline. Moreover, we show latents correlating to certain features of the input do not always induce a causal impact on model’s computation. We thus argue that causality has to become a central target in SAE training: learning of causal features should be incentivized from the ground-up. Motivated by this, we propose and perform preliminary investigations for an approach that promotes learning of causally relevant features in our formal language setting.

[LG-16] ECGN: A Cluster-Aware Approach to Graph Neural Networks for Imbalanced Classification

链接: https://arxiv.org/abs/2410.11765
作者: Bishal Thapaliya,Anh Nguyen,Yao Lu,Tian Xie,Igor Grudetskyi,Fudong Lin,Antonios Valkanas,Jingyu Liu,Deepayan Chakraborty,Bilel Fehri
关键词-EN: Classifying nodes, Graph Neural Networks, Cluster-aware Graph Network, Classifying, Enhanced Cluster-aware Graph
类目: Machine Learning (cs.LG)
*备注: 17 pages, 3 figures

点击查看摘要

Abstract:Classifying nodes in a graph is a common problem. The ideal classifier must adapt to any imbalances in the class distribution. It must also use information in the clustering structure of real-world graphs. Existing Graph Neural Networks (GNNs) have not addressed both problems together. We propose the Enhanced Cluster-aware Graph Network (ECGN), a novel method that addresses these issues by integrating cluster-specific training with synthetic node generation. Unlike traditional GNNs that apply the same node update process for all nodes, ECGN learns different aggregations for different clusters. We also use the clusters to generate new minority-class nodes in a way that helps clarify the inter-class decision boundary. By combining cluster-aware embeddings with a global integration step, ECGN enhances the quality of the resulting node embeddings. Our method works with any underlying GNN and any cluster generation technique. Experimental results show that ECGN consistently outperforms its closest competitors by up to 11% on some widely studied benchmark datasets.

[LG-17] LoSAM: Local Search in Additive Noise Models with Unmeasured Confounders a Top-Down Global Discovery Approach

链接: https://arxiv.org/abs/2410.11759
作者: Sujai Hiremath,Kyra Gan,Promit Ghosal
关键词-EN: underlying data-generating process, imposing additional assumptions, structural equation models, additive noise, additive noise model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the challenge of causal discovery in structural equation models with additive noise without imposing additional assumptions on the underlying data-generating process. We introduce local search in additive noise model (LoSAM), which generalizes an existing nonlinear method that leverages local causal substructures to the general additive noise setting, allowing for both linear and nonlinear causal mechanisms. We show that LoSAM achieves polynomial runtime, and improves runtime and efficiency by exploiting new substructures to minimize the conditioning set at each step. Further, we introduce a variant of LoSAM, LoSAM-UC, that is robust to unmeasured confounding among roots, a property that is often not satisfied by functional-causal-model-based methods. We numerically demonstrate the utility of LoSAM, showing that it outperforms existing benchmarks.

[LG-18] Latent Action Pretraining from Videos

链接: https://arxiv.org/abs/2410.11758
作者: Seonghyeon Ye,Joel Jang,Byeongguk Jeon,Sejune Joo,Jianwei Yang,Baolin Peng,Ajay Mandlekar,Reuben Tan,Yu-Wei Chao,Bill Yuchen Lin,Lars Liden,Kimin Lee,Jianfeng Gao,Luke Zettlemoyer,Dieter Fox,Minjoon Seo
关键词-EN: action labels, Latent Action Pretraining, robot action labels, introduce Latent Action, ground-truth robot action
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

[LG-19] DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure

链接: https://arxiv.org/abs/2410.11744
作者: Yunfan Xiong,Ruoyu Zhang,Yanzeng Li,Tianhao Wu,Lei Zou
关键词-EN: large language models, recently appeared, promising direction, direction for accelerating, accelerating the inference
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods usually organize predicted tokens as independent chains or fixed token trees, which fails to generalize to diverse query distributions. In this paper, we propose DySpec, a faster speculative decoding algorithm with a novel dynamic token tree structure. We begin by bridging the draft distribution and acceptance rate from intuitive and empirical clues, and successfully show that the two variables are strongly correlated. Based on this, we employ a greedy strategy to dynamically expand the token tree at run time. Theoretically, we show that our method can achieve optimal results under mild assumptions. Empirically, DySpec yields a higher acceptance rate and speedup than fixed trees. DySpec can drastically improve the throughput and reduce the latency of token generation across various data distribution and model sizes, which significantly outperforms strong competitors, including Specinfer and Sequoia. Under low temperature setting, DySpec can improve the throughput up to 9.1 \times and reduce the latency up to 9.4 \times on Llama2-70B. Under high temperature setting, DySpec can also improve the throughput up to 6.21 \times , despite the increasing difficulty of speculating more than one token per step for draft model.

[LG-20] KLay: Accelerating Neurosymbolic AI

链接: https://arxiv.org/abs/2410.11415
作者: Jaron Maene,Vincent Derkinderen,Pedro Zuidberg Dos Martires
关键词-EN: computation graphs consisting, involves mapping logic, mapping logic formulas, neural network, computation graphs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A popular approach to neurosymbolic AI involves mapping logic formulas to arithmetic circuits (computation graphs consisting of sums and products) and passing the outputs of a neural network through these circuits. This approach enforces symbolic constraints onto a neural network in a principled and end-to-end differentiable way. Unfortunately, arithmetic circuits are challenging to run on modern AI accelerators as they exhibit a high degree of irregular sparsity. To address this limitation, we introduce knowledge layers (KLay), a new data structure to represent arithmetic circuits that can be efficiently parallelized on GPUs. Moreover, we contribute two algorithms used in the translation of traditional circuit representations to KLay and a further algorithm that exploits parallelization opportunities during circuit evaluations. We empirically show that KLay achieves speedups of multiple orders of magnitude over the state of the art, thereby paving the way towards scaling neurosymbolic AI to larger real-world applications.

[LG-21] Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference

链接: https://arxiv.org/abs/2410.11403
作者: Yuta Oshima,Masahiro Suzuki,Yutaka Matsuo
关键词-EN: capture shared latent, shared latent representations, Multimodal variational autoencoders, variational autoencoders, aim to capture
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 12 figures

点击查看摘要

Abstract:Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities. A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number (2^M) of inference networks for all possible modality combinations. Mixture-based models simplify this by requiring only as many inference models as there are modalities, aggregating unimodal inferences. However, they suffer from information loss when modalities are missing. Alignment-based VAEs address this by aligning unimodal inference models with a multimodal model through minimizing the Kullback-Leibler (KL) divergence but face issues due to amortization gaps, which compromise inference accuracy. To tackle these problems, we introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework. This method overcomes information loss from missing modalities and minimizes the amortization gap by iteratively refining the multimodal inference using all available modalities. By aligning unimodal inference to this refined multimodal posterior, we achieve unimodal inferences that effectively incorporate multimodal information while requiring only unimodal inputs during inference. Experiments on benchmark datasets show that our approach improves inference performance, evidenced by higher linear classification accuracy and competitive cosine similarity, and enhances cross-modal generation, indicated by lower FID scores. This demonstrates that our method enhances inferred representations from unimodal inputs.

[LG-22] FOOGD: Federated Collaboration for Both Out-of-distribution Generalization and Detection NEURIPS2024

链接: https://arxiv.org/abs/2410.11397
作者: Xinting Liao,Weiming Liu,Pengyang Zhou,Fengyuan Yu,Jiahe Xu,Jun Wang,Wenjie Wang,Chaochao Chen,Xiaolin Zheng
关键词-EN: promising machine learning, machine learning paradigm, Federated learning, machine learning, learning paradigm
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Federated learning (FL) is a promising machine learning paradigm that collaborates with client models to capture global knowledge. However, deploying FL models in real-world scenarios remains unreliable due to the coexistence of in-distribution data and unexpected out-of-distribution (OOD) data, such as covariate-shift and semantic-shift data. Current FL researches typically address either covariate-shift data through OOD generalization or semantic-shift data via OOD detection, overlooking the simultaneous occurrence of various OOD shifts. In this work, we propose FOOGD, a method that estimates the probability density of each client and obtains reliable global distribution as guidance for the subsequent FL process. Firstly, SM3D in FOOGD estimates score model for arbitrary distributions without prior constraints, and detects semantic-shift data powerfully. Then SAG in FOOGD provides invariant yet diverse knowledge for both local covariate-shift generalization and client performance generalization. In empirical validations, FOOGD significantly enjoys three main advantages: (1) reliably estimating non-normalized decentralized distributions, (2) detecting semantic shift data via score values, and (3) generalizing to covariate-shift data by regularizing feature extractor. The prejoct is open in this https URL.

[LG-23] Experimental Design Using Interlacing Polynomials

链接: https://arxiv.org/abs/2410.11390
作者: Lap Chi Lau,Robert Wang,Hong Zhou
关键词-EN: experimental design problems, unified deterministic approach, interlacing polynomials, present a unified, unified deterministic
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 16 pages

点击查看摘要

Abstract:We present a unified deterministic approach for experimental design problems using the method of interlacing polynomials. Our framework recovers the best-known approximation guarantees for the well-studied D/A/E-design problems with simple analysis. Furthermore, we obtain improved non-trivial approximation guarantee for E-design in the challenging small budget regime. Additionally, our approach provides an optimal approximation guarantee for a generalized ratio objective that generalizes both D-design and A-design.

[LG-24] Point-Calibrated Spectral Neural Operators

链接: https://arxiv.org/abs/2410.11382
作者: Xihang Yue,Linchao Zhu,Yi Yang
关键词-EN:
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[LG-25] Survey and Evaluation of Converging Architecture in LLMs based on Footsteps of Operations

链接: https://arxiv.org/abs/2410.11381
作者: Seongho Kim,Jihyun Moon,Juntaek Oh,Insu Choi,Joon-Sung Yang
关键词-EN: Transformer architecture enables, enables contextually natural, contextually natural text, natural text generation, processing entire source
类目: Machine Learning (cs.LG)
*备注: 13 pages and 16 figures

点击查看摘要

Abstract:The advent of the Attention mechanism and Transformer architecture enables contextually natural text generation and compresses the burden of processing entire source information into singular vectors. Based on these two main ideas, model sizes gradually increases to accommodate more precise and comprehensive information, leading to the current state-of-the-art LLMs being very large, with parameters around 70 billion. As the model sizes are growing, the demand for substantial storage and computational capacity increases. This leads to the development of high-bandwidth memory and accelerators, as well as a variety of model architectures designed to meet these requirements. We note that LLM architectures have increasingly converged. This paper analyzes how these converged architectures perform in terms of layer configurations, operational mechanisms, and model sizes, considering various hyperparameter settings. In this paper, we conduct a concise survey of the history of LLMs by tracing the evolution of their operational improvements. Furthermore, we summarize the performance trends of LLMs under various hyperparameter settings using the RTX 6000, which features the state-of-the-art Ada Lovelace architecture. We conclude that even the same model can exhibit different behaviors depending on the hyperparameters or whether it is deployed in server or edge environments.

[LG-26] WPFed: Web-based Personalized Federation for Decentralized Systems

链接: https://arxiv.org/abs/2410.11378
作者: Guanhua Ye,Jifeng He,Weiqing Wang,Zhe Xue,Feifei Kou,Yawen Li
关键词-EN: trust are paramount, collaborative model training, WPFed, model training, learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Decentralized learning has become crucial for collaborative model training in environments where data privacy and trust are paramount. In web-based applications, clients are liberated from traditional fixed network topologies, enabling the establishment of arbitrary peer-to-peer (P2P) connections. While this flexibility is highly promising, it introduces a fundamental challenge: the optimal selection of neighbors to ensure effective collaboration. To address this, we introduce WPFed, a fully decentralized, web-based learning framework designed to enable globally optimal neighbor selection. WPFed employs a dynamic communication graph and a weighted neighbor selection mechanism. By assessing inter-client similarity through Locality-Sensitive Hashing (LSH) and evaluating model quality based on peer rankings, WPFed enables clients to identify personalized optimal neighbors on a global scale while preserving data privacy. To enhance security and deter malicious behavior, WPFed integrates verification mechanisms for both LSH codes and performance rankings, leveraging blockchain-driven announcements to ensure transparency and verifiability. Through extensive experiments on multiple real-world datasets, we demonstrate that WPFed significantly improves learning outcomes and system robustness compared to traditional federated learning methods. Our findings highlight WPFed’s potential to facilitate effective and secure decentralized collaborative learning across diverse and interconnected web environments.

[LG-27] DODT: Enhanced Online Decision Transformer Learning through Dreamers Actor-Critic Trajectory Forecasting

链接: https://arxiv.org/abs/2410.11359
作者: Eric Hanchen Jiang,Zhi Zhang,Dinghuai Zhang,Andrew Lizarraga,Chenheng Xu,Yasi Zhang,Siyan Zhao,Zhengjie Xu,Peiyu Yu,Yuer Tang,Deqian Kong,Ying Nian Wu
关键词-EN: sophisticated models capable, complex decision-making tasks, Online Decision Transformer, development of sophisticated, learning complex decision-making
类目: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Advancements in reinforcement learning have led to the development of sophisticated models capable of learning complex decision-making tasks. However, efficiently integrating world models with decision transformers remains a challenge. In this paper, we introduce a novel approach that combines the Dreamer algorithm’s ability to generate anticipatory trajectories with the adaptive learning strengths of the Online Decision Transformer. Our methodology enables parallel training where Dreamer-produced trajectories enhance the contextual decision-making of the transformer, creating a bidirectional enhancement loop. We empirically demonstrate the efficacy of our approach on a suite of challenging benchmarks, achieving notable improvements in sample efficiency and reward maximization over existing methods. Our results indicate that the proposed integrated framework not only accelerates learning but also showcases robustness in diverse and dynamic scenarios, marking a significant step forward in model-based reinforcement learning.

[LG-28] Reducing Labeling Costs in Sentiment Analysis via Semi-Supervised Learning

链接: https://arxiv.org/abs/2410.11355
作者: Minoo Jafarlou,Mario M. Kubek
关键词-EN: noteworthy challenge, challenge in machine, machine learning, learning, Labeling datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 12 pages, 7 figures, accepted at the 2024 8th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2024), Okayama, Japan, 2024

点击查看摘要

Abstract:Labeling datasets is a noteworthy challenge in machine learning, both in terms of cost and time. This research, however, leverages an efficient answer. By exploring label propagation in semi-supervised learning, we can significantly reduce the number of labels required compared to traditional methods. We employ a transductive label propagation method based on the manifold assumption for text classification. Our approach utilizes a graph-based method to generate pseudo-labels for unlabeled data for the text classification task, which are then used to train deep neural networks. By extending labels based on cosine proximity within a nearest neighbor graph from network embeddings, we combine unlabeled data into supervised learning, thereby reducing labeling costs. Based on previous successes in other domains, this study builds and evaluates this approach’s effectiveness in sentiment analysis, presenting insights into semi-supervised learning.

[LG-29] oward a Well-Calibrated Discrimination via Survival Outcome-Aware Contrastive Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.11340
作者: Dongjoon Lee,Hyeryn Park,Changhee Lee
关键词-EN: Previous deep learning, Previous deep, analysis have primarily, primarily relied, relied on ranking
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Previous deep learning approaches for survival analysis have primarily relied on ranking losses to improve discrimination performance, which often comes at the expense of calibration performance. To address such an issue, we propose a novel contrastive learning approach specifically designed to enhance discrimination \textitwithout sacrificing calibration. Our method employs weighted sampling within a contrastive learning framework, assigning lower penalties to samples with similar survival outcomes. This aligns well with the assumption that patients with similar event times share similar clinical statuses. Consequently, when augmented with the commonly used negative log-likelihood loss, our approach significantly improves discrimination performance without directly manipulating the model outputs, thereby achieving better calibration. Experiments on multiple real-world clinical datasets demonstrate that our method outperforms state-of-the-art deep survival models in both discrimination and calibration. Through comprehensive ablation studies, we further validate the effectiveness of our approach through quantitative and qualitative analyses.

[LG-30] DIAR: Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation

链接: https://arxiv.org/abs/2410.11338
作者: Jaehyun Park,Yunho Kim,Sejin Kim,Byung-Jun Lee,Sundong Kim
关键词-EN: Implicit Q-learning, offline reinforcement learning, Adaptive Revaluation, Adaptive Revaluation mechanism, offline reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Preprint, under review. Comments welcome

点击查看摘要

Abstract:We propose a novel offline reinforcement learning (offline RL) approach, introducing the Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR) framework. We address two key challenges in offline RL: out-of-distribution samples and long-horizon problems. We leverage diffusion models to learn state-action sequence distributions and incorporate value functions for more balanced and adaptive decision-making. DIAR introduces an Adaptive Revaluation mechanism that dynamically adjusts decision lengths by comparing current and future state values, enabling flexible long-term decision-making. Furthermore, we address Q-value overestimation by combining Q-network learning with a value function guided by a diffusion model. The diffusion model generates diverse latent trajectories, enhancing policy robustness and generalization. As demonstrated in tasks like Maze2D, AntMaze, and Kitchen, DIAR consistently outperforms state-of-the-art algorithms in long-horizon, sparse-reward environments.

[LG-31] SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments

链接: https://arxiv.org/abs/2410.11331
作者: Syed Abdul Gaffar Shakhadri,Kruthika KR,Rakshit Aralimatti
关键词-EN: billion parameter language, including smartphones, model specifically optimized, billion parameter, IoT systems
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Paper in pdf format is 11 pages and contains 4 tables

点击查看摘要

Abstract:We introduce Shakti, a 2.5 billion parameter language model specifically optimized for resource-constrained environments such as edge devices, including smartphones, wearables, and IoT systems. Shakti combines high-performance NLP with optimized efficiency and precision, making it ideal for real-time AI applications where computational resources and memory are limited. With support for vernacular languages and domain-specific tasks, Shakti excels in industries such as healthcare, finance, and customer service. Benchmark evaluations demonstrate that Shakti performs competitively against larger models while maintaining low latency and on-device efficiency, positioning it as a leading solution for edge AI.

[LG-32] Evolutionary Retrofitting

链接: https://arxiv.org/abs/2410.11330
作者: Mathurin Videau(TAU),Mariia Zameshina(LIGM),Alessandro Leite(TAU),Laurent Najman(LIGM),Marc Schoenauer(TAU),Olivier Teytaud(TAU)
关键词-EN: Learning Evolutionary Retrofitting, fully-trained machine learning, machine learning models, standard validation set, including evolutionary methods
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:AfterLearnER (After Learning Evolutionary Retrofitting) consists in applying non-differentiable optimization, including evolutionary methods, to refine fully-trained machine learning models by optimizing a set of carefully chosen parameters or hyperparameters of the model, with respect to some actual, exact, and hence possibly non-differentiable error signal, performed on a subset of the standard validation set. The efficiency of AfterLearnER is demonstrated by tackling non-differentiable signals such as threshold-based criteria in depth sensing, the word error rate in speech re-synthesis, image quality in 3D generative adversarial networks (GANs), image generation via Latent Diffusion Models (LDM), the number of kills per life at Doom, computational accuracy or BLEU in code translation, and human appreciations in image synthesis. In some cases, this retrofitting is performed dynamically at inference time by taking into account user inputs. The advantages of AfterLearnER are its versatility (no gradient is needed), the possibility to use non-differentiable feedback including human evaluations, the limited overfitting, supported by a theoretical study and its anytime behavior. Last but not least, AfterLearnER requires only a minimal amount of feedback, i.e., a few dozens to a few hundreds of scalars, rather than the tens of thousands needed in most related published works. Compared to fine-tuning (typically using the same loss, and gradient-based optimization on a smaller but still big dataset at a fine grain), AfterLearnER uses a minimum amount of data on the real objective function without requiring differentiability.

[LG-33] Sequential LLM Framework for Fashion Recommendation

链接: https://arxiv.org/abs/2410.11327
作者: Han Liu,Xianfeng Tang,Tianlang Chen,Jiapeng Liu,Indu Indu,Henry Peng Zou,Peng Dai,Roberto Fernandez Galan,Michael D Porter,Dongmei Jia,Ning Zhang,Lian Xiong
关键词-EN: prompting major online, major online retailers, global e-commerce sector, prompting major, customer convenience
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The fashion industry is one of the leading domains in the global e-commerce sector, prompting major online retailers to employ recommendation systems for product suggestions and customer convenience. While recommendation systems have been widely studied, most are designed for general e-commerce problems and struggle with the unique challenges of the fashion domain. To address these issues, we propose a sequential fashion recommendation framework that leverages a pre-trained large language model (LLM) enhanced with recommendation-specific prompts. Our framework employs parameter-efficient fine-tuning with extensive fashion data and introduces a novel mix-up-based retrieval technique for translating text into relevant product suggestions. Extensive experiments show our proposed framework significantly enhances fashion recommendation performance.

[LG-34] Diffusion-Based Offline RL for Improved Decision-Making in Augmented ARC Task

链接: https://arxiv.org/abs/2410.11324
作者: Yunho Kim,Jaehyun Park,Heejun Kim,Sejin Kim,Byung-Jun Lee,Sundong Kim
关键词-EN: Effective long-term strategies, navigate complex environments, Effective long-term, long-term strategies enable, extended horizons
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint, Under review. Comments welcome

点击查看摘要

Abstract:Effective long-term strategies enable AI systems to navigate complex environments by making sequential decisions over extended horizons. Similarly, reinforcement learning (RL) agents optimize decisions across sequences to maximize rewards, even without immediate feedback. To verify that Latent Diffusion-Constrained Q-learning (LDCQ), a prominent diffusion-based offline RL method, demonstrates strong reasoning abilities in multi-step decision-making, we aimed to evaluate its performance on the Abstraction and Reasoning Corpus (ARC). However, applying offline RL methodologies to enhance strategic reasoning in AI for solving tasks in ARC is challenging due to the lack of sufficient experience data in the ARC training set. To address this limitation, we introduce an augmented offline RL dataset for ARC, called Synthesized Offline Learning Data for Abstraction and Reasoning (SOLAR), along with the SOLAR-Generator, which generates diverse trajectory data based on predefined rules. SOLAR enables the application of offline RL methods by offering sufficient experience data. We synthesized SOLAR for a simple task and used it to train an agent with the LDCQ method. Our experiments demonstrate the effectiveness of the offline RL approach on a simple ARC task, showing the agent’s ability to make multi-step sequential decisions and correctly identify answer states. These results highlight the potential of the offline RL approach to enhance AI’s strategic reasoning capabilities.

[LG-35] KA-GNN: Kolmogorov-Arnold Graph Neural Networks for Molecular Property Prediction

链接: https://arxiv.org/abs/2410.11323
作者: Longlong Li,Yipeng Zhang,Guanghui Wang,Kelin Xia
关键词-EN: Intelligence-Driven Drug Discovery, Artificial Intelligence-Driven Drug, Drug Discovery, process of Artificial, Artificial Intelligence-Driven
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Molecular property prediction is a crucial task in the process of Artificial Intelligence-Driven Drug Discovery (AIDD). The challenge of developing models that surpass traditional non-neural network methods continues to be a vibrant area of research. This paper presents a novel graph neural network model-the Kolmogorov-Arnold Network (KAN)-based Graph Neural Network (KA-GNN), which incorporates Fourier series, specifically designed for molecular property prediction. This model maintains the high interpretability characteristic of KAN methods while being extremely efficient in computational resource usage, making it an ideal choice for deployment in resource-constrained environments. Tested and validated on seven public datasets, KA-GNN has shown significant improvements in property predictions over the existing state-of-the-art (SOTA) benchmarks.

[LG-36] Herald: A Natural Language Annotated Lean 4 Dataset

链接: https://arxiv.org/abs/2410.10878
作者: Guoxiong Gao,Yutong Wang,Jiedong Jiang,Qi Gao,Zihan Qin,Tianyi Xu,Bin Dong
关键词-EN: Verifiable formal languages, impacted mathematical reasoning, Verifiable formal, formal language Lean, automated reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Verifiable formal languages like Lean have profoundly impacted mathematical reasoning, particularly through the use of large language models (LLMs) for automated reasoning. A significant challenge in training LLMs for these formal languages is the lack of parallel datasets that align natural language with formal language proofs. To address this challenge, this paper introduces a novel framework for translating the Mathlib4 corpus (a unified library of mathematics in formal language Lean 4) into natural language. Building upon this, we employ a dual augmentation strategy that combines tactic-based and informal-based approaches, leveraging the Lean-jixia system, a Lean 4 analyzer. We present the results of this pipeline on Mathlib4 as Herald (Hierarchy and Retrieval-based Translated Lean Dataset). We also propose the Herald Translator, which is fine-tuned on Herald. Herald translator achieves a 93.2% accuracy (Pass@128) on formalizing statements in the miniF2F-test and a 22.5% accuracy on our internal graduate-level textbook dataset, outperforming InternLM2-Math-Plus-7B (74.0% and 7.5%) and TheoremLlama (50.1% and 4.0%). Furthermore, we propose a section-level translation framework for real-world applications. As a direct application of Herald translator, we have successfully translated a template section in the Stack project, marking a notable progress in the automatic formalization of graduate-level mathematical literature. Our model, along with the datasets, will be open-sourced to the public soon.

[LG-37] FreqMark: Frequency-Based Watermark for Sentence-Level Detection of LLM-Generated Text

链接: https://arxiv.org/abs/2410.10876
作者: Zhenyu Xu,Kun Zhang,Victor S. Sheng
关键词-EN: Large Language Models, Language Models, Large Language, generating highly coherent, contextually relevant text
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing use of Large Language Models (LLMs) for generating highly coherent and contextually relevant text introduces new risks, including misuse for unethical purposes such as disinformation or academic dishonesty. To address these challenges, we propose FreqMark, a novel watermarking technique that embeds detectable frequency-based watermarks in LLM-generated text during the token sampling process. The method leverages periodic signals to guide token selection, creating a watermark that can be detected with Short-Time Fourier Transform (STFT) analysis. This approach enables accurate identification of LLM-generated content, even in mixed-text scenarios with both human-authored and LLM-generated segments. Our experiments demonstrate the robustness and precision of FreqMark, showing strong detection capabilities against various attack scenarios such as paraphrasing and token substitution. Results show that FreqMark achieves an AUC improvement of up to 0.98, significantly outperforming existing detection methods.

[LG-38] SHyPar: A Spectral Coarsening Approach to Hypergraph Partitioning

链接: https://arxiv.org/abs/2410.10875
作者: Hamed Sajadinia,Ali Aghdaei,Zhuo Feng
关键词-EN: construct progressively coarser, guiding cut refinements, progressively coarser hypergraphs, paradigm to construct, construct progressively
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 14 pages, 11 figures, 4 tables

点击查看摘要

Abstract:State-of-the-art hypergraph partitioners utilize a multilevel paradigm to construct progressively coarser hypergraphs across multiple layers, guiding cut refinements at each level of the hierarchy. Traditionally, these partitioners employ heuristic methods for coarsening and do not consider the structural features of hypergraphs. In this work, we introduce a multilevel spectral framework, SHyPar, for partitioning large-scale hypergraphs by leveraging hyperedge effective resistances and flow-based community detection techniques. Inspired by the latest theoretical spectral clustering frameworks, such as HyperEF and HyperSF, SHyPar aims to decompose large hypergraphs into multiple subgraphs with few inter-partition hyperedges (cut size). A key component of SHyPar is a flow-based local clustering scheme for hypergraph coarsening, which incorporates a max-flow-based algorithm to produce clusters with substantially improved conductance. Additionally, SHyPar utilizes an effective resistance-based rating function for merging nodes that are strongly connected (coupled). Compared with existing state-of-the-art hypergraph partitioning methods, our extensive experimental results on real-world VLSI designs demonstrate that SHyPar can more effectively partition hypergraphs, achieving state-of-the-art solution quality.

[LG-39] PortLLM: Personalizing Evolving Large Language Models with Training-Free and Portable Model Patches

链接: https://arxiv.org/abs/2410.10870
作者: Rana Muhammad Shahroz Khan,Pingzhi Li,Sukwon Yun,Zhenyu Wang,Shahriar Nirjon,Chau-Wai Wong,Tianlong Chen
关键词-EN: large language models, achieving optimal performance, increasingly shape, large language, pre-LLM era
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) increasingly shape the AI landscape, fine-tuning pretrained models has become more popular than in the pre-LLM era for achieving optimal performance in domain-specific tasks. However, pretrained LLMs such as ChatGPT are periodically evolved, i.e., model parameters are frequently updated), making it challenging for downstream users with limited resources to keep up with fine-tuning the newest LLMs for their domain application. Even though fine-tuning costs have nowadays been reduced thanks to the innovations of parameter-efficient fine-tuning such as LoRA, not all downstream users have adequate computing for frequent personalization. Moreover, access to fine-tuning datasets, particularly in sensitive domains such as healthcare, could be time-restrictive, making it crucial to retain the knowledge encoded in earlier fine-tuned rounds for future adaptation. In this paper, we present PortLLM, a training-free framework that (i) creates an initial lightweight model update patch to capture domain-specific knowledge, and (ii) allows a subsequent seamless plugging for the continual personalization of evolved LLM at minimal cost. Our extensive experiments cover seven representative datasets, from easier question-answering tasks BoolQ, SST2 to harder reasoning tasks WinoGrande, GSM8K, and models including Mistral-7B, Llama2, Llama3.1, and Gemma2, validating the portability of our designed model patches and showcasing the effectiveness of our proposed framework. For instance, PortLLM achieves comparable performance to LoRA fine-tuning with reductions of up to 12.2x in GPU memory usage. Finally, we provide theoretical justifications to understand the portability of our model update patches, which offers new insights into the theoretical dimension of LLMs’ personalization.

[LG-40] Application of NotebookLM a Large Language Model with Retrieval-Augmented Generation for Lung Cancer Staging

链接: https://arxiv.org/abs/2410.10869
作者: Ryota Tozuka,Hisashi Johno,Akitomo Amakawa,Junichi Sato,Mizuki Muto,Shoichiro Seki,Atsushi Komaba,Hiroshi Onishi
关键词-EN: large language models, recently gained attention, lung cancer, lung cancer staging, including ChatGPT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, 1 table, 3 ancillary files

点击查看摘要

Abstract:Purpose: In radiology, large language models (LLMs), including ChatGPT, have recently gained attention, and their utility is being rapidly evaluated. However, concerns have emerged regarding their reliability in clinical applications due to limitations such as hallucinations and insufficient referencing. To address these issues, we focus on the latest technology, retrieval-augmented generation (RAG), which enables LLMs to reference reliable external knowledge (REK). Specifically, this study examines the utility and reliability of a recently released RAG-equipped LLM (RAG-LLM), NotebookLM, for staging lung cancer. Materials and methods: We summarized the current lung cancer staging guideline in Japan and provided this as REK to NotebookLM. We then tasked NotebookLM with staging 100 fictional lung cancer cases based on CT findings and evaluated its accuracy. For comparison, we performed the same task using a gold-standard LLM, GPT-4 Omni (GPT-4o), both with and without the REK. Results: NotebookLM achieved 86% diagnostic accuracy in the lung cancer staging experiment, outperforming GPT-4o, which recorded 39% accuracy with the REK and 25% without it. Moreover, NotebookLM demonstrated 95% accuracy in searching reference locations within the REK. Conclusion: NotebookLM successfully performed lung cancer staging by utilizing the REK, demonstrating superior performance compared to GPT-4o. Additionally, it provided highly accurate reference locations within the REK, allowing radiologists to efficiently evaluate the reliability of NotebookLM’s responses and detect possible hallucinations. Overall, this study highlights the potential of NotebookLM, a RAG-LLM, in image diagnosis. Comments: 9 pages, 5 figures, 1 table, 3 ancillary files Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.10869 [cs.CL] (or arXiv:2410.10869v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.10869 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hisashi Johno [view email] [v1] Tue, 8 Oct 2024 12:42:42 UTC (84 KB)

[LG-41] LLaCA: Multimodal Large Language Continual Assistant

链接: https://arxiv.org/abs/2410.10868
作者: Jingyang Qiao,Zhizhong Zhang,Xin Tan,Yanyun Qu,Shouhong Ding,Yuan Xie
关键词-EN: Large Language Models, Multimodal Large Language, Continual Instruction Tuning, designing text instructions, Instruction tuning guides
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Instruction tuning guides the Multimodal Large Language Models (MLLMs) in aligning different modalities by designing text instructions, which seems to be an essential technique to enhance the capabilities and controllability of foundation models. In this framework, Multimodal Continual Instruction Tuning (MCIT) is adopted to continually instruct MLLMs to follow human intent in sequential datasets. We observe existing gradient update would heavily destroy the tuning performance on previous datasets and the zero-shot ability during continual instruction tuning. Exponential Moving Average (EMA) update policy owns the ability to trace previous parameters, which can aid in decreasing forgetting. However, its stable balance weight cannot deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability of MLLMs. In this paper, we propose a method called Multimodal Large Language Continual Assistant (LLaCA) to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight is basically according to the gradient information and previous parameters. We automatically determine the balance weight and significantly improve the performance. Through comprehensive experiments on LLaVA-1.5 in a continual visual-question-answering benchmark, compared with baseline, our approach not only highly improves anti-forgetting ability (with reducing forgetting from 22.67 to 2.68), but also significantly promotes continual tuning performance (with increasing average accuracy from 41.31 to 61.89). Our code will be published soon.

[LG-42] Fill In The Gaps: Model Calibration and Generalization with Synthetic Data EMNLP2024

链接: https://arxiv.org/abs/2410.10864
作者: Yang Ba,Michelle V. Mancenido,Rong Pan
关键词-EN: major concern prior, swiftly advance, calibrating their performance, widespread implementation, continue to swiftly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Main Conference (Long paper)

点击查看摘要

Abstract:As machine learning models continue to swiftly advance, calibrating their performance has become a major concern prior to practical and widespread implementation. Most existing calibration methods often negatively impact model accuracy due to the lack of diversity of validation data, resulting in reduced generalizability. To address this, we propose a calibration method that incorporates synthetic data without compromising accuracy. We derive the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework. Large language models (LLMs), known for their ability to mimic real data and generate text with mixed class labels, are utilized as a synthetic data generation strategy to lower the ECE bound and improve model accuracy on real test data. Additionally, we propose data generation mechanisms for efficient calibration. Testing our method on four different natural language processing tasks, we observed an average up to 34% increase in accuracy and 33% decrease in ECE.

[LG-43] Superficial Safety Alignment Hypothesis

链接: https://arxiv.org/abs/2410.10862
作者: Jianwei Li,Jung-Eun Kim
关键词-EN: large language models, safety alignment, safety, ensuring they generate, alignment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe and aligned responses is a pressing need. Previous research on alignment has largely focused on general instruction-following but has often overlooked the unique properties and challenges of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction - interpreted as a specialized binary classification task - and incorporate a refusal mechanism with multiple reserved fallback options. Furthermore, through SSAH, we hypothesize that safety guardrails in LLMs can be established by just a small number of essential components. To verify this, we conduct an ablation study and successfully identify four types of attribute-critical components in safety-aligned LLMs: Exclusive Safety Unit (ESU), Exclusive Utility Unit (EUU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components 7.5% during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Additionally, we show that leveraging redundant units 20% in the pre-trained model as an ``alignment budget’’ can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated. We believe this work contributes to the foundation of efficient and scalable safety alignment for future LLMs.

[LG-44] Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths EMNLP2024

链接: https://arxiv.org/abs/2410.10858
作者: Yew Ken Chia,Guizhen Chen,Weiwen Xu,Luu Anh Tuan,Soujanya Poria,Lidong Bing
关键词-EN: exhibit impressive problem-solving, impressive problem-solving capabilities, Reasoning Paths Optimization, Advanced models, exhibit impressive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: EMNLP 2024 camera ready version

点击查看摘要

Abstract:Advanced models such as OpenAI o1 exhibit impressive problem-solving capabilities through step-by-step reasoning. However, they may still falter on more complex problems, making errors that disrupt their reasoning paths. We attribute this to the expansive solution space, where each step has the risk of diverging into mistakes. To enhance language model reasoning, we introduce a specialized training framework called Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model’s overall problem-solving performance. Reasoning Paths Optimization does not rely on large-scale human-annotated rationales or outputs from closed-source models, making it scalable and data-efficient. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions. The experiments demonstrate that our framework significantly enhances the reasoning performance of large language models, with up to 3.1% and 4.3% improvement on GSM8K and MMLU (STEM) respectively. Our data and code can be found at this https URL.

[LG-45] LLM Gesticulator: Leveraging Large Language Models for Scalable and Controllable Co-Speech Gesture Synthesis

链接: https://arxiv.org/abs/2410.10851
作者: Haozhou Pang,Tianwei Ding,Lanshan He,Qi Gan
关键词-EN: synthesizes full-body animations, exhibiting natural movements, present LLM Gesticulator, LLM-based audio-driven co-speech, movements and editability
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this work, we present LLM Gesticulator, an LLM-based audio-driven co-speech gesture generation framework that synthesizes full-body animations that are rhythmically aligned with the input audio while exhibiting natural movements and editability. Compared to previous work, our model demonstrates substantial scalability. As the size of the backbone LLM model increases, our framework shows proportional improvements in evaluation metrics (a.k.a. scaling law). Our method also exhibits strong controllability where the content, style of the generated gestures can be controlled by text prompt. To the best of our knowledge, LLM gesticulator is the first work that use LLM on the co-speech generation task. Evaluation with existing objective metrics and user studies indicate that our framework outperforms prior works.

[LG-46] Continuous Approximations for Improving Quantization Aware Training of LLMs

链接: https://arxiv.org/abs/2410.10849
作者: He Li,Jianhang Hong,Yuanzhuo Wu,Snehal Adbol,Zonglin Li
关键词-EN: Large Language Models, Large Language, requirements for Large, Language Models, Model compression methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Model compression methods are used to reduce the computation and energy requirements for Large Language Models (LLMs). Quantization Aware Training (QAT), an effective model compression method, is proposed to reduce performance degradation after quantization. To further minimize this degradation, we introduce two continuous approximations to the QAT process on the rounding function, traditionally approximated by the Straight-Through Estimator (STE), and the clamping function. By applying both methods, the perplexity (PPL) on the WikiText-v2 dataset of the quantized model reaches 9.0815, outperforming 9.9621 by the baseline. Also, we achieve a 2.76% improvement on BoolQ, and a 5.47% improvement on MMLU, proving that the step sizes and weights can be learned more accurately with our approach. Our method achieves better performance with the same precision, model size, and training setup, contributing to the development of more energy-efficient LLMs technology that aligns with global sustainability goals.

[LG-47] Lotus: learning-based online thermal and latency variation management for two-stage detectors on edge devices

链接: https://arxiv.org/abs/2410.10847
作者: Yifan Gong,Yushu Wu,Zheng Zhan,Pu Zhao,Liangkai Liu,Chao Wu,Xulong Tang,Yanzhi Wang
关键词-EN: identifying small objects, exhibit high accuracy, object detectors exhibit, Two-stage object detectors, small objects
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: DAC’24, code is available at: this https URL

点击查看摘要

Abstract:Two-stage object detectors exhibit high accuracy and precise localization, especially for identifying small objects that are favorable for various edge applications. However, the high computation costs associated with two-stage detection methods cause more severe thermal issues on edge devices, incurring dynamic runtime frequency change and thus large inference latency variations. Furthermore, the dynamic number of proposals in different frames leads to various computations over time, resulting in further latency variations. The significant latency variations of detectors on edge devices can harm user experience and waste hardware resources. To avoid thermal throttling and provide stable inference speed, we propose Lotus, a novel framework that is tailored for two-stage detectors to dynamically scale CPU and GPU frequencies jointly in an online manner based on deep reinforcement learning (DRL). To demonstrate the effectiveness of Lotus, we implement it on NVIDIA Jetson Orin Nano and Mi 11 Lite mobile platforms. The results indicate that Lotus can consistently and significantly reduce latency variation, achieve faster inference, and maintain lower CPU and GPU temperatures under various settings.

[LG-48] Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models

链接: https://arxiv.org/abs/2410.10846
作者: Keivan Alizadeh,Iman Mirzadeh,Hooman Shahrokhi,Dmitry Belenko,Frank Sun,Minsik Cho,Mohammad Hossein Sekhavat,Moin Nabi,Mehrdad Farajtabar
关键词-EN: fixed compute budget, typically generate outputs, Large Language Models, Language Models, inefficient resource utilization
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget, leading to inefficient resource utilization. To address this shortcoming, recent advancements in mixture of expert (MoE) models, speculative decoding, and early exit strategies leverage the insight that computational demands can vary significantly based on the complexity and nature of the input. However, identifying optimal routing patterns for dynamic execution remains an open challenge, limiting the full potential of these adaptive methods. To address this need, we study adaptive computation in LLMs more systematically. We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM. This design enables dynamic routing of tokens based on task complexity: tokens can be processed by either the small or big modules at each layer, or even bypass certain layers entirely. This allows us to introduce a novel notion of a token’s difficulty, defined by its potential to benefit from additional computational resources. Importantly, by employing oracles to identify optimal patterns of adaptive computations, we gain valuable insights into the internal workings of LLMs and the routing processes in a simplified heterogeneous MoE setup. We show that trained routers operate differently from oracles and often yield suboptimal solutions. Notably, activating a large module in just one layer outperforms models that use large modules across all layers, underscoring the gap between practical implementations of routing in MoE models and theoretical optima for adaptive computation.

[LG-49] DIIT: A Domain-Invariant Information Transfer Method for Industrial Cross-Domain Recommendation CIKM2024

链接: https://arxiv.org/abs/2410.10835
作者: Heyuan Huang,Xingyu Lou,Chaochao Chen,Pengxiang Cheng,Yue Xin,Chengwei He,Xiang Liu,Jun Wang
关键词-EN: received widespread attention, widespread attention due, existing CDR methods, utilize rich information, CDR methods
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at CIKM 2024

点击查看摘要

Abstract:Cross-Domain Recommendation (CDR) have received widespread attention due to their ability to utilize rich information across domains. However, most existing CDR methods assume an ideal static condition that is not practical in industrial recommendation systems (RS). Therefore, simply applying existing CDR methods in the industrial RS environment may lead to low effectiveness and efficiency. To fill this gap, we propose DIIT, an end-to-end Domain-Invariant Information Transfer method for industrial cross-domain recommendation. Specifically, We first simulate the industrial RS environment that maintains respective models in multiple domains, each of them is trained in the incremental mode. Then, for improving the effectiveness, we design two extractors to fully extract domain-invariant information from the latest source domain models at the domain level and the representation level respectively. Finally, for improving the efficiency, we design a migrator to transfer the extracted information to the latest target domain model, which only need the target domain model for inference. Experiments conducted on one production dataset and two public datasets verify the effectiveness and efficiency of DIIT.

[LG-50] Focus On What Matters: Separated Models For Visual-Based RL Generalization

链接: https://arxiv.org/abs/2410.10834
作者: Di Zhang,Bowen Lv,Hai Zhang,Feifan Yang,Junqiao Zhao,Hang Yu,Chang Huang,Hongtu Zhou,Chen Ye,Changjun Jiang
关键词-EN: visual-based Reinforcement Learning, visual-based Reinforcement, Reinforcement Learning, unseen environments, primary challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:A primary challenge for visual-based Reinforcement Learning (RL) is to generalize effectively across unseen environments. Although previous studies have explored different auxiliary tasks to enhance generalization, few adopt image reconstruction due to concerns about exacerbating overfitting to task-irrelevant features during training. Perceiving the pre-eminence of image reconstruction in representation learning, we propose SMG (Separated Models for Generalization), a novel approach that exploits image reconstruction for generalization. SMG introduces two model branches to extract task-relevant and task-irrelevant representations separately from visual observations via cooperatively reconstruction. Built upon this architecture, we further emphasize the importance of task-relevant features for generalization. Specifically, SMG incorporates two additional consistency losses to guide the agent’s focus toward task-relevant areas across different scenarios, thereby achieving free from overfitting. Extensive experiments in DMC demonstrate the SOTA performance of SMG in generalization, particularly excelling in video-background settings. Evaluations on robotic manipulation tasks further confirm the robustness of SMG in real-world applications.

[LG-51] Online Client Scheduling and Resource Allocation for Efficient Federated Edge Learning

链接: https://arxiv.org/abs/2410.10833
作者: Zhidong Gao,Zhenxiao Zhang,Yu Zhang,Tongnian Wang,Yanmin Gong,Yuanxiong Guo
关键词-EN: Federated learning, enables edge devices, machine learning model, machine learning, devices to collaboratively
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Federated learning (FL) enables edge devices to collaboratively train a machine learning model without sharing their raw data. Due to its privacy-protecting benefits, FL has been deployed in many real-world applications. However, deploying FL over mobile edge networks with constrained resources such as power, bandwidth, and computation suffers from high training latency and low model accuracy, particularly under data and system heterogeneity. In this paper, we investigate the optimal client scheduling and resource allocation for FL over mobile edge networks under resource constraints and uncertainty to minimize the training latency while maintaining the model accuracy. Specifically, we first analyze the impact of client sampling on model convergence in FL and formulate a stochastic optimization problem that captures the trade-off between the running time and model performance under heterogeneous and uncertain system resources. To solve the formulated problem, we further develop an online control scheme based on Lyapunov-based optimization for client sampling and resource allocation without requiring the knowledge of future dynamics in the FL system. Extensive experimental results demonstrate that the proposed scheme can improve both the training latency and resource efficiency compared with the existing schemes.

[LG-52] st Case-Informed Knowledge Tracing for Open-ended Coding Tasks

链接: https://arxiv.org/abs/2410.10829
作者: Zhangqi Duan,Nigel Fernandez,Alexander Hicks,Andrew Lan
关键词-EN: computer science education, Open-ended coding tasks, science education, Open-ended, code
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Open-ended coding tasks, which ask students to construct programs according to certain specifications, are common in computer science education. Student modeling can be challenging since their open-ended nature means that student code can be diverse. Traditional knowledge tracing (KT) models that only analyze response correctness may not fully capture nuances in student knowledge from student code. In this paper, we introduce Test case-Informed Knowledge Tracing for Open-ended Coding (TIKTOC), a framework to simultaneously analyze and predict both open-ended student code and whether the code passes each test case. We augment the existing CodeWorkout dataset with the test cases used for a subset of the open-ended coding questions, and propose a multi-task learning KT method to simultaneously analyze and predict 1) whether a student’s code submission passes each test case and 2) the student’s open-ended code, using a large language model as the backbone. We quantitatively show that these methods outperform existing KT methods for coding that only use the overall score a code submission receives. We also qualitatively demonstrate how test case information, combined with open-ended code, helps us gain fine-grained insights into student knowledge.

[LG-53] High-Fidelity 3D Lung CT Synthesis in ARDS Swine Models Using Score-Based 3D Residual Diffusion Models

链接: https://arxiv.org/abs/2410.10826
作者: Siyeop Yoon,Yujin Oh,Xiang Li,Yi Xin,Maurizio Cereda,Quanzheng Li
关键词-EN: Acute respiratory distress, respiratory distress syndrome, severe condition characterized, high mortality rate, Acute respiratory
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: 5 page, 3 figures, Submitted to SPIE 2025-Medical Imaging

点击查看摘要

Abstract:Acute respiratory distress syndrome (ARDS) is a severe condition characterized by lung inflammation and respiratory failure, with a high mortality rate of approximately 40%. Traditional imaging methods, such as chest X-rays, provide only two-dimensional views, limiting their effectiveness in fully assessing lung pathology. Three-dimensional (3D) computed tomography (CT) offers a more comprehensive visualization, enabling detailed analysis of lung aeration, atelectasis, and the effects of therapeutic interventions. However, the routine use of CT in ARDS management is constrained by practical challenges and risks associated with transporting critically ill patients to remote scanners. In this study, we synthesize high-fidelity 3D lung CT from 2D generated X-ray images with associated physiological parameters using a score-based 3D residual diffusion model. Our preliminary results demonstrate that this approach can produce high-quality 3D CT images that are validated with ground truth, offering a promising solution for enhancing ARDS management.

[LG-54] Bayesian Experimental Design via Contrastive Diffusions

链接: https://arxiv.org/abs/2410.11826
作者: Jacopo Iollo,Christophe Heinkelé,Pierre Alliez,Florence Forbes
关键词-EN: Bayesian Optimal Experimental, Optimal Experimental Design, Bayesian Optimal, Optimal Experimental, Experimental Design
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian Optimal Experimental Design (BOED) is a powerful tool to reduce the cost of running a sequence of experiments. When based on the Expected Information Gain (EIG), design optimization corresponds to the maximization of some intractable expected \it contrast between prior and posterior distributions. Scaling this maximization to high dimensional and complex settings has been an issue due to BOED inherent computational complexity. In this work, we introduce an \it expected posterior distribution with cost-effective sampling properties and provide a tractable access to the EIG contrast maximization via a new EIG gradient expression. Diffusion-based samplers are used to compute the dynamics of the expected posterior and ideas from bi-level optimization are leveraged to derive an efficient joint sampling-optimization loop, without resorting to lower bound approximations of the EIG. The resulting efficiency gain allows to extend BOED to the well-tested generative capabilities of diffusion models. By incorporating generative models into the BOED framework, we expand its scope and its use in scenarios that were previously impractical. Numerical experiments and comparison with state-of-the-art methods show the potential of the approach.

[LG-55] Regional Ocean Forecasting with Hierarchical Graph Neural Networks NEURIPS2024

链接: https://arxiv.org/abs/2410.11807
作者: Daniel Holmberg,Emanuela Clementi,Teemu Roos
关键词-EN: climate adaptation strategies, understanding marine dynamics, Accurate ocean forecasting, adaptation strategies, Accurate ocean
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 28 pages, 35 figures. Accepted to the Tackling Climate Change with Machine Learning workshop at NeurIPS 2024

点击查看摘要

Abstract:Accurate ocean forecasting systems are vital for understanding marine dynamics, which play a crucial role in environmental management and climate adaptation strategies. Traditional numerical solvers, while effective, are computationally expensive and time-consuming. Recent advancements in machine learning have revolutionized weather forecasting, offering fast and energy-efficient alternatives. Building on these advancements, we introduce SeaCast, a neural network designed for high-resolution, medium-range ocean forecasting. SeaCast employs a graph-based framework to effectively handle the complex geometry of ocean grids and integrates external forcing data tailored to the regional ocean context. Our approach is validated through experiments at a high spatial resolution using the operational numerical model of the Mediterranean Sea provided by the Copernicus Marine Service, along with both numerical and data-driven atmospheric forcings.

[LG-56] A Benchmark Suite for Evaluating Neural Mutual Information Estimators on Unstructured Datasets NEURIPS2024

链接: https://arxiv.org/abs/2410.10924
作者: Kyungeun Lee,Wonjong Rhee
关键词-EN: Mutual Information, random variables, fundamental metric, metric for quantifying, quantifying dependency
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Mutual Information (MI) is a fundamental metric for quantifying dependency between two random variables. When we can access only the samples, but not the underlying distribution functions, we can evaluate MI using sample-based estimators. Assessment of such MI estimators, however, has almost always relied on analytical datasets including Gaussian multivariates. Such datasets allow analytical calculations of the true MI values, but they are limited in that they do not reflect the complexities of real-world datasets. This study introduces a comprehensive benchmark suite for evaluating neural MI estimators on unstructured datasets, specifically focusing on images and texts. By leveraging same-class sampling for positive pairing and introducing a binary symmetric channel trick, we show that we can accurately manipulate true MI values of real-world datasets. Using the benchmark suite, we investigate seven challenging scenarios, shedding light on the reliability of neural MI estimators for unstructured datasets.

[LG-57] EPi-cKANs: Elasto-Plasticity Informed Kolmogorov-Arnold Networks Using Chebyshev Polynomials

链接: https://arxiv.org/abs/2410.10897
作者: Farinaz Mostajeran,Salah A Faroughi
关键词-EN: Multilayer perceptron, develop data-driven constitutive, constitutive models, physics-based constitutive models, data-driven constitutive models
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 20 pages, 13 figures

点击查看摘要

Abstract:Multilayer perceptron (MLP) networks are predominantly used to develop data-driven constitutive models for granular materials. They offer a compelling alternative to traditional physics-based constitutive models in predicting nonlinear responses of these materials, e.g., elasto-plasticity, under various loading conditions. To attain the necessary accuracy, MLPs often need to be sufficiently deep or wide, owing to the curse of dimensionality inherent in these problems. To overcome this limitation, we present an elasto-plasticity informed Chebyshev-based Kolmogorov-Arnold network (EPi-cKAN) in this study. This architecture leverages the benefits of KANs and augmented Chebyshev polynomials, as well as integrates physical principles within both the network structure and the loss function. The primary objective of EPi-cKAN is to provide an accurate and generalizable function approximation for non-linear stress-strain relationships, using fewer parameters compared to standard MLPs. To evaluate the efficiency, accuracy, and generalization capabilities of EPi-cKAN in modeling complex elasto-plastic behavior, we initially compare its performance with other cKAN-based models, which include purely data-driven parallel and serial architectures. Furthermore, to differentiate EPi-cKAN’s distinct performance, we also compare it against purely data-driven and physics-informed MLP-based methods. Lastly, we test EPi-cKAN’s ability to predict blind strain-controlled paths that extend beyond the training data distribution to gauge its generalization and predictive capabilities. Our findings indicate that, even with limited data and fewer parameters compared to other approaches, EPi-cKAN provides superior accuracy in predicting stress components and demonstrates better generalization when used to predict sand elasto-plastic behavior under blind triaxial axisymmetric strain-controlled loading paths.

[LG-58] COME: Test-time adaption by Conservatively Minimizing Entropy

链接: https://arxiv.org/abs/2410.10894
作者: Qingyang Zhang,Yatao Bian,Xinke Kong,Peilin Zhao,Changqing Zhang
关键词-EN: Machine learning models, Machine learning, open world, continuously self-adjust, Machine
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Ongoing work

点击查看摘要

Abstract:Machine learning models must continuously self-adjust themselves for novel data distribution in the open world. As the predominant principle, entropy minimization (EM) has been proven to be a simple yet effective cornerstone in existing test-time adaption (TTA) methods. While unfortunately its fatal limitation (i.e., overconfidence) tends to result in model collapse. For this issue, we propose to Conservatively Minimize the Entropy (COME), which is a simple drop-in replacement of traditional EM to elegantly address the limitation. In essence, COME explicitly models the uncertainty by characterizing a Dirichlet prior distribution over model predictions during TTA. By doing so, COME naturally regularizes the model to favor conservative confidence on unreliable samples. Theoretically, we provide a preliminary analysis to reveal the ability of COME in enhancing the optimization stability by introducing a data-adaptive lower bound on the entropy. Empirically, our method achieves state-of-the-art performance on commonly used benchmarks, showing significant improvements in terms of classification accuracy and uncertainty estimation under various settings including standard, life-long and open-world TTA, i.e., up to 34.5% improvement on accuracy and 15.1% on false positive rate.

[LG-59] Replicable Uniformity Testing NEURIPS2024

链接: https://arxiv.org/abs/2410.10892
作者: Sihan Liu,Christopher Ye
关键词-EN: Uniformity, Uniformity testing, varepsilon, distribution testing problems, Machine Learning
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: To appear in NeurIPS 2024

点击查看摘要

Abstract:Uniformity testing is arguably one of the most fundamental distribution testing problems. Given sample access to an unknown distribution \mathbfp on [n] , one must decide if \mathbfp is uniform or \varepsilon -far from uniform (in total variation distance). A long line of work established that uniformity testing has sample complexity \Theta(\sqrtn\varepsilon^-2) . However, when the input distribution is neither uniform nor far from uniform, known algorithms may have highly non-replicable behavior. Consequently, if these algorithms are applied in scientific studies, they may lead to contradictory results that erode public trust in science. In this work, we revisit uniformity testing under the framework of algorithmic replicability [STOC '22], requiring the algorithm to be replicable under arbitrary distributions. While replicability typically incurs a \rho^-2 factor overhead in sample complexity, we obtain a replicable uniformity tester using only \tildeO(\sqrtn \varepsilon^-2 \rho^-1) samples. To our knowledge, this is the first replicable learning algorithm with (nearly) linear dependence on \rho . Lastly, we consider a class of ``symmetric" algorithms [FOCS '00] whose outputs are invariant under relabeling of the domain [n] , which includes all existing uniformity testers (including ours). For this natural class of algorithms, we prove a nearly matching sample complexity lower bound for replicable uniformity testing. Comments: To appear in NeurIPS 2024 Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2410.10892 [stat.ML] (or arXiv:2410.10892v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2410.10892 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-60] Adaptive AI-Driven Material Synthesis: Towards Autonomous 2D Materials Growth

链接: https://arxiv.org/abs/2410.10885
作者: Leonardo Sabattini,Annalisa Coriolano,Corneel Casert,Stiven Forti,Edward S. Barnard,Fabio Beltram,Massimiliano Pontil,Stephen Whitelam,Camilla Coletti,Antonio Rossi
关键词-EN: revolutionize current solid-state, current solid-state technology, extraordinary properties, poised to revolutionize, revolutionize current
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Two-dimensional (2D) materials are poised to revolutionize current solid-state technology with their extraordinary properties. Yet, the primary challenge remains their scalable production. While there have been significant advancements, much of the scientific progress has depended on the exfoliation of materials, a method that poses severe challenges for large-scale applications. With the advent of artificial intelligence (AI) in materials science, innovative synthesis methodologies are now on the horizon. This study explores the forefront of autonomous materials synthesis using an artificial neural network (ANN) trained by evolutionary methods, focusing on the efficient production of graphene. Our approach demonstrates that a neural network can iteratively and autonomously learn a time-dependent protocol for the efficient growth of graphene, without requiring pretraining on what constitutes an effective recipe. Evaluation criteria are based on the proximity of the Raman signature to that of monolayer graphene: higher scores are granted to outcomes whose spectrum more closely resembles that of an ideal continuous monolayer structure. This feedback mechanism allows for iterative refinement of the ANN’s time-dependent synthesis protocols, progressively improving sample quality. Through the advancement and application of AI methodologies, this work makes a substantial contribution to the field of materials engineering, fostering a new era of innovation and efficiency in the synthesis process.

信息检索

[IR-0] GaVaMoE: Gaussian-Variational Gated Mixture of Experts for Explainable Recommendation

链接: https://arxiv.org/abs/2410.11841
作者: Fei Tang,Yongliang Shen,Hang Zhang,Zeqi Tan,Wenqi Zhang,Guiyang Hou,Kaitao Song,Weiming Lu,Yueting Zhuang
关键词-EN: Large language model-based, systems show promise, Large language, language model-based explainable, model-based explainable recommendation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language model-based explainable recommendation (LLM-based ER) systems show promise in generating human-like explanations for recommendations. However, they face challenges in modeling user-item collaborative preferences, personalizing explanations, and handling sparse user-item interactions. To address these issues, we propose GaVaMoE, a novel Gaussian-Variational Gated Mixture of Experts framework for explainable recommendation. GaVaMoE introduces two key components: (1) a rating reconstruction module that employs Variational Autoencoder (VAE) with a Gaussian Mixture Model (GMM) to capture complex user-item collaborative preferences, serving as a pre-trained multi-gating mechanism; and (2) a set of fine-grained expert models coupled with the multi-gating mechanism for generating highly personalized explanations. The VAE component models latent factors in user-item interactions, while the GMM clusters users with similar behaviors. Each cluster corresponds to a gate in the multi-gating mechanism, routing user-item pairs to appropriate expert models. This architecture enables GaVaMoE to generate tailored explanations for specific user types and preferences, mitigating data sparsity by leveraging user similarities. Extensive experiments on three real-world datasets demonstrate that GaVaMoE significantly outperforms existing methods in explanation quality, personalization, and consistency. Notably, GaVaMoE exhibits robust performance in scenarios with sparse user-item interactions, maintaining high-quality explanations even for users with limited historical data.

[IR-1] Enhance Graph Alignment for Large Language Models

链接: https://arxiv.org/abs/2410.11370
作者: Haitong Luo,Xuying Meng,Suhang Wang,Tianxiang Zhao,Fali Wang,Hanyun Cao,Yujun Zhang
关键词-EN: Large Language Models, Graph-structured data, Large Language, Alignment Large Language, real world
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Under review

点击查看摘要

Abstract:Graph-structured data is prevalent in the real world. Recently, due to the powerful emergent capabilities, Large Language Models (LLMs) have shown promising performance in modeling graphs. The key to effectively applying LLMs on graphs is converting graph data into a format LLMs can comprehend. Graph-to-token approaches are popular in enabling LLMs to process graph information. They transform graphs into sequences of tokens and align them with text tokens through instruction tuning, where self-supervised instruction tuning helps LLMs acquire general knowledge about graphs, and supervised fine-tuning specializes LLMs for the downstream tasks on graphs. Despite their initial success, we find that existing methods have a misalignment between self-supervised tasks and supervised downstream tasks, resulting in negative transfer from self-supervised fine-tuning to downstream tasks. To address these issues, we propose Graph Alignment Large Language Models (GALLM) to benefit from aligned task templates. In the self-supervised tuning stage, we introduce a novel text matching task using templates aligned with downstream tasks. In the task-specific tuning stage, we propose two category prompt methods that learn supervision information from additional explanation with further aligned templates. Experimental evaluations on four datasets demonstrate substantial improvements in supervised learning, multi-dataset generalizability, and particularly in zero-shot capability, highlighting the model’s potential as a graph foundation model.

[IR-2] Reducing Labeling Costs in Sentiment Analysis via Semi-Supervised Learning

链接: https://arxiv.org/abs/2410.11355
作者: Minoo Jafarlou,Mario M. Kubek
关键词-EN: noteworthy challenge, challenge in machine, machine learning, learning, Labeling datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 12 pages, 7 figures, accepted at the 2024 8th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2024), Okayama, Japan, 2024

点击查看摘要

Abstract:Labeling datasets is a noteworthy challenge in machine learning, both in terms of cost and time. This research, however, leverages an efficient answer. By exploring label propagation in semi-supervised learning, we can significantly reduce the number of labels required compared to traditional methods. We employ a transductive label propagation method based on the manifold assumption for text classification. Our approach utilizes a graph-based method to generate pseudo-labels for unlabeled data for the text classification task, which are then used to train deep neural networks. By extending labels based on cosine proximity within a nearest neighbor graph from network embeddings, we combine unlabeled data into supervised learning, thereby reducing labeling costs. Based on previous successes in other domains, this study builds and evaluates this approach’s effectiveness in sentiment analysis, presenting insights into semi-supervised learning.

[IR-3] Sequential LLM Framework for Fashion Recommendation

链接: https://arxiv.org/abs/2410.11327
作者: Han Liu,Xianfeng Tang,Tianlang Chen,Jiapeng Liu,Indu Indu,Henry Peng Zou,Peng Dai,Roberto Fernandez Galan,Michael D Porter,Dongmei Jia,Ning Zhang,Lian Xiong
关键词-EN: prompting major online, major online retailers, global e-commerce sector, prompting major, customer convenience
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The fashion industry is one of the leading domains in the global e-commerce sector, prompting major online retailers to employ recommendation systems for product suggestions and customer convenience. While recommendation systems have been widely studied, most are designed for general e-commerce problems and struggle with the unique challenges of the fashion domain. To address these issues, we propose a sequential fashion recommendation framework that leverages a pre-trained large language model (LLM) enhanced with recommendation-specific prompts. Our framework employs parameter-efficient fine-tuning with extensive fashion data and introduces a novel mix-up-based retrieval technique for translating text into relevant product suggestions. Extensive experiments show our proposed framework significantly enhances fashion recommendation performance.

[IR-4] Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

链接: https://arxiv.org/abs/2410.10867
作者: Théo Gigant(L2S),Camille Guinaudeau(STL, LISN),Marc Decombas,Frédéric Dufaux(L2S)
关键词-EN: evaluate abstractive summarization, abstractive summarization systems, Automatic metrics, proxies to evaluate, evaluate abstractive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Automatic metrics are used as proxies to evaluate abstractive summarization systems when human annotations are too expensive. To be useful, these metrics should be fine-grained, show a high correlation with human annotations, and ideally be independent of reference quality; however, most standard evaluation metrics for summarization are reference-based, and existing reference-free metrics correlate poorly with relevance, especially on summaries of longer documents. In this paper, we introduce a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute. We show that this metric can also be used alongside reference-based metrics to improve their robustness in low quality reference settings.

[IR-5] DIIT: A Domain-Invariant Information Transfer Method for Industrial Cross-Domain Recommendation CIKM2024

链接: https://arxiv.org/abs/2410.10835
作者: Heyuan Huang,Xingyu Lou,Chaochao Chen,Pengxiang Cheng,Yue Xin,Chengwei He,Xiang Liu,Jun Wang
关键词-EN: received widespread attention, widespread attention due, existing CDR methods, utilize rich information, CDR methods
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at CIKM 2024

点击查看摘要

Abstract:Cross-Domain Recommendation (CDR) have received widespread attention due to their ability to utilize rich information across domains. However, most existing CDR methods assume an ideal static condition that is not practical in industrial recommendation systems (RS). Therefore, simply applying existing CDR methods in the industrial RS environment may lead to low effectiveness and efficiency. To fill this gap, we propose DIIT, an end-to-end Domain-Invariant Information Transfer method for industrial cross-domain recommendation. Specifically, We first simulate the industrial RS environment that maintains respective models in multiple domains, each of them is trained in the incremental mode. Then, for improving the effectiveness, we design two extractors to fully extract domain-invariant information from the latest source domain models at the domain level and the representation level respectively. Finally, for improving the efficiency, we design a migrator to transfer the extracted information to the latest target domain model, which only need the target domain model for inference. Experiments conducted on one production dataset and two public datasets verify the effectiveness and efficiency of DIIT.

附件下载

点击下载今日全部论文列表