本篇博文主要展示 2024-10-10 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2024-10-10)
今日共更新1059篇论文,其中:
- 自然语言处理共173篇(Computation and Language (cs.CL))
- 人工智能共300篇(Artificial Intelligence (cs.AI))
- 计算机视觉共227篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共414篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
【速读】: 该论文试图解决Retrieval-Augmented Generation (RAG)系统中由于不完美的检索引入的不相关、误导性或恶意信息的问题。解决方案的关键在于提出了一种名为Astute RAG的新方法,该方法通过自适应地从大语言模型(LLM)内部知识中提取关键信息,并结合外部知识的来源意识进行迭代整合,最终根据信息可靠性生成答案。Astute RAG通过有效解决LLM内部知识与外部知识之间的冲突,显著提升了RAG系统的鲁棒性、可靠性和可信度。
链接: https://arxiv.org/abs/2410.07176
作者: Fei Wang,Xingchen Wan,Ruoxi Sun,Jiefeng Chen,Sercan Ö. Arık
关键词-EN: large language models, Retrieval-Augmented Generation, Astute RAG, RAG, imperfect retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG), while effective in integrating external knowledge to address the limitations of large language models (LLMs), can be undermined by imperfect retrieval, which may introduce irrelevant, misleading, or even malicious information. Despite its importance, previous studies have rarely explored the behavior of RAG through joint analysis on how errors from imperfect retrieval attribute and propagate, and how potential conflicts arise between the LLMs’ internal knowledge and external sources. We find that imperfect retrieval augmentation might be inevitable and quite harmful, through controlled analysis under realistic conditions. We identify the knowledge conflicts between LLM-internal and external knowledge from retrieval as a bottleneck to overcome in the post-retrieval stage of RAG. To render LLMs resilient to imperfect retrieval, we propose Astute RAG, a novel RAG approach that adaptively elicits essential information from LLMs’ internal knowledge, iteratively consolidates internal and external knowledge with source-awareness, and finalizes the answer according to information reliability. Our experiments using Gemini and Claude demonstrate that Astute RAG significantly outperforms previous robustness-enhanced RAG methods. Notably, Astute RAG is the only approach that matches or exceeds the performance of LLMs without RAG under worst-case scenarios. Further analysis reveals that Astute RAG effectively resolves knowledge conflicts, improving the reliability and trustworthiness of RAG systems.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 虽然在整合外部知识以解决大语言模型 (Large Language Models, LLMs) 的局限性方面表现有效,但其效果可能因检索不完善而受损,这种不完善可能引入无关、误导甚至恶意信息。尽管其重要性不言而喻,但先前的研究很少通过联合分析检索错误如何归因和传播,以及 LLMs 内部知识与外部来源之间潜在冲突如何产生,来探讨 RAG 的行为。我们在现实条件下进行的控制分析中发现,不完善的检索增强可能是不可避免且极具危害性的。我们将 LLM 内部知识与检索外部知识之间的知识冲突识别为 RAG 检索后阶段需要克服的瓶颈。为了使 LLMs 对不完善的检索具有韧性,我们提出了 Astute RAG,这是一种新颖的 RAG 方法,能够自适应地从 LLMs 内部知识中提取关键信息,迭代地整合内部和外部知识并具备来源意识,并根据信息可靠性最终确定答案。我们使用 Gemini 和 Claude 进行的实验表明,Astute RAG 显著优于先前增强鲁棒性的 RAG 方法。值得注意的是,在极端情况下,Astute RAG 是唯一能够匹配或超越无 RAG 的 LLMs 性能的方法。进一步分析显示,Astute RAG 有效地解决了知识冲突,提高了 RAG 系统的可靠性和可信度。
[NLP-1] Do better language models have crisper vision?
【速读】: 该论文试图解决的问题是评估纯文本大型语言模型(LLMs)对视觉世界的理解能力,特别是在计算机视觉应用中的表现。解决方案的关键在于提出了Visual Text Representation Benchmark (ViTeRB),通过该基准测试识别出与视觉世界高度对齐的语言模型关键属性。论文进一步指出,大规模解码器型LLMs在视觉为中心的场景中更适合用于文本表示,而不是当前常用的文本编码器。基于此,论文提出了ShareLock,一种超轻量级的类CLIP模型,通过利用预计算的冻结特征,显著降低了训练成本并提高了准确性。
链接: https://arxiv.org/abs/2410.07173
作者: Jona Ruthardt,Gertjan J. Burghouts,Serge Belongie,Yuki M. Asano
关键词-EN: text-only Large Language, text-only Large, Large Language Models, Large Language, visual world
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:How well do text-only Large Language Models (LLMs) grasp the visual world? As LLMs are increasingly used in computer vision, addressing this question becomes both fundamental and pertinent. However, existing studies have primarily focused on limited scenarios, such as their ability to generate visual content or cluster multimodal data. To this end, we propose the Visual Text Representation Benchmark (ViTeRB) to isolate key properties that make language models well-aligned with the visual world. With this, we identify large-scale decoder-based LLMs as ideal candidates for representing text in vision-centric contexts, counter to the current practice of utilizing text encoders. Building on these findings, we propose ShareLock, an ultra-lightweight CLIP-like model. By leveraging precomputable frozen features from strong vision and language models, ShareLock achieves an impressive 51% accuracy on ImageNet despite utilizing just 563k image-caption pairs. Moreover, training requires only 1 GPU hour (or 10 hours including the precomputation of features) - orders of magnitude less than prior methods. Code will be released.
摘要:纯文本的大语言模型 (LLM) 对视觉世界的理解程度如何?随着 LLM 在计算机视觉中的应用日益广泛,这一问题的解答变得既基础又迫切。然而,现有研究主要集中在有限场景,如生成视觉内容或聚类多模态数据。为此,我们提出了视觉文本表示基准 (ViTeRB),以隔离使语言模型与视觉世界良好对齐的关键属性。通过这一基准,我们识别出大规模基于解码器的 LLM 是视觉为中心情境下表示文本的理想候选,这与当前使用文本编码器的做法相反。基于这些发现,我们提出了 ShareLock,一个超轻量级的类 CLIP 模型。通过利用强视觉和语言模型的预计算冻结特征,ShareLock 在 ImageNet 上实现了令人印象深刻的 51% 准确率,尽管仅使用了 563k 图像-标题对。此外,训练仅需 1 GPU 小时(或包括特征预计算在内的 10 小时),远低于先前方法的计算量。代码将公开发布。
[NLP-2] One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
【速读】: 该论文试图解决低秩适应(LoRA)方法在微调预训练模型时初始化权重矩阵导致收敛速度慢和性能次优的问题。解决方案的关键在于提出了一种新的数据驱动初始化方法——解释方差适应(EVA),通过在小批量激活向量上进行奇异值分解(SVD)来初始化LoRA矩阵,并重新分配秩以最大化方差解释,从而加速收敛并提升性能。
链接: https://arxiv.org/abs/2410.07170
作者: Fabian Paischer,Lukas Hauzenberger,Thomas Schmied,Benedikt Alkin,Marc Peter Deisenroth,Sepp Hochreiter
关键词-EN: Foundation models, specific application, large-scale datasets, Foundation, uniform rank distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 10 pages + references and appendix, code available at this https URL
点击查看摘要
Abstract:Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned on a downstream task for a specific application. The most successful and most commonly used fine-tuning method is to update the pre-trained weights via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are usually initialized at random with a uniform rank distribution across model weights. Recent works focus on weight-driven initialization or learning of adaptive ranks during training. Both approaches have only been investigated in isolation, resulting in slow convergence or a uniform rank distribution, in turn leading to sub-optimal performance. We propose to enhance LoRA by initializing the new weights in a data-driven manner by computing singular value decomposition on minibatches of activation vectors. Then, we initialize the LoRA matrices with the obtained right-singular vectors and re-distribute ranks among all weight matrices to explain the maximal amount of variance and continue the standard LoRA fine-tuning procedure. This results in our new method Explained Variance Adaptation (EVA). We apply EVA to a variety of fine-tuning tasks ranging from language generation and understanding to image classification and reinforcement learning. EVA exhibits faster convergence than competitors and attains the highest average score across a multitude of tasks per domain.
摘要:基础模型 (Foundation models, FMs) 在大规模数据集上进行预训练,然后针对特定应用在下游任务上进行微调。最成功且最常用的微调方法是通过低秩适应 (Low-rank adaptation, LoRA) 更新预训练权重。LoRA 引入了新的权重矩阵,这些矩阵通常以均匀的秩分布随机初始化。近期的工作集中在权重驱动的初始化或训练过程中自适应秩的学习。这两种方法仅单独研究,导致收敛速度慢或秩分布均匀,进而导致性能次优。我们提出通过计算激活向量小批量的奇异值分解 (Singular Value Decomposition, SVD) 以数据驱动的方式初始化新权重,从而增强 LoRA。然后,我们使用获得的右奇异向量初始化 LoRA 矩阵,并在所有权重矩阵中重新分配秩,以解释最大量的方差,并继续标准的 LoRA 微调过程。这导致了我们新的方法——解释方差适应 (Explained Variance Adaptation, EVA)。我们将 EVA 应用于从语言生成和理解到图像分类和强化学习的各种微调任务。EVA 表现出比竞争对手更快的收敛速度,并在多个任务的每个领域中获得最高的平均分数。
[NLP-3] Sylber: Syllabic Embedding Representation of Speech from Raw Audio
【速读】: 该论文试图解决当前神经语音表示缺乏结构化的问题,导致生成的密集令牌序列处理成本高。解决方案的关键在于提出了一种名为Sylber的新模型,该模型通过自监督学习方法,利用教师模型的指数移动平均值提取音节段特征,从而生成具有清晰和鲁棒音节结构的语音表示。这一方法不仅实现了高效的音节分割和令牌化,还提升了音节单元在词汇和句法理解中的适用性,并自然地引入了语言现象中的分类感知,使得嵌入空间更具分类性和稀疏性。
链接: https://arxiv.org/abs/2410.07168
作者: Cheol Jun Cho,Nicholas Lee,Akshat Gupta,Dhruv Agarwal,Ethan Chen,Alan W Black,Gopala K. Anumanchipalli
关键词-EN: play a crucial, crucial role, role in human, speech, human speech perception
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Syllables are compositional units of spoken language that play a crucial role in human speech perception and production. However, current neural speech representations lack structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding. We also train token-to-speech generative models with our syllabic units and show that fully intelligible speech can be reconstructed from these tokens. Lastly, we observe that categorical perception, a linguistic phenomenon of speech perception, emerges naturally in our model, making the embedding space more categorical and sparse than previous self-supervised learning approaches. Together, we present a novel self-supervised approach for representing speech as syllables, with significant potential for efficient speech tokenization and spoken language modeling.
摘要:音节是口语语言的组成单元,在人类语音感知和生成中起着至关重要的作用。然而,当前的神经语音表示缺乏结构,导致处理密集的 Token 序列成本高昂。为了填补这一空白,我们提出了一种新的模型——Sylber,该模型能够生成具有清晰且稳健音节结构的语音表示。具体而言,我们提出了一种自监督模型,该模型在从教师模型中提取的音节片段上回归特征,而教师模型是训练过程中模型的指数移动平均值。这产生了一种高度结构化的语音特征表示,具有三个关键优势:1) 一种快速、线性时间的音节分割算法,2) 高效的音节 Token 化,平均每秒 4.27 个 Token,以及 3) 更适合词汇和句法理解的音节单元。我们还使用我们的音节单元训练了 Token 到语音的生成模型,并展示了可以从这些 Token 中完全重建可理解的语音。最后,我们观察到,语音感知中的语言现象——类别感知,在我们的模型中自然出现,使得嵌入空间比之前的自监督学习方法更具类别性和稀疏性。综上所述,我们提出了一种新颖的自监督方法,用于将语音表示为音节,具有显著的潜力,可用于高效的语音 Token 化和口语语言建模。
[NLP-4] Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
【速读】: 该论文试图解决大视觉语言模型(LVLMs)在多模态预训练阶段缺乏有效评估指标的问题。解决方案的关键在于提出了Modality Integration Rate(MIR),这是一种从模态间分布距离角度评估预训练质量的指标。MIR具有以下关键特性:1) 有效性,能够正向反映预训练质量并关联监督微调后的基准性能;2) 鲁棒性,对不同训练和评估数据具有稳定性;3) 通用性,适用于不同的训练配置和架构选择。通过一系列预训练实验,MIR被证明能够指导训练数据选择、训练策略安排和模型架构设计,从而优化预训练效果。
链接: https://arxiv.org/abs/2410.07167
作者: Qidong Huang,Xiaoyi Dong,Pan Zhang,Yuhang Zang,Yuhang Cao,Jiaqi Wang,Dahua Lin,Weiming Zhang,Nenghai Yu
关键词-EN: Large Vision Language, Vision Language Models, Modality Integration Rate, Large Language Models, Vision Language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project page: this https URL
点击查看摘要
Abstract:We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs). Large-scale pre-training plays a critical role in building capable LVLMs, while evaluating its training quality without the costly supervised fine-tuning stage is under-explored. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when aligning a well-trained LLM with a new modality. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc. In this paper, we propose evaluating the pre-training quality from the inter-modal distribution distance perspective and present MIR, the Modality Integration Rate, which is 1) \textbfEffective to represent the pre-training quality and show a positive relation with the benchmark performance after supervised fine-tuning. 2) \textbfRobust toward different training/evaluation data. 3) \textbfGeneralize across training configurations and architecture choices. We conduct a series of pre-training experiments to explore the effectiveness of MIR and observe satisfactory results that MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful metric for building capable LVLMs and inspire the following research about modality alignment in different areas. Our code is at: this https URL.
摘要:我们提出了模态集成率 (Modality Integration Rate, MIR),这是一种有效、稳健且通用的指标,用于指示大视觉语言模型 (Large Vision Language Models, LVLMs) 的多模态预训练质量。大规模预训练在构建能力强大的 LVLMs 中起着关键作用,而无需昂贵的监督微调阶段来评估其训练质量的研究尚不充分。损失、困惑度和上下文评估结果通常被用作大语言模型 (Large Language Models, LLMs) 的预训练指标,但我们观察到,当将一个训练良好的 LLM 与新模态对齐时,这些指标的指示性较弱。由于缺乏适当的指标,LVLMs 在关键的预训练阶段的研究受到了极大的阻碍,包括训练数据选择、高效模块设计等。在本文中,我们提出从模态间分布距离的角度评估预训练质量,并提出了 MIR,即模态集成率,该指标具有以下特点:1) 有效表示预训练质量,并与监督微调后的基准性能呈正相关。2) 对不同的训练/评估数据具有鲁棒性。3) 在不同的训练配置和架构选择中具有通用性。我们进行了一系列预训练实验,以探索 MIR 的有效性,并观察到令人满意的结果,即 MIR 对训练数据选择、训练策略安排和模型架构设计具有指示性,以获得更好的预训练结果。我们希望 MIR 能够成为构建能力强大的 LVLMs 的有用指标,并激发后续关于不同领域模态对齐的研究。我们的代码位于:this https URL。
[NLP-5] Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making NEURIPS2024
【速读】: 该论文旨在解决现有大型语言模型(LLMs)在具身决策任务中缺乏系统性评估的问题。解决方案的关键在于提出了一个通用的接口(Embodied Agent Interface),该接口能够标准化各种具身决策任务的定义、LLM模块的输入输出规范,并通过细粒度的评估指标(如幻觉错误、可操作性错误、规划错误等)来全面评估LLMs在不同子任务中的表现,从而揭示LLM在具身AI系统中的优势与不足,为有效和选择性地利用LLMs提供指导。
链接: https://arxiv.org/abs/2410.07166
作者: Manling Li,Shiyu Zhao,Qineng Wang,Kangrui Wang,Yu Zhou,Sanjana Srivastava,Cem Gokmen,Tony Lee,Li Erran Li,Ruohan Zhang,Weiyu Liu,Percy Liang,Li Fei-Fei,Jiayuan Mao,Jiajun Wu
关键词-EN: Large Language Models, evaluate Large Language, Language Models, Large Language, evaluate Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted for oral presentation at NeurIPS 2024 in the Datasets and Benchmarks track
点击查看摘要
Abstract:We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision-making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc. Overall, our benchmark offers a comprehensive assessment of LLMs’ performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making.
摘要:我们的目标是评估大语言模型 (LLM) 在具身决策中的表现。尽管已有大量工作利用 LLM 在具身环境中进行决策,但我们仍缺乏对其性能的系统性理解,因为这些模型通常应用于不同的领域,出于不同的目的,并且基于不同的输入和输出构建。此外,现有的评估往往仅依赖于最终的成功率,这使得难以确定 LLM 中缺失的能力以及问题的根源所在,从而阻碍了具身智能体有效且有选择性地利用 LLM。为了解决这些局限性,我们提出了一种通用接口(具身智能体接口),该接口支持对各种类型任务和基于 LLM 模块的输入输出规范进行形式化。具体而言,它使我们能够统一:1) 涉及状态和时间扩展目标的广泛具身决策任务,2) 四种常用的基于 LLM 的决策模块:目标解释、子目标分解、动作排序和过渡建模,以及 3) 一系列细粒度指标,这些指标将评估分解为各种类型的错误,如幻觉错误、可操作性错误、各种类型的规划错误等。总体而言,我们的基准测试提供了对 LLM 在不同子任务中性能的全面评估,明确了基于 LLM 的具身 AI 系统的优缺点,并为在具身决策中有效且有选择性地使用 LLM 提供了见解。
[NLP-6] Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
【速读】: 该论文试图解决大型语言模型(LLM)中的“遗忘”问题,即在不重新从头训练的情况下,移除模型中不希望保留的数据影响和相关能力(如版权数据或有害内容生成),同时保持模型的基本功能。解决方案的关键在于提出了一个名为SimNPO的简单而有效的遗忘优化框架,该框架通过简化对参考模型的依赖,显著提升了遗忘效果,特别是在处理不同难度的遗忘数据时。SimNPO的优势通过混合马尔可夫链的分析得到了进一步验证,并在TOFU和MUSE等基准测试中展示了其优于现有遗忘方法的性能,同时具备对再学习攻击的鲁棒性。
链接: https://arxiv.org/abs/2410.07163
作者: Chongyu Fan,Jiancheng Liu,Licong Lin,Jinghan Jia,Ruiqi Zhang,Song Mei,Sijia Liu
关键词-EN: harmful content generation, essential model utilities, remove unwanted data, unwanted data influences, large language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:In this work, we address the problem of large language model (LLM) unlearning, aiming to remove unwanted data influences and associated model capabilities (e.g., copyrighted data or harmful content generation) while preserving essential model utilities, without the need for retraining from scratch. Despite the growing need for LLM unlearning, a principled optimization framework remains lacking. To this end, we revisit the state-of-the-art approach, negative preference optimization (NPO), and identify the issue of reference model bias, which could undermine NPO’s effectiveness, particularly when unlearning forget data of varying difficulty. Given that, we propose a simple yet effective unlearning optimization framework, called SimNPO, showing that ‘simplicity’ in removing the reliance on a reference model (through the lens of simple preference optimization) benefits unlearning. We also provide deeper insights into SimNPO’s advantages, supported by analysis using mixtures of Markov chains. Furthermore, we present extensive experiments validating SimNPO’s superiority over existing unlearning baselines in benchmarks like TOFU and MUSE, and robustness against relearning attacks. Codes are available at this https URL.
摘要:在本研究中,我们探讨了大语言模型 (LLM) 的“遗忘”问题,旨在消除不必要的数据影响及其相关模型能力(例如,版权数据或有害内容生成),同时保留模型的基本功能,而无需从头开始重新训练。尽管对 LLM 遗忘的需求日益增长,但一个系统的优化框架仍然缺失。为此,我们重新审视了当前最先进的负偏好优化 (NPO) 方法,并识别出参考模型偏差的问题,这可能会削弱 NPO 的有效性,尤其是在遗忘不同难度的数据时。基于此,我们提出了一种简单而有效的遗忘优化框架,称为 SimNPO,表明通过简单偏好优化视角消除对参考模型的依赖,有利于遗忘过程。我们还通过使用马尔可夫链混合物的分析,深入探讨了 SimNPO 的优势。此外,我们进行了广泛的实验,验证了 SimNPO 在 TOFU 和 MUSE 等基准测试中优于现有的遗忘基线,并展示了其对再学习攻击的鲁棒性。代码可在以下链接获取:https URL。
[NLP-7] InstructG2I: Synthesizing Images from Multimodal Attributed Graphs
【速读】: 该论文试图解决从多模态属性图(MMAGs)生成图像的难题,即Graph2Image任务。解决方案的关键在于提出了一种名为InstructG2I的图上下文条件扩散模型。该模型通过结合个性化PageRank和基于视觉-语言特征的重排序来进行信息丰富的邻居采样,利用Graph-QFormer编码器将图节点自适应编码为辅助的图提示集,以指导扩散的去噪过程。此外,提出的图分类器无指导方法通过调整图指导的强度和节点间的多重连接,实现了生成过程的可控性。
链接: https://arxiv.org/abs/2410.07157
作者: Bowen Jin,Ziqi Pang,Bingjun Guo,Yu-Xiong Wang,Jiaxuan You,Jiawei Han
关键词-EN: generating images, overlooked yet critical, graph, multimodal attributed graphs, critical task
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 16 pages
点击查看摘要
Abstract:In this paper, we approach an overlooked yet critical task Graph2Image: generating images from multimodal attributed graphs (MMAGs). This task poses significant challenges due to the explosion in graph size, dependencies among graph entities, and the need for controllability in graph conditions. To address these challenges, we propose a graph context-conditioned diffusion model called InstructG2I. InstructG2I first exploits the graph structure and multimodal information to conduct informative neighbor sampling by combining personalized page rank and re-ranking based on vision-language features. Then, a Graph-QFormer encoder adaptively encodes the graph nodes into an auxiliary set of graph prompts to guide the denoising process of diffusion. Finally, we propose graph classifier-free guidance, enabling controllable generation by varying the strength of graph guidance and multiple connected edges to a node. Extensive experiments conducted on three datasets from different domains demonstrate the effectiveness and controllability of our approach. The code is available at this https URL.
摘要:本文探讨了一个被忽视但至关重要的任务——图到图像生成 (Graph2Image):从多模态属性图 (MMAGs) 生成图像。由于图规模爆炸、图实体间的依赖关系以及对图条件可控性的需求,这一任务带来了显著的挑战。为应对这些挑战,我们提出了一种名为 InstructG2I 的图上下文条件扩散模型。InstructG2I 首先利用图结构和多模态信息,通过结合个性化 PageRank 和基于视觉-语言特征的重新排序来进行信息丰富的邻居采样。接着,Graph-QFormer 编码器自适应地将图节点编码为一组辅助的图提示,以指导扩散的去噪过程。最后,我们提出了图分类器无指导方法,通过调整图指导的强度和节点到多个连接边的数量,实现可控生成。在来自不同领域的三个数据集上进行的广泛实验证明了我们方法的有效性和可控性。代码可在以下链接获取:https URL。
[NLP-8] aking a turn for the better: Conversation redirection throughout the course of mental-health therapy EMNLP
【速读】: 该论文试图解决的问题是:在心理治疗过程中,患者和治疗师之间的对话重定向如何影响他们关系的进展和质量。解决方案的关键在于引入了一种概率度量方法,用于量化某一话语对对话流向的立即重定向程度,考虑了重定向的意图和实际实现。通过这种方法,论文分析了患者和治疗师在多次会话中关系的发展,发现患者对对话方向的控制通常随着关系的进展而增加,而早期控制较少的患者更有可能最终对治疗师表示不满并终止关系。
链接: https://arxiv.org/abs/2410.07147
作者: Vivian Nguyen,Sang Min Jung,Lillian Lee,Thomas D. Hull,Cristian Danescu-Niculescu-Mizil
关键词-EN: Mental-health therapy involves, therapists continuously negotiate, complex conversation flow, Mental-health therapy, involves a complex
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: To appear in the Proceedings of EMNLP (Findings) 2024. Code available at this https URL
点击查看摘要
Abstract:Mental-health therapy involves a complex conversation flow in which patients and therapists continuously negotiate what should be talked about next. For example, therapists might try to shift the conversation’s direction to keep the therapeutic process on track and avoid stagnation, or patients might push the discussion towards issues they want to focus on. How do such patient and therapist redirections relate to the development and quality of their relationship? To answer this question, we introduce a probabilistic measure of the extent to which a certain utterance immediately redirects the flow of the conversation, accounting for both the intention and the actual realization of such a change. We apply this new measure to characterize the development of patient-therapist relationships over multiple sessions in a very large, widely-used online therapy platform. Our analysis reveals that (1) patient control of the conversation’s direction generally increases relative to that of the therapist as their relationship progresses; and (2) patients who have less control in the first few sessions are significantly more likely to eventually express dissatisfaction with their therapist and terminate the relationship. Comments: To appear in the Proceedings of EMNLP (Findings) 2024. Code available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2410.07147 [cs.CL] (or arXiv:2410.07147v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.07147 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:心理健康治疗涉及复杂的对话流程,其中患者和治疗师不断协商下一步应讨论的内容。例如,治疗师可能会尝试改变对话的方向,以使治疗过程保持在正轨并避免停滞,或者患者可能会推动讨论朝向他们希望关注的议题。这种患者和治疗师的重新定向如何影响他们关系的发展和质量?为了回答这个问题,我们引入了一种概率度量,用于衡量某个话语在多大程度上立即重新定向了对话的流程,同时考虑了这种变化的意图和实际实现。我们将这一新度量应用于一个广泛使用的在线治疗平台上的多次会话中,以表征患者与治疗师关系的发展。我们的分析揭示了以下两点:(1)随着关系的进展,患者对对话方向的控制通常相对于治疗师有所增加;(2)在最初几次会话中控制力较弱的患者,最终表达对治疗师不满并终止关系的可能性显著更高。
评论:将发表于 EMNLP 2024 会议(Findings)。代码可在以下链接获取 https URL。
主题:计算与语言 (cs.CL);人工智能 (cs.AI);计算机与社会 (cs.CY)
引用方式:arXiv:2410.07147 [cs.CL] (或 arXiv:2410.07147v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.07147
通过 DataCite 发布的 arXiv DOI(待注册)
[NLP-9] Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling
【速读】: 该论文试图解决现有循环神经网络(RNNs)在处理长序列时表现不佳的问题,特别是针对训练长度之外的输入无法有效泛化以及内存容量上限的限制。解决方案的关键在于识别并缓解“状态崩溃”(state collapse)现象,这是由于循环状态过度参数化导致的过拟合问题。论文提出了三种缓解方法来提高Mamba-2模型的长度泛化能力,使其能够处理超过100万tokens的序列,同时通过实验验证了循环状态容量在密码检索任务中随状态大小呈指数增长,表明RNN在长上下文建模方面具有潜力。
链接: https://arxiv.org/abs/2410.07145
作者: Yingfa Chen,Xinrong Zhang,Shengding Hu,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: linear computational complexity, recurrent neural networks, handling long sequences, neural networks, Mamba and RWKV
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 18 figures
点击查看摘要
Abstract:One essential advantage of recurrent neural networks (RNNs) over transformer-based language models is their linear computational complexity concerning the sequence length, which makes them much faster in handling long sequences during inference. However, most publicly available RNNs (e.g., Mamba and RWKV) are trained on sequences with less than 10K tokens, and their effectiveness in longer contexts remains largely unsatisfying so far. In this paper, we study the cause of the inability to process long context for RNNs and suggest critical mitigations. We examine two practical concerns when applying state-of-the-art RNNs to long contexts: (1) the inability to extrapolate to inputs longer than the training length and (2) the upper bound of memory capacity. Addressing the first concern, we first investigate state collapse (SC), a phenomenon that causes severe performance degradation on sequence lengths not encountered during training. With controlled experiments, we attribute this to overfitting due to the recurrent state being overparameterized for the training length. For the second concern, we train a series of Mamba-2 models on long documents to empirically estimate the recurrent state capacity in language modeling and passkey retrieval. Then, three SC mitigation methods are proposed to improve Mamba-2’s length generalizability, allowing the model to process more than 1M tokens without SC. We also find that the recurrent state capacity in passkey retrieval scales exponentially to the state size, and we empirically train a Mamba-2 370M with near-perfect passkey retrieval accuracy on 256K context length. This suggests a promising future for RNN-based long-context modeling.
摘要:循环神经网络 (Recurrent Neural Networks, RNNs) 相对于基于 Transformer 的语言模型的一个主要优势在于其关于序列长度的线性计算复杂度,这使得它们在推理过程中处理长序列时速度更快。然而,大多数公开可用的 RNNs(例如 Mamba 和 RWKV)都是在长度不超过 10K Token 的序列上进行训练的,其在更长上下文中的有效性迄今为止仍然不尽如人意。本文研究了 RNNs 无法处理长上下文的原因,并提出了关键的缓解措施。我们探讨了将最先进的 RNNs 应用于长上下文时的两个实际问题:(1) 无法外推到比训练长度更长的输入,以及 (2) 内存容量的上限。针对第一个问题,我们首先研究了状态崩溃 (State Collapse, SC),这是一种导致在训练过程中未遇到的序列长度上性能严重下降的现象。通过控制实验,我们将此归因于由于循环状态对于训练长度而言过度参数化导致的过拟合。对于第二个问题,我们在长文档上训练了一系列 Mamba-2 模型,以经验估计语言建模和密钥检索中的循环状态容量。然后,提出了三种 SC 缓解方法,以提高 Mamba-2 的长度泛化能力,使得模型能够在不发生 SC 的情况下处理超过 1M Token 的序列。我们还发现,密钥检索中的循环状态容量与状态大小呈指数关系,并通过实验训练了一个 Mamba-2 370M 模型,在 256K 上下文长度上实现了接近完美的密钥检索准确率。这表明 RNN 在长上下文建模方面具有广阔的前景。
[NLP-10] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
【速读】: 该论文试图解决自动语言模型基准测试(如AlpacaEval 2.0、Arena-Hard-Auto和MT-Bench)中存在的作弊问题。论文通过实验证明,即使是一个“空模型”(即总是输出固定且与输入指令无关的响应)也能在这些基准测试中获得高胜率(如在AlpacaEval 2.0中达到86.5%的胜率),从而揭示了现有基准测试的脆弱性。解决方案的关键在于开发反作弊机制,以确保自动基准测试的可靠性和公正性。
链接: https://arxiv.org/abs/2410.07137
作者: Xiaosen Zheng,Tianyu Pang,Chao Du,Qian Liu,Jing Jiang,Min Lin
关键词-EN: language models due, evaluating language models, win rates, human evaluation, popular for evaluating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a “null model” that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at this https URL.
摘要:自动大语言模型基准测试,如 AlpacaEval 2.0、Arena-Hard-Auto 和 MT-Bench,由于其成本效益和可扩展性,相较于人工评估,已成为评估语言模型的流行工具。在这些基准测试中取得高胜率可以显著提升新发布语言模型的推广效果。这种推广利益可能促使一些技巧,如操纵模型输出长度或风格以提高胜率,尽管已有多种机制被开发出来控制长度和解耦风格以减少可操控性。然而,我们展示了一个“空模型”,该模型总是输出一个恒定的响应(与输入指令无关),可以在自动基准测试中作弊并达到顶级胜率:在 AlpacaEval 2.0 上达到 86.5% 的 LC 胜率;在 Arena-Hard-Auto 上获得 83.0 分;在 MT-Bench 上获得 9.55 分。此外,精心设计的作弊输出是可转移的,因为我们假设这些基准测试的指令(例如,AlpacaEval 2.0 的 805 个样本)是私密的,无法访问。虽然我们的实验主要是概念验证,但对手可以使用大语言模型生成更不易察觉的作弊响应,不道德地从高胜率和推广效果中获益。我们的发现呼吁开发反作弊机制以确保自动基准测试的可靠性。代码可在以下链接获取:https URL。
[NLP-11] Mental Disorders Detection in the Era of Large Language Models
【速读】: 该论文旨在比较传统机器学习方法、基于编码器的模型和大型语言模型(LLMs)在检测抑郁和焦虑任务中的有效性。解决方案的关键在于利用AutoML模型基于语言特征、多种Transformer变体(如BERT)以及最先进的LLMs进行病理分类模型的测试。研究结果表明,LLMs在处理噪声大、样本量小的数据集时表现优于传统方法,但在针对临床确诊抑郁患者的文本训练时,心理语言学特征和基于编码器的模型也能达到与语言模型相当的性能,显示出其在特定临床应用中的潜力。
链接: https://arxiv.org/abs/2410.07129
作者: Gleb Kuzmin,Petr Strepetov,Maksim Stankevich,Ivan Smirnov,Artem Shelmanov
关键词-EN: machine learning methods, traditional machine learning, paper compares, machine learning, task of detecting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper compares the effectiveness of traditional machine learning methods, encoder-based models, and large language models (LLMs) on the task of detecting depression and anxiety. Five datasets were considered, each differing in format and the method used to define the target pathology class. We tested AutoML models based on linguistic features, several variations of encoder-based Transformers such as BERT, and state-of-the-art LLMs as pathology classification models. The results demonstrated that LLMs outperform traditional methods, particularly on noisy and small datasets where training examples vary significantly in text length and genre. However, psycholinguistic features and encoder-based models can achieve performance comparable to language models when trained on texts from individuals with clinically confirmed depression, highlighting their potential effectiveness in targeted clinical applications.
摘要:本文比较了传统机器学习方法、基于编码器的模型以及大语言模型 (LLM) 在检测抑郁和焦虑任务中的有效性。我们考虑了五个数据集,每个数据集在格式和定义目标病理类别的方法上有所不同。我们测试了基于语言特征的 AutoML 模型、几种基于编码器的 Transformer 变体(如 BERT)以及最新的 LLM 作为病理分类模型。结果表明,LLM 在传统方法上表现更优,特别是在噪声大且数据量小的数据集上,训练样本在文本长度和类型上差异显著。然而,心理语言学特征和基于编码器的模型在针对临床确诊抑郁个体的文本训练时,可以达到与语言模型相当的性能,突显了它们在特定临床应用中的潜在有效性。
[NLP-12] End-Cloud Collaboration Framework for Advanced AI Customer Service in E-commerce
【速读】: 该论文试图解决电子商务领域中AI驱动的客户服务解决方案在延迟、个性化服务和隐私保护方面的挑战,以及终端设备计算资源不足的问题。解决方案的关键在于提出了一个创新的端云协作(End-Cloud Collaboration, ECC)框架,该框架通过云端大型模型指导终端中小型模型的学习,减少对大规模高质量数据的依赖,并通过在线演化学习策略实现终端模型的持续迭代和升级,从而在保护隐私的同时提供个性化服务。
链接: https://arxiv.org/abs/2410.07122
作者: Liangyu Teng,Yang Liu,Jing Liu,Liang Song
关键词-EN: customer service solutions, end model, AI-driven customer service, model, advanced AI-driven customer
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by 2024 IEEE 10th World Forum on Internet of Things (WF-IoT)
点击查看摘要
Abstract:In recent years, the e-commerce industry has seen a rapid increase in the demand for advanced AI-driven customer service solutions. Traditional cloud-based models face limitations in terms of latency, personalized services, and privacy concerns. Furthermore, end devices often lack the computational resources to deploy large AI models effectively. In this paper, we propose an innovative End-Cloud Collaboration (ECC) framework for advanced AI customer service in e-commerce. This framework integrates the advantages of large cloud models and mid/small-sized end models by deeply exploring the generalization potential of cloud models and effectively utilizing the computing power resources of terminal chips, alleviating the strain on computing resources to some extent. Specifically, the large cloud model acts as a teacher, guiding and promoting the learning of the end model, which significantly reduces the end model’s reliance on large-scale, high-quality data and thereby addresses the data bottleneck in traditional end model training, offering a new paradigm for the rapid deployment of industry applications. Additionally, we introduce an online evolutive learning strategy that enables the end model to continuously iterate and upgrade based on guidance from the cloud model and real-time user feedback. This strategy ensures that the model can flexibly adapt to the rapid changes in application scenarios while avoiding the uploading of sensitive information by performing local fine-tuning, achieving the dual goals of privacy protection and personalized service. %We make systematic contributions to the customized model fine-tuning methods in the e-commerce domain. To conclude, we implement in-depth corpus collection (e.g., data organization, cleaning, and preprocessing) and train an ECC-based industry-specific model for e-commerce customer service.
摘要:近年来,电子商务行业对先进的 AI 驱动的客户服务解决方案的需求迅速增长。传统的基于云的模型在延迟、个性化服务和隐私问题上存在局限性。此外,终端设备通常缺乏有效部署大型 AI 模型的计算资源。本文提出了一种创新的端云协作 (End-Cloud Collaboration, ECC) 框架,用于电子商务中的高级 AI 客户服务。该框架通过深入挖掘云模型的泛化潜力,并有效利用终端芯片的计算能力资源,整合了大型云模型和中型/小型终端模型的优势,在一定程度上缓解了计算资源的紧张。具体而言,大型云模型作为教师,指导和促进终端模型的学习,显著减少了终端模型对大规模高质量数据的依赖,从而解决了传统终端模型训练中的数据瓶颈问题,为行业应用的快速部署提供了新的范式。此外,我们引入了一种在线进化学习策略,使终端模型能够根据云模型的指导和实时用户反馈不断迭代和升级。该策略确保模型能够灵活适应应用场景的快速变化,同时通过本地微调避免敏感信息的上传,实现了隐私保护和个性化服务双重目标。%我们在电子商务领域的定制模型微调方法上做出了系统性贡献。最后,我们进行了深入的语料库收集(例如,数据组织、清洗和预处理),并训练了一个基于 ECC 的电子商务客户服务行业专用模型。
[NLP-13] Exploring the Readiness of Prominent Small Language Models for the Democratization of Financial Literacy
【速读】: 该论文试图解决的问题是如何通过小型语言模型(SLMs)来普及金融知识的获取,特别是在金融教育不足的群体中。解决方案的关键在于评估和选择适合在资源有限和隐私保护需求高的环境下运行的小型语言模型,如Apple的OpenELM、Microsoft的Phi、Google的Gemma和Tinyllama项目,并通过零样本学习和少样本学习的方式,分析这些模型的内存使用、推理时间、与标准答案的相似度以及输出可读性,以确定哪些模型最适合支持金融信息的普及化。
链接: https://arxiv.org/abs/2410.07118
作者: Tagore Rao Kosireddy,Jeffrey D. Wall,Evan Lucas
关键词-EN: small language models, billion parameters, small language, language models, financial
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The use of small language models (SLMs), herein defined as models with less than three billion parameters, is increasing across various domains and applications. Due to their ability to run on more accessible hardware and preserve user privacy, SLMs possess the potential to democratize access to language models for individuals of different socioeconomic status and with different privacy preferences. This study assesses several state-of-the-art SLMs (e.g., Apple’s OpenELM, Microsoft’s Phi, Google’s Gemma, and the Tinyllama project) for use in the financial domain to support the development of financial literacy LMs. Democratizing access to quality financial information for those who are financially under educated is greatly needed in society, particularly as new financial markets and products emerge and participation in financial markets increases due to ease of access. We are the first to examine the use of open-source SLMs to democratize access to financial question answering capabilities for individuals and students. To this end, we provide an analysis of the memory usage, inference time, similarity comparisons to ground-truth answers, and output readability of prominent SLMs to determine which models are most accessible and capable of supporting access to financial information. We analyze zero-shot and few-shot learning variants of the models. The results suggest that some off-the-shelf SLMs merit further exploration and fine-tuning to prepare them for individual use, while others may have limits to their democratization.
摘要:小型语言模型 (SLM) 的使用,此处定义为参数少于三十亿的模型,正在各个领域和应用中不断增加。由于这些模型能够在更易获取的硬件上运行并保护用户隐私,SLM 具有为不同社会经济地位和隐私偏好的个人提供语言模型访问的潜力。本研究评估了几种最先进的小型语言模型 (例如 Apple 的 OpenELM、Microsoft 的 Phi、Google 的 Gemma 以及 Tinyllama 项目) 在金融领域的应用,以支持金融素养大语言模型的开发。为那些金融知识不足的人提供高质量的金融信息访问权在社会中尤为重要,特别是在新兴金融市场和产品不断涌现以及由于访问便利性增加而参与金融市场的人数增加的情况下。我们是首个研究使用开源小型语言模型为个人和学生提供金融问答能力民主化访问的团队。为此,我们分析了这些模型的内存使用情况、推理时间、与标准答案的相似度比较以及输出可读性,以确定哪些模型最易于访问并能够支持金融信息的获取。我们分析了这些模型的零样本和少样本学习变体。结果表明,一些现成的 SLM 值得进一步探索和微调以准备个人使用,而其他模型可能在民主化方面存在局限。
[NLP-14] System 2 thinking in OpenAIs o1-preview model: Near-perfect performance on a mathematics exam
【速读】: 该论文试图解决的问题是验证OpenAI新推出的O1模型系列在处理复杂、分析性任务(类似于人类认知的System 2)方面的能力。解决方案的关键在于通过实际测试,如荷兰的“数学B”期末考试,来评估O1-preview模型的表现。研究结果表明,O1-preview模型在两次测试中分别获得了76和73分(满分76分),远超GPT-4o模型和荷兰学生的平均水平,显示出其在System 2类型任务中的潜力。然而,模型在重复提示下偶尔会出现错误,这提示通过自一致性方法(选择共识输出)可能进一步提高其准确性。
链接: https://arxiv.org/abs/2410.07114
作者: Joost de Winter,Dimitra Dodou,Yke Bauke Eisma
关键词-EN: processes underlying human, underlying human cognition, involves fast, involves slow, intuitive thinking
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The processes underlying human cognition are often divided into two systems: System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the O1 model series, specifically designed to handle System 2-like reasoning. While OpenAI’s benchmarks are promising, independent validation is still needed. In this study, we tested the O1-preview model twice on the Dutch ‘Mathematics B’ final exam. It scored a near-perfect 76 and 73 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the GPT-4o model scored 66 and 61 out of 76, well above the Dutch average of 40.63 points. The O1-preview model completed the exam in around 10 minutes, while GPT-4o took 3 minutes, and neither model had access to the exam figures. Although O1-preview had the ability to achieve a perfect score, its performance showed some variability, as it made occasional mistakes with repeated prompting. This suggests that the self-consistency method, where the consensus output is selected, could improve accuracy. We conclude that while OpenAI’s new model series holds great potential, certain risks must be considered.
摘要:人类认知过程通常分为两个系统:系统 1 (System 1),涉及快速、直觉的思维;系统 2 (System 2),涉及缓慢、深思熟虑的推理。此前,大语言模型 (Large Language Model) 因缺乏系统 2 的深度分析能力而受到批评。2024 年 9 月,OpenAI 推出了 O1 模型系列,专门设计用于处理类似系统 2 的推理。尽管 OpenAI 的基准测试结果令人鼓舞,但仍需独立验证。在本研究中,我们两次测试了 O1-preview 模型在荷兰“数学 B”期末考试中的表现。它分别获得了 76 分和 73 分(满分 76 分)。作为参考,荷兰 16,414 名学生中仅有 24 名获得了满分。相比之下,GPT-4o 模型分别获得了 66 分和 61 分,远高于荷兰平均分 40.63 分。O1-preview 模型完成考试大约需要 10 分钟,而 GPT-4o 需要 3 分钟,且两个模型均未访问考试图表。尽管 O1-preview 有能力获得满分,但其表现显示出一定的波动性,因为它在重复提示下偶尔会出错。这表明,采用自一致性方法 (self-consistency method),即选择共识输出,可能会提高准确性。我们得出结论,尽管 OpenAI 的新模型系列具有巨大潜力,但必须考虑某些风险。
[NLP-15] I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy
【速读】: 该论文试图解决在大语言模型(LLM)代理之间交互中出现的说服和反社会行为问题,特别是在严格社会等级背景下。解决方案的关键在于通过模拟监狱场景中的守卫和囚犯代理之间的对话,研究不同LLM模型在这些交互中的表现。研究通过200个实验场景和2000次机器间对话,揭示了模型在多代理设置中处理权力动态的能力差异,以及目标设定对代理说服力的主要影响,而对其反社会行为影响较小。此外,研究还强调了代理角色和守卫个性对说服成功率和反社会行为出现概率的驱动作用,并指出即使不明确提示特定个性,角色分配本身也能引发反社会行为。这些发现对开发交互式LLM代理及其社会影响讨论具有重要意义。
链接: https://arxiv.org/abs/2410.07109
作者: Gian Maria Campedelli,Nicolò Penzo,Massimo Stefan,Roberto Dessì,Marco Guerini,Bruno Lepri,Jacopo Staiano
关键词-EN: Large Language Model, Large Language, Stanford Prison Experiment, anticipate emergent phenomena, Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:As Large Language Model (LLM)-based agents become increasingly autonomous and will more freely interact with each other, studying interactions between them becomes crucial to anticipate emergent phenomena and potential risks. Drawing inspiration from the widely popular Stanford Prison Experiment, we contribute to this line of research by studying interaction patterns of LLM agents in a context characterized by strict social hierarchy. We do so by specifically studying two types of phenomena: persuasion and anti-social behavior in simulated scenarios involving a guard and a prisoner agent who seeks to achieve a specific goal (i.e., obtaining additional yard time or escape from prison). Leveraging 200 experimental scenarios for a total of 2,000 machine-machine conversations across five different popular LLMs, we provide a set of noteworthy findings. We first document how some models consistently fail in carrying out a conversation in our multi-agent setup where power dynamics are at play. Then, for the models that were able to engage in successful interactions, we empirically show how the goal that an agent is set to achieve impacts primarily its persuasiveness, while having a negligible effect with respect to the agent’s anti-social behavior. Third, we highlight how agents’ personas, and particularly the guard’s personality, drive both the likelihood of successful persuasion from the prisoner and the emergence of anti-social behaviors. Fourth, we show that even without explicitly prompting for specific personalities, anti-social behavior emerges by simply assigning agents’ roles. These results bear implications for the development of interactive LLM agents as well as the debate on their societal impact.
摘要:随着基于大语言模型 (LLM) 的智能体变得越来越自主,并且将更自由地相互交互,研究它们之间的交互变得至关重要,以便预测涌现现象和潜在风险。受到广受欢迎的斯坦福监狱实验的启发,我们通过研究在严格社会等级制度背景下 LLM 智能体的交互模式,为这一研究领域做出了贡献。我们具体研究了两种现象:说服和反社会行为,这些现象发生在涉及一个守卫和一个寻求实现特定目标(即获得额外放风时间或越狱)的囚犯智能体的模拟场景中。通过利用 200 个实验场景,总共进行了 2,000 次机器与机器之间的对话,涵盖了五种不同流行的大语言模型,我们提供了一系列值得注意的发现。首先,我们记录了某些模型在多智能体设置中,由于权力动态的影响,始终无法进行对话的情况。然后,对于那些能够成功进行交互的模型,我们实证展示了智能体设定的目标如何主要影响其说服力,而对智能体的反社会行为影响甚微。第三,我们强调了智能体的人格特质,特别是守卫的性格,如何驱动囚犯成功说服的可能性以及反社会行为的涌现。第四,我们展示了即使没有明确提示特定人格,仅通过分配智能体的角色,反社会行为也会自然涌现。这些结果对交互式 LLM 智能体的发展以及关于其社会影响的辩论具有重要意义。
[NLP-16] Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context
【速读】: 该论文试图解决大型语言模型(LLMs)在多跳推理任务中因支持文档顺序不当而导致的性能下降问题,即“misordered context problem”。解决方案的关键在于提出了一种名为“context repetition(CoRe)”的方法,通过反复呈现上下文内容,确保支持文档以最优顺序呈现给模型,从而显著提升多跳问答任务的F1分数(最高提升30%)和合成任务的准确率(最高提升70%),并有效缓解了LLMs中常见的“lost-in-the-middle”问题。
链接: https://arxiv.org/abs/2410.07103
作者: Sangwon Yu,Ik-hwan Kim,Jongyoon Song,Saehyung Lee,Junsung Park,Sungroh Yoon
关键词-EN: supporting documents, large language models, requires multi-step reasoning, multi-step reasoning based, remains challenging
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Multi-hop reasoning, which requires multi-step reasoning based on the supporting documents within a given context, remains challenging for large language models (LLMs). LLMs often struggle to filter out irrelevant documents within the context, and their performance is sensitive to the position of supporting documents within that context. In this paper, we identify an additional challenge: LLMs’ performance is also sensitive to the order in which the supporting documents are presented. We refer to this as the misordered context problem. To address this issue, we propose a simple yet effective method called context repetition (CoRe), which involves prompting the model by repeatedly presenting the context to ensure the supporting documents are presented in the optimal order for the model. Using CoRe, we improve the F1 score by up to 30%p on multi-hop QA tasks and increase accuracy by up to 70%p on a synthetic task. Additionally, CoRe helps mitigate the well-known “lost-in-the-middle” problem in LLMs and can be effectively combined with retrieval-based approaches utilizing Chain-of-Thought (CoT) reasoning.
摘要:多跳推理(Multi-hop reasoning),即基于给定上下文中的支持文档进行多步骤推理,对于大语言模型(LLMs)来说仍然是一个挑战。LLMs 常常难以过滤掉上下文中的无关文档,并且其性能对支持文档在上下文中的位置非常敏感。在本文中,我们识别出另一个挑战:LLMs 的性能也对支持文档的呈现顺序敏感。我们将其称为错序上下文问题(misordered context problem)。为解决这一问题,我们提出了一种简单而有效的方法,称为上下文重复(context repetition, CoRe),该方法通过反复呈现上下文来提示模型,以确保支持文档以对模型最优的顺序呈现。使用 CoRe,我们在多跳问答任务中将 F1 分数提高了最多 30%,在合成任务中将准确率提高了最多 70%。此外,CoRe 有助于缓解 LLMs 中众所周知的“中间迷失”问题(“lost-in-the-middle” problem),并且可以与基于思维链(Chain-of-Thought, CoT)推理的检索方法有效结合。
[NLP-17] MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
【速读】: 该论文旨在解决如何评估AI代理在机器学习工程(MLE)任务中的表现问题。解决方案的关键在于创建了一个名为MLE-bench的基准测试,该基准通过从Kaggle平台精选的75个与ML工程相关的竞赛任务,全面测试AI代理在模型训练、数据准备和实验运行等方面的实际能力。通过使用开源代理框架评估前沿语言模型,论文发现最佳配置(OpenAI的o1-preview与AIDE框架结合)在16.9%的竞赛中达到了Kaggle铜牌水平。此外,研究还探讨了资源扩展对AI代理性能的影响以及预训练数据污染问题,并开源了基准代码以促进未来研究。
链接: https://arxiv.org/abs/2410.07095
作者: Jun Shern Chan,Neil Chowdhury,Oliver Jaffe,James Aung,Dane Sherburn,Evan Mays,Giulio Starace,Kevin Liu,Leon Maksin,Tejal Patwardhan,Lilian Weng,Aleksander Mądry
关键词-EN: machine learning engineering, introduce MLE-bench, perform at machine, machine learning, learning engineering
类目: Computation and Language (cs.CL)
备注: 10 pages. Plus 17 pages appendix. 8 figures. Equal contribution by first seven authors. Authors randomized. Work by Neil Chowdhury done while at OpenAI
点击查看摘要
Abstract:We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle’s publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup–OpenAI’s o1-preview with AIDE scaffolding–achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (this http URL) to facilitate future research in understanding the ML engineering capabilities of AI agents.
摘要:我们引入了 MLE-bench,这是一个用于评估 AI 智能体在机器学习工程 (Machine Learning Engineering) 方面表现的标准。为此,我们从 Kaggle 精选了 75 个与机器学习工程相关的竞赛,创建了一系列多样化的挑战任务,测试实际的机器学习工程技能,如模型训练、数据集准备和实验运行。我们利用 Kaggle 公开的排行榜为每个竞赛建立了人类基线。我们使用开源的智能体框架来评估多个前沿语言模型在我们的基准上的表现,发现表现最佳的配置——OpenAI 的 o1-preview 结合 AIDE 框架——在 16.9% 的竞赛中至少达到了 Kaggle 铜牌的水平。除了主要结果外,我们还研究了 AI 智能体的各种资源扩展形式以及预训练污染的影响。我们开源了基准代码 (this http URL),以促进未来在理解 AI 智能体的机器学习工程能力方面的研究。
[NLP-18] An Approach for Auto Generation of Labeling Functions for Software Engineering Chatbots
【速读】: 该论文试图解决软件工程聊天机器人(SE chatbots)在训练自然语言理解平台(NLU)时面临的高质量标注数据稀缺问题。解决方案的关键在于提出一种自动生成标注函数(Labeling Functions, LFs)的方法,通过从已标注的用户查询中提取模式来生成LFs。这种方法能够显著减少手动标注的工作量和资源消耗,提高标注效率,从而使开发者能够更专注于聊天机器人的核心功能开发。实验结果表明,自动生成的LFs在标注数据上表现出色,AUC得分高达85.3%,并使NLU的性能提升了27.2%。
链接: https://arxiv.org/abs/2410.07094
作者: Ebube Alor,Ahmad Abdellatif,SayedHassan Khatoonabadi,Emad Shihab
关键词-EN: enhancing development processes, increasingly gaining attention, Software engineering, Natural Language Understanding, development processes
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to IEEE Transactions on Software Engineering for review
点击查看摘要
Abstract:Software engineering (SE) chatbots are increasingly gaining attention for their role in enhancing development processes. At the core of chatbots are the Natural Language Understanding platforms (NLUs), which enable them to comprehend and respond to user queries. Before deploying NLUs, there is a need to train them with labeled data. However, acquiring such labeled data for SE chatbots is challenging due to the scarcity of high-quality datasets. This challenge arises because training SE chatbots requires specialized vocabulary and phrases not found in typical language datasets. Consequently, chatbot developers often resort to manually annotating user queries to gather the data necessary for training effective chatbots, a process that is both time-consuming and resource-intensive. Previous studies propose approaches to support chatbot practitioners in annotating users’ posed queries. However, these approaches require human intervention to generate rules, called labeling functions (LFs), that identify and categorize user queries based on specific patterns in the data. To address this issue, we propose an approach to automatically generate LFs by extracting patterns from labeled user queries. We evaluate the effectiveness of our approach by applying it to the queries of four diverse SE datasets (namely AskGit, MSA, Ask Ubuntu, and Stack Overflow) and measure the performance improvement gained from training the NLU on the queries labeled by the generated LFs. We find that the generated LFs effectively label data with AUC scores of up to 85.3%, and NLU’s performance improvement of up to 27.2% across the studied datasets. Furthermore, our results show that the number of LFs used to generate LFs affects the labeling performance. We believe that our approach can save time and resources in labeling users’ queries, allowing practitioners to focus on core chatbot functionalities.
摘要:软件工程 (SE) 聊天机器人因其提升开发流程的作用而日益受到关注。聊天机器人的核心是自然语言理解平台 (NLU),它们使聊天机器人能够理解和响应用户查询。在部署 NLU 之前,需要使用标注数据对其进行训练。然而,由于高质量数据集的稀缺,获取用于 SE 聊天机器人的标注数据具有挑战性。这一挑战源于训练 SE 聊天机器人需要专业词汇和短语,而这些词汇和短语在典型的语言数据集中并不常见。因此,聊天机器人开发者通常依赖手动标注用户查询来收集训练有效聊天机器人所需的数据,这一过程既耗时又资源密集。以往的研究提出了支持聊天机器人实践者标注用户提出查询的方法。然而,这些方法需要人工干预来生成规则,称为标注函数 (LF),这些规则根据数据中的特定模式识别和分类用户查询。为了解决这一问题,我们提出了一种通过从标注的用户查询中提取模式来自动生成 LF 的方法。我们通过将该方法应用于四个多样化的 SE 数据集(即 AskGit、MSA、Ask Ubuntu 和 Stack Overflow)的查询,并测量通过使用生成的 LF 标注的查询训练 NLU 所获得的性能提升,来评估我们方法的有效性。我们发现,生成的 LF 能够有效标注数据,AUC 得分高达 85.3%,并且在研究的数据集中,NLU 的性能提升高达 27.2%。此外,我们的结果表明,用于生成 LF 的 LF 数量会影响标注性能。我们相信,我们的方法可以节省标注用户查询的时间和资源,使实践者能够专注于聊天机器人的核心功能。
[NLP-19] Stanceformer: Target-Aware Transformer for Stance Detection
【速读】: 该论文试图解决立场检测任务中现有Transformer模型无法有效优先处理目标信息的问题。解决方案的关键在于引入Stanceformer,这是一个目标感知的Transformer模型,通过设计一个目标感知矩阵来增强目标在自注意力机制中的权重,从而在训练和推理过程中更有效地关注目标信息。这一方法不仅在立场检测任务中表现优异,还展示了在其他领域(如基于方面的情感分析)中的泛化能力。
链接: https://arxiv.org/abs/2410.07083
作者: Krishna Garg,Cornelia Caragea
关键词-EN: Detection involves discerning, Stance Detection involves, involves discerning, specific subject, Stance Detection
类目: Computation and Language (cs.CL)
备注: 16 pages, 2 figures, 14 tables including Appendix
点击查看摘要
Abstract:The task of Stance Detection involves discerning the stance expressed in a text towards a specific subject or target. Prior works have relied on existing transformer models that lack the capability to prioritize targets effectively. Consequently, these models yield similar performance regardless of whether we utilize or disregard target information, undermining the task’s significance. To address this challenge, we introduce Stanceformer, a target-aware transformer model that incorporates enhanced attention towards the targets during both training and inference. Specifically, we design a \textitTarget Awareness matrix that increases the self-attention scores assigned to the targets. We demonstrate the efficacy of the Stanceformer with various BERT-based models, including state-of-the-art models and Large Language Models (LLMs), and evaluate its performance across three stance detection datasets, alongside a zero-shot dataset. Our approach Stanceformer not only provides superior performance but also generalizes even to other domains, such as Aspect-based Sentiment Analysis. We make the code publicly available.\footnote\scriptsize\urlthis https URL
摘要:立场检测任务涉及识别文本中对特定主题或目标所表达的立场。先前的工作依赖于现有的 Transformer 模型,这些模型缺乏有效优先处理目标的能力。因此,无论我们是否利用目标信息,这些模型的表现都相似,从而削弱了任务的重要性。为解决这一挑战,我们引入了 Stanceformer,一种目标感知的 Transformer 模型,该模型在训练和推理过程中对目标赋予增强的注意力。具体而言,我们设计了一个目标感知矩阵,该矩阵增加了分配给目标的自注意力分数。我们通过多种基于 BERT 的模型(包括最先进的模型和大语言模型 (LLM))展示了 Stanceformer 的有效性,并在三个立场检测数据集和一个零样本数据集上评估了其性能。我们的方法 Stanceformer 不仅提供了优越的性能,而且还能泛化到其他领域,如基于方面的情感分析。我们公开了代码。\footnote\scriptsize\urlthis https URL
[NLP-20] MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
【速读】: 该论文试图解决的问题是:在仅提供化学研究背景信息的情况下,大型语言模型(LLMs)能否自动发现新颖且有效的化学研究假设。解决方案的关键在于提出了一个假设,即大多数化学假设可以从研究背景和若干灵感中得出,并据此将核心问题分解为三个子问题:(1)LLMs能否从背景问题中提取出有价值的灵感;(2)结合背景和灵感,LLMs能否生成假设;(3)LLMs能否识别并优先排序出好的假设。通过构建包含51篇高水平化学论文的基准数据集,并开发基于LLM的多代理框架,论文验证了LLMs在重新发现高相似度假设方面的能力,涵盖了主要创新点。
链接: https://arxiv.org/abs/2410.07076
作者: Zonglin Yang,Wanhao Liu,Ben Gao,Tong Xie,Yuqiang Li,Wanli Ouyang,Soujanya Poria,Erik Cambria,Dongzhan Zhou
关键词-EN: Scientific discovery contributes, human society prosperity, discovery contributes largely, recent progress shows, Scientific discovery
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code and Benchmark are available at this https URL
点击查看摘要
Abstract:Scientific discovery contributes largely to human society’s prosperity, and recent progress shows that LLMs could potentially catalyze this process. However, it is still unclear whether LLMs can discover novel and valid hypotheses in chemistry. In this work, we investigate this central research question: Can LLMs automatically discover novel and valid chemistry research hypotheses given only a chemistry research background (consisting of a research question and/or a background survey), without limitation on the domain of the research question? After extensive discussions with chemistry experts, we propose an assumption that a majority of chemistry hypotheses can be resulted from a research background and several inspirations. With this key insight, we break the central question into three smaller fundamental questions. In brief, they are: (1) given a background question, whether LLMs can retrieve good inspirations; (2) with background and inspirations, whether LLMs can lead to hypothesis; and (3) whether LLMs can identify good hypotheses to rank them higher. To investigate these questions, we construct a benchmark consisting of 51 chemistry papers published in Nature, Science, or a similar level in 2024 (all papers are only available online since 2024). Every paper is divided by chemistry PhD students into three components: background, inspirations, and hypothesis. The goal is to rediscover the hypothesis, given only the background and a large randomly selected chemistry literature corpus consisting the ground truth inspiration papers, with LLMs trained with data up to 2023. We also develop an LLM-based multi-agent framework that leverages the assumption, consisting of three stages reflecting the three smaller questions. The proposed method can rediscover many hypotheses with very high similarity with the ground truth ones, covering the main innovations.
摘要:科学发现对人类社会的繁荣贡献巨大,而最近的进展表明,大语言模型 (LLM) 可能在这一过程中起到催化作用。然而,目前尚不清楚 LLM 是否能在化学领域发现新颖且有效的假设。在本研究中,我们探讨了这一核心研究问题:在仅提供化学研究背景(包括研究问题和/或背景调查)的情况下,LLM 能否自动发现新颖且有效的化学研究假设,且不受研究问题领域的限制?在与化学专家进行广泛讨论后,我们提出了一种假设,即大多数化学假设可以由研究背景和若干灵感产生。基于这一关键洞察,我们将核心问题分解为三个更基本的小问题。简而言之,它们是:(1) 给定一个背景问题,LLM 能否检索到良好的灵感;(2) 结合背景和灵感,LLM 能否引导出假设;以及 (3) LLM 能否识别出好的假设并将其排名靠前。为了研究这些问题,我们构建了一个基准,该基准包含 51 篇发表在《自然》、《科学》或同等水平的 2024 年化学论文(所有论文自 2024 年起仅在线提供)。每篇论文由化学博士生分为三个部分:背景、灵感和假设。目标是仅在提供背景和一个大型的随机选择的化学文献语料库(包含真实灵感论文)的情况下,利用截至 2023 年数据训练的 LLM 重新发现这些假设。我们还开发了一个基于 LLM 的多智能体框架,该框架利用了上述假设,由三个阶段组成,分别反映了上述三个小问题。所提出的方法能够以非常高的相似度重新发现许多与真实假设相符的假设,涵盖了主要创新点。
[NLP-21] Pixtral 12B
【速读】: 该论文试图解决多模态语言模型在处理自然图像和文档时性能不足的问题,特别是如何在保持文本理解能力的同时提升多模态任务的表现。解决方案的关键在于开发了Pixtral-12B模型,该模型不仅具备120亿参数,还在多模态任务上表现出色,超越了许多更大规模的模型。Pixtral-12B采用了一种全新的视觉编码器,能够以自然分辨率和宽高比处理图像,同时支持在128K token的长上下文中处理任意数量的图像,从而在不牺牲自然语言性能的前提下,显著提升了多模态任务的处理能力。
链接: https://arxiv.org/abs/2410.07073
作者: Pravesh Agrawal,Szymon Antoniak,Emma Bou Hanna,Devendra Chaplot,Jessica Chudnovsky,Saurabh Garg,Theophile Gervet,Soham Ghosh,Amélie Héliou,Paul Jacob,Albert Q. Jiang,Timothée Lacroix,Guillaume Lample,Diego Las Casas,Thibaut Lavril,Teven Le Scao,Andy Lo,William Marshall,Louis Martin,Arthur Mensch,Pavankumar Muddireddy,Valera Nemychnikova,Marie Pellat,Patrick Von Platen,Nikhil Raghuraman,Baptiste Rozière,Alexandre Sablayrolles,Lucile Saulnier,Romain Sauvestre,Wendy Shang,Roman Soletskyi,Lawrence Stewart,Pierre Stock,Joachim Studnia,Sandeep Subramanian,Sagar Vaze,Thomas Wang
关键词-EN: billion-parameter multimodal language, Pixtral, billion-parameter multimodal, multimodal, models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We introduce Pixtral-12B, a 12–billion-parameter multimodal language model. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B \ Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license.
摘要:我们介绍了 Pixtral-12B,这是一个拥有 120 亿参数的多模态语言模型。Pixtral-12B 经过训练,能够理解自然图像和文档,在多种多模态基准测试中表现领先,超越了许多更大规模的模型。与许多开源模型不同,Pixtral 在其规模上也是一个前沿的文本模型,并且在多模态任务中表现出色,不牺牲自然语言处理性能。Pixtral 采用了一种全新的视觉编码器,从头开始训练,使其能够以自然分辨率和宽高比处理图像。这为用户在处理图像时使用的 Token 数量提供了灵活性。Pixtral 还能够在其 128K Token 的长上下文窗口中处理任意数量的图像。Pixtral-12B 在性能上显著优于其他同规模的开放模型(如 Llama-3.2 11B 和 Qwen-2-VL 7B)。它还优于更大规模的开放模型,如 Llama-3.2 90B,同时体积缩小了 7 倍。我们进一步贡献了一个开源基准测试 MM-MT-Bench,用于在实际场景中评估视觉-语言模型,并提供了详细分析和代码,用于标准化多模态大语言模型的评估协议。Pixtral-12B 以 Apache 2.0 许可证发布。
[NLP-22] ReIFE: Re-evaluating Instruction-Following Evaluation
【速读】: 该论文试图解决自动评估指令跟随质量时,缺乏对基于大型语言模型(LLM)的评估器在不同基础LLM和评估协议上的全面评估问题。解决方案的关键在于进行了一次全面的元评估,涵盖了25个基础LLM和15种评估协议,基于4个人类标注的数据集,评估了LLM评估器的准确性。通过这项大规模评估,论文揭示了基础LLM性能排名的稳定性、评估协议对不同基础LLM的依赖性以及多数据集评估的必要性,并发布了名为ReIFE的元评估套件,以支持未来在指令跟随评估领域的研究。
链接: https://arxiv.org/abs/2410.07069
作者: Yixin Liu,Kejian Shi,Alexander R. Fabbri,Yilun Zhao,Peifeng Wang,Chien-Sheng Wu,Shafiq Joty,Arman Cohan
关键词-EN: large language models, assess response quality, base LLMs, evaluation, evaluation protocols
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: GitHub Repo: this https URL , Evaluation Result Collection: this https URL
点击查看摘要
Abstract:The automatic evaluation of instruction following typically involves using large language models (LLMs) to assess response quality. However, there is a lack of comprehensive evaluation of these LLM-based evaluators across two dimensions: the base LLMs and the evaluation protocols. Therefore, we present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 recently proposed evaluation protocols, on 4 human-annotated datasets, assessing the evaluation accuracy of the LLM-evaluators. Our evaluation allows us to identify the best-performing base LLMs and evaluation protocols with a high degree of robustness. Moreover, our large-scale evaluation reveals: (1) Base LLM performance ranking remains largely consistent across evaluation protocols, with less capable LLMs showing greater improvement from protocol enhancements; (2) Robust evaluation of evaluation protocols requires many base LLMs with varying capability levels, as protocol effectiveness can depend on the base LLM used; (3) Evaluation results on different datasets are not always consistent, so a rigorous evaluation requires multiple datasets with distinctive features. We release our meta-evaluation suite ReIFE, which provides the codebase and evaluation result collection for more than 500 LLM-evaluator configurations, to support future research in instruction-following evaluation.
摘要:指令跟随的自动评估通常涉及使用大语言模型 (LLM) 来评估响应质量。然而,目前缺乏对这些基于 LLM 的评估器在两个维度上的全面评估:基础 LLM 和评估协议。因此,我们进行了全面的元评估,涵盖了 25 个基础 LLM 和 15 个最近提出的评估协议,基于 4 个人类标注的数据集,评估了 LLM 评估器的评估准确性。我们的评估使我们能够识别出表现最佳且具有高度鲁棒性的基础 LLM 和评估协议。此外,我们的大规模评估揭示了以下几点:(1) 基础 LLM 的性能排名在不同评估协议中基本保持一致,而能力较弱的 LLM 从协议改进中获益更多;(2) 对评估协议的鲁棒评估需要多种能力水平不同的基础 LLM,因为协议的有效性可能依赖于所使用的基础 LLM;(3) 不同数据集上的评估结果并不总是一致,因此严格的评估需要多个具有独特特征的数据集。我们发布了元评估套件 ReIFE,该套件提供了超过 500 种 LLM 评估器配置的代码库和评估结果集合,以支持未来在指令跟随评估方面的研究。
[NLP-23] Data Selection via Optimal Control for Language Models
【速读】: 该论文试图解决从大规模语料库中选择高质量预训练数据以增强语言模型(LMs)在下游任务中的表现的问题。解决方案的关键在于将数据选择问题形式化为广义最优控制问题,并通过庞特里亚金最大值原理(PMP)理论求解,从而得到一组描述最优数据选择与LM训练动态关系的基本条件。基于这些理论结果,论文提出了基于PMP的数据选择框架(PDS),通过近似求解PMP条件来实现最优数据选择。实验结果表明,PDS在CommonCrawl数据集上的应用显著加速了LMs的学习过程,并在多种模型规模和下游任务中持续提升模型性能,同时在大规模模型训练中也能有效减少数据需求,提高数据利用率。
链接: https://arxiv.org/abs/2410.07064
作者: Yuxian Gu,Li Dong,Hongning Wang,Yaru Hao,Qingxiu Dong,Furu Wei,Minlie Huang
关键词-EN: enhance LMs’ capabilities, Pontryagin Maximum Principle, optimal data selection, data selection, Optimal Control problem
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs’ capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved theoretically by Pontryagin’s Maximum Principle (PMP), yielding a set of necessary conditions that characterize the relationship between optimal data selection and LM training dynamics. Based on these theoretical results, we introduce PMP-based Data Selection (PDS), a framework that approximates optimal data selection by solving the PMP conditions. In our experiments, we adopt PDS to select data from CommmonCrawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which mitigates the quick exhaustion of available web-crawled corpora. Our code, data, and model checkpoints can be found in this https URL.
摘要:本研究探讨了从大规模语料库中选择高质量预训练数据以增强大语言模型 (Large Language Model, LLM) 在下游任务中的能力。我们将数据选择问题形式化为一个广义的最优控制问题,该问题在理论上可以通过庞特里亚金最大值原理 (Pontryagin’s Maximum Principle, PMP) 求解,从而得到一组必要条件,这些条件描述了最优数据选择与大语言模型训练动态之间的关系。基于这些理论结果,我们提出了基于 PMP 的数据选择 (PMP-based Data Selection, PDS) 框架,该框架通过求解 PMP 条件来近似最优数据选择。在我们的实验中,我们采用 PDS 从 CommonCrawl 中选择数据,并展示了 PDS 选择的语料库加速了大语言模型的学习,并持续提升其在各种模型规模下广泛下游任务中的性能。此外,PDS 的益处还扩展到训练在约 10 万亿 Token 上的约 4000 亿参数模型,这一点通过根据缩放定律 (Scaling Laws) 外推测试损失曲线得到证实。当预训练数据有限时,PDS 通过减少 1.8 倍的数据需求,提高了数据利用率,从而缓解了可用网络爬取语料库的快速耗尽问题。我们的代码、数据和模型检查点可以在以下链接中找到:https URL。
[NLP-24] Mitigating the Language Mismatch and Repetition Issues in LLM-based Machine Translation via Model Editing EMNLP’2024
【速读】: 该论文试图解决大语言模型(LLMs)在机器翻译任务中常见的语言不匹配和重复错误问题。解决方案的关键在于通过模型编辑方法,定位并调整负责这些错误的前馈神经网络(FFN)组件,同时通过在不同语言设置下获取定位结果的交集,过滤掉与目标错误无关的信息,从而在减少语言不匹配和重复错误的同时,保持或提升整体翻译质量。
链接: https://arxiv.org/abs/2410.07054
作者: Weichuan Wang,Zhaoyi Li,Defu Lian,Chen Ma,Linqi Song,Ying Wei
关键词-EN: Large Language Models, specific down-stream tasks, NLP field, revolutionized the NLP, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, EMNLP’2024 Main Conference
点击查看摘要
Abstract:Large Language Models (LLMs) have recently revolutionized the NLP field, while they still fall short in some specific down-stream tasks. In the work, we focus on utilizing LLMs to perform machine translation, where we observe that two patterns of errors frequently occur and drastically affect the translation quality: language mismatch and repetition. The work sets out to explore the potential for mitigating these two issues by leveraging model editing methods, e.g., by locating Feed-Forward Network (FFN) neurons or something that are responsible for the errors and deactivating them in the inference time. We find that directly applying such methods either limited effect on the targeted errors or has significant negative side-effect on the general translation quality, indicating that the located components may also be crucial for ensuring machine translation with LLMs on the rails. To this end, we propose to refine the located components by fetching the intersection of the locating results under different language settings, filtering out the aforementioned information that is irrelevant to targeted errors. The experiment results empirically demonstrate that our methods can effectively reduce the language mismatch and repetition ratios and meanwhile enhance or keep the general translation quality in most cases.
摘要:大语言模型 (LLMs) 近期在自然语言处理 (NLP) 领域带来了革命性的变化,但在某些特定的下游任务中仍显不足。在本研究中,我们专注于利用 LLMs 进行机器翻译,并观察到两种常见的错误模式频繁出现并极大地影响了翻译质量:语言不匹配和重复。本研究旨在探索通过模型编辑方法来缓解这两个问题,例如通过定位负责这些错误的 Feed-Forward Network (FFN) 神经元或其他组件,并在推理时将其停用。我们发现,直接应用这些方法要么对目标错误的效果有限,要么对整体翻译质量产生显著的负面影响,这表明定位的组件对于确保 LLMs 在轨道上的机器翻译至关重要。为此,我们提出通过在不同语言设置下获取定位结果的交集,过滤掉与目标错误无关的信息,来优化定位的组件。实验结果经验性地证明了我们的方法能够有效降低语言不匹配和重复率,并且在大多数情况下能够提升或保持整体翻译质量。
[NLP-25] Robots in the Middle: Evaluating LLMs in Dispute Resolution
【速读】: 该论文试图解决的问题是评估大型语言模型(LLMs)在作为调解员参与纠纷解决中的能力,特别是它们在分析纠纷对话、选择合适的干预类型以及生成适当的干预信息方面的表现。解决方案的关键在于通过一个新颖的手动创建的50个纠纷场景数据集,进行盲评比较LLMs与人类注释者在多个关键指标上的表现。研究结果表明,LLMs在选择干预类型和生成干预信息方面表现出色,甚至在某些维度上优于人类注释者,显示出将AI集成到在线纠纷解决(ODR)平台中的潜力。
链接: https://arxiv.org/abs/2410.07053
作者: Jinzhe Tan,Hannes Westermann,Nikhil Reddy Pottanigari,Jaromír Šavelka,Sébastien Meeùs,Mia Godet,Karim Benyekhlef
关键词-EN: resolution method featuring, neutral third-party, method featuring, featuring a neutral, individuals resolve
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Mediation is a dispute resolution method featuring a neutral third-party (mediator) who intervenes to help the individuals resolve their dispute. In this paper, we investigate to which extent large language models (LLMs) are able to act as mediators. We investigate whether LLMs are able to analyze dispute conversations, select suitable intervention types, and generate appropriate intervention messages. Using a novel, manually created dataset of 50 dispute scenarios, we conduct a blind evaluation comparing LLMs with human annotators across several key metrics. Overall, the LLMs showed strong performance, even outperforming our human annotators across dimensions. Specifically, in 62% of the cases, the LLMs chose intervention types that were rated as better than or equivalent to those chosen by humans. Moreover, in 84% of the cases, the intervention messages generated by the LLMs were rated as better than or equal to the intervention messages written by humans. LLMs likewise performed favourably on metrics such as impartiality, understanding and contextualization. Our results demonstrate the potential of integrating AI in online dispute resolution (ODR) platforms.
摘要:调解是一种争议解决方法,其特点是中立的第三方(调解员)介入,帮助当事人解决争议。本文探讨了大语言模型(LLMs)在多大程度上能够充当调解员的角色。我们研究了LLMs是否能够分析争议对话、选择合适的介入类型,并生成适当的介入信息。通过使用一个新颖的、手动创建的包含50个争议场景的数据集,我们进行了盲评,比较了LLMs与人类标注者在多个关键指标上的表现。总体而言,LLMs表现出色,甚至在多个维度上超过了我们的专业标注者。具体来说,在62%的情况下,LLMs选择的介入类型被评定为优于或等同于人类选择的类型。此外,在84%的情况下,LLMs生成的介入信息被评定为优于或等同于人类撰写的介入信息。LLMs在公正性、理解和情境化等指标上也表现出色。我们的研究结果展示了将AI整合到在线争议解决(ODR)平台中的潜力。
[NLP-26] PositionID: LLMs can Control Lengths Copy and Paste with Explicit Positional Awareness
【速读】: 该论文试图解决大型语言模型(LLMs)在生成文本时难以严格控制长度的问题,特别是由于模型在训练中缺乏对位置的敏感性,导致无法有效遵循特定的长度限制。解决方案的关键在于提出了两种新方法:PositionID Prompting 和 PositionID Fine-Tuning,这两种方法通过增强模型在生成过程中对文本长度的持续监控和管理能力,从而显著提高了模型对长度约束的遵守程度。此外,论文还引入了PositionID CP Prompting,使模型能够准确执行复制和粘贴操作,并通过开发新的基准测试来评估这些能力。实验结果表明,这些方法在不降低响应质量的前提下,显著提升了模型的长度控制和复制粘贴的准确性。
链接: https://arxiv.org/abs/2410.07035
作者: Zekun Wang,Feiyu Duan,Yibo Zhang,Wangchunshu Zhou,Ke Xu,Wenhao Huang,Jie Fu
关键词-EN: Large Language Models, Large Language, demonstrate impressive capabilities, including role-playing, creative writing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 39 pages. CP-Bench and LenCtrl-Bench are available in this https URL and this https URL
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate impressive capabilities across various domains, including role-playing, creative writing, mathematical reasoning, and coding. Despite these advancements, LLMs still encounter challenges with length control, frequently failing to adhere to specific length constraints due to their token-level operations and insufficient training on data with strict length limitations. We identify this issue as stemming from a lack of positional awareness and propose novel approaches–PositionID Prompting and PositionID Fine-Tuning–to address it. These methods enhance the model’s ability to continuously monitor and manage text length during generation. Additionally, we introduce PositionID CP Prompting to enable LLMs to perform copy and paste operations accurately. Furthermore, we develop two benchmarks for evaluating length control and copy-paste abilities. Our experiments demonstrate that our methods significantly improve the model’s adherence to length constraints and copy-paste accuracy without compromising response quality.
摘要:大语言模型 (LLMs) 在角色扮演、创意写作、数学推理和编码等多个领域展示了令人印象深刻的能力。尽管取得了这些进展,LLMs 在长度控制方面仍面临挑战,常常由于其基于 Token 的操作和在严格长度限制数据上的训练不足而无法遵守特定的长度约束。我们认为这一问题源于缺乏位置感知,并提出了两种新方法——位置 ID 提示 (PositionID Prompting) 和位置 ID 微调 (PositionID Fine-Tuning) 来解决这一问题。这些方法增强了模型在生成过程中持续监控和管理文本长度的能力。此外,我们引入了位置 ID 复制粘贴提示 (PositionID CP Prompting),使 LLMs 能够准确执行复制和粘贴操作。我们还开发了两个基准来评估长度控制和复制粘贴能力。我们的实验表明,这些方法显著提高了模型对长度约束的遵守程度和复制粘贴的准确性,同时不损害响应质量。
[NLP-27] Clean Evaluations on Contaminated Visual Language Models
【速读】: 该论文试图解决视觉语言模型(VLM)的干净评估问题,即如何在不受到数据污染影响的情况下准确评估VLM的性能。解决方案的关键在于提出了一种新的数据增强方法——BGR颜色通道切换,该方法简单有效且不易被恶意训练者用于训练数据增强,从而能够有效减少数据污染对评估结果的影响,为干净评估VLM提供了一种有前景的技术手段。
链接: https://arxiv.org/abs/2410.07030
作者: Hongyuan Lu,Shujie Miao,Wai Lam
关键词-EN: important research era, possibly contaminated LLMs, evaluate large language, large language models, important research
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:How to evaluate large language models (LLMs) cleanly has been established as an important research era to genuinely report the performance of possibly contaminated LLMs. Yet, how to cleanly evaluate the visual language models (VLMs) is an under-studied problem. We propose a novel approach to achieve such goals through data augmentation methods on the visual input information. We then craft a new visual clean evaluation benchmark with thousands of data instances. Through extensive experiments, we found that the traditional visual data augmentation methods are useful, but they are at risk of being used as a part of the training data as a workaround. We further propose using BGR augmentation to switch the colour channel of the visual information. We found that it is a simple yet effective method for reducing the effect of data contamination and fortunately, it is also harmful to be used as a data augmentation method during training. It means that it is hard to integrate such data augmentation into training by malicious trainers and it could be a promising technique to cleanly evaluate visual LLMs. Our code, data, and model weights will be released upon publication.
摘要:如何干净地评估大语言模型 (LLMs) 已经成为一个重要的研究课题,以真正报告可能受到污染的 LLMs 的性能。然而,如何干净地评估视觉语言模型 (VLMs) 是一个尚未充分研究的问题。我们提出了一种通过视觉输入信息的数据增强方法来实现这一目标的新方法。随后,我们构建了一个包含数千个数据实例的新视觉干净评估基准。通过广泛的实验,我们发现传统的视觉数据增强方法虽然有用,但它们有可能被用作训练数据的一部分作为权宜之计。我们进一步提出使用 BGR 增强来切换视觉信息的色彩通道。我们发现这是一种简单而有效的方法,可以减少数据污染的影响,并且幸运的是,它也不利于在训练过程中作为数据增强方法使用。这意味着恶意训练者很难将这种数据增强整合到训练中,并且它可能是一种有前途的技术,用于干净地评估视觉 LLMs。我们的代码、数据和模型权重将在发表后公开。
[NLP-28] Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback
【速读】: 该论文试图解决放射科医生短缺和日益增加的工作量问题,提出了一种可扩展的自动化偏好对齐技术,用于胸部X光(CXR)报告生成的视觉-语言模型(VLM)。解决方案的关键在于利用公开数据集和LLM-as-a-Judge机制,无需额外专家放射科医生反馈,通过直接对齐算法(DAAs)进行评估和基准测试,显著提升了CXR报告的质量和多样性指标,同时避免了奖励过度优化和潜在的对齐损失。
链接: https://arxiv.org/abs/2410.07025
作者: Dennis Hein,Zhihong Chen,Sophie Ostmeier,Justin Xu,Maya Varma,Eduardo Pontes Reis,Arne Edward Michalson,Christian Bluethgen,Hyun Joo Shin,Curtis Langlotz,Akshay S Chaudhari
关键词-EN: translating medical images, translating medical, medical images, play a crucial, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Radiologists play a crucial role by translating medical images into medical reports. However, the field faces staffing shortages and increasing workloads. While automated approaches using vision-language models (VLMs) show promise as assistants, they require exceptionally high accuracy. Most current VLMs in radiology rely solely on supervised fine-tuning (SFT). Meanwhile, in the general domain, additional preference fine-tuning has become standard practice. The challenge in radiology lies in the prohibitive cost of obtaining radiologist feedback. We propose a scalable automated preference alignment technique for VLMs in radiology, focusing on chest X-ray (CXR) report generation. Our method leverages publicly available datasets with an LLM-as-a-Judge mechanism, eliminating the need for additional expert radiologist feedback. We evaluate and benchmark five direct alignment algorithms (DAAs). Our results show up to a 57.4% improvement in average GREEN scores, a LLM-based metric for evaluating CXR reports, and a 9.2% increase in an average across six metrics (domain specific and general), compared to the SFT baseline. We study reward overoptimization via length exploitation, with reports lengthening by up to 3.2x. To assess a potential alignment tax, we benchmark on six additional diverse tasks, finding no significant degradations. A reader study involving four board-certified radiologists indicates win rates of up to 0.62 over the SFT baseline, while significantly penalizing verbosity. Our analysis provides actionable insights for the development of VLMs in high-stakes fields like radiology.
摘要:放射科医生通过将医学影像转化为医学报告发挥着至关重要的作用。然而,该领域面临着人员短缺和工作量增加的问题。尽管使用视觉-语言模型 (VLM) 的自动化方法显示出作为助手的潜力,但它们需要极高的准确性。目前大多数放射学领域的 VLM 仅依赖于监督微调 (SFT)。与此同时,在通用领域,额外的偏好微调已成为标准做法。放射学领域的挑战在于获取放射科医生反馈的成本过高。我们提出了一种可扩展的自动化偏好对齐技术,用于放射学领域的 VLM,专注于胸部 X 光 (CXR) 报告生成。我们的方法利用公开可用的数据集,结合大语言模型 (LLM) 作为评判机制,消除了对额外专家放射科医生反馈的需求。我们评估并基准测试了五种直接对齐算法 (DAA)。结果显示,平均 GREEN 分数(一种基于 LLM 的 CXR 报告评估指标)提高了 57.4%,六个指标(领域特定和通用)的平均值提高了 9.2%,相较于 SFT 基线。我们通过长度利用研究了奖励过度优化问题,报告长度增加了最多 3.2 倍。为了评估潜在的对齐代价,我们在六个额外的多样化任务上进行了基准测试,未发现显著的性能下降。一项涉及四名委员会认证放射科医生的读者研究表明,相较于 SFT 基线,胜率最高可达 0.62,同时显著惩罚冗长性。我们的分析为在高风险领域如放射学中开发 VLM 提供了可操作的见解。
[NLP-29] Pap2Pat: Towards Automated Paper-to-Patent Drafting using Chunk-based Outline-guided Generation
【速读】: 该论文试图解决专利文件中描述部分(占专利文档的90%以上)的生成问题,特别是在学术论文提供技术规格和提纲指导专利结构的情况下。解决方案的关键在于引入了一种新的任务——提纲引导的论文到专利生成(outline-guided paper-to-patent generation),并创建了一个名为PAP2PAT的基准数据集,包含1.8k专利-论文对及其文档提纲。通过实验验证,当前的大型语言模型(LLMs)和提纲引导的分块生成方法能够有效利用论文信息,但仍需克服专利语言固有的重复性问题。
链接: https://arxiv.org/abs/2410.07009
作者: Valentin Knappich,Simon Razniewski,Anna Hätty,Annemarie Friedrich
关键词-EN: offering practical applications, large language models, natural language processing, providing challenging benchmarks, offering practical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The patent domain is gaining attention in natural language processing research, offering practical applications in streamlining the patenting process and providing challenging benchmarks for large language models (LLMs). However, the generation of the description sections of patents, which constitute more than 90% of the patent document, has not been studied to date. We address this gap by introducing the task of outline-guided paper-to-patent generation, where an academic paper provides the technical specification of the invention and an outline conveys the desired patent structure. We present PAP2PAT, a new challenging benchmark of 1.8k patent-paper pairs with document outlines, collected using heuristics that reflect typical research lab practices. Our experiments with current open-weight LLMs and outline-guided chunk-based generation show that they can effectively use information from the paper but struggle with repetitions, likely due to the inherent repetitiveness of patent language. We release our data and code.
摘要:专利领域在自然语言处理研究中逐渐受到关注,为简化专利流程和为大语言模型 (LLM) 提供具有挑战性的基准提供了实际应用。然而,迄今为止,专利文件中占比超过 90% 的描述部分的生成尚未得到研究。我们通过引入大纲引导的论文到专利生成任务来填补这一空白,其中学术论文提供了发明的技术规格,而大纲传达了所需的专利结构。我们提出了 PAP2PAT,这是一个包含 1.8k 专利-论文对的新挑战性基准,这些对是通过反映典型研究实验室实践的启发式方法收集的,并附有文档大纲。我们在当前的开源权重 LLM 和大纲引导的分块生成实验中表明,它们能够有效利用论文中的信息,但由于专利语言的内在重复性,仍存在重复问题。我们公开了数据和代码。
[NLP-30] CursorCore: Assist Programming through Aligning Anything
【速读】: 该论文试图解决现有大型语言模型在编程辅助任务中自动化程度不足、难以有效整合编程过程中的多种信息(如编码历史、当前代码和用户指令)的问题。解决方案的关键在于提出了一种新的对话框架,通过引入APEval基准全面评估模型性能,并开发了Programming-Instruct数据生成管道,从多种来源合成训练数据,最终生成219K样本并微调多个模型,形成CursorCore系列,显著提升了编程辅助任务的效果。
链接: https://arxiv.org/abs/2410.07002
作者: Hao Jiang,Qi Liu,Rui Li,Shengyu Ye,Shijin Wang
关键词-EN: Large language models, Large language, programming assistance tasks, Assist Programming Eval, successfully applied
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:Large language models have been successfully applied to programming assistance tasks, such as code completion, code insertion, and instructional code editing. However, these applications remain insufficiently automated and struggle to effectively integrate various types of information during the programming process, including coding history, current code, and user instructions. In this work, we propose a new conversational framework that comprehensively integrates these information sources, collect data to train our models and evaluate their performance. Firstly, to thoroughly evaluate how well models align with different types of information and the quality of their outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to comprehensively assess the performance of models in programming assistance tasks. Then, for data collection, we develop a data generation pipeline, Programming-Instruct, which synthesizes training data from diverse sources, such as GitHub and online judge platforms. This pipeline can automatically generate various types of messages throughout the programming process. Finally, using this pipeline, we generate 219K samples, fine-tune multiple models, and develop the CursorCore series. We show that CursorCore outperforms other models of comparable size. This framework unifies applications such as inline chat and automated editing, contributes to the advancement of coding assistants. Code, models and data are freely available at this https URL.
摘要:大语言模型已成功应用于编程辅助任务,如代码补全、代码插入和指导性代码编辑。然而,这些应用在自动化方面仍显不足,且在编程过程中难以有效整合各类信息,包括编码历史、当前代码和用户指令。在本研究中,我们提出了一种新的对话框架,全面整合这些信息源,收集数据以训练我们的模型并评估其性能。首先,为了全面评估模型与不同类型信息的契合度及其输出质量,我们引入了一个新的基准测试,APEval(辅助编程评估),以全面评估模型在编程辅助任务中的表现。其次,在数据收集方面,我们开发了一个数据生成管道,Programming-Instruct,该管道从GitHub和在线评测平台等多种来源合成训练数据。此管道能够自动生成编程过程中各类消息。最后,利用此管道,我们生成了219K样本,微调了多个模型,并开发了CursorCore系列。实验表明,CursorCore在同等规模模型中表现优异。该框架统一了内联聊天和自动化编辑等应用,推动了编码助手的进步。代码、模型和数据均可在此https URL免费获取。
[NLP-31] Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
【速读】: 该论文试图解决大语言模型(LLMs)中特征普遍性(feature universality)的问题,即理解不同模型在其中间层的潜在空间中如何相似地表示概念。解决方案的关键在于使用稀疏自编码器(SAEs)进行字典学习,将LLM的激活转换为更具解释性的空间,该空间由对应于单个特征的神经元所张成。通过激活相关性匹配不同模型中的特征神经元,并应用表示空间相似性度量(如奇异值典型相关分析)来分析这些SAE特征,从而揭示不同LLMs之间SAE特征空间的高度相似性,为特征普遍性提供了新的证据。
链接: https://arxiv.org/abs/2410.06981
作者: Michael Lan,Philip Torr,Austin Meek,Ashkan Khakzar,David Krueger,Fazl Barez
关键词-EN: similarly represent concepts, large language models, models similarly represent, investigate feature universality, intermediate layers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We investigate feature universality in large language models (LLMs), a research field that aims to understand how different models similarly represent concepts in the latent spaces of their intermediate layers. Demonstrating feature universality allows discoveries about latent representations to generalize across several models. However, comparing features across LLMs is challenging due to polysemanticity, in which individual neurons often correspond to multiple features rather than distinct ones. This makes it difficult to disentangle and match features across different models. To address this issue, we employ a method known as dictionary learning by using sparse autoencoders (SAEs) to transform LLM activations into more interpretable spaces spanned by neurons corresponding to individual features. After matching feature neurons across models via activation correlation, we apply representational space similarity metrics like Singular Value Canonical Correlation Analysis to analyze these SAE features across different LLMs. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.
摘要:我们研究了大语言模型 (LLM) 中的特征普遍性,这是一个旨在理解不同模型如何在其中间层的潜在空间中相似地表示概念的研究领域。展示特征普遍性使得关于潜在表示的发现能够跨多个模型进行推广。然而,由于多义性 (polysemanticity),比较不同 LLM 之间的特征是具有挑战性的,其中单个神经元通常对应于多个特征而非单一特征。这使得在不同模型之间解耦和匹配特征变得困难。为解决这一问题,我们采用了一种称为字典学习的方法,通过使用稀疏自编码器 (SAE) 将 LLM 激活转换为由对应于单个特征的神经元所张成的更具解释性的空间。通过激活相关性匹配模型间的特征神经元后,我们应用表示空间相似性度量方法,如奇异值典型相关分析,来分析不同 LLM 之间的 SAE 特征。我们的实验揭示了不同 LLM 之间 SAE 特征空间的显著相似性,为特征普遍性提供了新的证据。
[NLP-32] Personal Intelligence System UniLM: Hybrid On-Device Small Language Model and Server-Based Large Language Model for Malay Nusantara
【速读】: 该论文试图解决在计算和数据资源有限的情况下,高资源语言模型在处理马来语特定需求时表现不足的问题。解决方案的关键在于设计了一个个人智能系统,该系统高效地整合了设备端和服务器端的模型。具体来说,系统采用了SLiM-34M模型进行设备端处理,该模型优化了内存和功耗,而MANYAK-1.3B模型则用于服务器端任务,以实现高性能的语言处理。这种独特的模型协同工作方式不仅显著提高了各种任务(如机器翻译、问答和翻译IndoMMLU)的性能,还挑战了构建有效语言模型必须依赖大规模计算资源的普遍假设,从而推动了资源高效型马来语模型的开发。
链接: https://arxiv.org/abs/2410.06973
作者: Azree Nazri,Olalekan Agbolade,Faisal Aziz
关键词-EN: Personal Intelligence System, prove inadequate, contexts with limited, addressing the specific, high-resource language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 5 tables, 4 figures
点击查看摘要
Abstract:In contexts with limited computational and data resources, high-resource language models often prove inadequate, particularly when addressing the specific needs of Malay languages. This paper introduces a Personal Intelligence System designed to efficiently integrate both on-device and server-based models. The system incorporates SLiM-34M for on-device processing, optimized for low memory and power usage, and MANYAK-1.3B for server-based tasks, allowing for scalable, high-performance language processing. The models achieve significant results across various tasks, such as machine translation, question-answering, and translate IndoMMLU. Particularly noteworthy is SLiM-34M’s ability to achieve a high improvement in accuracy compared to other LLMs while using 2 times fewer pre-training tokens. This work challenges the prevailing assumption that large-scale computational resources are necessary to build effective language models, contributing to the development of resource-efficient models for the Malay language with the unique orchestration between SLiM-34M and MANYAK-1.3B.
摘要:在计算和数据资源有限的情况下,高资源语言模型往往表现不佳,尤其是在应对马来语的特定需求时。本文介绍了一种个人智能系统,旨在高效整合设备端和服务器端模型。该系统集成了SLiM-34M用于设备端处理,优化了内存和功耗,以及MANYAK-1.3B用于服务器端任务,实现了可扩展的高性能语言处理。这些模型在机器翻译、问答和翻译IndoMMLU等任务中取得了显著成果。特别值得一提的是,SLiM-34M在与其他大语言模型相比时,使用预训练Token数量减少一半的情况下,实现了更高的准确性提升。这项工作挑战了构建有效语言模型必须依赖大规模计算资源的普遍假设,为马来语开发了资源高效模型,并通过SLiM-34M与MANYAK-1.3B之间的独特协调,推动了这一领域的发展。
[NLP-33] Uncovering Factor Level Preferences to Improve Human-Model Alignment
【速读】: 该论文试图解决大语言模型(LLM)在偏好对齐方面缺乏解释性的问题。解决方案的关键在于引入PROFILE框架,通过分析影响偏好的具体因素(如写作风格、输出冗长度等),量化这些因素对偏好驱动的影响,从而解释模型与人类偏好对齐或偏离的原因。PROFILE框架的核心在于其因素级分析,能够揭示生成任务中人类与LLM偏好的显著差异,以及评估任务中LLM与人类偏好的强对齐,为模型改进提供方向性指导。
链接: https://arxiv.org/abs/2410.06965
作者: Juhyun Oh,Eunsu Kim,Jiseon Kim,Wenda Xu,Inha Cha,William Yang Wang,Alice Oh
关键词-EN: Large Language Model, Large Language, advancements in Large, Language Model, preferences remains crucial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Despite advancements in Large Language Model (LLM) alignment, understanding the reasons behind LLM preferences remains crucial for bridging the gap between desired and actual behavior. LLMs often exhibit biases or tendencies that diverge from human preferences, such as favoring certain writing styles or producing overly verbose outputs. However, current methods for evaluating preference alignment often lack explainability, relying on coarse-grained comparisons. To address this, we introduce PROFILE (PRObing Factors of InfLuence for Explainability), a novel framework that uncovers and quantifies the influence of specific factors driving preferences. PROFILE’s factor level analysis explains the ‘why’ behind human-model alignment and misalignment, offering insights into the direction of model improvement. We apply PROFILE to analyze human and LLM preferences across three tasks: summarization, helpful response generation, and document-based question-answering. Our factor level analysis reveals a substantial discrepancy between human and LLM preferences in generation tasks, whereas LLMs show strong alignment with human preferences in evaluation tasks. We demonstrate how leveraging factor level insights, including addressing misaligned factors or exploiting the generation-evaluation gap, can improve alignment with human preferences. This work underscores the importance of explainable preference analysis and highlights PROFILE’s potential to provide valuable training signals, driving further improvements in human-model alignment.
摘要:尽管大语言模型 (LLM) 对齐技术取得了进展,但理解 LLM 偏好背后的原因对于弥合期望行为与实际行为之间的差距仍然至关重要。LLM 常常表现出与人类偏好不一致的偏见或倾向,例如偏好某些写作风格或产生过于冗长的输出。然而,当前评估偏好对齐的方法往往缺乏可解释性,依赖于粗粒度的比较。为了解决这一问题,我们引入了 PROFILE (PRObing Factors of InfLuence for Explainability),这是一个新颖的框架,旨在揭示并量化驱动偏好的特定因素的影响。PROFILE 的因素级分析解释了人类与模型对齐和不对齐背后的“为什么”,为模型改进的方向提供了见解。我们将 PROFILE 应用于分析人类和 LLM 在三个任务中的偏好:摘要生成、有用响应生成和基于文档的问答。我们的因素级分析揭示了在生成任务中人类与 LLM 偏好之间存在显著差异,而在评估任务中 LLM 显示出与人类偏好的高度对齐。我们展示了如何利用因素级见解,包括解决不对齐的因素或利用生成-评估差距,来改善与人类偏好的对齐。这项工作强调了可解释偏好分析的重要性,并突显了 PROFILE 提供有价值训练信号的潜力,推动了人类与模型对齐的进一步改进。
[NLP-34] Self-Boosting Large Language Models with Synthetic Preference Data
【速读】: 该论文试图解决大规模语言模型(LLMs)在持续改进过程中,高质量偏好数据收集成本高、创意需求大的问题。解决方案的关键是引入了一种自增强范式SynPO,通过合成偏好数据实现模型对齐。SynPO采用迭代机制,利用自提示生成器创建多样化提示,并通过响应改进器逐步优化模型响应,使LLMs能够自主学习生成奖励,从而减少对大规模提示和人类偏好标注的依赖。
链接: https://arxiv.org/abs/2410.06961
作者: Qingxiu Dong,Li Dong,Xingxing Zhang,Zhifang Sui,Furu Wei
关键词-EN: Large Language Models, Large Language, Language Models, generating honest, advanced significantly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.
摘要:通过与人类偏好的对齐,大语言模型 (LLMs) 在生成诚实、无害且有帮助的响应方面取得了显著进展。然而,收集高质量的偏好数据是一个资源密集且需要创造力的过程,尤其是对于大语言模型的持续改进。我们引入了 SynPO,一种利用合成偏好数据进行模型对齐的自增强范式。SynPO 采用迭代机制,其中自提示生成器创建多样化的提示,响应改进器逐步优化模型响应。这种方法训练大语言模型自主学习其输出生成的奖励,并消除了对大规模提示和人类偏好标注的需求。经过四次 SynPO 迭代后,Llama3-8B 和 Mistral-7B 在指令跟随能力方面显示出显著增强,在 AlpacaEval 2.0 和 ArenaHard 上实现了超过 22.1% 的胜率提升。同时,SynPO 提高了大语言模型在各种任务上的整体性能,通过在广受认可的 Open LLM 排行榜上平均得分增加了 3.2 到 5.0 分得到了验证。
[NLP-35] Seeker: Enhancing Exception Handling in Code with LLM-based Multi-Agent Approach ICLR2025
【速读】: 该论文试图解决软件开发中异常处理不当或缺失导致的代码脆弱性和可靠性问题。解决方案的关键在于提出了一种名为Seeker的多代理框架,该框架借鉴了专家开发者的异常处理策略,通过Scanner、Detector、Predator、Ranker和Handler五个代理协助大型语言模型(LLMs)更有效地检测、捕获和解决异常。这一系统性研究首次利用LLMs来增强异常处理实践,为提升代码可靠性提供了宝贵的见解。
链接: https://arxiv.org/abs/2410.06949
作者: Xuanming Zhang,Yuxuan Chen,Yuan Yuan,Minlie Huang
关键词-EN: exception handling, improper or missing, missing exception handling, handling, world software development
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 26 pages, 7 figures. Submitted ICLR 2025
点击查看摘要
Abstract:In real world software development, improper or missing exception handling can severely impact the robustness and reliability of code. Exception handling mechanisms require developers to detect, capture, and manage exceptions according to high standards, but many developers struggle with these tasks, leading to fragile code. This problem is particularly evident in open source projects and impacts the overall quality of the software ecosystem. To address this challenge, we explore the use of large language models (LLMs) to improve exception handling in code. Through extensive analysis, we identify three key issues: Insensitive Detection of Fragile Code, Inaccurate Capture of Exception Types, and Distorted Handling Solutions. These problems are widespread across real world repositories, suggesting that robust exception handling practices are often overlooked or mishandled. In response, we propose Seeker, a multi agent framework inspired by expert developer strategies for exception handling. Seeker uses agents: Scanner, Detector, Predator, Ranker, and Handler to assist LLMs in detecting, capturing, and resolving exceptions more effectively. Our work is the first systematic study on leveraging LLMs to enhance exception handling practices, providing valuable insights for future improvements in code reliability.
摘要:在现实世界的软件开发中,不当或缺失的异常处理会严重损害代码的健壮性和可靠性。异常处理机制要求开发人员按照高标准检测、捕获和管理异常,但许多开发人员在这方面遇到困难,导致代码脆弱。这一问题在开源项目中尤为明显,影响了整个软件生态系统的质量。为应对这一挑战,我们探索了利用大语言模型 (LLM) 来改进代码中的异常处理。通过广泛分析,我们识别出三个关键问题:脆弱代码的敏感性检测不足、异常类型捕获不准确以及处理方案的扭曲。这些问题在现实世界的代码库中普遍存在,表明健壮的异常处理实践往往被忽视或处理不当。为此,我们提出了 Seeker,一个受专家开发人员异常处理策略启发的多智能体框架。Seeker 使用 Scanner、Detector、Predator、Ranker 和 Handler 等智能体,协助 LLM 更有效地检测、捕获和解决异常。我们的工作是首个系统研究利用 LLM 增强异常处理实践的尝试,为未来提高代码可靠性提供了宝贵的见解。
[NLP-36] CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages EMNLP2024
【速读】: 该论文试图解决在低资源且形态丰富的语言中,依赖解析模型对词序变化的鲁棒性问题。解决方案的关键在于提出了一种对比自监督学习方法,通过数据增强和去除位置编码来增强模型对词序变化的适应能力。实验结果表明,该方法在7种词序相对自由的语言中,相较于最佳基线模型,在UAS/LAS评分上分别提升了3.03和2.95个点。
链接: https://arxiv.org/abs/2410.06944
作者: Pretam Ray,Jivnesh Sandhan,Amrith Krishna,Pawan Goyal
关键词-EN: free word order, Neural dependency parsing, morphologically rich languages, word order, low resource morphologically
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 Main (Short), 9 pages, 3 figures, 4 Tables
点击查看摘要
Abstract:Neural dependency parsing has achieved remarkable performance for low resource morphologically rich languages. It has also been well-studied that morphologically rich languages exhibit relatively free word order. This prompts a fundamental investigation: Is there a way to enhance dependency parsing performance, making the model robust to word order variations utilizing the relatively free word order nature of morphologically rich languages? In this work, we examine the robustness of graph-based parsing architectures on 7 relatively free word order languages. We focus on scrutinizing essential modifications such as data augmentation and the removal of position encoding required to adapt these architectures accordingly. To this end, we propose a contrastive self-supervised learning method to make the model robust to word order variations. Furthermore, our proposed modification demonstrates a substantial average gain of 3.03/2.95 points in 7 relatively free word order languages, as measured by the UAS/LAS Score metric when compared to the best performing baseline.
摘要:神经依赖解析在低资源形态丰富的语言中取得了显著的性能。同样,形态丰富的语言表现出相对自由的词序,这一点也得到了充分研究。这引发了一个根本性的探讨:是否有一种方法可以利用形态丰富语言相对自由的词序特性,增强依赖解析的性能,使其对词序变化具有鲁棒性?在本研究中,我们考察了基于图的解析架构在7种相对自由词序语言中的鲁棒性。我们重点审查了为适应这些架构所需的关键修改,如数据增强和去除位置编码。为此,我们提出了一种对比自监督学习方法,以使模型对词序变化具有鲁棒性。此外,我们提出的修改在7种相对自由词序语言中,相较于最佳表现的基线,在UAS/LAS评分指标上展示了3.03/2.95点的平均显著提升。
[NLP-37] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
【速读】: 该论文试图解决现有推测解码(Speculative Decoding, SD)方法在加速大型语言模型(LLMs)推理时需要额外参数或大量训练的问题。解决方案的关键在于提出了一种新颖的即插即用SD方法,称为SWIFT(Self-speculative decoding With Intermediate layer-skipping For Tasks),该方法通过跳过目标LLM的中间层来实现自加速,无需辅助模型或额外训练。SWIFT通过自适应地选择跳过的中间层,实现了在不同输入数据流上的LLM推理加速,同时保持生成文本的原始分布,实验表明其能实现1.3x-1.6x的加速效果。
链接: https://arxiv.org/abs/2410.06916
作者: Heming Xia,Yongqi Li,Jun Zhang,Cunxiao Du,Wenjie Li
关键词-EN: compromising generation quality, large language models, Speculative decoding, generation quality, widely used paradigm
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by first employing a compact model to draft multiple tokens efficiently and then using the target LLM to verify them in parallel. While this technique has achieved notable speedups, most existing approaches necessitate either additional parameters or extensive training to construct effective draft models, thereby restricting their applicability across different LLMs and tasks. To address this limitation, we explore a novel plug-and-play SD solution with layer-skipping, which skips intermediate layers of the target LLM as the compact draft model. Our analysis reveals that LLMs exhibit great potential for self-acceleration through layer sparsity and the task-specific nature of this sparsity. Building on these insights, we introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. SWIFT does not require auxiliary models or additional training, making it a plug-and-play solution for accelerating LLM inference across diverse input data streams. Our extensive experiments across a wide range of models and downstream tasks demonstrate that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
摘要:推测性解码 (Speculative Decoding, SD) 作为一种广泛使用的范式,在不牺牲生成质量的前提下,加速了大语言模型 (Large Language Models, LLMs) 的推理过程。其工作原理是首先利用一个紧凑模型高效地生成多个 Token,然后使用目标 LLM 并行验证这些 Token。尽管这一技术已显著提升了推理速度,但大多数现有方法需要额外的参数或大量训练来构建有效的草稿模型,从而限制了其在不同 LLM 和任务中的适用性。为解决这一限制,我们探索了一种新颖的即插即用 SD 解决方案,通过跳过目标 LLM 的中间层作为紧凑草稿模型。我们的分析表明,LLM 通过层稀疏性和任务特定稀疏性展现出巨大的自我加速潜力。基于这些见解,我们引入了 SWIFT,一种实时自推测性解码算法,该算法在推理过程中自适应地选择跳过 LLM 的中间层。SWIFT 不需要辅助模型或额外训练,使其成为加速各种输入数据流中 LLM 推理的即插即用解决方案。我们在广泛模型和下游任务上的大量实验表明,SWIFT 能够在保持生成文本原始分布的同时,实现 1.3 倍至 1.6 倍的加速。
[NLP-38] Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning
【速读】: 该论文试图解决Refusal-Aware Instruction Tuning (RAIT)过程中出现的“过度拒绝”问题,即大型语言模型(LLMs)在面对本可以正确回答的问题时也选择拒绝回答。解决方案的关键在于引入“确定性表示的知识流用于拒绝感知指令构建(CRaFT)”。CRaFT通过两个主要贡献来解决这一问题:首先,结合响应的确定性来选择性地过滤和修改数据,减少静态冲突;其次,实施初步的排练训练以表征LLM知识状态的变化,从而在微调过程中缓解动态冲突。
链接: https://arxiv.org/abs/2410.06913
作者: Runchuan Zhu,Zhipeng Ma,Jiang Wu,Junyuan Gao,Jiaqi Wang,Dahua Lin,Conghui He
关键词-EN: Large Language Models, enables Large Language, Language Models, Large Language, Refusal-Aware Instruction Tuning
类目: Computation and Language (cs.CL)
备注: Equal contribution: Runchuan Zhu, Zhipeng Ma, Jiang Wu; Corresponding author: Conghui He
点击查看摘要
Abstract:Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs) to refuse to answer unknown questions. By modifying responses of unknown questions in the training data to refusal responses such as “I don’t know”, RAIT enhances the reliability of LLMs and reduces their hallucination. Generally, RAIT modifies training samples based on the correctness of the initial LLM’s response. However, this crude approach can cause LLMs to excessively refuse answering questions they could have correctly answered, the problem we call over-refusal. In this paper, we explore two primary causes of over-refusal: Static conflict emerges when the RAIT data is constructed solely on correctness criteria, causing similar samples in the LLM’s feature space to be assigned different labels (original vs. modified “I don’t know”). Dynamic conflict occurs due to the changes of LLM’s knowledge state during fine-tuning, which transforms previous unknown questions into knowns, while the training data, which is constructed based on the initial LLM, remains unchanged. These conflicts cause the trained LLM to misclassify known questions as unknown, resulting in over-refusal. To address this issue, we introduce Certainty Represented Knowledge Flow for Refusal-Aware Instructions Construction (CRaFT). CRaFT centers on two main contributions: First, we additionally incorporate response certainty to selectively filter and modify data, reducing static conflicts. Second, we implement preliminary rehearsal training to characterize changes in the LLM’s knowledge state, which helps mitigate dynamic conflicts during the fine-tuning process. We conducted extensive experiments on open-ended question answering and multiple-choice question task. Experiment results show that CRaFT can improve LLM’s overall performance during the RAIT process. Source code and training data will be released at Github.
摘要:拒绝感知指令调优 (Refusal-Aware Instruction Tuning, RAIT) 使大语言模型 (Large Language Models, LLMs) 能够拒绝回答未知问题。通过将训练数据中未知问题的响应修改为拒绝响应,如“我不知道”,RAIT 提高了 LLMs 的可靠性并减少了其幻觉。通常,RAIT 根据初始 LLM 响应的正确性修改训练样本。然而,这种粗糙的方法可能导致 LLMs 过度拒绝回答它们本可以正确回答的问题,我们称之为过度拒绝问题。本文探讨了过度拒绝的两个主要原因:静态冲突在 RAIT 数据仅基于正确性标准构建时出现,导致 LLM 特征空间中的相似样本被分配不同的标签(原始 vs. 修改后的“我不知道”)。动态冲突是由于 LLM 在微调过程中知识状态的变化,将之前未知的问题转化为已知问题,而基于初始 LLM 构建的训练数据保持不变。这些冲突导致训练后的 LLM 将已知问题错误分类为未知,从而导致过度拒绝。为解决这一问题,我们引入了拒绝感知指令构建的确定性表示知识流 (Certainty Represented Knowledge Flow for Refusal-Aware Instructions Construction, CRaFT)。CRaFT 围绕两个主要贡献展开:首先,我们额外引入响应确定性来选择性过滤和修改数据,减少静态冲突。其次,我们实施初步排练训练以表征 LLM 知识状态的变化,这有助于在微调过程中缓解动态冲突。我们在开放式问答和多项选择题任务上进行了广泛的实验。实验结果表明,CRaFT 可以在 RAIT 过程中提高 LLM 的整体性能。源代码和训练数据将在 Github 上发布。
[NLP-39] Generative Model for Less-Resourced Language with 1 billion parameters
【速读】: 该论文试图解决低资源语言(如斯洛文尼亚语)的大规模生成语言模型(LLMs)的开发问题。解决方案的关键在于通过继续预训练现有的英语OPT模型,并开发适应斯洛文尼亚语、克罗地亚语和英语的新分词器,以及使用FOCUS和WECHSEL方法迁移嵌入向量,从而创建具有10亿参数的GaMS 1B模型。该模型在斯洛文尼亚语的分类任务和句子简化任务中表现出色,尤其是在句子简化任务中,其性能与GPT-3.5-Turbo模型相当或更优。
链接: https://arxiv.org/abs/2410.06898
作者: Domen Vreš,Martin Božič,Aljaž Potočnik,Tomaž Martinčič,Marko Robnik-Šikonja
关键词-EN: English OPT model, natural language processing, modern natural language, English OPT, English
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are a basic infrastructure for modern natural language processing. Many commercial and open-source LLMs exist for English, e.g., ChatGPT, Llama, Falcon, and Mistral. As these models are trained on mostly English texts, their fluency and knowledge of low-resource languages and societies are superficial. We present the development of large generative language models for a less-resourced language. GaMS 1B - Generative Model for Slovene with 1 billion parameters was created by continuing pretraining of the existing English OPT model. We developed a new tokenizer adapted to Slovene, Croatian, and English languages and used embedding initialization methods FOCUS and WECHSEL to transfer the embeddings from the English OPT model. We evaluate our models on several classification datasets from the Slovene suite of benchmarks and generative sentence simplification task SENTA. We only used a few-shot in-context learning of our models, which are not yet instruction-tuned. For classification tasks, in this mode, the generative models lag behind the existing Slovene BERT-type models fine-tuned for specific tasks. On a sentence simplification task, the GaMS models achieve comparable or better performance than the GPT-3.5-Turbo model.
摘要:大语言模型 (LLMs) 是现代自然语言处理的基础设施。许多商业和开源的 LLMs 针对英语存在,例如 ChatGPT、Llama、Falcon 和 Mistral。由于这些模型主要在英语文本上进行训练,它们对低资源语言和社会的流畅性和知识较为表面。我们介绍了为一种低资源语言开发的大型生成式语言模型。GaMS 1B - 拥有 10 亿参数的斯洛文尼亚语生成模型,是通过继续预训练现有的英语 OPT 模型创建的。我们开发了一种新的 Tokenizer,适用于斯洛文尼亚语、克罗地亚语和英语,并使用嵌入初始化方法 FOCUS 和 WECHSEL 从英语 OPT 模型中转移嵌入。我们在斯洛文尼亚语基准套件中的几个分类数据集和生成句子简化任务 SENTA 上评估了我们的模型。我们仅使用了少样本上下文学习,这些模型尚未进行指令微调。对于分类任务,在这种模式下,生成模型落后于针对特定任务微调的现有斯洛文尼亚语 BERT 类型模型。在句子简化任务上,GaMS 模型达到了与 GPT-3.5-Turbo 模型相当或更好的性能。
[NLP-40] FltLM: An Intergrated Long-Context Large Language Model for Effective Context Filtering and Understanding ECAI-2024
【速读】: 该论文试图解决长上下文大语言模型(Long-Context LLMs)在处理长文档和多语料库时面临的两个关键问题:中间信息丢失现象和注意力分散问题。解决方案的关键在于提出了Context Filtering Language Model (FltLM),通过引入带有软掩码机制的上下文过滤器,动态识别并排除无关内容,聚焦于相关信息,从而增强模型在多文档问答任务中的理解和推理能力。这一方法不仅缓解了上述两个问题,还使得模型能够在单次前向传递中高效运作,显著提升了复杂问答场景下的性能。
链接: https://arxiv.org/abs/2410.06886
作者: Jingyang Deng,Zhengyang Shen,Boyang Wang,Lixin Su,Suqi Cheng,Ying Nie,Junfeng Wang,Dawei Yin,Jinwen Ma
关键词-EN: Large Language Models, Long-Context Large Language, markedly advanced natural, Filtering Language Model, Large Language
类目: Computation and Language (cs.CL)
备注: Accepted by the 27th European Conference on Artificial Intelligence (ECAI-2024), this is the full version of the paper including technical appendices. This final version features enhanced formatting and corrections to errors present in other online versions. We regret any inconvenience this may have caused our readers
点击查看摘要
Abstract:The development of Long-Context Large Language Models (LLMs) has markedly advanced natural language processing by facilitating the process of textual data across long documents and multiple corpora. However, Long-Context LLMs still face two critical challenges: The lost in the middle phenomenon, where crucial middle-context information is likely to be missed, and the distraction issue that the models lose focus due to overly extended contexts. To address these challenges, we propose the Context Filtering Language Model (FltLM), a novel integrated Long-Context LLM which enhances the ability of the model on multi-document question-answering (QA) tasks. Specifically, FltLM innovatively incorporates a context filter with a soft mask mechanism, identifying and dynamically excluding irrelevant content to concentrate on pertinent information for better comprehension and reasoning. Our approach not only mitigates these two challenges, but also enables the model to operate conveniently in a single forward pass. Experimental results demonstrate that FltLM significantly outperforms supervised fine-tuning and retrieval-based methods in complex QA scenarios, suggesting a promising solution for more accurate and reliable long-context natural language understanding applications.
摘要:长上下文大语言模型 (Long-Context Large Language Models, LLMs) 的发展显著推动了自然语言处理技术,通过促进跨长文档和多语料库的文本数据处理。然而,长上下文 LLMs 仍面临两个关键挑战:中间信息丢失现象 (lost in the middle phenomenon),即关键的中间上下文信息可能被忽略;以及分散注意力问题 (distraction issue),模型因上下文过长而失去焦点。为应对这些挑战,我们提出了上下文过滤语言模型 (Context Filtering Language Model, FltLM),这是一种新颖的集成长上下文 LLM,增强了模型在多文档问答 (QA) 任务中的能力。具体而言,FltLM 创新性地引入了一种带有软掩码机制的上下文过滤器,识别并动态排除无关内容,专注于相关信息以实现更好的理解和推理。我们的方法不仅缓解了这两个挑战,还使模型能够在单次前向传递中便捷地运行。实验结果表明,FltLM 在复杂 QA 场景中显著优于监督微调和基于检索的方法,为更准确和可靠的长上下文自然语言理解应用提供了有前景的解决方案。
[NLP-41] Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity
【速读】: 该论文试图解决在非文本领域中,由于缺乏大型预训练模型,难以将Transformer模型替换为线性时间复杂度架构(如Linformer和Mamba)的问题。解决方案的关键是提出了一种跨架构逐层蒸馏(Cross-Architecture Layerwise Distillation, CALD)方法,该方法能够在将Transformer模型转换为线性时间替代模型的同时,对其进行目标任务的微调,以保留原始模型的推理能力。论文还探讨了不同的微调引导策略,以优化保留原始模型推理能力的效果,并通过一系列实验验证了CALD方法的有效性。
链接: https://arxiv.org/abs/2410.06846
作者: Mutian He,Philip N. Garner
关键词-EN: Linformer and Mamba, Mamba have recently, linear time replacements, competitive linear time, recently emerged
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 15 pages, 4 figures
点击查看摘要
Abstract:Architectures such as Linformer and Mamba have recently emerged as competitive linear time replacements for transformers. However, corresponding large pretrained models are often unavailable, especially in non-text domains. To remedy this, we present a Cross-Architecture Layerwise Distillation (CALD) approach that jointly converts a transformer model to a linear time substitute and fine-tunes it to a target task. We also compare several means to guide the fine-tuning to optimally retain the desired inference capability from the original model. The methods differ in their use of the target model and the trajectory of the parameters. In a series of empirical studies on language processing, language modeling, and speech processing, we show that CALD can effectively recover the result of the original model, and that the guiding strategy contributes to the result. Some reasons for the variation are suggested.
摘要:诸如 Linformer 和 Mamba 等架构近期作为 Transformer 的线性时间替代方案崭露头角。然而,相应的大型预训练模型往往不可用,尤其是在非文本领域。为解决这一问题,我们提出了一种跨架构逐层蒸馏 (Cross-Architecture Layerwise Distillation, CALD) 方法,该方法能够将 Transformer 模型转换为线性时间替代模型,并针对目标任务进行微调。我们还比较了几种引导微调的方法,以最佳地保留原始模型的期望推理能力。这些方法在目标模型的使用和参数轨迹方面有所不同。在一系列关于语言处理、语言建模和语音处理的实证研究中,我们展示了 CALD 能够有效恢复原始模型的结果,并且引导策略对结果有所贡献。我们提出了一些导致结果差异的原因。
[NLP-42] MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders
【速读】: 该论文试图解决心理健康领域中个性化治疗数据隐私受限,导致难以构建强大诊断和治疗模型的问题。解决方案的关键在于提出了MentalArena自对弈框架,通过生成领域特定的个性化数据来训练语言模型。该框架的核心创新包括Symptom Encoder,用于从认知和行为角度模拟真实患者,以及Symptom Decoder,用于比较诊断症状与编码症状,并根据识别到的偏差动态管理患者与治疗师之间的对话。通过这些方法,模型能够进行个性化的诊断和治疗,并在多个基准测试中显著优于现有先进模型。
链接: https://arxiv.org/abs/2410.06845
作者: Cheng Li,May Fung,Qingyun Wang,Chi Han,Manling Li,Jindong Wang,Heng Ji
关键词-EN: Mental health disorders, Mental health, health disorders, Mental, health
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Technical Report; 27 pages
点击查看摘要
Abstract:Mental health disorders are one of the most serious diseases in the world. Most people with such a disease lack access to adequate care, which highlights the importance of training models for the diagnosis and treatment of mental health disorders. However, in the mental health domain, privacy concerns limit the accessibility of personalized treatment data, making it challenging to build powerful models. In this paper, we introduce MentalArena, a self-play framework to train language models by generating domain-specific personalized data, where we obtain a better model capable of making a personalized diagnosis and treatment (as a therapist) and providing information (as a patient). To accurately model human-like mental health patients, we devise Symptom Encoder, which simulates a real patient from both cognition and behavior perspectives. To address intent bias during patient-therapist interactions, we propose Symptom Decoder to compare diagnosed symptoms with encoded symptoms, and dynamically manage the dialogue between patient and therapist according to the identified deviations. We evaluated MentalArena against 6 benchmarks, including biomedicalQA and mental health tasks, compared to 6 advanced models. Our models, fine-tuned on both GPT-3.5 and Llama-3-8b, significantly outperform their counterparts, including GPT-4o. We hope that our work can inspire future research on personalized care. Code is available in this https URL
摘要:心理健康障碍是全球最严重的疾病之一。大多数患有此类疾病的人缺乏足够的护理,这凸显了训练用于心理健康障碍诊断和治疗的模型的重要性。然而,在心理健康领域,隐私问题限制了个性化治疗数据的可用性,使得构建强大的模型变得具有挑战性。本文中,我们介绍了 MentalArena,一个通过生成特定领域个性化数据来训练语言模型的自对弈框架,我们获得了一个能够进行个性化诊断和治疗(作为治疗师)并提供信息(作为患者)的更好模型。为了准确模拟类人心理健康患者,我们设计了 Symptom Encoder,该编码器从认知和行为两个角度模拟真实患者。为了解决患者与治疗师互动中的意图偏差问题,我们提出了 Symptom Decoder,通过比较诊断症状与编码症状,并根据识别到的偏差动态管理患者与治疗师之间的对话。我们在包括 biomedicalQA 和心理健康任务在内的 6 个基准上评估了 MentalArena,并与 6 个先进模型进行了比较。我们的模型,经过 GPT-3.5 和 Llama-3-8b 的微调,显著优于其对应模型,包括 GPT-4o。我们希望我们的工作能够启发未来在个性化护理方面的研究。代码可在以下链接获取:https URL
[NLP-43] Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
【速读】: 该论文旨在解决大语言模型(LLMs)在面对恶意或错误指令时产生有害输出的风险问题。解决方案的关键在于设计了一种基于解码阶段的逐步防御机制,通过引入推测性解码(speculative decoding)来直接修正有害查询,而非简单地拒绝这些查询。这种方法不仅提高了模型的安全性,还保持了其推理速度和有用性,有效利用了模型对先前标记危险性的识别能力。
链接: https://arxiv.org/abs/2410.06809
作者: Xinyi Zeng,Yuying Shang,Yutao Zhu,Jiawei Chen,Yu Tian
关键词-EN: Large language models, demonstrated immense utility, Large language, demonstrated immense, immense utility
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 19 pages, 9 figures
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated immense utility across various industries. However, as LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts. While current methods effectively address jailbreak risks, they share common limitations: 1) Judging harmful responses from the prefill-level lacks utilization of the model’s decoding outputs, leading to relatively lower effectiveness and robustness. 2) Rejecting potentially harmful responses based on a single evaluation can significantly impair the model’s this http URL paper examines the LLMs’ capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens. Motivated by pilot experiment results, we design a robust defense mechanism at the decoding level. Our novel decoder-oriented, step-by-step defense architecture corrects harmful queries directly rather than rejecting them outright. We introduce speculative decoding to enhance usability and facilitate deployment to boost secure decoding speed. Extensive experiments demonstrate that our approach improves model security without compromising reasoning speed. Notably, our method leverages the model’s ability to discern hazardous information, maintaining its helpfulness compared to existing methods.
摘要:大语言模型 (LLMs) 在各个行业中展示了巨大的实用性。然而,随着 LLMs 的进步,由于不正确或恶意的指令提示,有害输出的风险也随之增加。尽管当前的方法有效地解决了越狱风险,但它们存在共同的局限性:1) 从预填充级别判断有害响应缺乏对模型解码输出的利用,导致相对较低的有效性和鲁棒性。2) 基于单一评估拒绝潜在有害响应可能会显著损害模型的性能。本文研究了 LLMs 识别有害输出的能力,揭示并量化了其在评估先前 Token 危险性方面的熟练程度。受试点实验结果的启发,我们设计了一种在解码级别的鲁棒防御机制。我们新颖的面向解码器的逐步防御架构直接纠正有害查询,而不是直接拒绝它们。我们引入了推测性解码以增强可用性并促进部署,以提高安全解码速度。广泛的实验表明,我们的方法在不牺牲推理速度的情况下提高了模型安全性。值得注意的是,我们的方法利用了模型辨别危险信息的能力,与现有方法相比保持了其有用性。
[NLP-44] Seg2Act: Global Context-aware Action Generation for Document Logical Structuring EMNLP2024
【速读】: 该论文试图解决文档逻辑结构提取的问题,特别是在处理长文档的复杂性和多样性时,传统方法表现不佳的挑战。解决方案的关键在于引入Seg2Act,这是一种端到端的生成式方法,将文档逻辑结构提取重新定义为动作生成任务。Seg2Act通过全局上下文感知的生成模型,迭代地生成动作序列,并根据生成的动作同时更新全局上下文和当前的逻辑结构,从而在ChCatExt和HierDoc数据集上展示了在监督学习和迁移学习设置中的优越性能。
链接: https://arxiv.org/abs/2410.06802
作者: Zichao Li,Shaojie He,Meng Liao,Xuanang Chen,Yaojie Lu,Hongyu Lin,Yanxiong Lu,Xianpei Han,Le Sun
关键词-EN: underlying hierarchical structure, Document logical structuring, logical structuring aims, aims to extract, extract the underlying
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024 Main Conference
点击查看摘要
Abstract:Document logical structuring aims to extract the underlying hierarchical structure of documents, which is crucial for document intelligence. Traditional approaches often fall short in handling the complexity and the variability of lengthy documents. To address these issues, we introduce Seg2Act, an end-to-end, generation-based method for document logical structuring, revisiting logical structure extraction as an action generation task. Specifically, given the text segments of a document, Seg2Act iteratively generates the action sequence via a global context-aware generative model, and simultaneously updates its global context and current logical structure based on the generated actions. Experiments on ChCatExt and HierDoc datasets demonstrate the superior performance of Seg2Act in both supervised and transfer learning settings.
摘要:文档逻辑结构化旨在提取文档的底层层次结构,这对于文档智能至关重要。传统方法在处理长文档的复杂性和多样性方面往往表现不佳。为了解决这些问题,我们提出了 Seg2Act,一种端到端的生成式方法,将文档逻辑结构提取重新定义为动作生成任务。具体而言,给定文档的文本片段,Seg2Act 通过一个全局上下文感知的生成模型迭代生成动作序列,并根据生成的动作同时更新其全局上下文和当前逻辑结构。在 ChCatExt 和 HierDoc 数据集上的实验表明,Seg2Act 在监督学习和迁移学习设置下均表现出优越的性能。
[NLP-45] From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
【速读】: 该论文试图解决大型视觉-语言模型(LVLMs)中的幻觉问题,即模型生成视觉输入中不存在的对象,从而影响其可靠性。论文认为幻觉的主要原因是模型在视觉特征提取和解耦方面的不足,而非仅仅是对视觉输入的理解问题。解决方案的关键在于提出了一种名为PATCH的新型调优策略,该策略通过使用自适应虚拟标记从边界框中提取对象特征,从而有效解决视觉特征解耦不足导致的幻觉问题。PATCH方法具有即插即用的特性,可集成到多种LVLMs中,并在多个多模态幻觉数据集上实现了最先进的性能。
链接: https://arxiv.org/abs/2410.06795
作者: Yuying Shang,Xinyi Zeng,Yutao Zhu,Xiao Yang,Zhengwei Fang,Jingyuan Zhang,Jiawei Chen,Zinan Liu,Yu Tian
关键词-EN: large vision-language models, visual input, significant challenge, impairs their reliability, large vision-language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Hallucinations in large vision-language models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input, which impairs their reliability. Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model’s inability to effectively extract or decouple visual features. In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling). Motivated by our findings on the preliminary investigation, we propose a novel tuning strategy, PATCH, to mitigate hallucinations in LVLMs. This plug-and-play method can be integrated into various LVLMs, utilizing adaptive virtual tokens to extract object features from bounding boxes, thereby addressing hallucinations caused by insufficient decoupling of visual features. PATCH achieves state-of-the-art performance on multiple multi-modal hallucination datasets. We hope this approach provides researchers with deeper insights into the underlying causes of hallucinations in LVLMs, fostering further advancements and innovation in this field.
摘要:大型视觉-语言模型 (LVLMs) 中的幻觉现象是一个重大挑战,即生成视觉输入中未呈现的对象,这损害了模型的可靠性。近期研究往往将幻觉归因于对视觉输入理解不足,却忽视了一个更根本的问题:模型无法有效提取或解耦视觉特征。本文从架构角度重新审视 LVLMs 中的幻觉现象,探讨其主要原因在于视觉编码器 (特征提取) 还是模态对齐模块 (特征解耦)。基于初步调查的发现,我们提出了一种新颖的调优策略 PATCH,以缓解 LVLMs 中的幻觉问题。这种即插即用的方法可以集成到各种 LVLMs 中,利用自适应虚拟 Token 从边界框中提取对象特征,从而解决因视觉特征解耦不足导致的幻觉问题。PATCH 在多个多模态幻觉数据集上达到了最先进的性能。我们希望这种方法能为研究人员提供更深入的洞察,了解 LVLMs 中幻觉的根本原因,从而促进该领域的进一步发展和创新。
[NLP-46] o Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models EMNLP2024
【速读】: 该论文试图解决多模态大语言模型(MLLM)架构中连接器选择对感知任务性能的影响问题。解决方案的关键在于系统地分类和评估不同类型的连接器(特征保留型和特征压缩型)在不同粒度感知任务(粗粒度、细粒度和推理任务)中的表现。研究发现,特征保留型连接器在细粒度感知任务中表现优异,因其能保留详细的视觉信息;而特征压缩型连接器在粗粒度感知和推理任务中表现相当,并具有显著的速度优势。这些发现为MLLM架构设计和优化提供了重要指导。
链接: https://arxiv.org/abs/2410.06765
作者: Junyan Lin,Haoran Chen,Dawei Zhu,Xiaoyu Shen
关键词-EN: multimodal large language, large language models, garnered significant attention, recent years, multimodal large
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to EMNLP 2024 Main Conference
点击查看摘要
Abstract:In recent years, multimodal large language models (MLLMs) have garnered significant attention from both industry and academia. However, there is still considerable debate on constructing MLLM architectures, particularly regarding the selection of appropriate connectors for perception tasks of varying granularities. This paper systematically investigates the impact of connectors on MLLM performance. Specifically, we classify connectors into feature-preserving and feature-compressing types. Utilizing a unified classification standard, we categorize sub-tasks from three comprehensive benchmarks, MMBench, MME, and SEED-Bench, into three task types: coarse-grained perception, fine-grained perception, and reasoning, and evaluate the performance. Our findings reveal that feature-preserving connectors excel in \emphfine-grained perception tasks due to their ability to retain detailed visual information. In contrast, feature-compressing connectors, while less effective in fine-grained perception tasks, offer significant speed advantages and perform comparably in \emphcoarse-grained perception and \emphreasoning tasks. These insights are crucial for guiding MLLM architecture design and advancing the optimization of MLLM architectures.
**摘要:**近年来,多模态大语言模型 (MLLM) 在工业界和学术界引起了广泛关注。然而,关于如何构建 MLLM 架构,尤其是如何选择适合不同粒度感知任务的连接器,仍存在较大争议。本文系统地研究了连接器对 MLLM 性能的影响。具体而言,我们将连接器分为特征保留型和特征压缩型两类。利用统一的分类标准,我们将来自三个综合基准 MMBench、MME 和 SEED-Bench 的子任务分为粗粒度感知、细粒度感知和推理三种任务类型,并评估了其性能。我们的研究结果表明,特征保留型连接器在细粒度感知任务中表现优异,这得益于其能够保留详细的视觉信息。相比之下,特征压缩型连接器虽然在细粒度感知任务中效果较差,但在速度上具有显著优势,并且在粗粒度感知和推理任务中表现相当。这些见解对于指导 MLLM 架构设计以及推进 MLLM 架构的优化至关重要。
[NLP-47] CoBa: Convergence Balancer for Multitask Finetuning of Large Language Models EMNLP2024
【速读】: 该论文试图解决多任务学习(MTL)在大语言模型(LLMs)微调过程中面临的任务收敛不平衡和计算资源消耗大的问题。解决方案的关键在于提出了一种名为CoBa的新MTL方法,通过引入相对收敛分数(RCS)、绝对收敛分数(ACS)和发散因子(DF),动态调整任务权重,确保所有任务的验证损失以均匀的步伐趋向收敛,同时避免个别任务的发散,从而在最小化计算开销的同时提升模型性能。
链接: https://arxiv.org/abs/2410.06741
作者: Zi Gong,Hang Yu,Cong Liao,Bingchang Liu,Chaoyu Chen,Jianguo Li
关键词-EN: large language models, developing separate models, Multi-task learning, Absolute Convergence Scores, language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, main conference of EMNLP 2024
点击查看摘要
Abstract:Multi-task learning (MTL) benefits the fine-tuning of large language models (LLMs) by providing a single model with improved performance and generalization ability across tasks, presenting a resource-efficient alternative to developing separate models for each task. Yet, existing MTL strategies for LLMs often fall short by either being computationally intensive or failing to ensure simultaneous task convergence. This paper presents CoBa, a new MTL approach designed to effectively manage task convergence balance with minimal computational overhead. Utilizing Relative Convergence Scores (RCS), Absolute Convergence Scores (ACS), and a Divergence Factor (DF), CoBa dynamically adjusts task weights during the training process, ensuring that the validation loss of all tasks progress towards convergence at an even pace while mitigating the issue of individual task divergence. The results of our experiments involving three disparate datasets underscore that this approach not only fosters equilibrium in task improvement but enhances the LLMs’ performance by up to 13% relative to the second-best baselines. Code is open-sourced at this https URL.
摘要:多任务学习 (Multi-task learning, MTL) 通过提供一个在多个任务上具有改进性能和泛化能力的单一模型,为大语言模型 (Large Language Models, LLMs) 的微调带来了益处,成为为每个任务开发单独模型的资源高效替代方案。然而,现有的 MTL 策略在 LLMs 上往往表现不佳,要么计算量庞大,要么无法确保任务同时收敛。本文提出了 CoBa,一种新的 MTL 方法,旨在以最小的计算开销有效管理任务收敛平衡。CoBa 利用相对收敛分数 (Relative Convergence Scores, RCS)、绝对收敛分数 (Absolute Convergence Scores, ACS) 和发散因子 (Divergence Factor, DF),在训练过程中动态调整任务权重,确保所有任务的验证损失以均匀的速度向收敛推进,同时缓解个别任务发散的问题。我们在三个不同数据集上的实验结果表明,该方法不仅促进了任务改进的平衡,还使 LLMs 的性能相对第二最佳基线提升了高达 13%。代码已开源,详见此 https URL。
[NLP-48] Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?
【速读】: 该论文试图解决的问题是验证在预训练过程中,哪些编程语言及其特征对逻辑推理性能有显著影响。解决方案的关键在于通过从头开始预训练基于解码器的语言模型,使用十种编程语言和三种自然语言数据集,并在相同的条件下进行训练。随后,在不需要常识或世界知识的逻辑推理任务(如FLD和bAbi)上进行少样本上下文学习评估。结果表明,使用编程语言训练的模型在逻辑推理任务上表现优于使用自然语言训练的模型,且编程语言的抽象语法树深度对逻辑推理性能有影响。这一发现为提升大型语言模型的基础能力提供了关键见解。
链接: https://arxiv.org/abs/2410.06735
作者: Fumiya Uchiyama,Takeshi Kojima,Andrew Gambardella,Qi Cao,Yusuke Iwasawa,Yutaka Matsuo
关键词-EN: Recent large language, demonstrated remarkable generalization, Recent large, remarkable generalization abilities, programming languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent large language models (LLMs) have demonstrated remarkable generalization abilities in mathematics and logical reasoning tasks. Prior research indicates that LLMs pre-trained with programming language data exhibit high mathematical and reasoning abilities; however, this causal relationship has not been rigorously tested. Our research aims to verify which programming languages and features during pre-training affect logical inference performance. Specifically, we pre-trained decoder-based language models from scratch using datasets from ten programming languages (e.g., Python, C, Java) and three natural language datasets (Wikipedia, Fineweb, C4) under identical conditions. Thereafter, we evaluated the trained models in a few-shot in-context learning setting on logical reasoning tasks: FLD and bAbi, which do not require commonsense or world knowledge. The results demonstrate that nearly all models trained with programming languages consistently outperform those trained with natural languages, indicating that programming languages contain factors that elicit logic inference performance. In addition, we found that models trained with programming languages exhibit a better ability to follow instructions compared to those trained with natural languages. Further analysis reveals that the depth of Abstract Syntax Trees representing parsed results of programs also affects logical reasoning performance. These findings will offer insights into the essential elements of pre-training for acquiring the foundational abilities of LLMs.
摘要:近期的大语言模型 (LLMs) 在数学和逻辑推理任务中展示了显著的泛化能力。先前研究表明,使用编程语言数据预训练的 LLM 表现出较高的数学和推理能力;然而,这种因果关系尚未经过严格验证。我们的研究旨在验证在预训练过程中,哪些编程语言及其特征影响了逻辑推理性能。具体而言,我们在相同条件下从头开始预训练了基于解码器的语言模型,使用了来自十种编程语言(如 Python, C, Java)和三种自然语言数据集(Wikipedia, Fineweb, C4)的数据。随后,我们在零样本上下文学习环境中评估了这些模型在逻辑推理任务(FLD 和 bAbi)上的表现,这些任务不需要常识或世界知识。结果显示,几乎所有使用编程语言训练的模型在逻辑推理任务上的表现均优于使用自然语言训练的模型,这表明编程语言中包含激发逻辑推理性能的因素。此外,我们发现使用编程语言训练的模型在遵循指令方面的能力优于使用自然语言训练的模型。进一步分析表明,表示程序解析结果的抽象语法树的深度也影响逻辑推理性能。这些发现将为获取 LLM 基础能力所需的预训练要素提供见解。
[NLP-49] Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles NEURIPS2024
【速读】: 该论文试图解决大语言模型(LLMs)在横向思维能力评估方面的挑战,特别是由于创造性思维过程的复杂性和相关数据的稀缺性。解决方案的关键在于引入了一个名为SPLAT的基准测试,该基准利用情景谜题来评估和激发LLMs的横向思维能力。SPLAT包含975个分级情景谜题,采用了一种新的多轮玩家-裁判框架,取代了传统的基于模型的评估方法。这种框架模拟了一个互动游戏,模型(玩家)通过向评估模型(裁判)提问来推断完整情景,裁判根据详细参考情景或评估玩家的预测是否与参考情景一致来回答问题。这种方法减少了对更强大评估模型的依赖,从而能够有效评估最先进的LLMs的横向思维能力。实验结果表明,强大的评估模型如WizardLM-2在中间问答和最终情景准确性方面与人类判断高度一致,达到了超过80%的一致性。此外,将该基准的数据和推理过程应用于其他横向思维相关的基准测试,如RiddleSense和BrainTeaser,也带来了性能提升。
链接: https://arxiv.org/abs/2410.06733
作者: Qi Chen,Bowen Zhang,Gang Wang,Qi Wu
关键词-EN: Large Language Models, Large Language, tasks requiring vertical, capabilities remain under-explored, assessing creative thought
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model. This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario. The judge answers based on a detailed reference scenario or evaluates if the player’s predictions align with the reference one. This approach lessens dependence on more robust evaluation models, enabling the assessment of state-of-the-art LLMs. The experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement-similar to the agreement levels among humans. Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., RiddleSense and BrainTeaser, leads to performance enhancements. This suggests that our benchmark effectively evaluates and elicits the lateral thinking abilities of LLMs. Code is available at: this https URL.
摘要:尽管自然语言处理 (NLP) 的进步显著提升了大语言模型 (LLM) 在需要垂直思考任务上的表现,但其横向思考能力仍未得到充分探索,且由于评估创造性思维过程的复杂性和相关数据的稀缺性,这一能力难以衡量。为应对这些挑战,我们引入了 SPLAT,这是一个利用情境谜题来评估和激发 LLM 横向思考能力的基准。该基准包含 975 个分级情境谜题,涵盖三个难度级别,采用了一种新的多轮玩家-裁判框架,而非传统的基于模型的评估方法,后者通常需要更强大的评估模型。此框架模拟了一个互动游戏,其中模型 (玩家) 向评估模型 (裁判) 询问关于不完整故事的问题,以推断完整情景。裁判根据详细的参考情景回答问题,或评估玩家的预测是否与参考情景一致。这种方法减少了对更强大评估模型的依赖,使得能够评估最先进的 LLM。实验表明,如 WizardLM-2 这样的强大评估模型,在中间问答和最终情景准确性上与人类判断高度一致,达成超过 80% 的一致性,与人类之间的一致性水平相当。此外,将我们基准中的数据和推理过程应用于其他与横向思考相关的基准,例如 RiddleSense 和 BrainTeaser,可带来性能提升。这表明我们的基准有效地评估和激发了 LLM 的横向思考能力。代码可在以下链接获取:this https URL。
[NLP-50] Scaling Laws for Mixed quantization in Large Language Models
【速读】: 该论文试图解决在大规模语言模型(LLMs)的训练后量化过程中,如何在保持特定精度或困惑度目标的前提下,确定需要保留的高精度数值或计算的数量。解决方案的关键在于引入了一个名为“量化比率”的指标,该指标比较了低精度算术量化的参数数量与总参数数量的比例。通过在不同模型家族、算术类型和量化粒度(如层级、矩阵乘法级)上的广泛实验,论文发现:1) 模型越大,随着量化比率的增加,性能保持得越好;2) 混合精度量化的粒度越细(如矩阵乘法级),模型能增加的量化比率越大。这些发现为未来AI硬件设计和高效AI算法的发展提供了重要见解。
链接: https://arxiv.org/abs/2410.06722
作者: Zeyu Cao,Cheng Zhang,Pedro Gimenes,Jianqiao Lu,Jianyi Cheng,Yiren Zhao
关键词-EN: Large Language Models, Large Language, Post-training quantization, Language Models, proven effective
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the computational requirements for running inference on these models. In this study, we focus on a straightforward question: When aiming for a specific accuracy or perplexity target for low-precision quantization, how many high-precision numbers or calculations are required to preserve as we scale LLMs to larger sizes? We first introduce a critical metric named the quantization ratio, which compares the number of parameters quantized to low-precision arithmetic against the total parameter count. Through extensive and carefully controlled experiments across different model families, arithmetic types, and quantization granularities (e.g. layer-wise, matmul-wise), we identify two central phenomenons. 1) The larger the models, the better they can preserve performance with an increased quantization ratio, as measured by perplexity in pre-training tasks or accuracy in downstream tasks. 2) The finer the granularity of mixed-precision quantization (e.g., matmul-wise), the more the model can increase the quantization ratio. We believe these observed phenomena offer valuable insights for future AI hardware design and the development of advanced Efficient AI algorithms.
摘要:大语言模型 (LLM) 的训练后量化已被证明在减少这些模型推理计算需求方面非常有效。在本研究中,我们关注一个直接的问题:当目标是在低精度量化下达到特定的准确性或困惑度时,随着 LLM 规模的扩大,需要保留多少高精度数值或计算?我们首先引入了一个关键指标,称为量化比率,该指标将量化为低精度算术的参数数量与总参数数量进行比较。通过在不同模型家族、算术类型和量化粒度(例如,逐层、逐矩阵乘法)上进行广泛且精心控制的实验,我们确定了两个核心现象。1) 模型越大,通过增加量化比率,它们在预训练任务中的困惑度或下游任务中的准确性方面能够更好地保持性能。2) 混合精度量化的粒度越细(例如,逐矩阵乘法),模型能够增加的量化比率就越多。我们相信,这些观察到的现象为未来 AI 硬件设计和先进高效 AI 算法的发展提供了宝贵的见解。
[NLP-51] MatMamba: A Matryoshka State Space Model
【速读】: 该论文试图解决在大规模模型部署中,如何在不牺牲性能的前提下实现高效的训练和推理时间的问题。解决方案的关键在于提出了MatMamba模型,这是一种结合了Matryoshka风格学习和Mamba2的状态空间模型。MatMamba通过在其模块中引入嵌套维度,实现了联合训练和自适应推理,从而能够在不同模型规模下进行高效且自适应的部署。通过训练一个单一的大型MatMamba模型,可以免费获得多个嵌套的小型模型,同时保持或提升基线小型模型的性能。这种方法在语言和图像模型上展示了与Transformer相当的扩展性,同时具有更高效的推理特性。
链接: https://arxiv.org/abs/2410.06718
作者: Abhinav Shukla,Sai Vemprala,Aditya Kusupati,Ashish Kapoor
关键词-EN: long context lengths, Matryoshka Representation Learning, faster theoretical training, State Space, State Space Models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures
点击查看摘要
Abstract:State Space Models (SSMs) like Mamba2 are a promising alternative to Transformers, with faster theoretical training and inference times – especially for long context lengths. Recent work on Matryoshka Representation Learning – and its application to Transformer backbones in works like MatFormer – showed how to introduce nested granularities of smaller submodels in one universal elastic model. In this work, we present MatMamba: a state space model which combines Matryoshka-style learning with Mamba2, by modifying the block to contain nested dimensions to enable joint training and adaptive inference. MatMamba allows for efficient and adaptive deployment across various model sizes. We train a single large MatMamba model and are able to get a number of smaller nested models for free – while maintaining or improving upon the performance of a baseline smaller model trained from scratch. We train language and image models at a variety of parameter sizes from 35M to 1.4B. Our results on ImageNet and FineWeb show that MatMamba models scale comparably to Transformers, while having more efficient inference characteristics. This makes MatMamba a practically viable option for deploying large-scale models in an elastic way based on the available inference compute. Code and models are open sourced at \urlthis https URL
摘要:状态空间模型 (State Space Models, SSMs) 如 Mamba2 是 Transformer 的有力替代方案,具有更快的理论训练和推理时间,尤其是在长上下文长度的情况下。最近关于 Matryoshka 表示学习的研究,以及其在 MatFormer 等工作中对 Transformer 骨干网络的应用,展示了如何在一个通用弹性模型中引入嵌套的较小子模型的多层次粒度。在本研究中,我们提出了 MatMamba:一种结合了 Matryoshka 风格学习与 Mamba2 的状态空间模型,通过修改模块以包含嵌套维度,从而实现联合训练和自适应推理。MatMamba 能够在不同模型规模上进行高效且自适应的部署。我们训练了一个单一的大型 MatMamba 模型,并能够免费获得多个较小的嵌套模型,同时保持或超越从头开始训练的基准较小模型的性能。我们在 35M 到 1.4B 参数的各种规模上训练了语言和图像模型。我们在 ImageNet 和 FineWeb 上的结果表明,MatMamba 模型在扩展性上与 Transformer 相当,同时具有更高效的推理特性。这使得 MatMamba 成为基于可用推理计算弹性部署大规模模型的实际可行选项。代码和模型已在 \urlthis https URL 上开源。
[NLP-52] Guaranteed Generation from Large Language Models
【速读】: 该论文试图解决在大语言模型(LLMs)生成文本时如何严格满足特定约束条件的问题,同时尽可能保持原始模型的分布特性。解决方案的关键在于提出了GUARD方法,该方法结合了自回归生成和拒绝采样技术,通过控制KL散度来优化推理速度和分布接近度,从而在保证约束满足的同时,几乎不损失生成文本的质量和效率。
链接: https://arxiv.org/abs/2410.06716
作者: Minbeom Kim,Thibaut Thonet,Jos Rozen,Hwaran Lee,Kyomin Jung,Marc Dymetman
关键词-EN: large language models, large language, satisfy specific constraints, control text generation, original model
类目: Computation and Language (cs.CL)
备注: 22 pages, 11 figures
点击查看摘要
Abstract:As large language models (LLMs) are increasingly used across various applications, there is a growing need to control text generation to satisfy specific constraints or requirements. This raises a crucial question: Is it possible to guarantee strict constraint satisfaction in generated outputs while preserving the distribution of the original model as much as possible? We first define the ideal distribution - the one closest to the original model, which also always satisfies the expressed constraint - as the ultimate goal of guaranteed generation. We then state a fundamental limitation, namely that it is impossible to reach that goal through autoregressive training alone. This motivates the necessity of combining training-time and inference-time methods to enforce such guarantees. Based on this insight, we propose GUARD, a simple yet effective approach that combines an autoregressive proposal distribution with rejection sampling. Through GUARD’s theoretical properties, we show how controlling the KL divergence between a specific proposal and the target ideal distribution simultaneously optimizes inference speed and distributional closeness. To validate these theoretical concepts, we conduct extensive experiments on two text generation settings with hard-to-satisfy constraints: a lexical constraint scenario and a sentiment reversal scenario. These experiments show that GUARD achieves perfect constraint satisfaction while almost preserving the ideal distribution with highly improved inference efficiency. GUARD provides a principled approach to enforcing strict guarantees for LLMs without compromising their generative capabilities.
摘要:随着大语言模型 (Large Language Models, LLMs) 在各种应用中的日益普及,对文本生成进行控制以满足特定约束或要求的需求不断增加。这引发了一个关键问题:在尽可能保持原始模型分布的情况下,是否有可能保证生成输出严格满足约束?我们首先将理想分布定义为最接近原始模型且始终满足表达约束的分布,这是保证生成输出的最终目标。然后,我们陈述了一个基本限制,即仅通过自回归训练无法达到这一目标。这促使了结合训练时和推理时方法以强制执行此类保证的必要性。基于这一见解,我们提出了 GUARD,这是一种简单而有效的方法,结合了自回归提议分布与拒绝采样。通过 GUARD 的理论性质,我们展示了如何控制特定提议与目标理想分布之间的 KL 散度,同时优化推理速度和分布接近度。为了验证这些理论概念,我们在两种具有难以满足约束的文本生成设置中进行了广泛的实验:词汇约束场景和情感反转场景。这些实验表明,GUARD 在几乎保持理想分布的同时实现了完美的约束满足,并显著提高了推理效率。GUARD 为在不损害其生成能力的情况下,为大语言模型强制执行严格保证提供了一种原则性的方法。
[NLP-53] Calibrating Verbalized Probabilities for Large Language Models
【速读】: 该论文试图解决黑盒大型语言模型(LLMs)输出的概率校准问题,特别是针对分类任务中的概率分布校准。解决方案的关键在于识别并解决由校准过程中产生的“re-softmax”问题,通过提出“invert softmax trick”来近似估计“logit”,从而实现对概率分布的精确校准。这一方法在多个公共数据集上的广泛评估中展示了LLMs生成类别分布的稳健能力,并验证了invert softmax trick在估计logits方面的有效性,进而促进了后续的校准调整。
链接: https://arxiv.org/abs/2410.06707
作者: Cheng Wang,Gyuri Szarvas,Georges Balazs,Pavel Danchenko,Patrick Ernst
关键词-EN: Large Language Models, black-box Large Language, Language Models, Large Language, Calibrating verbalized probabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages
点击查看摘要
Abstract:Calibrating verbalized probabilities presents a novel approach for reliably assessing and leveraging outputs from black-box Large Language Models (LLMs). Recent methods have demonstrated improved calibration by applying techniques like Platt scaling or temperature scaling to the confidence scores generated by LLMs. In this paper, we explore the calibration of verbalized probability distributions for discriminative tasks. First, we investigate the capability of LLMs to generate probability distributions over categorical labels. We theoretically and empirically identify the issue of re-softmax arising from the scaling of verbalized probabilities, and propose using the invert softmax trick to approximate the “logit” by inverting verbalized probabilities. Through extensive evaluation on three public datasets, we demonstrate: (1) the robust capability of LLMs in generating class distributions, and (2) the effectiveness of the invert softmax trick in estimating logits, which, in turn, facilitates post-calibration adjustments.
摘要:校准语言模型输出的概率是一种新颖的方法,用于可靠地评估和利用黑箱大语言模型 (LLM) 的输出。近期方法通过应用诸如 Platt 缩放或温度缩放等技术对 LLM 生成的置信度分数进行校准,展示了改进的校准效果。本文探讨了在判别任务中校准语言模型输出的概率分布。首先,我们研究了 LLM 生成类别标签概率分布的能力。我们从理论和实证两方面识别了由于语言模型输出的概率缩放而产生的 re-softmax 问题,并提出使用 invert softmax 技巧来近似“logit”,通过反转语言模型输出的概率。通过在三个公开数据集上的广泛评估,我们展示了:(1) LLM 在生成类别分布方面的稳健能力,以及 (2) invert softmax 技巧在估计 logit 方面的有效性,这反过来促进了后续校准调整。
[NLP-54] PII-Scope: A Benchmark for Training Data PII Leakage Assessment in LLMs
【速读】: 该论文旨在解决现有单一查询攻击对PII(个人身份信息)泄露评估不足的问题。解决方案的关键在于引入PII-Scope基准,通过深入研究PII提取攻击中的超参数(如示范选择),并扩展到更现实的攻击场景,包括使用高级对抗策略(如重复和多样化查询、迭代学习),以显著提高PII提取率。研究结果表明,在有限查询预算下,利用这些策略可以使PII提取率提高五倍,特别是在针对微调模型的攻击中,其泄露风险高于预训练模型。该研究为PII提取攻击提供了严格的实证基准,并为开发有效的缓解策略奠定了基础。
链接: https://arxiv.org/abs/2410.06704
作者: Krishna Kanth Nakka,Ahmed Frikha,Ricardo Mendes,Xue Jiang,Xuebing Zhou
关键词-EN: PII extraction, comprehensive benchmark designed, PII extraction attacks, PII, introduce PII-Scope
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:In this work, we introduce PII-Scope, a comprehensive benchmark designed to evaluate state-of-the-art methodologies for PII extraction attacks targeting LLMs across diverse threat settings. Our study provides a deeper understanding of these attacks by uncovering several hyperparameters (e.g., demonstration selection) crucial to their effectiveness. Building on this understanding, we extend our study to more realistic attack scenarios, exploring PII attacks that employ advanced adversarial strategies, including repeated and diverse querying, and leveraging iterative learning for continual PII extraction. Through extensive experimentation, our results reveal a notable underestimation of PII leakage in existing single-query attacks. In fact, we show that with sophisticated adversarial capabilities and a limited query budget, PII extraction rates can increase by up to fivefold when targeting the pretrained model. Moreover, we evaluate PII leakage on finetuned models, showing that they are more vulnerable to leakage than pretrained models. Overall, our work establishes a rigorous empirical benchmark for PII extraction attacks in realistic threat scenarios and provides a strong foundation for developing effective mitigation strategies.
摘要:在本研究中,我们引入了 PII-Scope,这是一个全面的基准测试,旨在评估针对大语言模型 (LLM) 的 PII 提取攻击在多种威胁环境下的最先进方法。我们的研究通过揭示几个关键的超参数(例如,演示选择)对其有效性的重要性,提供了对这些攻击的更深入理解。基于这一理解,我们将研究扩展到更现实的攻击场景,探索采用高级对抗策略的 PII 攻击,包括重复和多样化的查询,以及利用迭代学习进行持续的 PII 提取。通过广泛的实验,我们的结果揭示了现有单次查询攻击中 PII 泄露的显著低估。事实上,我们表明,在有限的查询预算下,通过使用复杂的对抗能力,针对预训练模型的 PII 提取率可以增加多达五倍。此外,我们评估了微调模型上的 PII 泄露情况,结果显示它们比预训练模型更容易受到泄露的影响。总体而言,我们的工作为现实威胁场景中的 PII 提取攻击建立了一个严格的实证基准,并为开发有效的缓解策略提供了坚实的基础。
[NLP-55] Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
【速读】: 该论文试图解决视频内容生成详细且准确的自然语言描述的问题。解决方案的关键在于提出了video-SALMONN 2模型,这是一个结合了低秩适应(LoRA)和定向偏好优化(DPO)的先进音视频大语言模型(LLM)。通过引入多轮DPO(mrDPO)方法,定期更新DPO参考模型并重新初始化LoRA模块,结合真实视频字幕的指导,显著提升了视频字幕生成的准确性和完整性。此外,通过重生调优(rebirth tuning)策略,防止了mrDPO过程中可能出现的非字幕生成能力的遗忘,从而在保持模型参数规模较小的同时,超越了现有领先模型在视频字幕生成任务中的表现。
链接: https://arxiv.org/abs/2410.06682
作者: Changli Tang,Yixuan Li,Yudong Yang,Jimin Zhuang,Guangzhi Sun,Wei Li,Zujun Ma,Chao Zhang
关键词-EN: wealth of information, generating detailed, detailed and accurate, key aspect, natural language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimization (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimized using DPO. To further improve training, we introduce a novel multi-round DPO (mrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initializing the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilize the process. To address potential catastrophic forgetting of non-captioning abilities due to mrDPO, we propose rebirth tuning, which finetunes the pre-DPO LLM by using the captions generated by the mrDPO-trained model as supervised labels. Experiments show that mrDPO significantly enhances video-SALMONN 2’s captioning accuracy, reducing global and local error rates by 40% and 20%, respectively, while decreasing the repetition rate by 35%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining competitive performance to the state-of-the-art on widely used video question-answering benchmark among models of similar size. Upon acceptance, we will release the code, model checkpoints, and training and test data. Demos are available at \hrefthis https URLthis https URL.
摘要:视频蕴含丰富的信息,生成详细且准确的自然语言描述是视频理解的关键方面。本文介绍了 video-SALMONN 2,这是一种先进的音视频大语言模型 (LLM),采用低秩适应 (LoRA) 设计,通过定向偏好优化 (DPO) 增强视频(带配对音频)的字幕生成。我们提出了新的指标来评估视频描述的完整性和准确性,这些指标通过 DPO 进行优化。为进一步改进训练,我们引入了一种新颖的多轮 DPO (mrDPO) 方法,该方法涉及定期更新 DPO 参考模型,在每轮训练(1,000 步)后合并并重新初始化 LoRA 模块作为参数更新的代理,并结合真实视频字幕的指导以稳定过程。为解决 mrDPO 可能导致非字幕生成能力的灾难性遗忘问题,我们提出了重生调优,即使用 mrDPO 训练模型生成的字幕作为监督标签,对预 DPO 的 LLM 进行微调。实验表明,mrDPO 显著提高了 video-SALMONN 2 的字幕准确性,分别将全局和局部错误率降低了 40% 和 20%,同时将重复率降低了 35%。最终的 video-SALMONN 2 模型仅包含 70 亿参数,在视频字幕生成任务中超越了 GPT-4o 和 Gemini-1.5-Pro 等领先模型,同时在类似规模的模型中保持了与最先进水平相当的广泛使用的视频问答基准性能。一经接受,我们将发布代码、模型检查点以及训练和测试数据。演示可在 \hrefthis https URLthis https URL 获取。
[NLP-56] owards Universality: Studying Mechanistic Similarity Across Language Model Architectures
【速读】: 该论文试图解决神经网络在解释性上的普遍性问题,即不同神经网络在处理相似任务时是否趋向于实现相似的算法。解决方案的关键在于使用稀疏自编码器(SAEs)来隔离和比较Transformer和Mamba这两种主流语言模型架构中的可解释特征。研究发现,这两种模型中的大部分特征相似,并且特征相似性与普遍性之间存在相关性。此外,论文还深入分析了Mamba模型的电路层面,发现其诱导电路结构与Transformer中的电路相似,但存在一个细微差异,称为“Off-by-One motif”,即一个token的信息被写入其下一个位置的SSM状态,而Transformer中的token交互则无此趋势。
链接: https://arxiv.org/abs/2410.06672
作者: Junxuan Wang,Xuyang Ge,Wentao Shu,Qiong Tang,Yunhua Zhou,Zhengfu He,Xipeng Qiu
关键词-EN: implement similar algorithms, interpretability suggests, neural networks, networks may converge, converge to implement
类目: Computation and Language (cs.CL)
备注: 22 pages, 13 figures
点击查看摘要
Abstract:The hypothesis of Universality in interpretability suggests that different neural networks may converge to implement similar algorithms on similar tasks. In this work, we investigate two mainstream architectures for language modeling, namely Transformers and Mambas, to explore the extent of their mechanistic similarity. We propose to use Sparse Autoencoders (SAEs) to isolate interpretable features from these models and show that most features are similar in these two models. We also validate the correlation between feature similarity and Universality. We then delve into the circuit-level analysis of Mamba models and find that the induction circuits in Mamba are structurally analogous to those in Transformers. We also identify a nuanced difference we call \emphOff-by-One motif: The information of one token is written into the SSM state in its next position. Whilst interaction between tokens in Transformers does not exhibit such trend.
摘要:可解释性中的普遍性假设认为,不同的神经网络在处理相似任务时可能会收敛于实现相似的算法。在本研究中,我们探讨了两种主流的语言建模架构,即 Transformer 和 Mamba,以探究它们在机制上的相似程度。我们提出使用稀疏自编码器 (Sparse Autoencoders, SAEs) 来分离这些模型中的可解释特征,并展示出这两种模型中的大多数特征是相似的。我们还验证了特征相似性与普遍性之间的关联。随后,我们深入分析了 Mamba 模型的电路层面,发现 Mamba 中的归纳电路在结构上与 Transformer 中的归纳电路相似。我们还识别出一个细微的差异,我们称之为“偏差一模式 (Off-by-One motif)”:一个 Token 的信息被写入其下一个位置的 SSM 状态中。而 Transformer 中 Token 之间的交互则不表现出这种趋势。
[NLP-57] Large Language Models as Code Executors: An Exploratory Study
【速读】: 该论文试图解决将大型语言模型(LLMs)扩展到代码执行领域的问题,即直接利用LLMs执行代码片段并返回输出。解决方案的关键在于首次系统性地评估了多种LLMs(如OpenAI的o1、GPT-4o、GPT-3.5、DeepSeek和Qwen-Coder)在代码执行任务中的可行性和准确性,并引入了迭代指令提示(IIP)技术,通过逐行处理代码片段显著提升了较弱模型的执行准确率,平均提高了7.22%,最高达到18.96%,相较于思维链提示(CoT)方法,绝对平均提升达到3.86%,最高达到19.46%。
链接: https://arxiv.org/abs/2410.06667
作者: Chenyang Lyu,Lecheng Yan,Rui Xing,Wenxi Li,Younes Samih,Tianbo Ji,Longyue Wang
关键词-EN: Large Language Models, natural language processing, Large Language, natural language, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The capabilities of Large Language Models (LLMs) have significantly evolved, extending from natural language processing to complex tasks like code understanding and generation. We expand the scope of LLMs’ capabilities to a broader context, using LLMs to execute code snippets to obtain the output. This paper pioneers the exploration of LLMs as code executors, where code snippets are directly fed to the models for execution, and outputs are returned. We are the first to comprehensively examine this feasibility across various LLMs, including OpenAI’s o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder. Notably, the o1 model achieved over 90% accuracy in code execution, while others demonstrated lower accuracy levels. Furthermore, we introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22% (with the highest improvement of 18.96%) and an absolute average improvement of 3.86% against CoT prompting (with the highest improvement of 19.46%). Our study not only highlights the transformative potential of LLMs in coding but also lays the groundwork for future advancements in automated programming and the completion of complex tasks.
摘要:大语言模型 (LLM) 的能力已显著进化,从自然语言处理扩展到代码理解和生成等复杂任务。我们扩展了 LLM 能力的范围,利用 LLM 执行代码片段以获取输出。本文首次探索了将 LLM 作为代码执行器的概念,其中代码片段直接输入模型进行执行并返回输出。我们首次全面评估了这一可行性在多个 LLM 中的表现,包括 OpenAI 的 o1、GPT-4o、GPT-3.5、DeepSeek 和 Qwen-Coder。值得注意的是,o1 模型在代码执行中达到了超过 90% 的准确率,而其他模型则显示出较低的准确率水平。此外,我们引入了一种迭代指令提示 (IIP) 技术,该技术逐行处理代码片段,使较弱模型的准确率平均提高了 7.22%(最高提升 18.96%),相对于思维链 (CoT) 提示,绝对平均提升 3.86%(最高提升 19.46%)。我们的研究不仅突显了 LLM 在编程中的变革潜力,还为自动化编程和复杂任务的完成奠定了基础。
[NLP-58] Subtle Errors Matter: Preference Learning via Error-injected Self-editing
【速读】: 该论文试图解决大语言模型(LLMs)在数学推理任务中频繁出现的细微错误问题,如计算错误或代入错误,这些错误限制了模型的数学潜力。解决方案的关键在于提出了一种名为eRror-Injected Self-Editing(RISE)的新型偏好学习框架。RISE通过在正确解的部分标记中注入预定义的细微错误来构建困难对,用于错误缓解。具体来说,RISE利用模型自身编辑解决方案中的少量标记,注入设计的细微错误,然后结合自我编辑的解决方案及其对应的正确解决方案,以及通过采样获得的正确与错误解决方案对,进行细微错误感知的DPO训练。相比其他偏好学习方法,RISE进一步细化了训练目标,专注于预定义的错误及其标记,无需细粒度采样或偏好注释。实验验证了RISE的有效性,在Qwen2-7B-Instruct模型上进行的偏好学习在GSM8K和MATH数据集上分别取得了3.0%和7.9%的显著改进。
链接: https://arxiv.org/abs/2410.06638
作者: Kaishuai Xu,Tiezheng Yu,Wenjun Hou,Yi Cheng,Chak Tou Leong,Liangyou Li,Xin Jiang,Lifeng Shang,Qun Liu,Wenjie Li
关键词-EN: Large Language Models, Large Language, tackling tasks ranging, advanced competition-level problems, exhibited strong mathematical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have exhibited strong mathematical reasoning and computational prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle errors, such as miscalculations or incorrect substitutions, limit the models’ full mathematical potential. Existing studies to improve mathematical ability typically involve distilling reasoning skills from stronger LLMs or applying preference learning to step-wise response pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook the frequently occurring subtle errors. A major reason is that sampled preference pairs involve differences unrelated to the errors, which may distract the model from focusing on subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation. In detail, RISE uses the model itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective to focus on predefined errors and their tokens, without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
摘要:大语言模型 (Large Language Models, LLMs) 展示了强大的数学推理和计算能力,能够处理从基本算术到高级竞赛级别的问题。然而,频繁出现的细微错误,如计算错误或不正确的代入,限制了模型在数学领域的全部潜力。现有的提高数学能力的研究通常涉及从更强的 LLMs 中提炼推理技能,或将偏好学习应用于逐步响应对。尽管这些方法利用了不同粒度的样本以减轻推理错误,但它们忽略了频繁出现的细微错误。一个主要原因是,采样的偏好对涉及与错误无关的差异,这可能会分散模型对细微错误的注意力。在本研究中,我们提出了一种名为错误注入自编辑 (eRror-Injected Self-Editing, RISE) 的新型偏好学习框架,该框架将预定义的细微错误注入正确解决方案的部分 Token 中,以构建用于错误缓解的困难对。具体而言,RISE 使用模型自身编辑解决方案中的少量 Token,注入设计好的细微错误。然后,由自编辑解决方案及其对应的正确解决方案组成的对,以及通过采样获得的正确和错误解决方案对,共同用于细微错误感知的 DPO 训练。与其他偏好学习方法相比,RISE 进一步细化了训练目标,专注于预定义的错误及其 Token,而无需细粒度采样或偏好注释。大量实验验证了 RISE 的有效性,在 Qwen2-7B-Instruct 上的偏好学习在 GSM8K 上取得了 3.0% 的显著提升,在 MATH 上取得了 7.9% 的提升。
[NLP-59] ree of Problems: Improving structured problem solving with compositionality
【速读】: 该论文试图解决复杂推理任务中大语言模型(LLMs)的表现不足问题,特别是在需要逐步思考的任务中。解决方案的关键在于提出了一种名为“Tree of Problems (ToP)”的方法,这是一种简化版的“Tree of Thoughts (ToT)”,旨在通过将复杂问题分解为相同的子任务来提高模型性能。实验结果表明,ToP在复杂推理任务中优于ToT和GoT,并且在某些情况下表现优于传统的Chain-of-Thought (CoT)方法。
链接: https://arxiv.org/abs/2410.06634
作者: Armel Zebaze,Benoît Sagot,Rachel Bawden
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated remarkable performance, in-context learning
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across multiple tasks through in-context learning. For complex reasoning tasks that require step-by-step thinking, Chain-of-Thought (CoT) prompting has given impressive results, especially when combined with self-consistency. Nonetheless, some tasks remain particularly difficult for LLMs to solve. Tree of Thoughts (ToT) and Graph of Thoughts (GoT) emerged as alternatives, dividing the complex problem into paths of subproblems. In this paper, we propose Tree of Problems (ToP), a simpler version of ToT, which we hypothesise can work better for complex tasks that can be divided into identical subtasks. Our empirical results show that our approach outperforms ToT and GoT, and in addition performs better than CoT on complex reasoning tasks. All code for this paper is publicly available here: this https URL.
摘要:大语言模型 (LLMs) 通过上下文学习在多个任务中展示了卓越的性能。对于需要逐步推理的复杂任务,思维链 (Chain-of-Thought, CoT) 提示法取得了显著成果,尤其是在结合自一致性时。然而,某些任务对 LLMs 来说仍然特别困难。思维树 (Tree of Thoughts, ToT) 和思维图 (Graph of Thoughts, GoT) 作为替代方案出现,将复杂问题分解为子问题路径。在本文中,我们提出了问题树 (Tree of Problems, ToP),这是 ToT 的一个简化版本,我们假设它对于可以分解为相同子任务的复杂任务效果更好。我们的实证结果表明,我们的方法优于 ToT 和 GoT,并且在复杂推理任务上表现优于 CoT。本文的所有代码均可在此公开获取:this https URL。
[NLP-60] ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
【速读】: 该论文旨在解决视觉语言模型(VLM)在多模态智能应用中的安全挑战,特别是对抗性视觉输入对模型的威胁。解决方案的关键在于提出了一种名为“评估然后对齐(ETA)”的两阶段推理时对齐框架。首先,ETA通过评估输入视觉内容和输出响应来建立强大的安全意识;其次,通过在浅层和深层级别上对齐不安全行为,使用干扰前缀条件化VLM的生成分布,并执行句子级别的最佳N搜索,以找到最无害且有用的生成路径。实验结果表明,ETA在无害性、有用性和效率方面优于基线方法,显著降低了跨模态攻击中的不安全率,并在GPT-4的有用性评估中取得了96.6%的胜率。
链接: https://arxiv.org/abs/2410.06625
作者: Yi Ding,Bolian Li,Ruqi Zhang
关键词-EN: Vision Language Models, Vision Language, Language Models, significant safety challenges, safety challenges limit
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: 27pages
点击查看摘要
Abstract:Vision Language Models (VLMs) have become essential backbones for multimodal intelligence, yet significant safety challenges limit their real-world application. While textual inputs are often effectively safeguarded, adversarial visual inputs can easily bypass VLM defense mechanisms. Existing defense methods are either resource-intensive, requiring substantial data and compute, or fail to simultaneously ensure safety and usefulness in responses. To address these limitations, we propose a novel two-phase inference-time alignment framework, Evaluating Then Aligning (ETA): 1) Evaluating input visual contents and output responses to establish a robust safety awareness in multimodal settings, and 2) Aligning unsafe behaviors at both shallow and deep levels by conditioning the VLMs’ generative distribution with an interference prefix and performing sentence-level best-of-N to search the most harmless and helpful generation paths. Extensive experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency, reducing the unsafe rate by 87.5% in cross-modality attacks and achieving 96.6% win-ties in GPT-4 helpfulness evaluation. The code is publicly available at this https URL.
摘要:视觉语言模型 (Vision Language Models, VLMs) 已成为多模态智能的核心骨干,然而显著的安全挑战限制了其在实际应用中的使用。尽管文本输入通常能得到有效保护,但对抗性视觉输入却能轻易绕过 VLM 的防御机制。现有的防御方法要么资源密集,需要大量数据和计算,要么无法同时确保响应的安全性和实用性。为了解决这些限制,我们提出了一种新颖的两阶段推理时对齐框架,评估然后对齐 (Evaluating Then Aligning, ETA):1) 评估输入视觉内容和输出响应,以在多模态环境中建立强大的安全意识;2) 通过在 VLMs 的生成分布中引入干扰前缀,并执行句子级别的最佳 N 选择,来对浅层和深层的不安全行为进行对齐,以搜索最无害且最有帮助的生成路径。广泛的实验表明,ETA 在无害性、有用性和效率方面优于基线方法,在跨模态攻击中将不安全率降低了 87.5%,并在 GPT-4 有用性评估中达到了 96.6% 的胜率。代码已公开,可访问此 https URL。
[NLP-61] Learning Evolving Tools for Large Language Models
【速读】: 该论文试图解决大语言模型(LLMs)在与外部工具和API交互时,由于外部环境的动态变化导致工具和API可能过时的问题。解决方案的关键在于提出了一种名为ToolEVO的新框架,该框架通过利用蒙特卡洛树搜索(Monte Carlo Tree Search)来增强LLMs在动态环境中的适应性和自省能力。ToolEVO允许LLMs在动态环境中进行主动探索和交互,并根据环境反馈自主进行工具使用的自省和更新,从而提高LLMs在面对工具变异性时的适应性和稳定性。
链接: https://arxiv.org/abs/2410.06617
作者: Guoxin Chen,Zhong Zhang,Xin Cong,Fangda Guo,Yesai Wu,Yankai Lin,Wenzheng Feng,Yasheng Wang
关键词-EN: large language models, enables large language, learning enables large, language models, greatly expanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Ongoning Work
点击查看摘要
Abstract:Tool learning enables large language models (LLMs) to interact with external tools and APIs, greatly expanding the application scope of LLMs. However, due to the dynamic nature of external environments, these tools and APIs may become outdated over time, preventing LLMs from correctly invoking tools. Existing research primarily focuses on static environments and overlooks this issue, limiting the adaptability of LLMs in real-world applications. In this paper, we propose ToolEVO, a novel framework designed to enhance the adaptive and reflective capabilities of LLMs against tool variability. By leveraging Monte Carlo Tree Search, ToolEVO facilitates active exploration and interaction of LLMs within dynamic environments, allowing for autonomous self-reflection and self-updating of tool usage based on environmental feedback. Additionally, we introduce ToolQA-D, a benchmark specifically designed to evaluate the impact of tool variability. Extensive experiments demonstrate the effectiveness and stability of our approach, highlighting the importance of adaptability to tool variability for effective tool learning.
摘要:工具学习使大语言模型 (LLMs) 能够与外部工具和 API 进行交互,极大地扩展了 LLMs 的应用范围。然而,由于外部环境的动态性,这些工具和 API 可能会随着时间的推移而变得过时,从而阻止 LLMs 正确调用工具。现有研究主要集中在静态环境上,忽视了这一问题,限制了 LLMs 在实际应用中的适应性。本文提出了 ToolEVO,一种旨在增强 LLMs 对工具变异性的适应性和反射能力的新框架。通过利用蒙特卡洛树搜索 (Monte Carlo Tree Search),ToolEVO 促进了 LLMs 在动态环境中的主动探索和交互,允许基于环境反馈自主进行工具使用的自我反思和自我更新。此外,我们引入了 ToolQA-D,一个专门设计用于评估工具变异性影响的基准。广泛的实验证明了我们方法的有效性和稳定性,强调了适应工具变异性对有效工具学习的重要性。
[NLP-62] beta-calibration of Language Model Confidence Scores for Generative QA
【速读】: 该论文试图解决生成式问答系统在决策和关键应用中提供准确置信度评分的问题。现有校准方法主要确保置信度评分平均上反映答案的正确性,但这种平均校准概念在生成式问答系统的决策应用中难以解释。为此,论文提出了一种广义的校准概念——\beta-校准,确保校准在不同问答组中保持一致。解决方案的关键在于引入离散化的后验校准方案,以实现\beta-校准,从而提高置信度评分的可靠性和可解释性。
链接: https://arxiv.org/abs/2410.06615
作者: Putra Manggala,Atalanti Mastakouri,Elke Kirschbaum,Shiva Prasad Kasiviswanathan,Aaditya Ramdas
关键词-EN: provide well-calibrated confidence, well-calibrated confidence scores, critical application, provide well-calibrated, reflect the correctness
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is on average indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce \beta -calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving \beta -calibration.
摘要:为了在决策和任何关键应用中使用生成式问答 (QA) 系统,这些系统需要提供经过良好校准的置信度分数,以反映其答案的正确性。现有的校准方法旨在确保置信度分数平均上能指示答案的正确概率。然而,我们认为,这种标准(平均情况)的校准概念在生成式 QA 中的决策过程中难以解释。为此,我们推广了标准的平均校准概念,并引入了 \beta -校准,确保校准在不同的问答组之间保持一致。随后,我们提出了实现 \beta -校准的离散后校准方案。
[NLP-63] Dissecting Fine-Tuning Unlearning in Large Language Models EMNLP2024
【速读】: 该论文旨在探讨基于微调的遗忘方法在防止大型语言模型中嵌入的有害、敏感或受版权保护信息方面的实际效果。研究发现,这些方法通过激活补丁和参数恢复实验揭示,它们更多地改变了模型的知识检索过程,而非真正擦除嵌入在模型参数中的问题知识。此外,行为测试表明,遗忘机制不可避免地影响模型的全局行为,可能损害无关的知识或能力。因此,论文主张开发更为健壮的遗忘技术,以实现真正的知识擦除。
链接: https://arxiv.org/abs/2410.06606
作者: Yihuai Hong,Yuelin Zou,Lijie Hu,Ziqian Zeng,Di Wang,Haiqin Yang
关键词-EN: preventing targeted harmful, large language models, targeted harmful, unlearning methods prevail, prevail for preventing
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in EMNLP 2024 Main (Short paper)
点击查看摘要
Abstract:Fine-tuning-based unlearning methods prevail for preventing targeted harmful, sensitive, or copyrighted information within large language models while preserving overall capabilities. However, the true effectiveness of these methods is unclear. In this paper, we delve into the limitations of fine-tuning-based unlearning through activation patching and parameter restoration experiments. Our findings reveal that these methods alter the model’s knowledge retrieval process, rather than genuinely erasing the problematic knowledge embedded in the model parameters. Furthermore, behavioral tests demonstrate that the unlearning mechanisms inevitably impact the global behavior of the models, affecting unrelated knowledge or capabilities. Our work advocates the development of more resilient unlearning techniques for truly erasing knowledge. Our code is released at this https URL.
摘要:基于微调的遗忘方法在防止大语言模型中包含特定有害、敏感或受版权保护的信息的同时,保留了整体能力,因此广受欢迎。然而,这些方法的实际效果尚不明确。本文通过激活补丁和参数恢复实验深入探讨了基于微调的遗忘方法的局限性。我们的研究结果表明,这些方法改变了模型知识检索的过程,而不是真正擦除嵌入在模型参数中的问题知识。此外,行为测试显示,遗忘机制不可避免地影响了模型的全局行为,影响了无关的知识或能力。我们的工作主张开发更具弹性的遗忘技术,以真正擦除知识。我们的代码已在此 https URL 发布。
[NLP-64] Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
【速读】: 该论文试图解决Transformer-based大语言模型(LLMs)中softmax注意力机制带来的高计算复杂度问题,特别是O(T)的复杂度,其中T代表上下文长度。解决方案的关键在于引入Rodimus及其增强版Rodimus +,通过创新的线性注意力机制和数据依赖的温度选择(DDTS)机制,实现了显著的计算效率提升和内存使用减少,同时保持了模型的性能。Rodimus +进一步结合滑动窗口共享键注意力(SW-SKA)机制,采用混合方法,有效整合了语义、token和头压缩技术,从而在减少计算复杂度的同时,提升了下游任务的表现。
链接: https://arxiv.org/abs/2410.06577
作者: Zhihao He,Hang Yu,Zi Gong,Shizhan Liu,Jianguo Li,Weiyao Lin
关键词-EN: Transformer-based large language, natural language processing, Recent advancements, advancements in Transformer-based, Transformer-based large
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advancements in Transformer-based large language models (LLMs) have set new standards in natural language processing. However, the classical softmax attention incurs significant computational costs, leading to a O(T) complexity for per-token generation, where T represents the context length. This work explores reducing LLMs’ complexity while maintaining performance by introducing Rodimus and its enhanced version, Rodimus + . Rodimus employs an innovative data-dependent tempered selection (DDTS) mechanism within a linear attention-based, purely recurrent framework, achieving significant accuracy while drastically reducing the memory usage typically associated with recurrent models. This method exemplifies semantic compression by maintaining essential input information with fixed-size hidden states. Building on this, Rodimus + combines Rodimus with the innovative Sliding Window Shared-Key Attention (SW-SKA) in a hybrid approach, effectively leveraging the complementary semantic, token, and head compression techniques. Our experiments demonstrate that Rodimus + -1.6B, trained on 1 trillion tokens, achieves superior downstream performance against models trained on more tokens, including Qwen2-1.5B and RWKV6-1.6B, underscoring its potential to redefine the accuracy-efficiency balance in LLMs. Model code and pre-trained checkpoints will be available soon.
摘要:基于 Transformer 的大语言模型 (Large Language Model, LLM) 的最新进展在自然语言处理领域树立了新的标准。然而,经典的 softmax 注意力机制带来了显著的计算成本,导致每个 Token 生成的复杂度为 O(T),其中 T 代表上下文长度。本文探讨了在保持性能的同时降低 LLM 复杂度的方法,引入了 Rodimus 及其增强版本 Rodimus +。Rodimus 采用了一种创新的数据依赖型温度选择 (Data-Dependent Tempered Selection, DDTS) 机制,在一个基于线性注意力的纯循环框架内,实现了显著的准确性,同时大幅减少了与循环模型相关的内存使用。这种方法通过保持固定大小的隐藏状态来保留关键的输入信息,展示了语义压缩的特性。在此基础上,Rodimus + 结合了 Rodimus 与创新的滑动窗口共享键注意力 (Sliding Window Shared-Key Attention, SW-SKA) 的混合方法,有效利用了互补的语义、Token 和头压缩技术。我们的实验表明,在训练了 1 万亿 Token 的 Rodimus + -1.6B 模型在下游任务中表现优于训练了更多 Token 的模型,包括 Qwen2-1.5B 和 RWKV6-1.6B,突显了其在重新定义 LLM 中准确性与效率平衡方面的潜力。模型代码和预训练的检查点将很快提供。
[NLP-65] Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare
【速读】: 该论文旨在解决AI在医疗决策中可能产生的偏见和误诊问题,以提升患者安全。解决方案的关键在于引入两个新的数据集:BiasMD用于评估和缓解健康相关LLM输出的偏见,DiseaseMatcher用于评估基于症状的诊断准确性。通过这些数据集,研究团队开发了EthiClinician模型,该模型在伦理推理和临床判断方面优于GPT-4,从而为实现更安全、更可靠的医疗AI系统设定了新的标准。
链接: https://arxiv.org/abs/2410.06566
作者: Pardis Sadat Zahraei,Zahra Shakeri
关键词-EN: Biased AI-generated medical, AI-generated medical advice, Biased AI-generated, jeopardize patient safety, Large Language Models
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety, making the integrity of AI in healthcare more critical than ever. As Large Language Models (LLMs) take on a growing role in medical decision-making, addressing their biases and enhancing their accuracy is key to delivering safe, reliable care. This study addresses these challenges head-on by introducing new resources designed to promote ethical and precise AI in healthcare. We present two datasets: BiasMD, featuring 6,007 question-answer pairs crafted to evaluate and mitigate biases in health-related LLM outputs, and DiseaseMatcher, with 32,000 clinical question-answer pairs spanning 700 diseases, aimed at assessing symptom-based diagnostic accuracy. Using these datasets, we developed the EthiClinician, a fine-tuned model built on the ChatDoctor framework, which outperforms GPT-4 in both ethical reasoning and clinical judgment. By exposing and correcting hidden biases in existing models for healthcare, our work sets a new benchmark for safer, more reliable patient outcomes.
摘要:偏见的 AI 生成的医疗建议和误诊可能危及患者安全,使得 AI 在医疗领域的完整性变得比以往任何时候都更加关键。随着大语言模型 (LLM) 在医疗决策中扮演越来越重要的角色,解决其偏见并提高其准确性是提供安全、可靠护理的关键。本研究直接面对这些挑战,引入了旨在促进医疗领域伦理和精确 AI 的新资源。我们提出了两个数据集:BiasMD,包含 6,007 个问题-答案对,旨在评估和缓解与健康相关的大语言模型输出中的偏见;以及 DiseaseMatcher,包含 32,000 个临床问题-答案对,涵盖 700 种疾病,旨在评估基于症状的诊断准确性。利用这些数据集,我们开发了 EthiClinician,这是一个在 ChatDoctor 框架上微调的模型,其在伦理推理和临床判断方面均优于 GPT-4。通过揭示和纠正现有医疗模型中的隐藏偏见,我们的工作为实现更安全、更可靠的患者结果设定了新的基准。
[NLP-66] ING-VP: MLLMs cannot Play Easy Vision-based Games Yet
【速读】: 该论文试图解决现有多模态基准测试在评估基于图像空间关系的多步规划能力方面的不足。解决方案的关键在于提出了ING-VP基准,这是一个基于互动游戏的视觉规划基准,专门用于评估多模态大语言模型(MLLMs)的空间想象和多步推理能力。ING-VP通过6种不同游戏、300个关卡和6种独特配置,提供了超过60,000轮的互动,允许在图像-文本与纯文本输入、单步与多步推理、有历史与无历史条件等多种比较设置下进行评估,从而深入分析模型的能力。
链接: https://arxiv.org/abs/2410.06555
作者: Haoran Zhang,Hangyu Guo,Shuyue Guo,Meng Cao,Wenhao Huang,Jiaheng Liu,Ge Zhang
关键词-EN: demonstrate increasingly competitive, increasingly competitive performance, multimodal large language, large language models, continue to demonstrate
类目: Computation and Language (cs.CL)
备注: 49 pages, 12 figures
点击查看摘要
Abstract:As multimodal large language models (MLLMs) continue to demonstrate increasingly competitive performance across a broad spectrum of tasks, more intricate and comprehensive benchmarks have been developed to assess these cutting-edge models. These benchmarks introduce new challenges to core capabilities such as perception, reasoning, and planning. However, existing multimodal benchmarks fall short in providing a focused evaluation of multi-step planning based on spatial relationships in images. To bridge this gap, we present ING-VP, the first INteractive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs. ING-VP features 6 distinct games, encompassing 300 levels, each with 6 unique configurations. A single model engages in over 60,000 rounds of interaction. The benchmark framework allows for multiple comparison settings, including image-text vs. text-only inputs, single-step vs. multi-step reasoning, and with-history vs. without-history conditions, offering valuable insights into the model’s capabilities. We evaluated numerous state-of-the-art MLLMs, with the highest-performing model, Claude-3.5 Sonnet, achieving an average accuracy of only 3.37%, far below the anticipated standard. This work aims to provide a specialized evaluation framework to drive advancements in MLLMs’ capacity for complex spatial reasoning and planning. The code is publicly available at this https URL.
摘要:随着多模态大语言模型 (MLLMs) 在广泛任务中展现出日益竞争力的表现,评估这些尖端模型的复杂且全面的基准测试也随之发展。这些基准测试为感知、推理和规划等核心能力引入了新的挑战。然而,现有的多模态基准测试在基于图像空间关系的多步规划评估方面存在不足。为了填补这一空白,我们提出了 ING-VP,即首个基于互动游戏的视觉规划基准测试,专门用于评估 MLLMs 的空间想象力和多步推理能力。ING-VP 包含 6 种不同的游戏,涵盖 300 个关卡,每个关卡有 6 种独特的配置。单个模型参与超过 60,000 轮互动。该基准框架允许进行多种比较设置,包括图像-文本与纯文本输入、单步与多步推理、以及有历史记录与无历史记录条件,从而为模型的能力提供有价值的见解。我们评估了众多最先进的 MLLMs,其中表现最佳的模型 Claude-3.5 Sonnet 的平均准确率仅为 3.37%,远低于预期标准。本工作旨在提供一个专门的评估框架,推动 MLLMs 在复杂空间推理和规划能力方面的进步。代码已公开,可访问此 https URL。
[NLP-67] he Accuracy Paradox in RLHF: When Better Reward Models Dont Yield Better Language Models EMNLP2024
【速读】: 该论文试图解决的问题是:是否更强的奖励模型(reward models)总是能带来更好的语言模型(language models)。解决方案的关键在于通过实验发现,中等准确度的奖励模型在训练语言模型时,其表现优于高度准确的奖励模型。这一发现挑战了传统观念,即更强的奖励模型必然导致更好的语言模型,并为未来研究如何选择最合适的奖励模型提供了新的方向。
链接: https://arxiv.org/abs/2410.06554
作者: Yanjun Chen,Dawei Zhu,Yirong Sun,Xinghao Chen,Wei Zhang,Xiaoyu Shen
关键词-EN: Human Feedback significantly, Natural Language Processing, significantly enhances Natural, Feedback significantly enhances, Reinforcement Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 27 figures (including 18 in the appendix), submitted to EMNLP 2024
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models. Code and additional details are available at [this https URL](this https URL).
摘要:通过将语言模型与人类期望对齐,从人类反馈中进行强化学习显著提升了自然语言处理的效果。在这一对齐过程中,奖励模型的强度是一个关键因素。本研究探讨了是否更强的奖励模型必然会带来更好的语言模型。本文通过在 QA-FEEDBACK 数据集和基于 Longformer 的奖励模型上进行的关联性、事实性和完整性任务实验,揭示了一个令人惊讶的悖论:使用中等准确度奖励模型训练的语言模型,其表现优于那些由高度准确奖励模型引导的模型。这一发现挑战了广泛持有的观点,即更强的奖励模型总是会带来更好的语言模型,并为未来研究驱动模型性能的关键因素以及如何选择最合适的奖励模型开辟了新的途径。代码和更多细节可在 [this https URL](this https URL) 获取。
[NLP-68] Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis
【速读】: 该论文试图解决在训练数据生成过程中,如何平衡高质量但昂贵的人工标注数据与低质量但成本低廉的LLM生成数据之间的权衡问题。解决方案的关键在于通过实验验证,在不同预算水平下,最优的成本效益是通过混合使用人工标注数据和LLM生成数据来实现的。特别是在预算减少的情况下,增加LLM生成数据的比例更为有利。
链接: https://arxiv.org/abs/2410.06550
作者: Shiho Matta,Yin Jou Huang,Fei Cheng,Hirokazu Kiyomaru,Yugo Murawaki
关键词-EN: Recent studies, low cost, LLM-generated data, studies have demonstrated, demonstrated that few-shot
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages including 4 pages of references and appendix. 7 figures
点击查看摘要
Abstract:Recent studies have demonstrated that few-shot learning allows LLMs to generate training data for supervised models at a low cost. However, the quality of LLM-generated data may not entirely match that of human-labeled data. This raises a crucial question: how should one balance the trade-off between the higher quality but more expensive human data and the lower quality yet substantially cheaper LLM-generated data? In this paper, we synthesized training data for conversational semantic frame analysis using GPT-4 and examined how to allocate budgets optimally to achieve the best performance. Our experiments, conducted across various budget levels, reveal that optimal cost-efficiency is achieved by combining both human and LLM-generated data across a wide range of budget levels. Notably, as the budget decreases, a higher proportion of LLM-generated data becomes more preferable.
摘要:最近的研究表明,少样本学习使得大语言模型 (LLM) 能够以低成本生成监督模型的训练数据。然而,LLM 生成的数据质量可能无法完全匹配人类标注的数据。这引发了一个关键问题:如何在高质量但昂贵的人类数据与低质量但成本显著较低的 LLM 生成数据之间取得平衡?本文中,我们使用 GPT-4 合成了对话语义框架分析的训练数据,并探讨了如何在不同预算水平下最优地分配资源以达到最佳性能。我们的实验在多个预算水平上进行,结果显示,通过在广泛的预算范围内结合人类和 LLM 生成的数据,可以实现最佳的成本效益。值得注意的是,随着预算的减少,更高比例的 LLM 生成数据变得更为可取。
[NLP-69] uringQ: Benchmarking AI Comprehension in Theory of Computation EMNLP
【速读】: 该论文试图解决评估大型语言模型(LLMs)在计算理论中的推理能力的问题。解决方案的关键在于提出了TuringQ基准,这是一个包含4,006个本科和研究生级别的问题-答案对的数据集,涵盖了七个核心理论领域,并分为四个难度级别。通过使用Chain of Thought提示和专家人类评估,以及一个自动化的基于LLM的评估系统,论文展示了如何有效评估和提升LLMs在复杂计算推理任务中的表现。此外,对Llama3-8B模型进行TuringQ数据集的微调,显著提高了其在推理能力和跨领域任务(如代数)中的表现。
链接: https://arxiv.org/abs/2410.06547
作者: Pardis Sadat Zahraei,Ehsaneddin Asgari
关键词-EN: large language models, theory of computation, large language, Chain of Thought, language models
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: Accepted to EMNLP Findings 2024
点击查看摘要
Abstract:We present TuringQ, the first benchmark designed to evaluate the reasoning capabilities of large language models (LLMs) in the theory of computation. TuringQ consists of 4,006 undergraduate and graduate-level question-answer pairs, categorized into four difficulty levels and covering seven core theoretical areas. We evaluate several open-source LLMs, as well as GPT-4, using Chain of Thought prompting and expert human assessment. Additionally, we propose an automated LLM-based evaluation system that demonstrates competitive accuracy when compared to human evaluation. Fine-tuning a Llama3-8B model on TuringQ shows measurable improvements in reasoning ability and out-of-domain tasks such as algebra. TuringQ serves as both a benchmark and a resource for enhancing LLM performance in complex computational reasoning tasks. Our analysis offers insights into LLM capabilities and advances in AI comprehension of theoretical computer science.
摘要:我们提出了 TuringQ,这是首个用于评估大语言模型 (LLM) 在计算理论中推理能力的基准。TuringQ 包含 4,006 个本科和研究生级别的问题-答案对,分为四个难度等级,涵盖七个核心理论领域。我们使用思维链提示和专家人工评估,对多个开源 LLM 以及 GPT-4 进行了评估。此外,我们还提出了一种基于 LLM 的自动化评估系统,该系统在比较人类评估时表现出竞争性的准确性。在 TuringQ 上微调 Llama3-8B 模型显示出在推理能力和代数等领域外任务中的可测量改进。TuringQ 既是一个基准,也是一个用于提升 LLM 在复杂计算推理任务中性能的资源。我们的分析提供了对 LLM 能力和 AI 对理论计算机科学理解的深入见解。
[NLP-70] Chip-Tuning: Classify Before Language Models Say
【速读】: 该论文试图解决大型语言模型(LLMs)在性能提升过程中伴随的模型规模扩大和训练推理成本增加的问题。解决方案的关键在于采用探针技术识别并移除LLMs中的冗余层,通过在不同层附加微型探针分类器(chips)并训练这些探针,选择性能最佳的探针层,从而移除该层之后的所有层,实现高效且结构化的模型剪枝。这种方法显著提高了剪枝效率和准确性,最高可实现50%的剪枝比例,并展示了在多模态模型和模型微调中的良好兼容性。
链接: https://arxiv.org/abs/2410.06541
作者: Fangwei Zhu,Dian Li,Jiajun Huang,Gang Liu,Hui Wang,Zhifang Sui
关键词-EN: training and inference, rapid development, increasing cost, large language models, LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid development in the performance of large language models (LLMs) is accompanied by the escalation of model size, leading to the increasing cost of model training and inference. Previous research has discovered that certain layers in LLMs exhibit redundancy, and removing these layers brings only marginal loss in model performance. In this paper, we adopt the probing technique to explain the layer redundancy in LLMs and demonstrate that language models can be effectively pruned with probing classifiers. We propose chip-tuning, a simple and effective structured pruning framework specialized for classification problems. Chip-tuning attaches tiny probing classifiers named chips to different layers of LLMs, and trains chips with the backbone model frozen. After selecting a chip for classification, all layers subsequent to the attached layer could be removed with marginal performance loss. Experimental results on various LLMs and datasets demonstrate that chip-tuning significantly outperforms previous state-of-the-art baselines in both accuracy and pruning ratio, achieving a pruning ratio of up to 50%. We also find that chip-tuning could be applied on multimodal models, and could be combined with model finetuning, proving its excellent compatibility.
摘要:大语言模型 (LLM) 性能的快速发展伴随着模型规模的扩大,导致模型训练和推理成本的增加。先前的研究表明,LLM 中的某些层表现出冗余性,移除这些层仅会导致模型性能的微小损失。本文采用探针技术来解释 LLM 中的层冗余现象,并证明通过探针分类器可以有效地对语言模型进行剪枝。我们提出了芯片调优 (chip-tuning),这是一种针对分类问题的简单且有效的结构化剪枝框架。芯片调优将名为芯片的小型探针分类器附加到 LLM 的不同层上,并在骨干模型冻结的情况下训练这些芯片。在选择用于分类的芯片后,附加层之后的所有层都可以被移除,而仅带来微小的性能损失。在各种 LLM 和数据集上的实验结果表明,芯片调优在准确性和剪枝比例方面显著优于先前的最先进基线,实现了高达 50% 的剪枝比例。我们还发现,芯片调优可以应用于多模态模型,并且可以与模型微调结合使用,证明了其出色的兼容性。
[NLP-71] Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA EMNLP2024
【速读】: 该论文试图解决的问题是如何量化评估和比较人类与AI系统在问答任务中的问题解决能力。解决方案的关键在于引入了CAIMIRA框架,该框架基于项目反应理论(IRT),能够对不同知识领域和推理技能的熟练度进行定量分析。通过分析超过30万条来自约70个AI系统和155名人类的回答,CAIMIRA揭示了人类和AI在知识驱动的发散性推理和概念性推理方面表现出的不同熟练度模式,并指出了未来问答任务应侧重于挑战高阶推理、科学思维以及复杂语言解释和跨情境知识应用的问题,以推动AI更好地模拟或补充人类在实际问题解决中的认知能力。
链接: https://arxiv.org/abs/2410.06524
作者: Maharshi Gor,Hal Daumé III,Tianyi Zhou,Jordan Boyd-Graber
关键词-EN: large language models, natural language processing, Recent advancements, language models, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear at EMNLP 2024 (Main)
点击查看摘要
Abstract:Recent advancements of large language models (LLMs) have led to claims of AI surpassing humans in natural language processing (NLP) tasks such as textual understanding and reasoning. This work investigates these assertions by introducing CAIMIRA, a novel framework rooted in item response theory (IRT) that enables quantitative assessment and comparison of problem-solving abilities of question-answering (QA) agents: humans and AI systems. Through analysis of over 300,000 responses from ~70 AI systems and 155 humans across thousands of quiz questions, CAIMIRA uncovers distinct proficiency patterns in knowledge domains and reasoning skills. Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning, while state-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval and fact-based reasoning, particularly when information gaps are well-defined and addressable through pattern matching or data retrieval. These findings highlight the need for future QA tasks to focus on questions that challenge not only higher-order reasoning and scientific thinking, but also demand nuanced linguistic interpretation and cross-contextual knowledge application, helping advance AI developments that better emulate or complement human cognitive abilities in real-world problem-solving.
摘要:近期大语言模型 (Large Language Models, LLMs) 的进展引发了关于 AI 在自然语言处理 (Natural Language Processing, NLP) 任务中超越人类的言论,这些任务包括文本理解和推理。本研究通过引入 CAIMIRA,一种基于项目反应理论 (Item Response Theory, IRT) 的新框架,来探讨这些言论,该框架能够对问答 (Question-Answering, QA) 智能体(包括人类和 AI 系统)的问题解决能力进行定量评估和比较。通过对来自约 70 个 AI 系统和 155 名人类在数千个测验问题上的超过 30 万条回答进行分析,CAIMIRA 揭示了在知识领域和推理技能方面的不同熟练度模式。人类在基于知识的溯因推理和概念推理方面优于 AI 系统,而像 GPT-4 和 LLaMA 这样的最先进大语言模型在目标信息检索和基于事实的推理方面表现出色,尤其是在信息差距明确且可通过模式匹配或数据检索解决的情况下。这些发现强调了未来 QA 任务需要关注那些不仅挑战高阶推理和科学思维,还要求细致的语言解释和跨情境知识应用的问题,以推动 AI 发展,使其更好地模拟或补充人类在现实世界问题解决中的认知能力。
[NLP-72] A Novel LLM-based Two-stage Summarization Approach for Long Dialogues
【速读】: 该论文试图解决长文档摘要生成的问题,特别是当文档长度超过现有预训练语言模型的输入限制时。解决方案的关键在于提出了一种分层框架,通过无监督的主题分割方法将长文档分割成多个段落,并利用无监督生成模型(如ChatGPT v3.5)对这些段落进行信息浓缩。随后,通过微调抽象摘要模型在浓缩后的数据上生成最终的摘要。这种方法不仅解决了模型输入长度的限制,还显著减少了训练时间和计算资源的消耗,特别适用于计算资源受限的环境。
链接: https://arxiv.org/abs/2410.06520
作者: Yuan-Jhe Yin,Bo-Yu Chen,Berlin Chen
关键词-EN: natural language processing, language processing due, pre-trained language models, pre-trained language, document summarization poses
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Long document summarization poses a significant challenge in natural language processing due to input lengths that exceed the capacity of most state-of-the-art pre-trained language models. This study proposes a hierarchical framework that segments and condenses information from long documents, subsequently fine-tuning the processed text with an abstractive summarization model. Unsupervised topic segmentation methods identify semantically appropriate breakpoints. The condensation stage utilizes an unsupervised generation model to generate condensed data, and our current experiments employ ChatGPT(v3.5). The summarization stage fine-tunes the abstractive summarization model on the condensed data to generate the final results. This framework enables long documents to be processed on models even when the document length exceeds the model’s maximum input size. The exclusion of the entire document from the summarization model reduces the time and computational resources required for training, making the framework suitable for contexts with constrained local computational resources.
摘要:长文档摘要生成在自然语言处理中面临显著挑战,因为输入长度通常超过大多数最先进的预训练语言模型的处理能力。本研究提出了一种分层框架,该框架通过分割和压缩长文档中的信息,随后对处理后的文本进行抽象摘要模型的微调。无监督的主题分割方法识别语义上合适的断点。压缩阶段利用无监督生成模型生成压缩数据,当前实验采用 ChatGPT (v3.5)。摘要阶段在压缩数据上微调抽象摘要模型以生成最终结果。该框架使得即使文档长度超过模型的最大输入尺寸,也能处理长文档。由于摘要模型不需处理整个文档,从而减少了训练所需的时间和计算资源,使得该框架适用于计算资源受限的本地环境。
[NLP-73] SEGMENT: Long Text Processing with Short-Context Language Models EMNLP2024
【速读】: 该论文试图解决语言模型在处理长输入任务时性能不足的问题,特别是在理解广泛文档和从冗长且嘈杂的数据中提取详细信息方面。解决方案的关键是引入SEGMENT+框架,该框架通过利用结构化笔记和过滤模块来有效管理信息流,使得在有限上下文窗口内处理扩展输入成为可能,从而提高模型在长文档问答和Needle-in-a-Haystack任务中的表现。
链接: https://arxiv.org/abs/2410.06519
作者: Wei Shi,Shuang Li,Kerun Yu,Jinglei Chen,Zujie Liang,Xinhui Wu,Yuxi Qian,Feng Wei,Bo Zheng,Jiaqing Liang,Jiangjie Chen,Yanghua Xiao
关键词-EN: growing interest, interest in expanding, capacity of language, input capacity, SEGMENT
类目: Computation and Language (cs.CL)
备注: EMNLP 2024
点击查看摘要
Abstract:There is a growing interest in expanding the input capacity of language models (LMs) across various domains. However, simply increasing the context window does not guarantee robust performance across diverse long-input processing tasks, such as understanding extensive documents and extracting detailed information from lengthy and noisy data. In response, we introduce SEGMENT+, a general framework that enables LMs to handle extended inputs within limited context windows efficiently. SEGMENT+ utilizes structured notes and a filtering module to manage information flow, resulting in a system that is both controllable and interpretable. Our extensive experiments across various model sizes, focusing on long-document question-answering and Needle-in-a-Haystack tasks, demonstrate the effectiveness of SEGMENT+ in improving performance.
摘要:在各个领域中,扩展语言模型 (Language Models, LMs) 的输入容量正逐渐引起广泛关注。然而,仅仅增加上下文窗口并不能保证在处理多样化的长输入任务(如理解大量文档和从冗长且嘈杂的数据中提取详细信息)时具备稳健的性能。为此,我们引入了 SEGMENT+,这是一个通用框架,能够在有限的上下文窗口内高效处理扩展输入。SEGMENT+ 利用结构化笔记和过滤模块来管理信息流,从而形成一个既可控又可解释的系统。我们在不同模型规模上进行了广泛的实验,重点关注长文档问答和“大海捞针”任务,结果表明 SEGMENT+ 在提升性能方面具有显著效果。
[NLP-74] orchTitan: One-stop PyTorch native solution for production ready LLM pre-training
【速读】: 该论文试图解决大规模语言模型(LLMs)训练中分布式系统复杂性、缺乏互操作性和维护困难的问题。解决方案的关键在于引入了一个名为TorchTitan的开源PyTorch原生分布式训练系统,该系统通过统一和简化最先进技术的集成,实现了模块化的3D并行训练和弹性扩展。TorchTitan提供了全面的日志记录、检查点和调试工具,并结合了硬件-软件协同设计的解决方案,如Float8训练和对称内存。通过这种方式,TorchTitan不仅简化了训练流程,还显著提升了训练效率,特别是在大规模GPU集群上的性能表现。
链接: https://arxiv.org/abs/2410.06511
作者: Wanchao Liang,Tianyu Liu,Less Wright,Will Constable,Andrew Gu,Chien-Chin Huang,Iris Zhang,Wei Feng,Howard Huang,Junjie Wang,Sanket Purandare,Gokul Nadathur,Stratos Idreos
关键词-EN: large language models, language processing applications, natural language processing, instrumental in advancing, processing applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The development of large language models (LLMs) has been instrumental in advancing state-of-the-art natural language processing applications. Training LLMs with billions of parameters and trillions of tokens require sophisticated distributed systems that enable composing and comparing several state-of-the-art techniques in order to efficiently scale across thousands of accelerators. However, existing solutions are complex, scattered across multiple libraries/repositories, lack interoperability, and are cumbersome to maintain. Thus, curating and empirically comparing training recipes require non-trivial engineering effort. This paper introduces TorchTitan, an open-source, PyTorch-native distributed training system that unifies state-of-the-art techniques, streamlining integration and reducing overhead. TorchTitan enables 3D parallelism in a modular manner with elastic scaling, providing comprehensive logging, checkpointing, and debugging tools for production-ready training. It also incorporates hardware-software co-designed solutions, leveraging features like Float8 training and SymmetricMemory. As a flexible test bed, TorchTitan facilitates custom recipe curation and comparison, allowing us to develop optimized training recipes for Llama 3.1 and provide guidance on selecting techniques for maximum efficiency based on our experiences. We thoroughly assess TorchTitan on the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its exceptional performance, modular composability, and elastic scalability. By stacking training optimizations, we demonstrate accelerations of 65.08% with 1D parallelism at the 128-GPU scale (Llama 3.1 8B), an additional 12.59% with 2D parallelism at the 256-GPU scale (Llama 3.1 70B), and an additional 30% with 3D parallelism at the 512-GPU scale (Llama 3.1 405B) on NVIDIA H100 GPUs over optimized baselines. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2410.06511 [cs.CL] (or arXiv:2410.06511v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.06511 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:大语言模型 (LLM) 的发展在推动最先进的自然语言处理应用方面发挥了关键作用。训练具有数十亿参数和数万亿 Token 的 LLM 需要复杂的分布式系统,这些系统能够组合和比较多种最先进的技术,以便在数千个加速器上高效扩展。然而,现有解决方案复杂、分散在多个库/存储库中、缺乏互操作性,并且维护起来繁琐。因此,策划和经验性地比较训练方案需要非同小可的工程努力。本文介绍了 TorchTitan,一个开源的、基于 PyTorch 的分布式训练系统,它统一了最先进的技术,简化了集成并减少了开销。TorchTitan 以模块化方式实现 3D 并行,并具有弹性扩展功能,为生产就绪的训练提供了全面的日志记录、检查点和调试工具。它还集成了硬件与软件协同设计的解决方案,利用了 Float8 训练和对称内存 (SymmetricMemory) 等功能。作为一个灵活的测试平台,TorchTitan 促进了自定义训练方案的策划和比较,使我们能够为 Llama 3.1 开发优化的训练方案,并根据我们的经验提供选择技术以实现最大效率的指导。我们对 TorchTitan 在 Llama 3.1 系列 LLM 上进行了全面评估,涵盖了从 80 亿到 4050 亿参数的范围,并展示了其卓越的性能、模块化组合性和弹性可扩展性。通过叠加训练优化,我们在 128 GPU 规模下使用 1D 并行 (Llama 3.1 8B) 实现了 65.08% 的加速,在 256 GPU 规模下使用 2D 并行 (Llama 3.1 70B) 额外实现了 12.59% 的加速,在 512 GPU 规模下使用 3D 并行 (Llama 3.1 405B) 在 NVIDIA H100 GPU 上相对于优化基线额外实现了 30% 的加速。
主题:计算与语言 (cs.CL); 人工智能 (cs.AI); 分布式、并行与集群计算 (cs.DC); 机器学习 (cs.LG)
引用为:arXiv:2410.06511 [cs.CL] (或 arXiv:2410.06511v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.06511
arXiv 发布的 DOI 通过 DataCite (待注册)
[NLP-75] owards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning
【速读】: 该论文试图解决现有MCTS行为蒸馏方法未能充分利用MCTS生成的丰富轨迹信息,从而限制了LLM推理能力提升的问题。解决方案的关键在于提出了AlphaLLM-CPL框架,通过两个创新点实现高效利用MCTS轨迹:一是构建搜索树中共享同一父节点的子节点间的逐步轨迹对,提供更细粒度的MCTS行为蒸馏信息;二是引入课程偏好学习,动态调整轨迹对的训练顺序,优先处理关键学习步骤,从而有效缓解过拟合并提升LLM的推理能力。
链接: https://arxiv.org/abs/2410.06508
作者: Xiyao Wang,Linfeng Song,Ye Tian,Dian Yu,Baolin Peng,Haitao Mi,Furong Huang,Dong Yu
关键词-EN: Monte Carlo Tree, Monte Carlo, Carlo Tree Search, MCTS behavior distillation, Carlo Tree
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Monte Carlo Tree Search (MCTS) has recently emerged as a powerful technique for enhancing the reasoning capabilities of LLMs. Techniques such as SFT or DPO have enabled LLMs to distill high-quality behaviors from MCTS, improving their reasoning performance. However, existing distillation methods underutilize the rich trajectory information generated by MCTS, limiting the potential for improvements in LLM reasoning. In this paper, we propose AlphaLLM-CPL, a novel pairwise training framework that enables LLMs to self-improve through MCTS behavior distillation. AlphaLLM-CPL efficiently leverages MCTS trajectories via two key innovations: (1) AlphaLLM-CPL constructs stepwise trajectory pairs from child nodes sharing the same parent in the search tree, providing step-level information for more effective MCTS behavior distillation. (2) AlphaLLM-CPL introduces curriculum preference learning, dynamically adjusting the training sequence of trajectory pairs in each offline training epoch to prioritize critical learning steps and mitigate overfitting. Experimental results on mathematical reasoning tasks demonstrate that AlphaLLM-CPL significantly outperforms previous MCTS behavior distillation methods, substantially boosting the reasoning capabilities of LLMs.
摘要:蒙特卡洛树搜索 (Monte Carlo Tree Search, MCTS) 最近作为一种增强大语言模型 (Large Language Model, LLM) 推理能力的技术而崭露头角。诸如 SFT 或 DPO 等技术使得 LLM 能够从 MCTS 中提炼出高质量的行为,从而提升其推理性能。然而,现有的提炼方法未能充分利用 MCTS 生成的丰富轨迹信息,限制了 LLM 推理能力的进一步提升。本文中,我们提出了 AlphaLLM-CPL,一种新颖的成对训练框架,通过 MCTS 行为提炼实现 LLM 的自我改进。AlphaLLM-CPL 通过两大创新高效利用 MCTS 轨迹:(1) AlphaLLM-CPL 构建了从搜索树中共享同一父节点的子节点出发的逐步轨迹对,为更有效的 MCTS 行为提炼提供了步骤级信息。(2) AlphaLLM-CPL 引入了课程偏好学习,动态调整每个离线训练周期中轨迹对的训练顺序,优先处理关键学习步骤并减轻过拟合。在数学推理任务上的实验结果表明,AlphaLLM-CPL 显著优于之前的 MCTS 行为提炼方法,大幅提升了 LLM 的推理能力。
[NLP-76] On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task EMNLP2024
【速读】: 该论文试图解决的问题是探究Gemma 2B模型在解决主谓一致任务时,其内部电路在不同语言(英语和西班牙语)中的通用性。解决方案的关键在于发现了一个特定的注意力头,它将“主语数量”信号写入最后一个残差流中,该信号在残差流空间中以方向形式表示,并且是语言无关的。这一发现不仅揭示了该信号在模型预测中的因果效应,还表明在不同语言中,该信号的方向可以有效翻转西班牙语动词的预测数量,从而证明了该电路在不同语言设置中的高度一致性。
链接: https://arxiv.org/abs/2410.06496
作者: Javier Ferrando,Marta R.Costa-jussà
关键词-EN: successfully reversed-engineered, recently been successfully, algorithms implemented, Gemma, circuits
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 Findings
点击查看摘要
Abstract:Several algorithms implemented by language models have recently been successfully reversed-engineered. However, these findings have been concentrated on specific tasks and models, leaving it unclear how universal circuits are across different settings. In this paper, we study the circuits implemented by Gemma 2B for solving the subject-verb agreement task across two different languages, English and Spanish. We discover that both circuits are highly consistent, being mainly driven by a particular attention head writing a `subject number’ signal to the last residual stream, which is read by a small set of neurons in the final MLPs. Notably, this subject number signal is represented as a direction in the residual stream space, and is language-independent. We demonstrate that this direction has a causal effect on the model predictions, effectively flipping the Spanish predicted verb number by intervening with the direction found in English. Finally, we present evidence of similar behavior in other models within the Gemma 1 and Gemma 2 families.
摘要:近期,语言模型中实现的若干算法已被成功逆向工程。然而,这些发现主要集中在特定任务和模型上,尚不清楚这些电路在不同设置中的通用性如何。本文研究了Gemma 2B在解决英语和西班牙语主谓一致任务时所实现的电路。我们发现,这两种电路高度一致,主要由一个特定的注意力头将“主语数量”信号写入最后一个残差流,该信号由最终MLP中的一小组神经元读取。值得注意的是,这个主语数量信号在残差流空间中表现为一个方向,并且是语言无关的。我们证明,这个方向对模型预测具有因果效应,通过干预在英语中发现的该方向,能够有效地翻转西班牙语预测动词的数量。最后,我们提供了Gemma 1和Gemma 2系列中其他模型表现出类似行为的证据。
[NLP-77] LLM Compression with Neural Architecture Search
【速读】: 该论文试图解决大规模语言模型(LLMs)在推理成本高昂的问题,特别是随着模型规模的扩大,推理成本显著增加,导致其在实际应用中的经济性和可行性受到限制。解决方案的关键在于利用神经架构搜索(NAS)技术对预训练的LLMs进行压缩,通过修剪模型的结构组件(如注意力头、神经元和层),以实现性能和效率之间的帕累托最优平衡。相较于传统的结构修剪方法,NAS在扩展到LLMs时,能够显著提升模型性能(如在MMLU基准上提升3.4%),同时实现设备上的推理延迟加速。
链接: https://arxiv.org/abs/2410.06479
作者: Rhea Sanjay Sukthanker,Benedikt Staffler,Frank Hutter,Aaron Klein
关键词-EN: exhibit remarkable reasoning, remarkable reasoning abilities, Large language models, reasoning abilities, Large language
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) exhibit remarkable reasoning abilities, allowing them to generalize across a wide range of downstream tasks, such as commonsense reasoning or instruction following. However, as LLMs scale, inference costs become increasingly prohibitive, accumulating significantly over their life cycle. This poses the question: Can we compress pre-trained LLMs to meet diverse size and latency requirements? We leverage Neural Architecture Search (NAS) to compress LLMs by pruning structural components, such as attention heads, neurons, and layers, aiming to achieve a Pareto-optimal balance between performance and efficiency. While NAS already achieved promising results on small language models in previous work, in this paper we propose various extensions that allow us to scale to LLMs. Compared to structural pruning baselines, we show that NAS improves performance up to 3.4% on MMLU with an on-device latency speedup.
摘要:大语言模型 (LLMs) 展现出卓越的推理能力,使其能够在常识推理或指令跟随等广泛的下游任务中进行泛化。然而,随着 LLMs 规模的扩大,推理成本变得愈发高昂,在其生命周期内显著累积。这引发了一个问题:我们能否压缩预训练的 LLMs 以满足多样化的尺寸和延迟需求?我们利用神经架构搜索 (NAS) 通过修剪结构组件(如注意力头、神经元和层)来压缩 LLMs,旨在实现性能与效率之间的帕累托最优平衡。尽管 NAS 在先前的工作中已经在小型语言模型上取得了有前景的结果,本文我们提出了多种扩展方法,使其能够扩展至 LLMs。与结构修剪基线相比,我们展示了 NAS 在 MMLU 上提升了高达 3.4% 的性能,并实现了设备上的延迟加速。
[NLP-78] LLM Self-Correction with DeCRIM: Decompose Critique and Refine for Enhanced Following of Instructions with Multiple Constraints EMNLP2024
【速读】: 该论文试图解决大型语言模型(LLMs)在遵循包含多个约束的复杂指令时表现不佳的问题。解决方案的关键在于提出了Decompose, Critique and Refine (DeCRIM)自校正流程,通过将原始指令分解为一系列约束,并利用Critic模型判断何时以及何处需要对LLM的响应进行细化,从而显著提升模型在遵循多约束指令方面的性能。该方法不仅提高了开源模型Mistral在RealInstruct和IFEval基准测试中的表现,而且在强反馈条件下,开源LLMs结合DeCRIM甚至能够超越GPT-4的性能。
链接: https://arxiv.org/abs/2410.06458
作者: Thomas Palmeira Ferraz,Kartik Mehta,Yu-Hsiang Lin,Haw-Shiuan Chang,Shereen Oraby,Sijia Liu,Vivek Subramanian,Tagyoung Chung,Mohit Bansal,Nanyun Peng
关键词-EN: key capability, instructions, Abstract, DeCRIM, LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear at EMNLP 2024
点击查看摘要
Abstract:Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post “in a funny tone” with “no hashtag”). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs’ ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs’ ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM’s response needs refinement. Our results show that DeCRIM improves Mistral’s performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.
摘要:指令遵循是大语言模型 (LLM) 的关键能力。然而,最近的研究表明,大语言模型在处理包含多个约束的指令时常常遇到困难(例如,要求创建一个“幽默风格”且“不带标签”的社交媒体帖子)。尽管如此,大多数评估仅依赖于合成数据。为了解决这一问题,我们引入了 RealInstruct,这是首个通过利用真实用户向 AI 助手提出的查询来评估大语言模型遵循真实世界多约束指令能力的基准。我们还探讨了基于模型的评估作为人类标注的经济有效替代方案。我们的研究发现,即使是专有的 GPT-4 模型,在超过 21% 的指令中至少未能满足一个约束,这突显了当前最先进模型的局限性。为了缩小开源模型与专有模型之间的性能差距,我们提出了分解、批判和精炼 (DeCRIM) 自我修正流程,该流程增强了大语言模型遵循约束的能力。DeCRIM 通过将原始指令分解为一系列约束,并使用批判模型来决定何时以及何处需要对大语言模型的响应进行精炼。我们的结果显示,即使在弱反馈的情况下,DeCRIM 也能将 Mistral 在 RealInstruct 上的表现提升 7.3%,在 IFEval 上提升 8.0%。此外,我们证明,在强反馈的情况下,配备 DeCRIM 的开源大语言模型在两个基准测试中均能超越 GPT-4。
[NLP-79] Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models
【速读】: 该论文试图解决在微调语言模型(LMs)时,使用Adam优化器导致的内存需求过高问题。解决方案的关键在于提出了一种名为Addax的新方法,该方法通过将“原地”随机梯度下降(IP-SGD)与内存高效的零阶优化器(MeZO)结合,实现了内存效率和性能的提升。Addax根据数据点的内存消耗,在minibatch中计算零阶或一阶梯度,并将这些梯度估计结合以更新方向,从而克服了MeZO的慢收敛问题和IP-SGD的内存需求过高问题。此外,零阶梯度作为一阶梯度的正则化项,进一步提高了模型的最终性能。实验结果表明,Addax在准确性和收敛速度上均优于MeZO,并且在内存占用上具有可比性。
链接: https://arxiv.org/abs/2410.06441
作者: Zeman Li,Xinwei Zhang,Peilin Zhong,Yuan Deng,Meisam Razaviyayn,Vahab Mirrokni
关键词-EN: Stochastic Gradient Descent, Addax, limiting accessibility, MeZO, memory
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Fine-tuning language models (LMs) with the Adam optimizer often demands excessive memory, limiting accessibility. The “in-place” version of Stochastic Gradient Descent (IP-SGD) and Memory-Efficient Zeroth-order Optimizer (MeZO) have been proposed to address this. However, IP-SGD still requires substantial memory, and MeZO suffers from slow convergence and degraded final performance due to its zeroth-order nature. This paper introduces Addax, a novel method that improves both memory efficiency and performance of IP-SGD by integrating it with MeZO. Specifically, Addax computes zeroth- or first-order gradients of data points in the minibatch based on their memory consumption, combining these gradient estimates to update directions. By computing zeroth-order gradients for data points that require more memory and first-order gradients for others, Addax overcomes the slow convergence of MeZO and the excessive memory requirement of IP-SGD. Additionally, the zeroth-order gradient acts as a regularizer for the first-order gradient, further enhancing the model’s final performance. Theoretically, we establish the convergence of Addax under mild assumptions, demonstrating faster convergence and less restrictive hyper-parameter choices than MeZO. Our experiments with diverse LMs and tasks show that Addax consistently outperforms MeZO regarding accuracy and convergence speed while having a comparable memory footprint. When fine-tuning OPT-13B with one A100 GPU, on average, Addax outperforms MeZO in accuracy/F1 score by 14% and runs 15x faster while using memory similar to MeZO. In our experiments on the larger OPT-30B model, on average, Addax outperforms MeZO in terms of accuracy/F1 score by 16 and runs 30x faster on a single H100 GPU. Moreover, Addax surpasses the performance of standard fine-tuning approaches, such as IP-SGD and Adam, in most tasks with significantly less memory requirement.
摘要:使用 Adam 优化器微调语言模型 (LMs) 通常需要大量内存,限制了其可用性。为此,提出了“原地”版本的随机梯度下降 (IP-SGD) 和内存高效的零阶优化器 (MeZO) 来解决这一问题。然而,IP-SGD 仍然需要大量内存,而 MeZO 由于其零阶特性,存在收敛速度慢和最终性能下降的问题。本文介绍了 Addax,这是一种通过将 IP-SGD 与 MeZO 结合,从而在提高内存效率的同时提升性能的新方法。具体而言,Addax 根据数据点在 minibatch 中的内存消耗,计算其零阶或一阶梯度,并将这些梯度估计结合起来更新方向。通过为需要更多内存的数据点计算零阶梯度,为其他数据点计算一阶梯度,Addax 克服了 MeZO 的慢收敛问题和 IP-SGD 的内存需求过高问题。此外,零阶梯度作为一阶梯度的正则化器,进一步提升了模型的最终性能。理论上,我们在温和假设下证明了 Addax 的收敛性,表明其收敛速度比 MeZO 更快,且超参数选择更为宽松。我们在多种语言模型和任务上的实验表明,Addax 在准确性和收敛速度方面始终优于 MeZO,同时内存占用相当。在单个 A100 GPU 上微调 OPT-13B 时,平均而言,Addax 在准确率/F1 分数上比 MeZO 高出 14%,运行速度快 15 倍,同时内存使用与 MeZO 相当。在我们对更大的 OPT-30B 模型的实验中,平均而言,Addax 在单个 H100 GPU 上的准确率/F1 分数比 MeZO 高出 16,运行速度快 30 倍。此外,Addax 在大多数任务中超越了标准的微调方法,如 IP-SGD 和 Adam,且内存需求显著减少。
[NLP-80] Stress Detection on Code-Mixed Texts in Dravidian Languages using Machine Learning
【速读】: 该论文试图解决在达罗毗荼语系(如泰米尔语和泰卢固语)的代码混合文本中识别压力的问题。解决方案的关键在于采用未经清洗的文本作为基准,结合多种预处理技术(如TF-IDF、单词的Uni-grams和字符的(1+2+3)-Grams),并使用随机森林算法进行分类。该方法在泰米尔语和泰卢固语数据集上分别达到了0.734和0.727的Macro F1-score,超越了使用FastText和Transformer模型等复杂技术的结果,强调了未经清洗数据在心理状态检测中的价值及代码混合文本分类的挑战。
链接: https://arxiv.org/abs/2410.06428
作者: L. Ramos,M. Shahiki-Tash,Z. Ahani,A. Eponon,O. Kolesnikova,H. Calvo
关键词-EN: affect mental well-being, daily life, common feeling, feeling in daily, development of robust
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Stress is a common feeling in daily life, but it can affect mental well-being in some situations, the development of robust detection models is imperative. This study introduces a methodical approach to the stress identification in code-mixed texts for Dravidian languages. The challenge encompassed two datasets, targeting Tamil and Telugu languages respectively. This proposal underscores the importance of using uncleaned text as a benchmark to refine future classification methodologies, incorporating diverse preprocessing techniques. Random Forest algorithm was used, featuring three textual representations: TF-IDF, Uni-grams of words, and a composite of (1+2+3)-Grams of characters. The approach achieved a good performance for both linguistic categories, achieving a Macro F1-score of 0.734 in Tamil and 0.727 in Telugu, overpassing results achieved with different complex techniques such as FastText and Transformer models. The results underscore the value of uncleaned data for mental state detection and the challenges classifying code-mixed texts for stress, indicating the potential for improved performance through cleaning data, other preprocessing techniques, or more complex models.
摘要:压力是日常生活中常见的感受,但在某些情况下,它会影响心理健康,因此开发稳健的检测模型势在必行。本研究提出了一种系统的方法,用于识别德拉威语系中代码混合文本中的压力。该挑战涉及两个数据集,分别针对泰米尔语和泰卢固语。本研究强调了使用未清洗文本作为基准以改进未来分类方法的重要性,并结合了多种预处理技术。采用随机森林算法,特征包括三种文本表示:TF-IDF、单词的单字词以及字符的 (1+2+3)-Grams 组合。该方法在两种语言类别中均表现良好,泰米尔语的宏 F1 分数为 0.734,泰卢固语为 0.727,超过了使用 FastText 和 Transformer 模型等不同复杂技术所取得的结果。研究结果强调了未清洗数据在心理状态检测中的价值,以及分类代码混合文本中压力的挑战,表明通过清洗数据、其他预处理技术或更复杂的模型,性能有望进一步提升。
[NLP-81] NLP Case Study on Predicting the Before and After of the Ukraine-Russia and Hamas-Israel Conflicts
【速读】: 该论文试图解决通过自然语言处理(NLP)技术预测冲突前后的社交媒体文本毒性和其他属性的问题。解决方案的关键在于利用Twitter和Reddit等平台上的社交媒体数据,通过监督学习和非监督学习相结合的NLP技术,分析冲突前后的社交媒体讨论差异,从而实现对未来冲突期间社交媒体状态的预测,误差率仅为约1.2%。
链接: https://arxiv.org/abs/2410.06427
作者: Jordan Miner,John E. Ortega
关键词-EN: natural language processing, social media, recent events, Twitter and Reddit, propose a method
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The clusters created using topic modeling can be viewed at this https URL
点击查看摘要
Abstract:We propose a method to predict toxicity and other textual attributes through the use of natural language processing (NLP) techniques for two recent events: the Ukraine-Russia and Hamas-Israel conflicts. This article provides a basis for exploration in future conflicts with hopes to mitigate risk through the analysis of social media before and after a conflict begins. Our work compiles several datasets from Twitter and Reddit for both conflicts in a before and after separation with an aim of predicting a future state of social media for avoidance. More specifically, we show that: (1) there is a noticeable difference in social media discussion leading up to and following a conflict and (2) social media discourse on platforms like Twitter and Reddit is useful in identifying future conflicts before they arise. Our results show that through the use of advanced NLP techniques (both supervised and unsupervised) toxicity and other attributes about language before and after a conflict is predictable with a low error of nearly 1.2 percent for both conflicts.
摘要:我们提出了一种通过自然语言处理 (NLP) 技术预测乌克兰-俄罗斯和哈马斯-以色列冲突期间文本毒性及其他属性的方法。本文为未来冲突的探索提供了基础,希望通过分析冲突前后的社交媒体数据来降低风险。我们的研究收集了来自 Twitter 和 Reddit 的多个数据集,分别在冲突前和冲突后进行分类,旨在预测未来社交媒体的状态以避免风险。更具体地说,我们发现:(1) 冲突前后的社交媒体讨论存在显著差异;(2) 像 Twitter 和 Reddit 这样的平台上的社交媒体话语有助于在冲突发生前识别潜在的冲突。我们的结果表明,通过使用先进的 NLP 技术(包括监督学习和无监督学习),可以在冲突前后以接近 1.2% 的低误差率预测文本的毒性及其他属性。
[NLP-82] ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments EMNLP2024
【速读】: 该论文旨在解决全球医疗工作者短缺问题,通过开发智能医疗助手来辅助监控和及时提醒医疗人员。其关键解决方案在于引入Emergency Room Visual Question Answering (ERVQA)数据集,这是一个涵盖多种急诊室场景的图像、问题、答案三元组基准,用于评估大型视觉语言模型(LVLMs)在医疗知识方面的表现。通过详细错误分类和答案趋势分析,论文揭示了该任务的复杂性,并强调了开发领域特定解决方案的必要性。
链接: https://arxiv.org/abs/2410.06420
作者: Sourjyadip Ray,Kushal Gupta,Soumi Kundu,Payal Arvind Kasat,Somak Aditya,Pawan Goyal
关键词-EN: Visual Question Answering, alert healthcare workers, smart healthcare assistants, Large Vision Language, Room Visual Question
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at EMNLP 2024
点击查看摘要
Abstract:The global shortage of healthcare workers has demanded the development of smart healthcare assistants, which can help monitor and alert healthcare workers when necessary. We examine the healthcare knowledge of existing Large Vision Language Models (LVLMs) via the Visual Question Answering (VQA) task in hospital settings through expert annotated open-ended questions. We introduce the Emergency Room Visual Question Answering (ERVQA) dataset, consisting of image, question, answer triplets covering diverse emergency room scenarios, a seminal benchmark for LVLMs. By developing a detailed error taxonomy and analyzing answer trends, we reveal the nuanced nature of the task. We benchmark state-of-the-art open-source and closed LVLMs using traditional and adapted VQA metrics: Entailment Score and CLIPScore Confidence. Analyzing errors across models, we infer trends based on properties like decoder type, model size, and in-context examples. Our findings suggest the ERVQA dataset presents a highly complex task, highlighting the need for specialized, domain-specific solutions.
摘要:全球医疗工作者的短缺促使了智能医疗助手的开发,这些助手能够在必要时帮助监控并提醒医疗工作者。我们通过在医院环境中使用专家标注的开放式问题进行视觉问答 (VQA) 任务,来考察现有大视觉语言模型 (LVLMs) 的医疗知识。我们引入了急诊室视觉问答 (ERVQA) 数据集,该数据集由涵盖多种急诊室场景的图像、问题、答案三元组组成,是大视觉语言模型的一个开创性基准。通过开发详细的错误分类法并分析答案趋势,我们揭示了该任务的细微差别。我们使用传统的和适应的 VQA 指标(包括蕴涵得分和 CLIPScore 置信度)对最先进的开源和闭源大视觉语言模型进行了基准测试。通过分析不同模型中的错误,我们根据解码器类型、模型大小和上下文示例等属性推断出趋势。我们的研究结果表明,ERVQA 数据集呈现了一个高度复杂的任务,突显了专门化、领域特定解决方案的必要性。
[NLP-83] MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks EMNLP2024
【速读】: 该论文试图解决语言模型在处理需要重复使用简单规则的任务时表现不佳的问题,尤其是在处理长度较长的序列时。解决方案的关键在于引入了一个名为MLissard的多语言基准测试,该基准测试能够评估模型在处理和生成不同长度文本时的能力,并提供了一种控制序列复杂度的机制。通过该基准测试,研究者发现,随着序列复杂度的增加,所有模型和语言的表现均出现一致的下降。此外,使用非英语的上下文示例能够显著提高模型的外推性能。
链接: https://arxiv.org/abs/2410.06396
作者: Mirelle Bueno,Roberto Lotufo,Rodrigo Nogueira
关键词-EN: long sequences consisting, thousands of tokens, tasks that require, capable of solving, dealing with long
类目: Computation and Language (cs.CL)
备注: GenBench Workshop by EMNLP 2024: Camera-ready version
点击查看摘要
Abstract:Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens. However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training. For example, state-of-the-art LLMs can find common items in two lists with up to 20 items but fail when lists have 80 items. In this paper, we introduce MLissard, a multilingual benchmark designed to evaluate models’ abilities to process and generate texts of varied lengths and offers a mechanism for controlling sequence complexity. Our evaluation of open-source and proprietary models show a consistent decline in performance across all models and languages as the complexity of the sequence increases. Surprisingly, the use of in-context examples in languages other than English helps increase extrapolation performance significantly. The datasets and code are available at this https URL Comments: GenBench Workshop by EMNLP 2024: Camera-ready version Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.06396 [cs.CL] (or arXiv:2410.06396v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.06396 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:语言模型现在能够解决需要处理包含数十万 Token 的长序列任务。然而,它们在需要重复使用简单规则的任务上往往表现不佳,即使这些任务涉及的序列长度远小于训练时所见的长度。例如,最先进的大语言模型 (LLM) 可以在两个最多包含 20 个项目的列表中找到共同项目,但当列表包含 80 个项目时则会失败。在本文中,我们介绍了 MLissard,这是一个多语言基准测试,旨在评估模型处理和生成不同长度文本的能力,并提供了一种控制序列复杂性的机制。我们对开源和专有模型的评估显示,随着序列复杂性的增加,所有模型和语言的性能均呈现一致的下降趋势。令人惊讶的是,在非英语语言中使用上下文示例可以显著提高外推性能。数据集和代码可在以下链接获取:https URL。
评论:GenBench 研讨会由 EMNLP 2024 举办:最终版本
主题:计算与语言 (cs.CL)
引用方式:arXiv:2410.06396 [cs.CL]
(或 arXiv:2410.06396v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.06396
了解更多信息
arXiv 发布的 DOI 通过 DataCite(待注册)
[NLP-84] Counterfactual Causal Inference in Natural Language with Large Language Models
【速读】: 该论文试图解决从非结构化的自然语言数据(如新闻文章)中恢复因果结构的问题。解决方案的关键在于利用大型语言模型(LLMs)从文本数据中提取具体的因果变量,并构建因果图。通过合并来自多个数据源的因果图,形成最全面的因果关系集合,并在估计的因果图上进行反事实推断。该方法通过因果图的条件化减少了LLM的偏差,更准确地表示因果估计量,并揭示了LLM在反事实因果推理中的预测误差问题及其缓解方向。
链接: https://arxiv.org/abs/2410.06392
作者: Gaël Gendron,Jože M. Rožanec,Michael Witbrock,Gillian Dobbie
关键词-EN: Causal, Causal structure discovery, Causal structure, commonly applied, applied to structured
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages, 10 pages for the main paper, 12 pages for the references and appendix, 5 figures
点击查看摘要
Abstract:Causal structure discovery methods are commonly applied to structured data where the causal variables are known and where statistical testing can be used to assess the causal relationships. By contrast, recovering a causal structure from unstructured natural language data such as news articles contains numerous challenges due to the absence of known variables or counterfactual data to estimate the causal links. Large Language Models (LLMs) have shown promising results in this direction but also exhibit limitations. This work investigates LLM’s abilities to build causal graphs from text documents and perform counterfactual causal inference. We propose an end-to-end causal structure discovery and causal inference method from natural language: we first use an LLM to extract the instantiated causal variables from text data and build a causal graph. We merge causal graphs from multiple data sources to represent the most exhaustive set of causes possible. We then conduct counterfactual inference on the estimated graph. The causal graph conditioning allows reduction of LLM biases and better represents the causal estimands. We use our method to show that the limitations of LLMs in counterfactual causal reasoning come from prediction errors and propose directions to mitigate them. We demonstrate the applicability of our method on real-world news articles.
摘要:因果结构发现方法通常应用于结构化数据,其中因果变量已知,并且可以使用统计测试来评估因果关系。相比之下,从新闻文章等非结构化自然语言数据中恢复因果结构面临着诸多挑战,因为缺乏已知的变量或反事实数据来估计因果关系。大语言模型 (LLM) 在这方面展示了有前景的结果,但也存在局限性。本文研究了 LLM 从文本文档构建因果图并进行反事实因果推断的能力。我们提出了一种从自然语言中进行端到端因果结构发现和因果推断的方法:首先,我们使用 LLM 从文本数据中提取实例化的因果变量并构建因果图。我们将来自多个数据源的因果图合并,以表示尽可能全面的因果集合。然后,我们在估计的图上进行反事实推断。因果图的条件化允许减少 LLM 的偏差,并更好地表示因果估计量。我们使用我们的方法展示了 LLM 在反事实因果推理中的局限性源于预测错误,并提出了缓解这些问题的方向。我们展示了该方法在真实世界新闻文章中的适用性。
[NLP-85] Validation of the Scientific Literature via Chemputation Augmented by Large Language Models
【速读】: 该论文试图解决合成化学文献中实验步骤自动验证和执行的问题。解决方案的关键在于利用大型语言模型(LLM)构建一个自动化工作流程,该流程能够从大量文档中自主提取合成步骤和分析数据,将其转换为通用的XDL代码,并在特定硬件设置中模拟执行,最终通过XDL控制的机器人系统实现合成化学实验的自动执行。这一方法不仅提高了自动化程度,还增强了实验的安全性、可扩展性和可重复性。
链接: https://arxiv.org/abs/2410.06384
作者: Sebastian Pagel,Michael Jirasek,Leroy Cronin
关键词-EN: Large Language Models, universal symbolic language, programming chemical robots, symbolic language, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 22 pages, 7 figures, 34 references
点击查看摘要
Abstract:Chemputation is the process of programming chemical robots to do experiments using a universal symbolic language, but the literature can be error prone and hard to read due to ambiguities. Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, including natural language processing, robotic control, and more recently, chemistry. Despite significant advancements in standardizing the reporting and collection of synthetic chemistry data, the automatic reproduction of reported syntheses remains a labour-intensive task. In this work, we introduce an LLM-based chemical research agent workflow designed for the automatic validation of synthetic literature procedures. Our workflow can autonomously extract synthetic procedures and analytical data from extensive documents, translate these procedures into universal XDL code, simulate the execution of the procedure in a hardware-specific setup, and ultimately execute the procedure on an XDL-controlled robotic system for synthetic chemistry. This demonstrates the potential of LLM-based workflows for autonomous chemical synthesis with Chemputers. Due to the abstraction of XDL this approach is safe, secure, and scalable since hallucinations will not be chemputable and the XDL can be both verified and encrypted. Unlike previous efforts, which either addressed only a limited portion of the workflow, relied on inflexible hard-coded rules, or lacked validation in physical systems, our approach provides four realistic examples of syntheses directly executed from synthetic literature. We anticipate that our workflow will significantly enhance automation in robotically driven synthetic chemistry research, streamline data extraction, improve the reproducibility, scalability, and safety of synthetic and experimental chemistry.
摘要:化学计算 (Chemputation) 是指通过一种通用的符号语言编程化学机器人进行实验的过程,但由于存在歧义,文献可能容易出错且难以阅读。大语言模型 (Large Language Models, LLMs) 在多个领域展示了卓越的能力,包括自然语言处理、机器人控制,以及最近在化学领域的应用。尽管在标准化合成化学数据的报告和收集方面取得了显著进展,但自动重现报告的合成过程仍然是一项劳动密集型任务。本文中,我们介绍了一种基于 LLM 的化学研究智能体工作流程,旨在自动验证合成文献程序。我们的工作流程能够自主地从大量文档中提取合成程序和分析数据,将这些程序翻译成通用的 XDL 代码,模拟在特定硬件设置下的执行过程,并最终在 XDL 控制的机器人系统上执行合成化学程序。这展示了基于 LLM 的工作流程在化学计算机 (Chemputers) 上实现自主化学合成的潜力。由于 XDL 的抽象性,这种方法安全、可靠且可扩展,因为幻觉 (hallucinations) 不会被化学计算,且 XDL 可以被验证和加密。与以往仅处理工作流程的有限部分、依赖于不灵活的硬编码规则或缺乏物理系统验证的方法不同,我们的方法提供了四个直接从合成文献中执行的合成实例。我们预计,我们的工作流程将显著提升机器人驱动合成化学研究的自动化水平,简化数据提取,提高合成和实验化学的可重复性、可扩展性和安全性。
[NLP-86] HumVI: A Multilingual Dataset for Detecting Violent Incidents Impacting Humanitarian Aid
【速读】: 该论文试图解决人道主义组织在获取与援助行动直接相关的暴力事件数据方面的困难。解决方案的关键在于提出了一个自动数据收集和基于自然语言处理(NLP)的分类框架——HumVI数据集,该数据集包含英语、法语和阿拉伯语的新闻文章,涵盖了影响人道主义领域的多种暴力事件类型,如援助安全、教育、食品安全、健康和保护。通过与数据驱动的非政府组织Insecurity Insight合作,确保了数据标签的可靠性,并提供了多种深度学习架构和技术的基准测试,以应对不同任务挑战,如领域扩展。
链接: https://arxiv.org/abs/2410.06370
作者: Hemank Lamba,Anton Abilov,Ke Zhang,Elizabeth M. Olson,Henry k. Dambanemuya,João c. Bárcia,David S. Batista,Christina Wille,Aoife Cahill,Joel Tetreault,Alex Jaimes
关键词-EN: gather aggregated insights, support decision-making, discover trends, gather aggregated, funding proposals
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:Humanitarian organizations can enhance their effectiveness by analyzing data to discover trends, gather aggregated insights, manage their security risks, support decision-making, and inform advocacy and funding proposals. However, data about violent incidents with direct impact and relevance for humanitarian aid operations is not readily available. An automatic data collection and NLP-backed classification framework aligned with humanitarian perspectives can help bridge this gap. In this paper, we present HumVI - a dataset comprising news articles in three languages (English, French, Arabic) containing instances of different types of violent incidents categorized by the humanitarian sector they impact, e.g., aid security, education, food security, health, and protection. Reliable labels were obtained for the dataset by partnering with a data-backed humanitarian organization, Insecurity Insight. We provide multiple benchmarks for the dataset, employing various deep learning architectures and techniques, including data augmentation and mask loss, to address different task-related challenges, e.g., domain expansion. The dataset is publicly available at this https URL.
摘要:人道主义组织可以通过分析数据来发现趋势、收集汇总见解、管理安全风险、支持决策制定以及为倡导和资金提案提供信息,从而提高其效率。然而,关于直接影响和与人道主义援助行动相关的暴力事件的数据并不容易获得。一个与人文视角相一致的自动数据收集和自然语言处理(NLP)支持的分类框架可以帮助填补这一空白。在本文中,我们介绍了 HumVI——一个包含三种语言(英语、法语、阿拉伯语)新闻文章的数据集,这些文章包含了不同类型的暴力事件实例,这些事件根据其影响的人道主义部门进行分类,例如援助安全、教育、食品安全、健康和保护。通过与数据支持的人道主义组织 Insecurity Insight 合作,我们为数据集获取了可靠的标签。我们为该数据集提供了多个基准测试,采用了多种深度学习架构和技术,包括数据增强和掩码损失,以应对不同的任务相关挑战,例如领域扩展。该数据集可通过此 https URL 公开获取。
[NLP-87] Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?
【速读】: 该论文试图解决的问题是评估大型语言模型(LLMs)在无需参考翻译的情况下,对包含情感表达的用户生成内容(UGC)进行机器翻译质量评估的能力。解决方案的关键在于通过参数高效微调(PEFT)技术对LLMs进行优化,以提高其在质量评分预测中的表现,并生成人类可解释的解释。研究发现,PEFT后的LLMs在评分预测方面优于传统的微调模型,但仍存在拒绝回复提示和输出不稳定等问题。
链接: https://arxiv.org/abs/2410.06338
作者: Shenbin Qian,Constantin Orăsan,Diptesh Kanojia,Félix do Carmo
关键词-EN: Multi-dimensional Quality Metrics, large language models, user-generated content, emotional expressions, paper investigates
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC) that contains emotional expressions, without the use of reference translations. To achieve this, we employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Quality Metrics. We compare the accuracy of several LLMs with that of our fine-tuned baseline models, under in-context learning and parameter-efficient fine-tuning (PEFT) scenarios. We find that PEFT of LLMs leads to better performance in score prediction with human interpretable explanations than fine-tuned models. However, a manual analysis of LLM outputs reveals that they still have problems such as refusal to reply to a prompt and unstable output while evaluating machine translation of UGC.
摘要:本文探讨了大语言模型 (LLMs) 是否能在不使用参考翻译的情况下,成为用户生成内容 (UGC) 机器翻译质量评估的最新技术。这些 UGC 包含情感表达。为此,我们采用了一个现有的情感相关数据集,该数据集包含人工标注的错误,并基于多维质量指标计算质量评估分数。我们在上下文学习和参数高效微调 (PEFT) 场景下,比较了几种 LLMs 与我们的微调基线模型的准确性。我们发现,LLMs 的 PEFT 在分数预测方面表现优于微调模型,并且提供了人类可解释的解释。然而,对 LLM 输出的手动分析显示,在评估 UGC 的机器翻译时,它们仍然存在拒绝回复提示和不稳定输出的问题。
[NLP-88] Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing
【速读】: 该论文试图解决现有知识编辑方法在多跳事实召回任务中的性能不足问题。解决方案的关键在于提出了一种名为IFMET的新型locate-then-edit知识编辑方法,该方法通过多跳编辑提示和补充集来定位并修改浅层和深层MLP层中的知识,从而有效编辑涉及不同推理阶段的知识,显著提升多跳事实召回任务的性能。
链接: https://arxiv.org/abs/2410.06331
作者: Zhuoran Zhang,Yongxiang Li,Zijian Kan,Keyuan Cheng,Lijie Hu,Di Wang
关键词-EN: Large Language Models, Language Models, Large Language, shown significant promise, paradigm has shown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages
点击查看摘要
Abstract:The locate-then-edit paradigm has shown significant promise for knowledge editing (KE) in Large Language Models (LLMs). While previous methods perform well on single-hop fact recall tasks, they consistently struggle with multi-hop factual recall tasks involving newly edited knowledge. In this paper, leveraging tools in mechanistic interpretability, we first identify that in multi-hop tasks, LLMs tend to retrieve implicit subject knowledge from deeper MLP layers, unlike single-hop tasks, which rely on earlier layers. This distinction explains the poor performance of current methods in multi-hop queries, as they primarily focus on editing shallow layers, leaving deeper layers unchanged. To address this, we propose IFMET, a novel locate-then-edit KE approach designed to edit both shallow and deep MLP layers. IFMET employs multi-hop editing prompts and supplementary sets to locate and modify knowledge across different reasoning stages. Experimental results demonstrate that IFMET significantly improves performance on multi-hop factual recall tasks, effectively overcoming the limitations of previous locate-then-edit methods.
摘要:定位-然后-编辑范式在大型语言模型 (LLM) 的知识编辑 (KE) 中显示出显著的前景。尽管先前的方法在单跳事实回忆任务中表现良好,但它们在涉及新编辑知识的多跳事实回忆任务中始终表现不佳。本文利用机制可解释性中的工具,首先发现,在多跳任务中,LLM 倾向于从更深的多层感知器 (MLP) 层中检索隐含的主体知识,这与依赖于较早层的单跳任务不同。这种区别解释了当前方法在多跳查询中表现不佳的原因,因为它们主要集中在编辑浅层,而深层保持不变。为了解决这个问题,我们提出了 IFMET,一种新颖的定位-然后-编辑 KE 方法,旨在编辑浅层和深层 MLP 层。IFMET 采用多跳编辑提示和补充集来定位和修改不同推理阶段的知识。实验结果表明,IFMET 显著提高了多跳事实回忆任务的性能,有效地克服了先前定位-然后-编辑方法的局限性。
[NLP-89] Auto-Evolve: Enhancing Large Language Models Performance via Self-Reasoning Framework EMNLP2024
【速读】: 该论文试图解决现有大型语言模型(LLMs)在处理复杂问题时依赖单一或固定静态推理模块的局限性问题。解决方案的关键在于引入Auto-Evolve框架,该框架允许LLMs动态生成与人类推理范式相符的推理模块,并迭代优化指令指导,从而显著提升模型在多样问题上的表现。Auto-Evolve通过消除预定义模板的需求,实现了对不同任务的灵活适应,并在多个模型上超越了现有的最先进(SOTA)提示策略。
链接: https://arxiv.org/abs/2410.06328
作者: Krishna Aswani,Huilin Lu,Pranav Patankar,Priya Dhalwani,Iris Tan,Jayant Ganeshmohan,Simon Lacasse
关键词-EN: Large Language Models, Recent advancements, Large Language, demonstrated significant potential, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at EMNLP 2024
点击查看摘要
Abstract:Recent advancements in prompt engineering strategies, such as Chain-of-Thought (CoT) and Self-Discover, have demonstrated significant potential in improving the reasoning abilities of Large Language Models (LLMs). However, these state-of-the-art (SOTA) prompting strategies rely on single or fixed set of static seed reasoning modules like \emph"think step by step" or \emph"break down this problem" intended to simulate human approach to problem-solving. This constraint limits the flexibility of models in tackling diverse problems effectively. In this paper, we introduce Auto-Evolve, a novel framework that enables LLMs to self-create dynamic reasoning modules and downstream action plan, resulting in significant improvements over current SOTA methods. We evaluate Auto-Evolve on the challenging BigBench-Hard (BBH) dataset with Claude 2.0, Claude 3 Sonnet, Mistral Large, and GPT 4, where it consistently outperforms the SOTA prompt strategies. Auto-Evolve outperforms CoT by up to 10.4% and on an average by 7% across these four models. Our framework introduces two innovations: a) Auto-Evolve dynamically generates reasoning modules for each task while aligning with human reasoning paradigm, thus eliminating the need for predefined templates. b) We introduce an iterative refinement component, that incrementally refines instruction guidance for LLMs and helps boost performance by average 2.8% compared to doing it in a single step.
摘要:近年来,提示工程策略的进展,如思维链 (Chain-of-Thought, CoT) 和自我发现 (Self-Discover),展示了显著提升大语言模型 (Large Language Models, LLMs) 推理能力的潜力。然而,这些最先进的 (State-of-the-Art, SOTA) 提示策略依赖于单一或固定的静态种子推理模块,如“逐步思考”或“分解问题”,旨在模拟人类的问题解决方法。这种限制限制了模型在有效应对多样化问题时的灵活性。本文中,我们提出了自动进化 (Auto-Evolve),一种新颖的框架,使 LLMs 能够自我创建动态推理模块和下游行动计划,从而显著优于当前的 SOTA 方法。我们在具有挑战性的 BigBench-Hard (BBH) 数据集上评估了 Auto-Evolve,使用了 Claude 2.0、Claude 3 Sonnet、Mistral Large 和 GPT 4,结果显示其持续优于 SOTA 提示策略。Auto-Evolve 在四个模型上分别比 CoT 提升了高达 10.4%,平均提升 7%。我们的框架引入了两项创新:a) Auto-Evolve 动态生成与人类推理范式一致的推理模块,从而消除了对预定义模板的需求。b) 我们引入了一个迭代细化组件,该组件逐步细化对 LLMs 的指令指导,并帮助性能平均提升 2.8%,相比于一次性完成。
[NLP-90] mporal Image Caption Retrieval Competition – Description and Results
【速读】: 该论文试图解决多模态文本-图像检索问题,并引入了一个新的任务,即结合时间数据的多模态检索。解决方案的关键在于提出了“时间图像描述检索竞赛(TICRC)”,该竞赛基于Chronicling America和Challenging America项目,利用这些项目提供的274年历史的数字化美国报纸数据集,通过结合视觉、文本和时间信息,实现更复杂的多模态检索任务。
链接: https://arxiv.org/abs/2410.06314
作者: Jakub Pokrywka,Piotr Wierzchoń,Kornel Weryszko,Krzysztof Jassem
关键词-EN: gained significant recognition, recently gained significant, textual information, significant recognition, Multimodal models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Multimodal models, which combine visual and textual information, have recently gained significant recognition. This paper addresses the multimodal challenge of Text-Image retrieval and introduces a novel task that extends the modalities to include temporal data. The Temporal Image Caption Retrieval Competition (TICRC) presented in this paper is based on the Chronicling America and Challenging America projects, which offer access to an extensive collection of digitized historic American newspapers spanning 274 years. In addition to the competition results, we provide an analysis of the delivered dataset and the process of its creation.
摘要:多模态模型(Multimodal models),结合视觉和文本信息,近年来获得了显著的认可。本文针对文本-图像检索的多模态挑战,并引入了一项新任务,将模态扩展到包括时间数据。本文提出的时间图像描述检索竞赛(Temporal Image Caption Retrieval Competition, TICRC)基于“美国编年史”和“挑战美国”项目,这些项目提供了访问跨越274年的大量数字化历史美国报纸的途径。除了竞赛结果,我们还对提供的数据集及其创建过程进行了分析。
[NLP-91] Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning
【速读】: 该论文试图解决大型语言模型(LLMs)在复杂多步推理任务中产生的幻觉问题,特别是在数学问题解决中的应用。解决方案的关键在于提出了一个细粒度的过程奖励模型(FG-PRM),该模型通过引入一个全面的幻觉分类法,将常见的数学推理幻觉分为六种类型:虚构、事实不一致、上下文不一致、指令不一致、逻辑不一致和逻辑错误。FG-PRM通过六个专门的过程奖励模型(PRMs)来检测和缓解这些幻觉,每个PRM针对一种特定的幻觉类型。此外,论文还提出了一种自动化方法,利用LLMs生成细粒度的幻觉数据,以创建一个多样且平衡的合成数据集来训练FG-PRM。实验结果表明,FG-PRM在细粒度幻觉检测和验证任务中表现优异,显著提升了LLMs在GSM8K和MATH基准测试中的性能。
链接: https://arxiv.org/abs/2410.06304
作者: Ruosen Li,Ziming Luo,Xinya Du
关键词-EN: pose significant challenges, requiring complex multi-step, large language models, tasks requiring complex, complex multi-step reasoning
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Hallucinations in large language models (LLMs) pose significant challenges in tasks requiring complex multi-step reasoning, such as mathematical problem-solving. Existing approaches primarily detect the presence of hallucinations but lack a nuanced understanding of their types and manifestations. In this paper, we first introduce a comprehensive taxonomy that categorizes the common hallucinations in mathematical reasoning task into six types: fabrication, factual inconsistency, context inconsistency, instruction inconsistency, logical inconsistency, and logical error. We then propose FG-PRM (Fine-Grained Process Reward Model), an augmented model designed to detect and mitigate hallucinations in a fine-grained, step-level manner. To address the limitations of manually labeling training data, we propose an automated method for generating fine-grained hallucination data using LLMs. By injecting hallucinations into reasoning steps of correct solutions, we create a diverse and balanced synthetic dataset for training FG-PRM, which consists of six specialized Process Reward Models (PRMs), each tailored to detect a specific hallucination type. Our FG-PRM demonstrates superior performance across two key tasks: 1) Fine-grained hallucination detection: classifying hallucination types for each reasoning step; and 2) Verification: ranking multiple LLM-generated outputs to select the most accurate solution, mitigating reasoning hallucinations. Our experiments show that FG-PRM outperforms ChatGPT-3.5 and Claude-3 on fine-grained hallucination detection and substantially boosts the performance of LLMs on GSM8K and MATH benchmarks.
摘要:大语言模型 (LLM) 中的幻觉现象在需要复杂多步推理的任务中,如数学问题解决,带来了显著的挑战。现有方法主要检测幻觉的存在,但缺乏对其类型和表现形式的细致理解。本文首先引入了一个全面的分类法,将数学推理任务中的常见幻觉分为六种类型:捏造 (fabrication)、事实不一致 (factual inconsistency)、上下文不一致 (context inconsistency)、指令不一致 (instruction inconsistency)、逻辑不一致 (logical inconsistency) 和逻辑错误 (logical error)。接着,我们提出了细粒度过程奖励模型 (FG-PRM),这是一种增强模型,旨在以细粒度、步骤级别的方式检测和缓解幻觉。为了解决手动标注训练数据的局限性,我们提出了一种利用大语言模型自动生成细粒度幻觉数据的方法。通过将幻觉注入正确解决方案的推理步骤中,我们创建了一个多样且平衡的合成数据集,用于训练 FG-PRM,该数据集包含六个专门的过程奖励模型 (PRM),每个模型专门用于检测特定类型的幻觉。我们的 FG-PRM 在两个关键任务中展示了优越的性能:1) 细粒度幻觉检测:对每个推理步骤的幻觉类型进行分类;2) 验证:对多个大语言模型生成的输出进行排序,以选择最准确的解决方案,从而缓解推理幻觉。我们的实验表明,FG-PRM 在细粒度幻觉检测方面优于 ChatGPT-3.5 和 Claude-3,并显著提升了大语言模型在 GSM8K 和 MATH 基准测试中的表现。
[NLP-92] Accelerated Preference Optimization for Large Language Model Alignment
【速读】: 该论文试图解决强化学习从人类反馈(RLHF)中加速大型语言模型(LLMs)与人类偏好对齐的问题。解决方案的关键在于提出了一种基于Nesterov动量技术的加速偏好优化(APO)框架,该框架将迭代偏好优化方法视为近端点方法,并通过引入动量技术来加速优化过程。理论分析表明,APO能够比标准的迭代偏好优化方法(如DPO和SPPO)实现更快的收敛速度,实验结果进一步验证了APO在AlpacaEval 2.0基准测试中相对于DPO等方法的优越性。
链接: https://arxiv.org/abs/2410.06293
作者: Jiafan He,Huizhuo Yuan,Quanquan Gu
关键词-EN: Reinforcement Learning, Human Feedback, large language models, aligning large language, Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 44 pages, 10 tables
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a policy optimization problem without explicitly estimating the reward function. It overcomes the stability and efficiency issues of two-step approaches, which typically involve first estimating the reward function and then optimizing the policy via proximal policy optimization (PPO). Since RLHF is essentially an optimization problem, and it is well-known that momentum techniques can accelerate optimization both theoretically and empirically, a natural question arises: Can RLHF be accelerated by momentum? This paper answers this question in the affirmative. In detail, we first show that the iterative preference optimization method can be viewed as a proximal point method. Based on this observation, we propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms and employs Nesterov’s momentum technique to speed up the alignment of LLMs. Theoretically, we demonstrate that APO can achieve a faster convergence rate than the standard iterative preference optimization methods, including DPO and Self-Play Preference Optimization (SPPO). Empirically, we show the superiority of APO over DPO, iterative DPO, and other strong baselines for RLHF on the AlpacaEval 2.0 benchmark.
摘要:基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 已成为将大语言模型 (Large Language Model, LLM) 与人类偏好对齐的关键工具。直接偏好优化 (Direct Preference Optimization, DPO) 是其中最流行的方法之一,它将 RLHF 表述为一个策略优化问题,而无需显式估计奖励函数。这种方法克服了两步法在稳定性和效率上的问题,后者通常首先估计奖励函数,然后通过近端策略优化 (Proximal Policy Optimization, PPO) 优化策略。由于 RLHF 本质上是一个优化问题,并且众所周知,动量技术在理论和实践上都能加速优化,因此自然会提出一个问题:RLHF 能否通过动量加速?本文对此问题给出了肯定的回答。具体来说,我们首先证明迭代偏好优化方法可以被视为一种近端点方法。基于这一观察,我们提出了一种通用的加速偏好优化 (Accelerated Preference Optimization, APO) 框架,该框架统一了多种现有的偏好优化算法,并采用 Nesterov 的动量技术来加速 LLM 的对齐。理论上,我们证明了 APO 的收敛速度比标准的迭代偏好优化方法(包括 DPO 和自博弈偏好优化 (Self-Play Preference Optimization, SPPO))更快。在实践中,我们在 AlpacaEval 2.0 基准测试中展示了 APO 相对于 DPO、迭代 DPO 及其他强基线在 RLHF 上的优越性。
[NLP-93] Non-Halting Queries: Exploiting Fixed Points in LLMs
【速读】: 该论文试图解决自回归模型中的非终止输出问题,即某些查询会导致语言模型(LLM)输出永不终止。解决方案的关键在于识别和利用模型输出中的固定点(fixed points),特别是在温度为零时,若输出中出现重复(循环)的token序列超过上下文大小,模型将进入非终止状态。论文通过实验验证了这一现象,并提出了一种简单的提示结构,能够使对齐模型也陷入非终止状态,强调了进一步研究和加强模型对齐机制以抵抗非终止异常的必要性。
链接: https://arxiv.org/abs/2410.06287
作者: Ghaith Hammouri,Kemal Derya,Berk Sunar
关键词-EN: non-halting, vulnerability that exploits, non-halting anomaly, call non-halting queries, exploits fixed points
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We introduce a new vulnerability that exploits fixed points in autoregressive models and use it to craft queries that never halt, i.e. an LLM output that does not terminate. More precisely, for what we call non-halting queries, the LLM never samples the end-of-string token (eos). We rigorously analyze the conditions under which the non-halting anomaly presents itself. In particular, at temperature zero, we prove that if a repeating (cyclic) sequence of tokens is observed at the output beyond the context size, then the LLM does not halt. We demonstrate the non-halting anomaly in a number of experiments performed in base (unaligned) models where repeating tokens immediately lead to a non-halting cyclic behavior as predicted by the analysis. Further, we develop a simple recipe that takes the same fixed points observed in the base model and creates a prompt structure to target aligned models. We study the recipe behavior in bypassing alignment in a number of LLMs including GPT-4o, llama-3-8b-instruct, and gemma-2-9b-it where all models are forced into a non-halting state. Further, we demonstrate the recipe’s success in sending most major models released over the past year into a non-halting state with the same simple prompt even at higher temperatures. Further, we study direct inversion based techniques to craft new short prompts to induce the non-halting state. Our experiments with the gradient search based inversion technique ARCA show that non-halting is prevalent across models and may be easily induced with a few input tokens. While its impact on the reliability of hosted systems can be mitigated by configuring a hard maximum token limit in the sampler, the non-halting anomaly still manages to break alignment. This underlines the need for further studies and stronger forms of alignment against non-halting anomalies. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2410.06287 [cs.LG] (or arXiv:2410.06287v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.06287 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:我们介绍了一种新的漏洞,该漏洞利用自回归模型中的固定点,并利用它来构建永不终止的查询,即大语言模型 (LLM) 输出永不终止。更准确地说,对于我们称之为非终止查询的情况,LLM 永远不会采样字符串结束 Token (eos)。我们严格分析了非终止异常出现的条件。特别地,在温度为零的情况下,我们证明了如果输出中观察到超出上下文大小的重复(循环)Token 序列,那么 LLM 将不会终止。我们在多个实验中展示了非终止异常,这些实验在基础(未对齐)模型中进行,其中重复 Token 立即导致如分析预测的非终止循环行为。此外,我们开发了一种简单的配方,利用基础模型中观察到的相同固定点,创建针对对齐模型的提示结构。我们在绕过对齐的多个大语言模型 (LLM) 中研究了该配方的行为,包括 GPT-4o、llama-3-8b-instruct 和 gemma-2-9b-it,所有模型都被迫进入非终止状态。进一步,我们展示了该配方在发送过去一年发布的大多数主要模型进入非终止状态方面的成功,即使在较高温度下,也只需使用相同的简单提示。此外,我们研究了基于直接逆向技术来构建新的短提示以诱导非终止状态的方法。我们的基于梯度搜索逆向技术 ARCA 的实验表明,非终止现象在模型中普遍存在,并且可以通过少量输入 Token 轻松诱导。尽管其对托管系统可靠性的影响可以通过在采样器中配置硬最大 Token 限制来缓解,但非终止异常仍然能够打破对齐。这强调了进一步研究和更强形式的对齐以对抗非终止异常的必要性。
主题:机器学习 (cs.LG); 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用为:arXiv:2410.06287 [cs.LG] (或 arXiv:2410.06287v1 [cs.LG] 用于此版本)
https://doi.org/10.48550/arXiv.2410.06287
通过 DataCite 发布的 arXiv DOI(待注册)
[NLP-94] he Mystery of Compositional Generalization in Graph-based Generative Commonsense Reasoning EMNLP2024
【速读】: 该论文试图解决大型语言模型(LLMs)在组合泛化能力上的不足问题,特别是在基于图的常识推理任务中。解决方案的关键在于引入了一个名为“基于图的常识推理组合泛化挑战(CGGC)”的新评估框架,该框架要求模型根据给定的概念和推理图生成自然语言句子,其中推理图涉及先前未见过的关系类型组合。通过分析不同推理图结构的难度,并采用由易到难的样本排列方式,论文发现这种策略能够显著提升LLMs的组合泛化能力。
链接: https://arxiv.org/abs/2410.06272
作者: Xiyan Fu,Anette Frank
关键词-EN: Graph-based Commonsense Reasoning, compositional generalization, compositional generalization capabilities, Graph-based Commonsense, reasoning tasks
类目: Computation and Language (cs.CL)
备注: Accepted Findings at EMNLP 2024
点击查看摘要
Abstract:While LLMs have emerged as performant architectures for reasoning tasks, their compositional generalization capabilities have been questioned. In this work, we introduce a Compositional Generalization Challenge for Graph-based Commonsense Reasoning (CGGC) that goes beyond previous evaluations that are based on sequences or tree structures - and instead involves a reasoning graph: It requires models to generate a natural sentence based on given concepts and a corresponding reasoning graph, where the presented graph involves a previously unseen combination of relation types. To master this challenge, models need to learn how to reason over relation tupels within the graph, and how to compose them when conceptualizing a verbalization. We evaluate seven well-known LLMs using in-context learning and find that performant LLMs still struggle in compositional generalization. We investigate potential causes of this gap by analyzing the structures of reasoning graphs, and find that different structures present varying levels of difficulty for compositional generalization. Arranging the order of demonstrations according to the structures’ difficulty shows that organizing samples in an easy-to-hard schema enhances the compositional generalization ability of LLMs.
摘要:尽管大语言模型 (LLM) 在推理任务中表现出色,但其组合泛化能力一直受到质疑。在本研究中,我们引入了一个基于图的常识推理组合泛化挑战 (CGGC),该挑战超越了以往基于序列或树结构的评估方法,转而涉及一个推理图:它要求模型根据给定的概念和相应的推理图生成一个自然句子,其中呈现的图涉及一种先前未见过的关系类型组合。为了掌握这一挑战,模型需要学习如何在图中的关系元组上进行推理,以及在概念化表达时如何组合这些元组。我们使用上下文学习评估了七个知名的大语言模型,发现表现优异的大语言模型在组合泛化方面仍然存在困难。我们通过分析推理图的结构,探讨了这一差距的可能原因,并发现不同的结构对组合泛化的难度水平有所不同。根据结构难度排列演示顺序的实验表明,采用由易到难的样本组织方式可以增强大语言模型的组合泛化能力。
[NLP-95] Probing the Robustness of Theory of Mind in Large Language Models
【速读】: 该论文试图解决大语言模型(LLMs)在理论心智(Theory of Mind, ToM)能力上的表现问题,特别是探讨这些模型在面对不同复杂度任务时的表现差异。解决方案的关键在于引入了一个包含68个任务的新数据集,这些任务根据复杂度分为10个类别,用于系统性地评估LLMs在ToM任务中的表现。通过对比不同复杂度任务下的模型表现,论文揭示了LLMs在处理涉及环境自动状态变化和对象关系变化任务时的局限性,并提出了进一步研究如何稳定和提升LLMs ToM能力的新方向。
链接: https://arxiv.org/abs/2410.06271
作者: Christian Nickel,Laura Schrewe,Lucie Flek
关键词-EN: Theory of Mind, social reasoning capabilities, similarly sized SotA, claims of emergent, scientific literature
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With the success of ChatGPT and other similarly sized SotA LLMs, claims of emergent human like social reasoning capabilities, especially Theory of Mind (ToM), in these models have appeared in the scientific literature. On the one hand those ToM-capabilities have been successfully tested using tasks styled similar to those used in psychology (Kosinski, 2023). On the other hand, follow up studies showed that those capabilities vanished when the tasks were slightly altered (Ullman, 2023). In this work we introduce a novel dataset of 68 tasks for probing ToM in LLMs, including potentially challenging variations which are assigned to 10 complexity classes. This way it is providing novel insights into the challenges LLMs face with those task variations. We evaluate the ToM performance of four SotA open source LLMs on our dataset and the dataset introduced by (Kosinski, 2023). The overall low goal accuracy across all evaluated models indicates only a limited degree of ToM capabilities. The LLMs’ performance on simple complexity class tasks from both datasets are similar. Whereas we find a consistent tendency in all tested LLMs to perform poorly on tasks that require the realization that an agent has knowledge of automatic state changes in its environment, even when those are spelled out to the model. For task complications that change the relationship between objects by replacing prepositions, we notice a performance drop in all models, with the strongest impact on the mixture-of-experts model. With our dataset of tasks grouped by complexity we offer directions for further research on how to stabilize and advance ToM capabilities in LLM.
摘要:随着 ChatGPT 和其他类似规模的最新大语言模型 (LLM) 的成功,科学文献中出现了这些模型具有类似人类的社会推理能力,特别是心智理论 (Theory of Mind, ToM) 的声明。一方面,这些 ToM 能力已经通过类似于心理学中使用的任务成功测试(Kosinski, 2023)。另一方面,后续研究表明,当任务稍作改变时,这些能力就会消失(Ullman, 2023)。在本研究中,我们引入了一个包含 68 个任务的新数据集,用于探测 LLM 中的 ToM,其中包括可能具有挑战性的变体,这些变体被分配到 10 个复杂度类别中。通过这种方式,我们为 LLM 在这些任务变体中面临的挑战提供了新的见解。我们在本数据集和(Kosinski, 2023)引入的数据集上评估了四个最新开源 LLM 的 ToM 表现。所有评估模型的总体目标准确率较低,表明 ToM 能力有限。LLM 在两个数据集中简单复杂度类别任务上的表现相似。然而,我们发现所有测试的 LLM 在需要意识到智能体对其环境中的自动状态变化具有知识的任务上表现不佳,即使这些变化明确告知模型。对于通过替换介词改变对象之间关系的任务复杂性,我们注意到所有模型的性能下降,其中对混合专家模型影响最大。通过按复杂度分组的任务数据集,我们为如何稳定和提升 LLM 中的 ToM 能力提供了进一步研究的方向。
[NLP-96] MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More
【速读】: 该论文试图解决Mixture-of-Experts大型语言模型(MoE-LLMs)中存在的两个关键问题:专家参数导致的内存消耗和加载延迟,以及当前激活的专家冗余问题。解决方案的关键在于提出了MC-MoE(Mixture-Compressor for MoE-LLMs),这是一个无需训练的混合压缩器,通过利用专家和token的重要性来实现极端压缩。具体方法包括:1) 引入预加载混合精度量化(Pre-Loading Mixed-Precision Quantization),将自适应比特宽度分配问题建模为线性规划问题,平衡反映每个专家重要性的多因素;2) 开发在线动态剪枝(Online Dynamic Pruning),识别重要token并动态选择激活专家以优化效率。MC-MoE结合静态量化和动态剪枝,协同实现MoE-LLMs的极端压缩,同时保持性能与效率的最佳平衡。
链接: https://arxiv.org/abs/2410.06270
作者: Wei Huang,Yue Liao,Jianhui Liu,Ruifei He,Haoru Tan,Shiming Zhang,Hongsheng Li,Si Liu,Xiaojuan Qi
关键词-EN: significant step forward, considerable memory consumption, large language models, large language, language models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 18 pages
点击查看摘要
Abstract:Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important – only a small subset is critical. Building on these insights, we propose MC-MoE, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and loading overheads, we introduce Pre-Loading Mixed-Precision Quantization, which formulates the adaptive bit-width allocation as a Linear Programming problem, where the objective function balances multi-factors reflecting the importance of each expert. Additionally, we develop Online Dynamic Pruning, which identifies important tokens to retain and dynamically select activated experts for other tokens during inference to optimize efficiency while maintaining performance. Our MC-MoE integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency. Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.
摘要:混合专家大语言模型 (Mixture-of-Experts large language models, MoE-LLMs) 标志着语言模型领域的重要进步,然而在实际应用中,它们面临两大关键挑战:1) 专家参数导致显著的内存消耗和加载延迟;2) 当前激活的专家存在冗余,因为许多 Token 可能仅需要单一专家。受这些问题启发,我们对 MoE-LLMs 进行了深入研究,并得出两个关键观察结果:a) 不同专家在激活重构误差、路由评分和激活频率上表现出不同的行为,突显了它们不同的重要性;b) 并非所有 Token 都同等重要——只有一小部分是关键的。基于这些洞察,我们提出了 MC-MoE,一种无需训练的混合压缩器 (Mixture-Compressor),用于 MoE-LLMs,它利用专家和 Token 的重要性实现极端压缩。首先,为了减轻存储和加载开销,我们引入了预加载混合精度量化 (Pre-Loading Mixed-Precision Quantization),将自适应比特宽度分配形式化为线性规划问题,目标函数平衡了反映每个专家重要性的多重因素。此外,我们开发了在线动态剪枝 (Online Dynamic Pruning),在推理过程中识别并保留重要 Token,并动态选择其他 Token 的激活专家,以优化效率同时保持性能。我们的 MC-MoE 结合了静态量化和动态剪枝,共同实现了 MoE-LLMs 的极端压缩,减少了精度损失,确保了性能与效率之间的最佳平衡。广泛的实验证实了我们方法的有效性。例如,在 2.54 比特下,MC-MoE 压缩了 76.6% 的模型,平均精度损失仅为 3.8%。在动态推理过程中,我们进一步减少了 15% 的激活参数,性能下降不到 0.6%。
[NLP-97] hink While You Generate: Discrete Diffusion with Planned Denoising
【速读】: 该论文试图解决传统去噪扩散方法在生成过程中效率不高的问题。解决方案的关键在于引入了一种名为“Discrete Diffusion with Planned Denoising (DDPD)”的新框架,该框架将生成过程分为规划器和去噪器两个模型。在推理阶段,规划器通过识别最需要去噪的位置来决定下一步的去噪顺序,从而实现更高效的重建。这种方法不仅优化了去噪顺序,还在语言建模和图像生成等基准测试中显著提升了性能,缩小了扩散模型与自回归模型之间的性能差距。
链接: https://arxiv.org/abs/2410.06264
作者: Sulin Liu,Juno Nam,Andrew Campbell,Hannes Stärk,Yilun Xu,Tommi Jaakkola,Rafael Gómez-Bombarelli
关键词-EN: Discrete diffusion, introduce Discrete Diffusion, outperforming or approaching, approaching autoregressive models, Discrete
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Discrete diffusion has achieved state-of-the-art performance, outperforming or approaching autoregressive models on standard benchmarks. In this work, we introduce Discrete Diffusion with Planned Denoising (DDPD), a novel framework that separates the generation process into two models: a planner and a denoiser. At inference time, the planner selects which positions to denoise next by identifying the most corrupted positions in need of denoising, including both initially corrupted and those requiring additional refinement. This plan-and-denoise approach enables more efficient reconstruction during generation by iteratively identifying and denoising corruptions in the optimal order. DDPD outperforms traditional denoiser-only mask diffusion methods, achieving superior results on language modeling benchmarks such as text8, OpenWebText, and token-based generation on ImageNet 256 \times 256 . Notably, in language modeling, DDPD significantly reduces the performance gap between diffusion-based and autoregressive methods in terms of generative perplexity. Code is available at this https URL.
摘要:离散扩散在标准基准测试中已达到最先进的性能,超越或接近自回归模型。在本研究中,我们提出了离散扩散与规划去噪 (Discrete Diffusion with Planned Denoising, DDPD),这是一种新颖的框架,将生成过程分为两个模型:规划器和去噪器。在推理阶段,规划器通过识别最需要去噪的位置来选择下一步去噪的位置,包括初始损坏的位置和需要进一步细化的位置。这种“规划-去噪”方法通过迭代地识别和按最佳顺序去噪,实现了生成过程中更高效的重建。DDPD 优于传统的仅去噪器掩码扩散方法,在语言建模基准测试(如 text8、OpenWebText 和基于 Token 的 ImageNet 256 \times 256 生成)中取得了优异的结果。值得注意的是,在语言建模中,DDPD 显著缩小了基于扩散方法与自回归方法在生成困惑度方面的性能差距。代码可在以下链接获取:https URL。
[NLP-98] Unsupervised Model Diagnosis
【速读】: 该论文试图解决深度视觉系统在部署中的可解释性和鲁棒性问题,特别是当前依赖于大量标注数据集来评估模型鲁棒性的方法存在成本高、覆盖不足的问题。论文提出的解决方案是Unsupervised Model Diagnosis (UMO),其关键在于利用生成模型在无用户指导的情况下生成语义反事实解释。UMO通过在生成模型的潜在空间中优化最反事实的方向,自动识别和可视化语义变化,并将其与广泛文本源中的属性进行匹配,从而无需人工干预即可揭示模型的虚假关联和失败模式。
链接: https://arxiv.org/abs/2410.06243
作者: Yinong Oliver Wang,Eileen Li,Jinqi Luo,Zhaoning Wang,Fernando De la Torre
关键词-EN: Ensuring model explainability, deep vision systems, Ensuring model, essential for reliable, reliable deployment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 9 figures, 3 tables
点击查看摘要
Abstract:Ensuring model explainability and robustness is essential for reliable deployment of deep vision systems. Current methods for evaluating robustness rely on collecting and annotating extensive test sets. While this is common practice, the process is labor-intensive and expensive with no guarantee of sufficient coverage across attributes of interest. Recently, model diagnosis frameworks have emerged leveraging user inputs (e.g., text) to assess the vulnerability of the model. However, such dependence on human can introduce bias and limitation given the domain knowledge of particular users. This paper proposes Unsupervised Model Diagnosis (UMO), that leverages generative models to produce semantic counterfactual explanations without any user guidance. Given a differentiable computer vision model (i.e., the target model), UMO optimizes for the most counterfactual directions in a generative latent space. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources, such as dictionaries or language models. We validate the framework on multiple vision tasks (e.g., classification, segmentation, keypoint detection). Extensive experiments show that our unsupervised discovery of semantic directions can correctly highlight spurious correlations and visualize the failure mode of target models without any human intervention.
摘要:确保模型的可解释性和鲁棒性对于深度视觉系统的可靠部署至关重要。当前评估鲁棒性的方法依赖于收集和标注大量的测试集。尽管这是常见的做法,但这一过程既耗费人力又成本高昂,且无法保证对所有感兴趣的属性进行充分覆盖。近年来,利用用户输入(例如文本)来评估模型脆弱性的模型诊断框架逐渐兴起。然而,这种依赖于人类的方法可能会因特定用户的领域知识而引入偏差和局限性。本文提出了无监督模型诊断 (Unsupervised Model Diagnosis, UMO),该方法利用生成模型生成语义反事实解释,无需任何用户指导。给定一个可微分的计算机视觉模型(即目标模型),UMO 在生成模型的潜在空间中优化最反事实的方向。我们的方法识别并可视化语义变化,然后将这些变化与来自广泛文本来源(如词典或语言模型)的属性进行匹配。我们在多个视觉任务(例如分类、分割、关键点检测)上验证了该框架。大量实验表明,我们的无监督语义方向发现能够正确突出虚假相关性,并在没有任何人类干预的情况下可视化目标模型的失败模式。
[NLP-99] EVOLvE: Evaluating and Optimizing LLMs For Exploration
【速读】: 该论文试图解决大型语言模型(LLMs)在需要不确定性下最优决策的场景中表现不足的问题。解决方案的关键在于通过开发一系列包含无上下文和上下文多臂老虎机(bandits)的环境,来评估LLMs的决策能力,并提出将最优探索算法知识集成到LLMs中的方法。具体包括在推理过程中提供显式的算法引导支持,以及通过上下文演示和微调进行算法蒸馏,使用这些算法生成的合成数据。这些技术显著提升了较小模型在探索性能上的表现,超越了较大模型,并通过广泛的消融研究分析了任务难度和数据表示等因素对LLM探索效率的影响。
链接: https://arxiv.org/abs/2410.06238
作者: Allen Nie,Yi Su,Bo Chang,Jonathan N. Lee,Ed H. Chi,Quoc V. Le,Minmin Chen
关键词-EN: scenarios requiring optimal, requiring optimal decision-making, large language models, make optimal decisions, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages
点击查看摘要
Abstract:Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs’ (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs’ performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM’s exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.
摘要:尽管大语言模型 (LLM) 在许多领域取得了成功,但在需要在不确性条件下进行最优决策的场景中,其研究仍然不足。这一点至关重要,因为许多现实世界的应用,从个性化推荐到医疗干预,都要求 LLM 不仅能够预测,还要通过探索主动学习如何做出最优决策。在本研究中,我们评估了 LLM 在无状态强化学习设置中,即在多臂老虎机 (bandits) 中的最优决策能力,这一设置与许多应用相关。我们开发了一套全面的环境,包括无上下文和有上下文的多臂老虎机,并设置了不同难度的任务,以基准测试 LLM 的性能。受最优探索算法存在的启发,我们提出了将这种算法知识高效整合到 LLM 中的方法:通过在推理过程中提供明确的算法引导支持;以及通过上下文演示和微调进行算法蒸馏,使用这些算法生成的合成数据。令人印象深刻的是,这些技术使我们能够用较小的模型实现卓越的探索性能,在各种任务中超越较大的模型。我们进行了广泛的消融研究,以揭示影响 LLM 探索效率的各种因素,如任务难度和数据表示。此外,我们使用遗憾 (regret) 的概念对 LLM 的探索效率进行了严格分析,将其探索能力与模型大小和底层算法联系起来。
[NLP-100] DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
【速读】: 该论文试图解决自动化生成训练数据的问题,特别是减少人工干预的需求。解决方案的关键在于引入了一个名为DataEnvGym的测试平台,该平台将数据生成过程框架化为一个序列决策任务,涉及一个由数据生成策略和数据生成引擎组成的代理,在提供学生反馈的环境中运作。代理的目标是通过迭代训练和评估学生模型来提高其性能,并根据学生的反馈(如错误或弱项技能)调整数据生成策略。DataEnvGym支持多种任务(如数学、代码和视觉问答),并通过不同结构级别的环境实例化来测试和改进数据生成代理及其模块。
链接: https://arxiv.org/abs/2410.06215
作者: Zaid Khan,Elias Stengel-Eskin,Jaemin Cho,Mohit Bansal
关键词-EN: analyze model weaknesses, manually analyze model, data generation, data, data generation agents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL
点击查看摘要
Abstract:The process of creating training data to teach models is currently driven by humans, who manually analyze model weaknesses and plan how to create data that improves a student model. Recent approaches using LLMs as annotators reduce human effort, but still require humans to interpret feedback from evaluations and control the LLM to produce data the student needs. Automating this labor-intensive process by creating autonomous data generation agents - or teachers - is desirable, but requires environments that can simulate the feedback-driven, iterative, closed loop of data creation. To enable rapid and scalable testing for such agents and their modules, we introduce DataEnvGym, a testbed of teacher environments for data generation agents. DataEnvGym frames data generation as a sequential decision-making task, involving an agent consisting of a data generation policy (which generates a plan for creating training data) and a data generation engine (which transforms the plan into data), inside an environment that provides student feedback. The agent’s goal is to improve student performance. Students are iteratively trained and evaluated on generated data, with their feedback (in the form of errors or weak skills) being reported to the agent after each iteration. DataEnvGym includes multiple teacher environment instantiations across 3 levels of structure in the state representation and action space. More structured environments are based on inferred skills and offer more interpretability and curriculum control. We support 3 diverse tasks (math, code, and VQA) and test multiple students and teachers. Example agents in our teaching environments can iteratively improve students across tasks and settings. Moreover, we show that environments teach different skill levels and test variants of key modules, pointing to future work in improving data generation agents, engines, and feedback mechanisms.
摘要:目前,创建训练数据以教导模型的过程主要由人类驱动,人类手动分析模型的弱点并计划如何创建改进学生模型的数据。最近使用大语言模型 (LLM) 作为标注者的方法减少了人力,但仍需要人类解释评估反馈并控制 LLM 生成学生所需的数据。通过创建自主数据生成代理(或教师)来自动化这一劳动密集型过程是可取的,但需要能够模拟数据创建的反馈驱动、迭代、闭环的环境。为了实现对这些代理及其模块的快速和可扩展测试,我们引入了 DataEnvGym,这是一个用于数据生成代理的教师环境测试平台。DataEnvGym 将数据生成框架为序列决策任务,涉及一个由数据生成策略(生成创建训练数据的计划)和数据生成引擎(将计划转化为数据)组成的代理,在一个提供学生反馈的环境中。代理的目标是提高学生的表现。学生在生成的数据上进行迭代训练和评估,他们的反馈(以错误或弱技能的形式)在每次迭代后报告给代理。DataEnvGym 包括多个教师环境实例,跨越状态表示和动作空间的 3 个结构级别。更具结构化的环境基于推断的技能,并提供更高的可解释性和课程控制。我们支持 3 种多样化的任务(数学、代码和视觉问答 (VQA)),并测试多个学生和教师。我们教学环境中的示例代理可以在任务和设置中迭代改进学生。此外,我们展示了环境教授不同技能水平和测试关键模块变体的能力,指出了未来在改进数据生成代理、引擎和反馈机制方面的工作。
[NLP-101] Round and Round We Go! What makes Rotary Positional Encodings useful?
【速读】: 该论文试图解决关于Rotary Positional Encodings (RoPE)在Transformer-based Large Language Models (LLMs)中作用的核心问题,即RoPE是否主要通过衰减token依赖性来发挥作用。论文通过深入研究Gemma 7B模型的内部机制,发现RoPE实际上被用于构建稳健的“位置”注意力模式,特别是通过利用最高频率成分。此外,Gemma模型更倾向于使用RoPE的最低频率成分,这些成分可能用于携带语义信息。论文通过数学证明和实验验证了这些发现,并提出了一种改进的RoPE方法,以解决现有问题并提升模型性能。解决方案的关键在于揭示RoPE在模型中的实际作用机制,并通过优化RoPE的设计来增强模型的表现。
链接: https://arxiv.org/abs/2410.06205
作者: Federico Barbero,Alex Vitvitskyi,Christos Perivolaropoulos,Razvan Pascanu,Petar Veličković
关键词-EN: Transformer-based Large Language, Rotary Positional Encodings, Large Language Models, component of Transformer-based, Positional Encodings
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust “positional” attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry semantic information. We mathematically prove interesting behaviours of RoPE and conduct experiments to verify our findings, proposing a modification of RoPE that fixes some highlighted issues and improves performance. We believe that this work represents an interesting step in better understanding PEs in LLMs, which we believe holds crucial value for scaling LLMs to large sizes and context lengths.
摘要:位置编码 (Positional Encodings, PEs) 是基于 Transformer 的大语言模型 (Large Language Models, LLMs) 中的关键组件,为注意力机制提供了重要的序列位置信息。目前,LLMs 中最常用的编码类型之一是旋转位置编码 (Rotary Positional Encodings, RoPE),它根据查询和键的相对距离进行旋转。普遍认为,RoPE 之所以有用,是因为它有助于随着相对距离的增加而衰减 Token 依赖性。在本研究中,我们认为这不太可能是其核心原因。我们深入研究了经过训练的 Gemma 7B 模型的内部机制,以了解 RoPE 在机械层面的使用方式。我们发现,Gemma 学会了利用 RoPE 构建稳健的“位置”注意力模式,主要是通过利用最高频率。我们还发现,总体而言,Gemma 更倾向于使用 RoPE 的最低频率,我们推测这些频率用于携带语义信息。我们通过数学方法证明了 RoPE 的有趣行为,并进行了实验以验证我们的发现,提出了一种改进的 RoPE 方法,解决了某些突出问题并提高了性能。我们相信,这项工作代表了更好地理解 LLMs 中 PEs 的有趣一步,我们认为这对于将 LLMs 扩展到大规模和长上下文长度具有至关重要的价值。
[NLP-102] Integrating Planning into Single-Turn Long-Form Text Generation
【速读】: 该论文试图解决大型语言模型(LLMs)在生成高质量、深入的文本文档(如学术论文、新闻文章、维基百科条目和书籍)方面的挑战。解决方案的关键在于引入一个辅助任务,通过该任务训练LLM在生成最终文本之前进行规划、推理和结构化。这一辅助任务的创新之处在于它不需要多轮提示或规划,并且通过利用LLMs生成现有完整文章的合成中间写作数据(如大纲、关键信息和摘要)来克服训练数据的稀缺性。实验结果表明,经过辅助任务微调的LLMs在生成高质量文档方面表现更优,具体体现在ROUGE-Lsum指标提升2.5%,以及在人类对比评估中总体胜负比为3.60,特别是在组织性、相关性和可验证性方面表现突出。
链接: https://arxiv.org/abs/2410.06203
作者: Yi Liang,You Wu,Honglei Zhuang,Li Chen,Jiaming Shen,Yiling Jia,Zhen Qin,Sumit Sanghai,Xuanhui Wang,Carl Yang,Michael Bendersky
关键词-EN: Large Language Models, Language Models, Large Language, in-depth textual documents, challenge for Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Generating high-quality, in-depth textual documents, such as academic papers, news articles, Wikipedia entries, and books, remains a significant challenge for Large Language Models (LLMs). In this paper, we propose to use planning to generate long form content. To achieve our goal, we generate intermediate steps via an auxiliary task that teaches the LLM to plan, reason and structure before generating the final text. Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning. To overcome the scarcity of training data for these intermediate steps, we leverage LLMs to generate synthetic intermediate writing data such as outlines, key information and summaries from existing full articles. Our experiments demonstrate on two datasets from different domains, namely the scientific news dataset SciNews and Wikipedia datasets in KILT-Wiki and FreshWiki, that LLMs fine-tuned with the auxiliary task generate higher quality documents. We observed +2.5% improvement in ROUGE-Lsum, and a strong 3.60 overall win/loss ratio via human SxS evaluation, with clear wins in organization, relevance, and verifiability.
摘要:生成高质量、深入的文本文档,如学术论文、新闻文章、维基百科条目和书籍,对于大语言模型 (Large Language Models, LLMs) 来说仍然是一个重大挑战。本文提出使用规划来生成长篇内容。为了实现这一目标,我们通过一个辅助任务生成中间步骤,该任务教导 LLM 在生成最终文本之前进行规划、推理和结构化。我们的主要创新在于一个单一的辅助任务,该任务不需要多轮提示或规划。为了克服这些中间步骤训练数据的稀缺性,我们利用 LLMs 从现有完整文章中生成合成中间写作数据,如大纲、关键信息和摘要。我们的实验在两个不同领域的数据集上进行,即科学新闻数据集 SciNews 和维基百科数据集 KILT-Wiki 和 FreshWiki,结果表明,经过辅助任务微调的 LLMs 生成的文档质量更高。我们观察到 ROUGE-Lsum 提高了 +2.5%,并且在人类 SxS 评估中,总体胜负比为 3.60,在组织性、相关性和可验证性方面有明显优势。
[NLP-103] Entering Real Social World! Benchmarking the Theory of Mind and Socialization Capabilities of LLMs from a First-person Perspective
【速读】: 该论文试图解决的问题是如何从第一人称视角评估大型语言模型(LLMs)在心智理论(Theory of Mind, ToM)和社会化能力方面的表现,以及这些能力是否足以使AI模型真正进入并适应现实社交世界。解决方案的关键在于提出了EgoSocialArena框架,该框架通过静态和交互两种评估环境,涵盖七个场景,共计2195个数据条目,从第一人称视角全面评估了九种先进LLMs的ToM和社会化能力,填补了现有研究中LLMs作为被动观察者的第三者视角评估的空白。
链接: https://arxiv.org/abs/2410.06195
作者: Guiyang Hou,Wenqi Zhang,Yongliang Shen,Zeqi Tan,Sihao Shen,Weiming Lu
关键词-EN: Theory of Mind, socialization capabilities, mental states, mental states evolve, infer and reason
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures
点击查看摘要
Abstract:In the social world, humans possess the capability to infer and reason about others mental states (such as emotions, beliefs, and intentions), known as the Theory of Mind (ToM). Simultaneously, humans own mental states evolve in response to social situations, a capability we refer to as socialization. Together, these capabilities form the foundation of human social interaction. In the era of artificial intelligence (AI), especially with the development of large language models (LLMs), we raise an intriguing question: How do LLMs perform in terms of ToM and socialization capabilities? And more broadly, can these AI models truly enter and navigate the real social world? Existing research evaluating LLMs ToM and socialization capabilities by positioning LLMs as passive observers from a third person perspective, rather than as active participants. However, compared to the third-person perspective, observing and understanding the world from an egocentric first person perspective is a natural approach for both humans and AI agents. The ToM and socialization capabilities of LLMs from a first person perspective, a crucial attribute for advancing embodied AI agents, remain unexplored. To answer the aforementioned questions and bridge the research gap, we introduce EgoSocialArena, a novel framework designed to evaluate and investigate the ToM and socialization capabilities of LLMs from a first person perspective. It encompasses two evaluation environments: static environment and interactive environment, with seven scenarios: Daily Life, Counterfactual, New World, Blackjack, Number Guessing, and Limit Texas Hold em, totaling 2,195 data entries. With EgoSocialArena, we have conducted a comprehensive evaluation of nine advanced LLMs and observed some key insights regarding the future development of LLMs as well as the capabilities levels of the most advanced LLMs currently available.
摘要:在社会世界中,人类具备推断和推理他人心理状态(如情感、信念和意图)的能力,这一能力被称为心智理论(Theory of Mind, ToM)。同时,人类自身的心理状态也会随着社会情境的变化而演变,这种能力我们称之为社会化。这两者共同构成了人类社会互动的基础。在人工智能(AI)时代,尤其是随着大语言模型(Large Language Models, LLMs)的发展,我们提出一个引人深思的问题:LLMs在心智理论和社会化能力方面表现如何?更广泛地说,这些AI模型能否真正进入并适应现实社会世界?现有研究通过将LLMs定位为第三人称视角的被动观察者来评估其心智理论和社会化能力,而非作为主动参与者。然而,与第三人称视角相比,从自我中心的第一人称视角观察和理解世界,对于人类和AI智能体来说都是一种自然的方式。从第一人称视角评估LLMs的心智理论和社会化能力,对于推进具身AI智能体的发展至关重要,但这一领域仍未被探索。为了回答上述问题并填补研究空白,我们引入了EgoSocialArena,这是一个新颖的框架,旨在从第一人称视角评估和研究LLMs的心智理论和社会化能力。该框架包含两个评估环境:静态环境和交互环境,涵盖七个场景:日常生活、反事实、新世界、二十一点、数字猜测、以及有限德州扑克,总计2,195个数据条目。通过EgoSocialArena,我们对九个先进的大语言模型进行了全面评估,并观察到一些关于LLMs未来发展以及当前最先进LLMs能力水平的关键见解。
[NLP-104] Neural-Bayesian Program Learning for Few-shot Dialogue Intent Parsing
【速读】: 该论文试图解决在客户服务对话中识别意图的问题,特别是在数据稀缺的情况下。解决方案的关键在于提出了一种名为Dialogue-Intent Parser (DI-Parser)的新型神经贝叶斯程序学习模型,该模型通过“学习如何学习”的方式有效利用多源数据,并通过少样本学习能力从人类标注的数据集中汲取“群体智慧”。DI-Parser在实验中表现优于现有的深度学习模型,并具有工业规模应用的实际优势。
链接: https://arxiv.org/abs/2410.06190
作者: Mengze Hong,Di Jiang,Yuanfeng Song,Chen Jason Zhang
关键词-EN: contemporary business, success of enterprises, customer service, growing importance, importance of customer
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:With the growing importance of customer service in contemporary business, recognizing the intents behind service dialogues has become essential for the strategic success of enterprises. However, the nature of dialogue data varies significantly across different scenarios, and implementing an intent parser for a specific domain often involves tedious feature engineering and a heavy workload of data labeling. In this paper, we propose a novel Neural-Bayesian Program Learning model named Dialogue-Intent Parser (DI-Parser), which specializes in intent parsing under data-hungry settings and offers promising performance improvements. DI-Parser effectively utilizes data from multiple sources in a “Learning to Learn” manner and harnesses the “wisdom of the crowd” through few-shot learning capabilities on human-annotated datasets. Experimental results demonstrate that DI-Parser outperforms state-of-the-art deep learning models and offers practical advantages for industrial-scale applications.
摘要:随着客户服务在现代商业中的重要性日益增加,识别服务对话背后的意图对于企业的战略成功至关重要。然而,对话数据的性质在不同场景中存在显著差异,为特定领域实现意图解析器通常涉及繁琐的特征工程和大量的数据标注工作。本文提出了一种名为对话意图解析器 (Dialogue-Intent Parser, DI-Parser) 的新型神经贝叶斯程序学习模型,该模型专门针对数据匮乏环境下的意图解析,并展现出显著的性能提升。DI-Parser 通过“学会学习”的方式有效利用来自多个数据源的信息,并通过在人类标注数据集上的少样本学习能力,汲取“群体智慧”。实验结果表明,DI-Parser 在性能上超越了当前最先进的深度学习模型,并为工业级应用提供了实际优势。
[NLP-105] Manual Verbalizer Enrichment for Few-Shot Text Classification
【速读】: 该论文试图解决在零样本或少样本学习场景下,如何有效利用预训练语言模型进行文本分类的问题。解决方案的关键在于提出了一种名为\acrshortmave的方法,通过在词嵌入空间中利用类别标签的邻域关系来丰富类别标签,从而构建更有效的verbalizer(标签解释器)。这种方法在极少监督数据的情况下表现尤为出色,显著提升了模型的性能,同时减少了资源消耗。
链接: https://arxiv.org/abs/2410.06173
作者: Quang Anh Nguyen,Nadi Tomeh,Mustapha Lebbah,Thierry Charnois,Hanene Azzag,Santiago Cordoba Muñoz
关键词-EN: natural language processing, language processing tasks, pre-trained language models, prompt-based training, pre-trained language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:With the continuous development of pre-trained language models, prompt-based training becomes a well-adopted paradigm that drastically improves the exploitation of models for many natural language processing tasks. Prompting also shows great performance compared to traditional fine-tuning when adapted to zero-shot or few-shot scenarios where the number of annotated data is limited. In this framework, the role of verbalizers is essential, as an interpretation from masked word distributions into output predictions. In this work, we propose \acrshortmave, an approach for verbalizer construction by enrichment of class labels using neighborhood relation in the embedding space of words for the text classification task. In addition, we elaborate a benchmarking procedure to evaluate typical baselines of verbalizers for document classification in few-shot learning contexts. Our model achieves state-of-the-art results while using significantly fewer resources. We show that our approach is particularly effective in cases with extremely limited supervision data.
摘要:随着预训练语言模型的不断发展,基于提示的训练方法已成为一种广泛采用的范式,显著提升了模型在多种自然语言处理任务中的应用效果。在零样本或少样本场景下,提示方法相较于传统的微调方法展现出更优异的性能,尤其是在标注数据有限的情况下。在这一框架中,verbalizer 的作用至关重要,它负责将掩码词分布解释为输出预测。本文提出了一种名为 \acrshortmave 的方法,通过在文本分类任务中利用词嵌入空间中的邻域关系来丰富类别标签,从而构建 verbalizer。此外,我们还详细阐述了一种基准测试流程,用于评估少样本学习情境下文档分类的典型 verbalizer 基线。我们的模型在显著减少资源使用的情况下,达到了最先进的结果。实验表明,我们的方法在监督数据极其有限的情况下尤为有效。
[NLP-106] Multimodal Situational Safety
【速读】: 该论文试图解决多模态大语言模型(MLLMs)在复杂情境下的安全问题,特别是如何根据语言查询及其对应的视觉上下文来评估和响应安全问题。解决方案的关键在于开发了多模态情境安全基准(MSSBench),通过1,820个语言查询-图像对的数据集,评估MLLMs在不同情境下的安全表现,并提出了一种多代理协作的安全挑战解决流程,以提升模型在处理复杂安全问题时的表现。
链接: https://arxiv.org/abs/2410.06172
作者: Kaiwen Zhou,Chengzhi Liu,Xuandong Zhao,Anderson Compalas,Dawn Song,Xin Eric Wang
关键词-EN: Large Language Models, Multimodal Large Language, demonstrating impressive capabilities, Multimodal Situational Safety, Multimodal Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) are rapidly evolving, demonstrating impressive capabilities as multimodal assistants that interact with both humans and their environments. However, this increased sophistication introduces significant safety concerns. In this paper, we present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety, which explores how safety considerations vary based on the specific situation in which the user or agent is engaged. We argue that for an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context. To evaluate this capability, we develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs. The dataset comprises 1,820 language query-image pairs, half of which the image context is safe, and the other half is unsafe. We also develop an evaluation framework that analyzes key safety aspects, including explicit safety reasoning, visual understanding, and, crucially, situational safety reasoning. Our findings reveal that current MLLMs struggle with this nuanced safety problem in the instruction-following setting and struggle to tackle these situational safety challenges all at once, highlighting a key area for future research. Furthermore, we develop multi-agent pipelines to coordinately solve safety challenges, which shows consistent improvement in safety over the original MLLM response. Code and data: this http URL.
摘要:多模态大语言模型 (Multimodal Large Language Models, MLLMs) 正在迅速发展,展现出作为多模态助手的显著能力,能够与人类及其环境进行互动。然而,这种复杂性的增加也带来了重大的安全问题。本文首次对一种新型安全挑战——多模态情境安全 (Multimodal Situational Safety) 进行了评估和分析,探讨了安全考虑如何根据用户或智能体所处的具体情境而变化。我们认为,为了使 MLLM 能够安全地响应,无论是通过语言还是行动,通常需要在相应的视觉上下文中评估语言查询的安全影响。为了评估这一能力,我们开发了多模态情境安全基准 (Multimodal Situational Safety benchmark, MSSBench),以评估当前 MLLMs 的情境安全性能。该数据集包含 1,820 个语言查询-图像对,其中一半的图像上下文是安全的,另一半是不安全的。我们还开发了一个评估框架,分析了关键的安全方面,包括显式安全推理、视觉理解和至关重要的情境安全推理。我们的研究结果表明,当前的 MLLMs 在指令跟随设置中难以应对这种细微的安全问题,并且在同时解决这些情境安全挑战方面存在困难,这突显了未来研究的一个关键领域。此外,我们开发了多智能体管道来协调解决安全挑战,这显示了在安全性方面对原始 MLLM 响应的持续改进。代码和数据:this http URL。
[NLP-107] mporal Reasoning Transfer from Text to Video
【速读】: 该论文试图解决视频大语言模型(Video LLMs)在时间推理能力上的不足问题。研究发现,问题并非源自视觉输入的时间编码不足,而是底层语言模型对时间概念的理解困难。解决方案的关键在于引入文本时间推理迁移(T3)方法,通过合成纯文本格式的时间推理任务,增强模型对时间概念的理解。这一方法在不使用任何视频数据的情况下,显著提升了模型在时间推理任务上的表现,如在TempCompass基准测试中提高了5.3%的绝对准确率,并在视频综合基准测试中取得了竞争性成绩,验证了从文本到视频领域时间推理能力迁移的有效性。
链接: https://arxiv.org/abs/2410.06166
作者: Lei Li,Yuanxin Liu,Linli Yao,Peiyuan Zhang,Chenxin An,Lean Wang,Xu Sun,Lingpeng Kong,Qi Liu
关键词-EN: Video Large Language, Large Language Models, Large Language, shown promising capabilities, temporal reasoning
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project page: this https URL
点击查看摘要
Abstract:Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs’ temporal reasoning capability stems from the underlying LLM’s inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the Textual Temporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the scarcity of video samples with complex temporal scenarios. Remarkably, without using any video data, T3 enhances LongVA-7B’s temporal understanding, yielding a 5.3 absolute accuracy improvement on the challenging TempCompass benchmark, which enables our model to outperform ShareGPT4Video-8B trained on 28,000 video samples. Additionally, the enhanced LongVA-7B model achieves competitive performance on comprehensive video benchmarks. For example, it achieves a 49.7 accuracy on the Temporal Reasoning task of Video-MME, surpassing powerful large-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further analysis reveals a strong correlation between textual and video temporal task performance, validating the efficacy of transferring temporal reasoning abilities from text to video domains.
摘要:视频大语言模型 (Video LLMs) 在视频理解方面展现出有前景的能力,但在跟踪时间变化和推理时间关系方面仍存在困难。尽管先前研究将这一局限性归因于视觉输入的时间编码效率低下,但我们的诊断研究揭示,视频表示包含足够的信息,即使是小型探测分类器也能达到完美的准确率。令人惊讶的是,我们发现视频大语言模型在时间推理能力上的关键瓶颈源于底层大语言模型 (LLM) 对时间概念的固有困难,这一点在文本时间问答任务中的表现不佳得到了证明。基于这一发现,我们引入了文本时间推理迁移 (Textual Temporal reasoning Transfer, T3)。T3 从现有的图像-文本数据集中综合了多种纯文本格式的时间推理任务,解决了复杂时间场景下视频样本稀缺的问题。值得注意的是,在不使用任何视频数据的情况下,T3 增强了 LongVA-7B 的时间理解能力,在具有挑战性的 TempCompass 基准测试中实现了 5.3 的绝对准确率提升,使我们的模型超越了在 28,000 个视频样本上训练的 ShareGPT4Video-8B。此外,增强后的 LongVA-7B 模型在综合视频基准测试中也取得了竞争性的表现。例如,在 Video-MME 的时间推理任务中,其准确率达到 49.7,超过了强大的大规模模型如 InternVL-Chat-V1.5-20B 和 VILA1.5-40B。进一步分析表明,文本和视频时间任务表现之间存在强相关性,验证了从文本领域向视频领域迁移时间推理能力的有效性。
[NLP-108] AgentS quare: Automatic LLM Agent Search in Modular Design Space
【速读】: 该论文试图解决当前大型语言模型(LLM)代理系统在处理复杂任务时依赖手动、任务特定设计的局限性,导致其适应新任务的能力受限的问题。解决方案的关键在于提出了一个名为Modularized LLM Agent Search (MoLAS)的新研究问题,并设计了一个模块化的LLM代理搜索框架AgentSquare。该框架通过抽象现有LLM代理设计为四个基本模块(规划、推理、工具使用和记忆),并引入模块进化和重组两大核心机制,以高效搜索优化的LLM代理。此外,通过设计性能预测器使用上下文代理模型跳过不具前景的设计,进一步加速搜索过程。实验结果表明,AgentSquare在多个基准测试中显著优于手工设计的代理,平均性能提升17.2%,并能生成可解释的设计洞察,深化对代理架构及其对任务性能影响的理解。
链接: https://arxiv.org/abs/2410.06153
作者: Yu Shang,Yu Li,Keyu Zhao,Likai Ma,Jiahe Liu,Fengli Xu,Yong Li
关键词-EN: Large Language Models, Large Language, Recent advancements, advancements in Large, LLM Agent Search
类目: Computation and Language (cs.CL)
备注: 26 pages
点击查看摘要
Abstract:Recent advancements in Large Language Models (LLMs) have led to a rapid growth of agentic systems capable of handling a wide range of complex tasks. However, current research largely relies on manual, task-specific design, limiting their adaptability to novel tasks. In this paper, we introduce a new research problem: Modularized LLM Agent Search (MoLAS). We propose a modular design space that abstracts existing LLM agent designs into four fundamental modules with uniform IO interface: Planning, Reasoning, Tool Use, and Memory. Building on this design space, we present a novel LLM agent search framework called AgentSquare, which introduces two core mechanisms, i.e., module evolution and recombination, to efficiently search for optimized LLM agents. To further accelerate the process, we design a performance predictor that uses in-context surrogate models to skip unpromising agent designs. Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that AgentSquare substantially outperforms hand-crafted agents, achieving an average performance gain of 17.2% against best-known human designs. Moreover, AgentSquare can generate interpretable design insights, enabling a deeper understanding of agentic architecture and its impact on task performance. We believe that the modular design space and AgentSquare search framework offer a platform for fully exploiting the potential of prior successful designs and consolidating the collective efforts of research community. Code repo is available at this https URL.
摘要:近年来,大语言模型 (LLM) 的进步推动了能够处理广泛复杂任务的智能体系统的快速发展。然而,当前的研究主要依赖于手动、任务特定的设计,限制了其对新任务的适应性。本文中,我们引入了一个新的研究问题:模块化 LLM 智能体搜索 (MoLAS)。我们提出了一种模块化设计空间,将现有的 LLM 智能体设计抽象为四个具有统一输入输出接口的基本模块:规划、推理、工具使用和记忆。基于这一设计空间,我们提出了一种名为 AgentSquare 的新型 LLM 智能体搜索框架,该框架引入了两种核心机制,即模块进化和重组,以高效地搜索优化的 LLM 智能体。为进一步加速这一过程,我们设计了一个性能预测器,该预测器使用上下文代理模型来跳过不具前景的智能体设计。在涵盖网页、实体、工具使用和游戏应用等多样化场景的六个基准测试中进行的广泛实验表明,AgentSquare 显著优于手工设计的智能体,相对于已知最佳人类设计,平均性能提升了 17.2%。此外,AgentSquare 能够生成可解释的设计见解,从而更深入地理解智能体架构及其对任务性能的影响。我们相信,模块化设计空间和 AgentSquare 搜索框架为充分挖掘先前成功设计的潜力并整合研究社区的集体努力提供了一个平台。代码仓库可在以下链接获取:https URL。
[NLP-109] Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA EMNLP2024
【速读】: 该论文试图解决在大语言模型(LLMs)中如何高效地从知识图谱(KGs)中检索相关子图的问题。解决方案的关键在于将子图检索任务建模为一个条件生成任务,并利用小型语言模型来处理这一任务。具体来说,论文定义了子图标识符为一系列关系序列,每个关系由语言模型中的特殊标记表示。通过这种方式,仅包含220M参数的小型语言模型在检索性能上能够与依赖7B参数的现有最先进模型相媲美,证明了小型语言模型在子图检索任务中的有效性。此外,当与大语言模型阅读器结合时,论文提出的3B模型在WebQSP和CWQ基准测试中达到了新的最先进水平。
链接: https://arxiv.org/abs/2410.06121
作者: Wenyu Huang,Guancheng Zhou,Hongru Wang,Pavlos Vougiouklis,Mirella Lapata,Jeff Z. Pan
关键词-EN: inject external non-parametric, external non-parametric knowledge, Knowledge Graphs, valuable external knowledge, non-parametric knowledge
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024 Findings
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) is widely used to inject external non-parametric knowledge into large language models (LLMs). Recent works suggest that Knowledge Graphs (KGs) contain valuable external knowledge for LLMs. Retrieving information from KGs differs from extracting it from document sets. Most existing approaches seek to directly retrieve relevant subgraphs, thereby eliminating the need for extensive SPARQL annotations, traditionally required by semantic parsing methods. In this paper, we model the subgraph retrieval task as a conditional generation task handled by small language models. Specifically, we define a subgraph identifier as a sequence of relations, each represented as a special token stored in the language models. Our base generative subgraph retrieval model, consisting of only 220M parameters, achieves competitive retrieval performance compared to state-of-the-art models relying on 7B parameters, demonstrating that small language models are capable of performing the subgraph retrieval task. Furthermore, our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks. Our model and data will be made available online: this https URL.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 广泛用于将外部非参数知识注入到大语言模型 (Large Language Models, LLMs) 中。最近的研究表明,知识图谱 (Knowledge Graphs, KGs) 包含对 LLMs 有价值的外部知识。从 KGs 中检索信息与从文档集中提取信息有所不同。大多数现有方法试图直接检索相关子图,从而消除了传统上由语义解析方法所需的广泛 SPARQL 注释。在本文中,我们将子图检索任务建模为一个小语言模型处理的条件生成任务。具体来说,我们将子图标识符定义为一系列关系,每个关系表示为语言模型中存储的特殊 Token。我们的基础生成子图检索模型,仅由 220M 参数组成,与依赖 7B 参数的最先进模型相比,实现了具有竞争力的检索性能,证明了小语言模型能够执行子图检索任务。此外,我们的最大 3B 模型,在与 LLM 阅读器结合时,在 WebQSP 和 CWQ 基准测试中均创下了新的最先进端到端性能。我们的模型和数据将在线提供:this https URL。
[NLP-110] Optimizing the Training Schedule of Multilingual NMT using Reinforcement Learning
【速读】: 该论文试图解决多语言神经机器翻译(NMT)中低资源语言(LRLs)翻译质量受训练顺序影响的问题。解决方案的关键在于采用强化学习算法优化训练顺序,具体提出了两种方法:(1) 教师-学生课程学习,通过指数平滑估计每个动作的回报基于单语或多语开发子集的损失;(2) 深度Q网络,利用额外的神经网络从系统不同状态下的历史动作和接收的奖励中估计奖励。实验结果表明,第二种方法通过调整低资源语言和高资源语言批次的出现次数,显著提升了BLEU和COMET评分。
链接: https://arxiv.org/abs/2410.06118
作者: Alexis Allemann,Àlex R. Atrio,Andrei Popescu-Belis
关键词-EN: translating low-resource languages, viable solution, solution for translating, translating low-resource, data from high-resource
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Multilingual NMT is a viable solution for translating low-resource languages (LRLs) when data from high-resource languages (HRLs) from the same language family is available. However, the training schedule, i.e. the order of presentation of languages, has an impact on the quality of such systems. Here, in a many-to-one translation setting, we propose to apply two algorithms that use reinforcement learning to optimize the training schedule of NMT: (1) Teacher-Student Curriculum Learning and (2) Deep Q Network. The former uses an exponentially smoothed estimate of the returns of each action based on the loss on monolingual or multilingual development subsets, while the latter estimates rewards using an additional neural network trained from the history of actions selected in different states of the system, together with the rewards received. On a 8-to-1 translation dataset with LRLs and HRLs, our second method improves BLEU and COMET scores with respect to both random selection of monolingual batches and shuffled multilingual batches, by adjusting the number of presentations of LRL vs. HRL batches.
摘要:多语言神经机器翻译 (NMT) 是翻译低资源语言 (LRL) 的可行解决方案,当同一语系的高资源语言 (HRL) 的数据可用时。然而,训练计划,即语言呈现的顺序,对这类系统的质量有影响。在这里,在多对一翻译设置中,我们提出应用两种使用强化学习来优化 NMT 训练计划的算法:(1) 教师-学生课程学习 (Teacher-Student Curriculum Learning) 和 (2) 深度 Q 网络 (Deep Q Network)。前者基于单语或多语开发子集的损失,使用每个动作回报的指数平滑估计;后者使用从系统不同状态中选择的动作历史和接收到的奖励训练的额外神经网络来估计奖励。在一个包含 LRL 和 HRL 的 8 对 1 翻译数据集上,我们的第二种方法通过调整 LRL 和 HRL 批次的呈现次数,相对于随机选择单语批次和打乱的多语批次,提高了 BLEU 和 COMET 评分。
[NLP-111] Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation
【速读】: 该论文试图解决大型语言模型(LLMs)在开放式文本生成任务中,解码策略中超参数选择对文本质量的影响问题。解决方案的关键在于通过大规模、全面的分析,研究不同超参数设置对文本生成质量的影响,并提供实用的超参数调优指南。论文通过在多个LLMs、数据集和评估指标上的敏感性分析,展示了超参数选择对文本质量的显著影响,并强调了其在不同模型和任务中的差异性。解决方案的核心在于深入分析这些影响,结合人工评估和广泛使用的自动评估指标,为超参数调优提供具体指导。
链接: https://arxiv.org/abs/2410.06097
作者: Esteban Garces Arias,Meimingwei Li,Christian Heumann,Matthias Aßenmacher
关键词-EN: large language models, strategies for large, large language, underexplored aspect, Decoding strategies
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Decoding strategies for large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. Since LLMs produce probability distributions over the entire vocabulary, various decoding methods have been developed to transform these probabilities into coherent and fluent text, each with its own set of hyperparameters. In this study, we present a large-scale, comprehensive analysis of how hyperparameter selection affects text quality in open-ended text generation across multiple LLMs, datasets, and evaluation metrics. Through an extensive sensitivity analysis, we provide practical guidelines for hyperparameter tuning and demonstrate the substantial influence of these choices on text quality. Using three established datasets, spanning factual domains (e.g., news) and creative domains (e.g., fiction), we show that hyperparameter tuning significantly impacts generation quality, though its effects vary across models and tasks. We offer in-depth insights into these effects, supported by both human evaluations and a synthesis of widely-used automatic evaluation metrics.
摘要:大语言模型 (LLM) 的解码策略是文本生成任务中至关重要但往往未被充分探索的方面。由于 LLM 生成整个词汇的概率分布,因此开发了各种解码方法,将这些概率转化为连贯且流畅的文本,每种方法都有其自身的超参数集。在本研究中,我们进行了大规模、全面的分析,探讨了超参数选择如何影响多个 LLM、数据集和评估指标下的开放式文本生成质量。通过广泛的敏感性分析,我们提供了实用的超参数调优指南,并展示了这些选择对文本质量的显著影响。我们使用了三个成熟的基准数据集,涵盖事实领域(例如新闻)和创意领域(例如小说),结果表明超参数调优显著影响生成质量,尽管其效果在不同模型和任务之间有所不同。我们通过人类评估和广泛使用的自动评估指标的综合分析,深入探讨了这些影响。
[NLP-112] Listen to the Patient: Enhancing Medical Dialogue Generation with Patient Hallucination Detection and Mitigation
【速读】: 该论文试图解决医疗对话系统中患者表达与实际健康状况之间的差异问题,即患者幻觉(Patient Hallucination)。解决方案的关键在于提出了MedPH方法,该方法通过检测对话实体图的一维结构熵来识别幻觉,并利用幻觉相关信息引导患者表达实际状况,从而在医疗实体预测和响应生成任务中有效减少幻觉现象。
链接: https://arxiv.org/abs/2410.06094
作者: Lang Qin,Yao Zhang,Hongru Liang,Adam Jatowt,Zhenglu Yang
关键词-EN: provide medical services, patient-agent conversations, dialogue systems aim, aim to provide, services through patient-agent
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Medical dialogue systems aim to provide medical services through patient-agent conversations. Previous methods typically regard patients as ideal users, focusing mainly on common challenges in dialogue systems, while neglecting the potential biases or misconceptions that might be introduced by real patients, who are typically non-experts. This study investigates the discrepancy between patients’ expressions during medical consultations and their actual health conditions, defined as patient hallucination. Such phenomena often arise from patients’ lack of knowledge and comprehension, concerns, and anxieties, resulting in the transmission of inaccurate or wrong information during consultations. To address this issue, we propose MedPH, a Medical dialogue generation method for mitigating the problem of Patient Hallucinations designed to detect and cope with hallucinations. MedPH incorporates a detection method that utilizes one-dimensional structural entropy over a temporal dialogue entity graph, and a mitigation strategy based on hallucination-related information to guide patients in expressing their actual conditions. Experimental results indicate the high effectiveness of MedPH when compared to existing approaches in both medical entity prediction and response generation tasks, while also demonstrating its effectiveness in mitigating hallucinations within interactive scenarios.
摘要:医疗对话系统旨在通过患者与智能体之间的对话提供医疗服务。以往的方法通常将患者视为理想用户,主要关注对话系统中的常见挑战,而忽略了真实患者(通常是非专家)可能引入的潜在偏见或误解。本研究探讨了患者在医疗咨询过程中表达与其实际健康状况之间的差异,定义为患者幻觉 (Patient Hallucination)。这种现象通常源于患者缺乏知识与理解、担忧和焦虑,导致在咨询过程中传递不准确或错误的信息。为解决这一问题,我们提出了 MedPH,一种用于缓解患者幻觉问题的医疗对话生成方法。MedPH 结合了一种检测方法,该方法利用时间对话实体图上的一维结构熵,以及基于幻觉相关信息的缓解策略,以引导患者表达其实际状况。实验结果表明,与现有方法相比,MedPH 在医疗实体预测和响应生成任务中均表现出高效性,同时在交互场景中也显示出其在缓解幻觉方面的有效性。
[NLP-113] OWER: Tree Organized Weighting for Evaluating Complex Instructions EMNLP2024
【速读】: 该论文试图解决现有大型语言模型(LLMs)在遵循复杂人类指令时的评估问题,特别是现有基准测试方法(如Chatbot Arena)资源密集且耗时,而使用LLMs作为评判者的替代方法(如AlpacaEval、MT Bench等)未能充分考虑某些复杂指令的重要性。论文提出的解决方案是引入一种名为\textsc{TOWER}的新评估指标,该指标通过树形结构表示复杂指令,并结合人类评判的重要性权重,从而更准确地评估模型对复杂指令的遵循能力。关键在于利用树形结构和人类评判的一致性来提升评估的准确性和效率。
链接: https://arxiv.org/abs/2410.06089
作者: Noah Ziems,Zhihan Zhang,Meng Jiang
关键词-EN: Evaluating the ability, large language models, real-world applications, follow complex human-written, ability of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2024
点击查看摘要
Abstract:Evaluating the ability of large language models (LLMs) to follow complex human-written instructions is essential for their deployment in real-world applications. While benchmarks like Chatbot Arena use human judges to assess model performance, they are resource-intensive and time-consuming. Alternative methods using LLMs as judges, such as AlpacaEval, MT Bench, WildBench, and InFoBench offer improvements but still do not capture that certain complex instruction aspects are more important than others to follow. To address this gap, we propose a novel evaluation metric, \textscTOWER, that incorporates human-judged importance into the assessment of complex instruction following. We show that human annotators agree with tree-based representations of these complex instructions nearly as much as they agree with other human annotators. We release tree-based annotations of the InFoBench dataset and the corresponding evaluation code to facilitate future research. Comments: Accepted to EMNLP 2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.06089 [cs.CL] (or arXiv:2410.06089v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.06089 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:评估大语言模型 (LLMs) 遵循复杂人类编写指令的能力对于其在实际应用中的部署至关重要。尽管像 Chatbot Arena 这样的基准测试使用人类评判来评估模型性能,但这些方法资源密集且耗时。使用 LLMs 作为评判的替代方法,如 AlpacaEval、MT Bench、WildBench 和 InFoBench,虽然有所改进,但仍未能捕捉到某些复杂指令方面比其他方面更重要的特点。为了填补这一空白,我们提出了一种新的评估指标,\textscTOWER,该指标将人类评判的重要性纳入复杂指令遵循的评估中。我们展示了人类标注者对这些复杂指令的树形表示的认同度几乎与他们对其他人类标注者的认同度一样高。我们发布了 InFoBench 数据集的树形标注以及相应的评估代码,以促进未来研究。
评论:已被 EMNLP 2024 接受
主题:计算与语言 (cs.CL);人工智能 (cs.AI)
引用方式:arXiv:2410.06089 [cs.CL]
(或 arXiv:2410.06089v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.06089
了解更多信息
arXiv 发布的 DOI 通过 DataCite(待注册)
[NLP-114] raining-free LLM-generated Text Detection by Mining Token Probability Sequences
【速读】: 该论文试图解决大语言模型(LLMs)生成文本的检测问题,特别是在跨领域、跨模型和跨语言场景下的检测难题。解决方案的关键在于提出了一种名为Lastde的新型无训练检测器,该检测器通过结合局部和全局统计特征来增强检测效果。具体来说,Lastde首次将时间序列分析引入LLM生成文本的检测中,捕捉词元概率序列的时间动态,从而揭示人类和LLM生成文本之间的显著差异。此外,论文还提出了**Lastde++**作为实时检测的高效替代方案,通过在多种复杂场景下的实验验证,展示了其优越的性能和对抗释义攻击的鲁棒性。
链接: https://arxiv.org/abs/2410.06072
作者: Yihuai Xu,Yongwei Wang,Yifei Bi,Huangsen Cao,Zhouhan Lin,Yu Zhao,Fei Wu
关键词-EN: Large language models, Large language, generating high-quality texts, demonstrated remarkable capabilities, language models
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in generating high-quality texts across diverse domains. However, the potential misuse of LLMs has raised significant concerns, underscoring the urgent need for reliable detection of LLM-generated texts. Conventional training-based detectors often struggle with generalization, particularly in cross-domain and cross-model scenarios. In contrast, training-free methods, which focus on inherent discrepancies through carefully designed statistical features, offer improved generalization and interpretability. Despite this, existing training-free detection methods typically rely on global text sequence statistics, neglecting the modeling of local discriminative features, thereby limiting their detection efficacy. In this work, we introduce a novel training-free detector, termed \textbfLastde that synergizes local and global statistics for enhanced detection. For the first time, we introduce time series analysis to LLM-generated text detection, capturing the temporal dynamics of token probability sequences. By integrating these local statistics with global ones, our detector reveals significant disparities between human and LLM-generated texts. We also propose an efficient alternative, \textbfLastde++ to enable real-time detection. Extensive experiments on six datasets involving cross-domain, cross-model, and cross-lingual detection scenarios, under both white-box and black-box settings, demonstrated that our method consistently achieves state-of-the-art performance. Furthermore, our approach exhibits greater robustness against paraphrasing attacks compared to existing baseline methods.
摘要:大语言模型 (LLM) 在生成高质量文本方面展现了显著的能力,涵盖了多个领域。然而,LLM 的潜在滥用引发了重大担忧,突显了可靠检测 LLM 生成文本的迫切需求。传统的基于训练的检测器在泛化方面往往表现不佳,尤其是在跨领域和跨模型的场景中。相比之下,无需训练的方法通过精心设计的统计特征关注内在差异,提供了更好的泛化和可解释性。尽管如此,现有的无需训练的检测方法通常依赖于全局文本序列统计,忽略了局部判别特征的建模,从而限制了其检测效果。在本研究中,我们引入了一种新型无需训练的检测器,称为 \textbfLastde,它协同利用局部和全局统计信息以增强检测能力。首次将时间序列分析引入 LLM 生成文本检测,捕捉 Token 概率序列的时间动态。通过将这些局部统计信息与全局统计信息整合,我们的检测器揭示了人类和 LLM 生成文本之间显著的差异。我们还提出了一种高效的替代方案,\textbfLastde++,以实现实时检测。在涉及跨领域、跨模型和跨语言检测场景的六个数据集上进行的广泛实验,无论是在白盒还是黑盒设置下,均表明我们的方法始终达到最先进的性能。此外,与现有的基线方法相比,我们的方法在应对释义攻击时表现出更强的鲁棒性。
[NLP-115] Jet Expansions of Residual Computation
【速读】: 该论文试图解决模型预测中不同计算路径贡献的解耦问题,提出了一种基于jets(广义截断泰勒级数算子)扩展残差计算图的框架。解决方案的关键在于利用jets对模型内部的计算路径进行系统性分解,无需依赖数据、训练或模型采样,从而实现对模型计算过程的无数据分析,为模型解释性、开发和评估提供了新的方法。
链接: https://arxiv.org/abs/2410.06024
作者: Yihong Chen,Xiangxiang Xu,Yao Lu,Pontus Stenetorp,Luca Franceschi
关键词-EN: truncated Taylor series, generalize truncated Taylor, Taylor series, truncated Taylor, graphs using jets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注:
点击查看摘要
Abstract:We introduce a framework for expanding residual computational graphs using jets, operators that generalize truncated Taylor series. Our method provides a systematic approach to disentangle contributions of different computational paths to model predictions. In contrast to existing techniques such as distillation, probing, or early decoding, our expansions rely solely on the model itself and requires no data, training, or sampling from the model. We demonstrate how our framework grounds and subsumes logit lens, reveals a (super-)exponential path structure in the recursive residual depth and opens up several applications. These include sketching a transformer large language model with n -gram statistics extracted from its computations, and indexing the models’ levels of toxicity knowledge. Our approach enables data-free analysis of residual computation for model interpretability, development, and evaluation.
摘要:我们提出了一种利用 jets(一种泛化截断泰勒级数算子)扩展残差计算图的框架。该方法提供了一种系统化的途径,用于解耦不同计算路径对模型预测的贡献。与现有的技术如蒸馏、探针或早期解码相比,我们的扩展完全依赖于模型本身,无需数据、训练或从模型中采样。我们展示了该框架如何奠定并包含 logit lens 的基础,揭示了递归残差深度中的(超)指数路径结构,并开辟了多种应用。这些应用包括利用从模型计算中提取的 n-gram 统计数据绘制大语言模型草图,以及索引模型对毒性知识的掌握程度。我们的方法实现了无需数据的残差计算分析,为模型的可解释性、开发和评估提供了支持。
[NLP-116] Can Language Models Induce Grammatical Knowledge from Indirect Evidence? EMNLP2024
【速读】: 该论文试图解决的问题是:语言模型在判断句子可接受性时,是否能像人类一样高效地利用间接数据(间接证据)来推导语法知识。解决方案的关键在于引入Wug InDirect Evidence Test (WIDET)数据集,通过在预训练数据中插入合成实例并评估模型对这些实例的反应,来研究模型在不同间接程度和数量的实例暴露下,是否能有效推导出语法知识。实验结果表明,现有语言模型在某些语言现象中,即使多次暴露于结构相同但词汇不同的实例,仍未能有效推导语法知识,这为未来研究开发能利用潜在间接证据推导语法知识的模型提供了方向。
链接: https://arxiv.org/abs/2410.06022
作者: Miyu Oba,Yohei Oseki,Akiyo Fukatsu,Akari Haga,Hiroki Ouchi,Taro Watanabe,Saku Sugawara
关键词-EN: indirect evidence, language, judge sentence acceptability, data, language models
类目: Computation and Language (cs.CL)
备注: This paper is accepted at EMNLP 2024 Main
点击查看摘要
Abstract:What kinds of and how much data is necessary for language models to induce grammatical knowledge to judge sentence acceptability? Recent language models still have much room for improvement in their data efficiency compared to humans. This paper investigates whether language models efficiently use indirect data (indirect evidence), from which they infer sentence acceptability. In contrast, humans use indirect evidence efficiently, which is considered one of the inductive biases contributing to efficient language acquisition. To explore this question, we introduce the Wug InDirect Evidence Test (WIDET), a dataset consisting of training instances inserted into the pre-training data and evaluation instances. We inject synthetic instances with newly coined wug words into pretraining data and explore the model’s behavior on evaluation data that assesses grammatical acceptability regarding those words. We prepare the injected instances by varying their levels of indirectness and quantity. Our experiments surprisingly show that language models do not induce grammatical knowledge even after repeated exposure to instances with the same structure but differing only in lexical items from evaluation instances in certain language phenomena. Our findings suggest a potential direction for future research: developing models that use latent indirect evidence to induce grammatical knowledge.
摘要:语言模型需要什么样的数据以及多少数据才能推导出语法知识来判断句子的可接受性?与人类相比,当前的语言模型在数据效率方面仍有很大提升空间。本文探讨了语言模型是否能高效利用间接数据(间接证据)来推断句子的可接受性。相比之下,人类能够高效利用间接证据,这被认为是促进语言高效习得的归纳偏置之一。为了探究这一问题,我们引入了 Wug 间接证据测试 (WIDET),这是一个由插入预训练数据中的训练实例和评估实例组成的数据集。我们将带有新造 wug 词的合成实例注入预训练数据,并探索模型在评估数据上的表现,这些评估数据用于评估这些词的语法可接受性。我们通过改变间接程度和数量来准备注入的实例。我们的实验结果出乎意料地显示,即使语言模型在反复接触到与评估实例结构相同但仅在词汇项上不同的实例后,在某些语言现象上仍未能推导出语法知识。我们的研究结果为未来的研究指出了一个潜在方向:开发能够利用潜在间接证据来推导语法知识的模型。
[NLP-117] Unveiling Transformer Perception by Exploring Input Manifolds
【速读】: 该论文试图解决Transformer模型输入空间中等价类(equivalence classes)的探索问题。解决方案的关键在于利用数学理论,将Transformer架构的内部层描述为输入流形的序列变形,并通过输出空间距离度量的拉回(pullback)的特征分解,重构输入空间中的等价类,从而实现对Transformer如何看待输入空间的深入理解,为计算机视觉和自然语言处理任务提供局部且任务无关的可解释性。
链接: https://arxiv.org/abs/2410.06019
作者: Alessandro Benfenati,Alfio Ferrara,Alessio Marta,Davide Riva,Elisabetta Rocchetti
关键词-EN: input space, paper introduces, introduces a general, equivalence classes, Transformer models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 4 figures
点击查看摘要
Abstract:This paper introduces a general method for the exploration of equivalence classes in the input space of Transformer models. The proposed approach is based on sound mathematical theory which describes the internal layers of a Transformer architecture as sequential deformations of the input manifold. Using eigendecomposition of the pullback of the distance metric defined on the output space through the Jacobian of the model, we are able to reconstruct equivalence classes in the input space and navigate across them. We illustrate how this method can be used as a powerful tool for investigating how a Transformer sees the input space, facilitating local and task-agnostic explainability in Computer Vision and Natural Language Processing tasks.
摘要:本文介绍了一种在 Transformer 模型输入空间中探索等价类的一般方法。所提出的方法基于坚实的数学理论,该理论将 Transformer 架构的内部层描述为输入流形的连续变形。通过使用模型雅可比矩阵对输出空间上定义的距离度量进行拉回的特征分解,我们能够重建输入空间中的等价类并跨越它们。我们展示了这种方法如何作为一种强大的工具,用于研究 Transformer 如何看待输入空间,从而促进计算机视觉和自然语言处理任务中的局部和任务无关的可解释性。
[NLP-118] Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
【速读】: 该论文试图解决的问题是:在长上下文大语言模型(LLMs)中,随着检索信息数量的增加,生成输出的质量先提升后下降的现象。解决方案的关键在于识别并减轻检索到的“硬负样本”(hard negatives)对生成质量的负面影响。论文提出了两种解决方案:一是无需训练的检索重排序优化,二是基于训练的方法,包括RAG特定的隐式LLM微调和面向RAG的中间推理微调,以提升生成输出的质量。
链接: https://arxiv.org/abs/2410.05983
作者: Bowen Jin,Jinsung Yoon,Jiawei Han,Sercan O. Arik
关键词-EN: large language models, external knowledge sources, empowers large language, utilize external knowledge, Retrieval-augmented generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 pages
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) empowers large language models (LLMs) to utilize external knowledge sources. The increasing capacity of LLMs to process longer input sequences opens up avenues for providing more retrieved information, to potentially enhance the quality of generated outputs. It is plausible to assume that a larger retrieval set would contain more relevant information (higher recall), that might result in improved performance. However, our empirical findings demonstrate that for many long-context LLMs, the quality of generated output initially improves first, but then subsequently declines as the number of retrieved passages increases. This paper investigates this phenomenon, identifying the detrimental impact of retrieved “hard negatives” as a key contributor. To mitigate this and enhance the robustness of long-context LLM-based RAG, we propose both training-free and training-based approaches. We first showcase the effectiveness of retrieval reordering as a simple yet powerful training-free optimization. Furthermore, we explore training-based methods, specifically RAG-specific implicit LLM fine-tuning and RAG-oriented fine-tuning with intermediate reasoning, demonstrating their capacity for substantial performance gains. Finally, we conduct a systematic analysis of design choices for these training-based methods, including data distribution, retriever selection, and training context length.
摘要:检索增强生成 (Retrieval-augmented generation, RAG) 赋予大语言模型 (Large Language Models, LLMs) 利用外部知识源的能力。随着 LLMs 处理更长输入序列能力的增强,提供更多检索信息的可能性也随之增加,这有望提升生成输出的质量。可以合理假设,更大的检索集将包含更多相关信息(更高的召回率),从而可能提高性能。然而,我们的实证研究发现,对于许多长上下文的 LLMs,生成输出的质量起初会随着检索段落数量的增加而提升,但随后会随着检索段落数量的进一步增加而下降。本文探讨了这一现象,识别出检索到的“硬负样本”作为关键影响因素。为缓解这一问题并增强基于长上下文 LLM 的 RAG 的鲁棒性,我们提出了无训练和基于训练的两种方法。首先,我们展示了检索重排序作为一种简单而强大的无训练优化方法的有效性。此外,我们探索了基于训练的方法,特别是针对 RAG 的隐式 LLM 微调以及带有中间推理的 RAG 导向微调,展示了它们在显著提升性能方面的潜力。最后,我们对这些基于训练方法的设计选择进行了系统分析,包括数据分布、检索器选择以及训练上下文长度。
[NLP-119] PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
链接: https://arxiv.org/abs/2410.05970
作者: Xudong Xie,Liang Yin,Hao Yan,Yang Liu,Jing Ding,Minghui Liao,Yuliang Liu,Wei Chen,Xiang Bai
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-120] Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning
链接: https://arxiv.org/abs/2410.05928
作者: Ayush Singh,Mansi Gupta,Shivank Garg,Abhinav Kumar,Vansh Agrawal
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-121] Give me a hint: Can LLMs take a hint to solve math problems?
链接: https://arxiv.org/abs/2410.05915
作者: Vansh Agrawal,Pratham Singla,Amitoj Singh Miglani,Shivank Garg,Ayush Mangal
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
[NLP-122] Automatic Summarization of Long Documents ACL2023
【速读】: 该论文试图解决现有基于Transformer的模型在处理长文本时受限于输入大小的问题。解决方案的关键在于引入了三种新型算法,这些算法允许任何大型语言模型(LLM)在不进行架构修改的情况下,高效地克服输入大小限制,从而充分利用其潜力。实验结果表明,这些算法在处理超过70,000字的长文本时,显著提高了BERTScore,并在ROUGE评分上表现出色。
链接: https://arxiv.org/abs/2410.05903
作者: Naman Chhibbar,Jugal Kalita
关键词-EN: internet daily, making utilization, difficult and cumbersome, vast amount, amount of textual
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages (including bibliography) with 6 figures. ACL 2023 proceedings format
点击查看摘要
Abstract:A vast amount of textual data is added to the internet daily, making utilization and interpretation of such data difficult and cumbersome. As a result, automatic text summarization is crucial for extracting relevant information, saving precious reading time. Although many transformer-based models excel in summarization, they are constrained by their input size, preventing them from processing texts longer than their context size. This study introduces three novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications. We test our algorithms on texts with more than 70,000 words, and our experiments show a significant increase in BERTScore with competitive ROUGE scores.
摘要:互联网每天都会新增大量文本数据,使得这些数据的利用和解读变得困难且繁琐。因此,自动文本摘要对于提取相关信息、节省宝贵的阅读时间至关重要。尽管许多基于 Transformer 的模型在摘要生成方面表现出色,但它们受限于输入大小,无法处理超过其上下文长度的文本。本研究引入了三种新型算法,使任何大语言模型 (LLM) 都能高效地克服其输入大小限制,充分发挥其潜力,而无需进行任何架构修改。我们在超过 70,000 字的文本上测试了这些算法,实验结果显示 BERTScore 显著提升,同时 ROUGE 评分具有竞争力。
[NLP-123] Edit Distances and Their Applications to Downstream Tasks in Research and Commercial Contexts
【速读】: 该论文探讨了编辑距离在研究和商业应用中的概念,特别是如何使用Translation Edit Rate (TER)、Levenshtein、Damerau-Levenshtein、Longest Common Subsequence和n-gram距离等指标来衡量文本序列之间的差异。论文指出,这些统计指标在比较文本时存在局限性,无法完全捕捉到后编辑工作的细节和实际需要修正的错误。解决方案的关键在于深入分析这些编辑距离的基本组成部分,特别是四种编辑操作(插入、删除、替换和移动单词),并讨论它们在开源包和工具包中的实现。此外,论文还探讨了编辑距离在下游任务中的应用,强调了这些指标在捕捉后编辑工作细节方面的不足,以及对研究人员和商业应用的影响。
链接: https://arxiv.org/abs/2410.05881
作者: Félix do Carmo,Diptesh Kanojia
关键词-EN: Longest Common Subsequence, edit distances, edit distances applied, tutorial describes, describes the concept
类目: Computation and Language (cs.CL)
备注: Tutorial @ 16th AMTA Conference, 2024
点击查看摘要
Abstract:The tutorial describes the concept of edit distances applied to research and commercial contexts. We use Translation Edit Rate (TER), Levenshtein, Damerau-Levenshtein, Longest Common Subsequence and n -gram distances to demonstrate the frailty of statistical metrics when comparing text sequences. Our discussion disassembles them into their essential components. We discuss the centrality of four editing actions: insert, delete, replace and move words, and show their implementations in openly available packages and toolkits. The application of edit distances in downstream tasks often assumes that these accurately represent work done by post-editors and real errors that need to be corrected in MT output. We discuss how imperfect edit distances are in capturing the details of this error correction work and the implications for researchers and for commercial applications, of these uses of edit distances. In terms of commercial applications, we discuss their integration in computer-assisted translation tools and how the perception of the connection between edit distances and post-editor effort affects the definition of translator rates.
摘要:本教程阐述了编辑距离在研究和商业环境中的应用概念。我们使用翻译编辑率 (TER)、Levenshtein、Damerau-Levenshtein、最长公共子序列和 n-gram 距离来展示在比较文本序列时统计指标的脆弱性。我们的讨论将这些指标分解为其基本组成部分。我们探讨了四种编辑操作的核心地位:插入、删除、替换和移动词语,并展示了它们在公开可用包和工具包中的实现。在下游任务中应用编辑距离时,通常假设这些距离能够准确代表后期编辑者所做的工作以及机器翻译输出中需要纠正的真实错误。我们讨论了编辑距离在捕捉这些错误修正工作细节方面的不足,以及这些使用对研究人员和商业应用的影响。在商业应用方面,我们讨论了它们在计算机辅助翻译工具中的集成,以及编辑距离与后期编辑者努力之间的联系如何影响翻译费率的定义。
[NLP-124] MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
链接: https://arxiv.org/abs/2410.05873
作者: Amir Hossein Kargaran,Ali Modarressi,Nafiseh Nikeghbal,Jana Diesner,François Yvon,Hinrich Schütze
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-125] From Tokens to Words: on the inner lexicon of LLMs
链接: https://arxiv.org/abs/2410.05864
作者: Guy Kaplan,Matanel Oren,Yuval Reif,Roy Schwartz
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-126] Communicating with Speakers and Listeners of Different Pragmatic Levels EMNLP2024
链接: https://arxiv.org/abs/2410.05851
作者: Kata Naszadi,Frans A. Oliehoek,Christof Monz
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 main
[NLP-127] Multi-Session Client-Centered Treatment Outcome Evaluation in Psychotherapy
链接: https://arxiv.org/abs/2410.05824
作者: Hongbin Na,Tao Shen,Shumao Yu,Ling Chen
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Under review
[NLP-128] A Zero-Shot approach to the Conversational Tree Search Task
链接: https://arxiv.org/abs/2410.05821
作者: Dirk Väth,Ngoc Thang Vu
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-129] Probing Language Models on Their Knowledge Source EMNLP2024
链接: https://arxiv.org/abs/2410.05817
作者: Zineddine Tighidet,Andrea Mogini,Jiali Mei,Benjamin Piwowarski,Patrick Gallinari
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted at BlackBoxNLP@EMNLP2024
[NLP-130] Gradual Learning: Optimizing Fine-Tuning with Partially Mastered Knowledge in Large Language Models
链接: https://arxiv.org/abs/2410.05802
作者: Bozhou Li,Hao Liang,Yang Li,Fangcheng Fu,Hongzhi Yin,Conghui He,Wentao Zhang
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-131] Retrieving Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation EMNLP2024
链接: https://arxiv.org/abs/2410.05801
作者: Bolei He,Nuo Chen,Xinran He,Lingyong Yan,Zhenkai Wei,Jinchang Luo,Zhen-Hua Ling
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2024 Findings. 9 pages, 4 figures, 7 tables
[NLP-132] CodeCipher: Learning to Obfuscate Source Code Against LLMs
链接: https://arxiv.org/abs/2410.05797
作者: Yalan Lin,Chengcheng Wan,Yixiong Fang,Xiaodong Gu
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-133] Song Emotion Classification of Lyrics with Out-of-Domain Data under Label Scarcity
链接: https://arxiv.org/abs/2410.05778
作者: Jonathan Sakunkoo,Annabella Sakunkoo
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-134] Efficient Few-shot Learning for Multi-label Classification of Scientific Documents with Many Classes
链接: https://arxiv.org/abs/2410.05770
作者: Tim Schopf,Alexander Blatzheim,Nektarios Machner,Florian Matthes
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted to the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)
[NLP-135] Information Discovery in e-Commerce
链接: https://arxiv.org/abs/2410.05763
作者: Zhaochun Ren,Xiangnan He,Dawei Yin,Maarten de Rijke
关键词-EN:
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
[NLP-136] Label Confidence Weighted Learning for Target-level Sentence Simplification EMNLP2024
链接: https://arxiv.org/abs/2410.05748
作者: Xinying Qiu,Jingshen Zhang
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024
[NLP-137] Enhancing Temporal Modeling of Video LLMs via Time Gating EMNLP2024
链接: https://arxiv.org/abs/2410.05714
作者: Zi-Yuan Hu,Yiwu Zhong,Shijia Huang,Michael R. Lyu,Liwei Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP 2024 Findings (Short)
[NLP-138] A Two-Step Approach for Data-Efficient French Pronunciation Learning EMNLP2024
链接: https://arxiv.org/abs/2410.05698
作者: Hoyeon Lee,Hyeeun Jang,Jong-Hwan Kim,Jae-Min Kim
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2024 Main
[NLP-139] Unlocking the Boundaries of Thought: A Reasoning Granularity Framework to Quantify and Optimize Chain-of-Thought NEURIPS2024
链接: https://arxiv.org/abs/2410.05695
作者: Qiguang Chen,Libo Qin,Jiaqi Wang,Jinxuan Zhou,Wanxiang Che
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted at NeurIPS2024 (Oral)
[NLP-140] Copiloting Diagnosis of Autism in Real Clinical Scenarios via LLMs
链接: https://arxiv.org/abs/2410.05684
作者: Yi Jiang,Qingyang Shen,Shuzhong Lai,Shunyu Qi,Qian Zheng,Lin Yao,Yueming Wang,Gang Pan
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-141] Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective
链接: https://arxiv.org/abs/2410.05648
作者: Xueying Bai,Yifan Sun,Niranjan Balasubramanian
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: COLM 2024
[NLP-142] DecorateLM: Data Engineering through Corpus Rating Tagging and Editing with Language Models
链接: https://arxiv.org/abs/2410.05639
作者: Ranchi Zhao,Zhen Leng Thai,Yifan Zhang,Shengding Hu,Yunqi Ba,Jie Zhou,Jie Cai,Zhiyuan Liu,Maosong Sun
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-143] Vector-ICL: In-context Learning with Continuous Vector Representations
链接: https://arxiv.org/abs/2410.05629
作者: Yufan Zhuang,Chandan Singh,Liyuan Liu,Jingbo Shang,Jianfeng Gao
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-144] Stereotype or Personalization? User Identity Biases Chatbot Recommendations
链接: https://arxiv.org/abs/2410.05613
作者: Anjali Kantharuban,Jeremiah Milbauer,Emma Strubell,Graham Neubig
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-145] Multimodal Large Language Models and Tunings: Vision Language Sensors Audio and Beyond
链接: https://arxiv.org/abs/2410.05608
作者: Soyeon Caren Han,Feiqi Cao,Josiah Poon,Roberto Navigli
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted at ACM-MM 2024
[NLP-146] Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition
链接: https://arxiv.org/abs/2410.05603
作者: Zheyang Xiong,Ziyang Cai,John Cooper,Albert Ge,Vasilis Papageorgiou,Zack Sifakis,Angeliki Giannou,Ziqian Lin,Liu Yang,Saurabh Agarwal,Grigorios G Chrysos,Samet Oymak,Kangwook Lee,Dimitris Papailiopoulos
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-147] Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning EMNLP’24
链接: https://arxiv.org/abs/2410.05600
作者: Ming Shan Hee,Aditi Kumaresan,Roy Ka-Wei Lee
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP’24 (Main)
[NLP-148] ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
链接: https://arxiv.org/abs/2410.05589
作者: Zilin Xiao,Hongming Zhang,Tao Ge,Siru Ouyang,Vicente Ordonez,Dong Yu
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: work in progress
[NLP-149] Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?
链接: https://arxiv.org/abs/2410.05584
作者: Xueru Wen,Jie Lou,Yaojie Lu,Hongyu Lin,Xing Yu,Xinyu Lu,Ben He,Xianpei Han,Debing Zhang,Le Sun
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-150] Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve? EMNLP2024
链接: https://arxiv.org/abs/2410.05581
作者: Fırat Öncel,Matthias Bethge,Beyza Ermis,Mirco Ravanelli,Cem Subakan,Çağatay Yıldız
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 Main Conference
[NLP-151] ClaimBrush: A Novel Framework for Automated Patent Claim Refinement Based on Large Language Models
链接: https://arxiv.org/abs/2410.05575
作者: Seiya Kawano,Hirofumi Nonaka,Koichiro Yoshino
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
[NLP-152] aeBench: Improving Quality of Toxic Adversarial Examples
链接: https://arxiv.org/abs/2410.05573
作者: Xuan Zhu,Dmitriy Bespalov,Liwen You,Ninad Kulkarni,Yanjun Qi
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-153] Chain and Causal Attention for Efficient Entity Tracking EMNLP2024
链接: https://arxiv.org/abs/2410.05565
作者: Erwan Fagnou,Paul Caillon,Blaise Delattre,Alexandre Allauzen
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 pages, 5 figures, EMNLP 2024 Main
[NLP-154] Rational Metareasoning for Large Language Models
链接: https://arxiv.org/abs/2410.05563
作者: C. Nicolò De Sabbata,Theodore R. Sumers,Thomas L. Griffiths
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[NLP-155] Attribute Controlled Fine-tuning for Large Language Models : A Case Study on Detoxification EMNLP
链接: https://arxiv.org/abs/2410.05559
作者: Tao Meng,Ninareh Mehrabi,Palash Goyal,Anil Ramakrishna,Aram Galstyan,Richard Zemel,Kai-Wei Chang,Rahul Gupta,Charith Peris
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP Findings
[NLP-156] Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives EMNLP’24
链接: https://arxiv.org/abs/2410.05558
作者: Xinliang Frederick Zhang,Nick Beauchamp,Lu Wang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP’24 Findings
[NLP-157] On Instruction-Finetuning Neural Machine Translation Models
链接: https://arxiv.org/abs/2410.05553
作者: Vikas Raunak,Roman Grundkiewicz,Marcin Junczys-Dowmunt
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: WMT’24
[NLP-158] Self-rationalization improves LLM as a fine-grained judge
链接: https://arxiv.org/abs/2410.05495
作者: Prapti Trivedi,Aditya Gulati,Oliver Molenschot,Meghana Arakkal Rajeev,Rajkumar Ramamurthy,Keith Stevens,Tanveesh Singh Chaudhery,Jahnavi Jambholkar,James Zou,Nazneen Rajani
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-159] Neural machine translation system for Lezgian Russian and Azerbaijani languages
链接: https://arxiv.org/abs/2410.05472
作者: Alidar Asvarov,Andrey Grabovoy
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-160] From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
链接: https://arxiv.org/abs/2410.05459
作者: Kaiyue Wen,Huaqing Zhang,Hongzhou Lin,Jingzhao Zhang
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 43 pages,11 figures
[NLP-161] Interconnected Kingdoms: Comparing A Song of Ice and Fire Adaptations Across Media Using Complex Networks
链接: https://arxiv.org/abs/2410.05453
作者: Arthur Amalvy,Madeleine Janickyj,Shane Mannion,Pádraig MacCarron,Vincent Labatut
关键词-EN:
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:
[NLP-162] ask Diversity Shortens the ICL Plateau
链接: https://arxiv.org/abs/2410.05448
作者: Jaeyeon Kim,Sehyun Kwon,Joo Young Choi,Jongho Park,Jaewoong Cho,Jason D. Lee,Ernest K. Ryu
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
[NLP-163] Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness Evaluation
【速读】: 该论文试图解决气候变化社交媒体传播中的微目标策略问题,特别是如何利用大型语言模型(LLMs)分析和解释这些策略的公平性和有效性。解决方案的关键在于利用LLMs对Facebook广告进行后验分析,评估其对不同人口统计目标(如性别和年龄组)的预测准确性,并生成透明化的分类解释,以揭示针对不同受众的特定主题元素和策略。此外,论文还进行了公平性分析,识别模型预测中的潜在偏见,从而为未来研究提供增强透明度、问责制和包容性的框架。
链接: https://arxiv.org/abs/2410.05401
作者: Tunazzina Islam,Dan Goldwasser
关键词-EN: media increasingly employs, increasingly employs microtargeting, social media increasingly, Climate change communication, media increasingly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:Climate change communication on social media increasingly employs microtargeting strategies to effectively reach and influence specific demographic groups. This study presents a post-hoc analysis of microtargeting practices within climate campaigns by leveraging large language models (LLMs) to examine Facebook advertisements. Our analysis focuses on two key aspects: demographic targeting and fairness. We evaluate the ability of LLMs to accurately predict the intended demographic targets, such as gender and age group, achieving an overall accuracy of 88.55%. Furthermore, we instruct the LLMs to generate explanations for their classifications, providing transparent reasoning behind each decision. These explanations reveal the specific thematic elements used to engage different demographic segments, highlighting distinct strategies tailored to various audiences. Our findings show that young adults are primarily targeted through messages emphasizing activism and environmental consciousness, while women are engaged through themes related to caregiving roles and social advocacy. In addition to evaluating the effectiveness of LLMs in detecting microtargeted messaging, we conduct a comprehensive fairness analysis to identify potential biases in model predictions. Our findings indicate that while LLMs perform well overall, certain biases exist, particularly in the classification of senior citizens and male audiences. By showcasing the efficacy of LLMs in dissecting and explaining targeted communication strategies and by highlighting fairness concerns, this study provides a valuable framework for future research aimed at enhancing transparency, accountability, and inclusivity in social media-driven climate campaigns.
摘要:社交媒体上的气候变化传播越来越多地采用微目标定位策略,以有效触达并影响特定的受众群体。本研究通过利用大语言模型 (LLMs) 对 Facebook 广告进行事后分析,探讨了气候运动中的微目标定位实践。我们的分析聚焦于两个关键方面:受众定位和公平性。我们评估了 LLMs 准确预测目标受众群体(如性别和年龄段)的能力,总体准确率达到 88.55%。此外,我们指导 LLMs 生成其分类的解释,为每个决策提供透明的推理过程。这些解释揭示了用于吸引不同受众群体的具体主题元素,突显了针对不同人群的差异化策略。我们的研究结果显示,年轻人主要通过强调行动主义和环境意识的信息被定位,而女性则通过与照顾角色和社会倡导相关的主题被吸引。除了评估 LLMs 在检测微目标信息方面的有效性外,我们还进行了全面的公平性分析,以识别模型预测中潜在的偏见。我们的研究结果表明,尽管 LLMs 总体表现良好,但在老年人群体和男性受众的分类中存在一定的偏见。通过展示 LLMs 在剖析和解释目标传播策略方面的效能,并强调公平性问题,本研究为未来旨在提升社交媒体驱动气候运动的透明度、问责制和包容性的研究提供了宝贵的框架。
[NLP-164] LLMs Are In-Context Reinforcement Learners
链接: https://arxiv.org/abs/2410.05362
作者: Giovanni Monea,Antoine Bosselut,Kianté Brantley,Yoav Artzi
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[NLP-165] Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild NEURIPS2024
链接: https://arxiv.org/abs/2410.05357
作者: Xinyu Zhao,Guoheng Sun,Ruisi Cai,Yukun Zhou,Pingzhi Li,Peihao Wang,Bowen Tan,Yexiao He,Li Chen,Yi Liang,Beidi Chen,Binhang Yuan,Hongyi Wang,Ang Li,Zhangyang Wang,Tianlong Chen
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages, 4 figures, accepted to NeurIPS 2024 Datasets and Benchmarks Track
[NLP-166] Falcon Mamba: The First Competitive Attention-free 7B Language Model
【速读】: 该论文试图解决现有大型语言模型在推理速度和内存需求方面的性能瓶颈问题,并证明纯Mamba架构模型在性能上可以超越基于Transformer的模型以及混合架构模型。解决方案的关键在于采用了全新的Mamba架构,训练了Falcon Mamba 7B模型,该模型在处理长序列生成时显著提高了推理速度并大幅减少了内存需求,同时在性能上超越了包括Mistral 7B、Llama3.1 8B和Falcon2 11B在内的领先开源模型。
链接: https://arxiv.org/abs/2410.05355
作者: Jingwei Zuo,Maksim Velikanov,Dhia Eddine Rhaiem,Ilyas Chahed,Younes Belkada,Guillaume Kunsch,Hakim Hacid
关键词-EN: Falcon Mamba, base large language, present Falcon Mamba, Mamba, large language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In this technical report, we present Falcon Mamba 7B, a new base large language model based on the novel Mamba architecture. Falcon Mamba 7B is trained on 5.8 trillion tokens with carefully selected data mixtures. As a pure Mamba-based model, Falcon Mamba 7B surpasses leading open-weight models based on Transformers, such as Mistral 7B, Llama3.1 8B, and Falcon2 11B. It is on par with Gemma 7B and outperforms models with different architecture designs, such as RecurrentGemma 9B and RWKV-v6 Finch 7B/14B. Currently, Falcon Mamba 7B is the best-performing Mamba model in the literature at this scale, surpassing both existing Mamba and hybrid Mamba-Transformer models, according to the Open LLM Leaderboard. Due to its architecture, Falcon Mamba 7B is significantly faster at inference and requires substantially less memory for long sequence generation. Despite recent studies suggesting that hybrid Mamba-Transformer models outperform pure architecture designs, we demonstrate that even the pure Mamba design can achieve similar, or even superior results compared to the Transformer and hybrid designs. We make the weights of our implementation of Falcon Mamba 7B publicly available on this https URL, under a permissive license.
摘要:在本技术报告中,我们介绍了 Falcon Mamba 7B,这是一种基于新型 Mamba 架构的新型基础大语言模型。Falcon Mamba 7B 经过 5.8 万亿 Token 的精心挑选数据混合训练。作为一个纯 Mamba 架构模型,Falcon Mamba 7B 超越了基于 Transformer 的领先开源权重模型,如 Mistral 7B、Llama3.1 8B 和 Falcon2 11B。它在性能上与 Gemma 7B 相当,并优于其他不同架构设计的模型,如 RecurrentGemma 9B 和 RWKV-v6 Finch 7B/14B。根据 Open LLM Leaderboard 的数据,目前 Falcon Mamba 7B 是文献中这一规模下表现最佳的 Mamba 模型,超越了现有的 Mamba 和混合 Mamba-Transformer 模型。由于其架构特性,Falcon Mamba 7B 在推理速度上显著更快,并且在长序列生成时所需内存大幅减少。尽管最近的研究表明混合 Mamba-Transformer 模型优于纯架构设计,但我们证明了即使是纯 Mamba 设计也能达到与 Transformer 和混合设计相似甚至更优的结果。我们在此 https URL 下,以宽松的许可协议公开了 Falcon Mamba 7B 的权重实现。
[NLP-167] EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos with Procedural Texts
链接: https://arxiv.org/abs/2410.05343
作者: Yuto Haneji,Taichi Nishimura,Hirotaka Kameko,Keisuke Shirai,Tomoya Yoshida,Keiya Kajimura,Koki Yamamoto,Taiyu Cui,Tomohiro Nishimoto,Shinsuke Mori
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-168] aylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion
链接: https://arxiv.org/abs/2410.05331
作者: Guanchu Wang,Yu-Neng Chuang,Ruixiang Tang,Shaochen Zhong,Jiayi Yuan,Hongye Jin,Zirui Liu,Vipin Chaudhary,Shuai Xu,James Caverlee,Xia Hu
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-169] Output Scouting: Auditing Large Language Models for Catastrophic Responses
链接: https://arxiv.org/abs/2410.05305
作者: Andrew Bell,Joao Fonseca
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-170] Hate Speech Detection Using Cross-Platform Social Media Data In English and German Language
链接: https://arxiv.org/abs/2410.05287
作者: Gautam Kishore Shahi,Tim A. Majchrzak
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
[NLP-171] Utility of Multimodal Large Language Models in Analyzing Chest X-ray with Incomplete Contextual Information
链接: https://arxiv.org/abs/2410.07111
作者: Choonghan Kim,Seonhee Cho,Joo Heung Yoon
关键词-EN:
类目: Image and Video Processing (eess.IV); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
[NLP-172] he OCON model: an old but gold solution for distributable supervised classification
链接: https://arxiv.org/abs/2410.05320
作者: Stefano Giacomelli,Marco Giordano,Claudia Rinaldi
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted at “2024 29th IEEE Symposium on Computers and Communications (ISCC): workshop on Next-Generation Multimedia Services at the Edge: Leveraging 5G and Beyond (NGMSE2024)”. arXiv admin note: text overlap with arXiv:2410.04098
人工智能
[AI-0] MM-Ego: Towards Building Egocentric Multimodal LLMs
链接: https://arxiv.org/abs/2410.07177
作者: Hanrong Ye,Haotian Zhang,Erik Daxberger,Lin Chen,Zongyu Lin,Yanghao Li,Bowen Zhang,Haoxuan You,Dan Xu,Zhe Gan,Jiasen Lu,Yinfei Yang
关键词-EN: comprehensively explore building, egocentric video understanding, research aims, aims to comprehensively, comprehensively explore
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Technical Report
点击查看摘要
Abstract:This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. This is currently the largest egocentric QA dataset. Second, we contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models’ ability in recognizing and memorizing visual details across videos of varying lengths. We introduce a new de-biasing evaluation method to help mitigate the unavoidable language bias present in the models being evaluated. Third, we propose a specialized multimodal architecture featuring a novel “Memory Pointer Prompting” mechanism. This design includes a global glimpse step to gain an overarching understanding of the entire video and identify key visual information, followed by a fallback step that utilizes the key visual information to generate responses. This enables the model to more effectively comprehend extended video content. With the data, benchmark, and model, we successfully build MM-Ego, an egocentric multimodal LLM that shows powerful performance on egocentric video understanding.
[AI-1] Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
链接: https://arxiv.org/abs/2410.07176
作者: Fei Wang,Xingchen Wan,Ruoxi Sun,Jiefeng Chen,Sercan Ö. Arık
关键词-EN: large language models, Retrieval-Augmented Generation, Astute RAG, RAG, imperfect retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG), while effective in integrating external knowledge to address the limitations of large language models (LLMs), can be undermined by imperfect retrieval, which may introduce irrelevant, misleading, or even malicious information. Despite its importance, previous studies have rarely explored the behavior of RAG through joint analysis on how errors from imperfect retrieval attribute and propagate, and how potential conflicts arise between the LLMs’ internal knowledge and external sources. We find that imperfect retrieval augmentation might be inevitable and quite harmful, through controlled analysis under realistic conditions. We identify the knowledge conflicts between LLM-internal and external knowledge from retrieval as a bottleneck to overcome in the post-retrieval stage of RAG. To render LLMs resilient to imperfect retrieval, we propose Astute RAG, a novel RAG approach that adaptively elicits essential information from LLMs’ internal knowledge, iteratively consolidates internal and external knowledge with source-awareness, and finalizes the answer according to information reliability. Our experiments using Gemini and Claude demonstrate that Astute RAG significantly outperforms previous robustness-enhanced RAG methods. Notably, Astute RAG is the only approach that matches or exceeds the performance of LLMs without RAG under worst-case scenarios. Further analysis reveals that Astute RAG effectively resolves knowledge conflicts, improving the reliability and trustworthiness of RAG systems.
[AI-2] Do better language models have crisper vision?
链接: https://arxiv.org/abs/2410.07173
作者: Jona Ruthardt,Gertjan J. Burghouts,Serge Belongie,Yuki M. Asano
关键词-EN: text-only Large Language, text-only Large, Large Language Models, Large Language, visual world
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:How well do text-only Large Language Models (LLMs) grasp the visual world? As LLMs are increasingly used in computer vision, addressing this question becomes both fundamental and pertinent. However, existing studies have primarily focused on limited scenarios, such as their ability to generate visual content or cluster multimodal data. To this end, we propose the Visual Text Representation Benchmark (ViTeRB) to isolate key properties that make language models well-aligned with the visual world. With this, we identify large-scale decoder-based LLMs as ideal candidates for representing text in vision-centric contexts, counter to the current practice of utilizing text encoders. Building on these findings, we propose ShareLock, an ultra-lightweight CLIP-like model. By leveraging precomputable frozen features from strong vision and language models, ShareLock achieves an impressive 51% accuracy on ImageNet despite utilizing just 563k image-caption pairs. Moreover, training requires only 1 GPU hour (or 10 hours including the precomputation of features) - orders of magnitude less than prior methods. Code will be released.
[AI-3] One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
链接: https://arxiv.org/abs/2410.07170
作者: Fabian Paischer,Lukas Hauzenberger,Thomas Schmied,Benedikt Alkin,Marc Peter Deisenroth,Sepp Hochreiter
关键词-EN: Foundation models, specific application, large-scale datasets, Foundation, uniform rank distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: 10 pages + references and appendix, code available at this https URL
点击查看摘要
Abstract:Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned on a downstream task for a specific application. The most successful and most commonly used fine-tuning method is to update the pre-trained weights via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are usually initialized at random with a uniform rank distribution across model weights. Recent works focus on weight-driven initialization or learning of adaptive ranks during training. Both approaches have only been investigated in isolation, resulting in slow convergence or a uniform rank distribution, in turn leading to sub-optimal performance. We propose to enhance LoRA by initializing the new weights in a data-driven manner by computing singular value decomposition on minibatches of activation vectors. Then, we initialize the LoRA matrices with the obtained right-singular vectors and re-distribute ranks among all weight matrices to explain the maximal amount of variance and continue the standard LoRA fine-tuning procedure. This results in our new method Explained Variance Adaptation (EVA). We apply EVA to a variety of fine-tuning tasks ranging from language generation and understanding to image classification and reinforcement learning. EVA exhibits faster convergence than competitors and attains the highest average score across a multitude of tasks per domain.
[AI-4] Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making NEURIPS2024
链接: https://arxiv.org/abs/2410.07166
作者: Manling Li,Shiyu Zhao,Qineng Wang,Kangrui Wang,Yu Zhou,Sanjana Srivastava,Cem Gokmen,Tony Lee,Li Erran Li,Ruohan Zhang,Weiyu Liu,Percy Liang,Li Fei-Fei,Jiayuan Mao,Jiajun Wu
关键词-EN: Large Language Models, evaluate Large Language, Language Models, Large Language, evaluate Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted for oral presentation at NeurIPS 2024 in the Datasets and Benchmarks track
点击查看摘要
Abstract:We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision-making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc. Overall, our benchmark offers a comprehensive assessment of LLMs’ performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making.
[AI-5] Complex Logical Query Answering by Calibrating Knowledge Graph Completion Models
链接: https://arxiv.org/abs/2410.07165
作者: Changyi Xiao,Yixin Cao
关键词-EN: complex logical queries, KGC models, Complex logical query, Complex logical, involves finding answer
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Complex logical query answering (CLQA) is a challenging task that involves finding answer entities for complex logical queries over incomplete knowledge graphs (KGs). Previous research has explored the use of pre-trained knowledge graph completion (KGC) models, which can predict the missing facts in KGs, to answer complex logical queries. However, KGC models are typically evaluated using ranking evaluation metrics, which may result in values of predictions of KGC models that are not well-calibrated. In this paper, we propose a method for calibrating KGC models, namely CKGC, which enables KGC models to adapt to answering complex logical queries. Notably, CKGC is lightweight and effective. The adaptation function is simple, allowing the model to quickly converge during the adaptation process. The core concept of CKGC is to map the values of predictions of KGC models to the range [0, 1], ensuring that values associated with true facts are close to 1, while values linked to false facts are close to 0. Through experiments on three benchmark datasets, we demonstrate that our proposed calibration method can significantly boost model performance in the CLQA task. Moreover, our approach can enhance the performance of CLQA while preserving the ranking evaluation metrics of KGC models. The code is available at this https URL.
[AI-6] Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
链接: https://arxiv.org/abs/2410.07163
作者: Chongyu Fan,Jiancheng Liu,Licong Lin,Jinghan Jia,Ruiqi Zhang,Song Mei,Sijia Liu
关键词-EN: harmful content generation, essential model utilities, remove unwanted data, unwanted data influences, large language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this work, we address the problem of large language model (LLM) unlearning, aiming to remove unwanted data influences and associated model capabilities (e.g., copyrighted data or harmful content generation) while preserving essential model utilities, without the need for retraining from scratch. Despite the growing need for LLM unlearning, a principled optimization framework remains lacking. To this end, we revisit the state-of-the-art approach, negative preference optimization (NPO), and identify the issue of reference model bias, which could undermine NPO’s effectiveness, particularly when unlearning forget data of varying difficulty. Given that, we propose a simple yet effective unlearning optimization framework, called SimNPO, showing that ‘simplicity’ in removing the reliance on a reference model (through the lens of simple preference optimization) benefits unlearning. We also provide deeper insights into SimNPO’s advantages, supported by analysis using mixtures of Markov chains. Furthermore, we present extensive experiments validating SimNPO’s superiority over existing unlearning baselines in benchmarks like TOFU and MUSE, and robustness against relearning attacks. Codes are available at this https URL.
[AI-7] Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond
链接: https://arxiv.org/abs/2410.07158
作者: Dilyara Bareeva,Galip Ümit Yolcu,Anna Hedström,Niklas Schmolenski,Thomas Wiegand,Wojciech Samek,Sebastian Lapuschkin
关键词-EN: training data attribution, TDA methods, recent years, training data, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In recent years, training data attribution (TDA) methods have emerged as a promising direction for the interpretability of neural networks. While research around TDA is thriving, limited effort has been dedicated to the evaluation of attributions. Similar to the development of evaluation metrics for traditional feature attribution approaches, several standalone metrics have been proposed to evaluate the quality of TDA methods across various contexts. However, the lack of a unified framework that allows for systematic comparison limits trust in TDA methods and stunts their widespread adoption. To address this research gap, we introduce Quanda, a Python toolkit designed to facilitate the evaluation of TDA methods. Beyond offering a comprehensive set of evaluation metrics, Quanda provides a uniform interface for seamless integration with existing TDA implementations across different repositories, thus enabling systematic benchmarking. The toolkit is user-friendly, thoroughly tested, well-documented, and available as an open-source library on PyPi and under this https URL.
[AI-8] InstructG2I: Synthesizing Images from Multimodal Attributed Graphs
链接: https://arxiv.org/abs/2410.07157
作者: Bowen Jin,Ziqi Pang,Bingjun Guo,Yu-Xiong Wang,Jiaxuan You,Jiawei Han
关键词-EN: generating images, overlooked yet critical, graph, multimodal attributed graphs, critical task
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 16 pages
点击查看摘要
Abstract:In this paper, we approach an overlooked yet critical task Graph2Image: generating images from multimodal attributed graphs (MMAGs). This task poses significant challenges due to the explosion in graph size, dependencies among graph entities, and the need for controllability in graph conditions. To address these challenges, we propose a graph context-conditioned diffusion model called InstructG2I. InstructG2I first exploits the graph structure and multimodal information to conduct informative neighbor sampling by combining personalized page rank and re-ranking based on vision-language features. Then, a Graph-QFormer encoder adaptively encodes the graph nodes into an auxiliary set of graph prompts to guide the denoising process of diffusion. Finally, we propose graph classifier-free guidance, enabling controllable generation by varying the strength of graph guidance and multiple connected edges to a node. Extensive experiments conducted on three datasets from different domains demonstrate the effectiveness and controllability of our approach. The code is available at this https URL.
[AI-9] Graph Network Models To Detect Illicit Transactions In Block Chain
链接: https://arxiv.org/abs/2410.07150
作者: Hrushyang Adloori,Vaishnavi Dasanapu,Abhijith Chandra Mergu
关键词-EN: traditional rule-based approaches, Elliptic Bitcoin Transaction, cryptocurrencies has led, traditional rule-based, rule-based approaches
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 9 pages, 7 figures
点击查看摘要
Abstract:The use of cryptocurrencies has led to an increase in illicit activities such as money laundering, with traditional rule-based approaches becoming less effective in detecting and preventing such activities. In this paper, we propose a novel approach to tackling this problem by applying graph attention networks with residual network-like architecture (GAT-ResNet) to detect illicit transactions related to anti-money laundering/combating the financing of terrorism (AML/CFT) in blockchains. We train various models on the Elliptic Bitcoin Transaction dataset, implementing logistic regression, Random Forest, XGBoost, GCN, GAT, and our proposed GAT-ResNet model. Our results demonstrate that the GAT-ResNet model has a potential to outperform the existing graph network models in terms of accuracy, reliability and scalability. Our research sheds light on the potential of graph related machine learning models to improve efforts to combat financial crime and lays the foundation for further research in this area.
[AI-10] aking a turn for the better: Conversation redirection throughout the course of mental-health therapy EMNLP
链接: https://arxiv.org/abs/2410.07147
作者: Vivian Nguyen,Sang Min Jung,Lillian Lee,Thomas D. Hull,Cristian Danescu-Niculescu-Mizil
关键词-EN: Mental-health therapy involves, therapists continuously negotiate, complex conversation flow, Mental-health therapy, involves a complex
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: To appear in the Proceedings of EMNLP (Findings) 2024. Code available at this https URL
点击查看摘要
Abstract:Mental-health therapy involves a complex conversation flow in which patients and therapists continuously negotiate what should be talked about next. For example, therapists might try to shift the conversation’s direction to keep the therapeutic process on track and avoid stagnation, or patients might push the discussion towards issues they want to focus on. How do such patient and therapist redirections relate to the development and quality of their relationship? To answer this question, we introduce a probabilistic measure of the extent to which a certain utterance immediately redirects the flow of the conversation, accounting for both the intention and the actual realization of such a change. We apply this new measure to characterize the development of patient-therapist relationships over multiple sessions in a very large, widely-used online therapy platform. Our analysis reveals that (1) patient control of the conversation’s direction generally increases relative to that of the therapist as their relationship progresses; and (2) patients who have less control in the first few sessions are significantly more likely to eventually express dissatisfaction with their therapist and terminate the relationship. Comments: To appear in the Proceedings of EMNLP (Findings) 2024. Code available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2410.07147 [cs.CL] (or arXiv:2410.07147v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.07147 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-11] Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling
链接: https://arxiv.org/abs/2410.07145
作者: Yingfa Chen,Xinrong Zhang,Shengding Hu,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: linear computational complexity, recurrent neural networks, handling long sequences, neural networks, Mamba and RWKV
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 21 pages, 18 figures
点击查看摘要
Abstract:One essential advantage of recurrent neural networks (RNNs) over transformer-based language models is their linear computational complexity concerning the sequence length, which makes them much faster in handling long sequences during inference. However, most publicly available RNNs (e.g., Mamba and RWKV) are trained on sequences with less than 10K tokens, and their effectiveness in longer contexts remains largely unsatisfying so far. In this paper, we study the cause of the inability to process long context for RNNs and suggest critical mitigations. We examine two practical concerns when applying state-of-the-art RNNs to long contexts: (1) the inability to extrapolate to inputs longer than the training length and (2) the upper bound of memory capacity. Addressing the first concern, we first investigate state collapse (SC), a phenomenon that causes severe performance degradation on sequence lengths not encountered during training. With controlled experiments, we attribute this to overfitting due to the recurrent state being overparameterized for the training length. For the second concern, we train a series of Mamba-2 models on long documents to empirically estimate the recurrent state capacity in language modeling and passkey retrieval. Then, three SC mitigation methods are proposed to improve Mamba-2’s length generalizability, allowing the model to process more than 1M tokens without SC. We also find that the recurrent state capacity in passkey retrieval scales exponentially to the state size, and we empirically train a Mamba-2 370M with near-perfect passkey retrieval accuracy on 256K context length. This suggests a promising future for RNN-based long-context modeling.
[AI-12] Natural Language Query Engine for Relational Databases using Generative AI
链接: https://arxiv.org/abs/2410.07144
作者: Steve Tueno Fotso
关键词-EN: data-driven decision-making highlights, analyze information stored, growing reliance, reliance on data-driven, data-driven decision-making
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Artificial Intelligence, Machine Learning, Generative AI, SQL, Relational Database, SQL Correctness
点击查看摘要
Abstract:The growing reliance on data-driven decision-making highlights the need for more intuitive ways to access and analyze information stored in relational databases. However, the requirement of SQL knowledge has long been a significant barrier for non-technical users. This article introduces an innovative solution that leverages Generative AI to bridge this gap, enabling users to query databases using natural language. Our approach automatically translates natural language queries into SQL, ensuring both syntactic and semantic correctness, while also generating clear, natural language responses from the retrieved data. By streamlining the interaction between users and databases, this method empowers individuals without technical expertise to engage with data directly and efficiently, democratizing access to valuable insights and enhancing productivity.
[AI-13] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
链接: https://arxiv.org/abs/2410.07137
作者: Xiaosen Zheng,Tianyu Pang,Chao Du,Qian Liu,Jing Jiang,Min Lin
关键词-EN: language models due, evaluating language models, win rates, human evaluation, popular for evaluating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a “null model” that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are transferable because we assume that the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are private and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at this https URL.
[AI-14] Mental Disorders Detection in the Era of Large Language Models
链接: https://arxiv.org/abs/2410.07129
作者: Gleb Kuzmin,Petr Strepetov,Maksim Stankevich,Ivan Smirnov,Artem Shelmanov
关键词-EN: machine learning methods, traditional machine learning, paper compares, machine learning, task of detecting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper compares the effectiveness of traditional machine learning methods, encoder-based models, and large language models (LLMs) on the task of detecting depression and anxiety. Five datasets were considered, each differing in format and the method used to define the target pathology class. We tested AutoML models based on linguistic features, several variations of encoder-based Transformers such as BERT, and state-of-the-art LLMs as pathology classification models. The results demonstrated that LLMs outperform traditional methods, particularly on noisy and small datasets where training examples vary significantly in text length and genre. However, psycholinguistic features and encoder-based models can achieve performance comparable to language models when trained on texts from individuals with clinically confirmed depression, highlighting their potential effectiveness in targeted clinical applications.
[AI-15] Cross-Task Pretraining for Cross-Organ Cross-Scanner Adenocarcinoma Segmentation MICCAI2024
链接: https://arxiv.org/abs/2410.07124
作者: Adrian Galdran
关键词-EN: Cross-Scanner Adenocarcinoma Segmentation, short abstract describes, histopathological image patches, Cross-Scanner Adenocarcinoma, Adenocarcinoma Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: MICCAI2024 COSAS Challenge - short abstract
点击查看摘要
Abstract:This short abstract describes a solution to the COSAS 2024 competition on Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation from histopathological image patches. The main challenge in the task of segmenting this type of cancer is a noticeable domain shift encountered when changing acquisition devices (microscopes) and also when tissue comes from different organs. The two tasks proposed in COSAS were to train on a dataset of images from three different organs, and then predict segmentations on data from unseen organs (dataset T1), and to train on a dataset of images acquired on three different scanners and then segment images acquired with another unseen microscope. We attempted to bridge the domain shift gap by experimenting with three different strategies: standard training for each dataset, pretraining on dataset T1 and then fine-tuning on dataset T2 (and vice-versa, a strategy we call \textitCross-Task Pretraining), and training on the combination of dataset A and B. Our experiments showed that Cross-Task Pre-training is a more promising approach to domain generalization.
[AI-16] End-Cloud Collaboration Framework for Advanced AI Customer Service in E-commerce
链接: https://arxiv.org/abs/2410.07122
作者: Liangyu Teng,Yang Liu,Jing Liu,Liang Song
关键词-EN: customer service solutions, end model, AI-driven customer service, model, advanced AI-driven customer
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by 2024 IEEE 10th World Forum on Internet of Things (WF-IoT)
点击查看摘要
Abstract:In recent years, the e-commerce industry has seen a rapid increase in the demand for advanced AI-driven customer service solutions. Traditional cloud-based models face limitations in terms of latency, personalized services, and privacy concerns. Furthermore, end devices often lack the computational resources to deploy large AI models effectively. In this paper, we propose an innovative End-Cloud Collaboration (ECC) framework for advanced AI customer service in e-commerce. This framework integrates the advantages of large cloud models and mid/small-sized end models by deeply exploring the generalization potential of cloud models and effectively utilizing the computing power resources of terminal chips, alleviating the strain on computing resources to some extent. Specifically, the large cloud model acts as a teacher, guiding and promoting the learning of the end model, which significantly reduces the end model’s reliance on large-scale, high-quality data and thereby addresses the data bottleneck in traditional end model training, offering a new paradigm for the rapid deployment of industry applications. Additionally, we introduce an online evolutive learning strategy that enables the end model to continuously iterate and upgrade based on guidance from the cloud model and real-time user feedback. This strategy ensures that the model can flexibly adapt to the rapid changes in application scenarios while avoiding the uploading of sensitive information by performing local fine-tuning, achieving the dual goals of privacy protection and personalized service. %We make systematic contributions to the customized model fine-tuning methods in the e-commerce domain. To conclude, we implement in-depth corpus collection (e.g., data organization, cleaning, and preprocessing) and train an ECC-based industry-specific model for e-commerce customer service.
[AI-17] ransfer Learning for E-commerce Query Product Type Prediction
链接: https://arxiv.org/abs/2410.07121
作者: Anna Tigunova,Thomas Ricatte,Ghadir Eraisha
关键词-EN: e-commerce search engines, good understanding, intent is essential, correct product type, customer intent
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Getting a good understanding of the customer intent is essential in e-commerce search engines. In particular, associating the correct product type to a search query plays a vital role in surfacing correct products to the customers. Query product type classification (Q2PT) is a particularly challenging task because search queries are short and ambiguous, the number of existing product categories is extremely large, spanning thousands of values. Moreover, international marketplaces face additional challenges, such as language and dialect diversity and cultural differences, influencing the interpretation of the query. In this work we focus on Q2PT prediction in the global multilocale e-commerce markets. The common approach of training Q2PT models for each locale separately shows significant performance drops in low-resource stores. Moreover, this method does not allow for a smooth expansion to a new country, requiring to collect the data and train a new locale-specific Q2PT model from scratch. To tackle this, we propose to use transfer learning from the highresource to the low-resource locales, to achieve global parity of Q2PT performance. We benchmark the per-locale Q2PT model against the unified one, which shares the training data and model structure across all worldwide stores. Additionally, we compare locale-aware and locale-agnostic Q2PT models, showing the task dependency on the country-specific traits. We conduct extensive quantiative and qualitative analysis of Q2PT models on the large-scale e-commerce dataset across 20 worldwide locales, which shows that unified locale-aware Q2PT model has superior performance over the alternatives.
[AI-18] hing2Reality: Transforming 2D Content into Conditioned Multiviews and 3D Gaussian Objects for XR Communication
链接: https://arxiv.org/abs/2410.07119
作者: Erzhen Hu,Mingyi Li,Jungtaek Hong,Xun Qian,Alex Olwal,David Kim,Seongkook Heo,Ruofei Du
关键词-EN: enhance mutual understanding, product designs, mutual understanding, digital assets, digital
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages (15 pages without references), 13 figures
点击查看摘要
Abstract:During remote communication, participants often share both digital and physical content, such as product designs, digital assets, and environments, to enhance mutual understanding. Recent advances in augmented communication have facilitated users to swiftly create and share digital 2D copies of physical objects from video feeds into a shared space. However, conventional 2D representations of digital objects restricts users’ ability to spatially reference items in a shared immersive environment. To address this, we propose Thing2Reality, an Extended Reality (XR) communication platform that enhances spontaneous discussions of both digital and physical items during remote sessions. With Thing2Reality, users can quickly materialize ideas or physical objects in immersive environments and share them as conditioned multiview renderings or 3D Gaussians. Thing2Reality enables users to interact with remote objects or discuss concepts in a collaborative manner. Our user study revealed that the ability to interact with and manipulate 3D representations of objects significantly enhances the efficiency of discussions, with the potential to augment discussion of 2D artifacts.
[AI-19] System 2 thinking in OpenAIs o1-preview model: Near-perfect performance on a mathematics exam
链接: https://arxiv.org/abs/2410.07114
作者: Joost de Winter,Dimitra Dodou,Yke Bauke Eisma
关键词-EN: processes underlying human, underlying human cognition, involves fast, involves slow, intuitive thinking
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:The processes underlying human cognition are often divided into two systems: System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the O1 model series, specifically designed to handle System 2-like reasoning. While OpenAI’s benchmarks are promising, independent validation is still needed. In this study, we tested the O1-preview model twice on the Dutch ‘Mathematics B’ final exam. It scored a near-perfect 76 and 73 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the GPT-4o model scored 66 and 61 out of 76, well above the Dutch average of 40.63 points. The O1-preview model completed the exam in around 10 minutes, while GPT-4o took 3 minutes, and neither model had access to the exam figures. Although O1-preview had the ability to achieve a perfect score, its performance showed some variability, as it made occasional mistakes with repeated prompting. This suggests that the self-consistency method, where the consensus output is selected, could improve accuracy. We conclude that while OpenAI’s new model series holds great potential, certain risks must be considered.
[AI-20] VHELM: A Holistic Evaluation of Vision Language Models NEURIPS2024
链接: https://arxiv.org/abs/2410.07112
作者: Tony Lee,Haoqin Tu,Chi Heem Wong,Wenhao Zheng,Yiyang Zhou,Yifan Mai,Josselin Somerville Roberts,Michihiro Yasunaga,Huaxiu Yao,Cihang Xie,Percy Liang
关键词-EN: assessing vision-language models, Current benchmarks, assessing vision-language, neglect other critical, Vision Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024. First three authors contributed equally
点击查看摘要
Abstract:Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM). VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors. In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable fair comparisons across models. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast. Our initial run evaluates 22 VLMs on 21 existing datasets to provide a holistic snapshot of the models. We uncover new key findings, such as the fact that efficiency-focused models (e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform significantly worse than their full models (e.g., Claude 3 Opus or Gemini 1.5 Pro) on the bias benchmark but not when evaluated on the other aspects. For transparency, we release the raw model generations and complete results on our website (this https URL). VHELM is intended to be a living benchmark, and we hope to continue adding new datasets and models over time.
[AI-21] I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy
链接: https://arxiv.org/abs/2410.07109
作者: Gian Maria Campedelli,Nicolò Penzo,Massimo Stefan,Roberto Dessì,Marco Guerini,Bruno Lepri,Jacopo Staiano
关键词-EN: Large Language Model, Large Language, Stanford Prison Experiment, anticipate emergent phenomena, Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:As Large Language Model (LLM)-based agents become increasingly autonomous and will more freely interact with each other, studying interactions between them becomes crucial to anticipate emergent phenomena and potential risks. Drawing inspiration from the widely popular Stanford Prison Experiment, we contribute to this line of research by studying interaction patterns of LLM agents in a context characterized by strict social hierarchy. We do so by specifically studying two types of phenomena: persuasion and anti-social behavior in simulated scenarios involving a guard and a prisoner agent who seeks to achieve a specific goal (i.e., obtaining additional yard time or escape from prison). Leveraging 200 experimental scenarios for a total of 2,000 machine-machine conversations across five different popular LLMs, we provide a set of noteworthy findings. We first document how some models consistently fail in carrying out a conversation in our multi-agent setup where power dynamics are at play. Then, for the models that were able to engage in successful interactions, we empirically show how the goal that an agent is set to achieve impacts primarily its persuasiveness, while having a negligible effect with respect to the agent’s anti-social behavior. Third, we highlight how agents’ personas, and particularly the guard’s personality, drive both the likelihood of successful persuasion from the prisoner and the emergence of anti-social behaviors. Fourth, we show that even without explicitly prompting for specific personalities, anti-social behavior emerges by simply assigning agents’ roles. These results bear implications for the development of interactive LLM agents as well as the debate on their societal impact.
[AI-22] FAIR GPT: A virtual consultant for research data management in ChatGPT
链接: https://arxiv.org/abs/2410.07108
作者: Renat Shigapov,Irene Schumm
关键词-EN: FAIR GPT, virtual consultant, consultant in ChatGPT, ChatGPT designed, organizations make
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 4 pages, 2 figures, 1 table
点击查看摘要
Abstract:FAIR GPT is a first virtual consultant in ChatGPT designed to help researchers and organizations make their data and metadata compliant with the FAIR (Findable, Accessible, Interoperable, Reusable) principles. It provides guidance on metadata improvement, dataset organization, and repository selection. To ensure accuracy, FAIR GPT uses external APIs to assess dataset FAIRness, retrieve controlled vocabularies, and recommend repositories, minimizing hallucination and improving precision. It also assists in creating documentation (data and software management plans, README files, and codebooks), and selecting proper licenses. This paper describes its features, applications, and limitations.
[AI-23] Identifying and Addressing Delusions for Target-Directed Decision-Making
链接: https://arxiv.org/abs/2410.07096
作者: Mingde Zhao,Tristan Sylvain,Doina Precup,Yoshua Bengio
关键词-EN: decision-time planning, behaviors and achieve, produce targets, behaviors, Abstract
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We are interested in target-directed agents, which produce targets during decision-time planning, to guide their behaviors and achieve better generalization during evaluation. Improper training of these agents can result in delusions: the agent may come to hold false beliefs about the targets, which cannot be properly rejected, leading to unwanted behaviors and damaging out-of-distribution generalization. We identify different types of delusions by using intuitive examples in carefully controlled environments, and investigate their causes. We demonstrate how delusions can be addressed for agents trained by hindsight relabeling, a mainstream approach in for training target-directed RL agents. We validate empirically the effectiveness of the proposed solutions in correcting delusional behaviors and improving out-of-distribution generalization.
[AI-24] An Approach for Auto Generation of Labeling Functions for Software Engineering Chatbots
链接: https://arxiv.org/abs/2410.07094
作者: Ebube Alor,Ahmad Abdellatif,SayedHassan Khatoonabadi,Emad Shihab
关键词-EN: enhancing development processes, increasingly gaining attention, Software engineering, Natural Language Understanding, development processes
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Software Engineering for review
点击查看摘要
Abstract:Software engineering (SE) chatbots are increasingly gaining attention for their role in enhancing development processes. At the core of chatbots are the Natural Language Understanding platforms (NLUs), which enable them to comprehend and respond to user queries. Before deploying NLUs, there is a need to train them with labeled data. However, acquiring such labeled data for SE chatbots is challenging due to the scarcity of high-quality datasets. This challenge arises because training SE chatbots requires specialized vocabulary and phrases not found in typical language datasets. Consequently, chatbot developers often resort to manually annotating user queries to gather the data necessary for training effective chatbots, a process that is both time-consuming and resource-intensive. Previous studies propose approaches to support chatbot practitioners in annotating users’ posed queries. However, these approaches require human intervention to generate rules, called labeling functions (LFs), that identify and categorize user queries based on specific patterns in the data. To address this issue, we propose an approach to automatically generate LFs by extracting patterns from labeled user queries. We evaluate the effectiveness of our approach by applying it to the queries of four diverse SE datasets (namely AskGit, MSA, Ask Ubuntu, and Stack Overflow) and measure the performance improvement gained from training the NLU on the queries labeled by the generated LFs. We find that the generated LFs effectively label data with AUC scores of up to 85.3%, and NLU’s performance improvement of up to 27.2% across the studied datasets. Furthermore, our results show that the number of LFs used to generate LFs affects the labeling performance. We believe that our approach can save time and resources in labeling users’ queries, allowing practitioners to focus on core chatbot functionalities.
[AI-25] MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
链接: https://arxiv.org/abs/2410.07076
作者: Zonglin Yang,Wanhao Liu,Ben Gao,Tong Xie,Yuqiang Li,Wanli Ouyang,Soujanya Poria,Erik Cambria,Dongzhan Zhou
关键词-EN: Scientific discovery contributes, human society prosperity, discovery contributes largely, recent progress shows, Scientific discovery
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Code and Benchmark are available at this https URL
点击查看摘要
Abstract:Scientific discovery contributes largely to human society’s prosperity, and recent progress shows that LLMs could potentially catalyze this process. However, it is still unclear whether LLMs can discover novel and valid hypotheses in chemistry. In this work, we investigate this central research question: Can LLMs automatically discover novel and valid chemistry research hypotheses given only a chemistry research background (consisting of a research question and/or a background survey), without limitation on the domain of the research question? After extensive discussions with chemistry experts, we propose an assumption that a majority of chemistry hypotheses can be resulted from a research background and several inspirations. With this key insight, we break the central question into three smaller fundamental questions. In brief, they are: (1) given a background question, whether LLMs can retrieve good inspirations; (2) with background and inspirations, whether LLMs can lead to hypothesis; and (3) whether LLMs can identify good hypotheses to rank them higher. To investigate these questions, we construct a benchmark consisting of 51 chemistry papers published in Nature, Science, or a similar level in 2024 (all papers are only available online since 2024). Every paper is divided by chemistry PhD students into three components: background, inspirations, and hypothesis. The goal is to rediscover the hypothesis, given only the background and a large randomly selected chemistry literature corpus consisting the ground truth inspiration papers, with LLMs trained with data up to 2023. We also develop an LLM-based multi-agent framework that leverages the assumption, consisting of three stages reflecting the three smaller questions. The proposed method can rediscover many hypotheses with very high similarity with the ground truth ones, covering the main innovations.
[AI-26] Retrieval-Augmented Decision Transformer: External Memory for In-context RL
链接: https://arxiv.org/abs/2410.07071
作者: Thomas Schmied,Fabian Paischer,Vihang Patil,Markus Hofmarcher,Razvan Pascanu,Sepp Hochreiter
关键词-EN: Reinforcement Learning, model to learn, task by observing, ICL, In-context learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In-context learning (ICL) is the ability of a model to learn a new task by observing a few exemplars in its context. While prevalent in NLP, this capability has recently also been observed in Reinforcement Learning (RL) settings. Prior in-context RL methods, however, require entire episodes in the agent’s context. Given that complex environments typically lead to long episodes with sparse rewards, these methods are constrained to simple environments with short episodes. To address these challenges, we introduce Retrieval-Augmented Decision Transformer (RA-DT). RA-DT employs an external memory mechanism to store past experiences from which it retrieves only sub-trajectories relevant for the current situation. The retrieval component in RA-DT does not require training and can be entirely domain-agnostic. We evaluate the capabilities of RA-DT on grid-world environments, robotics simulations, and procedurally-generated video games. On grid-worlds, RA-DT outperforms baselines, while using only a fraction of their context length. Furthermore, we illuminate the limitations of current in-context RL methods on complex environments and discuss future directions. To facilitate future research, we release datasets for four of the considered environments.
[AI-27] ReIFE: Re-evaluating Instruction-Following Evaluation
链接: https://arxiv.org/abs/2410.07069
作者: Yixin Liu,Kejian Shi,Alexander R. Fabbri,Yilun Zhao,Peifeng Wang,Chien-Sheng Wu,Shafiq Joty,Arman Cohan
关键词-EN: large language models, assess response quality, base LLMs, evaluation, evaluation protocols
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: GitHub Repo: this https URL , Evaluation Result Collection: this https URL
点击查看摘要
Abstract:The automatic evaluation of instruction following typically involves using large language models (LLMs) to assess response quality. However, there is a lack of comprehensive evaluation of these LLM-based evaluators across two dimensions: the base LLMs and the evaluation protocols. Therefore, we present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 recently proposed evaluation protocols, on 4 human-annotated datasets, assessing the evaluation accuracy of the LLM-evaluators. Our evaluation allows us to identify the best-performing base LLMs and evaluation protocols with a high degree of robustness. Moreover, our large-scale evaluation reveals: (1) Base LLM performance ranking remains largely consistent across evaluation protocols, with less capable LLMs showing greater improvement from protocol enhancements; (2) Robust evaluation of evaluation protocols requires many base LLMs with varying capability levels, as protocol effectiveness can depend on the base LLM used; (3) Evaluation results on different datasets are not always consistent, so a rigorous evaluation requires multiple datasets with distinctive features. We release our meta-evaluation suite ReIFE, which provides the codebase and evaluation result collection for more than 500 LLM-evaluator configurations, to support future research in instruction-following evaluation.
[AI-28] Emergent properties with repeated examples
链接: https://arxiv.org/abs/2410.07041
作者: François Charton,Julia Kempe
关键词-EN: algorithmically generated datasets, algorithmically generated, outperform models trained, models trained, generated datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We study the performance of transformers as a function of the number of repetitions of training examples with algorithmically generated datasets. On three problems of mathematics: the greatest common divisor, modular multiplication, and matrix eigenvalues, we show that for a fixed number of training steps, models trained on smaller sets of repeated examples outperform models trained on larger sets of single-use examples. We also demonstrate that two-set training - repeated use of a small random subset of examples, along normal sampling on the rest of the training set - provides for faster learning and better performance. This highlights that the benefits of repetition can outweigh those of data diversity. These datasets and problems provide a controlled setting to shed light on the still poorly understood interplay between generalization and memorization in deep learning.
[AI-29] PositionID: LLMs can Control Lengths Copy and Paste with Explicit Positional Awareness
链接: https://arxiv.org/abs/2410.07035
作者: Zekun Wang,Feiyu Duan,Yibo Zhang,Wangchunshu Zhou,Ke Xu,Wenhao Huang,Jie Fu
关键词-EN: Large Language Models, Large Language, demonstrate impressive capabilities, including role-playing, creative writing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 39 pages. CP-Bench and LenCtrl-Bench are available in this https URL and this https URL
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate impressive capabilities across various domains, including role-playing, creative writing, mathematical reasoning, and coding. Despite these advancements, LLMs still encounter challenges with length control, frequently failing to adhere to specific length constraints due to their token-level operations and insufficient training on data with strict length limitations. We identify this issue as stemming from a lack of positional awareness and propose novel approaches–PositionID Prompting and PositionID Fine-Tuning–to address it. These methods enhance the model’s ability to continuously monitor and manage text length during generation. Additionally, we introduce PositionID CP Prompting to enable LLMs to perform copy and paste operations accurately. Furthermore, we develop two benchmarks for evaluating length control and copy-paste abilities. Our experiments demonstrate that our methods significantly improve the model’s adherence to length constraints and copy-paste accuracy without compromising response quality.
[AI-30] ri-Level Navigator: LLM-Empowered Tri-Level Learning for Time Series OOD Generalization NEURIPS2024
链接: https://arxiv.org/abs/2410.07018
作者: Chengtao Jian,Kai Yang,Yang Jiao
关键词-EN: area of study, Large Language Models, burgeoning area, machine learning models, textbf
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS 2024
点击查看摘要
Abstract:Out-of-Distribution (OOD) generalization in machine learning is a burgeoning area of study. Its primary goal is to enhance the adaptability and resilience of machine learning models when faced with new, unseen, and potentially adversarial data that significantly diverges from their original training datasets. In this paper, we investigate time series OOD generalization via pre-trained Large Language Models (LLMs). We first propose a novel \textbfTri-level learning framework for \textbfTime \textbfSeries \textbfOOD generalization, termed TTSO, which considers both sample-level and group-level uncertainties. This formula offers a fresh theoretic perspective for formulating and analyzing OOD generalization problem. In addition, we provide a theoretical analysis to justify this method is well motivated. We then develop a stratified localization algorithm tailored for this tri-level optimization problem, theoretically demonstrating the guaranteed convergence of the proposed algorithm. Our analysis also reveals that the iteration complexity to obtain an \epsilon -stationary point is bounded by O( \frac1\epsilon^2 ). Extensive experiments on real-world datasets have been conducted to elucidate the effectiveness of the proposed method.
[AI-31] Pap2Pat: Towards Automated Paper-to-Patent Drafting using Chunk-based Outline-guided Generation
链接: https://arxiv.org/abs/2410.07009
作者: Valentin Knappich,Simon Razniewski,Anna Hätty,Annemarie Friedrich
关键词-EN: offering practical applications, large language models, natural language processing, providing challenging benchmarks, offering practical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The patent domain is gaining attention in natural language processing research, offering practical applications in streamlining the patenting process and providing challenging benchmarks for large language models (LLMs). However, the generation of the description sections of patents, which constitute more than 90% of the patent document, has not been studied to date. We address this gap by introducing the task of outline-guided paper-to-patent generation, where an academic paper provides the technical specification of the invention and an outline conveys the desired patent structure. We present PAP2PAT, a new challenging benchmark of 1.8k patent-paper pairs with document outlines, collected using heuristics that reflect typical research lab practices. Our experiments with current open-weight LLMs and outline-guided chunk-based generation show that they can effectively use information from the paper but struggle with repetitions, likely due to the inherent repetitiveness of patent language. We release our data and code.
[AI-32] CursorCore: Assist Programming through Aligning Anything
链接: https://arxiv.org/abs/2410.07002
作者: Hao Jiang,Qi Liu,Rui Li,Shengyu Ye,Shijin Wang
关键词-EN: Large language models, Large language, programming assistance tasks, Assist Programming Eval, successfully applied
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:
点击查看摘要
Abstract:Large language models have been successfully applied to programming assistance tasks, such as code completion, code insertion, and instructional code editing. However, these applications remain insufficiently automated and struggle to effectively integrate various types of information during the programming process, including coding history, current code, and user instructions. In this work, we propose a new conversational framework that comprehensively integrates these information sources, collect data to train our models and evaluate their performance. Firstly, to thoroughly evaluate how well models align with different types of information and the quality of their outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to comprehensively assess the performance of models in programming assistance tasks. Then, for data collection, we develop a data generation pipeline, Programming-Instruct, which synthesizes training data from diverse sources, such as GitHub and online judge platforms. This pipeline can automatically generate various types of messages throughout the programming process. Finally, using this pipeline, we generate 219K samples, fine-tune multiple models, and develop the CursorCore series. We show that CursorCore outperforms other models of comparable size. This framework unifies applications such as inline chat and automated editing, contributes to the advancement of coding assistants. Code, models and data are freely available at this https URL.
[AI-33] Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
链接: https://arxiv.org/abs/2410.06981
作者: Michael Lan,Philip Torr,Austin Meek,Ashkan Khakzar,David Krueger,Fazl Barez
关键词-EN: similarly represent concepts, large language models, models similarly represent, investigate feature universality, intermediate layers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:We investigate feature universality in large language models (LLMs), a research field that aims to understand how different models similarly represent concepts in the latent spaces of their intermediate layers. Demonstrating feature universality allows discoveries about latent representations to generalize across several models. However, comparing features across LLMs is challenging due to polysemanticity, in which individual neurons often correspond to multiple features rather than distinct ones. This makes it difficult to disentangle and match features across different models. To address this issue, we employ a method known as dictionary learning by using sparse autoencoders (SAEs) to transform LLM activations into more interpretable spaces spanned by neurons corresponding to individual features. After matching feature neurons across models via activation correlation, we apply representational space similarity metrics like Singular Value Canonical Correlation Analysis to analyze these SAE features across different LLMs. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.
[AI-34] Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification
链接: https://arxiv.org/abs/2410.06977
作者: Chenyue Li,Shuoyi Chen,Mang Ye
关键词-EN: holding significant importance, involves utilizing visual, utilizing visual technology, identify specific individuals, ReID involves utilizing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Wildlife ReID involves utilizing visual technology to identify specific individuals of wild animals in different scenarios, holding significant importance for wildlife conservation, ecological research, and environmental monitoring. Existing wildlife ReID methods are predominantly tailored to specific species, exhibiting limited applicability. Although some approaches leverage extensively studied person ReID techniques, they struggle to address the unique challenges posed by wildlife. Therefore, in this paper, we present a unified, multi-species general framework for wildlife ReID. Given that high-frequency information is a consistent representation of unique features in various species, significantly aiding in identifying contours and details such as fur textures, we propose the Adaptive High-Frequency Transformer model with the goal of enhancing high-frequency information learning. To mitigate the inevitable high-frequency interference in the wilderness environment, we introduce an object-aware high-frequency selection strategy to adaptively capture more valuable high-frequency components. Notably, we unify the experimental settings of multiple wildlife datasets for ReID, achieving superior performance over state-of-the-art ReID methods. In domain generalization scenarios, our approach demonstrates robust generalization to unknown species.
[AI-35] Personal Intelligence System UniLM: Hybrid On-Device Small Language Model and Server-Based Large Language Model for Malay Nusantara
链接: https://arxiv.org/abs/2410.06973
作者: Azree Nazri,Olalekan Agbolade,Faisal Aziz
关键词-EN: Personal Intelligence System, prove inadequate, contexts with limited, addressing the specific, high-resource language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 20 pages, 5 tables, 4 figures
点击查看摘要
Abstract:In contexts with limited computational and data resources, high-resource language models often prove inadequate, particularly when addressing the specific needs of Malay languages. This paper introduces a Personal Intelligence System designed to efficiently integrate both on-device and server-based models. The system incorporates SLiM-34M for on-device processing, optimized for low memory and power usage, and MANYAK-1.3B for server-based tasks, allowing for scalable, high-performance language processing. The models achieve significant results across various tasks, such as machine translation, question-answering, and translate IndoMMLU. Particularly noteworthy is SLiM-34M’s ability to achieve a high improvement in accuracy compared to other LLMs while using 2 times fewer pre-training tokens. This work challenges the prevailing assumption that large-scale computational resources are necessary to build effective language models, contributing to the development of resource-efficient models for the Malay language with the unique orchestration between SLiM-34M and MANYAK-1.3B.
[AI-36] DLGNet: Hyperedge Classification through Directed Line Graphs for Chemical Reactions
链接: https://arxiv.org/abs/2410.06969
作者: Stefano Fiorini,Giulia M. Bovolenta,Stefano Coniglio,Michele Ciavotta,Pietro Morerio,Michele Parrinello,Alessio Del Bue
关键词-EN: Directed Line Graph, Line Graph Laplacian, provide powerful abstractions, Line Graph, Directed Line
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Graphs and hypergraphs provide powerful abstractions for modeling interactions among a set of entities of interest and have been attracting a growing interest in the literature thanks to many successful applications in several fields. In particular, they are rapidly expanding in domains such as chemistry and biology, especially in the areas of drug discovery and molecule generation. One of the areas witnessing the fasted growth is the chemical reactions field, where chemical reactions can be naturally encoded as directed hyperedges of a hypergraph. In this paper, we address the chemical reaction classification problem by introducing the notation of a Directed Line Graph (DGL) associated with a given directed hypergraph. On top of it, we build the Directed Line Graph Network (DLGNet), the first spectral-based Graph Neural Network (GNN) expressly designed to operate on a hypergraph via its DLG transformation. The foundation of DLGNet is a novel Hermitian matrix, the Directed Line Graph Laplacian, which compactly encodes the directionality of the interactions taking place within the directed hyperedges of the hypergraph thanks to the DLG representation. The Directed Line Graph Laplacian enjoys many desirable properties, including admitting an eigenvalue decomposition and being positive semidefinite, which make it well-suited for its adoption within a spectral-based GNN. Through extensive experiments on chemical reaction datasets, we show that DGLNet significantly outperforms the existing approaches, achieving on a collection of real-world datasets an average relative-percentage-difference improvement of 33.01%, with a maximum improvement of 37.71%.
[AI-37] Uncovering Factor Level Preferences to Improve Human-Model Alignment
链接: https://arxiv.org/abs/2410.06965
作者: Juhyun Oh,Eunsu Kim,Jiseon Kim,Wenda Xu,Inha Cha,William Yang Wang,Alice Oh
关键词-EN: Large Language Model, Large Language, advancements in Large, Language Model, preferences remains crucial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Despite advancements in Large Language Model (LLM) alignment, understanding the reasons behind LLM preferences remains crucial for bridging the gap between desired and actual behavior. LLMs often exhibit biases or tendencies that diverge from human preferences, such as favoring certain writing styles or producing overly verbose outputs. However, current methods for evaluating preference alignment often lack explainability, relying on coarse-grained comparisons. To address this, we introduce PROFILE (PRObing Factors of InfLuence for Explainability), a novel framework that uncovers and quantifies the influence of specific factors driving preferences. PROFILE’s factor level analysis explains the ‘why’ behind human-model alignment and misalignment, offering insights into the direction of model improvement. We apply PROFILE to analyze human and LLM preferences across three tasks: summarization, helpful response generation, and document-based question-answering. Our factor level analysis reveals a substantial discrepancy between human and LLM preferences in generation tasks, whereas LLMs show strong alignment with human preferences in evaluation tasks. We demonstrate how leveraging factor level insights, including addressing misaligned factors or exploiting the generation-evaluation gap, can improve alignment with human preferences. This work underscores the importance of explainable preference analysis and highlights PROFILE’s potential to provide valuable training signals, driving further improvements in human-model alignment.
[AI-38] ELMO: Enhanced Real-time LiDAR Motion Capture through Upsampling SIGGRAPH
链接: https://arxiv.org/abs/2410.06963
作者: Deok-Kyeong Jang,Dongseok Yang,Deok-Yun Jang,Byeoli Choi,Donghoon Shin,Sung-hee Lee
关键词-EN: single LiDAR sensor, paper introduces ELMO, motion capture framework, capture framework designed, motion capture
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: published at ACM Transactions on Graphics (Proc. SIGGRAPH ASIA), 2024
点击查看摘要
Abstract:This paper introduces ELMO, a real-time upsampling motion capture framework designed for a single LiDAR sensor. Modeled as a conditional autoregressive transformer-based upsampling motion generator, ELMO achieves 60 fps motion capture from a 20 fps LiDAR point cloud sequence. The key feature of ELMO is the coupling of the self-attention mechanism with thoughtfully designed embedding modules for motion and point clouds, significantly elevating the motion quality. To facilitate accurate motion capture, we develop a one-time skeleton calibration model capable of predicting user skeleton offsets from a single-frame point cloud. Additionally, we introduce a novel data augmentation technique utilizing a LiDAR simulator, which enhances global root tracking to improve environmental understanding. To demonstrate the effectiveness of our method, we compare ELMO with state-of-the-art methods in both image-based and point cloud-based motion capture. We further conduct an ablation study to validate our design principles. ELMO’s fast inference time makes it well-suited for real-time applications, exemplified in our demo video featuring live streaming and interactive gaming scenarios. Furthermore, we contribute a high-quality LiDAR-mocap synchronized dataset comprising 20 different subjects performing a range of motions, which can serve as a valuable resource for future research. The dataset and evaluation code are available at \blue \urlthis https URL
[AI-39] Self-Boosting Large Language Models with Synthetic Preference Data
链接: https://arxiv.org/abs/2410.06961
作者: Qingxiu Dong,Li Dong,Xingxing Zhang,Zhifang Sui,Furu Wei
关键词-EN: Large Language Models, Large Language, Language Models, generating honest, advanced significantly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.
[AI-40] Support Vector Boosting Machine (SVBM): Enhancing Classification Performance with AdaBoost and Residual Connections
链接: https://arxiv.org/abs/2410.06957
作者: Junbo Jacob Lian
关键词-EN: training samples emphasizes, misclassified training samples, Support Vector Machine, Vector Boosting Machine, Support Vector Boosting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The MATLAB source code for SVBM can be accessed at this https URL
点击查看摘要
Abstract:In traditional boosting algorithms, the focus on misclassified training samples emphasizes their importance based on difficulty during the learning process. While using a standard Support Vector Machine (SVM) as a weak learner in an AdaBoost framework can enhance model performance by concentrating on error samples, this approach introduces significant challenges. Specifically, SVMs, characterized by their stability and robustness, may require destabilization to fit the boosting paradigm, which in turn can constrain performance due to reliance on the weighted results from preceding iterations. To address these challenges, we propose the Support Vector Boosting Machine (SVBM), which integrates a novel subsampling process with SVM algorithms and residual connection techniques. This method updates sample weights by considering both the current model’s predictions and the outputs from prior rounds, allowing for effective sparsity control. The SVBM framework enhances the ability to form complex decision boundaries, thereby improving classification performance. The MATLAB source code for SVBM can be accessed at this https URL.
[AI-41] Faithful Interpretation for Graph Neural Networks
链接: https://arxiv.org/abs/2410.06950
作者: Lijie Hu,Tianhao Huang,Lu Yu,Wanyu Lin,Tianhang Zheng,Di Wang
关键词-EN: Graph Neural Networks, Neural Networks, Graph Attention Networks, Graph Transformers, garnered increasing attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages
点击查看摘要
Abstract:Currently, attention mechanisms have garnered increasing attention in Graph Neural Networks (GNNs), such as Graph Attention Networks (GATs) and Graph Transformers (GTs). It is not only due to the commendable boost in performance they offer but also its capacity to provide a more lucid rationale for model behaviors, which are often viewed as inscrutable. However, Attention-based GNNs have demonstrated instability in interpretability when subjected to various sources of perturbations during both training and testing phases, including factors like additional edges or nodes. In this paper, we propose a solution to this problem by introducing a novel notion called Faithful Graph Attention-based Interpretation (FGAI). In particular, FGAI has four crucial properties regarding stability and sensitivity to interpretation and final output distribution. Built upon this notion, we propose an efficient methodology for obtaining FGAI, which can be viewed as an ad hoc modification to the canonical Attention-based GNNs. To validate our proposed solution, we introduce two novel metrics tailored for graph interpretation assessment. Experimental results demonstrate that FGAI exhibits superior stability and preserves the interpretability of attention under various forms of perturbations and randomness, which makes FGAI a more faithful and reliable explanation tool.
[AI-42] A Trilogy of AI Safety Frameworks: Paths from Facts and Knowledge Gaps to Reliable Predictions and New Knowledge
链接: https://arxiv.org/abs/2410.06946
作者: Simon Kasif
关键词-EN: vital front-line concern, vital front-line, front-line concern, machine learning systems, Safety
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:AI Safety has become a vital front-line concern of many scientists within and outside the AI community. There are many immediate and long term anticipated risks that range from existential risk to human existence to deep fakes and bias in machine learning systems [1-5]. In this paper, we reduce the full scope and immense complexity of AI safety concerns to a trilogy of three important but tractable opportunities for advances that have the short-term potential to improve AI safety and reliability without reducing AI innovation in critical domains. In this perspective, we discuss this vision based on several case studies that already produced proofs of concept in critical ML applications in biomedical science.
[AI-43] AutoFeedback: An LLM-based Framework for Efficient and Accurate API Request Generation
链接: https://arxiv.org/abs/2410.06943
作者: Huanxi Liu,Jiaqi Liao,Dawei Feng,Kele Xu,Huaimin Wang
关键词-EN: Large Language Models, API request generation, Large Language, Language Models, API request
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 17 pages
点击查看摘要
Abstract:Large Language Models (LLMs) leverage external tools primarily through generating the API request to enhance task completion efficiency. The accuracy of API request generation significantly determines the capability of LLMs to accomplish tasks. Due to the inherent hallucinations within the LLM, it is difficult to efficiently and accurately generate the correct API request. Current research uses prompt-based feedback to facilitate the LLM-based API request generation. However, existing methods lack factual information and are insufficiently detailed. To address these issues, we propose AutoFeedback, an LLM-based framework for efficient and accurate API request generation, with a Static Scanning Component (SSC) and a Dynamic Analysis Component (DAC). SSC incorporates errors detected in the API requests as pseudo-facts into the feedback, enriching the factual information. DAC retrieves information from API documentation, enhancing the level of detail in feedback. Based on this two components, Autofeedback implementes two feedback loops during the process of generating API requests by the LLM. Extensive experiments demonstrate that it significantly improves accuracy of API request generation and reduces the interaction cost. AutoFeedback achieves an accuracy of 100.00% on a real-world API dataset and reduces the cost of interaction with GPT-3.5 Turbo by 23.44%, and GPT-4 Turbo by 11.85%. Comments: 17 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.06943 [cs.SE] (or arXiv:2410.06943v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2410.06943 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Huanxi Liu [view email] [v1] Wed, 9 Oct 2024 14:38:28 UTC (9,265 KB) Full-text links: Access Paper: View a PDF of the paper titled AutoFeedback: An LLM-based Framework for Efficient and Accurate API Request Generation, by Huanxi Liu and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2024-10 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[AI-44] Compositional Entailment Learning for Hyperbolic Vision-Language Models
链接: https://arxiv.org/abs/2410.06912
作者: Avik Pal,Max van Spengler,Guido Maria D’Amely di Melendugno,Alessandro Flaborea,Fabio Galasso,Pascal Mettes
关键词-EN: shared embedding space, representation learning forms, forms a cornerstone, contrastively aligned, Image-text representation learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, 12 figures, 8 tables
点击查看摘要
Abstract:Image-text representation learning forms a cornerstone in vision-language models, where pairs of images and textual descriptions are contrastively aligned in a shared embedding space. Since visual and textual concepts are naturally hierarchical, recent work has shown that hyperbolic space can serve as a high-potential manifold to learn vision-language representation with strong downstream performance. In this work, for the first time we show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. The idea is that an image is not only described by a sentence but is itself a composition of multiple object boxes, each with their own textual description. Such information can be obtained freely by extracting nouns from sentences and using openly available localized grounding models. We show how to hierarchically organize images, image boxes, and their textual descriptions through contrastive and entailment-based objectives. Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning, as well as recent hyperbolic alternatives, with better zero-shot and retrieval generalization and clearly stronger hierarchical performance.
[AI-45] Combining Planning and Diffusion for Mobility with Unknown Dynamics ICRA2025
链接: https://arxiv.org/abs/2410.06911
作者: Yajvan Ravan,Zhutian Yang,Tao Chen,Tomás Lozano-Pérez,Leslie Pack Kaelbling
关键词-EN: deployable robotic systems, robotic systems, essential skill, skill for deployable, deployable robotic
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to ICRA 2025
点击查看摘要
Abstract:Manipulation of large objects over long horizons (such as carts in a warehouse) is an essential skill for deployable robotic systems. Large objects require mobile manipulation which involves simultaneous manipulation, navigation, and movement with the object in tow. In many real-world situations, object dynamics are incredibly complex, such as the interaction of an office chair (with a rotating base and five caster wheels) and the ground. We present a hierarchical algorithm for long-horizon robot manipulation problems in which the dynamics are partially unknown. We observe that diffusion-based behavior cloning is highly effective for short-horizon problems with unknown dynamics, so we decompose the problem into an abstract high-level, obstacle-aware motion-planning problem that produces a waypoint sequence. We use a short-horizon, relative-motion diffusion policy to achieve the waypoints in sequence. We train mobile manipulation policies on a Spot robot that has to push and pull an office chair. Our hierarchical manipulation policy performs consistently better, especially when the horizon increases, compared to a diffusion policy trained on long-horizon demonstrations or motion planning assuming a rigidly-attached object (success rate of 8 (versus 0 and 5 respectively) out of 10 runs). Importantly, our learned policy generalizes to new layouts, grasps, chairs, and flooring that induces more friction, without any further training, showing promise for other complex mobile manipulation problems. Project Page: this https URL
[AI-46] Degree Distribution based Spiking Graph Networks for Domain Adaptation
链接: https://arxiv.org/abs/2410.06883
作者: Yingxu Wang,Siwei Liu,Mengzhu Wang,Shangsong Liang,Nan Yin
关键词-EN: Spiking Graph Networks, garnered significant attraction, Spiking Graph Domain, Graph Networks, Graph Domain Adaptation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Spiking Graph Networks (SGNs) have garnered significant attraction from both researchers and industry due to their ability to address energy consumption challenges in graph classification. However, SGNs are only effective for in-distribution data and cannot tackle out-of-distribution data. In this paper, we first propose the domain adaptation problem in SGNs, and introduce a novel framework named Degree-aware Spiking Graph Domain Adaptation for Classification. The proposed DeSGDA addresses the spiking graph domain adaptation problem by three aspects: node degree-aware personalized spiking representation, adversarial feature distribution alignment, and pseudo-label distillation. First, we introduce the personalized spiking representation method for generating degree-dependent spiking signals. Specifically, the threshold of triggering a spike is determined by the node degree, allowing this personalized approach to capture more expressive information for classification. Then, we propose the graph feature distribution alignment module that is adversarially trained using membrane potential against a domain discriminator. Such an alignment module can efficiently maintain high performance and low energy consumption in the case of inconsistent distribution. Additionally, we extract consistent predictions across two spaces to create reliable pseudo-labels, effectively leveraging unlabeled data to enhance graph classification performance. Extensive experiments on benchmark datasets validate the superiority of the proposed DeSGDA compared with competitive baselines.
[AI-47] Students Perceptions and Use of Generative AI Tools for Programming Across Different Computing Courses
链接: https://arxiv.org/abs/2410.06865
作者: Hieke Keuning,Isaac Alpizar-Chacon,Ioanna Lykourentzou,Lauren Beehler,Christian Köppe,Imke de Jong,Sergey Sosnovsky
关键词-EN: Investigation of students’, generative artificial intelligence, gaining much interest, students’ perceptions, topic gaining
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Accepted to Koli Calling 24
点击查看摘要
Abstract:Investigation of students’ perceptions and opinions on the use of generative artificial intelligence (GenAI) in education is a topic gaining much interest. Studies addressing this are typically conducted with large heterogeneous groups, at one moment in time. However, how students perceive and use GenAI tools can potentially depend on many factors, including their background knowledge, familiarity with the tools, and the learning goals and policies of the courses they are taking. In this study we explore how students following computing courses use GenAI for programming-related tasks across different programs and courses: Bachelor and Master, in courses in which learning programming is the learning goal, courses that require programming as a means to achieve another goal, and in courses in which programming is optional, but can be useful. We are also interested in changes over time, since GenAI capabilities are changing at a fast pace, and users are adopting GenAI increasingly. We conducted three consecutive surveys (fall 23, winter
23, and spring `24) among students of all computing programs of a large European research university. We asked questions on the use in education, ethics, and job prospects, and we included specific questions on the (dis)allowed use of GenAI tools in the courses they were taking at the time. We received 264 responses, which we quantitatively and qualitatively analyzed, to find out how students have employed GenAI tools across 59 different computing courses, and whether the opinion of an average student about these tools evolves over time. Our study contributes to the emerging discussion of how to differentiate GenAI use across different courses, and how to align its use with the learning goals of a computing course. Comments: Accepted to Koli Calling 24 Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.06865 [cs.CY] (or arXiv:2410.06865v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2410.06865 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Isaac Alpizar-Chacon [view email] [v1] Wed, 9 Oct 2024 13:24:06 UTC (334 KB)
[AI-48] Understanding Model Ensemble in Transferable Adversarial Attack
链接: https://arxiv.org/abs/2410.06851
作者: Wei Yao,Zeliang Zhang,Huayi Tang,Yong Liu
关键词-EN: foundation remains underexplored, Model ensemble adversarial, theoretical foundation remains, Model ensemble, ensemble adversarial attack
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Model ensemble adversarial attack has become a powerful method for generating transferable adversarial examples that can target even unknown models, but its theoretical foundation remains underexplored. To address this gap, we provide early theoretical insights that serve as a roadmap for advancing model ensemble adversarial attack. We first define transferability error to measure the error in adversarial transferability, alongside concepts of diversity and empirical model ensemble Rademacher complexity. We then decompose the transferability error into vulnerability, diversity, and a constant, which rigidly explains the origin of transferability error in model ensemble attack: the vulnerability of an adversarial example to ensemble components, and the diversity of ensemble components. Furthermore, we apply the latest mathematical tools in information theory to bound the transferability error using complexity and generalization terms, contributing to three practical guidelines for reducing transferability error: (1) incorporating more surrogate models, (2) increasing their diversity, and (3) reducing their complexity in cases of overfitting. Finally, extensive experiments with 54 models validate our theoretical framework, representing a significant step forward in understanding transferable model ensemble adversarial attacks.
[AI-49] A Safety Modulator Actor-Critic Method in Model-Free Safe Reinforcement Learning and Application in UAV Hovering
链接: https://arxiv.org/abs/2410.06847
作者: Qihan Qi,Xinsong Yang,Gang Xia,Daniel W. C. Ho,Pengyang Tang
关键词-EN: safe reinforcement learning, model-free safe reinforcement, safety modulator actor-critic, address safety constraint, safety constraints
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:This paper proposes a safety modulator actor-critic (SMAC) method to address safety constraint and overestimation mitigation in model-free safe reinforcement learning (RL). A safety modulator is developed to satisfy safety constraints by modulating actions, allowing the policy to ignore safety constraint and focus on maximizing reward. Additionally, a distributional critic with a theoretical update rule for SMAC is proposed to mitigate the overestimation of Q-values with safety constraints. Both simulation and real-world scenarios experiments on Unmanned Aerial Vehicles (UAVs) hovering confirm that the SMAC can effectively maintain safety constraints and outperform mainstream baseline algorithms.
[AI-50] Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity
链接: https://arxiv.org/abs/2410.06846
作者: Mutian He,Philip N. Garner
关键词-EN: Linformer and Mamba, Mamba have recently, linear time replacements, competitive linear time, recently emerged
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 15 pages, 4 figures
点击查看摘要
Abstract:Architectures such as Linformer and Mamba have recently emerged as competitive linear time replacements for transformers. However, corresponding large pretrained models are often unavailable, especially in non-text domains. To remedy this, we present a Cross-Architecture Layerwise Distillation (CALD) approach that jointly converts a transformer model to a linear time substitute and fine-tunes it to a target task. We also compare several means to guide the fine-tuning to optimally retain the desired inference capability from the original model. The methods differ in their use of the target model and the trajectory of the parameters. In a series of empirical studies on language processing, language modeling, and speech processing, we show that CALD can effectively recover the result of the original model, and that the guiding strategy contributes to the result. Some reasons for the variation are suggested.
[AI-51] MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders
链接: https://arxiv.org/abs/2410.06845
作者: Cheng Li,May Fung,Qingyun Wang,Chi Han,Manling Li,Jindong Wang,Heng Ji
关键词-EN: Mental health disorders, Mental health, health disorders, Mental, health
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Technical Report; 27 pages
点击查看摘要
Abstract:Mental health disorders are one of the most serious diseases in the world. Most people with such a disease lack access to adequate care, which highlights the importance of training models for the diagnosis and treatment of mental health disorders. However, in the mental health domain, privacy concerns limit the accessibility of personalized treatment data, making it challenging to build powerful models. In this paper, we introduce MentalArena, a self-play framework to train language models by generating domain-specific personalized data, where we obtain a better model capable of making a personalized diagnosis and treatment (as a therapist) and providing information (as a patient). To accurately model human-like mental health patients, we devise Symptom Encoder, which simulates a real patient from both cognition and behavior perspectives. To address intent bias during patient-therapist interactions, we propose Symptom Decoder to compare diagnosed symptoms with encoded symptoms, and dynamically manage the dialogue between patient and therapist according to the identified deviations. We evaluated MentalArena against 6 benchmarks, including biomedicalQA and mental health tasks, compared to 6 advanced models. Our models, fine-tuned on both GPT-3.5 and Llama-3-8b, significantly outperform their counterparts, including GPT-4o. We hope that our work can inspire future research on personalized care. Code is available in this https URL
[AI-52] Dynamic Neural Potential Field: Online Trajectory Optimization in Presence of Moving Obstacles
链接: https://arxiv.org/abs/2410.06819
作者: Aleksey Staroverov,Muhammad Alhaddad,Aditya Narendra,Konstantin Mironov,Aleksandr Panov
关键词-EN: Model Predictive Control, local trajectory planning, MPC local trajectory, local trajectory, Predictive Control
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We address a task of local trajectory planning for the mobile robot in the presence of static and dynamic obstacles. Local trajectory is obtained as a numerical solution of the Model Predictive Control (MPC) problem. Collision avoidance may be provided by adding repulsive potential of the obstacles to the cost function of MPC. We develop an approach, where repulsive potential is estimated by the neural model. We propose and explore three possible strategies of handling dynamic obstacles. First, environment with dynamic obstacles is considered as a sequence of static environments. Second, the neural model predict a sequence of repulsive potential at once. Third, the neural model predict future repulsive potential step by step in autoregressive mode. We implement these strategies and compare it with CIAO* and MPPI using BenchMR framework. First two strategies showed higher performance than CIAO* and MPPI while preserving safety constraints. The third strategy was a bit slower, however it still satisfy time limits. We deploy our approach on Husky UGV mobile platform, which move through the office corridors under proposed MPC local trajectory planner. The code and trained models are available at \urlthis https URL.
[AI-53] An Improved Approach for Cardiac MRI Segmentation based on 3D UNet Combined with Papillary Muscle Exclusion
链接: https://arxiv.org/abs/2410.06818
作者: Narjes Benameur,Ramzi Mahmoudi,Mohamed Deriche,Amira fayouka,Imene Masmoudi,Nessrine Zoghlami
关键词-EN: Left ventricular ejection, ventricular ejection fraction, important clinical parameter, Cardiovascular Magnetic Resonance, Left ventricular
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:Left ventricular ejection fraction (LVEF) is the most important clinical parameter of cardiovascular function. The accuracy in estimating this parameter is highly dependent upon the precise segmentation of the left ventricle (LV) structure at the end diastole and systole phases. Therefore, it is crucial to develop robust algorithms for the precise segmentation of the heart structure during different phases. Methodology: In this work, an improved 3D UNet model is introduced to segment the myocardium and LV, while excluding papillary muscles, as per the recommendation of the Society for Cardiovascular Magnetic Resonance. For the practical testing of the proposed framework, a total of 8,400 cardiac MRI images were collected and analysed from the military hospital in Tunis (HMPIT), as well as the popular ACDC public dataset. As performance metrics, we used the Dice coefficient and the F1 score for validation/testing of the LV and the myocardium segmentation. Results: The data was split into 70%, 10%, and 20% for training, validation, and testing, respectively. It is worth noting that the proposed segmentation model was tested across three axis views: basal, medio basal and apical at two different cardiac phases: end diastole and end systole instances. The experimental results showed a Dice index of 0.965 and 0.945, and an F1 score of 0.801 and 0.799, at the end diastolic and systolic phases, respectively. Additionally, clinical evaluation outcomes revealed a significant difference in the LVEF and other clinical parameters when the papillary muscles were included or excluded.
[AI-54] Multi-Neuron Unleashes Expressivity of ReLU Networks Under Convex Relaxation
链接: https://arxiv.org/abs/2410.06816
作者: Yuhao Mao,Yani Zhang,Martin Vechev
关键词-EN: Neural work certification, neural networks, crucial tool, tool for ensuring, Neural work
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Neural work certification has established itself as a crucial tool for ensuring the robustness of neural networks. Certification methods typically rely on convex relaxations of the feasible output set to provide sound bounds. However, complete certification requires exact bounds, which strongly limits the expressivity of ReLU networks: even for the simple `` \max ‘’ function in \mathbbR^2 , there does not exist a ReLU network that expresses this function and can be exactly bounded by single-neuron relaxation methods. This raises the question whether there exists a convex relaxation that can provide exact bounds for general continuous piecewise linear functions in \mathbbR^n . In this work, we answer this question affirmatively by showing that (layer-wise) multi-neuron relaxation provides complete certification for general ReLU networks. Based on this novel result, we show that the expressivity of ReLU networks is no longer limited under multi-neuron relaxation. To the best of our knowledge, this is the first positive result on the completeness of convex relaxations, shedding light on the practice of certified robustness.
[AI-55] Defending Membership Inference Attacks via Privacy-aware Sparsity Tuning
链接: https://arxiv.org/abs/2410.06814
作者: Qiang Hu,Hengxiang Zhang,Hongxin Wei
关键词-EN: Over-parameterized models, membership inference attacks, vulnerable to membership, membership inference, aim to determine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Over-parameterized models are typically vulnerable to membership inference attacks, which aim to determine whether a specific sample is included in the training of a given model. Previous Weight regularizations (e.g., L1 regularization) typically impose uniform penalties on all parameters, leading to a suboptimal tradeoff between model utility and privacy. In this work, we first show that only a small fraction of parameters substantially impact the privacy risk. In light of this, we propose Privacy-aware Sparsity Tuning (PAST), a simple fix to the L1 Regularization, by employing adaptive penalties to different parameters. Our key idea behind PAST is to promote sparsity in parameters that significantly contribute to privacy leakage. In particular, we construct the adaptive weight for each parameter based on its privacy sensitivity, i.e., the gradient of the loss gap with respect to the parameter. Using PAST, the network shrinks the loss gap between members and non-members, leading to strong resistance to privacy attacks. Extensive experiments demonstrate the superiority of PAST, achieving a state-of-the-art balance in the privacy-utility trade-off.
[AI-56] Diffuse or Confuse: A Diffusion Deepfake Speech Dataset
链接: https://arxiv.org/abs/2410.06796
作者: Anton Firc,Kamil Malinka,Petr Hanáček
关键词-EN: Advancements in artificial, significantly improved synthetic, synthetic speech generation, improved synthetic speech, artificial intelligence
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Presented at International Conference of the Biometrics Special Interest Group (BIOSIG 2024)
点击查看摘要
Abstract:Advancements in artificial intelligence and machine learning have significantly improved synthetic speech generation. This paper explores diffusion models, a novel method for creating realistic synthetic speech. We create a diffusion dataset using available tools and pretrained models. Additionally, this study assesses the quality of diffusion-generated deepfakes versus non-diffusion ones and their potential threat to current deepfake detection systems. Findings indicate that the detection of diffusion-based deepfakes is generally comparable to non-diffusion deepfakes, with some variability based on detector architecture. Re-vocoding with diffusion vocoders shows minimal impact, and the overall speech quality is comparable to non-diffusion methods.
[AI-57] Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?
链接: https://arxiv.org/abs/2410.06735
作者: Fumiya Uchiyama,Takeshi Kojima,Andrew Gambardella,Qi Cao,Yusuke Iwasawa,Yutaka Matsuo
关键词-EN: Recent large language, demonstrated remarkable generalization, Recent large, remarkable generalization abilities, programming languages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent large language models (LLMs) have demonstrated remarkable generalization abilities in mathematics and logical reasoning tasks. Prior research indicates that LLMs pre-trained with programming language data exhibit high mathematical and reasoning abilities; however, this causal relationship has not been rigorously tested. Our research aims to verify which programming languages and features during pre-training affect logical inference performance. Specifically, we pre-trained decoder-based language models from scratch using datasets from ten programming languages (e.g., Python, C, Java) and three natural language datasets (Wikipedia, Fineweb, C4) under identical conditions. Thereafter, we evaluated the trained models in a few-shot in-context learning setting on logical reasoning tasks: FLD and bAbi, which do not require commonsense or world knowledge. The results demonstrate that nearly all models trained with programming languages consistently outperform those trained with natural languages, indicating that programming languages contain factors that elicit logic inference performance. In addition, we found that models trained with programming languages exhibit a better ability to follow instructions compared to those trained with natural languages. Further analysis reveals that the depth of Abstract Syntax Trees representing parsed results of programs also affects logical reasoning performance. These findings will offer insights into the essential elements of pre-training for acquiring the foundational abilities of LLMs.
[AI-58] Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles NEURIPS2024
链接: https://arxiv.org/abs/2410.06733
作者: Qi Chen,Bowen Zhang,Gang Wang,Qi Wu
关键词-EN: Large Language Models, Large Language, tasks requiring vertical, capabilities remain under-explored, assessing creative thought
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model. This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario. The judge answers based on a detailed reference scenario or evaluates if the player’s predictions align with the reference one. This approach lessens dependence on more robust evaluation models, enabling the assessment of state-of-the-art LLMs. The experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement-similar to the agreement levels among humans. Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., RiddleSense and BrainTeaser, leads to performance enhancements. This suggests that our benchmark effectively evaluates and elicits the lateral thinking abilities of LLMs. Code is available at: this https URL.
[AI-59] Evaluating the Impact of Point Cloud Colorization on Semantic Segmentation Accuracy
链接: https://arxiv.org/abs/2410.06725
作者: Qinfeng Zhu,Jiaze Cao,Yuanzhi Cai,Lei Fan
关键词-EN: scene understanding, predefined categories, process of classifying, RGB, RGB information
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted by 2024 IEEE 8th International Conference on Vision, Image and Signal Processing
点击查看摘要
Abstract:Point cloud semantic segmentation, the process of classifying each point into predefined categories, is essential for 3D scene understanding. While image-based segmentation is widely adopted due to its maturity, methods relying solely on RGB information often suffer from degraded performance due to color inaccuracies. Recent advancements have incorporated additional features such as intensity and geometric information, yet RGB channels continue to negatively impact segmentation accuracy when errors in colorization occur. Despite this, previous studies have not rigorously quantified the effects of erroneous colorization on segmentation performance. In this paper, we propose a novel statistical approach to evaluate the impact of inaccurate RGB information on image-based point cloud segmentation. We categorize RGB inaccuracies into two types: incorrect color information and similar color information. Our results demonstrate that both types of color inaccuracies significantly degrade segmentation accuracy, with similar color errors particularly affecting the extraction of geometric features. These findings highlight the critical need to reassess the role of RGB information in point cloud segmentation and its implications for future algorithm design.
[AI-60] Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques
链接: https://arxiv.org/abs/2410.06719
作者: Benyuan Meng,Qianqian Xu,Zitai Wang,Zhiyong Yang,Xiaochun Cao,Qingming Huang
关键词-EN: powerful generative models, content shift, Diffusion, diffusion feature, applied to discrimination
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2410.03558
点击查看摘要
Abstract:Diffusion models are powerful generative models, and this capability can also be applied to discrimination. The inner activations of a pre-trained diffusion model can serve as features for discriminative tasks, namely, diffusion feature. We discover that diffusion feature has been hindered by a hidden yet universal phenomenon that we call content shift. To be specific, there are content differences between features and the input image, such as the exact shape of a certain object. We locate the cause of content shift as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature. Further empirical study also indicates that its negative impact is not negligible even when content shift is not visually perceivable. Hence, we propose to suppress content shift to enhance the overall quality of diffusion features. Specifically, content shift is related to the information drift during the process of recovering an image from the noisy input, pointing out the possibility of turning off-the-shelf generation techniques into tools for content shift suppression. We further propose a practical guideline named GATE to efficiently evaluate the potential benefit of a technique and provide an implementation of our methodology. Despite the simplicity, the proposed approach has achieved superior results on various tasks and datasets, validating its potential as a generic booster for diffusion features. Our code is available at this https URL.
[AI-61] Calibrating Verbalized Probabilities for Large Language Models
链接: https://arxiv.org/abs/2410.06707
作者: Cheng Wang,Gyuri Szarvas,Georges Balazs,Pavel Danchenko,Patrick Ernst
关键词-EN: Large Language Models, black-box Large Language, Language Models, Large Language, Calibrating verbalized probabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 21 pages
点击查看摘要
Abstract:Calibrating verbalized probabilities presents a novel approach for reliably assessing and leveraging outputs from black-box Large Language Models (LLMs). Recent methods have demonstrated improved calibration by applying techniques like Platt scaling or temperature scaling to the confidence scores generated by LLMs. In this paper, we explore the calibration of verbalized probability distributions for discriminative tasks. First, we investigate the capability of LLMs to generate probability distributions over categorical labels. We theoretically and empirically identify the issue of re-softmax arising from the scaling of verbalized probabilities, and propose using the invert softmax trick to approximate the “logit” by inverting verbalized probabilities. Through extensive evaluation on three public datasets, we demonstrate: (1) the robust capability of LLMs in generating class distributions, and (2) the effectiveness of the invert softmax trick in estimating logits, which, in turn, facilitates post-calibration adjustments.
[AI-62] PII-Scope: A Benchmark for Training Data PII Leakage Assessment in LLMs
链接: https://arxiv.org/abs/2410.06704
作者: Krishna Kanth Nakka,Ahmed Frikha,Ricardo Mendes,Xue Jiang,Xuebing Zhou
关键词-EN: PII extraction, comprehensive benchmark designed, PII extraction attacks, PII, introduce PII-Scope
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this work, we introduce PII-Scope, a comprehensive benchmark designed to evaluate state-of-the-art methodologies for PII extraction attacks targeting LLMs across diverse threat settings. Our study provides a deeper understanding of these attacks by uncovering several hyperparameters (e.g., demonstration selection) crucial to their effectiveness. Building on this understanding, we extend our study to more realistic attack scenarios, exploring PII attacks that employ advanced adversarial strategies, including repeated and diverse querying, and leveraging iterative learning for continual PII extraction. Through extensive experimentation, our results reveal a notable underestimation of PII leakage in existing single-query attacks. In fact, we show that with sophisticated adversarial capabilities and a limited query budget, PII extraction rates can increase by up to fivefold when targeting the pretrained model. Moreover, we evaluate PII leakage on finetuned models, showing that they are more vulnerable to leakage than pretrained models. Overall, our work establishes a rigorous empirical benchmark for PII extraction attacks in realistic threat scenarios and provides a strong foundation for developing effective mitigation strategies.
[AI-63] ST-WebAgent Bench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
链接: https://arxiv.org/abs/2410.06703
作者: Ido Levy,Ben Wiesel,Sami Marreed,Alon Oved,Avi Yaeli,Segev Shlomov
关键词-EN: Recent advancements, autonomous web navigation, benchmarks showcasing progress, LLM-based web agents, web agents
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent advancements in LLM-based web agents have introduced novel architectures and benchmarks showcasing progress in autonomous web navigation and interaction. However, most existing benchmarks prioritize effectiveness and accuracy, overlooking crucial factors like safety and trustworthiness which are essential for deploying web agents in enterprise settings. The risks of unsafe web agent behavior, such as accidentally deleting user accounts or performing unintended actions in critical business operations, pose significant barriers to widespread this http URL this paper, we present ST-WebAgentBench, a new online benchmark specifically designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. This benchmark is grounded in a detailed framework that defines safe and trustworthy (ST) agent behavior, outlines how ST policies should be structured and introduces the Completion under Policies metric to assess agent performance. Our evaluation reveals that current SOTA agents struggle with policy adherence and cannot yet be relied upon for critical business applications. Additionally, we propose architectural principles aimed at improving policy awareness and compliance in web agents. We open-source this benchmark and invite the community to contribute, with the goal of fostering a new generation of safer, more trustworthy AI agents.
[AI-64] Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models
链接: https://arxiv.org/abs/2410.06699
作者: Yubo Wang,Chaohu Liu,Yanqiu Qu,Haoyu Cao,Deqiang Jiang,Linli Xu
关键词-EN: Large vision-language models, large language models, showcasing remarkable multi-modal, multi-modal conversational capabilities, remarkable multi-modal conversational
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ACMMM 2024
点击查看摘要
Abstract:Large vision-language models (LVLMs) integrate visual information into large language models, showcasing remarkable multi-modal conversational capabilities. However, the visual modules introduces new challenges in terms of robustness for LVLMs, as attackers can craft adversarial images that are visually clean but may mislead the model to generate incorrect answers. In general, LVLMs rely on vision encoders to transform images into visual tokens, which are crucial for the language models to perceive image contents effectively. Therefore, we are curious about one question: Can LVLMs still generate correct responses when the encoded visual tokens are attacked and disrupting the visual information? To this end, we propose a non-targeted attack method referred to as VT-Attack (Visual Tokens Attack), which constructs adversarial examples from multiple perspectives, with the goal of comprehensively disrupting feature representations and inherent relationships as well as the semantic properties of visual tokens output by image encoders. Using only access to the image encoder in the proposed attack, the generated adversarial examples exhibit transferability across diverse LVLMs utilizing the same image encoder and generality across different tasks. Extensive experiments validate the superior attack performance of the VT-Attack over baseline methods, demonstrating its effectiveness in attacking LVLMs with image encoders, which in turn can provide guidance on the robustness of LVLMs, particularly in terms of the stability of the visual feature space.
[AI-65] AI Climate and Regulation: From Data Centers to the AI Act
链接: https://arxiv.org/abs/2410.06681
作者: Kai Ebert,Nicolas Alder,Ralf Herbrich,Philipp Hacker
关键词-EN: potentially particularly consequentially, professional workspace, experiencing an unprecedented, unprecedented boom, increasingly penetrate
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 18 pages, 1 figure, preprint
点击查看摘要
Abstract:We live in a world that is experiencing an unprecedented boom of AI applications that increasingly penetrate and enhance all sectors of private and public life, from education, media, medicine, and mobility to the industrial and professional workspace, and – potentially particularly consequentially – robotics. As this world is simultaneously grappling with climate change, the climate and environmental implications of the development and use of AI have become an important subject of public and academic debate. In this paper, we aim to provide guidance on the climate-related regulation for data centers and AI specifically, and discuss how to operationalize these requirements. We also highlight challenges and room for improvement, and make a number of policy proposals to this end. In particular, we propose a specific interpretation of the AI Act to bring reporting on the previously unadressed energy consumption from AI inferences back into the scope. We also find that the AI Act fails to address indirect greenhouse gas emissions from AI applications. Furthermore, for the purpose of energy consumption reporting, we compare levels of measurement within data centers and recommend measurement at the cumulative server level. We also argue for an interpretation of the AI Act that includes environmental concerns in the mandatory risk assessment (sustainability risk assessment, SIA), and provide guidance on its operationalization. The EU data center regulation proves to be a good first step but requires further development by including binding renewable energy and efficiency targets for data centers. Overall, we make twelve concrete policy proposals, in four main areas: Energy and Environmental Reporting Obligations; Legal and Regulatory Clarifications; Transparency and Accountability Mechanisms; and Future Far-Reaching Measures beyond Transparency.
[AI-66] M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes
链接: https://arxiv.org/abs/2410.06678
作者: Zeyu Zhang,Sixu Yan,Muzhi Han,Zaijin Wang,Xinggang Wang,Song-Chun Zhu,Hangxin Liu
关键词-EN: whole-body motion trajectories, whole-body motion, coordinated whole-body motion, object rearrangement tasks, whole-body motion generation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We propose M^3Bench, a new benchmark for whole-body motion generation for mobile manipulation tasks. Given a 3D scene context, M^3Bench requires an embodied agent to understand its configuration, environmental constraints and task objectives, then generate coordinated whole-body motion trajectories for object rearrangement tasks. M^3Bench features 30k object rearrangement tasks across 119 diverse scenes, providing expert demonstrations generated by our newly developed M^3BenchMaker. This automatic data generation tool produces coordinated whole-body motion trajectories from high-level task instructions, requiring only basic scene and robot information. Our benchmark incorporates various task splits to assess generalization across different dimensions and leverages realistic physics simulation for trajectory evaluation. Through extensive experimental analyses, we reveal that state-of-the-art models still struggle with coordinated base-arm motion while adhering to environment-context and task-specific constraints, highlighting the need to develop new models that address this gap. Through M^3Bench, we aim to facilitate future robotics research towards more adaptive and capable mobile manipulation in diverse, real-world environments.
[AI-67] Large Language Models as Code Executors: An Exploratory Study
链接: https://arxiv.org/abs/2410.06667
作者: Chenyang Lyu,Lecheng Yan,Rui Xing,Wenxi Li,Younes Samih,Tianbo Ji,Longyue Wang
关键词-EN: Large Language Models, natural language processing, Large Language, natural language, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The capabilities of Large Language Models (LLMs) have significantly evolved, extending from natural language processing to complex tasks like code understanding and generation. We expand the scope of LLMs’ capabilities to a broader context, using LLMs to execute code snippets to obtain the output. This paper pioneers the exploration of LLMs as code executors, where code snippets are directly fed to the models for execution, and outputs are returned. We are the first to comprehensively examine this feasibility across various LLMs, including OpenAI’s o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder. Notably, the o1 model achieved over 90% accuracy in code execution, while others demonstrated lower accuracy levels. Furthermore, we introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22% (with the highest improvement of 18.96%) and an absolute average improvement of 3.86% against CoT prompting (with the highest improvement of 19.46%). Our study not only highlights the transformative potential of LLMs in coding but also lays the groundwork for future advancements in automated programming and the completion of complex tasks.
[AI-68] Revisiting Multi-Permutation Equivariance through the Lens of Irreducible Representations
链接: https://arxiv.org/abs/2410.06665
作者: Yonatan Sverdlov,Ido Springer,Nadav Dym
关键词-EN: paper explores, permutations and related, Deep Weight Space, equivariant linear layers, Schur lemma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper explores the characterization of equivariant linear layers for representations of permutations and related groups. Unlike traditional approaches, which address these problems using parameter-sharing, we consider an alternative methodology based on irreducible representations and Schur’s lemma. Using this methodology, we obtain an alternative derivation for existing models like DeepSets, 2-IGN graph equivariant networks, and Deep Weight Space (DWS) networks. The derivation for DWS networks is significantly simpler than that of previous results. Next, we extend our approach to unaligned symmetric sets, where equivariance to the wreath product of groups is required. Previous works have addressed this problem in a rather restrictive setting, in which almost all wreath equivariant layers are Siamese. In contrast, we give a full characterization of layers in this case and show that there is a vast number of additional non-Siamese layers in some settings. We also show empirically that these additional non-Siamese layers can improve performance in tasks like graph anomaly detection, weight space alignment, and learning Wasserstein distances. Our code is available at \hrefthis https URLGitHub. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.06665 [cs.LG] (or arXiv:2410.06665v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.06665 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-69] Decouple-Then-Merge: Towards Better Training for Diffusion Models
链接: https://arxiv.org/abs/2410.06664
作者: Qianli Ma,Xuefei Ning,Dongrui Liu,Li Niu,Linfeng Zhang
关键词-EN: noise corruption, trained by learning, learning a sequence, reverse each step, step of noise
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Diffusion models are trained by learning a sequence of models that reverse each step of noise corruption. Typically, the model parameters are fully shared across multiple timesteps to enhance training efficiency. However, since the denoising tasks differ at each timestep, the gradients computed at different timesteps may conflict, potentially degrading the overall performance of image generation. To solve this issue, this work proposes a Decouple-then-Merge (DeMe) framework, which begins with a pretrained model and finetunes separate models tailored to specific timesteps. We introduce several improved techniques during the finetuning stage to promote effective knowledge sharing while minimizing training interference across timesteps. Finally, after finetuning, these separate models can be merged into a single model in the parameter space, ensuring efficient and practical inference. Experimental results show significant generation quality improvements upon 6 benchmarks including Stable Diffusion on COCO30K, ImageNet1K, PartiPrompts, and DDPM on LSUN Church, LSUN Bedroom, and CIFAR10.
[AI-70] ask-oriented Time Series Imputation Evaluation via Generalized Representers NEURIPS2024
链接: https://arxiv.org/abs/2410.06652
作者: Zhixian Wang,Linxiao Yang,Liang Sun,Qingsong Wen,Yi Wang
关键词-EN: time series imputation, Time series analysis, Time series, anomaly detection, power energy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 9 figures, 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
点击查看摘要
Abstract:Time series analysis is widely used in many fields such as power energy, economics, and transportation, including different tasks such as forecasting, anomaly detection, classification, etc. Missing values are widely observed in these tasks, and often leading to unpredictable negative effects on existing methods, hindering their further application. In response to this situation, existing time series imputation methods mainly focus on restoring sequences based on their data characteristics, while ignoring the performance of the restored sequences in downstream tasks. Considering different requirements of downstream tasks (e.g., forecasting), this paper proposes an efficient downstream task-oriented time series imputation evaluation approach. By combining time series imputation with neural network models used for downstream tasks, the gain of different imputation strategies on downstream tasks is estimated without retraining, and the most favorable imputation value for downstream tasks is given by combining different imputation strategies according to the estimated gain.
[AI-71] oward Physics-guided Time Series Embedding
链接: https://arxiv.org/abs/2410.06651
作者: Jiaxi Hu,Bowen Zhang,Qingsong Wen,Fugee Tsung,Yuxuan Liang
关键词-EN: primary research areas, dynamical systems modeling, physics-based dynamical systems, time series analysis, data-driven time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In various scientific and engineering fields, the primary research areas have revolved around physics-based dynamical systems modeling and data-driven time series analysis. According to the embedding theory, dynamical systems and time series can be mutually transformed using observation functions and physical reconstruction techniques. Based on this, we propose Embedding Duality Theory, where the parameterized embedding layer essentially provides a linear estimation of the non-linear time series dynamics. This theory enables us to bypass the parameterized embedding layer and directly employ physical reconstruction techniques to acquire a data embedding representation. Utilizing physical priors results in a 10X reduction in parameters, a 3X increase in speed, and maximum performance boosts of 18% in expert, 22% in few-shot, and 53% in zero-shot tasks without any hyper-parameter tuning. All methods are encapsulated as a plug-and-play module
[AI-72] Subtle Errors Matter: Preference Learning via Error-injected Self-editing
链接: https://arxiv.org/abs/2410.06638
作者: Kaishuai Xu,Tiezheng Yu,Wenjun Hou,Yi Cheng,Chak Tou Leong,Liangyou Li,Xin Jiang,Lifeng Shang,Qun Liu,Wenjie Li
关键词-EN: Large Language Models, Large Language, tackling tasks ranging, advanced competition-level problems, exhibited strong mathematical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have exhibited strong mathematical reasoning and computational prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle errors, such as miscalculations or incorrect substitutions, limit the models’ full mathematical potential. Existing studies to improve mathematical ability typically involve distilling reasoning skills from stronger LLMs or applying preference learning to step-wise response pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook the frequently occurring subtle errors. A major reason is that sampled preference pairs involve differences unrelated to the errors, which may distract the model from focusing on subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation. In detail, RISE uses the model itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective to focus on predefined errors and their tokens, without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
[AI-73] Effective Exploration Based on the Structural Information Principles
链接: https://arxiv.org/abs/2410.06621
作者: Xianghua Zeng,Hao Peng,Angsheng Li
关键词-EN: Traditional information theory, Reinforcement Learning, foundation for Reinforcement, Traditional information, valuable foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages in main paper and 15 pages in appendix
点击查看摘要
Abstract:Traditional information theory provides a valuable foundation for Reinforcement Learning, particularly through representation learning and entropy maximization for agent exploration. However, existing methods primarily concentrate on modeling the uncertainty associated with RL’s random variables, neglecting the inherent structure within the state and action spaces. In this paper, we propose a novel Structural Information principles-based Effective Exploration framework, namely SI2E. Structural mutual information between two variables is defined to address the single-variable limitation in structural information, and an innovative embedding principle is presented to capture dynamics-relevant state-action representations. The SI2E analyzes value differences in the agent’s policy between state-action pairs and minimizes structural entropy to derive the hierarchical state-action structure, referred to as the encoding tree. Under this tree structure, value-conditional structural entropy is defined and maximized to design an intrinsic reward mechanism that avoids redundant transitions and promotes enhanced coverage in the state-action space. Theoretical connections are established between SI2E and classical information-theoretic methodologies, highlighting our framework’s rationality and advantage. Comprehensive evaluations in the MiniGrid, MetaWorld, and DeepMind Control Suite benchmarks demonstrate that SI2E significantly outperforms state-of-the-art exploration baselines regarding final performance and sample efficiency, with maximum improvements of 37.63% and 60.25%, respectively.
[AI-74] Learning Evolving Tools for Large Language Models
链接: https://arxiv.org/abs/2410.06617
作者: Guoxin Chen,Zhong Zhang,Xin Cong,Fangda Guo,Yesai Wu,Yankai Lin,Wenzheng Feng,Yasheng Wang
关键词-EN: large language models, enables large language, learning enables large, language models, greatly expanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Ongoning Work
点击查看摘要
Abstract:Tool learning enables large language models (LLMs) to interact with external tools and APIs, greatly expanding the application scope of LLMs. However, due to the dynamic nature of external environments, these tools and APIs may become outdated over time, preventing LLMs from correctly invoking tools. Existing research primarily focuses on static environments and overlooks this issue, limiting the adaptability of LLMs in real-world applications. In this paper, we propose ToolEVO, a novel framework designed to enhance the adaptive and reflective capabilities of LLMs against tool variability. By leveraging Monte Carlo Tree Search, ToolEVO facilitates active exploration and interaction of LLMs within dynamic environments, allowing for autonomous self-reflection and self-updating of tool usage based on environmental feedback. Additionally, we introduce ToolQA-D, a benchmark specifically designed to evaluate the impact of tool variability. Extensive experiments demonstrate the effectiveness and stability of our approach, highlighting the importance of adaptability to tool variability for effective tool learning.
[AI-75] Pair-VPR: Place-Aware Pre-training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers
链接: https://arxiv.org/abs/2410.06614
作者: Stephen Hausler,Peyman Moghadam
关键词-EN: Visual Place Recognition, Place Recognition, pair classifier, global descriptor, Vision Transformer components
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In this work we propose a novel joint training method for Visual Place Recognition (VPR), which simultaneously learns a global descriptor and a pair classifier for re-ranking. The pair classifier can predict whether a given pair of images are from the same place or not. The network only comprises Vision Transformer components for both the encoder and the pair classifier, and both components are trained using their respective class tokens. In existing VPR methods, typically the network is initialized using pre-trained weights from a generic image dataset such as ImageNet. In this work we propose an alternative pre-training strategy, by using Siamese Masked Image Modelling as a pre-training task. We propose a Place-aware image sampling procedure from a collection of large VPR datasets for pre-training our model, to learn visual features tuned specifically for VPR. By re-using the Mask Image Modelling encoder and decoder weights in the second stage of training, Pair-VPR can achieve state-of-the-art VPR performance across five benchmark datasets with a ViT-B encoder, along with further improvements in localization recall with larger encoders. The Pair-VPR website is: https://csiro-robotics.github.io/Pair-VPR.
[AI-76] Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS
链接: https://arxiv.org/abs/2410.06608
作者: Onkar Kishor Susladkar,Vishesh Tripathi,Biddwan Ahmed
关键词-EN: designed to enhance, introduces a comprehensive, enhance the quality, quality and versatility, versatility of synthetic
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:This research introduces a comprehensive Bahasa text-to-speech (TTS) dataset and a novel TTS model, EnGen-TTS, designed to enhance the quality and versatility of synthetic speech in the Bahasa language. The dataset, spanning \textasciitilde55.0 hours and 52K audio recordings, integrates diverse textual sources, ensuring linguistic richness. A meticulous recording setup captures the nuances of Bahasa phonetics, employing professional equipment to ensure high-fidelity audio samples. Statistical analysis reveals the dataset’s scale and diversity, laying the foundation for model training and evaluation. The proposed EnGen-TTS model performs better than established baselines, achieving a Mean Opinion Score (MOS) of 4.45 \pm 0.13. Additionally, our investigation on real-time factor and model size highlights EnGen-TTS as a compelling choice, with efficient performance. This research marks a significant advancement in Bahasa TTS technology, with implications for diverse language applications. Link to Generated Samples: \urlthis https URL
[AI-77] Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching
链接: https://arxiv.org/abs/2410.06561
作者: Wenqi Niu,Yingchao Wang,Guohui Cai,Hanpo Hou
关键词-EN: neural network compression, Matching Knowledge Distillation, Knowledge Distillation, student model, Correlation Matching Knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 10 figures
点击查看摘要
Abstract:Knowledge Distillation (KD) has emerged as a pivotal technique for neural network compression and performance enhancement. Most KD methods aim to transfer dark knowledge from a cumbersome teacher model to a lightweight student model based on Kullback-Leibler (KL) divergence loss. However, the student performance improvements achieved through KD exhibit diminishing marginal returns, where a stronger teacher model does not necessarily lead to a proportionally stronger student model. To address this issue, we empirically find that the KL-based KD method may implicitly change the inter-class relationships learned by the student model, resulting in a more complex and ambiguous decision boundary, which in turn reduces the model’s accuracy and generalization ability. Therefore, this study argues that the student model should learn not only the probability values from the teacher’s output but also the relative ranking of classes, and proposes a novel Correlation Matching Knowledge Distillation (CMKD) method that combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model. Moreover, considering that samples vary in difficulty, CMKD dynamically adjusts the weights of the Pearson-based loss and Spearman-based loss. CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet, and adapts well to various teacher architectures, sizes, and other KD methods.
[AI-78] Mitigating Time Discretization Challenges with WeatherODE: A Sandwich Physics-Driven Neural ODE for Weather Forecasting
链接: https://arxiv.org/abs/2410.06560
作者: Peiyuan Liu,Tian Zhou,Liang Sun,Rong Jin
关键词-EN: time-dependent source discrepancies, weather forecasting, grapple with discretization, limit their predictive, weather forecasting accuracy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In the field of weather forecasting, traditional models often grapple with discretization errors and time-dependent source discrepancies, which limit their predictive performance. In this paper, we present WeatherODE, a novel one-stage, physics-driven ordinary differential equation (ODE) model designed to enhance weather forecasting accuracy. By leveraging wave equation theory and integrating a time-dependent source model, WeatherODE effectively addresses the challenges associated with time-discretization error and dynamic atmospheric processes. Moreover, we design a CNN-ViT-CNN sandwich structure, facilitating efficient learning dynamics tailored for distinct yet interrelated tasks with varying optimization biases in advection equation estimation. Through rigorous experiments, WeatherODE demonstrates superior performance in both global and regional weather forecasting tasks, outperforming recent state-of-the-art approaches by significant margins of over 40.0% and 31.8% in root mean square error (RMSE), respectively. The source code is available at \urlthis https URL.
[AI-79] he Accuracy Paradox in RLHF: When Better Reward Models Dont Yield Better Language Models EMNLP2024
链接: https://arxiv.org/abs/2410.06554
作者: Yanjun Chen,Dawei Zhu,Yirong Sun,Xinghao Chen,Wei Zhang,Xiaoyu Shen
关键词-EN: Human Feedback significantly, Natural Language Processing, significantly enhances Natural, Feedback significantly enhances, Reinforcement Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, 27 figures (including 18 in the appendix), submitted to EMNLP 2024
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models. Code and additional details are available at [this https URL](this https URL).
[AI-80] InstantIR: Blind Image Restoration with Instant Generative Reference
链接: https://arxiv.org/abs/2410.06551
作者: Jen-Yuan Huang,Haofan Wang,Qixun Wang,Xu Bai,Hao Ai,Peng Xing,Jen-Tse Huang
关键词-EN: Handling test-time unknown, Blind Image Restoration, Handling test-time, high model generalization, necessitating high model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Handling test-time unknown degradation is the major challenge in Blind Image Restoration (BIR), necessitating high model generalization. An effective strategy is to incorporate prior knowledge, either from human input or generative model. In this paper, we introduce Instant-reference Image Restoration (InstantIR), a novel diffusion-based BIR method which dynamically adjusts generation condition during inference. We first extract a compact representation of the input via a pre-trained vision encoder. At each generation step, this representation is used to decode current diffusion latent and instantiate it in the generative prior. The degraded image is then encoded with this reference, providing robust generation condition. We observe the variance of generative references fluctuate with degradation intensity, which we further leverage as an indicator for developing a sampling algorithm adaptive to input quality. Extensive experiments demonstrate InstantIR achieves state-of-the-art performance and offering outstanding visual quality. Through modulating generative references with textual description, InstantIR can restore extreme degradation and additionally feature creative restoration.
[AI-81] Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis
链接: https://arxiv.org/abs/2410.06550
作者: Shiho Matta,Yin Jou Huang,Fei Cheng,Hirokazu Kiyomaru,Yugo Murawaki
关键词-EN: Recent studies, low cost, LLM-generated data, studies have demonstrated, demonstrated that few-shot
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 12 pages including 4 pages of references and appendix. 7 figures
点击查看摘要
Abstract:Recent studies have demonstrated that few-shot learning allows LLMs to generate training data for supervised models at a low cost. However, the quality of LLM-generated data may not entirely match that of human-labeled data. This raises a crucial question: how should one balance the trade-off between the higher quality but more expensive human data and the lower quality yet substantially cheaper LLM-generated data? In this paper, we synthesized training data for conversational semantic frame analysis using GPT-4 and examined how to allocate budgets optimally to achieve the best performance. Our experiments, conducted across various budget levels, reveal that optimal cost-efficiency is achieved by combining both human and LLM-generated data across a wide range of budget levels. Notably, as the budget decreases, a higher proportion of LLM-generated data becomes more preferable.
[AI-82] DiffGAD: A Diffusion-based Unsupervised Graph Anomaly Detector
链接: https://arxiv.org/abs/2410.06549
作者: Jinghan Li,Yuan Gao,Jinda Lu,Junfeng Fang,Congcong Wen,Hui Lin,Xiang Wang
关键词-EN: Graph Anomaly Detection, garnering significant attention, Anomaly Detection, Diffusion-based Graph Anomaly, suboptimal anomaly detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:
点击查看摘要
Abstract:Graph Anomaly Detection (GAD) is crucial for identifying abnormal entities within networks, garnering significant attention across various fields. Traditional unsupervised methods, which decode encoded latent representations of unlabeled data with a reconstruction focus, often fail to capture critical discriminative content, leading to suboptimal anomaly detection. To address these challenges, we present a Diffusion-based Graph Anomaly Detector (DiffGAD). At the heart of DiffGAD is a novel latent space learning paradigm, meticulously designed to enhance its proficiency by guiding it with discriminative content. This innovative approach leverages diffusion sampling to infuse the latent space with discriminative content and introduces a content-preservation mechanism that retains valuable information across different scales, significantly improving its adeptness at identifying anomalies with limited time and space complexity. Our comprehensive evaluation of DiffGAD, conducted on six real-world and large-scale datasets with various metrics, demonstrated its exceptional performance.
[AI-83] Chip-Tuning: Classify Before Language Models Say
链接: https://arxiv.org/abs/2410.06541
作者: Fangwei Zhu,Dian Li,Jiajun Huang,Gang Liu,Hui Wang,Zhifang Sui
关键词-EN: training and inference, rapid development, increasing cost, large language models, LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The rapid development in the performance of large language models (LLMs) is accompanied by the escalation of model size, leading to the increasing cost of model training and inference. Previous research has discovered that certain layers in LLMs exhibit redundancy, and removing these layers brings only marginal loss in model performance. In this paper, we adopt the probing technique to explain the layer redundancy in LLMs and demonstrate that language models can be effectively pruned with probing classifiers. We propose chip-tuning, a simple and effective structured pruning framework specialized for classification problems. Chip-tuning attaches tiny probing classifiers named chips to different layers of LLMs, and trains chips with the backbone model frozen. After selecting a chip for classification, all layers subsequent to the attached layer could be removed with marginal performance loss. Experimental results on various LLMs and datasets demonstrate that chip-tuning significantly outperforms previous state-of-the-art baselines in both accuracy and pruning ratio, achieving a pruning ratio of up to 50%. We also find that chip-tuning could be applied on multimodal models, and could be combined with model finetuning, proving its excellent compatibility.
[AI-84] opoTune : A Framework for Generalized Combinatorial Complex Neural Networks
链接: https://arxiv.org/abs/2410.06530
作者: Mathilde Papillon,Guillermo Bernárdez,Claudio Battiloro,Nina Miolane
关键词-EN: Complex Neural Networks, relational datasets, processing node, Graph Neural Networks, Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs) excel in learning from relational datasets, processing node and edge features in a way that preserves the symmetries of the graph domain. However, many complex systems–such as biological or social networks–involve multiway complex interactions that are more naturally represented by higher-order topological spaces. The emerging field of Topological Deep Learning (TDL) aims to accommodate and leverage these higher-order structures. Combinatorial Complex Neural Networks (CCNNs), fairly general TDL models, have been shown to be more expressive and better performing than GNNs. However, differently from the graph deep learning ecosystem, TDL lacks a principled and standardized framework for easily defining new architectures, restricting its accessibility and applicability. To address this issue, we introduce Generalized CCNNs (GCCNs), a novel simple yet powerful family of TDL models that can be used to systematically transform any (graph) neural network into its TDL counterpart. We prove that GCCNs generalize and subsume CCNNs, while extensive experiments on a diverse class of GCCNs show that these architectures consistently match or outperform CCNNs, often with less model complexity. In an effort to accelerate and democratize TDL, we introduce TopoTune, a lightweight software that allows practitioners to define, build, and train GCCNs with unprecedented flexibility and ease.
[AI-85] he Sampling-Gaussian for stereo matching ATC
链接: https://arxiv.org/abs/2410.06527
作者: Baiyu Pan,jichao jiao,Bowen Yao,Jianxin Pang,Jun Cheng
关键词-EN: enable differentiable regression, neural network-based stereo, network-based stereo matching, regression of disparity, operation is widely
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: TL;DR: A novel Gaussian distribution-based supervision method for stereo matching. Implemented with five baseline methods and achieves notable improvement. Main content, 10 pages. conference submission
点击查看摘要
Abstract:The soft-argmax operation is widely adopted in neural network-based stereo matching methods to enable differentiable regression of disparity. However, network trained with soft-argmax is prone to being multimodal due to absence of explicit constraint to the shape of the probability distribution. Previous methods leverages Laplacian distribution and cross-entropy for training but failed to effectively improve the accuracy and even compromises the efficiency of the network. In this paper, we conduct a detailed analysis of the previous distribution-based methods and propose a novel supervision method for stereo matching, Sampling-Gaussian. We sample from the Gaussian distribution for supervision. Moreover, we interpret the training as minimizing the distance in vector space and propose a combined loss of L1 loss and cosine similarity loss. Additionally, we leveraged bilinear interpolation to upsample the cost volume. Our method can be directly applied to any soft-argmax-based stereo matching method without a reduction in efficiency. We have conducted comprehensive experiments to demonstrate the superior performance of our Sampling-Gaussian. The experimental results prove that we have achieved better accuracy on five baseline methods and two datasets. Our method is easy to implement, and the code is available online.
[AI-86] Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA EMNLP2024
链接: https://arxiv.org/abs/2410.06524
作者: Maharshi Gor,Hal Daumé III,Tianyi Zhou,Jordan Boyd-Graber
关键词-EN: large language models, natural language processing, Recent advancements, language models, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear at EMNLP 2024 (Main)
点击查看摘要
Abstract:Recent advancements of large language models (LLMs) have led to claims of AI surpassing humans in natural language processing (NLP) tasks such as textual understanding and reasoning. This work investigates these assertions by introducing CAIMIRA, a novel framework rooted in item response theory (IRT) that enables quantitative assessment and comparison of problem-solving abilities of question-answering (QA) agents: humans and AI systems. Through analysis of over 300,000 responses from ~70 AI systems and 155 humans across thousands of quiz questions, CAIMIRA uncovers distinct proficiency patterns in knowledge domains and reasoning skills. Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning, while state-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval and fact-based reasoning, particularly when information gaps are well-defined and addressable through pattern matching or data retrieval. These findings highlight the need for future QA tasks to focus on questions that challenge not only higher-order reasoning and scientific thinking, but also demand nuanced linguistic interpretation and cross-contextual knowledge application, helping advance AI developments that better emulate or complement human cognitive abilities in real-world problem-solving.
[AI-87] QuadBEV: An Efficient Quadruple-Task Perception Framework via Birds-Eye-View Representation
链接: https://arxiv.org/abs/2410.06516
作者: Yuxin Li,Yiheng Li,Xulei Yang,Mengying Yu,Zihang Huang,Xiaojun Wu,Chai Kiat Yeo
关键词-EN: integrate multiple sensor, multiple sensor inputs, driving systems due, autonomous driving systems, unified representation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Bird’s-Eye-View (BEV) perception has become a vital component of autonomous driving systems due to its ability to integrate multiple sensor inputs into a unified representation, enhancing performance in various downstream tasks. However, the computational demands of BEV models pose challenges for real-world deployment in vehicles with limited resources. To address these limitations, we propose QuadBEV, an efficient multitask perception framework that leverages the shared spatial and contextual information across four key tasks: 3D object detection, lane detection, map segmentation, and occupancy prediction. QuadBEV not only streamlines the integration of these tasks using a shared backbone and task-specific heads but also addresses common multitask learning challenges such as learning rate sensitivity and conflicting task objectives. Our framework reduces redundant computations, thereby enhancing system efficiency, making it particularly suited for embedded systems. We present comprehensive experiments that validate the effectiveness and robustness of QuadBEV, demonstrating its suitability for real-world applications.
[AI-88] orchTitan: One-stop PyTorch native solution for production ready LLM pre-training
链接: https://arxiv.org/abs/2410.06511
作者: Wanchao Liang,Tianyu Liu,Less Wright,Will Constable,Andrew Gu,Chien-Chin Huang,Iris Zhang,Wei Feng,Howard Huang,Junjie Wang,Sanket Purandare,Gokul Nadathur,Stratos Idreos
关键词-EN: large language models, language processing applications, natural language processing, instrumental in advancing, processing applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The development of large language models (LLMs) has been instrumental in advancing state-of-the-art natural language processing applications. Training LLMs with billions of parameters and trillions of tokens require sophisticated distributed systems that enable composing and comparing several state-of-the-art techniques in order to efficiently scale across thousands of accelerators. However, existing solutions are complex, scattered across multiple libraries/repositories, lack interoperability, and are cumbersome to maintain. Thus, curating and empirically comparing training recipes require non-trivial engineering effort. This paper introduces TorchTitan, an open-source, PyTorch-native distributed training system that unifies state-of-the-art techniques, streamlining integration and reducing overhead. TorchTitan enables 3D parallelism in a modular manner with elastic scaling, providing comprehensive logging, checkpointing, and debugging tools for production-ready training. It also incorporates hardware-software co-designed solutions, leveraging features like Float8 training and SymmetricMemory. As a flexible test bed, TorchTitan facilitates custom recipe curation and comparison, allowing us to develop optimized training recipes for Llama 3.1 and provide guidance on selecting techniques for maximum efficiency based on our experiences. We thoroughly assess TorchTitan on the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its exceptional performance, modular composability, and elastic scalability. By stacking training optimizations, we demonstrate accelerations of 65.08% with 1D parallelism at the 128-GPU scale (Llama 3.1 8B), an additional 12.59% with 2D parallelism at the 256-GPU scale (Llama 3.1 70B), and an additional 30% with 3D parallelism at the 512-GPU scale (Llama 3.1 405B) on NVIDIA H100 GPUs over optimized baselines. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2410.06511 [cs.CL] (or arXiv:2410.06511v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.06511 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-89] Chemistry-Inspired Diffusion with Non-Differentiable Guidance
链接: https://arxiv.org/abs/2410.06502
作者: Yuchen Shen,Chenhao Zhang,Sijie Fu,Chenghui Zhou,Newell Washburn,Barnabás Póczos
关键词-EN: shown remarkable potential, Recent advances, shown remarkable, remarkable potential, diffusion models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: preprint
点击查看摘要
Abstract:Recent advances in diffusion models have shown remarkable potential in the conditional generation of novel molecules. These models can be guided in two ways: (i) explicitly, through additional features representing the condition, or (ii) implicitly, using a property predictor. However, training property predictors or conditional diffusion models requires an abundance of labeled data and is inherently challenging in real-world applications. We propose a novel approach that attenuates the limitations of acquiring large labeled datasets by leveraging domain knowledge from quantum chemistry as a non-differentiable oracle to guide an unconditional diffusion model. Instead of relying on neural networks, the oracle provides accurate guidance in the form of estimated gradients, allowing the diffusion process to sample from a conditional distribution specified by quantum chemistry. We show that this results in more precise conditional generation of novel and stable molecular structures. Our experiments demonstrate that our method: (1) significantly reduces atomic forces, enhancing the validity of generated molecules when used for stability optimization; (2) is compatible with both explicit and implicit guidance in diffusion models, enabling joint optimization of molecular properties and stability; and (3) generalizes effectively to molecular optimization tasks beyond stability optimization.
[AI-90] ERCache: An Efficient and Reliable Caching Framework for Large-Scale User Representations in Metas Ads System
链接: https://arxiv.org/abs/2410.06497
作者: Fang Zhou,Yaning Huang,Dong Liang,Dai Li,Zhongke Zhang,Kai Wang,Xiao Xin,Abdallah Aboelela,Zheliang Jiang,Yang Wang,Jeff Song,Wei Zhang,Chen Liang,Huayu Li,ChongLin Sun,Hang Yang,Lei Qu,Zhan Shu,Mindi Yuan,Emanuele Maccherani,Taha Hayat,John Guo,Varna Puvvada,Uladzimir Pashkevich
关键词-EN: strict service-level agreements, presents significant challenges, deep learning models, representations presents significant, calculating user representations
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The increasing complexity of deep learning models used for calculating user representations presents significant challenges, particularly with limited computational resources and strict service-level agreements (SLAs). Previous research efforts have focused on optimizing model inference but have overlooked a critical question: is it necessary to perform user model inference for every ad request in large-scale social networks? To address this question and these challenges, we first analyze user access patterns at Meta and find that most user model inferences occur within a short timeframe. T his observation reveals a triangular relationship among model complexity, embedding freshness, and service SLAs. Building on this insight, we designed, implemented, and evaluated ERCache, an efficient and robust caching framework for large-scale user representations in ads recommendation systems on social networks. ERCache categorizes cache into direct and failover types and applies customized settings and eviction policies for each model, effectively balancing model complexity, embedding freshness, and service SLAs, even considering the staleness introduced by caching. ERCache has been deployed at Meta for over six months, supporting more than 30 ranking models while efficiently conserving computational resources and complying with service SLA requirements.
[AI-91] BiC-MPPI: Goal-Pursuing Sampling-Based Bidirectional Rollout Clustering Path Integral for Trajectory Optimization
链接: https://arxiv.org/abs/2410.06493
作者: Minchan Jung,Kwangki Kim
关键词-EN: Predictive Path Integral, Model Predictive Path, Bidirectional Clustered MPPI, enhancing goal-directed guidance, Model Predictive
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 7 pages, 1 figures
点击查看摘要
Abstract:This paper introduces the Bidirectional Clustered MPPI (BiC-MPPI) algorithm, a novel trajectory optimization method aimed at enhancing goal-directed guidance within the Model Predictive Path Integral (MPPI) framework. BiC-MPPI incorporates bidirectional dynamics approximations and a new guide cost mechanism, improving both trajectory planning and goal-reaching performance. By leveraging forward and backward rollouts, the bidirectional approach ensures effective trajectory connections between initial and terminal states, while the guide cost helps discover dynamically feasible paths. Experimental results demonstrate that BiC-MPPI outperforms existing MPPI variants in both 2D and 3D environments, achieving higher success rates and competitive computation times across 900 simulations on a modified BARN dataset for autonomous navigation. GitHub: this https URL Comments: 7 pages, 1 figures Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC) MSC classes: 68T40, 13P25 ACMclasses: I.2.9; I.2.8; G.1.6; G.4 Cite as: arXiv:2410.06493 [cs.RO] (or arXiv:2410.06493v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2410.06493 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-92] Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
链接: https://arxiv.org/abs/2410.06491
作者: Leo McKee-Reid,Christoph Sträter,Maria Angelica Martinez,Joe Needham,Mikita Balesni
关键词-EN: modifying task checklists, Previous work, egregious specification gaming, reinforcement learning, specification gaming
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 9 figures
点击查看摘要
Abstract:Previous work has shown that training “helpful-only” LLMs with reinforcement learning on a curriculum of gameable environments can lead models to generalize to egregious specification gaming, such as editing their own reward function or modifying task checklists to appear more successful. We show that gpt-4o, gpt-4o-mini, o1-preview, and o1-mini - frontier models trained to be helpful, harmless, and honest - can engage in specification gaming without training on a curriculum of tasks, purely from in-context iterative reflection (which we call in-context reinforcement learning, “ICRL”). We also show that using ICRL to generate highly-rewarded outputs for expert iteration (compared to the standard expert iteration reinforcement learning algorithm) may increase gpt-4o-mini’s propensity to learn specification-gaming policies, generalizing (in very rare cases) to the most egregious strategy where gpt-4o-mini edits its own reward function. Our results point toward the strong ability of in-context reflection to discover rare specification-gaming strategies that models might not exhibit zero-shot or with normal training, highlighting the need for caution when relying on alignment of LLMs in zero-shot settings.
[AI-93] FedL2G: Learning to Guide Local Training in Heterogeneous Federated Learning
链接: https://arxiv.org/abs/2410.06490
作者: Jianqing Zhang,Yang Liu,Yang Hua,Jian Cao,Qiang Yang
关键词-EN: Heterogeneous Federated Learning, Federated Learning, heterogeneous model architectures, core issues, heterogeneous model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Data and model heterogeneity are two core issues in Heterogeneous Federated Learning (HtFL). In scenarios with heterogeneous model architectures, aggregating model parameters becomes infeasible, leading to the use of prototypes (i.e., class representative feature vectors) for aggregation and guidance. However, they still experience a mismatch between the extra guiding objective and the client’s original local objective when aligned with global prototypes. Thus, we propose a Federated Learning-to-Guide (FedL2G) method that adaptively learns to guide local training in a federated manner and ensures the extra guidance is beneficial to clients’ original tasks. With theoretical guarantees, FedL2G efficiently implements the learning-to-guide process using only first-order derivatives w.r.t. model parameters and achieves a non-convex convergence rate of O(1/T). We conduct extensive experiments on two data heterogeneity and six model heterogeneity settings using 14 heterogeneous model architectures (e.g., CNNs and ViTs) to demonstrate FedL2G’s superior performance compared to six counterparts.
[AI-94] OledFL: Unleashing the Potential of Decentralized Federated Learning via Opposite Lookahead Enhancement
链接: https://arxiv.org/abs/2410.06482
作者: Qinglun Li,Miao Zhang,Mengzhu Wang,Quanjun Yin,Li Shen
关键词-EN: Decentralized Federated Learning, surpasses Centralized Federated, Federated Learning, Decentralized Federated, surpasses Centralized
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Decentralized Federated Learning (DFL) surpasses Centralized Federated Learning (CFL) in terms of faster training, privacy preservation, and light communication, making it a promising alternative in the field of federated learning. However, DFL still exhibits significant disparities with CFL in terms of generalization ability such as rarely theoretical understanding and degraded empirical performance due to severe inconsistency. In this paper, we enhance the consistency of DFL by developing an opposite lookahead enhancement technique (Ole), yielding OledFL to optimize the initialization of each client in each communication round, thus significantly improving both the generalization and convergence speed. Moreover, we rigorously establish its convergence rate in non-convex setting and characterize its generalization bound through uniform stability, which provides concrete reasons why OledFL can achieve both the fast convergence speed and high generalization ability. Extensive experiments conducted on the CIFAR10 and CIFAR100 datasets with Dirichlet and Pathological distributions illustrate that our OledFL can achieve up to 5% performance improvement and 8 \times speedup, compared to the most popular DFedAvg optimizer in DFL.
[AI-95] Grounding Robot Policies with Visuomotor Language Guidance
链接: https://arxiv.org/abs/2410.06473
作者: Arthur Bucker,Pablo Ortega,Jonathan Francis,Jean Oh
关键词-EN: large-scale internet data, Recent advances, natural language processing, shown great potential, fields of natural
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 19 pages, 6 figures, 1 table
点击查看摘要
Abstract:Recent advances in the fields of natural language processing and computer vision have shown great potential in understanding the underlying dynamics of the world from large-scale internet data. However, translating this knowledge into robotic systems remains an open challenge, given the scarcity of human-robot interactions and the lack of large-scale datasets of real-world robotic data. Previous robot learning approaches such as behavior cloning and reinforcement learning have shown great capabilities in learning robotic skills from human demonstrations or from scratch in specific environments. However, these approaches often require task-specific demonstrations or designing complex simulation environments, which limits the development of generalizable and robust policies for new settings. Aiming to address these limitations, we propose an agent-based framework for grounding robot policies to the current context, considering the constraints of a current robot and its environment using visuomotor-grounded language guidance. The proposed framework is composed of a set of conversational agents designed for specific roles – namely, high-level advisor, visual grounding, monitoring, and robotic agents. Given a base policy, the agents collectively generate guidance at run time to shift the action distribution of the base policy towards more desirable future states. We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates both in simulation and in real-world experiments without the need for additional human demonstrations or extensive exploration. Project videos at this https URL.
[AI-96] Enabling Novel Mission Operations and Interactions with ROSA: The Robot Operating System Agent
链接: https://arxiv.org/abs/2410.06472
作者: Rob Royce,Marcel Kaufmann,Jonathan Becktor,Sangwoo Moon,Kalind Carpenter,Kai Pak,Amanda Towler,Rohan Thakker,Shehryar Khattak
关键词-EN: Robot Operating System, revolutionized numerous industries, specialized technical knowledge, demands specialized technical, Operating System Agent
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Under review for IEEE Aerospace Conference, 20 pages, 20 figures
点击查看摘要
Abstract:The advancement of robotic systems has revolutionized numerous industries, yet their operation often demands specialized technical knowledge, limiting accessibility for non-expert users. This paper introduces ROSA (Robot Operating System Agent), an AI-powered agent that bridges the gap between the Robot Operating System (ROS) and natural language interfaces. By leveraging state-of-the-art language models and integrating open-source frameworks, ROSA enables operators to interact with robots using natural language, translating commands into actions and interfacing with ROS through well-defined tools. ROSA’s design is modular and extensible, offering seamless integration with both ROS1 and ROS2, along with safety mechanisms like parameter validation and constraint enforcement to ensure secure, reliable operations. While ROSA is originally designed for ROS, it can be extended to work with other robotics middle-wares to maximize compatibility across missions. ROSA enhances human-robot interaction by democratizing access to complex robotic systems, empowering users of all expertise levels with multi-modal capabilities such as speech integration and visual perception. Ethical considerations are thoroughly addressed, guided by foundational principles like Asimov’s Three Laws of Robotics, ensuring that AI integration promotes safety, transparency, privacy, and accountability. By making robotic technology more user-friendly and accessible, ROSA not only improves operational efficiency but also sets a new standard for responsible AI use in robotics and potentially future mission operations. This paper introduces ROSA’s architecture and showcases initial mock-up operations in JPL’s Mars Yard, a laboratory, and a simulation using three different robots. The core ROSA library is available as open-source.
[AI-97] Does Spatial Cognition Emerge in Frontier Models?
链接: https://arxiv.org/abs/2410.06468
作者: Santhosh Kumar Ramakrishnan,Erik Wijmans,Philipp Kraehenbuehl,Vladlen Koltun
关键词-EN: present SPACE, Abstract, models, benchmark, spatial
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.
[AI-98] Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders
链接: https://arxiv.org/abs/2410.06462
作者: David Noever,Forrest McKee
关键词-EN: research builds, builds and evaluates, evaluates the adversarial, adversarial potential, potential to introduce
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The research builds and evaluates the adversarial potential to introduce copied code or hallucinated AI recommendations for malicious code in popular code repositories. While foundational large language models (LLMs) from OpenAI, Google, and Anthropic guard against both harmful behaviors and toxic strings, previous work on math solutions that embed harmful prompts demonstrate that the guardrails may differ between expert contexts. These loopholes would appear in mixture of expert’s models when the context of the question changes and may offer fewer malicious training examples to filter toxic comments or recommended offensive actions. The present work demonstrates that foundational models may refuse to propose destructive actions correctly when prompted overtly but may unfortunately drop their guard when presented with a sudden change of context, like solving a computer programming challenge. We show empirical examples with trojan-hosting repositories like GitHub, NPM, NuGet, and popular content delivery networks (CDN) like jsDelivr which amplify the attack surface. In the LLM’s directives to be helpful, example recommendations propose application programming interface (API) endpoints which a determined domain-squatter could acquire and setup attack mobile infrastructure that triggers from the naively copied code. We compare this attack to previous work on context-shifting and contrast the attack surface as a novel version of “living off the land” attacks in the malware literature. In the latter case, foundational language models can hijack otherwise innocent user prompts to recommend actions that violate their owners’ safety policies when posed directly without the accompanying coding support request.
[AI-99] LLM Self-Correction with DeCRIM: Decompose Critique and Refine for Enhanced Following of Instructions with Multiple Constraints EMNLP2024
链接: https://arxiv.org/abs/2410.06458
作者: Thomas Palmeira Ferraz,Kartik Mehta,Yu-Hsiang Lin,Haw-Shiuan Chang,Shereen Oraby,Sijia Liu,Vivek Subramanian,Tagyoung Chung,Mohit Bansal,Nanyun Peng
关键词-EN: key capability, instructions, Abstract, DeCRIM, LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear at EMNLP 2024
点击查看摘要
Abstract:Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post “in a funny tone” with “no hashtag”). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs’ ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs’ ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM’s response needs refinement. Our results show that DeCRIM improves Mistral’s performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.
[AI-100] Modeling chaotic Lorenz ODE System using Scientific Machine Learning
链接: https://arxiv.org/abs/2410.06452
作者: Sameera S Kashyap,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
关键词-EN: face significant challenges, significant challenges due, data efficiency crucial, prediction face significant, making data efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 8 figures, 3 tables
点击查看摘要
Abstract:In climate science, models for global warming and weather prediction face significant challenges due to the limited availability of high-quality data and the difficulty in obtaining it, making data efficiency crucial. In the past few years, Scientific Machine Learning (SciML) models have gained tremendous traction as they can be trained in a data-efficient manner, making them highly suitable for real-world climate applications. Despite this, very little attention has been paid to chaotic climate system modeling utilizing SciML methods. In this paper, we have integrated SciML methods into foundational weather models, where we have enhanced large-scale climate predictions with a physics-informed approach that achieves high accuracy with reduced data. We successfully demonstrate that by combining the interpretability of physical climate models with the computational power of neural networks, SciML models can prove to be a reliable tool for modeling climate. This indicates a shift from the traditional black box-based machine learning modeling of climate systems to physics-informed decision-making, leading to effective climate policy implementation.
[AI-101] MaD-Scientist: AI-based Scientist solving Convection-Diffusion-Reaction Equations Using Massive PINN-Based Prior Data
链接: https://arxiv.org/abs/2410.06442
作者: Mingu Kang,Dongseok Lee,Woojin Cho,Jaehyeon Park,Kookjin Lee,Anthony Gruber,Youngjoon Hong,Noseong Park
关键词-EN: Large language models, Large language, noisy prior data, in-context learning, approximated prior data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large language models (LLMs), like ChatGPT, have shown that even trained with noisy prior data, they can generalize effectively to new tasks through in-context learning (ICL) and pre-training techniques. Motivated by this, we explore whether a similar approach can be applied to scientific foundation models (SFMs). Our methodology is structured as follows: (i) we collect low-cost physics-informed neural network (PINN)-based approximated prior data in the form of solutions to partial differential equations (PDEs) constructed through an arbitrary linear combination of mathematical dictionaries; (ii) we utilize Transformer architectures with self and cross-attention mechanisms to predict PDE solutions without knowledge of the governing equations in a zero-shot setting; (iii) we provide experimental evidence on the one-dimensional convection-diffusion-reaction equation, which demonstrate that pre-training remains robust even with approximated prior data, with only marginal impacts on test accuracy. Notably, this finding opens the path to pre-training SFMs with realistic, low-cost data instead of (or in conjunction with) numerical high-cost data. These results support the conjecture that SFMs can improve in a manner similar to LLMs, where fully cleaning the vast set of sentences crawled from the Internet is nearly impossible.
[AI-102] Stress Detection on Code-Mixed Texts in Dravidian Languages using Machine Learning
链接: https://arxiv.org/abs/2410.06428
作者: L. Ramos,M. Shahiki-Tash,Z. Ahani,A. Eponon,O. Kolesnikova,H. Calvo
关键词-EN: affect mental well-being, daily life, common feeling, feeling in daily, development of robust
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Stress is a common feeling in daily life, but it can affect mental well-being in some situations, the development of robust detection models is imperative. This study introduces a methodical approach to the stress identification in code-mixed texts for Dravidian languages. The challenge encompassed two datasets, targeting Tamil and Telugu languages respectively. This proposal underscores the importance of using uncleaned text as a benchmark to refine future classification methodologies, incorporating diverse preprocessing techniques. Random Forest algorithm was used, featuring three textual representations: TF-IDF, Uni-grams of words, and a composite of (1+2+3)-Grams of characters. The approach achieved a good performance for both linguistic categories, achieving a Macro F1-score of 0.734 in Tamil and 0.727 in Telugu, overpassing results achieved with different complex techniques such as FastText and Transformer models. The results underscore the value of uncleaned data for mental state detection and the challenges classifying code-mixed texts for stress, indicating the potential for improved performance through cleaning data, other preprocessing techniques, or more complex models.
[AI-103] NLP Case Study on Predicting the Before and After of the Ukraine-Russia and Hamas-Israel Conflicts
链接: https://arxiv.org/abs/2410.06427
作者: Jordan Miner,John E. Ortega
关键词-EN: natural language processing, social media, recent events, Twitter and Reddit, propose a method
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The clusters created using topic modeling can be viewed at this https URL
点击查看摘要
Abstract:We propose a method to predict toxicity and other textual attributes through the use of natural language processing (NLP) techniques for two recent events: the Ukraine-Russia and Hamas-Israel conflicts. This article provides a basis for exploration in future conflicts with hopes to mitigate risk through the analysis of social media before and after a conflict begins. Our work compiles several datasets from Twitter and Reddit for both conflicts in a before and after separation with an aim of predicting a future state of social media for avoidance. More specifically, we show that: (1) there is a noticeable difference in social media discussion leading up to and following a conflict and (2) social media discourse on platforms like Twitter and Reddit is useful in identifying future conflicts before they arise. Our results show that through the use of advanced NLP techniques (both supervised and unsupervised) toxicity and other attributes about language before and after a conflict is predictable with a low error of nearly 1.2 percent for both conflicts.
[AI-104] FAIREDU: A Multiple Regression-Based Method for Enhancing Fairness in Machine Learning Models for Educational Applications
链接: https://arxiv.org/abs/2410.06423
作者: Nga Pham,Minh Kha Do,Tran Vu Dai,Pham Ngoc Hung,Anh Nguyen-Duc
关键词-EN: impact diverse groups, systems impact diverse, machine learning, critically important, diverse groups
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Fairness in artificial intelligence and machine learning (AI/ML) models is becoming critically important, especially as decisions made by these systems impact diverse groups. In education, a vital sector for all countries, the widespread application of AI/ML systems raises specific concerns regarding fairness. Current research predominantly focuses on fairness for individual sensitive features, which limits the comprehensiveness of fairness assessments. This paper introduces FAIREDU, a novel and effective method designed to improve fairness across multiple sensitive features. Through extensive experiments, we evaluate FAIREDU effectiveness in enhancing fairness without compromising model performance. The results demonstrate that FAIREDU addresses intersectionality across features such as gender, race, age, and other sensitive features, outperforming state-of-the-art methods with minimal effect on model accuracy. The paper also explores potential future research directions to enhance further the method robustness and applicability to various machine-learning models and datasets.
[AI-105] Biased AI can Influence Political Decision-Making
链接: https://arxiv.org/abs/2410.06415
作者: Jillian Fisher,Shangbin Feng,Robert Aron,Thomas Richardson,Yejin Choi,Daniel W. Fisher,Jennifer Pan,Yulia Tsvetkov,Katharina Reinecke
关键词-EN: integral to everyday, biases influence human, inherent biases, everyday tasks, bias
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:As modern AI models become integral to everyday tasks, concerns about their inherent biases and their potential impact on human decision-making have emerged. While bias in models are well-documented, less is known about how these biases influence human decisions. This paper presents two interactive experiments investigating the effects of partisan bias in AI language models on political decision-making. Participants interacted freely with either a biased liberal, conservative, or unbiased control model while completing political decision-making tasks. We found that participants exposed to politically biased models were significantly more likely to adopt opinions and make decisions aligning with the AI’s bias, regardless of their personal political partisanship. However, we also discovered that prior knowledge about AI could lessen the impact of the bias, highlighting the possible importance of AI education for robust bias mitigation. Our findings not only highlight the critical effects of interacting with biased AI and its ability to impact public discourse and political conduct, but also highlights potential techniques for mitigating these risks in the future.
[AI-106] ackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation Positions and Objects
链接: https://arxiv.org/abs/2410.06405
作者: Wenhao Li,Yudong Xu,Scott Sanner,Elias Boutros Khalil
关键词-EN: Artificial Intelligence systems, Artificial Intelligence, Intelligence systems, popular benchmark focused, evaluation of Artificial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The Abstraction and Reasoning Corpus (ARC) is a popular benchmark focused on visual reasoning in the evaluation of Artificial Intelligence systems. In its original framing, an ARC task requires solving a program synthesis problem over small 2D images using a few input-output training pairs. In this work, we adopt the recently popular data-driven approach to the ARC and ask whether a Vision Transformer (ViT) can learn the implicit mapping, from input image to output image, that underlies the task. We show that a ViT – otherwise a state-of-the-art model for images – fails dramatically on most ARC tasks even when trained on one million examples per task. This points to an inherent representational deficiency of the ViT architecture that makes it incapable of uncovering the simple structured mappings underlying the ARC tasks. Building on these insights, we propose ViTARC, a ViT-style architecture that unlocks some of the visual reasoning capabilities required by the ARC. Specifically, we use a pixel-level input representation, design a spatially-aware tokenization scheme, and introduce a novel object-based positional encoding that leverages automatic segmentation, among other enhancements. Our task-specific ViTARC models achieve a test solve rate close to 100% on more than half of the 400 public ARC tasks strictly through supervised learning from input-output grids. This calls attention to the importance of imbuing the powerful (Vision) Transformer with the correct inductive biases for abstract visual reasoning that are critical even when the training data is plentiful and the mapping is noise-free. Hence, ViTARC provides a strong foundation for future research in visual reasoning using transformer-based architectures.
[AI-107] Multimodal Representation Learning using Adaptive Graph Construction
链接: https://arxiv.org/abs/2410.06395
作者: Weichen Huang
关键词-EN: train neural networks, learning train neural, images and text, contrastive learning train, train neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Multimodal contrastive learning train neural networks by levergaing data from heterogeneous sources such as images and text. Yet, many current multimodal learning architectures cannot generalize to an arbitrary number of modalities and need to be hand-constructed. We propose AutoBIND, a novel contrastive learning framework that can learn representations from an arbitrary number of modalites through graph optimization. We evaluate AutoBIND on Alzhiemer’s disease detection because it has real-world medical applicability and it contains a broad range of data modalities. We show that AutoBIND outperforms previous methods on this task, highlighting the generalizablility of the approach.
[AI-108] Validation of the Scientific Literature via Chemputation Augmented by Large Language Models
链接: https://arxiv.org/abs/2410.06384
作者: Sebastian Pagel,Michael Jirasek,Leroy Cronin
关键词-EN: Large Language Models, universal symbolic language, programming chemical robots, symbolic language, Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 22 pages, 7 figures, 34 references
点击查看摘要
Abstract:Chemputation is the process of programming chemical robots to do experiments using a universal symbolic language, but the literature can be error prone and hard to read due to ambiguities. Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, including natural language processing, robotic control, and more recently, chemistry. Despite significant advancements in standardizing the reporting and collection of synthetic chemistry data, the automatic reproduction of reported syntheses remains a labour-intensive task. In this work, we introduce an LLM-based chemical research agent workflow designed for the automatic validation of synthetic literature procedures. Our workflow can autonomously extract synthetic procedures and analytical data from extensive documents, translate these procedures into universal XDL code, simulate the execution of the procedure in a hardware-specific setup, and ultimately execute the procedure on an XDL-controlled robotic system for synthetic chemistry. This demonstrates the potential of LLM-based workflows for autonomous chemical synthesis with Chemputers. Due to the abstraction of XDL this approach is safe, secure, and scalable since hallucinations will not be chemputable and the XDL can be both verified and encrypted. Unlike previous efforts, which either addressed only a limited portion of the workflow, relied on inflexible hard-coded rules, or lacked validation in physical systems, our approach provides four realistic examples of syntheses directly executed from synthetic literature. We anticipate that our workflow will significantly enhance automation in robotically driven synthetic chemistry research, streamline data extraction, improve the reproducibility, scalability, and safety of synthetic and experimental chemistry.
[AI-109] Cooperative and Asynchronous Transformer-based Mission Planning for Heterogeneous Teams of Mobile Robots
链接: https://arxiv.org/abs/2410.06372
作者: Milad Farjadnasab,Shahin Sirouspour
关键词-EN: Coordinating heterogeneous teams, Coordinating heterogeneous, highly challenging, mobile robots, robots for tasks
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 7 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
点击查看摘要
Abstract:Coordinating heterogeneous teams of mobile robots for tasks such as search and rescue is highly challenging. This is due to the complexities of perception, decision making and planning in such environments, with agents’ non-synchronous operation, constrained communication, and limited computational resources. This paper presents the Cooperative and Asynchronous Transformer-based Mission Planning (CATMiP) framework, which leverages multi-agent reinforcement learning (MARL) to effectively coordinate agents with heterogeneous sensing, motion, and actuation capabilities. The framework introduces a Class-based Macro-Action Decentralized Partially Observable Markov Decision Process (CMD-POMDP) model to handle asynchronous decision-making among different agent classes via macro-actions. It also extends the Multi-Agent Transformer (MAT) architecture to facilitate distributed, ad hoc communication among the agents. CATMiP easily adapts to mission complexities and communication constraints, and scales to varying environment sizes and team compositions. Simulations demonstrate its scalability and ability to achieve cooperative mission objectives with two classes of explorer and rescuer agents, even under severe communication constraints. The code is available at this https URL.
[AI-110] HumVI: A Multilingual Dataset for Detecting Violent Incidents Impacting Humanitarian Aid
链接: https://arxiv.org/abs/2410.06370
作者: Hemank Lamba,Anton Abilov,Ke Zhang,Elizabeth M. Olson,Henry k. Dambanemuya,João c. Bárcia,David S. Batista,Christina Wille,Aoife Cahill,Joel Tetreault,Alex Jaimes
关键词-EN: gather aggregated insights, support decision-making, discover trends, gather aggregated, funding proposals
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
点击查看摘要
Abstract:Humanitarian organizations can enhance their effectiveness by analyzing data to discover trends, gather aggregated insights, manage their security risks, support decision-making, and inform advocacy and funding proposals. However, data about violent incidents with direct impact and relevance for humanitarian aid operations is not readily available. An automatic data collection and NLP-backed classification framework aligned with humanitarian perspectives can help bridge this gap. In this paper, we present HumVI - a dataset comprising news articles in three languages (English, French, Arabic) containing instances of different types of violent incidents categorized by the humanitarian sector they impact, e.g., aid security, education, food security, health, and protection. Reliable labels were obtained for the dataset by partnering with a data-backed humanitarian organization, Insecurity Insight. We provide multiple benchmarks for the dataset, employing various deep learning architectures and techniques, including data augmentation and mask loss, to address different task-related challenges, e.g., domain expansion. The dataset is publicly available at this https URL.
[AI-111] Physics-Informed Regularization for Domain-Agnostic Dynamical System Modeling NEURIPS2024
链接: https://arxiv.org/abs/2410.06366
作者: Zijie Huang,Wanjia Zhao,Jingdong Gao,Ziniu Hu,Xiao Luo,Yadi Cao,Yuanzhou Chen,Yizhou Sun,Wei Wang
关键词-EN: Learning complex physical, Learning complex, Hamiltonian Neural Networks, complex physical dynamics, physical dynamics purely
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
点击查看摘要
Abstract:Learning complex physical dynamics purely from data is challenging due to the intrinsic properties of systems to be satisfied. Incorporating physics-informed priors, such as in Hamiltonian Neural Networks (HNNs), achieves high-precision modeling for energy-conservative systems. However, real-world systems often deviate from strict energy conservation and follow different physical priors. To address this, we present a framework that achieves high-precision modeling for a wide range of dynamical systems from the numerical aspect, by enforcing Time-Reversal Symmetry (TRS) via a novel regularization term. It helps preserve energies for conservative systems while serving as a strong inductive bias for non-conservative, reversible systems. While TRS is a domain-specific physical prior, we present the first theoretical proof that TRS loss can universally improve modeling accuracy by minimizing higher-order Taylor terms in ODE integration, which is numerically beneficial to various systems regardless of their properties, even for irreversible systems. By integrating the TRS loss within neural ordinary differential equation models, the proposed model TREAT demonstrates superior performance on diverse physical systems. It achieves a significant 11.5% MSE improvement in a challenging chaotic triple-pendulum scenario, underscoring TREAT’s broad applicability and effectiveness.
[AI-112] Context-Aware Command Understanding for Tabletop Scenarios
链接: https://arxiv.org/abs/2410.06355
作者: Paul Gajewski,Antonio Galiza Cerdeira Gonzalez,Bipin Indurkhya
关键词-EN: hybrid algorithm designed, tabletop scenarios, paper presents, designed to interpret, interpret natural human
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios. By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot, identifying relevant objects and actions. The system operates in a zero-shot fashion, without reliance on predefined object models, enabling flexible and adaptive use in various environments. We assess the integration of multiple deep learning models, evaluating their suitability for deployment in real-world robotic setups. Our algorithm performs robustly across different tasks, combining language processing with visual grounding. In addition, we release a small dataset of video recordings used to evaluate the system. This dataset captures real-world interactions in which a human provides instructions in natural language to a robot, a contribution to future research on human-robot interaction. We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation, and its ability to be integrated into symbolic robotic frameworks for safe and explainable decision-making.
[AI-113] Solving Multi-Goal Robotic Tasks with Decision Transformer
链接: https://arxiv.org/abs/2410.06347
作者: Paul Gajewski,Dominik Żurek,Marcin Pietroń,Kamil Faber
关键词-EN: Artificial intelligence plays, reinforcement learning, plays a crucial, crucial role, promising approaches
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Artificial intelligence plays a crucial role in robotics, with reinforcement learning (RL) emerging as one of the most promising approaches for robot control. However, several key challenges hinder its broader application. First, many RL methods rely on online learning, which requires either real-world hardware or advanced simulation environments–both of which can be costly, time-consuming, and impractical. Offline reinforcement learning offers a solution, enabling models to be trained without ongoing access to physical robots or simulations. A second challenge is learning multi-goal tasks, where robots must achieve multiple objectives simultaneously. This adds complexity to the training process, as the model must generalize across different goals. At the same time, transformer architectures have gained significant popularity across various domains, including reinforcement learning. Yet, no existing methods effectively combine offline training, multi-goal learning, and transformer-based architectures. In this paper, we address these challenges by introducing a novel adaptation of the decision transformer architecture for offline multi-goal reinforcement learning in robotics. Our approach integrates goal-specific information into the decision transformer, allowing it to handle complex tasks in an offline setting. To validate our method, we developed a new offline reinforcement learning dataset using the Panda robotic platform in simulation. Our extensive experiments demonstrate that the decision transformer can outperform state-of-the-art online reinforcement learning methods. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.06347 [cs.RO] (or arXiv:2410.06347v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2410.06347 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-114] Boolean Nearest Neighbor Language in the Knowledge Compilation Map
链接: https://arxiv.org/abs/2410.06332
作者: Ondřej Čepek,Jelena Glišić
关键词-EN: Boolean Nearest Neighbor, Liu and Turan, Nearest Neighbor, Boolean Nearest, introduced by Hajnal
类目: Artificial Intelligence (cs.AI)
*备注: 19 pages, 5 figures, 2 tables
点击查看摘要
Abstract:The Boolean Nearest Neighbor (BNN) representation of Boolean functions was recently introduced by Hajnal, Liu and Turan. A BNN representation of f is a pair (P,N) of sets of Boolean vectors (called positive and negative prototypes) where f(x)=1 for every positive prototype x \in P , f(x)=0 for all every negative prototype x \in N , and the value f(x) for x \not\in P \cup N is determined by the type of the closest prototype. The main aim of this paper is to determine the position of the BNN language in the Knowledge Compilation Map (KCM). To this end, we derive results which compare the succinctness of the BNN language to several standard languages from KCM, and determine the complexity status of most standard queries and transformations for BNN inputs.
[AI-115] Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing
链接: https://arxiv.org/abs/2410.06331
作者: Zhuoran Zhang,Yongxiang Li,Zijian Kan,Keyuan Cheng,Lijie Hu,Di Wang
关键词-EN: Large Language Models, Language Models, Large Language, shown significant promise, paradigm has shown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 21 pages
点击查看摘要
Abstract:The locate-then-edit paradigm has shown significant promise for knowledge editing (KE) in Large Language Models (LLMs). While previous methods perform well on single-hop fact recall tasks, they consistently struggle with multi-hop factual recall tasks involving newly edited knowledge. In this paper, leveraging tools in mechanistic interpretability, we first identify that in multi-hop tasks, LLMs tend to retrieve implicit subject knowledge from deeper MLP layers, unlike single-hop tasks, which rely on earlier layers. This distinction explains the poor performance of current methods in multi-hop queries, as they primarily focus on editing shallow layers, leaving deeper layers unchanged. To address this, we propose IFMET, a novel locate-then-edit KE approach designed to edit both shallow and deep MLP layers. IFMET employs multi-hop editing prompts and supplementary sets to locate and modify knowledge across different reasoning stages. Experimental results demonstrate that IFMET significantly improves performance on multi-hop factual recall tasks, effectively overcoming the limitations of previous locate-then-edit methods.
[AI-116] Auto-Evolve: Enhancing Large Language Models Performance via Self-Reasoning Framework EMNLP2024
链接: https://arxiv.org/abs/2410.06328
作者: Krishna Aswani,Huilin Lu,Pranav Patankar,Priya Dhalwani,Iris Tan,Jayant Ganeshmohan,Simon Lacasse
关键词-EN: Large Language Models, Recent advancements, Large Language, demonstrated significant potential, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at EMNLP 2024
点击查看摘要
Abstract:Recent advancements in prompt engineering strategies, such as Chain-of-Thought (CoT) and Self-Discover, have demonstrated significant potential in improving the reasoning abilities of Large Language Models (LLMs). However, these state-of-the-art (SOTA) prompting strategies rely on single or fixed set of static seed reasoning modules like \emph"think step by step" or \emph"break down this problem" intended to simulate human approach to problem-solving. This constraint limits the flexibility of models in tackling diverse problems effectively. In this paper, we introduce Auto-Evolve, a novel framework that enables LLMs to self-create dynamic reasoning modules and downstream action plan, resulting in significant improvements over current SOTA methods. We evaluate Auto-Evolve on the challenging BigBench-Hard (BBH) dataset with Claude 2.0, Claude 3 Sonnet, Mistral Large, and GPT 4, where it consistently outperforms the SOTA prompt strategies. Auto-Evolve outperforms CoT by up to 10.4% and on an average by 7% across these four models. Our framework introduces two innovations: a) Auto-Evolve dynamically generates reasoning modules for each task while aligning with human reasoning paradigm, thus eliminating the need for predefined templates. b) We introduce an iterative refinement component, that incrementally refines instruction guidance for LLMs and helps boost performance by average 2.8% compared to doing it in a single step.
[AI-117] Learning in complex action spaces without policy gradients
链接: https://arxiv.org/abs/2410.06317
作者: Arash Tavakoli,Sina Ghiassian,Nemanja Rakićević
关键词-EN: Conventional wisdom suggests, Conventional wisdom, complex action spaces, action spaces, wisdom suggests
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Conventional wisdom suggests that policy gradient methods are better suited to complex action spaces than action-value methods. However, foundational studies have shown equivalences between these paradigms in small and finite action spaces (O’Donoghue et al., 2017; Schulman et al., 2017a). This raises the question of why their computational applicability and performance diverge as the complexity of the action space increases. We hypothesize that the apparent superiority of policy gradients in such settings stems not from intrinsic qualities of the paradigm, but from universal principles that can also be applied to action-value methods to serve similar functionality. We identify three such principles and provide a framework for incorporating them into action-value methods. To support our hypothesis, we instantiate this framework in what we term QMLE, for Q-learning with maximum likelihood estimation. Our results show that QMLE can be applied to complex action spaces with a controllable computational cost that is comparable to that of policy gradient methods, all without using policy gradients. Furthermore, QMLE demonstrates strong performance on the DeepMind Control Suite, even when compared to the state-of-the-art methods such as DMPO and D4PG.
[AI-118] A Comparative Study of Hybrid Models in Health Misinformation Text Classification
链接: https://arxiv.org/abs/2410.06311
作者: Mkululi Sikosana,Oluwaseun Ajao,Sean Maudsley-Barton
关键词-EN: machine learning, deep learning, online social networks, Naive Bayes, Random Forest
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 4 tables presented at the OASIS workshop of the ACM Hypertext and Social Media Conference 2024
点击查看摘要
Abstract:This study evaluates the effectiveness of machine learning (ML) and deep learning (DL) models in detecting COVID-19-related misinformation on online social networks (OSNs), aiming to develop more effective tools for countering the spread of health misinformation during the pan-demic. The study trained and tested various ML classifiers (Naive Bayes, SVM, Random Forest, etc.), DL models (CNN, LSTM, hybrid CNN+LSTM), and pretrained language models (DistilBERT, RoBERTa) on the “COVID19-FNIR DATASET”. These models were evaluated for accuracy, F1 score, recall, precision, and ROC, and used preprocessing techniques like stemming and lemmatization. The results showed SVM performed well, achieving a 94.41% F1-score. DL models with Word2Vec embeddings exceeded 98% in all performance metrics (accuracy, F1 score, recall, precision ROC). The CNN+LSTM hybrid models also exceeded 98% across performance metrics, outperforming pretrained models like DistilBERT and RoBERTa. Our study concludes that DL and hybrid DL models are more effective than conventional ML algorithms for detecting COVID-19 misinformation on OSNs. The findings highlight the importance of advanced neural network approaches and large-scale pretraining in misinformation detection. Future research should optimize these models for various misinformation types and adapt to changing OSNs, aiding in combating health misinformation.
[AI-119] Compositional Risk Minimization
链接: https://arxiv.org/abs/2410.06303
作者: Divyat Mahajan,Mohammad Pezeshki,Ioannis Mitliagkas,Kartik Ahuja,Pascal Vincent
关键词-EN: termed compositional shift, compositional shifts, challenging and extreme, tackle compositional shifts, shifts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint. Under Review
点击查看摘要
Abstract:In this work, we tackle a challenging and extreme form of subpopulation shift, which is termed compositional shift. Under compositional shifts, some combinations of attributes are totally absent from the training distribution but present in the test distribution. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.
[AI-120] A Taxonomy of Collectible Card Games from a Game-Playing AI Perspective CEC
链接: https://arxiv.org/abs/2410.06299
作者: Ronaldo e Silva Vieira,Anderson Rocha Tavares,Luiz Chaimowicz
关键词-EN: received increasing attention, widely played games, Collectible card games, widely played, recent years
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages, accepted at the International Conference on Entertainment Computing (ICEC) 2024
点击查看摘要
Abstract:Collectible card games are challenging, widely played games that have received increasing attention from the AI research community in recent years. Despite important breakthroughs, the field still poses many unresolved challenges. This work aims to help further research on the genre by proposing a taxonomy of collectible card games by analyzing their rules, mechanics, and game modes from the perspective of game-playing AI research. To achieve this, we studied a set of popular games and provided a thorough discussion about their characteristics.
[AI-121] Accelerated Preference Optimization for Large Language Model Alignment
链接: https://arxiv.org/abs/2410.06293
作者: Jiafan He,Huizhuo Yuan,Quanquan Gu
关键词-EN: Reinforcement Learning, Human Feedback, large language models, aligning large language, Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 44 pages, 10 tables
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a policy optimization problem without explicitly estimating the reward function. It overcomes the stability and efficiency issues of two-step approaches, which typically involve first estimating the reward function and then optimizing the policy via proximal policy optimization (PPO). Since RLHF is essentially an optimization problem, and it is well-known that momentum techniques can accelerate optimization both theoretically and empirically, a natural question arises: Can RLHF be accelerated by momentum? This paper answers this question in the affirmative. In detail, we first show that the iterative preference optimization method can be viewed as a proximal point method. Based on this observation, we propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms and employs Nesterov’s momentum technique to speed up the alignment of LLMs. Theoretically, we demonstrate that APO can achieve a faster convergence rate than the standard iterative preference optimization methods, including DPO and Self-Play Preference Optimization (SPPO). Empirically, we show the superiority of APO over DPO, iterative DPO, and other strong baselines for RLHF on the AlpacaEval 2.0 benchmark.
[AI-122] Non-Halting Queries: Exploiting Fixed Points in LLMs
链接: https://arxiv.org/abs/2410.06287
作者: Ghaith Hammouri,Kemal Derya,Berk Sunar
关键词-EN: non-halting, vulnerability that exploits, non-halting anomaly, call non-halting queries, exploits fixed points
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:We introduce a new vulnerability that exploits fixed points in autoregressive models and use it to craft queries that never halt, i.e. an LLM output that does not terminate. More precisely, for what we call non-halting queries, the LLM never samples the end-of-string token (eos). We rigorously analyze the conditions under which the non-halting anomaly presents itself. In particular, at temperature zero, we prove that if a repeating (cyclic) sequence of tokens is observed at the output beyond the context size, then the LLM does not halt. We demonstrate the non-halting anomaly in a number of experiments performed in base (unaligned) models where repeating tokens immediately lead to a non-halting cyclic behavior as predicted by the analysis. Further, we develop a simple recipe that takes the same fixed points observed in the base model and creates a prompt structure to target aligned models. We study the recipe behavior in bypassing alignment in a number of LLMs including GPT-4o, llama-3-8b-instruct, and gemma-2-9b-it where all models are forced into a non-halting state. Further, we demonstrate the recipe’s success in sending most major models released over the past year into a non-halting state with the same simple prompt even at higher temperatures. Further, we study direct inversion based techniques to craft new short prompts to induce the non-halting state. Our experiments with the gradient search based inversion technique ARCA show that non-halting is prevalent across models and may be easily induced with a few input tokens. While its impact on the reliability of hosted systems can be mitigated by configuring a hard maximum token limit in the sampler, the non-halting anomaly still manages to break alignment. This underlines the need for further studies and stronger forms of alignment against non-halting anomalies. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2410.06287 [cs.LG] (or arXiv:2410.06287v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.06287 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-123] Is Pontryagins Maximum Principle all you need? Solving optimal control problems with PMP-inspired neural networks ICLR2025
链接: https://arxiv.org/abs/2410.06277
作者: Kawisorn Kamtue,Jose M.F. Moura,Orathai Sangpetch
关键词-EN: Calculus of Variations, Pontryagin Maximum Principle, functional optimization, time interval, control
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 16 pages, 5 figures, under review at ICLR 2025
点击查看摘要
Abstract:Calculus of Variations is the mathematics of functional optimization, i.e., when the solutions are functions over a time interval. This is particularly important when the time interval is unknown like in minimum-time control problems, so that forward in time solutions are not possible. Calculus of Variations offers a robust framework for learning optimal control and inference. How can this framework be leveraged to design neural networks to solve challenges in control and inference? We propose the Pontryagin’s Maximum Principle Neural Network (PMP-net) that is tailored to estimate control and inference solutions, in accordance with the necessary conditions outlined by Pontryagin’s Maximum Principle. We assess PMP-net on two classic optimal control and inference problems: optimal linear filtering and minimum-time control. Our findings indicate that PMP-net can be effectively trained in an unsupervised manner to solve these problems without the need for ground-truth data, successfully deriving the classical “Kalman filter” and “bang-bang” control solution. This establishes a new approach for addressing general, possibly yet unsolved, optimal control problems.
[AI-124] PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories
链接: https://arxiv.org/abs/2410.06273
作者: Stephane Aroca-Ouellette,Natalie Mackraz,Barry-John Theobald,Katherine Metcalf
关键词-EN: Accommodating human preferences, Accommodating human, essential for creating, creating AI agents, agents that deliver
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:Accommodating human preferences is essential for creating AI agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs to infer preferences from user interactions, but they often produce broad and generic preferences, failing to capture the unique and individualized nature of human preferences. This paper introduces PREDICT, a method designed to enhance the precision and adaptability of inferring preferences. PREDICT incorporates three key elements: (1) iterative refinement of inferred preferences, (2) decomposition of preferences into constituent components, and (3) validation of preferences across multiple trajectories. We evaluate PREDICT on two distinct environments: a gridworld setting and a new text-domain environment (PLUME). PREDICT more accurately infers nuanced human preferences improving over existing baselines by 66.2% (gridworld environment) and 41.0% (PLUME).
[AI-125] Probing the Robustness of Theory of Mind in Large Language Models
链接: https://arxiv.org/abs/2410.06271
作者: Christian Nickel,Laura Schrewe,Lucie Flek
关键词-EN: Theory of Mind, social reasoning capabilities, similarly sized SotA, claims of emergent, scientific literature
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:With the success of ChatGPT and other similarly sized SotA LLMs, claims of emergent human like social reasoning capabilities, especially Theory of Mind (ToM), in these models have appeared in the scientific literature. On the one hand those ToM-capabilities have been successfully tested using tasks styled similar to those used in psychology (Kosinski, 2023). On the other hand, follow up studies showed that those capabilities vanished when the tasks were slightly altered (Ullman, 2023). In this work we introduce a novel dataset of 68 tasks for probing ToM in LLMs, including potentially challenging variations which are assigned to 10 complexity classes. This way it is providing novel insights into the challenges LLMs face with those task variations. We evaluate the ToM performance of four SotA open source LLMs on our dataset and the dataset introduced by (Kosinski, 2023). The overall low goal accuracy across all evaluated models indicates only a limited degree of ToM capabilities. The LLMs’ performance on simple complexity class tasks from both datasets are similar. Whereas we find a consistent tendency in all tested LLMs to perform poorly on tasks that require the realization that an agent has knowledge of automatic state changes in its environment, even when those are spelled out to the model. For task complications that change the relationship between objects by replacing prepositions, we notice a performance drop in all models, with the strongest impact on the mixture-of-experts model. With our dataset of tasks grouped by complexity we offer directions for further research on how to stabilize and advance ToM capabilities in LLM.
[AI-126] hink While You Generate: Discrete Diffusion with Planned Denoising
链接: https://arxiv.org/abs/2410.06264
作者: Sulin Liu,Juno Nam,Andrew Campbell,Hannes Stärk,Yilun Xu,Tommi Jaakkola,Rafael Gómez-Bombarelli
关键词-EN: Discrete diffusion, introduce Discrete Diffusion, outperforming or approaching, approaching autoregressive models, Discrete
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Discrete diffusion has achieved state-of-the-art performance, outperforming or approaching autoregressive models on standard benchmarks. In this work, we introduce Discrete Diffusion with Planned Denoising (DDPD), a novel framework that separates the generation process into two models: a planner and a denoiser. At inference time, the planner selects which positions to denoise next by identifying the most corrupted positions in need of denoising, including both initially corrupted and those requiring additional refinement. This plan-and-denoise approach enables more efficient reconstruction during generation by iteratively identifying and denoising corruptions in the optimal order. DDPD outperforms traditional denoiser-only mask diffusion methods, achieving superior results on language modeling benchmarks such as text8, OpenWebText, and token-based generation on ImageNet 256 \times 256 . Notably, in language modeling, DDPD significantly reduces the performance gap between diffusion-based and autoregressive methods in terms of generative perplexity. Code is available at this https URL.
[AI-127] Unsupervised Model Diagnosis
链接: https://arxiv.org/abs/2410.06243
作者: Yinong Oliver Wang,Eileen Li,Jinqi Luo,Zhaoning Wang,Fernando De la Torre
关键词-EN: Ensuring model explainability, deep vision systems, Ensuring model, essential for reliable, reliable deployment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 9 pages, 9 figures, 3 tables
点击查看摘要
Abstract:Ensuring model explainability and robustness is essential for reliable deployment of deep vision systems. Current methods for evaluating robustness rely on collecting and annotating extensive test sets. While this is common practice, the process is labor-intensive and expensive with no guarantee of sufficient coverage across attributes of interest. Recently, model diagnosis frameworks have emerged leveraging user inputs (e.g., text) to assess the vulnerability of the model. However, such dependence on human can introduce bias and limitation given the domain knowledge of particular users. This paper proposes Unsupervised Model Diagnosis (UMO), that leverages generative models to produce semantic counterfactual explanations without any user guidance. Given a differentiable computer vision model (i.e., the target model), UMO optimizes for the most counterfactual directions in a generative latent space. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources, such as dictionaries or language models. We validate the framework on multiple vision tasks (e.g., classification, segmentation, keypoint detection). Extensive experiments show that our unsupervised discovery of semantic directions can correctly highlight spurious correlations and visualize the failure mode of target models without any human intervention.
[AI-128] Using Crank-Nikolson Scheme to Solve the Korteweg-de Vries (KdV) Equation
链接: https://arxiv.org/abs/2410.06240
作者: Qiming Wu
关键词-EN: Korteweg-de Vries, fundamental partial differential, partial differential equation, models wave propagation, KdV equation
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The Korteweg-de Vries (KdV) equation is a fundamental partial differential equation that models wave propagation in shallow water and other dispersive media. Accurately solving the KdV equation is essential for understanding wave dynamics in physics and engineering applications. This project focuses on implementing the Crank-Nicolson scheme, a finite difference method known for its stability and accuracy, to solve the KdV equation. The Crank-Nicolson scheme’s implicit nature allows for a more stable numerical solution, especially in handling the dispersive and nonlinear terms of the KdV equation. We investigate the performance of the scheme through various test cases, analyzing its convergence and error behavior. The results demonstrate that the Crank-Nicolson method provides a robust approach for solving the KdV equation, with improved accuracy over traditional explicit methods. Code is available at the end of the paper.
[AI-129] EVOLvE: Evaluating and Optimizing LLMs For Exploration
链接: https://arxiv.org/abs/2410.06238
作者: Allen Nie,Yi Su,Bo Chang,Jonathan N. Lee,Ed H. Chi,Quoc V. Le,Minmin Chen
关键词-EN: scenarios requiring optimal, requiring optimal decision-making, large language models, make optimal decisions, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 28 pages
点击查看摘要
Abstract:Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs’ (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs’ performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM’s exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.
[AI-130] BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation
链接: https://arxiv.org/abs/2410.06237
作者: Rutav Shah,Albert Yu,Yifeng Zhu,Yuke Zhu,Roberto Martín-Martín
关键词-EN: mobile manipulation, Building-wide Mobile Manipulation, service robots, everyday objects, long-horizon mobile manipulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 Figures, 2 Tables, 11 Pages
点击查看摘要
Abstract:To operate at a building scale, service robots must perform very long-horizon mobile manipulation tasks by navigating to different rooms, accessing different floors, and interacting with a wide and unseen range of everyday objects. We refer to these tasks as Building-wide Mobile Manipulation. To tackle these inherently long-horizon tasks, we introduce BUMBLE, a unified Vision-Language Model (VLM)-based framework integrating open-world RGBD perception, a wide spectrum of gross-to-fine motor skills, and dual-layered memory. Our extensive evaluation (90+ hours) indicates that BUMBLE outperforms multiple baselines in long-horizon building-wide tasks that require sequencing up to 12 ground truth skills spanning 15 minutes per trial. BUMBLE achieves 47.1% success rate averaged over 70 trials in different buildings, tasks, and scene layouts from different starting rooms and floors. Our user study demonstrates 22% higher satisfaction with our method than state-of-the-art mobile manipulation methods. Finally, we demonstrate the potential of using increasingly-capable foundation models to push performance further. For more information, see this https URL
[AI-131] EOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
链接: https://arxiv.org/abs/2410.06234
作者: Jeremy Andrew Irvin,Emily Ruoyu Liu,Joyce Chuyi Chen,Ines Dormoy,Jinyoung Kim,Samar Khanna,Zhuo Zheng,Stefano Ermon
关键词-EN: interpreting natural images, Large vision, earth observation data, vision and language, interpreting natural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large vision and language assistants have enabled new capabilities for interpreting natural images. These approaches have recently been adapted to earth observation data, but they are only able to handle single image inputs, limiting their use for many real-world tasks. In this work, we develop a new vision and language assistant called TEOChat that can engage in conversations about temporal sequences of earth observation data. To train TEOChat, we curate an instruction-following dataset composed of many single image and temporal tasks including building change and damage assessment, semantic change detection, and temporal scene classification. We show that TEOChat can perform a wide variety of spatial and temporal reasoning tasks, substantially outperforming previous vision and language assistants, and even achieving comparable or better performance than specialist models trained to perform these specific tasks. Furthermore, TEOChat achieves impressive zero-shot performance on a change detection and change question answering dataset, outperforms GPT-4o and Gemini 1.5 Pro on multiple temporal tasks, and exhibits stronger single image capabilities than a comparable single EO image instruction-following model. We publicly release our data, models, and code at this https URL .
[AI-132] A Timeline and Analysis for Representation Plasticity in Large Language Models
链接: https://arxiv.org/abs/2410.06225
作者: Akshat Kannan
关键词-EN: long term dangerous, catastrophic potential, crucial to preventing, preventing its long, long term
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The ability to steer AI behavior is crucial to preventing its long term dangerous and catastrophic potential. Representation Engineering (RepE) has emerged as a novel, powerful method to steer internal model behaviors, such as “honesty”, at a top-down level. Understanding the steering of representations should thus be placed at the forefront of alignment initiatives. Unfortunately, current efforts to understand plasticity at this level are highly neglected. This paper aims to bridge the knowledge gap and understand how LLM representation stability, specifically for the concept of “honesty”, and model plasticity evolve by applying steering vectors extracted at different fine-tuning stages, revealing differing magnitudes of shifts in model behavior. The findings are pivotal, showing that while early steering exhibits high plasticity, later stages have a surprisingly responsive critical window. This pattern is observed across different model architectures, signaling that there is a general pattern of model plasticity that can be used for effective intervention. These insights greatly contribute to the field of AI transparency, addressing a pressing lack of efficiency limiting our ability to effectively steer model behavior.
[AI-133] DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
链接: https://arxiv.org/abs/2410.06215
作者: Zaid Khan,Elias Stengel-Eskin,Jaemin Cho,Mohit Bansal
关键词-EN: analyze model weaknesses, manually analyze model, data generation, data, data generation agents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project Page: this https URL
点击查看摘要
Abstract:The process of creating training data to teach models is currently driven by humans, who manually analyze model weaknesses and plan how to create data that improves a student model. Recent approaches using LLMs as annotators reduce human effort, but still require humans to interpret feedback from evaluations and control the LLM to produce data the student needs. Automating this labor-intensive process by creating autonomous data generation agents - or teachers - is desirable, but requires environments that can simulate the feedback-driven, iterative, closed loop of data creation. To enable rapid and scalable testing for such agents and their modules, we introduce DataEnvGym, a testbed of teacher environments for data generation agents. DataEnvGym frames data generation as a sequential decision-making task, involving an agent consisting of a data generation policy (which generates a plan for creating training data) and a data generation engine (which transforms the plan into data), inside an environment that provides student feedback. The agent’s goal is to improve student performance. Students are iteratively trained and evaluated on generated data, with their feedback (in the form of errors or weak skills) being reported to the agent after each iteration. DataEnvGym includes multiple teacher environment instantiations across 3 levels of structure in the state representation and action space. More structured environments are based on inferred skills and offer more interpretability and curriculum control. We support 3 diverse tasks (math, code, and VQA) and test multiple students and teachers. Example agents in our teaching environments can iteratively improve students across tasks and settings. Moreover, we show that environments teach different skill levels and test variants of key modules, pointing to future work in improving data generation agents, engines, and feedback mechanisms.
[AI-134] LeanAgent : Lifelong Learning for Formal Theorem Proving
链接: https://arxiv.org/abs/2410.06209
作者: Adarsh Kumarappan,Mo Tiwari,Peiyang Song,Robert Joseph George,Chaowei Xiao,Anima Anandkumar
关键词-EN: Large Language Models, Large Language, Language Models, interactive proof assistants, integrated with interactive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have been successful in mathematical reasoning tasks such as formal theorem proving when integrated with interactive proof assistants like Lean. Existing approaches involve training or fine-tuning an LLM on a specific dataset to perform well on particular domains, such as undergraduate-level mathematics. These methods struggle with generalizability to advanced mathematics. A fundamental limitation is that these approaches operate on static domains, failing to capture how mathematicians often work across multiple domains and projects simultaneously or cyclically. We present LeanAgent, a novel lifelong learning framework for theorem proving that continuously generalizes to and improves on ever-expanding mathematical knowledge without forgetting previously learned knowledge. LeanAgent introduces several key innovations, including a curriculum learning strategy that optimizes the learning trajectory in terms of mathematical difficulty, a dynamic database for efficient management of evolving mathematical knowledge, and progressive training to balance stability and plasticity. LeanAgent successfully proves 162 theorems previously unproved by humans across 23 diverse Lean repositories, many from advanced mathematics. It performs up to 11 \times better than the static LLM baseline, proving challenging theorems in domains like abstract algebra and algebraic topology while showcasing a clear progression of learning from basic concepts to advanced topics. In addition, we analyze LeanAgent’s superior performance on key lifelong learning metrics. LeanAgent achieves exceptional scores in stability and backward transfer, where learning new tasks improves performance on previously learned tasks. This emphasizes LeanAgent’s continuous generalizability and improvement, explaining its superior theorem proving performance.
[AI-135] Integrating Planning into Single-Turn Long-Form Text Generation
链接: https://arxiv.org/abs/2410.06203
作者: Yi Liang,You Wu,Honglei Zhuang,Li Chen,Jiaming Shen,Yiling Jia,Zhen Qin,Sumit Sanghai,Xuanhui Wang,Carl Yang,Michael Bendersky
关键词-EN: Large Language Models, Language Models, Large Language, in-depth textual documents, challenge for Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Generating high-quality, in-depth textual documents, such as academic papers, news articles, Wikipedia entries, and books, remains a significant challenge for Large Language Models (LLMs). In this paper, we propose to use planning to generate long form content. To achieve our goal, we generate intermediate steps via an auxiliary task that teaches the LLM to plan, reason and structure before generating the final text. Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning. To overcome the scarcity of training data for these intermediate steps, we leverage LLMs to generate synthetic intermediate writing data such as outlines, key information and summaries from existing full articles. Our experiments demonstrate on two datasets from different domains, namely the scientific news dataset SciNews and Wikipedia datasets in KILT-Wiki and FreshWiki, that LLMs fine-tuned with the auxiliary task generate higher quality documents. We observed +2.5% improvement in ROUGE-Lsum, and a strong 3.60 overall win/loss ratio via human SxS evaluation, with clear wins in organization, relevance, and verifiability.
[AI-136] Entering Real Social World! Benchmarking the Theory of Mind and Socialization Capabilities of LLMs from a First-person Perspective
链接: https://arxiv.org/abs/2410.06195
作者: Guiyang Hou,Wenqi Zhang,Yongliang Shen,Zeqi Tan,Sihao Shen,Weiming Lu
关键词-EN: Theory of Mind, socialization capabilities, mental states, mental states evolve, infer and reason
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages, 5 figures
点击查看摘要
Abstract:In the social world, humans possess the capability to infer and reason about others mental states (such as emotions, beliefs, and intentions), known as the Theory of Mind (ToM). Simultaneously, humans own mental states evolve in response to social situations, a capability we refer to as socialization. Together, these capabilities form the foundation of human social interaction. In the era of artificial intelligence (AI), especially with the development of large language models (LLMs), we raise an intriguing question: How do LLMs perform in terms of ToM and socialization capabilities? And more broadly, can these AI models truly enter and navigate the real social world? Existing research evaluating LLMs ToM and socialization capabilities by positioning LLMs as passive observers from a third person perspective, rather than as active participants. However, compared to the third-person perspective, observing and understanding the world from an egocentric first person perspective is a natural approach for both humans and AI agents. The ToM and socialization capabilities of LLMs from a first person perspective, a crucial attribute for advancing embodied AI agents, remain unexplored. To answer the aforementioned questions and bridge the research gap, we introduce EgoSocialArena, a novel framework designed to evaluate and investigate the ToM and socialization capabilities of LLMs from a first person perspective. It encompasses two evaluation environments: static environment and interactive environment, with seven scenarios: Daily Life, Counterfactual, New World, Blackjack, Number Guessing, and Limit Texas Hold em, totaling 2,195 data entries. With EgoSocialArena, we have conducted a comprehensive evaluation of nine advanced LLMs and observed some key insights regarding the future development of LLMs as well as the capabilities levels of the most advanced LLMs currently available.
[AI-137] Benign Overfitting for Regression with Trained Two-Layer ReLU Networks
链接: https://arxiv.org/abs/2410.06191
作者: Junhyung Park,Patrick Bloebaum,Shiva Prasad Kasiviswanathan
关键词-EN: least-square regression problem, two-layer fully-connected neural, ReLU activation function, study the least-square, two-layer fully-connected
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 65 pages
点击查看摘要
Abstract:We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow. Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded. We operate in the neural tangent kernel regime, and our generalization result is developed via a decomposition of the excess risk into estimation and approximation errors, viewing gradient flow as an implicit regularizer. This decomposition in the context of neural networks is a novel perspective of gradient descent, and helps us avoid uniform convergence traps. In this work, we also establish that under the same setting, the trained network overfits to the data. Together, these results, establishes the first result on benign overfitting for finite-width ReLU networks for arbitrary regression functions.
[AI-138] CBIDR: A novel method for information retrieval combining image and data by means of TOPSIS applied to medical diagnosis
链接: https://arxiv.org/abs/2410.06180
作者: Humberto Giuri,Renato A. Krohling
关键词-EN: Content-Based Image Retrieval, Image Retrieval, shown promising results, doctor or pathologist, medical professionals
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 28 pages
点击查看摘要
Abstract:Content-Based Image Retrieval (CBIR) have shown promising results in the field of medical diagnosis, which aims to provide support to medical professionals (doctor or pathologist). However, the ultimate decision regarding the diagnosis is made by the medical professional, drawing upon their accumulated experience. In this context, we believe that artificial intelligence can play a pivotal role in addressing the challenges in medical diagnosis not by making the final decision but by assisting in the diagnosis process with the most relevant information. The CBIR methods use similarity metrics to compare feature vectors generated from images using Convolutional Neural Networks (CNNs). In addition to the information contained in medical images, clinical data about the patient is often available and is also relevant in the final decision-making process by medical professionals. In this paper, we propose a novel method named CBIDR, which leverage both medical images and clinical data of patient, combining them through the ranking algorithm TOPSIS. The goal is to aid medical professionals in their final diagnosis by retrieving images and clinical data of patient that are most similar to query data from the database. As a case study, we illustrate our CBIDR for diagnostic of oral cancer including histopathological images and clinical data of patient. Experimental results in terms of accuracy achieved 97.44% in Top-1 and 100% in Top-5 showing the effectiveness of the proposed approach.
[AI-139] SC-Bench: A Large-Scale Dataset for Smart Contract Auditing
链接: https://arxiv.org/abs/2410.06176
作者: Shihao Xia,Mengting He,Linhai Song,Yiying Zhang
关键词-EN: demand to ensure, ensure the compliance, safety and economic, smart contracts listed, violations
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:There is a huge demand to ensure the compliance of smart contracts listed on blockchain platforms to safety and economic standards. Today, manual efforts in the form of auditing are commonly used to achieve this goal. ML-based automated techniques have the promise to alleviate human efforts and the resulting monetary costs. However, unlike other domains where ML techniques have had huge successes, no systematic ML techniques have been proposed or applied to smart contract auditing. We present SC-Bench, the first dataset for automated smart-contract auditing research. SC-Bench consists of 5,377 real-world smart contracts running on Ethereum, a widely used blockchain platform, and 15,975 violations of standards on Ehereum called ERCs. Out of these violations, 139 are real violations programmers made. The remaining are errors we systematically injected to reflect the violations of different ERC rules. We evaluate SC-Bench using GPT-4 by prompting it with both the contracts and ERC rules. In addition, we manually identify each violated rule and the corresponding code site (i.e., oracle) and prompt GPT-4 with the information asking for a True-or-False question. Our results show that without the oracle, GPT-4 can only detect 0.9% violations, and with the oracle, it detects 22.9% violations. These results show the potential room for improvement in ML-based techniques for smart-contract auditing.
[AI-140] Manual Verbalizer Enrichment for Few-Shot Text Classification
链接: https://arxiv.org/abs/2410.06173
作者: Quang Anh Nguyen,Nadi Tomeh,Mustapha Lebbah,Thierry Charnois,Hanene Azzag,Santiago Cordoba Muñoz
关键词-EN: natural language processing, language processing tasks, pre-trained language models, prompt-based training, pre-trained language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:With the continuous development of pre-trained language models, prompt-based training becomes a well-adopted paradigm that drastically improves the exploitation of models for many natural language processing tasks. Prompting also shows great performance compared to traditional fine-tuning when adapted to zero-shot or few-shot scenarios where the number of annotated data is limited. In this framework, the role of verbalizers is essential, as an interpretation from masked word distributions into output predictions. In this work, we propose \acrshortmave, an approach for verbalizer construction by enrichment of class labels using neighborhood relation in the embedding space of words for the text classification task. In addition, we elaborate a benchmarking procedure to evaluate typical baselines of verbalizers for document classification in few-shot learning contexts. Our model achieves state-of-the-art results while using significantly fewer resources. We show that our approach is particularly effective in cases with extremely limited supervision data.
[AI-141] Multimodal Situational Safety
链接: https://arxiv.org/abs/2410.06172
作者: Kaiwen Zhou,Chengzhi Liu,Xuandong Zhao,Anderson Compalas,Dawn Song,Xin Eric Wang
关键词-EN: Large Language Models, Multimodal Large Language, demonstrating impressive capabilities, Multimodal Situational Safety, Multimodal Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) are rapidly evolving, demonstrating impressive capabilities as multimodal assistants that interact with both humans and their environments. However, this increased sophistication introduces significant safety concerns. In this paper, we present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety, which explores how safety considerations vary based on the specific situation in which the user or agent is engaged. We argue that for an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context. To evaluate this capability, we develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs. The dataset comprises 1,820 language query-image pairs, half of which the image context is safe, and the other half is unsafe. We also develop an evaluation framework that analyzes key safety aspects, including explicit safety reasoning, visual understanding, and, crucially, situational safety reasoning. Our findings reveal that current MLLMs struggle with this nuanced safety problem in the instruction-following setting and struggle to tackle these situational safety challenges all at once, highlighting a key area for future research. Furthermore, we develop multi-agent pipelines to coordinately solve safety challenges, which shows consistent improvement in safety over the original MLLM response. Code and data: this http URL.
[AI-142] Quality Diversity Imitation Learning
链接: https://arxiv.org/abs/2410.06151
作者: Zhenglin Wan,Xingrui Yu,David Mark Bossens,Yueming Lyu,Qing Guo,Flint Xiaofeng Fan,Ivor Tsang
关键词-EN: shown great potential, Imitation learning, Diversity Imitation Learning, shown great, great potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, conference paper
点击查看摘要
Abstract:Imitation learning (IL) has shown great potential in various applications, such as robot control. However, traditional IL methods are usually designed to learn only one specific type of behavior since demonstrations typically correspond to a single expert. In this work, we introduce the first generic framework for Quality Diversity Imitation Learning (QD-IL), which enables the agent to learn a broad range of skills from limited demonstrations. Our framework integrates the principles of quality diversity with adversarial imitation learning (AIL) methods, and can potentially improve any inverse reinforcement learning (IRL) method. Empirically, our framework significantly improves the QD performance of GAIL and VAIL on the challenging continuous control tasks derived from Mujoco environments. Moreover, our method even achieves 2x expert performance in the most challenging Humanoid environment.
[AI-143] ConceptAgent : LLM-Driven Precondition Grounding and Tree Search for Robust Task Planning and Execution
链接: https://arxiv.org/abs/2410.06108
作者: Corban Rivera,Grayson Byrd,William Paul,Tyler Feldman,Meghan Booker,Emma Holmes,David Handelman,Bethany Kemp,Andrew Badger,Aurora Schmidt,Krishna Murthy Jatavallabhula,Celso M de Melo,Lalithkumar Seenivasan,Mathias Unberath,Rama Chellappa
关键词-EN: Large Language Models, complex problem due, vast state spaces, Language Models, Carlo Tree Search
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Robotic planning and execution in open-world environments is a complex problem due to the vast state spaces and high variability of task embodiment. Recent advances in perception algorithms, combined with Large Language Models (LLMs) for planning, offer promising solutions to these challenges, as the common sense reasoning capabilities of LLMs provide a strong heuristic for efficiently searching the action space. However, prior work fails to address the possibility of hallucinations from LLMs, which results in failures to execute the planned actions largely due to logical fallacies at high- or low-levels. To contend with automation failure due to such hallucinations, we introduce ConceptAgent, a natural language-driven robotic platform designed for task execution in unstructured environments. With a focus on scalability and reliability of LLM-based planning in complex state and action spaces, we present innovations designed to limit these shortcomings, including 1) Predicate Grounding to prevent and recover from infeasible actions, and 2) an embodied version of LLM-guided Monte Carlo Tree Search with self reflection. In simulation experiments, ConceptAgent achieved a 19% task completion rate across three room layouts and 30 easy level embodied tasks outperforming other state-of-the-art LLM-driven reasoning baselines that scored 10.26% and 8.11% on the same benchmark. Additionally, ablation studies on moderate to hard embodied tasks revealed a 20% increase in task completion from the baseline agent to the fully enhanced ConceptAgent, highlighting the individual and combined contributions of Predicate Grounding and LLM-guided Tree Search to enable more robust automation in complex state and action spaces.
[AI-144] owards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap
链接: https://arxiv.org/abs/2410.06107
作者: Ahmed E. Hassan,Gustavo A. Oliva,Dayi Lin,Boyuan Chen,Zhen Ming(Jack)Jiang
关键词-EN: improving developer productivity, http URL, AI-assisted software engineering, software engineering, Foundation Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The rise of AI-assisted software engineering (SE 2.0), powered by Foundation Models (FMs) and FM-powered copilots, has shown promise in improving developer productivity. However, it has also exposed inherent limitations, such as cognitive overload on developers and inefficiencies. We propose a shift towards Software Engineering 3.0 (SE 3.0), an AI-native approach characterized by intent-first, conversation-oriented development between human developers and AI teammates. SE 3.0 envisions AI systems evolving beyond task-driven copilots into intelligent collaborators, capable of deeply understanding and reasoning about software engineering principles and intents. We outline the key components of the SE 3.0 technology stack, which includes this http URL for adaptive and personalized AI partnership, this http URL for intent-first conversation-oriented development, this http URL for multi-objective code synthesis, and this http URL for SLA-aware execution with edge-computing support. Our vision addresses the inefficiencies and cognitive strain of SE 2.0 by fostering a symbiotic relationship between human developers and AI, maximizing their complementary strengths. We also present a roadmap of challenges that must be overcome to realize our vision of SE 3.0. This paper lays the foundation for future discussions on the role of AI in the next era of software engineering.
[AI-145] Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2410.06101
作者: Hao Ma,Tianyi Hu,Zhiqiang Pu,Boyin Liu,Xiaolin Ai,Yanyan Liang,Min Chen
关键词-EN: large language models, fine-tuning large language, language models, specific tasks, pivotal technique
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 28 pages, 26 images
点击查看摘要
Abstract:Reinforcement learning (RL) has emerged as a pivotal technique for fine-tuning large language models (LLMs) on specific tasks. However, prevailing RL fine-tuning methods predominantly rely on PPO and its variants. Though these algorithms are effective in general RL settings, they often exhibit suboptimal performance and vulnerability to distribution collapse when applied to the fine-tuning of LLMs. In this paper, we propose CORY, extending the RL fine-tuning of LLMs to a sequential cooperative multi-agent reinforcement learning framework, to leverage the inherent coevolution and emergent capabilities of multi-agent systems. In CORY, the LLM to be fine-tuned is initially duplicated into two autonomous agents: a pioneer and an observer. The pioneer generates responses based on queries, while the observer generates responses using both the queries and the pioneer’s responses. The two agents are trained together. During training, the agents exchange roles periodically, fostering cooperation and coevolution between them. Experiments evaluate CORY’s performance by fine-tuning GPT-2 and Llama-2 under subjective and objective reward functions on the IMDB Review and GSM8K datasets, respectively. Results show that CORY outperforms PPO in terms of policy optimality, resistance to distribution collapse, and training robustness, thereby underscoring its potential as a superior methodology for refining LLMs in real-world applications.
[AI-146] OWER: Tree Organized Weighting for Evaluating Complex Instructions EMNLP2024
链接: https://arxiv.org/abs/2410.06089
作者: Noah Ziems,Zhihan Zhang,Meng Jiang
关键词-EN: Evaluating the ability, large language models, real-world applications, follow complex human-written, ability of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024
点击查看摘要
Abstract:Evaluating the ability of large language models (LLMs) to follow complex human-written instructions is essential for their deployment in real-world applications. While benchmarks like Chatbot Arena use human judges to assess model performance, they are resource-intensive and time-consuming. Alternative methods using LLMs as judges, such as AlpacaEval, MT Bench, WildBench, and InFoBench offer improvements but still do not capture that certain complex instruction aspects are more important than others to follow. To address this gap, we propose a novel evaluation metric, \textscTOWER, that incorporates human-judged importance into the assessment of complex instruction following. We show that human annotators agree with tree-based representations of these complex instructions nearly as much as they agree with other human annotators. We release tree-based annotations of the InFoBench dataset and the corresponding evaluation code to facilitate future research. Comments: Accepted to EMNLP 2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.06089 [cs.CL] (or arXiv:2410.06089v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.06089 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-147] Posets and Bounded Probabilities for Discovering Order-inducing Features in Event Knowledge Graphs
链接: https://arxiv.org/abs/2410.06065
作者: Christoffer Olling Back,Jakob Grue Simonsen
关键词-EN: Event knowledge graphs, knowledge graphs, extend the classical, capture multiple, interacting views
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Event knowledge graphs (EKG) extend the classical notion of a trace to capture multiple, interacting views of a process execution. In this paper, we tackle the open problem of automating EKG discovery from uncurated data through a principled, probabilistic framing based on the outcome space resulting from featured-derived partial orders on events. From this, we derive an EKG discovery algorithm based upon statistical inference rather than an ad-hoc or heuristic-based strategy, or relying on manual analysis from domain experts. This approach comes at the computational cost of exploring a large, non-convex hypothesis space. In particular, solving the maximum likelihood term involves counting the number of linear extensions of posets, which in general is #P-complete. Fortunately, bound estimates suffice for model comparison, and admit incorporation into a bespoke branch-and-bound algorithm. We show that the posterior probability as defined is antitonic w.r.t. search depth for branching rules that are monotonic w.r.t. model inclusion. This allows pruning of large portions of the search space, which we show experimentally leads to rapid convergence toward optimal solutions that are consistent with manually built EKGs. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 06-08 ACMclasses: G.3; I.2.6; I.5 Cite as: arXiv:2410.06065 [cs.LG] (or arXiv:2410.06065v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.06065 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-148] LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs
链接: https://arxiv.org/abs/2410.06062
作者: Vincent Emonet,Jerven Bolleman,Severine Duvaud,Tarcisio Mendes de Farias,Ana Claudia Sima
关键词-EN: Large Language Models, leveraging Large Language, accurate federated SPARQL, Language Models, bioinformatics knowledge graphs
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:We introduce a Retrieval-Augmented Generation (RAG) system for translating user questions into accurate federated SPARQL queries over bioinformatics knowledge graphs (KGs) leveraging Large Language Models (LLMs). To enhance accuracy and reduce hallucinations in query generation, our system utilises metadata from the KGs, including query examples and schema information, and incorporates a validation step to correct generated queries. The system is available online at this http URL.
[AI-149] Extracting Finite State Machines from Transformers ICML2024
链接: https://arxiv.org/abs/2410.06045
作者: Rik Adriaensen,Jaron Maene
关键词-EN: deep learning, regular languages, architecture in deep, works have investigated, investigated what formal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for Workshop on Mechanistic Interpretability ICML 2024
点击查看摘要
Abstract:Fueled by the popularity of the transformer architecture in deep learning, several works have investigated what formal languages a transformer can learn. Nonetheless, existing results remain hard to compare and a fine-grained understanding of the trainability of transformers on regular languages is still lacking. We investigate transformers trained on regular languages from a mechanistic interpretability perspective. Using an extension of the L^* algorithm, we extract Moore machines from transformers. We empirically find tighter lower bounds on the trainability of transformers, when a finite number of symbols determine the state. Additionally, our mechanistic insight allows us to characterise the regular languages a one-layer transformer can learn with good length generalisation. However, we also identify failure cases where the determining symbols get misrecognised due to saturation of the attention mechanism.
[AI-150] Block Induced Signature Generative Adversarial Network (BISGAN): Signature Spoofing Using GANs and Their Evaluation
链接: https://arxiv.org/abs/2410.06041
作者: Haadia Amjad,Kilian Goeller,Steffen Seitz,Carsten Knoll,Naseer Bajwa,Muhammad Imran Malik,Ronald Tetzlaff
关键词-EN: develop efficient identification, Deep learning, learning is actively, develop efficient, efficient identification
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Deep learning is actively being used in biometrics to develop efficient identification and verification systems. Handwritten signatures are a common subset of biometric data for authentication purposes. Generative adversarial networks (GANs) learn from original and forged signatures to generate forged signatures. While most GAN techniques create a strong signature verifier, which is the discriminator, there is a need to focus more on the quality of forgeries generated by the generator model. This work focuses on creating a generator that produces forged samples that achieve a benchmark in spoofing signature verification systems. We use CycleGANs infused with Inception model-like blocks with attention heads as the generator and a variation of the SigCNN model as the base Discriminator. We train our model with a new technique that results in 80% to 100% success in signature spoofing. Additionally, we create a custom evaluation technique to act as a goodness measure of the generated forgeries. Our work advocates generator-focused GAN architectures for spoofing data quality that aid in a better understanding of biometric data generation and evaluation.
[AI-151] Data Quality Issues in Vulnerability Detection Datasets
链接: https://arxiv.org/abs/2410.06030
作者: Yuejun Guo,Seifeddine Bettaieb
关键词-EN: identify potential weaknesses, cyber security, crucial yet challenging, challenging task, task to identify
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 2023 IEEE European Symposium on Security and Privacy Workshops (EuroSP;PW)
点击查看摘要
Abstract:Vulnerability detection is a crucial yet challenging task to identify potential weaknesses in software for cyber security. Recently, deep learning (DL) has made great progress in automating the detection process. Due to the complex multi-layer structure and a large number of parameters, a DL model requires massive labeled (vulnerable or secure) source code to gain knowledge to effectively distinguish between vulnerable and secure code. In the literature, many datasets have been created to train DL models for this purpose. However, these datasets suffer from several issues that will lead to low detection accuracy of DL models. In this paper, we define three critical issues (i.e., data imbalance, low vulnerability coverage, biased vulnerability distribution) that can significantly affect the model performance and three secondary issues (i.e., errors in source code, mislabeling, noisy historical data) that also affect the performance but can be addressed through a dedicated pre-processing procedure. In addition, we conduct a study of 14 papers along with 54 datasets for vulnerability detection to confirm these defined issues. Furthermore, we discuss good practices to use existing datasets and to create new ones.
[AI-152] Jet Expansions of Residual Computation
链接: https://arxiv.org/abs/2410.06024
作者: Yihong Chen,Xiangxiang Xu,Yao Lu,Pontus Stenetorp,Luca Franceschi
关键词-EN: truncated Taylor series, generalize truncated Taylor, Taylor series, truncated Taylor, graphs using jets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
*备注:
点击查看摘要
Abstract:We introduce a framework for expanding residual computational graphs using jets, operators that generalize truncated Taylor series. Our method provides a systematic approach to disentangle contributions of different computational paths to model predictions. In contrast to existing techniques such as distillation, probing, or early decoding, our expansions rely solely on the model itself and requires no data, training, or sampling from the model. We demonstrate how our framework grounds and subsumes logit lens, reveals a (super-)exponential path structure in the recursive residual depth and opens up several applications. These include sketching a transformer large language model with n -gram statistics extracted from its computations, and indexing the models’ levels of toxicity knowledge. Our approach enables data-free analysis of residual computation for model interpretability, development, and evaluation.
[AI-153] Unveiling Transformer Perception by Exploring Input Manifolds
链接: https://arxiv.org/abs/2410.06019
作者: Alessandro Benfenati,Alfio Ferrara,Alessio Marta,Davide Riva,Elisabetta Rocchetti
关键词-EN: input space, paper introduces, introduces a general, equivalence classes, Transformer models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 4 figures
点击查看摘要
Abstract:This paper introduces a general method for the exploration of equivalence classes in the input space of Transformer models. The proposed approach is based on sound mathematical theory which describes the internal layers of a Transformer architecture as sequential deformations of the input manifold. Using eigendecomposition of the pullback of the distance metric defined on the output space through the Jacobian of the model, we are able to reconstruct equivalence classes in the input space and navigate across them. We illustrate how this method can be used as a powerful tool for investigating how a Transformer sees the input space, facilitating local and task-agnostic explainability in Computer Vision and Natural Language Processing tasks.
[AI-154] SplaTraj: Camera Trajectory Generation with Semantic Gaussian Splatting
链接: https://arxiv.org/abs/2410.06014
作者: Xinyi Liu,Tianyi Zhang,Matthew Johnson-Roberson,Weiming Zhi
关键词-EN: photorealistic Gaussian Splatting, Gaussian Splatting models, recent developments, developments for robots, robots to represent
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Many recent developments for robots to represent environments have focused on photorealistic reconstructions. This paper particularly focuses on generating sequences of images from the photorealistic Gaussian Splatting models, that match instructions that are given by user-inputted language. We contribute a novel framework, SplaTraj, which formulates the generation of images within photorealistic environment representations as a continuous-time trajectory optimization problem. Costs are designed so that a camera following the trajectory poses will smoothly traverse through the environment and render the specified spatial information in a photogenic manner. This is achieved by querying a photorealistic representation with language embedding to isolate regions that correspond to the user-specified inputs. These regions are then projected to the camera’s view as it moves over time and a cost is constructed. We can then apply gradient-based optimization and differentiate through the rendering to optimize the trajectory for the defined cost. The resulting trajectory moves to photogenically view each of the specified objects. We empirically evaluate our approach on a suite of environments and instructions, and demonstrate the quality of generated image sequences.
[AI-155] A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications
链接: https://arxiv.org/abs/2410.06010
作者: Jerven Bolleman,Vincent Emonet,Adrian Altenhoff,Amos Bairoch,Marie-Claude Blatter,Alan Bridge,Severine Duvaud,Elisabeth Gasteiger,Dmitry Kuznetsov,Sebastien Moretti,Pierre-Andre Michel,Anne Morgat,Marco Pagni,Nicole Redaschi,Monique Zahn-Zabal,Tarcisio Mendes de Farias,Ana Claudia Sima
关键词-EN: Knowledge graphs, Background, Knowledge, SPARQL, graphs
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Background. In the last decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased adoption in bioinformatics due to their advantages for representing data in a generic graph format. For example, this http URL catalogs more than 60 knowledge graphs accessible through SPARQL, a technical query language. Although SPARQL allows powerful, expressive queries, even across physically distributed knowledge graphs, formulating such queries is a challenge for most users. Therefore, to guide users in retrieving the relevant data, many of these resources provide representative examples. These examples can also be an important source of information for machine learning, if a sufficiently large number of examples are provided and published in a common, machine-readable and standardized format across different resources. Findings. We introduce a large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs (KGs) collected for several years across different research groups at the SIB Swiss Institute of Bioinformatics. The collection comprises more than 1000 example questions and queries, including 65 federated queries. We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards. Furthermore, we introduce an extensive set of open-source applications, including query graph visualizations and smart query editors, easily reusable by KG maintainers who adopt the proposed methodology. Conclusions. We encourage the community to adopt and extend the proposed methodology, towards richer KG metadata and improved Semantic Web services. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2410.06010 [cs.DB] (or arXiv:2410.06010v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2410.06010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-156] Vector Grimoire: Codebook-based Shape Generation under Raster Image Supervision
链接: https://arxiv.org/abs/2410.05991
作者: Moritz Feuerpfeil,Marco Cipriano,Gerard de Melo
关键词-EN: Scalable Vector Graphics, Scalable Vector, design industry, popular format, Vector Graphics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注:
点击查看摘要
Abstract:Scalable Vector Graphics (SVG) is a popular format on the web and in the design industry. However, despite the great strides made in generative modeling, SVG has remained underexplored due to the discrete and complex nature of such data. We introduce GRIMOIRE, a text-guided SVG generative model that is comprised of two modules: A Visual Shape Quantizer (VSQ) learns to map raster images onto a discrete codebook by reconstructing them as vector shapes, and an Auto-Regressive Transformer (ART) models the joint probability distribution over shape tokens, positions and textual descriptions, allowing us to generate vector graphics from natural language. Unlike existing models that require direct supervision from SVG data, GRIMOIRE learns shape image patches using only raster image supervision which opens up vector generative modeling to significantly more data. We demonstrate the effectiveness of our method by fitting GRIMOIRE for closed filled shapes on the MNIST and for outline strokes on icon and font data, surpassing previous image-supervised methods in generative quality and vector-supervised approach in flexibility.
[AI-157] Utilizing Lyapunov Exponents in designing deep neural networks
链接: https://arxiv.org/abs/2410.05988
作者: Tirthankar Mittra
关键词-EN: Training large deep, Training large, resource intensive, Lyapunov exponents, large deep neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Training large deep neural networks is resource intensive. This study investigates whether Lyapunov exponents can accelerate this process by aiding in the selection of hyperparameters. To study this I formulate an optimization problem using neural networks with different activation functions in the hidden layers. By initializing model weights with different random seeds, I calculate the Lyapunov exponent while performing traditional gradient descent on these model weights. The findings demonstrate that variations in the learning rate can induce chaotic changes in model weights. I also show that activation functions with more negative Lyapunov exponents exhibit better convergence properties. Additionally, the study also demonstrates that Lyapunov exponents can be utilized to select effective initial model weights for deep neural networks, potentially enhancing the optimization process.
[AI-158] Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates
链接: https://arxiv.org/abs/2410.05985
作者: Cabrel Teguemne Fokam,Khaleelulla Khan Nazeer,Lukas König,David Kappel,Anand Subramoney
关键词-EN: standard error backpropagation, deep learning models, error backpropagation algorithm, increasing size, size of deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 16 pages, 4 figures
点击查看摘要
Abstract:The increasing size of deep learning models has created the need for more efficient alternatives to the standard error backpropagation algorithm, that make better use of asynchronous, parallel and distributed computing. One major shortcoming of backpropagation is the interlocking between the forward phase of the algorithm, which computes a global loss, and the backward phase where the loss is backpropagated through all layers to compute the gradients, which are used to update the network parameters. To address this problem, we propose a method that parallelises SGD updates across the layers of a model by asynchronously updating them from multiple threads. Furthermore, since we observe that the forward pass is often much faster than the backward pass, we use separate threads for the forward and backward pass calculations, which allows us to use a higher ratio of forward to backward threads than the usual 1:1 ratio, reducing the overall staleness of the parameters. Thus, our approach performs asynchronous stochastic gradient descent using separate threads for the loss (forward) and gradient (backward) computations and performs layer-wise partial updates to parameters in a distributed way. We show that this approach yields close to state-of-the-art results while running up to 2.97x faster than Hogwild! scaled on multiple devices (Locally-Partitioned-Asynchronous-Parallel SGD). We theoretically prove the convergence of the algorithm using a novel theoretical framework based on stochastic differential equations and the drift diffusion process, by modeling the asynchronous parameter updates as a stochastic process.
[AI-159] Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
链接: https://arxiv.org/abs/2410.05983
作者: Bowen Jin,Jinsung Yoon,Jiawei Han,Sercan O. Arik
关键词-EN: large language models, external knowledge sources, empowers large language, utilize external knowledge, Retrieval-augmented generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 34 pages
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) empowers large language models (LLMs) to utilize external knowledge sources. The increasing capacity of LLMs to process longer input sequences opens up avenues for providing more retrieved information, to potentially enhance the quality of generated outputs. It is plausible to assume that a larger retrieval set would contain more relevant information (higher recall), that might result in improved performance. However, our empirical findings demonstrate that for many long-context LLMs, the quality of generated output initially improves first, but then subsequently declines as the number of retrieved passages increases. This paper investigates this phenomenon, identifying the detrimental impact of retrieved “hard negatives” as a key contributor. To mitigate this and enhance the robustness of long-context LLM-based RAG, we propose both training-free and training-based approaches. We first showcase the effectiveness of retrieval reordering as a simple yet powerful training-free optimization. Furthermore, we explore training-based methods, specifically RAG-specific implicit LLM fine-tuning and RAG-oriented fine-tuning with intermediate reasoning, demonstrating their capacity for substantial performance gains. Finally, we conduct a systematic analysis of design choices for these training-based methods, including data distribution, retriever selection, and training context length.
[AI-160] PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
链接: https://arxiv.org/abs/2410.05970
作者: Xudong Xie,Liang Yin,Hao Yan,Yang Liu,Jing Ding,Minghui Liao,Yuliang Liu,Wei Chen,Xiang Bai
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-161] FLOPS: Forward Learning with OPtimal Sampling
链接: https://arxiv.org/abs/2410.05966
作者: Tao Ren,Zishi Zhang,Jinyang Jiang,Guanghao Li,Zeliang Zhang,Mingqian Feng,Yijie Peng
关键词-EN: recently gained focus, perturbation-based gradient computation, gradient computation methods, Monte Carlo sampling, limitations of backpropagation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Given the limitations of backpropagation, perturbation-based gradient computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling, which hinders the scalability of those algorithms. However, not all data points deserve equal queries for gradient estimation. In this paper, we study the problem of improving the forward learning efficiency from a novel perspective: how to reduce the gradient estimation variance with minimum cost? For this, we propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency. Specifically, with a simplified proxy objective and a reparameterization technique, we derive a novel plug-and-play query allocator with minimal parameters. Theoretical results are carried out to verify its optimality. We conduct extensive experiments for fine-tuning Vision Transformers on various datasets and further deploy the allocator to two black-box applications: prompt tuning and multimodal alignment for foundation models. All findings demonstrate that our proposed allocator significantly enhances the scalability of forward-learning algorithms, paving the way for real-world applications.
[AI-162] STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking
链接: https://arxiv.org/abs/2410.05964
作者: Yidi Li,Hong Liu,Bing Yang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-163] EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
链接: https://arxiv.org/abs/2410.05938
作者: Yifei Xing,Xiangyuan Lan,Ruiping Wang,Dongmei Jiang,Wenjun Huang,Qingfang Zheng,Yaowei Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-164] Athanor: Local Search over Abstract Constraint Specifications
链接: https://arxiv.org/abs/2410.05937
作者: Saad Attieh,Nguyen Dang,Christopher Jefferson,Ian Miguel,Peter Nightingale
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: 48 pages
[AI-165] Fortify Your Foundations: Practical Privacy and Security for Foundation Model Deployments In The Cloud
链接: https://arxiv.org/abs/2410.05930
作者: Marcin Chrapek,Anjo Vahldiek-Oberwagner,Marcin Spoczynski,Scott Constable,Mona Vij,Torsten Hoefler
关键词-EN: natural language processing, display exceptional performance, Foundation Models, display exceptional, range of disciplines
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Foundation Models (FMs) display exceptional performance in tasks such as natural language processing and are being applied across a growing range of disciplines. Although typically trained on large public datasets, FMs are often fine-tuned or integrated into Retrieval-Augmented Generation (RAG) systems, which rely on private data. This access, along with their size and costly training, heightens the risk of intellectual property theft. Moreover, multimodal FMs may expose sensitive information. In this work, we examine the FM threat model and discuss the practicality and comprehensiveness of various approaches for securing against them, such as ML-based methods and trusted execution environments (TEEs). We demonstrate that TEEs offer an effective balance between strong security properties, usability, and performance. Specifically, we present a solution achieving less than 10% overhead versus bare metal for the full Llama2 7B and 13B inference pipelines running inside \intel\ SGX and \intel\ TDX. We also share our configuration files and insights from our implementation. To our knowledge, our work is the first to show the practicality of TEEs for securing FMs.
[AI-166] Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning
链接: https://arxiv.org/abs/2410.05928
作者: Ayush Singh,Mansi Gupta,Shivank Garg,Abhinav Kumar,Vansh Agrawal
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-167] FINALLY: fast and universal speech enhancement with studio-like quality NEURIPS2024
链接: https://arxiv.org/abs/2410.05920
作者: Nicholas Babaev,Kirill Tamogashev,Azat Saginbaev,Ivan Shchekotov,Hanbin Bae,Hosang Sung,WonJun Lee,Hoon-Young Cho,Pavel Andreev
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted to NeurIPS 2024
[AI-168] Give me a hint: Can LLMs take a hint to solve math problems?
链接: https://arxiv.org/abs/2410.05915
作者: Vansh Agrawal,Pratham Singla,Amitoj Singh Miglani,Shivank Garg,Ayush Mangal
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[AI-169] Accelerating Error Correction Code Transformers
链接: https://arxiv.org/abs/2410.05911
作者: Matan Levy,Yoni Choukroun,Lior Wolf
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:
[AI-170] Automatic Summarization of Long Documents ACL2023
链接: https://arxiv.org/abs/2410.05903
作者: Naman Chhibbar,Jugal Kalita
关键词-EN: internet daily, making utilization, difficult and cumbersome, vast amount, amount of textual
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages (including bibliography) with 6 figures. ACL 2023 proceedings format
点击查看摘要
Abstract:A vast amount of textual data is added to the internet daily, making utilization and interpretation of such data difficult and cumbersome. As a result, automatic text summarization is crucial for extracting relevant information, saving precious reading time. Although many transformer-based models excel in summarization, they are constrained by their input size, preventing them from processing texts longer than their context size. This study introduces three novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications. We test our algorithms on texts with more than 70,000 words, and our experiments show a significant increase in BERTScore with competitive ROUGE scores.
[AI-171] Mini-Batch Kernel k-means
链接: https://arxiv.org/abs/2410.05902
作者: Ben Jourdan,Gregory Schwartzman
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注: arXiv admin note: text overlap with arXiv:2304.00419
[AI-172] owards an Autonomous Surface Vehicle Prototype for Artificial Intelligence Applications of Water Quality Monitoring
链接: https://arxiv.org/abs/2410.05892
作者: Luis Miguel Díaz,Samuel Yanes Luis,Alejandro Mendoza Barrionuevo,Dame Seck Diop,Manuel Perales,Alejandro Casado,Sergio Toral,Daniel Gutiérrez
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-173] Deep learning-based fault identification in condition monitoring
链接: https://arxiv.org/abs/2410.05889
作者: Hariom Dhungana,Suresh Kumar Mukhiya,Pragya Dhungana,Benjamin Karic
关键词-EN: Vibration-based condition monitoring, Vibration-based condition, condition monitoring techniques, condition monitoring, techniques are commonly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Vibration-based condition monitoring techniques are commonly used to identify faults in rolling element bearings. Accuracy and speed of fault detection procedures are critical performance measures in condition monitoring. Delay is especially important in remote condition monitoring and time-sensitive industrial applications. While most existing methods focus on accuracy, little attention has been given to the inference time in the fault identification process. In this paper, we address this gap by presenting a Convolutional Neural Network (CNN) based approach for real-time fault identification in rolling element bearings. We encode raw vibration signals into two-dimensional images using various encoding methods and use these with a CNN to classify several categories of bearing fault types and sizes. We analyse the interplay between fault identification accuracy and processing time. For training and evaluation we use a bearing failure CWRU dataset.
[AI-174] MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
链接: https://arxiv.org/abs/2410.05873
作者: Amir Hossein Kargaran,Ali Modarressi,Nafiseh Nikeghbal,Jana Diesner,François Yvon,Hinrich Schütze
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
[AI-175] A second-order-like optimizer with adaptive gradient scaling for deep learning
链接: https://arxiv.org/abs/2410.05871
作者: Jérôme Bolte(TSE-R),Ryan Boustany(TSE-R),Edouard Pauwels(TSE-R, IRIT-ADRIA),Andrei Purica
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:
[AI-176] Heuristics for Partially Observable Stochastic Contingent Planning
链接: https://arxiv.org/abs/2410.05870
作者: Guy Shani
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-177] Unobserved Object Detection using Generative Models
链接: https://arxiv.org/abs/2410.05869
作者: Subhransu S. Bhattacharjee,Dylan Campbell,Rahul Shome
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 16 pages; 41 figures
[AI-178] From Tokens to Words: on the inner lexicon of LLMs
链接: https://arxiv.org/abs/2410.05864
作者: Guy Kaplan,Matanel Oren,Yuval Reif,Roy Schwartz
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
[AI-179] MelissaDL x Breed: Towards Data-Efficient On-line Supervised Training of Multi-parametric Surrogates with Active Learning
链接: https://arxiv.org/abs/2410.05860
作者: Sofya Dymchenko(DATAMOVE),Abhishek Purandare(DATAMOVE),Bruno Raffin(DATAMOVE)
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-180] Communicating with Speakers and Listeners of Different Pragmatic Levels EMNLP2024
链接: https://arxiv.org/abs/2410.05851
作者: Kata Naszadi,Frans A. Oliehoek,Christof Monz
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 main
[AI-181] Bottom-up Anytime Discovery of Generalised Multimodal Graph Patterns for Knowledge Graphs
链接: https://arxiv.org/abs/2410.05839
作者: Xander Wilcke,Rick Mourits,Auke Rijpma,Richard Zijdeman
关键词-EN:
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:
[AI-182] me Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit
链接: https://arxiv.org/abs/2410.05838
作者: Oleg Filatov,Jan Ebert,Jiangtao Wang,Stefan Kesselheim
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-183] owards an Operational Responsible AI Framework for Learning Analytics in Higher Education
链接: https://arxiv.org/abs/2410.05827
作者: Alba Morales Tirado,Paul Mulholland,Miriam Fernandez
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 16 pages, 1 figure, submitted to LAK 25
[AI-184] A Parameter Update Balancing Algorithm for Multi-task Ranking Models in Recommendation Systems ICDM’24
链接: https://arxiv.org/abs/2410.05806
作者: Jun Yuan,Guohao Cai,Zhenhua Dong
关键词-EN: Multi-task ranking, real-world recommendation systems, essential for modern, Multi-task, Multi-task ranking models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ICDM’24
点击查看摘要
Abstract:Multi-task ranking models have become essential for modern real-world recommendation systems. While most recommendation researches focus on designing sophisticated models for specific scenarios, achieving performance improvement for multi-task ranking models across various scenarios still remains a significant challenge. Training all tasks naively can result in inconsistent learning, highlighting the need for the development of multi-task optimization (MTO) methods to tackle this challenge. Conventional methods assume that the optimal joint gradient on shared parameters leads to optimal parameter updates. However, the actual update on model parameters may deviates significantly from gradients when using momentum based optimizers such as Adam, and we design and execute statistical experiments to support the observation. In this paper, we propose a novel Parameter Update Balancing algorithm for multi-task optimization, denoted as PUB. In contrast to traditional MTO method which are based on gradient level tasks fusion or loss level tasks fusion, PUB is the first work to optimize multiple tasks through parameter update balancing. Comprehensive experiments on benchmark multi-task ranking datasets demonstrate that PUB consistently improves several multi-task backbones and achieves state-of-the-art performance. Additionally, experiments on benchmark computer vision datasets show the great potential of PUB in various multi-task learning scenarios. Furthermore, we deployed our method for an industrial evaluation on the real-world commercial platform, HUAWEI AppGallery, where PUB significantly enhances the online multi-task ranking model, efficiently managing the primary traffic of a crucial channel.
[AI-185] PostCast: Generalizable Postprocessing for Precipitation Nowcasting via Unsupervised Blurriness Modeling
链接: https://arxiv.org/abs/2410.05805
作者: Junchao Gong,Siwei Tu,Weidong Yang,Ben Fei,Kun Chen,Wenlong Zhang,Xiaokang Yang,Wanli Ouyang,Lei Bai
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-186] Retrieving Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation EMNLP2024
链接: https://arxiv.org/abs/2410.05801
作者: Bolei He,Nuo Chen,Xinran He,Lingyong Yan,Zhenkai Wei,Jinchang Luo,Zhen-Hua Ling
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024 Findings. 9 pages, 4 figures, 7 tables
[AI-187] Core Tokensets for Data-efficient Sequential Training of Transformers
链接: https://arxiv.org/abs/2410.05800
作者: Subarnaduti Paul,Manuel Brack,Patrick Schramowski,Kristian Kersting,Martin Mundt
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-188] F"urElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance SIGGRAPH
链接: https://arxiv.org/abs/2410.05791
作者: Ruocheng Wang,Pei Xu,Haochen Shi,Elizabeth Schumann,C. Karen Liu
关键词-EN:
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: SIGGRAPH Asia 2024. Project page: this https URL
[AI-189] LightRAG: Simple and Fast Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2410.05779
作者: Zirui Guo,Lianghao Xia,Yanhua Yu,Tu Ao,Chao Huang
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
[AI-190] Grounding is All You Need? Dual Temporal Grounding for Video Dialog
链接: https://arxiv.org/abs/2410.05767
作者: You Qin,Wei Ji,Xinze Lan,Hao Fei,Xun Yang,Dan Guo,Roger Zimmermann,Lizi Liao
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:
[AI-191] raining-free Diffusion Model Alignment with Sampling Demons
链接: https://arxiv.org/abs/2410.05760
作者: Po-Hung Yeh,Kuang-Huei Lee,Jun-Cheng Chen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 36 pages
[AI-192] Learning the Generalizable Manipulation Skills on Soft-body Tasks via Guided Self-attention Behavior Cloning Policy
链接: https://arxiv.org/abs/2410.05756
作者: Xuetao Li,Fang Gao,Jun Yu,Shaodong Li,Feng Shuang
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
[AI-193] Polynomial Time Cryptanalytic Extraction of Deep Neural Networks in the Hard-Label Setting
链接: https://arxiv.org/abs/2410.05750
作者: Nicholas Carlini,Jorge Chávez-Saab,Anna Hambitzer,Francisco Rodríguez-Henríquez,Adi Shamir
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
[AI-194] Learning to Race in Extreme Turning Scene with Active Exploration and Gaussian Process Regression-based MPC
链接: https://arxiv.org/abs/2410.05740
作者: Guoqiang Wu,Cheng Hu,Wangjia Weng,Zhouheng Li,Yonghao Fu,Lei Xie,Hongye Su
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:
[AI-195] Array2BR: An End-to-End Noise-immune Binaural Audio Synthesis from Microphone-array Signals
链接: https://arxiv.org/abs/2410.05739
作者: Cheng Chi,Xiaoyu Li,Andong Li,Yuxuan Ke,Xiaodong Li,Chengshi Zheng
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:
[AI-196] Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration
链接: https://arxiv.org/abs/2410.05729
作者: Xueyang Kang,Zhaoliang Luan,Kourosh Khoshelham,Bing Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 18 main body pages, and 9 pages for supplementary part
[AI-197] Reducing fuzzy relation equations via concept lattices
链接: https://arxiv.org/abs/2410.05728
作者: David Lobo,Víctor López-Marchante,Jesús Medina
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-198] Less is more: Embracing sparsity and interpolation with Esiformer for time series forecasting
链接: https://arxiv.org/abs/2410.05726
作者: Yangyang Guo,Yanjun Zhao,Sizhe Dang,Tian Zhou,Liang Sun,Yi Qian
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-199] KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server EMNLP2024
链接: https://arxiv.org/abs/2410.05725
作者: Wenhao Wang,Xiaoyu Liang,Rui Ye,Jingyi Chai,Siheng Chen,Yanfeng Wang
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Main
[AI-200] Mero Nagarikta: Advanced Nepali Citizenship Data Extractor with Deep Learning-Powered Text Detection and OCR
链接: https://arxiv.org/abs/2410.05721
作者: Sisir Dhakal,Sujan Sigdel,Sandesh Prasad Paudel,Sharad Kumar Ranabhat,Nabin Lamichhane
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 8 figures
[AI-201] Enhancing Temporal Modeling of Video LLMs via Time Gating EMNLP2024
链接: https://arxiv.org/abs/2410.05714
作者: Zi-Yuan Hu,Yiwu Zhong,Shijia Huang,Michael R. Lyu,Liwei Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: EMNLP 2024 Findings (Short)
[AI-202] PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection SAM
链接: https://arxiv.org/abs/2410.05710
作者: Stefan Stefanache,Lluís Pastor Pérez,Julen Costa Watanabe,Ernesto Sanchez Tejedor,Thomas Hofmann,Enis Simsar
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 35 pages (17 main paper, 18 appendix), 22 figures
[AI-203] A Two-Step Approach for Data-Efficient French Pronunciation Learning EMNLP2024
链接: https://arxiv.org/abs/2410.05698
作者: Hoyeon Lee,Hyeeun Jang,Jong-Hwan Kim,Jae-Min Kim
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at EMNLP 2024 Main
[AI-204] Copiloting Diagnosis of Autism in Real Clinical Scenarios via LLMs
链接: https://arxiv.org/abs/2410.05684
作者: Yi Jiang,Qingyang Shen,Shuzhong Lai,Shunyu Qi,Qian Zheng,Lin Yao,Yueming Wang,Gang Pan
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-205] 2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data Reward and Conditional Guidance Design
链接: https://arxiv.org/abs/2410.05677
作者: Jiachen Li,Qian Long,Jian Zheng,Xiaofeng Gao,Robinson Piramuthu,Wenhu Chen,William Yang Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL
[AI-206] ACPBench: Reasoning about Action Change and Planning
链接: https://arxiv.org/abs/2410.05669
作者: Harsha Kokel,Michael Katz,Kavitha Srinivas,Shirin Sohrabi
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-207] Diversity and Inclusion Index with Networks and Similarity: Analysis and its Application
链接: https://arxiv.org/abs/2410.05668
作者: Keita Kinjo
关键词-EN: attracted considerable attention, recent years, range of fields, encompassing both social, biological disciplines
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注: 20 pages
点击查看摘要
Abstract:In recent years, the concepts of diversity'' and
inclusion’’ have attracted considerable attention across a range of fields, encompassing both social and biological disciplines. To fully understand these concepts, it is critical to not only examine the number of categories but also the similarities and relationships among them. In this study, I introduce a novel index for diversity and inclusion that considers similarities and network connections. I analyzed the properties of these indices and investigated their mathematical relationships using established measures of diversity and networks. Moreover, I developed a methodology for estimating similarities based on the utility of diversity. I also created a method for visualizing proportions, similarities, and network connections. Finally, I evaluated the correlation with external metrics using real-world data, confirming that both the proposed indices and our index can be effectively utilized. This study contributes to a more nuanced understanding of diversity and inclusion analysis.
[AI-208] Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models
链接: https://arxiv.org/abs/2410.05661
作者: Siqi Wang,Zhengyu Chen,Bei Li,Keqing He,Min Zhang,Jingang Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-209] On the Modeling Capabilities of Large Language Models for Sequential Decision Making
链接: https://arxiv.org/abs/2410.05656
作者: Martin Klissarov,Devon Hjelm,Alexander Toshev,Bogdan Mazoure
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-210] ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler
链接: https://arxiv.org/abs/2410.05651
作者: Serin Yang,Taesung Kwon,Jong Chul Ye
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL
[AI-211] Score-Based Variational Inference for Inverse Problems
链接: https://arxiv.org/abs/2410.05646
作者: Zhipeng Xue,Penghao Cai,Xiaojun Yuan,Xiqi Gao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: 10 pages, 7 figures, conference
[AI-212] Federated Neural Nonparametric Point Processes
链接: https://arxiv.org/abs/2410.05637
作者: Hui Chen,Hengyu Liu,Yaqiong Li,Xuhui Fan,Zhilin Zhao,Feng Zhou,Christopher John Quinn,Longbing Cao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
[AI-213] Vector-ICL: In-context Learning with Continuous Vector Representations
链接: https://arxiv.org/abs/2410.05629
作者: Yufan Zhuang,Chandan Singh,Liyuan Liu,Jingbo Shang,Jianfeng Gao
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:
[AI-214] Versatile Motion Langauge Models for Multi-Turn Interactive Agents
链接: https://arxiv.org/abs/2410.05628
作者: Jeongeun Park,Sungjoon Choi,Sangdoo Yun
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-215] CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning ECCV2024
链接: https://arxiv.org/abs/2410.05627
作者: Junghun Oh,Sungyong Baik,Kyoung Mu Lee
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV2024
[AI-216] Understanding Gradient Boosting Classifier: Training Prediction and the Role of gamma_j
链接: https://arxiv.org/abs/2410.05623
作者: Hung-Hsuan Chen
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-217] Chain-of-Thoughts for Molecular Understanding
链接: https://arxiv.org/abs/2410.05610
作者: Yunhui Jang,Jaehyung Kim,Sungsoo Ahn
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-218] Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition
链接: https://arxiv.org/abs/2410.05603
作者: Zheyang Xiong,Ziyang Cai,John Cooper,Albert Ge,Vasilis Papageorgiou,Zack Sifakis,Angeliki Giannou,Ziqian Lin,Liu Yang,Saurabh Agarwal,Grigorios G Chrysos,Samet Oymak,Kangwook Lee,Dimitris Papailiopoulos
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-219] raining Stiff Neural Ordinary Differential Equations with Implicit Single-Step Methods
链接: https://arxiv.org/abs/2410.05592
作者: Colby Fronk,Linda Petzold
关键词-EN:
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:
[AI-220] aserGen: Generating Teasers for Long Documentaries
链接: https://arxiv.org/abs/2410.05586
作者: Weihan Xu,Paul Pu Liang,Haven Kim,Julian McAuley,Taylor Berg-Kirkpatrick,Hao-Wen Dong
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-221] Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?
链接: https://arxiv.org/abs/2410.05584
作者: Xueru Wen,Jie Lou,Yaojie Lu,Hongyu Lin,Xing Yu,Xinyu Lu,Ben He,Xianpei Han,Debing Zhang,Le Sun
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:
[AI-222] NegMerge: Consensual Weight Negation for Strong Machine Unlearning
链接: https://arxiv.org/abs/2410.05583
作者: Hyoseo Kim,Dongyoon Han,Junsuk Choe
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-223] Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve? EMNLP2024
链接: https://arxiv.org/abs/2410.05581
作者: Fırat Öncel,Matthias Bethge,Beyza Ermis,Mirco Ravanelli,Cem Subakan,Çağatay Yıldız
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 Main Conference
[AI-224] Swift Sampler: Efficient Learning of Sampler by 10 Parameters NEURIPS2024
链接: https://arxiv.org/abs/2410.05578
作者: Jiawei Yao,Chuming Li,Canran Xiao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024. Project page: this https URL
[AI-225] ClaimBrush: A Novel Framework for Automated Patent Claim Refinement Based on Large Language Models
链接: https://arxiv.org/abs/2410.05575
作者: Seiya Kawano,Hirofumi Nonaka,Koichiro Yoshino
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
[AI-226] aeBench: Improving Quality of Toxic Adversarial Examples
链接: https://arxiv.org/abs/2410.05573
作者: Xuan Zhu,Dmitriy Bespalov,Liwen You,Ninad Kulkarni,Yanjun Qi
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:
[AI-227] Improved deep learning of chaotic dynamical systems with multistep penalty losses
链接: https://arxiv.org/abs/2410.05572
作者: Dibyajyoti Chakraborty,Seung Whan Chung,Ashesh Chattopadhyay,Romit Maulik
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
*备注: 7 pages, 5 Figures, Submitted to CASML2024
[AI-228] Rational Metareasoning for Large Language Models
链接: https://arxiv.org/abs/2410.05563
作者: C. Nicolò De Sabbata,Theodore R. Sumers,Thomas L. Griffiths
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-229] Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives EMNLP’24
链接: https://arxiv.org/abs/2410.05558
作者: Xinliang Frederick Zhang,Nick Beauchamp,Lu Wang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP’24 Findings
[AI-230] On Instruction-Finetuning Neural Machine Translation Models
链接: https://arxiv.org/abs/2410.05553
作者: Vikas Raunak,Roman Grundkiewicz,Marcin Junczys-Dowmunt
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: WMT’24
[AI-231] Online Dynamic Pricing for Electric Vehicle Charging Stations with Reservations
链接: https://arxiv.org/abs/2410.05538
作者: Jan Mrkos,Antonín Komenda,David Fiedler,Jiří Vokřínek
关键词-EN:
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 45 pages, 11 figure, prepared for submission to IEEE Transactions on Intelligent Transportation Systems (T-ITS)
[AI-232] On Feature Decorrelation in Cloth-Changing Person Re-identification
链接: https://arxiv.org/abs/2410.05536
作者: Hongjun Wang,Jiyuan Chen,Renhe Jiang,Xuan Song,Yinqiang Zheng
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
[AI-233] Optimizing Tensor Computation Graphs with Equality Saturation and Monte Carlo Tree Search
链接: https://arxiv.org/abs/2410.05534
作者: Jakob Hartmann,Guoliang He,Eiko Yoneki
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To be published in the 33rd International Conference on Parallel Architectures and Compilation Techniques (PACT '24), October 14-16, 2024, Long Beach, CA, USA
[AI-234] oward General Object-level Mapping from Sparse Views with 3D Diffusion Priors
链接: https://arxiv.org/abs/2410.05514
作者: Ziwei Liao,Binbin Xu,Steven L. Waslander
关键词-EN: Object-level mapping, Object-level mapping builds, Neural Radiance Fields, General Object-level Mapping, mapping
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted by CoRL 2024
点击查看摘要
Abstract:Object-level mapping builds a 3D map of objects in a scene with detailed shapes and poses from multi-view sensor observations. Conventional methods struggle to build complete shapes and estimate accurate poses due to partial occlusions and sensor noise. They require dense observations to cover all objects, which is challenging to achieve in robotics trajectories. Recent work introduces generative shape priors for object-level mapping from sparse views, but is limited to single-category objects. In this work, we propose a General Object-level Mapping system, GOM, which leverages a 3D diffusion model as shape prior with multi-category support and outputs Neural Radiance Fields (NeRFs) for both texture and geometry for all objects in a scene. GOM includes an effective formulation to guide a pre-trained diffusion model with extra nonlinear constraints from sensor measurements without finetuning. We also develop a probabilistic optimization formulation to fuse multi-view sensor observations and diffusion priors for joint 3D object pose and shape estimation. Our GOM system demonstrates superior multi-category mapping performance from sparse views, and achieves more accurate mapping results compared to state-of-the-art methods on the real-world benchmarks. We will release our code: this https URL.
[AI-235] Residual Kolmogorov-Arnold Network for Enhanced Deep Learning
链接: https://arxiv.org/abs/2410.05500
作者: Ray Congrui Yu,Sherry Wu,Jiang Gui
关键词-EN: Convolutional Neural Networks, complex non-linear dependencies, computer vision tasks, efficiently capture long-range, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Code is available at this https URL
点击查看摘要
Abstract:Despite the strong performance in many computer vision tasks, Convolutional Neural Networks (CNNs) can sometimes struggle to efficiently capture long-range, complex non-linear dependencies in deeper layers of the network. We address this limitation by introducing Residual KAN, which incorporates the Kolmogorov-Arnold Network (KAN) within the CNN framework as a residual component. Our approach uses Chebyshev polynomials as the basis for KAN convolutions that enables more expressive and adaptive feature representations while maintaining computational efficiency. The proposed RKAN blocks, when integrated into established architectures such as ResNet and DenseNet, offer consistent improvements over the baseline models on various well-known benchmarks. Our results demonstrate the potential of RKAN to enhance the capabilities of deep CNNs in visual data.
[AI-236] Intuitions of Compromise: Utilitarianism vs. Contractualism
链接: https://arxiv.org/abs/2410.05496
作者: Jared Moore,Yejin Choi,Sydney Levine
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:
[AI-237] Neural Networks Decoded: Targeted and Robust Analysis of Neural Network Decisions via Causal Explanations and Reasoning
链接: https://arxiv.org/abs/2410.05484
作者: Alec F. Diallo,Vaishak Belle,Paul Patras
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注:
[AI-238] Ensured: Explanations for Decreasing the Epistemic Uncertainty in Predictions
链接: https://arxiv.org/abs/2410.05479
作者: Helena Löfström,Tuwe Löfström,Johan Hallberg Szabadvary
关键词-EN:
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 35 pages, 11 figures, journal
[AI-239] Image Watermarks are Removable Using Controllable Regeneration from Clean Noise
链接: https://arxiv.org/abs/2410.05470
作者: Yepeng Liu,Yiren Song,Hai Ci,Yu Zhang,Haofan Wang,Mike Zheng Shou,Yuheng Bu
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[AI-240] Herd Mentality in Augmentation – Not a Good Idea! A Robust Multi-stage Approach towards Deepfake Detection
链接: https://arxiv.org/abs/2410.05466
作者: Monu,Rohan Raju Dhanakshirur
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-241] On the Expressive Power of Tree-Structured Probabilistic Circuits NEURIPS2024
链接: https://arxiv.org/abs/2410.05465
作者: Lang Yin,Han Zhao
关键词-EN: exact probabilistic inference, probabilistic inference, exact probabilistic, compactly represent probability, Probabilistic circuits
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper was accepted to NeurIPS 2024
点击查看摘要
Abstract:Probabilistic circuits (PCs) have emerged as a powerful framework to compactly represent probability distributions for efficient and exact probabilistic inference. It has been shown that PCs with a general directed acyclic graph (DAG) structure can be understood as a mixture of exponentially (in its height) many components, each of which is a product distribution over univariate marginals. However, existing structure learning algorithms for PCs often generate tree-structured circuits or use tree-structured circuits as intermediate steps to compress them into DAG-structured circuits. This leads to the intriguing question of whether there exists an exponential gap between DAGs and trees for the PC structure. In this paper, we provide a negative answer to this conjecture by proving that, for n variables, there exists a sub-exponential upper bound n^O(\log n) on the size of an equivalent tree computing the same probability distribution. On the other hand, we also show that given a depth restriction on the tree, there is a super-polynomial separation between tree and DAG-structured PCs. Our work takes an important step towards understanding the expressive power of tree-structured PCs, and our techniques may be